This article is part of the AIOps for Big Data series.
The heart of artificial intelligence is data, and companies with a lot of it are constantly working to boost what it can do for them. Beyond that simple fact, though, many continue to wonder what it means to say that data is involved in development practices like AIOps (AI for IT operations), or how data can be used in place of human and even machine-driven analytics.
Further questions abound. How can we use machine learning algorithms together with big data-based business operation and maintenance platforms? How can machine learning enhance alarm filtering, anomaly monitoring, automated repairs, and other tasks to truly liberate operation and maintenance?
Faced with these questions, Alibaba is moving away from a belief in AIOps as a long-term evolution toward a data-centric approach to IT. In that spirit, the group has departed from common industry practices and invested in a robust foundation in DataOps — an automated, process-oriented methodology that data analysts use to improve the quality and rate of analysis cycles.
In this article, we look at the challenges and opportunities facing operation and maintenance teams as they move beyond outdated practices and into the data-driven era.
From ScriptOps to AIOps, Level by Level
Scripted operation and maintenance
- Script replaces manual operation
- Execution: human + script
- Decision-making: human
Automated operation and maintenance
- Most operation and maintenance work is done automatically or by processes
- Execution: human + system
- Decision-making: human
Highly automated + single-point intelligence
- Operation and maintenance is done by data system construction
- Execution: human + system (80%)
- Decision-making: human + system (20%)
L4: DataOps (advanced)
Highly automated + series intelligence
- Main operation and maintenance scenes are implemented by processes and free of intervention
- Execution: human + system (95%)
- Decision-making: human + system (80%)
Fully automatic smart operation and maintenance
- Can be easily adjusted between cost, quality, and efficiency
- Execution: system (100%)
- Decision-making: human + system (95%)
Rough Beginnings: ScriptOps
Operation and maintenance work requires a high level of skill, and the scope of the work exceeds other IT fields. Nevertheless, many think of it as limited to releases, modifications, alerts, and device migration, generally reflecting dated practices known together as ScriptOps.
In some ways, this is not a bad sense to have. All big Internet companies begin as small companies where these issues (and all manner of other problems) threaten the company’s survival. Pressure and the pursuit of short-term results, though, have led many to rely on simplistic solutions from online technical forums or even personal blogs, leaving a legacy of misunderstanding that today’s professionals must move beyond.
A Case for ToolOps
The view described above is more than an outside misconception. Anyone who has led newcomers in the field is likely aware of their tendency to deploy one-click batch release software, one-click cleanup, interactive wizard execution, or other “black screen” scripts. Often, they simply re-implement some such solution according to their personal sense of it, failing to grasp the potential for mishap in different scenarios. This invites inefficiencies and security risks, and the history of the Internet is riddled with the disastrous consequences of mistakes as simple as typing in the wrong characters.
Today, it is better understood that novices should not be left to run free on systems they have a limited grasp of. Instead, there is an ongoing push to merge more and more functional scripts into workable tools that can ensure the effective handover of the capabilities they provide — ToolOps, for short.
Shifting to Platform-Based DevOps
When an Internet company’s commercial success raises the scale of its operations, quantitative changes begin to create qualitative changes at the data level. Today, operation and maintenance for a large factory setting demands entirely new computing practices, and simply adding staff is not a solution.
Put another way, when an application grows from hundreds of platform units to tens or hundreds of thousands, data processing changes from a simple matter of CPU, memory, and mechanical hard disks to an elaborate mix of GPUs, FPGAs, ASICs, Optane SSDs, and other hardware, software, and big data distributions.
As issues threaten a large platform’s business and resources, data workers often face tasks bordering on the impossible. At such times, the operation and maintenance job description more closely resembles:
· Global architecture planning
· Resource operation and cost optimization
· Automated platform development
· Stability protection
· Massive data analysis
· Any number of unforeseen scenarios…
For Alibaba, developing platforms to assist operation and maintenance workers in these cases is now a given.
Entering the DataOps Phase
As Alibaba’s business grows, its operation and maintenance capabilities have likewise grown in depth and precision. Through software engineering and data-based innovation, operation and maintenance tools must adapt to handle ultra-large-scale distributed cluster management and improve the stability, efficiency, and cost of the overall product. This presents tremendous challenges for operation and maintenance personnel and sets a high requirement for skill on their part.
Simultaneously, the broader industry has also evolved toward a prevailing concept of AIOps. The field as a whole is pushing for greater awareness of these practices, driven by the idea that a powerful algorithm can replace the intelligence now afforded by human labor. Total automation is still an ambitious goal beyond today’s reality, much like driverless transit.
At Alibaba, the prevailing thought is that if an algorithm is the kernel, then its value depends on the amount of engineering devoted to implementing it. This essentially describes the thinking behind the DataOps stage, in which data figures in all operation and maintenance goals, and data-driven operation and maintenance has been effectively implemented.
The following image illustrates the aforementioned comparison to autonomous driving.