To help researchers and developers in academia and beyond, Alibaba has now published its Cluster Data V2018 for all to explore
In IT articles about internet applications, you will often find words such as “large scale” and “mass requests”. These applications all run in large-scale data centers, and readers generally have many questions about those environments. For example, what is the operating status of each machine in the data center? What kinds of applications are running? What are the characteristics of these applications? With the exception of a few senior experts, it is rare for students and corporate researchers to understand these details.
Today, Alibaba shares a real dataset for computer clusters: Alibaba Cluster Data V2018 (attached at the end of article). This dataset provides a full record of details for the servers and running tasks in Alibaba’s production cluster. With the release of this data, Alibaba hopes to engage its peers in academia and the wider industry and to promote further development within the industry.
Drawing on insights from Lin Shi, a technical expert at the Ali System Software Division, this article offers an in-depth introduction to this unique dataset, as well as findings from academic research with the previous year’s data.
Releasing the Dataset: A Resource for Exploration
In 2015, Alibaba tried to deploy latency insensitive batch computing tasks and latency sensitive online services to the same batch of machines in its data center, with the goals of allowing redundant resources to be fully utilized and improving the overall utilization of these machines.
After more than three years of trial demonstrations, structural adjustments, and optimization of resource isolation, this program has now moved into the mass production phase. Average resource utilization of clusters has since increased from 10% to 45% using colocation technology (i.e. the technology that allows running online service and batch workloads on same machine). In addition, through various optimization methods, more tasks can now be run in the data center, and the average transaction cost of the 11.11 Global Shopping Festival has been reduced by 17% per 10,000 transactions.
Though some readers will understand this achievement and how it was accomplished, many are likely to have questions such as what exactly the computer cluster looks like after optimization, or what colocation means and why it matters.
In releasing the Alibaba Cluster Data V2018 dataset, Alibaba hopes to help interested students and researchers answer those questions and better understand large-scale data centers. Using the dataset, individuals can learn more about how Alibaba has increased resource utilization to 45% via colocation; how many tasks Alibaba runs each day; and what the resource requirements of the business are. How individuals use this dataset depends entirely on their needs.
Making Use of the Data
At 50 GB compressed and 270 GB uncompressed, Alibaba Cluster Data V2018 is a large dataset with many different uses. In six files, it has data for 4000 servers, corresponding online application containers, and operation of offline computing tasks for up to eight days.
With Alibaba Cluster Data V2018, individuals can do the following:
· Understand the servers and the characteristics of running tasks in today’s advanced data centers.
· Experiment with various algorithms for task management and cluster optimization in scheduling and operations, and write papers about them.
· Use the data to learn how to conduct data analysis, and reveal more techniques and methodologies that have not yet been discovered.
The following are examples of questions the data can be used to answer:
· E-commerce businesses face different pressures during daytime and nighttime. How can we improve overall resource utilization with these peaks and valleys?
· How many dependencies does Alibaba’s longest DAG have?
· How long does a container typically exist?
· What is the typical lifetime of a computational task?
· Multiple instances of a task are theoretically similar to each other, but do they all run at the same time?
In fact, academics have already used Alibaba cluster data for a range of crucial analysis and research. In 2017 and 2018, a number of prominent academic papers were published based on of the first wave of data shared in Alibaba Cluster Data V2017. The following sections show examples of academic uses of the dataset, many of which have been featured in academic conferences such as OSDI.
LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation — Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang; Purdue University. (OSDI’18 Best Paper Award recipient)
Imbalance in the Cloud: An Analysis on Alibaba Cluster Trace — Chengzhi Lu et al. (BIGDATA 2017)
Looking Forward: Updates in V2018
There are two major differences that distinguish the new V2018 dataset from the previous year’s V2017 dataset featured in the above research.
First, the V2018 dataset includes the DAG information of some of our production batch workloads. This means that offline computing tasks like those commonly used in Map Reduce, Hadoop, Spark, and Flink are arranged in the form of a Directed Acyclic Graph (DAG), which shows parallelism, dependencies between tasks, and so on. As such, this dataset is the now the largest available DAG dataset from an actual production environment.
The following figure illustrates and example of a DAG:
Second, the V2018 dataset is considerably larger in scale than the previous set, which contained data for approximately 1300 machines over a 24-hour period. The new version includes data for 4000 machines over a period of 8 days.
With so much having been accomplished using V2017 dataset, Alibaba looks forward to the developments to come as researchers begin work with the new dataset in 2019.
Alibaba Cluster Data V2018