Ali Mei Guide: Recently, Alibaba officially released the open source address of Mars, a distributed scientific computing engine. Developers can access the source code on Github and participate in the development.
Mars breaks through the relational algebra-based computing model of existing large data computing engines, introduces distributed technology into the field of scientific computing/numerical computing, and greatly expands the scale and efficiency of scientific computing. It has been applied to Alibaba and its cloud customers'business and production scenarios.
Next, we will introduce Mars's design intention and technical framework in detail, hoping to share with you.
Scientific calculation, i.e. numerical calculation, refers to the application of computers to deal with mathematical calculation problems encountered in scientific research and engineering technology. For example, image processing, machine learning, in-depth learning and many other fields will be used in scientific computing. Many languages and libraries provide scientific computing tools. Among them, Numpy excels in its concise and easy-to-use grammar and powerful performance, and on this basis forms a huge technology stack (shown below).
Numpy's core concept, multidimensional arrays, is the basis of various upper-level tools. Multidimensional arrays are also called tensors. Compared with two-dimensional tables/matrices, tensors have more powerful expressive ability. Therefore, the popular deep learning framework is also widely based on tensor data structure.
With the boom of machine learning/deep learning sweeping in, the concept of tensor is becoming more and more well known, and the scale requirement of tensor for general computing is increasing day by day. But the reality is that such excellent scientific computing libraries as Numpy still remain in the era of stand-alone computers and can not break through the bottleneck of scale. Nowadays, the popular distributed computing engine is not born for scientific computing either. The mismatch of upper interfaces makes it difficult to write scientific computing tasks using traditional SQL/MapReduce. The execution engine itself does not optimize scientific computing, which makes the computing efficiency unsatisfactory.
Based on the current situation of scientific computing, the research and development team of MaxCompute, the Alibaba Unified Big Data Computing Platform, has gone through more than a year of research and development to break the boundaries of big data and scientific computing, complete the first version and open source.
Mars, a unified distributed computing framework based on tensors. Using Mars for scientific computing not only reduces thousands of lines of code from MapReduce to Mars, but also greatly improves performance. At present, Mars has realized the tensor part, that is, numpy distributed, and realized 70% of the common numpy interfaces. Subsequently, in the version of Mars 0.2, Pandas is being distributed, which will provide a fully compatible interface to build the whole ecosystem.
As a new generation of super-large-scale scientific computing engine, Mars has not only entered the era of distributed computing, but also made it possible for large data to be used efficiently in scientific computing.
Mars provides Numpy-compatible interface through tensor module. Users can transplant existing Numpy-based code into Mars by replacing import, and directly obtain tens of thousands of times larger scale than the original, while improving processing capacity by tens of times. Currently, Mars implements about 70% of the common Numpy interfaces.
Making Full Use of GPU Acceleration
In addition, Mars extends Numpy to take full advantage of the achievements of GPU in the field of scientific computing. When creating tensors, subsequent calculations can be performed on the GPU by specifying GPU = True. For example:
Mars also supports two-dimensional sparse matrices, which can be created by specifying sparse = True. Take eye interface as an example, it creates a unit diagonal matrix, which has value only on the diagonal line and is 0 at other locations, so we can store it sparsely.
Next, we introduce Mars's system design, so that you can understand how Mars makes scientific computing tasks automatically parallelized and has powerful performance.
Divide and rule:tile
Mars usually divides and conquers scientific computing tasks. Given a tensor, Mars automatically divides it into small Chunks on each dimension for processing separately. For all operators implemented by Mars, automatic task segmentation and parallelism are supported. This automatic segmentation process is called tile in Mars.
For example, given a 1000 * 2000 tensor, if the chunk size on each dimension is 500, then the tensor will be tiled to 2 * 4 for a total of 8 chunks. For subsequent operators, such as additions and summations, tiles are also performed automatically. The tile process of a tensor operation is shown in the following figure.
Delayed execution and Fusion optimization
At present, the code written by Mars needs to explicitly call execute trigger, which is based on the delayed execution mechanism of Mars. Users do not need any actual data calculation when writing intermediate code. The advantage of this is that more optimization can be done to the intermediate process, so that the whole task can be better executed. At present, Mars mainly uses fusion optimization, which combines multiple operations into one execution.
Various scheduling modes
Mars supports multiple schedules:
Multithread mode: Mars can use multithreading to schedule and execute Chunk-level graphs locally. For Numpy, most operators are executed by single thread. Using this scheduling method alone, Mars can obtain tiled execution graph on a single machine, break through the memory limitation of Numpy, and make full use of all CPU/GPU resources on a single machine to achieve performance several times faster than Numpy.
Single-machine cluster mode: Mars can start the whole distributed runtime on a single machine and use multi-process to accelerate task execution; this mode is suitable for simulation of development and debugging oriented to distributed environment.
Distributed: Mars can start one or more schedulers, as well as multiple workers, schedulers will schedule Chunk-level operators to each worker to execute.
Below is the Mars distributed execution architecture:
When Mars is executed in a distributed way, multiple schedulers and workers are started. In the figure, there are three schedulers and five workers. These schedulers form a consistent hash loop. The user creates a session explicitly or implicitly on the client, assigns Session Actor to one of the schedulers according to the consistency hash, and then submits a tensor calculation through execute. The GraphActor is created to manage the execution of the tensor, which is tiled into a chunk-level graph in GraphActor. Assuming there are three chunks, three OperandActors will be created on the scheduler to correspond to each other. These OperandActors are submitted to each worker for execution based on whether their dependencies are completed and whether the cluster resources are sufficient. When all OperandActors are completed, the GraphActor task is notified, and then the client can pull the data to display or draw.
Inward and outward expansion
The flexible tiled execution diagram of Mars, combined with various scheduling modes, can make the code written by the same Mars scale in and out at will. Scaling inward to a single machine can use multi-core to perform scientific computing tasks in parallel; scaling outward to a distributed cluster can support thousands of worker scale to complete tasks that a single machine can not accomplish in any way.
In a real scenario, we encounter the computational requirement of huge matrix multiplication. We need to complete the multiplication of two matrices, each of which is 100 billion yuan, and the size is about 2.25T. Through five lines of code, Mars uses 1600 CU (200 workers, 32 G memory per worker) to complete the calculation in two and a half hours. Prior to this, similar computing could only be done by using MapReduce to write more than 1,000 lines of code simulation. It took 9,000 CU and 10 hours to complete the same task.
Let's look at two more comparisons. The following figure is to multiply each element of the 3.6 billion data matrix by two over and over again. The Red Cross Represents Numpy's computing time. The green solid line is Mars'computing time, and the blue dotted line is the theoretical computing time. You can see that single Mars is several times faster than Numpy, and with the increase of Worker, almost linear acceleration ratio can be obtained.
The following figure is to further expand the scale of calculation, expand the data to 14.4 billion yuan, add these elements, multiply by two, and then sum up. At this time, the input data has 115G, single Numpy can not complete the operation, Mars can still complete the operation, and with the increase of machines can get a good acceleration ratio.