On August 22, 2019, Tencent's first AI open source project, Angel, was officially released in version 3.0. Angel 3.0 attempts to create a full-stack machine learning platform that covers all phases of machine learning: feature engineering, model training, hyper-parameter tuning, and model services.
Angel is a distributed machine learning platform based on parametric server architecture of Tencent Open Source. It is dedicated to solving sparse data model training and large-scale graph data analysis. It is jointly developed by Tencent and Peking University, taking into account the high availability and academics of the industry. The innovation of the world. Currently, it is the Linux Deep Learning Foundation Incubation Project. Compared with similar platforms such as TensorFlow, PyTorch and Spark, she has the following characteristics:
Angel is a high-performance distributed machine learning platform based on the parameter server (PS) concept. It has a flexible and customizable function PS Function (PSF) that can push some calculations down to the PS side. The PS's well-developed scale-out capabilities allow Angel to efficiently process billions of models.
Angel has a math library specifically optimized for handling high-dimensional sparse features with performance up to 10 times better than the breeze math library. Angel's PS and built-in algorithm kernel are built on top of the math library.
Angel is good at recommending models and graphs related to network models (such as social network analysis). The following figure is a comparison of Angel and several industry mainstream platforms in terms of sparse data, model dimensions, performance, depth models, and ecological construction. Tensorflow and PyTouch have significant advantages in deep learning and ecological construction, but the processing power in sparse data and high-dimensional models is relatively insufficient, and Angel is just complementary to them. The 3.0 version of PyTorch On Angel tries to PyTorch and Angel. The advantages are combined.
Angel's comparison with mainstream platforms in the industry
Since its launch in Tencent in early 2016, Angel has been used in WeChat payment, QQ, Tencent video, Tencent social advertising and user portrait mining.
In June 2017, Angel made a low-profile open source on GitHub. Open source for two weeks, this project has been harvested on GitHub 183 Watch, 1693 Star, 389 Fork, and attracted many industry engineers to pay attention and contribution.
In September 2018, Angel 2.0 was released, supporting the training of 100 billion model dimensions, and the algorithm library was also richer. The deep learning algorithm and graph algorithm were introduced for the first time. In the same year, Angel joined the Deep Learning Foundation of Linux (now renamed the LF AI Foundation), and in conjunction with the well-established operations of the Foundation, the fully upgraded Angel 2.0 continues to interact with the international open source community and is committed to making machines Learning technology is easier to get started with research and application goals.
Let's take a look at the new features that are worth paying attention to in the Angel 3.0 milestone.
Angel System ArchitectureThe Angel 3.0 system architecture is shown below:
Angel 3.0 architecture
Angel's self-developed high-performance math library is the foundation of the entire system. Angel's PS function and built-in algorithm kernel are based on this math library.
Angel PS provides efficient, stable and flexible parameter storage and exchange services. In version 3.0, we extended the Angel PS functionality so that it can store any type of object. A typical example is the use of Angel PS to store a large number of complex objects during the implementation of the graph algorithm.
MLcore is Angel's self-developed set of algorithmic kernels that support automatic derivation and can be defined and run using JSON configuration files. In addition, in version 3.0, Angel also integrated PyTorch as a calculation engine. Above the Compute Engine layer are computational frameworks that can be thought of as a container for computational engines and currently support three computational frameworks: native Angel, Spark On Angel (SONA), and PyTorch On Angel (PyTONA), which can make Spark Switch seamlessly to the Angel platform with PyTorch users. The top level is two common components: AutoML and model services.
Angel 3.0 new features
An overview of Angel 3.0 (red for new features, white for existing but continuous improvements)
The image above provides a holistic view of the Angel 3.0 features. Angel 3.0 attempts to build a full-stack machine learning platform with features that cover all phases of machine learning: feature engineering, model training, hyper-parameter tuning, and model services.
Angel's feature engineering module is based on Spark development and enhances Spark's feature selection capabilities, while feature feature intersection and re-indexing for automatic feature generation. These components can be seamlessly integrated into Spark's pipeline. In order to make the whole system more intelligent, Angel 3.0 has added the function of hyperparameter adjustment. Currently, it supports 3 algorithms: random search, grid search and Bayesian optimization. In terms of model services, Angel 3.0 provides a cross-platform component. Angel Serving, Angel Serving not only meets Angel's own needs, but also provides model services for other platforms.
In terms of ecology, Angel also tried to assign the parameter server (PS, Parameter Server) capability to other computing platforms, and has completed the construction of two platforms, Spark On Angel and PyTorch On Angel. Each of these platforms has its advantages and focus. Spark On Angel uses Angel's built-in algorithmic core, which is responsible for machine learning algorithms and basemap algorithms in common recommended areas. PyTorch On Angel uses PyTorch as the core of computing, and is mainly responsible for recommending domain deep learning algorithms and graph deep learning algorithms.
Automatic feature engineering
Feature engineering, such as feature crossing and selection, is important for machine learning applications in industry. Spark provides some feature selection operators, but there are still some limitations. Angel provides more feature selection operators based on Spark:
Most online recommendation systems often choose linear algorithms, such as logistic regression as a machine learning model, but logistic regression requires complex feature engineering to achieve higher precision, which makes automatic feature synthesis essential. However, existing automated high-order feature synthesis methods bring dimensional disasters. To solve this problem, Angel implemented a method of iteratively generating high-order synthetic features. Each iteration consists of two phases:
Here are the iteration steps:
Finally, the composite features are stitched together with the original features.
Automatic feature engineering process
As shown in the figure above, this feature synthesis method linearly increases the number of features, avoiding dimensional disasters. Experiments on the Higgs data set show that the synthesized features can effectively improve the model accuracy (as shown in Table 1).
Table 1 Feature synthesis effect
Spark On Angel (SONA)
In Angel 3.0, we made a major optimization of Spark On Angel, adding these new features:
Spark On Angel architecture
In addition to these large features, we are continuing to improve Spark On Angel's algorithm library: new algorithms such as Deep & Cross Network (DCN) and Attention Factorization Machines (AFM); and existing algorithms A lot of optimizations have been done, such as refactoring the LINE and K-Core algorithms, and the performance and stability of the refactored algorithms have been greatly improved.
As can be seen from the figure below, the algorithm in Spark On Angel is significantly different from the algorithm in Spark. For example, the algorithm based on Spark On Angel is mainly for the recommendation and graph fields, but the algorithm in Spark is more general.
Comparison between Spark and Spark On Angel algorithms
Spark On Angel algorithm example
The above figure provides an example of a distributed algorithm based on Spark On Angel, which mainly includes the following steps:
After the training is completed, Spark On Angel will showcase a variety of model metrics such as accuracy, ROC curve and AUC. The user can save the trained model for next use.
Spark On Angel and TensorFlow performance comparison
We compared the performance of Spark On Angel and TensorFlow using the same resources and datasets on two popular recommendation algorithms, Deep & Wide and DeepFM. As shown in the figure above, Spark On Angel is 3 times faster than TensorFlow on the Deep & Wide algorithm, while TensorFlow runs slightly faster on the DeepFM algorithm.
PyTorch On Angel (PyTONA)
PyTorch On Angel is a new feature in Angel 3.0 that is designed to solve large-scale graph representation learning and deep learning model training.
In the past few years, the graph convolutional neural network (GNN) has developed rapidly, a series of research papers and related algorithms have emerged: such as GCN, GraphSAGE and GAT, research and test results show that they can learn more than traditional graph representation. Good extraction of graph features. Tencent has a large social network (QQ and WeChat), and has a large demand for analysis of graph data. The graph indicates that learning is the basis of these analyses. Therefore, Tencent has a strong demand for GNN, which is why we develop PyTorch On. One of the main reasons for Angel.
The representation of large-scale graphs faces two major challenges: the first challenge comes from the storage and access of hyperscale graph structures, which requires the system not only to survive but also to provide efficient access interfaces, such as the need to provide efficient The interface to access the two-hop neighbor of any node; the second challenge comes from the GNN calculation process, which requires an efficient auto-derivation module.
By analyzing Angel's own situation and the existing systems in the industry, we have the following conclusions:
In order to combine the advantages of both, we developed the PyTorch On Angel platform based on Angel PS. The basic idea is to use Angel PS to store large models, and use Spark as a distributed scheduling platform for PyTorch, which is called in Spark's Executor. PyTorch to complete the calculation.The architecture of PyTorch On Angel is shown below:
PyTorch On Angel System Architecture
PyTorch On Angel has 3 main components:
Of course, these details are encapsulated, and algorithm developers and users don't need to know. Developing a new algorithm on the PyTorch On Angel platform requires only a focus on the algorithm logic, which is not much different from developing a stand-alone PyTorch algorithm. An example of the implementation of a 2-layer GCN algorithm is given below:
Example of implementing GCN on PyTorch On Angel
After the algorithm is developed, save the code as a pt file and submit the pt file to the PyTorch On Angel platform for distributed training.
We have implemented many algorithms on PyTorch On Angel: including algorithms commonly recommended in the recommended fields (FM, DeepFM, Wide & Deep, xDeepFM, AttentionFM, DCN and PNN, etc.) and GNN algorithms (GCN and GraphSAGE). In subsequent iterations of the version, we will further enrich PyTorch On Angel's algorithm library.
Combining the advantages of PyTorch and Angel, PyTorch On Angel has great advantages in algorithm performance: the performance of the deep learning algorithm common in the recommended field can reach more than 4 times that of TensorFlow; for GNN algorithm, the performance is far better than the current one. The same type of platform open source in the industry (specific performance data will be published in the open source community). The following figure is a comparison test on the public dataset criteo kaggle2014 (45 million training samples, 1 million features):
PyTorch On Angel and TensorFlow performance comparison test
In addition to the performance advantages, PyTorch On Angel has a big advantage is the ease of use. As shown in the figure "PyTorch On Angel System Architecture": PyTorch runs in Spark's Executor, which enables seamless interfacing of Spark graph data preprocessing and PyTorch model training, and completes the entire calculation process in one program.
Automatic hyperparameter adjustment
There are two ways to adjust the traditional hyperparameters (as shown below):
Grid search and random search
Bayesian optimization, unlike traditional model-free methods, uses a less expensive surrogate function to approximate the original objective function. In Bayesian optimization, the agent function generates the probability mean and variance of the hyperparameter combination. The availability function will then evaluate the expected loss or improvement of the hyperparameter combination. Such a probabilistic interpretation method enables Bayesian optimization to find a better solution of the objective function using much less overhead.
Angel 3.0 includes two traditional methods and Bayesian algorithm optimization. For Bayesian optimization, Angel implements the following features:
Since the computational overhead of each evaluation objective function may be large, if it is observed that the candidate hyperparameter combination does not perform well in the first few iterations, these candidate hyperparameter combinations can be stopped early. This early stop strategy was implemented in the Angel 3.0 release.
Table 2 is an experiment in the logistic regression algorithm. The adjusted hyperparameters are the learning speed and the learning speed decay rate. The results show that the Bayesian optimization performance is better than the random search and the grid search, while the random search results are slightly better than the grid. search for.
Table 2 Comparison of the effects of different hyperparameter automatic condition methods
To meet the needs of efficient model services in a production environment, we implemented the Angel Serving subsystem in Angel 3.0, a scalable, high-performance machine learning model service system that is a full-stack machine learning platform. Angel's upper service portal allows the Angel Eco to form a closed loop. The figure below shows the architectural design of Angel Serving.
Angel Serving Architecture
The main features of Angel Serving include:
Table 3 Comparison of Angel Serving and TensorFlow Serving Performance
Table 3 shows the comparison of Angel Serving and TensorFlow Serving performance. We used a DeepFM model with 1 million features to send 100,000 prediction requests to the service. The total time spent on Angel Serving and TensorFlow Serving is 56 seconds and 59 seconds respectively. The average response time for both service systems is 2 milliseconds. Angel Serving's QPS is 1,900, while TensorFlow Serving's QPS is 1,800. The above results show that Angel Serving is comparable or even better than TensorFlow Serving.
Angel 3.0 supports Kubernetes so it can run on the cloud.
As shown in the chart below, in the past 12 months, Angel's number of tasks within Tencent has grown significantly, with an increase of 150%. It is worth mentioning that Spark On Angel's number of tasks has increased by 10 times. In order to make Spark On Angel easier to use, the 3.0 version has greatly upgraded Spark On Angel. Within Tencent, Angel's businesses include Tencent Video, Tencent News and WeChat.
Tencent internal Angel tasks
Angel officially maintains a QQ group to communicate with external developers. Statistics on group users indicate:
Angel open source user
Angel open source
Angel's statistics on GitHub and papers by Angel
Since the open source in June 2017, Angel has received more attention. As of now, Angel has more than 4200 Stars on GitHub and more than 1000 Fork. The Angel project currently has a total of 38 code contributors, and the other includes an 8-bit committer, which submitted more than 2000 commits in total. Tencent's total number of projects on GitHub has exceeded 80, covering AI, cloud computing, security and other fields, with a total of more than 230,000 Stars.
From 1.0 to 3.0, Angel has undergone tremendous changes, from a single model training platform to a machine learning process, including its own general-purpose computing platform, with more than 500,000 lines of code. For subsequent maintenance and ease of use, Angel is split into 8 sub-projects and placed in the Angel-ML directory (https://github.com/Angel-ML): angel, PyTorch On Angel, sona (Spark On Angel) , serving, automl, mlcore, math2 and format, these sub-projects are described in detail above.
Tencent short video recommendation
Short video recommendation data processing flow
The above picture shows a use case of the Tencent short video department. The user's video playback log and context information are forwarded to Kafka in real time, and the streaming data engine Storm subscribes to Kafka's data. Storm is a real-time feature generator that takes user portraits and video information from an offline key-value store and stitches the two together to generate features. The generated features are transmitted to the online training system to update the online model; at the same time, these features are also dumped to HDFS as input for offline training. The offline model is usually used to initialize the online training system. When an abnormality occurs, the offline model can also be used to reset the online system.
The recommended algorithm used in this case is FM, with 2.4 billion training samples and a feature dimension of 63611. It takes more than 10 hours to train on Spark and 1 hour after applying Angel.
Financial anti-fraud data processing
Financial fraud detection is a common case of large-scale graph learning. Its network data is heterogeneous and contains several different types of edges:
Financial scammers often share devices and Wi-Fi to generate communities by extending edge relationships. The fast unfolding algorithm on Angel can effectively discover these communities. Downstream fraud risk models can use the user portraits and network characteristics of these communities as input to learn and push to anti-fraud strategies. The graph data contains 1.5 billion nodes and 20 billion edges, and the implementation based on Spark GraphX takes 20 hours, while Angel takes only 5 hours.