Home > News content

Huawei deeply interprets DaVinci architecture: 3D Cube computing engine accelerates computing

via:博客园     time:2019/8/22 15:01:13     readed:261

Author: A fan

IT House August 22 news, Huawei officially launched the Kirin 810 chip, this chip uses Huawei's self-developed DaVinci architecture, its AI score ranked among the top three in the AI ​​Benchmark list launched by the Federal Institute of Technology in Zurich .

Huawei also said in the article that on August 23, the Da Vinci-based AI chip Ascend910 will be officially released for commercial use, and the new generation AI open source computing framework MindSpore will also be unveiled at the same time.

In response to the Da Vinci framework, Huawei China issued a deep science popularization today. The following is the science content of Huawei's Da Vinci framework.


Source: Why do DaVinci architecture?

Huawei predicts that by 2025, the number of smart terminals in the world will reach 40 billion, the penetration rate of intelligent assistants will reach 90%, and the usage rate of enterprise data will reach 86%. It is foreseeable that in the near future, AI will greatly increase productivity and change every organization and every industry as a general technology. In order to achieve the synergy between AI and multi-platform and multi-scene, Huawei designed the DaVinci computing architecture to provide powerful AI computing power under different volume and power consumption conditions.

First sight: the core strengths of the Da Vinci architecture

The Da Vinci architecture is a new computing architecture for Huawei's self-developed AI computing features. It is highly computationally efficient, energy efficient, flexible and tailorable, and is an important foundation for realizing all things. Specifically, the DaVinci architecture uses 3D Cube to accelerate matrix operations, dramatically increasing AI power per unit of power consumption. Each AI Core can achieve 4096 MAC operations in one clock cycle compared to traditional CPUs. The GPU achieves an order of magnitude improvement.


3D Cube

At the same time, in order to improve the completeness of AI calculation and the computational efficiency of different scenes, DaVinci architecture also integrates various computing units such as vector, scalar and hardware accelerator. At the same time, it supports multiple precision calculations, supports the data accuracy requirements of both training and inference scenarios, and achieves full coverage of AI coverage.

Deep farming: Da Vinci architecture's AI hard power

Popular Science 1: What are the common types of AI operations?

Before we get to know the technology of the DaVinci architecture, let's figure out a few AI computing data objects:

  • Scalar: consists of a single number
  • Vector: consists of a set of one-dimensional ordered numbers, each number identified by an index
  • Matrix: consists of a set of two-dimensional ordered numbers, each number identified by two indices.
  • Tensor: consists of a set of n-dimensional ordered numbers, each number identified by n indices

Among them, the core of AI calculation is matrix multiplication operation, which is multiplied by one row of the left matrix and one column of the right matrix in the calculation, and the sum after each element is output to the result matrix. In this calculation process, the scalar (Scalar), vector (Vector), and matrix (Matrix) computing power density increase in turn, which puts higher demands on the hardware AI computing power.

The typical neural network model is very computationally intensive, and 99% of the calculations require matrix multiplication. That is to say, if the computational efficiency of matrix multiplication is improved, the AI ​​computational power —— It is also the core of DaVinci's architectural design: increasing the computational power of matrix multiplication with minimal computational cost to achieve higher AI energy efficiency.

Popular Science 2: The division of roles in each unit is revealed. How does Da Vinci Core achieve efficient AI calculation?

At the 2018 All-Connect Conference, Huawei introduced the AI ​​chip Ascend 310 (the rising 310), which is the debut of the Da Vinci architecture. The Ascend 310 is equivalent to the NPU in the AI ​​chip.

Among them, Da Vinci Core is only a part of NPU, Da Vinci Core is also subdivided into many units, including core 3D Cube, Vector vector calculation unit, Scalar scalar calculation unit, etc., which are responsible for different computing tasks to achieve parallel computing. Models that together ensure efficient processing of AI calculations.


The 3D Cube matrix multiplication unit is the core of the AI ​​calculation. This part of the operation is done by the 3D Cube. The Buffer L0A, L0B, and L0C are used to store the input matrix and output matrix data, and are responsible for transferring data to the Cube calculation unit and storing the calculation results.

Although Cube's power is very powerful, it can only complete matrix multiplication, and many calculation types rely on Vector vector calculation unit to complete. Vector's instructions are relatively rich, covering a wide range of basic calculation types and many custom calculation types.

The Scalar scalar unit is mainly responsible for the scalar operation of the AI ​​Core. It can be regarded as a small CPU, which can complete the loop control of the whole program, branch judgment, address and parameter calculation of the instructions such as Cube and Vector, and basic arithmetic operations.

What are the unique advantages of the Science 3:3D Cube calculation method?

Different from the previous scalar and vector computing modes, Huawei's DaVinci architecture is based on the high-performance 3D Cube computing engine, which accelerates matrix operations, greatly increases the AI ​​power per unit area, and fully stimulates the computing potential of the end-side AI. Take two N*N matrix A*B multiplications as an example: if it is N 1D MACs, N is required.2Number of cycles; if it is 1 N2A 2D MAC array requires N Cycles; if it is an N-dimensional 3D Cube, only 1 Cycle is required.


The number of calculation units in the figure is only indicative, and the actual flexible design

The DaVinci architecture will greatly increase computing power, and the 16*16*16 3D Cube can significantly improve data utilization, shorten computing cycles, and achieve faster and stronger AI operations. For example, the same is done for 4096 operations, the 2D structure requires 64 rows * 64 columns to calculate, and the 3D Cube only needs 16 * 16 * 16 structure to calculate. Among them, the problem caused by the 64*64 structure is that the operation cycle is long, the delay is high, and the utilization rate is low.

This feature of the Da Vinci architecture is also reflected in the Kirin 810. As the first mobile SoC chip with the DaVinci architecture NPU, the Kirin 810 achieves a powerful AI computing power, achieving optimal energy efficiency per unit area, and industry-leading FP16 accuracy and INT8 quantization accuracy.

Kirin 810 supports the self-developed intermediate operator format. IR is open, and the number of operators is up to 240+, which is in the leading position in the industry. More operators, support for open source frameworks, and a more complete toolchain will enable developers to quickly transform and integrate models developed based on different AI frameworks, greatly enhancing the compatibility, ease of use, and improvement of Huawei HiAI mobile computing platforms. Developers' efficiency, saving time costs, and accelerating the landing of more AI applications.

Foresight: Da Vinci Architecture Unlocks AI Unlimited possibilities

Based on flexible and scalable features, the DaVinci architecture can meet the application scenarios of end-side, edge-side and cloud. It can be used for training scenarios as small as tens of milliwatts and hundreds of watts, providing optimal calculation across the entire scene. force.


Taking the Ascend chip as an example, Ascend-Nano can be used for IoT devices such as headset phones; Ascend-Tiny and Ascend-Lite are used for AI processing of smartphones; on portable devices such as notebook computers where higher computing power is required. The arithmetic support is provided by Ascend 310 (Ascend-Mini); the AI ​​calculation is performed by Multi-Ascend 310 on the edge side server; the computational power of the ultra-complex cloud data is up to 256 TFLOPS@ The Ascend 910 (Ascend-Max) of the FP16 is completed. It is precisely because of the flexible and tailorable and energy-efficient features of the DaVinci architecture that the AI ​​operations of the above complex scenes can be realized.

At the same time, choosing to develop a unified architecture is also a very critical decision. The advantage of a unified architecture is obvious, and that is very beneficial to the majority of developers. Based on the uniformity of the DaVinci architecture, developers can only apply one operator development and debugging to the cloud, edge side, and end side. It can be applied to different platforms and greatly reduce the migration cost. . Not only is the development platform language unified, but the training and reasoning framework is also unified. Developers can place a large number of training models on local and cloud servers, and then put lightweight inference work on mobile devices to achieve a consistent development experience.


After breakthroughs in computing power and technology, AI will be widely used in smart cities, autonomous driving, smart new retail, robotics, industrial manufacturing, cloud computing AI services and other scenarios. In the future, AI will be applied to a wider range of areas and gradually cover all aspects of life.

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments