The Imagination, that once dominated the mobile GPU IP market has changed, with 36 percent of the mobile GPU IP market and 43 percent of the car market. Imagination recent release of a series of new products is not only a demonstration of its strength, but also enough to increase the attention of peers to the old opponent.
On November 13, imagine released img series4, the latest third generation neural network accelerator (NNA) product, which took two years to develop. Its new multi-core architecture can provide 600 tops (trillion operations per second) or even higher ultra-high performance, mainly for advanced driving assistance system (ADAS) and automatic driving applications.
With low-power products Imagination the introduction of high-performance ultimate AI-performance ultimate AI accelerator, will the auto-driving car chip market in the leading Nvidia of the impact?
The ultimate AI accelerator built in two years
Imagine launched the first generation of neural network accelerator (NNA) powervr 2nx in the hot AI year of 2017, with single core performance from 1tops to 4.1tops. Then, in 2018, powervr 3nx was released, with single core performance ranging from 0.6tops to 10tops, and multi-core product performance from 20tops to 160tops.
Two years later, Imagination launched NX.3rd generation NNA product 4 4NX series of mononuclear performance is further improved, each single core can provide 12.5 TOPS performance with less than 1 watt power consumption. Compared with the previous two generations NNA, the new generation of products emphasizes a new multi-core architecture, which supports flexible allocation and synchronization of workloads between multiple kernels to achieve higher performance.
Gilberto Rodriguez, director of product management at imagine technologies, said, "our software provides fine-grained control and increases flexibility by batching, splitting, and scheduling multiple workloads, and can be used on any number of cores. Series4 can be configured with 2, 4, 6, or 8 cores for each cluster. An 8-core cluster can provide 100 tops of computing power, and a solution with 6 8-core clusters can provide 600 tops computing power. "
The performance of Series4NNA is more than 20 times faster than embedded GPU and 1000 times faster than embedded CPU.
As for why we're launching such a high - performance AI accelerator, Gilberto Rodriguez said, "ADAS and autopilot have high computational power requirements for chips, Such as L2 driver detection or voice / gesture control requires 10 TOPS of performance, L3-L4 levels of autopilot have 50-100 TOPS performance requirements, L5 level autopilot performance requires more than 500 TOPS."
"Although there are AI chips on the market to meet the needs of automatic driving, the power consumption is not ideal. Therefore, we spent two years to understand and evaluate customer needs. Based on our first two generations of low-power products, we launched a high-performance and low-power 4nx series of products. We also took autonomous driving as the main market, which can also be applied to data center and desktop GPU. " Said Andrew grant, senior director of vision and artificial intelligence at imaging technologies.
How to balance low power consumption with 600TOPS high performance?
It should be pointed out that in order to achieve the performance of 100tops, the power consumption ratio of more than 30 tops / watt, and the performance density of more than 12 tops / mm ^ 2 should be implemented in 5nm nodes.Gilberto Rodriguez also mentioned that if you want to achieve higher computing power with multiple clusters, Imagination can provide multi-cluster collaboration mechanism, but also need customers to do some design in the application layer.
The scalability of multi-core flexible architecture enables 4nx to achieve high performance, but for high-performance chips, power control is also very important, especially for AI chips. AI chips need to process a large number of data, and the power consumption of data handling is far greater than that of data processing. Therefore, high-performance AI chips must find ways to minimize data handling, reduce latency and save bandwidth.
In order to reduce the latency, imagine uses a multi-core cluster composed of two cores, four cores, six cores or eight cores. All cores can cooperate with each other to process a task in parallel, so as to reduce the processing delay and shorten the response time.Of course, cluster and multi-core can not only perform a batch task together, but also run different networks separately, that is, each kernel can run independently.
The increase of the number of cores can improve the performance and reduce the delay
Different cores operate independently
The greater highlight of 4nx lies in its bandwidth saving tensor tiling (ITT) technology, which is a patent pending technology of imagination, and is also a new function in the 4 series. Tensor tiling technology uses the dependence of local data to store the intermediate data in the on-chip memory, which minimizes the transmission of data to external memory, and reduces the bandwidth by up to 90% compared with the previous generation products.
Specifically, the multilayer of neural network runs in the hardware pipeline of accelerator in the form of fusion kernel, and the feature map between fusion cores needs to be exchanged through external storage. Tiling technology is to make full use of tightly coupled SRAM to fuse more layers. After more layers are fused, it reduces the need to exchange characteristic graphs through external storage, so as to improve efficiency and save bandwidth.
We also need to explain the batch processing and splitting in the Tensor Tiling technology. Batch processing is the allocation of a large number of small network tasks suitable for batch processing to each independent NNA single core, which can improve the parallel processing ability. Split is that the task is split in multiple dimensions, all NNA single-core perform a reasoning task together, reducing network reasoning delay, while ideally the throughput of collaborative parallel processing is the same as that of independent concurrent processing. Very suitable for the network layer of a large network.
Of course, tensor tiling is split through the compiler provided by imagination, which does not need to be completed manually by developers. Moreover, the AI tasks can be better scheduled and allocated by using the performance analysis tools of NNA.
Can tensor tiling reduce data movement while saving bandwidth? "The answer is yes," Gilberto Rodriguez told Lei Feng. On the one hand, tensor tiling reduces the transmission of the data to be processed through the memory bandwidth. On the other hand, the number of times that the weight of the neural network is reused to the processor core is also reduced, which can effectively reduce data handling. "
Hardware upper tool chain, Imagination offline and online tools composed of workflow can allow developers to deploy faster.
NVIDIA will meet new competitors in the field of autonomous driving?
NVIDIA launched the on-board computing platform in 2015, and has continued to iterate since then. At present, NVIDIA has been in an advantageous position in the autopilot chip market. However, NVIDIA, which excels at desktop GPUs, can provide high performance, but the power consumption may not be friendly to battery powered electric vehicles. This is also an opportunity for imagination, which has advantages in mobile terminals with strict power consumption requirements.
Unlike NVIDIA, imaging is an IP provider and does not provide chips directly. So,Imagine can work with leading automotive industry disruptors, first tier suppliers, OEMs and SOC manufacturers to launch competitive products.In order to help partners better enter this market and launch vehicle size products more quickly, the NX4 also includes IP level security features, and the design process conforms to ISO 26262 standard.ISO 26262 is an industry safety standard designed to address the risk of automotive electronics.
The new 4-series NNA can perform neural network reasoning safely without affecting the performance. Hardware security mechanism can protect the compiled network, network execution and data processing pipeline.
Andrew grant has revealed that it has started to provide licenses and will be fully available on the market by December 2020. More than one customer has been authorized.
This means that the autopilot chip market will usher in more competitive products. Lei Feng believes that imagination's stronger GPU and NNA product portfolio will help more companies that want to enter this market to launch more competitive products. Last month, imagine released the latest generation of img B Series High-Performance GPU IP. This multi-core architecture GPU IP has four series of cores with 33 configurations.
More general-purpose GPU and more dedicated AI accelerator can obviously bring more choices to high-performance computing. Interestingly, NVIDIA also has a strong combination of GPU and AI accelerated tensor core.
ABI Research expect demand for ADAS to triple by around 2027, the auto industry has turned its attention to further fully automated driving cars and self-driving taxis. the combination of high performance, low low latency and high energy efficiency will be key in the evolution from L2 and L3 levels ADAS to L4 and class fully automated driving.
Under the huge market opportunity, how will the two companies with similar chip product advantages compete?