On July 30, 2020, mlperf released the third version of mlperf training v0.7 benchmark results. NVIDIA's DGX superpod system, based on the newly released A100 tensor core GPU released in May, has broken eight records in performance, which has increased the difficulty for many AI chip companies that want to build better GPUs than NVIDIA.
The second-generation IPU GC200 released Graphcore july 15th, by contrast, deserves NVIDIA vigilance. The reason is, of course, not simply because the second generation IPU of the TSMC 7 nm process is 10% higher than the NVIDIAA100GPU transistor density.
Instead,The second generation of Graphcore IPU outperforms many mainstream models A100GPU, both of which will compete positively in super-large data centers.
IPU may show greater advantages in some emerging AI applications in the future.
Compared with GPU, IPU has 100 times performance improvement
At present, the application of AI is mainly focused on computer vision (CV). In terms of CV, according to the benchmarks (benchmarks) of Google's latest efficient net model, the IPU throughput of reasoning performance can reach 15 times that of GPU, and the performance of training can also be improved by 7 times.
In the reasoning of RESNET's improved model resnext-101, IPU can increase the throughput by 7 times and reduce the delay by about 24 times. In a training of resnext-50 model, the throughput of IPU is about 30% higher than that of GPU.
In addition, in the most popular NLP model Bert base, the throughput of IPU with the same delay can be doubled, the training time can be reduced by 25% to 36.3 hours, and the power consumption can be reduced by 20%.
In a training model of MCMC, the performance of IPU is 15 times higher than that of GPU, and the training time is 15 times shorter. In the precision training model of VAE, the performance can be improved by 4.8 times and the training time can be shortened by 4.8 times.
In addition, the sales forecasting and recommendation model which is more concerned at present. Compared with GPU, IPU has the highest 6-fold performance improvement in MLP model training for sales data analysis, and 2.5-fold improvement in the recommended density autoencoder model training.
If the IPU is better at packet convolution kernel, the smaller the group dimension, the more obvious the performance advantage of IPU. Generally speaking, the throughput is improved by 4-100 times.
Three technological breakthroughs of IPU
From the multi-dimensional comparison of IPU and GPU in current AI applications, we can see the advantages of IPU, which is closely related to graphcore's breakthrough in computing, data and communication.
Graphcore the latest release of the second generation IPU Colossus Mk2GC200 computing power core from 1216 to 1472 independent IPU-Tiles units, There are 8832 threads that can be executed in parallel. In-Processor-Memory MB.900 from 300 MB of the previous generation The Memory bandwidth per IPU is TB/s.47.5
There is also an interface for IPU-Exchange and PCI Gen4 to interact with the host; and a chip-to-chip interconnection of IPU-Links 320GB/s.
Three typical application scenarios are selected to compare the performance of the second generation and the first generation of IPU from the computing level. The performance of the training of the Bert large is 9.3 times, that of the three-layer bet reasoning is 8.5 times, and that of the efficient net-b3 is 7.4 times. Compared with the first generation IPU, the second generation IPU has twice the peak computing power. In the typical CV and NLP model, the second generation IPU shows an average performance improvement of 8 times compared with the first generation IPU.
This performance improvement is very important because the internal memory of the processor has been improved from 300MB to 900mb. Graphcore China's head of technology applications Luo Xu to Lei Feng said,
MK2 IPU increased in processor storage is mainly used for our model activation, weight of some storage space.Because the space occupied by the program stored in the processor is basically the same as that of the first generation IPU, the weight available for the algorithm model and the effective storage capacity for activation are more than 6 times.
But,300M processor storage itself is a great challenge, to 900 M facing what challenges? Luo Xu points out that,
If compared with NVIDIA's dgx-a100 based on eight latest A100 GPUs, the fp32 computing power of the system composed of eight M2000 of graphcore is 12 times of that of dgx-a100, AI calculation is 3 times and AI storage is 10 times. In terms of price, ipu-m2000 costs $259600, and dgx-a100 costs $199000. Graphcore has certain cost performance advantages.
From the perspective of application, in the image classification training of efficientnet-b4, the performance of 8 ipus-m2000 (integrating 4 gc200 IPUs in a 1U box) is equivalent to 16 dgx-a100, which can reflect the price advantage of more than 10 times.
In terms of data, graphcore proposed the concept of IPU exchange memory. Compared with the HBM technology currently used by NVIDIA, each ipu-m2000 can provide nearly 100 times bandwidth and about 10 times capacity through IPU exchange memory technology, which is very helpful for many complex AI model algorithms.
With the breakthrough of computing and data, IPU can show 10-50 times better performance than IPU in native sparse computing.In the case of data and computing intensive, GPU performs very well, but with the enhancement of data sparsity, IPU has more and more significant advantages than GPU when data sparse and dynamic sparse.
Graphcore Lu Tao, senior vice president and general manager of China, said:
Communication is also a key problem in large-scale computing of data center.To this end, graphcore designed IPU fabric for AI scale out. IPU fabric can achieve 2.8tbps ultra-low latency structure, and can support the horizontal expansion of up to 64000 IPUs.
Lu Tao introduced that IPU fabric is composed of three networks: the first is IPU link, the second is IPU gateway link, and the third is IPU over fabric.IPU link is an interface that provides communication between IPUs within a rack. IPU gateway link provides a network between racks and scale out between racks. IPU over fabric can combine IPU cluster and x86 cluster into a very flexible, low latency and high performance network.
The combination of computing, data and communication breakthroughs can be used to build a large-scale and scalable ipu-pod system.The form of an ipu-pod used for Supercomputing is ipu-pod64, which is a basic component of ipu-pod. There are 64 IPUs in the cabinet of each ipu-pod64, providing 16pflops of computing power and 58gb of in processor memory, reaching 7 TB of streaming storage in total.
Therefore, in ipu-pod, it is very important to decouple the AI calculation and logic control, so that the system is easy to deploy and the network delay is very low. It can support a very large algorithm model and very secure multi household use.
Lu Tao said,
GraphcoreWhy?Worthy of NVIDIA's attention?
Besides,From the point of view of power consumption, different scenarios will have some differences.Overall, the overall system power consumption of the single-chip M2000 is 1.1KW, equivalent to the power consumption ratio of 0.9tflops/w per IPU processor,In the same kind of high-performance AI computing products for data center, the energy efficiency ratio of 0.7tflops/w of A100 GPU and 0.71tflops/w of Huawei ascend 910 are higher.
In other words, graphcore will compete positively with NVIDIA in large-scale data centers.Lei Feng believes that, compared with the competition from GPU, NVIDIA should not ignore graphcore's IPU. In particular, graphcore has always stressed that it is designed for AI, and its application is also AI application that CPU and GPU are not good at.
This can also be seen from graphcore's software and ecological construction. As a general-purpose processor, IPU can support training and reasoning at the same time, and also provide a unified software platform. The latest poplar sdk1.2 has three features: first, it will integrate with more advanced machine learning framework. Second, further open low-level API to allow developers to do some specific tuning for network performance. Thirdly, the framework support is added, including pytorch and keras, and convolution library and sparse library are optimized.
In addition, by supporting the three mainstream operating systems of the comprehensive development framework, such as Ubuntu, RedHat and CentOS, the difficulty for developers to use is reduced. At the same time, by further opening the low-level API, the source code of poplar poplibs is open-source.These work is to let developers use IPU to innovate and build the competitive advantage of IPU in new application fields.
Furthermore, graphcore provides free IPU for commercial users, universities and research institutions, and individual developers. In China, graphcore IPU developer cloud is deployed on Jinshan cloud, in which three kinds of IPU products are used: ipu-pod64, IPU server of Inspur (nf5568m5) and IPU server of Dell (dss8440).
Lei Feng learned that at present, commercial users and universities are mainly applying for the use of graphcore IPU developer cloud, and there are few individual researchers.
The IPU developer cloud supports the training and reasoning of some of the most advanced and complex AI algorithm models. For example, advanced computer vision class is mainly represented by some application models of machine vision such as resnext, efficientnet, etc. Based on the application of time series analysis, such as LSTM, Gru and so on, it is widely used in natural voice application, advertising recommendation, financial algorithm and so on. Ranking and recommendation classes, such as deep autoencoder, perform very well in probability models and some algorithmic trading models based on MCMC.
Lu Tao said:
There is also a key issue that affects the large-scale commercial use of IPU. What is the yield cost of the second-generation IPU with on-chip storage up to 900m?
Lu Tao said:We use distributed storage architecture from the first generation to the second generation, which can control the yield of products very well. Therefore, even memory storage of 900m processor will not have a significant impact on the cost.
Graphcore, which already has several cloud partners, is building China's innovation community through hardware and software to develop ecology. How will graphcore compete with NVIDIA through cooperation with OEM and channel partners?