Home > News content

Google announced the sub-millisecond face detection algorithm BlazeFace, another breakthrough in face detection!

via:博客园     time:2019/7/30 8:16:10     readed:403

Lei Feng Network AI Technology Review by:Google recently released a lightweight face detector tailored for mobile GPU reasoning — — sub-millisecond face detection algorithm Blaze Face. It can run at 200~1000+ FPS on flagship devices and can be used in many tasks that require fast and accurate recognition of face areas, such as 2D/3D facial keypoint recognition and geometry assessment, facial features and Expression classification and face segmentation. Google published a related paper to introduce the research results, Lei Feng network (public number: Lei Feng network) AI Technology Review compiled it as follows.

谷歌公布亚毫秒级人脸检测算法 BlazeFace,人脸检测又一突破!

Introduction to BlazeFace

In recent years, we have been able to achieve real-time target detection by improving various architectures in deep neural networks. In mobile applications, real-time object detection is often the first step in the video processing process, followed by various task components such as segmentation, tracking or geometric reasoning. Therefore, the target detection model reasoning must be run as fast as possible, and its performance is best able to achieve a real-time benchmark far above the standard.

We propose a new face detection framework called BlazeFace, which is optimized for mobile GPU reasoning on a single-lens multi-box detector (SSD) framework. Our main innovations include:

1. About the speed of reasoning

  • A very compact feature extractor convolutional neural network designed to be lightweight for target detection and structurally related to MobileNetV1/V2.
  • A new SSD-based GPU-friendly anchor mechanism designed to increase GPU utilization. Anchors (a priori in SSD terminology) are predefined static bounding boxes that serve as the basis for network prediction adjustments and determining prediction granularity.

2. About the reasoning effect

  • A joint resolution strategy that replaces non-maximum suppression enables a more stable, smoother contact resolution between multiple predictions.

AR based face detection

Although this framework is suitable for a variety of target detection tasks, in this article, we are committed to the face detection problem in the camera viewfinder of mobile phones. Due to the different focal lengths and the size of the captured objects, we built models for the front and rear cameras respectively.

In addition to predicting axis-aligned face rectangles, the BlazeFace model also generates six facial keypoint coordinates (for eye center, ear, mouth center, and nose tip) so we can estimate the face rotation angle (rolling angle). Such an arrangement makes it possible to pass a rotated face rectangle to a later task specific phase of the video processing flow, thereby mitigating the requirements of subsequent processing steps for important translation and rotation invariance.

Model structure and design

The BlazeFace model architecture is built around the four important design considerations discussed below.

1. Expanding the experience field

Although most modern convolutional neural network architectures (including MobileNet,Https://arxiv.org/pdf/1704.04861.pdfBoth tend to use 3×3 convolution kernels in the model diagram, but we note that the depth separable convolution calculations are dominated by their pointwise parts. On the s×s×c input tensor, a decoupling convolution operation is applied, where the deep convolution of k×k involves s^2ck^2 multiply-add operations, and the subsequent 1×1 convolution to d outputs The channel consists of s^2cd submultiple addition, which is d / (k ^ 2) times in the depth phase.

In fact, on Apple iPhone X with a metal case, 3×3 deep convolution in 16-bit floating point operations takes 0.07 ms for 56×56×128 compared to 1× for 128 to 128 channels The ;1 convolution operation is 4.3 times slower, which means that the subsequent point convolution operation takes 0.3 milliseconds (the difference in pure arithmetic operations due to fixed cost and memory access factors).

This observation indicates that the core size of the increased depth portion is more cost effective. We use the 5×5 kernel in the model architecture, so that the number of bottlenecks required to reach the specified size is greatly reduced, and the resulting BlazeBlock has the two structures shown below:

谷歌公布亚毫秒级人脸检测算法 BlazeFace,人脸检测又一突破!

Single BlazeBlock (left) with double BlazeBlock (right)

2, feature extractor

For specific examples, we focus on the feature extractor of the front camera model. The feature extractor must consider a smaller range of target metrics, so it has lower computational requirements. The extractor consists of 128× 128 pixel RGB input, including a 2D convolution and 5 single BlazeBlocks and 6 dual BlazeBlocks. The complete layout is shown in the table below. The maximum tensor depth (channel resolution) is 96 and the minimum spatial resolution is 8× 8 (which reduces the resolution to 1× compared to SSD; 1).

谷歌公布亚毫秒级人脸检测算法 BlazeFace,人脸检测又一突破!

BlazeFace feature extractor network structure

3, Anchor mechanism

A target detection model like SSD relies on a predefined fixed-size base bounding box, called a priori mechanism, or an anchor point in Faster-R-CNN terminology. A set of regression (and possibly classification) parameters, such as center offset and size adjustment, are predicted for each anchor. They are used to adjust a predefined anchor position to a tight bounding rectangle.

The usual practice is to define anchor points at multiple resolution levels based on the target scale range, while downsampling is also a means of computing resource optimization. A typical SSD model uses 1×1,2×2,4×4,8×8 and 16×16 feature map size predictions. However, the pyramid pooled network PPN architecture (Https://arxiv.org/pdf/1807.03284.pdfThe success of this means that after the feature map reaches a certain feature map resolution, a lot of extra calculations will be generated.

A key feature unique to GPUs compared to CPU computing is that scheduling specific layer calculations has a significant fixed cost, which is important for the deep low-resolution layers inherent in popular CPU custom architectures. For example, in one experiment we observed that MobileNetV1 requires 4.9 milliseconds of inference time and 3.9 milliseconds in actual GPU computing.

With this in mind, we have adopted another anchoring scheme that stays at the 8×8 feature map size without further downsampling (Figure 2). We have replaced 2 anchor points for each pixel in 8×8,4×4 and 2×2 resolutions with 8 anchors for 8×8. Due to the limited variation in face aspect ratio, it was found that fixing the anchor to a 1:1 aspect ratio is sufficient for accurate face detection.

谷歌公布亚毫秒级人脸检测算法 BlazeFace,人脸检测又一突破!

Anchor calculation, SSD (left) and BlazeFace (right)

4, post-processing mechanism

Since our feature extractor does not reduce the resolution below 8×8, the number of anchor points for a given target overlap will increase significantly as the target size increases. In a typical non-maximum suppression scheme, only one anchor point is selected as the output of the algorithm. When such a model is applied to subsequent video face prediction, the prediction results will fluctuate between different anchors and continuously jitter on the time series (human susceptible noise).

To minimize this phenomenon, we replace the suppression algorithm with a hybrid strategy that estimates the regression parameters of the bounding box with a weighted average between overlapping predictions, which produces little additional cost to the original NMS algorithm. For face detection tasks, this adjustment increases accuracy by 10%.

We quantify the amount of jitter by continuously inputting a slightly offset image of the target and observe how the model results (affected by the offset) are affected. After the joint resolution strategy is modified, the amount of jitter (defined as the root mean square difference between the predictions of the original input and the shifted input) drops by 40% on our pre-camera dataset, in the case of smaller faces. The rear camera data set has dropped by 30%.


We trained our model on a dataset of 66K images. To evaluate the experimental results, we used a geographically diverse data set consisting of 2K images.

For the front camera model, it only considers the face that occupies more than 20% of the image area, which is determined by the intended use case (the threshold for the rear camera model is 5%).

The regression parameter error was normalized by scale invariance using the interocular distance (IOD), and the median absolute error was 7.4% of IOD. The jitter metric evaluated by the above procedure is 3% of IOD.

Figure 4 shows the average accuracy (AP) metric (standard 0.5 cross joint bounding box matching threshold) and mobile GPU inference time of the proposed frontal face detection network and its target with MobileNetV2-based target detector (MobileNetV2-SSD) A comparison was made. We use the TensorFlow Lite GPU as a framework for inference time evaluation in 16-bit floating point mode.

谷歌公布亚毫秒级人脸检测算法 BlazeFace,人脸检测又一突破!

Front camera face detection performance

Figure 5 shows a perspective view of the GPU inference speed for two network models on more flagship devices:

谷歌公布亚毫秒级人脸检测算法 BlazeFace,人脸检测又一突破!

Inference speed across multiple mobile devices

Figure 6 shows the degradation degree of regression parameter prediction quality due to the small size of the model. As described in the following section, this does not necessarily lead to a proportional reduction in the quality of the entire AR pipeline.

谷歌公布亚毫秒级人脸检测算法 BlazeFace,人脸检测又一突破!

Quality of regression parameter prediction


These models can run on complete images or video frames, and can be used as the first step in almost any face-related computer vision applications, such as 2D/3D face key points, contour or surface geometry estimation, facial feature or expression classification and face region segmentation. Therefore, follow-up tasks in the computer vision process can be defined according to appropriate facial tailoring. This result can also be rotated with a small number of key facial point estimates provided by BlazeFace, so that the face in the image is centralized, standardized and the rolling angle is close to zero. This eliminates the requirement that SIG-nifi cannot be translated and rotated invariant, thus allowing the model to achieve better computational resource allocation.

This method is illustrated by an example of face contour estimation. In Figure 7, we show how BlazeFace's output, i.e., the predicted boundary box and the six key points of the face (red), can be further refined through a more complex face contour estimation model and applied to the extended results.

谷歌公布亚毫秒级人脸检测算法 BlazeFace,人脸检测又一突破!

Process examples; red for BlazeFace output; green for task-specific model output

Detailed key points can generate more fine boundary frame estimation (green) and be reused for tracking in subsequent frames without running the face detector. In order to detect the faults of the computational saving strategy, the model can also detect whether there is a reasonable alignment in the rectangular clipping provided by the face. Whenever this condition is violated, the BlazeFace Face Detector will run on the entire video frame again.

Paper links:


China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments