Lei Feng Network AI Technology Review by:Google recently released a lightweight face detector tailored for mobile GPU reasoning — — sub-millisecond face detection algorithm Blaze Face. It can run at 200~1000+ FPS on flagship devices and can be used in many tasks that require fast and accurate recognition of face areas, such as 2D/3D facial keypoint recognition and geometry assessment, facial features and Expression classification and face segmentation. Google published a related paper to introduce the research results, Lei Feng network (public number: Lei Feng network) AI Technology Review compiled it as follows.
Introduction to BlazeFace
In recent years, we have been able to achieve real-time target detection by improving various architectures in deep neural networks. In mobile applications, real-time object detection is often the first step in the video processing process, followed by various task components such as segmentation, tracking or geometric reasoning. Therefore, the target detection model reasoning must be run as fast as possible, and its performance is best able to achieve a real-time benchmark far above the standard.
We propose a new face detection framework called BlazeFace, which is optimized for mobile GPU reasoning on a single-lens multi-box detector (SSD) framework. Our main innovations include:
1. About the speed of reasoning
2. About the reasoning effect
AR based face detection
Although this framework is suitable for a variety of target detection tasks, in this article, we are committed to the face detection problem in the camera viewfinder of mobile phones. Due to the different focal lengths and the size of the captured objects, we built models for the front and rear cameras respectively.
In addition to predicting axis-aligned face rectangles, the BlazeFace model also generates six facial keypoint coordinates (for eye center, ear, mouth center, and nose tip) so we can estimate the face rotation angle (rolling angle). Such an arrangement makes it possible to pass a rotated face rectangle to a later task specific phase of the video processing flow, thereby mitigating the requirements of subsequent processing steps for important translation and rotation invariance.
Model structure and design
The BlazeFace model architecture is built around the four important design considerations discussed below.
1. Expanding the experience field
Although most modern convolutional neural network architectures (including MobileNet,Https://arxiv.org/pdf/1704.04861.pdfBoth tend to use 3×3 convolution kernels in the model diagram, but we note that the depth separable convolution calculations are dominated by their pointwise parts. On the s×s×c input tensor, a decoupling convolution operation is applied, where the deep convolution of k×k involves s^2ck^2 multiply-add operations, and the subsequent 1×1 convolution to d outputs The channel consists of s^2cd submultiple addition, which is d / (k ^ 2) times in the depth phase.
In fact, on Apple iPhone X with a metal case, 3×3 deep convolution in 16-bit floating point operations takes 0.07 ms for 56×56×128 compared to 1× for 128 to 128 channels The ;1 convolution operation is 4.3 times slower, which means that the subsequent point convolution operation takes 0.3 milliseconds (the difference in pure arithmetic operations due to fixed cost and memory access factors).
This observation indicates that the core size of the increased depth portion is more cost effective. We use the 5×5 kernel in the model architecture, so that the number of bottlenecks required to reach the specified size is greatly reduced, and the resulting BlazeBlock has the two structures shown below:
Single BlazeBlock (left) with double BlazeBlock (right)
2, feature extractor
For specific examples, we focus on the feature extractor of the front camera model. The feature extractor must consider a smaller range of target metrics, so it has lower computational requirements. The extractor consists of 128× 128 pixel RGB input, including a 2D convolution and 5 single BlazeBlocks and 6 dual BlazeBlocks. The complete layout is shown in the table below. The maximum tensor depth (channel resolution) is 96 and the minimum spatial resolution is 8× 8 (which reduces the resolution to 1× compared to SSD; 1).
BlazeFace feature extractor network structure
3, Anchor mechanism
A target detection model like SSD relies on a predefined fixed-size base bounding box, called a priori mechanism, or an anchor point in Faster-R-CNN terminology. A set of regression (and possibly classification) parameters, such as center offset and size adjustment, are predicted for each anchor. They are used to adjust a predefined anchor position to a tight bounding rectangle.
The usual practice is to define anchor points at multiple resolution levels based on the target scale range, while downsampling is also a means of computing resource optimization. A typical SSD model uses 1×1,2×2,4×4,8×8 and 16×16 feature map size predictions. However, the pyramid pooled network PPN architecture (Https://arxiv.org/pdf/1807.03284.pdfThe success of this means that after the feature map reaches a certain feature map resolution, a lot of extra calculations will be generated.
A key feature unique to GPUs compared to CPU computing is that scheduling specific layer calculations has a significant fixed cost, which is important for the deep low-resolution layers inherent in popular CPU custom architectures. For example, in one experiment we observed that MobileNetV1 requires 4.9 milliseconds of inference time and 3.9 milliseconds in actual GPU computing.
With this in mind, we have adopted another anchoring scheme that stays at the 8×8 feature map size without further downsampling (Figure 2). We have replaced 2 anchor points for each pixel in 8×8,4×4 and 2×2 resolutions with 8 anchors for 8×8. Due to the limited variation in face aspect ratio, it was found that fixing the anchor to a 1:1 aspect ratio is sufficient for accurate face detection.
Anchor calculation, SSD (left) and BlazeFace (right)
4, post-processing mechanism
Since our feature extractor does not reduce the resolution below 8×8, the number of anchor points for a given target overlap will increase significantly as the target size increases. In a typical non-maximum suppression scheme, only one anchor point is selected as the output of the algorithm. When such a model is applied to subsequent video face prediction, the prediction results will fluctuate between different anchors and continuously jitter on the time series (human susceptible noise).
To minimize this phenomenon, we replace the suppression algorithm with a hybrid strategy that estimates the regression parameters of the bounding box with a weighted average between overlapping predictions, which produces little additional cost to the original NMS algorithm. For face detection tasks, this adjustment increases accuracy by 10%.
We quantify the amount of jitter by continuously inputting a slightly offset image of the target and observe how the model results (affected by the offset) are affected. After the joint resolution strategy is modified, the amount of jitter (defined as the root mean square difference between the predictions of the original input and the shifted input) drops by 40% on our pre-camera dataset, in the case of smaller faces. The rear camera data set has dropped by 30%.
We trained our model on a dataset of 66K images. To evaluate the experimental results, we used a geographically diverse data set consisting of 2K images.
For the front camera model, it only considers the face that occupies more than 20% of the image area, which is determined by the intended use case (the threshold for the rear camera model is 5%).
The regression parameter error was normalized by scale invariance using the interocular distance (IOD), and the median absolute error was 7.4% of IOD. The jitter metric evaluated by the above procedure is 3% of IOD.
Figure 4 shows the average accuracy (AP) metric (standard 0.5 cross joint bounding box matching threshold) and mobile GPU inference time of the proposed frontal face detection network and its target with MobileNetV2-based target detector (MobileNetV2-SSD) A comparison was made. We use the TensorFlow Lite GPU as a framework for inference time evaluation in 16-bit floating point mode.
Front camera face detection performance
Figure 5 shows a perspective view of the GPU inference speed for two network models on more flagship devices:
Inference speed across multiple mobile devices
Figure 6 shows the degradation degree of regression parameter prediction quality due to the small size of the model. As described in the following section, this does not necessarily lead to a proportional reduction in the quality of the entire AR pipeline.
Quality of regression parameter prediction
These models can run on complete images or video frames, and can be used as the first step in almost any face-related computer vision applications, such as 2D/3D face key points, contour or surface geometry estimation, facial feature or expression classification and face region segmentation. Therefore, follow-up tasks in the computer vision process can be defined according to appropriate facial tailoring. This result can also be rotated with a small number of key facial point estimates provided by BlazeFace, so that the face in the image is centralized, standardized and the rolling angle is close to zero. This eliminates the requirement that SIG-nifi cannot be translated and rotated invariant, thus allowing the model to achieve better computational resource allocation.
This method is illustrated by an example of face contour estimation. In Figure 7, we show how BlazeFace's output, i.e., the predicted boundary box and the six key points of the face (red), can be further refined through a more complex face contour estimation model and applied to the extended results.
Process examples; red for BlazeFace output; green for task-specific model output
Detailed key points can generate more fine boundary frame estimation (green) and be reused for tracking in subsequent frames without running the face detector. In order to detect the faults of the computational saving strategy, the model can also detect whether there is a reasonable alignment in the rectangular clipping provided by the face. Whenever this condition is violated, the BlazeFace Face Detector will run on the entire video frame again.