A few days ago, deepfake technology appeared in India's elections, and was used by candidates in promotional materials for campaign canvassing. Although the candidate ended up with a fiasco, it means that the AI face changing fire ignited by deepfake is gradually warming up. Although it is becoming more and more intense, the technology of anti deepfake has been relatively lacking. Recently, Microsoft Asia Research Institute has proposed a face X-ray method to detect face changing images.
This technology is published in the paper face X-ray for more general face forge detection. According to the corresponding paper, such tools can help prevent face changing images from being abused.
This technology is different from the existing methods, it can accurately detect "unknown" images, that is, no matter what algorithm is synthesized, it can also be detected without targeted training.
Overview of generating training samples
More specifically, it will generate gray-scale images to show whether a given input image can be decomposed into a mixture of two images from different sources. After all, most of the ways to change faces are to combine the generated images with the existing ones.
This means that FaceX-Ray cannot only judge whether it is a composite picture, but also point out where it is a synthetic one, that is, identifying both Explain two functions.
As above, the following graph is obviously composed.
The core idea of the algorithm is to recognize the unique mark of each image. There are many reasons for these tags, which may come from software factors such as algorithms, or hardware factors such as sensors.
Compared with other face detection methods, face X-ray is more effective in recognizing the undetected face image and predicting the mixed region reliably.
Comparison with the experimental results of binary detector
However, it is also pointed out in the paper that this method relies on a hybrid step, so it may not be suitable for completely synthesized images, and may be fooled by antagonistic samples.
1、 Related work
With the rapid development of fake face technology, many algorithms can synthesize images, and the synthesized images are more and more realistic, which means that the forged images may be misused, so it is very important to study the face change detection technology.
This kind of detection technology has been studied in academia, but most of them are "two classification" detection methods, although they can also achieve 98% accuracy, but these detection methods are often affected by transition matching, that is to say, when processing different types of pictures, the performance of detection methods will be significantly reduced.
More specifically, the technology that can distinguish real people from photos is called liveness detection, and the Chinese name is "living forensics". The current technology is mainly based on the resolution, three-dimensional information, eye movements and so on, because the resolution of the remade photos is different from that of the photos collected directly from real people in terms of quality and resolution.
For video deception, it can be distinguished according to three-dimensional information, light, etc.
For specific applications, Google once launched a photo artifact called assembler, which has seven detectors, five of which are developed by university research teams in the United States and Italy, respectively responsible for detecting photos processed by different types of technologies, such as integration, erasure, etc.
The other two detectors were developed by Jigsaw's own team, one of which aims to identify deepfake, the AI face change that has aroused heated discussion in the past two years. The detector uses machine learning to distinguish between real-life images and deepfake generated by stylegan technology.
For fake images, mark the areas that may be spliced. The face X-ray method can aim at the commonness of composite pictures: picture splicing, that is, mixing one picture with another. Detect the possible mixed area of the picture, analyze the difference, find the picture mark, and judge whether it is a composite picture.
2、 Face X-ray algorithm details
A typical face synthesis method consists of three stages:
1. Detection of facial area;
2. Synthesize the desired target face;
3. The target face is fused into the original image.
It is worth mentioning that the self-monitoring here is in quotation marks. Unlike the traditional definition of self-monitoring, the unsupervised here refers to not training algorithms from the face changing database. As mentioned earlier, the image tag mainly comes from two aspects, hardware and software. In normal images, the marks produced by hardware and software are usually "periodic" or even. Once the image changes, it will break the uniformity, so we can use the marker to judge whether it is a composite image. Specifically to the algorithm level, the definition of synthetic image is as follows:
⊙ represents element by element multiplication, if represents an image providing facial attributes, IB represents the image providing background, M is the mask dividing the manipulated region, and the gray value of each pixel is between 0.0 and 1.0.
For example, define face X-ray as image B, then if the input is a composite image, then B will display the mixed region, if the input is a real image, then B will be 0 for all pixels.
In essence, the purpose of face X-ray is to decompose images into two images from two different sources. After all, there are some subtle differences between images from different sources that can't be found by human eyes, while computers can.
In other words, face X-ray is a computational representation of finding image differences, which only cares about mixed boundaries.
Then it comes to the "self supervision" learning module. The difficulty of this part is to solve how to obtain the corresponding training data only with real pictures. It is mainly divided into three parts.
1. Given a real image, then look for another image as a variation of the real image. Face landmarks is used as the matching standard, and the search is based on the Euclidean distance.
2. Generate mask to delimit "fake" area.
3. Get the mixed image through the first formula above, and then get the mixed boundary according to the second formula. In practice, the tag data will be generated dynamically along with the training process, and the framework will be trained in a self-monitoring way. Therefore, only operating on the real image level can generate a large number of training data.
In the process of training, because deep learning has a strong ability of representation learning, researchers use the framework based on convolutional neural network. The input is image and the output is face X-ray. Then, based on the predicted face X-ray, the mixing probability of whether an image is real is output. In addition, the widely used loss function is used for prediction. For face X-ray, cross entropy loss is used to measure the accuracy of prediction. In general, face X-ray does not need to rely on the artifact knowledge related to specific face operation technology, and the algorithm supporting it can be trained without using any method to generate a fake image.
In the experimental section, the researchers trained FaceX-Ray on FaceForensics and another training dataset containing a hybrid image built from a real image, training to use only "true graphs" in the database, without using false graphs. Among them, FaceForensics is a large video corpus containing more than 1,000 original clips operated with four state-of-the-art facial manipulations, including DeepFake, Face2Face, Face Swap, NeuralTextures.
the generalization ability of facex-ray using four datasets was evaluated in the test section. The four datasets include: FaceForensics, Deepfake collection, Deepfake Protection Challenge, celeb-DF.
Generalization capability assessment
First, the face X-ray detection model is evaluated using the same training set and training strategy as xception. In order to get the accurate face X-ray image, the real image is taken as the background, the face changing image is taken as the foreground, and a pair of real image and false image are given. In order to make a fair comparison, the results of binary classes are also given. The results are as follows:
In the unknown face change detection, only using classifier will lead to performance degradation.
In addition, the generalization ability is also improved, which mainly comes from two parts: 1. It is recommended to detect face X-ray instead of operation specific artifacts. 2. Build a large number of training samples from real images. The results show that only using self-monitoring data can achieve high detection accuracy.
Benchmark results for unknown datasets
The test results are presented from AUC, AP and eer. The frame shown in the figure below performs better than the benchmark. If other face changing images are used, the performance will be improved even if they are distributed differently from the test set.
The following figure shows the visual examples of various types of face changing images. By calculating the difference between the false face and the real image, and then converting it into gray scale, the basic facts can be obtained after normalization. As shown in the figure below, face X-ray can better reflect the facts.
Fusion boundary predicted by algorithm
Compare with current work
Recently, some related work has also paid attention to the problem of generalization and tried to solve it to some extent. FWA also uses a self-monitoring method to create negative samples from real images. However, its goal is only to describe the facial distortion artifacts that only exist in the video generated by deepfake.
Table3 ~ table5 are illustrations, please ignore table6 automatically
The other tasks in the above table are trying to learn the inherent representation, as well as the MTDS learning detection and positioning at the same time. In comparison, face X-ray has exceeded the existing SOTA.
Analysis of the proposed framework
The overall goal of data enhancement in self-monitoring data generation is to provide a large number of different types of mixed images, so that the model can detect various tampered images.
In this part, the author studies two important enhancement strategies: a) mask deformation, which aims to bring greater changes to the shape of face X-ray; b) color correction to produce more realistic mixed images. These two strategies are very important for the production of diverse and high accuracy data samples, which are also helpful for network training.
In addition, in the process of self-monitoring data generation, we use the method of phase mixing, use different types of mixing to build test data, and evaluate the model when using alpha mixing to build training data. The results are shown in the figure below
One More Thing
Face X-ray works wonders for "semi synthetic" images, but it also has two limitations. The first is for pure synthetic images, because the marks cannot be recognized effectively, face X-ray cannot overcome it. This is what we said earlier: "this method relies on a blending step, so it may not be suitable for fully compositing images.".
The second limitation is that if someone specially trains the counter samples for this algorithm, then face X-ray may also fail.
In addition, like other face changing detection technologies, this technology is sensitive to image resolution. If the image resolution is low, the face X-ray detection rate will be low.
Left: Guo bining. Right: Chen Dong
Q: what's the solution to the problem that face X-ray can't accurately recognize the completely synthesized image and the counter sample?
A: we are still in the process of research, and plan to work hard on the detection of background details, because the synthetic pictures generally have rough background processing. Another idea is to train the algorithm by comparing the real image with the fake image, because the general celebrity or other face image has a unique attribute ID, and this unique attribute ID can also be used as data training to improve the algorithm.
Q: can face X-ray recognize the face photo modified by the image repair tool?
A: the focus of face X-ray's work is not to judge whether it is the original picture, but to measure between "true" and "false". After all, false videos and pictures have a great negative impact on society. Q: how is the algorithm landing? When can it be integrated into the application? A: our algorithm breakthrough is the progress just made. It will take a while for the specific application to be implemented.