CVPR is the most influential and most comprehensive top academic conference in the field of computer vision in the last decade, hosted by IEEE (Institute of Electrical and Electronics Engineers), the world's largest nonprofit professional and technical society. 2017 Google Scholar Ranked by Papers, CVPR ranked first in the field of computer vision. According to Lei Feng net (public: Lei Feng net) AI technical review to understand this year, CVPR reviewed 2620 articles, the final collection of 783, the admission rate of 29%, oral report admission rate of only 2.65%.
Tencent AI Lab computer vision director Dr. Liu Wei introduced to"CVPR's oral report is generally the most cutting-edge research topic in the academic and industry are influential, every year such as Stanford University and Google and other world's most well-known universities and technology companies.
This year, Tencent AI Lab has six papers selected CVPR, following the Lei Feng network AI technical review will be one by one to introduce the summary of these papers.
Paper 1: "Real-time video style conversion implementation" & mdash; & mdash; Real Time Neural Style Transfer for Videos
Recent research work has demonstrated the feasibility of using feedforward convolution neural networks to achieve rapid image transformation. Tsinghua University and Tencent AI laboratory research based on this point in practice a step closer, they use the feedforward network to transform the style of the video, while also maintaining the style of video frame image time consistency. In the paper "Real-time video style conversion implementation", the authors introduce that the feedforward networks they use are trained by forcing the output of successive frames to maintain both the original style and the good continuity. More specifically, the authors propose a hybrid loss theory that takes advantage of the picture information of the input picture frame, the style information of the image, and the time information of successive frames to process the image. In order to calculate the time loss during the training phase, the authors propose a new two-frame cooperative training mechanism. Compared with the original direct hard to the existing picture style into the video compared to this new way to exclude the original method of time-dependent optimization of dependence, you can keep the screen time continuity, but also eliminates the screen flicker Of the problem, to ensure that the video style to move real-time, high quality, efficiency and integrity, in order to achieve better visual appreciation.
The paper links:Real Time Neural Style Transfer for Videos
Paper 2: "Based on Pathological Image Prediction Method & mdash; & mdash; WSISA" & mdash; & mdash; WSISA: Making Survival Prediction from Whole Slide Histopathological Images
University of Texas-Alington and Tencent AI Labs have proposed a patient-based survival prediction method based on pathological images to effectively support the precise personalization of large data age. It is well known that image-based precision hospital technology has long been in the field of vision, and through this technology for cancer patients for better treatment. However, the gigapixel resolution of the whole image (WSI) of the histopathology negates the computational burden of the traditional survival model in this field. This model usually needs to be manually marked and needs to be distinguished in the region of interest (ROI, Region of Interest), so that in a gigabit image, the computer can not learn directly through the distinguished blocks. In addition, due to tumor heterogeneity, through a small part of the block and can not fully represent the patient's survival status. At the same time, patient samples for survival prediction training are usually inadequate. This has brought difficulties to survival prediction. In this paper, the authors propose an effective analytical framework to overcome the difficulties described above, namely, WSISA, full size, no labeling, based on pathological images of patient survival effective prediction method. Firstly, the individual patches on each WSI are extracted by the adaptive sampling method, and then the small blocks are grouped into clusters. The authors propose a cumulative model of training based on deep convolution Survival (DeepConvSurv) prediction results to characterize patient-level predictions. Compared with the existing image-based survival model, this model can effectively extract and use all the small blocks that can be distinguished on the WSI to predict. In the current field of research, this method has not been raised. Through the method of the paper, the authors used three data sets to study the survival prediction of glioma and non-small cell lung cancer. The results confirmed that WSISA architecture could greatly improve the accuracy of prediction.
"Self-Taught Learning for Weakly Supervised Object Localization" - "Self-Taught Learning for Weakly Supervised Object Localization"
By the National University of Singapore (National University of Singapore) and Tencent AI laboratory jointly published the paper "for the monitoring of the object location of the depth of self-learning" proposed by the detector itself to improve the quality of training samples, and continuously enhance the detector performance of a A new depth of self-learning methods to crack the monitoring of the target problem in the training sample quality bottlenecks. Most of the existing weak supervisory location (WSL) methods perform detector learning by identifying feature blocks identified by supervisory learning at the image level. However, these features do not contain information about the spatial location, while the quality of the sample data provided by the detector is poor. In order to overcome this problem, this paper presents a deep self-learning method, which is a detector to learn to obtain reliable sample object features and to re-train itself as a basis. Correspondingly, with the improvement of the detection capability of the detector itself and the improvement of the quality of the position information provided, it can further improve the quality of the data. In order to achieve this self-learning, this paper proposes a seed sample acquisition method, through the image to the object of transmission and intensive sub-graph acquisition to obtain a reliable positive sample to carry out the initialization of the detector. The author further provides an online support sample collection program to dynamically select the most credible positive samples and provide a sophisticated training method for training the detectors. In order to prevent the detector from being caught in the process of training due to the difficulties caused by adaptation, the author also introduced a way to guide the self-learning process. The experimental results based on PASCAL 2007 and 2012 confirm the significant efficiency advantage of this approach over existing methods.
Paper 4: "Diversity Image Marking" & mdash; & mdash; Diverse Image Annotation
A new image labeling method, presented by the University of Science and Technology in Abu Dua, Saudi Arabia, and Tencent AI Labs, presents a new image automatic labeling method, that is, the use of a small number of labels to express as much as possible image information Make full use of the semantic relations between labels, so that the results of automatic labeling and human labeling results are more similar. The goal of the DIA (multi-pattern annotation) is to describe the image using a limited number of tags, so the labels used need to cover as much useful information as possible. Compared to conventional image marking tasks, DIA requires labels that are not only representative of the image, but also need to be interrelated to reduce redundancy. In order to achieve this goal, the author sets the DIA as a subset selection problem and is based on a conditional DPP (Determinantal Point Process) model (which also has representative representations and diversity diversity). The appropriate semantic connection path is selected by further exploring the semantic hierarchy and synonyms in the candidate tag. That is, when the label is selected, the same image is discarded and the same semantics are avoided. This restriction is embedded in the conditional DPP model algorithm. In the traditional annotation method, the choice of the tag is concerned only with the representation of the whole image (including accuracy, recall and F1 score) while ignoring the diversity characteristics of the tag. Therefore, the new method proposed in this paper is based on the traditional method of a major upgrade. Through another research on the subject, we can prove that the method proposed in this paper is more similar to that of human labeling. The experimental results based on the two benchmark data sets confirm that the diversity image annotations made by this method are more satisfying.
The paper links:Diverse Image Annotation
"Designing 3D Objects with Single and Multiple Images Using Symmetry and / or Manhattan Properties" & mdash; & mdash; Exploiting Symmetry and / or Manhattan Properties for 3D Object Structure Estimation from Single and Multiple Images
This is a paper by Tencent AI Labs, Johns Hopkins University and the University of California, Los Angeles, which discusses the use of symmetry and / or Manhattan features for single and multiple images of three-dimensional object structure design method The Many artificially designed and manufactured objects have inherent symmetry and Manhattan structural characteristics. This paper is established by assuming a quadratic projection model when a single or multiple images are derived from a class of objects, such as different cars, that propose a method of estimating the three-dimensional spatial model using symmetry and Manhattan properties The By analyzing, using only the single image in the Manhattan feature is enough to restore the corresponding camera projection, while the use of symmetry to its 3D structure to restore. However, it is difficult to extract Manhattan properties from a single image because of its occlusion. Extended to the state of multiple images, you can use the symmetry of the object, then no longer need Manhattan axis. Therefore, through this idea, the author proposes a new rigid structure, which uses the motion method, the use of object symmetry and the use of the same classification of multiple images as input, so as to three-dimensional object structure design. Experiments with the Pascal3D + data set confirm the obvious advantages of this method over the general method.
Paper 6: "SCA-CNN: Attention Model in Convolution Neural Networks" & mdash; & mdash; SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning
The SCA-CNN: Attention Model in Convolution Neural Networks, published by Zhejiang University, Columbia University, Shandong University, Tencent AI Lab and the National University of Singapore, is based on the image description generation task. Convolution of the network to dynamically generate text description, and then put forward the spatial and channel perception on the attention model. At present, visual attention has been successfully applied to the task of structural prediction, such as: visual subtitles and Q & A function. The existing visual attention model is usually based on space, that is, by mapping the visual attention of the spatial probability of the final image of the convolution neural network (CNN) that encodes the input image. However, the study suggests that this spatial attention is not necessarily consistent with the attention mechanism, that is, combining the dynamic feature extraction of context fixed time, because the convolution neural network is characterized by natural space, channel perception and multi-layer. In this paper, the author introduces a novel convolution neural network called SCA-CNN, which combines spatial and channel-aware attention into convolution neural networks. When the task of adding subtitles to the image is implemented, the SCA-CNN dynamically adjusts the context of the sentence generation in the multi-level feature map, thereby compiling the two characteristics of the visual attention: where (ie, the spatial position of the attention ) And what (that is, to attract attention to the channel). The paper evaluates the proposed SCA-CNN architecture through three kinds of benchmark image subtitle data sets, including Flickr8K, Flickr30 and MSCOCO. The evaluation shows that the image subtitle annotation based on the SCA-CNN architecture has obvious advantages over the existing methods.
Lei Feng network finishing compilation