CVPR 2019 will be held in Long Beach, USA in June. More than 5165 papers were submitted to the conference this year, and 1299 papers were finally accepted. This time, Tencent has more than 58 papers received by this CVPR, including 25 papers from Tencent Youtu Laboratory and 33 papers from Tencent AI Lab. Here is a detailed introduction of 25 papers hired by Tencent Youtu Laboratory.
1. Unsupervised Person Re-identification by Soft Multilabel Learning
Unsupervised pedestrian recognition based on soft multi-label learning
Compared with supervised pedestrian re-identification (RE-ID), unsupervised RE-ID has attracted more and more attention due to its better scalability. However, in the non-overlapping multi-camera view, the lack of pairwise label leads to the learning of discriminatory information is still a very challenging task. To overcome this problem, we propose a soft multi-label learning depth model for unsupervised RE-ID. The idea is to label soft labels (like the likelihood vectors of real labels) for unlabeled people by comparing them with a group of known references in the auxiliary domain. Based on the visual features and the similarity consistency of soft labels of unlabeled target pairs, we propose a hard negative mining method guided by soft multi-labels to learn a discriminative embedding. Since most of the target pairs come from cross-perspective, we propose a soft multi-label consistency learning method from Cross-perspective to ensure the consistency of labels from different perspectives. In order to achieve efficient soft label learning, reference agent learning is introduced. Our method has been evaluated on Market-1501 and Duke MTMC-reID, which is significantly better than the best unsupervised RE-ID method at present.
2. Visual Tracking via Adaptive Spatially-Regularized Correlation Filters
Research on visual tracking based on adaptive spatial weighted correlation filtering
In this paper, an adaptive spatial constrained correlation filtering algorithm is proposed to optimize both the filter weight and the spatial constrained matrix. Firstly, the adaptive spatial constraint mechanism proposed in this paper can efficiently learn a spatial weight to adapt to the change of the appearance of the target, so it can obtain more robust tracking results. Secondly, the proposed algorithm can be efficiently solved by alternating iteration algorithm, based on which, each sub-problem can be solved in a closed form. Thirdly, the proposed tracker uses two kinds of correlation filtering models to estimate the position and scale of the target, which can effectively reduce the computational complexity while achieving high positioning accuracy. A large number of experimental results on integrated data sets show that the proposed algorithm can achieve the same tracking results as the existing advanced algorithms, and achieve real-time tracking speed.
3. Adversarial Attacks Beyond the Image Space
Counterattack beyond image space
The generation of countermeasure instances is an important way to understand the working mechanism of deep neural networks. Most of the existing methods will produce disturbances in the image space, i.e. modifying each pixel independently. In this paper, we focus more on a subset of adversarial examples corresponding to meaningful changes in three-dimensional physical properties, such as rotation and translation, lighting conditions, etc. It can be said that these confrontation methods raise a more noteworthy issue, because they prove that simply interfering with three-dimensional objects and scenes in the real world may also lead to neural network misclassification examples.
In the task of classification and visual question answering, we add a rendering module in front of the neural network receiving 2D input to expand the existing neural network. The process of our method is to render a 3D scene (physical space) into a 2D picture (picture space), and then map them to a predictive value (output space) through a neural network. This countermeasure jamming method can go beyond image space. It has a clear meaning in the three-dimensional physical world. Although the countermeasure attacks in image space can be explained by the change of albedo of pixels, we prove that they can not be well explained in physical space, which usually has non-local effects. But attacks in physical space are likely to outperform those in image space. Although this attack is more difficult than that in image space, attacks in physical world have lower success rate and need more interference.
4. Learning Context Graph for Person Search
Pedestrian Retrieval Model Based on Context Graph Network
This paper is led by Tencent Youtu Laboratory and Shanghai Jiaotong University.
In recent years, deep neural network has achieved great success in pedestrian retrieval tasks. However, these methods are usually based on the appearance information of a single person, and it is still difficult to deal with the situation of attitude change, illumination change, occlusion and so on. This paper presents a new pedestrian retrieval model based on context information. The proposed model takes other pedestrians in the scene as context information, and uses convolutional graph model to model the impact of these context information on the target pedestrian. We refreshed the world record in two famous pedestrian retrieval datasets, CUHK-SYSU and PRW, and obtained the results of Top1 pedestrian retrieval.
5. Underexposed Photo Enhancement using Deep Illumination Estimation
Optimizing Image Enhancement in Dark Light Based on Deep Learning
This paper introduces a new end-to-end network for enhancing underexposed photos. Instead of directly learning image-to-image mapping as in previous work, we introduce intermediate lighting in our network to correlate input with expected enhancement results, which enhances the network's ability to learn complex photographic adjustments from expert-modified input/output images. Based on this model, we formulate a loss function, which uses constraints and priori in the middle of the illumination. We prepare a new data set of 3000 underexposed image pairs, and train the network to effectively learn various adjustments of illumination conditions. In these ways, our network can restore clear details, sharp contrast and natural color in enhanced results. We have done a lot of experiments on the benchmark MIT-Adobe FiveK data set and our new data set, and show that our network can effectively deal with previously difficult images.
6. Homomorphic Latent Space Interpolation for Unpaired Image-to-image Translation
Unpaired Picture-to-Picture Conversion Based on Homomorphic Hidden Space Interpolation
The generation of confrontation networks has achieved great success in image-to-image conversion from incompatible images. Cyclic consistency allows modeling the relationship between two different domains without paired data. In this paper, we propose an alternative framework as an extension of potential spatial interpolation, which considers the middle part between two domains in image transformation. The framework is based on the fact that there are multiple paths connecting two sampling points in a flat and smooth potential space. Proper selection of interpolation paths allows some image attributes to be changed, which is very useful for generating intermediate images between two domains. We also show that the framework can be applied to multi-domain and multi-mode conversion. Extensive experiments show that the framework has universality and applicability to various tasks.
7. X2CT-GAN: Reconstructing CT from Biplanar X-Rays with Generative Adversarial Networks
Biplane X-ray to CT Generation System Based on Generating Countermeasure Network
At present, CT imaging can provide a three-dimensional panoramic perspective to help doctors understand the condition of tissues and organs in patients, to assist in the diagnosis of diseases. However, compared with X-ray imaging, CT imaging brings more radiation dose to patients and costs more. In the process of three-dimensional reconstruction of traditional CT images, a large number of X-ray projections are collected and used around the center of the object, which can not be achieved in traditional X-ray machines. In this paper, we propose an innovative method based on confrontation generation network, which uses only two orthogonal two-dimensional X-ray images to reconstruct realistic three-dimensional CT images. The core innovations include dimension-increasing generation network, multi-view feature fusion algorithm and so on. Through experiments and quantitative analysis, we show that this method is superior to other contrast methods in two-dimensional X-ray to three-dimensional CT reconstruction. Through the visualization of CT reconstruction results, we can also intuitively see that the details provided by this method are more realistic. In practical applications, without changing the existing X-ray imaging process, our method can provide doctors with additional CT-like three-dimensional images to help them better diagnose.
8. Semantic Regeneration Network
Semantic Regeneration Network
In this paper, we study the basic problem of using depth generation model to infer visual context, that is, using reasonable structure and details to expand image boundaries. This seemingly simple task actually faces many key technical challenges and has its unique nature. The two main problems in the task are the expansion of size and one-sided constraints. We propose a semantic regeneration network with multiple special contributions, and use multiple spatial-related losses to solve these problems. The final experimental results in this paper include highly consistent structures and high quality textures. We have experimented extensively on various possible alternatives and related methods. Finally, we have explored the potential of our approach for a variety of interesting applications that can benefit research in various fields.
9. Towards Accurate One-Stage Object Detection with AP-Loss
Accurate First-order Target Detection Using AP Loss Function
The first-order target detector is usually trained by optimizing both the classification loss function and the location loss function. However, due to the existence of a large number of anchor frames, the effect of classification loss function will be severely limited by the imbalance of foreground-background classes. This paper proposes a new training framework to solve this problem. We use ranking task to replace the classification task in the first-order target detector, and use the evaluation index AP in the ranking problem as the loss function. Because of its discontinuity and non-convexity, AP loss function can not be optimized directly through gradient descent. Therefore, we propose a novel optimization algorithm, which combines the error-driven updating scheme in perceptron learning with the back propagation algorithm in deep networks. The good convergence of the proposed algorithm is verified theoretically and experimentally. The experimental results show that, without changing the network architecture, the performance of AP loss function on various data sets and the most excellent first-order target detector is significantly improved compared with the classification loss function of different categories.
10. Amodal Instance Segmentation through KINS Dataset
Perspective Instance Segmentation Based on KINS Data Set
Perspective instance segmentation is a new direction of instance segmentation, which aims to imitate human ability to segment each object instance including its invisible and occluded parts. This task needs to infer the complex structure of the object. Although important and futuristic, the task lacks large-scale and detailed annotated data due to the difficulty of marking the invisible parts correctly and consistently, which creates a huge obstacle to exploring the frontier of visual recognition. In this article, we extend KITTI with more instance-level annotations of eight categories, which we call KITTI INStance Data Set (KINS). We propose a new multi-task framework with multi-branch coding (MBC) to infer the network structure of the invisible part, which combines information at various recognition levels. A lot of experiments show that our MBC can effectively improve both perspective and non-perspective segmentation. KINS datasets and our proposed methods will be published publicly.
11. Pyramidal Person Re-IDentification via Multi-Loss Dynamic Training
Pyramid Pedestrian Recognition Based on Multi-Loss Dynamic Training Strategy
Most of the proposed pedestrian re-recognition methods rely heavily on accurate human body detection to ensure the alignment of objects. However, in complex practical scenarios, the existing models can not guarantee the accuracy of detection, which inevitably affects the performance of pedestrian recognition. In this paper, we propose a new pyramid model from coarse to fine to relax the accuracy limitation of the detection frame. The pyramid model integrates local, global and intermediate transition information, and can match effectively at different scales, even when the target alignment is not good. In addition, in order to learn discriminant identity representation, we propose a dynamic training framework to seamlessly coordinate the two loss functions and extract appropriate information. We achieved the best results on three databases. It is worth mentioning that the most challenging CUHK03 dataset exceeds the current best method by 9.5 percentage points.
12. Dynamic Scene Deblurring with Parameter Selective Sharing and Nested Skip Connections
Image deblurring algorithm based on selective parameter sharing and nested jump connection
Dynamic scene deblurring is a challenging low-level vision problem because each pixel's blurring is caused by many factors, including camera motion and object motion. Recently, the method based on deep convolution network has made great progress on this issue. Compared with parameter independent strategy and parameter sharing strategy, we analyze the strategy of network parameters and propose a selective parameter sharing scheme. In each scale sub-network, we propose a nested jump connection structure for the module of non-linear transformation. In addition, according to the method of generating fuzzy data, we build a larger data set and train a better de-fuzzing network. Experiments show that our selective parameter sharing, nested jump links and new data sets can improve the performance and achieve the best deblurring effect.
13. Learning Shape-Aware Embedding for Scene Text Detection
A Text Detection Method Based on Case Segmentation and Embedded Features
Because of the complex and changeable scenes, the detection of arbitrary shape text in natural scenes is very challenging. This paper mainly proposes a solution to detect arbitrary shape text. Specifically, we consider text detection as an instance segmentation problem and propose a segmentation-based framework, which uses independent connected regions to represent different text instances. In order to distinguish different text instances, our method maps image pixels into embedded feature space. Pixels belonging to the same text instance will be closer to each other in embedded feature space, whereas those belonging to different text instances will be far away from each other. In addition, the Shape-Aware loss proposed by us enables the model to adjust the training adaptively according to the complex and diverse aspect ratios of text instances and the narrow gaps between instances. At the same time, we propose a new post-processing algorithm. Our method can produce accurate prediction. Our experimental results demonstrate the effectiveness of our work on three challenging datasets (ICDAR 15, MSRA-TD500 and CTW 1500).
14. Point Web: Enhancing Local Neighborhood Features for Point Cloud Processing
Point Web: Enhanced Point Cloud Processing by Local Nearest Neighbor Features
15. Associatively Segmenting Instances and Semantics in Point Clouds
Instances and Semantics in Joint Segmentation Point Cloud
A 3D point cloud describes a real scene in detail and intuitively. However, how to segment diverse elements in such an informative three-dimensional scene is still seldom discussed. In this paper, we first introduce a simple and flexible framework to segment instances and semantics in point clouds at the same time. Further, we propose two ways to benefit the two tasks from each other and achieve win-win performance improvement. Specifically, we benefit from semantic segmentation by learning case embedding vectors with semantic awareness. At the same time, the semantic features of points belonging to the same instance are fused together to predict each point more accurately. Our method greatly surpasses the most advanced 3D instance segmentation method at present, and also improves the 3D semantic segmentation significantly.
Codes and models have been sourced:Https://github.com/WXinlong/ASIS
16. Cyclic Guidance for Weakly Supervised Joint Detection and Segmentation
Weak Supervisory Joint Detection and Segmentation Based on Circular Guidance
This paper is led by Tencent Youtu Laboratory and Professor Ji Rongrong of Xiamen University.
For the first time, we propose to use multi-task learning mechanism to combine weak supervisory detection and segmentation tasks, and improve each other based on the complementary failure modes of the two tasks. This enhancement of cross-tasking enables the two tasks to escape the local minimum. Our approach WS-JDS has two branches and shares the same backbone model, corresponding to two tasks. In the process of learning, we propose a circular guidance paradigm and a special loss function to improve both sides. The experimental results show that the performance of the algorithm is improved.
17. ROI Pooled Correlation Filters for Visual Tracking
Research on correlation filtering tracking based on region of interest pooling
ROI-based pooling algorithm has been successful in target detection and other fields. The pooling algorithm can compress the size of the model better and retain the positioning accuracy of the original model, so it is very suitable for the field of visual tracking. Although ROI-based pooling operation has been proved effective in different fields, it has not been well applied in the field of correlation filtering. Based on this, a novel correlation filtering algorithm with ROI pooling function is proposed for robust target tracking. Through rigorous mathematical deduction, we prove that ROI pooling in correlation filtering can be achieved by introducing additional constraints to the learnt filter, so that we can complete the pooling operation without explicitly extracting training samples. We propose an efficient correlation filtering algorithm and a Fourier-based algorithm for solving the objective function. We test the proposed algorithm on OTB-2013, OTB-2015 and VOT-2017. A large number of experimental results prove the effectiveness of the proposed algorithm.
18. Exploiting Kernel Sparsity and Entropy for Interpretable CNN Compression
Neural Network Compression Method Based on Sparsity and Density Entropy of Convolution Kernel
This paper is led by Tencent Youtu Laboratory and Professor Ji Rongrong of Xiamen University.
From the explanatory point of view of the neural network, we analyze the redundancy of the feature graph of convolutional neural network, and find that the importance of the feature graph depends on its sparsity and information richness. However, directly calculating the sparsity and information richness of feature graphs requires huge computational overhead. In order to overcome this problem, we establish the relationship between the feature graph and its corresponding two-dimensional convolution kernel. The sparsity and density entropy of the convolution kernel are used to represent the importance of the corresponding feature graph, and the score function to determine the importance of the feature graph is obtained. On this basis, we adopt convolution kernel clustering with finer granularity compression instead of traditional pruning compression model. A large number of experimental results show that our proposed compression method based on sparsity and density entropy of convolution kernels can achieve higher compression rate and accuracy.
19. MMFace: A Multi-Metric Regression Network for Unconstrained Face Reconstruction
MMFace: Multimetric Regression Network for Unconstrained 3D Face Reconstruction
In this paper, a multi-metric regression network for unconstrained three-dimensional face reconstruction is proposed. Its core idea is to use a voxel regression sub-network to generate an intermediate expression of the geometric structure of a face from the input image, and then to regress the corresponding parameters of the three-dimensional face deformation model from the intermediate expression. We constrain the regression results from multiple measures including face identity, expression, head posture and voxel, which makes our algorithm robust in exaggerated expressions, large head posture, local occlusion and complex illumination environment. Compared with the current mainstream algorithms, our method has been significantly improved on the open three-dimensional face data sets LS3D-W and Florence. In addition, our method is also directly applied to the processing of video sequences.
20. Towards Optimal Structured CNN Pruning via Generative Adversarial Learning
An Optimal Structured Convolutional Neural Network Pruning Method Based on Generative Countermeasure Learning
This paper is led by Tencent Youtu Laboratory and Professor Ji Rongrong of Xiamen University.
We propose an optimal structured network pruning method based on generative confrontation learning. By using the redundant heterogeneous structure of unsupervised end-to-end training pruning network, we effectively solve the problems of low pruning efficiency, lack of relaxation and strong label dependence in traditional structured pruning methods. This method introduces soft mask to each model structure and adds sparse restriction to it to characterize the redundancy of each structure. In order to better learn model parameters and masks, we construct a new structured pruning objective function by using no-class labels to generate a learning framework, and use fast iterative threshold shrinkage algorithm to solve the optimization problem and remove redundant structures steadily. A large number of experimental results show that compared with the most advanced structured pruning method, the proposed pruning method can achieve better performance.
21. Semantic Component Decomposition for Face Attribute Manipulation
Face Attribute Editing Based on Semantic Component Decomposition
Recently, the method based on deep neural network has been widely used in facial attribute editing. However, there are still two main problems: poor visual quality and difficulty in controlling results by users. This limits the applicability of existing methods because users may have different editing preferences for different facial attributes. In this paper, we solve these problems by proposing a model based on semantic components. The model decomposes facial attributes into several semantic components, each of which corresponds to a specific facial region. This not only allows users to control the editing intensity of different parts based on their preferences, but also effectively removes unwanted editing effects. In addition, each semantic component consists of two basic elements, which determine the editing effect and the editing area respectively. This property allows for finer-grained interactive control. Experiments show that our model can not only produce high quality results, but also achieve effective user interaction.
22. Memory-Attended Recurrent Network for Video Captioning
A Memory-based Cyclic Neural Network for Video Description
The traditional video description generation model follows the frame of encoder-decoder. The input video is encoded first, and then decoded to generate the corresponding video description. The limitation of this approach is that it can only focus on a video that is currently being processed. In practical cases, a word or phrase can appear in different but similar videos at the same time, so the coding-decoding method can not capture the context semantic information of a word in multiple related videos at the same time. To overcome this limitation, we propose a memory-based cyclic neural network model, and design a unique memory structure to capture the corresponding semantic information of words in each lexicon and all related videos. Therefore, our model can have a more comprehensive and in-depth understanding of the semantics of each word, so as to improve the quality of the generated video description. In addition, the memory structure we designed can evaluate the coherence between adjacent words. Sufficient experiments prove that our model has higher quality of video description than other existing models.
23. Distilled Person Re-identification: Towards a More Scalable System
Distilled Pedestrian Recognition: Towards a More Scalable System
Pedestrian Recognition (Re-ID), as a pedestrian comparison technology from the perspective of non-overlapping cameras, has made considerable progress in the field of supervised learning with abundant tag data. However, scalability is still the bottleneck of large-scale application. We consider the scalability of Re-ID from three aspects: (1) reducing label size to reduce labeling costs, (2) reusing existing knowledge to reduce migration costs, (3) using lightweight models to reduce prediction costs. To solve these problems, we propose a multi-teacher adaptive similarity distillation framework, which can transfer knowledge from multiple teacher models to customized lightweight student models without using source domain data with only a small number of labeled target domain identities. In order to effectively select the teacher model and complete knowledge transfer, we propose Log-Euclidean similarity distillation loss function and further integrate the Adaptive Knowledge Aggregator. A large number of experimental evaluation results demonstrate the extensibility of the method.
Sex is comparable in performance to the best unsupervised and semi-supervised Re-ID methods currently available.
24. DSFD: Dual Shot Face Detector
Dual-branch Face Detector
This paper is completed by the cooperation of PCALab and Tencent Youtu Laboratory, School of Computer Science and Engineering, Nanjing University of Technology.
In recent years, convolutional neural network has achieved great success in face detection. However, these methods are still difficult to deal with such problems as variable scales, gestures, occlusion, facial expressions, illumination and so on. In this paper, a new method is proposed to deal with three key points of face detection direction, including better feature learning, progressive loss function design and data expansion based on anchor assignment. Firstly, we propose a feature enhancement unit to extend the single branch to the double branch structure in order to enhance the feature capability. Secondly, we use the progressive anchor loss function to promote feature learning more effectively by giving anchor sets with different scales of two branches. Finally, we use an improved anchor matching method to provide better initialization data for the regenerator. Because the above technologies are related to the design of bifurcation, we named this method as bifurcation face detector. We have refreshed the world record in the five evaluation dimensions of two famous face detection datasets WIDER FACE and FDDB, and achieved Top1 face detection results.
25. 3D Motion Decomposition for RGBD Future Dynamic Scene Synthesis
Synthesis of RGBD Future Dynamic Scene Based on 3D Motion Decomposition
The frame of the future time in the video is formed by projecting the 3D scene of the camera's own motion and the object's motion into 2D. Therefore, fundamentally, accurate prediction of future video changes requires understanding the 3D motion and geometric characteristics of the scene. In this paper, we propose a RGB scene prediction model based on 3D motion decomposition. We first predict the motion of the camera and the motion of the foreground object. They are used to generate the 3D future scene together. Then they are projected onto the plane of the 2D camera to synthesize the future motion, RGB image and depth map. We can also incorporate semantically segmented information into the system to predict future semantic maps. Our results on KITTI and Driving show that our approach outperforms the current optimal approach for predicting future RGBD scenarios.