Image description generation is a field of research in the field of cross-talk between computer vision and NLP, which is even more hot in today's wave. In August of this year, Tencent AI Lab ranked first in Microsoft's COCO related Image Captioning task with the self-developed enhanced learning algorithm, surpassing Microsoft, Google, IBM and other participating companies.
It is understood that the MS COCO (Common Objects in Context, Common Object Image Recognition) data set (http://cocodataset.org/) Is an image dataset published and maintained by Microsoft. In this dataset, there are four types of game tasks: object detection, keypoints, image segmentation, and captions. As these visual tasks are the most talked about and representative of the field of computer vision, MS COCO is one of the most important benchmark for image understanding and analysis. Among them, Captions, which require more in-depth understanding and analysis of images and texts, are more challenging than the other three tasks and therefore attract more industry (Google, IBM, Microsoft) and To date, a total of 80 teams have joined the competition for UC Berkeley (Stanford University).
In general, image captioning is about enabling the machine to have the human ability to understand the image and to describe the perceived image content in human language. Image description generation allows the machine to help people with visual impairments understand the image, giving the image a richer description than a tag, so this task has broad practical implications.
From the academic research point of view, the research of image description generation not only needs to understand the image, but also need to understand the natural language. It is an interdisciplinary cross-modal cross-cutting research subject, and it is also a study of the deep neural network learning ability to multiple data domains Expansion of the important exploration. As a result, many tech companies and research institutes are involved in this task, including Google  , Microsoft , IBM , Snapchat , Montreal / Toronto University , UC Berkeley  , Stanford University , Baidu  and so on.
Recently Tencent's AI Lab developed a new Reinforcement Learning to further enhance the model capabilities generated by the image description, as shown above. The corresponding image description generation model uses the framework of encoder-decoder  and introduces the attention mechanism . Based on previous research on spatial and channel-wise attention , AI Lab has built a new network model that introduces a multi-stage Attention mechanism. Encoders, using existing image convolutional neural networks (CNN) such as VGG, Inception, ResNet, etc., encode a given image into a vector that contains the semantic information of the image.
These vectors can represent the semantic information of different scales of images, such as global semantics and multi-scale local semantics. Decoder, using the most popular long-term and short-time memory model (LSTM), decodes the global and local semantic vectors of the image obtained by the encoder to generate textual statements describing the content of the image. It is in the decoding process, AI Lab innovative use of a multi-stage attention mechanism: the image of different scales of local semantic information, through different stages of the attention module, embedded in the generation of each word; at the same time The attention module needs to consider the attention signal strength at different stages introduced at different scales.
In addition to introducing a multi-stage attentional mechanism, the enhanced learning algorithm developed by AI Lab can further enhance the training effects of the constructed network model. Training using traditional cross entropy as a loss function does not adequately optimize measures of image description generation, such as BLEU, METEOR, ROUGE, CIDER, SPICE, and the like. These measures as a loss function are not differentiable. To address this issue, AI Lab uses a reinforcement learning algorithm to train network models to optimize these metrics.
The training process can be summarized as the following stages:
Given an image, the corresponding statements are generated through the deep network model;
The corresponding statements and callouts to compare the statement to calculate the corresponding measure;
The use of reinforcement learning to build the gradient network model of the gradient information, the implementation of gradient descent to complete the final network optimization.
After sufficient training, the image description generation model developed by Tencent AI Lab ranks first in Captions tasks of Microsoft MS COCO, surpassing that of Microsoft, Google, IBM and other technology companies.
. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and Tell: A Neural Image Caption Generator", CVPR 2015.
 S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, "Self-critical Sequence Training for Image Captioning", CVPR 2017.
. S. Liu; Z. Zhu; N. Ye; S. Guadarrama; and K. Murphy, "Improved Image Captioning via Policy Gradient Optimization of SPIDEr", ICCV 2017.
 Z. Ren, X. Wang, N. Zhang, X. Lv, and Li-Jia Li, "Deep Reinforcement Learning-Based Image Captioning With Embedding Reward," CVPR 2017.
 H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Doll & aacute; r, J. Gao, X. He, M. Mitchell, J. Platt, CL Zitnick, and G. Zweig, "From Captions to Visual Concepts and Back," CVPR 2015.
 K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, "Attend and Tell: Neural Image Caption Generation with Visual Attention ", ICML 2015.
 J. Donahue, L. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, "Long-term Recurrent Convolutional Networks for Visual Recognition and Description", CVPR 2015.
 A. Karpathy and Li Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions," CVPR 2015.
. J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, "Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)", ICLR 2015.
 L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua, "SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning ", CVPR 2017.