In August 2017, Tencent's AI Lab ranked first in the Image Captioning tasks related to Microsoft's MS COCO with its self-developed enhanced learning algorithm in the field of image description generation technology, a popular computer vision and NLP cross study, surpassing Microsoft, Google, IBM and other participating companies, reflects the technical advantages in the forefront of this AI.
MS COCO (Common Object in Context, Common Object Image Recognition) data set (http://cocodataset.org/) is an image dataset published and maintained by Microsoft. In this dataset, there are four types of game tasks: object detection, keypoints, image segmentation, and captions. As these visual tasks are the most talked about and representative of the field of computer vision, MS COCO is one of the most important benchmark for image understanding and analysis. Among them, Captions, which require more in-depth understanding and analysis of images and texts, are more challenging than the other three tasks and therefore attract more industry (Google, IBM, Microsoft) and To date, a total of 80 teams have joined the competition for UC Berkeley (Stanford University).
In general, image captioning is about enabling the machine to have the human ability to understand the image and to describe the perceived image content in human language. Image description generation allows the machine to help people with visual impairments understand the image, giving the image a richer description than a tag, so this task has broad practical implications. From the academic research point of view, the research of image description generation not only needs to understand the image, but also need to understand the natural language. It is an interdisciplinary cross-modal cross-cutting research subject, and it is also a study of the deep neural network learning ability to multiple data domains Expansion of the important exploration. As a result, many tech companies and research institutes are involved in this task, including Google  , Microsoft , IBM , Snapchat , Montreal / Toronto University , UC Berkeley  , Stanford University , Baidu  and so on.
Recently, Tencent AI Lab has developed a new Reinforcement Learning to further enhance the model capabilities generated by the image description, as shown in the figure above. The corresponding image description generation model uses the framework of encoder-decoder  and introduces the attention mechanism . Based on previous research on spatial and channel-wise attention , AI Lab has built a new network model that introduces a multi-stage Attention mechanism. Encoders, using existing image convolutional neural networks (CNN) such as VGG, Inception, ResNet, etc., encode a given image into a vector that contains the semantic information of the image. These vectors can represent the semantic information of different scales of images, such as global semantics and multi-scale local semantics. The decoder, using the most popular long-term and short-time memory model (LSTM), decodes the global and local semantic vectors of the image obtained by the encoder to generate textual statements describing the contents of the image. It is in the decoding process, AI Lab innovative use of a multi-stage attention mechanism: the image of different scales of local semantic information, through different stages of the attention module, embedded in the generation of each word; at the same time The attention module needs to consider the attention signal strength at different stages introduced at different scales.
In addition to introducing a multi-stage attentional mechanism, the enhanced learning algorithm developed by AI Lab can further enhance the training effects of the constructed network model. Training using traditional cross entropy as a loss function does not adequately optimize measures of image description generation, such as BLEU, METEOR, ROUGE, CIDER, SPICE, and the like. These measures as a loss function are not differentiable. To address this issue, AI Lab uses a reinforcement learning algorithm to train network models to optimize these metrics. The training process can be summarized as: given a pair of images, generating the corresponding sentences through the deep network model, comparing the corresponding sentences with the labeled sentences to calculate the corresponding measurement indexes; using the reinforcement learning to construct the gradient information of the deep network model, performing the gradient Decline to finalize the network optimization. Finally, with sufficient training, the image description generation model developed by Tencent's AI Lab ranked first in the Captions tasks of Microsoft's MS COCO, surpassing Microsoft, Google, IBM and other technology companies.
. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and Tell: A Neural Image Caption Generator", CVPR 2015.
 S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, "Self-critical Sequence Training for Image Captioning", CVPR 2017.
. S. Liu; Z. Zhu; N. Ye; S. Guadarrama; and K. Murphy, "Improved Image Captioning via Policy Gradient Optimization of SPIDEr" ICCV 2017.
 Z. Ren, X. Wang, N. Zhang, X. Lv, and Li-Jia Li, "Deep Reinforcement Learning-Based Image Captioning With Embedding Reward", CVPR 2017.
 H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Doll & aacute; r, J. Gao, X. He, M. Mitchell, J. Platt, CL Zitnick, and G. Zweig, "From Captions to Visual Concepts and Back", CVPR 2015.
 K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, "Attend and Tell: Neural Image Caption Generation with Visual Attention "ICML 2015.
 J. Donahue, L. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, "Long-term Recurrent Convolutional Networks for Visual Recognition and Description", CVPR 2015 .
. A. Karpathy and Li Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions", CVPR 2015.
. J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, "Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)", ICLR 2015.
. L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua, "SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning & rdquo; CVPR 2017.