Deep Learning pentru descrierea automată a imaginilor în limbaj natural – Image Captioning

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
Image Captioning (IC) in Computer Vision context refers to the automatic generation of textual descriptions associated with digital images. It is not only the recognition of the objects in these images, but also the description of their properties, as well as the relationships and interactions between them, all expressed textually in natural language, syntactically and semantically correct. Synthetically, the main steps in the automatic generation of textual descriptions associated with the images are: a) – extracting the visual information from the image, and, b) – “translating” it into an adequate and meaningful text. The spectacular developments in the field of deep neural networks and Deep Learning in recent years have led to absolutely remarkable progress also in the field of IC, the quality of the generated descriptive texts being substantially improved. Convolutional Neural Networks (CNN) have been naturally used to obtain essentialized vectorial representations of the image features, and Recurrent Neural Networks (RNN), in particular Long Short-Term Memory (LSTM), were used to decode these representations into phrases in natural language. In this paper we present an overview of the new techniques and methods based on Deep Learning used in the IC field, while also detailing and analyzing, as a case study, one of the best performing ones, using an encoderdecoder architecture combined with a mechanism for focusing the visual attention on the appropriate relevant regions of the image when generating each new word in the output sequence.
image captioning, machine learning, deep learning, deep neural network, convolutional network, recurrent network, LSTM, encoder-decoder, attentional mechanism