Visual Question Answering (VQA) is an emerging topic which aims at automatically answering questions referred to a specific image. Together with image captioning, VQA is the main point of contact between the two heterogeneous communities of Natural Language Processing (NLP) and Computer Vision. VQA datasets contain different types of questions which require reasoning about several facets of the problem, including among others object localization, attribute detection, activity classification, scene understanding, counting, scene text understanding. These different sub-tasks make this topic very challenging.
Most approaches in VQA are based on Deep Learning and adopt Convolutional Neural Networks (CNNs) to represent images and Recurrent Neural Networks (RNNs) to represent sentences or phrases. The extracted visual and textual feature vectors are then jointly embedded by concatenation, element-wise sum or product to infer the answer. Almost all VQA techniques use an attention mechanism to select the salient image regions useful to answer correctly. In this seminar, the problem of Visual Question Answering will be described together with the most popular datasets and their structure. Then we will focus on the most interesting techniques of the state-of-the-art, analyzing their strengths and weaknesses, in order to show what are the next steps that will be done to improve the obtained results.