TY  - CONF
AU  - Ali Furkan Biten
AU  - R. Tito
AU  - Andres Mafla
AU  - Lluis Gomez
AU  - Marçal Rusiñol
AU  - C.V. Jawahar
AU  - Ernest Valveny
AU  - Dimosthenis Karatzas
A2  - ICCV
PY  - 2019//
TI  - Scene Text Visual Question Answering
BT  - 18th IEEE International Conference on Computer Vision
SP  - 4291
EP  - 4301
N2  - Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting highlevel semantic information present in images as textual cues in the Visual Question Answering process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.
UR  - https://ieeexplore.ieee.org/document/9011031
L1  - http://refbase.cvc.uab.es/files/BTM2019b.pdf
UR  - http://dx.doi.org/10.1109/ICCV.2019.00439
N1  - DAG; 600.129; 600.135; 601.338; 600.121
ID  - Ali Furkan Biten2019
ER  -