TY  - CONF
AU  - Emanuele Vivoli
AU  - Ali Furkan Biten
AU  - Andres Mafla
AU  - Dimosthenis Karatzas
AU  - Lluis Gomez
A2  - ECCVW
PY  - 2022//
TI  - MUST-VQA: MUltilingual Scene-text VQA
T2  - LNCS
BT  - Proceedings European Conference on Computer Vision Workshops
SP  - 345–358
VL  - 13804
KW  - Visual question answering
KW  - Scene text
KW  - Translation robustness
KW  - Multilingual models
KW  - Zero-shot transfer
KW  - Power of language models
N2  - In this paper, we present a framework for Multilingual Scene Text Visual Question Answering that deals with new languages in a zero-shot fashion. Specifically, we consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages and it is not necessarily aligned to the scene text language. Thus, we first introduce a natural step towards a more generalized version of STVQA: MUST-VQA. Accounting for this, we discuss two evaluation scenarios in the constrained setting, namely IID and zero-shot and we demonstrate that the models can perform on a par on a zero-shot setting. We further provide extensive experimentation and show the effectiveness of adapting multilingual language models into STVQA tasks.
UR  - https://link.springer.com/chapter/10.1007/978-3-031-25069-9_23
L1  - http://refbase.cvc.uab.es/files/VBM2022.pdf
UR  - http://dx.doi.org/10.1007/978-3-031-25069-9_23
N1  - DAG; 302.105; 600.155; 611.002
ID  - Emanuele Vivoli2022
ER  -