TY  - STD
AU  - Souhail Bakkali
AU  - Sanket Biswas
AU  - Zuheng Ming
AU  - Mickael Coustaty
AU  - Marçal Rusiñol
AU  - Oriol Ramos Terrades
AU  - Josep Llados
PY  - 2023//
TI  - TransferDoc: A Self-Supervised Transferable Document Representation Learning Model Unifying Vision and Language
N2  - The field of visual document understanding has witnessed a rapid growth in emerging challenges and powerful multi-modal strategies. However, they rely on an extensive amount of document data to learn their pretext objectives in a ``pre-train-then-fine-tune'' paradigm and thus, suffer a significant performance drop in real-world online industrial settings. One major reason is the over-reliance on OCR engines to extract local positional information within a document page. Therefore, this hinders the model's generalizability, flexibility and robustness due to the lack of capturing global information within a document image. We introduce TransferDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised fashion using three novel pretext objectives. TransferDoc learns richer semantic concepts by unifying language and visual representations, which enables the production of more transferable models. Besides, two novel downstream tasks have been introduced for a ``closer-to-real'' industrial evaluation scenario where TransferDoc outperforms other state-of-the-art approaches.
UR  - https://arxiv.org/abs/2309.05756
L1  - http://refbase.cvc.uab.es/files/BBM2023.pdf
N1  - DAG
ID  - Souhail Bakkali2023
ER  -