TY  - JOUR
AU  - Souhail Bakkali
AU  - Zuheng Ming
AU  - Mickael Coustaty
AU  - Marçal Rusiñol
AU  - Oriol Ramos Terrades
PY  - 2023//
TI  - VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification
T2  - PR
JO  - Pattern Recognition
SP  - 109419
VL  - 139
N2  - Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream approach. In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues, considering intra- and inter-modality relationships. Instead of merging features from different modalities into a common representation space, the proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities. The proposed learning objective is devised between intra- and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the common feature representation space}. Extensive experiments on public document classification datasets demonstrate the effectiveness and the generalization capacity of our model on both low-scale and large-scale datasets.
SN  - ISSN 0031-3203
L1  - http://refbase.cvc.uab.es/files/BMC2022.pdf
UR  - http://dx.doi.org/10.1016/j.patcog.2023.109419
N1  - DAG; 600.140; 600.121
ID  - Souhail Bakkali2023
ER  -