TY  - CONF
AU  - Stepan Simsa
AU  - Milan Sulc
AU  - Michal Uricar
AU  - Yash Patel
AU  - Ahmed Hamdi
AU  - Matej Kocian
AU  - Matyas Skalicky
AU  - Jiri Matas
AU  - Antoine Doucet
AU  - Mickael Coustaty
AU  - Dimosthenis Karatzas
A2  - ICDAR
PY  - 2023//
TI  - DocILE Benchmark for Document Information Localization and Extraction
T2  - LNCS
BT  - 17th International Conference on Document Analysis and Recognition
SP  - 147–166
VL  - 14188
KW  - Document AI
KW  - Information Extraction
KW  - Line Item Recognition
KW  - Business Documents
KW  - Intelligent Document Processing
N2  - This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer; applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset, baselines and supplementary material are available at https://github.com/rossumai/docile.
UR  - https://link.springer.com/chapter/10.1007/978-3-031-41679-8_9
L1  - http://refbase.cvc.uab.es/files/SSU2023.pdf
N1  - DAG
ID  - Stepan Simsa2023
ER  -