toggle visibility Search & Display Options

Select All    Deselect All
 |   | 
Details
   print
  Record Links
Author (up) Stepan Simsa; Milan Sulc; Michal Uricar; Yash Patel; Ahmed Hamdi; Matej Kocian; Matyas Skalicky; Jiri Matas; Antoine Doucet; Mickael Coustaty; Dimosthenis Karatzas edit   pdf
url  openurl
  Title DocILE Benchmark for Document Information Localization and Extraction Type Conference Article
  Year 2023 Publication 17th International Conference on Document Analysis and Recognition Abbreviated Journal  
  Volume 14188 Issue Pages 147–166  
  Keywords Document AI; Information Extraction; Line Item Recognition; Business Documents; Intelligent Document Processing  
  Abstract This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer; applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset, baselines and supplementary material are available at https://github.com/rossumai/docile.  
  Address San Jose; CA; USA; August 2023  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title LNCS  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference ICDAR  
  Notes DAG Approved no  
  Call Number Admin @ si @ SSU2023 Serial 3903  
Permanent link to this record
Select All    Deselect All
 |   | 
Details
   print

Save Citations:
Export Records: