toggle visibility Search & Display Options

Select All    Deselect All
 |   | 
  Record Links
Author (up) Ruben Tito; Dimosthenis Karatzas; Ernest Valveny edit   pdf
doi  openurl
  Title Hierarchical multimodal transformers for Multi-Page DocVQA Type Journal Article
  Year 2023 Publication Pattern Recognition Abbreviated Journal PR  
  Volume 144 Issue Pages 109834  
  Abstract Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple pages that should be processed altogether. In this work we extend DocVQA to the multi-page scenario. For that, we first create a new dataset, MP-DocVQA, where questions are posed over multi-page documents instead of single pages. Second, we propose a new hierarchical method, Hi-VT5, based on the T5 architecture, that overcomes the limitations of current methods to process long multi-page documents. The proposed method is based on a hierarchical transformer architecture where the encoder summarizes the most relevant information of every page and then, the decoder takes this summarized information to generate the final answer. Through extensive experimentation, we demonstrate that our method is able, in a single stage, to answer the questions and provide the page that contains the relevant information to find the answer, which can be used as a kind of explainability measure.  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISSN 0031-3203 ISBN Medium  
  Area Expedition Conference  
  Notes DAG; 600.155; 600.121 Approved no  
  Call Number Admin @ si @ TKV2023 Serial 3825  
Permanent link to this record
Select All    Deselect All
 |   | 

Save Citations:
Export Records: