|
Ruben Tito, Dimosthenis Karatzas, & Ernest Valveny. (2023). Hierarchical multimodal transformers for Multi-Page DocVQA. PR - Pattern Recognition, 144, 109834.
Abstract: Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple pages that should be processed altogether. In this work we extend DocVQA to the multi-page scenario. For that, we first create a new dataset, MP-DocVQA, where questions are posed over multi-page documents instead of single pages. Second, we propose a new hierarchical method, Hi-VT5, based on the T5 architecture, that overcomes the limitations of current methods to process long multi-page documents. The proposed method is based on a hierarchical transformer architecture where the encoder summarizes the most relevant information of every page and then, the decoder takes this summarized information to generate the final answer. Through extensive experimentation, we demonstrate that our method is able, in a single stage, to answer the questions and provide the page that contains the relevant information to find the answer, which can be used as a kind of explainability measure.
|
|
|
Pau Riba, Josep Llados, & Alicia Fornes. (2020). Hierarchical graphs for coarse-to-fine error tolerant matching. PRL - Pattern Recognition Letters, 134, 116–124.
Abstract: During the last years, graph-based representations are experiencing a growing usage in visual recognition and retrieval due to their ability to capture both structural and appearance-based information. Thus, they provide a greater representational power than classical statistical frameworks. However, graph-based representations leads to high computational complexities usually dealt by graph embeddings or approximated matching techniques. Despite their representational power, they are very sensitive to noise and small variations of the input image. With the aim to cope with the time complexity and the variability present in the generated graphs, in this paper we propose to construct a novel hierarchical graph representation. Graph clustering techniques adapted from social media analysis have been used in order to contract a graph at different abstraction levels while keeping information about the topology. Abstract nodes attributes summarise information about the contracted graph partition. For the proposed representations, a coarse-to-fine matching technique is defined. Hence, small graphs are used as a filtering before more accurate matching methods are applied. This approach has been validated in real scenarios such as classification of colour images or retrieval of handwritten words (i.e. word spotting).
Keywords: Hierarchical graph representation; Coarse-to-fine graph matching; Graph-based retrieval
|
|
|
Josep Llados, & Enric Marti. (1999). Graph-edit algorithms for hand-drawn graphical document recognition and their automatic introduction. Machine Graphics & Vision journal, special issue on Graph transformation, .
|
|
|
Josep Llados, & Gemma Sanchez. (2004). Graph Matching vs. Graph Parsing in Graphics Recognition: A Combined Approach. IJPRAI - International Journal of Pattern Recognition and Artificial Intelligence, 455–473.
|
|
|
Jaume Gibert, Ernest Valveny, & Horst Bunke. (2012). Graph Embedding in Vector Spaces by Node Attribute Statistics. PR - Pattern Recognition, 45(9), 3072–3083.
Abstract: Graph-based representations are of broad use and applicability in pattern recognition. They exhibit, however, a major drawback with regards to the processing tools that are available in their domain. Graphembedding into vectorspaces is a growing field among the structural pattern recognition community which aims at providing a feature vector representation for every graph, and thus enables classical statistical learning machinery to be used on graph-based input patterns. In this work, we propose a novel embedding methodology for graphs with continuous nodeattributes and unattributed edges. The approach presented in this paper is based on statistics of the node labels and the edges between them, based on their similarity to a set of representatives. We specifically deal with an important issue of this methodology, namely, the selection of a suitable set of representatives. In an experimental evaluation, we empirically show the advantages of this novel approach in the context of different classification problems using several databases of graphs.
Keywords: Structural pattern recognition; Graph embedding; Data clustering; Graph classification
|
|