2021 |
|
Ruben Tito, Dimosthenis Karatzas and Ernest Valveny. 2021. Document Collection Visual Question Answering. 16th International Conference on Document Analysis and Recognition.778–792. (LNCS.)
Abstract: Current tasks and methods in Document Understanding aims to process documents as single elements. However, documents are usually organized in collections (historical records, purchase invoices), that provide context useful for their interpretation. To address this problem, we introduce Document Collection Visual Question Answering (DocCVQA) a new dataset and related task, where questions are posed over a whole collection of document images and the goal is not only to provide the answer to the given question, but also to retrieve the set of documents that contain the information needed to infer the answer. Along with the dataset we propose a new evaluation metric and baselines which provide further insights to the new dataset and task.
Keywords: Document collection; Visual Question Answering
|
|
|
Ruben Tito, Minesh Mathew, C.V. Jawahar, Ernest Valveny and Dimosthenis Karatzas. 2021. ICDAR 2021 Competition on Document Visual Question Answering. 16th International Conference on Document Analysis and Recognition.635–649.
Abstract: In this report we present results of the ICDAR 2021 edition of the Document Visual Question Challenges. This edition complements the previous tasks on Single Document VQA and Document Collection VQA with a newly introduced on Infographics VQA. Infographics VQA is based on a new dataset of more than 5, 000 infographics images and 30, 000 question-answer pairs. The winner methods have scored 0.6120 ANLS in Infographics VQA task, 0.7743 ANLSL in Document Collection VQA task and 0.8705 ANLS in Single Document VQA. We present a summary of the datasets used for each task, description of each of the submitted methods and the results and analysis of their performance. A summary of the progress made on Single Document VQA since the first edition of the DocVQA 2020 challenge is also presented.
|
|
|
Sanket Biswas, Pau Riba, Josep Llados and Umapada Pal. 2021. Beyond Document Object Detection: Instance-Level Segmentation of Complex Layouts. IJDAR, 24, 269–281.
Abstract: Information extraction is a fundamental task of many business intelligence services that entail massive document processing. Understanding a document page structure in terms of its layout provides contextual support which is helpful in the semantic interpretation of the document terms. In this paper, inspired by the progress of deep learning methodologies applied to the task of object recognition, we transfer these models to the specific case of document object detection, reformulating the traditional problem of document layout analysis. Moreover, we importantly contribute to prior arts by defining the task of instance segmentation on the document image domain. An instance segmentation paradigm is especially important in complex layouts whose contents should interact for the proper rendering of the page, i.e., the proper text wrapping around an image. Finally, we provide an extensive evaluation, both qualitative and quantitative, that demonstrates the superior performance of the proposed methodology over the current state of the art.
|
|
|
Sanket Biswas, Pau Riba, Josep Llados and Umapada Pal. 2021. DocSynth: A Layout Guided Approach for Controllable Document Image Synthesis. 16th International Conference on Document Analysis and Recognition.555–568. (LNCS.)
Abstract: Despite significant progress on current state-of-the-art image generation models, synthesis of document images containing multiple and complex object layouts is a challenging task. This paper presents a novel approach, called DocSynth, to automatically synthesize document images based on a given layout. In this work, given a spatial layout (bounding boxes with object categories) as a reference by the user, our proposed DocSynth model learns to generate a set of realistic document images consistent with the defined layout. Also, this framework has been adapted to this work as a superior baseline model for creating synthetic document image datasets for augmenting real data during training for document layout analysis tasks. Different sets of learning objectives have been also used to improve the model performance. Quantitatively, we also compare the generated results of our model with real data using standard evaluation metrics. The results highlight that our model can successfully generate realistic and diverse document images with multiple objects. We also present a comprehensive qualitative analysis summary of the different scopes of synthetic image generation tasks. Lastly, to our knowledge this is the first work of its kind.
|
|
|
Sanket Biswas, Pau Riba, Josep Llados and Umapada Pal. 2021. Graph-Based Deep Generative Modelling for Document Layout Generation. 16th International Conference on Document Analysis and Recognition.525–537. (LNCS.)
Abstract: One of the major prerequisites for any deep learning approach is the availability of large-scale training data. When dealing with scanned document images in real world scenarios, the principal information of its content is stored in the layout itself. In this work, we have proposed an automated deep generative model using Graph Neural Networks (GNNs) to generate synthetic data with highly variable and plausible document layouts that can be used to train document interpretation systems, in this case, specially in digital mailroom applications. It is also the first graph-based approach for document layout generation task experimented on administrative document images, in this case, invoices.
|
|
2020 |
|
Alicia Fornes, Josep Llados and Joana Maria Pujadas-Mora. 2020. Browsing of the Social Network of the Past: Information Extraction from Population Manuscript Images. Handwritten Historical Document Analysis, Recognition, and Retrieval – State of the Art and Future Trends. World Scientific.
|
|
|
Andres Mafla, Sounak Dey, Ali Furkan Biten, Lluis Gomez and Dimosthenis Karatzas. 2020. Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features. IEEE Winter Conference on Applications of Computer Vision.
Abstract: Text contained in an image carries high-level semantics that can be exploited to achieve richer image understanding. In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks such as image retrieval, fine-grained classification, and visual question answering. In this paper, we address the problem of fine-grained classification and image retrieval by leveraging textual information along with visual cues to comprehend the existing intrinsic relation between the two modalities. The novelty of the proposed model consists of the usage of a PHOC descriptor to construct a bag of textual words along with a Fisher Vector Encoding that captures the morphology of text. This approach provides a stronger multimodal representation for this task and as our experiments demonstrate, it achieves state-of-the-art results on two different tasks, fine-grained classification and image retrieval.
|
|
|
Anjan Dutta, Pau Riba, Josep Llados and Alicia Fornes. 2020. Hierarchical Stochastic Graphlet Embedding for Graph-based Pattern Recognition. NEUCOMA, 32, 11579–11596.
Abstract: Despite being very successful within the pattern recognition and machine learning community, graph-based methods are often unusable because of the lack of mathematical operations defined in graph domain. Graph embedding, which maps graphs to a vectorial space, has been proposed as a way to tackle these difficulties enabling the use of standard machine learning techniques. However, it is well known that graph embedding functions usually suffer from the loss of structural information. In this paper, we consider the hierarchical structure of a graph as a way to mitigate this loss of information. The hierarchical structure is constructed by topologically clustering the graph nodes and considering each cluster as a node in the upper hierarchical level. Once this hierarchical structure is constructed, we consider several configurations to define the mapping into a vector space given a classical graph embedding, in particular, we propose to make use of the stochastic graphlet embedding (SGE). Broadly speaking, SGE produces a distribution of uniformly sampled low-to-high-order graphlets as a way to embed graphs into the vector space. In what follows, the coarse-to-fine structure of a graph hierarchy and the statistics fetched by the SGE complements each other and includes important structural information with varied contexts. Altogether, these two techniques substantially cope with the usual information loss involved in graph embedding techniques, obtaining a more robust graph representation. This fact has been corroborated through a detailed experimental evaluation on various benchmark graph datasets, where we outperform the state-of-the-art methods.
|
|
|
Arnau Baro, Alicia Fornes and Carles Badal. 2020. Handwritten Historical Music Recognition by Sequence-to-Sequence with Attention Mechanism. 17th International Conference on Frontiers in Handwriting Recognition.
Abstract: Despite decades of research in Optical Music Recognition (OMR), the recognition of old handwritten music scores remains a challenge because of the variabilities in the handwriting styles, paper degradation, lack of standard notation, etc. Therefore, the research in OMR systems adapted to the particularities of old manuscripts is crucial to accelerate the conversion of music scores existing in archives into digital libraries, fostering the dissemination and preservation of our music heritage. In this paper we explore the adaptation of sequence-to-sequence models with attention mechanism (used in translation and handwritten text recognition) and the generation of specific synthetic data for recognizing old music scores. The experimental validation demonstrates that our approach is promising, especially when compared with long short-term memory neural networks.
|
|
|
Asma Bensalah, Jialuo Chen, Alicia Fornes, Cristina Carmona_Duarte, Josep Llados and Miguel A. Ferrer. 2020. Towards Stroke Patients' Upper-limb Automatic Motor Assessment Using Smartwatches. International Workshop on Artificial Intelligence for Healthcare Applications.476–489.
Abstract: Assessing the physical condition in rehabilitation scenarios is a challenging problem, since it involves Human Activity Recognition (HAR) and kinematic analysis methods. In addition, the difficulties increase in unconstrained rehabilitation scenarios, which are much closer to the real use cases. In particular, our aim is to design an upper-limb assessment pipeline for stroke patients using smartwatches. We focus on the HAR task, as it is the first part of the assessing pipeline. Our main target is to automatically detect and recognize four key movements inspired by the Fugl-Meyer assessment scale, which are performed in both constrained and unconstrained scenarios. In addition to the application protocol and dataset, we propose two detection and classification baseline methods. We believe that the proposed framework, dataset and baseline results will serve to foster this research field.
|
|