toggle visibility Search & Display Options

Select All    Deselect All
 |   | 
Details
  Records Links
Author Mohamed Ali Souibgui; Sanket Biswas; Andres Mafla; Ali Furkan Biten; Alicia Fornes; Yousri Kessentini; Josep Llados; Lluis Gomez; Dimosthenis Karatzas edit  url
openurl 
  Title Text-DIAE: a self-supervised degradation invariant autoencoder for text recognition and document enhancement Type Conference Article
  Year 2023 Publication Proceedings of the 37th AAAI Conference on Artificial Intelligence Abbreviated Journal  
  Volume 37 Issue 2 Pages  
  Keywords Representation Learning for Vision; CV Applications; CV Language and Vision; ML Unsupervised; Self-Supervised Learning  
  Abstract In this paper, we propose a Text-Degradation Invariant Auto Encoder (Text-DIAE), a self-supervised model designed to tackle two tasks, text recognition (handwritten or scene-text) and document image enhancement. We start by employing a transformer-based architecture that incorporates three pretext tasks as learning objectives to be optimized during pre-training without the usage of labelled data. Each of the pretext objectives is specifically tailored for the final downstream tasks. We conduct several ablation experiments that confirm the design choice of the selected pretext tasks. Importantly, the proposed model does not exhibit limitations of previous state-of-the-art methods based on contrastive losses, while at the same time requiring substantially fewer data samples to converge. Finally, we demonstrate that our method surpasses the state-of-the-art in existing supervised and self-supervised settings in handwritten and scene text recognition and document image enhancement. Our code and trained models will be made publicly available at https://github.com/dali92002/SSL-OCR  
  Address (up)  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference AAAI  
  Notes DAG Approved no  
  Call Number Admin @ si @ SBM2023 Serial 3848  
Permanent link to this record
 

 
Author Mohamed Ali Souibgui; Pau Torras; Jialuo Chen; Alicia Fornes edit  url
openurl 
  Title An Evaluation of Handwritten Text Recognition Methods for Historical Ciphered Manuscripts Type Conference Article
  Year 2023 Publication 7th International Workshop on Historical Document Imaging and Processing Abbreviated Journal  
  Volume Issue Pages 7-12  
  Keywords  
  Abstract This paper investigates the effectiveness of different deep learning HTR families, including LSTM, Seq2Seq, and transformer-based approaches with self-supervised pretraining, in recognizing ciphered manuscripts from different historical periods and cultures. The goal is to identify the most suitable method or training techniques for recognizing ciphered manuscripts and to provide insights into the challenges and opportunities in this field of research. We evaluate the performance of these models on several datasets of ciphered manuscripts and discuss their results. This study contributes to the development of more accurate and efficient methods for recognizing historical manuscripts for the preservation and dissemination of our cultural heritage.  
  Address (up)  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference HIP  
  Notes DAG Approved no  
  Call Number Admin @ si @ STC2023 Serial 3849  
Permanent link to this record
 

 
Author Pau Torras; Mohamed Ali Souibgui; Sanket Biswas; Alicia Fornes edit  url
openurl 
  Title Segmentation-Free Alignment of Arbitrary Symbol Transcripts to Images Type Conference Article
  Year 2023 Publication Document Analysis and Recognition – ICDAR 2023 Workshops Abbreviated Journal  
  Volume 14193 Issue Pages 83-93  
  Keywords Historical Manuscripts; Symbol Alignment  
  Abstract Developing arbitrary symbol recognition systems is a challenging endeavour. Even using content-agnostic architectures such as few-shot models, performance can be substantially improved by providing a number of well-annotated examples into training. In some contexts, transcripts of the symbols are available without any position information associated to them, which enables using line-level recognition architectures. A way of providing this position information to detection-based architectures is finding systems that can align the input symbols with the transcription. In this paper we discuss some symbol alignment techniques that are suitable for low-data scenarios and provide an insight on their perceived strengths and weaknesses. In particular, we study the usage of Connectionist Temporal Classification models, Attention-Based Sequence to Sequence models and we compare them with the results obtained on a few-shot recognition system.  
  Address (up)  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title LNCS  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference ICDAR  
  Notes DAG Approved no  
  Call Number Admin @ si @ TSS2023 Serial 3850  
Permanent link to this record
 

 
Author Marwa Dhiaf; Mohamed Ali Souibgui; Kai Wang; Yuyang Liu; Yousri Kessentini; Alicia Fornes; Ahmed Cheikh Rouhou edit   pdf
url  openurl
  Title CSSL-MHTR: Continual Self-Supervised Learning for Scalable Multi-script Handwritten Text Recognition Type Miscellaneous
  Year 2023 Publication Arxiv Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract Self-supervised learning has recently emerged as a strong alternative in document analysis. These approaches are now capable of learning high-quality image representations and overcoming the limitations of supervised methods, which require a large amount of labeled data. However, these methods are unable to capture new knowledge in an incremental fashion, where data is presented to the model sequentially, which is closer to the realistic scenario. In this paper, we explore the potential of continual self-supervised learning to alleviate the catastrophic forgetting problem in handwritten text recognition, as an example of sequence recognition. Our method consists in adding intermediate layers called adapters for each task, and efficiently distilling knowledge from the previous model while learning the current task. Our proposed framework is efficient in both computation and memory complexity. To demonstrate its effectiveness, we evaluate our method by transferring the learned model to diverse text recognition downstream tasks, including Latin and non-Latin scripts. As far as we know, this is the first application of continual self-supervised learning for handwritten text recognition. We attain state-of-the-art performance on English, Italian and Russian scripts, whilst adding only a few parameters per task. The code and trained models will be publicly available.  
  Address (up)  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number Admin @ si @ DSW2023 Serial 3851  
Permanent link to this record
 

 
Author Ruben Perez Tito edit  isbn
openurl 
  Title Exploring the role of Text in Visual Question Answering on Natural Scenes and Documents Type Book Whole
  Year 2023 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract Visual Question Answering (VQA) is the task where given an image and a natural language question, the objective is to generate a natural language answer. At the intersection between computer vision and natural language processing, this task can be seen as a measure of image understanding capabilities, as it requires to reason about objects, actions, colors, positions, the relations between the different elements as well as commonsense reasoning, world knowledge, arithmetic skills and natural language understanding. However, even though the text present in the images conveys important semantically rich information that is explicit and not available in any other form, most VQA methods remained illiterate, largely
ignoring the text despite its potential significance. In this thesis, we set out on a journey to bring reading capabilities to computer vision models applied to the VQA task, creating new datasets and methods that can read, reason and integrate the text with other visual cues in natural scene images and documents.
In Chapter 3, we address the combination of scene text with visual information to fully understand all the nuances of natural scene images. To achieve this objective, we define a new sub-task of VQA that requires reading the text in the image, and highlight the limitations of the current methods. In addition, we propose a new architecture that integrates both modalities and jointly reasons about textual and visual features. In Chapter 5, we shift the domain of VQA with reading capabilities and apply it on scanned industry document images, providing a high-level end-purpose perspective to Document Understanding, which has been
primarily focused on digitizing the document’s contents and extracting key values without considering the ultimate purpose of the extracted information. For this, we create a dataset which requires methods to reason about the unique and challenging elements of documents, such as text, images, tables, graphs and complex layouts, to provide accurate answers in natural language. However, we observed that explicit visual features provide a slight contribution in the overall performance, since the main information is usually conveyed within the text and its position. In consequence, in Chapter 6, we propose VQA on infographic images, seeking for document images with more visually rich elements that require to fully exploit visual information in order to answer the questions. We show the performance gap of
different methods when used over industry scanned and infographic images, and propose a new method that integrates the visual features in early stages, which allows the transformer architecture to exploit the visual features during the self-attention operation. Instead, in Chapter 7, we apply VQA on a big collection of single-page documents, where the methods must find which documents are relevant to answer the question, and provide the answer itself. Finally, in Chapter 8, mimicking real-world application problems where systems must process documents with multiple pages, we address the multipage document visual question answering task. We demonstrate the limitations of existing methods, including models specifically designed to process long sequences. To overcome these limitations, we propose
a hierarchical architecture that can process long documents, answer questions, and provide the index of the page where the information to answer the question is located as an explainability measure.
 
  Address (up)  
  Corporate Author Thesis Ph.D. thesis  
  Publisher IMPRIMA Place of Publication Editor Ernest Valveny  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-124793-5-5 Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number Admin @ si @ Per2023 Serial 3967  
Permanent link to this record
 

 
Author Alloy Das; Sanket Biswas; Umapada Pal; Josep Llados edit   pdf
url  openurl
  Title Diving into the Depths of Spotting Text in Multi-Domain Noisy Scenes Type Miscellaneous
  Year 2023 Publication ARXIV Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract When used in a real-world noisy environment, the capacity to generalize to multiple domains is essential for any autonomous scene text spotting system. However, existing state-of-the-art methods employ pretraining and fine-tuning strategies on natural scene datasets, which do not exploit the feature interaction across other complex domains. In this work, we explore and investigate the problem of domain-agnostic scene text spotting, i.e., training a model on multi-domain source data such that it can directly generalize to target domains rather than being specialized for a specific domain or scenario. In this regard, we present the community a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes to establish an important case study. Moreover, we also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter which achieves comparable or superior performance over existing text spotting architectures for both regular and arbitrary-shaped scene text spotting benchmarks in terms of both accuracy and model efficiency. The dataset, code and pre-trained models will be released upon acceptance.  
  Address (up)  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number Admin @ si @ DBP2023 Serial 3979  
Permanent link to this record
 

 
Author Souhail Bakkali; Sanket Biswas; Zuheng Ming; Mickael Coustaty; Marçal Rusiñol; Oriol Ramos Terrades; Josep Llados edit   pdf
url  openurl
  Title TransferDoc: A Self-Supervised Transferable Document Representation Learning Model Unifying Vision and Language Type Miscellaneous
  Year 2023 Publication Arxiv Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract The field of visual document understanding has witnessed a rapid growth in emerging challenges and powerful multi-modal strategies. However, they rely on an extensive amount of document data to learn their pretext objectives in a ``pre-train-then-fine-tune'' paradigm and thus, suffer a significant performance drop in real-world online industrial settings. One major reason is the over-reliance on OCR engines to extract local positional information within a document page. Therefore, this hinders the model's generalizability, flexibility and robustness due to the lack of capturing global information within a document image. We introduce TransferDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised fashion using three novel pretext objectives. TransferDoc learns richer semantic concepts by unifying language and visual representations, which enables the production of more transferable models. Besides, two novel downstream tasks have been introduced for a ``closer-to-real'' industrial evaluation scenario where TransferDoc outperforms other state-of-the-art approaches.  
  Address (up)  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number Admin @ si @ BBM2023 Serial 3995  
Permanent link to this record
 

 
Author Mohamed Ali Souibgui; Y.Kessentini edit   pdf
url  doi
openurl 
  Title DE-GAN: A Conditional Generative Adversarial Network for Document Enhancement Type Journal Article
  Year 2022 Publication IEEE Transactions on Pattern Analysis and Machine Intelligence Abbreviated Journal TPAMI  
  Volume 44 Issue 3 Pages 1180-1191  
  Keywords  
  Abstract Documents often exhibit various forms of degradation, which make it hard to be read and substantially deteriorate the performance of an OCR system. In this paper, we propose an effective end-to-end framework named Document Enhancement Generative Adversarial Networks (DE-GAN) that uses the conditional GANs (cGANs) to restore severely degraded document images. To the best of our knowledge, this practice has not been studied within the context of generative adversarial deep networks. We demonstrate that, in different tasks (document clean up, binarization, deblurring and watermark removal), DE-GAN can produce an enhanced version of the degraded document with a high quality. In addition, our approach provides consistent improvements compared to state-of-the-art methods over the widely used DIBCO 2013, DIBCO 2017 and H-DIBCO 2018 datasets, proving its ability to restore a degraded document image to its ideal condition. The obtained results on a wide variety of degradation reveal the flexibility of the proposed model to be exploited in other document enhancement problems.  
  Address (up) 1 March 2022  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes DAG; 602.230; 600.121; 600.140 Approved no  
  Call Number Admin @ si @ SoK2022 Serial 3454  
Permanent link to this record
 

 
Author Francesco Brughi; Debora Gil; Llorenç Badiella; Eva Jove Casabella; Oriol Ramos Terrades edit   pdf
doi  isbn
openurl 
  Title Exploring the impact of inter-query variability on the performance of retrieval systems Type Conference Article
  Year 2014 Publication 11th International Conference on Image Analysis and Recognition Abbreviated Journal  
  Volume 8814 Issue Pages 413–420  
  Keywords  
  Abstract This paper introduces a framework for evaluating the performance of information retrieval systems. Current evaluation metrics provide an average score that does not consider performance variability across the query set. In this manner, conclusions lack of any statistical significance, yielding poor inference to cases outside the query set and possibly unfair comparisons. We propose to apply statistical methods in order to obtain a more informative measure for problems in which different query classes can be identified. In this context, we assess the performance variability on two levels: overall variability across the whole query set and specific query class-related variability. To this end, we estimate confidence bands for precision-recall curves, and we apply ANOVA in order to assess the significance of the performance across different query classes.  
  Address (up) Algarve; Portugal; October 2014  
  Corporate Author Thesis  
  Publisher Springer International Publishing Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title LNCS  
  Series Volume Series Issue Edition  
  ISSN 0302-9743 ISBN 978-3-319-11757-7 Medium  
  Area Expedition Conference ICIAR  
  Notes IAM; DAG; 600.060; 600.061; 600.077; 600.075 Approved no  
  Call Number Admin @ si @ BGB2014 Serial 2559  
Permanent link to this record
 

 
Author Miquel Ferrer; F. Serratosa; Ernest Valveny edit  openurl
  Title On the Relation Between the Median Graph and the Maximum Common Subgraph of a Set of Graphs Type Book Chapter
  Year 2007 Publication Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract  
  Address (up) Alicante (Spain)  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number DAG @ dag @ FSV2007 Serial 790  
Permanent link to this record
Select All    Deselect All
 |   | 
Details

Save Citations:
Export Records: