toggle visibility Search & Display Options

Select All    Deselect All
 |   | 
Details
  Records Links
Author Souhail Bakkali; Sanket Biswas; Zuheng Ming; Mickael Coustaty; Marçal Rusiñol; Oriol Ramos Terrades; Josep Llados edit   pdf
url  openurl
  Title TransferDoc: A Self-Supervised Transferable Document Representation Learning Model Unifying Vision and Language Type Miscellaneous
  Year 2023 Publication Arxiv Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract The field of visual document understanding has witnessed a rapid growth in emerging challenges and powerful multi-modal strategies. However, they rely on an extensive amount of document data to learn their pretext objectives in a ``pre-train-then-fine-tune'' paradigm and thus, suffer a significant performance drop in real-world online industrial settings. One major reason is the over-reliance on OCR engines to extract local positional information within a document page. Therefore, this hinders the model's generalizability, flexibility and robustness due to the lack of capturing global information within a document image. We introduce TransferDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised fashion using three novel pretext objectives. TransferDoc learns richer semantic concepts by unifying language and visual representations, which enables the production of more transferable models. Besides, two novel downstream tasks have been introduced for a ``closer-to-real'' industrial evaluation scenario where TransferDoc outperforms other state-of-the-art approaches.  
  Address  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number (down) Admin @ si @ BBM2023 Serial 3995  
Permanent link to this record
 

 
Author Ayan Banerjee; Sanket Biswas; Josep Llados; Umapada Pal edit   pdf
url  openurl
  Title GraphKD: Exploring Knowledge Distillation Towards Document Object Detection with Structured Graph Creation Type Miscellaneous
  Year 2024 Publication Arxiv Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract Object detection in documents is a key step to automate the structural elements identification process in a digital or scanned document through understanding the hierarchical structure and relationships between different elements. Large and complex models, while achieving high accuracy, can be computationally expensive and memory-intensive, making them impractical for deployment on resource constrained devices. Knowledge distillation allows us to create small and more efficient models that retain much of the performance of their larger counterparts. Here we present a graph-based knowledge distillation framework to correctly identify and localize the document objects in a document image. Here, we design a structured graph with nodes containing proposal-level features and edges representing the relationship between the different proposal regions. Also, to reduce text bias an adaptive node sampling strategy is designed to prune the weight distribution and put more weightage on non-text nodes. We encode the complete graph as a knowledge representation and transfer it from the teacher to the student through the proposed distillation loss by effectively capturing both local and global information concurrently. Extensive experimentation on competitive benchmarks demonstrates that the proposed framework outperforms the current state-of-the-art approaches. The code will be available at: this https URL.  
  Address  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number (down) Admin @ si @ BBL2024b Serial 4023  
Permanent link to this record
 

 
Author Ayan Banerjee; Sanket Biswas; Josep Llados; Umapada Pal edit  url
doi  openurl
  Title SemiDocSeg: Harnessing Semi-Supervised Learning for Document Layout Analysis Type Journal Article
  Year 2024 Publication International Journal on Document Analysis and Recognition Abbreviated Journal IJDAR  
  Volume Issue Pages  
  Keywords Document layout analysis; Semi-supervised learning; Co-Occurrence matrix; Instance segmentation; Swin transformer  
  Abstract Document Layout Analysis (DLA) is the process of automatically identifying and categorizing the structural components (e.g. Text, Figure, Table, etc.) within a document to extract meaningful content and establish the page's layout structure. It is a crucial stage in document parsing, contributing to their comprehension. However, traditional DLA approaches often demand a significant volume of labeled training data, and the labor-intensive task of generating high-quality annotated training data poses a substantial challenge. In order to address this challenge, we proposed a semi-supervised setting that aims to perform learning on limited annotated categories by eliminating exhaustive and expensive mask annotations. The proposed setting is expected to be generalizable to novel categories as it learns the underlying positional information through a support set and class information through Co-Occurrence that can be generalized from annotated categories to novel categories. Here, we first extract features from the input image and support set with a shared multi-scale feature acquisition backbone. Then, the extracted feature representation is fed to the transformer encoder as a query. Later on, we utilize a semantic embedding network before the decoder to capture the underlying semantic relationships and similarities between different instances, enabling the model to make accurate predictions or classifications with only a limited amount of labeled data. Extensive experimentation on competitive benchmarks like PRIMA, DocLayNet, and Historical Japanese (HJ) demonstrate that this generalized setup obtains significant performance compared to the conventional supervised approach.  
  Address June 2024  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number (down) Admin @ si @ BBL2024a Serial 4001  
Permanent link to this record
 

 
Author Ayan Banerjee; Sanket Biswas; Josep Llados; Umapada Pal edit  url
openurl 
  Title SwinDocSegmenter: An End-to-End Unified Domain Adaptive Transformer for Document Instance Segmentation Type Conference Article
  Year 2023 Publication 17th International Conference on Document Analysis and Recognition Abbreviated Journal  
  Volume 14187 Issue Pages 307–325  
  Keywords  
  Abstract Instance-level segmentation of documents consists in assigning a class-aware and instance-aware label to each pixel of the image. It is a key step in document parsing for their understanding. In this paper, we present a unified transformer encoder-decoder architecture for en-to-end instance segmentation of complex layouts in document images. The method adapts a contrastive training with a mixed query selection for anchor initialization in the decoder. Later on, it performs a dot product between the obtained query embeddings and the pixel embedding map (coming from the encoder) for semantic reasoning. Extensive experimentation on competitive benchmarks like PubLayNet, PRIMA, Historical Japanese (HJ), and TableBank demonstrate that our model with SwinL backbone achieves better segmentation performance than the existing state-of-the-art approaches with the average precision of 93.72, 54.39, 84.65 and 98.04 respectively under one billion parameters. The code is made publicly available at: github.com/ayanban011/SwinDocSegmenter .  
  Address San Jose; CA; USA; August 2023  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title LNCS  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference ICDAR  
  Notes DAG Approved no  
  Call Number (down) Admin @ si @ BBL2023 Serial 3893  
Permanent link to this record
 

 
Author Dena Bazazian edit  isbn
openurl 
  Title Fully Convolutional Networks for Text Understanding in Scene Images Type Book Whole
  Year 2018 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract Text understanding in scene images has gained plenty of attention in the computer vision community and it is an important task in many applications as text carries semantically rich information about scene content and context. For instance, reading text in a scene can be applied to autonomous driving, scene understanding or assisting visually impaired people. The general aim of scene text understanding is to localize and recognize text in scene images. Text regions are first localized in the original image by a trained detector model and afterwards fed into a recognition module. The tasks of localization and recognition are highly correlated since an inaccurate localization can affect the recognition task.
The main purpose of this thesis is to devise efficient methods for scene text understanding. We investigate how the latest results on deep learning can advance text understanding pipelines. Recently, Fully Convolutional Networks (FCNs) and derived methods have achieved a significant performance on semantic segmentation and pixel level classification tasks. Therefore, we took benefit of the strengths of FCN approaches in order to detect text in natural scenes. In this thesis we have focused on two challenging tasks of scene text understanding which are Text Detection and Word Spotting. For the task of text detection, we have proposed an efficient text proposal technique in scene images. We have considered the Text Proposals method as the baseline which is an approach to reduce the search space of possible text regions in an image. In order to improve the Text Proposals method we combined it with Fully Convolutional Networks to efficiently reduce the number of proposals while maintaining the same level of accuracy and thus gaining a significant speed up. Our experiments demonstrate that this text proposal approach yields significantly higher recall rates than the line based text localization techniques, while also producing better-quality localization. We have also applied this technique on compressed images such as videos from wearable egocentric cameras. For the task of word spotting, we have introduced a novel mid-level word representation method. We have proposed a technique to create and exploit an intermediate representation of images based on text attributes which roughly correspond to character probability maps. Our representation extends the concept of Pyramidal Histogram Of Characters (PHOC) by exploiting Fully Convolutional Networks to derive a pixel-wise mapping of the character distribution within candidate word regions. We call this representation the Soft-PHOC. Furthermore, we show how to use Soft-PHOC descriptors for word spotting tasks through an efficient text line proposal algorithm. To evaluate the detected text, we propose a novel line based evaluation along with the classic bounding box based approach. We test our method on incidental scene text images which comprises real-life scenarios such as urban scenes. The importance of incidental scene text images is due to the complexity of backgrounds, perspective, variety of script and language, short text and little linguistic context. All of these factors together makes the incidental scene text images challenging.
 
  Address November 2018  
  Corporate Author Thesis Ph.D. thesis  
  Publisher Ediciones Graficas Rey Place of Publication Editor Dimosthenis Karatzas;Andrew Bagdanov  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-948531-1-1 Medium  
  Area Expedition Conference  
  Notes DAG; 600.121 Approved no  
  Call Number (down) Admin @ si @ Baz2018 Serial 3220  
Permanent link to this record
 

 
Author Arnau Baro edit  isbn
openurl 
  Title Reading Music Systems: From Deep Optical Music Recognition to Contextual Methods Type Book Whole
  Year 2022 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract The transcription of sheet music into some machine-readable format can be carried out manually. However, the complexity of music notation inevitably leads to burdensome software for music score editing, which makes the whole process
very time-consuming and prone to errors. Consequently, automatic transcription
systems for musical documents represent interesting tools.
Document analysis is the subject that deals with the extraction and processing
of documents through image and pattern recognition. It is a branch of computer
vision. Taking music scores as source, the field devoted to address this task is
known as Optical Music Recognition (OMR). Typically, an OMR system takes an
image of a music score and automatically extracts its content into some symbolic
structure such as MEI or MusicXML.
In this dissertation, we have investigated different methods for recognizing a
single staff section (e.g. scores for violin, flute, etc.), much in the same way as most text recognition research focuses on recognizing words appearing in a given line image. These methods are based in two different methodologies. On the one hand, we present two methods based on Recurrent Neural Networks, in particular, the
Long Short-Term Memory Neural Network. On the other hand, a method based on Sequence to Sequence models is detailed.
Music context is needed to improve the OMR results, just like language models
and dictionaries help in handwriting recognition. For example, syntactical rules
and grammars could be easily defined to cope with the ambiguities in the rhythm.
In music theory, for example, the time signature defines the amount of beats per
bar unit. Thus, in the second part of this dissertation, different methodologies
have been investigated to improve the OMR recognition. We have explored three
different methods: (a) a graphic tree-structure representation, Dendrograms, that
joins, at each level, its primitives following a set of rules, (b) the incorporation of Language Models to model the probability of a sequence of tokens, and (c) graph neural networks to analyze the music scores to avoid meaningless relationships between music primitives.
Finally, to train all these methodologies, and given the method-specificity of
the datasets in the literature, we have created four different music datasets. Two of them are synthetic with a modern or old handwritten appearance, whereas the
other two are real handwritten scores, being one of them modern and the other
old.
 
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher IMPRIMA Place of Publication Editor Alicia Fornes  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-124793-8-6 Medium  
  Area Expedition Conference  
  Notes DAG; Approved no  
  Call Number (down) Admin @ si @ Bar2022 Serial 3754  
Permanent link to this record
 

 
Author Jon Almazan; Ernest Valveny; Alicia Fornes edit  doi
openurl 
  Title Deforming the Blurred Shape Model for Shape Description and Recognition Type Conference Article
  Year 2011 Publication 5th Iberian Conference on Pattern Recognition and Image Analysis Abbreviated Journal  
  Volume 6669 Issue Pages 1-8  
  Keywords  
  Abstract This paper presents a new model for the description and recognition of distorted shapes, where the image is represented by a pixel density distribution based on the Blurred Shape Model combined with a non-linear image deformation model. This leads to an adaptive structure able to capture elastic deformations in shapes. This method has been evaluated using thee different datasets where deformations are present, showing the robustness and good performance of the new model. Moreover, we show that incorporating deformation and flexibility, the new model outperforms the BSM approach when classifying shapes with high variability of appearance.  
  Address Las Palmas de Gran Canaria. Spain  
  Corporate Author Thesis  
  Publisher Springer-Verlag Place of Publication Berlin Editor Jordi Vitria; Joao Miguel Raposo; Mario Hernandez  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title LNCS  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference IbPRIA  
  Notes DAG; Approved no  
  Call Number (down) Admin @ si @ AVF2011 Serial 1732  
Permanent link to this record
 

 
Author David Aldavert; Marçal Rusiñol; Ricardo Toledo edit   pdf
doi  openurl
  Title Automatic Static/Variable Content Separation in Administrative Document Images Type Conference Article
  Year 2017 Publication 14th International Conference on Document Analysis and Recognition Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract In this paper we present an automatic method for separating static and variable content from administrative document images. An alignment approach is able to unsupervisedly build probabilistic templates from a set of examples of the same document kind. Such templates define which is the likelihood of every pixel of being either static or variable content. In the extraction step, the same alignment technique is used to match
an incoming image with the template and to locate the positions where variable fields appear. We validate our approach on the public NIST Structured Tax Forms Dataset.
 
  Address Kyoto; Japan; November 2017  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference ICDAR  
  Notes DAG; 600.084; 600.121 Approved no  
  Call Number (down) Admin @ si @ ART2017 Serial 3001  
Permanent link to this record
 

 
Author David Aldavert; Marçal Rusiñol; Ricardo Toledo; Josep Llados edit  doi
openurl 
  Title A Study of Bag-of-Visual-Words Representations for Handwritten Keyword Spotting Type Journal Article
  Year 2015 Publication International Journal on Document Analysis and Recognition Abbreviated Journal IJDAR  
  Volume 18 Issue 3 Pages 223-234  
  Keywords Bag-of-Visual-Words; Keyword spotting; Handwritten documents; Performance evaluation  
  Abstract The Bag-of-Visual-Words (BoVW) framework has gained popularity among the document image analysis community, specifically as a representation of handwritten words for recognition or spotting purposes. Although in the computer vision field the BoVW method has been greatly improved, most of the approaches in the document image analysis domain still rely on the basic implementation of the BoVW method disregarding such latest refinements. In this paper, we present a review of those improvements and its application to the keyword spotting task. We thoroughly evaluate their impact against a baseline system in the well-known George Washington dataset and compare the obtained results against nine state-of-the-art keyword spotting methods. In addition, we also compare both the baseline and improved systems with the methods presented at the Handwritten Keyword Spotting Competition 2014.  
  Address  
  Corporate Author Thesis  
  Publisher Springer Berlin Heidelberg Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN 1433-2833 ISBN Medium  
  Area Expedition Conference  
  Notes DAG; ADAS; 600.055; 600.061; 601.223; 600.077; 600.097 Approved no  
  Call Number (down) Admin @ si @ ART2015 Serial 2679  
Permanent link to this record
 

 
Author David Aldavert; Marçal Rusiñol; Ricardo Toledo; Josep Llados edit   pdf
doi  openurl
  Title Integrating Visual and Textual Cues for Query-by-String Word Spotting Type Conference Article
  Year 2013 Publication 12th International Conference on Document Analysis and Recognition Abbreviated Journal  
  Volume Issue Pages 511 - 515  
  Keywords  
  Abstract In this paper, we present a word spotting framework that follows the query-by-string paradigm where word images are represented both by textual and visual representations. The textual representation is formulated in terms of character $n$-grams while the visual one is based on the bag-of-visual-words scheme. These two representations are merged together and projected to a sub-vector space. This transform allows to, given a textual query, retrieve word instances that were only represented by the visual modality. Moreover, this statistical representation can be used together with state-of-the-art indexation structures in order to deal with large-scale scenarios. The proposed method is evaluated using a collection of historical documents outperforming state-of-the-art performances.  
  Address Washington; USA; August 2013  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN 1520-5363 ISBN Medium  
  Area Expedition Conference ICDAR  
  Notes DAG; ADAS; 600.045; 600.055; 600.061 Approved no  
  Call Number (down) Admin @ si @ ART2013 Serial 2224  
Permanent link to this record
Select All    Deselect All
 |   | 
Details

Save Citations:
Export Records: