|
Ayan Banerjee, Sanket Biswas, Josep Llados, & Umapada Pal. (2024). SemiDocSeg: Harnessing Semi-Supervised Learning for Document Layout Analysis. IJDAR - International Journal on Document Analysis and Recognition, .
Abstract: Document Layout Analysis (DLA) is the process of automatically identifying and categorizing the structural components (e.g. Text, Figure, Table, etc.) within a document to extract meaningful content and establish the page's layout structure. It is a crucial stage in document parsing, contributing to their comprehension. However, traditional DLA approaches often demand a significant volume of labeled training data, and the labor-intensive task of generating high-quality annotated training data poses a substantial challenge. In order to address this challenge, we proposed a semi-supervised setting that aims to perform learning on limited annotated categories by eliminating exhaustive and expensive mask annotations. The proposed setting is expected to be generalizable to novel categories as it learns the underlying positional information through a support set and class information through Co-Occurrence that can be generalized from annotated categories to novel categories. Here, we first extract features from the input image and support set with a shared multi-scale feature acquisition backbone. Then, the extracted feature representation is fed to the transformer encoder as a query. Later on, we utilize a semantic embedding network before the decoder to capture the underlying semantic relationships and similarities between different instances, enabling the model to make accurate predictions or classifications with only a limited amount of labeled data. Extensive experimentation on competitive benchmarks like PRIMA, DocLayNet, and Historical Japanese (HJ) demonstrate that this generalized setup obtains significant performance compared to the conventional supervised approach.
Keywords: Document layout analysis; Semi-supervised learning; Co-Occurrence matrix; Instance segmentation; Swin transformer
|
|
|
Jon Almazan, Albert Gordo, Alicia Fornes, & Ernest Valveny. (2014). Segmentation-free Word Spotting with Exemplar SVMs. PR - Pattern Recognition, 47(12), 3967–3978.
Abstract: In this paper we propose an unsupervised segmentation-free method for word spotting in document images. Documents are represented with a grid of HOG descriptors, and a sliding-window approach is used to locate the document regions that are most similar to the query. We use the Exemplar SVM framework to produce a better representation of the query in an unsupervised way. Then, we use a more discriminative representation based on Fisher Vector to rerank the best regions retrieved, and the most promising ones are used to expand the Exemplar SVM training set and improve the query representation. Finally, the document descriptors are precomputed and compressed with Product Quantization. This offers two advantages: first, a large number of documents can be kept in RAM memory at the same time. Second, the sliding window becomes significantly faster since distances between quantized HOG descriptors can be precomputed. Our results significantly outperform other segmentation-free methods in the literature, both in accuracy and in speed and memory usage.
Keywords: Word spotting; Segmentation-free; Unsupervised learning; Reranking; Query expansion; Compression
|
|
|
Sounak Dey, Palaiahnakote Shivakumara, K.S. Raghunanda, Umapada Pal, Tong Lu, G. Hemantha Kumar, et al. (2017). Script independent approach for multi-oriented text detection in scene image. NEUCOM - Neurocomputing, 242, 96–112.
Abstract: Developing a text detection method which is invariant to scripts in natural scene images is a challeng- ing task due to different geometrical structures of various scripts. Besides, multi-oriented of text lines in natural scene images make the problem more challenging. This paper proposes to explore ring radius transform (RRT) for text detection in multi-oriented and multi-script environments. The method finds component regions based on convex hull to generate radius matrices using RRT. It is a fact that RRT pro- vides low radius values for the pixels that are near to edges, constant radius values for the pixels that represent stroke width, and high radius values that represent holes created in background and convex hull because of the regular structures of text components. We apply k -means clustering on the radius matrices to group such spatially coherent regions into individual clusters. Then the proposed method studies the radius values of such cluster components that are close to the centroid and far from the cen- troid to detect text components. Furthermore, we have developed a Bangla dataset (named as ISI-UM dataset) and propose a semi-automatic system for generating its ground truth for text detection of arbi- trary orientations, which can be used by the researchers for text detection and recognition in the future. The ground truth will be released to public. Experimental results on our ISI-UM data and other standard datasets, namely, ICDAR 2013 scene, SVT and MSRA data, show that the proposed method outperforms the existing methods in terms of multi-lingual and multi-oriented text detection ability.
|
|
|
Sangheeta Roy, Palaiahnakote Shivakumara, Namita Jain, Vijeta Khare, Anjan Dutta, Umapada Pal, et al. (2018). Rough-Fuzzy based Scene Categorization for Text Detection and Recognition in Video. PR - Pattern Recognition, 80, 64–82.
Abstract: Scene image or video understanding is a challenging task especially when number of video types increases drastically with high variations in background and foreground. This paper proposes a new method for categorizing scene videos into different classes, namely, Animation, Outlet, Sports, e-Learning, Medical, Weather, Defense, Economics, Animal Planet and Technology, for the performance improvement of text detection and recognition, which is an effective approach for scene image or video understanding. For this purpose, at first, we present a new combination of rough and fuzzy concept to study irregular shapes of edge components in input scene videos, which helps to classify edge components into several groups. Next, the proposed method explores gradient direction information of each pixel in each edge component group to extract stroke based features by dividing each group into several intra and inter planes. We further extract correlation and covariance features to encode semantic features located inside planes or between planes. Features of intra and inter planes of groups are then concatenated to get a feature matrix. Finally, the feature matrix is verified with temporal frames and fed to a neural network for categorization. Experimental results show that the proposed method outperforms the existing state-of-the-art methods, at the same time, the performances of text detection and recognition methods are also improved significantly due to categorization.
Keywords: Rough set; Fuzzy set; Video categorization; Scene image classification; Video text detection; Video text recognition
|
|
|
Alicia Fornes, Josep Llados, Gemma Sanchez, & Dimosthenis Karatzas. (2010). Rotation Invariant Hand-Drawn Symbol Recognition based on a Dynamic Time Warping Model. IJDAR - International Journal on Document Analysis and Recognition, 13(3), 229–241.
Abstract: One of the major difficulties of handwriting symbol recognition is the high variability among symbols because of the different writer styles. In this paper, we introduce a robust approach for describing and recognizing hand-drawn symbols tolerant to these writer style differences. This method, which is invariant to scale and rotation, is based on the dynamic time warping (DTW) algorithm. The symbols are described by vector sequences, a variation of the DTW distance is used for computing the matching distance, and K-Nearest Neighbor is used to classify them. Our approach has been evaluated in two benchmarking scenarios consisting of hand-drawn symbols. Compared with state-of-the-art methods for symbol recognition, our method shows higher tolerance to the irregular deformations induced by hand-drawn strokes.
|
|