toggle visibility Search & Display Options

Select All    Deselect All
 |   | 
Details
   print
  Records Links
Author Jon Almazan edit  openurl
  Title Learning to Represent Handwritten Shapes and Words for Matching and Recognition Type Book Whole
  Year 2014 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) Writing is one of the most important forms of communication and for centuries, handwriting had been the most reliable way to preserve knowledge. However, despite the recent development of printing houses and electronic devices, handwriting is still broadly used for taking notes, doing annotations, or sketching ideas.
Transferring the ability of understanding handwritten text or recognizing handwritten shapes to computers has been the goal of many researches due to its huge importance for many different fields. However, designing good representations to deal with handwritten shapes, e.g. symbols or words, is a very challenging problem due to the large variability of these kinds of shapes. One of the consequences of working with handwritten shapes is that we need representations to be robust, i.e., able to adapt to large intra-class variability. We need representations to be discriminative, i.e., able to learn what are the differences between classes. And, we need representations to be efficient, i.e., able to be rapidly computed and compared. Unfortunately, current techniques of handwritten shape representation for matching and recognition do not fulfill some or all of these requirements.
Through this thesis we focus on the problem of learning to represent handwritten shapes aimed at retrieval and recognition tasks. Concretely, on the first part of the thesis, we focus on the general problem of representing any kind of handwritten shape. We first present a novel shape descriptor based on a deformable grid that deals with large deformations by adapting to the shape and where the cells of the grid can be used to extract different features. Then, we propose to use this descriptor to learn statistical models, based on the Active Appearance Model, that jointly learns the variability in structure and texture of a given class. Then, on the second part, we focus on a concrete application, the problem of representing handwritten words, for the tasks of word spotting, where the goal is to find all instances of a query word in a dataset of images, and recognition. First, we address the segmentation-free problem and propose an unsupervised, sliding-window-based approach that achieves state-of- the-art results in two public datasets. Second, we address the more challenging multi-writer problem, where the variability in words exponentially increases. We describe an approach in which both word images and text strings are embedded in a common vectorial subspace, and where those that represent the same word are close together. This is achieved by a combination of label embedding and attributes learning, and a common subspace regression. This leads to a low-dimensional, unified representation of word images and strings, resulting in a method that allows one to perform either image and text searches, as well as image transcription, in a unified framework. We evaluate our methods on different public datasets of both handwritten documents and natural images showing results comparable or better than the state-of-the-art on spotting and recognition tasks.
 
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher Ediciones Graficas Rey Place of Publication Editor Ernest Valveny;Alicia Fornes  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes DAG; 600.077 Approved no  
  Call Number Admin @ si @ Alm2014 Serial 2572  
Permanent link to this record
 

 
Author David Aldavert edit  isbn
openurl 
  Title Efficient and Scalable Handwritten Word Spotting on Historical Documents using Bag of Visual Words Type Book Whole
  Year 2021 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) Word spotting can be defined as the pattern recognition tasked aimed at locating and retrieving a specific keyword within a document image collection without explicitly transcribing the whole corpus. Its use is particularly interesting when applied in scenarios where Optical Character Recognition performs poorly or can not be used at all. This thesis focuses on such a scenario, word spotting on historical handwritten documents that have been written by a single author or by multiple authors with a similar calligraphy.
This problem requires a visual signature that is robust to image artifacts, flexible to accommodate script variations and efficient to retrieve information in a rapid manner. For this, we have developed a set of word spotting methods that on their foundation use the well known Bag-of-Visual-Words (BoVW) representation. This representation has gained popularity among the document image analysis community to characterize handwritten words
in an unsupervised manner. However, most approaches on this field rely on a basic BoVW configuration and disregard complex encoding and spatial representations. We determine which BoVW configurations provide the best performance boost to a spotting system.
Then, we extend the segmentation-based word spotting, where word candidates are given a priori, to segmentation-free spotting. The proposed approach seeds the document images with overlapping word location candidates and characterizes them with a BoVW signature. Retrieval is achieved comparing the query and candidate signatures and returning the locations that provide a higher consensus. This is a simple but powerful approach that requires a more compact signature than in a segmentation-based scenario. We first
project the BoVW signature into a reduced semantic topics space and then compress it further using Product Quantizers. The resulting signature only requires a few dozen bytes, allowing us to index thousands of pages on a common desktop computer. The final system still yields a performance comparable to the state-of-the-art despite all the information loss during the compression phases.
Afterwards, we also study how to combine different modalities of information in order to create a query-by-X spotting system where, words are indexed using an information modality and queries are retrieved using another. We consider three different information modalities: visual, textual and audio. Our proposal is to create a latent feature space where features which are semantically related are projected onto the same topics. Creating thus a new feature space where information from different modalities can be compared. Later, we consider the codebook generation and descriptor encoding problem. The codebooks used to encode the BoVW signatures are usually created using an unsupervised clustering algorithm and, they require to test multiple parameters to determine which configuration is best for a certain document collection. We propose a semantic clustering algorithm which allows to estimate the best parameter from data. Since gather annotated data is costly, we use synthetically generated word images. The resulting codebook is database agnostic, i. e. a codebook that yields a good performance on document collections that use the same script. We also propose the use of an additional codebook to approximate descriptors and reduce the descriptor encoding
complexity to sub-linear.
Finally, we focus on the problem of signatures dimensionality. We propose a new symbol probability signature where each bin represents the probability that a certain symbol is present a certain location of the word image. This signature is extremely compact and combined with compression techniques can represent word images with just a few bytes per signature.
 
  Address April 2021  
  Corporate Author Thesis Ph.D. thesis  
  Publisher Ediciones Graficas Rey Place of Publication Editor Marçal Rusiñol;Josep Llados  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-122714-5-4 Medium  
  Area Expedition Conference  
  Notes DAG; 600.121 Approved no  
  Call Number Admin @ si @ Ald2021 Serial 3601  
Permanent link to this record
 

 
Author Partha Pratim Roy edit  isbn
openurl 
  Title Multi-Oriented and Multi-Scaled Text Character Analysis and Recognition in Graphical Documents and their Applications to Document Image Retrieval Type Book Whole
  Year 2010 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) With the advent research of Document Image Analysis and Recognition (DIAR), an
important line of research is explored on indexing and retrieval of graphics rich documents. It aims at finding relevant documents relying on segmentation and recognition
of text and graphics components underlying in non-standard layout where commercial
OCRs can not be applied due to complexity. This thesis is focused towards text information extraction approaches in graphical documents and retrieval of such documents
using text information.
Automatic text recognition in graphical documents (map, engineering drawing,
etc.) involves many challenges because text characters are usually printed in multioriented and multi-scale way along with different graphical objects. Text characters
are used to annotate the graphical curve lines and hence, many times they follow
curvi-linear paths too. For OCR of such documents, individual text lines and their
corresponding words/characters need to be extracted.
For recognition of multi-font, multi-scale and multi-oriented characters, we have
proposed a feature descriptor for character shape using angular information from contour pixels to take care of the invariance nature. To improve the efficiency of OCR, an
approach towards the segmentation of multi-oriented touching strings into individual
characters is also discussed. Convex hull based background information is used to
segment a touching string into possible primitive segments and later these primitive
segments are merged to get optimum segmentation using dynamic programming. To
overcome the touching/overlapping problem of text with graphical lines, a character
spotting approach using SIFT and skeleton information is included. Afterwards, we
propose a novel method to extract individual curvi-linear text lines using the foreground and background information of the characters of the text and a water reservoir
concept is used to utilize the background information.
We have also formulated the methodologies for graphical document retrieval applications using query words and seals. The retrieval approaches are performed using
recognition results of individual components in the document. Given a query text,
the system extracts positional knowledge from the query word and uses the same to
generate hypothetical locations in the document. Indexing of documents is also performed based on automatic detection of seals from documents containing cluttered
background. A seal is characterized by scale and rotation invariant spatial feature
descriptors computed from labelled text characters and a concept based on the Generalized Hough Transform is used to locate the seal in documents.
 
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher Ediciones Graficas Rey Place of Publication Editor Josep Llados;Umapada Pal  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-937261-7-1 Medium  
  Area Expedition Conference  
  Notes Approved no  
  Call Number Admin @ si @ Roy2010 Serial 1455  
Permanent link to this record
 

 
Author Michal Drozdzal edit  isbn
openurl 
  Title Sequential image analysis for computer-aided wireless endoscopy Type Book Whole
  Year 2014 Publication PhD Thesis, Universitat de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) Wireless Capsule Endoscopy (WCE) is a technique for inner-visualization of the entire small intestine and, thus, offers an interesting perspective on intestinal motility. The two major drawbacks of this technique are: 1) huge amount of data acquired by WCE makes the motility analysis tedious and 2) since the capsule is the first tool that offers complete inner-visualization of the small intestine,the exact importance of the observed events is still an open issue. Therefore, in this thesis, a novel computer-aided system for intestinal motility analysis is presented. The goal of the system is to provide an easily-comprehensible visual description of motility-related intestinal events to a physician. In order to do so, several tools based either on computer vision concepts or on machine learning techniques are presented. A method for transforming 3D video signal to a holistic image of intestinal motility, called motility bar, is proposed. The method calculates the optimal mapping from video into image from the intestinal motility point of view.
To characterize intestinal motility, methods for automatic extraction of motility information from WCE are presented. Two of them are based on the motility bar and two of them are based on frame-per-frame analysis. In particular, four algorithms dealing with the problems of intestinal contraction detection, lumen size estimation, intestinal content characterization and wrinkle frame detection are proposed and validated. The results of the algorithms are converted into sequential features using an online statistical test. This test is designed to work with multivariate data streams. To this end, we propose a novel formulation of concentration inequality that is introduced into a robust adaptive windowing algorithm for multivariate data streams. The algorithm is used to obtain robust representation of segments with constant intestinal motility activity. The obtained sequential features are shown to be discriminative in the problem of abnormal motility characterization.
Finally, we tackle the problem of efficient labeling. To this end, we incorporate active learning concepts to the problems present in WCE data and propose two approaches. The first one is based the concepts of sequential learning and the second one adapts the partition-based active learning to an error-free labeling scheme. All these steps are sufficient to provide an extensive visual description of intestinal motility that can be used by an expert as decision support system.
 
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher Ediciones Graficas Rey Place of Publication Editor Petia Radeva  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-940902-3-3 Medium  
  Area Expedition Conference  
  Notes MILAB Approved no  
  Call Number Admin @ si @ Dro2014 Serial 2486  
Permanent link to this record
 

 
Author Ivet Rafegas edit  isbn
openurl 
  Title Color in Visual Recognition: from flat to deep representations and some biological parallelisms Type Book Whole
  Year 2017 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) Visual recognition is one of the main problems in computer vision that attempts to solve image understanding by deciding what objects are in images. This problem can be computationally solved by using relevant sets of visual features, such as edges, corners, color or more complex object parts. This thesis contributes to how color features have to be represented for recognition tasks.

Image features can be extracted following two different approaches. A first approach is defining handcrafted descriptors of images which is then followed by a learning scheme to classify the content (named flat schemes in Kruger et al. (2013). In this approach, perceptual considerations are habitually used to define efficient color features. Here we propose a new flat color descriptor based on the extension of color channels to boost the representation of spatio-chromatic contrast that surpasses state-of-the-art approaches. However, flat schemes present a lack of generality far away from the capabilities of biological systems. A second approach proposes evolving these flat schemes into a hierarchical process, like in the visual cortex. This includes an automatic process to learn optimal features. These deep schemes, and more specifically Convolutional Neural Networks (CNNs), have shown an impressive performance to solve various vision problems. However, there is a lack of understanding about the internal representation obtained, as a result of automatic learning. In this thesis we propose a new methodology to explore the internal representation of trained CNNs by defining the Neuron Feature as a visualization of the intrinsic features encoded in each individual neuron. Additionally, and inspired by physiological techniques, we propose to compute different neuron selectivity indexes (e.g., color, class, orientation or symmetry, amongst others) to label and classify the full CNN neuron population to understand learned representations.

Finally, using the proposed methodology, we show an in-depth study on how color is represented on a specific CNN, trained for object recognition, that competes with primate representational abilities (Cadieu et al (2014)). We found several parallelisms with biological visual systems: (a) a significant number of color selectivity neurons throughout all the layers; (b) an opponent and low frequency representation of color oriented edges and a higher sampling of frequency selectivity in brightness than in color in 1st layer like in V1; (c) a higher sampling of color hue in the second layer aligned to observed hue maps in V2; (d) a strong color and shape entanglement in all layers from basic features in shallower layers (V1 and V2) to object and background shapes in deeper layers (V4 and IT); and (e) a strong correlation between neuron color selectivities and color dataset bias.
 
  Address November 2017  
  Corporate Author Thesis Ph.D. thesis  
  Publisher Ediciones Graficas Rey Place of Publication Editor Maria Vanrell  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-945373-7-0 Medium  
  Area Expedition Conference  
  Notes CIC Approved no  
  Call Number Admin @ si @ Raf2017 Serial 3100  
Permanent link to this record
 

 
Author Ruben Perez Tito edit  isbn
openurl 
  Title Exploring the role of Text in Visual Question Answering on Natural Scenes and Documents Type Book Whole
  Year 2023 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) Visual Question Answering (VQA) is the task where given an image and a natural language question, the objective is to generate a natural language answer. At the intersection between computer vision and natural language processing, this task can be seen as a measure of image understanding capabilities, as it requires to reason about objects, actions, colors, positions, the relations between the different elements as well as commonsense reasoning, world knowledge, arithmetic skills and natural language understanding. However, even though the text present in the images conveys important semantically rich information that is explicit and not available in any other form, most VQA methods remained illiterate, largely
ignoring the text despite its potential significance. In this thesis, we set out on a journey to bring reading capabilities to computer vision models applied to the VQA task, creating new datasets and methods that can read, reason and integrate the text with other visual cues in natural scene images and documents.
In Chapter 3, we address the combination of scene text with visual information to fully understand all the nuances of natural scene images. To achieve this objective, we define a new sub-task of VQA that requires reading the text in the image, and highlight the limitations of the current methods. In addition, we propose a new architecture that integrates both modalities and jointly reasons about textual and visual features. In Chapter 5, we shift the domain of VQA with reading capabilities and apply it on scanned industry document images, providing a high-level end-purpose perspective to Document Understanding, which has been
primarily focused on digitizing the document’s contents and extracting key values without considering the ultimate purpose of the extracted information. For this, we create a dataset which requires methods to reason about the unique and challenging elements of documents, such as text, images, tables, graphs and complex layouts, to provide accurate answers in natural language. However, we observed that explicit visual features provide a slight contribution in the overall performance, since the main information is usually conveyed within the text and its position. In consequence, in Chapter 6, we propose VQA on infographic images, seeking for document images with more visually rich elements that require to fully exploit visual information in order to answer the questions. We show the performance gap of
different methods when used over industry scanned and infographic images, and propose a new method that integrates the visual features in early stages, which allows the transformer architecture to exploit the visual features during the self-attention operation. Instead, in Chapter 7, we apply VQA on a big collection of single-page documents, where the methods must find which documents are relevant to answer the question, and provide the answer itself. Finally, in Chapter 8, mimicking real-world application problems where systems must process documents with multiple pages, we address the multipage document visual question answering task. We demonstrate the limitations of existing methods, including models specifically designed to process long sequences. To overcome these limitations, we propose
a hierarchical architecture that can process long documents, answer questions, and provide the index of the page where the information to answer the question is located as an explainability measure.
 
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher IMPRIMA Place of Publication Editor Ernest Valveny  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-124793-5-5 Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number Admin @ si @ Per2023 Serial 3967  
Permanent link to this record
 

 
Author Ali Furkan Biten edit  isbn
openurl 
  Title A Bitter-Sweet Symphony on Vision and Language: Bias and World Knowledge Type Book Whole
  Year 2022 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) Vision and Language are broadly regarded as cornerstones of intelligence. Even though language and vision have different aims – language having the purpose of communication, transmission of information and vision having the purpose of constructing mental representations around us to navigate and interact with objects – they cooperate and depend on one another in many tasks we perform effortlessly. This reliance is actively being studied in various Computer Vision tasks, e.g. image captioning, visual question answering, image-sentence retrieval, phrase grounding, just to name a few. All of these tasks share the inherent difficulty of the aligning the two modalities, while being robust to language
priors and various biases existing in the datasets. One of the ultimate goal for vision and language research is to be able to inject world knowledge while getting rid of the biases that come with the datasets. In this thesis, we mainly focus on two vision and language tasks, namely Image Captioning and Scene-Text Visual Question Answering (STVQA).
In both domains, we start by defining a new task that requires the utilization of world knowledge and in both tasks, we find that the models commonly employed are prone to biases that exist in the data. Concretely, we introduce new tasks and discover several problems that impede performance at each level and provide remedies or possible solutions in each chapter: i) We define a new task to move beyond Image Captioning to Image Interpretation that can utilize Named Entities in the form of world knowledge. ii) We study the object hallucination problem in classic Image Captioning systems and develop an architecture-agnostic solution. iii) We define a sub-task of Visual Question Answering that requires reading the text in the image (STVQA), where we highlight the limitations of current models. iv) We propose an architecture for the STVQA task that can point to the answer in the image and show how to combine it with classic VQA models. v) We show how far language can get us in STVQA and discover yet another bias which causes the models to disregard the image while doing Visual Question Answering.
 
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher IMPRIMA Place of Publication Editor Dimosthenis Karatzas;Lluis Gomez  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-124793-5-5 Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number Admin @ si @ Bit2022 Serial 3755  
Permanent link to this record
 

 
Author Ferran Diego edit  openurl
  Title Probabilistic Alignment of Video Sequences Recorded by Moving Cameras Type Book Whole
  Year 2011 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) Video alignment consists of integrating multiple video sequences recorded independently into a single video sequence. This means to register both in time (synchronize
frames) and space (image registration) so that the two videos sequences can be fused
or compared pixel–wise. In spite of being relatively unknown, many applications today may benefit from the availability of robust and efficient video alignment methods.
For instance, video surveillance requires to integrate video sequences that are recorded
of the same scene at different times in order to detect changes. The problem of aligning videos has been addressed before, but in the relatively simple cases of fixed or rigidly attached cameras and simultaneous acquisition. In addition, most works rely
on restrictive assumptions which reduce its difficulty such as linear time correspondence or the knowledge of the complete trajectories of corresponding scene points on the images; to some extent, these assumptions limit the practical applicability of the solutions developed until now. In this thesis, we focus on the challenging problem of aligning sequences recorded at different times from independent moving cameras following similar but not coincident trajectories. More precisely, this thesis covers four studies that advance the state-of-the-art in video alignment. First, we focus on analyzing and developing a probabilistic framework for video alignment, that is, a principled way to integrate multiple observations and prior information. In this way, two different approaches are presented to exploit the combination of several purely visual features (image–intensities, visual words and dense motion field descriptor), and
global positioning system (GPS) information. Second, we focus on reformulating the
problem into a single alignment framework since previous works on video alignment
adopt a divide–and–conquer strategy, i.e., first solve the synchronization, and then
register corresponding frames. This also generalizes the ’classic’ case of fixed geometric transform and linear time mapping. Third, we focus on exploiting directly the
time domain of the video sequences in order to avoid exhaustive cross–frame search.
This provides relevant information used for learning the temporal mapping between
pairs of video sequences. Finally, we focus on adapting these methods to the on–line
setting for road detection and vehicle geolocation. The qualitative and quantitative
results presented in this thesis on a variety of real–world pairs of video sequences show that the proposed method is: robust to varying imaging conditions, different image
content (e.g., incoming and outgoing vehicles), variations on camera velocity, and
different scenarios (indoor and outdoor) going beyond the state–of–the–art. Moreover, the on–line video alignment has been successfully applied for road detection and
vehicle geolocation achieving promising results.
 
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher Ediciones Graficas Rey Place of Publication Editor Joan Serrat  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes ADAS Approved no  
  Call Number Admin @ si @ Die2011 Serial 1787  
Permanent link to this record
 

 
Author Marçal Rusiñol edit  openurl
  Title Geometric and Structural-based Symbol Spotting. Application to Focused Retrieval in Graphic Document Collections Type Book Whole
  Year 2009 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) Usually, pattern recognition systems consist of two main parts. On the one hand, the data acquisition and, on the other hand, the classification of this data on a certain category. In order to recognize which category a certain query element belongs to, a set of pattern models must be provided beforehand. An off-line learning stage is needed to train the classifier and to offer a robust classification of the patterns. Within the pattern recognition field, we are interested in the recognition of graphics and, in particular, on the analysis of documents rich in graphical information. In this context, one of the main concerns is to see if the proposed systems remain scalable with respect to the data volume so as it can handle growing amounts of symbol models. In order to avoid to work with a database of reference symbols, symbol spotting and on-the-fly symbol recognition methods have been introduced in the past years.

Generally speaking, the symbol spotting problem can be defined as the identification of a set of regions of interest from a document image which are likely to contain an instance of a certain queriedn symbol without explicitly applying the whole pattern recognition scheme. Our application framework consists on indexing a collection of graphic-rich document images. This collection is
queried by example with a single instance of the symbol to look for and, by means of symbol spotting methods we retrieve the regions of interest where the symbol is likely to appear within the documents. This kind of applications are known as focused retrieval methods.

In order that the focused retrieval application can handle large collections of documents there is a need to provide an efficient access to the large volume of information that might be stored. We use indexing strategies in order to efficiently retrieve by similarity the locations where a certain part of the symbol appears. In that scenario, graphical patterns should be used as indices for accessing and navigating the collection of documents.
These indexing mechanism allow the user to search for similar elements using graphical information rather than textual queries.

Along this thesis we present a spotting architecture and different methods aiming to build a complete focused retrieval application dealing with a graphic-rich document collections. In addition, a protocol to evaluate the performance of symbol
spotting systems in terms of recognition abilities, location accuracy and scalability is proposed.
 
  Address Barcelona (Spain)  
  Corporate Author Thesis Ph.D. thesis  
  Publisher Ediciones Graficas Rey Place of Publication Editor Josep Llados  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number DAG @ dag @ Rus2009 Serial 1264  
Permanent link to this record
 

 
Author Andres Mafla edit  isbn
openurl 
  Title Leveraging Scene Text Information for Image Interpretation Type Book Whole
  Year 2022 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) Until recently, most computer vision models remained illiterate, largely ignoring the semantically rich and explicit information contained in scene text. Recent progress in scene text detection and recognition has recently allowed exploring its role in a diverse set of open computer vision problems, e.g. image classification, image-text retrieval, image captioning, and visual question answering to name a few. The explicit semantics of scene text closely requires specific modeling similar to language. However, scene text is a particular signal that has to be interpreted according to a comprehensive perspective that encapsulates all the visual cues in an image. Incorporating this information is a straightforward task for humans, but if we are unfamiliar with a language or scripture, achieving a complete world understanding is impossible (e.a. visiting a foreign country with a different alphabet). Despite the importance of scene text, modeling it requires considering the several ways in which scene text interacts with an image, processing and fusing an additional modality. In this thesis, we mainly focus
on two tasks, scene text-based fine-grained image classification, and cross-modal retrieval. In both studied tasks we identify existing limitations in current approaches and propose plausible solutions. Concretely, in each chapter: i) We define a compact way to embed scene text that generalizes to unseen words at training time while performing in real-time. ii) We incorporate the previously learned scene text embedding to create an image-level descriptor that overcomes optical character recognition (OCR) errors which is well-suited to the fine-grained image classification task. iii) We design a region-level reasoning network that learns the interaction through semantics among salient visual regions and scene text instances. iv) We employ scene text information in image-text matching and introduce the Scene Text Aware Cross-Modal retrieval StacMR task. We gather a dataset that incorporates scene text and design a model suited for the newly studied modality. v) We identify the drawbacks of current retrieval metrics in cross-modal retrieval. An image captioning metric is proposed as a way of better evaluating semantics in retrieved results. Ample experimentation shows that incorporating such semantics into a model yields better semantic results while
requiring significantly less data to converge.
 
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher IMPRIMA Place of Publication Editor Dimosthenis Karatzas;Lluis Gomez  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-124793-6-2 Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number Admin @ si @ Maf2022 Serial 3756  
Permanent link to this record
 

 
Author Bonifaz Stuhr edit  isbn
openurl 
  Title Towards Unsupervised Representation Learning: Learning, Evaluating and Transferring Visual Representations Type Book Whole
  Year 2023 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) Unsupervised representation learning aims at finding methods that learn representations from data without annotation-based signals. Abstaining from annotations not only leads to economic benefits but may – and to some extent already does – result in advantages regarding the representation’s structure, robustness, and generalizability to different tasks. In the long run, unsupervised methods are expected to surpass their supervised counterparts due to the reduction of human intervention and the inherently more general setup that does not bias the optimization towards an objective originating from specific annotation-based signals. While major advantages of unsupervised representation learning have been recently observed in natural language processing, supervised methods still dominate in vision domains for most tasks. In this dissertation, we contribute to the field of unsupervised (visual) representation learning from three perspectives: (i) Learning representations: We design unsupervised, backpropagation-free Convolutional Self-Organizing Neural Networks (CSNNs) that utilize self-organization- and Hebbian-based learning rules to learn convolutional kernels and masks to achieve deeper backpropagation-free models. Thereby, we observe that backpropagation-based and -free methods can suffer from an objective function mismatch between the unsupervised pretext task and the target task. This mismatch can lead to performance decreases for the target task. (ii) Evaluating representations: We build upon the widely used (non-)linear evaluation protocol to define pretext- and target-objective-independent metrics for measuring the objective function mismatch. With these metrics, we evaluate various pretext and target tasks and disclose dependencies of the objective function mismatch concerning different parts of the training and model setup. (iii) Transferring representations: We contribute CARLANE, the first 3-way sim-to-real domain adaptation benchmark for 2D lane detection. We adopt several well-known unsupervised domain adaptation methods as baselines and propose a method based on prototypical cross-domain self-supervised learning. Finally, we focus on pixel-based unsupervised domain adaptation and contribute a content-consistent unpaired image-to-image translation method that utilizes masks, global and local discriminators, and similarity sampling to mitigate content inconsistencies, as well as feature-attentive denormalization to fuse content-based statistics into the generator stream. In addition, we propose the cKVD metric to incorporate class-specific content inconsistencies into perceptual metrics for measuring translation quality.  
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher IMPRIA Place of Publication Editor Jordi Gonzalez;Jurgen Brauer  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-126409-6-0 Medium  
  Area Expedition Conference  
  Notes ISE Approved no  
  Call Number Admin @ si @ Stu2023 Serial 3966  
Permanent link to this record
 

 
Author Josep M. Gonfaus edit  openurl
  Title Towards Deep Image Understanding: From pixels to semantics Type Book Whole
  Year 2012 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) Understanding the content of the images is one of the greatest challenges of computer vision. Recognition of objects appearing in images, identifying and interpreting their actions are the main purposes of Image Understanding. This thesis seeks to identify what is present in a picture by categorizing and locating all the objects in the scene.
Images are composed by pixels, and one possibility consists of assigning to each pixel an object category, which is commonly known as semantic segmentation. By incorporating information as a contextual cue, we are able to resolve the ambiguity within categories at the pixel-level. We propose three levels of scale in order to resolve such ambiguity.
Another possibility to represent the objects is the object detection task. In this case, the aim is to recognize and localize the whole object by accurately placing a bounding box around it. We present two new approaches. The first one is focused on improving the object representation of deformable part models with the concept of factorized appearances. The second approach addresses the issue of reducing the computational cost for multi-class recognition. The results given have been validated on several commonly used datasets, reaching international recognition and state-of-the-art within the field
 
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher Ediciones Graficas Rey Place of Publication Editor Jordi Gonzalez;Theo Gevers  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes ISE Approved no  
  Call Number Admin @ si @ Gon2012 Serial 2208  
Permanent link to this record
 

 
Author Jordi Roca edit  openurl
  Title Constancy and inconstancy in categorical colour perception Type Book Whole
  Year 2012 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) To recognise objects is perhaps the most important task an autonomous system, either biological or artificial needs to perform. In the context of human vision, this is partly achieved by recognizing the colour of surfaces despite changes in the wavelength distribution of the illumination, a property called colour constancy. Correct surface colour recognition may be adequately accomplished by colour category matching without the need to match colours precisely, therefore categorical colour constancy is likely to play an important role for object identification to be successful. The main aim of this work is to study the relationship between colour constancy and categorical colour perception. Previous studies of colour constancy have shown the influence of factors such the spatio-chromatic properties of the background, individual observer's performance, semantics, etc. However there is very little systematic study of these influences. To this end, we developed a new approach to colour constancy which includes both individual observers' categorical perception, the categorical structure of the background, and their interrelations resulting in a more comprehensive characterization of the phenomenon. In our study, we first developed a new method to analyse the categorical structure of 3D colour space, which allowed us to characterize individual categorical colour perception as well as quantify inter-individual variations in terms of shape and centroid location of 3D categorical regions. Second, we developed a new colour constancy paradigm, termed chromatic setting, which allows measuring the precise location of nine categorically-relevant points in colour space under immersive illumination. Additionally, we derived from these measurements a new colour constancy index which takes into account the magnitude and orientation of the chromatic shift, memory effects and the interrelations among colours and a model of colour naming tuned to each observer/adaptation state. Our results lead to the following conclusions: (1) There exists large inter-individual variations in the categorical structure of colour space, and thus colour naming ability varies significantly but this is not well predicted by low-level chromatic discrimination ability; (2) Analysis of the average colour naming space suggested the need for an additional three basic colour terms (turquoise, lilac and lime) for optimal colour communication; (3) Chromatic setting improved the precision of more complex linear colour constancy models and suggested that mechanisms other than cone gain might be best suited to explain colour constancy; (4) The categorical structure of colour space is broadly stable under illuminant changes for categorically balanced backgrounds; (5) Categorical inconstancy exists for categorically unbalanced backgrounds thus indicating that categorical information perceived in the initial stages of adaptation may constrain further categorical perception.  
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher Place of Publication Editor Maria Vanrell;C. Alejandro Parraga  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes CIC Approved no  
  Call Number Admin @ si @ Roc2012 Serial 2893  
Permanent link to this record
 

 
Author Shiqi Yang edit  isbn
openurl 
  Title Towards Source-Free Domain Adaption of Neural Networks in an Open World Type Book Whole
  Year 2023 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) Though they achieve great success, deep neural networks typically require a huge
amount of labeled data for training. However, collecting labeled data is often laborious and expensive. It would, therefore, be ideal if the knowledge obtained from label-rich datasets could be transferred to unlabeled data. However, deep networks are weak at generalizing to unseen domains, even when the differences are only subtle between the datasets. In real-world situations, a typical factor impairing the model generalization ability is the distribution shift between data from different domains, which is a long-standing problem usually termed as (unsupervised) domain adaptation.
A crucial requirement in the methodology of these domain adaptation methods is that they require access to source domain data during the adaptation process to the target domain. Accessibility to the source data of a trained source model is often impossible in real-world applications, for example, when deploying domain adaptation algorithms on mobile devices where the computational capacity is limited or in situations where data privacy rules limit access to the source domain data. Without access to the source domain data, existing methods suffer from inferior performance. Thus, in this thesis, we investigate domain adaptation without source data (termed as source-free domain adaptation) in multiple different scenarios that focus on image classification tasks.
We first study the source-free domain adaptation problem in a closed-set setting,
where the label space of different domains is identical. Only accessing the pretrained source model, we propose to address source-free domain adaptation from the perspective of unsupervised clustering. We achieve this based on nearest neighborhood clustering. In this way, we can transfer the challenging source-free domain adaptation task to a type of clustering problem. The final optimization objective is an upper bound containing only two simple terms, which can be explained as discriminability and diversity. We show that this allows us to relate several other methods in domain adaptation, unsupervised clustering and contrastive learning via the perspective of discriminability and diversity.
 
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher IMPRIMA Place of Publication Editor Joost  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-126409-3-9 Medium  
  Area Expedition Conference  
  Notes LAMP Approved no  
  Call Number Admin @ si @ Yan2023 Serial 3963  
Permanent link to this record
 

 
Author Sergio Escalera; Ralf Herbrich edit  url
doi  isbn
openurl 
  Title The NeurIPS’18 Competition: From Machine Learning to Intelligent Conversations Type Book Whole
  Year 2020 Publication The Springer Series on Challenges in Machine Learning Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract (down) This volume presents the results of the Neural Information Processing Systems Competition track at the 2018 NeurIPS conference. The competition follows the same format as the 2017 competition track for NIPS. Out of 21 submitted proposals, eight competition proposals were selected, spanning the area of Robotics, Health, Computer Vision, Natural Language Processing, Systems and Physics. Competitions have become an integral part of advancing state-of-the-art in artificial intelligence (AI). They exhibit one important difference to benchmarks: Competitions test a system end-to-end rather than evaluating only a single component; they assess the practicability of an algorithmic solution in addition to assessing feasibility.  
  Address  
  Corporate Author Thesis  
  Publisher Place of Publication Editor Sergio Escalera; Ralf Hebrick  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN 2520-1328 ISBN 978-3-030-29134-1 Medium  
  Area Expedition Conference  
  Notes HuPBA; no menciona Approved no  
  Call Number Admin @ si @ HeE2020 Serial 3328  
Permanent link to this record
Select All    Deselect All
 |   | 
Details
   print

Save Citations:
Export Records: