Publicacions CVC -- Query Results

<< 1 2 3 4 5 6 7 8 9 10 >> [11–14]

Details

Records
Author	Jose Elias Yauri
Title	Deep Learning Based Data Fusion Approaches for the Assessment of Cognitive States on EEG Signals			Type	Book Whole
Year	2023	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	For millennia, the study of the couple brain-mind has fascinated the humanity in order to understand the complex nature of cognitive states. A cognitive state is the state of the mind at a specific time and involves cognition activities to acquire and process information for making a decision, solving a problem, or achieving a goal. While normal cognitive states assist in the successful accomplishment of tasks; on the contrary, abnormal states of the mind can lead to task failures due to a reduced cognition capability. In this thesis, we focus on the assessment of cognitive states by means of the analysis of ElectroEncephaloGrams (EEG) signals using deep learning methods. EEG records the electrical activity of the brain using a set of electrodes placed on the scalp that output a set of spatiotemporal signals that are expected to be correlated to a specific mental process. From the point of view of artificial intelligence, any method for the assessment of cognitive states using EEG signals as input should face several challenges. On the one hand, one should determine which is the most suitable approach for the optimal combination of the multiple signals recorded by EEG electrodes. On the other hand, one should have a protocol for the collection of good quality unambiguous annotated data, and an experimental design for the assessment of the generalization and transfer of models. In order to tackle them, first, we propose several convolutional neural architectures to perform data fusion of the signals recorded by EEG electrodes, at raw signal and feature levels. Four channel fusion methods, easy to incorporate into any neural network architecture, are proposed and assessed. Second, we present a method to create an unambiguous dataset for the prediction of cognitive mental workload using serious games and an Airbus-320 flight simulator. Third, we present a validation protocol that takes into account the levels of generalization of models based on the source and amount of test data. Finally, the approaches for the assessment of cognitive states are applied to two use cases of high social impact: the assessment of mental workload for personalized support systems in the cockpit and the detection of epileptic seizures. The results obtained from the first use case show the feasibility of task transfer of models trained to detect workload in serious games to real flight scenarios. The results from the second use case show the generalization capability of our EEG channel fusion methods at k-fold cross-validation, patient-specific, and population levels.
Address
Corporate Author				Thesis	Ph.D. thesis
Publisher	IMPRIMA	Place of Publication		Editor	Aura Hernandez;Debora Gil
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	IAM			Approved	no
Call Number	Admin @ si @ Yau2023			Serial	3962
Permanent link to this record



Author	Shiqi Yang
Title	Towards Source-Free Domain Adaption of Neural Networks in an Open World			Type	Book Whole
Year	2023	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	Though they achieve great success, deep neural networks typically require a huge amount of labeled data for training. However, collecting labeled data is often laborious and expensive. It would, therefore, be ideal if the knowledge obtained from label-rich datasets could be transferred to unlabeled data. However, deep networks are weak at generalizing to unseen domains, even when the differences are only subtle between the datasets. In real-world situations, a typical factor impairing the model generalization ability is the distribution shift between data from different domains, which is a long-standing problem usually termed as (unsupervised) domain adaptation. A crucial requirement in the methodology of these domain adaptation methods is that they require access to source domain data during the adaptation process to the target domain. Accessibility to the source data of a trained source model is often impossible in real-world applications, for example, when deploying domain adaptation algorithms on mobile devices where the computational capacity is limited or in situations where data privacy rules limit access to the source domain data. Without access to the source domain data, existing methods suffer from inferior performance. Thus, in this thesis, we investigate domain adaptation without source data (termed as source-free domain adaptation) in multiple different scenarios that focus on image classification tasks. We first study the source-free domain adaptation problem in a closed-set setting, where the label space of different domains is identical. Only accessing the pretrained source model, we propose to address source-free domain adaptation from the perspective of unsupervised clustering. We achieve this based on nearest neighborhood clustering. In this way, we can transfer the challenging source-free domain adaptation task to a type of clustering problem. The final optimization objective is an upper bound containing only two simple terms, which can be explained as discriminability and diversity. We show that this allows us to relate several other methods in domain adaptation, unsupervised clustering and contrastive learning via the perspective of discriminability and diversity.
Address
Corporate Author				Thesis	Ph.D. thesis
Publisher	IMPRIMA	Place of Publication		Editor	Joost
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN	978-84-126409-3-9	Medium
Area		Expedition		Conference
Notes	LAMP			Approved	no
Call Number	Admin @ si @ Yan2023			Serial	3963
Permanent link to this record



Author	Yi Xiao
Title	Advancing Vision-based End-to-End Autonomous Driving			Type	Book Whole
Year	2023	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	In autonomous driving, artificial intelligence (AI) processes the traffic environment to drive the vehicle to a desired destination. Currently, there are different paradigms that address the development of AI-enabled drivers. On the one hand, we find modular pipelines, which divide the driving task into sub-tasks such as perception, maneuver planning, and control. On the other hand, we find end-to-end driving approaches that attempt to learn the direct mapping of raw data from input sensors to vehicle control signals. The latter are relatively less studied but are gaining popularity as they are less demanding in terms of data labeling. Therefore, in this thesis, our goal is to investigate end-to-end autonomous driving. We propose to evaluate three approaches to tackle the challenge of end-to-end autonomous driving. First, we focus on the input, considering adding depth information as complementary to RGB data, in order to mimic the human being’s ability to estimate the distance to obstacles. Notice that, in the real world, these depth maps can be obtained either from a LiDAR sensor, or a trained monocular depth estimation module, where human labeling is not needed. Then, based on the intuition that the latent space of end-to-end driving models encodes relevant information for driving, we use it as prior knowledge for training an affordancebased driving model. In this case, the trained affordance-based model can achieve good performance while requiring less human-labeled data, and it can provide interpretability regarding driving actions. Finally, we present a new pure vision-based end-to-end driving model termed CIL++, which is trained by imitation learning. CIL++ leverages modern best practices, such as a large horizontal field of view and a self-attention mechanism, which are contributing to the agent’s understanding of the driving scene and bringing a better imitation of human drivers. Using training data without any human labeling, our model yields almost expert performance in the CARLA NoCrash benchmark and could rival SOTA models that require large amounts of human-labeled data.
Address
Corporate Author				Thesis	Ph.D. thesis
Publisher	IMPRIMA	Place of Publication		Editor	Antonio Lopez
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN	978-84-126409-4-6	Medium
Area		Expedition		Conference
Notes	ADAS			Approved	no
Call Number	Admin @ si @ Xia2023			Serial	3964
Permanent link to this record



Author	Diego Velazquez
Title	Towards Robustness in Computer-based Image Understanding			Type	Book Whole
Year	2023	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	This thesis embarks on an exploratory journey into robustness in deep learning, with a keen focus on the intertwining facets of generalization, explainability, and edge cases within the realm of computer vision. In deep learning, robustness epitomizes a model’s resilience and flexibility, grounded on its capacity to generalize across diverse data distributions, explain its predictions transparently, and navigate the intricacies of edge cases effectively. The challenges associated with robust generalization are multifaceted, encompassing the model’s performance on unseen data and its defense against out-of-distribution data and adversarial attacks. Bridging this gap, the potential of Embedding Propagation (EP) for improving out-of-distribution generalization is explored. EP is depicted as a powerful tool facilitating manifold smoothing, which in turn fortifies the model’s robustness against adversarial onslaughts and bolsters performance in few-shot and self-/semi-supervised learning scenarios. In the labyrinth of deep learning models, the path to robustness often intersects with explainability. As model complexity increases, so does the urgency to decipher their decision-making processes. Acknowledging this, the thesis introduces a robust framework for evaluating and comparing various counterfactual explanation methods, echoing the imperative of explanation quality over quantity and spotlighting the intricacies of diversifying explanations. Simultaneously, the deep learning landscape is fraught with edge cases – anomalies in the form of small objects or rare instances in object detection tasks that defy the norm. Confronting this, the thesis presents an extension of the DETR (DEtection TRansformer) model to enhance small object detection. The devised DETR-FP, embedding the Feature Pyramid technique, demonstrating improvement in small objects detection accuracy, albeit facing challenges like high computational costs. With emergence of foundation models in mind, the thesis unveils EarthView, the largest scale remote sensing dataset to date, built for the self-supervised learning of a robust foundational model for remote sensing. Collectively, these studies contribute to the grand narrative of robustness in deep learning, weaving together the strands of generalization, explainability, and edge case performance. Through these methodological advancements and novel datasets, the thesis calls for continued exploration, innovation, and refinement to fortify the bastion of robust computer vision.
Address
Corporate Author				Thesis	Ph.D. thesis
Publisher	IMPRIMA	Place of Publication		Editor	Jordi Gonzalez;Josep M. Gonfaus;Pau Rodriguez
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN	978-81-126409-5-3	Medium
Area		Expedition		Conference
Notes	ISE			Approved	no
Call Number	Admin @ si @ Vel2023			Serial	3965
Permanent link to this record



Author	Bonifaz Stuhr
Title	Towards Unsupervised Representation Learning: Learning, Evaluating and Transferring Visual Representations			Type	Book Whole
Year	2023	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	Unsupervised representation learning aims at finding methods that learn representations from data without annotation-based signals. Abstaining from annotations not only leads to economic benefits but may – and to some extent already does – result in advantages regarding the representation’s structure, robustness, and generalizability to different tasks. In the long run, unsupervised methods are expected to surpass their supervised counterparts due to the reduction of human intervention and the inherently more general setup that does not bias the optimization towards an objective originating from specific annotation-based signals. While major advantages of unsupervised representation learning have been recently observed in natural language processing, supervised methods still dominate in vision domains for most tasks. In this dissertation, we contribute to the field of unsupervised (visual) representation learning from three perspectives: (i) Learning representations: We design unsupervised, backpropagation-free Convolutional Self-Organizing Neural Networks (CSNNs) that utilize self-organization- and Hebbian-based learning rules to learn convolutional kernels and masks to achieve deeper backpropagation-free models. Thereby, we observe that backpropagation-based and -free methods can suffer from an objective function mismatch between the unsupervised pretext task and the target task. This mismatch can lead to performance decreases for the target task. (ii) Evaluating representations: We build upon the widely used (non-)linear evaluation protocol to define pretext- and target-objective-independent metrics for measuring the objective function mismatch. With these metrics, we evaluate various pretext and target tasks and disclose dependencies of the objective function mismatch concerning different parts of the training and model setup. (iii) Transferring representations: We contribute CARLANE, the first 3-way sim-to-real domain adaptation benchmark for 2D lane detection. We adopt several well-known unsupervised domain adaptation methods as baselines and propose a method based on prototypical cross-domain self-supervised learning. Finally, we focus on pixel-based unsupervised domain adaptation and contribute a content-consistent unpaired image-to-image translation method that utilizes masks, global and local discriminators, and similarity sampling to mitigate content inconsistencies, as well as feature-attentive denormalization to fuse content-based statistics into the generator stream. In addition, we propose the cKVD metric to incorporate class-specific content inconsistencies into perceptual metrics for measuring translation quality.
Address
Corporate Author				Thesis	Ph.D. thesis
Publisher	IMPRIA	Place of Publication		Editor	Jordi Gonzalez;Jurgen Brauer
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN	978-84-126409-6-0	Medium
Area		Expedition		Conference
Notes	ISE			Approved	no
Call Number	Admin @ si @ Stu2023			Serial	3966
Permanent link to this record



Author	Ruben Perez Tito
Title	Exploring the role of Text in Visual Question Answering on Natural Scenes and Documents			Type	Book Whole
Year	2023	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	Visual Question Answering (VQA) is the task where given an image and a natural language question, the objective is to generate a natural language answer. At the intersection between computer vision and natural language processing, this task can be seen as a measure of image understanding capabilities, as it requires to reason about objects, actions, colors, positions, the relations between the different elements as well as commonsense reasoning, world knowledge, arithmetic skills and natural language understanding. However, even though the text present in the images conveys important semantically rich information that is explicit and not available in any other form, most VQA methods remained illiterate, largely ignoring the text despite its potential significance. In this thesis, we set out on a journey to bring reading capabilities to computer vision models applied to the VQA task, creating new datasets and methods that can read, reason and integrate the text with other visual cues in natural scene images and documents. In Chapter 3, we address the combination of scene text with visual information to fully understand all the nuances of natural scene images. To achieve this objective, we define a new sub-task of VQA that requires reading the text in the image, and highlight the limitations of the current methods. In addition, we propose a new architecture that integrates both modalities and jointly reasons about textual and visual features. In Chapter 5, we shift the domain of VQA with reading capabilities and apply it on scanned industry document images, providing a high-level end-purpose perspective to Document Understanding, which has been primarily focused on digitizing the document’s contents and extracting key values without considering the ultimate purpose of the extracted information. For this, we create a dataset which requires methods to reason about the unique and challenging elements of documents, such as text, images, tables, graphs and complex layouts, to provide accurate answers in natural language. However, we observed that explicit visual features provide a slight contribution in the overall performance, since the main information is usually conveyed within the text and its position. In consequence, in Chapter 6, we propose VQA on infographic images, seeking for document images with more visually rich elements that require to fully exploit visual information in order to answer the questions. We show the performance gap of different methods when used over industry scanned and infographic images, and propose a new method that integrates the visual features in early stages, which allows the transformer architecture to exploit the visual features during the self-attention operation. Instead, in Chapter 7, we apply VQA on a big collection of single-page documents, where the methods must find which documents are relevant to answer the question, and provide the answer itself. Finally, in Chapter 8, mimicking real-world application problems where systems must process documents with multiple pages, we address the multipage document visual question answering task. We demonstrate the limitations of existing methods, including models specifically designed to process long sequences. To overcome these limitations, we propose a hierarchical architecture that can process long documents, answer questions, and provide the index of the page where the information to answer the question is located as an explainability measure.
Address
Corporate Author				Thesis	Ph.D. thesis
Publisher	IMPRIMA	Place of Publication		Editor	Ernest Valveny
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN	978-84-124793-5-5	Medium
Area		Expedition		Conference
Notes	DAG			Approved	no
Call Number	Admin @ si @ Per2023			Serial	3967
Permanent link to this record



Author	David Geronimo
Title	A Global Approach to Vision-Based Pedestrian Detection for Advanced Driver Assistance Systems			Type	Book Whole
Year	2010	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	At the beginning of the 21th century, traffic accidents have become a major problem not only for developed countries but also for emerging ones. As in other scientific areas in which Artificial Intelligence is becoming a key actor, advanced driver assistance systems, and concretely pedestrian protection systems based on Computer Vision, are becoming a strong topic of research aimed at improving the safety of pedestrians. However, the challenge is of considerable complexity due to the varying appearance of humans (e.g., clothes, size, aspect ratio, shape, etc.), the dynamic nature of on-board systems and the unstructured moving environments that urban scenarios represent. In addition, the required performance is demanding both in terms of computational time and detection rates. In this thesis, instead of focusing on improving specific tasks as it is frequent in the literature, we present a global approach to the problem. Such a global overview starts by the proposal of a generic architecture to be used as a framework both to review the literature and to organize the studied techniques along the thesis. We then focus the research on tasks such as foreground segmentation, object classification and refinement following a general viewpoint and exploring aspects that are not usually analyzed. In order to perform the experiments, we also present a novel pedestrian dataset that consists of three subsets, each one addressed to the evaluation of a different specific task in the system. The results presented in this thesis not only end with a proposal of a pedestrian detection system but also go one step beyond by pointing out new insights, formalizing existing and proposed algorithms, introducing new techniques and evaluating their performance, which we hope will provide new foundations for future research in the area.
Address	Antonio Lopez;Krystian Mikolajczyk;Jaume Amores;Dariu M. Gavrila;Oriol Pujol;Felipe Lumbreras
Corporate Author				Thesis	Ph.D. thesis
Publisher	Ediciones Graficas Rey	Place of Publication		Editor	Antonio Lopez;Krystian Mikolajczyk;Jaume Amores;Dariu M. Gavrila;Oriol Pujol;Felipe Lumbreras
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN	978-84-936529-5-1	Medium
Area		Expedition		Conference
Notes	ADAS			Approved	no
Call Number	ADAS @ adas @ Ger2010			Serial	1279
Permanent link to this record



Author	Jiaolong Xu
Title	Domain Adaptation of Deformable Part-based Models			Type	Book Whole
Year	2015	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	On-board pedestrian detection is crucial for Advanced Driver Assistance Systems (ADAS). An accurate classication is fundamental for vision-based pedestrian detection. The underlying assumption for learning classiers is that the training set and the deployment environment (testing) follow the same probability distribution regarding the features used by the classiers. However, in practice, there are dierent reasons that can break this constancy assumption. Accordingly, reusing existing classiers by adapting them from the previous training environment (source domain) to the new testing one (target domain) is an approach with increasing acceptance in the computer vision community. In this thesis we focus on the domain adaptation of deformable part-based models (DPMs) for pedestrian detection. As a prof of concept, we use a computer graphic based synthetic dataset, i.e. a virtual world, as the source domain, and adapt the virtual-world trained DPM detector to various real-world dataset. We start by exploiting the maximum detection accuracy of the virtual-world trained DPM. Even though, when operating in various real-world datasets, the virtualworld trained detector still suer from accuracy degradation due to the domain gap of virtual and real worlds. We then focus on domain adaptation of DPM. At the rst step, we consider single source and single target domain adaptation and propose two batch learning methods, namely A-SSVM and SA-SSVM. Later, we further consider leveraging multiple target (sub-)domains for progressive domain adaptation and propose a hierarchical adaptive structured SVM (HA-SSVM) for optimization. Finally, we extend HA-SSVM for the challenging online domain adaptation problem, aiming at making the detector to automatically adapt to the target domain online, without any human intervention. All of the proposed methods in this thesis do not require revisiting source domain data. The evaluations are done on the Caltech pedestrian detection benchmark. Results show that SA-SSVM slightly outperforms A-SSVM and avoids accuracy drops as high as 15 points when comparing with a non-adapted detector. The hierarchical model learned by HA-SSVM further boosts the domain adaptation performance. Finally, the online domain adaptation method has demonstrated that it can achieve comparable accuracy to the batch learned models while not requiring manually label target domain examples. Domain adaptation for pedestrian detection is of paramount importance and a relatively unexplored area. We humbly hope the work in this thesis could provide foundations for future work in this area.
Address	April 2015
Corporate Author				Thesis	Ph.D. thesis
Publisher		Place of Publication		Editor	Antonio Lopez
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN	978-84-943427-1-4	Medium
Area		Expedition		Conference
Notes	ADAS; 600.076			Approved	no
Call Number	Admin @ si @ Xu2015			Serial	2631
Permanent link to this record



Author	Cesar de Souza
Title	Action Recognition in Videos: Data-efficient approaches for supervised learning of human action classification models for video			Type	Book Whole
Year	2018	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	In this dissertation, we explore different ways to perform human action recognition in video clips. We focus on data efficiency, proposing new approaches that alleviate the need for laborious and time-consuming manual data annotation. In the first part of this dissertation, we start by analyzing previous state-of-the-art models, comparing their differences and similarities in order to pinpoint where their real strengths come from. Leveraging this information, we then proceed to boost the classification accuracy of shallow models to levels that rival deep neural networks. We introduce hybrid video classification architectures based on carefully designed unsupervised representations of handcrafted spatiotemporal features classified by supervised deep networks. We show in our experiments that our hybrid model combine the best of both worlds: it is data efficient (trained on 150 to 10,000 short clips) and yet improved significantly on the state of the art, including deep models trained on millions of manually labeled images and videos. In the second part of this research, we investigate the generation of synthetic training data for action recognition, as it has recently shown promising results for a variety of other computer vision tasks. We propose an interpretable parametric generative model of human action videos that relies on procedural generation and other computer graphics techniques of modern game engines. We generate a diverse, realistic, and physically plausible dataset of human action videos, called PHAV for “Procedural Human Action Videos”. It contains a total of 39,982 videos, with more than 1,000 examples for each action of 35 categories. Our approach is not limited to existing motion capture sequences, and we procedurally define 14 synthetic actions. We then introduce deep multi-task representation learning architectures to mix synthetic and real videos, even if the action categories differ. Our experiments on the UCF-101 and HMDB-51 benchmarks suggest that combining our large set of synthetic videos with small real-world datasets can boost recognition performance, outperforming fine-tuning state-of-the-art unsupervised generative models of videos.
Address	April 2018
Corporate Author				Thesis	Ph.D. thesis
Publisher	Ediciones Graficas Rey	Place of Publication		Editor	Antonio Lopez;Naila Murray
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	ADAS; 600.118			Approved	no
Call Number	Admin @ si @ Sou2018			Serial	3127
Permanent link to this record



Author	David Aldavert
Title	Efficient and Scalable Handwritten Word Spotting on Historical Documents using Bag of Visual Words			Type	Book Whole
Year	2021	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	Word spotting can be defined as the pattern recognition tasked aimed at locating and retrieving a specific keyword within a document image collection without explicitly transcribing the whole corpus. Its use is particularly interesting when applied in scenarios where Optical Character Recognition performs poorly or can not be used at all. This thesis focuses on such a scenario, word spotting on historical handwritten documents that have been written by a single author or by multiple authors with a similar calligraphy. This problem requires a visual signature that is robust to image artifacts, flexible to accommodate script variations and efficient to retrieve information in a rapid manner. For this, we have developed a set of word spotting methods that on their foundation use the well known Bag-of-Visual-Words (BoVW) representation. This representation has gained popularity among the document image analysis community to characterize handwritten words in an unsupervised manner. However, most approaches on this field rely on a basic BoVW configuration and disregard complex encoding and spatial representations. We determine which BoVW configurations provide the best performance boost to a spotting system. Then, we extend the segmentation-based word spotting, where word candidates are given a priori, to segmentation-free spotting. The proposed approach seeds the document images with overlapping word location candidates and characterizes them with a BoVW signature. Retrieval is achieved comparing the query and candidate signatures and returning the locations that provide a higher consensus. This is a simple but powerful approach that requires a more compact signature than in a segmentation-based scenario. We first project the BoVW signature into a reduced semantic topics space and then compress it further using Product Quantizers. The resulting signature only requires a few dozen bytes, allowing us to index thousands of pages on a common desktop computer. The final system still yields a performance comparable to the state-of-the-art despite all the information loss during the compression phases. Afterwards, we also study how to combine different modalities of information in order to create a query-by-X spotting system where, words are indexed using an information modality and queries are retrieved using another. We consider three different information modalities: visual, textual and audio. Our proposal is to create a latent feature space where features which are semantically related are projected onto the same topics. Creating thus a new feature space where information from different modalities can be compared. Later, we consider the codebook generation and descriptor encoding problem. The codebooks used to encode the BoVW signatures are usually created using an unsupervised clustering algorithm and, they require to test multiple parameters to determine which configuration is best for a certain document collection. We propose a semantic clustering algorithm which allows to estimate the best parameter from data. Since gather annotated data is costly, we use synthetically generated word images. The resulting codebook is database agnostic, i. e. a codebook that yields a good performance on document collections that use the same script. We also propose the use of an additional codebook to approximate descriptors and reduce the descriptor encoding complexity to sub-linear. Finally, we focus on the problem of signatures dimensionality. We propose a new symbol probability signature where each bin represents the probability that a certain symbol is present a certain location of the word image. This signature is extremely compact and combined with compression techniques can represent word images with just a few bytes per signature.
Address	April 2021
Corporate Author				Thesis	Ph.D. thesis
Publisher	Ediciones Graficas Rey	Place of Publication		Editor	Marçal Rusiñol;Josep Llados
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN	978-84-122714-5-4	Medium
Area		Expedition		Conference
Notes	DAG; 600.121			Approved	no
Call Number	Admin @ si @ Ald2021			Serial	3601
Permanent link to this record



Author	Parichehr Behjati Ardakani
Title	Towards Efficient and Robust Convolutional Neural Networks for Single Image Super-Resolution			Type	Book Whole
Year	2022	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	Single image super-resolution (SISR) is an important task in image processing which aims to enhance the resolution of imaging systems. Recently, SISR has witnessed great strides with the rapid development of deep learning. Recent advances in SISR are mostly devoted to designing deeper and wider networks to enhance their representation learning capacity. However, as the depth of networks increases, deep learning-based methods are faced with the challenge of computational complexity in practice. Moreover, most existing methods rarely leverage the intermediate features and also do not discriminate the computation of features by their frequencial components, thereby achieving relatively low performance. Aside from the aforementioned problems, another desired ability is to upsample images to arbitrary scales using a single model. Most current SISR methods train a dedicated model for each target resolution, losing generality and increasing memory requirements. In this thesis, we address the aforementioned issues and propose solutions to them: i) We present a novel frequency-based enhancement block which treats different frequencies in a heterogeneous way and also models inter-channel dependencies, which consequently enrich the output feature. Thus it helps the network generate more discriminative representations by explicitly recovering finer details. ii) We introduce OverNet which contains two main parts: a lightweight feature extractor that follows a novel recursive framework of skip and dense connections to reduce low-level feature degradation, and an overscaling module that generates an accurate SR image by internally constructing an overscaled intermediate representation of the output features. Then, to solve the problem of reconstruction at arbitrary scale factors, we introduce a novel multi-scale loss, that allows the simultaneous training of all scale factors using a single model. iii) We propose a directional variance attention network which leverages a novel attention mechanism to enhance features in different channels and spatial regions. Moreover, we introduce a novel procedure for using attention mechanisms together with residual blocks to facilitate the preservation of finer details. Finally, we demonstrate that our approaches achieve considerably better performance than previous state-of-the-art methods, in terms of both quantitative and visual quality.
Address	April, 2022
Corporate Author				Thesis	Ph.D. thesis
Publisher		Place of Publication		Editor	Jordi Gonzalez;Xavier Roca;Pau Rodriguez
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN	978-84-124793-1-7	Medium
Area		Expedition		Conference
Notes	ISE			Approved	no
Call Number	Admin @ si @ Beh2022			Serial	3713
Permanent link to this record



Author	Muhammad Anwer Rao
Title	Color for Object Detection and Action Recognition			Type	Book Whole
Year	2013	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	Recognizing object categories in real world images is a challenging problem in computer vision. The deformable part based framework is currently the most successful approach for object detection. Generally, HOG are used for image representation within the part-based framework. For action recognition, the bag-of-word framework has shown to provide promising results. Within the bag-of-words framework, local image patches are described by SIFT descriptor. Contrary to object detection and action recognition, combining color and shape has shown to provide the best performance for object and scene recognition. In the first part of this thesis, we analyze the problem of person detection in still images. Standard person detection approaches rely on intensity based features for image representation while ignoring the color. Channel based descriptors is one of the most commonly used approaches in object recognition. This inspires us to evaluate incorporating color information using the channel based fusion approach for the task of person detection. In the second part of the thesis, we investigate the problem of object detection in still images. Due to high dimensionality, channel based fusion increases the computational cost. Moreover, channel based fusion has been found to obtain inferior results for object category where one of the visual varies significantly. On the other hand, late fusion is known to provide improved results for a wide range of object categories. A consequence of late fusion strategy is the need of a pure color descriptor. Therefore, we propose to use Color attributes as an explicit color representation for object detection. Color attributes are compact and computationally efficient. Consequently color attributes are combined with traditional shape features providing excellent results for object detection task. Finally, we focus on the problem of action detection and classification in still images. We investigate the potential of color for action classification and detection in still images. We also evaluate different fusion approaches for combining color and shape information for action recognition. Additionally, an analysis is performed to validate the contribution of color for action recognition. Our results clearly demonstrate that combining color and shape information significantly improve the performance of both action classification and detection in still images.
Address	Barcelona
Corporate Author				Thesis	Ph.D. thesis
Publisher	Ediciones Graficas Rey	Place of Publication		Editor	Antonio Lopez;Joost Van de Weijer
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	ADAS			Approved	no
Call Number	Admin @ si @ Rao2013			Serial	2281
Permanent link to this record



Author	Javier Marin
Title	Pedestrian Detection Based on Local Experts			Type	Book Whole
Year	2013	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	During the last decade vision-based human detection systems have started to play a key rolein multiple applications linked to driver assistance, surveillance, robot sensing and home automation. Detecting humans is by far one of the most challenging tasks in Computer Vision. This is mainly due to the high degree of variability in the human appearanceassociated to the clothing, pose, shape and size. Besides, other factors such as cluttered scenarios, partial occlusions, or environmental conditions can make the detection task even harder. Most promising methods of the state-of-the-art rely on discriminative learning paradigms which are fed with positive and negative examples. The training data is one of the most relevant elements in order to build a robust detector as it has to cope the large variability of the target. In order to create this dataset human supervision is required. The drawback at this point is the arduous effort of annotating as well as looking for such claimed variability. In this PhD thesis we address two recurrent problems in the literature. In the first stage,we aim to reduce the consuming task of annotating, namely, by using computer graphics. More concretely, we develop a virtual urban scenario for later generating a pedestrian dataset. Then, we train a detector using this dataset, and finally we assess if this detector can be successfully applied in a real scenario. In the second stage, we focus on increasing the robustness of our pedestrian detectors under partial occlusions. In particular, we present a novel occlusion handling approach to increase the performance of block-based holistic methods under partial occlusions. For this purpose, we make use of local experts via a RandomSubspaceMethod (RSM) to handle these cases. If the method infers a possible partial occlusion, then the RSM, based on performance statistics obtained from partially occluded data, is applied. The last objective of this thesis is to propose a robust pedestrian detector based on an ensemble of local experts. To achieve this goal, we use the random forest paradigm, where the trees act as ensembles an their nodesare the local experts. In particular, each expert focus on performing a robust classification ofa pedestrian body patch. This approach offers computational efficiency and far less design complexity when compared to other state-of-the-artmethods, while reaching better accuracy
Address	Barcelona
Corporate Author				Thesis	Ph.D. thesis
Publisher	Ediciones Graficas Rey	Place of Publication		Editor	Antonio Lopez;Jaume Amores
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	ADAS			Approved	no
Call Number	Admin @ si @ Mar2013			Serial	2280
Permanent link to this record



Author	Wenjuan Gong
Title	3D Motion Data aided Human Action Recognition and Pose Estimation			Type	Book Whole
Year	2013	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	In this work, we explore human action recognition and pose estimation prob- lems. Different from traditional works of learning from 2D images or video sequences and their annotated output, we seek to solve the problems with ad- ditional 3D motion capture information, which helps to fill the gap between 2D image features and human interpretations. We first compare two different schools of approaches commonly used for 3D pose estimation from 2D pose configuration: modeling and learning methods. By looking into experiments results and considering our problems, we fixed a learning method as the following approaches to do pose estimation. We then establish a framework by adding a module of detecting 2D pose configuration from images with varied background, which widely extend the application of the approach. We also seek to directly estimate 3D poses from image features, instead of estimating 2D poses as a intermediate module. We explore a robust input feature, which combined with the proposed distance measure, provides a solution for noisy or corrupted inputs. We further utilize the above method to estimate weak poses,which is a concise representation of the original poses by using dimension deduction technologies, from image features. Weak pose space is where we calculate vocabulary and label action types using a bog of words pipeline. Temporal information of an action is taken into consideration by considering several consecutive frames as a single unit for computing vocabulary and histogram assignments.
Address	Barcelona
Corporate Author				Thesis	Ph.D. thesis
Publisher	Ediciones Graficas Rey	Place of Publication		Editor	Jordi Gonzalez;Xavier Roca
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	ISE			Approved	no
Call Number	Admin @ si @ Gon2013			Serial	2279
Permanent link to this record



Author	Murad Al Haj
Title	Looking at Faces: Detection, Tracking and Pose Estimation			Type	Book Whole
Year	2013	Publication	PhD Thesis, Universitat Autonoma de Barcelona-CVC	Abbreviated Journal
Volume		Issue		Pages
Keywords
Abstract	Humans can effortlessly perceive faces, follow them over space and time, and decode their rich content, such as pose, identity and expression. However, despite many decades of research on automatic facial perception in areas like face detection, expression recognition, pose estimation and face recognition, and despite many successes, a complete solution remains elusive. This thesis is dedicated to three problems in automatic face perception, namely face detection, face tracking and pose estimation. In face detection, an initial simple model is presented that uses pixel-based heuristics to segment skin locations and hand-crafted rules to determine the locations of the faces present in an image. Different colorspaces are studied to judge whether a colorspace transformation can aid skin color detection. The output of this study is used in the design of a more complex face detector that is able to successfully generalize to different scenarios. In face tracking, a framework that combines estimation and control in a joint scheme is presented to track a face with a single pan-tilt-zoom camera. While this work is mainly motivated by tracking faces, it can be easily applied atop of any detector to track different objects. The applicability of this method is demonstrated on simulated as well as real-life scenarios. The last and most important part of this thesis is dedicate to monocular head pose estimation. In this part, a method based on partial least squares (PLS) regression is proposed to estimate pose and solve the alignment problem simultaneously. The contributions of this work are two-fold: 1) demonstrating that the proposed method achieves better than state-of-the-art results on the estimation problem and 2) developing a technique to reduce misalignment based on the learned PLS factors that outperform multiple instance learning (MIL) without the need for any re-training or the inclusion of misaligned samples in the training process, as normally done in MIL.
Address	Barcelona
Corporate Author				Thesis	Ph.D. thesis
Publisher	Ediciones Graficas Rey	Place of Publication		Editor	Jordi Gonzalez;Xavier Roca
Language		Summary Language		Original Title
Series Editor		Series Title		Abbreviated Series Title
Series Volume		Series Issue		Edition
ISSN		ISBN		Medium
Area		Expedition		Conference
Notes	ISE			Approved	no
Call Number	Admin @ si @ Haj2013			Serial	2278
Permanent link to this record