toggle visibility Search & Display Options

Select All    Deselect All
 |   | 
Details
   print
  Records Links
Author Victoria Ruiz; Angel Sanchez; Jose F. Velez; Bogdan Raducanu edit  doi
isbn  openurl
  Title Waste Classification with Small Datasets and Limited Resources Type Book Chapter
  Year 2022 Publication ICT Applications for Smart Cities. Intelligent Systems Reference Library Abbreviated Journal  
  Volume 224 Issue Pages 185-203  
  Keywords  
  Abstract Automatic waste recycling has become a very important societal challenge nowadays, raising people’s awareness for a cleaner environment and a more sustainable lifestyle. With the transition to Smart Cities, and thanks to advanced ICT solutions, this problem has received a new impulse. The waste recycling focus has shifted from general waste treating facilities to an individual responsibility, where each person should become aware of selective waste separation. The surge of the mobile devices, accompanied by a significant increase in computation power, has potentiated and facilitated this individual role. An automated image-based waste classification mechanism can help with a more efficient recycling and a reduction of contamination from residuals. Despite the good results achieved with the deep learning methodologies for this task, the Achille’s heel is that they require large neural networks which need significant computational resources for training and therefore are not suitable for mobile devices. To circumvent this apparently intractable problem, we will rely on knowledge distillation in order to transfer the network’s knowledge from a larger network (called ‘teacher’) to a smaller, more compact one, (referred as ‘student’) and thus making it possible the task of image classification on a device with limited resources. For evaluation, we considered as ‘teachers’ large architectures such as InceptionResNet or DenseNet and as ‘students’, several configurations of the MobileNets. We used the publicly available TrashNet dataset to demonstrate that the distillation process does not significantly affect system’s performance (e.g. classification accuracy) of the student network.  
  Address September 2022  
  Corporate Author Thesis  
  Publisher Springer Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title ISRL  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-3-031-06306-0 Medium  
  Area Expedition Conference  
  Notes LAMP Approved no  
  Call Number (up) Admin @ si @ Serial 3813  
Permanent link to this record
 

 
Author Aitor Alvarez-Gila edit  openurl
  Title Self-supervised learning for image-to-image translation in the small data regime Type Book Whole
  Year 2022 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords Computer vision; Neural networks; Self-supervised learning; Image-to-image mapping; Probabilistic programming  
  Abstract The mass irruption of Deep Convolutional Neural Networks (CNNs) in computer vision since 2012 led to a dominance of the image understanding paradigm consisting in an end-to-end fully supervised learning workflow over large-scale annotated datasets. This approach proved to be extremely useful at solving a myriad of classic and new computer vision tasks with unprecedented performance —often, surpassing that of humans—, at the expense of vast amounts of human-labeled data, extensive computational resources and the disposal of all of our prior knowledge on the task at hand. Even though simple transfer learning methods, such as fine-tuning, have achieved remarkable impact, their success when the amount of labeled data in the target domain is small is limited. Furthermore, the non-static nature of data generation sources will often derive in data distribution shifts that degrade the performance of deployed models. As a consequence, there is a growing demand for methods that can exploit elements of prior knowledge and sources of information other than the manually generated ground truth annotations of the images during the network training process, so that they can adapt to new domains that constitute, if not a small data regime, at least a small labeled data regime. This thesis targets such few or no labeled data scenario in three distinct image-to-image mapping learning problems. It contributes with various approaches that leverage our previous knowledge of different elements of the image formation process: We first present a data-efficient framework for both defocus and motion blur detection, based on a model able to produce realistic synthetic local degradations. The framework comprises a self-supervised, a weakly-supervised and a semi-supervised instantiation, depending on the absence or availability and the nature of human annotations, and outperforms fully-supervised counterparts in a variety of settings. Our knowledge on color image formation is then used to gather input and target ground truth image pairs for the RGB to hyperspectral image reconstruction task. We make use of a CNN to tackle this problem, which, for the first time, allows us to exploit spatial context and achieve state-of-the-art results given a limited hyperspectral image set. In our last contribution to the subfield of data-efficient image-to-image transformation problems, we present the novel semi-supervised task of zero-pair cross-view semantic segmentation: we consider the case of relocation of the camera in an end-to-end trained and deployed monocular, fixed-view semantic segmentation system often found in industry. Under the assumption that we are allowed to obtain an additional set of synchronized but unlabeled image pairs of new scenes from both original and new camera poses, we present ZPCVNet, a model and training procedure that enables the production of dense semantic predictions in either source or target views at inference time. The lack of existing suitable public datasets to develop this approach led us to the creation of MVMO, a large-scale Multi-View, Multi-Object path-traced dataset with per-view semantic segmentation annotations. We expect MVMO to propel future research in the exciting under-developed fields of cross-view and multi-view semantic segmentation. Last, in a piece of applied research of direct application in the context of process monitoring of an Electric Arc Furnace (EAF) in a steelmaking plant, we also consider the problem of simultaneously estimating the temperature and spectral emissivity of distant hot emissive samples. To that end, we design our own capturing device, which integrates three point spectrometers covering a wide range of the Ultra-Violet, visible, and Infra-Red spectra and is capable of registering the radiance signal incoming from an 8cm diameter spot located up to 20m away. We then define a physically accurate radiative transfer model that comprises the effects of atmospheric absorbance, of the optical system transfer function, and of the sample temperature and spectral emissivity themselves. We solve this inverse problem without the need for annotated data using a probabilistic programming-based Bayesian approach, which yields full posterior distribution estimates of the involved variables that are consistent with laboratory-grade measurements.  
  Address Julu, 2019  
  Corporate Author Thesis Ph.D. thesis  
  Publisher Place of Publication Editor Joost Van de Weijer; Estibaliz Garrote  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes LAMP Approved no  
  Call Number (up) Admin @ si @ Alv2022 Serial 3716  
Permanent link to this record
 

 
Author Eduardo Aguilar; Bhalaji Nagarajan; Beatriz Remeseiro; Petia Radeva edit  doi
openurl 
  Title Bayesian deep learning for semantic segmentation of food images Type Journal Article
  Year 2022 Publication Computers and Electrical Engineering Abbreviated Journal CEE  
  Volume 103 Issue Pages 108380  
  Keywords Deep learning; Uncertainty quantification; Bayesian inference; Image segmentation; Food analysis  
  Abstract Deep learning has provided promising results in various applications; however, algorithms tend to be overconfident in their predictions, even though they may be entirely wrong. Particularly for critical applications, the model should provide answers only when it is very sure of them. This article presents a Bayesian version of two different state-of-the-art semantic segmentation methods to perform multi-class segmentation of foods and estimate the uncertainty about the given predictions. The proposed methods were evaluated on three public pixel-annotated food datasets. As a result, we can conclude that Bayesian methods improve the performance achieved by the baseline architectures and, in addition, provide information to improve decision-making. Furthermore, based on the extracted uncertainty map, we proposed three measures to rank the images according to the degree of noisy annotations they contained. Note that the top 135 images ranked by one of these measures include more than half of the worst-labeled food images.  
  Address October 2022  
  Corporate Author Thesis  
  Publisher Science Direct Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes MILAB Approved no  
  Call Number (up) Admin @ si @ ANR2022 Serial 3763  
Permanent link to this record
 

 
Author Aitor Alvarez-Gila; Joost Van de Weijer; Yaxing Wang; Estibaliz Garrote edit   pdf
url  doi
openurl 
  Title MVMO: A Multi-Object Dataset for Wide Baseline Multi-View Semantic Segmentation Type Conference Article
  Year 2022 Publication 29th IEEE International Conference on Image Processing Abbreviated Journal  
  Volume Issue Pages  
  Keywords multi-view; cross-view; semantic segmentation; synthetic dataset  
  Abstract We present MVMO (Multi-View, Multi-Object dataset): a synthetic dataset of 116,000 scenes containing randomly placed objects of 10 distinct classes and captured from 25 camera locations in the upper hemisphere. MVMO comprises photorealistic, path-traced image renders, together with semantic segmentation ground truth for every view. Unlike existing multi-view datasets, MVMO features wide baselines between cameras and high density of objects, which lead to large disparities, heavy occlusions and view-dependent object appearance. Single view semantic segmentation is hindered by self and inter-object occlusions that could benefit from additional viewpoints. Therefore, we expect that MVMO will propel research in multi-view semantic segmentation and cross-view semantic transfer. We also provide baselines that show that new research is needed in such fields to exploit the complementary information of multi-view setups 1 .  
  Address Bordeaux; France; October2022  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference ICIP  
  Notes LAMP Approved no  
  Call Number (up) Admin @ si @ AWW2022 Serial 3781  
Permanent link to this record
 

 
Author Ruben Ballester; Xavier Arnal Clemente; Carles Casacuberta; Meysam Madadi; Ciprian Corneanu edit   pdf
openurl 
  Title Towards explaining the generalization gap in neural networks using topological data analysis Type Miscellaneous
  Year 2022 Publication Arxiv Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract Understanding how neural networks generalize on unseen data is crucial for designing more robust and reliable models. In this paper, we study the generalization gap of neural networks using methods from topological data analysis. For this purpose, we compute homological persistence diagrams of weighted graphs constructed from neuron activation correlations after a training phase, aiming to capture patterns that are linked to the generalization capacity of the network. We compare the usefulness of different numerical summaries from persistence diagrams and show that a combination of some of them can accurately predict and partially explain the generalization gap without the need of a test set. Evaluation on two computer vision recognition tasks (CIFAR10 and SVHN) shows competitive generalization gap prediction when compared against state-of-the-art methods.  
  Address  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes HUPBA; no menciona Approved no  
  Call Number (up) Admin @ si @ BAC2022 Serial 3821  
Permanent link to this record
 

 
Author Arnau Baro edit  isbn
openurl 
  Title Reading Music Systems: From Deep Optical Music Recognition to Contextual Methods Type Book Whole
  Year 2022 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract The transcription of sheet music into some machine-readable format can be carried out manually. However, the complexity of music notation inevitably leads to burdensome software for music score editing, which makes the whole process
very time-consuming and prone to errors. Consequently, automatic transcription
systems for musical documents represent interesting tools.
Document analysis is the subject that deals with the extraction and processing
of documents through image and pattern recognition. It is a branch of computer
vision. Taking music scores as source, the field devoted to address this task is
known as Optical Music Recognition (OMR). Typically, an OMR system takes an
image of a music score and automatically extracts its content into some symbolic
structure such as MEI or MusicXML.
In this dissertation, we have investigated different methods for recognizing a
single staff section (e.g. scores for violin, flute, etc.), much in the same way as most text recognition research focuses on recognizing words appearing in a given line image. These methods are based in two different methodologies. On the one hand, we present two methods based on Recurrent Neural Networks, in particular, the
Long Short-Term Memory Neural Network. On the other hand, a method based on Sequence to Sequence models is detailed.
Music context is needed to improve the OMR results, just like language models
and dictionaries help in handwriting recognition. For example, syntactical rules
and grammars could be easily defined to cope with the ambiguities in the rhythm.
In music theory, for example, the time signature defines the amount of beats per
bar unit. Thus, in the second part of this dissertation, different methodologies
have been investigated to improve the OMR recognition. We have explored three
different methods: (a) a graphic tree-structure representation, Dendrograms, that
joins, at each level, its primitives following a set of rules, (b) the incorporation of Language Models to model the probability of a sequence of tokens, and (c) graph neural networks to analyze the music scores to avoid meaningless relationships between music primitives.
Finally, to train all these methodologies, and given the method-specificity of
the datasets in the literature, we have created four different music datasets. Two of them are synthetic with a modern or old handwritten appearance, whereas the
other two are real handwritten scores, being one of them modern and the other
old.
 
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher IMPRIMA Place of Publication Editor Alicia Fornes  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-124793-8-6 Medium  
  Area Expedition Conference  
  Notes DAG; Approved no  
  Call Number (up) Admin @ si @ Bar2022 Serial 3754  
Permanent link to this record
 

 
Author Arnau Baro; Carles Badal; Pau Torras; Alicia Fornes edit   pdf
url  openurl
  Title Handwritten Historical Music Recognition through Sequence-to-Sequence with Attention Mechanism Type Conference Article
  Year 2022 Publication 3rd International Workshop on Reading Music Systems (WoRMS2021) Abbreviated Journal  
  Volume Issue Pages 55-59  
  Keywords Optical Music Recognition; Digits; Image Classification  
  Abstract Despite decades of research in Optical Music Recognition (OMR), the recognition of old handwritten music scores remains a challenge because of the variabilities in the handwriting styles, paper degradation, lack of standard notation, etc. Therefore, the research in OMR systems adapted to the particularities of old manuscripts is crucial to accelerate the conversion of music scores existing in archives into digital libraries, fostering the dissemination and preservation of our music heritage. In this paper we explore the adaptation of sequence-to-sequence models with attention mechanism (used in translation and handwritten text recognition) and the generation of specific synthetic data for recognizing old music scores. The experimental validation demonstrates that our approach is promising, especially when compared with long short-term memory neural networks.  
  Address July 23, 2021, Alicante (Spain)  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference WoRMS  
  Notes DAG; 600.121; 600.162; 602.230; 600.140 Approved no  
  Call Number (up) Admin @ si @ BBT2022 Serial 3734  
Permanent link to this record
 

 
Author Parichehr Behjati Ardakani edit  isbn
openurl 
  Title Towards Efficient and Robust Convolutional Neural Networks for Single Image Super-Resolution Type Book Whole
  Year 2022 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract Single image super-resolution (SISR) is an important task in image processing which aims to enhance the resolution of imaging systems. Recently, SISR has witnessed great strides with the rapid development of deep learning. Recent advances in SISR are mostly devoted to designing deeper and wider networks to enhance their representation learning capacity. However, as the depth of networks increases, deep learning-based methods are faced with the challenge of computational complexity in practice. Moreover, most existing methods rarely leverage the intermediate features and also do not discriminate the computation of features by their frequencial components, thereby achieving relatively low performance. Aside from the aforementioned problems, another desired ability is to upsample images to arbitrary scales using a single model. Most current SISR methods train a dedicated model for each target resolution, losing generality and increasing memory requirements. In this thesis, we address the aforementioned issues and propose solutions to them: i) We present a novel frequency-based enhancement block which treats different frequencies in a heterogeneous way and also models inter-channel dependencies, which consequently enrich the output feature. Thus it helps the network generate more discriminative representations by explicitly recovering finer details. ii) We introduce OverNet which contains two main parts: a lightweight feature extractor that follows a novel recursive framework of skip and dense connections to reduce low-level feature degradation, and an overscaling module that generates an accurate SR image by internally constructing an overscaled intermediate representation of the output features. Then, to solve the problem of reconstruction at arbitrary scale factors, we introduce a novel multi-scale loss, that allows the simultaneous training of all scale factors using a single model. iii) We propose a directional variance attention network which leverages a novel attention mechanism to enhance features in different channels and spatial regions. Moreover, we introduce a novel procedure for using attention mechanisms together with residual blocks to facilitate the preservation of finer details. Finally, we demonstrate that our approaches achieve considerably better performance than previous state-of-the-art methods, in terms of both quantitative and visual quality.  
  Address April, 2022  
  Corporate Author Thesis Ph.D. thesis  
  Publisher Place of Publication Editor Jordi Gonzalez;Xavier Roca;Pau Rodriguez  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-124793-1-7 Medium  
  Area Expedition Conference  
  Notes ISE Approved no  
  Call Number (up) Admin @ si @ Beh2022 Serial 3713  
Permanent link to this record
 

 
Author David Berga; Xavier Otazu edit  doi
openurl 
  Title A neurodynamic model of saliency prediction in v1 Type Journal Article
  Year 2022 Publication Neural Computation Abbreviated Journal NEURALCOMPUT  
  Volume 34 Issue 2 Pages 378-414  
  Keywords  
  Abstract Lateral connections in the primary visual cortex (V1) have long been hypothesized to be responsible for several visual processing mechanisms such as brightness induction, chromatic induction, visual discomfort, and bottom-up visual attention (also named saliency). Many computational models have been developed to independently predict these and other visual processes, but no computational model has been able to reproduce all of them simultaneously. In this work, we show that a biologically plausible computational model of lateral interactions of V1 is able to simultaneously predict saliency and all the aforementioned visual processes. Our model's architecture (NSWAM) is based on Penacchio's neurodynamic model of lateral connections of V1. It is defined as a network of firing rate neurons, sensitive to visual features such as brightness, color, orientation, and scale. We tested NSWAM saliency predictions using images from several eye tracking data sets. We show that the accuracy of predictions obtained by our architecture, using shuffled metrics, is similar to other state-of-the-art computational methods, particularly with synthetic images (CAT2000-Pattern and SID4VAM) that mainly contain low-level features. Moreover, we outperform other biologically inspired saliency models that are specifically designed to exclusively reproduce saliency. We show that our biologically plausible model of lateral connections can simultaneously explain different visual processes present in V1 (without applying any type of training or optimization and keeping the same parameterization for all the visual processes). This can be useful for the definition of a unified architecture of the primary visual cortex.  
  Address  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes NEUROBIT; 600.128; 600.120 Approved no  
  Call Number (up) Admin @ si @ BeO2022 Serial 3696  
Permanent link to this record
 

 
Author Asma Bensalah; Alicia Fornes; Cristina Carmona_Duarte; Josep Llados edit   pdf
doi  openurl
  Title Easing Automatic Neurorehabilitation via Classification and Smoothness Analysis Type Conference Article
  Year 2022 Publication Intertwining Graphonomics with Human Movements. 20th International Conference of the International Graphonomics Society, IGS 2022 Abbreviated Journal  
  Volume 13424 Issue Pages 336-348  
  Keywords Neurorehabilitation; Upper-lim; Movement classification; Movement smoothness; Deep learning; Jerk  
  Abstract Assessing the quality of movements for post-stroke patients during the rehabilitation phase is vital given that there is no standard stroke rehabilitation plan for all the patients. In fact, it depends basically on the patient’s functional independence and its progress along the rehabilitation sessions. To tackle this challenge and make neurorehabilitation more agile, we propose an automatic assessment pipeline that starts by recognising patients’ movements by means of a shallow deep learning architecture, then measuring the movement quality using jerk measure and related measures. A particularity of this work is that the dataset used is clinically relevant, since it represents movements inspired from Fugl-Meyer a well common upper-limb clinical stroke assessment scale for stroke patients. We show that it is possible to detect the contrast between healthy and patients movements in terms of smoothness, besides achieving conclusions about the patients’ progress during the rehabilitation sessions that correspond to the clinicians’ findings about each case.  
  Address June 7-9, 2022, Las Palmas de Gran Canaria, Spain  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title LNCS  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference IGS  
  Notes DAG; 600.121; 600.162; 602.230; 600.140 Approved no  
  Call Number (up) Admin @ si @ BFC2022 Serial 3738  
Permanent link to this record
 

 
Author Sonia Baeza; Debora Gil; I.Garcia Olive; M.Salcedo; J.Deportos; Carles Sanchez; Guillermo Torres; G.Moragas; Antoni Rosell edit  doi
openurl 
  Title A novel intelligent radiomic analysis of perfusion SPECT/CT images to optimize pulmonary embolism diagnosis in COVID-19 patients Type Journal Article
  Year 2022 Publication EJNMMI Physics Abbreviated Journal EJNMMI-PHYS  
  Volume 9 Issue 1, Article 84 Pages 1-17  
  Keywords  
  Abstract Background: COVID-19 infection, especially in cases with pneumonia, is associated with a high rate of pulmonary embolism (PE). In patients with contraindications for CT pulmonary angiography (CTPA) or non-diagnostic CTPA, perfusion single-photon emission computed tomography/computed tomography (Q-SPECT/CT) is a diagnostic alternative. The goal of this study is to develop a radiomic diagnostic system to detect PE based only on the analysis of Q-SPECT/CT scans.
Methods: This radiomic diagnostic system is based on a local analysis of Q-SPECT/CT volumes that includes both CT and Q-SPECT values for each volume point. We present a combined approach that uses radiomic features extracted from each scan as input into a fully connected classifcation neural network that optimizes a weighted crossentropy loss trained to discriminate between three diferent types of image patterns (pixel sample level): healthy lungs (control group), PE and pneumonia. Four types of models using diferent confguration of parameters were tested.
Results: The proposed radiomic diagnostic system was trained on 20 patients (4,927 sets of samples of three types of image patterns) and validated in a group of 39 patients (4,410 sets of samples of three types of image patterns). In the training group, COVID-19 infection corresponded to 45% of the cases and 51.28% in the test group. In the test group, the best model for determining diferent types of image patterns with PE presented a sensitivity, specifcity, positive predictive value and negative predictive value of 75.1%, 98.2%, 88.9% and 95.4%, respectively. The best model for detecting
pneumonia presented a sensitivity, specifcity, positive predictive value and negative predictive value of 94.1%, 93.6%, 85.2% and 97.6%, respectively. The area under the curve (AUC) was 0.92 for PE and 0.91 for pneumonia. When the results obtained at the pixel sample level are aggregated into regions of interest, the sensitivity of the PE increases to 85%, and all metrics improve for pneumonia.
Conclusion: This radiomic diagnostic system was able to identify the diferent lung imaging patterns and is a frst step toward a comprehensive intelligent radiomic system to optimize the diagnosis of PE by Q-SPECT/CT.
 
  Address 5 dec 2022  
  Corporate Author Thesis  
  Publisher Springer Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes IAM Approved no  
  Call Number (up) Admin @ si @ BGG2022 Serial 3759  
Permanent link to this record
 

 
Author Ali Furkan Biten; Lluis Gomez; Dimosthenis Karatzas edit   pdf
url  doi
openurl 
  Title Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning Type Conference Article
  Year 2022 Publication Winter Conference on Applications of Computer Vision Abbreviated Journal  
  Volume Issue Pages 1381-1390  
  Keywords Measurement; Training; Visualization; Analytical models; Computer vision; Computational modeling; Training data  
  Abstract Explaining an image with missing or non-existent objects is known as object bias (hallucination) in image captioning. This behaviour is quite common in the state-of-the-art captioning models which is not desirable by humans. To decrease the object hallucination in captioning, we propose three simple yet efficient training augmentation method for sentences which requires no new training data or increase
in the model size. By extensive analysis, we show that the proposed methods can significantly diminish our models’ object bias on hallucination metrics. Moreover, we experimentally demonstrate that our methods decrease the dependency on the visual features. All of our code, configuration files and model weights are available online.
 
  Address Virtual; Waikoloa; Hawai; USA; January 2022  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference WACV  
  Notes DAG; 600.155; 302.105 Approved no  
  Call Number (up) Admin @ si @ BGK2022 Serial 3662  
Permanent link to this record
 

 
Author Josep Brugues Pujolras; Lluis Gomez; Dimosthenis Karatzas edit   pdf
doi  openurl
  Title A Multilingual Approach to Scene Text Visual Question Answering Type Conference Article
  Year 2022 Publication Document Analysis Systems.15th IAPR International Workshop, (DAS2022) Abbreviated Journal  
  Volume Issue Pages 65-79  
  Keywords Scene text; Visual question answering; Multilingual word embeddings; Vision and language; Deep learning  
  Abstract Scene Text Visual Question Answering (ST-VQA) has recently emerged as a hot research topic in Computer Vision. Current ST-VQA models have a big potential for many types of applications but lack the ability to perform well on more than one language at a time due to the lack of multilingual data, as well as the use of monolingual word embeddings for training. In this work, we explore the possibility to obtain bilingual and multilingual VQA models. In that regard, we use an already established VQA model that uses monolingual word embeddings as part of its pipeline and substitute them by FastText and BPEmb multilingual word embeddings that have been aligned to English. Our experiments demonstrate that it is possible to obtain bilingual and multilingual VQA models with a minimal loss in performance in languages not used during training, as well as a multilingual model trained in multiple languages that match the performance of the respective monolingual baselines.  
  Address La Rochelle, France; May 22–25, 2022  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference DAS  
  Notes DAG; 611.004; 600.155; 601.002 Approved no  
  Call Number (up) Admin @ si @ BGK2022b Serial 3695  
Permanent link to this record
 

 
Author Ali Furkan Biten edit  isbn
openurl 
  Title A Bitter-Sweet Symphony on Vision and Language: Bias and World Knowledge Type Book Whole
  Year 2022 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract Vision and Language are broadly regarded as cornerstones of intelligence. Even though language and vision have different aims – language having the purpose of communication, transmission of information and vision having the purpose of constructing mental representations around us to navigate and interact with objects – they cooperate and depend on one another in many tasks we perform effortlessly. This reliance is actively being studied in various Computer Vision tasks, e.g. image captioning, visual question answering, image-sentence retrieval, phrase grounding, just to name a few. All of these tasks share the inherent difficulty of the aligning the two modalities, while being robust to language
priors and various biases existing in the datasets. One of the ultimate goal for vision and language research is to be able to inject world knowledge while getting rid of the biases that come with the datasets. In this thesis, we mainly focus on two vision and language tasks, namely Image Captioning and Scene-Text Visual Question Answering (STVQA).
In both domains, we start by defining a new task that requires the utilization of world knowledge and in both tasks, we find that the models commonly employed are prone to biases that exist in the data. Concretely, we introduce new tasks and discover several problems that impede performance at each level and provide remedies or possible solutions in each chapter: i) We define a new task to move beyond Image Captioning to Image Interpretation that can utilize Named Entities in the form of world knowledge. ii) We study the object hallucination problem in classic Image Captioning systems and develop an architecture-agnostic solution. iii) We define a sub-task of Visual Question Answering that requires reading the text in the image (STVQA), where we highlight the limitations of current models. iv) We propose an architecture for the STVQA task that can point to the answer in the image and show how to combine it with classic VQA models. v) We show how far language can get us in STVQA and discover yet another bias which causes the models to disregard the image while doing Visual Question Answering.
 
  Address  
  Corporate Author Thesis Ph.D. thesis  
  Publisher IMPRIMA Place of Publication Editor Dimosthenis Karatzas;Lluis Gomez  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN 978-84-124793-5-5 Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number (up) Admin @ si @ Bit2022 Serial 3755  
Permanent link to this record
 

 
Author Joakim Bruslund Haurum; Meysam Madadi; Sergio Escalera; Thomas B. Moeslund edit   pdf
url  doi
openurl 
  Title Multi-Task Classification of Sewer Pipe Defects and Properties Using a Cross-Task Graph Neural Network Decoder Type Conference Article
  Year 2022 Publication Winter Conference on Applications of Computer Vision Abbreviated Journal  
  Volume Issue Pages 2806-2817  
  Keywords Vision Systems; Applications Multi-Task Classification  
  Abstract The sewerage infrastructure is one of the most important and expensive infrastructures in modern society. In order to efficiently manage the sewerage infrastructure, automated sewer inspection has to be utilized. However, while sewer
defect classification has been investigated for decades, little attention has been given to classifying sewer pipe properties such as water level, pipe material, and pipe shape, which are needed to evaluate the level of sewer pipe deterioration.
In this work we classify sewer pipe defects and properties concurrently and present a novel decoder-focused multi-task classification architecture Cross-Task Graph Neural Network (CT-GNN), which refines the disjointed per-task predictions using cross-task information. The CT-GNN architecture extends the traditional disjointed task-heads decoder, by utilizing a cross-task graph and unique class node embeddings. The cross-task graph can either be determined a priori based on the conditional probability between the task classes or determined dynamically using self-attention.
CT-GNN can be added to any backbone and trained end-toend at a small increase in the parameter count. We achieve state-of-the-art performance on all four classification tasks in the Sewer-ML dataset, improving defect classification and
water level classification by 5.3 and 8.0 percentage points, respectively. We also outperform the single task methods as well as other multi-task classification approaches while introducing 50 times fewer parameters than previous modelfocused approaches.
 
  Address  
  Corporate Author Thesis  
  Publisher Place of Publication Editor  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference WACV  
  Notes HUPBA; no proj Approved no  
  Call Number (up) Admin @ si @ BME2022 Serial 3638  
Permanent link to this record
Select All    Deselect All
 |   | 
Details
   print

Save Citations:
Export Records: