|
Josep Brugues Pujolras, Lluis Gomez, & Dimosthenis Karatzas. (2022). A Multilingual Approach to Scene Text Visual Question Answering. In Document Analysis Systems.15th IAPR International Workshop, (DAS2022) (pp. 65–79).
Abstract: Scene Text Visual Question Answering (ST-VQA) has recently emerged as a hot research topic in Computer Vision. Current ST-VQA models have a big potential for many types of applications but lack the ability to perform well on more than one language at a time due to the lack of multilingual data, as well as the use of monolingual word embeddings for training. In this work, we explore the possibility to obtain bilingual and multilingual VQA models. In that regard, we use an already established VQA model that uses monolingual word embeddings as part of its pipeline and substitute them by FastText and BPEmb multilingual word embeddings that have been aligned to English. Our experiments demonstrate that it is possible to obtain bilingual and multilingual VQA models with a minimal loss in performance in languages not used during training, as well as a multilingual model trained in multiple languages that match the performance of the respective monolingual baselines.
Keywords: Scene text; Visual question answering; Multilingual word embeddings; Vision and language; Deep learning
|
|
|
Josep Llados, Felipe Lumbreras, & X. Varona. (1999). A multidocument platform for automatic reading of identity cards..
|
|
|
Maria Vanrell, Jordi Vitria, & Xavier Roca. (1997). A multidimensional scaling approach to explore the behavior of a texture perception algorithm. Machine Vision and Applications, 9, 262–271.
|
|
|
Debora Gil, & Guillermo Torres. (2020). A multi-shape loss function with adaptive class balancing for the segmentation of lung structures. In 34th International Congress and Exhibition on Computer Assisted Radiology & Surgery.
|
|
|
Guillermo Torres, & Debora Gil. (2020). A multi-shape loss function with adaptive class balancing for the segmentation of lung structures. IJCAR - International Journal of Computer Assisted Radiology and Surgery, 15(1), S154–55.
|
|
|
Agnes Borras, & Josep Llados. (2008). A Multi-Scale Layout Descriptor Based on Delaunay Triangulation for Image Retrieval. In 3rd International Conference on Computer Vision Theory and Applications VISAPP (2) 2008 (Vol. 2, pp. 139–144).
|
|
|
Judit Martinez, Eva Costa, P. Herreros, F. Javier Sanchez, & Ramon Baldrich. (2003). A Modular and Scalable Architecture for PC-Based Real-Time Vision Systems. Real–Time Imaging, (IF: 0.512), 9, 99–112.
|
|
|
Marçal Rusiñol. (2006). A Model of Vectorial Signatures in Terms of Expressive Sub-Shapes: Symbol Indexation in Technical Documents.
|
|
|
Ernest Valveny, & Enric Marti. (2003). A model for image generation and symbol recognition through the deformation of lineal shapes. PRL - Pattern Recognition Letters, 24(15), 2857–2867.
Abstract: We describe a general framework for the recognition of distorted images of lineal shapes, which relies on three items: a model to represent lineal shapes and their deformations, a model for the generation of distorted binary images and the combination of both models in a common probabilistic framework, where the generation of deformations is related to an internal energy, and the generation of binary images to an external energy. Then, recognition consists in the minimization of a global energy function, performed by using the EM algorithm. This general framework has been applied to the recognition of hand-drawn lineal symbols in graphic documents.
|
|
|
Daniel Ponsa. (2001). A model based pedestrian tracking review.
|
|
|
V. Valev, & Petia Radeva. (1992). A Method of Solving Pattern or image Recognition Problems by Learning Boolean Formulas..
|
|
|
Francesco Ciompi, Oriol Pujol, & Petia Radeva. (2010). A meta-learning approach to Conditional Random Fields using Error-Correcting Output Codes. In 20th International Conference on Pattern Recognition (710–713).
Abstract: We present a meta-learning framework for the design of potential functions for Conditional Random Fields. The design of both node potential and edge potential is formulated as a classification problem where margin classifiers are used. The set of state transitions for the edge potential is treated as a set of different classes, thus defining a multi-class learning problem. The Error-Correcting Output Codes (ECOC) technique is used to deal with the multi-class problem. Furthermore, the point defined by the combination of margin classifiers in the ECOC space is interpreted in a probabilistic manner, and the obtained distance values are then converted into potential values. The proposed model exhibits very promising results when applied to two real detection problems.
|
|
|
Sergio Vera, Miguel Angel Gonzalez Ballester, & Debora Gil. (2012). A medial map capturing the essential geometry of organs. In ISBI Workshop on Open Source Medical Image Analysis software (1691 - 1694). IEEE.
Abstract: Medial representations are powerful tools for describing and parameterizing the volumetric shape of anatomical structures. Accurate computation of one pixel wide medial surfaces is mandatory. Those surfaces must represent faithfully the geometry of the volume. Although morphological methods produce excellent results in 2D, their complexity and quality drops across dimensions, due to a more complex description of pixel neighborhoods. This paper introduces a continuous operator for accurate and efficient computation of medial structures of arbitrary dimension. Our experiments show its higher performance for medical imaging applications in terms of simplicity of medial structures and capability for reconstructing the anatomical volume
Keywords: Medial Surface Representation, Volume Reconstruction,Geometry , Image reconstruction , Liver , Manifolds , Shape , Surface morphology , Surface reconstruction
|
|
|
Olivier Penacchio, Xavier Otazu, Arnold J Wilkings, & Sara M. Haigh. (2023). A mechanistic account of visual discomfort. FN - Frontiers in Neuroscience, 17.
Abstract: Much of the neural machinery of the early visual cortex, from the extraction of local orientations to contextual modulations through lateral interactions, is thought to have developed to provide a sparse encoding of contour in natural scenes, allowing the brain to process efficiently most of the visual scenes we are exposed to. Certain visual stimuli, however, cause visual stress, a set of adverse effects ranging from simple discomfort to migraine attacks, and epileptic seizures in the extreme, all phenomena linked with an excessive metabolic demand. The theory of efficient coding suggests a link between excessive metabolic demand and images that deviate from natural statistics. Yet, the mechanisms linking energy demand and image spatial content in discomfort remain elusive. Here, we used theories of visual coding that link image spatial structure and brain activation to characterize the response to images observers reported as uncomfortable in a biologically based neurodynamic model of the early visual cortex that included excitatory and inhibitory layers to implement contextual influences. We found three clear markers of aversive images: a larger overall activation in the model, a less sparse response, and a more unbalanced distribution of activity across spatial orientations. When the ratio of excitation over inhibition was increased in the model, a phenomenon hypothesised to underlie interindividual differences in susceptibility to visual discomfort, the three markers of discomfort progressively shifted toward values typical of the response to uncomfortable stimuli. Overall, these findings propose a unifying mechanistic explanation for why there are differences between images and between observers, suggesting how visual input and idiosyncratic hyperexcitability give rise to abnormal brain responses that result in visual stress.
|
|
|
Gemma Sanchez, Josep Llados, & K. Tombre. (2002). A mean string algorithm to compute the average among a set of 2D shapes. PRL - Pattern Recognition Letters, 23(1-3), 203–214.
|
|