Javier Varona, & Juan J. Villanueva. (1996). Neural networks as spatial filters for image processing: Neurofilters.
|
Javier Varona, & Juan J. Villanueva. (1997). NeuroFilters: Neural Networks for image Processing..
|
Javier Varona, Jordi Gonzalez, Xavier Roca, & Juan J. Villanueva. (2000). iTrack: Image-based Probabilistic Tracking of People. In 15 th International Conference on Pattern Recognition (Vol. 3, pp. 1122–1125).
|
Javier Varona, Jordi Gonzalez, Xavier Roca, & Juan J. Villanueva. (2000). Automatic Selection of Keyframes for Activity Recognition..
|
Javier Varona, Jordi Gonzalez, Xavier Roca, & Juan J. Villanueva. (2003). Appearance Tracking for Video Surveillance.
|
Javier Varona, Jordi Gonzalez, Xavier Roca, & Juan J. Villanueva. (2003). Appearance Tracking for Video Surveillance.
|
Javier Varona, Jordi Gonzalez, Ignasi Rius, & Juan J. Villanueva. (2008). Importance of Detection for Video Surveillance Applications. Optical Engineering, vol. 47(8), 087201/1–9.
|
Javier Varona, Antoni Jaume-i-Capo, Jordi Gonzalez, & Francisco Jose Perales. (2008). Toward Natural Interaction through Visual Recognition of Body Gestures in Real-Time. Interacting with Computers, diu 10,1016/j.intcom.2008.10.001, available on line.
|
Javier Varona, A. Pujol, & Juan J. Villanueva. (1999). Visual tracking in application domains..
|
Javier Varona, A. Pujol, & Juan J. Villanueva. (2000). Visual Tracking in Application Domains..
|
Javier Varona. (2001). Seguimiento visual robusto en entornos complejos, Tesis..
|
Javier Selva, Anders S. Johansen, Sergio Escalera, Kamal Nasrollahi, Thomas B. Moeslund, & Albert Clapes. (2023). Video transformers: A survey. TPAMI - IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11), 12922–12943.
Abstract: Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced by the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey, we analyze the main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled at the input level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition, we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.
Keywords: Artificial Intelligence; Computer Vision; Self-Attention; Transformers; Video Representations
|
Javier Rodenas, Bhalaji Nagarajan, Marc Bolaños, & Petia Radeva. (2022). Learning Multi-Subset of Classes for Fine-Grained Food Recognition. In 7th International Workshop on Multimedia Assisted Dietary Management (17–26).
Abstract: Food image recognition is a complex computer vision task, because of the large number of fine-grained food classes. Fine-grained recognition tasks focus on learning subtle discriminative details to distinguish similar classes. In this paper, we introduce a new method to improve the classification of classes that are more difficult to discriminate based on Multi-Subsets learning. Using a pre-trained network, we organize classes in multiple subsets using a clustering technique. Later, we embed these subsets in a multi-head model structure. This structure has three distinguishable parts. First, we use several shared blocks to learn the generalized representation of the data. Second, we use multiple specialized blocks focusing on specific subsets that are difficult to distinguish. Lastly, we use a fully connected layer to weight the different subsets in an end-to-end manner by combining the neuron outputs. We validated our proposed method using two recent state-of-the-art vision transformers on three public food recognition datasets. Our method was successful in learning the confused classes better and we outperformed the state-of-the-art on the three datasets.
|
Javier Marin, & Sergio Escalera. (2021). SSSGAN: Satellite Style and Structure Generative Adversarial Networks. Remote Sensing, 13(19), 3984.
Abstract: This work presents Satellite Style and Structure Generative Adversarial Network (SSGAN), a generative model of high resolution satellite imagery to support image segmentation. Based on spatially adaptive denormalization modules (SPADE) that modulate the activations with respect to segmentation map structure, in addition to global descriptor vectors that capture the semantic information in a vector with respect to Open Street Maps (OSM) classes, this model is able to produce
consistent aerial imagery. By decoupling the generation of aerial images into a structure map and a carefully defined style vector, we were able to improve the realism and geodiversity of the synthesis with respect to the state-of-the-art baseline. Therefore, the proposed model allows us to control the generation not only with respect to the desired structure, but also with respect to a geographic area.
|
Javier Marin, David Vazquez, David Geronimo, & Antonio Lopez. (2010). Learning Appearance in Virtual Scenarios for Pedestrian Detection. In 23rd IEEE Conference on Computer Vision and Pattern Recognition (137–144).
Abstract: Detecting pedestrians in images is a key functionality to avoid vehicle-to-pedestrian collisions. The most promising detectors rely on appearance-based pedestrian classifiers trained with labelled samples. This paper addresses the following question: can a pedestrian appearance model learnt in virtual scenarios work successfully for pedestrian detection in real images? (Fig. 1). Our experiments suggest a positive answer, which is a new and relevant conclusion for research in pedestrian detection. More specifically, we record training sequences in virtual scenarios and then appearance-based pedestrian classifiers are learnt using HOG and linear SVM. We test such classifiers in a publicly available dataset provided by Daimler AG for pedestrian detection benchmarking. This dataset contains real world images acquired from a moving car. The obtained result is compared with the one given by a classifier learnt using samples coming from real images. The comparison reveals that, although virtual samples were not specially selected, both virtual and real based training give rise to classifiers of similar performance.
Keywords: Pedestrian Detection; Domain Adaptation
|