|
Fahad Shahbaz Khan, Joost Van de Weijer, Muhammad Anwer Rao, Andrew Bagdanov, Michael Felsberg and Jorma. 2018. Scale coding bag of deep features for human attribute and action recognition. MVAP, 29(1), 55–71.
Abstract: Most approaches to human attribute and action recognition in still images are based on image representation in which multi-scale local features are pooled across scale into a single, scale-invariant encoding. Both in bag-of-words and the recently popular representations based on convolutional neural networks, local features are computed at multiple scales. However, these multi-scale convolutional features are pooled into a single scale-invariant representation. We argue that entirely scale-invariant image representations are sub-optimal and investigate approaches to scale coding within a bag of deep features framework. Our approach encodes multi-scale information explicitly during the image encoding stage. We propose two strategies to encode multi-scale information explicitly in the final image representation. We validate our two scale coding techniques on five datasets: Willow, PASCAL VOC 2010, PASCAL VOC 2012, Stanford-40 and Human Attributes (HAT-27). On all datasets, the proposed scale coding approaches outperform both the scale-invariant method and the standard deep features of the same network. Further, combining our scale coding approaches with standard deep features leads to consistent improvement over the state of the art.
Keywords: Action recognition; Attribute recognition; Bag of deep features
|
|
|
Daniel Hernandez, Antonio Espinosa, David Vazquez, Antonio Lopez and Juan C. Moure. 2021. 3D Perception With Slanted Stixels on GPU. TPDS, 32(10), 2434–2447.
Abstract: This article presents a GPU-accelerated software design of the recently proposed model of Slanted Stixels, which represents the geometric and semantic information of a scene in a compact and accurate way. We reformulate the measurement depth model to reduce the computational complexity of the algorithm, relying on the confidence of the depth estimation and the identification of invalid values to handle outliers. The proposed massively parallel scheme and data layout for the irregular computation pattern that corresponds to a Dynamic Programming paradigm is described and carefully analyzed in performance terms. Performance is shown to scale gracefully on current generation embedded GPUs. We assess the proposed methods in terms of semantic and geometric accuracy as well as run-time performance on three publicly available benchmark datasets. Our approach achieves real-time performance with high accuracy for 2048 × 1024 image sizes and 4 × 4 Stixel resolution on the low-power embedded GPU of an NVIDIA Tegra Xavier.
Keywords: Daniel Hernandez-Juarez; Antonio Espinosa; David Vazquez; Antonio M. Lopez; Juan C. Moure
|
|
|
Sudeep Katakol, Basem Elbarashy, Luis Herranz, Joost Van de Weijer and Antonio Lopez. 2021. Distributed Learning and Inference with Compressed Images. TIP, 30, 3069–3083.
Abstract: Modern computer vision requires processing large amounts of data, both while training the model and/or during inference, once the model is deployed. Scenarios where images are captured and processed in physically separated locations are increasingly common (e.g. autonomous vehicles, cloud computing). In addition, many devices suffer from limited resources to store or transmit data (e.g. storage space, channel capacity). In these scenarios, lossy image compression plays a crucial role to effectively increase the number of images collected under such constraints. However, lossy compression entails some undesired degradation of the data that may harm the performance of the downstream analysis task at hand, since important semantic information may be lost in the process. Moreover, we may only have compressed images at training time but are able to use original images at inference time, or vice versa, and in such a case, the downstream model suffers from covariate shift. In this paper, we analyze this phenomenon, with a special focus on vision-based perception for autonomous driving as a paradigmatic scenario. We see that loss of semantic information and covariate shift do indeed exist, resulting in a drop in performance that depends on the compression rate. In order to address the problem, we propose dataset restoration, based on image restoration with generative adversarial networks (GANs). Our method is agnostic to both the particular image compression method and the downstream task; and has the advantage of not adding additional cost to the deployed models, which is particularly important in resource-limited devices. The presented experiments focus on semantic segmentation as a challenging use case, cover a broad range of compression rates and diverse datasets, and show how our method is able to significantly alleviate the negative effects of compression on the downstream visual task.
|
|
|
Yi Xiao, Felipe Codevilla, Akhil Gurram, Onay Urfalioglu and Antonio Lopez. 2020. Multimodal end-to-end autonomous driving. TITS, 1–11.
Abstract: A crucial component of an autonomous vehicle (AV) is the artificial intelligence (AI) is able to drive towards a desired destination. Today, there are different paradigms addressing the development of AI drivers. On the one hand, we find modular pipelines, which divide the driving task into sub-tasks such as perception and maneuver planning and control. On the other hand, we find end-to-end driving approaches that try to learn a direct mapping from input raw sensor data to vehicle control signals. The later are relatively less studied, but are gaining popularity since they are less demanding in terms of sensor data annotation. This paper focuses on end-to-end autonomous driving. So far, most proposals relying on this paradigm assume RGB images as input sensor data. However, AVs will not be equipped only with cameras, but also with active sensors providing accurate depth information (e.g., LiDARs). Accordingly, this paper analyses whether combining RGB and depth modalities, i.e. using RGBD data, produces better end-to-end AI drivers than relying on a single modality. We consider multimodality based on early, mid and late fusion schemes, both in multisensory and single-sensor (monocular depth estimation) settings. Using the CARLA simulator and conditional imitation learning (CIL), we show how, indeed, early fusion multimodality outperforms single-modality.
|
|
|
Fei Yang, Luis Herranz, Joost Van de Weijer, Jose Antonio Iglesias, Antonio Lopez and Mikhail Mozerov. 2020. Variable Rate Deep Image Compression with Modulated Autoencoder. SPL, 27, 331–335.
Abstract: Variable rate is a requirement for flexible and adaptable image and video compression. However, deep image compression methods (DIC) are optimized for a single fixed rate-distortion (R-D) tradeoff. While this can be addressed by training multiple models for different tradeoffs, the memory requirements increase proportionally to the number of models. Scaling the bottleneck representation of a shared autoencoder can provide variable rate compression with a single shared autoencoder. However, the R-D performance using this simple mechanism degrades in low bitrates, and also shrinks the effective range of bitrates. To address these limitations, we formulate the problem of variable R-D optimization for DIC, and propose modulated autoencoders (MAEs), where the representations of a shared autoencoder are adapted to the specific R-D tradeoff via a modulation network. Jointly training this modulated autoencoder and the modulation network provides an effective way to navigate the R-D operational curve. Our experiments show that the proposed method can achieve almost the same R-D performance of independent models with significantly fewer parameters.
|
|
|
Akhil Gurram, Onay Urfalioglu, Ibrahim Halfaoui, Fahd Bouzaraa and Antonio Lopez. 2020. Semantic Monocular Depth Estimation Based on Artificial Intelligence. ITSM, 13(4), 99–103.
Abstract: Depth estimation provides essential information to perform autonomous driving and driver assistance. A promising line of work consists of introducing additional semantic information about the traffic scene when training CNNs for depth estimation. In practice, this means that the depth data used for CNN training is complemented with images having pixel-wise semantic labels where the same raw training data is associated with both types of ground truth, i.e., depth and semantic labels. The main contribution of this paper is to show that this hard constraint can be circumvented, i.e., that we can train CNNs for depth estimation by leveraging the depth and semantic information coming from heterogeneous datasets. In order to illustrate the benefits of our approach, we combine KITTI depth and Cityscapes semantic segmentation datasets, outperforming state-of-the-art results on monocular depth estimation.
|
|
|
Zhijie Fang and Antonio Lopez. 2019. Intention Recognition of Pedestrians and Cyclists by 2D Pose Estimation. TITS, 21(11), 4773–4783.
Abstract: Anticipating the intentions of vulnerable road users (VRUs) such as pedestrians and cyclists is critical for performing safe and comfortable driving maneuvers. This is the case for human driving and, thus, should be taken into account by systems providing any level of driving assistance, from advanced driver assistant systems (ADAS) to fully autonomous vehicles (AVs). In this paper, we show how the latest advances on monocular vision-based human pose estimation, i.e. those relying on deep Convolutional Neural Networks (CNNs), enable to recognize the intentions of such VRUs. In the case of cyclists, we assume that they follow traffic rules to indicate future maneuvers with arm signals. In the case of pedestrians, no indications can be assumed. Instead, we hypothesize that the walking pattern of a pedestrian allows to determine if he/she has the intention of crossing the road in the path of the ego-vehicle, so that the ego-vehicle must maneuver accordingly (e.g. slowing down or stopping). In this paper, we show how the same methodology can be used for recognizing pedestrians and cyclists' intentions. For pedestrians, we perform experiments on the JAAD dataset. For cyclists, we did not found an analogous dataset, thus, we created our own one by acquiring and annotating videos which we share with the research community. Overall, the proposed pipeline provides new state-of-the-art results on the intention recognition of VRUs.
|
|
|
Jose Luis Gomez, Gabriel Villalonga and Antonio Lopez. 2021. Co-Training for Deep Object Detection: Comparing Single-Modal and Multi-Modal Approaches. SENS, 21(9), 3185.
Abstract: Top-performing computer vision models are powered by convolutional neural networks (CNNs). Training an accurate CNN highly depends on both the raw sensor data and their associated ground truth (GT). Collecting such GT is usually done through human labeling, which is time-consuming and does not scale as we wish. This data-labeling bottleneck may be intensified due to domain shifts among image sensors, which could force per-sensor data labeling. In this paper, we focus on the use of co-training, a semi-supervised learning (SSL) method, for obtaining self-labeled object bounding boxes (BBs), i.e., the GT to train deep object detectors. In particular, we assess the goodness of multi-modal co-training by relying on two different views of an image, namely, appearance (RGB) and estimated depth (D). Moreover, we compare appearance-based single-modal co-training with multi-modal. Our results suggest that in a standard SSL setting (no domain shift, a few human-labeled data) and under virtual-to-real domain shift (many virtual-world labeled data, no human-labeled data) multi-modal co-training outperforms single-modal. In the latter case, by performing GAN-based domain translation both co-training modalities are on par, at least when using an off-the-shelf depth estimation model not specifically trained on the translated images.
Keywords: co-training; multi-modality; vision-based object detection; ADAS; self-driving
|
|
|
Xavier Soria, Angel Sappa and Riad I. Hammoud. 2018. Wide-Band Color Imagery Restoration for RGB-NIR Single Sensor Images. SENS, 18(7), 2059.
Abstract: Multi-spectral RGB-NIR sensors have become ubiquitous in recent years. These sensors allow the visible and near-infrared spectral bands of a given scene to be captured at the same time. With such cameras, the acquired imagery has a compromised RGB color representation due to near-infrared bands (700–1100 nm) cross-talking with the visible bands (400–700 nm).
This paper proposes two deep learning-based architectures to recover the full RGB color images, thus removing the NIR information from the visible bands. The proposed approaches directly restore the high-resolution RGB image by means of convolutional neural networks. They are evaluated with several outdoor images; both architectures reach a similar performance when evaluated in different
scenarios and using different similarity metrics. Both of them improve the state of the art approaches.
Keywords: RGB-NIR sensor; multispectral imaging; deep learning; CNNs
|
|
|
Javier Marin and Sergio Escalera. 2021. SSSGAN: Satellite Style and Structure Generative Adversarial Networks. Remote Sensing, 13(19), 3984.
Abstract: This work presents Satellite Style and Structure Generative Adversarial Network (SSGAN), a generative model of high resolution satellite imagery to support image segmentation. Based on spatially adaptive denormalization modules (SPADE) that modulate the activations with respect to segmentation map structure, in addition to global descriptor vectors that capture the semantic information in a vector with respect to Open Street Maps (OSM) classes, this model is able to produce
consistent aerial imagery. By decoupling the generation of aerial images into a structure map and a carefully defined style vector, we were able to improve the realism and geodiversity of the synthesis with respect to the state-of-the-art baseline. Therefore, the proposed model allows us to control the generation not only with respect to the desired structure, but also with respect to a geographic area.
|
|