Fernando Vilariño, Debora Gil, & Petia Radeva. (2004). A Novel FLDA Formulation for Numerical Stability Analysis. In P. R. and I. A. J. Vitrià (Ed.), Recent Advances in Artificial Intelligence Research and Development (Vol. 113, pp. 77–84). IOS Press.
Abstract: Fisher Linear Discriminant Analysis (FLDA) is one of the most popular techniques used in classification applying dimensional reduction. The numerical scheme involves the inversion of the within-class scatter matrix, which makes FLDA potentially ill-conditioned when it becomes singular. In this paper we present a novel explicit formulation of FLDA in terms of the eccentricity ratio and eigenvector orientations of the within-class scatter matrix. An analysis of this function will characterize those situations where FLDA response is not reliable because of numerical instability. This can solve common situations of poor classification performance in computer vision.
Keywords: Supervised Learning; Linear Discriminant Analysis; Numerical Stability; Computer Vision
|
David Geronimo, Angel Sappa, Daniel Ponsa, & Antonio Lopez. (2010). 2D-3D based on-board pedestrian detection system. CVIU - Computer Vision and Image Understanding, 114(5), 583–595.
Abstract: During the next decade, on-board pedestrian detection systems will play a key role in the challenge of increasing traffic safety. The main target of these systems, to detect pedestrians in urban scenarios, implies overcoming difficulties like processing outdoor scenes from a mobile platform and searching for aspect-changing objects in cluttered environments. This makes such systems combine techniques in the state-of-the-art Computer Vision. In this paper we present a three module system based on both 2D and 3D cues. The first module uses 3D information to estimate the road plane parameters and thus select a coherent set of regions of interest (ROIs) to be further analyzed. The second module uses Real AdaBoost and a combined set of Haar wavelets and edge orientation histograms to classify the incoming ROIs as pedestrian or non-pedestrian. The final module loops again with the 3D cue in order to verify the classified ROIs and with the 2D in order to refine the final results. According to the results, the integration of the proposed techniques gives rise to a promising system.
Keywords: Pedestrian detection; Advanced Driver Assistance Systems; Horizon line; Haar wavelets; Edge orientation histograms
|
Miquel Ferrer, Dimosthenis Karatzas, Ernest Valveny, I. Bardaji, & Horst Bunke. (2011). A Generic Framework for Median Graph Computation based on a Recursive Embedding Approach. CVIU - Computer Vision and Image Understanding, 115(7), 919–928.
Abstract: The median graph has been shown to be a good choice to obtain a represen- tative of a set of graphs. However, its computation is a complex problem. Recently, graph embedding into vector spaces has been proposed to obtain approximations of the median graph. The problem with such an approach is how to go from a point in the vector space back to a graph in the graph space. The main contribution of this paper is the generalization of this previ- ous method, proposing a generic recursive procedure that permits to recover the graph corresponding to a point in the vector space, introducing only the amount of approximation inherent to the use of graph matching algorithms. In order to evaluate the proposed method, we compare it with the set me- dian and with the other state-of-the-art embedding-based methods for the median graph computation. The experiments are carried out using four dif- ferent databases (one semi-artificial and three containing real-world data). Results show that with the proposed approach we can obtain better medi- ans, in terms of the sum of distances to the training graphs, than with the previous existing methods.
Keywords: Median Graph, Graph Embedding, Graph Matching, Structural Pattern Recognition
|
Bhaskar Chakraborty, Michael Holte, Thomas B. Moeslund, & Jordi Gonzalez. (2012). Selective Spatio-Temporal Interest Points. CVIU - Computer Vision and Image Understanding, 116(3), 396–410.
Abstract: Recent progress in the field of human action recognition points towards the use of Spatio-TemporalInterestPoints (STIPs) for local descriptor-based recognition strategies. In this paper, we present a novel approach for robust and selective STIP detection, by applying surround suppression combined with local and temporal constraints. This new method is significantly different from existing STIP detection techniques and improves the performance by detecting more repeatable, stable and distinctive STIPs for human actors, while suppressing unwanted background STIPs. For action representation we use a bag-of-video words (BoV) model of local N-jet features to build a vocabulary of visual-words. To this end, we introduce a novel vocabulary building strategy by combining spatial pyramid and vocabulary compression techniques, resulting in improved performance and efficiency. Action class specific Support Vector Machine (SVM) classifiers are trained for categorization of human actions. A comprehensive set of experiments on popular benchmark datasets (KTH and Weizmann), more challenging datasets of complex scenes with background clutter and camera motion (CVC and CMU), movie and YouTube video clips (Hollywood 2 and YouTube), and complex scenes with multiple actors (MSR I and Multi-KTH), validates our approach and show state-of-the-art performance. Due to the unavailability of ground truth action annotation data for the Multi-KTH dataset, we introduce an actor specific spatio-temporal clustering of STIPs to address the problem of automatic action annotation of multiple simultaneous actors. Additionally, we perform cross-data action recognition by training on source datasets (KTH and Weizmann) and testing on completely different and more challenging target datasets (CVC, CMU, MSR I and Multi-KTH). This documents the robustness of our proposed approach in the realistic scenario, using separate training and test datasets, which in general has been a shortcoming in the performance evaluation of human action recognition techniques.
|
Susana Alvarez, Anna Salvatella, Maria Vanrell, & Xavier Otazu. (2012). Low-dimensional and Comprehensive Color Texture Description. CVIU - Computer Vision and Image Understanding, 116(I), 54–67.
Abstract: Image retrieval can be dealt by combining standard descriptors, such as those of MPEG-7, which are defined independently for each visual cue (e.g. SCD or CLD for Color, HTD for texture or EHD for edges).
A common problem is to combine similarities coming from descriptors representing different concepts in different spaces. In this paper we propose a color texture description that bypasses this problem from its inherent definition. It is based on a low dimensional space with 6 perceptual axes. Texture is described in a 3D space derived from a direct implementation of the original Julesz’s Texton theory and color is described in a 3D perceptual space. This early fusion through the blob concept in these two bounded spaces avoids the problem and allows us to derive a sparse color-texture descriptor that achieves similar performance compared to MPEG-7 in image retrieval. Moreover, our descriptor presents comprehensive qualities since it can also be applied either in segmentation or browsing: (a) a dense image representation is defined from the descriptor showing a reasonable performance in locating texture patterns included in complex images; and (b) a vocabulary of basic terms is derived to build an intermediate level descriptor in natural language improving browsing by bridging semantic gap
|
Jordi Gonzalez, Thomas B. Moeslund, & Liang Wang. (2012). Semantic Understanding of Human Behaviors in Image Sequences: From video-surveillance to video-hermeneutics. CVIU - Computer Vision and Image Understanding, 116(3), 305–306.
Abstract: Purpose: Atheromatic plaque progression is affected, among others phenomena, by biomechanical, biochemical, and physiological factors. In this paper, the authors introduce a novel framework able to provide both morphological (vessel radius, plaque thickness, and type) and biomechanical (wall shear stress and Von Mises stress) indices of coronary arteries.Methods: First, the approach reconstructs the three-dimensional morphology of the vessel from intravascular ultrasound (IVUS) and Angiographic sequences, requiring minimal user interaction. Then, a computational pipeline allows to automatically assess fluid-dynamic and mechanical indices. Ten coronary arteries are analyzed illustrating the capabilities of the tool and confirming previous technical and clinical observations.Results: The relations between the arterial indices obtained by IVUS measurement and simulations have been quantitatively analyzed along the whole surface of the artery, extending the analysis of the coronary arteries shown in previous state of the art studies. Additionally, for the first time in the literature, the framework allows the computation of the membrane stresses using a simplified mechanical model of the arterial wall.Conclusions: Circumferentially (within a given frame), statistical analysis shows an inverse relation between the wall shear stress and the plaque thickness. At the global level (comparing a frame within the entire vessel), it is observed that heavy plaque accumulations are in general calcified and are located in the areas of the vessel having high wall shear stress. Finally, in their experiments the inverse proportionality between fluid and structural stresses is observed.
|
Bhaskar Chakraborty, Jordi Gonzalez, & Xavier Roca. (2013). Large scale continuous visual event recognition using max-margin Hough transformation framework. CVIU - Computer Vision and Image Understanding, 117(10), 1356–1368.
Abstract: In this paper we propose a novel method for continuous visual event recognition (CVER) on a large scale video dataset using max-margin Hough transformation framework. Due to high scalability, diverse real environmental state and wide scene variability direct application of action recognition/detection methods such as spatio-temporal interest point (STIP)-local feature based technique, on the whole dataset is practically infeasible. To address this problem, we apply a motion region extraction technique which is based on motion segmentation and region clustering to identify possible candidate “event of interest” as a preprocessing step. On these candidate regions a STIP detector is applied and local motion features are computed. For activity representation we use generalized Hough transform framework where each feature point casts a weighted vote for possible activity class centre. A max-margin frame work is applied to learn the feature codebook weight. For activity detection, peaks in the Hough voting space are taken into account and initial event hypothesis is generated using the spatio-temporal information of the participating STIPs. For event recognition a verification Support Vector Machine is used. An extensive evaluation on benchmark large scale video surveillance dataset (VIRAT) and as well on a small scale benchmark dataset (MSR) shows that the proposed method is applicable on a wide range of continuous visual event recognition applications having extremely challenging conditions.
|
Tadashi Araki, Nobutaka Ikeda, Nilanjan Dey, Sayan Chakraborty, Luca Saba, Dinesh Kumar, et al. (2015). A comparative approach of four different image registration techniques for quantitative assessment of coronary artery calcium lesions using intravascular ultrasound. CMPB - Computer Methods and Programs in Biomedicine, 118(2), 158–172.
|
Antonio Hernandez, Sergio Escalera, & Stan Sclaroff. (2016). Poselet-basedContextual Rescoring for Human Pose Estimation via Pictorial Structures. IJCV - International Journal of Computer Vision, 118(1), 49–64.
Abstract: In this paper we propose a contextual rescoring method for predicting the position of body parts in a human pose estimation framework. A set of poselets is incorporated in the model, and their detections are used to extract spatial and score-related features relative to other body part hypotheses. A method is proposed for the automatic discovery of a compact subset of poselets that covers the different poses in a set of validation images while maximizing precision. A rescoring mechanism is defined as a set-based boosting classifier that computes a new score for each body joint detection, given its relationship to detections of other body joints and mid-level parts in the image. This new score is incorporated in the pictorial structure model as an additional unary potential, following the recent work of Pishchulin et al. Experiments on two benchmarks show comparable results to Pishchulin et al. while reducing the size of the mid-level representation by an order of magnitude, reducing the execution time by 68 % accordingly.
Keywords: Contextual rescoring; Poselets; Human pose estimation
|
Cristina Palmero, Albert Clapes, Chris Bahnsen, Andreas Møgelmose, Thomas B. Moeslund, & Sergio Escalera. (2016). Multi-modal RGB-Depth-Thermal Human Body Segmentation. IJCV - International Journal of Computer Vision, 118(2), 217–239.
Abstract: This work addresses the problem of human body segmentation from multi-modal visual cues as a first stage of automatic human behavior analysis. We propose a novel RGB–depth–thermal dataset along with a multi-modal segmentation baseline. The several modalities are registered using a calibration device and a registration algorithm. Our baseline extracts regions of interest using background subtraction, defines a partitioning of the foreground regions into cells, computes a set of image features on those cells using different state-of-the-art feature extractions, and models the distribution of the descriptors per cell using probabilistic models. A supervised learning algorithm then fuses the output likelihoods over cells in a stacked feature vector representation. The baseline, using Gaussian mixture models for the probabilistic modeling and Random Forest for the stacked learning, is superior to other state-of-the-art methods, obtaining an overlap above 75 % on the novel dataset when compared to the manually annotated ground-truth of human segmentations.
Keywords: Human body segmentation; RGB ; Depth Thermal
|
Hannes Mueller, Andre Groeger, Jonathan Hersh, Andrea Matranga, & Joan Serrat. (2021). Monitoring war destruction from space using machine learning. PNAS - Proceedings of the National Academy of Sciences of the United States of America, 118(23), e2025400118.
Abstract: Existing data on building destruction in conflict zones rely on eyewitness reports or manual detection, which makes it generally scarce, incomplete, and potentially biased. This lack of reliable data imposes severe limitations for media reporting, humanitarian relief efforts, human-rights monitoring, reconstruction initiatives, and academic studies of violent conflict. This article introduces an automated method of measuring destruction in high-resolution satellite images using deep-learning techniques combined with label augmentation and spatial and temporal smoothing, which exploit the underlying spatial and temporal structure of destruction. As a proof of concept, we apply this method to the Syrian civil war and reconstruct the evolution of damage in major cities across the country. Our approach allows generating destruction data with unprecedented scope, resolution, and frequency—and makes use of the ever-higher frequency at which satellite imagery becomes available.
|
Jiaolong Xu, Sebastian Ramos, David Vazquez, & Antonio Lopez. (2016). Hierarchical Adaptive Structural SVM for Domain Adaptation. IJCV - International Journal of Computer Vision, 119(2), 159–178.
Abstract: A key topic in classification is the accuracy loss produced when the data distribution in the training (source) domain differs from that in the testing (target) domain. This is being recognized as a very relevant problem for many
computer vision tasks such as image classification, object detection, and object category recognition. In this paper, we present a novel domain adaptation method that leverages multiple target domains (or sub-domains) in a hierarchical adaptation tree. The core idea is to exploit the commonalities and differences of the jointly considered target domains.
Given the relevance of structural SVM (SSVM) classifiers, we apply our idea to the adaptive SSVM (A-SSVM), which only requires the target domain samples together with the existing source-domain classifier for performing the desired adaptation. Altogether, we term our proposal as hierarchical A-SSVM (HA-SSVM).
As proof of concept we use HA-SSVM for pedestrian detection, object category recognition and face recognition. In the former we apply HA-SSVM to the deformable partbased model (DPM) while in the rest HA-SSVM is applied to multi-category classifiers. We will show how HA-SSVM is effective in increasing the detection/recognition accuracy with respect to adaptation strategies that ignore the structure of the target data. Since, the sub-domains of the target data are not always known a priori, we shown how HA-SSVM can incorporate sub-domain discovery for object category recognition.
Keywords: Domain Adaptation; Pedestrian Detection
|
Dena Bazazian, Raul Gomez, Anguelos Nicolaou, Lluis Gomez, Dimosthenis Karatzas, & Andrew Bagdanov. (2019). Fast: Facilitated and accurate scene text proposals through fcn guided pruning. PRL - Pattern Recognition Letters, 119, 112–120.
Abstract: Class-specific text proposal algorithms can efficiently reduce the search space for possible text object locations in an image. In this paper we combine the Text Proposals algorithm with Fully Convolutional Networks to efficiently reduce the number of proposals while maintaining the same recall level and thus gaining a significant speed up. Our experiments demonstrate that such text proposal approaches yield significantly higher recall rates than state-of-the-art text localization techniques, while also producing better-quality localizations. Our results on the ICDAR 2015 Robust Reading Competition (Challenge 4) and the COCO-text datasets show that, when combined with strong word classifiers, this recall margin leads to state-of-the-art results in end-to-end scene text recognition.
|
Pau Riba, Andreas Fischer, Josep Llados, & Alicia Fornes. (2021). Learning graph edit distance by graph neural networks. PR - Pattern Recognition, 120, 108132.
Abstract: The emergence of geometric deep learning as a novel framework to deal with graph-based representations has faded away traditional approaches in favor of completely new methodologies. In this paper, we propose a new framework able to combine the advances on deep metric learning with traditional approximations of the graph edit distance. Hence, we propose an efficient graph distance based on the novel field of geometric deep learning. Our method employs a message passing neural network to capture the graph structure, and thus, leveraging this information for its use on a distance computation. The performance of the proposed graph distance is validated on two different scenarios. On the one hand, in a graph retrieval of handwritten words i.e. keyword spotting, showing its superior performance when compared with (approximate) graph edit distance benchmarks. On the other hand, demonstrating competitive results for graph similarity learning when compared with the current state-of-the-art on a recent benchmark dataset.
|
Xavier Perez Sala, Fernando De la Torre, Laura Igual, Sergio Escalera, & Cecilio Angulo. (2017). Subspace Procrustes Analysis. IJCV - International Journal of Computer Vision, 121(3), 327–343.
Abstract: Procrustes Analysis (PA) has been a popular technique to align and build 2-D statistical models of shapes. Given a set of 2-D shapes PA is applied to remove rigid transformations. Then, a non-rigid 2-D model is computed by modeling (e.g., PCA) the residual. Although PA has been widely used, it has several limitations for modeling 2-D shapes: occluded landmarks and missing data can result in local minima solutions, and there is no guarantee that the 2-D shapes provide a uniform sampling of the 3-D space of rotations for the object. To address previous issues, this paper proposes Subspace PA (SPA). Given several
instances of a 3-D object, SPA computes the mean and a 2-D subspace that can simultaneously model all rigid and non-rigid deformations of the 3-D object. We propose a discrete (DSPA) and continuous (CSPA) formulation for SPA, assuming that 3-D samples of an object are provided. DSPA extends the traditional PA, and produces unbiased 2-D models by uniformly sampling different views of the 3-D object. CSPA provides a continuous approach to uniformly sample the space of 3-D rotations, being more efficient in space and time. Experiments using SPA to learn 2-D models of bodies from motion capture data illustrate the benefits of our approach.
|