Hugo Bertiche, Meysam Madadi, & Sergio Escalera. (2022). Neural Cloth Simulation. ACMTGraph - ACM Transactions on Graphics, 41(6), 1–14.
Abstract: We present a general framework for the garment animation problem through unsupervised deep learning inspired in physically based simulation. Existing trends in the literature already explore this possibility. Nonetheless, these approaches do not handle cloth dynamics. Here, we propose the first methodology able to learn realistic cloth dynamics unsupervisedly, and henceforth, a general formulation for neural cloth simulation. The key to achieve this is to adapt an existing optimization scheme for motion from simulation based methodologies to deep learning. Then, analyzing the nature of the problem, we devise an architecture able to automatically disentangle static and dynamic cloth subspaces by design. We will show how this improves model performance. Additionally, this opens the possibility of a novel motion augmentation technique that greatly improves generalization. Finally, we show it also allows to control the level of motion in the predictions. This is a useful, never seen before, tool for artists. We provide of detailed analysis of the problem to establish the bases of neural cloth simulation and guide future research into the specifics of this domain.
ACM Transactions on GraphicsVolume 41Issue 6December 2022 Article No.: 220pp 1–
|
Joakim Bruslund Haurum, Meysam Madadi, Sergio Escalera, & Thomas B. Moeslund. (2022). Multi-scale hybrid vision transformer and Sinkhorn tokenizer for sewer defect classification. AC - Automation in Construction, 144, 104614.
Abstract: A crucial part of image classification consists of capturing non-local spatial semantics of image content. This paper describes the multi-scale hybrid vision transformer (MSHViT), an extension of the classical convolutional neural network (CNN) backbone, for multi-label sewer defect classification. To better model spatial semantics in the images, features are aggregated at different scales non-locally through the use of a lightweight vision transformer, and a smaller set of tokens was produced through a novel Sinkhorn clustering-based tokenizer using distinct cluster centers. The proposed MSHViT and Sinkhorn tokenizer were evaluated on the Sewer-ML multi-label sewer defect classification dataset, showing consistent performance improvements of up to 2.53 percentage points.
Keywords: Sewer Defect Classification; Vision Transformers; Sinkhorn-Knopp; Convolutional Neural Networks; Closed-Circuit Television; Sewer Inspection
|
Swathikiran Sudhakaran, Sergio Escalera, & Oswald Lanz. (2023). Gate-Shift-Fuse for Video Action Recognition. TPAMI - IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 10913–10928.
Abstract: Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiring large scale annotated datasets to train them in scale. 3D kernel factorization approaches have been proposed to reduce the complexity of 3D CNNs. Existing kernel factorization approaches follow hand-designed and hard-wired techniques. In this paper we propose Gate-Shift-Fuse (GSF), a novel spatio-temporal feature extraction module which controls interactions in spatio-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner. GSF leverages grouped spatial gating to decompose input tensor and channel weighting to fuse the decomposed tensors. GSF can be inserted into existing 2D CNNs to convert them into an efficient and high performing spatio-temporal feature extractor, with negligible parameter and compute overhead. We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.
Keywords: Action Recognition; Video Classification; Spatial Gating; Channel Fusion
|
Guillermo Torres, Debora Gil, Antoni Rosell, S. Mena, & Carles Sanchez. (2023). Virtual Radiomics Biopsy for the Histological Diagnosis of Pulmonary Nodules – Intermediate Results of the RadioLung Project. IJCARS - International Journal of Computer Assisted Radiology and Surgery, .
|
Kunal Biswas, Palaiahnakote Shivakumara, Umapada Pal, Tong Lu, Michel Blumenstein, & Josep Llados. (2023). Classification of aesthetic natural scene images using statistical and semantic features. MTAP - Multimedia Tools and Applications, 82(9), 13507–13532.
Abstract: Aesthetic image analysis is essential for improving the performance of multimedia image retrieval systems, especially from a repository of social media and multimedia content stored on mobile devices. This paper presents a novel method for classifying aesthetic natural scene images by studying the naturalness of image content using statistical features, and reading text in the images using semantic features. Unlike existing methods that focus only on image quality with human information, the proposed approach focuses on image features as well as text-based semantic features without human intervention to reduce the gap between subjectivity and objectivity in the classification. The aesthetic classes considered in this work are (i) Very Pleasant, (ii) Pleasant, (iii) Normal and (iv) Unpleasant. The naturalness is represented by features of focus, defocus, perceived brightness, perceived contrast, blurriness and noisiness, while semantics are represented by text recognition, description of the images and labels of images, profile pictures, and banner images. Furthermore, a deep learning model is proposed in a novel way to fuse statistical and semantic features for the classification of aesthetic natural scene images. Experiments on our own dataset and the standard datasets demonstrate that the proposed approach achieves 92.74%, 88.67% and 83.22% average classification rates on our own dataset, AVA dataset and CUHKPQ dataset, respectively. Furthermore, a comparative study of the proposed model with the existing methods shows that the proposed method is effective for the classification of aesthetic social media images.
|
Javier Selva, Anders S. Johansen, Sergio Escalera, Kamal Nasrollahi, Thomas B. Moeslund, & Albert Clapes. (2023). Video transformers: A survey. TPAMI - IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11), 12922–12943.
Abstract: Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced by the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey, we analyze the main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled at the input level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition, we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.
Keywords: Artificial Intelligence; Computer Vision; Self-Attention; Transformers; Video Representations
|
Ruben Tito, Dimosthenis Karatzas, & Ernest Valveny. (2023). Hierarchical multimodal transformers for Multi-Page DocVQA. PR - Pattern Recognition, 144, 109834.
Abstract: Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple pages that should be processed altogether. In this work we extend DocVQA to the multi-page scenario. For that, we first create a new dataset, MP-DocVQA, where questions are posed over multi-page documents instead of single pages. Second, we propose a new hierarchical method, Hi-VT5, based on the T5 architecture, that overcomes the limitations of current methods to process long multi-page documents. The proposed method is based on a hierarchical transformer architecture where the encoder summarizes the most relevant information of every page and then, the decoder takes this summarized information to generate the final answer. Through extensive experimentation, we demonstrate that our method is able, in a single stage, to answer the questions and provide the page that contains the relevant information to find the answer, which can be used as a kind of explainability measure.
|
Souhail Bakkali, Zuheng Ming, Mickael Coustaty, Marçal Rusiñol, & Oriol Ramos Terrades. (2023). VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification. PR - Pattern Recognition, 139, 109419.
Abstract: Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream approach. In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues, considering intra- and inter-modality relationships. Instead of merging features from different modalities into a common representation space, the proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities. The proposed learning objective is devised between intra- and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the common feature representation space}. Extensive experiments on public document classification datasets demonstrate the effectiveness and the generalization capacity of our model on both low-scale and large-scale datasets.
|
Juan Borrego-Carazo, Carles Sanchez, David Castells, Jordi Carrabina, & Debora Gil. (2022). A benchmark for the evaluation of computational methods for bronchoscopic navigation. IJCARS - International Journal of Computer Assisted Radiology and Surgery, 17(1).
|
Antoni Rosell, Sonia Baeza, S. Garcia-Reina, JL. Mate, Ignasi Guasch, I. Nogueira, et al. (2022). EP01.05-001 Radiomics to Increase the Effectiveness of Lung Cancer Screening Programs. Radiolung Preliminary Results. JTO - Journal of Thoracic Oncology, 17(9), S182.
|
Antoni Rosell, Sonia Baeza, S. Garcia-Reina, JL. Mate, Ignasi Guasch, I. Nogueira, et al. (2022). Radiomics to increase the effectiveness of lung cancer screening programs. Radiolung preliminary results. ERJ - European Respiratory Journal, 60(66).
|
Ruben Tito, Dimosthenis Karatzas, & Ernest Valveny. (2023). Hierarchical multimodal transformers for Multipage DocVQA. PR - Pattern Recognition, 144(109834).
Abstract: Existing work on DocVQA only considers single-page documents. However, in real applications documents are mostly composed of multiple pages that should be processed altogether. In this work, we propose a new multimodal hierarchical method Hi-VT5, that overcomes the limitations of current methods to process long multipage documents. In contrast to previous hierarchical methods that focus on different semantic granularity (He et al., 2021) or different subtasks (Zhou et al., 2022) used in image classification. Our method is a hierarchical transformer architecture where the encoder learns to summarize the most relevant information of every page and then, the decoder uses this summarized representation to generate the final answer, following a bottom-up approach. Moreover, due to the lack of multipage DocVQA datasets, we also introduce MP-DocVQA, an extension of SP-DocVQA where questions are posed over multipage documents instead of single pages. Through extensive experimentation, we demonstrate that Hi-VT5 is able, in a single stage, to answer the questions and provide the page that contains the answer, which can be used as a kind of explainability measure.
|
Bhalaji Nagarajan, Marc Bolaños, Eduardo Aguilar, & Petia Radeva. (2023). Deep ensemble-based hard sample mining for food recognition. JVCIR - Journal of Visual Communication and Image Representation, 95, 103905.
Abstract: Deep neural networks represent a compelling technique to tackle complex real-world problems, but are over-parameterized and often suffer from over- or under-confident estimates. Deep ensembles have shown better parameter estimations and often provide reliable uncertainty estimates that contribute to the robustness of the results. In this work, we propose a new metric to identify samples that are hard to classify. Our metric is defined as coincidence score for deep ensembles which measures the agreement of its individual models. The main hypothesis we rely on is that deep learning algorithms learn the low-loss samples better compared to large-loss samples. In order to compensate for this, we use controlled over-sampling on the identified ”hard” samples using proper data augmentation schemes to enable the models to learn those samples better. We validate the proposed metric using two public food datasets on different backbone architectures and show the improvements compared to the conventional deep neural network training using different performance metrics.
|
Cristhian A. Aguilera-Carrasco, Luis Felipe Gonzalez-Böhme, Francisco Valdes, Francisco Javier Quitral Zapata, & Bogdan Raducanu. (2023). A Hand-Drawn Language for Human–Robot Collaboration in Wood Stereotomy. ACCESS - IEEE Access, 11, 100975–100985.
Abstract: This study introduces a novel, hand-drawn language designed to foster human-robot collaboration in wood stereotomy, central to carpentry and joinery professions. Based on skilled carpenters’ line and symbol etchings on timber, this language signifies the location, geometry of woodworking joints, and timber placement within a framework. A proof-of-concept prototype has been developed, integrating object detectors, keypoint regression, and traditional computer vision techniques to interpret this language and enable an extensive repertoire of actions. Empirical data attests to the language’s efficacy, with the successful identification of a specific set of symbols on various wood species’ sawn surfaces, achieving a mean average precision (mAP) exceeding 90%. Concurrently, the system can accurately pinpoint critical positions that facilitate robotic comprehension of carpenter-indicated woodworking joint geometry. The positioning error, approximately 3 pixels, meets industry standards.
|
G. Gasbarri, Matias Bilkis, E. Roda Salichs, & J. Calsamiglia. (2024). Sequential hypothesis testing for continuously-monitored quantum systems. Quantum, 8(1289).
Abstract: We consider a quantum system that is being continuously monitored, giving rise to a measurement signal. From such a stream of data, information needs to be inferred about the underlying system's dynamics. Here we focus on hypothesis testing problems and put forward the usage of sequential strategies where the signal is analyzed in real time, allowing the experiment to be concluded as soon as the underlying hypothesis can be identified with a certified prescribed success probability. We analyze the performance of sequential tests by studying the stopping-time behavior, showing a considerable advantage over currently-used strategies based on a fixed predetermined measurement time.
|