|
Ruben Tito, Dimosthenis Karatzas, & Ernest Valveny. (2023). Hierarchical multimodal transformers for Multipage DocVQA. PR - Pattern Recognition, 144(109834).
Abstract: Existing work on DocVQA only considers single-page documents. However, in real applications documents are mostly composed of multiple pages that should be processed altogether. In this work, we propose a new multimodal hierarchical method Hi-VT5, that overcomes the limitations of current methods to process long multipage documents. In contrast to previous hierarchical methods that focus on different semantic granularity (He et al., 2021) or different subtasks (Zhou et al., 2022) used in image classification. Our method is a hierarchical transformer architecture where the encoder learns to summarize the most relevant information of every page and then, the decoder uses this summarized representation to generate the final answer, following a bottom-up approach. Moreover, due to the lack of multipage DocVQA datasets, we also introduce MP-DocVQA, an extension of SP-DocVQA where questions are posed over multipage documents instead of single pages. Through extensive experimentation, we demonstrate that Hi-VT5 is able, in a single stage, to answer the questions and provide the page that contains the answer, which can be used as a kind of explainability measure.
|
|
|
G. Gasbarri, Matias Bilkis, E. Roda Salichs, & J. Calsamiglia. (2024). Sequential hypothesis testing for continuously-monitored quantum systems. Quantum, 8(1289).
Abstract: We consider a quantum system that is being continuously monitored, giving rise to a measurement signal. From such a stream of data, information needs to be inferred about the underlying system's dynamics. Here we focus on hypothesis testing problems and put forward the usage of sequential strategies where the signal is analyzed in real time, allowing the experiment to be concluded as soon as the underlying hypothesis can be identified with a certified prescribed success probability. We analyze the performance of sequential tests by studying the stopping-time behavior, showing a considerable advantage over currently-used strategies based on a fixed predetermined measurement time.
|
|
|
P. Canals, Simone Balocco, O. Diaz, J. Li, A. Garcia Tornel, M. Olive Gadea, et al. (2023). A fully automatic method for vascular tortuosity feature extraction in the supra-aortic region: unraveling possibilities in stroke treatment planning. CMIG - Computerized Medical Imaging and Graphics, 104(102170).
Abstract: Vascular tortuosity of supra-aortic vessels is widely considered one of the main reasons for failure and delays in endovascular treatment of large vessel occlusion in patients with acute ischemic stroke. Characterization of tortuosity is a challenging task due to the lack of objective, robust and effective analysis tools. We present a fully automatic method for arterial segmentation, vessel labelling and tortuosity feature extraction applied to the supra-aortic region. A sample of 566 computed tomography angiography scans from acute ischemic stroke patients (aged 74.8 ± 12.9, 51.0% females) were used for training, validation and testing of a segmentation module based on a U-Net architecture (162 cases) and a vessel labelling module powered by a graph U-Net (566 cases). Successively, 30 cases were processed for testing of a tortuosity feature extraction module. Measurements obtained through automatic processing were compared to manual annotations from two observers for a thorough validation of the method. The proposed feature extraction method presented similar performance to the inter-rater variability observed in the measurement of 33 geometrical and morphological features of the arterial anatomy in the supra-aortic region. This system will contribute to the development of more complex models to advance the treatment of stroke by adding immediate automation, objectivity, repeatability and robustness to the vascular tortuosity characterization of patients.
Keywords: Artificial intelligence; Deep learning; Stroke; Thrombectomy; Vascular feature extraction; Vascular tortuosity
|
|
|
Danna Xue, Javier Vazquez, Luis Herranz, Yang Zhang, & Michael S Brown. (2023). Integrating High-Level Features for Consistent Palette-based Multi-image Recoloring. CGF - Computer Graphics Forum, .
Abstract: Achieving visually consistent colors across multiple images is important when images are used in photo albums, websites, and brochures. Unfortunately, only a handful of methods address multi-image color consistency compared to one-to-one color transfer techniques. Furthermore, existing methods do not incorporate high-level features that can assist graphic designers in their work. To address these limitations, we introduce a framework that builds upon a previous palette-based color consistency method and incorporates three high-level features: white balance, saliency, and color naming. We show how these features overcome the limitations of the prior multi-consistency workflow and showcase the user-friendly nature of our framework.
|
|
|
Qingshan Chen, Zhenzhen Quan, Yujun Li, Chao Zhai, & Mikhail Mozerov. (2023). An Unsupervised Domain Adaption Approach for Cross-Modality RGB-Infrared Person Re-Identification. IEEE-SENS - IEEE Sensors Journal, 23(24).
Abstract: Dual-camera systems commonly employed in surveillance serve as the foundation for RGB-infrared (IR) cross-modality person re-identification (ReID). However, significant modality differences give rise to inferior performance compared to single-modality scenarios. Furthermore, most existing studies in this area rely on supervised training with meticulously labeled datasets. Labeling RGB-IR image pairs is more complex than labeling conventional image data, and deploying pretrained models on unlabeled datasets can lead to catastrophic performance degradation. In contrast to previous solutions that focus solely on cross-modality or domain adaptation issues, this article presents an end-to-end unsupervised domain adaptation (UDA) framework for the cross-modality person ReID, which can simultaneously address both of these challenges. This model employs source domain classes, target domain clusters, and unclustered instance samples for the training, maximizing the comprehensive use of the dataset. Moreover, it addresses the problem of mismatched clustering labels between the two modalities in the target domain by incorporating a label matching module that reassigns reliable clusters with labels, ensuring correspondence between different modality labels. We construct the loss function by incorporating distinctiveness loss and multiplicity loss, both of which are determined by the similarity of neighboring features in the predicted feature space and the difference between distant features. This approach enables efficient feature clustering and cluster class assignment to occur concurrently. Eight UDA cross-modality person ReID experiments are conducted on three real datasets and six synthetic datasets. The experimental results unequivocally demonstrate that the proposed model outperforms the existing state-of-the-art algorithms to a significant degree. Notably, in RegDB → RegDB_light, the Rank-1 accuracy exhibits a remarkable improvement of 8.24%.
Keywords: Q. Chen, Z. Quan, Y. Li, C. Zhai and M. G. Mozerov
|
|
|
Qingshan Chen, Zhenzhen Quan, Yifan Hu, Yujun Li, Zhi Liu, & Mikhail Mozerov. (2023). MSIF: multi-spectrum image fusion method for cross-modality person re-identification. IJMLC - International Journal of Machine Learning and Cybernetics, .
Abstract: Sketch-RGB cross-modality person re-identification (ReID) is a challenging task that aims to match a sketch portrait drawn by a professional artist with a full-body photo taken by surveillance equipment to deal with situations where the monitoring equipment is damaged at the accident scene. However, sketch portraits only provide highly abstract frontal body contour information and lack other important features such as color, pose, behavior, etc. The difference in saliency between the two modalities brings new challenges to cross-modality person ReID. To overcome this problem, this paper proposes a novel dual-stream model for cross-modality person ReID, which is able to mine modality-invariant features to reduce the discrepancy between sketch and camera images end-to-end. More specifically, we propose a multi-spectrum image fusion (MSIF) method, which aims to exploit the image appearance changes brought by multiple spectrums and guide the network to mine modality-invariant commonalities during training. It only processes the spectrum of the input images without adding additional calculations and model complexity, which can be easily integrated into other models. Moreover, we introduce a joint structure via a generalized mean pooling (GMP) layer and a self-attention (SA) mechanism to balance background and texture information and obtain the regional features with a large amount of information in the image. To further shrink the intra-class distance, a weighted regularized triplet (WRT) loss is developed without introducing additional hyperparameters. The model was first evaluated on the PKU Sketch ReID dataset, and extensive experimental results show that the Rank-1/mAP accuracy of our method is 87.00%/91.12%, reaching the current state-of-the-art performance. To further validate the effectiveness of our approach in handling cross-modality person ReID, we conducted experiments on two commonly used IR-RGB datasets (SYSU-MM01 and RegDB). The obtained results show that our method achieves competitive performance. These results confirm the ability of our method to effectively process images from different modalities.
|
|
|
Olivier Penacchio, Xavier Otazu, Arnold J Wilkings, & Sara M. Haigh. (2023). A mechanistic account of visual discomfort. FN - Frontiers in Neuroscience, 17.
Abstract: Much of the neural machinery of the early visual cortex, from the extraction of local orientations to contextual modulations through lateral interactions, is thought to have developed to provide a sparse encoding of contour in natural scenes, allowing the brain to process efficiently most of the visual scenes we are exposed to. Certain visual stimuli, however, cause visual stress, a set of adverse effects ranging from simple discomfort to migraine attacks, and epileptic seizures in the extreme, all phenomena linked with an excessive metabolic demand. The theory of efficient coding suggests a link between excessive metabolic demand and images that deviate from natural statistics. Yet, the mechanisms linking energy demand and image spatial content in discomfort remain elusive. Here, we used theories of visual coding that link image spatial structure and brain activation to characterize the response to images observers reported as uncomfortable in a biologically based neurodynamic model of the early visual cortex that included excitatory and inhibitory layers to implement contextual influences. We found three clear markers of aversive images: a larger overall activation in the model, a less sparse response, and a more unbalanced distribution of activity across spatial orientations. When the ratio of excitation over inhibition was increased in the model, a phenomenon hypothesised to underlie interindividual differences in susceptibility to visual discomfort, the three markers of discomfort progressively shifted toward values typical of the response to uncomfortable stimuli. Overall, these findings propose a unifying mechanistic explanation for why there are differences between images and between observers, suggesting how visual input and idiosyncratic hyperexcitability give rise to abnormal brain responses that result in visual stress.
|
|
|
Armin Mehri, Parichehr Behjati, Dario Carpio, & Angel Sappa. (2023). SRFormer: Efficient Yet Powerful Transformer Network for Single Image Super Resolution. ACCESS - IEEE Access, 11.
Abstract: Recent breakthroughs in single image super resolution have investigated the potential of deep Convolutional Neural Networks (CNNs) to improve performance. However, CNNs based models suffer from their limited fields and their inability to adapt to the input content. Recently, Transformer based models were presented, which demonstrated major performance gains in Natural Language Processing and Vision tasks while mitigating the drawbacks of CNNs. Nevertheless, Transformer computational complexity can increase quadratically for high-resolution images, and the fact that it ignores the original structures of the image by converting them to the 1D structure can make it problematic to capture the local context information and adapt it for real-time applications. In this paper, we present, SRFormer, an efficient yet powerful Transformer-based architecture, by making several key designs in the building of Transformer blocks and Transformer layers that allow us to consider the original structure of the image (i.e., 2D structure) while capturing both local and global dependencies without raising computational demands or memory consumption. We also present a Gated Multi-Layer Perceptron (MLP) Feature Fusion module to aggregate the features of different stages of Transformer blocks by focusing on inter-spatial relationships while adding minor computational costs to the network. We have conducted extensive experiments on several super-resolution benchmark datasets to evaluate our approach. SRFormer demonstrates superior performance compared to state-of-the-art methods from both Transformer and Convolutional networks, with an improvement margin of 0.1∼0.53dB . Furthermore, while SRFormer has almost the same model size, it outperforms SwinIR by 0.47% and inference time by half the time of SwinIR. The code will be available on GitHub.
|
|
|
Ayan Banerjee, Sanket Biswas, Josep Llados, & Umapada Pal. (2024). SemiDocSeg: Harnessing Semi-Supervised Learning for Document Layout Analysis. IJDAR - International Journal on Document Analysis and Recognition, .
Abstract: Document Layout Analysis (DLA) is the process of automatically identifying and categorizing the structural components (e.g. Text, Figure, Table, etc.) within a document to extract meaningful content and establish the page's layout structure. It is a crucial stage in document parsing, contributing to their comprehension. However, traditional DLA approaches often demand a significant volume of labeled training data, and the labor-intensive task of generating high-quality annotated training data poses a substantial challenge. In order to address this challenge, we proposed a semi-supervised setting that aims to perform learning on limited annotated categories by eliminating exhaustive and expensive mask annotations. The proposed setting is expected to be generalizable to novel categories as it learns the underlying positional information through a support set and class information through Co-Occurrence that can be generalized from annotated categories to novel categories. Here, we first extract features from the input image and support set with a shared multi-scale feature acquisition backbone. Then, the extracted feature representation is fed to the transformer encoder as a query. Later on, we utilize a semantic embedding network before the decoder to capture the underlying semantic relationships and similarities between different instances, enabling the model to make accurate predictions or classifications with only a limited amount of labeled data. Extensive experimentation on competitive benchmarks like PRIMA, DocLayNet, and Historical Japanese (HJ) demonstrate that this generalized setup obtains significant performance compared to the conventional supervised approach.
Keywords: Document layout analysis; Semi-supervised learning; Co-Occurrence matrix; Instance segmentation; Swin transformer
|
|
|
Zahra Raisi-Estabragh, Carlos Martin-Isla, Louise Nissen, Liliana Szabo, Victor M. Campello, Sergio Escalera, et al. (2023). Radiomics analysis enhances the diagnostic performance of CMR stress perfusion: a proof-of-concept study using the Dan-NICAD dataset. FCM - Frontiers in Cardiovascular Medicine, .
|
|
|
Adrien Pavao, Isabelle Guyon, Anne-Catherine Letournel, Dinh-Tuan Tran, Xavier Baro, Hugo Jair Escalante, et al. (2023). CodaLab Competitions: An Open Source Platform to Organize Scientific Challenges. JMLR - Journal of Machine Learning Research, .
Abstract: CodaLab Competitions is an open source web platform designed to help data scientists and research teams to crowd-source the resolution of machine learning problems through the organization of competitions, also called challenges or contests. CodaLab Competitions provides useful features such as multiple phases, results and code submissions, multi-score leaderboards, and jobs running
inside Docker containers. The platform is very flexible and can handle large scale experiments, by allowing organizers to upload large datasets and provide their own CPU or GPU compute workers.
|
|
|
Anders Skaarup Johansen, Kamal Nasrollahi, Sergio Escalera, & Thomas B. Moeslund. (2023). Who Cares about the Weather? Inferring Weather Conditions for Weather-Aware Object Detection in Thermal Images. AS - Applied Sciences, 13(18).
Abstract: Deployments of real-world object detection systems often experience a degradation in performance over time due to concept drift. Systems that leverage thermal cameras are especially susceptible because the respective thermal signatures of objects and their surroundings are highly sensitive to environmental changes. In this study, two types of weather-aware latent conditioning methods are investigated. The proposed method aims to guide two object detectors, (YOLOv5 and Deformable DETR) to become weather-aware. This is achieved by leveraging an auxiliary branch that predicts weather-related information while conditioning intermediate layers of the object detector. While the conditioning methods proposed do not directly improve the accuracy of baseline detectors, it can be observed that conditioned networks manage to extract a weather-related signal from the thermal images, thus resulting in a decreased miss rate at the cost of increased false positives. The extracted signal appears noisy and is thus challenging to regress accurately. This is most likely a result of the qualitative nature of the thermal sensor; thus, further work is needed to identify an ideal method for optimizing the conditioning branch, as well as to further improve the accuracy of the system.
Keywords: thermal; object detection; concept drift; conditioning; weather recognition
|
|
|
Henry Velesaca, Gisel Bastidas-Guacho, Mohammad Rouhani, & Angel Sappa. (2024). Multimodal image registration techniques: a comprehensive survey. MTAP - Multimedia Tools and Applications, .
Abstract: This manuscript presents a review of state-of-the-art techniques proposed in the literature for multimodal image registration, addressing instances where images from different modalities need to be precisely aligned in the same reference system. This scenario arises when the images to be registered come from different modalities, among the visible and thermal spectral bands, 3D-RGB, or flash-no flash, or NIR-visible. The review spans different techniques from classical approaches to more modern ones based on deep learning, aiming to highlight the particularities required at each step in the registration pipeline when dealing with multimodal images. It is noteworthy that medical images are excluded from this review due to their specific characteristics, including the use of both active and passive sensors or the non-rigid nature of the body contained in the image.
|
|
|
Trevor Canham, Javier Vazquez, D Long, Richard F. Murray, & Michael S Brown. (2021). Noise Prism: A Novel Multispectral Visualization Technique. 31st Color and Imaging Conference, .
Abstract: A novel technique for visualizing multispectral images is proposed. Inspired by how prisms work, our method spreads spectral information over a chromatic noise pattern. This is accomplished by populating the pattern with pixels representing each measurement band at a count proportional to its measured intensity. The method is advantageous because it allows for lightweight encoding and visualization of spectral information
while maintaining the color appearance of the stimulus. A four alternative forced choice (4AFC) experiment was conducted to validate the method’s information-carrying capacity in displaying metameric stimuli of varying colors and spectral basis functions. The scores ranged from 100% to 20% (less than chance given the 4AFC task), with many conditions falling somewhere in between at statistically significant intervals. Using this data, color and texture difference metrics can be evaluated and optimized to predict the legibility of the visualization technique.
|
|
|
Mingyi Yang, Fei Yang, Luka Murn, Marc Gorriz Blanch, Juil Sock, Shuai Wan, et al. (2024). Task-Switchable Pre-Processor for Image Compression for Multiple Machine Vision Tasks. IEEE Transactions on Circuits and Systems for Video Technology, .
Abstract: Visual content is increasingly being processed by machines for various automated content analysis tasks instead of being consumed by humans. Despite the existence of several compression methods tailored for machine tasks, few consider real-world scenarios with multiple tasks. In this paper, we aim to address this gap by proposing a task-switchable pre-processor that optimizes input images specifically for machine consumption prior to encoding by an off-the-shelf codec designed for human consumption. The proposed task-switchable pre-processor adeptly maintains relevant semantic information based on the specific characteristics of different downstream tasks, while effectively suppressing irrelevant information to reduce bitrate. To enhance the processing of semantic information for diverse tasks, we leverage pre-extracted semantic features to modulate the pixel-to-pixel mapping within the pre-processor. By switching between different modulations, multiple tasks can be seamlessly incorporated into the system. Extensive experiments demonstrate the practicality and simplicity of our approach. It significantly reduces the number of parameters required for handling multiple tasks while still delivering impressive performance. Our method showcases the potential to achieve efficient and effective compression for machine vision tasks, supporting the evolving demands of real-world applications.
Keywords: M Yang, F Yang, L Murn, MG Blanch, J Sock, S Wan, F Yang, L Herranz
|
|