|
Qingshan Chen, Zhenzhen Quan, Yifan Hu, Yujun Li, Zhi Liu, & Mikhail Mozerov. (2023). MSIF: multi-spectrum image fusion method for cross-modality person re-identification. IJMLC - International Journal of Machine Learning and Cybernetics, .
Abstract: Sketch-RGB cross-modality person re-identification (ReID) is a challenging task that aims to match a sketch portrait drawn by a professional artist with a full-body photo taken by surveillance equipment to deal with situations where the monitoring equipment is damaged at the accident scene. However, sketch portraits only provide highly abstract frontal body contour information and lack other important features such as color, pose, behavior, etc. The difference in saliency between the two modalities brings new challenges to cross-modality person ReID. To overcome this problem, this paper proposes a novel dual-stream model for cross-modality person ReID, which is able to mine modality-invariant features to reduce the discrepancy between sketch and camera images end-to-end. More specifically, we propose a multi-spectrum image fusion (MSIF) method, which aims to exploit the image appearance changes brought by multiple spectrums and guide the network to mine modality-invariant commonalities during training. It only processes the spectrum of the input images without adding additional calculations and model complexity, which can be easily integrated into other models. Moreover, we introduce a joint structure via a generalized mean pooling (GMP) layer and a self-attention (SA) mechanism to balance background and texture information and obtain the regional features with a large amount of information in the image. To further shrink the intra-class distance, a weighted regularized triplet (WRT) loss is developed without introducing additional hyperparameters. The model was first evaluated on the PKU Sketch ReID dataset, and extensive experimental results show that the Rank-1/mAP accuracy of our method is 87.00%/91.12%, reaching the current state-of-the-art performance. To further validate the effectiveness of our approach in handling cross-modality person ReID, we conducted experiments on two commonly used IR-RGB datasets (SYSU-MM01 and RegDB). The obtained results show that our method achieves competitive performance. These results confirm the ability of our method to effectively process images from different modalities.
|
|
|
Shida Beigpour, Christian Riess, Joost Van de Weijer, & Elli Angelopoulou. (2014). Multi-Illuminant Estimation with Conditional Random Fields. TIP - IEEE Transactions on Image Processing, 23(1), 83–95.
Abstract: Most existing color constancy algorithms assume uniform illumination. However, in real-world scenes, this is not often the case. Thus, we propose a novel framework for estimating the colors of multiple illuminants and their spatial distribution in the scene. We formulate this problem as an energy minimization task within a conditional random field over a set of local illuminant estimates. In order to quantitatively evaluate the proposed method, we created a novel data set of two-dominant-illuminant images comprised of laboratory, indoor, and outdoor scenes. Unlike prior work, our database includes accurate pixel-wise ground truth illuminant information. The performance of our method is evaluated on multiple data sets. Experimental results show that our framework clearly outperforms single illuminant estimators as well as a recently proposed multi-illuminant estimation approach.
Keywords: color constancy; CRF; multi-illuminant
|
|
|
Xinhang Song, Shuqiang Jiang, & Luis Herranz. (2017). Multi-Scale Multi-Feature Context Modeling for Scene Recognition in the Semantic Manifold. TIP - IEEE Transactions on Image Processing, 26(6), 2721–2735.
Abstract: Before the big data era, scene recognition was often approached with two-step inference using localized intermediate representations (objects, topics, and so on). One of such approaches is the semantic manifold (SM), in which patches and images are modeled as points in a semantic probability simplex. Patch models are learned resorting to weak supervision via image labels, which leads to the problem of scene categories co-occurring in this semantic space. Fortunately, each category has its own co-occurrence patterns that are consistent across the images in that category. Thus, discovering and modeling these patterns are critical to improve the recognition performance in this representation. Since the emergence of large data sets, such as ImageNet and Places, these approaches have been relegated in favor of the much more powerful convolutional neural networks (CNNs), which can automatically learn multi-layered representations from the data. In this paper, we address many limitations of the original SM approach and related works. We propose discriminative patch representations using neural networks and further propose a hybrid architecture in which the semantic manifold is built on top of multiscale CNNs. Both representations can be computed significantly faster than the Gaussian mixture models of the original SM. To combine multiple scales, spatial relations, and multiple features, we formulate rich context models using Markov random fields. To solve the optimization problem, we analyze global and local approaches, where a top-down hierarchical algorithm has the best performance. Experimental results show that exploiting different types of contextual relations jointly consistently improves the recognition accuracy.
|
|
|
Xiangyang Li, Luis Herranz, & Shuqiang Jiang. (2020). Multifaceted Analysis of Fine-Tuning in Deep Model for Visual Recognition. ACM - ACM Transactions on Data Science.
Abstract: In recent years, convolutional neural networks (CNNs) have achieved impressive performance for various visual recognition scenarios. CNNs trained on large labeled datasets can not only obtain significant performance on most challenging benchmarks but also provide powerful representations, which can be used to a wide range of other tasks. However, the requirement of massive amounts of data to train deep neural networks is a major drawback of these models, as the data available is usually limited or imbalanced. Fine-tuning (FT) is an effective way to transfer knowledge learned in a source dataset to a target task. In this paper, we introduce and systematically investigate several factors that influence the performance of fine-tuning for visual recognition. These factors include parameters for the retraining procedure (e.g., the initial learning rate of fine-tuning), the distribution of the source and target data (e.g., the number of categories in the source dataset, the distance between the source and target datasets) and so on. We quantitatively and qualitatively analyze these factors, evaluate their influence, and present many empirical observations. The results reveal insights into what fine-tuning changes CNN parameters and provide useful and evidence-backed intuitions about how to implement fine-tuning for computer vision tasks.
|
|
|
Marçal Rusiñol, Volkmar Frinken, Dimosthenis Karatzas, Andrew Bagdanov, & Josep Llados. (2014). Multimodal page classification in administrative document image streams. IJDAR - International Journal on Document Analysis and Recognition, 17(4), 331–341.
Abstract: In this paper, we present a page classification application in a banking workflow. The proposed architecture represents administrative document images by merging visual and textual descriptions. The visual description is based on a hierarchical representation of the pixel intensity distribution. The textual description uses latent semantic analysis to represent document content as a mixture of topics. Several off-the-shelf classifiers and different strategies for combining visual and textual cues have been evaluated. A final step uses an n-gram model of the page stream allowing a finer-grained classification of pages. The proposed method has been tested in a real large-scale environment and we report results on a dataset of 70,000 pages.
Keywords: Digital mail room; Multimodal page classification; Visual and textual document description
|
|