|   | 
Details
   web
Records
Author W.Win; B.Bao; Q.Xu; Luis Herranz; Shuqiang Jiang
Title Editorial Note: Efficient Multimedia Processing Methods and Applications Type Miscellaneous
Year 2019 Publication Multimedia Tools and Applications Abbreviated Journal MTAP
Volume 78 Issue 1 Pages
Keywords
Abstract
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes (down) LAMP; 600.141; 600.120 Approved no
Call Number Admin @ si @ WBX2019 Serial 3257
Permanent link to this record
 

 
Author Lichao Zhang
Title Towards end-to-end Networks for Visual Tracking in RGB and TIR Videos Type Book Whole
Year 2019 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal
Volume Issue Pages
Keywords
Abstract In the current work, we identify several problems of current tracking systems. The lack of large-scale labeled datasets hampers the usage of deep learning, especially end-to-end training, for tracking in TIR images. Therefore, many methods for tracking on TIR data are still based on hand-crafted features. This situation also happens in multi-modal tracking, e.g. RGB-T tracking. Another reason, which hampers the development of RGB-T tracking, is that there exists little research on the fusion mechanisms for combining information from RGB and TIR modalities. One of the crucial components of most trackers is the update module. For the currently existing end-to-end tracking architecture, e.g, Siamese trackers, the online model update is still not taken into consideration at the training stage. They use no-update or a linear update strategy during the inference stage. While such a hand-crafted approach to updating has led to improved results, its simplicity limits the potential gain likely to be obtained by learning to update.

To address the data-scarcity for TIR and RGB-T tracking, we use image-to-image translation to generate a large-scale synthetic TIR dataset. This dataset allows us to perform end-to-end training for TIR tracking. Furthermore, we investigate several fusion mechanisms for RGB-T tracking. The multi-modal trackers are also trained in an end-to-end manner on the synthetic data. To improve the standard online update, we pose the updating step as an optimization problem which can be solved by training a neural network. Our approach thereby reduces the hand-crafted components in the tracking pipeline and sets a further step in the direction of a complete end-to-end trained tracking network which also considers updating during optimization.
Address November 2019
Corporate Author Thesis Ph.D. thesis
Publisher Ediciones Graficas Rey Place of Publication Editor Joost Van de Weijer;Abel Gonzalez;Fahad Shahbaz Khan
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN 978-84-1210011-1-9 Medium
Area Expedition Conference
Notes (down) LAMP; 600.141; 600.120 Approved no
Call Number Admin @ si @ Zha2019 Serial 3393
Permanent link to this record
 

 
Author Yaxing Wang
Title Transferring and Learning Representations for Image Generation and Translation Type Book Whole
Year 2020 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal
Volume Issue Pages
Keywords
Abstract Image generation is arguably one of the most attractive, compelling, and challenging tasks in computer vision. Among the methods which perform image generation, generative adversarial networks (GANs) play a key role. The most common image generation models based on GANs can be divided into two main approaches. The first one, called simply image generation takes random noise as an input and synthesizes an image which follows the same distribution as the images in the training set. The second class, which is called image-to-image translation, aims to map an image from a source domain to one that is indistinguishable from those in the target domain. Image-to-image translation methods can further be divided into paired and unpaired image-to-image translation based on whether they require paired data or not. In this thesis, we aim to address some challenges of both image generation and image-to-image generation.GANs highly rely upon having access to vast quantities of data, and fail to generate realistic images from random noise when applied to domains with few images. To address this problem, we aim to transfer knowledge from a model trained on a large dataset (source domain) to the one learned on limited data (target domain). We find that both GANs andconditional GANs can benefit from models trained on large datasets. Our experiments show that transferring the discriminator is more important than the generator. Using both the generator and discriminator results in the best performance. We found, however, that this method suffers from overfitting, since we update all parameters to adapt to the target data. We propose a novel architecture, which is tailored to address knowledge transfer to very small target domains. Our approach effectively exploreswhich part of the latent space is more related to the target domain. Additionally, the proposed method is able to transfer knowledge from multiple pretrained GANs. Although image-to-image translation has achieved outstanding performance, it still facesseveral problems. First, for translation between complex domains (such as translations between different modalities) image-to-image translation methods require paired data. We show that when only some of the pairwise translations have been seen (i.e. during training), we can infer the remaining unseen translations (where training pairs are not available). We propose a new approach where we align multiple encoders and decoders in such a way that the desired translation can be obtained by simply cascadingthe source encoder and the target decoder, even when they have not interacted during the training stage (i.e. unseen). Second, we address the issue of bias in image-to-image translation. Biased datasets unavoidably contain undesired changes, which are dueto the fact that the target dataset has a particular underlying visual distribution. We use carefully designed semantic constraints to reduce the effects of the bias. The semantic constraint aims to enforce the preservation of desired image properties. Finally, current approaches fail to generate diverse outputs or perform scalable image transfer in a single model. To alleviate this problem, we propose a scalable and diverse image-to-image translation. We employ random noise to control the diversity. The scalabitlity is determined by conditioning the domain label.computer vision, deep learning, imitation learning, adversarial generative networks, image generation, image-to-image translation.
Address January 2020
Corporate Author Thesis Ph.D. thesis
Publisher Ediciones Graficas Rey Place of Publication Editor Joost Van de Weijer;Abel Gonzalez;Luis Herranz
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN 978-84-121011-5-7 Medium
Area Expedition Conference
Notes (down) LAMP; 600.141; 600.120 Approved no
Call Number Admin @ si @ Wan2020 Serial 3397
Permanent link to this record
 

 
Author Xiangyang Li; Luis Herranz; Shuqiang Jiang
Title Multifaceted Analysis of Fine-Tuning in Deep Model for Visual Recognition Type Journal
Year 2020 Publication ACM Transactions on Data Science Abbreviated Journal ACM
Volume Issue Pages
Keywords
Abstract In recent years, convolutional neural networks (CNNs) have achieved impressive performance for various visual recognition scenarios. CNNs trained on large labeled datasets can not only obtain significant performance on most challenging benchmarks but also provide powerful representations, which can be used to a wide range of other tasks. However, the requirement of massive amounts of data to train deep neural networks is a major drawback of these models, as the data available is usually limited or imbalanced. Fine-tuning (FT) is an effective way to transfer knowledge learned in a source dataset to a target task. In this paper, we introduce and systematically investigate several factors that influence the performance of fine-tuning for visual recognition. These factors include parameters for the retraining procedure (e.g., the initial learning rate of fine-tuning), the distribution of the source and target data (e.g., the number of categories in the source dataset, the distance between the source and target datasets) and so on. We quantitatively and qualitatively analyze these factors, evaluate their influence, and present many empirical observations. The results reveal insights into what fine-tuning changes CNN parameters and provide useful and evidence-backed intuitions about how to implement fine-tuning for computer vision tasks.
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes (down) LAMP; 600.141; 600.120 Approved no
Call Number Admin @ si @ LHJ2020 Serial 3423
Permanent link to this record
 

 
Author Xinhang Song; Haitao Zeng; Sixian Zhang; Luis Herranz; Shuqiang Jiang
Title Generalized Zero-shot Learning with Multi-source Semantic Embeddings for Scene Recognition Type Conference Article
Year 2020 Publication 28th ACM International Conference on Multimedia Abbreviated Journal
Volume Issue Pages
Keywords
Abstract Recognizing visual categories from semantic descriptions is a promising way to extend the capability of a visual classifier beyond the concepts represented in the training data (i.e. seen categories). This problem is addressed by (generalized) zero-shot learning methods (GZSL), which leverage semantic descriptions that connect them to seen categories (e.g. label embedding, attributes). Conventional GZSL are designed mostly for object recognition. In this paper we focus on zero-shot scene recognition, a more challenging setting with hundreds of categories where their differences can be subtle and often localized in certain objects or regions. Conventional GZSL representations are not rich enough to capture these local discriminative differences. Addressing these limitations, we propose a feature generation framework with two novel components: 1) multiple sources of semantic information (i.e. attributes, word embeddings and descriptions), 2) region descriptions that can enhance scene discrimination. To generate synthetic visual features we propose a two-step generative approach, where local descriptions are sampled and used as conditions to generate visual features. The generated features are then aggregated and used together with real features to train a joint classifier. In order to evaluate the proposed method, we introduce a new dataset for zero-shot scene recognition with multi-semantic annotations. Experimental results on the proposed dataset and SUN Attribute dataset illustrate the effectiveness of the proposed method.
Address Virtual; October 2020
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference ACM
Notes (down) LAMP; 600.141; 600.120 Approved no
Call Number Admin @ si @ SZZ2020 Serial 3465
Permanent link to this record
 

 
Author Kai Wang; Luis Herranz; Anjan Dutta; Joost Van de Weijer
Title Bookworm continual learning: beyond zero-shot learning and continual learning Type Conference Article
Year 2020 Publication Workshop TASK-CV 2020 Abbreviated Journal
Volume Issue Pages
Keywords
Abstract We propose bookworm continual learning(BCL), a flexible setting where unseen classes can be inferred via a semantic model, and the visual model can be updated continually. Thus BCL generalizes both continual learning (CL) and zero-shot learning (ZSL). We also propose the bidirectional imagination (BImag) framework to address BCL where features of both past and future classes are generated. We observe that conditioning the feature generator on attributes can actually harm the continual learning ability, and propose two variants (joint class-attribute conditioning and asymmetric generation) to alleviate this problem.
Address Virtual; August 2020
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference ECCVW
Notes (down) LAMP; 600.141; 600.120 Approved no
Call Number Admin @ si @ WHD2020 Serial 3466
Permanent link to this record
 

 
Author Yaxing Wang; Abel Gonzalez-Garcia; Luis Herranz; Joost Van de Weijer
Title Controlling biases and diversity in diverse image-to-image translation Type Journal Article
Year 2021 Publication Computer Vision and Image Understanding Abbreviated Journal CVIU
Volume 202 Issue Pages 103082
Keywords
Abstract JCR 2019 Q2, IF=3.121
The task of unpaired image-to-image translation is highly challenging due to the lack of explicit cross-domain pairs of instances. We consider here diverse image translation (DIT), an even more challenging setting in which an image can have multiple plausible translations. This is normally achieved by explicitly disentangling content and style in the latent representation and sampling different styles codes while maintaining the image content. Despite the success of current DIT models, they are prone to suffer from bias. In this paper, we study the problem of bias in image-to-image translation. Biased datasets may add undesired changes (e.g. change gender or race in face images) to the output translations as a consequence of the particular underlying visual distribution in the target domain. In order to alleviate the effects of this problem we propose the use of semantic constraints that enforce the preservation of desired image properties. Our proposed model is a step towards unbiased diverse image-to-image translation (UDIT), and results in less unwanted changes in the translated images while still performing the wanted transformation. Experiments on several heavily biased datasets show the effectiveness of the proposed techniques in different domains such as faces, objects, and scenes.
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes (down) LAMP; 600.141; 600.109; 600.147 Approved no
Call Number Admin @ si @ WGH2021 Serial 3464
Permanent link to this record
 

 
Author Shiqi Yang; Yaxing Wang; Joost Van de Weijer; Luis Herranz; Shangling Jui
Title Generalized Source-free Domain Adaptation Type Conference Article
Year 2021 Publication 19th IEEE International Conference on Computer Vision Abbreviated Journal
Volume Issue Pages 8958-8967
Keywords
Abstract Domain adaptation (DA) aims to transfer the knowledge learned from a source domain to an unlabeled target domain. Some recent works tackle source-free domain adaptation (SFDA) where only a source pre-trained model is available for adaptation to the target domain. However, those methods do not consider keeping source performance which is of high practical value in real world applications. In this paper, we propose a new domain adaptation paradigm called Generalized Source-free Domain Adaptation (G-SFDA), where the learned model needs to perform well on both the target and source domains, with only access to current unlabeled target data during adaptation. First, we propose local structure clustering (LSC), aiming to cluster the target features with its semantically similar neighbors, which successfully adapts the model to the target domain in the absence of source data. Second, we propose sparse domain attention (SDA), it produces a binary domain specific attention to activate different feature channels for different domains, meanwhile the domain attention will be utilized to regularize the gradient during adaptation to keep source information. In the experiments, for target performance our method is on par with or better than existing DA and SFDA methods, specifically it achieves state-of-the-art performance (85.4%) on VisDA, and our method works well for all domains after adapting to single or multiple target domains.
Address Virtual; October 2021
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes (down) LAMP; 600.120; 600.147 Approved no
Call Number Admin @ si @ YWW2021 Serial 3605
Permanent link to this record
 

 
Author Sudeep Katakol; Luis Herranz; Fei Yang; Marta Mrak
Title DANICE: Domain adaptation without forgetting in neural image compression Type Conference Article
Year 2021 Publication Conference on Computer Vision and Pattern Recognition Workshops Abbreviated Journal
Volume Issue Pages 1921-1925
Keywords
Abstract Neural image compression (NIC) is a new coding paradigm where coding capabilities are captured by deep models learned from data. This data-driven nature enables new potential functionalities. In this paper, we study the adaptability of codecs to custom domains of interest. We show that NIC codecs are transferable and that they can be adapted with relatively few target domain images. However, naive adaptation interferes with the solution optimized for the original source domain, resulting in forgetting the original coding capabilities in that domain, and may even break the compatibility with previously encoded bitstreams. Addressing these problems, we propose Codec Adaptation without Forgetting (CAwF), a framework that can avoid these problems by adding a small amount of custom parameters, where the source codec remains embedded and unchanged during the adaptation process. Experiments demonstrate its effectiveness and provide useful insights on the characteristics of catastrophic interference in NIC.
Address Virtual; June 2021
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference CVPRW
Notes (down) LAMP; 600.120; 600.141; 601.379 Approved no
Call Number Admin @ si @ KHY2021 Serial 3568
Permanent link to this record
 

 
Author Kai Wang; Luis Herranz; Joost Van de Weijer
Title Continual learning in cross-modal retrieval Type Conference Article
Year 2021 Publication 2nd CLVISION workshop Abbreviated Journal
Volume Issue Pages 3628-3638
Keywords
Abstract Multimodal representations and continual learning are two areas closely related to human intelligence. The former considers the learning of shared representation spaces where information from different modalities can be compared and integrated (we focus on cross-modal retrieval between language and visual representations). The latter studies how to prevent forgetting a previously learned task when learning a new one. While humans excel in these two aspects, deep neural networks are still quite limited. In this paper, we propose a combination of both problems into a continual cross-modal retrieval setting, where we study how the catastrophic interference caused by new tasks impacts the embedding spaces and their cross-modal alignment required for effective retrieval. We propose a general framework that decouples the training, indexing and querying stages. We also identify and study different factors that may lead to forgetting, and propose tools to alleviate it. We found that the indexing stage pays an important role and that simply avoiding reindexing the database with updated embedding networks can lead to significant gains. We evaluated our methods in two image-text retrieval datasets, obtaining significant gains with respect to the fine tuning baseline.
Address Virtual; June 2021
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference CVPRW
Notes (down) LAMP; 600.120; 600.141; 600.147; 601.379 Approved no
Call Number Admin @ si @ WHW2021 Serial 3566
Permanent link to this record
 

 
Author Aymen Azaza; Joost Van de Weijer; Ali Douik; Javad Zolfaghari Bengar; Marc Masana
Title Saliency from High-Level Semantic Image Features Type Journal
Year 2020 Publication SN Computer Science Abbreviated Journal SN
Volume 1 Issue 4 Pages 1-12
Keywords
Abstract Top-down semantic information is known to play an important role in assigning saliency. Recently, large strides have been made in improving state-of-the-art semantic image understanding in the fields of object detection and semantic segmentation. Therefore, since these methods have now reached a high-level of maturity, evaluation of the impact of high-level image understanding on saliency estimation is now feasible. We propose several saliency features which are computed from object detection and semantic segmentation results. We combine these features with a standard baseline method for saliency detection to evaluate their importance. Experiments demonstrate that the proposed features derived from object detection and semantic segmentation improve saliency estimation significantly. Moreover, they show that our method obtains state-of-the-art results on (FT, ImgSal, and SOD datasets) and obtains competitive results on four other datasets (ECSSD, PASCAL-S, MSRA-B, and HKU-IS).
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes (down) LAMP; 600.120; 600.109; 600.106 Approved no
Call Number Admin @ si @ AWD2020 Serial 3503
Permanent link to this record
 

 
Author Aymen Azaza
Title Context, Motion and Semantic Information for Computational Saliency Type Book Whole
Year 2018 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal
Volume Issue Pages
Keywords
Abstract The main objective of this thesis is to highlight the salient object in an image or in a video sequence. We address three important—but in our opinion
insufficiently investigated—aspects of saliency detection. Firstly, we start
by extending previous research on saliency which explicitly models the information provided from the context. Then, we show the importance of
explicit context modelling for saliency estimation. Several important works
in saliency are based on the usage of object proposals. However, these methods
focus on the saliency of the object proposal itself and ignore the context.
To introduce context in such saliency approaches, we couple every object
proposal with its direct context. This allows us to evaluate the importance
of the immediate surround (context) for its saliency. We propose several
saliency features which are computed from the context proposals including
features based on omni-directional and horizontal context continuity. Secondly,
we investigate the usage of top-downmethods (high-level semantic
information) for the task of saliency prediction since most computational
methods are bottom-up or only include few semantic classes. We propose
to consider a wider group of object classes. These objects represent important
semantic information which we will exploit in our saliency prediction
approach. Thirdly, we develop a method to detect video saliency by computing
saliency from supervoxels and optical flow. In addition, we apply the
context features developed in this thesis for video saliency detection. The
method combines shape and motion features with our proposed context
features. To summarize, we prove that extending object proposals with their
direct context improves the task of saliency detection in both image and
video data. Also the importance of the semantic information in saliency
estimation is evaluated. Finally, we propose a newmotion feature to detect
saliency in video data. The three proposed novelties are evaluated on standard
saliency benchmark datasets and are shown to improve with respect to
state-of-the-art.
Address October 2018
Corporate Author Thesis Ph.D. thesis
Publisher Ediciones Graficas Rey Place of Publication Editor Joost Van de Weijer;Ali Douik
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN 978-84-945373-9-4 Medium
Area Expedition Conference
Notes (down) LAMP; 600.120 Approved no
Call Number Admin @ si @ Aza2018 Serial 3218
Permanent link to this record
 

 
Author Xinhang Song; Shuqiang Jiang; Luis Herranz
Title Multi-Scale Multi-Feature Context Modeling for Scene Recognition in the Semantic Manifold Type Journal Article
Year 2017 Publication IEEE Transactions on Image Processing Abbreviated Journal TIP
Volume 26 Issue 6 Pages 2721-2735
Keywords
Abstract Before the big data era, scene recognition was often approached with two-step inference using localized intermediate representations (objects, topics, and so on). One of such approaches is the semantic manifold (SM), in which patches and images are modeled as points in a semantic probability simplex. Patch models are learned resorting to weak supervision via image labels, which leads to the problem of scene categories co-occurring in this semantic space. Fortunately, each category has its own co-occurrence patterns that are consistent across the images in that category. Thus, discovering and modeling these patterns are critical to improve the recognition performance in this representation. Since the emergence of large data sets, such as ImageNet and Places, these approaches have been relegated in favor of the much more powerful convolutional neural networks (CNNs), which can automatically learn multi-layered representations from the data. In this paper, we address many limitations of the original SM approach and related works. We propose discriminative patch representations using neural networks and further propose a hybrid architecture in which the semantic manifold is built on top of multiscale CNNs. Both representations can be computed significantly faster than the Gaussian mixture models of the original SM. To combine multiple scales, spatial relations, and multiple features, we formulate rich context models using Markov random fields. To solve the optimization problem, we analyze global and local approaches, where a top-down hierarchical algorithm has the best performance. Experimental results show that exploiting different types of contextual relations jointly consistently improves the recognition accuracy.
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes (down) LAMP; 600.120 Approved no
Call Number Admin @ si @ SJH2017a Serial 2963
Permanent link to this record
 

 
Author Weiqing Min; Shuqiang Jiang; Jitao Sang; Huayang Wang; Xinda Liu; Luis Herranz
Title Being a Supercook: Joint Food Attributes and Multimodal Content Modeling for Recipe Retrieval and Exploration Type Journal Article
Year 2017 Publication IEEE Transactions on Multimedia Abbreviated Journal TMM
Volume 19 Issue 5 Pages 1100 - 1113
Keywords
Abstract This paper considers the problem of recipe-oriented image-ingredient correlation learning with multi-attributes for recipe retrieval and exploration. Existing methods mainly focus on food visual information for recognition while we model visual information, textual content (e.g., ingredients), and attributes (e.g., cuisine and course) together to solve extended recipe-oriented problems, such as multimodal cuisine classification and attribute-enhanced food image retrieval. As a solution, we propose a multimodal multitask deep belief network (M3TDBN) to learn joint image-ingredient representation regularized by different attributes. By grouping ingredients into visible ingredients (which are visible in the food image, e.g., “chicken” and “mushroom”) and nonvisible ingredients (e.g., “salt” and “oil”), M3TDBN is capable of learning both midlevel visual representation between images and visible ingredients and nonvisual representation. Furthermore, in order to utilize different attributes to improve the intermodality correlation, M3TDBN incorporates multitask learning to make different attributes collaborate each other. Based on the proposed M3TDBN, we exploit the derived deep features and the discovered correlations for three extended novel applications: 1) multimodal cuisine classification; 2) attribute-augmented cross-modal recipe image retrieval; and 3) ingredient and attribute inference from food images. The proposed approach is evaluated on the constructed Yummly dataset and the evaluation results have validated the effectiveness of the proposed approach.
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes (down) LAMP; 600.120 Approved no
Call Number Admin @ si @ MJS2017 Serial 2964
Permanent link to this record
 

 
Author Luis Herranz; Shuqiang Jiang; Ruihan Xu
Title Modeling Restaurant Context for Food Recognition Type Journal Article
Year 2017 Publication IEEE Transactions on Multimedia Abbreviated Journal TMM
Volume 19 Issue 2 Pages 430 - 440
Keywords
Abstract Food photos are widely used in food logs for diet monitoring and in social networks to share social and gastronomic experiences. A large number of these images are taken in restaurants. Dish recognition in general is very challenging, due to different cuisines, cooking styles, and the intrinsic difficulty of modeling food from its visual appearance. However, contextual knowledge can be crucial to improve recognition in such scenario. In particular, geocontext has been widely exploited for outdoor landmark recognition. Similarly, we exploit knowledge about menus and location of restaurants and test images. We first adapt a framework based on discarding unlikely categories located far from the test image. Then, we reformulate the problem using a probabilistic model connecting dishes, restaurants, and locations. We apply that model in three different tasks: dish recognition, restaurant recognition, and location refinement. Experiments on six datasets show that by integrating multiple evidences (visual, location, and external knowledge) our system can boost the performance in all tasks.
Address
Corporate Author Thesis
Publisher Place of Publication Editor
Language Summary Language Original Title
Series Editor Series Title Abbreviated Series Title
Series Volume Series Issue Edition
ISSN ISBN Medium
Area Expedition Conference
Notes (down) LAMP; 600.120 Approved no
Call Number Admin @ si @ HJX2017 Serial 2965
Permanent link to this record