PT Unknown
AU Lluis Gomez
   Y. Patel
   Marçal Rusiñol
   C.V. Jawahar
   Dimosthenis Karatzas
TI Self‐supervised learning of visual features through embedding images into text topic spaces
BT 30th IEEE Conference on Computer Vision and Pattern Recognition
PY 2017
DI 10.1109/CVPR.2017.218
AB End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible. In this paper we present a method that is able to take advantage of freely available multi-modal content to train computer vision algorithms without human supervision. We put forward the idea of performing self-supervised learning of visual features by mining a large scale corpus of multi-modal (text and image) documents. We show that discriminative visual features can be learnt efficiently by training a CNN to predict the semantic context in which a particular image is more probable to appear as an illustration. For this we leverage the hidden semantic structures discovered in the text corpus with a well-known topic modeling technique. Our experiments demonstrate state of the art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or natural-supervised approaches.
ER