Francesco Pelosin, Saurav Jha, Andrea Torsello, Bogdan Raducanu, & Joost Van de Weijer. (2022). Towards exemplar-free continual learning in vision transformers: an account of attention, functional and weight regularization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
Abstract: In this paper, we investigate the continual learning of Vision Transformers (ViT) for the challenging exemplar-free scenario, with special focus on how to efficiently distill the knowledge of its crucial self-attention mechanism (SAM). Our work takes an initial step towards a surgical investigation of SAM for designing coherent continual learning methods in ViTs. We first carry out an evaluation of established continual learning regularization techniques. We then examine the effect of regularization when applied to two key enablers of SAM: (a) the contextualized embedding layers, for their ability to capture well-scaled representations with respect to the values, and (b) the prescaled attention maps, for carrying value-independent global contextual information. We depict the perks of each distilling strategy on two image recognition benchmarks (CIFAR100 and ImageNet-32) – while (a) leads to a better overall accuracy, (b) helps enhance the rigidity by maintaining competitive performances. Furthermore, we identify the limitation imposed by the symmetric nature of regularization losses. To alleviate this, we propose an asymmetric variant and apply it to the pooled output distillation (POD) loss adapted for ViTs. Our experiments confirm that introducing asymmetry to POD boosts its plasticity while retaining stability across (a) and (b). Moreover, we acknowledge low forgetting measures for all the compared methods, indicating that ViTs might be naturally inclined continual learners. 1
Keywords: Learning systems; Weight measurement; Image recognition; Surgery; Benchmark testing; Transformers; Stability analysis
|
|
C. Alejandro Parraga, Xavier Otazu, & Arash Akbarinia. (2019). Modelling symmetry perception with banks of quadrature convolutional Gabor kernels. In 42nd edition of the European Conference on Visual Perception (p. 224).
Abstract: Mirror symmetry is a property most likely to be encountered in animals than in medium scale vegetation or inanimate objects in the natural world. This might be the reason why the human visual system has evolved to detect it quickly and robustly. Indeed, the perception of symmetry assists higher-level visual processing that are crucial for survival such as target recognition and identification irrespective of position and location. Although the task of detecting symmetrical objects seems effortless to us, it is very challenging for computers (to the extent that it has been proposed as a robust “captcha” by Funk & Liu in 2016). Indeed, the exact mechanism of symmetry detection in primates is not well understood: fMRI studies have shown that symmetrical shapes activate specific higher-level areas of the visual cortex (Sasaki et al.; 2005) and similarly, a large body of psychophysical experiments suggest that the symmetry perception is critically influenced by low-level mechanisms (Treder; 2010). In this work we attempt to find plausible low-level mechanisms that might form the basis for symmetry perception. Our simple model is made from banks of (i) odd-symmetric Gabors (resembling edge-detecting V1 neurons); and (ii) banks of larger odd- and even-symmetric Gabors (resembling higher visual cortex neurons), that pool signals from the 'edge image'. As reported previously (Akbarinia et al, ECVP2017), the convolution of the symmetrical lines with the two Gabor kernels of alternative phase produces a minimum in one and a maximum in the other (Osorio; 1996), and the rectification and combination of these signals create lines which hint of mirror symmetry in natural images. We improved the algorithm by combining these signals across several spatial scales. Our preliminary results suggest that such multiscale combination of convolutional operations might form the basis for much of the operation of the HVS in terms of symmetry detection and representation.
|
|
Olivier Penacchio, Xavier Otazu, & Laura Dempere-Marco. (2013). A Neurodynamical Model of Brightness Induction in V1. Plos - PloS ONE, 8(5), e64086.
Abstract: Brightness induction is the modulation of the perceived intensity of an area by the luminance of surrounding areas. Recent neurophysiological evidence suggests that brightness information might be explicitly represented in V1, in contrast to the more common assumption that the striate cortex is an area mostly responsive to sensory information. Here we investigate possible neural mechanisms that offer a plausible explanation for such phenomenon. To this end, a neurodynamical model which is based on neurophysiological evidence and focuses on the part of V1 responsible for contextual influences is presented. The proposed computational model successfully accounts for well known psychophysical effects for static contexts and also for brightness induction in dynamic contexts defined by modulating the luminance of surrounding areas. This work suggests that intra-cortical interactions in V1 could, at least partially, explain brightness induction effects and reveals how a common general architecture may account for several different fundamental processes, such as visual saliency and brightness induction, which emerge early in the visual processing pathway.
|
|
Olivier Penacchio, Xavier Otazu, A. wilkins, & J. Harris. (2015). Uncomfortable images prevent lateral interactions in the cortex from providing a sparse code. In European Conference on Visual Perception ECVP2015.
|
|
Olivier Penacchio, Xavier Otazu, Arnold J Wilkings, & Sara M. Haigh. (2023). A mechanistic account of visual discomfort. FN - Frontiers in Neuroscience, 17.
Abstract: Much of the neural machinery of the early visual cortex, from the extraction of local orientations to contextual modulations through lateral interactions, is thought to have developed to provide a sparse encoding of contour in natural scenes, allowing the brain to process efficiently most of the visual scenes we are exposed to. Certain visual stimuli, however, cause visual stress, a set of adverse effects ranging from simple discomfort to migraine attacks, and epileptic seizures in the extreme, all phenomena linked with an excessive metabolic demand. The theory of efficient coding suggests a link between excessive metabolic demand and images that deviate from natural statistics. Yet, the mechanisms linking energy demand and image spatial content in discomfort remain elusive. Here, we used theories of visual coding that link image spatial structure and brain activation to characterize the response to images observers reported as uncomfortable in a biologically based neurodynamic model of the early visual cortex that included excitatory and inhibitory layers to implement contextual influences. We found three clear markers of aversive images: a larger overall activation in the model, a less sparse response, and a more unbalanced distribution of activity across spatial orientations. When the ratio of excitation over inhibition was increased in the model, a phenomenon hypothesised to underlie interindividual differences in susceptibility to visual discomfort, the three markers of discomfort progressively shifted toward values typical of the response to uncomfortable stimuli. Overall, these findings propose a unifying mechanistic explanation for why there are differences between images and between observers, suggesting how visual input and idiosyncratic hyperexcitability give rise to abnormal brain responses that result in visual stress.
|
|
C. Alejandro Parraga, Olivier Penacchio, & Maria Vanrell. (2011). Retinal Filtering Matches Natural Image Statistics at Low Luminance Levels. PER - Perception, 40, 96.
Abstract: The assumption that the retina’s main objective is to provide a minimum entropy representation to higher visual areas (ie efficient coding principle) allows to predict retinal filtering in space–time and colour (Atick, 1992 Network 3 213–251). This is achieved by considering the power spectra of natural images (which is proportional to 1/f2) and the suppression of retinal and image noise. However, most studies consider images within a limited range of lighting conditions (eg near noon) whereas the visual system’s spatial filtering depends on light intensity and the spatiochromatic properties of natural scenes depend of the time of the day. Here, we explore whether the dependence of visual spatial filtering on luminance match the changes in power spectrum of natural scenes at different times of the day. Using human cone-activation based naturalistic stimuli (from the Barcelona Calibrated Images Database), we show that for a range of luminance levels, the shape of the retinal CSF reflects the slope of the power spectrum at low spatial frequencies. Accordingly, the retina implements the filtering which best decorrelates the input signal at every luminance level. This result is in line with the body of work that places efficient coding as a guiding neural principle.
|
|
C. Alejandro Parraga, Jordi Roca, Dimosthenis Karatzas, & Sophie Wuerger. (2014). Limitations of visual gamma corrections in LCD displays. Dis - Displays, 35(5), 227–239.
Abstract: A method for estimating the non-linear gamma transfer function of liquid–crystal displays (LCDs) without the need of a photometric measurement device was described by Xiao et al. (2011) [1]. It relies on observer’s judgments of visual luminance by presenting eight half-tone patterns with luminances from 1/9 to 8/9 of the maximum value of each colour channel. These half-tone patterns were distributed over the screen both over the vertical and horizontal viewing axes. We conducted a series of photometric and psychophysical measurements (consisting in the simultaneous presentation of half-tone patterns in each trial) to evaluate whether the angular dependency of the light generated by three different LCD technologies would bias the results of these gamma transfer function estimations. Our results show that there are significant differences between the gamma transfer functions measured and produced by observers at different viewing angles. We suggest appropriate modifications to the Xiao et al. paradigm to counterbalance these artefacts which also have the advantage of shortening the amount of time spent in collecting the psychophysical measurements.
Keywords: Display calibration; Psychophysics; Perceptual; Visual gamma correction; Luminance matching; Observer-based calibration
|
|
C. Alejandro Parraga, Jordi Roca, & Maria Vanrell. (2011). Do Basic Colors Influence Chromatic Adaptation? VSS - Journal of Vision, 11(11), 85.
Abstract: Color constancy (the ability to perceive colors relatively stable under different illuminants) is the result of several mechanisms spread across different neural levels and responding to several visual scene cues. It is usually measured by estimating the perceived color of a grey patch under an illuminant change. In this work, we hypothesize whether chromatic adaptation (without a reference white or grey) could be driven by certain colors, specifically those corresponding to the universal color terms proposed by Berlin and Kay (1969). To this end we have developed a new psychophysical paradigm in which subjects adjust the color of a test patch (in CIELab space) to match their memory of the best example of a given color chosen from the universal terms list (grey, red, green, blue, yellow, purple, pink, orange and brown). The test patch is embedded inside a Mondrian image and presented on a calibrated CRT screen inside a dark cabin. All subjects were trained to “recall” their most exemplary colors reliably from memory and asked to always produce the same basic colors when required under several adaptation conditions. These include achromatic and colored Mondrian backgrounds, under a simulated D65 illuminant and several colored illuminants. A set of basic colors were measured for each subject under neutral conditions (achromatic background and D65 illuminant) and used as “reference” for the rest of the experiment. The colors adjusted by the subjects in each adaptation condition were compared to the reference colors under the corresponding illuminant and a “constancy index” was obtained for each of them. Our results show that for some colors the constancy index was better than for grey. The set of best adapted colors in each condition were common to a majority of subjects and were dependent on the chromaticity of the illuminant and the chromatic background considered.
|
|
Guim Perarnau, Joost Van de Weijer, Bogdan Raducanu, & Jose Manuel Alvarez. (2016). Invertible conditional gans for image editing. In 30th Annual Conference on Neural Information Processing Systems Worshops.
Abstract: Generative Adversarial Networks (GANs) have recently demonstrated to successfully approximate complex data distributions. A relevant extension of this model is conditional GANs (cGANs), where the introduction of external information allows to determine specific representations of the generated images. In this work, we evaluate encoders to inverse the mapping of a cGAN, i.e., mapping a real image into a latent space and a conditional representation. This allows, for example, to reconstruct and modify real images of faces conditioning on arbitrary attributes.
Additionally, we evaluate the design of cGANs. The combination of an encoder
with a cGAN, which we call Invertible cGAN (IcGAN), enables to re-generate real
images with deterministic complex modifications.
|
|
Ivet Rafegas. (2013). Exploring Low-Level Vision Models. Case Study: Saliency Prediction (Vol. 175). Master's thesis, , .
|
|
Ivet Rafegas. (2017). Color in Visual Recognition: from flat to deep representations and some biological parallelisms (Maria Vanrell, Ed.). Ph.D. thesis, Ediciones Graficas Rey, .
Abstract: Visual recognition is one of the main problems in computer vision that attempts to solve image understanding by deciding what objects are in images. This problem can be computationally solved by using relevant sets of visual features, such as edges, corners, color or more complex object parts. This thesis contributes to how color features have to be represented for recognition tasks.
Image features can be extracted following two different approaches. A first approach is defining handcrafted descriptors of images which is then followed by a learning scheme to classify the content (named flat schemes in Kruger et al. (2013). In this approach, perceptual considerations are habitually used to define efficient color features. Here we propose a new flat color descriptor based on the extension of color channels to boost the representation of spatio-chromatic contrast that surpasses state-of-the-art approaches. However, flat schemes present a lack of generality far away from the capabilities of biological systems. A second approach proposes evolving these flat schemes into a hierarchical process, like in the visual cortex. This includes an automatic process to learn optimal features. These deep schemes, and more specifically Convolutional Neural Networks (CNNs), have shown an impressive performance to solve various vision problems. However, there is a lack of understanding about the internal representation obtained, as a result of automatic learning. In this thesis we propose a new methodology to explore the internal representation of trained CNNs by defining the Neuron Feature as a visualization of the intrinsic features encoded in each individual neuron. Additionally, and inspired by physiological techniques, we propose to compute different neuron selectivity indexes (e.g., color, class, orientation or symmetry, amongst others) to label and classify the full CNN neuron population to understand learned representations.
Finally, using the proposed methodology, we show an in-depth study on how color is represented on a specific CNN, trained for object recognition, that competes with primate representational abilities (Cadieu et al (2014)). We found several parallelisms with biological visual systems: (a) a significant number of color selectivity neurons throughout all the layers; (b) an opponent and low frequency representation of color oriented edges and a higher sampling of frequency selectivity in brightness than in color in 1st layer like in V1; (c) a higher sampling of color hue in the second layer aligned to observed hue maps in V2; (d) a strong color and shape entanglement in all layers from basic features in shallower layers (V1 and V2) to object and background shapes in deeper layers (V4 and IT); and (e) a strong correlation between neuron color selectivities and color dataset bias.
|
|
Ivet Rafegas, & Maria Vanrell. (2016). Color spaces emerging from deep convolutional networks. In 24th Color and Imaging Conference (pp. 225–230).
Abstract: Award for the best interactive session
Defining color spaces that provide a good encoding of spatio-chromatic properties of color surfaces is an open problem in color science [8, 22]. Related to this, in computer vision the fusion of color with local image features has been studied and evaluated [16]. In human vision research, the cells which are selective to specific color hues along the visual pathway are also a focus of attention [7, 14]. In line with these research aims, in this paper we study how color is encoded in a deep Convolutional Neural Network (CNN) that has been trained on more than one million natural images for object recognition. These convolutional nets achieve impressive performance in computer vision, and rival the representations in human brain. In this paper we explore how color is represented in a CNN architecture that can give some intuition about efficient spatio-chromatic representations. In convolutional layers the activation of a neuron is related to a spatial filter, that combines spatio-chromatic representations. We use an inverted version of it to explore the properties. Using a series of unsupervised methods we classify different type of neurons depending on the color axes they define and we propose an index of color-selectivity of a neuron. We estimate the main color axes that emerge from this trained net and we prove that colorselectivity of neurons decreases from early to deeper layers.
|
|
Ivet Rafegas, & Maria Vanrell. (2016). Colour Visual Coding in trained Deep Neural Networks. In European Conference on Visual Perception.
|
|
Ivet Rafegas, & Maria Vanrell. (2017). Color representation in CNNs: parallelisms with biological vision. In ICCV Workshop on Mutual Benefits ofr Cognitive and Computer Vision.
Abstract: Convolutional Neural Networks (CNNs) trained for object recognition tasks present representational capabilities approaching to primate visual systems [1]. This provides a computational framework to explore how image features
are efficiently represented. Here, we dissect a trained CNN
[2] to study how color is represented. We use a classical methodology used in physiology that is measuring index of selectivity of individual neurons to specific features. We use ImageNet Dataset [20] images and synthetic versions
of them to quantify color tuning properties of artificial neurons to provide a classification of the network population.
We conclude three main levels of color representation showing some parallelisms with biological visual systems: (a) a decomposition in a circular hue space to represent single color regions with a wider hue sampling beyond the first
layer (V2), (b) the emergence of opponent low-dimensional spaces in early stages to represent color edges (V1); and (c) a strong entanglement between color and shape patterns representing object-parts (e.g. wheel of a car), objectshapes (e.g. faces) or object-surrounds configurations (e.g. blue sky surrounding an object) in deeper layers (V4 or IT).
|
|
Mohamed Ramzy Ibrahim, Robert Benavente, Daniel Ponsa, & Felipe Lumbreras. (2024). SWViT-RRDB: Shifted Window Vision Transformer Integrating Residual in Residual Dense Block for Remote Sensing Super-Resolution. In 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications.
Abstract: Remote sensing applications, impacted by acquisition season and sensor variety, require high-resolution images. Transformer-based models improve satellite image super-resolution but are less effective than convolutional neural networks (CNNs) at extracting local details, crucial for image clarity. This paper introduces SWViT-RRDB, a new deep learning model for satellite imagery super-resolution. The SWViT-RRDB, combining transformer with convolution and attention blocks, overcomes the limitations of existing models by better representing small objects in satellite images. In this model, a pipeline of residual fusion group (RFG) blocks is used to combine the multi-headed self-attention (MSA) with residual in residual dense block (RRDB). This combines global and local image data for better super-resolution. Additionally, an overlapping cross-attention block (OCAB) is used to enhance fusion and allow interaction between neighboring pixels to maintain long-range pixel dependencies across the image. The SWViT-RRDB model and its larger variants outperform state-of-the-art (SoTA) models on two different satellite datasets in terms of PSNR and SSIM.
|
|
Christophe Rigaud, Dimosthenis Karatzas, Joost Van de Weijer, Jean-Christophe Burie, & Jean-Marc Ogier. (2013). An active contour model for speech balloon detection in comics. In 12th International Conference on Document Analysis and Recognition (pp. 1240–1244).
Abstract: Comic books constitute an important cultural heritage asset in many countries. Digitization combined with subsequent comic book understanding would enable a variety of new applications, including content-based retrieval and content retargeting. Document understanding in this domain is challenging as comics are semi-structured documents, combining semantically important graphical and textual parts. Few studies have been done in this direction. In this work we detail a novel approach for closed and non-closed speech balloon localization in scanned comic book pages, an essential step towards a fully automatic comic book understanding. The approach is compared with existing methods for closed balloon localization found in the literature and results are presented.
|
|
Christophe Rigaud, Dimosthenis Karatzas, Joost Van de Weijer, Jean-Christophe Burie, & Jean-Marc Ogier. (2013). Automatic text localisation in scanned comic books. In Proceedings of the International Conference on Computer Vision Theory and Applications (pp. 814–819).
Abstract: Comic books constitute an important cultural heritage asset in many countries. Digitization combined with subsequent document understanding enable direct content-based search as opposed to metadata only search (e.g. album title or author name). Few studies have been done in this direction. In this work we detail a novel approach for the automatic text localization in scanned comics book pages, an essential step towards a fully automatic comics book understanding. We focus on speech text as it is semantically important and represents the majority of the text present in comics. The approach is compared with existing methods of text localization found in the literature and results are presented.
Keywords: Text localization; comics; text/graphic separation; complex background; unstructured document
|
|
Muhammad Anwer Rao, Fahad Shahbaz Khan, Joost Van de Weijer, & Jorma Laaksonen. (2016). Combining Holistic and Part-based Deep Representations for Computational Painting Categorization. In 6th International Conference on Multimedia Retrieval.
Abstract: Automatic analysis of visual art, such as paintings, is a challenging inter-disciplinary research problem. Conventional approaches only rely on global scene characteristics by encoding holistic information for computational painting categorization.We argue that such approaches are sub-optimal and that discriminative common visual structures provide complementary information for painting classification. We present an approach that encodes both the global scene layout and discriminative latent common structures for computational painting categorization. The region of interests are automatically extracted, without any manual part labeling, by training class-specific deformable part-based models. Both holistic and region-of-interests are then described using multi-scale dense convolutional features. These features are pooled separately using Fisher vector encoding and concatenated afterwards in a single image representation. Experiments are performed on a challenging dataset with 91 different painters and 13 diverse painting styles. Our approach outperforms the standard method, which only employs the global scene characteristics. Furthermore, our method achieves state-of-the-art results outperforming a recent multi-scale deep features based approach [11] by 6.4% and 3.8% respectively on artist and style classification.
|
|