TY  - CONF
AU  - Alejandro Cartas
AU  - Jordi Luque
AU  - Petia Radeva
AU  - Carlos Segura
AU  - Mariella Dimiccoli
A2  - ICCVW
PY  - 2019//
TI  - Seeing and Hearing Egocentric Actions: How Much Can We Learn?
BT  - IEEE International Conference on Computer Vision Workshops
SP  - 4470
EP  - 4480
N2  - Our interaction with the world is an inherently multimodal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial, and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a 5.18% improvement over the state of the art on verb classification.
UR  - https://ieeexplore.ieee.org/document/9022020
UR  - http://dx.doi.org/10.1109/ICCVW.2019.00548
N1  - MILAB; no proj
ID  - Alejandro Cartas2019
ER  -