Volume 67, Part 1, 1 December 2015, Pages 66–74

Cognitive Systems for Knowledge Discovery

Edited By Lledó Museros, Oriol Pujol and Núria Agell

Emotion recognition from mid-level features

  • a Open University of Catalonia, Rambla del Poblenou 156, Barcelona, Spain
  • b Computer Vision Center, Edifici O Bellaterra, Barcelona 08193, Spain

Highlights

We automatically classify emotions from facial video sequences.

We use the Action Units as intermediate features for emotion learning.

We fuse structural and appearance information to classify emotions.

We propose the use of the Histogram of Action Units for emotion classification.

Subtle positive emotions can be automatically inferred with close to human accuracy.


Abstract

In this paper we present a study on the use of Action Units as mid-level features for automatically recognizing basic and subtle emotions. We propose a representation model based on mid-level facial muscular movement features. We encode these movements dynamically using the Facial Action Coding System, and propose to use these intermediate features based on Action Units (AUs) to classify emotions. AUs activations are detected fusing a set of spatiotemporal geometric and appearance features. The algorithm is validated in two applications: (i) the recognition of 7 basic emotions using the publicly available Cohn-Kanade database, and (ii) the inference of subtle emotional cues in the Newscast database. In this second scenario, we consider emotions that are perceived cumulatively in longer periods of time. In particular, we automatically classify whether video shoots from public News TV channels refer to Good or Bad news. To deal with the different video lengths we propose a Histogram of Action Units and compute it using a sliding window strategy on the frame sequences. Our approach achieves accuracies close to human perception.

Keywords

  • Facial expression;
  • Emotion recognition;
  • Action units;
  • Computer vision

1. Introduction

The dynamic facial behavior is a rich source of information for conveying emotions. In any communication, we infer much of the information of the message from the emitter’s facial expression. For instance, the severity or the kindness of the message is often perceived from relevant changes in expressiveness [32], the preference on simple binary choices can be inferred from low-emotional facial expressions [17] and low level communication notions such as the positivity or negativity of the message are usually inferred by human perceivers. See Fig. 1 as an example of the facial expression power when communicating. One of these two people is telling good news, it becomes straightforward for a human observer to guess who.

In computer vision, facial expression analysis is usually tackled using the Facial Action Coding System (FACS), developed by Ekman and Friesen [10]. FACS defines accurately and unequivocally a set of atomic and isolated facial movements people are able to execute. In particular, FACS uses a set of 44 main Action Units (AUs) with regard to their location and intensity. Ekman and Friesen also defined six basic emotions [9] that are common across different cultures, named anger, disgust, fear, happiness, sadness and surprise.

In this paper we use AUs to construct low-dimensional feature vectors from video sequences, and then we use these features to recognize emotions. Our goal is to go further than recognizing the basic emotions. In particular we present a system able of inferring the subtle emotional information illustrated in the example of Fig. 1.

To detect the AU activations we combine two kinds of facial dynamics representations. The first one is a geometric descriptor that encodes the evolution in time of the facial structure, and the second one is an appearance descriptor based on local binary patterns (called LBP-TOP). Sections 3.2 and 3.3 give the details of these representations, respectively. We combine these two descriptors to detect the activation of particular AUs and then we construct a mid-level feature vector using these AU detections.

In our experiments of automatic emotion recognition we consider two different scenarios. The first one is the recognition of the basic emotions. In this setting we construct a binary feature vector from the AUs detections which indicates, at each component, whether the corresponding AU is active or not. We test the recognition of basic emotions with the publicly available Cohn-Kanade database [15] and our results show that, indeed, AUs are suitable mid-level representations for automatically detecting basic emotions.

The second scenario considered is the detection of subtle emotions. In particular we considered the problem of detecting positive attitudes from a speaker. Up to our knowledge there is no publicly available dataset for evaluating attitudes as subtle emotions. However this is an interesting setting that can have a lot of applications in fields where it is interesting to understand the feeling of a subject (for instance in online education or in medicine). To validate the use of AUs as intermediate representations for detecting positive or negative attitudes, we collected a database of videos from different public news channels. We call this dataset the Newcast Dataset. In the context of newscasts, the communication is intended to be as neutral as possible. However, when a news anchor communicates good news, she also communicates them with her face. We manually labeled 150 samples according to the semantic contents of the news (good news vs. bad news). These data objectively link the non-verbal facial attitude with the message outcome, and will be made available to the community.

Who is telling us good news? Notice that the first person is transmitting much ...
Fig. 1. 

Who is telling us good news? Notice that the first person is transmitting much more positiveness with her facial expression than the second one.

In this second scenario, in order to deal with different video lengths, we propose to use a descriptor based on the Histogram of Action Units (HAU), as detailed in Section 5.2. The HAU allows us to encode cumulative evidences of the presence of complex emotions. In a secondary contribution, we propose to extend the feature set with second order statistics of the action units activation. The resulting normalized histograms are fed into standard machine learning classifiers. The experimental results show accuracies very close to human performance.

2. Related work

Our approach involves two stages: the detection of each AU separately and the emotion classification using the AUs activation feature vector as building blocks.

2.1. Action Units detection

We rely on a well-known methodology based on the Facial Action Coding System [10]. The FACS system comprises 44 Action Units (AU). Each AU defines a set of muscle movements that anatomically define expression descriptors that can be used for higher level emotion recognition tasks.