IdenNet: Identity-Aware Facial Action Unit Detection
Cheng-Hao Tu Chih-Yuan Yang Jane Yung-jen Hsu Intelligent Agent Lab
Computer Science and Information Engineering
iAgents Lab
Outline
➢ Introduction and Motivation
➢ Related Work
➢ Proposed Method
➢ Experiments
➢ Conclusion
Facial Action Units (AUs)
● Facial expression is a fast and non-verbal channel conveying our emotions and intentions.
● Facial Action Units (AUs) indicate fundamental actions of individual muscles or groups of muscles which describe more than 7,000 observable facial
expressions.
Problem: Identity-based Differences
Motivation - Limited Subjects of AU-annotated Datasets
● Labeling AUs takes time and requires expert knowledge. Therefore, existing AU-annotated datasets only contain limited subjects.
Dataset Labels Numberof
Subjects Numberof
Samples
BP4D AUs 41 328videos
(148562frames)
DISFA AUs 27 27videos(4mins,
~129,600frames)
UNBC-McMaster AUs,Pain 25 200videos
(48398frames)
AMFED AUs,Interest <=242 242videos(1
mins)
Idea - Utilizing ID-annotated Datasets
Dataset Numberof
Subjects Numberof
Samples
LFW 5,749 13,233
WDRef 2,995 99,773
CelebA 10,177 202,599
VGGFACE 2,622 2.6M
VGGFACE2 9,131 3.3M
● We aims to utilize identity-rich face datasets to improve AU detection.
Outline
➢ Introduction and Motivation
➢ Related Work
➢ Proposed Method
➢ Experiments
➢ Conclusion
SVTPT: Support Vector-based Transductive Parameter Transfer (ICMI 2014)
This method learns multiple linear classifiers (Support Vector Machine) using labeled source data. For an unlabeled test image, the method transfers those classifier parameters to create a new linear classifier.
This method runs slowly because it needs to compare a test image with all training images.
DRML: Deep Region and Multi-label Learning for Facial Action Unit Detection (CVPR 2016)
This method develops a region layer to simultaneously learn regional information and AU co-occurrence.
This method does not use any additional dataset to train the model.
ROINet: Action Unit Detection with Region Adaptation, Multi-labeling Learning and
Optimal Temporal Fusing (CVPR 2017)
This method selects 20 regions of interest to detect AUs, and fuses temporally information for inputs as frame sequences.
This method does not use additional datasets.
The network size is large because it uses VGG conv1 to conv12 to extract features.
E-Net+ATF: Identity-based Adversarial Training of Deep CNNs for Facial Action Unit Recognition
(BMVC 2018)
E-Net means a VGG19-based network containing enhancing layers, which is extended from ROINet.
ATF (adversarial training framework) means the parameter updating strategy for its three sub-networks: E-Net, AU, and ID.
It shares a similar idea as ours: to reduce identity-caused differences.
Differences: it uses VGG (large, generic) but ours uses LightCNN (small, face-specific).
Ours uses an additional identity-rich dataset.
LightCNN: A light CNN for Deep Face
Representation with Noisy Labels (IEEE TIFS 2018)
This method develops MFM (Max-Feature-Map), a variation of the maxout
activation function, to handle noisy data and reduce model size and computational load.
MFM-based small networks work well for face identification problems.
Outline
➢ Introduction and Motivation
➢ Related Work
➢ Proposed Method
➢ Experiments
➢ Conclusion
Ideas
● To remove identity-caused differences
○ We subtract identity-based features from AU-based features.
● To make the network small and robust
○ We adopt LightCNN and its pre-trained weights as our feature extractor.
○ We use ID-annotated, noisy images to train the network.
Face Clustering Task
● We train our network to “learn a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity.“ (Schroff et al. 2015)
● We use the same triplet loss proposed by FaceNet.
close
far positive sample
(same identity)
negative sample (different identity) anchor sample
(given identity)
AU Detection Task
● We adopt cross-entropy loss for AU detection, which is widely used by existing AU detection methods.
t-SNE visualization of 6000 frames from 6 identities (1000 frame per identity)
black: AU12 labeled gray: AU12 not labeled
Optimization
● We combine AU- and ID-annotated datasets in our end-to-end training procedure with the total loss
● For ID-annotated batches, we set as 1 because AU labels are unavailable.
● For AU/ID-annotated batches, we set as 0.5.
Network Architecture
● We adopt 4 convolutional layers from LightCNN and use its weights
pretrained on the CASIA-WebFace.
● We randomly initialize our Face
Clustering and AU Detection layers and use CelebA as our ID-annotated
dataset to train the network.
Outline
➢ Introduction and Motivation
➢ Related Work
➢ Proposed Method
➢ Experiments
➢ Conclusion
Datasets
AU/ID-annotated ID-annotated
Dataset BP4D DISFA UNBC-McMaster CelebA
Year of release 2014 2013 2011 2015
Number of identities
41 27 25 10177
Number of frames 148,572 ~129,600 48,398 202,599
Number of annotated AUs
27 12 10 none
Example image
Matrics
F1-measure, the harmonic mean of recall and precision
BP4D within-dataset
DISFA within-dataset
UNBC-McMaster within-dataset
Trained on BP4D / Test on DISFA
Trained on BP4D / Test on UNBC-McMaster
Summary
● We propose a lightweight, effective network to address the AU detection problem.
● It adopts two existing methods: LightCNN, and FaceNet for face clustering, and a subtraction to reduce identity-caused differences for AU detection.
● It generates state-of-the-art performance for both within-dataset and cross- dataset scenarios.