Motion Magnification of Facial Micro-expressions - Runpeng Liu

Transcription

Motion Magnification of Facial Micro-expressions - Runpeng Liu
Motion Magnification of Facial Micro-expressions
Sumit Gogia
Massachusetts Institute of Technology
77 Massachusetts Avenue, Cambridge, MA
Runpeng Liu
Massachusetts Institute of Technology
77 Massachusetts Avenue, Cambridge, MA
[email protected]
[email protected]
December 8, 2014
Figure 1: Motion magnification of facial micro-expressions generated by “frustration” (male) and “disgust”
(female) emotional cues. (a) Top: frame in original video sequence. Bottom: spatiotemporal YT slices of video along
eye and lip profiles marked (yellow) above (b-c) Results of applying linear Eulerian motion magnification [9] and
phase-based motion magnification [7], respectively, on video sequence in (a). Subtle lip and eye movements
associated with micro-expressions are amplified (top), as visualized by YT spatiotemporal slices (bottom).
1
Introduction
employed to good effect in emotion recognition from
facial images and videos [1]. The current models, though,
cannot distinguish the expressions of a wide variety of
people well, and cannot accurately detect and classify
micro-expressions.
In addition, in high-sensitivity
environments such as psychological treatments and
interrogation, it may not be desirable to fully automate
the emotion detection procedures.
A common issue faced when judging emotional reactions,
such as in psychology appointments and high-stakes interrogation rooms, is the inability to accurately detect
brief, subtle expressions called micro-expressions. These
micro-expressions frequently indicate true emotional
response, but also are often difficult for human vision to
perceive [2].
These issues motivate methods that sit between fully
automated solutions and pure human perception for analyzing emotion. In this paper, we develop such a method
relying on motion magnification of video sequences, a
topic with recent and promising development ([3], [9]).
Motion magnification, particularly Eulerian motion mag-
To tackle this problem, there have been attempts to frame
micro-expression detection computationally, with the
hope that machine vision resources can parse the small
differences that human visual systems toss out ([4],
[5], [6], [10]). Particularly, machine learning has been
1
2.2
nification ([7], [8], [9]), has been utilized to great effect in
revealing subtle motions in videos, e.g. for amplification
of physiological features such as heart pulse, breath rate,
and pupil dilation ([9], [7]). Accordingly, we use motion
magnification to magnify subtle facial expressions and
make them visible to humans.
Fundamentally, the Eulerian method uses temporal filtering to amplify variation in a fixed spatial region across a
sequence of video frames. It has been shown in [9] that
this Eulerian process can also be used to approximate
spatial translation in 2D, and thereby amplify small
motions in the same video.
We formulate the task of magnifying micro-expressions
by detailing the linear and phase-based Eulerian motion
processing methods. We describe filtering parameters
used in these methods and delineate the connections between facial motion cues and human emotions. These are
used to specify spatiotemporal filters for extracting motion from different facial features. We lastly evaluate our
proposed approach against a small dataset of facial expression video sequences we collected, and examine the
relevance of collected data to target emotion detection environments. Our results indicate that motion magnification can be used to amplify a variety of subtle facial expressions and make them visible to humans.
2
2.1
Eulerian Motion Magnification
In the first step of Eulerian motion magnification, videos
are decomposed into different spatial pyramid levels
(Figure 2a). Spatial decomposition allows modularity in
signal amplification, so that bands that best approximate
desired motions receive greater amplification, and those
that encode unwanted artifacts receive little amplification.
Recent works in Eulerian motion magnification have
used various types of spatial pyramids for this step, such
as Laplacian pyramids by Wu et al. [9], or complex
steerable pyramids by Wadwha et al. in [7] and Riesz
pyramids in [8].
Next, a temporal bandpass filter is applied to each spatial
band in order to extract time series signals responding to
a frequency range of interest (Figure 2b). The band width
and frequency range of the temporal filter is modular,
and can be fine-tuned for various applications such as
selective magnification of motion factors in a video. In
our experiments, modular temporal filters are used to
extract movement from different facial features.
Background
Lagrangian Motion Magnification
Lagrangian methods inspire the initial solutions to motion magnification, and are used in many optical flow
and feature-tracking algorithms. For solutions in this
approach, optimal flow vectors are computed for each
pixel in an image frame so as to track moving features
in time and space through a video sequence. As such, Lagrangian methods are best suited for representing large,
broad movements in a video. They generally fail to amplify small changes arising from very subtle motions.
After applying the desired temporal filter to each spatial
pyramid level, the bands are then amplified before
being added back to the original signal (Figure 2c).
These amplification factors are also modular, and can
be specified as a function of the spatial frequency band.
After amplification, spatial bands are collapsed to form
a reconstructed video with desired temporal frequencies
amplified (Figure 2d).
Another significant drawback of the approach is that much
computation must be expended to ensure that motion is
amplified smoothly; even with diligence artifacts tend to
appear, as shown in [3]. Lastly, we note that the approach
is also sensitive to noise, as tracking explicit pixels under
noise is an issue. We will see that the Eulerian approach
mitigates some of these effects.
While solutions under the Lagrangian approach rely on
global optimization of flow vectors and smoothness parameters, the Eulerian method of motion magnification
analyzes temporal changes of pixel values in a spatially
localized manner, making it less computationally expensive. In addition, it is capable of approximating and magnifying small-scale spatial motion [9, 7], indicating its ap2
Figure 2: Spatiotemporal processing pipeline for general Eulerian video magnification framework. (a) Input video is
decomposed into different spatial frequency bands (i.e. pyramid levels). (b) User-specified temporal bandpass filter is
applied to each spatial pyramid level. (c) Each spatial band is amplified by a motion magnification factor, α, before
being added back to original signal. (d) Spatial pyramids are collapsed to reconstruct motion-magnified output video.
Figure adapted from [9]
plicability to magnification of micro-expressions.
for smooth motion magnification.
2.2.1
This method was shown by the inventors to magnify subtle motion signals effectively; however, as seen through
the determinedn bound, the range of suitable α values is
small for high spatial frequency signals. In addition, with
high α values noise can be amplified significantly due to
amplification directly on the pixel values.
Linear Magnification
Linear motion magnification, examined by [9] was
the first method proposed falling under the Eulerian
approach. In linear motion magnification, variations of
pixel values over time are considered the motion-coherent
signals, and so are extracted by temporal bandpass and
amplified.
2.2.2
Theoretical justification for this coherence between pixel
value and motion is given in [9]. They used a first-order
Taylor series approximation of image motion to show that
the amplification of temporally-bandpassed pixel values
could approximate motion, and derived a bound on amplification factor:
(1 + α)δ(t) <
Phase-based Magnification
In phase-based magnification, as opposed to linear
motion magnification, the image phase over time is
taken to be the motion-coherent signal, following from a
relation between signal phase and translation. To operate
on phase as well as spatial subbands, a complex steerable
pyramid is employed; each phase signal at each spatial
subband and orientation is then temporally-bandpassed
and amplified.
λ
8
where α is the magnification factor, δ(t) is motion signal,
and λ is the spatial wavelength of the moving signal.
This bound is then applied to determine how much to
amplify the motion-coherent signal at each spatial band
The method has theoretical justification similar to that for
linear motion magnification, though the Fourier expansion is used instead to expose the relation between translation and phase. Again, an according bound on the am3
Emotion
Happiness
Sadness
Surprise
Anger
Frustration
plification factor was also found:
αδ(t) <
λ
4
Disgust
with α the magnification factor, δ the motion signal,
and λ the corresponding wavelength for the frequency
extracted by the processed subband.
Table 1: Summary of facial feature motion cues associated with each of the six universal emotions, as specified
in [1].
The phase-based method extends the range of appropriate
amplification factors and has improved noise performance
relative to linear magnification, allowing for more accurate magnification of small motions, and better modular
control over magnification of larger motions. The only
drawback of the phase-based approach is in performance
efficiency, as computing the complex steerable pyramid at
different scales and orientations takes substantially more
time than the computing the Laplacian pyramid in the linear approach.
2.3
Description
Raising of mouth corners
Lowering of mouth corners
Brow arch; eyes open wide
Eyes bulging; brows lowered
Lowering of mouth corner; slanted
slips; eye twitch
Brow ridge wrinkled; head withdrawal
3
3.1
Methods
Generation of Micro-Expression Sequences
We collected video sequences of facial expressions from
10 volunteer undergraduate students at the Massachusetts
Institute of Technology (MIT). A DSLR camera recording at 30 fps was trained on subjects’ faces to videotape
their responses to various emotion words. Subjects
were instructed to remain as motionless and emotionless
as possible during the experiment, except when cued
by emotional keywords to generate a particular microexpression.
Micro-expressions
Micro-expressions are brief facial expressions that occur
when people unconsciously repress or deliberately conceal emotions. They match facial expressions that occur
when emotion is naturally expressed, and so their detection and magnification is valuable for recognizing emotion in high-stakes environments. Unfortunately, microexpressions are difficult for humans to detect due to their
brevity and subtlety in the spatial domain. It is for these
reasons that micro-expressions are a natural candidate for
motion magnification.
Verbal cues of 6 universal emotions (happiness, sadness,
disgust, frustration, anger, surprise) as classified by
psychological research [1] were given at 15-second
intervals to elicit corresponding micro-expressions from
each subject.
Though micro-expressions often occur involuntarily or
unconsciously in real-life situations, it was difficult to
reproduce this effect in an experimental setting. Alternatively, subjects were advised to imagine a high-stakes
situation in which there would be great incentive to
conceal their true feelings or emotional responses to a
sensitive topic. This instruction would motivate subjects
to make their facial expressions as brief and subtle as
possible, so that any motion of facial features would be
nearly imperceptible to the human eye, but suitable for
applying motion magnification.
In order to formulate the magnification task, a description
of facial expressions in space is required, particularly for
specifying the spatiotemporal frequencies to magnify.
This description has been given much study, and accepted
results have been determined for the 6 universal emotions.
We list them in Table 1 and visualize them in Figure 3, as
found in [1].
4
Figure 3: A visualization of the 6 universal facial expressions as they form in space and time; taken from [1].
Facial feature
Head
Brow and Lip
Eye/Pupil
Finally, in selecting appropriate video footage for processing by the motion magnification algorithms described
previously, we discarded any sequences in which a subject’s facial expressions were trivially perceptible by the
naked eye (i.e. too melodramatic to be considered a
micro-expression), or in which large head movements
would lead to unwanted artifacts in a motion-magnified
video.
3.2
Temporal frequency range
< 1.0 Hz
1.0−5.0 Hz
> 5.0 Hz
Table 2: Estimated temporal frequency benchmarks for
magnifying motion of different facial features in our video
sequences. Values are hypothesized based roughly on the
size and scale of motion to be observed at each location.
Next, we specified temporal bandpass filter parameters
for motion-magnifying subtle facial expressions corresponding to each of the 6 universal emotions. We
synthesize the qualitative descriptions of facial motion
cues in Table 1 with quantified estimates of optimal
frequency ranges in Table 2 to generate temporal filters
(specified by low frequency cutoff, ωl , and high frequency cutoff, ωh ) for the six emotions.
Application of Motion Magnification
We applied linear and phase-based Eulerian motion magnification, as formulated by Wu et al. [9] and Wadwha et
al. [7] respectively, to the micro-expression sequences we
collected. In specifying temporal frequency ranges that
will best amplify motions of different facial features (i.e.
head, mouth/lip, brow, eye/pupil), we follow the heuristic
that motions of low temporal frequencies correspond to
larger facial features (subtle but broad head movements);
motions of mid-range temporal frequencies correspond to
brow and lip movements; and motions of high temporal
frequencies correspond to sudden motions of small facial
features (eye/pupil movements). Rough estimates of temporal frequency benchmarks we used for magnifying motion in these facial features are specified in Table 2.
For example, in magnifying motion of facial features corresponding to the micro-expression of disgust, we chose
a temporal bandpass filter of [ωl , ωh ] = [0.5, 4.0] Hz that
might amplify subtle movement due to “head withdrawal”
(< 1.0 Hz; Table 2) and “brow wrinkling” (1.0 − 5.0 Hz;
Table 2). For frustration, we used a temporal bandpass of
[ωl , ωh ] = [1.5, 6.0] to extract possible movement due to
“slanted lips” (1.0 − 5.0 Hz) and “eye twitching” (> 5.0
Hz). The full set of hypothesized temporal filter parameters is summarized in Table 3.
5
Micro-expression
Happiness
Sadness
Disgust
Frustration
Anger
Surprise
Temporal Bandpass [ωl , ωh ] (Hz)
[1.0, 3.0]
[1.0, 3.0]
[0.5, 4.0]
[1.5, 6.0]
[1.5, 8.0]
[2.0, 8.0]
whole indicate that magnification of micro-expressions is
achievable by the methods proposed.
4.1
Usage of Simulated Expressions
A notable issue with the results is that they were
obtained for artificial data; that is, micro-expressions
were simulated by people on command, whereas true
Table 3: Temporal bandpass filtering parameters used to micro-expressions arise during emotional concealment or
magnify video sequences of each micro-expression. Val- repression. It would of course be desirable to apply the
ues are hypothesized based description of facial motion methods described to real data.
cues in Table 1 and temporal frequency ranges in Table 2.
4
However, setting up an environment in which subjects
are forced to react in emotionally-charged situations, particularly the high-stakes environments that are necessary
for micro-expressions to appear, is difficult and proved
infeasible in the time allotted for this project.
Results and Discussion
We used the linear Eulerian magnification implementation from [9] and our own unoptimized MATLAB
implementation of phase-based motion processing to
magnify the collected video sequences for each facial
expression on a quad core laptop with 32 GB of RAM.
While our code for phase-based motion processing took
roughly 100 seconds to run on each 200-frame video
sequence, with suitable optimization discussed in [7] the
magnification could be run with similar performance
in real-time. All code, as well as both original and
magnified video sequences are available upon request.
Despite this drawback, we believe that the data obtained
appropriately approximates genuine production of microexpressions, and at least serve as subtle and brief facial motions. That these sequences can be clearly be
magnified indicates that the same holds for true microexpressions, and possibly normal expressions as well.
4.2
Head Motion Environments
In our experiments, we requested that subjects remain
completely motionless except when making facial
expressions. While in many applicable real-life environments, such as interrogation rooms, this may be a
feasible constraint, it is also likely that the subject is
in a more natural state and so has continual small head
motions. We noted that for subjects with large head
motions, the magnification was fairly unhelpful as the
expression had coordinated motion and it was difficult to
separate. It would then be prudent to see if the expression
magnification for small head motions was still acceptable.
Using the filtering parameters hypothesized in Table 3,
the magnification results were good, with magnification
clearly visible for all subjects over all emotions (Figure
4). In frames from the motion-magnified videos, we
observe facial motion cues corresponding well those
described in Table 1. For example, in frustration (Figure
1), we observe slanting of lips and eye twitching in the
motion-magnified frame that is nearly imperceptible in
the source frame. These amplified facial motions are
consistent with those presented in Table 1 and described
in [1].
One approach which may be helpful for this case, and
would only improve the usefulness of the approach in this
paper, would be to separate out the pieces of the face important to the expression, namely the eyes and mouth. Isolating the motion of these components may not be helpful
with separating head motion and expression motion, but
it would allow for better perceptual focus for the viewer.
Artifacting and noise amplification was more visible after linear magnification, as expected, but not significantly
detrimental to video quality (Figure 1b). For both magnification methods, happiness and sadness expressions
required more magnification for similar levels of distinctiveness as the other expressions. The results as a
6
Figure 4: Sample results after applying phase-based Eulerian motion magnification on facial micro-expressions corresponding to the six universal emotions. (left) shows frame from source video and (right) shows corresponding frame
in motion-magnified video. The magnified facial motions e.g. (a) raising of lip corners; (b) lowering of lip corners;
(c) brow wrinkling; (d) eyes bulging; (e) slanted lips and eye twitching; (f) eyes open wide correspond well to motion
cues described in Table 1.
4.3
Comparison to Full Automation and amount of robust training data to train the necessary
classifiers. The time needed for data collection and model
Human Perception
training can become an issue with the fully-automated
approach.
A natural question the reader may have is regarding
the usefulness of this approach in comparison to a
fully-automated system, or a professional trained in
recognizing and understanding facial expressions.
On the other end, professionals trained in recognizing facial expressions may have limited availability in some target applications. Our proposed approach can serve as
an effective substitute for these professionals, as well as
support professional opinions, a quality valuable in highstakes environments.
While a fully automated system is useful for unbiased
emotion recognition, and current methods do achieve
good results [10], they do not achieve the extremely
high accuracy required for many high-stakes application
environments. In some cases, an incorrect decision could
be tremendously costly, such as the decision to release a
serial killer after interrogation. As humans have proven
to be effective in understanding emotion from expression,
the method proposed has an advantage over a fully
automated system. Using both in tandem can also prove
useful.
5
Conclusion
We showed the feasibility of amplifying nearly imperceptible facial expressions by applying Eulerian motion
magnification to collected video sequences of microexpressions. In our proposed method, we hypothesize
appropriate temporal filtering parameters for magnifying
motion of different facial features. Our motion-magnified
results correspond well to accepted facial motion cues of
Another drawback of fully automated systems is that
current machine learning approaches require a significant
7
the six universal emotions classified by psychological research. This approach sits well between fully automated
methods for emotion detection, which may not be suitable for high-stakes environments, and pure human perception, which may be unable to detect subtle changes in
facial features at the scale of micro-expressions.
6
[10] Q. Wu, X. Shen, and X. Fu. The machine knows what
you are hiding: An automatic micro-expression recognition system. Computing and Intelligent Systems, 2011.
Acknowledgements
We thank Cole Graham, Harlin Lee, Rebecca Shi,
Melanie Abrams, Staly Chin, Norman Cao, Eurah Ko,
Aditya Gopalan, Ryan Fish, Sophie Mori, Kevin Hu, and
Olivia Chong for volunteering in our micro-expression
experiments.
References
[1] M. J. Black and Y. Yacoob. Recognizing facial expressions
in image sequences using local parameterized models of
image motion. Int. Journal of Computer Vision, 25(1):23–
48, 1997.
[2] P. Ekman. Lie catching and microexpressions. The philosophy of deception, pages 118–133, 2009.
[3] C. Liu, A. Torralba, W. T. Freeman, F. Durand, and E. H.
Adelson. Motion magnification. In ACM Transactions on
Graphics (TOG), volume 24, pages 519–526. ACM, 2005.
[4] T. Pfister, X. Li, G. Zhao, and M. Pietikainen. Recognising spontaneous facial micro-expressions. Computer Vision (ICCV), pages 519–526, 2011.
[5] M. Shreve, S. Godavarthy, V. Manohar, D. Goldgof, and
S. Sarkar. Towards macro-and micro-expression spotting
in video using strain patterns. Applications of Computer
Vision (WACV), 2009.
[6] M. Shreve, S. Godavarthy, V. Manohar, D. Goldgof, and
S. Sarkar. Macro-and micro-expression spotting in long
videos using spatio-temporal strain. IEEE Conference on
Automatic Face and Gesture Recognition, 2011.
[7] N. Wadhwa, M. Rubinstein, F. Durand, and W. T. Freeman.
Phase-based video motion processing. ACM Transactions
on Graphics (TOG), 32(4):80, 2013.
[8] N. Wadhwa, M. Rubinstein, F. Durand, and W. T. Freeman.
Riesz pyramids for fast phase-based video magnification.
2014.
[9] H.-Y. Wu, M. Rubinstein, E. Shih, J. V. Guttag, F. Durand,
and W. T. Freeman. Eulerian video magnification for revealing subtle changes in the world. ACM Trans. Graph.,
31(4):65, 2012.
8