a multimodal approach to audiovisual text-to-speech

Transcription

FACULTY OF ENGINEERING
Department of Electronics and Informatics (ETRO)
A MULTIMODAL APPROACH
TO AUDIOVISUAL
TEXT-TO-SPEECH SYNTHESIS
Wesley Mattheyses
Advisor: prof. dr. ir. Werner Verhelst
Thesis submitted in fulfilment of the requirements for
the degree of Doctor in de Ingenieurswetenschappen
(Doctor in Engineering)
June 2013
FACULTY OF ENGINEERING
Department of Electronics and Informatics (ETRO)
A MULTIMODAL APPROACH
TO AUDIOVISUAL
TEXT-TO-SPEECH
SYNTHESIS
Dissertation submitted in fulfilment of the requirements for
the degree of Doctor in de Ingenieurswetenschappen
(Doctor in Engineering)
by
ir. Wesley Mattheyses
Advisor: prof. dr. ir. Werner Verhelst
Brussels,
June 2013
EXAMINING COMMITTEE
prof. Bart de Boer - Vrije Universiteit Brussel - Chair
prof. Rik Pintelon - Vrije Universiteit Brussel - Vice-chair
prof. Hichem Sahli - Vrije Universiteit Brussel - Secretary
prof. Barry-John Theobald - University of East Anglia - Member
dr. Juergen Schroeter - AT&T Labs - Member
prof. Werner Verhelst - Vrije Universiteit Brussel - Advisor
Preface
Back in the days when I was still studying for engineer, I honestly could have never
thought I would ever find myself writing this PhD thesis. It is not that I had a clear
goal in mind for my professional career after graduating, but I was always sure
that pure research was not really my cup of tea. Things started to change when I
worked in my final year on my master thesis on auditory text-to-speech synthesis
with prof. Werner Verhelst as supervisor. I was still not a big fan of watching the
Matlab-editor from morning till evening, but on the other hand it was fascinating
to see how something complicated as a human speech signal could be mimicked by
running programming code I had written myself. When prof. Verhelst offered me
the possibility to work further on my thesis subject in the form of a PhD, it took a
few days before I could convince myself that maybe it would indeed be possible for
me to actually get a PhD and that I should at least try to take up the challenge.
Afterwards, I am very glad I took this unique once-in-a-life opportunity that was
offered me.
The first year as a PhD student I was facing the biggest challenge I had ever
experienced so far. Everybody who went from secondary school to university knows
the feeling: everything that seemed hard to do in secondary school suddenly seems
negligible in comparison with what is expected from you in higher education.
I experienced this terrifying feeling twice: when starting a PhD the amount of
knowledge you are missing to fulfil your tasks seems inseparable and the list of
things to do seems endless and infeasible. In addition, you know that no one around
you has the time nor the knowledge to solve the problems for you. Once seated
behind your computer, the only thing to do is dig into the literature (Google & the
Internet become your very best friends), let your brain work overtime, and start
putting those very first bricks in place that finally have to make up that huge castle
representing your PhD.
After one year of working as a PhD student, I was lucky to be assigned as
teaching assistant (AAP) at ETRO. From that moment, the work package became
even larger since I became responsible for teaching the first year engineering
students how to write program code. This made me find out that I really enjoy
teaching and transferring skills to other people. It also made me realize that it
i
Preface
ii
is a challenge to let first year students find their focus and motivation (and to
be silent while you are explaining something), and that it is not easy to clearly
explain things that are not straightforward such as programming. Being assigned
as a teaching assistant slowed down my PhD research, but on the other hand it was
a blessing that I did not have to worry about funding and the teaching offered a
nice alteration to working with my good friends Matlab and Visual Studio.
Like anything else in life, the process of getting a PhD is a bumpy road with
lots of ups and downs. Only researchers know the frustration of working for months
on an optimization to find out in the end that it really does not improve the results a
single bit. Debugging and the unavoidable system crashes altogether have probably
cost a year of my life. On the other hand, research also offers incomparable climaxes
when your code finally produces the desired output or when an experiment points
out the hoped-for results. If I look back upon the 6.5 years I spend in research,
these years compose a very valuable lesson for life in general, teaching you to never
give up trying to reach your goals, to put yourself back together after a failure and
to keep believing in the things you do.
Although the process of getting a PhD often feels like a lonely quest in which
you are thrown on your own resources, the opposite is true and it is important to
never forget the people around you and their valuable contributions. Therefore, I
have to thank prof. Werner Verhelst to convince me to start with my PhD, to find
the necessary funding, to elaborate together on the proposed audiovisual speech
synthesis approach and to carefully revise my publications and this thesis. I also
thank my colleagues at ETRO and at DSSP for the nice working atmosphere I have
been able to experience. Getting through a stressful day of failing programming
code is only possible when it can be alternated with interesting, pleasant and
especially funny conversations with the people around you. For this I thank my
co-workers at building K and at building Ke, especially offices Ke3.13 and Ke3.14.
Gio, Selma, Lukas, Tomas: I cannot imagine better office co-workers than you guys.
I feel we have become real friends and that we will keep on meeting each other in
our post-VUB life. Lukas and Tomas, you guys always succeeded in making me
laugh with your (sometimes somewhat special) sense of humour! If I go through my
pictures collection, it is amazing how many events and parties we already attended
together (Tomas, you’re looking good on every single picture . . . sort of). You guys
also made the Interspeech 2011 conference in Firenze probably the nicest conference
ever (recall “Luca de la Tacci”, “Tommaso Blanchetti” and “Matteo di Mattei”).
I hope we continue making fun together in the future and that we stay close in touch.
Obviously, colleagues are not only necessary for the appropriate working atmosphere, they also supply valuable advice and support to your research. I am very
Preface
iii
grateful to Lukas “Mr C++” Latacz for all the work he spend on the development
of the linguistic front-end, the Festival-backbone, and other parts of the system.
Countless times he helped me with implementing the C++ part of the system and
I can honestly say I could not have developed the system as it is now without him.
I am also very thankful to Tomas “Eagle-Eye” Dekens who always meticulously
tested my subjective perception experiments. I also thank the other colleagues,
friends and family who regularly participated in the subjective evaluations, such as
Lukas, Selma, Gio, Yorgos, Jan L., Bruno, Henk, Pieter, Mikael, Chris, Jan G., Eric
and Jenny. I am also grateful to Yorgos, who assisted in performing high-quality
audiovisual recordings at the ETRO recording studio. Also, these recordings would
have never been possible without the cooperation of Britt, Annick, Evelien and
Kaat, who were willing to act as voice talent. In addition, I should not forget to
thank Mike, Joris, Isabel and Bruno who shared the joy (and the frustration) of
teaching at the university and to thank prof. Jacques Tiberghien, prof. Ann Dooms
and prof. Jan Lemeire for their confidence in me for teaching their topics. In this
final stage of my PhD, I would also like to show my gratitude to prof. Barry-John
Theobald, dr. Juergen Schroeter, prof. Bart de Boer, prof. Rik Pintelon, prof.
Hichem Sahli and prof. Werner Verhelst for being part of the PhD committee and
for dedicating their time to review this dissertation.
Finally, I would like to take this opportunity to thank the most important
people in my life for their unconditional love and support they have been given me.
I would like to thank my wife Annick for brightening up my life and to offer me
exactly the warmth, the joy, the distraction and the love that made me go on after
hard and discouraging periods in the PhD research. Furthermore, words cannot
express my gratitude for my parents, Eric and Jenny, who supported and motivated
me from the beginning of my studies at the VUB until the end of my PhD. They
kept on enduring my complaints and my doubts and every time they succeeded
in motivating me to keep going on. During the past twelve years they reorganized
their life in function of lessons, exams and especially in function of the NMBS train
schedule (which is quite a burden). I cannot thank them enough for this. I would
like to conclude my words of gratitude with a special mention of my father, who
spend over the years more time in the car waiting for me at the train station than
anybody should have to bear in a life time. Thanks for getting me home dad!
Wesley Mattheyses
Spring 2013
Publications
Journal papers
◦ W. Mattheyses, L. Latacz and W. Verhelst, “On the importance of audiovisual
coherence for the perceived quality of synthesized visual speech”, EURASIP
Journal on Audio, Speech, and Music Processing, SI: Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial Animation, 2009.
◦ W. Mattheyses, L. Latacz and W. Verhelst, “Comprehensive Many-to-Many
Phoneme-to-Viseme Mapping and its Application for Concatenative Visual
Speech Synthesis”, Speech Communication, Vol.55(7-8), pp.857-876, 2013.
Conference papers
First author
◦ W. Mattheyses, W. Verhelst and P. Verhoeve, “Robust Pitch Marking For
Prosodic Modification Of Speech Using TD-Psola”, Proc. SPS-DARTS - IEEE
BENELUX/DSP Valley Signal Processing Symposium, pp.43-46, 2006.
◦ W. Mattheyses, L. Latacz, Y.O. Kong and W. Verhelst, “A Flemish Voice for
the Nextens Text-To-Speech System”, Proc. Fifth Slovenian and First International Language Technologies Conference, 2006.
◦ W. Mattheyses, L. Latacz, W. Verhelst and H. Sahli, “Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis”, International workshop
on Machine Learning for Multimodal Interaction, Springer Lecture Notes in
Computer Science, Vol.5237, pp.125-136, 2008.
◦ W. Mattheyses, L. Latacz and W. Verhelst, “Multimodal Coherency Issues in
Designing and Optimizing Audiovisual Speech Synthesis Techniques”, Proc.
International Conference on Auditory-visual Speech Processing, pp.47-52,
2009.
iv
Publications
v
◦ W. Mattheyses, L. Latacz and W. Verhelst, “Active Appearance Models for
Photorealistic Visual Speech Synthesis”, Proc. Interspeech, pp.1113-1116,
2010.
◦ W. Mattheyses, L. Latacz and W. Verhelst, “Optimized Photorealistic Audiovisual Speech Synthesis Using Active Appearance Modeling”, Proc. International Conference on Auditory-visual Speech Processing, pp.148-153, 2010.
◦ W. Mattheyses, L. Latacz and W. Verhelst, “Automatic Viseme Clustering for
Audiovisual Speech Synthesis”, Proc. Interspeech, pp.2173-2176, 2011.
◦ W. Mattheyses, L. Latacz and W. Verhelst, “Auditory and Photo-realistic
Audiovisual Speech Synthesis for Dutch”, Proc. International Conference on
Auditory-visual Speech Processing, pp.55-60, 2011.
Other
◦ S. Yilmazyildiz, W. Mattheyses, G. Patsis and W. Verhelst, “Expressive
Speech Recognition and Synthesis as Enabling Technologies for Affective
Robot-Child Communication”, Advances in Multimedia Information Processing - PCM06, Springer Lecture Notes in Computer Science, Vol.4261, pp.1-8,
2006.
◦ L. Latacz, Y.O. Kong, W. Mattheyses and W. Verhelst, “Novel Text-to-Speech
Reading Modes for Educational Applications”, Proc. ProRISC/IEEE Benelux
Workshop on Circuits, Systems and Signal Processing, pp.148-153, 2006.
◦ L. Latacz, Y.O. Kong, W. Mattheyses and W. Verhelst, “An Overview of the
VUB Entry for the 2008 Blizzard Challenge”, Proc. Blizzard Challenge 2008,
2008.
◦ L. Latacz, W. Mattheyses and W. Verhelst, “The VUB Blizzard Challenge
2009 Entry”, Proc. Blizzard Challenge 2009, 2009.
◦ L. Latacz, W. Mattheyses and W. Verhelst, “The VUB Blizzard Challenge
2010 Entry: Towards Automatic Voice Building”, Proc. Blizzard Challenge
2010, 2010.
◦ S. Yilmazyildiz, L. Latacz, W. Mattheyses and W. Verhelst, “Expressive Gibberish Speech Synthesis for Affective Human-Computer Interaction”, Proc.
International Conference on Text, Speech and Dialogue, pp.584-590, 2010.
◦ L. Latacz, W. Mattheyses and W. Verhelst, “Joint Target and Join Cost
Weight Training for Unit Selection Synthesis”, Proc. Interspeech, pp.321-324,
2011.
Publications
vi
Abstracts
◦ W. Mattheyses, L. Latacz and W. Verhelst, “2D Audiovisual Text-to-Speech
Synthesis for Human-Machine Interaction”, Proc. Speech and Face to Face
Communication, pp.24-25, 2008.
◦ W. Mattheyses and W. Verhelst, “Photorealistic 2D Audiovisual Text-toSpeech Synthesis using Active Appearance Models”, Proc. ACM / SSPNET
International Symposium on Facial Analysis and Animation, pp.13, 2010.
Synopsis
Speech has always been the most important means of communication between
humans. When a message is conveyed, it is encoded in two separate signals: an
auditory speech signal and a visual speech signal. The auditory speech signal
consists of a series of speech sounds that are produced by the human speech
production system. In order to generate different sounds, the parameters of this
speech production system are varied. Since some of the human articulators are
visible to an observer (e.g., the lips, the teeth and the tongue), while uttering the speech sounds the variations of these visible articulators define a visual
speech signal. It is well known that an optimal conveyance of the message requires
that both the auditory and the visual speech signal can be perceived by the receiver.
During the last decades the development of advanced computer systems has
led to the current situation in which the vast majority of appliances, from industrial
machinery to small household devices, are computer-controlled. This implicates
that at nowadays people interact countless times with computer systems in everyday situations. Since the ultimate goal is to make this interaction feel completely
natural and familiar, the most optimal way to interact with a machine is by
means of speech. Similar to the speech communication between humans, the most
appropriate human-machine interaction consists of audiovisual speech signals.
In order to allow the machine to transfer a spoken message towards its users,
the device has to contain a so-called audiovisual speech synthesizer. This is a
system that is capable of generating a novel audiovisual speech signal, typically
from text input (so-called audiovisual text-to-speech (AVTTS) synthesis). Audiovisual speech synthesis has been a popular research topic in the last decade. The
synthetic auditory speech mode, created by the synthesizer, consist of a waveform
that resembles as closely as possible an original acoustic speech signal uttered by a
human. The synthetic visual speech signal displays a virtual speaker exhibiting the
speech gestures that match the synthetic auditory speech information. The great
majority of the AVTTS synthesizers perform the synthesis in separate stages: in the
first stages the auditory and the visual speech signals are synthesized consecutively
and often completely independently, after which both synthetic speech modes are
synchronized and multiplexed. Unfortunately, this strategy is unable to optimize the
vii
Synopsis
viii
level of audiovisual coherence in the output signal. This motivates the development
of a single-phase AVTTS synthesis approach, in which both speech modes are
generated simultaneously which allows to maximize the coherence between the two
synthetic speech signals.
In this work a single-phase AVTTS synthesis technique was developed that
constructs the desired speech signal by concatenating audiovisual speech segments
that were selected from a database containing original audiovisual speech recordings
from a single speaker. By selecting segments containing an original combination of
auditory and visual speech information, the original coherence between both speech
modes is copied as much as possible to the synthetic speech signal. Obviously, the
simultaneous synthesis of the auditory and the visual speech entails some additional
difficulties in optimizing the individual quality of both synthetic speech modes.
Nevertheless, through subjective perception experiments it was concluded that the
maximization of the level of audiovisual coherence is indeed necessary for achieving
an optimal perception of the synthetic audiovisual speech signal.
In the next part of the work it was investigated how the quality of the synthetic speech synthesized by the AVTTS system could be enhanced. To this end,
the individual quality of the synthetic visual speech mode was improved, while
ensuring not to affect the audiovisual coherence. The original visual speech from
the database was parameterized using an Active Appearance Model. This allows
many optimizations, such as a normalization of the original speech data and a
smoothing of the synthetic visual speech without affecting the visual articulation
strength. Next, by the construction of a new extensive Dutch audiovisual speech
database, the first-ever system capable of high-quality photorealistic audiovisual
speech synthesis for Dutch was developed.
In a final part of this work it was investigated how the AVTTS synthesis techniques
can be adopted to create a novel visual speech signal matching an original auditory
speech signal and its text transcript. For visual-only synthesis, the speech information can be described by means of either phoneme or viseme labels. The attainable
synthesis quality using phonemes was compared with the synthesis quality attained
using both standardized and speaker-dependent many-to-one phoneme-to-viseme
mappings. In addition, novel context-dependent many-to-many phoneme-to-viseme
mapping strategies were investigated and evaluated for synthesis. It was found that
these novel viseme labels more accurately describe the visual speech information
compared to phonemes and that they enhance the attainable synthesis quality in
case only a limited amount of original speech data is available.
Samenvatting
Gesproken communicatie, bestaande uit een auditief en een visueel spraaksignaal,
is altijd al de belangrijkste vorm van menselijke interactie geweest. Een optimale
overdracht van de boodschap is enkel mogelijk indien zowel het auditieve als het
visuele signaal adequaat kunnen worden waargenomen door de ontvanger. Vandaag
de dag interageren we talloze keren met computersystemen in dagdagelijkse situaties. Het uiteindelijke doel bestaat erin om deze communicatie zo natuurlijk en
vertrouwd mogelijk te laten overkomen. Dit impliceert dat de computersystemen
best interageren met hun gebruikers door middel van gesproken communicatie.
Net zoals de interactie tussen mensen onderling, zal de meest optimale vorm van
mens-machine communicatie bestaan uit audiovisuele spraaksignalen.
Om het mogelijk te maken het computersysteem een gesproken bericht te laten
verzenden naar zijn gebruikers is een audiovisueel tekst-naar-spraak systeem, hetwelk in staat is om een nieuw audiovisueel spraaksignaal aan te maken gebaseerd op
een gegeven tekst, noodzakelijk. Deze thesis focust op een spraaksynthese waarbij
beide spraakmodaliteiten tegelijkertijd worden gesynthetiseerd. De voorgestelde
synthesestrategie maakt het gewenste spraaksignaal aan door het samenvoegen van
audiovisuele spraaksegmenten, bestaande uit een originele combinatie van akoestische en visuele spraakinformatie. Dit leidt tot een maximale audiovisuele coherentie
tussen beide synthetische spraakmodaliteiten. Spraaksynthese van hoge kwaliteit
is bereikt door middel van verscheidene optimalisaties, zoals een normalisatie van
de originele visuele spraakdata en een smoothing van het gesynthetiseerde visuele
spraaksignaal waarbij de audiovisuele coherentie zo weinig mogelijk wordt aangetast. Met behulp van een nieuwe, uitgebreide en geoptimaliseerde Nederlandstalige
audiovisuele spraakdatabase is het allereerste tekst-naar-spraak systeem gerealiseerd
dat in staat is om fotorealistische Nederlandstalige audiovisuele spraak te genereren
van hoge kwaliteit. Door middel van subjectieve perceptie-experimenten is vastgesteld dat de maximalisatie van het niveau van audiovisuele coherentie inderdaad
noodzakelijk is om een optimale waarneming van de synthetische spraak te bekomen.
In het geval dat er reeds een akoestisch signaal beschikbaar is, volstaat een
automatische generatie van de visuele spraakmodaliteit. Hierbij kan de spraakinformatie worden beschreven aan de hand van zowel fonemen als visemen. De haalbare
ix
Samenvatting
x
synthesekwaliteit gebruik makende van een foneem-gebaseerde spraaklabeling is
vergeleken met de haalbare synthesekwaliteit gebruik makende van een labeling
gebaseerd op zowel gestandaardiseerde als spreker-afhankelijke veel-op-een relaties
tussen fonemen en visemen. Tot slot zijn er ook nieuwe context-afhankelijke veelop-veel relaties tussen fonemen en visemen opgesteld. Door middel van objectieve
en subjectieve evaluaties is aangetoond dat deze nieuwe viseemdefinities leiden tot
een verbetering van de visuele spraaksynthese.
Supplementary Data
Supplementary data associated with this dissertation can be found online at
http://www.etro.vub.ac.be/Personal/wmatthey/phd_demo.htm
Audiovisual samples illustrating the following sections are supplied:
Section
3.3.2
3.5
3.6
3.7
4.4
4.5
5.2
5.3
5.4.1
5.4.2
6.4
6.6
Description
Databases for synthesis
Concatenated audiovisual segments
Evaluation of the audiovisual speech synthesis strategy
Evaluation of audiovisual optimal coupling
Optimized audiovisual speech synthesis
Evaluation of the AAM-based AVTTS approach
Dutch audiovisual database “AVKH”
AVTTS synthesis for Dutch
Turing test
Single-phase vs Two-phase audiovisual speech synthesis
Evaluation of many-to-one mapping schemes for English
Evaluation of many-to-many mapping schemes for Dutch
xi
Contents
Preface
i
Publications
iv
Synopsis
vii
Samenvatting
ix
Supplementary Data
xi
Contents
xii
List of Figures
xviii
List of Tables
xxi
Abbreviations
xxii
1 Introduction
1.1 Spoken communication . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Speech signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Human speech production . . . . . . . . . . . . . . . . . . . .
1.2.2 Description of auditory speech . . . . . . . . . . . . . . . . .
1.3 Multimodality of speech . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Synthetic speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.2 A brief history on human-machine interaction using auditory
speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.3 Multimodality of spoken human-machine communication . .
1.4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Audiovisual speech synthesis at the VUB . . . . . . . . . . . . . . .
1.5.1 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
xii
1
1
2
2
2
4
7
7
9
12
18
19
20
20
Table of Contents
xiii
2 Generation of synthetic visual speech
2.1 Facial animation and visual speech synthesis . . . . . . . . . . .
2.2 An overview on visual speech synthesis . . . . . . . . . . . . . .
2.2.1 Input requirements . . . . . . . . . . . . . . . . . . . . .
2.2.2 Output modality . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Output dimensions . . . . . . . . . . . . . . . . . . . . .
2.2.4 Photorealism . . . . . . . . . . . . . . . . . . . . . . . .
2.2.5 Definition of the visual articulators and their variations
2.2.5.1 Speech synthesis in 3D . . . . . . . . . . . . .
2.2.5.2 Speech synthesis in 2D . . . . . . . . . . . . .
2.2.5.3 Standardization: FACS and MPEG-4 . . . . .
2.2.6 Prediction of the target speech gestures . . . . . . . . .
2.2.6.1 Coarticulation . . . . . . . . . . . . . . . . . .
2.2.6.2 Rule-based synthesis . . . . . . . . . . . . . . .
2.2.6.3 Concatenative synthesis . . . . . . . . . . . . .
2.2.6.4 Synthesis based on statistical prediction . . . .
2.3 Positioning of this thesis in the literature . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
22
26
26
28
30
32
34
34
40
42
45
45
47
53
58
61
3 Single-phase concatenative AVTTS synthesis
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . .
3.2 A concatenative audiovisual text-to-speech synthesizer
3.2.1 General text-to-speech workflow . . . . . . . .
3.2.2 Concatenative single-phase AVTTS synthesis .
3.3 Database preparation . . . . . . . . . . . . . . . . . .
3.3.1 Requirements . . . . . . . . . . . . . . . . . . .
3.3.2 Databases used for synthesis . . . . . . . . . .
3.3.3 Post-processing . . . . . . . . . . . . . . . . . .
3.3.3.1 Phonemic segmentation . . . . . . . .
3.3.3.2 Symbolic features . . . . . . . . . . .
3.3.3.3 Acoustic features . . . . . . . . . . . .
3.3.3.4 Visual features . . . . . . . . . . . . .
3.4 Audiovisual segment selection . . . . . . . . . . . . . .
3.4.1 Minimization of a global cost function . . . . .
3.4.2 Target costs . . . . . . . . . . . . . . . . . . . .
3.4.2.1 Phonemic match . . . . . . . . . . . .
3.4.2.2 Symbolic costs . . . . . . . . . . . . .
3.4.2.3 Safety costs . . . . . . . . . . . . . . .
3.4.3 Join costs . . . . . . . . . . . . . . . . . . . . .
3.4.3.1 Auditory join costs . . . . . . . . . .
3.4.3.2 Visual join costs . . . . . . . . . . . .
3.4.4 Weight optimization . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
65
65
65
67
71
71
71
72
73
73
74
75
80
80
83
83
84
85
87
87
88
89
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Table of Contents
3.5
3.6
3.7
3.8
xiv
3.4.4.1 Cost scaling . . . . . . . . . . . . . . . . . .
3.4.4.2 Weight distribution . . . . . . . . . . . . . .
Audiovisual concatenation . . . . . . . . . . . . . . . . . . . .
3.5.1 A visual mouth-signal and a visual background-signal
3.5.2 Audiovisual synchrony . . . . . . . . . . . . . . . . . .
3.5.3 Audio concatenation . . . . . . . . . . . . . . . . . . .
3.5.4 Video concatenation . . . . . . . . . . . . . . . . . . .
Evaluation of the audiovisual speech synthesis strategy . . . .
3.6.1 Single-phase and two-phase synthesis approaches . . .
3.6.2 Evaluation of the audiovisual coherence . . . . . . . .
3.6.2.1 Method and subjects . . . . . . . . . . . . .
3.6.2.2 Test strategies . . . . . . . . . . . . . . . . .
3.6.2.3 Samples and results . . . . . . . . . . . . . .
3.6.2.4 Discussion . . . . . . . . . . . . . . . . . . .
3.6.3 Evaluation of the perceived naturalness . . . . . . . .
3.6.3.1 Method and subjects . . . . . . . . . . . . .
3.6.3.2 Test strategies . . . . . . . . . . . . . . . . .
3.6.3.3 Samples and results . . . . . . . . . . . . . .
3.6.3.4 Discussion . . . . . . . . . . . . . . . . . . .
3.6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
Audiovisual optimal coupling . . . . . . . . . . . . . . . . . .
3.7.1 Concatenation optimization . . . . . . . . . . . . . . .
3.7.1.1 Maximal coherence . . . . . . . . . . . . . .
3.7.1.2 Maximal smoothness . . . . . . . . . . . . .
3.7.1.3 Maximal synchrony . . . . . . . . . . . . . .
3.7.2 Perception of non-uniform audiovisual asynchrony . .
3.7.3 Objective smoothness assessment . . . . . . . . . . . .
3.7.4 Subjective evaluation . . . . . . . . . . . . . . . . . .
3.7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
Summary and conclusions . . . . . . . . . . . . . . . . . . . .
4 Enhancing the visual synthesis using AAMs
4.1 Introduction and motivation . . . . . . . . . . . .
4.2 Facial image modeling . . . . . . . . . . . . . . .
4.3 Audiovisual speech synthesis using AAMs . . . .
4.3.1 Motivation . . . . . . . . . . . . . . . . .
4.3.2 Synthesis overview . . . . . . . . . . . . .
4.3.3 Database preparation and model training
4.3.4 Segment selection . . . . . . . . . . . . . .
4.3.4.1 Target costs . . . . . . . . . . .
4.3.4.2 Join costs . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
90
90
93
93
94
95
96
99
100
100
101
102
102
103
105
106
106
107
109
110
111
111
112
113
114
115
118
120
121
123
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
126
126
128
131
131
132
133
136
137
139
Table of Contents
4.4
4.5
4.6
4.3.5 Segment concatenation . . . . . . . . . . .
Improving the synthesis quality . . . . . . . . . .
4.4.1 Parameter classification . . . . . . . . . .
4.4.2 Database normalization . . . . . . . . . .
4.4.3 Differential smoothing . . . . . . . . . . .
4.4.4 Spectral smoothing . . . . . . . . . . . . .
4.5.1 Visual speech-only . . . . . . . . . . . . .
4.5.1.1 Test setup . . . . . . . . . . . .
4.5.1.2 Participants and results . . . . .
4.5.1.3 Discussion . . . . . . . . . . . .
4.5.2 Audiovisual speech . . . . . . . . . . . . .
4.5.2.1 Test setup . . . . . . . . . . . .
4.5.2.2 Participants and results . . . . .
4.5.2.3 Discussion . . . . . . . . . . . .
Summary and conclusions . . . . . . . . . . . . .
xv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 High-quality AVTTS synthesis for Dutch
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Database construction . . . . . . . . . . . . . . . . . . .
5.2.1 Text selection . . . . . . . . . . . . . . . . . . . .
5.2.1.1 Domain-specific . . . . . . . . . . . . .
5.2.1.2 Open domain . . . . . . . . . . . . . . .
5.2.1.3 Additional data . . . . . . . . . . . . .
5.2.2 Recordings . . . . . . . . . . . . . . . . . . . . .
5.2.3 Post-processing . . . . . . . . . . . . . . . . . . .
5.2.3.1 Acoustic signals . . . . . . . . . . . . .
5.2.3.2 Video signals . . . . . . . . . . . . . . .
5.3 AVTTS synthesis for Dutch . . . . . . . . . . . . . . . .
5.4 Evaluation of the Dutch AVTTS system . . . . . . . . .
5.4.1 Turing Test . . . . . . . . . . . . . . . . . . . . .
5.4.1.1 Introduction . . . . . . . . . . . . . . .
5.4.1.2 Test set-up and test samples . . . . . .
5.4.1.3 Participants and results . . . . . . . . .
5.4.1.4 Discussion . . . . . . . . . . . . . . . .
5.4.2 Comparison between single-phase and two-phase
speech synthesis . . . . . . . . . . . . . . . . . .
5.4.2.1 Motivation . . . . . . . . . . . . . . . .
5.4.2.2 Method and samples . . . . . . . . . . .
5.4.2.3 Subjects and results . . . . . . . . . . .
5.4.2.4 Discussion . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
audiovisual
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
140
143
143
148
150
153
157
157
157
158
159
159
159
160
161
163
166
166
167
167
167
168
169
170
170
170
174
178
179
181
181
182
183
183
186
186
187
188
189
Table of Contents
5.5
xvi
Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . 190
6 Context-dependent visemes
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.2 Concatenative VTTS synthesis . . . . . . . . . . . . . . . . .
6.2 Visemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 The concept of visemes . . . . . . . . . . . . . . . . . . . . .
6.2.2 Visemes for the Dutch language . . . . . . . . . . . . . . . . .
6.3 Phoneme-to-viseme mapping for visual speech synthesis . . . . . . .
6.3.1 Application of visemes in VTTS systems . . . . . . . . . . . .
6.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Evaluation of Nx1 mapping schemes for English . . . . . . . . . . . .
6.4.1 Design of many-to-one phoneme-to-viseme mapping schemes
6.4.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5 Many-to-many phoneme-to-viseme mapping schemes . . . . . . . . .
6.5.1 Tree-based clustering . . . . . . . . . . . . . . . . . . . . . . .
6.5.1.1 Decision trees . . . . . . . . . . . . . . . . . . . . .
6.5.1.2 Decision features . . . . . . . . . . . . . . . . . . . .
6.5.1.3 Pre-cluster . . . . . . . . . . . . . . . . . . . . . . .
6.5.1.4 Clustering into visemes . . . . . . . . . . . . . . . .
6.5.1.5 Objective candidate test . . . . . . . . . . . . . . .
6.5.2 Towards a useful many-to-many mapping scheme . . . . . . .
6.5.2.1 Decreasing the number of visemes . . . . . . . . . .
6.5.2.2 Evaluation of the final NxM visemes . . . . . . . . .
6.6 NxM visemes for concatenative visual speech synthesis . . . . . . . .
6.6.1 Application in a large-database system . . . . . . . . . . . . .
6.6.2 Application in limited-database systems . . . . . . . . . . . .
6.6.2.1 Limited databases . . . . . . . . . . . . . . . . . . .
6.6.2.2 Evaluation of the segment selection . . . . . . . . .
6.6.2.3 Evaluation of the synthetic visual speech . . . . . .
6.7 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . .
191
191
191
192
193
193
197
197
197
199
200
201
201
202
204
204
205
205
206
207
207
209
211
211
212
214
214
217
217
218
219
225
7 Conclusions
7.1 Brief summary . . . . . . . . . . . . . . . . . . . .
7.2 General conclusions . . . . . . . . . . . . . . . . . .
7.3 Future work . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Enhancing the audiovisual synthesis quality
7.3.2 Adding expressions and emotions . . . . . .
228
228
232
233
233
234
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Table of Contents
7.3.3
xvii
Future evaluations . . . . . . . . . . . . . . . . . . . . . . . . 236
A The Viterbi algorithm
239
B English phonemes
243
C English visemes
245
List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
1.7
The human speech production system . . . . . . . . . . .
The talking computer HAL-9000 . . . . . . . . . . . . . .
Wheatstone’s Talking Machine . . . . . . . . . . . . . . .
Fictional rudimentary synthetic visual speech . . . . . . .
Examples of various visual speech synthesis systems . . .
Intelligibility scores as a function of acoustic degradation .
The uncanny valley effect . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
3
9
10
13
14
15
17
Examples of mechanically generated visual speech . . . . . . . . . .
Georges Demeny’s “Phonoscope” . . . . . . . . . . . . . . . . . . . .
Pioneering realistic computer-generated facial expressions . . . . . .
Two approaches for audiovisual text-to-speech synthesis . . . . . . .
Beyond pure 2D/3D synthesis . . . . . . . . . . . . . . . . . . . . . .
Various examples of synthetic 2D visual speech . . . . . . . . . . . .
Various examples of synthetic 3D visual speech . . . . . . . . . . . .
Anatomy-based facial models . . . . . . . . . . . . . . . . . . . . . .
The VICON motion capture system . . . . . . . . . . . . . . . . . .
2D visual speech synthesis . . . . . . . . . . . . . . . . . . . . . . . .
The facial feature points defined in the MPEG-4 standard . . . . . .
Visual speech synthesis using articulation rules to define keyframes .
Modelling visual coarticulation using the Cohen-Massaro model . . .
Visual speech synthesis based on the concatenation of segments of
original speech data . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.15 Visual speech synthesis based on statistical prediction of visual features
23
24
25
29
32
33
35
38
39
42
44
48
51
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
66
69
70
72
73
76
77
78
Overview of the AVTTS synthesis . . . . . . . . . . . . .
Diphone-based unit selection . . . . . . . . . . . . . . . .
Overview of the audiovisual unit selection approach . . .
Example frames from the AVBS audiovisual database . .
Example frames from the LIPS2008 audiovisual database
Landmarks indicating the various parts of the face . . . .
Detection of the teeth and the mouth-cavity (1) . . . . . .
Detection of the teeth and the mouth-cavity (2) . . . . . .
xviii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
54
59
List of Figures
xix
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
3.20
3.21
3.22
3.23
Detection of the teeth and the mouth-cavity (3) . . . .
Unit selection synthesis using target and join costs . .
Unit selection trellis . . . . . . . . . . . . . . . . . . .
Target costs applied in the AVTTS synthesis . . . . .
Join costs applied in the AVTTS synthesis . . . . . . .
Join cost histograms . . . . . . . . . . . . . . . . . . .
Merging the mouth-signal with the background signal
Auditory concatenation artifacts . . . . . . . . . . . .
Pitch-synchronous audio concatenation . . . . . . . . .
Visual concatenation by image morphing . . . . . . . .
Audiovisual consistence test results . . . . . . . . . . .
Naturalness test results . . . . . . . . . . . . . . . . .
Audiovisual optimal coupling: methods . . . . . . . . .
Audiovisual optimal coupling: resulting signals . . . .
Objective smoothness measures . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
79
80
82
83
88
91
94
96
97
99
104
109
116
117
120
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
A point-light visual speech signal . . . . . . . . . . . . . . . . .
AAM-based image modeling . . . . . . . . . . . . . . . . . . . .
AVTTS synthesis using an active appearance model . . . . . .
AAM-based representation of the original visual speech data . .
AAM sub-trajectory concatenation . . . . . . . . . . . . . . . .
Relation between AAM parameters and physical properties . .
Speech-correlation of the AAM shape and texture parameters .
AAM Normalization . . . . . . . . . . . . . . . . . . . . . . . .
Spectral smoothing of a parameter trajectory . . . . . . . . . .
Evaluation of AAM-based AVTTS synthesis: visual speech-only
Evaluation of AAM-based AVTTS synthesis: audiovisual speech
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
127
130
133
136
142
144
146
149
156
158
162
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
Overview of the recording setup . . . . . . . . . . . . . . . . . . . . . 171
Some details of the recording setup . . . . . . . . . . . . . . . . . . . 172
Example frames from the AVKH audiovisual speech database . . . . 173
Landmark information for the AVKH database . . . . . . . . . . . . 175
AAM resynthesis for the AVKH database . . . . . . . . . . . . . . . 177
AAM modelling of the complete face . . . . . . . . . . . . . . . . . . 178
Final output of the Dutch AVTTS system . . . . . . . . . . . . . . . 180
Ratio of incorrect answers obtained by the experts and the non-experts184
Ratio of incorrect answers for each type of sentence . . . . . . . . . . 185
Comparison between single-phase and two-phase audiovisual synthesis 189
6.1
6.2
Subjective evaluation of the Nx1 phoneme-to-viseme mappings . . . 203
Pre-cluster features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
List of Figures
6.3
6.4
6.5
Candidate test results for the tree-based visemes . . . . . . . . . . .
Candidate test results for the final visemes . . . . . . . . . . . . . .
Relation between the visual speech synthesis stages and the objective
measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6 Evaluation of the segment selection using a large database . . . . . .
6.7 Evaluation of the segment selection using a large database (2) . . . .
6.8 Evaluation of the segment selection using a limited database . . . . .
6.9 Evaluation of the segment selection using a limited database (2) . .
6.10 DTW-based evaluation of the final synthesis result . . . . . . . . . .
6.11 Subjective test results evaluating the synthesis quality using various
NxM visemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.12 Subjective test results evaluating the synthesis quality using the most
optimal NxM visemes . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1
xx
210
213
214
215
216
218
219
221
223
224
Facial expressions related to a happy emotion . . . . . . . . . . . . . 236
A.1 Unit selection trellis . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
A.2 Unit selection trellis costs . . . . . . . . . . . . . . . . . . . . . . . . 240
List of Tables
1.1
Example of a many-to-one phoneme-to-viseme mapping for English .
3.1
3.2
3.3
3.4
3.5
3.6
Symbolic database features . . . . . . . . . . . . . . . . .
Test strategies for the audiovisual consistence test . . . .
Test strategies for the naturalness test . . . . . . . . . . .
Detection of local audiovisual asynchrony . . . . . . . . .
Various optimal coupling configurations . . . . . . . . . .
Subjective evaluation of the optimal coupling approaches
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
74
103
108
118
119
122
4.1
4.2
4.3
4.4
4.5
Normalization test results: visual speech-only
Normalization test results: audiovisual speech
Differential visual concatenation smoothing .
Subjective trajectory filtering experiment . .
Optimal filter settings . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
150
150
152
155
155
5.1
5.2
Turing test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Turing test results for the non-experts . . . . . . . . . . . . . . . . . 185
6.1
Subjective evaluation of the Nx1 phoneme-to-viseme mappings for
English: Wilcoxon signed-rank analysis . . . . . . . . . . . . . . . . .
Decision tree configurations . . . . . . . . . . . . . . . . . . . . . . .
Mapping from tree-based visemes to final NxM visemes . . . . . . .
Construction of limited databases . . . . . . . . . . . . . . . . . . . .
Subjective test results evaluating the synthesis quality using various
NxM visemes: Wilcoxon signed-rank analysis . . . . . . . . . . . . .
Subjective test results evaluating the synthesis quality using the most
optimal NxM visemes: Wilcoxon signed-rank analysis . . . . . . . . .
6.2
6.3
6.4
6.5
6.6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
203
208
212
218
222
224
B.1 English phone set and classifications . . . . . . . . . . . . . . . . . . 244
C.1 Many-to-one phoneme-to-viseme mappings for English . . . . . . . . 246
xxi
Abbreviations
Section
AAM
ANN
AVTTS
C/V
CGI
EM-PCA
FACS
FAP
FFT
HMM
ICA
LPC
LSP
MFCC
MOS
Nx1
NxM
PCA
PEC
TTS
VTTS
Active Appearance Model
Artificial Neural Network
Audiovisual Text-to-Speech
Consonant/Vowel
Computer Generated Imagery
Expectation-Maximisation Principal
Component Analysis
Facial Action Coding System
Facial Action Parameter
Fast Fourrier Transform
Hidden Markov Model
Independent Component Analysis
Linear Predictive Coding
Line Spectral Pairs
Mel-Frequency Cepstrum Coefficient
Mean Opinion Score
Many-to-One
Many-to-Many
Principal Component Analysis
Phonemic Equivalence Class
Text-to-Speech
Visual Text-to-Speech
xxii
4.2
2.2.1
2.2.2
6.5.1.2
2.1
2.2.5.1
2.2.5.3
2.2.5.3
4.4.4
1.4.2
2.2.5.1
2.2.1
2.2.1
2.2.1
3.6.2.1
6.2
6.2
2.2.1
6.2.1
2.2.1
6.1
1
Introduction
1.1
Spoken communication
One of the main driving forces behind the development of mankind is its capability
to effectively transfer thoughts and ideas by means of sound signals. Where at
the dawn of man this communication was nothing more than a way to express
basic instincts like fear and excitement by means of some grunts and growls (as
can nowadays still be seen by some higher mammals), the communication between
humans slowly but surely evolved towards a real spoken language. A complicated
message can be passed from one individual towards many others by uttering a
series of sounds of which the subtle variations can be perceived by the receiver of
the message. When both the sender and the receiver agree on the linkage between
such a series of sound segments and real-life concepts, a successful transfer of the
information is possible. Throughout history, countless spoken languages have been
employed, of which only few are still practised today. Each of these languages
exhibits its own rules and agreements on the nature of its particular speech sounds
and their semantic meaning. They define the elementary building blocks of which
a speech signal must be constructed in order to successfully transfer a message
from the speaker to the receiver. These building blocks are called phones and
phonemes, two concepts that can easily be mixed up, although there is a subtle yet
important difference between them. A phone is an elementary sound that can be
produced by the human speech production system. A speech signal is generated by
consecutively altering the parameters of this speech production system in order to
create the auditory speech signal which consists of a sequence of such phones. On
the other hand, a spoken language is defined by its characteristic phonemes. Each
phoneme is a unique label for a set of phones that are cognitively equivalent for the
1
1.2. Speech signals
2
particular language. In an auditory speech signal, the replacement of a single phone,
corresponding to phoneme X, with another phone that corresponds to a phoneme
other than X, causes a change in the message conveyed by the speech signal. Two
phones that correspond to the same phoneme are called allophones. Replacing
a single phone in a speech signal by one of its allophones does not change the
semantic meaning of the speech, although it may sound strange or less intelligible.
Since spoken language has always been the primary means of communication
between humans, the interest in this topic has been existing since the beginning
of science. While research in the field of anatomy and physiology has been trying
to figure out how the human body is able to produce the various phones, the field
of phonology has been trying to describe spoken languages by identifying their
characteristic phonemes, their prosodic properties (pitch, timing, accents) and the
semantic meaning of these features.
1.2
1.2.1
Speech signals
Human speech production
The part of the human body that is responsible for the production of speech sounds is
a complicated system that consists of two major components: organs of phonation
and organs of articulation. The phonation system creates the air flow that will
eventually leave the body and causes a change in air pressure which is needed to carry
the spoken message. The two most important organs for phonation are the lungs
and the larynx. The first one is responsible for the production of the necessary air
pressure, while the latter controls the vibration of the vocal folds. The articulatory
organs are used to create the resonances and modulations that are necessary to shape
the air flow in order to achieve the target speech sound. Important articulators are
the lower jaw, the tongue, the lips and the velum. The exact manner in which
a particular sound is produced is a very complicated process, for which all the
different phonation and articulation organs cooperate. The speech production can
be described in terms of three stages in which the sounds are created: the subglottal
tract, the vocal tract and the (para-)nasal cavities (illustrated in figure 1.1).
1.2.2
Description of auditory speech
When people talk, the successive alterations of the properties of their phonation
and articulation organs results in the production of the target sequence of speech
sounds. These sounds can be classified either as consonants or as vowels. Vowels are
those sounds that have no audible friction caused by the narrowing or obstruction of
some part of the upper vocal tract. Different vowels can be generated by varying the
1.2. Speech signals
3
Figure 1.1: The human speech production system [Benesty et al., 2008].
degree of lip aperture, the degree of lip rounding and the placement of the tongue
within the oral cavity. On the other hand, sounds that do exhibit an audible friction
or closure at some point within the upper vocal tract are called consonants. Consonant sounds vary by the place in the vocal tract where the airflow is obstructed
(e.g., at the lips, the teeth, the velum, the glottis, etc.). Each place of articulation
produces a different set of consonants, which are further distinguished by the kind
of friction that is applied (a full closure or a particular degree of aperture). The
larynx also plays an important role in the discrimination of the different phones.
When the vocal chords are relaxed during the traversal of the air flow, so-called
voiceless or unvoiced sounds are created. On the other hand, when the vocal chords
are tightened, the traversal of the air flow causes them to vibrate. This vibration
influences the shape of the air flow in such a way that it will exhibit a periodic
behaviour. Speech sounds that are uttered while the vocal chords are vibrating
are called voiced sounds. Vowels are always voiced, but consonants can be either
voiced or unvoiced. Often two different consonants are discerned by only changing
the voicing property while all other parameters of the speech production system
are the same (e.g., the English phoneme /s/ from the word “some” is unvoiced
while its voiced counterpart is the phoneme /z/ as heard in the English word “zoo”).
To encode a message by means of speech sounds, it is not only the particular
sequence of generated phonemes that determines the semantic meaning of the
1.3. Multimodality of speech
4
speech signal. The speech sounds also exhibit particular prosodic properties [Smalley, 1963]. The pitch of a voiced speech sound is determined by its fundamental
frequency. It is controlled by the larynx which regulates the degree of tightening of
the vocal chords. Pitch variations are often used to assign expression to the speech,
for instance, an English sentence can be transformed into a question just by raising
the pitch of the sounds at the end. Another prosodic property is the duration of
the speech sounds. For instance, by lengthening a particular word from a sentence,
stress can be assigned to it. Finally, speech sounds also exhibit an energy (or
amplitude) property. This is controlled by the lungs which can vary the strength of
the air flow. This property is mainly used in combination with pitch variations to
assign stress to specific parts of the sentence.
1.3
Multimodality of speech
In the previous section it was briefly explained how the cooperative work of multiple
human organs is capable of producing speech sounds. Some of these organs, such
as the lungs, the larynx, and the nasal cavities are not visible from the outside.
On the other hand, articulators like the lips are clearly visible when looking at a
speaker’s face. In addition, other articulators such as the tongue, the teeth, and
even the velum are in some cases visible, depending on the particular sounds that
are being produced. The exact appearance of these visible articulators is highly
correlated with the uttered phone. This implies that spoken communication should
not be seen as a unimodal signal consisting of solely auditory information, since
the variations of the visible articulators (lips, cheeks, tongue, teeth, etc.) when
uttering the speech sounds define a visible speech signal that encodes the message
as well. Consequently, the speech message is encoded in both an auditory and a
visual speech signal.
One can wonder if this double encoding is worth investigating, since it unequivocally
contains a lot of redundancy. It could be that the visible articulatory gestures are
just a side-effect of the speech production process, meaning that the visible speech
mode adds no extra information to the communication channel. Let’s take a look
at the everyday use of spoken language. Imagine that someone starts talking to a
colleague who is reading an interesting dissertation. The reader will instinctively
look up from the text and gaze at the talker’s face. Maybe this is just an act of
politeness since we all tell our children to do so, but it is likely that this habit
originates from the fact that we have to make an effort to understand each other
the best. Don’t we also tell our children not to talk with their hand in front of their
mouth in order to be more intelligible? Similarly, when we are having a conversation
with someone at a place where there is a lot of background noise, we will always
try to keep our eyes fixed to our companion’s mouth in order to better understand
5
his/her words. This effect is also noticeable when speaking to a group of people
at a busy party where the speech gets polluted by lots of other speech sounds. In
that case, we understand the person from the group we are looking at much better
than the others. The visual clues also assist in focusing our attention since they
help to match for each of the audible speech sounds the correct speaker [Plenge
and Tilse, 1975]. From these examples it is clear that the multimodal coding of the
speech message is actively used to improve the quality of the communication. The
auditory speech is considered as the primary communication channel, but receiving
also the visible speech information helps to better understand the message [Massaro
and Cohen, 1990] [Summerfield, 1992] [Schwippert and Benoit, 1997] [Benoit et al.,
2000] [Schwartz et al., 2004] [Van Wassenhove et al., 2005], especially in conditions
where the auditory speech is polluted with noise [Erber, 1975] [MacLeod and Summerfield, 1987] [MacLeod and Summerfield, 1990] [Grant et al., 1998] [Bernstein
et al., 2004] [Ma et al., 2009]. The most extreme example of this is the fact that
hearing-impaired people can be trained to understand a speech message by only
receiving the visual speech information (so-called lip-reading) [Jeffers and Barley,
1971] [Woods, 1986] [Summerfield, 1992] [Bernstein et al., 2000].
Being able to notice the changing appearance of the visual articulators during
the uttering of speech sounds increases the intelligibility of the speech. In addition,
including the visual speech mode to the communication adds more expression
and extra metacognitive information to the speech as well [Pelachaud et al.,
1991] [Swerts and Krahmer, 2005] [Granstrom and House, 2005]. Comparable to
auditory prosody, some degree of visual prosody is added to the speech signal in
order to assign stress or prominence to particular parts of the speech [Swerts and
Krahmer, 2006] or to add an emotion to the message [Grant, 1969] [Ekman et al.,
1972] [Schmidt and Cohn, 2001]. This visual prosody can either add new expressive
information to the message or it can be used to strengthen the effect of the auditory
prosody [Graf et al., 2002] [Al Moubayed et al., 2010]. Typical visual prosodic
gestures are movements of the eyebrows [Granstrom et al., 1999] [Krahmer et al.,
2002] and subtle head movements like shakes or nods [Fisher, 1969] [Hadar et al.,
1983] [Bernstein et al., 1989] [Munhall et al., 2004]. Another means of visual prosody
is eye gaze, which can be linked with grammatical and conversational clues [Argyle
and Cook, 1976] [Vatikiotis-Bateson et al., 1998]. Many research has been conducted
to identify typical configurations of the eyebrows, the mouth, the eyes, and even the
skin (e.g., wrinkles) that can be matched with basic emotions such as joy, sadness,
fear, anger, surprise and disgust. By arranging the facial appearance towards one of
these configurations, a speaker is able to emphasize the emotion that corresponds
to the message that is conveyed [Wilting et al., 2006] [Gordon and Hibberts, 2011].
Receiving the visual speech mode helps to better understand the message and
6
it enhances the expressiveness of the spoken communication. In addition, the
concept of confidence should be considered. Let’s take another look at the everyday
use of speech with the example of a salesman who is trying to sell one of his
products. Will a customer be more likely to purchase when the salesman performs
his sales talk over the telephone or when the salesman and the customer have a
face-to-face conversation? In the latter case customers will feel more confident about
the purchase since they have been able to observe the speaker while listening to his
well-prepared discourse. By noticing the subtle clues in the visual speech mode (e.g.,
frowns or eye gaze) they are (or at least they assume that they are) more capable of
correctly determining whether the salesman is trustworthy or not. A similar effect
is noticeable when people are having an argument with someone over the telephone,
which is more likely to escalate in comparison with a face-to-face discussion since in
the auditory speech-only case they have to make assumptions about the emotional
state of their speaking partner and about the intended expression of his/her words.
From these examples it is clear that the multimodality of speech is not only a
side-effect of which people take advantage whenever it is possible. In their effort to
convey the message as efficiently as possible, humans will always try to use the both
auditory and the visual communication channel. When the circumstances somehow
disrupt this multimodality, we feel less satisfied with the communication since we
are less assured of a correct conveyance of the message.
From the previous paragraphs it is clear that speech should be seen as a multimodal means of communication. When the sender of the message utters the
speech sounds, the multimodality intrinsically appears by the variations of his/her
visual articulators. On the receiver’s side, the auditory speech information is captured by the ears and the visual speech information is captured by the eyes. Once
captured, these two information streams are send to the brain. Research on the
human perception of audiovisual speech information can be considered as a final
evidence for the truly multimodal nature of speech communication, since it has
been shown that the brain does not separately decode the auditory and the visual
speech information. Instead, the captured speech is analyzed by use of a complex
multimodal decoding of both information streams at once, in which the decoding
of the auditory information is influenced by the captured visual information and
vice-versa [Skipper et al., 2007] [Campbell, 2008] [Benoit et al., 2010]. The best
known indication for the existence of such a combined decoding is the so-called
McGurk effect [McGurk and MacDonald, 1976]. This effect occurs when displaying
audiovisual speech fragments of which the auditory and the visual information
originate from different sources. For example, observers reported hearing a /ta/
when a sound track containing the syllable /pa/ is dubbed onto a video track of a
mouth producing /ka/. A similar effect is noticed when an auditory syllable /ba/ is
dubbed with the visual gestures for /ga/. In that case, most people report hearing
1.4. Synthetic speech
7
the syllable /da/. Another such effect, called visual capture, occurs when listeners
who are perceiving unmatched auditory-visual speech report to hear the visually
presented syllable instead of the auditory syllable that was presented [Andersen,
2010].
Since speech is a truly multimodal signal, it is obvious that the analysis of a
speech signal should inspect both the auditory and the visual speech mode. For
this purpose, in section 1.1 the concept of a phoneme was explained. Likewise,
a visual speech signal can also be split up in elementary building blocks. These
atomic units of visual speech are called visemes [Fisher, 1968]. For each language
a typical set of visemes can be determined, which describe the distinct visual
appearances that occur when uttering speech sounds of that particular language.
Such a representative viseme set can be constructed by collecting for each distinct
phoneme its typical visual representation. However, some phoneme groups will
exhibit a similar visual representation since their articulatory production only
differs in terms of invisible features (e.g., voicing or nasality). This implies that a
language is defined by a smaller number of visemes in comparison to the number of
phonemes. A mapping table from phonemes to visemes will thus exhibit a many-toone behaviour. Later on in chapter 6 the mapping from phonemes to visemes will
be extensively described and investigated, since much can be said about the naive
concept of such a many-to-one mapping. Nevertheless, at this point it is sufficient
to consider a viseme as an elementary building block of a visual speech signal and
to remember that from each phoneme sequence describing an auditory speech mode
a matching viseme sequence that describes the corresponding visual speech mode
can be determined by a many-to-one mapping table such as illustrated in table 1.1.
1.4
1.4.1
Synthetic speech
Motivation
Throughout history, speech has been a means of communication that is used
exclusively for interaction between humans. Apart from a few animals such as dogs
that understand short uttered commands, no non-human living creature nor any
machinery has been able to understand or produce complicated speech messages.
Up until a few decades ago, the communication with machines was solely based on
levers, switches, gauges, indicator lamps and other mechanical input and output
devices. Over the last decades, however, this situation has evolved drastically as
the development of advanced computer technologies has led to powerful computer
systems that are capable of processing complicated calculations. In parallel, these
computer systems have become more and more visible in common everyday situa-
8
Table 1.1: Example of a many-to-one phoneme-to-viseme mapping for English.
Viseme
Phonemes
Example
1
2
3
4
5
6
1
7
8
9
10
11
12
13
p,b,m
f,v
T,D
t,d
k,g
tS,dZ,S
s,z
n,l
r
A:
e
I
Q
U
put,bed,mill
far,voice
think,that
tip,doll
call,gas
chair,join,she
sir,zeal
lot,not
red
car
bed
tip
top
book
tions. At present times, cars, heavy machinery, vending machines, medical devices
and even the simplest home appliances such as fridges and central heating systems
are computer controlled. For every such device a human-machine interface is needed
for enabling users to control the machine and to make it possible for the device to
give feedback to its users. Simple commands (e.g., turn off/turn on) can easily be
passed to the system by a switch or a button. Similarly, elementary feedback can be
returned by basic alphanumeric displays or indicator lights. For more complicated
messages, however, these simple interfaces are inadequate or they would make the
use of the device cumbersome. The ultimate goal should be that the computer
systems that surround us are perfectly integrated in everyday life, which would
make them in some way “invisible”: their usage should feel natural and intuitive so
that users would “forget” that they are communicating with a highly complicated
computer system. This can be achieved when the interaction between human and
machine reaches the level of interaction amongst humans themselves. Consequently,
the communication between humans and computers should mainly consist of the
primary means of communication between humans, namely speech signals. Not
only will this make the interaction with the machine feel natural and familiar, it
will also improve the accessibility to the technology since everybody with speaking
and/or hearing capabilities will be able to use the device without having to learn
its particular interface.
It is interesting to see that, like many other recent developments, human-machine
interaction based on speech has been predicted by science-fiction a long time ago.
9
Figure 1.2: Detail of the user interface of the talking computer HAL-9000 from
“2001: A Space Odyssey”.
In Fredric Brown’s short story “Answer” (1954), the question “Is there a god?”
is passed to the Universe’s supercomputer by means of a voice command. The
computer answers, by means of auditory speech, “Yes, now there is a god”. Another
well-known example is Arthur C. Clarke’s novel “2001: A Space Odyssey” (1968),
in which the interaction with the spaceship Discovery One goes through HAL-9000,
a speaking computer which can understand voice commands. In fact, the use of
speech to communicate with machines is so straightforward that it has already been
mentioned in countless books or motion pictures.
1.4.2
A brief history on human-machine interaction using auditory speech
To allow a two-way speech-based human-machine communication, the computer
system should be able to understand the user’s voice commands and it has to know
how to generate a new speech signal in order to return its answer to the user.
These two requirements have resulted in two important research domains in the
field of speech technology: automatic speech recognition and speech synthesis. In
automatic speech recognition it is investigated how to translate a given waveform
into its corresponding phoneme sequence [Rabiner and Juang, 1993]. In the early
days, studies from this domain involved a rudimentary recognition of isolated
words [Davis et al., 1952]. However, in order to design a system that can be used
for human-machine interaction, two major challenges were needed to be taken.
First, the system has to be capable of recognizing unrestricted continuous speech
instead of fixed-vocabulary words. In addition, it should recognize speech uttered
by any given speaker and not only speech from those speakers that were used to
construct or train the system. Nowadays the automatic recognition of continuous
speech is performed using sophisticated machine learning techniques like Artificial
Neural Networks (ANN, [Anderson and Davis, 1995]) [Lippmann, 1989] or by use of
statistical models like Hidden Markov Models (HMM, [Baum et al., 1970]) [Baker,
1975] [Ferguson, 1980] [Rabiner, 1989].
10
Figure 1.3: Wheatstone’s Talking Machine [Flanagan, 1972].
The domain of speech synthesis investigates how a computer system can create a new waveform based on a target sequence of phonemes [Dutoit, 1997] [Taylor,
2009]. It may come as a surprise that the very first “talking machine” was already
invented by Wolfgang Von Kempelen at the end of the 18th century [Von Kempelen,
1791] [Dudley and Tarnoczy, 1950]. The essential components of his machine were
a pressure chamber for mimicking the lungs, a vibrating reed to act as vocal cords,
and a leather tube for the vocal tract action. When it was controlled by a practised
human operator, the machine was able to produce a series of sounds that quite
closely resembled human speech. A few decades later, the system was improved by
Charles Wheatstone in 1837 (see figure 1.3) [Bowers, 2001].
The very first electrical systems that were designed to synthesize human-like
speech signals were only able to generate isolated vowel sounds [Stewart, 1922]. The
VODER system can be considered as the first electrical system able to produce
continuous speech [Dudley et al., 1939]. It was a human-operated synthesizer that
consisted of wrist bar for selecting a voicing or noise source and a foot pedal
to control the fundamental frequency. The source signal was routed through 10
bandpass filters of which the output levels were controlled by the operator’s fingers.
From that point in time speech synthesis became a popular research topic in which
over the years various approaches have been investigated. Formant synthesizers
create the target synthetic speech by generating a new waveform that mimics the
known formants in human speech sounds [Fant, 1953] [Lawrence, 1953] [Kelly and
Gerstman, 1961]. Formants are resonant peaks that can be seen in the spectrum of
human speech signals, originating from the vocal tract which has for each particular
11
configuration several major resonant frequencies. An alternative synthesis approach
is articulatory synthesis, in which a new waveform is created by modelling the
different components of the human vocal tract and the articulation processes
occurring there [Dunn, 1950] [Rosen, 1958] [Kelly and Lochbaum, 1962]. Formant
synthesis and articulatory synthesis are examples of so-called rule-based synthesis.
A rule-based synthesizer estimates in a first synthesis stage the properties of the
target synthetic speech signal based on predefined knowledge about the uttering of
the target phonemes (e.g., the corresponding configurations of the human speech
production system) or about the properties of their corresponding speech signals
(e.g., spectrum, formants, timing, etc.). Afterwards, in a second synthesis stage a
new waveform is constructed based on these estimations.
Alternatively, a different approach to the problem of speech synthesis are the
so-called data-driven synthesizers. In this strategy, the target synthetic speech
signal is constructed by reusing original speech information. A first attempt to
synthesize speech by selecting and concatenating diphones (i.e., two consecutive
phones) from a predefined database was described by Dixon et al. [Dixon and
Maxey, 1968]. The major challenge of such concatenative synthesis is the realization
of a smooth synthetic speech signal containing no audible join artefacts. Since the
optimal point for concatenating two speech signals is found to be at the most stable
part of a phone (i.e., the sample at the middle of the phoneme instance) [Peterson
et al., 1958], for many years concatenative synthesizers were based on the selection
of diphones from a diphone database. Several speech modification techniques
for improving the concatenations as well as the prosody of the synthetic speech
have been proposed, such as TD-PSOLA [Moulines and Charpentier, 1990] and
MBROLA [Dutoit, 1996]. When the processing power of computer systems became
stronger and the data storage capabilities became larger, concatenative synthesis
evolved towards unit selection synthesis. In this approach the speech segments
are selected from a database containing continuous original speech from a single
speaker [Hunt and Black, 1996]. This has the advantage that longer original speech
segments can be used to construct the target synthetic speech, which reduces
the number of concatenations needed. As an alternative to concatenative speech
synthesis, other data-driven approaches make use of statistical models, trained on
original speech signals, to construct the target synthetic speech. The best known
example is HMM-based synthesis, in which in a first stage Hidden Markov Models
are trained on the correspondences between acoustic features of original speech
from a single speaker and the phonemic transcript of the speech signals. Afterwards,
these models can predict new acoustic features corresponding to the target phoneme
sequence [Zen et al., 2009]. Recently, hybrid data-driven approaches have gained
popularity. These systems first estimate the acoustic features of the target speech
using statistical models, after which these estimations are used to perform a unit
12
selection synthesis using a database containing original speech signals [Ling and
Wang, 2007].
This section only holds a brief overview of the research on auditory speech
synthesis systems. The interested reader is referred to [Schroeder, 1993] and [Klatt,
1987] for an extensive overview of the acoustic speech synthesis research and the
different approaches that have been studied. The latter reference holds also many
interesting sound samples from the early systems.
1.4.3
Multimodality of spoken human-machine communication
The previous sections explained why automatic speech recognition and synthetic
speech synthesis are necessary for improving the interaction between humans and
machines. Let’s take another look at the prophecies on human-machine interaction
found in science-fiction. It can be noticed that in many cases, the authors/movie
directors opted to assign some sort of virtual “face” to the computer system. In
addition, when the fictional computer system is speaking, some kind of visual clues
are displayed. These can be seen as an elementary type of synthetic visual speech.
For instance, in Stanley Kubrick’s version of “2001: A Space Odyssey” the speaking
computer HAL-9000 is presented as a red light (see figure 1.2). When the computer
talks, a close-up of this light is displayed, giving the impression that the audience
is looking at its “face”. Another common practice is the displaying of a graphical
representation corresponding to the auditory voice of the machine. A well-known
example of this is the smart car KITT featuring in the television series “Knight
Rider” (see figure 1.4). These are just fictional examples, however, they do indicate
that people seem to expect some sort of visual speech signal in the communication with machines as well. Recall that in section 1.3 it was shown that speech
communication between humans is a truly multimodal means of communication,
consisting of both an auditory and a visual mode. Consequently, it can be expected
that the most optimal human-machine interaction will be feasible only when this
communication consists of audiovisual speech as well [Chen and Rao, 1998].
When the machine is at the receiver’s side of the communication, the multimodal
human-machine interaction is based on automatic audiovisual speech recognition.
Studies on this subject have indicated that the accuracy in which a computer
system is able to correctly translate an auditory speech signal into its corresponding
phoneme sequence can be increased when visual speech information is given to the
system as well [Potamianos et al., 2004]. The visual speech information usually
consists of a video recording of the speaker’s face. It is analysed by the system in
order to determine important visual features such as the opening of the lips. An
13
Figure 1.4: Fictional rudimentary synthetic visual speech used in the television
series “Knight Rider”.
accurate estimation of the phoneme sequence is possible by combining these visual
features with the acoustic features of the auditory speech signal [Nefian et al., 2002].
Similarly, when the computer system has to transmit a message towards the
user, through audiovisual speech synthesis it is possible for the computer to display
both a synthetic auditory and a synthetic visual speech signal. The concept of
synthetic auditory speech is more or less unambiguously defined as a waveform that
resembles as close as possible a human auditory speech signal, however, numerous
variations on the concept of synthetic visual speech are possible (see figure 1.5).
For instance, the visual speech signal can appear photorealistic or it can display a
cartoon-like representation. It can display a complete talking head or it can just
simulate a speaking mouth. The visual speech can be both 2D or 3D-rendered and
its level of detail can vary from a simple opening/closing of the mouth (e.g., cartoons
like “South Park”) to an accurate simulation of the various visual articulators. For
realistic visual speech synthesis, the system has to model the exterior of the face
containing the lips, the chin and the cheeks, as well as the interior of the mouth,
especially the teeth and the tongue. In the next chapter an extensive overview of
the various approaches for generating synthetic visual speech will be given.
It can be noticed that much similarity exist between the miscellaneous visual
speech synthesis strategies and the sundry approaches for auditory speech synthesis. For instance, rule-based visual synthesizers create a new visual speech signal
based on prior knowledge about the visual appearances of phonemes. This knowledge can be used to estimate an appropriate visual counterpart for every target
14
Figure 1.5: Examples of various visual speech synthesis systems. From left to
right: 3D articulatory synthesis [Birkholz et al., 2006], 2D photorealistic synthesis [Ezzat et al., 2002] and 3D model-based synthesis [Cohen and Massaro,
1990].
phoneme, from which by means of interpolation a continuous synthetic visual speech
signal can be constructed. Alternatively, concatenative visual speech synthesizers
reuse original visual speech data to create a new visual speech signal by selecting
and concatenating the most appropriate segments from a database containing original visual speech information. Another approach is to synthesize the target visual
speech using a statistical prediction model (e.g., a Hidden Markov Model) that
has been trained on a dataset of original (audio)visual speech samples. The next
chapter will extensively elaborate on these various visual speech synthesis strategies.
Over the last two decades, many studies assessed the use of original and/or synthetic
audiovisual speech in various human-computer interaction scenarios [Walker et al.,
1994] [Sproull et al., 1996] [Cohen et al., 1996] [Pandzic et al., 1999] [Dehn and
Van Mulken, 2000] [Ostermann and Millen, 2000] [Geiger et al., 2003] [Agelfors
et al., 2006] [Ouni et al., 2006] [Weiss et al., 2010]. One of the important conclusions
of these studies is the fact that the addition of a high-quality, realistic synthetic
visual speech signal to a (synthetic or original) auditory speech signal improves
the overall intelligibility of the speech (visualized in figure 1.6). This is especially
true when the intelligibility of the auditory speech itself is degraded. In addition,
it has been shown that people react more positive and are more engaged when
the computer interacts through audiovisual speech. Also, the results obtained in
numerous perception experiments show that the displaying of a realistic talking face
makes the computer more human-like, causes the users to be more comfortable in
interacting with the system and increases the level in which users trust the system.
From these findings it can be concluded that for an optimal communication from
the machine towards the user, the speech should indeed consist of both an auditory
and a visual speech mode.
15
Figure 1.6: Intelligibility scores as a function of acoustic degradation, depending
on the mode of presentation [Le Goff et al., 1994]. From bottom to top: audio
alone, audio and the animation of an elementary synthetic lip model, audio
and the animation of a non-photorealistic 3D face model, audio and original
2D visual speech.
16
It has been shown that intelligibility scores increase when a more realistic and
a more accurate synthetic visual speech signal is displayed (see figure 1.6) [Benoit
and Le Goff, 1998]. On the other hand, it has to be ensured that the presented
(synthetic) visual speech is appropriate for the particular target application. For
instance, a suitable visual speech signal for a system interacting with children
could appear cartoon-like, since it is mainly the entertainment value of the system
that is important to draw the children’s attention. On the other hand, for more
general applications intended for either professional or entertainment purposes, an
important aspect of the synthetic visual speech that determines its applicability
is the exhibited degree of realism. It is evident that optimal circumstances for
human-machine interaction are feasible when a 100% realistic visual speech signal
is displayed. Unfortunately, this degree of realism cannot be reached by any visual
speech synthesis strategy known to date, although current state-of-the-art 3D
rendering techniques are capable of generating near-realistic static representations
of a virtual speaker. Surprisingly, it has been noticed that a high but not perfect
degree of realism can result in a worse user experience compared with less realistic
visual speech signals (such as cartoon-like 2D or 3D speech). This effect, called the
uncanny valley effect (see figure 1.7), was first noticed in the field of humanoid
robotics, where it was found that the more realistic a robot appears, the more
human observers are sensitive to subtle flaws or shortcomings in the design [Mori,
1970]. This effect holds also in the field of visual speech synthesis, since human
observers were found to easily dislike a near-realistic synthesis due to the existence
of a few short or subtle unnatural mouth-appearances in the signal [Theobald and
Matthews, 2012]. More in general, people tend to dislike a near-realistic synthesis
that “tries to fool them” by a realistic mimicking of an original speaker in case it
can still be noticed that the presented speech signal is originating from a synthesizer [Tinwell et al., 2011]. In contrast, flaws in explicitly non-realistic synthetic
visual speech signals are more easily forgiven by human observers, provided that the
movements of the visible articulators are correctly simulated. From a psychological
point of view, this can be explained by the fact that the almost-realistic virtual
characters are perceived as strange, abnormal or “spooky” real people. In contrast,
it is clear that the non-realistic characters are originating from a virtual world,
which makes the observers feel more comfortable. Bridging the uncanny valley
imposes a major challenge for visual speech synthesis research, since a high degree
of realism of the synthetic speech is necessary to provide an optimal communication
channel between the machine and its users.
17
Figure 1.7: The uncanny valley effect. For 2D cartoon-based synthesis the appearance of the mouth area varies among a limited set of drawn mouth representations (e.g., open/closed mouth represented by a disc/line). 3D model based
synthesis is capable of exhibiting very natural movements of the visual articulators represented by variations of a 3D polygon mesh without texture. 2D photorealistic synthesizers mimic original video recordings of a person uttering speech,
and 3D photorealistic synthesis uses a 3D model on which a photorealistic texture is mapped. Example figures taken from [Anime Studio, 2013] [Karlsson
et al., 2003] [Liu and Ostermann, 2011] [Albrecht et al., 2002].
1.4.4
18
Applications
Considering the rapidly increasing number of computer systems people interact
with in everyday situations, countless applications for speech-based human-machine
interaction are thinkable. The use of speech to transfer a message towards the
computer system is mainly used to improve the accessibility of the device. For
instance, several functions in a modern car can be triggered by voice command
in order to permit the driver to keep his/her hands on the steering wheel and
to maintain focus on the traffic. Voice controlled devices can also help elderly or
physically impaired people in using the appliance, since commands can be passed
with a minimal physical effort. The accuracy of these speech-controlled applications
can be enhanced by incorporating the visual speech mode in the communication.
This can also increase the level of interaction between the system and its users.
For instance, based on the observed facial expressions, the computer can estimate
the emotional state of its user and it can react in an appropriate way. This can,
for instance, enhance the communication between the computer system and young
children [Yilmazyildiz et al., 2006].
On the other hand, speech based communication from the machine towards
its users is advantageous for both the ease of interaction with the device and
the applicability of the system in common everyday tasks. It helps in making
the computer system more human-like, especially when the synthetic auditory
speech is accompanied by a good-quality visual speech signal [Pandzic et al., 1999].
Nowadays, auditory-only speech synthesis is already used in various applications,
such as the reading of text messages in cell-phones, automatic telephone exchanges
and satellite navigation systems. For all these applications, a logical next step is
the extension towards communication by means of audiovisual speech, which will
improve the intelligibility of the synthetic speech and it will enhance the accessibility for hearing-impaired users. The addition of a synthetic visual speech mode can
also be used to improve the intelligibility of original or synthesized announcements
in train stations or airports.
Audiovisual speech synthesis can be used to create talking avatars or virtual
assistants for increasing the working experience on personal computers, portable
devices, websites and social media [Gibbs et al., 1993] [Noma et al., 2000] [Cosatto
et al., 2003]. Audiovisual speech synthesis can also be applied in the entertainment
sector. As where nowadays the speaking gestures of animated characters are almost
completely hand-crafted, an automatic prediction of these gestures would speed
up the animation process. In addition, synthetic audiovisual speech can be used
for remote-teaching applications. A virtual teacher, displayed as a high-quality
speaking head/person, will help to draw the student’s attention in comparison with
1.5. Audiovisual speech synthesis at the VUB
19
the displaying of plain text [Johnson et al., 2000]. Another example can be found in
the field of video telephony and video conferencing, which is becoming increasingly
popular these days. Note that the transmission of high-quality audiovisual speech
requires high data rates, since the video signal containing the visual speech must
have a resolution and a frame-rate that are adequate for preserving the fine details
of the speech information. However, the transmission of audiovisual speech is also
feasible in a low-bandwidth scenario by transmitting only the textual information
after which a new audiovisual speech signal is locally generated at the receiver’s
side. Alternatively, when a model is used to describe the visual speech (see next
chapter), model parameters corresponding to the target message can be predicted
at the sender’s side, after which only these parameters need to be transmitted to the
receiver for allowing a local generation of the visible speech. Synthetic audiovisual
speech can also be applied in the health-care sector [Massaro, 2003] [Engwall et al.,
2004]. For instance, after an accident or surgery speech therapy involving speech
exercises demonstrated and counselled by a speech therapist may be necessary in
order to regain normal speech function. The use of audiovisual speech synthesis for
this purpose could drastically reduce the workload since custom speech samples for
usage during therapy can be generated beforehand. Similarly, an application using
audiovisual speech synthesis can be designed that allows the patients to practise
the speech production on their own in an individual training scheme. In addition,
audiovisual speech synthesizers can be used to generate speech samples for miscellaneous speech perception experiments. This will avoid the time-consuming and
costly audiovisual recordings that are necessary for these experiments. Moreover,
speech synthesis is able to produce series of highly consistent speech samples, which
is very hard to achieve when the speech samples are gathered during multiple
recording sessions.
Apart from the examples mentioned in this section, many other applications
that involve audiovisual speech synthesis are imaginable. It is highly possible that
within a few years we will live in a world where both the car park ticket machine,
the train that brings us to work and the fridge in our kitchen will interact with us
by means of synthetic auditory speech while displaying their own typical virtual
talking agent.
1.5
Audiovisual speech synthesis at the VUB
This thesis describes the research on audiovisual speech synthesis that was performed
at the Vrije Universiteit Brussel (VUB) in the Laboratory for Digital Speech and
Audio Processing (DSSP). The study has resulted in an audiovisual speech synthesis
system that is capable of generating high-quality photorealistic audiovisual speech
signals based on a given English or Dutch text. The research originated from the
20
observation that speech is a truly multimodal means of communication that people
practise every day of their life. Consequently, they are extremely skilled in perceiving
this type of audiovisual information, which implies that a quality perception of
synthetic audiovisual speech is only feasible when the two synthetic speech modes
closely resemble original speech signals and, in addition, when the level of coherence
between these two information streams is as high as found in original audiovisual
speech. The research focusses on synthesis strategies that allow the optimization of
both these features. It also investigates the importance of the level of audiovisual
coherence on the perceived speech quality, since this aspect is often disregarded by
audiovisual speech synthesizers described in the literature.
1.5.1
Thesis outline
Chapter 2 gives a comprehensive overview of the diverse (audio)visual speech synthesis strategies that have been described in the literature. It explains the various
aspects that distinguish these synthesis approaches and it positions the synthesis
strategy that is developed in this thesis in the literature. Next, chapter 3 describes
the proposed audiovisual speech synthesis strategy and it describes the experiments
that were conducted to evaluate the influence of the level of audiovisual coherence
on the perceived speech quality. Chapter 4 explains how the attainable synthesis
quality of the audiovisual speech synthesizer was enhanced by increasing the individual quality of the synthetic visual speech mode. Subsequently, chapter 5 explains
how the synthesis quality was further enhanced by the construction of a new extensive audiovisual speech database for the Dutch language. For some applications,
an (original) auditory speech signal is already available which means that instead
of audiovisual speech synthesis, a visual-only speech synthesis is required in order
to generate the accompanying visual speech mode. Chapter 6 elaborates on the use
of many-to-one phoneme-to-viseme mappings for this purpose. It also describes the
construction and the evaluation of novel many-to-many phoneme-to-viseme mapping
schemes. Finally, chapter 7 concludes the thesis by discussing the results obtained
and by elaborating on possible future additions to the research.
1.5.2
Contributions
Some of the important scientific and technical contributions made by this thesis
include:
◦ The development of a unit selection-based audiovisual text-to-speech synthesis
approach that is able to maximize the level of audiovisual coherence in the
synthetic speech.
◦ The development of a set of audiovisual selection costs and the development of
an audiovisual concatenation technique that allows the synthesis of audiovisual
21
speech of which the quality is sufficient to draw important conclusions on the
proposed synthesis approach.
◦ Subjective evaluations that point out that for an optimal perception of the
synthetic auditory and the synthetic visual speech, a maximal level of audiovisual coherence is mandatory.
◦ The enhancement of the quality of the synthetic visual speech by employing a model-based parameterization of the speech in order to normalize the
database, to employ a diversified concatenation smoothing, and to apply a
spectral smoothing to the synthetic visual speech information.
◦ Configuring a set-up that is appropriate for recording audiovisual speech
databases for speech synthesis purposes. This involves, among other things,
a strategy for maintaining constant recording conditions throughout the
database and an illumination set-up that allows a careful feature tracking in
the post-processing stage.
◦ The development of the first-ever system that is able to perform high-quality
photorealistic audiovisual text-to-speech synthesis for the Dutch language.
◦ An evaluation of the use of standardized and speaker-specific many-to-one
phoneme-to-viseme mappings for concatenative visual speech synthesis.
◦ The development of context-dependent viseme labels and the evaluation of
their applicability for concatenative visual speech synthesis.
2
Generation of synthetic
visual speech
2.1
Facial animation and visual speech synthesis
The previous chapter explained how the increasing number of everyday-life interactions with computer systems entails the need for strategies that allow the generation
of high-quality synthetic speech. It also explained that the most optimal synthetic
speech should consist of both an auditory and a visual speech mode. Section 1.4.2
briefly elaborated on the various strategies for synthesizing auditory speech. This
chapter focuses on the diverse approaches for generating synthetic visual speech
that have been the subject of investigation from the early days until present.
From an historical point of view, the very first visual speech synthesis approaches
emerged in the pre-computer era. Similar to auditory mechanical talking machines
like the one designed by Von Kempelen [Von Kempelen, 1791], the first attempts
for (audio-)visual speech synthesis consisted of a human-operated mechanical
construction that mimicked the human vocal tract in order to produce speech
sounds. Simultaneously, the operator could animate components of the machine
that resembled visible human articulators (e.g., wooden lips). A famous example
of such a machine was the “Wonderful Talking Machine” which was presented
by Joseph Faber in 1845 (see figure 2.1) [Lindsay, 1997]. Obviously, the synthetic
speech that is of interest for the great majority of modern applications is computergenerated. However, mechanically-generated visual speech gestures are nowadays
still a critical aspect in the development of humanoid robots, whose synthetic
22
2.1. Facial animation and visual speech synthesis
23
Figure 2.1: Examples of mechanically generated visual speech. From left to
right, the “Wonderful Talking Machine” [Lindsay, 1997], the humanoid robot
“KOBIAN” which is able of mimicking human facial emotions and expressions [Endo et al., 2010], and the humanoid robot “HRP-4C” which is able
to produce realistic facial expressions [Nakaoka et al., 2009].
face and articulators should exhibit appropriate variations that correspond to the
robot’s voice (see figure 2.1).
Long before research on automatic generation of synthetic visual speech existed,
hand-crafted visual speech animation was well-known. In the art of cartooning
and 2D animation pictures, visual speech was simulated by successively displaying
mouth appearances from a limited set of predefined images. For example, a minimal
set consisted of a closed mouth (e.g., represented by line) and an open mouth (e.g.,
represented by a disc). The set of reference images could be augmented with other
variations like the displaying of the tongue or the teeth. The very first use of this
technique is credited to Georges Demeny (1892), who used his “Phonoscope” to
successively display 12 “chronophotographs” (an ancestor of a transparency slide)
containing speech movements (see figure 2.2) [Demeny, 1892]. Later, the animation
technique became more and more popular after the release of popular shorts like
Walt Disney’s “Steamboat Willie” (1928). Cartoon animation inspired the very
first automatic techniques for generating visual speech. Erber et al. [Erber and
Filippo, 1978] used an oscilloscope for displaying a line drawing that represented
various lip-shapes occurring while uttering speech. Likewise, the displaying of simple
vector graphics was used by Montgomery et al. [Montgomery, 1980] and Brooke et
al. [Brooke and Summerfield, 1983] to generate sequences of lip-shapes. Obviously,
such rudimental 2D visual speech signals were of limited practical use due to the
lack of realism. Fortunately, later on the available computing power increased and
more complicated computer-based graphics generation became possible.
Computer-based generation of visual speech can be seen as a sub-problem in
the field of facial animation [Deng and Noh, 2007]. Facial animation studies the
24
Figure 2.2: Georges Demeny’s “Phonoscope” displaying early visual speech animation [Demeny, 1892].
design of virtual faces as well as methods to vary the appearance of these faces in
order to create human-like facial expressions. Two important categories of facial
expressions can be discerned: expressions that illustrate the emotional state of
a person and expressions that are linked with the production of speech sounds.
From this perspective, visual speech synthesis can be defined as the generation of a
sequence of synthetic facial expressions that are linked with the uttering of a given
sequence of speech sounds. Facial animation is a branch in computer-generated
imagery (CGI) that became increasingly popular after the release of the animated
short-film “Tony de Peltrie” by Philippe Bergeron and Pierre Lachapelle in 1985 (see
figure 2.3) [Bergeron and Lachapelle, 1985]. This was the first computer-generated
animation displaying a human character that exhibits realistic facial expressions in
order to show emotions and to accompany auditory speech. In this animated short,
the facial expressions were produced by photographing an actor with a control grid
on his face, and then matching points to those on a 3D computer-generated face
(itself obtained by digitizing a clay model). Similar early-days animated shorts that
initiated the development of computer-based facial animation are “Rendez-vous
Montreal” by Thalmann in 1987 and “Sextone for President” by Kleiser in 1988
(illustrated in figure 2.3).
As this thesis focuses on the synthesis of synthetic visual speech, the current
chapter will mainly discuss the various strategies for generating facial expressions
that correspond to the uttering of speech sounds. However, in chapter 1 it was
explained that certain facial gestures are used to stress or convey an emotion in the
speech information (i.e., the visual prosody). Therefore, it should be noted that an
25
Figure 2.3: Snapshots from pioneering animations showing realistic computergenerated facial expressions. From left to right: “Tony de Peltrie” (1985),
“Rendez-vous Montreal” (1987), and “Sextone for President” (1988).
ideal visual speech synthesizer has to be capable of mimicking both speech-related
gestures and facial expressions that add an emotion to the communication.
Throughout the years many strategies for the generation of synthetic visual
speech have been described [Bailly et al., 2003] [Theobald, 2007]. Classifying these
diverse approaches is not an easy task since many different aspects can be used
for typifying each of the proposed strategies. In the remainder of this section a
brief overview of such characteristic properties is given. The next section will then
elaborate on each of these aspects and will provide various examples from the
literature.
Input requirements
A description of the target visual speech has to be given to the synthesis
system. This can be accomplished by means of a phoneme (or viseme) sequence
or by means of plain text (so-called phoneme-driven or text-driven systems).
Alternatively, the speech synthesizer can be designed to generate synthetic
visual speech corresponding to an auditory speech signal that is given as input
to the system (so-called speech-driven systems).
Output modality
Most visual speech synthesis systems only generate a video signal or a sequence of video frames containing the target visual speech. Moreover, some
text-driven systems generate both a synthetic auditory and a synthetic visual
speech signal. These systems are generally referred to as audiovisual speech
synthesis systems.
Output dimensions
The synthetic visual speech can be rendered in either two or three dimensions.
2D-based synthesis usually displays a frontal view of the talking head, while
3D-based synthesis uses 3D rendering approaches to permit free movement
around the talking head.
2.2. An overview on visual speech synthesis
26
Photorealism
The synthetic visual speech can appear photorealistic, which means that it
is intended to appear as human-like as possible. On the other hand, some
systems generate 2D cartoon-like visual speech or render a 3D model using
solid colours instead of photorealistic textures.
Definition of the visual articulators and their variations
It was explained earlier that visual speech synthesis can be considered as a
sub-problem in the field of facial animation. Each visual speech synthesizer
has to adopt a facial animation technique in order to represent the virtual
speaker and to define the possible variations that allow the mimicking of speech
gestures. Note that most literature overviews on visual speech synthesis use
this property to classify the various proposed synthesis strategies. A wide
variety of animation approaches exist. For instance, 3D-based rendering needs
the definition of a 3D polygon mesh that models the mouth or the complete
face/head of the virtual speaker. In addition, it must define multiple variations
of this mesh that can be used to mimic speech gestures. A similar graphics
rendering can be used to generate a 2D representation of the virtual speaker.
On the other hand, 2D-based facial animation can also be achieved by reusing
original video recordings of a human speaker. In that case, the various speech
gestures are defined by the labelling of the original visual speech data.
Prediction of the target speech gestures
Whatever facial animation strategy that is used to describe the visual speech
information, each synthesis system needs to estimate the target visual speech
gestures based on the input data. Various strategies for this prediction have
been proposed, such as predefining correspondences between phonemes and
visemes (so-called rule-based systems), a statistical modelling of input-output
correspondences (e.g., speech-driven synthesis) or the reuse of appropriate
original speech data.
2.2
2.2.1
An overview on visual speech synthesis
Input requirements
In order to generate the target visual speech signal, the majority of the synthesis
systems require the sequence of phonemes/visemes that must be uttered. Such
a phoneme sequence can be directly given as input to the system [Pearce et al.,
1986] or the synthesizer’s input can be plain text. In the latter case, these so-called
text-to-speech (TTS) synthesis systems will in a first synthesis stage determine a
target phoneme sequence based on the given textual information [Dutoit, 1997].
Many TTS systems predict the prosodic properties of the target speech as well
27
(e.g., phoneme durations, stress, etc.).
Another category of synthesizers generate the novel visual speech signal based
on an auditory speech signal that is given as input to the system. These speechdriven systems estimate the target facial expressions based on features extracted
from the auditory input signal. For this purpose a training database is used to
train a statistical model on the correspondences between these auditory speech
features and their corresponding visual features. After training, this model is
used to predict the target visual features corresponding to a novel audio segment
that was given as input. The predicted visual features can then be used to drive
the facial animation. Various types of auditory features have been used. For example, Mel-frequency Cepstrum Coefficients (MFCC, [Mermelstein, 1976]) were
used by Massaro et al. [Massaro et al., 1999], by Theobald et al. [Theobald and
Wilkinson, 2007] [Theobald et al., 2008], and by Wang et al. [Wang et al., 2010].
Other potential auditory features include Line Spectral Pairs (LSP, [Deller et al.,
1993]) [Hsieh and Chen, 2006], Linear Prediction Coefficients (LPC, [Rabiner and
Schafer, 1978]) [Eisert et al., 1997] [Du and Lin, 2002] or filter-bank output coefficients [Gutierrez-Osuna et al., 2005]. The definition of the visual features depends
heavily on the nature of the synthetic visual speech (e.g., 2D-based or 3D-based)
and the manner in which it is represented by the synthesizer (i.e., the chosen facial
animation strategy). For instance, when landmark points describing the location of
the lips are known, the geometric dimensions of the mouth can be used to describe
the visual speech information [Hsieh and Chen, 2006]. Alternatively, when the visual
speech is described by a parameterized 3D model (see further in section 2.2.5.1), the
model’s control parameters are highly suited as visual features [Massaro et al., 1999].
Other systems use a mathematical model to parameterize the visual speech signal in
order to obtain useful visual features. For example, Brooke et al. [Brooke and Scott,
1998] use Principal Component Analysis (PCA, [Pearson, 1901]) and Theobald et
al. [Theobald and Wilkinson, 2007] use an Active Appearance Model (AAM, [Cootes
et al., 2001]). Diverse approaches have been suggested to learn the mapping from
auditory to visual features, such as a Hidden-Markov Model (HMM, [Baum et al.,
1970]) [Brand, 1999] [Arb, 2001] [Bozkurt et al., 2007], an Artificial Neural Network
(ANN, [Anderson and Davis, 1995]) [Eisert et al., 1997] [Massaro et al., 1999],
regression techniques [Hsieh and Chen, 2006], Gaussian-mixture models [Chen,
2001], switching linear dynamical systems [Englebienne et al., 2008] or switching
shared Gaussian process dynamical models [Deena et al., 2010].
Note that speech-driven visual speech synthesis can also be realized using a
hybrid analysis/synthesis approach. In this strategy, an auditory speech signal is
given as input to the system, from which in a first stage its corresponding phoneme
sequence is determined using speech recognition. Afterwards, in a second stage this
28
phoneme sequence is used as input for the actual visual speech synthesis [Lewis and
Parke, 1987] [Lewis, 1991] [Bregler et al., 1997] [Hong et al., 2001] [Ypsilos et al.,
2004] [Jiang et al., 2008]. A similar approach was proposed by Verma et al. [Verma
et al., 2003] and by Lei et al. [Lei et al., 2003], in which the speech recognition stage
is designed to directly estimate a sequence of visemes instead of phonemes. In fact,
the actual speech synthesis stage of these systems can be considered as text-driven
visual speech synthesizers.
Finally, a last category of visual speech synthesizers are based on the cloning
of visual speech. These systems generate a new synthetic visual speech signal by
mimicking the speech gestures that are detected in another visual speech signal that
is given as input to the system. This way, several “virtual actors” can be animated
by a single recording of a human speaker uttering speech sequences [Escher and
Thalmann, 1997] [Pighin et al., 1998] [Gao et al., 1998] [Goto et al., 2001] [Chang
and Ezzat, 2005].
2.2.2
Output modality
The previous chapter explained that speech is a multimodal means of communication. In practice, however, speech is often transmitted as an unimodal signal
containing only an auditory mode (e.g., a telephone conversation). In contrast, only
in rare cases a speech signal consisting of solely visual speech information is used.
This is in line with the fact that humans are very well-practised in understanding
auditory-only speech, while intelligibility scores for visual-only speech are much
lower [Ronnberg et al., 1998]. Only (hearing-impaired) people who have spent a lot
of practice time increasing their lip-reading skills are able to (partially) understand
visual-only speech [Jeffers and Barley, 1971]. From this observation it is clear that
for almost all applications, once the synthetic visual speech signal has been created
by the visual speech synthesis system, it will be multiplexed with an auditory
speech signal before it is presented to a user. For a speech-driven visual speech
synthesis system this workflow is obvious, since in this particular case the desired
audiovisual speech consists of a combination of the synthetic visual output speech
and the auditory speech that was given as input to the system. For text-driven
synthesis, however, multiple workflows are possible. In some applications an original
auditory speech fragment is available. In this case, the text or phoneme sequence
that corresponds to this original speech signal must be given as input to the visual
speech synthesizer. In addition, the system needs to know the timing properties of
the auditory speech signal (i.e., the duration of each phoneme) in order to generate
a synchronous synthetic visual speech signal. Multiplexing the generated visual
speech with the original auditory speech then gives the target audiovisual speech
signal. In other applications, the desired audiovisual speech needs to be generated
29
Auditory
speech
Text
Audiovisual
speech
Auditory TTS
Phoneme sequence
+ Timings
Visual TTS
Visual
speech
Auditory
speech
Text
Audiovisual
speech
Audiovisual
TTS
Visual
speech
Figure 2.4: Two approaches for audiovisual text-to-speech synthesis. Most systems adopt the strategy illustrated on top, in which the synthetic audiovisual
speech is generated in two distinct stages. A truly audiovisual synthesis should
synthesize both the audio and the video in a single stage, as illustrated in the
bottom figure.
from only textual information. This means that both a synthetic auditory and a
synthetic visual speech signal have to be synthesized. The great majority of the
systems found in the literature tackle this problem by a two-phase synthesis, where
in a first stage the synthetic auditory speech is generated by an auditory speech
synthesizer. This auditory synthesizer also provides the target phoneme sequence
and the corresponding phoneme durations. In the second stage, this information is
given as input to a visual speech synthesizer that creates the synchronized synthetic
visual speech. Afterwards, the two synthetic speech modes are multiplexed in order
to create the desired audiovisual speech.
In contrast, the audiovisual text-to-speech (AVTTS) synthesis can be performed
in a single phase when the synthetic auditory and the synthetic visual mode are
generated at the same time, as illustrated in figure 2.4. Such single-phase systems
can be considered as truly audiovisual synthesizers. On the other hand, although the
systems that apply a two-phase synthesis are often also referred to as “audiovisual
speech synthesizers”, it is more correct to consider these systems as two separate
synthesizers jointly performing the AVTTS synthesis. In many cases, the auditory
and the visual speech synthesizer were even developed independent of each other.
Therefore, this chapter will only consider the visual speech synthesis stage of these
two-phase synthesis systems.
Schroeter et al. described an overview of the workflow of two-phase AVTTS
30
systems in [Schroeter et al., 2000]. Many implementations of this strategy can
be found in the literature. For instance, the 3D talking head LUCIA [Cosi et al.,
2003] converts text to Italian audiovisual speech by using the Festival auditory
speech synthesizer [Black et al., 2013] to perform the first stage of the synthesis.
The Festival system is also used by King et al. for generating English audiovisual
speech [King and Parent, 2005]. Another example is the system by Cosatto et
al. [Cosatto et al., 2000] which uses the AT&T auditory TTS system [Beutnagel
et al., 1999] for realizing 2D photorealistic AVTTS synthesis, and the system by
Albrecht et al. [Albrecht et al., 2002] which uses the MARY TTS system [Schroder
and Trouvain, 2003]. Many other two-phase AVTTS implementations exist, such as
the synthesizers developed by Goyal et al. [Goyal et al., 2000] and by Zelezny et
al. [Zelezny et al., 2006]. This is in contrast with the single-phase AVTTS approach,
of which only a few implementations can be found. In 1988, Hill et al. developed
such an early AVTTS system based on articulatory synthesis [Hill et al., 1988].
Tamura et al. realized single-phase audiovisual TTS synthesis by jointly modelling
auditory and visual speech features [Tamura et al., 1999]. Other exploratory studies,
focusing on single-phase concatenative audiovisual synthesis (see further in section
2.2.6.3), were conducted by Hallgren et al. [Hallgren and Lyberg, 1998], Minnis
et al. [Minnis and Breen, 2000], Bailly et al. [Bailly et al., 2002], Shiraishi et
al. [Shiraishi et al., 2003] and Fagel [Fagel, 2006].
2.2.3
Output dimensions
The numerous visual speech synthesis systems that are described in the literature produce a variety of visual speech signals, which can be coarsely divided in
2D-rendered or 3D-rendered signals. 3D-based visual speech synthesizers use 3D
rendering techniques from the field of CGI by modelling the virtual speaker as a
3D polygon mesh consisting of vertices and their connecting edges. The 3D effect
is realized by casting shadow effects on the model based on a virtual illumination
source. The realism of the virtual speaker can be increased by adding detailed
texture information for simulating skin and wrinkles, eyes, eyebrows, etc. Most
3D-based systems model the complete face or even the whole head of the virtual
speaker, although some synthesizers only model the lips/mouth (e.g., [GuiardMarigny et al., 1996]). The major benefit of synthesizing 3D-rendered synthetic
visual speech is the possibility of a free movement around the virtual speaker.
Because of this, the synthetic speech is applicable in countless virtual surroundings
like virtual worlds (e.g., Second Life [Second Life, 2013]), computer games, and 3D
animation pictures. In addition, 3D-based facial animation offers a convenient way
to add visual prosody to the synthetic speech, since gestures like head movements
and eyebrow raises can easily be mimicked by alterations of the 3D mesh. The
design of a high quality 3D facial model or head model is a time-consuming task,
31
especially when realism is important. Fortunately, this process can be partly automated by creating dense meshes based on 3D scans of real persons (e.g., Cyberware
scanners [Cyberware Scanning Products, 2013]). Note, however, that the rendering
of detailed 3D models requires heavy calculations, which limits the synthesizer’s
applicability to computer systems that offer sufficient computing power. Another
important consideration is the fact that the use of realistic 3D-rendered synthetic
speech imposes an extra difficulty in bridging the “Uncanny Valley” (see section
1.4.3), since with 3D-rendered visual speech even static appearances of the virtual
speaker (on top of which synthetic speech movements will be imposed) can be
perceived as “almost but not quite good enough” human-like.
Other visual speech synthesizers generate a 2D visual speech signal. The majority of these systems aims to mimic standard 2D video recordings by pursuing
photorealism. An obvious downside of 2D-based speech synthesis is its limited
applicability in virtual worlds or surroundings, since these are mostly rendered in
3D. On the other hand, a 2D visual speech signal can be applied in numerous other
applications due to its similarity with standard television broadcast and motion
pictures. For instance, a video signal displaying a frontal view of the virtual speaker
can simulate a virtual newsreader or announcer. In addition, a 2D photorealistic
representation of the virtual talking agent is the most optimal technique for simulating a real (familiar) person, which is useful in applications such as a virtual
teacher or low-bandwidth video-conferencing. In comparison with 3D-based speech
synthesis, in a 2D-based approach it is more easy to create a virtual speaker that
exhibits a very high static realism, since people are very familiar with standard 2D
video recordings of real persons. Therefore, a high-quality photorealistic 2D visual
speech synthesis is more likely to bridge the Uncanny Valley in comparison with
3D-based speech synthesis. Of course, the major challenge remains to accurately
mimic the speech gestures on top of this realistic speaker representation.
A few systems have been developed that cannot be classified as either 2Dbased or 3D-based synthesis. Cosatto et al. created a visual speech synthesis
system that produces synthetic visual speech signals resembling standard 2D video
recordings, while it permits some limited head movements of the virtual speaker
as well [Cosatto, 2002]. These movements can be user-defined or can be predicted
based on the target speech information [Graf et al., 2002]. Note that the movement
of the speaker’s head causes a 3D motion of some important visual articulators
like the lips and the cheeks. By using a rudimental 3D head model, Cosatto et al.
were able to mimic these movements by affine transformations on 2D textures, as
illustrated in figure 2.5. Another such system has been developed by Theobald et
al. [Theobald et al., 2003]. This system generates the target visual speech in 2D,
however by using this synthetic 2D visual speech as texture map for a 3D facial
32
Figure 2.5: Modelling 3D motion using 2D texture samples (left) [Cosatto, 2002]
and visual speech synthesis using 3D screens (right) [Al Moubayed et al., 2012].
polygon mesh (describing the face of the same speaker that was used to model the
2D speech), a 3D representation of the synthetic speech is possible. The resulting
speech should be considered as “2.5D”, since there is no speech-correlated depth
variation of the 3D shape. Another category of systems that cannot be classified
as either 2D-based or 3D-based makes use of a 3D screen on which a visual speech
signal is projected. By shaping these screens in the form of a human face, “true” 3D
speech synthesis is possible, which can for instance be applied in the development
of humanoid robots (see figure 2.5) [Kuratate et al., 2011] [Al Moubayed et al.,
2012].
2.2.4
Photorealism
The visual speech signal that is generated by the synthesis system ought to exhibit
speech gestures that mimic as close as possible gestures that can be seen in original
visual speech. Independent of the realism of these synthetic speech movements
(i.e., the dynamic realism), the synthetic visual speech signal exhibits some degree
of photorealism. A high degree of photorealism implies that a static pose of the
virtual speaker (i.e., the static realism that can be seen in a single video frame
from the visual speech signal) appears very close to a (recording of a) real human.
The manner in which photorealism can be achieved is highly dependent on the
dimensionality of the synthetic visible speech (see section 2.2.3).
For 2D-based synthesis, a photorealistic speech signal appears close to standard television broadcast and video recordings, while 2D non-photorealistic speech
signals appear cartoon-like. Possible applications for such cartoon-like 2D visual
speech synthesis are the automation of the animation process for 2D animation
pictures (which are nowadays more and more outdated due to the success of 3D
33
Figure 2.6: Various examples of synthetic 2D visual speech. From left to right,
animation of a painting [Blanz et al., 2003], 2D photorealistic visual speech
synthesis using a mathematical model to describe the video signal [Theobald
et al., 2004], and 2D photorealistic visual speech synthesis by reusing 2D texture
samples [Cosatto and Graf, 2000].
animation pictures) or in various situations involving interaction between the
computer system and small children. There are a few 2D non-photorealistic speech
synthesizers described in the scientific literature. For instance, in the early days 2D
speech gestures were generated using oscilloscopes [Erber and Filippo, 1978] and
vector graphics devices [Montgomery, 1980] [Brooke and Summerfield, 1983]. In
more recent times, speech synthesis systems sometimes generate a 2D representation
based on lines or dots to verify a synthesis concept, which can later on be extended
for generating a more realistic speech signal [Arslan and Talkin, 1999] [Tamura
et al., 1999] [Arb, 2001]. In addition, some synthesis approaches have been developed
to animate paintings and drawings, obviously resulting in 2D non-photorealistic
speech [Perng et al., 1998] [Lin et al., 1999] [Brand, 1999] [Blanz et al., 2003]. Note,
however, that some of these techniques internally use a 3D-based representation of
the speech and that these systems permit more photorealistic results by animating a
picture instead of a drawing. In contrast to 2D non-photorealistic synthesizers, over
the years many systems for generating 2D photorealistic visual speech have been
developed [Waters and Levergood, 1993] [Bregler et al., 1997] [Ezzat and Poggio,
2000] [Aharon and Kimmel, 2004] [Melenchon et al., 2009]. The various synthesis
strategies adopted by these and other systems will be discussed in the following
sections. As was already mentioned in section 2.2.3, these synthetic photorealistic
2D visual speech signals are applicable in numerous applications since they will
appear familiar to the observers due to their resemblance to standard television
broadcast and motion pictures.
The previous section explained how 3D-based visual speech synthesizers apply
3D rendering techniques to model the virtual speaker. This involves the definition
of a 3D polygon mesh, which consists of multiple vertices and their connecting
edges. A 3D surface can be created by colouring the faces that are defined by these
edges. The level of photorealism of a 3D rendered virtual speaker depends on the
34
level of detail and the density of the polygon mesh as well as on the accuracy in
which the faces of the mesh are colourized. Due to a limited computing power, the
meshes that were applied for facial animation in the early days did not contain
much detail, nevertheless they were able to model important gestures such as lip
movement and eyebrow raises. The first 3D model that could be used to mimic
facial motions was defined by Parke in 1972 [Parke, 1972] (see figure 2.7). Since
then, this model has been adopted and improved by many researchers [Hill et al.,
1988] [Cohen and Massaro, 1990] [Beskow, 1995]. In current times, the computing
power has grown exponentially, allowing to model much more details of the face.
Modern systems apply a detailed colouring of the faces of the polygon mesh in
order to render an appealing 3D representation of the virtual speaker [Dey et al.,
2010]. In addition, photorealism can be achieved by “covering” the 3D surface with
a photorealistic texture map [Heckbert, 1986] [Ostermann et al., 1998]. This texture
can be sampled from one or multiple photographs [Ip and Yin, 1996] [Hallgren and
Lyberg, 1998] [Pighin et al., 1998], it can be captured together with a 3D depth
scan [Kuratate et al., 1998], or a full-head cylindrical texture can be obtained by
scanning 360 degrees around the head of a real human [Escher and Thalmann,
1997] [Hong et al., 2001] [Elisei et al., 2001]. Finally, it is also worth mentioning
that there are reports on systems developed to generate a very rudimental 3D
visual speech signal (e.g., only vertices or a non-colourized polygon mesh) intended
to verify a synthesis concept which can later on be extended for generating a more
realistic speech signal [Galanes et al., 1998] [Bailly et al., 2002] [Engwall, 2002].
2.2.5
Definition of the visual articulators and their variations
Section 2.1 explained that the synthesis of visual speech can be seen as a subproblem in the field of facial animation. Research on facial animation aims to develop
strategies for accurately representing the human face and all the facial motions
humans are capable of. Each visual speech synthesis system has to adopt a facial
animation strategy in order to represent the virtual speaker. While the accuracy
of the speech gestures of the virtual speaker is determined by the quality of their
estimation based on the input data (see further in section 2.2.6), the appearance
and the properties of synthesizer’s output speech will be greatly dependent on the
manner in which the facial animation is performed. Given the wide range of possible
facial animation approaches, this section separately describes the facial animation
strategies that have been applied for 2D-based and 3D-based visual speech synthesis.
2.2.5.1
Speech synthesis in 3D
Sections 2.2.3 and 2.2.4 explained that 3D-based visual speech synthesis requires
the definition of a 3D polygon mesh that describes the visual articulators. The
quality of the polygon mesh and the added texture information is crucial to achieve
35
Figure 2.7: Various examples of synthetic 3D visual speech. The top row shows
the pioneering model of Parke [Parke, 1982], the bottom row shows, from left
to right, the non-photorealistic 3D talking head MASSY [Fagel and Clemens,
2004], the photorealistic talking head LUCIA which adds a texture map to the 3D
polygon mesh [Cosi et al., 2003], and a photorealistic 3D facial image resulting
from a 3D depth and texture scan [Kuratate et al., 1998].
36
a (photo)realistic representation of the virtual speaker (i.e., static realism). In
addition, in order to be able to achieve a high dynamic realism by an accurate
prediction of the target speech gestures (see further in section 2.2.6), the chosen
facial animation approach has to define an appropriate collection of deformations
of the facial model that can be used to mimic the appropriate speech gestures.
In the pioneering work of Parke [Parke, 1972] [Parke, 1975] [Parke, 1982] the
facial model is hand-crafted by mimicking the geometry of the human face (see
figure 2.7). The facial deformations are directly parameterized: the model’s control
parameters act directly on the vertices/edges of the polygon mesh. As such, visual
speech gestures can be mimicked by varying those control parameters that are
linked to the important visual articulators. For instance, there is a parameter that
defines the jaw rotation (which determines the mouth opening), there is a parameter describing the lip protrusion, and there are parameters defining a translation
of the mouth corners. Many other facial animation approaches can be seen as
descendants of Parke’s directly parameterized facial model, such as the talking head
Baldi [Cohen and Massaro, 1990] (see figure 1.5) and its extension by LeGoff et
al. [Le Goff and Benoit, 1996], and the talking heads developed by Beskow [Beskow,
1995] and Fagel et al. [Fagel and Clemens, 2004] (see figure 2.7). One of the
important additions that were made to Parke’s initial model is the addition of 3D
representations of the tongue and the teeth. These directly parameterized facial
animation approaches are often referred to as terminal analogue systems. Note that
this label was classically given to formant-based auditory speech synthesizers, which
generate a novel waveform based on a manually predefined spectral information.
A second strategy for defining the 3D polygon mesh and its deformations is
the so-called anatomy-based approach. In contrast with the terminal-analogue facial
animation approach, in an anatomy-based model the deformations of the polygon
mesh are not directly parameterized. Instead, the facial motions are mimicked by
modelling the anatomy of the human face: bones, muscles and skin. Platt et al.
developed a strategy to mimic facial expression by modelling the elasticity of the
human face by a mass-spring system [Platt and Badler, 1981]. This way, facial
deformations can be created by applying a fictional force on some vertices of the
mesh, after which it can be calculated how these forces will propagate further to
cause variations in other vertices too. An alternative approach has been suggested
by Waters, in which various facial muscles and the effect of their activation on
the facial appearance are modelled [Waters, 1987]. Facial gestures are mimicked
by activating a subset of these virtual muscles and calculating the combined effect
of each muscle pulling particular vertices/edges of the polygon mesh towards a
predefined point where the virtual muscle is attached to the bone. In those days,
Water’s muscular model showed great potential, indicated by the fact that it was
37
adopted by the entertainment industry as well. For instance, Pixar’s animated
short “Tin Toy” (1988) [Pixar Animation Studios, 2013] featured a realistic baby
character whose facial expressions were modelled using a Waters-style facial model.
The muscular model has been extended towards a coupled skin-muscle model [Terzopoulos and Waters, 1993] and a multi-layered anatomy model [Lee et al., 1995].
It has also been fine-tuned for simulating speech-related facial expressions [Waters
and Frisbie, 1995]. Other muscle-based facial animation schemes for generating
expressive visual speech were described by Uz et al. [Uz et al., 1998] and by Edge et
al. [Edge and Maddock, 2001]. A hybrid muscle-based approach was described by
King et al. [King and Parent, 2005]. In this system, some of the model parameters
describe anatomical properties like muscle activations while other parameters of the
face model directly parameterize features like the jaw opening and the location of
the tip of the tongue. An advanced anatomy-based facial model was developed by
Kahler et al., which models the face using three layers: the skull, the muscles and
the skin [Kahler et al., 2001] (see figure 2.8). Alternatively, Sifakis et al. designed a
complex muscle-based model of the human head, for which the relationship between
muscle activation and facial expressions were determined using real-life motion captured data and a finite element tetrahedral mesh [Sifakis et al., 2005] (see figure 2.8).
The 3D facial animation techniques discussed so far all required a prior manual
definition of the deformations of the polygon mesh: the terminal analogue systems
directly parameterize the displacements of the vertices and the anatomy-based
models parameterize muscular and/or skin actions/forces which cause a variation of
the 3D vertices. Another technique for determining the mesh deformations needed to
mimic human facial expressions is the so-called performance-driven strategy, which
makes use of captured facial gestures from original visual speech data [Williams,
1990]. In general, this technique first requires a 3D polygon mesh that describes a
human face/head, which can be hand-crafted or automatically determined using a
3D scanner. Next, original facial gestures are captured and mapped on the polygon
mesh. This way, speech-related deformations of the mesh can be learned, which can
later on be reused to animate the virtual speaker. Note that there have been reports
on the use of performance-driven facial animation using anatomy-based facial models as well [Terzopoulos and Waters, 1993] [Sifakis et al., 2006]. In that particular
case, the captured speech gestures will not be used to directly determine the possible
mesh deformations, but they allow to estimate the muscular/skin actions occurring
in real speech and their effect on the appearance of the speaker’s face. Performancedriven animation techniques can also be applied using a terminal-analogue facial
animation scheme. In that case, the captured speech motions are used to deduce
articulatory rules that map speech information on parameter configurations of the
facial model [Fagel and Clemens, 2004]. The facial movements can be tracked from
regular video recordings of a human speaker using image processing techniques like
38
Figure 2.8: Anatomy-based facial models. The top row illustrates the musclelayer of the model by Kahler et al. [Kahler et al., 2001], the bottom row illustrates the facial muscles and the finite element tetrahedral mesh described by
Sifakis et al. [Sifakis et al., 2005].
39
Figure 2.9: Capturing original facial motions for performance-driven facial animation using the VICON motion capture system [Deng and Neumann, 2008].
snakes [Terzopoulos and Waters, 1993] or key-point tracking [Escher and Thalmann,
1997]. Alternatively, the motions can be captured by tracking markers that are
attached to the speaker’s face. These markers can be coloured [Elisei et al., 2001]
or fluorescent [Hallgren and Lyberg, 1998] [Minnis and Breen, 2000]. The tracking
of the markers can be achieved in 2D by applying image processing on the recorded
video frames [Kalberer and Van Gool, 2001] [Muller et al., 2005], in 3D using
multiple cameras [Ma et al., 2006] [Zelezny et al., 2006], or by using 3D motion
capture systems (e.g., the VICON system [Vicon Systems, 2013] (figure 2.9)) [Cao
et al., 2004] [Deng et al., 2006]. A variation to this technique is optoelectronic
motion capture, where the facial deformations are tracked by sensors attached to
the speaker’s face [Kuratate et al., 1998]. The correspondences between the original
speech recordings and the facial model can also be calculated using an analysisby-synthesis approach, for which no facial markers are needed [Reveret et al., 2000].
In performance-driven facial animation, an easy mapping from the captured
speech gestures to mesh deformations is feasible when the facial markers or the
tracked key-points correspond to vertices of the 3D mesh (e.g., such marker positions are standardized in MPEG-4 (see further in section 2.2.5.3). In many cases,
the captured original speech movements are mapped on a mathematical model.
This way, a parameterization of the original speech gestures (and their analogue
deformations of the 3D polygon mesh) is feasible. For instance, a PCA calculation
is performed by the synthesizers developed by Galanes et al. [Galanes et al., 1998],
Kuratate et al. [Kuratate et al., 1998], Kalberer et al. [Kalberer and Van Gool,
2001], Elisei et al. [Elisei et al., 2001], and Kshirsagar et al. [Kshirsagar and
Magnenat-Thalmann, 2003]. Expectation-Maximization PCA (EM-PCA, [Roweis,
1998]) is applied by Ma et al. [Ma et al., 2006] and by Deng et al. [Deng et al.,
2006]. Independent Component Analysis (ICA, [Hyvarinen et al., 2001]) is applied
by Muller et al. [Muller et al., 2005] and a Wavelet decomposition ( [Vidakovic,
40
2008]) is used by Edge et al. [Edge and Hilton, 2006]. The captured data can also
be used for learning auditory-visual correlations for speech-driven visual synthesis.
An interesting example was described by Badin et al., where electromagnetic
articulography was used to capture motion data from the inner articulators like
the tongue, the lower incisors and the boundaries between the vermillion and the
skin in the midsagittal plane [Badin et al., 2010]. An HMM was trained on the
correspondences between this data and auditory features. The trained HMM could
then be used for a speech-driven animation of a 3D model illustrating the inner of
the human speech production system.
2.2.5.2
Speech synthesis in 2D
2D-based visual speech synthesis aims to create a novel speech signal resembling
standard 2D video recordings or animations. Note that an original 2D representation of a human speaker can simply be obtained by a photograph and that
original 2D speech gestures can easily be gathered using standard video recordings.
Therefore, 2D-based speech synthesizers will often rely on such registrations of
original speech. This means that, in contrast with 3D-based synthesizers, 2D-based
speech synthesis not necessarily involves the construction of a graphical model and
associated rendering techniques. Where in the case of 3D-based synthesizers the
synthesis problem can mostly be split up in a facial animation problem (how are
the face and its deformations modeled) and a speech gesture prediction problem
(which speech gestures need to be rendered), in the case of 2D-based visual speech
synthesis it is less straightforward to make a similar separation since the rendering
of the 2D synthetic speech often automatically follows from the prediction of the
speech gestures (see further in section 2.2.6).
A first category of 2D-based visual speech synthesis systems defines the visual
speech information by a set of still images of the virtual speaker. In the early days,
these were hand-crafted representations of the lips [Erber and Filippo, 1978] [Montgomery, 1980]. More recently, Scott et al. [Scott et al., 1994], Ritter et al. [Ritter
et al., 1999], Ezzat et al. [Ezzat and Poggio, 2000] (see figure 2.10), Noh et al. [Noh
and Neumann, 2000], Goyal et al. [Goyal et al., 2000], and Verma et al. [Verma et al.,
2003] described a synthesis approach in which the virtual speaker is defined by a
consistent set of photographs of an original speaker uttering speech fragments. This
set is constructed to contain an example image of all typical mouth appearances
occurring when uttering speech in the target language. A more advanced technique
has been described by Cosatto et al. [Cosatto and Graf, 1998], in which a more
extensive set of static mouth images is gathered from the recordings of a human
speaker. These images are used to populate a multidimensional grid based on the
geometric properties of the mouth. Another extension to the image-based definition
41
of the virtual speaker was developed by Tiddeman et al., who built a system that is
able to generate from a single given photograph a set of images containing various
mouth appearances that can be used to define the virtual speaker [Tiddeman and
Perrett, 2002].
Where the aforementioned systems require a set of photographs of an original
speaker, an alternative approach uses pre-recorded video fragments of an original
speaker to define the virtual speaker. A pioneering work is the Video Rewrite system
by Bregler et al. [Bregler et al., 1997] which creates a novel visual speech signal
by reusing triphone-sized original video fragments. Similarly, systems described
by Cosatto et al. [Cosatto and Graf, 2000] [Cosatto et al., 2000], Shiraishi et
al. [Shiraishi et al., 2003], Weiss [Weiss, 2004], and Liu et al. [Liu and Ostermann,
2009] use arbitrary-sized video fragments to construct the visual speech.
Instead of directly using data from images or video recordings to create the
virtual speaker, some systems mathematically model the original 2D visual speech
information and use this model-based representation instead. Note that only a few
text-driven 2D visual speech synthesis systems apply this technique. An Active
Appearance Model (AAM) is used in the synthesizers by Theobald et al. [Theobald
et al., 2003] [Theobald et al., 2004] (see figure 2.6) and by Melenchon et al. [Melenchon et al., 2009], while the visual speech is mapped on a Multidimensional
Morphable Model (MMM) in the system by Ezzat et al. [Ezzat et al., 2002] (see
figure 1.5). On the other hand, as was already mentioned in section 2.2.1, many
speech-driven 2D visual speech synthesizers use a mathematical model to describe
the visual speech since it permits an easy mapping from auditory to visual parameters. For instance, Principal Component Analysis is used by Brooke et al. [Brooke
and Scott, 1998] and by Wang et al. [Wang et al., 2010], AAMs are used by Cosker
et al. [Cosker et al., 2003], by Englebienne et al. [Englebienne et al., 2008] and
by Deena et al. [Deena et al., 2010], and Shape Appearance Dependence Mapping
(SADM) is used by Du et al. [Du and Lin, 2002].
Finally, there are also reports on systems that use a graphical model to render the 2D synthetic visual speech, similar to the facial animation strategies that
are used by 3D-based visual speech synthesizers. For instance, a rendering approach
based of a 2D wireframe and its associated texture information was used in the
DECface system by Waters et al. [Waters and Levergood, 1993] (see figure 2.10).
Similarly, a wireframe and associated texture samples copied from an original
photograph are used to generate visual speech from a single given image in the
system by Lin et al. (which uses a 2D wireframe) [Lin et al., 1999] and in the Voice
Puppetry system by Brand (which uses a 3D wireframe) [Brand, 1999].
42
Figure 2.10: 2D visual speech synthesis using a 2D wireframe and a corresponding texture map (left) [Waters and Levergood, 1993] and 2D visual speech
synthesis based on a limited set of photographs (right) [Ezzat and Poggio, 2000].
2.2.5.3
Standardization: FACS and MPEG-4
From the previous paragraphs it is clear that there exists a huge variation in
techniques for representing the virtual speaker. Each of these approaches has its
own strong points and weaknesses. Unfortunately, such a diversity of methods
makes it very hard to compare or to combine multiple systems and it forms a
barrier for collaborative research. Therefore, some standardizations on the topic of
facial animation have been defined.
In 1978, Ekman et al. published their work on the so-called Facial Action Coding
System (FACS) [Ekman and Friesen, 1978]. This coding system defines numerous
Action Units, each corresponding to a contraction or relaxation of one or more facial
muscles. The FACS methodology models each human facial expression by one or
more Action Units. The standard has been proven to be useful for both psychologists
(analysis of human expressions) and animators (synthesis of human expressions).
It has also been successfully applied for automatic facial expression analysis [Fasel
and Luettin, 2003]. In the field of automatic facial animation, the FACS has been
the driving force behind the development of the anatomy-based facial models of
Platt et al. [Platt and Badler, 1981], Waters [Waters, 1987] and Terzopoulos et
al. [Terzopoulos and Waters, 1993]. These models were designed to mimic particular
Action Units, which could then be combined to create meaningful and realistic facial
expressions. In addition, Pelachaud developed a system for generating synthetic
visual speech and realistic emotions using the FACS [Pelachaud, 1991]. This system
was later on extended to split the FACS-based facial animation into independent
phonemic, intonational, informational, and affective elements [Pelachaud et al.,
1996]. More recently, the interest in using the FACS for generating visual speech
has tempered. The reason for this is two-fold. First, the majority of the modern
facial models designed for visual speech synthesis purposes is not designed based
on human anatomy, but high detailed polygon meshes and their corresponding tex-
43
tures are mostly automatically determined by 3D scanning techniques. In addition,
natural mesh deformations are learned by advanced 3D motion capture, which is
faster and easier than a detailed manual definition of the numerous facial muscles
and their effect on the face appearance. A second reason is that the FACS is not
optimized for modelling visual speech gestures: the FACS offers a lot of Action
Units to accurately mimic emphatic expressions, however it is very hard to use
these Action Units to simulate all the detailed mouth gestures corresponding to
speech uttering.
Where the standardization defined by the FACS is based on the biomechanics
of the face, a second standardization for facial animation has been defined in the
MPEG-4 standard [MPEG, 2013] which is derived from the geometric properties
of the human face [Ostermann, 1998] [Abrantes and Pereira, 1999] [Pandzic and
Forchheimer, 2003]. MPEG-4 is an object-based multimedia compression standard,
which allows to independently encode different audiovisual objects in the scene.
These visual objects may have a natural or synthetic content, including arbitrary
shape video objects, special synthetic objects such as human face and body, and
generic 2D/3D objects composed of primitives like rectangles/spheres or indexed
face sets that define an object surface by means of vertices and surface patches.
The MPEG-4 standard foresees that talking heads will serve an important role in
future customer service applications. To this effect, MPEG-4 enables integration
of face animation with multimedia communications and allows face animation
over low bit rate communication channels. The standard specifies a face model
in its neutral state, a number of feature points on this neutral face as reference
points and a set of Facial Action Parameters (FAPs), each corresponding to a
particular facial action deforming the face model away from its neutral state. This
way, a facial animation sequence can be generated by deforming the neutral face
model according to some specified FAP values at each time instant. The value for
a particular FAP indicates the magnitude of the corresponding action. MPEG-4
specifies 84 feature points on the neutral face (see figure 2.11). The main purpose
of these feature points is to provide spatial references for defining FAPs. The
68 FAPs are categorized into 10 groups related to parts of the face. The FAPs
represent a complete set of basic facial actions including head motion and control
over the tongue, eyes and mouth. Most FAPs represent low-level gestures such
as head or eyeball rotation around a particular axis. In addition, the FAP set
contains two high-level parameters corresponding to the realization of visemes and
expressions, respectively. The expression FAP is used to deform the face towards
one of the six primary expressions (anger, fear, joy, disgust, sadness and surprise).
The viseme FAP is used to deform the face towards a representative configuration that matches one of the 15 predefined (English) visemes. The use of this
viseme FAP makes it possible to generate visual speech by consecutively deforming
44
Figure 2.11: The facial feature points defined in the MPEG-4 standard [MPEG,
2013].
the face model towards viseme representations that correspond to the target speech.
Many 3D-based visual speech synthesis systems have adopted the MPEG-4
standard for describing the visual speech information. For instance, Cosi et al.
developed a 3D photorealistic talking head based on the MPEG-4 facial coding
standard [Cosi et al., 2003]. In this system, an advanced smoothing between the
consecutive viseme representations was implemented to achieve natural-looking
speech gestures (see further in section 2.2.6.2). An alternative implementation
of an MPEG-4 talking head is described by Pelachaud et al., which involves
a (pseudo-) anatomy-based facial model of which the feature points follow the
MPEG-4 standard [Pelachaud et al., 2001]. In this system, the two high-level FAPs
are not implemented, as speech gestures and expressions are simulated by the
researchers’ own animation rules. Facial animation based on the MPEG-4 standard
is often used in performance-driven visual speech animation approaches. In that
case, the facial markers on the original speaker’s face are placed conform to the
feature points described in MPEG-4. This way, the captured speech motions can
be directly mapped to displacements of the vertices of an MPEG-4 based polygon
mesh [Beskow and Nordenberg, 2005] [Gutierrez-Osuna et al., 2005]. From these
mapped deformations, variations in FAP values corresponding to the uttering of
speech can be learned [Eisert et al., 1997] [Tao et al., 2009]. The major drawback
of using the MPEG-4 standard to simulate visual speech gestures is the fact that
in such a system the FAPs are at the same time geometrical degrees-of-freedom
and articulatory degrees-of-freedom. Since they originate from the modelling of
geometrical deformations of the face, they compose a less-optimal base-set for
45
constructing articulatory gestures. A possible workaround this problem has been
proposed by Vignoli et al., who designed a facial animation scheme based on socalled Articulatory Parameters (including mouth height, mouth width, protrusion
and jaw rotation) [Vignoli and Braccini, 1999]. An MPEG-4 compliant animation
is achieved by mapping these Articulatory Parameters to FAPs.
2.2.6
Prediction of the target speech gestures
Section 2.2.5 elaborated on various strategies for representing the virtual speaker.
The selection of such a strategy is an important step in the design of a visual speech
synthesis system, since the chosen facial animation technique not only determines
the static realism of the synthetic visual speech but it also defines the synthetic
gestures that can be imposed on the virtual speaker. On the other hand, this section
elaborates on the prediction of the target speech gestures based on the system’s
input data. In other words, this section explains how the facial models that have
been described in section 2.2.5 can be used to generate an appropriate sequence
of speech gestures. An accurate prediction of these gestures is necessary to achieve
synthetic visual speech exhibiting a high level of dynamic realism.
Section 2.2.1 explained that, based on the synthesizer’s input requirements,
two main categories of visual speech synthesis systems can be discerned, namely
text-driven and speech-driven approaches. The problem of predicting the speech
gestures based on auditory speech input was already addressed in section 2.2.1.
These systems learn in a prior training stage the correspondences between auditory
and visual speech features. After training, visual features corresponding to an
unseen auditory input signal can be estimated, from which a new sequence of
speech gestures is determined. From this point onwards, this section will only focus
on the estimation of speech gestures based on textual information (i.e., text-driven
visual speech synthesis).
2.2.6.1
Coarticulation
Before elaborating on the prediction of the target speech gestures, the concept
of coarticulation must be explained. Coarticulation refers to the way in which
the realization of a speech sound is influenced by its neighbouring sounds in a
spoken message [Kent and Minifie, 1977] [Keating, 1988]. Forward or anticipatory
coarticulation is mainly caused by high-level articulatory planning and occurs
when the articulation of a speech segment is affected by other segments that
are not yet realized. On the other hand, backward or preservatory coarticulation
(also known as “carry-over” coarticulation) is mainly caused by inertia in the
biomechanical structures of the vocal tract which causes the articulation at some
point in time to be affected by the articulation of speech segments at an earlier
46
point in time. Note that coarticulation may not be seen as a pure side-effect since
it also serves a communicative purpose: it makes the speech signal more robust
to noise by introducing redundancies, since the phonetic information is spread
out over time. Many studies have tried to explain and to model the effect of
coarticulation on the uttering of speech sounds. Two important approaches can be
discerned, namely look-ahead models and time-locked models. Look-ahead models
allow the beginning of an anticipatory coarticulatory gesture at the earliest possible
time allowed by the articulatory constrains of other segments in the utterance. A
well-known look-ahead model is the numerical model of Ohman [Ohman, 1967].
This model splits the speech articulation in vocalic and consonant gestures. Every
articulatory parameter is defined as a numerical function over time, of which
the value is dependent on pure vocalic gestures onto which a consonant gesture
is superimposed. The consonant has an associated temporal blend function that
dictates how its shape should blend with the vowel gesture over time. It also has
a spatial coarticulation function that dictates to what degree different parts of the
vocal tract should deviate from the underlying vowel shape. On the other hand,
time-locked models assume that articulatory gestures are independent entities
which are combined in an approximately additive fashion. They allow the onset of a
gesture to occur a fixed time before the onset of the associated speech segment, regardless of the timing of other segments in the utterance. A well-known time-locked
coarticulation model is Lofqvist’s gestural model [Lofqvist, 1990], in which each
speech segment has dominance over the vocal articulators which increases and then
decreases over time during articulation. Adjacent segments will have overlapping
dominance functions which dictates the blending over time of the articulatory
commands related to these segments. The height of the dominance function at the
peak determines to what degree the segment is subject to coarticulation. Another
gestural model of speech production was described by Browman et al. [Browman
and Goldstein, 1992]. In their approach to articulatory phonology, gestures are
dynamic articulatory structures that can be specified by a group of related vocal
tract variables. Syllable-sized coarticulation effects are modelled by phasing consonant and vowel gestures with respect to one other. The basic relationship is that
initial consonants are coordinated with vowel gesture onset, and final consonants
with vowel gesture offset. This results in organisations in which there is substantial
temporal overlap between movements associated with vowel and consonant gestures.
Coarticulation effects are noticeable in both the auditory and the visual speech
mode. In the auditory mode, it leads to smooth spectral transitions from one speech
segment to the other. Auditory speech synthesis has to mimic these transitions to
avoid “jerky” synthetic speech. Rule-based auditory synthesizers such as articulatory or formant synthesis systems predict for each target phoneme a corresponding
speech sound. To achieve high-quality synthetic speech, these systems have to
47
integrate a coarticulation model to simulate the transitions between the consecutive
predicted articulatory or spectral properties. In visual speech, coarticulation effects
are even more pronounced than in the auditory speech mode. Particular gestures,
like lip protrusion, have been found to influence neighbouring articulatory gestures
up to several phonemes before and after the actual speech segment they are intended
for. In a visual speech signal, the inertia of the visual articulators can be directly noticed. For instance, preservatory coarticulation is noticeable when a speech gesture
continues after uttering a particular sound segment while the other gestures needed
to create this sound are already completed. An example of this effect is the presence
of lip protrusion during the /s/ segment of the English word “boots”. Moreover,
anticipatory coarticulation can be seen in the visual speech signal when a visible
gesture of a speech segment occurs in advance of the other articulatory components
of the segment. An example of such anticipatory coarticulation is the pre-rounding
of the lips in order to utter the English sound /uw/: in the word “school” the lip
rounding can be already noticed while the sounds /s/ or /k/ are still being uttered.
As will be explained further on, many visual speech synthesis systems adopt a
rule-based strategy, in which at particular time-instances the properties of the
visual speech are predicted (e.g., at the middle of each phoneme or viseme). Similar
to the rule-based auditory synthesizers, these visual speech synthesis systems have
to implement a strategy for creating smooth and natural transitions between the
consecutive predicted appearances by mimicking the visual coarticulation effects.
Note, however, that in the field of visual speech synthesis a noticeable trend exists
towards concatenation-based synthesis (see further in section 2.2.6.3). A similar
shift has already been made in the field of auditory synthesis. One of the benefits
of such a concatenative synthesis approach is the fact that coarticulation effects
can be automatically included in the synthetic speech. Indeed, original transitions
between adjacent phonemes can be seen in the synthetic speech signal by reusing
segments of original speech recordings that are longer than a single phone.
2.2.6.2
Rule-based synthesis
Analogue to the rule-based approaches for auditory speech synthesis, rule-based
visual speech synthesizers generate the synthetic speech by estimating its properties
using predefined rules. In general, only a few particular frames of the output video
signal will be predicted directly. Therefore, rule-based synthesis is often referred to
as keyframe-based synthesis. In most systems the predicted keyframes are located
at the middle of each phoneme or viseme of the target speech. The rule-based
synthesis approach can be split up into two stages. In an initial offline stage, the
synthesis rules are determined. This means that for each instance from a set of
predefined synthesis targets (e.g., all phonemes/visemes of the target language) at
least one typical configuration of the visual articulators is defined. The way in which
48
Phoneme
sequence
+ Timings
Articulatory
rules
FR0
FR1
FR2
FR3
FR4
FR5
FR6
FR7
FR8
FR9
time
Interpolation
rules
Figure 2.12: Visual speech synthesis using articulation rules to define keyframes
(black). The other video frames (white) are interpolated.
such a typical configuration is described is greatly dependent on the chosen facial
animation strategy (see section 2.2.5). In a second stage, synthesis of novel speech
signals is feasible by composing a sequence of predefined configurations based on the
textual input information. The target visual speech signal can then be generated by
interpolating between the predicted keyframes in order to attain a smooth signal
that is in synchrony with the imposed duration of each speech segment. As was
explained in section 2.2.6.1, for synthesizing high-quality speech sequences this
interpolation should mimic the visual coarticulation effects. A general overview of
the rule-based synthesis approach is illustrated in figure 2.12.
Early rule-based visual speech synthesis, using Parke’s directly parameterized
facial animation model [Parke, 1975], was developed by Pearce et al. [Pearce et al.,
1986], Lewis et al. [Lewis and Parke, 1987], Hill et al. [Hill et al., 1988] and Cohen
et al. [Cohen and Massaro, 1990]. These approaches specify for each instance from
a set of representative phonemes a set of typical parameter values for the 3D facial
model. As such, a series of keyframes can be determined based on a given target
sequence of phonemes. Smoothing between these keyframes is performed by a
interpolation of the parameter values of the model. A similar approach was followed
by Guiard-Marigny et al. [Guiard-Marigny et al., 1996], in which the synthetic
lips are described by algebraic equations. Because of this, interpolation between
consecutive keyframes can be easily mathematically achieved. Note that, although
all these mentioned systems are capable of generating smooth facial animations, no
real solution to mimic visual coarticulations is mentioned in these strategies.
In order to create natural transitions between the consecutive phoneme or viseme
representations found in the predicted keyframes, the interpolation strategy should
49
generate intermediate frames that mimic visual coarticulations. One approach for
this is to mimic the biomechanics of the face. Such an interpolation technique
is found in the DECface system [Waters and Levergood, 1993], which predefines
for each instance of a representative set of visemes a fixed configuration of the
2D wireframe that is used to render the visual speech. From this set of static
shapes, a sequence of 2D keyframes is composed based on the input text. In order
to achieve realistic keyframe transitions, the system models the dynamics of the
mouth movements by representing each node of the wireframe by a position, a mass,
and a velocity. The interpolation between the keyframes is calculated by applying
fictional forces on the nodes and calculating their propagation through the wireframe by mimicking the elastic behaviour of facial tissue. A similar anatomy-based
interpolation was used between keyframes containing a 3D polygon mesh by Hong
et al. [Hong et al., 2001]. Obviously, such an anatomy-based interpolation is also
feasible for rule-based visual speech synthesis systems that adopt anatomy-based
facial animation schemes. For instance, the system by Uz et al. [Uz et al., 1998]
defines for each representative phoneme a set of animation parameters defining the
muscle contractions and the jaw rotation. By use of these rules a series of keyframes
is composed, after which smooth animations are achieved by a cosine-based interpolation between the keyframes.
Another approach for mimicking coarticulations while interpolating between
the predicted keyframes is to adopt a comprehensive model that describes the
various visual coarticulation effects. Pelachaud et al. proposed a look-ahead model
to simulate coarticulation effects in visual speech synthesis using Action Units from
the FACS [Pelachaud et al., 1991]. In their system, phonemes are assigned a high or
low deformability rank. When synthesizing a new utterance, forward and backward
coarticulation rules are applied so that a phoneme takes the lip shape of a less
deformable phoneme forwards or backwards in the target phoneme sequence. This
is calculated in three stages, where the first one computes the ideal lip shapes, after
which in two additional stages temporal and spatial muscle actions are computed
based on constraints such as the contraction and relaxation time of the involved
facial muscles. Conflicting muscle actions are then resolved by use of a table of
Action Units similarities. Another implementation of the look-ahead coarticulation
model for visual speech synthesis purposes was described by Beskow [Beskow,
1995]. This system uses Parke’s facial animation strategy, where a 3D model of the
tongue was added. Each phoneme is assigned a target vector of articulatory control
parameters. To allow the targets to be influenced by coarticulation, the target
vector may be under-specified, i.e. some parameter values can be left undefined. If
a target is left undefined, the value is inferred from the phonemic context using
interpolation, followed by a smoothing of the resulting trajectory.
50
One of the most adopted strategies for estimating visual coarticulation is the
so-called Cohen-Massaro model [Cohen and Massaro, 1993], which is based on
the time-locked gestural model of Lofqvist [Lofqvist, 1990] and was originally
designed to interpolate between keyframe parameter values of a terminal-analogue
facial animation system. In this model, each synthetic speech segment (i.e., each
keyframe) is assigned a target vector of parameter values. Overlapping temporal
dominance functions are used to blend the target values over time. The dominance
functions take the shape of a pair of negative exponential functions, one rising
and one falling. The height of the peak and the rate in which the dominance
rises and falls are free parameters that can be adjusted for each representative
phoneme and articulatory control parameter. An illustration of this strategy is
given in figure 2.13. To implement the Cohen-Massaro coarticulation model, for
each representative phoneme or viseme the parameters of the dominance functions
have to be estimated. Le Goff et al. described a terminal-analogue rule-based
visual speech synthesis system that interpolates between keyframes using a slightly
improved version of the Cohen-Massaro coarticulation model [Le Goff, 1997].
In their approach, the parameters of the dominance functions are automatically
determined from original speech recordings. A downside of the Cohen-Massaro
model is that it offers no way to ensure that particular target parameter values
are reached. In some cases this is necessary, for instance at a bilabial stop where
the reaching of full mouth closure is crucial. To overcome this problem, Cosi et al.
augmented the Cohen-Massaro model with a resistance function that can be used
to suppress the dominance of segments surrounding such a critical target [Cosi
et al., 2002]. This interpolation scheme was used in the rule-based visual speech
synthesis system LUCIA [Cosi et al., 2003]. Many other implementations of the
Cohen-Massaro model for rule-based visual speech synthesis exist. For instance,
Fagel et al. captured original speech motions to estimate the dominance functions
of the coarticulation model [Fagel and Clemens, 2004]. These captured motions
were also used to learn articulation rules for a terminal-analogue facial animation
scheme. A similar data-driven training of the Cohen-Massaro model for interpolating keyframes described by MPEG-4 FAPs was described by Beskow et al. [Beskow
and Nordenberg, 2005]. In that system, the FAP-based articulation rules were also
learned from original speech data. The Cohen-Massaro model has also been used
to interpolate between keyframes described by anatomy-based facial animation
schemes [Albrecht et al., 2002]. Furthermore, Lin et al. described a system that
uses the Cohen-Massaro model to interpolate between predefined configurations of
a 2D wireframe [Lin et al., 1999]. An interesting extension of the Cohen-Massaro
model was proposed by King et al. [King and Parent, 2005]. In their approach, for
each viseme a typical trajectory of model parameters was hand-crafted (instead of a
single set of parameter values). Based on the target phoneme sequence, a sequence
of parameter sub-trajectories is composed. Smooth trajectories are obtained by
51
Figure 2.13: Modelling visual coarticulation using the Cohen-Massaro model
[Beskow, 2004]. The dominance functions of the speech segments define the
interpolation of the facial model parameter between keyframes. Note that not
all predicted keyframe-values will be reached.
interpolation using decaying dominance functions.
When the visual speech synthesis is based on performance-driven facial animation
(see section 2.2.5.1), the original speech data can be used to learn articulation
rules and coarticulation behavior. For instance, Muller et al. described a technique
in which motion capture data is modelled using ICA, from which for each representative viseme the mean parameter values and an “uncertainty” of this mean
representation are calculated [Muller et al., 2005]. For generating new visual speech,
a series of keyframes is generated based on the target phoneme sequence, which
are then interpolated by fitting fourth order splines. Coarticulation is modelled by
defining an attraction force from each keyframe to the interpolation curve that is
inversely proportional to the uncertainty of the mean representation of the corresponding viseme. In a strategy proposed by Deng et al., the transitions between
representative visemes are learned from motion capture data in a prior training
phase [Deng et al., 2006]. When synthesizing new speech, a series of appropriate
keyframes is constructed which are then interpolated using the corresponding
trained coarticulation rules. Revret et al. implemented Ohman’s numerical coarticulation model for a rule-based visual speech synthesis based on motion capture
data [Reveret et al., 2000]. The captured original speech gestures were used to estimate the values of the coarticulation model, such as the coarticulation coefficients
and the temporal functions guiding the blending of consonants and the underlying
vowel track.
52
Instead of defining a single articulation rule for each representative phoneme,
a more extensive set of rules can be learned to predict the keyframes. Galanes et
al. developed such a rule set using a tree-based clustering of 3D motion capture
data [Galanes et al., 1998]. For each distinct phoneme, several typical representations are collected based on the properties of the phonetic context. This way,
coarticulation is automatically included in the articulation rules. To synthesize novel
speech, the same tree is traversed to determine a keyframe for each target phoneme,
after which interpolation using splines is performed to create smooth parameter
trajectories. Another approach for defining context-dependent articulation rules
was suggested by De Martino et al. [De Martino et al., 2006]. In this approach, 3D
motion capture trajectories corresponding to the uttering of original CVCV and
diphthong samples are gathered, after which by means of k-means clustering [Lloyd,
1982] important groups of similar visual phoneme representations are distinguished.
From these context-dependent viseme definitions, the keyframe mouth dimensions
and jaw opening corresponding to a novel phoneme sequence can be predicted.
These predictions are then used to animate a 3D model of the virtual speaker.
Rule-based synthesis is also a popular approach for synthesizing 2D synthetic
visual speech from text. To this end, the system needs to define for each representative phoneme or viseme a typical 2D representation of the virtual speaker
(e.g., a picture of an original speaker uttering the particular speech sound). From
these typical representations, a series of keyframes is composed based on the target
phoneme sequence, after which an interpolation between these 2D keyframes is
needed to achieve a smooth video signal. Scott et al. [Scott et al., 1994] created
the “Actors” system which interpolated between a series of photograph-based
keyframes using image morphing techniques [Wolberg, 1998]. Unfortunately, this
morphing step required a hand-crafted definition of the various morph targets in
each keyframe. A more automated morphing between 2D keyframes is feasible using
optical flow techniques [Horn and Schunck, 1981] [Barron et al., 1994]. This type of
keyframe interpolation is used in the rule-based 2D visual speech synthesis systems
by Ezzat et al. [Ezzat and Poggio, 2000], Goyal et al. [Goyal et al., 2000], and Verma
et al. [Verma et al., 2003]. Similar rule-based synthesis approaches were described
by Noh et al. [Noh and Neumann, 2000], where Radial Basis Functions [Broomhead
and Lowe, 1988] are used to interpolate between the keyframes, and by Tiddeman
et al. [Tiddeman and Perrett, 2002], where the interpolation is achieved by texture
mapping and alpha blending [Porter and Duff, 1984]. A semi-automatic technique
to gather an appropriate set of 2D images from original speech recordings in order
to define the synthesis rules was described by Yang et al. [Yang et al., 2000]. In
order to include coarticulation effects in the articulation rules, context-dependent
2D keyframe prototypes are manually extracted from original speech recordings by
Costa et al. [Costa and De Martino, 2010]. To determine which articulation rules
53
are necessary, original motion capture speech data was analysed [De Martino et al.,
2006]. The major drawback of all these mentioned 2D interpolation approaches is
that no articulatory constraints are taken into account when creating the intermediate video frames. In other words, by smoothing the transition between two
keyframes, new configurations of the virtual speaker are generated that may or may
not exist in original visual speech. To avoid unnatural interpolated speaker configurations, Melonchon et al. [Melenchon et al., 2009] developed a 2D visual speech
synthesis system that can be classified as either rule-based or concatenative (see
further in section 2.2.6.3). In this system, for each distinct phoneme many typical
representations are gathered (instead of just one as in the other rule-based systems).
To generate new visible speech, for each target phoneme the best representation
is selected based on the distance between each candidate representation and the
representation that was selected for the previous phoneme in the target phoneme
sequence. When all keyframes are determined, continuous speech is generated by an
interpolation technique in which a smooth transition from one keyframe to the next
is constructed by reusing original recorded video frames. As such, every speaker
configuration between two keyframes will exhibit static realism.
A final category of rule-based visual speech synthesizers can be classified as
articulatory synthesis systems. Similar to articulatory auditory speech synthesizers,
these systems do not directly predict the result of speech production (i.e., formants
(auditory synthesis) or speaker appearances (visual synthesis)) but rather the
configurations of the human speech production system (i.e., the manner in which
the speech signal is produced). Visual articulatory systems are mainly used to
illustrate the mechanism of speech production, which can for instance be applied in
speech therapy applications. An example is the system by Birkholz et al. [Birkholz
et al., 2006] (see figure 1.5), which is in fact a rule-based speech synthesis system
based on a terminal-analogue animation scheme. In this system, the speech signal is
rendered using a directly parameterized polygon mesh that represents the lips, the
tongue, the teeth, the upper and lower cover, and the glottis. A new speech signal
is generated by predicting keyframe parameter values, which are interpolated using
dominance functions to mimic coarticulation. A similar approach has been followed
by Engwall, in which captured original speech gestures are used to learn rules for
animating a 3D tongue model from text [Engwall, 2001].
2.2.6.3
Concatenative synthesis
Although at present rule-based approaches are still adopted, over the last decade
a trend is noticeable towards concatenative visual speech synthesis strategies. A
similar transition already took place in the field of auditory synthesis, where rulebased techniques such as articulatory or formant synthesis have become obsolete
54
Speech
database
Phoneme
sequence +
Timings
FR0
FR1
FR2
FR3
FR4
FR5
FR6
FR7
FR8
FR9
time
Figure 2.14: Visual speech synthesis based on the concatenation of segments of
original speech data. Each output video frame is copied from the speech database.
over concatenative synthesis approaches. A concatenative speech synthesizer needs
to be provided with a database containing original speech recordings from a single
speaker. To synthesize novel speech, the synthesizer searches in the database for
suitable segments that (partially) match the target phoneme sequence. Once an
optimal set of original segments is determined, these segments are concatenated
to create the final synthetic speech signal. An important factor is the size of
the database segments that are available for selection. Older systems often use a
database containing diphone recordings. This way, the size of the database can be
kept limited as it suffices to contain at least one instance of each possible combination of every two phonemes or visemes that exist in the target language. The
selected diphones are usually concatenated at the middle of the first and the second
phoneme or viseme, respectively. As such, each transition between two consecutive
phonemes or visemes in the output speech will consist of original speech data.
This way, coarticulation effects are copied from the original speech to the synthetic
speech. When more data storage and stronger computing power is available, the
concatenative synthesis can be improved by selecting longer segments (triphones,
syllables, words, etc.) from a database of continuous original speech. In this
so-called unit-selection approach, fewer concatenations are needed to create a synthetic sentence, which reduces the chance for concatenation artefacts. In addition,
reusing longer original segments permits of copying extensive coarticulation effects
(extending over multiple phonemes) from the database to the synthetic speech. A
general overview of the concatenative synthesis approach is illustrated in figure 2.14.
The major benefit of a concatenative synthesis approach is the fact that a
maximal amount of original speech data is reused for generating the novel synthetic
speech. Because of this, modelling the coarticulation effects becomes superfluous
since original transitions between phonemes or visemes are copied from the original
speech data. In addition, the synthetic visual speech will exhibit a high degree of
static realism since only a limited number of output frames are newly generated
55
during synthesis (e.g., for smoothing the concatenations). This is in contrast with
rule-based synthesis approaches, in which many new video frames are generated for
interpolating between the predicted keyframes. Such a generation of new frames
involves the danger that these frames exhibit unrealistic speaker configurations that
are non-existing in original speech. The drawback of concatenative synthesis is its
large data footprint and the strong computing power that is required to perform
the segment selection calculations. However, at present time this is only a possible
issue with small-scale systems like cell-phones, automotive applications, etc. Even
more, the current large bandwidth capabilities allow a distant calculation where the
hand-held device only needs to send the synthesis request to a server and display
the synthetic speech after receiving the server’s response.
Concatenative synthesis requires a beforehand recording of original speech data.
For concatenative synthesis approaches using a 3D-based facial animation scheme,
a performance-driven facial animation strategy is necessary, in which the speechrelated deformations of the facial model are copied from motion capture data
of an original speaker. Exploratory studies on the concatenation of polyphones
described in terms of 3D polygon mesh configurations were described by Hallgren
et al. [Hallgren and Lyberg, 1998] and Kuratate et al. [Kuratate et al., 1998]. Edge
et al. proposed a unit selection approach based on so-called “dynamic phonemes”,
which can be seen as phonemes in a particular phonemic context [Edge and Hilton,
2006]. The visual context of each phoneme was also taken into account by Breen
et al. by performing a concatenative synthesis based on di-visemes [Breen et al.,
1996]. Each di-viseme corresponds to some units that describe variations of a 3D
polygon mesh. A unit selection synthesis based on 3D motion capture data was
described by Minnis et al. [Minnis and Breen, 2000]. In their system, the selection
of variable-length original speech segments depends on the correspondence between
the phonemic context of the candidate segment and the phonemic context of the
target phoneme. A similar system was proposed by Cao et al., in which the longest
possible segments are selected from the database in order to minimize the number
of concatenations [Cao et al., 2004]. In the system by Ma et al., the captured 3D
motions are organized in a graph indicating the cost of each possible transition
between the recorded phoneme instances [Ma et al., 2006]. Based on the target
phoneme sequence, an optimal path through this graph that traverses the necessary
nodes is searched. A similar approach has been described by Deng et al., in which
an optimal path through all recorded phoneme instances is constructed to create
a trajectory of facial animation parameters that corresponds to both a target
phoneme sequence and to time-evolving expressive properties [Deng and Neumann,
2008]. Note that any facial animation model can be animated using concatenative
synthesis, given an appropriate database of original speech gestures. For instance,
Engwall investigated on the diphone-based concatenation of captured articulation
56
data for animating a 3D tongue model [Engwall, 2002].
From sections 2.2.6.2 and 2.2.6.3 it is clear that 3D motion capture data can
be used for learning articulation rules as well as for direct reusage when generating the synthetic visual speech. Bailly et al. evaluated a synthesis approach
based on the concatenation of audiovisual diphones represented by 3D model
parameters [Bailly et al., 2002]. The attained synthetic visual speech was found
to be superior to a rule-based synthesis approach for which the articulation rules
were trained on the same original speech data as was used in the concatenative
synthesis. For an optimal transfer of the original visual coarticulation effects from
the original speech to the synthetic speech, Kshirsagar et al. proposed a technique
that selects and concatenates syllable-length original speech segments [Kshirsagar
and Magnenat-Thalmann, 2003]. These syllables are described in terms of facial
movement parameters resulting from a mathematical analysis of facial motion
capture data. The use of syllables is motivated by the fact that most coarticulation
occurs within the boundaries of a syllable. An interesting concatenative synthesis
approach using an anatomy-based facial animation scheme was proposed by Sifakis
et al. [Sifakis et al., 2006]. Based on motion capture data, a database of sentences
and the parameter trajectories corresponding to the muscle activations needed
to utter these sentences were constructed. By segmenting these muscle-parameter
trajectories based on phoneme boundaries, so-called psysemes were defined. To
synthesize new speech, an optimal sequence of such psysemes is selected and
concatenated. From these concatenated muscle-parameter trajectories, a novel
facial animation sequence is generated. As the proposed synthesis strategy creates smooth muscle-activation trajectories by selecting and concatenating original
speech segments based on muscle activation (instead of selecting segments based
on appearance like other methods do), visual coarticulation is taken into account
since such coarticulation effects are due to the inability of the facial muscles to
instantaneous change their activation level.
In comparison with 3D motion capture techniques, gathering a database of
2D original visual speech is much easier. The synthetic speech signal is directly
generated from reusing original video frames from the visual speech database. A
pioneering work is the Video Rewrite system by Bregler et al., in which a new video
sequence is constructed by reusing original triphone-sized video fragments from
the database [Bregler et al., 1997]. A similar approach based on the selection of
variable-length segments was proposed by Shiraishi et al. [Shiraishi et al., 2003] and
by Fagel [Fagel, 2006]. A system by Arslan et al. [Arslan and Talkin, 1999] selects
phoneme-sized 2D segments from the database, where each phoneme instance is
represented by its phonemic context up to 5 phonemes backward and forward. The
distance between two such phonemic contexts is calculated based on the measured
57
similarity between the mean visual representations of every two distinct phonemes
in the database. A similar approach was also used by Theobald et al. to select
phoneme-sized original speech segments from a database containing visual speech
fragments that are mapped on AAM parameters [Theobald et al., 2004]. Some
systems construct the synthetic speech signal by a frame-by-frame selection from
the database. Each new frame that is added to the synthetic video sequence is
selected based on various aspects, such as phonetic/visemic matching with the
target phoneme sequence and the continuity of the resulting visual speech signal.
Examples of such systems are described by Weiss [Weiss, 2004] and by Liu et
al. [Liu and Ostermann, 2009]. The last system was also extended to select video
frames from a database containing expressive visual speech fragments as well [Liu
and Ostermann, 2011].
An interesting approach to concatenative visual speech synthesis was proposed
by Jiang et al. [Jiang et al., 2008]. Their speech-driven synthesizer uses a database
of audiovisual di-viseme instances, learned from original audiovisual speech. The
input auditory speech is first translated in a sequence target di-visemes, after which
an appropriate sequence of database di-visemes is collected to create the output
visual speech. For each target di-viseme, the system selects from all matching
database instances the most suitable one by measuring the similarity between the
spectral information of the database auditory speech signal and the spectral information of the input auditory speech signal. Smooth animation is ensured by taking
also the ease of the concatenation of two consecutive database segments into account.
An important contribution to the field of 2D photorealistic visual speech synthesis is due to Cosatto & Graf. The first version of their system implemented
a hybrid rule-based/concatenation-based approach [Cosatto and Graf, 1998]. In
this system, for each representative phoneme a typical set of mouth parameters
(width, position upper lip and position lower lip) is determined. Similar to other
rule-based approaches, to synthesize a new speech signal, for each target phoneme
a keyframe is defined by its predicted mouth parameters. A grid is populated
with mouth appearances sampled from original visual speech recordings. Each
dimension of the grid represents a mouth parameter and each grid entry contains
multiple mouth samples. As such, for each keyframe a representative mouth sample
can be selected from the populated grid. Interpolation between the keyframes is
achieved by interpolating the keyframe mouth parameters and selecting for each
intermediate parameter set the most corresponding mouth sample from the grid.
The Cohen-Massaro coarticulation model is used by calculating the interpolated
parameter values based on an exponentially decaying dominance function that is
defined for each representative phoneme. The authors also suggest a concatenativebased interpolation strategy, in which common coarticulations are not generated
58
using keyframe interpolation but in which the intermediate frames are created by
reusing sequences of mouth parameters that have been measured in original speech
fragments. In a later version of the system, such a data-based interpolation was
used to predict the mouth parameters for every output frame [Cosatto and Graf,
2000]. Then, a frame-based unit selection procedure is performed, in which for
each output frame a set of candidate mouth samples is gathered from the database
based on their similarity with the predicted mouth parameters for that frame. From
each set of candidate mouth instances, one final instance is selected by maximizing
the overall smoothness of the synthetic visual speech. Note that this synthesis
strategy can be seen as a hybrid rule-based/concatenation-based approach, in
which the concatenative synthesis stage is based on target speech features predicted
by the rule-based synthesis stage. Later on this visual speech synthesizer evolved
towards a truly concatenative system (omitting the rule-based prediction stage)
where variable-length video sequences are selected from a visual speech corpus
based on target and join costs [Cosatto et al., 2000]. Similar to unit selection-based
auditory synthesis [Hunt and Black, 1996], the target costs express how good the
candidate segment matches the target segment, while the join costs express the ease
in which consecutive selected segments can be concatenated. Finally, in another
implementation the unit selection-based synthesizer was extended to minimally
select triphone-sized segments in order to speed-up the selection process [Huang
et al., 2002].
2.2.6.4
Synthesis based on statistical prediction
Section 2.2.1 described how statistical modelling (e.g., using an HMM) can be used
to synthesize novel visual speech based on a given auditory speech signal. However,
such a statistical modelling can also be used to synthesize novel speech from text
input. This technique has been applied for both auditory and visual speech synthesis purposes. In general, prediction-based speech synthesis requires that in a prior
training stage a prediction model is built by learning the correspondences between
captured properties of original speech and the corresponding phoneme sequence
or viseme sequence. The prediction model has to take both static correspondences
(the relationship between the observed features and the corresponding phoneme or
viseme) and dynamic properties (the transitions between feature sets) into account.
After training, the model can predict new parameter trajectories based on a target
phoneme or viseme sequence. Sampling these trajectories gives a prediction of the
target speech features for each frame of the output speech signal. In an alternative
approach, the statistical model only predicts target features for a limited set of
keyframes, after which an interpolation is performed to acquire target features for
each output frame. These synthesizers can be seen as hybrid rule-based/statistical
model-based systems, which learn their articulation rules by statistically modelling
59
Phoneme
sequence +
Timings
Trained
prediction
model
FR0
FR1
FR2
FR3
FR4
FR5
FR6
FR7
FR8
FR9
time
Figure 2.15: Visual speech synthesis based on statistical prediction of visual
features. Note that some systems only predict features for a limited number of
keyframes, after which an extra interpolation is required to acquire a set of target
features for each output frame.
features derived from original speech fragments. A general overview of the synthesis
approach is illustrated in figure 2.15. The benefit of speech synthesis based on
statistical prediction is the fact that it combines the advantages of both rule-based
and concatenative synthesis: observed original (co-)articulations can be reused
without the need to explicitly model this behaviour, while the synthesizer’s data
footprint is still small since no original speech data needs to be stored after the
training stage. The downside is the fact that the original speech data must be
parameterized in order to be able to train the model. Thus, the synthetic speech
signal is not constructed directly from original speech data but it is regenerated
from the predicted features. This possibly leads to a degraded signal quality.
Tamura et al. proposed a visual speech synthesis strategy based on visual features
predicted by an HMM [Tamura et al., 1998]. In this system, simple geometrical
features describe the visual speech by using 2D landmark points indicating the lip
shape. Syllables were chosen as basic speech synthesis unit, where for each syllable
a four-state left-to-right model with single Gaussian diagonal output distributions
and no skips was trained. HMMs are also used in the system by Zelezny et al., in
which a phoneme is used as basic synthesis unit [Zelezny et al., 2006]. Each phoneme
is modelled using a five-state left-to-right HMM with three central emitting states.
The visual speech was parameterized in terms of 3D landmark points around the
lips and the chin. Note that in this system, the HMM is only used to predict some
particular keyframes of the output speech. Afterwards, smooth trajectories are calculated using an interpolation based on the Cohen-Massaro coarticulation model.
Govokhina et al. also trained phone-based HMMs for visual speech synthesis purposes [Govokhina et al., 2006a]. This system uses articulatory parameters derived
from a PCA analysis on 3D motion capture data as speech features. An alternative
approach was proposed by Malcangi, who describes a system that statistically
60
predicts keyframe values using ANNs [Malcangi, 2010]. Afterwards, smooth trajectories are obtained using an interpolation based on fuzzy logic [Klir and Yuan, 1995].
An interesting approach for synthesizing 2D visual speech was proposed by
Ezzat et al., in which a multidimensional morphable model was used to model
each video frame from the original speech recordings [Ezzat et al., 2002]. Such a
model is built by selecting a reference image and a set of images containing key
mouth shapes. Then, the optical flows that morph each key image to the reference
image are calculated. A novel frame is defined in the model space by a set of shape
parameters and a set of appearance parameters. The shape parameters define the
linear contribution of the original optical flow vectors that, when applied to the
reference image, generate a set of morphed images, while the appearance parameters
define the contribution of these morphed images in the synthesis of the target frame.
Each video frame from a recording of original visual speech was projected in the
model space. Afterwards, from all original frames corresponding to a particular
phoneme the shape and the appearance parameters were gathered. By doing so, each
phoneme is represented by two multidimensional Gaussians (one for the shape and
one for the appearance). The synthesis of novel speech, based on a target phoneme
sequence, is solved as a regularization problem since a trajectory through the model
space is searched in order to minimize both a target term and a smoothness term.
Coarticulation is modelled via the magnitude of the measured variance for each
phoneme. A small variance means that the trajectory must pass through that region
in the phoneme space, and hence neighbouring phonemes have little coarticulatory
influence. On the other hand, a large variance means that the trajectory has a lot of
flexibility in choosing a path through a particular phonetic region, and hence it may
choose to pass through regions which are closer to a phoneme’s neighbours. The
phoneme will thus experience strong coarticulatory effects. The downside of the
strategy proposed by Ezzat et al. is that each viseme is characterized by only static
features. Kim et al. extended the technique for generating 3D model parameters,
for which not only static but also dynamic properties of each phoneme instance
from the original speech fragments were used to train the model [Kim and Ko, 2007].
Recently, some hybrid synthesis approaches that are based on both statistical
modelling and reusing original speech data have been proposed [Govokhina et al.,
2006b] [Tao et al., 2009] [Wang et al., 2010]. Note that a similar hybrid strategy has
also been proposed for generating synthetic auditory speech (see section 1.4.2). In
a first stage of the hybrid synthesis, target features describing the synthetic speech
are predicted using a trained statistical model. In a second stage, these predictions
are used to select appropriate segments from a database containing original speech
fragments. Govokhina et al. proposed such a hybrid text-driven synthesis method
in which the HMM-based synthesis stage is performed using context-dependent
2.3. Positioning of this thesis in the literature
61
phoneme models [Govokhina et al., 2006b]. The hybrid synthesis was found to
outperform both HMM-only and concatenative-only synthesis approaches. Wang
et al. proposed a hybrid strategy to synthesize visual speech from auditory speech
input [Wang et al., 2010]. In the training stage, an HMM is trained on the correspondences between auditory and visual features of original audiovisual speech
recordings. Then, in a first synthesis stage, given a novel auditory input, the trained
HMM predicts a set of target visual features for each output frame. In a second
synthesis stage, a frame-based unit selection is performed, where the target cost
is calculated as the distance between the candidate original frame and the frame
predicted by the HMM. An alternative speech-driven hybrid synthesis approach was
proposed by Tao et al., in which sub-sequences of original visual speech are selected
from a database based on a target cost that is calculated using a Fused-HMM [Tao
et al., 2009]. The Fused-HMM models the joint probabilistic distribution of the
novel audio input and the candidate visual deformations.
2.3
Positioning of this thesis in the literature
From the literature overview given in section 2.2 is it clear that a wide variety of approaches for achieving (audio)visual speech synthesis can be adopted. A particular
category of systems that shows much potential are the synthesizers that generate
both a synthetic auditory and a synthetic visual speech mode from a given text
input (i.e., audiovisual text-to-speech synthesis systems). The reason for this is twofold. First, there exist countless applications, such as virtual announcers and virtual
teachers, for which these synthesizers can be adopted: AVTTS synthesis is the most
optimal technique to realize speech-based communication from a computer system
towards its users. Second, the generation of the auditory and the visual speech mode
by the same system permits to enhance the level of audiovisual coherence in the
synthetic speech as much as possible. For this purpose, a single-phase audiovisual
speech synthesis approach is favourable (see section 2.2.2). It is remarkable that
such single-phase AVTTS strategies have only been adopted in some exploratory
studies (see section 2.2.2 for references).
Since humans are very experienced in simultaneously perceiving auditory and
visual speech information, they are very sensitive to the coherence between these
two information streams. The most important coherence-related feature that an
AVTTS system needs to address is the synchrony between the two synthetic speech
modes. Synchronous speech modes can be generated by both single-phase and
two-phase AVTTS synthesizers. In general, a two-phase system will first generate
the synthetic auditory mode, after which the phoneme durations found in this
signal are imposed on the durations of the visemes in the synthetic visual speech
signal (or vice-versa). Obviously, audiovisual synchrony can be achieved by single-
62
phase AVTTS systems as well, since both synthetic speech modes are generated
simultaneously. However, synchrony is not the only feature that determines the
overall level of audiovisual coherence. For instance, both the auditory and the
visual speech mode contain coarticulation effects. In original audiovisual speech,
these coarticulations occur simultaneously in both speech modes. However, when
the synthetic speech modes are synthesized separately, the auditory and the visual
coarticulations are introduced independently. It is impossible to predict how these
fragments of auditory and visual speech information will be perceived when they
are presented audiovisually to an observer. More in general, notwithstanding that
a well-built two-phase AVTTS synthesizer is able to generate synchronous auditory
and visual speech signals which both exhibit on their own high-quality and natural
speech sounds/gestures, such a two-phase synthesis is unable to ensure that the
audiovisual coherence between both synthetic speech modes is sufficient for a
high-quality and natural perception of the multiplexed audiovisual speech signal.
Humans are very well trained to match auditory and visual speech. Therefore, the
challenge for an AVTTS system is to create an auditory speech mode of which the
human observers believe that it could indeed have been generated by the virtual
speaker’s speech gestures that are displayed in the accompanying visual speech
signal. For this purpose, a single-phase AVTTS approach is the most favourable
synthesis strategy.
This thesis evaluates the benefits of a single-phase AVTTS synthesis approach
over the more conventional two-phase synthesis strategy. Similar to most modern
speech synthesizers, a concatenative synthesis strategy is adopted in which original
audiovisual articulations and coarticulations are copied from original speech recordings to the synthetic speech signal (see section 2.2.6.3). Section 2.2.3 elaborated
on the differences between 2D-based and 3D-based synthesis strategies. This thesis
adopts a 2D-based synthesis approach. The reason for this is two-fold. First, as
section 2.2.5 described, a 2D-based synthesis does not require the construction
and implementation of advanced facial models and their associated rendering
techniques, since the virtual speaker can be directly rendered from original 2D
speech recordings. Moreover, gathering original 2D speech data is much easier
to perform in comparison with 3D motion capture techniques. A second reason
to opt for a photorealistic 2D-based synthesis approach is that its output speech
resembles standard television broadcast and video recordings (see section 2.2.4), two
categories of audiovisual speech signals that people are very familiar with. This is
advantageous when conducting subjective perception experiments in which the participants have to rate or compare samples containing synthetic audiovisual speech.
The major downside of 2D-based visual speech synthesis is its limited applicability
in virtual surroundings and its limited power to create new expressions. However,
this is not an issue since the main goal of this thesis is to investigate efficient
63
strategies for performing single-phase AVTTS synthesis and the general evaluation
of a single-phase AVTTS synthesis approach in comparison with more traditional
two-phase synthesis strategies. Possible important synthesis paradigms resulting
from this thesis can in future research still be adopted in 3D facial animation
schemes and/or in systems that also incorporate additional facial expressions for
mimicking visual prosody and the emotional state of the virtual speaker.
This thesis describes the development of a 2D photorealistic single-phase concatenative AVTTS synthesizer. Single-phase concatenative audiovisual speech
synthesis using 3D motion data has already been mentioned in the studies by Hallgren et al. [Hallgren and Lyberg, 1998], Minnis et al. [Minnis and Breen, 2000] and
Bailly et al. [Bailly et al., 2002]. Unfortunately, these studies select the appropriate
audiovisual speech segments from the database based on auditory features only.
Obviously, this will result in sub-optimal synthetic visual speech signals, although
all studies report that the attained synthesis quality benefits from the fact that
synchronous and coherent original audiovisual speech data is applied. The study
of Minnis et al. also mentions a visual concatenation strategy that takes the importance of each phoneme for the purpose of lip-readability into account. Only two
systems that apply a 2D photorealistic single-phase concatenative AVTTS synthesis
have been described in the literature. Shiraisi et al. developed a system to synthesize
Japanese audiovisual speech using a database of 500 original sentences [Shiraishi
et al., 2003]. Both a single-phase approach (in which audiovisual speech segments
are selected from the database) and a two-phase approach (in which auditory and
visual segments are independently selected from the database) were implemented.
The smoothness and the naturalness of the resulting visual speech mode were
assessed, from which it was found that a unimodal selection of the original visual
speech segments resulted in higher-quality synthetic visual speech. This is a rather
obvious result, since no visual features are taken into account during the audiovisual segment selection. Unfortunately, no assessment of the resulting audiovisual
synthetic speech was made. In addition, the authors do not mention any strategy
to smooth the concatenations, indicating that the resulting (audio-)visual speech
is likely to contain noticeable concatenation artefacts (given the limited size of
the provided speech database) which interfere with a subjective evaluation of the
quality of the system. Another approach similar to the one that is investigated in
this thesis was described by Fagel [Fagel, 2006]. In this system, audiovisual segments
are selected from a database containing original German speech fragments. The
longest possible segments are selected in order to minimize the number of concatenations. The smoothness between two candidate segments is determined using
both auditory and visual features. Unfortunately, the system applies no technique
to smooth the concatenated speech, which results in a jerky output signal: despite
the fact that the recorded text corpus was optimized to contain about 820 distinct
64
diphones, the complete database contained only 2000 phones which is fairly limited
for unit selection-based speech synthesis. The system was evaluated by measuring
intelligibility scores for consonant-vowel and vowel-consonant sequences in three
modalities: auditory, visual, and audiovisual speech. Both natural and synthesized
speech was evaluated and in all samples the auditory mode was contaminated with
noise. It was found that for all modalities, the recognition of the synthetic sequences
was as good as the recognition of the original sequences. Unfortunately, the author
does not mention any conclusion on the comparison between a single-phase and
a two-phase audiovisual speech synthesis approach. Therefore, such a comparison
will be the primary goal of this thesis. Note that in order to allow meaningful
subjective evaluations of the AVTTS strategy, high quality synthetic auditory and
visual speech signals have to be generated. For this purpose, much attention should
be given to the design of the database containing original speech fragments, to the
audiovisual segment selection technique, and also to a concatenation strategy that is
able to create smooth synthetic speech signals. Recall from section 2.2.6.3 that high
quality synthesis should also pay attention to successfully transfer coarticulation
effects from the original speech to the synthetic speech.
3
Single-phase concatenative
AVTTS synthesis
3.1
Motivation
As explained in section 2.3, this thesis aims to evaluate the single-phase audiovisual
speech synthesis approach. To this end, in the first part of the research a concatenative single-phase AVTTS synthesizer will be developed. Afterwards, the benefits
of the single-phase approach will be evaluated and the single-phase synthesis strategy will be compared with the more traditional two-phase synthesis paradigm, in
which the synthetic auditory and the synthetic visual speech mode are generated
separately.
3.2
3.2.1
A concatenative audiovisual text-to-speech
synthesizer
General text-to-speech workflow
The general workflow in which the AVTTS system translates the text input into
an audiovisual speech signal is very similar to standard auditory unit selection
text-to-speech synthesis. The synthesis process can be split-up in two stages, where
in a high-level synthesis stage the input text is processed to acquire an appropriate
collection of parameters and descriptions that can be used by the low-level synthesis
stage to create the actual auditory/visual speech signals. An overview of the AVTTS
synthesis process is given in figure 3.1.
65
3.2. A concatenative audiovisual text-to-speech synthesizer
Text
Normalisation
Tokenisation
66
Part-of-speech
tagging
Syntactic
parsing
Assign prosody
model
Token-to-sound rules
Lexicon
Postlex rules
Phonemic
transcription
Assign timings
and f0-contour
Sound/Image synthesis
Audiovisual
speech
Figure 3.1: Overview of the AVTTS synthesis. High-level synthesis steps are indicated by rectangles and the low-level synthesis stage is indicated by an ellipse.
The high-level synthesis stage, also known as the linguistic front-end, first normalizes the input text by converting it into a set of known tokens. For instance,
abbreviations are expanded and numbers are written down using plain words.
Then, each word from the target speech is typified using a part-of-speech tagger
(indicating the nouns, verbs, adverbs, etc.) and possibly by a syntactic parser
(to provide information about the inter-word relationships in a sentence, such as
subject, direct object, etc.). Using this data, a prosody model is constructed for
each target utterance, describing variations in pitch and timings, assigning accents
to speech segments, and predicting phrase-breaks between words. This prosodic
information can for instance be expressed by means of “tone-and-break indices”
(ToBi), which indicate the variations in speech rate and pitch going from word
to word or from syllable to syllable [Pitrelli et al., 1994]. The sequence of input
tokens and the part-of-speech/syntactic information is also used to construct a
target phoneme sequence. For this a lexicon is used that contains for each word
of the target language its phonemic transcription. Note that each language contains several words that have the same spelling but a different pronunciation.
The distinction in the pronunciation of the word can be due to the phonemic
context (e.g., the English word “the” which sounds different when uttered before
a consonant of before a vowel) or it can be due to multiple semantic meanings
of the word (so-called heteronyms, e.g., the English word “refuse” which can be
67
either a noun or a verb). Especially for heteronyms it is crucial that the correct
phonemic transcript is applied by the speech synthesizer in order to convey the
correct semantic information in the synthetic speech. Based on the part-of-speech
tagging and on the syntactic information, it should be possible to select for each
heteronym found in the input text the intended entry from the lexicon. On the other
hand, pronunciation variations due to phonemic context can be defined in so-called
postlex rules, which are applied for locally fine-tuning the phonemic transcript
after the complete input text has been processed. It is impossible to avoid that
some words from the input text are missing in the lexicon (e.g., names or foreign
expressions). For these particular words, a phonemic transcription is estimated by a
predefined set of token-to-sound rules (also known as grapheme-to-phoneme rules).
Once the final target phoneme sequence has been determined, the assigned prosody
model can be used to predict for each individual phoneme a target duration. In
addition, an f0-contour that models the target pitch for each speech segment can
be constructed. Finally, the target phoneme sequence and its associated prosodic
parameters are given as input to the low-level synthesizer which then constructs
the appropriate physical speech signals.
3.2.2
Concatenative single-phase AVTTS synthesis
The synthesis paradigm for the low-level synthesis stage that is adopted in this
thesis is unit selection concatenative synthesis. As section 2.2.6.3 described, in
this synthesis approach the synthetic speech is constructed by the concatenation
of original speech segments that are selected from a database containing original speech recordings. In most modern systems, the segments are selected from
continuous original speech, which permits the selection of segments containing
multiple consecutive original phones. This way, both original coarticulations and
original prosody can be copied from the original speech to the synthetic speech.
Original segments exhibiting appropriate prosodic properties can be selected from
the database by using selection criteria that are linked with prosody, such as the
position in the sentence, the position in the syllable and syllable stress (see further
in section 3.4.2.2). Because of this, the AVTTS system does not necessarily need to
predict a prosody model and its associated timing and pitch parameters for each
utterance. This simplifies the synthesis workflow and it also minimizes the need for
modification of the selected original segments. This is advantageous since additional
modifications to the original speech, such as a time-scaling in order to match the
target phoneme durations, can result in irregular signals that degrade the quality
of the synthetic speech [Campbell and Black, 1996].
This thesis focusses on the development of the low-level synthesis stage. As
backbone of the synthesis system, the Festival framework is used [Black et al.,
68
2013]. The Festival framework offers modules for each step of the TTS synthesis process, as well as an environment to connect the various modules in order
to attain a fully-operational speech synthesizer. Festival also allows to integrate
user-defined synthesis modules in the synthesis workflow, a functionality that can
for instance be applied to combine new low-level synthesis algorithms with the
high-level synthesizer of the official Festival release. The modules of the Festival
system are written in C++, while the backbone of the system uses a Scheme
interpreter to pass data from each module to the other. This thesis describes two
audiovisual speech synthesizers, targeting English and Dutch, respectively. For the
English synthesis, original Festival high-level synthesis modules were used, while
for the Dutch synthesis some high-level modules of the NeXTeNS TTS system
were applied [Kerkhoff and Marsi, 2002]. The research described in this theses was
conducted in parallel with research on auditory-only TTS synthesis within the same
research laboratory. In that research, both new high-level and low-level synthesis
techniques for generating auditory speech from text input are investigated [Latacz
et al., 2007] [Latacz et al., 2010] [Latacz et al., 2011]. Over time, many of the
new high-level synthesis modules that have been developed in the auditory TTS
research were included in the AVTTS synthesizer that is described in this thesis.
Note that the details on these modules are beyond the scope of this thesis as they
are mainly used to compose a high-quality set of parameters and descriptions that
are given as input to the low-level synthesis stage (which is the actual subject of
this thesis). For an overview on the English high-level synthesis that is used in the
AVTTS system the interested reader is referred to [Latacz et al., 2008]. In addition,
details on the Dutch high-level synthesis stage that is used in the AVTTS system
are found in [Mattheyses et al., 2011a].
The low-level single-phase concatenative synthesizer (from this point on the
“low-level” label will be dropped, assuming that the input text has been translated
in its corresponding phoneme sequence) generates the synthetic audiovisual speech
by concatenating original audiovisual speech segments selected from an audiovisual
speech database. By jointly selecting and concatenating auditory and visual speech
data, a maximal audiovisual coherence is seen in the synthetic speech. The base
unit that is used in the selection process is a diphone. This means that the input
phoneme sequence is split up in consecutive diphones, for each of which a set of
matching candidate segments (in terms of phonemic transcript) is gathered from
the database. Then, for each target diphone one final original speech segment
is selected from its matching candidates based on the optimization of a global
selection cost that is calculated using both auditory and visual features (see further
in section 3.4). When the database contains a speech fragment that matches the
target phoneme sequence over multiple successive phones (e.g., when a complete
word of the input text is found in the database speech), in many cases all the
Text:
fish
Phoneme sequence:
Diphones:
Concatenation:
69
F IH SH
_F
_
F IH
IH SH
SH _
F
F
IH
IH
SH
SH
_
Figure 3.2: Diphone-based unit selection. The original speech data that is eventually copied to the synthetic speech is indicated in green. Phoneme labels are
in the Arpabet notation [CMU, 2013] and “ ” represents the silence phoneme.
consecutive diphone segments that make up this original fragment are selected for
the corresponding target phoneme sub-sequence (i.e., the whole original segment is
copied to the synthetic speech). However, in the general unit selection paradigm,
it is not necessarily so that the longest possible segment is always selected, since
apart from signal continuity other features such as the extended phonemic context
and a match with the target prosody are taken into account. Once a final original
segment has been selected for each diphone target, these segments are concatenated
in order to construct the synthetic speech signal. This concatenation involves both
the joining of waveforms and the joining of video frame sequences. As was explained
in section 2.2.6.3, the diphones are concatenated at the middle of the first and the
second phone, respectively. This way, the concatenation takes place in the most
stable part of each phone and the original transition between the two phones (i.e.,
the original local coarticulation) is copied to the synthetic speech (see figure 3.2).
In some cases it can occur that a target diphone is not found in the database,
especially when the applied database is small and not optimized to contain at least
one instance of each diphone existing in the target language. In that case, a back-off
High-level
synthesis
Audiovisual
unit selection
Audiovisual
concatenation
audiovisual
units
Pitch-synchronous
crossfade & image
metamorphosis
Original
combinations of
auditory & visual
speech
Audiovisual speech
database
Text
70
Waveforms
&
video sequences
Figure 3.3: Overview of the audiovisual unit selection synthesis.
takes place in which the synthesizer selects a single phone from the database. In
the concatenation stage, the phone-sized segment is concatenated at the phone
boundaries. Since the use of such phone-sized segments leads to less optimal results
as compared to diphone-sized segments, back-offs should be avoided by optimizing
the synthesis database. After the concatenation stage, the final audiovisual speech
is obtained by a simple multiplexing of the concatenated auditory and the concatenated visual speech signal, without the need for any additional signal processing.
This is in contrast to the two-phase AVTTS approach, in which an additional
synchronization of the two separately synthesized speech modes is needed. The
general workflow of the low-end synthesis process is illustrated in figure 3.3. In the
following sections the various steps of the audiovisual unit selection synthesis will
be discussed in more detail.
3.3. Database preparation
3.3
3.3.1
71
Database preparation
Requirements
In order to perform concatenative speech synthesis, a database containing original
audiovisual speech recordings must be created and provided to the synthesizer. This
is an important off-line step since the properties and the quality of this database
for a great deal determine the attainable synthesis quality. Since the speech synthesis involves the concatenation of original speech segments that are extracted
from multiple randomly-located parts of the database, it is crucial that the original
speech data is consistent throughout the whole dataset. Therefore, speech data from
a single speaker is used and it is attempted that the audiovisual recording conditions remain constant during the recording session(s). This thesis aims to design
an AVTTS system that creates a 2D photorealistic synthetic visual speech signal
displaying a frontal view of the virtual speaker (i.e., “newsreader-style”). Therefore,
the original visual speech should be recorded in a similar fashion. An important
issue to take into account are head movements. Even when the original speaker
was instructed to keep his/her head as steady as possible, some small variations of
the position of the face toward the camera are unavoidable. Some researchers have
tried to overcome this problem by fixing the speaker’s head in a canvas [Fagel and
Clemens, 2004] or by using a head-mounted camera [Theobald, 2003]. Unfortunately,
such solutions often result in a less optimal video quality or are unable to capture a
natural appearance of the complete face. Even more, when recording large databases
the speaker should be able to sit adequately comfortable, which is impossible when
his/her head is fixed in a stiff construction. The video signal itself should have a
sufficiently high resolution and a sufficiently high frame rate to capture all subtle
speech movements. In addition, it should be ensured that the signal allows a quality
post-processing stage, in which for instance a spatial segmentation of the image data
in each video frame is performed (separating the face from the background and/or
indicating the position of various visual articulators). On the other hand, it should
also be ensured that the audio recordings contain only minimal background noise
and that the used microphones allow a natural voice reproduction.
3.3.2
Databases used for synthesis
In this stage of the research, two audiovisual databases were used. A first preliminary
audiovisual speech corpus “AVBS” containing 53 Dutch sentences from weather
forecasts was recorded in a quiet room on the university campus. The audiovisual
speech was recorded at a resolution of 704x576 pixels at 25 progressive frames
per second. The audio was recorded by a lavalier microphone at 44100Hz. Two
example frames of this database are given in figure 3.4. Obviously, this database
72
Figure 3.4: Example frames from the “AVBS” audiovisual database.
is too limited to attain high quality synthesis results. Nevertheless, is has been
very useful to design and test various high-level and low-level synthesis modules by
synthesizing Dutch sentences from the limited domain of weather forecasts.
In 2008 the LIPS visual speech synthesis challenge was organized to assess and
compare various visual speech synthesis strategies using the same original speech
data [Theobald et al., 2008]. With this event an English audiovisual speech database
suitable for concatenative audiovisual speech synthesis was released. A great part of
the work described in this thesis was conducted using this “LIPS2008” dataset. The
database consists of audiovisual “newsreader-style” recordings of a native English
female speaker uttering 278 English sentences from the phonetically-balanced
Messiah corpus [Theobald, 2003]. The visual speech was recorded at 25 interlaced
frames per second in portrait orientation at a resolution of 288x720 pixels. After
post-processing, the final visual speech signals consisted of 50 progressive frames per
second at a resolution of 576x720 pixels. The acoustic speech signal was captured
using a boom-microphone near the subject and was stored with 16 bits/sample at
a sampling frequency of 44100Hz. Two example frames from the LIPS2008 corpus
are given in figure 3.5.
3.3.3
Post-processing
In order to be able to use a speech database for concatenative speech synthesis
purposes, appropriate meta-data describing various aspects of the speech contained
in the database has to be calculated. These features will be used by the synthesizer
to select the most appropriate sequence of original speech segments that compose
the target synthetic speech.
73
Figure 3.5: Example frames from the “LIPS2008” audiovisual database.
3.3.3.1
Phonemic segmentation
In order to determine which database segments are matching the target speech
description, the original speech must be phonemically segmented. To this end, the
original auditory speech is analysed and each phoneme boundary is indicated. Afterwards, the original visual speech signal is synchronously segmented by positioning the viseme boundaries at those video frames that are closest to the phoneme
boundaries in the corresponding acoustic signal. Note that an exact match between
these two boundaries is impossible since the sample rate of a video signal is much
lower compared to the sample rate of an audio signal. In general, the phonemic
segmentation of an auditory speech signal is performed by a speech recognition
tool in forced-alignment mode. This means that both the acoustic signal and its
corresponding phoneme sequence are given as input to the recognizer, after which
for each sentence an optimal set of phoneme boundaries is calculated. For the preliminary Dutch database AVBS, the phonemic segmentation was obtained using
the SPRAAK toolkit [Demuynck et al., 2008]. The LIPS2008 database was already
provided with a hand-corrected phonemic segmentation created using HTK [Young
et al., 2006].
3.3.3.2
Symbolic features
It was described in section 3.2.2 that in the case of unit selection-based synthesis
the synthesis system does not have to directly estimate the prosodic features of the
output speech, since segments containing an appropriate original prosody can be
copied from the database to the synthetic speech. For this, the segment selection
has to take prosody-dependent features into account. To this end, for each phone in
the original speech multiple symbolic features were calculated based on phonemic,
prosodic and linguistic properties, such as part-of-speech, lexical stress, syllable type,
74
etc. A complete list of these features is given in table 3.1. Note that some features
were determined for the neighbouring phones/syllables/words as well. The symbolic
features can be used in the segment selection process to force the selection towards
original segments exhibiting appropriate prosodic features such as pitch, stress and
duration (see further in section 3.4.2.2).
Table 3.1: Symbolic database features. Features with a are also calculated for
the neighboring phones, syllables or words. Neighboring syllables are restricted
to the syllables of the current word. Three neighbors on the left and three on
the right are taken into account.
Level
phone
phone
phone
syllable
syllable
syllable
syllable
syllable
syllable
syllable
syllable
syllable
syllable
word
word
word
word
word
word
word
word
word
word
3.3.3.3
Feature
Phonemic identity Pause type (if silence) Position in syllable
Phoneme sequence
Lexical stress ToBI accent Is accented Onset and coda type Onset, nucleus and coda size Distance to next/previous stressed syllable
(in terms of syllables)
Nbr. stressed syllables until next/prev. phrase break
Distance to next/previous accented syllable
(in terms of syllables)
Nbr. accented syllables until next/prev. phrase break
Position in phrase
Part of speech Is content word Has accented syllable(s) Is capitalized Position in phrase Token punctuation Token prepunctuation Nbr. words until next/prev. phrase break
Nbr. content words until next/prev. phrase break
Acoustic features
Several acoustic features describing the auditory speech signal were determined.
The acoustic signal was divided into 32ms frames with 8ms frame-shift, after which
75
12 MFCCs were calculated to parameterize the spectral information of each frame.
In addition, for each sentence a series of pitch-markers was determined, indicating
each pitch period in the (voiced) segments of the speech. This information is useful
in case the speech has to be pitch-modified or time-scaled by algorithms such as
PSOLA [Moulines and Charpentier, 1990]. Moreover, these pitch-markers are used
in the acoustic concatenation strategy (see further in section 3.5.3). The pitchmarkers are calculated using a dynamic programming approach. Summarized, a
crude estimation for each marker is calculated using an average magnitude difference
function. Then, for each estimation a final marker position is determined by selecting
the most appropriate marker from a set of candidate markers. For more details on
this pitch-marking strategy the interested reader is referred to [Mattheyses et al.,
2006] since this algorithm is beyond the scope of this thesis. Based on the distance
between consecutive pitch-markers, for each sentence a pitch contour is calculated.
Sampling this contour at the middle of each phoneme defines a pitch feature for
each database segment. Finally, an energy feature is calculated by measuring the
spectral energy of the acoustic signal in a window of 1024 samples (24ms) around
the centre of each phoneme.
3.3.3.4
Visual features
The most important step in the post-processing of the visual speech recordings is
the tracking of key points throughout the video signal. These landmarks indicate the
position of various visual articulators and other parts of the speaker’s face in each
video frame (illustrated in figure 3.6). The key point tracking was based on both a
general facial feature tracker developed at the Vrije Universiteit Brussel [Hou et al.,
2007] and an AAM-based tracker that was kindly provided by prof. Barry-John
Theobald (University of East Anglia) [Theobald, 2003]. The AAM-based tracker
performed best since the recorded video frames make up a uniform sequence of
images from which a manually landmarked subset can be used to train the tracker.
Based on these landmarks, the mouth-region of each video frame was extracted
(using a fixed-size rectangular area around the mouth). These mouth regions were
mathematically parameterized by a PCA analysis. These calculations resulted in a
set of “eigenfaces” and defined for each database frame a set of PCA coefficients
that reconstruct the grayscaled version of the mouth area of that frame by a linear
combination of the eigenfaces.
While key point tracking is useful to locate in each frame visually important
areas such as the lips and the cheeks, it cannot be used to identify the teeth or
the tongue since these are not visible in each recorded video frame. Therefore, an
image processing technique was developed to track these facial features throughout
the database. In a first step, for each frame the mouth is extracted based on the
76
Figure 3.6: Landmarks indicating the various parts of the face.
landmarks that indicate the position of the lips. Both horizontally and vertically a
margin of only a few pixels from the most outside landmark is used for this crop.
Then, the coloured mouth-image is split in a blue, a green, and a red channel.
In order to measure the area of the video frame representing visible teeth, the
number of pixels in the blue channel that have an intensity value above a predefined
threshold is calculated. This threshold is manually determined in such a way that
the intensity-measure results in a value close to zero when the detection is applied
on video frames containing no visible teeth. The blue channel is chosen as this
channel contains the least intensity information from the lips, the tongue and the
skin. Note that this detection strategy only works in these particular circumstances
where each video frame is captured in the same recording conditions (e.g., external
lightning, camera settings, etc.). The use of a similar technique to detect the
presence of the tongue in a video frame is hard to perform since the tongue exhibits
a variable appearance by moving forwards and backwards in the mouth. Therefore,
it was opted to measure the visibility of the mouth-cavity (the dark area inside an
open mouth when no tongue is displayed) instead. This is achieved in a similar
fashion as the detection of the teeth, only this time the red channel is used and all
pixels showing an intensity value below a predefined threshold are counted. This
way, the mouth-cavity measure indirectly measures the tongue behaviour since a
high value will be obtained when the mouth appears wide open and no tongue is
visible (since the tongue appears reddish it mostly affects the pixel intensities in
the red channel). The teeth and mouth-cavity detection is illustrated in figures 3.7,
3.8 and 3.9.
77
Figure 3.7: Detection of the teeth and the mouth-cavity (1). In the blue channel,
the lips/skin/tongue contribute less to the pixel intensities as compared to the
standard grayscale image, which improves the detection of the teeth (high pixel
intensities). Likewise, detecting low pixel intensities in the red channel helps to
detect the amount of mouth-cavity that is not blocked by the tongue.
78
600
400
200
0
0
50
100
150
200
250
50
100
150
200
250
50
100
150
200
250
50
100
150
200
250
50
100
150
200
250
600
400
200
0
0
600
400
200
0
0
600
400
200
0
0
600
400
200
0
0
Figure 3.8: Detection of the teeth and the mouth-cavity (2). This figure shows
the histograms of the red channel of the five mouth representations displayed in
figure 3.7. The summed histogram values below the treshold (indicated by the red
line) are an appropriate representation of the amount of visible mouth-cavity.
79
1000
500
0
0
20
40
60
80
100
120
140
160
180
20
40
60
80
100
120
140
160
180
20
40
60
80
100
120
140
160
180
20
40
60
80
100
120
140
160
180
20
40
60
80
100
120
140
160
180
1000
500
0
0
1000
500
0
0
1000
500
0
0
1000
500
0
0
Figure 3.9: Detection of the teeth and the mouth-cavity (3). This figure shows
the histograms of the blue channel of the five mouth representations displayed
in figure 3.7. The summed histogram values above the treshold (indicated by the
red line) are an appropriate representation of the amount of visible teeth.
3.4. Audiovisual segment selection
80
ti-1
ti
ti+1
Ctarget,i-1
Ctarget,i
Ctarget,i+1
ui-1
Cjoin,i-1
ui
Cjoin,i
ui+1
Cjoin,i+1
Figure 3.10: Unit selection synthesis using target and join costs.
3.4
3.4.1
Audiovisual segment selection
Minimization of a global cost function
The general idea behind unit selection synthesis is illustrated in figure 3.10. The
desired output speech is defined by a series of targets. Each target has the size
of a basic synthesis unit, which is a diphone in the case of the proposed AVTTS
synthesis approach, and describes the ideal database segment needed to construct
the synthetic speech. For each target, a set of phonemically matching candidate
segments is gathered from the database. The distance between a candidate segment
and the corresponding target defines the total target cost associated with that
candidate. This distance is usually calculated using multiple features, each defining
a sub-target cost. Since each target describes a diphone, the sub-target costs are
calculated by comparing the features of the first and the second phone of the
target with the features of the first and the second phone of the candidate segment,
respectively. Apart from searching for original speech segments closely matching the
target speech, the segment selection algorithm has to take into account the ease in
which two original speech segments can be joined together. To this end, the segment
selection takes join costs into account that indicate the smoothness of the signal
resulting from the concatenation of each pair of candidate segments corresponding
two consecutive targets. Similar to the calculation of the total target cost, the total
join cost is calculated by comparing multiple features of the candidate segments,
where each comparison defines its own sub-join cost. Note that the total join cost
is always zero when the two candidate segments are adjacent in the database, since
when those segments would be selected they can be copied as a whole from the
database to the synthetic speech (no concatenation is needed).
The total target cost of a candidate segment ui matching a synthesis target ti
81
can be written as the weighted sum of k sub-target costs:
Pk
target
Ctotal
(ti , ui )
j=1
=
ωjtarget Cjtarget (ti , ui )
Pk
target
j=1 ωj
(3.1)
in which ωjtarget represents the weight factor of the j-th target cost. The various
target costs Cjtarget that are used by the synthesizer are discussed in section 3.4.2.
Similarly, the join cost associated with the transition from candidate segment ui to
candidate segment ui+1 can be written as the weighted sum of l sub-join costs:
Pl
join
Ctotal
(ui , ui+1 ) =
j=1
ωjjoin Cjjoin (ui , ui+1 )
Pl
join
j=1 ωj
(3.2)
in which ωjjoin represents the weight factor of the j-th join cost. The various join
costs Cjjoin that are used by the synthesizer are discussed in section 3.4.3. Using
these two expressions, the total cost for synthesizing a sentence that is composed of
T targets t1 , t2 , . . . , tT by concatenating candidate segments u1 , u2 , . . . , uT can be
written as:
C(t1 , t2 , . . . , tT , u1 , u2 , . . . , uT ) =
" T
# T −1
X join
X target
Ctotal (ui , ui+1 )
α
Ctotal (ti , ui ) +
i=1
(3.3)
i=1
in which α is a parameter that controls the importance of the total target cost
over the total join cost. The most appropriate set of candidate segments is the
sequence (û1 , û2 , . . . , ûT ) that minimizes equation 3.3. Searching for this optimal
set is a complicated problem, since for every target multiple candidates exist which
leads to an enormous number of possible sequences, as illustrated in figure 3.11.
Therefore, a dynamic programming approach called Viterbi search [Viterbi, 1967]
is applied to efficiently find the most optimal sequence of database segments. The
Viterbi algorithm is explained in detail in appendix A.
Since the AVTTS synthesizer is designed to select and concatenate audiovisual speech segments, the total selection cost has to force the selection towards
original segments that are optimal in the auditory as well as in the visual mode. To
this end, both auditory sub-costs and visual sub-cost are applied and equations 3.1
82
t2
t3
u11
u21
u31
u12
u22
u32
u13
u23
u33
...
...
...
...
...
...
...
...
...
t1
tT
u1N
u2N
u3N
...
uTN
uT1
uT2
uT3
Figure 3.11: A trellis illustrating the unit selection problem. Each target t has
many associated candidate segments u. From each candidate segment uij matching target ti the transition to every candidate segment matching target ti+1 must
be considered.
and 3.2 can be written as:

Pka target,a target,a

Cj
(ti , ui )

j=1 ωj
target


Ctotal (ti , ui ) = Pka target,a Pkv target,v



+ j=1 ωj

j=1 ωj


P

k
target,v
v


Cjtarget,v (ti , ui )

j=1 ωj

+ Pka target,a Pkv target,v



+ j=1 ωj
j=1 ωj
P
l
join,a
join,a
a

Cj
(ui , ui+1 )

j=1 ωj

join

 Ctotal (ui , ui+1 ) = Pla
Plv

join,a

+ j=1 ωjjoin,v

j=1 ωj



Plv

join,v join,v

Cj
(ui , ui+1 )

j=1 ωj


+
Pla
Plv

join,a

+ j=1 ωjjoin,v
j=1 ωj
(3.4)
with label a denoting audio-related values and label v denoting video-related values.
High-quality audiovisual synthesis can be achieved by minimizing equation 3.3 only
if accurate sub-costs and an appropriate weighting between these multiple sub-costs
are defined. This is not trivial, since it is likely that often the most optimal segment for constructing the synthetic auditory speech will not be the most preferable
segment to construct the synthetic visual speech and vice-versa. The following two
sections will elaborate on the various sub-costs that are applied in the proposed
AVTTS synthesis approach.
83
Target cost
Phonemic match *
Symbolic binary
costs
Safety costs
Phonemic context
Suspicious units *
Syllable name
Suspicious-timing units
Syllable stress
Position word in phrase
...
Figure 3.12: Target costs applied in the AVTTS synthesis. Costs marked with
an * are assigned an infinitely high weight factor.
3.4.2
Target costs
A target cost C target (ti , ui ) indicates to which extent a candidate database segment
ui matches the target speech segment ti . An overview of the various target costs
used by the AVTTS system is given in figure 3.12.
3.4.2.1
Phonemic match
Section 3.2.2 explained that for each target the synthesizer searches in the database
for candidate segments that phonemically match the target phoneme sequence. This
technique already involves a “hidden” target cost, since in the general unit selection
paradigm [Hunt and Black, 1996] each database segment is considered as a candidate
unit. The candidate selection technique applied in the AVTTS system assumes a
binary target cost Cphon.match (ti , ui ) based on the phonemic matching between the
target segment ti and the database segment ui : when the segment from the database
has the same phoneme label as the target speech segment, the value of the cost
is set to 0. Otherwise, the cost is assigned a value 1. This target cost is given
an infinitely high associated weight. Most auditory unit selection synthesizers are
implemented this way, since for auditory synthesis it cannot be afforded to include
an incorrect phoneme in the synthetic speech. This explains why the hidden target
cost is employed in the AVTTS synthesizer as well, since in the proposed singlephase AVTTS approach auditory and visual segments are selected together from the
database. Note, however, that for visual-only speech synthesis it is possible to apply
84
non-phonemically matching database segments due to the many-to-one behaviour of
the mapping from phonemes to visemes. This allows to select a candidate segment
of which the phonemic transcript does not match the target phoneme sequence in
case each phoneme of the candidate segment is from the same viseme class as its
corresponding target phoneme. The advantages and disadvantages of this technique
will be discussed later on in chapter 6.
3.4.2.2
Symbolic costs
target
The synthesizer adopts multiple symbolic target costs Csymb
(ti , ui ) to guide the selection towards database segments that exhibit appropriate prosodic features. These
symbolic target costs are calculated using the symbolic features that were discussed
in section 3.3.3.2 and are assigned a binary value (zero or one) based on the matching
between the feature value for target segment ti and the feature value for candidate
segment ui . Various sub-sets from all features listed in table 3.1 have been evaluated
(e.g., the subset used in the Festival Multisyn TTS synthesizer [Clark et al., 2007]),
from which it was concluded that a minimal set of symbolic target costs should at
least contain cost values based on:
Context phoneme name
The context of the database segment is compared with the context of the
target in terms of phoneme identity. Six binary values are assigned, based
on the matching of the phonemes found one, two and three steps forward
and backward in the target/database phoneme sequence. These target costs
encourage the synthesizer to select original segments that were uttered in a
similar context as described in the target speech, since this way appropriate
longer coarticulation effects can be copied from the database to the synthetic
speech.
Silence type of the previous and next segment
The silence type is either “none” (no silence), “light” (short phrase break), or
“heavy” (long phrase break, for example after a comma). This cost is needed
since the uttering of a phoneme can be influenced by the vicinity of a pause
or phrase break.
Syllable name
This feature encourages the synthesizer to select database segments that are
located in the same syllable as described in the target phoneme sequence. This
helps to copy the appropriate coarticulations from the original speech since
such coarticulation effects are most profound within a syllable.
Syllable stress
This feature encourages the synthesizer to select database segments that are located in a stressed syllable in case the corresponding target syllable is stressed
85
too, and vice-versa. The syllable level is used for this cost since this is the
most appropriate level to assign stress-related features.
Part-of-speech (word level)
This feature encourages the synthesizer to select database segments that are
located in a word that was assigned the same part-of-speech label as the corresponding word in the target sequence. This cost is useful when an entire word
from the target phoneme sequence is found in the database. In many cases,
the whole original speech signal representing this word will be selected since
its consecutive candidate segments all contribute a zero join cost. In that case,
it is necessary to inspect the part-of-speech information of the original speech
segment: when the part-of-speech label of the word in the database does not
match the target part-of-speech, it is likely that an incorrect original prosody
is copied to the synthetic speech signal.
Position in phrase (word level)
The position of a word in a phrase often determines its prosodic properties (especially pitch-related properties since each type of phrase exhibits its
own typical f0-contour). Therefore, when selecting longer segments from the
database this cost promotes the selection of segments that are more likely to
exhibit an appropriate prosody.
Punctuation
The prosodic properties of a sentence are highly dependent on the punctuation (e.g., commas, colons, question marks, etc.). Therefore, the selection of a
database segment that matches the punctuation in the input text (e.g., both
followed by a question mark) is rewarded a lower target cost value.
3.4.2.3
Safety costs
The quality of the synthesized speech is highly dependent on the accurateness of the
database meta-data, since it is this meta-data that is used to calculate the various
selection cost values. In addition, the phonemic segmentation of the database has to
be very precise in order to be able to copy the correct pieces of acoustic/visual data
from the database to the synthetic speech. Unfortunately, the automatic phonemic
segmentation of speech data is never error-proof (while correcting it manually
would take a massive amount of work and time). For instance, it can occur that
the speech recognizer misplaces the boundary between two consecutive phonemes.
Moreover, it is possible that the original speaker made a mistake while uttering
the database sentences, such as the pronouncing of an incorrect phoneme or the
inadequate articulating of a particular phoneme instance. When these flaws are
not manually detected in the (post-)recording stage, the automatic phonemic segmentation of such a sentence is likely to result in unpredictable errors. This is why
86
during the construction of expensive databases for commercial TTS systems (e.g.,
Acapela [Acapela, 2013], Nuance [Nuance, 2013], etc.) a considerable amount of the
development time is spend on the manual inspection of the automatically generated
segmentation and database meta-data. An alternative automatic technique to avoid
synthesis errors caused by flaws in the database is applied by the AVTTS system
proposed in this thesis, for which an extra set of “safety” target costs has been
developed to minimize the chance of selecting a candidate segment that is likely to
contain such database errors.
A first safety target cost Chard−pruning (ti , ui ) is based on an offline analysis
of the database in which for each distinct phoneme its most extreme instances (i.e.,
its outliers) are marked as “suspicious” segments. These segments are restricted
from selection by assigning an infinitely high weight to a “safety” target cost which
is assigned the value one when the candidate segment ui has been marked as “suspicious” and a value zero otherwise. In order to mark particular database segments
as “suspicious”, all database instances of a particular phoneme are compared to
each other using both auditory and visual features. To this end, for each feature all
instances of a particular phoneme are gathered from the database and each instance
i is characterized by its mean distance di from all other instances. The actual way
in which di is calculated depends on the feature that is being used in the analysis.
Then, the overall mean µd and the standard deviation σd of these mean distances
are calculated. “Suspicious” segments that possibly contain a database error are
those segments for which equation 3.5 holds:
|di − µd | > λ × σd
(3.5)
with λ a factor that controls the number of segments to restrict from selection. This
calculation is performed for each distinct phoneme present in the database. A first
series of “suspicious” labels was calculated by describing each phoneme instance
based on its acoustic properties. To this end, each instance was segmented into
25ms frames after which each frame was represented by a feature-vector containing
both MFCC, pitch and energy information. The distance between two phoneme
instances was calculated as the frame-wise distance between the corresponding
feature vectors after time-aligning both instances (for more details on this the
interested reader is referred to [Latacz et al., 2009]). Another series of “suspicious”
labels was calculated on visual features. To this end, each phoneme instance was
identified by the PCA coefficients of the video frame that is closest to the middle
of the instance. The distance between two phoneme instances was calculated as the
Euclidean distance between their corresponding PCA coefficients.
Note that this safety target cost is likely to eliminate some extreme phoneme
87
instances that were correctly segmented/analysed as well. In general, this is not
a problem since these particular segments will be inappropriate for most target
speech sequences anyway. On the other hand, it should be ensured that only a
few instances of each phoneme are labelled as “suspicious”, since deviant instances
could still be needed to synthesize particular irregular coarticulations or prosody
configurations. Therefore, in equation 3.5 the parameter λ is used to ensure that
the number of “suspicious” segments is sufficiently small compared to the total
database size.
The “suspicious” labelling of the database defines a so-called hard pruning, in
which the labelled segments are completely excluded from selection. On the other
hand, a soft pruning of the database could also be advantageous, in which the
selection of some particular segments is strongly discouraged but not prohibited.
The AVTTS system applies such a soft pruning by performing an additional analysis
of the database in which the duration of each segment is evaluated in a similar
fashion as the analysis to determine the “suspicious” segments. This way, for each
distinct phoneme those instances exhibiting an atypical duration are assigned a
“suspicious-duration” label. A second safety target cost Csof t−pruning (ti , ui ) is
defined which is assigned a value one when a candidate segment ui was assigned
such a “suspicious-duration” label and a value zero in all other cases. By assigning
this target cost a high (but not infinite) weight, it can be ensured that these
“suspicious-duration” segments are only selected in case no other options are
possible.
3.4.3
Join costs
A join cost C join (ui , ui+1 ) indicates how two candidate segments ui and ui+1 can
be concatenated without creating disturbing concatenation artefacts. An overview
of the various join costs used by the AVTTS system is given in figure 3.13.
3.4.3.1
Auditory join costs
Auditory join costs promote the selection of candidate segments (for consecutive
targets) of which the auditory speech modes can be smoothly concatenated. To this
end, the continuity of various acoustic features (see section 3.3.3.3) at the concatenation point is evaluated. A first important auditory join cost measures the spectral
smoothness by calculating the Euclidean difference between the MFCC values at
both sides of the concatenation point:
v
uNM F CC
u X
2
(M F CCi (n) − M F CCi+1 (n))
(3.6)
CM F CC (ui , ui+1 ) = t
n=1
88
Join cost
Auditory costs
Visual costs
MFCC
Landmark points
Pitch
Visible teeth
Energy
Visible mouth cavity
PCA
Figure 3.13: Join costs applied in the AVTTS synthesis.
with NM F CC the number of MFCC values used to describe spectral information,
and M F CCi (n) and M F CCi+1 (n) the MFCC values of the last audio frame of
segment ui and the first audio frame of segment ui+1 , respectively. In addition, a
second join cost calculates the difference in spectral energy between both segments:
Cenergy (ui , ui+1 ) = |Ei − Ei+1 |
(3.7)
with Ei and Ei+1 the energy features of ui and ui+1 , respectively. A third auditory
join cost takes pitch levels into account by calculating the absolute difference in
logarithmic f0 between the two sides of a join:
Cpitch (ui , ui+1 ) = log (f0i ) − log f0i+1 (3.8)
with f0i and f0i+1 the pitch marker-based pitch value measured at the end of segment
ui and at the beginning of segment ui+1 , respectively. If the phone at the join
position is voiceless, the value of Cpitch (ui , ui+1 ) is set to zero.
3.4.3.2
Visual join costs
Similar to the auditory join costs, the visual join costs promote the selection of
database segments that allow a smooth concatenation of their visual speech modes.
To this end, the continuity of various visual features (see section 3.3.3.4) at the
concatenation point is evaluated. A first visual join cost measures the “shape” similarity at both sides of the join by comparing the positions of the landmarks denoting
89
the lips of the original speaker. The value of this cost is calculated as the summed
Euclidean distance between every two corresponding mouth-landmarks. Before calculating these distances, both frames at the join position (and their corresponding
landmark positions) are aligned in order to improve the concatenation quality (see
further in section 3.5.4):
Clandmark (ui , ui+1 ) =
NL q
X
2
(x̂i (m) − x̂i+1 (m)) + (ŷi (m) − ŷi+1 (m))
2
(3.9)
m=1
with x̂i /ŷi and x̂i+1 /ŷi+1 the vectors containing the coordinates of the landmarks
of the last video frame of segment ui and the first video frame of segment ui+1 ,
respectively, after the spatial alignment of segments ui and ui+1 . NL represents the
number of landmarks used in the calculation. Apart from continuity of the shape
information, it is also important that the “appearance” of the virtual speaker varies
smoothly around the concatenation point. To this end, a second visual join cost is
calculated as the difference in the amount of visible teeth between the two frames
at the join position. Similarly, another visual join cost measures the difference in
the amount of visible mouth-cavity between these two frames:
(
Cteeth (ui , ui+1 ) = |T Ei − T Ei+1 |
(3.10)
Ccavity (ui , ui+1 ) = |CAi − CAi+1 |
with T Ei and T Ei+1 the amount of teeth visible in the last video frame of segment
ui and the first video frame of segment ui+1 , respectively. Similarly, CAi and CAi+1
represent the amount of mouth cavity visible in the last video frame of segment ui
and the first video frame of segment ui+1 , respectively. Finally, a fourth visual join
cost measures the mathematical continuity of the concatenated visual speech by
calculating the Euclidean distance between the PCA coefficients of both frames at
the join position:
v
uNP CA
u X
2
(P CAi (n) − P CAi+1 (n))
(3.11)
CP CA (ui , ui+1 ) = t
n=1
with NP CA the number of PCA coefficients used to describe each video frame, and
P CAi (n) and P CAi+1 (n) the PCA coefficients of the last video frame of segment
ui and the first video frame of segment ui+1 , respectively.
3.4.4
Weight optimization
From the previous two sections it is clear that the total cost that corresponds to the
selection of a particular candidate segment involves the calculation of many separate
sub-costs. A specific weight is assigned to each sub-cost after which the total cost is
90
given by the weighed sum of all sub-costs (see equation 3.4). In the proposed joint
audio/video selection strategy, these weights are not only used to specify the relative
importance of each sub-cost over the other sub-costs but they also determine the
relative importance of the auditory sub-costs over the visual sub-costs. In addition,
recall that the importance of the total target cost over the total join cost can be
adjusted by the factor α in equation 3.3. Good quality segment selection is only
feasible when an appropriate configuration of all these various weights is applied.
3.4.4.1
Cost scaling
Sections 3.4.2 and 3.4.3 explained how the various sub-costs are calculated. Since
each cost is calculated on different features, every sub-cost will be assigned its own
typical range of cost values. In order to be able to easily specify the contribution of
the various sub-costs to the total selection cost, each sub-cost is scaled with a scaling
factor that adjusts its possible cost values to a range that is approximately between
zero and one. To determine these scaling factors, for each sub-cost an extensive
set of typical cost values is gathered. Sub-target cost values can be learned by
synthesizing random speech samples and registering all calculated cost values for
each sub-target cost. However, in the current set-up of the AVTTS system only
binary target costs are used which are always assigned the value zero or one. Because
of this, only for the sub-join costs typical cost values must be learned. To this end,
for each distinct phoneme a fixed number of instances are uniformly sampled from
the database. Cost values are collected by calculating the sub-join cost between each
two gathered instances of a particular phoneme. From the histograms describing the
learned sub-join cost values, three categories of join costs can be identified (see figure
3.14). A first category of join costs results in symmetrical Gaussian-distributed cost
values (e.g., CP CA ). A second category results in asymmetric Gaussian distributions
(e.g., CM F CC and the Clandmark ), while a third category exhibits an exponentially
decaying behaviour, for which in the majority of the cases a low cost value is assigned
(e.g., Cpitch and the Cteeth ). For each cost, a scaling factor is determined that adjusts
the 95% lowest gathered cost values towards the range [0,1].
3.4.4.2
Weight distribution
Once an appropriate scaling factor has been determined for each sub-cost, a particular sub-cost Ci can be given twice the importance of sub-cost Cj by assigning
it a weight ωi = 2ωj . Empirically determining an optimal set of weights is a very
time-consuming task. Therefore, the weight optimization was split into several
stages. In a first stage, an appropriate weight distribution among the auditory join
sub-costs and among the visual join sub-costs was determined using small informal
perception tests in which the attained synthesis quality for multiple random test
sentences was compared using several weight configurations. For the auditory
91
Figure 3.14: Join cost histograms, indicating three different behaviours.
join sub-costs, it was found that the MFCC-cost should be assigned an increased
weight compared to the pitch-cost and the energy-cost. A similar conclusion could
be made for the PCA-based visual join cost in comparison with the other visual
sub-join costs. Next, the overall influence of the auditory join costs in comparison
to the visual join costs was evaluated. To this end, a small perception test was
conducted in which 6 participants (all speech technology experts) were shown 10
pairs of audiovisual speech samples synthesized using the joint audio/video selection
approach. Each sample contained a standard-length English sentence and the original combinations of auditory and visual speech were selected from the LIPS2008
database. One sample of each pair was synthesized using only auditory join costs,
while the other sample contained a synthesis of the same sentence for which only
visual join costs were taken into account. All other synthesis parameters were the
same for both samples. The subjects were asked to write down their preference for
one of the two samples using a 5-point comparative MOS-scale [-2,2]. The results
obtained were analysed using a Wilcoxon signed-rank test, which indicated that
the samples synthesized using only auditory join costs were preferred over the
samples synthesized using only visual join costs (Z = −5.0 ; p < 0.001). From this
small experiment it can be concluded that the smoothness of the auditory speech
mode appears to be more crucial than the smoothness of the visual mode. As a
consequence, the total weight assigned to the auditory join costs should be higher
than the total weight assigned to the visual join costs.
All binary target costs were assigned the same weight, except for the costs
calculated on the phonemic match between the candidate context and the target
92
context. These particular target costs were triangularly weighted to assign more
influence to the matching of the context close to the segment and less influence to
the matching of the context further away from the segment. Finally, a last parameter that must be determined is the factor α in equation 3.3 which sets the relative
influence of the total target cost compared to the total join cost. To this end, a value
for α that balances these two influences was calculated by collecting a large number
of total target cost values and total join cost values occurring when synthesizing an
arbitrary set of sentences. From these gathered values, an appropriate value for α
was computed as the ratio of the mean total join cost over the mean total target cost.
Once all weight factors have been determined, the first expression from equation 3.4 can be written as:
target
Ctotal
(ti , ui ) = ω1 Cphon.match (ti , ui )
+ ω2 Chard−pruning (ti , ui )
+ ω3 Csof t−pruning (ti , ui )
PNs symb symb
Cj
(ti , ui )
j=1 ωj
+
PNs symb
j=1 ωj
(3.12)
In equation 3.12, all costs are binary costs, Ns represents the number of symbolic
costs (discussed in section 3.4.2.2) that are used in the calculation, ω1 = ω2 =
∞, ω3 = 1000, and all ωjsymb are equal to 1, except for the costs based on the
phonemic context (up to three phonemes before/after the target/candidate segment)
which are triangularly weighed using values (0.5, 0.25, 0.125). Note that in order to
speed up the unit selection process, an efficient implementation adds for each target
segment ti only those database segments ui to the list of candidate segments for
which Cphon.match (ti , ui ) = Chard−pruning (ti , ui ) = 0. Then, costs Cphon.match and
Chard−pruning are omitted in the Viterbi search. Likewise, the second expression
from equation 3.4 can be written as
join
Ctotal
(ui , ui+1 ) =
ω1 ĈM F CC (ui , ui+1 ) + ω2 Ĉpitch (ui , ui+1 )
ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7
ω3 Ĉenergy (ui , ui+1 ) + ω4 Ĉlandmark (ui , ui+1 )
+
ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7
ω5 Ĉteeth (ui , ui+1 ) + ω6 Ĉcavity (ui , ui+1 )
+
ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7
ω7 ĈP CA (ui , ui+1 )
+
ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7
(3.13)
3.5. Audiovisual concatenation
93
with ω1 = 5, ω2 = ω3 = 2, ω4 = ω5 = ω6 = 1 and ω7 = 3. Ĉ represents the scaled
value of the original cost C (see section 3.4.4.1).
Note that it is likely that the chosen weight distribution is only sub-optimal.
A more optimal set of weights could be automatically learned by a parameter
optimization technique. In the laboratory’s auditory-only TTS research, such an
automatic weight optimization strategy has been developed which learns multiple
context-dependent weight configurations [Latacz et al., 2011]. Unfortunately, only a
variable benefit was gained from this automatic weight training since the attained
synthesis quality was still fluctuating between consecutive syntheses. Nevertheless,
it would be an interesting future initiative to design and evaluate such an automatic
weight optimization technique to learn the balancing between auditory and visual
selection costs as well. In addition, note that the applied visual join costs contain
some redundancy: features such as the amount of visible teeth and the amount of
visible mouth cavity are described by the PCA parameter values as well. However,
it was opted to include all these sub-costs in order to be able to separately fine-tune
the influence of these aspects on the total join cost.
3.5
Audiovisual concatenation
Once the optimal sequence of database segments matching the target speech has
been determined, these audiovisual speech signals need to be concatenated in order
to construct the desired synthetic speech signal. This requires two parallel concatenation actions, since for every two consecutive selected segments both the acoustic
signals and the video signals need to be joined together. For each speech mode, a
concatenation strategy is needed that smooths the concatenated signal around the
concatenation point in order to avoid jerky synthetic speech. In addition, it has to
be ensured that this smoothing does not produce unnatural speech signals, since
otherwise observers will still be able to notice the join positions.
3.5.1
A visual mouth-signal and a visual background-signal
Despite the fact that the visual speech from the database displays the complete face
of the original speaker, it has to be noticed that all the selection costs mentioned in
section 3.4 focus on the mouth area of the video frames only. Obviously, this part
of each frame contains the major share of the speech-related information. This is
especially true for the visual speech from the LIPS2008 database, since while recording this dataset the original speaker was asked to utter the text while maintaining
as much as possible a neutral visual prosody. However, since the original speaker’s
head was not mechanically fixed, slight head movements are present in the original visual speech data. This makes it very hard to smoothly concatenate visual
94
Figure 3.15: The left panel shows the result of the merging of the mouth-signal
with the background signal. The right panel shows the background signal in gray
and the mouth-signal in colour.
speech segments containing the complete face of the speaker, as this would require
a 3D rotation and translation of the face towards its “mean” position in front of
the camera. Therefore, the AVTTS system focuses on synthesizing a mouth-signal
that matches the target visual speech, after which this synthetic mouth signal is
merged with a background signal that contains the other parts of the face of the
virtual speaker. These background signals are original sequences extracted from
the database, of which it was ensured that they exhibit a neutral visual prosody.
When a new mouth-signal has been constructed by the concatenation of the selected
database segments, each frame of this signal is aligned with its corresponding frame
from the background sequence, after which a hand-crafted mask is used to smoothly
merge the two video streams as illustrated in figure 3.15.
3.5.2
Audiovisual synchrony
As was explained in section 3.2.2, the joining of the selected segments takes place
at the middle of the two overlapping phonemes (see figure 3.2). The exact join
position in the auditory speech mode will always coincide with a pitch-marker as
this allows a pitch-synchronous concatenation smoothing (see further in section
3.5.3). The most straightforward technique would be to select in the two overlapping phones the pitch-marker that is closest to the phone centre as join position.
Instead, the AVTTS system optimizes each join by calculating the best pair of
pitch-markers (one marker for each overlapping phone) that minimizes the spectral
distance between the parts of the acoustic signals that will be overlapped during the
concatenation process [Conkie and Isard, 1996]. These optimal pitch-markers are
95
searched in a small window around the middle of each overlapping phone (typically
4 consecutive pitch-markers are evaluated for each phone).
Once the exact join position is determined in the auditory mode, in each corresponding video signal a video frame must be selected as join position in the visual
speech mode. Since the sample rate of the acoustic signal is much higher than the
sample rate of the video signal, the join position in the visual speech mode cannot
be determined with the same accuracy as the pitch-marker based optimization
strategy that was applied for the auditory mode. Note, however, that in order to
successfully copy the original audiovisual coherence from the two selected original
segments to the concatenated synthetic speech, it is important that the audiovisual
synchronization is preserved. To this end, for each concatenation the join position in
the visual speech mode is positioned as closely as possible to the join position in the
auditory mode. This still causes some degree of audiovisual asynchrony, since the
join position in the visual mode will always be located a small time extent before
or after the corresponding join position in the auditory mode. It is well-known that
in audiovisual speech perception, human observers are very sensitive to a lead of
the auditory speech information in front of the visual speech information. On the
other hand, there seems to exist quite a tolerance on the lead of the video signal in
front of the auditory signal [Summerfield, 1992] [Grant and Greenberg, 2001] [Grant
et al., 2004] [Van Wassenhove et al., 2007] [Carter et al., 2010]. The AVTTS system
exploits this property to optimize the concatenation of the selected audiovisual
segments by ensuring that throughout the whole concatenated audiovisual signal,
the original combinations of auditory and visual speech are always desynchronized
by the smallest possible video lead, i.e., between zero and one video frame (40ms
for a video signal containing 25 frames per second). More details on the exact
implementation of this technique are given further on in this chapter.
3.5.3
Audio concatenation
To smooth the concatenation of two acoustic signals, a small section of both
signals is overlapped and cross-faded. When the join takes place in a voiced speech
segment, it has to be ensured that the periodicity is not affected by the smoothing
technique. For instance, figure 3.16 illustrates the concatenation of two speech
segments representing diphones “b-o” and “o-m”. It shows that around the join
position there is quite a large dissimilarity between the two signals, although both
represent the same phoneme /o/. The figure shows that the usage of a standard
cross-fade technique results in the creation of some anomalous pitch periods around
the concatenation point, which causes noticeable concatenation artefacts in the
output speech.
96
Figure 3.16: Auditory concatenation artifacts. The rectangle indicates erroneous
pitch periods resulting from the cross-fading of the two waveforms.
To successfully smooth the acoustic concatenations, the AVTTS system applies a pitch-synchronous cross-fade technique. When the two segments that are
concatenated are referred to as A and B, the join technique first extracts from both
signals a number of pitch periods (typically 2 to 5) around the pitch-marker that
was selected as optimal join position, producing short segments a and b. Then, the
pitch of signals a and b is altered using PSOLA [Moulines and Charpentier, 1990]
in such a way that the two resulting signals â and b̂ exhibit exactly the same pitch
contour. The initial pitch value of these signals is chosen equal to the original pitch
level measured in signal A at the time instance on which segment a was extracted.
The pitch value at the end of â and b̂ is chosen equal to the original pitch value
measured in signal B at the end of the time interval from which segment b was
extracted. The pitch contour of â and b̂ linearly evolves from the pitch level at the
beginning to the pitch level at end of the signals. The concatenation of segments
A and B is performed by overlapping and cross-fading the pitch-synchronized
signals â and b̂ using a hanning-function. This strategy, illustrated in figure 3.17,
minimizes the creation of irregular pitch periods and preserves the periodicity in
the concatenated signal as much as possible.
3.5.4
Video concatenation
Similar to the acoustic concatenation technique, the approach for joining the visual
speech signals of two selected database segments has to smooth the concatenated
signal around the join position in order to avoid jerky synthetic visual speech. To
this end, the frames at the end and at the beginning of the first and the second
overlapping video segment, respectively, are replaced by a sequence of new intermediate video frames. It is obvious that these intermediate frames cannot be generated
by a simple image cross-fade, since for instance at the middle of the cross-fade the
intermediate frame will consist of two different original mouth configurations that
97
Figure 3.17: Pitch-synchronous audio concatenation. The upper panel illustrates
the two signals A and B that need to be concatenated, the middle panel illustrates
the pitch-synchronized waveforms â and b̂, and the lower panel illustrates the
resulting concatenated signal after cross-fading.
98
are each 50% visible. This easily results in erroneous mouth representations such
as frames displaying “double” lips, the visibility of teeth together with a closed
mouth, etc.
A first step in the concatenation procedure is the spatial alignment of the two
video segments. To this end, the pixels in each frame of the second video segment
are translated in such a way that the speaker’s mouth in the first frame of the
second segment is aligned with the mouth in the last frame of the first video
segment. To determine the translation parameters, for both frames an alignment
center is calculated from the facial landmark positions. The translation is then
defined by the vector that connects these two alignment centres. Next, image morphing techniques are used to smooth the transition between the two aligned video
segments. Image morphing is a widely used technique for creating a transformation
between two arbitrary images [Wolberg, 1998]. It consists of a combination of a
stepwise image warp and a stepwise cross-dissolve. To perform an image morph,
the correspondence between the two input images has to be denoted by means
of pairs of feature primitives. A common approach is to define a mesh as feature
primitive for both input images (so-called mesh-warping) [Wolberg, 1990]. A careful
definition of such meshes has been proven to result in a high quality metamorphosis,
however, the construction of these meshes is not always straightforward and often
very time-consuming. Fortunately, when the morphing technique is applied to
smooth the visual concatenations in the AVTTS system, every image given as input
to the morph algorithm is a frame from the speech database. This means that
for each morph input an appropriate mesh can be automatically defined by using
the frame’s facial landmark positions as mesh intersections (see figure 3.18). Since
these landmarks indicate the important visual articulators, the resulting meshes
adequately describe feature primitives for morphing. This way, for every concatenation the appropriate new frames (typically 2 or 4) that realize the transition of the
mouth region from the first video segment toward the second video segment can be
generated, as illustrated in figure 3.18.
Note that some segments that are selected from the database will be fairly
short and will contain only a few video frames. When such a short segment has
been added to the output video signal, the concatenation of the next database
segments to the output frame sequence can entail an interpolation of a frame that
was already interpolated during a previous concatenation. This way, the concatenation smoothing is likely to smooth the short segments in such a way they become
“invisible” in the output visual speech. This is necessary to avoid over-articulation
effects in the synthetic visual speech information.
3.6. Evaluation of the audiovisual speech synthesis strategy
99
Figure 3.18: Example of the video concatenation technique using the “AVBS”
database. The two frames shown in the middle of the lower panel were generated
by image morphing and replace the segments’ original boundary frames in order
to ensure the signal continuity during the transition from the first segment to the
second segment. The frame on the left and the frame on the right of the lower
panel were used as input for the morph calculations. A detail of the landmark
data and the morph meshes derived from these landmarks is shown in the top
panel.
3.6
Evaluation of the audiovisual speech synthesis
strategy
This section describes the experiments that were conducted in order to evaluate
the proposed single-phase AVTTS synthesis approach. It is especially interesting to
assess the influence of the joint auditory/visual segment selection on the perception
of the synthesizer output. The quality assessment of audiovisual speech includes
various aspects such as speech intelligibility, naturalness, and acceptance ratio measures. All these aspects can be individually evaluated for the auditory mode and for
the visual mode, or the multiplexed audiovisual signal can be evaluated as a whole.
Particularly the quality of the multiplexed audiovisual speech is important, since it
is this signal that will be presented to a user in a possible future application of the
speech synthesis system. The major benefit of the proposed single-phase audiovisual
unit selection synthesis is the fact that the synthetic speech shows original combinations of auditory and visual information. This way a maximal audiovisual coherence
in the synthetic speech is attainable, but on the other hand the multimodal selection strategy reduces the flexibility to optimize each speech mode individually in
comparison with a separate synthesis of the auditory and the visual speech signals.
Therefore, it should be investigated whether an enhanced audiovisual coherence indeed positively influences the perception of the synthetic audiovisual speech. If so,
100
the reduced flexibility of the single-phase approach would be justified.
3.6.1
Single-phase and two-phase synthesis approaches
To evaluate the proposed single-phase concatenative AVTTS approach, the synthesis techniques described earlier in this chapter were used to synthesize novel English
audiovisual sentences from text input. For comparison purposes, a corresponding
two-phase AVTTS synthesis strategy was developed. In this strategy, in a first
stage a unimodal auditory TTS synthesis is performed, producing a synthetic
auditory speech signal that matches the target text. Then, in a second synthesis
stage a synthetic visual speech signal is synthesized using the synthesis techniques
described earlier in this chapter, but for which only visual selection costs are taken
into account (see equation 3.4). When both synthetic speech modes have been
generated, each phoneme in the synthetic auditory signal is uniformly time-scaled
using WSOLA [Verhelst and Roelands, 1993] to match the duration of the corresponding segment in the synthetic visual speech. After this synchronization step
the two separately synthesized signals are multiplexed to create the final synthetic
audiovisual speech.
When time-scaling the phonemes in the synthetic auditory speech, it has to
be ensured that the time-scale factors are sufficiently close to one (= no scaling)
in order to avoid a degradation of the speech quality. To this end, during the
second synthesis stage in which the visual speech is synthesized, an additional
target cost Cdur (ti , ui ) is applied that measures the difference in duration between
each candidate speech segment ui matching target ti and the duration of the
corresponding auditory speech segment that was selected for target ti during the
auditory synthesis stage. A low value is assigned to Cdur when these durations are
much alike, since the selection of that candidate segment would afterwards require
only a minor time-scaling in the synchronization stage.
3.6.2
Evaluation of the audiovisual coherence
A first subjective experiment was designed to measure the degree in which audiovisual mismatches between the two modes of a synthetic audiovisual speech signal
are detected by human observers. Such mismatches can be classified as synchrony
issues, caused by an inaccurate synchronization of the two information streams, or
as incoherence issues, which are due to different origins or a unimodal processing
of the auditory and the visual speech information that is shown simultaneously to
the observer. In theory, every synthesis approach should be able to minimize the
number of audiovisual synchrony issues. In the proposed single-phase audiovisual
concatenative synthesis this is achieved by positioning the boundaries of the auditory and the visual speech segments that are copied from the database such that
101
in the synthetic audiovisual speech the visual speech information always leads the
auditory speech information by a time extent between zero and one video frame
duration (see section 3.5.2). In the two-phase synthesis strategy described in section
3.6.1 the audiovisual asynchrony is kept minimal by accurately time-scaling the
synthetic auditory speech mode to match the timings of the synthetic visual speech
mode.
On the other hand, the number of audiovisual incoherences that are likely to
occur in the synthetic audiovisual speech is dependent on the chosen synthesis
approach. Such incoherences are minimized by the joint audio/video selection
in the single-phase strategy. This cannot be achieved when both output speech
modes are synthesized separately. This means that a subjective evaluation of the
single-phase synthesis strategy should only assess the level in which the participants
notice audiovisual incoherences in the presented audiovisual speech. Note, however,
that while perceiving a continuous speech signal it is very hard for an observer to
distinguish between audiovisual incoherence issues and audiovisual asynchronies.
Therefore, a more general question was assessed in the experiment by evaluating to
which extent the participants found the two synthetic speech modes to be consistent. The participants were asked to take both the level of audiovisual synchrony
and the level of audiovisual coherence into account. This is because it is likely that
some incoherences in the audiovisual speech will be perceived as synchrony issues
by the test subjects.
3.6.2.1
Method and subjects
Medium-length audiovisual English sentences were displayed to the test subjects who
were asked to rate the overall level of consistence between the presented auditory
and visual speech mode. It was stressed that they should only rate audiovisual
consistence, and not, for instance, the smoothness or the naturalness of the speech.
The subjects were asked to use a 5-point Mean Opinion Score (MOS) scale [1,5]
with rating 5 meaning “perfect consistence” and rating 1 meaning “heavily distorted
consistence”. There was no time limit and the participants could play and replay
each sample any time they wanted. The samples were presented on a standard LCD
screen, placed at normal working distance from the viewers. The video signals had
a resolution of 532x550 pixels at 50 frames per second and they were displayed at
100% size. The acoustic signal was played through high-quality headphones using
flat equalizer settings. Eleven subjects (8 male and 3 female) participated in this
test, seven of which were experienced in speech processing. Six of the subjects were
aged between 20-30 years, the other subjects were between 35-57 years of age. None
of them were native English speakers but it was ensured that all participants had
good command of the English language.
3.6.2.2
102
Test strategies
Four types of speech samples were used in this evaluation (see table 3.2), each
sample containing a single English sentence extracted from the text transcript of
the LIPS2008 speech database. The first group, called “ORI” (“original”), contained original audiovisual speech samples from the LIPS2008 database. A second
group of samples, called “MUL” (“multimodal”), were synthesized using the proposed single-phase joint audio/video unit selection synthesis. To synthesize a sentence, the AVTTS system was provided with the LIPS2008 audiovisual database
from which each time the particular sentence that had to be synthesized was excluded. The third group of test samples, called “SAV” (“separate audio/video”), was
created by synthesizing the auditory and the visual speech mode separately using
the two-phase synthesis approach described in section 3.6.1. Both the auditory and
the visual speech mode were synthesized using the LIPS2008 database. The only
difference between the two synthesis stages was that for the auditory synthesis only
auditory selection costs were used and for the visual speech synthesis only visual
selection costs were used. A fourth group of samples, referred to as “SVO” (“switch
voice”), were also created by the two-phase AVTTS approach, but a different TTS
system was used in each synthesis stage. The auditory mode was synthesized using
the laboratory’s auditory TTS system [Latacz et al., 2008] provided with the CMU
ARCTIC database of an English female speaker [Kominek and Black, 2004]. This
database is commonly used in TTS research and its length of 52 minutes continuous
speech allows higher quality acoustic synthesis compared to the LIPS2008 database.
The visual mode of the SVO samples was synthesized in the same way as the visual
mode of the SAV samples by using the LIPS2008 database. Note that the audiovisual synthesis strategy that was used to generate the SVO samples is similar to most
other AVTTS approaches found in the literature, in which two different systems and
databases are used to create the auditory and the visual mode of the synthetic audiovisual speech. All samples, including the files from group ORI, were (re-)coded
using the Xvid codec [Xvid, 2013] with fixed quality settings in order to attain a
homogeneous image quality among all samples. Note that all files were created fully
automatically and no manual correction was involved for any of the synthesis or
synchronization steps.
3.6.2.3
Samples and results
Fifteen sample sentences with a mean word count of 15.8 words were randomly
selected from the LIPS2008 database transcript and were synthesized for each of the
groups ORI, MUL, SAV & SVO. Each participant was shown a subset containing
20 samples (5 sentences each synthesized using the four different techniques). While
distributing the sample sentences among the participants, each sentence was used
as many times as possible. The order in which the various versions of a sentence
103
Table 3.2: Test strategies for the audiovisual consistence test.
ORI
Origin A
Origin V
Description
Original LIPS2008 audio
Original LIPS2008 video
Original AV signal
MUL
Origin A
Origin V
Description
Audiovisual unit selection on LIPS2008 db
Audiovisual unit selection on LIPS2008 db
Concatenated original AV combinations
SAV
Origin A
Origin V
Description
Auditory unit selection on LIPS2008 db
Visual unit selection on LIPS2008 db
Separate A/V synthesis using same db
SVO
Origin A
Origin V
Description
Auditory unit selection on ARCTIC db
Visual unit selection on LIPS2008 db
Separate A/V synthesis using different dbs
were shown to the participants was randomized. Figure 3.19 summarizes the test
results obtained.
A Friedman test indicated significant differences among the answers reported
for each test group (χ2 (3) = 117 ; p < 0.001). An analysis using Wilcoxon-signed
rank tests indicated that all differences among the test groups were significant
(p < 0.001), except for the difference between the MUL and the SAV group
(Z = −0.701 ; p = 0.483). Further analysis of the test results, using Mann-Whitney
U test statistics, showed no difference between the overall ratings of the speech technology experts and the ratings given by the non-experts (Z = −0.505 ; p = 0.614).
No significant difference was found between the ratings given by the male and
the female participants (Z = −0.695 ; p = 0.487). Some participants consistently
reported higher ratings compared to other participants, although this difference
was not found to be significant by a Kruskal-Wallis test (χ2 (10) = 16.0 ; p = 0.099).
Maybe this could have been prevented by showing some training samples to the
participants indicating a “good” and a “bad” sample.
3.6.2.4
Discussion
For each group of samples an estimation of the actual audiovisual consistence/coherence can be made. For group ORI, a perfect coherence is expected
since these samples are original audiovisual speech recordings. Samples from group
MUL are composed of concatenated original combinations of audio and video.
Therefore, at the time instances between the concatenation points they exhibit the
original coherence as found in the database recordings. Only at the join positions
an exact calculation of the audiovisual coherence is impossible since at these time
104
5
4
3
2
1
ORI
SVO
MUL
SAV
Figure 3.19: Box plot showing the results obtained for the audiovisual consistence test.
instants the signal consists of an interpolated auditory speech signal accompanied
by an interpolated visual speech signal. For the SAV and the SVO samples, almost
perfect audiovisual synchrony should be attained by the synchronization step during
synthesis, however, audiovisual incoherences are likely to occur since non-original
combinations of auditory and visual speech are presented.
The results of the experiment show that the perceived audiovisual consistence
does differ between the groups. From the significant difference between the ratings
for group ORI and group MUL it appears that it is hard for a human observer to
judge only the audiovisual coherence aspect without being influenced by the overall
smoothness and naturalness of the speech modes themselves. Also, the perception of
the audiovisual consistence of the MUL samples could be affected by
Pagethe
1 moderate
loss of multimodal coherence at the join positions. Between groups MUL and SAV
no significant difference was found. In order to explain this result, the selected segments from the LIPS2008 database that were used to construct both speech modes
of each sample from the SAV group were compared. It appeared that for many
sentences, more than 70% of the selected segments were identical for both speech
modes. The reason for this is that for both the auditory and the visual synthesis
phoneme-based speech labels were used (instead of visemes), together with the fact
that the LIPS2008 database only contains about 25 minutes of original speech. The
use of such a small database implies that most of the time only a few candidate
105
segments matching a longer target (syllable, word, etc.) are available. Because of
this, very often the same original segment gets selected for both the acoustic and the
visual synthesis, disregarded the configuration of the selection costs (the selection
of long segments is favoured since these add a zero join cost to the global selection
cost). It was calculated that for the SAV samples on average around 50% of the
video frames are accompanied by the original matching audio from the database.
This could explain why the SAV group scored almost as good as the MUL group in
the subjective assessment. This result also indicates that the synchronization step
in the two-phase synthesis approach is indeed able to appropriately synchronize
the two speech modes that were synthesized separately. Keeping this in mind, it
is remarkable that significantly better ratings were found for the MUL group in
comparison with the SVO group. Since it can be safely assumed that the SVO
samples contain two speech modes that have been appropriately synchronized, the
reason for their degraded ratings has to be found in the fact that these samples are
completely composed of non-original combinations of auditory and visual speech
information, which apparently resulted in noticeable audiovisual mismatches. This
can be understood by the fact that the displayed auditory and visual speech
information are resulting from different repetitions of the same phoneme sequence
by two distinct speakers exhibiting different speaking accents.
3.6.3
Evaluation of the perceived naturalness
From the previous experiment it can be concluded that the single-phase AVTTS
synthesis approach reduces the number of noticeable mismatches between the two
synthetic speech modes. Consecutively, a new experiment has to investigate how
this influences the perceived quality of the synthetic speech signals. In order to
generate highly natural audiovisual speech, two aspects need to be optimized.
First, it is needed that both the auditory and the visual speech mode individually
exhibit high quality speech closely resembling original speech signals. In addition,
it is necessary that a human observer feels very familiar with the synchronous
observation of these two information streams. A possible test scenario would be
to present the test subjects audiovisual speech fragments, synthesized using both
single-phase and two-phase synthesis strategies, and to assess the overall perceived
level of naturalness of the audiovisual speech. However, such ratings would show a
lot of variability since each test subject would rate the samples following his/her
own personal feeling of which aspect is the most important: the individual quality of
the auditory/visual speech or the naturalness of the audiovisual observation of these
signals. Also, it has to be taken into account that the limited size of the LIPS2008
database does not allow high quality auditory speech synthesis. Because of this,
it is likely that the overall level of naturalness would be rated rather low, which
makes it more difficult to draw important conclusions from the test results obtained.
106
On the other hand, the main goal of the experiment is to evaluate whether
the reduced flexibility of the single-phase synthesis approach to optimize the individual synthetic speech modes is justified by the benefits of the increased audiovisual
coherence between the two synthetic speech modes. Therefore, a test scenario can
be developed to directly evaluate the effect of the level of audiovisual coherence on
the perceived naturalness of the synthetic speech. An ideal scenario would be to
evaluate the perceived naturalness of multiple groups of audiovisual speech samples
designed such that there exists a variation of the level of audiovisual coherence
among the groups while the individual quality of both the synthetic auditory and
the synthetic visual speech mode is the same for all groups. Unfortunately, it is
not clear how such samples can be realized in practice. Therefore, an alternative
test scenario was used in which several types of audiovisual speech signals were
created using various concatenative synthesis strategies. It was ensured that the
individual quality of the visual speech mode was the same for all groups. During
the subjective assessment the perceived naturalness of this visual speech mode was
evaluated. This allows to determine the influence of the audiovisual presentation of
a unimodal speech signal on its subjective quality assessment. Both the impact of
the degree of audiovisual coherence and the impact of the individual quality of the
corresponding speech mode can be evaluated.
3.6.3.1
Method and subjects
The participants were asked to rate the naturalness of the mouth movements displayed in the audiovisual speech fragments. It was stressed that for this experiment
they should only rate the visual speech mode. A 5-point MOS scale [1,5] was used,
with rating 5 meaning that the mouth variations are as smooth and as correct as
original visual speech and rating 1 meaning that the movements considerably differ
from the expected visual speech. The same subjects who participated in the experiment described in section 3.6.2 contributed to this test. The setup of the subjective
evaluation procedure was the same as was described in section 3.6.2.1.
3.6.3.2
Test strategies
Five different types of samples were generated for this experiment, as summarized
in table 3.3. Four sample types (ORI, MUL, SAV and SVO) were similar to the
samples used in the previous experiment (see section 3.6.2.2). Since for this experiment the quality of the synthetic visual speech mode has to be equal for all sample
types, the samples were created in such a way that for each group the same original
visual speech segments were used to construct the synthetic visual speech mode.
The samples from the MUL group were synthesized using the single-phase AVTTS
approach for which only visual selection costs were applied. The extra target cost
107
Cdur (see section 3.6.1) was included for which for each sentence the original timings from its version from the LIPS2008 database were used as reference. Next,
similar to the previous experiment, auditory speech signals were generated using
the LIPS2008 and the ARCTIC databases to create the auditory speech mode of
the SAV and SVO samples, respectively. Audiovisual synchronization was obtained
by time-scaling these acoustic signals using WSOLA. In addition, a fifth group of
samples, referred to as “RES” (“resynth”), was added. For these samples, the same
visual speech mode as applied in the MUL, SVO and SAV samples was used. The
auditory mode consisted of original auditory speech from the LIPS2008 database,
synchronized with the corresponding visual speech signals using WSOLA. Note that
the approach that was used to create the RES samples is a common visual speech
synthesis approach when a novel visual speech signal needs to be generated to accompany an already existing auditory speech signal (only in that case the visual
speech mode is time-scaled instead).
3.6.3.3
Samples and results
Fifteen sample sentences with a mean word count of 15.8 words were randomly
selected from the LIPS2008 database transcript and were synthesized for each of
the five groups ORI, MUL, SAV, SVO & RES. Each participant was shown a
subset containing 20 samples (4 sentences each synthesized using the 5 different
techniques). While distributing the samples among the participants, each sentence
was used as many times as possible. The order in which the various versions of a
sentence were shown to the participants was randomized. Figure 3.20 summarizes
the test results obtained.
A Friedman test indicated significant differences among the answers reported
for each test group (χ2 (4) = 103 ; p < 0.001). An analysis using Wilcoxon-signed
rank tests indicated that the ORI samples were rated significantly better than
all other sample groups (p < 0.001). In addition, the SVO samples were rated
significantly worse than the samples from groups MUL, SAV and RES (p < 0.005).
No significant differences were found between the ratings for the groups MUL,
SAV and RES. Further analysis of the test results, using Mann-Whitney U test
statistics, showed no difference between the overall ratings of the speech technology
experts and the ratings given by the non-experts (Z = −1.58 ; p = 0.114). No
significant difference was found between the ratings given by the male and the
female participants (Z = −1.48; p = 0.138). Some participants consistently reported
higher ratings compared to other participants (this was found to be significant by
a Kruskal-Wallis test (χ2 (10) = 23.2 ; p = 0.010). Probably, this could have been
prevented by showing some training samples to the participants indicating a “good”
and a “bad” sample.
Original LIPS2008 video
Original AV signal
Audiovisual unit selection on LIPS2008 database (video costs)
Audiovisual unit selection on LIPS2008 database (video costs)
Concatenated original AV combinations
Auditory unit selection on LIPS2008 database (audio costs)
Visual unit selection on LIPS2008 database (video costs)
Separate A/V synthesis using same database
Auditory unit selection on ARCTIC database
Separate A/V synthesis using different databases
Original audio and synthesized video
Origin A
Origin V
Description
Origin A
Origin V
Description
Origin A
Origin V
Description
Origin A
Origin V
Description
Origin A
Origin V
Description
ORI
MUL
SAV
SVO
RES
Table 3.3: Test strategies for the naturalness test.
108
109
5
4
3
2
1
ORI
SVO
SAV
MUL
RES
Figure 3.20: Box plot showing the results obtained for the naturalness test.
3.6.3.4
Discussion
For all but the ORI samples, the visual speech mode was synthesized by reusing
the same segments from the LIPS2008 database. This implies that any difference
in the perceived quality of the visual speech mode is caused by the properties of
the auditory speech that played along with the visual speech. The results obtained
show a clear preference for the MUL samples compared to the SVO samples. Note
that the individual quality of the auditory speech mode of the SVO samples is at
least as high as the individual quality of the auditory mode of the MUL samples,
since the auditory mode of the SVO samples is synthesized using acoustic selection
costs and a more extensive speech database. From this it can be concluded that
the perceived naturalness of the visual speech mode of the SVO samples was dePage 1
graded by the non-original combinations of auditory/visual speech information
that
were used to construct these samples. On the other hand, only a small decrease
in perceived naturalness can be noticed between the MUL and the SAV samples
(Wilcoxon signed-rank analysis ; Z = −1.78 ; p = 0.076). As explained earlier in
section 3.6.2.4, the samples of these two groups are probably too similar to lead
to important perception differences. However, it is interesting to see that the more
appropriate synthesis set-up to create the auditory mode of the SAV samples (acoustic costs instead of visual costs) certainly did not improve the test results obtained
for this group. The test results also contain slightly higher ratings for the MUL
samples compared to the results obtained for the RES group. This difference was
110
not found to be significant, however a trend is noticeable (Wilcoxon signed-rank
analysis ; Z = −1.90 ; p = 0.058). Since the auditory mode of the RES samples is
made out of original auditory speech signals, its quality is much higher compared
to the quality of the auditory mode of the MUL samples. Since the MUL samples
scored as least as good (even slightly better) than the RES samples, it can be concluded that for a high quality perception of the visual speech mode, a high level
of audiovisual coherence is equally (or even more) important than the individual
quality of the accompanying auditory speech. In addition, the comparison between
the MUL and the SVO samples showed that the perceived quality of a synthetic
speech mode can be strongly affected when it is presented audiovisually with a less
consistent accompanying speech mode.
3.6.4
Conclusions
This thesis proposes a single-phase AVTTS approach that is able to maximize
the audiovisual coherence in the synthetic speech signal. Two experiments were
conducted in order to assess the benefits of this joint audio/video segment selection
strategy. A first test measured the perceived audiovisual consistence resulting from
different synthesis strategies. It showed that human observers tend to underestimate this coherence when the displayed speech signals are synthetic and clearly
distinguishable from original speech. Perhaps this is due to a moderate loss of
coherence around the concatenation points. On the other hand, the highest level
of audiovisual consistence is perceived when the speech is synthesized using the
single-phase audiovisual concatenative synthesis. The more standard approach
in which both synthetic speech modes are synthesized separately was found to
easily result in a degraded perceived audiovisual consistence. A second experiment
investigated how the perceived quality of a (synthetic) visual speech mode can be
affected by the audiovisual presentation of the speech signal. The results obtained
showed that this quality can be seriously degraded when the consistence between
the two presented speech modes is poorer. In addition, it was found that the
influence of the individual quality of the accompanying auditory speech mode only
seems to be of secondary order.
The standard two-phase synthesis approach in which the auditory and the visual speech modes are synthesized separately (generally using different databases
and different synthesis techniques) is likely to cause audiovisual mismatches that
cannot be prevented by an accurate synchronization of the two synthetic speech signals, since they are due to the fact that the two information streams originate from
different repetitions of a same phoneme sequence (usually by two distinct speakers
that are likely to exhibit different speaking accents). The experiments described
in this section indicate that these mismatches reduce the perceived audiovisual
3.7. Audiovisual optimal coupling
111
coherence and that they are likely to degrade the perceived naturalness of the
synthetic speech modes. From this, it can be concluded that a major requirement
for an audiovisual speech synthesis system is to maximize the level of coherence
between the two synthetic speech modes. The speech synthesizer obviously also has
to optimize the individual quality of both synthetic speech modes, but it has to
be ensured that each optimization technique to increase these individual qualities
does not affect the level of audiovisual coherence in the audiovisual output speech.
Otherwise, it is likely that the benefits gained from the optimization technique are
cancelled out by the audiovisual way of presenting the synthetic speech modes.
The experiments described in this section encourage to further investigate on the
single-phase audiovisual segment selection technique, since this approach indeed
is able to maximize the coherence between both synthetic speech modes. On the
other hand, at this point the attainable synthesis quality of both speech modes
is still too low compared to original speech recordings. Therefore, the AVTTS
synthesis strategy will have to be extended to improve the individual quality of the
synthetic auditory and the synthetic visual speech, while it is ensured that the level
of coherence between these two output speech modes is minimally affected.
3.7
Audiovisual optimal coupling
Section 3.5 described that the concatenation of the audiovisual speech segments
that are selected from the database requires two join actions: one in the auditory
mode and one in the visual mode. It was explained that the auditory signals are
concatenated using a pitch-synchronous cross-fade that preserves the periodicity of
voiced speech sounds in the concatenated signal. The visual modes of the selected
segments are smoothly joined by generating interpolation frames using an image
morphing technique. In the previous section it was concluded that any optimization
to the audiovisual synthesis strategy, in order to enhance the individual quality of
the synthetic auditory or visual mode, should be designed not to affect the coherence
between these two speech modes. This section elaborates on an optimization to the
single-phase AVTTS approach that enhances the individual quality of the synthetic
visual speech mode. Unfortunately, the optimization technique also decreases the
coherence between both synthetic speech modes, which means that a trade-off will
have to be made.
3.7.1
Concatenation optimization
Section 3.5.2 explained that the AVTTS system separately optimizes each acoustic
concatenation by positioning the exact join position at a time instant that coincides
with a pitch-marker. Around the theoretical concatenation point (the centre of the
first and the second overlapping phone, respectively), an optimal pair of pitch-
112
markers (one marker in the first phone and one marker in the second phone) is
found by minimizing a spectral distance measure. This way, the transition takes
place between two signals that are maximally similar at the concatenation point,
which improves the smoothness of the resulting concatenated auditory speech signal.
A similar technique could be employed to enhance the concatenation quality of the
visual modes of the selected database segments as well. This would require for each
concatenation a separate optimization of the join position in the visual speech mode.
Such an optimization is, for instance, also applied in the “Video Rewrite” system
[Bregler et al., 1997]. For this purpose, three different approaches were developed,
each discussed in detail in the remainder of this section.
3.7.1.1
Maximal coherence
The standard approach for determining for each concatenation the exact join position in the visual mode was already briefly described in section 3.5.2: the synthesizer
tries to minimize the asynchrony between the concatenated auditory and the concatenated visual speech information as much as possible (see figure 3.21). Since the
sample rate of the auditory speech is much higher than the sample rate of the visual
speech, it is impossible to ensure that the join position in the visual mode perfectly
coincides with the optimized join position in the auditory mode. Because of this, for
each selected database segment the exact length of its auditory speech signal will be
different from the length of its accompanying visual speech signal. For a database
segment i, the difference between the length of its acoustic signal (Laudio (i)) and
the length of its video signal (Lvideo (i)) can be written as
∆L(i) = Laudio (i) − Lvideo (i)
(3.14)
The auditory and the visual speech information that correspond to a particular audiovisual segment is copied from the original recordings contained in the database.
The database time instance on which the extraction of the acoustic speech information corresponding to segment i starts can be denoted as taudio
start (i). Similarly, the
time instance on which the extraction of the visual information starts can be written
as tvideo
start (i). In general, due to the difference in sample rate between the acoustic
video
and the visual signal, tvideo
start (i) will be different from tstart (i). This means that after
the audiovisual segment i is added to the concatenated audiovisual speech signal,
the speech modes of segment i are shifted by a value ∆tstart (i) with respect to each
other:
video
∆tstart (i) = taudio
(3.15)
start (i) − tstart (i)
This can easily be understood if segment i is the first segment from the sequence that
constructs the output synthetic speech. If segment i is not the first segment from
this sequence, both its speech modes are added to an audiovisual signal of which
113
the length of its acoustic signal and the length of its visual signal are dissimilar
due to the earlier concatenations. This means that the total shift async(i) between
the original auditory and the original visual speech information corresponding to
segment i in the final concatenated speech is given by
async(i) =
i−1
X
∆L(n) − ∆tstart (i)
(3.16)
n=1
Note that in equation 3.16 a positive value of async(i) means that the visual speech
information leads the auditory speech information. Obviously, these calculations
assume that both speech modes of the original database recordings are correctly
synchronized. Equation 3.16 indicates that the level of audiovisual synchrony in the
final concatenated speech signal changes value after each concatenation point. In
addition, it shows that the audiovisual asynchrony of segment i after concatenation
is caused by two independent terms, determined by the properties of the previous
segments (1, . . . , i − 1) and the current segment i, respectively. This means that the
value of async(i) can be confined to reasonable limits by selecting for each segment
i a video frame as join position tvideo
start (i) that lies in the vicinity of the auditory
audio
join position tstart (i) and that maximally cancels the asynchrony caused by the
difference between the lengths of the speech modes of the already concatenated
Pi−1
audiovisual speech signal (i.e., the term n=1 ∆L(n) in equation 3.16). Since it is
generally assumed that human observers are more sensitive to a lead of the auditory
speech information in front of the visual speech information compared to a lead of the
visual information in front of the acoustic information [Summerfield, 1992] [Grant
et al., 2004], the most “safe” concatenation strategy is the one that maximizes the
audiovisual coherence in the concatenated speech signal by selecting for each visual
concatenation a video frame as join position that ensures that
0 ≤ async(i) <
1
F svideo
(3.17)
with F svideo the sample rate of the video signal.
3.7.1.2
Maximal smoothness
Similar to the optimal coupling technique that is applied for optimizing the acoustic
concatenations, the smoothness of the concatenated visual speech can be enhanced
by fine-tuning for each video concatenation the exact join position in order to maximize the similarity between the two video frames at which the concatenation takes
place. In a first stage, for both phones that need to be joined some of the video
frames in the vicinity of the corresponding auditory join position are selected as
candidate “join-frames”. Then, in a second stage two final frames are selected (one
from each set of candidate join-frames) that minimize a visual distance measure (see
114
figure 3.21). This visual distance is calculated in a similar way as the total visual
join cost value (using the difference in teeth, mouth-cavity and PCA properties).
Unfortunately, this optimization technique increases the possible values for async(i)
since in this case the visual join position is not chosen to minimize equation 3.16 (see
figure 3.22). The visual optimal coupling strategy is adjusted by three parameters:
the maximal allowed local audio lead (the minimal value of async(i)), the maximal
allowed local video lead (the maximal value of async(i)), and a search-length parameter that defines the number of video frames in each phone that is considered as
candidate join-frame. The search-length parameter influences both terms of equation 3.16: it determines the maximal audiovisual asynchrony in a segment caused
by the difference between the time instants on which its two speech modes are extracted from the database (equation 3.15), and it also determines to which extent
the length of the auditory and the visual mode of the already concatenated speech
signal can be altered (this can be seen in figure 3.22). Since the value of async(i)
is confined between its maximal and its minimal limit, most of the time the set of
candidate join frames that is selected for each of the two phones will not be centered
around the auditory join position. This is due to the fact that the maximal value
for async(i) can be chosen higher than the minimal value of async(i) due to the
asymmetric human sensitivity for audiovisual asynchrony.
3.7.1.3
Maximal synchrony
Where the approaches described in section 3.7.1.1 and section 3.7.1.2 maximize the
audiovisual coherence and the audiovisual smoothness, respectively, an in-between
approach exists that is able to enhance the smoothness of the synthetic visual speech
without introducing extra audiovisual asynchronies in the concatenated speech segments. The first stage of this in-between strategy is similar to the “maximal smoothness” approach since for both phones that need to be joined a set of candidate joinframes are selected around the corresponding join positions in the auditory mode.
In the second stage, from both sets of candidate join-frames a final frame is selected
by minimizing the visual join cost for this particular concatenation. In contrast with
the “maximal smoothness” approach, in this case only those pairs of join-frames are
considered that do not add an extra audiovisual asynchrony to the database segment.
This is possible when for a segment i the visual join position is chosen such that the
contribution of the term ∆tstart (i) to async(i) is cancelled by the modification of
Pi−1
the term n=1 ∆L(n), i.e., by the alteration of the length of the auditory and the
visual mode of the already concatenated audiovisual speech signal (this can be seen
in figure 3.21). This approach evaluates for each concatenation fewer combinations
of join frames compared to the “maximal smoothness” approach and thus offers
less freedom in optimizing the smoothness of the visual concatenations. Only one
parameter adjusts the optimal coupling technique: the search-length determining
115
the number of candidate join-frames that are selected in each phone. An important
observation is that even when a large search-length is applied, no extra audiovisual
asynchrony is introduced, however, at the join positions non-original combinations
of auditory and visual speech information are created (as illustrated in figure 3.22).
This means that the proposed optimization technique is likely to degrade overall
level of audiovisual coherence in the concatenated audiovisual speech signal.
3.7.2
Perception of non-uniform audiovisual asynchrony
In order to obtain appropriate parameter settings for the proposed optimal coupling techniques, the maximal allowed level of local audiovisual asynchrony must
be determined. Literature on the effects of a uniform audiovisual asynchrony on
the human perception of audiovisual speech signals mentions −50ms and +200ms
as tolerable bounds for the asynchrony level without being noticed by an observer [Grant et al., 2004]. On the other hand, in the proposed audiovisual synthesis
the level of audiovisual synchrony in the concatenated audiovisual speech is not
constant (it changes after each concatenation point). Since no exact tolerance for
this particular type of audiovisual asynchrony could be found in the literature,
a subjective perception test was conducted from which appropriate parameter
settings for the optimal coupling approaches can be inferred. To this end, it was
investigated to which extent local audiovisual asynchronies can be introduced in an
audiovisual speech signal without being noticed by a human observer. The speech
samples were generated by resynthesizing sentences from the LIPS2008 database
using the AVTTS system and the “maximal smoothness” optimal coupling approach
(the speech data corresponding to the target original sentence was excluded from
selection). Equation 3.16 was used to calculate the occurring levels of audiovisual
asynchrony in each synthesized sentence. An appropriate subset of samples was
collected from the synthesis results that covers the target range of maximal/minimal
local asynchronies that needs to be evaluated. For each sample from the selected
subset, a second version was synthesized using the “maximal coherence” optimal
coupling approach. These new syntheses were used as baseline samples since they
exhibit no significant audiovisual asynchrony. The two versions of each sentence
(with/without local audiovisual asynchronies) were shown pairwise to the test
subjects, who were asked to report which of the two samples they preferred, in
terms of synchrony between the two presented speech modes. The participants
were instructed to answer “no difference” if no significant difference in audiovisual
synchrony between the two samples could be noticed. Seven people participated
in the experiment, three of which were experienced in speech processing. The test
results obtained are summarized in table 3.4, in which the test samples are grouped
based on the minimal and maximal occurring local asynchrony level. It shows a
detection ratio of less than 20% for the samples in which the audio lead is always
116
Figure 3.21: Three approaches for optimal audiovisual coupling. The two audiovisual signals that need to be joined and the optimized auditory join positions
are indicated. The top panel shows the “maximal coherence” method in which
the visual join positions are close to the auditory join positions. The middle
panel illustrates the “maximal smoothness” approach, in which for both signals
a set of candidate join frames A1-A5 and B1-B5 are selected, from which the
most optimal pair is calculated by minimizing the visual join cost. The bottom
panel illustrates the in-between approach in which only candidate pairs A1-B1,
A2-B2, etc. are considered.
117
Figure 3.22: Resulting signals obtained by the three proposed optimal coupling
techniques. The top panel shows that the audiovisual coherence is maximized by
the “maximal coherence” approach. The middle panel shows that an extended
audiovisual asynchrony can occur by employing the “maximal smoothness” approach. The bottom panel shows that the “maximal synchrony” approach maintains the audiovisual synchrony but introduces some unseen combinations of
auditory and visual speech information at the join position (indicated by the
arrows).
118
lower than 0.04s and the video lead is always lower than 0.2s. It was opted to use
these values as parameter settings for the “maximal smoothness” optimal coupling
approach.
Table 3.4: Detection of local audiovisual asynchrony.
Max desync
0s
0s
0.1 s
0.2 s
0.4s
0%
15%
46%
Min desync
-0.04s
-0.08s
0%
0%
0%
no samples
26%
10%
20%
40%
-0.15s
90%
100%
60%
100%
Neither the number of participants to the experiment nor the number of test
samples in each group of table 3.4 were large enough to exactly define thresholds for
noticing non-uniform audiovisual asynchronies. Nevertheless, the results obtained
are sufficient for determining suitable parameter values for the optimal coupling
technique. The subjective experiment can also be seen as a preliminary study
on the general effects of a time-varying audiovisual asynchrony on audiovisual
speech perception. The results obtained indicate that the thresholds for noticing
non-uniform audiovisual asynchronies are quite similar to the noticing thresholds
for uniform audiovisual asynchrony that are mentioned in the literature. It seems to
be the case that the length of an audiovisual asynchrony occurring in an audiovisual
speech signal has only little influence on its detection by human observers, since the
subjective experiment showed that even very short asynchronies (occurring when
short segments are selected from the database) were noticed by the test subjects.
This result is in agreement with earlier experiments investigating audiovisual
perception effects such as the McGurk effect [McGurk and MacDonald, 1976], in
which it was found that even audiovisual mismatches with a duration of only a
single phoneme can drastically affect the intelligibility and/or the perceived quality
of the audiovisual speech signal.
3.7.3
Objective smoothness assessment
Before evaluating the effects of the proposed audiovisual optimal coupling approaches on the human perception of the concatenated audiovisual speech signals,
it was objectively assessed to which extent the three approaches are able to smooth
the visual concatenations. To this end, eleven English sentences (mean word count
= 15 words) were synthesized using the AVTTS system provided with the LIPS2008
119
database. For each sentence, six different configurations for the optimization of the
audiovisual concatenations were used, as described in table 3.5. For each synthesized
sample, the smoothness of the visual speech mode was objectively assessed. To
this end, the synthetic visual speech signals were automatically analysed in order
to calculate for each video frame a set of facial landmarks indicating the lips of
the virtual speaker, and a set of PCA coefficients that models the mouth-area of
the frame. This metadata was derived similar to the analysis of the original visual
speech information contained in the database (see section 3.3.3.4). A smoothness
measure was defined as the linear combination of the summed Euclidean distances
between the landmark positions and the Euclidean distance between the PCA
coefficients for every two consecutive frames in the synthetic visual speech located
at the join positions. A single measure for each synthesized sentence was calculated
as the sum of all distance measures that were calculated for the sentence divided
by the number of database segments that were used to construct the sentence.
Table 3.5: Various optimal coupling configurations.
(SL = search-length parameter)
Group
Method
SL
Min Async.
Max Async.
I
II
III
IV
V
VI
max coherence
max smoothness
max smoothnes
max synchrony
max synchrony
max synchrony
0.20s
0.20s
0.08s
0.20s
0.40s
-0.04s
-0.05s
-
0.20s
0.35s
-
Figure 3.23 shows the objective smoothness levels obtained. The results show
that both the “maximal smoothness” (groups II & III) and the “maximal synchrony” (groups IV, V & VI) strategies resulted in smoother synthetic visual
speech signals compared to the “maximal coherence” technique (group I). A statistical analysis using ANOVA with repeated measures and Greenhouse-Geisser
correction indicated significant differences among the values obtained for each
group (F (2.37, 23.7) = 14.127 ; p < 0.001). An analysis using paired-sample t-tests
indicated that the smoothness of the samples from group I was significantly worse
than the smoothness of the samples from the other groups (p ≤ 0.006). On the
other hand, as can be noticed from figure 3.23, only a minor improvement of
the smoothness of the synthetic visual speech is measured when more extreme
audiovisual asynchronies or longer audiovisual incoherences are allowed: no significant difference between the values for groups II-VI was found. This also means
120
500
450
400
350
300
250
I
II
III
IV
V
VI
Figure 3.23: Objective smoothness measures for various optimal coupling approaches. A lower value indicates a smoother visual speech signal.
that the “maximal smoothness” approach did not really outperform the “maximal
synchrony” optimization approach, despite the fact that the “maximal smoothness”
approach allows additional audiovisual asynchronies in order to increase the freedom
in optimizing the visual join position.
3.7.4
Subjective evaluation
To assess the effect of the various audiovisual optimal coupling techniques on the
perception of the synthetic audiovisual speech, a subjective perception experiment
was performed. Groups I, II and V from table 3.5 were selected to represent the thee
Page 1
proposed optimal coupling approaches. Eleven standard-length English
sentences
were synthesized using the LIPS2008 database and the optimal coupling settings of
groups I, II and V. For each of these sentences, two sample pairs were shown to the
test subjects. One sample pair contained a synthesis from group I and a synthesis
from group II, the other sample pair contained the same synthesis from group I
and a synthesis from group V. The participants were asked to report which of the
two samples of each pair they preferred. They were told to pay attention especially
to the smoothness of the mouth movements and to the overall level of audiovisual
coherence, but it was up to themselves to decide which aspect of the audiovisual
speech they found the most important to rate the samples. They were informed
that the quality of the auditory speech mode was the same for both samples of each
comparison pair. If the test subjects had no preference for one of the two samples,
they were asked to answer “no difference”. 9 people participated in this test (3
121
female, 6 male, aged [20-56]), 6 of them being experienced in speech processing.
The preference scores obtained are summarized in table 3.6. The difference
between groups I-II and the difference between groups I-V were analysed using a
Wilcoxon signed-rank test. The results obtained indicate that for neither of the
two comparisons the test subjects showed a preference for the samples of which
the smoothness of the visual speech mode was optimized. Samples from groups
I and V were rated equally good with many answers reporting “no difference”
(Z = −0.160 ; p = 0.873). On the other hand, it appeared that the participants
disliked the optimized samples from group II in comparison with the samples from
group I (Z = −2.07 ; p = 0.039). A manual inspection of the answer sheets and
feedback from the participants pointed out several explanations for these results.
Firstly, it can be noticed that the answers differ heavily among the participants:
some subjects tended to generally like or dislike the optimized sample in each pair
while other participants very often reported not to notice any difference between
the presented samples. Furthermore, many participants informed that they often
did notice that the smoothness of the visual speech mode of one of the two samples
had been improved. Unfortunately, in many of these cases the optimized sample
exhibited an affected audiovisual coherence which motivated the test subjects to
report a preference for the non-optimized sample of the comparison pair. This explains why the samples from group II were rated worse than the samples from group
I, since these samples contain time-varying audiovisual asynchronies. This is a quite
unexpected result, since the minimal and maximal asynchrony level were chosen
in the range that was hardly noticed by human observers in the subjective test
described in section 3.7.2. Maybe these parameters should have been chosen more
conservative, however, lowering these thresholds would leave only little freedom to
optimize the visual concatenations. In contrast, the samples from group V were not
rated worse than the samples from group I, which means that the introduction of
short local audiovisual incoherences at the join positions is not as easily noticed as a
time-varying audiovisual asynchrony. On the other hand, the improved smoothness
of the visual speech mode of the samples from group V did not manage to increase
their subjective quality rating, which means that the benefits of the individual
optimization of the visual speech mode was cancelled by the decrease of the level of
audiovisual coherence. Some participants indeed reported that the smoothed visual
speech appeared as “mumbled” compared to the accompanying auditory speech
signal that exhibited stronger articulations.
3.7.5
Conclusions
This section studied the audiovisual concatenation problem by investigating an
optimal technique that calculates the appropriate join positions in both the au-
122
Table 3.6: Subjective evaluation of the optimal coupling approaches.
Test
Preference
Count
Group I - Group II
Group I > Group II
Group I = Group II
Group I < Group II
38
39
22
Group I - Group V
Group I > Group V
Group I = Group V
Group I < Group V
19
60
20
ditory and the visual speech mode. The proposed optimization techniques are
designed to smooth the synthetic visual speech by introducing a time-varying
audiovisual asynchrony or some local audiovisual incoherences in the concatenated
audiovisual speech. Results from a subjective perception experiment indicate that
earlier published values for just noticeable audiovisual asynchrony hold in the nonuniform case as well (i.e., they hold for both constant and time varying audiovisual
asynchrony levels). A possible explanation for this resides in the fact that human
speech perception is for a great deal based on predictions. By applying speech
communication in every-day life humans learn what is to be considered as “normal”
speech signals. Every aspect of a synthetic speech signal that is not conforming
to these normal speech patterns will be immediately noticed. Since time-varying
audiovisual asynchronies do not exist in original speech signals, it can be expected
that there exist no temporal window in which humans are less sensitive to the
audiovisual asynchrony in multimodal speech perception.
Objective measures showed that the optimization of the visual join positions
indeed enhances the smoothness of the synthetic visual speech. However, a subjective experiment assessing the effects of these optimizations on the perception of the
concatenated audiovisual speech showed no indication that the observers preferred
the smoothed synthesis samples over the samples that were synthesized to contain a
maximal coherence between both synthetic speech modes. Quite often the benefit of
the optimal coupling approach, i.e., an improved individual quality of the synthetic
visual speech mode, was cancelled by a noticeable decrease of the audiovisual coherence in the synthetic audiovisual speech. The proposed audiovisual optimal coupling
techniques appear to cause some sort of disturbing under-articulation effect, since
some rapid variations in the auditory mode are not seen in the corresponding video
mode. These findings are in line with the results from the experiments described
in section 3.6, where it was concluded that a maximal audiovisual coherence is
crucial in order to attain a high-quality perception of the synthetic audiovisual
speech signal. The avoidance of any mismatch between both synthetic speech modes
3.8. Summary and conclusions
123
appears to be at least equally important as the individual optimization of one of
the two speech modes.
3.8
Summary and conclusions
The great majority of the audiovisual text-to-speech synthesis systems that are
described in the literature adopt a two-phase synthesis approach, in which the
synthetic auditory and the synthetic visual speech are synthesized separately. The
downside of this synthesis strategy is its inability to maximize the level of coherence
between the two output speech modes. To overcome this problem, a single-phase
audiovisual speech synthesis approach is proposed, in which the synthetic audiovisual speech is generated by concatenating original combinations of auditory and
visual speech that are selected from a pre-recorded speech database. Auditory and
visual selection costs are used to select original speech segments that match the
target speech as close as possible in both speech modes. To concatenate the selected
segments, an advanced join technique is used that smooths the concatenations by
generating appropriate intermediate pitch periods and video frames. The proposed
single-phase AVTTS approach was subjectively compared with a common twophase synthesis strategy in which the auditory and the visual speech are synthesized
separately using two different databases. These experiments indicated a reduction
of the perceived audiovisual speech quality when the level of coherence between the
two presented speech modes is lowered. Because of this, the single-phase synthesis
results were preferred over the two-phase synthesis results. In order to improve the
individual quality of the synthetic visual speech mode, multiple audiovisual optimal
coupling techniques were designed and evaluated. These techniques are able to
improve the smoothness of the synthetic visual speech signal at the expense of a
degraded audiovisual synchrony and/or coherence. However, a subjective evaluation
pointed out that the proposed optimization techniques are unable to enhance the
perceived audiovisual speech quality. This result again indicates the importance
of the level of audiovisual coherence on the perception of the audiovisual speech
information.
People are very familiar with perceiving audiovisual speech signals, since this
kind of communication is used countless times in their daily life. When perceiving (audiovisual) speech, the observer continuously makes predictions about the
received information. This means that even the shortest and smallest errors in
the speech communication will be directly noticed. Such local errors have been
found to degrade the perceived quality of much longer speech signals in which
they occur [Theobald and Matthews, 2012]. This raises serious problems when a
natural perception of synthesized speech is aimed for, since it is a major challenge
to design a TTS system that is able to generate a completely error-free synthetic
124
speech signal. For audiovisual speech synthesis, the problem even becomes more
challenging, since in that case not only the two individual speech modes but also the
combination of these two signals should be perceived error-free. When perceiving a
synthetic audiovisual speech signal of which the intermodal coherence is affected,
a typical problem that occurs is that the observers do not believe that the virtual
speaker they see in the visual speech mode actually uttered the auditory speech
information they hear in the corresponding acoustic speech mode. For instance, this
was noticed when evaluating the audiovisual optimal coupling techniques: when the
visual speech mode of the test samples was smoothed individually, the visual speech
easily appeared as “mumbled” in comparison with the more profound articulations
that were present in the accompanying auditory speech mode. A similar problem is
likely to occur when a two-phase synthesis approach is applied. These systems tend
to generate a “safe” synthetic visual speech signal that contains for each target
phoneme its most typical visual representation (e.g., systems that apply a simple
many-to-one mapping from phonemes to visemes (see section 1.3 or chapter 6),
rule-based synthesizers that apply the same synthesis rule for each instance of the
same phoneme (see section 2.2.6), etc.). This implies that some of the atypical
articulations (e.g., very profound ones) that are present in the synthetic acoustic
speech mode will be missing an appropriate visual counterpart in the audiovisual
output speech. When evaluating the single-phase synthesis approach it was noticed
that this effect holds for non-optimal parts of the synthetic speech as well, since
it could be observed that it is preferable that very fast or sudden (non-optimal)
articulations occurring in one speech mode have a similar counterpart in the other
speech mode as well. Obviously, this is infeasible when the two synthetic speech
modes are synthesized separately. Another downside of a two-phase AVTTS synthesis approach is the fact that a post-synchronization of the two synthetic speech
modes is required. As was shown in the perception experiments, this synchronization is feasible by non-uniformly time-stretching the signals in order to align
the boundaries of each phoneme with the boundaries of its corresponding viseme.
However, since allophones can exhibit a specific kinematics profile, for some parts
of the speech signal this synchronization step is likely to affect the speech quality.
For instance, the lengthening of an allophone can be due to a decrease in speech
rate, pre-boundary lengthening, lexical stress, or emphatic accentuation. When
each phone of a speech signal is individually time-stretched, the kinematics of the
phones are altered which can lead to a degraded transmission of the speech information and to a decrease of the level of audiovisual coherence [Bailly et al., 2003].
The results obtained motivate to investigate further on the single-phase synthesis approach, since this is the most convenient technique to ensure that the
perceived quality of the synthetic audiovisual speech is not affected by audiovisual coherence issues. Unfortunately, the subjective experiments indicated that
125
the attainable synthesis quality of the proposed AVTTS approach is too limited
to accurately mimic original audiovisual speech. Therefore, optimizations to the
audiovisual synthesis have to be developed that enhance the quality of the synthetic
auditory and the synthetic visual speech. From the results obtained in this chapter
it is known that it will have to be ensured that these optimizations do not affect
the coherence between these two synthetic speech modes. Note that this thesis will
mainly focus on the enhancement of the synthetic visual speech mode, as various
strategies for improving the auditory speech quality were developed in the scope of
the laboratory’s parallel research on auditory text-to-speech synthesis.
Some of the techniques, experiments and results mentioned in this chapter have been
published in [Mattheyses et al., 2008], [Mattheyses et al., 2009a] and [Mattheyses
et al., 2009b].
4
Enhancing the visual
synthesis using AAMs
4.1
Introduction and motivation
The results described in the previous chapter offer strong motivation to investigate further on the single-phase AVTTS synthesis approach. Unfortunately, the
proposed synthesis set-up resulted in synthetic audiovisual speech signals that did
not close enough resemble original audiovisual speech so that human observers
are unable to distinguish between the two. Therefore, additional improvements
to the synthesis technique are needed. Recall from the previous chapter that it
has to be ensured that neither of these optimizations significantly affects the
audiovisual coherence in the synthetic output speech. There already exists a wide
research area that investigates the improvement of auditory speech synthesis. One
of such research projects is conducted in parallel with the research described in
this thesis [Latacz et al., 2010] [Latacz et al., 2011]. Therefore, this thesis focusses
on various developments to enhance the quality of the synthetic visual speech mode.
A first experiment was conducted to investigate exactly which features caused
the observers to distinguish between the synthesized and the original visual speech.
Multiple original sentences from the LIPS2008 database were resynthesized, after
which for each generated video frame its facial landmarks were tracked, similar
to the analysis performed on the original database speech (see section 3.3.3.4).
Then, using these landmark points, for each sentence both the original and the
synthesized visual speech information was represented using point-light signals, as
126
4.1. Introduction and motivation
127
Figure 4.1: A point-light visual speech signal.
illustrated in figure 4.1. Such point-light signals display only the kinematics of the
visual speech. It has been shown that human observers are quite sensitive to the
coherence between an acoustic speech signal and the variation of facial flesh points
(indicated here using point-lights) [Bailly et al., 2002]. The synthetic point-light
signals were synchronized with the original acoustic speech from the database by
aligning the phoneme boundaries in both signals. The time-scaling of the point-light
signals was achieved by adding video frames (by calculating intermediate point-light
positions) or by removing video frames. Then, the original auditory speech was
played synchronously with both the original and the synthetic point-light signals.
In an informal perception test it was found that observers performed much worse in
distinguishing between the original and the synthesized point-light signals compared
to the distinguishing between original and synthesized “real” visual speech signals.
From this observation it can be concluded that the kinematics of the synthetic
visual speech signals do a reasonably good job in mimicking the original speech
kinematics. This means that in order to enhance the quality of the synthetic
visual speech, a greater effort has to be made to improve the smoothness and the
naturalness of total appearance of the synthetic visual speech (e.g., teeth visibility,
tongue visibility, colour and lightning continuity, etc.).
In order to improve the quality of the synthetic visual speech, a more detailed
analysis of the original visual speech data is needed. Processing this data as a series
of static images makes it very hard to differentiate between the analysis and/or
synthesis of aspects concerning the speech movements (e.g., lip and jaw movements)
and aspects concerning the overall appearance of the mouth area (e.g., visibility
of the teeth, colours, shadows, etc.). Such a differentiation would allow to design
techniques that remove the jerkiness from the overall appearance of the virtual
speaker, while the amplitude of the displayed speech kinematics is maintained to
avoid under-articulation effects that cause a “mumbled” perception of the synthetic
visual speech mode (e.g., this was noticed in the subjective assessment of the
audiovisual optimal coupling techniques (see section 3.7)).
4.2. Facial image modeling
4.2
128
Facial image modeling
A convenient technique to analyse the original visual speech recordings is to parameterize each captured video frame. This way, the frame-by-frame variations of the
parameter values compose parameter trajectories that describe the visual speech
information. Some image parameterizations not only mathematically describe the
image data in terms of parameter values, they also allow to reconstruct an image
from a given set of parameter values. This category of parameterizations is often
referred to as image modelling techniques. Section 3.3.3.4 already mentioned that
the original visual speech was modelled by a PCA calculation. To this end, a PCA
analysis was performed on the image data gathered by extracting the mouth area
from each original video frame. This analysis determines for each original video
frame an associated vector of PCA coefficients. It also calculates a set of so-called
“eigenfaces”, which can be linearly combined to recreate any particular original
video frame (using that frame’s PCA coefficients as combination weights). Unfortunately, a PCA analysis parameterizes the visual speech information contained
in each video frame “as a whole”, since it treats the facial images as standard
mathematical matrices containing the (grayscale) pixel values. This means that a
PCA-based analysis of the original speech data is unable to differentiate between
the parameterization of the kinematics and the parameterization of the appearance
of the visual speech.
An extension to PCA for modelling a set of similar images makes use of a socalled 2D Active Appearance Model (AAM) [Edwards et al., 1998b] [Cootes et al.,
2001]. Similar to other modelling techniques, an AAM is able to project an image
into a model-space, meaning that the original image data can be represented by
means of its corresponding model parameter values. In addition, when the AAM
has been appropriately trained using hand-labelled ground-truth data, it is possible
to generate a new image from a set of unseen model parameter values. In contrast
to plain PCA calculations on the pixel values of the image, an AAM models two
separate aspects of the image: the shape information and the texture information.
The shape of an image is defined by a set of landmark points that indicate the
position of certain objects that are present in each image that is used to build the
AAM. The texture of an image is determined by its (RGB) pixel values, which
are sampled over triangles defined by the landmark points that denote the shape
information of the image. The sampling of these pixel values is performed on the
shape-normalized equivalent of the image: before sampling the triangles, a warped
version of the image is calculated by aligning its landmark points to the mean shape
of the AAM (i.e., the mean value of every landmark point sampled over all training
images).
129
An AAM is built from a set of ground-truth images, of which the shape of
each image is hand-labelled by a manual positioning of the appropriate landmark
points. The vector containing the landmark positions that correspond to a particular image is called the shape S of that image. In addition, its texture T is defined
by the vector containing the pixel values of its shape-normalized equivalent. From
all training shapes Si , the mean shape Sm is calculated and a PCA calculation is
performed on the normalized shapes Ŝi with
Ŝi = Si − Sm
(4.1)
This PCA calculation returns a set of “eigenshapes” Ps which determine the shape
model of the AAM. Likewise, the mean training texture Tm is calculated and a
second PCA calculation is performed on the normalized training textures T̂i with
T̂i = Ti − Tm
(4.2)
This returns the “eigentextures” Pt which define the texture model of the AAM.
After the AAM has been built, any unseen image with shape S and texture
T can be projected on the AAM by searching iteratively for the most appropriate
model parameters (shape parameters Bs and texture parameters Bt ) that reconstruct the original shape and the original texture using the shape model and the
texture model, respectively:
(
Srecon = Sm + Ps Bs
(4.3)
Trecon = Tm + Pt Bt
Several approaches exist for optimizing the values of Bs and Bt to ensure that:
(
Srecon ≈ S
(4.4)
Trecon ≈ T
These techniques are beyond the scope of this thesis, and the interested reader is
referred to [Cootes et al., 2001]. After projection on the AAM, the original image
information has been parameterized by means of vectors Bs and Bt , as illustrated
in figure 4.2. In addition, the trained AAM is capable of calculating from an unseen
set of shape parameters Bsnew and texture parameters Btnew , a new shape S new and
a new texture T new by means of equation 4.3. From S new and T new a new image
can be generated by warping the shape-normalized texture T new (aligned with the
mean shape Sm ) towards the new shape S new .
For some applications, it is convenient that an image is represented by a sin-
130
AAM
SHAPE
SHAPE
MODEL
SHAPE
PARAMS
TEXTURE
TEXTURE
MODEL
TEXTURE
PARAMS
IMAGE
Figure 4.2: AAM-based image modelling.
gle set of model parameters, of which each individual parameter value determines
both shape and texture properties. To this end, from the shape model and the
texture model of the AAM, a combined AAM is calculated which can be used
to transform image data into a vector of so-called “combined AAM” parameter
values (and vice-versa) [Edwards et al., 1998a]. To build this combined model, the
shape parameters Bs and the texture parameters Bt of each training image are
concatenated to create the vector Bconcat
Ws Bs
Bconcat =
(4.5)
Bt
Ws is a weighting matrix to correct the difference in magnitude between Bs (which
models landmark coordinate positions) and Bt (which models pixel intensities). Ws
scales the variance of the shape parameters of the training images to equal the
variance of the texture parameters of the training images. Then, a PCA calculation
is performed on the Bconcat vectors of the training images, resulting in a collection
of eigenvectors denoted as Q. Each concatenated vector Bconcat can be written as a
linear combination of these eigenvectors:
Bconcat = Qc
(4.6)
The vector c describes the combined model parameter values of the image data that
was used to construct Bconcat . Q can be written as
Qs
Q=
(4.7)
Qt
4.3. Audiovisual speech synthesis using AAMs
131
The original image data can be directly reconstructed from a combined parameter
vector c by substituting equations 4.3 and 4.5 in equation 4.6:
(
Srecon = Sm + Ps Ws−1 Qs c
(4.8)
Trecon = Tm + Pt Qt c
4.3
4.3.1
Audiovisual speech synthesis using AAMs
Motivation
From sections 4.1 and 4.2 it can be concluded that the use of AAMs can make a
significant contribution to the enhancement of the (audio-)visual speech synthesis
strategy, since an AAM offers an individual parameterization of the shape and the
texture information of each original video frame. When all video frames from an
original visual speech recording are represented by their shape and their texture
parameters, these consecutive parameter values define parameter trajectories that
describe the visual speech information. The shape-trajectories describe the variations of the shape information in the visual speech signal. This means that these
shape-trajectories can be seen as a representation of the kinematics of the speech
information (i.e., the movements of the various visual articulators). Similarly, the
variation of the texture in the visual speech is described by the texture-trajectories.
These trajectories can be seen as a representation for the variation of the appearance of the virtual speaker throughout the original speech recordings (i.e., the
visibility of teeth/tongue inside the mouth, changes in illumination on the skin,
etc.). Because of this, the AAM-based parameterization of the original speech data
offers exactly the representation that allows to diversify the processing between
the kinematics-related and the appearance-related visual speech aspects, which is
needed to improve the attainable synthetic visual speech quality (as was suggested
in section 4.1).
The use of AAMs for visual speech synthesis is not new. Theobald et al. developed a concatenative visual speech synthesizer that selects original segments of
visual speech information from a database containing AAM-mapped original speech
recordings [Theobald, 2003] [Theobald et al., 2004]. Melenchon et al. developed a
rule-based synthesis approach in which AAMs are used to represent the original
speech data that is used to learn multiple visual pronunciation rules for each
phoneme [Melenchon et al., 2009]. AAMs have also been used for speech-driven
visual speech synthesis, since the AAM-based representation of the visual speech
offers a convenient set of visual features that can be linked to auditory features of
the corresponding auditory speech [Cosker et al., 2003] [Theobald and Wilkinson,
2008] [Englebienne et al., 2008].
132
In this thesis, however, it is investigated how the use of AAMs can enhance
the individual quality of the synthetic visual speech mode generated by the singlephase audiovisual speech synthesis system. From the previous chapter it is known
that it will have to be evaluated if this unimodal optimization does not affect the
level of audiovisual coherence between the two synthetic speech modes. When an
AAM is used to represent the original visual speech data from the speech database,
the model must first be trained on a set of hand-labelled training images. Afterwards, any unseen image can be projected into the model-space by calculating the
most appropriate model parameters that reconstruct the image. Note that only for
the training images an ideal reconstruction is feasible. For any other image, there
will always exist a difference between the original pixel values and the pixel values
of the regenerated image. A smaller reconstruction error is feasible for images that
are similar to at least one of the training images that were used to train the AAM.
Therefore, it is a challenge to build an AAM that is able to appropriately model
and reconstruct each frame of the original visual speech recordings. In addition,
when the original visual speech data is reconstructed from trajectories of model
parameters, the resulting speech information will be slightly different from the
original speech data. Therefore, it has to be investigated whether this modification
has an influence on the perception of the speech data when it is shown to an
observer. In addition, it should be assessed to which extent the AAM modelling of
the visual speech data affects the level of audiovisual coherence in the speech signal
that is created by multiplexing the original auditory speech with the regenerated
visual speech signal.
4.3.2
Synthesis overview
The AAM-based AVTTS synthesis procedure is similar to the synthesis approach
that was described in the previous chapter. The synthesizer selects original combinations of auditory and visual speech from the database, after which these segments
are audiovisually concatenated to construct the desired synthetic speech. The major
difference is that in this case, the original video recordings are modelled using an
AAM, which means that the original visual speech data that is provided to the synthesizer consists of trajectories of AAM parameters instead of video frames. The selection of an original video segment corresponds to the extraction of a sub-trajectory
from the database, while the concatenation of two selected video fragments is performed by joining the extracted sub-trajectories. In a final stage, the concatenated
sub-trajectories are sampled at the output video frame rate, after which from each
sampled set of AAM parameter values a new image is generated. These newly generated images then construct the video frame sequence of the synthetic visual speech.
The individual parameterization of various aspects of the visual speech data allows
High-level
synthesis
Audiovisual
unit selection
Audiovisual
concatenation
audiovisual
units
Pitch-synchronous
crossfade & AAM
parameter
interpolation
Original
combinations of
auditory speech &
AAM parameter
sub-trajectories
AAM-projected
audiovisual speech
database
Text
133
Waveforms
&
AAM parameter
trajectories
Inverse AAM
projection of
the
concatenated
sub-trajectories
Figure 4.3: AVTTS synthesis using an active appearance model.
to optimize the visual synthesis in each synthesis stage. An overview of synthesis
approach is given in figure 4.3.
4.3.3
Database preparation and model training
Most of the time, AAMs are used in visual speech analysis and synthesis to model
images representing the complete face of the original/virtual speaker. This approach
was also followed by the AAM-based visual speech synthesizers that were mentioned
in section 4.3.1. Recall from section 3.5.1 that in the initial AVTTS approach only
the mouth-area of the video frames is synthesized in accordance with the target
phoneme sequence. Afterwards, this mouth-signal is merged with a background
signal to create the final output visual speech signal. A similar approach is followed
in the AAM-based AVTTS approach: the AAM is trained only on the mouth area of
the video frames from the LIPS2008 database. This way, the AAM only models the
most important speech-related variations and does not have to model, for instance,
134
variations of the eyes or the eyebrows contained in the original speech recordings.
The shape information of each frame is determined by 29 landmarks that indicate
the outside of the lips and 18 landmarks that indicate the inside of the lips. In
addition, 14 landmarks are used to indicate the position of the cheeks and the
neck, and 3 additional landmarks are used to indicate the position of the nose. The
landmarks corresponding to a typical frame are visualised in figure 4.4. All texture
information inside the convex hull denoted by the landmarks is modelled by the
AAM.
In order to efficiently implement the AAM-based analysis and synthesis, the
AAM-API library was used [Stegmann et al., 2003]. This is a freely available C++
programming API that offers the core functions to perform AAM training, AAM
projection, and AAM reconstruction. The library was extended to allow the use of
AAMs for the specific purpose of visual speech synthesis. For instance, the memory
usage during the process of AAM training was optimized in order to be able to build
complex models on a large set of high-resolution training images. In addition, the
library was extended to be able to read a given set of model parameter trajectories
and reconstruct a sequence of mouth images of which the displayed mouth configurations are appropriately aligned to create a smooth mouth animation. To this end,
all output frames are aligned using the mean value of the six inner landmark points
on the outside of the upper lip. Also, in order to optimize the reconstruction of an
image displaying a closed mouth, it was ensured that all landmarks indicating the
inside of the upper lip are located above the corresponding landmarks indicating
inside of the lower lip.
Building a high-quality AAM is not a straightforward task. Since the AAM
has to be able to accurately describe each original video frame by means of model
parameter values, it is important that the training images that are used to build
the AAM sufficiently cover all possible variations present in the database. In
addition, the hand-crafted landmark information of these training images should
be as consistent as possible, since all variations that are present in this manually
determined shape data are regarded as “ground-truth” and will be modelled by
the AAM (this includes also the variations caused by an inconsistent or erratic
manual landmark positioning). In order to obtain a collection of training images
with associated shape data that satisfies these criteria, an iterative technique was
developed to build a high quality AAM that preserves as much original image detail
as possible. It was ensured that only a limited amount of manual labour is needed
to build the model. In a first step, 20 frames from the database were landmarked
manually. The subset of 20 frames was selected manually to ensure that it contains
many different mouth representations (open/closed mouth, visible teeth, visible
mouth-cavity, etc.). From these frames and their associated shape-information, an
135
“initial” AAM was built. Afterwards, this trained AAM was used to calculate for
each frame of 100 sentences, selected randomly from the database, its shape and its
texture parameter values. A k-means clustering was performed on all the calculated
AAM parameter values to determine 50 visually distant frames, of which the shape
information was re-labelled manually. These frames were used to train an improved
“intermediate” AAM. The intermediate AAM was applied to calculate a set of
model parameter values for every video frame contained in the speech database.
Then, by means of a k-means clustering on these new model parameter values, 160
visually distant frames were selected as training set for the “final” AAM. For each of
these frames, the corresponding landmark positions were automatically calculated
from its shape parameter values corresponding to the intermediate AAM. These
landmark positions were checked manually and corrected if necessary, after which
the final AAM was built from these 160 frames and their manually adjusted shape
information.
Section 4.2 explained that the shape model and the texture model of the AAM are
determined by PCA calculations on the training data. This implies that each resulting eigenshape or eigentexture corresponds to a particular degree of model variation.
A standard approach in PCA analysis is to compress the model-based representation of the original data by omitting some of the least important eigenvectors (and
their corresponding parameter values). This technique was applied by designing the
final AAM to retain 97% of the variation contained in the final training set, which
resulted in 8 eigenvectors that represent the shape model and 134 eigenvectors that
represent the texture model. It was checked that no difference in image quality
could be noticed between images regenerated using the final AAM and images
regenerated using an AAM that was built to model 100% of the variation of the
final training set. Finally, a combined model was calculated from the shape model
and the texture model of the final AAM (see section 4.2). By omitting 3% of the
total variation, 94 eigenvectors were needed to represent this combined AAM model.
The final AAM was used to project all frames of the speech database into the
model-space. To this end, for each frame the corresponding shape parameter values,
texture parameter values, and combined model parameter values were calculated.
In addition, for each frame the delta-shape, delta-texture and delta-combined
parameters were calculated in order to parameterize the variation of the image
information. The AAM-API functions were configured to use the Fixed Jacobian
Matrix Estimate technique to iteratively search for the most appropriate model
parameters that describe a particular original video frame (equation 4.3). Details
on this technique can be found in [Cootes and Taylor, 2001] and [Stegmann et al.,
2003]. An example of an original frame extracted from the database, its automatically determined shape information and the AAM reconstruction using its shape
136
Figure 4.4: AAM-based representation of the original visual speech data, illustrating the original image (left), its landmarking denoting the associated shape
information (middle), and the reconstructed image from its shape and texture
parameter values (right).
and texture parameter values is given in figure 4.4.
The proposed approach for building the AAM has the benefit that the final
AAM is trained on a large number of images that are selected to represent each
typical configuration found in the original speech data. As a consequence, it is able
to appropriately model all variety of visual speech information that is contained in
the speech database. The initial AAM and the intermediate AAM are necessary
to allow an accurate selection of these representative final training images. Since
the shape information (i.e., the landmark positions) of the final training images is
automatically generated from shape parameter values corresponding to the intermediate AAM, this data is more consistent compared to a manually landmarked
collection of images.
The AAM-based parameterization of the visual speech from the LIPS2008 database
makes some of the visual meta-data mentioned in section 3.3.3.4 superfluous, such
as the original landmark positions (that were determined by the original landmark
tracker) and the PCA-based parameterization. On the other hand, the measures
for the amount of visible teeth and visible mouth-cavity are still useful in the
AAM-based AVTTS system, since these compose a more direct indication for these
two visual features compared to the AAM texture parameter values. Obviously,
the symbolic and the acoustic meta-data, described in sections 3.3.3.2 and 3.3.3.3,
respectively, are still applied in the AAM-based AVTTS approach in order to describe the linguistic/prosodic and the acoustic properties of the original audiovisual
speech recordings.
4.3.4
Segment selection
The selection costs that are applied in the AAM-based AVTTS system are similar
to the various sub-costs that were used in the initial AVTTS synthesis strategy (see
137
section 3.4 for an overview). Candidate segments are selected from the database
based on their phonemic match with the target phoneme sequence. Next, target
costs and join costs are applied to select for each synthesis target one final database
segment to construct the output audiovisual speech.
4.3.4.1
Target costs
The target costs have to force the selection towards original segments that resemble
the synthesis targets as closely as possible. Binary target costs are used to select
original segments that exhibit the appropriate prosodic features (see section 3.4.2.2).
In addition, “safety” target costs are used to minimize the selection of erroneous
segments (see section 3.4.2.3). A novel strategy of labelling “suspicious” original
segments was introduced using the combined AAM parameter values as visual
feature for calculating equation 3.5. In contrast with the initial AVTTS system,
the AAM-based AVTTS approach includes an additional target cost that takes
the visual coarticulation effects into account by promoting the selection of original
segments of which the extended visual context is similar to the visual context of
the corresponding synthesis target. For instance, when the candidate segment is
preceded by a vowel that is associated with a wide mouth opening, the transition
effects that are likely to be present in the candidate segment are suited for copying
to the synthetic speech in case the corresponding target is also preceded by a vowel
that exhibits a profound mouth opening. Therefore, in addition to the symbolic
target costs that express the phonemic matching between the target context and
the candidate context, the “visual context” target cost has to express the visemic
matching between the target context and the candidate context. The visemic
matching between two phoneme sequences is often calculated as a binary cost,
for which all phonemes are first given a unique viseme label (based on their most
common visual representation). Afterwards, the viseme labels of the corresponding
phones from both sequences can be compared. Unfortunately, an accurate definition
of these viseme labels is far from straightforward since many phonemes exhibit a
variable visual representation due to visual coarticulation effects (see also further
in this thesis). Therefore, it was opted to calculate the visual context target cost
using a visual difference matrix that expresses the visual distance between every
two phonemes present in the database (as was proposed by Arslan et al. [Arslan
and Talkin, 1999]). It is important that this matrix is ad-hoc calculated for the
particular original speech data that is used for synthesis, since each speaker exhibits
his/her own personal speaking style and visual coarticulation effects have been
found to be speaker-specific [Lesner and Kricos, 1981].
To calculate the visual difference matrix, for every distinct phoneme all its instances in the database are gathered. For each instance, the combined AAM
138
parameters of the video frame located at the middle of the instance are sampled.
From these values, means Mij and variances Sij are calculated, where index i
corresponds to the various phonemes and index j corresponds to the distinct model
parameters. For a particular phoneme i, the sum of all the variances of the model
P
parameters j Sij expresses how much the visual appearance of that phoneme is
affected by visual coarticulation effects. Two phonemes can be considered comparable in terms of visual representation if their mean representations are alike and, in
addition, if these mean visual representations are sufficiently reliable (i.e., if small
summed variances were measured for the visual representation of these phonemes).
Therefore, two matrices are calculated, which express for each pair of phonemes
(p, q) the Euclidean difference between their mean visual representations and the
sum of the variances of their visual representation, respectively:

sX

M

(Mpj − Mqj )2
 Dpq =


j
(4.9)
X
X

S


D
=
S
+
S
pj
qj

 pq
j
j
M
and
Dividing each matrix by its largest element produces the scaled matrices D̂pq
S
, after which the final difference matrix D is constructed by:
D̂pq
M
S
Dpq = 2D̂pq
+ D̂pq
(4.10)
Matrix D can be applied to calculate the visual context target cost Cviscon for a
candidate segment u, matching a given target phoneme sequence t, by comparing
the three phonemes located before (u + n) and after (u − n) the segment u in the
database (i.e., the visual context of the candidate segment) with the visual context
of the synthesis target:
Cviscon (t, u) =
3
X
n=1
3
X
(4 − n)D(t+n,u+n)
(4 − n)D(t−n,u−n) +
(4.11)
n=1
In equation 4.11 the factor (4 − n) defines a triangular weighting of the calculated
visual distances.
Finally, for the AAM-based synthesis the first expression from equation 3.4
139
can be written as:
target
Ctotal
(ti , ui ) = ω1 Cphon.match (ti , ui )
+ ω2 Chard−pruning (ti , ui )
+ ω3 Csof t−pruning (ti , ui )
PN
ω4 Ĉviscon (ti , ui ) + j=1 ωjsymb Cjsymb (ti , ui )
+
PN
ω4 + j=1 ωjsymb
(4.12)
Note that this equation is very similar to equation 3.12. Ĉviscon represents the scaled
value of Cviscon (see section 3.4.4.1). The values for weights ω1 , ω2 , ω3 , and ωjsymb are
the same as described in section 3.4.4.2. The weight factor of the visual context cost
is chosen such that the term ω4 Ĉviscon is equally important as the summed symbolic
PN
PN
costs j=1 ωjsymb Cjsymb . To this end, mean values for Ĉviscon and j=1 ωjsymb Cjsymb
are learned from multiple random syntheses. The value for ω4 can then be calculated
from the ratio of these two measures.
4.3.4.2
Join costs
The join costs have to ensure that the final sequence of selected database segments can be appropriately concatenated to construct the desired synthetic speech
signal. The AAM-based AVTTS system uses both auditory and visual join costs
to ensure smooth concatenations in both speech modes. The auditory join costs
CM F CC (ui , ui+1 ), Cpitch (ui , ui+1 ) and Cenergy (ui , ui+1 ) are based on the difference
between the MFCC, pitch, and energy features of the waveforms at the join position, respectively (see section 3.4.3.1). In addition, the AVTTS system employs
four separate visual join costs. A first join cost Cteeth (ui , ui+1 ) was also used in the
initial AVTTS strategy and expresses the difference between the amount of teeth
that is visible in the two video frames at the join position (see section 3.4.3.2). This
join cost is useful since a sudden “jump” of this feature around a join position easily
causes noticeable concatenation artefacts. Three additional visual join costs are calculated on the AAM parameter values of the video frames at the join position. A
first cost calculates the Euclidean difference between the shape parameter values
of these frames. Another cost calculates the Euclidean difference between the combined AAM parameter values of these frames. A final visual join cost is calculated as
the Euclidean difference between the delta-combined AAM parameters of the video
frames at the join position. This cost is included since these delta values express
how the original visual features are varying in segments ui and ui+1 that need to
be concatenated. When this variation can be maintained across the join position,
smooth and natural variations will be perceived in the synthetic speech. The three
AAM-based costs can be written as:


 Cshape (ui , ui+1 ) = ||Bs,i − Bs,i+1 ||
Ccombined (ui , ui+1 ) = ||ci − ci+1 ||


C∆combined (ui , ui+1 ) = ||∆ci − ∆ci+1 ||
140
(4.13)
with Bs,i the shape parameters, ci the combined AAM parameters, and ∆ci the
delta-combined parameters of the last video frame of segment ui . Similarly, parameter values Bs,i+1 , ci+1 and ∆ci+1 are sampled at the first video frame of segment
ui+1 . Note that the difference between the texture parameter values of the join
frames does not define a separate join cost. The reason for this is that the most
critical texture-related feature, namely the continuity of the teeth-appearance, is
already separately taken into account by the teeth-join cost. Other texture-related
aspects are taken into account by comparing the combined model parameter values. In addition, further on in section 4.4.3 it will be explained that the texture
information is heavily smoothed around the concatenation points anyway.
For the AAM-based synthesis, the second expression from equation 3.4 can be
written as:
join
Ctotal
(ui , ui+1 ) =
ω1 ĈM F CC (ui , ui+1 ) + ω2 Ĉpitch (ui , ui+1 )
ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7
ω3 Ĉenergy (ui , ui+1 ) + ω4 Ĉteeth (ui , ui+1 )
+
ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7
ω5 Ĉshape (ui , ui+1 ) + ω6 Ĉcombined (ui , ui+1 )
+
ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7
ω7 Ĉ∆combined (ui , ui+1 )
+
ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7
(4.14)
The join cost weights (ω1 , . . . , ω7 ) were optimized manually in a similar way as
described in section 3.4.4, which resulted in values ω1 = 5, ω2 = 2, ω3 = ω4 = 1,
ω5 = 2, and ω6 = ω7 = 1. Ĉ represents the scaled value of the original cost C (see
section 3.4.4.1). A value for the factor α in equation 3.3 was determined such that
the total target cost and the total join cost contribute equally to the total selection
cost associated with the selection of a particular candidate segment.
4.3.5
Segment concatenation
Joining the selected original audiovisual speech segments requires a concatenation
in the auditory mode and a concatenation in the visual mode. The AAM-based
AVTTS system performs the concatenation of the acoustic signals by a pitchsynchronous cross-fade, as was explained earlier in section 3.5.3. Recall that in
141
the initial AVTTS approach, the concatenation of the visual speech segments was
smoothed by substituting the original video frames around the join position with
intermediate video frames generated using image morphing techniques. The AAMbased AVTTS approach has the benefit that a parameterization of the original
visual speech is available. Instead of concatenating the video data directly, the
selected original visual speech segments can be joined by the concatenation of the
parameter sub-trajectories that correspond to the selected visual speech signals. Obviously, this allows a much more efficient way to smooth the visual concatenations
since the visual speech can be modified by adjusting the concatenated parameter
trajectories around the join position.
The concatenation of two selected visual speech segments involves a separate
join calculation for all sub-trajectories that describe the visual speech information.
The segments that need to be joined are partially overlapped to calculate appropriate parameter values at the concatenation point. The AAM-based AVTTS system
overlaps both sub-trajectories by exactly one video frame. Then, the concatenated
trajectory is smoothed by adjusting the parameter values of the frames in the vicinity of the join position. When the two visual segments that need to be concatenated
are denoted as α and β, the sub-trajectories, corresponding to a particular AAM
α
)
parameter, that describe these two segments can be written as (B1α , B2α , . . . , Bm
β
β
β
and (B1 , B2 , . . . , Bn ), respectively, given that segment α contains m frames from
the beginning until the frame that was selected as join-position in α, and that segment β contains n frames from the video frame that was selected as join position in
β until the end of the segment. The joining of these two segments results in a joined
J
segment J that is described by the parameter trajectory (B1J , B2J , . . . , Bm+n−1
).
J
The parameter value at the join position Bm is calculated by the overlap of the two
boundary frames:
B α + B1β
J
Bm
= m
(4.15)
2
When a parameter S is used to denote the smoothing strength, the resulting parameter values of the concatenated trajectory are calculated as follows:
∀ k, 1 ≤ k ≤ m + n − 1, k 6= m :
BkJ
=

















Bkα
m−k α
S+1 Bk
+
if 1 ≤ k < m − S
(S+1)−(m−k) J
Bm
S+1
(S+1)−(k−m) J
Bm
S+1
+
k−m β
S+1 Bk−m+1
β
Bk−m+1
if m − S ≤ k < m
if m < k ≤ m + S
if m + S < k ≤ m + n − 1
(4.16)
142
0.17
0.16
0.15
Parameter value
0.14
0.13
0.12
0.11
0.1
0.09
0.08
0
10
20
30
Frame
40
50
60
Figure 4.5: Example of the concatenation of two sub-trajectories. The two original sub-trajectories are shown in black. The coloured lines indicate the concatenated trajectories that were calculated using S = 1 (blue), S = 3 (red), and
S = 6 (green) as smoothing strength. Note that all concatenated trajectories
pass through the interpolated value at the join position.
For each AAM parameter, equations 4.15 and 4.16 are used to concatenate the corresponding sub-trajectories. This allows to optimize the concatenation smoothing
strength for each AAM parameter individually, since for each parameter a separate
value for the smoothing strength S can be applied. For instance, as was concluded
in section 4.1, it is opportune to increase the value of S for joining sub-trajectories
corresponding to texture parameters in comparison with the concatenation of subtrajectories corresponding to shape parameters. This way, the texture variation in
the concatenated visual speech can be adequately smoothed to achieve a continuous
appearance of the virtual speaker across the concatenation points, while the shape
information is smoothed less strongly to ensure that the visual articulations are
appropriately pronounced. An example of the concatenation of two sub-trajectories
is given in figure 4.5.
When all parameter sub-trajectories of all selected database segments have been
concatenated, each concatenated trajectory is sampled at the output video frame
rate. This allows to generate a new sequence of images by the inverse AAM projection of the sampled parameter values. These new images describe the animation
of the mouth area in accordance with the target speech. The merging of this video
4.4. Improving the synthesis quality
143
signal with the background signal creates the final full-face synthetic visual speech
(see section 3.5.1).
4.4
4.4.1
Improving the synthesis quality
Parameter classification
The AAM-based representation of the original visual speech contained in the
database allows to independently parameterize the shape-related information and
the texture-related information, which can for instance be used to apply a different
concatenation smoothing strength to these two aspects (see section 4.3.5). However,
new possibilities to enhance the attainable synthesis quality emerge when a separate
description of distinct shape-aspects and distinct texture-aspects is feasible. This
would allow that speech-related shape/texture variations (e.g., lip movements,
changes in teeth/tongue visibility, etc.) are treated separately from the other variations present in the database (e.g., deviations of the head orientation, illumination
changes, etc.). To this end, a technique to classify each shape/texture parameter in
terms of its correlation with the speech is proposed.
Section 4.2 explained that the AAM parameters are each linked to an eigenvector, resulting from PCA calculations on the shape information and on the
texture information contained in the set of images that was used to train the model.
A manual inspection of the various parameters of the AAM that was trained on
the LIPS2008 database indicated that many of these parameters/eigenvectors can
be linked to a particular physical property. For example, the first shape parameter
of the AAM influences the amplitude of the mouth-opening while the second shape
parameter is linked to a (limited) head rotation. Likewise, the first texture parameter influences the appearance of shadows on the face of the speaker, while the
second texture parameter controls the presence of teeth in the image (see figure 4.6).
Two separate criteria were designed to identify the correlation between each
model parameter and the speech information. A first measure is based on the
knowledge that the visual representations of multiple instances of a same phoneme
will be much alike. Since this is more valid for some phonemes than for others
due to visual coarticulation effects, all distinct phonemes that are present in the
database are processed consecutively, after which the mean behaviour calculated
over all phonemes is taken as final measure (see further in this section). It can
safely be assumed that in general it is very likely that the visual representations
of two random database instances of the same phoneme are more similar than the
visual representations of two phones that are completely randomly selected from
the database. Therefore, it can be assumed that when a parameter is sufficiently
144
Figure 4.6: Relation between AAM parameters and physical properties. The two
top rows indicate the speech related first shape parameter and second texture
parameter that influence the mouth opening and the appearance of visible teeth,
respectively. The two bottom rows indicate the non-speech related second shape
parameter and first texture parameter that influence the head rotation and the
casting of shadows on the face, respectively.
145
correlated with the speech information, its values sampled at multiple database
instances of the same phoneme will be more similar compared with its values sampled at random database locations. In a first step of the analysis, for every distinct
phoneme all its instances in the database are gathered. For each instance, the
shape/texture parameters of the video frame located at the middle of the instance
are sampled. From these values, means Mij and variances Sij are calculated, where
index i corresponds to the various phonemes and index j corresponds to the distinct
model parameters. Then, for each phoneme i the shape/texture parameter values
of a set of video frames randomly selected from the database are gathered. The size
of this set of random frames is the same as the number of phoneme instances of
phoneme i that exist in the database. The mean and the variance of the random
rand
rand
parameter set are denoted as Mij
and Sij
, respectively. Next, the relative
var
rand
differences Dij between the values Sij and Sij
are calculated:
var
Dij
=
rand
Sij
− Sij
rand
Sij
(4.17)
Finally, a single measure for each model parameter is acquired by calculating the
mean variance difference over all phonemes:
P var
Dij
Djvar = i
(4.18)
Np
with Np the number of distinct phonemes in the database. Djvar expresses for
each parameter j the relative difference between its overall variation and its
intra-phoneme variation. This means that highly speech-correlated parameters will
exhibit larger values for Djvar than other parameters. The values for Djvar that were
measured for the AAM that was trained on the LIPS2008 database are visualized
in figure 4.7.
In a second approach for determining which of the AAM parameters are the
most correlated with the speech information, some random original sentences from
the LIPS2008 database are resynthesized using the AAM-based AVTTS system (the
original speech corresponding to these sentences is excluded from selection). Then,
for each sentence the synthesized parameter trajectories are synchronized with their
corresponding original database trajectories. The synchronization is performed by
time-scaling each synthesized phoneme sub-trajectory such that its length matches
the duration of the corresponding original phoneme sub-trajectory. For each sensyn
tence n and for each parameter j, the distance Dnj
between the original trajectory
syn
ori
Tnj
and the synchronized synthetic trajectory Tnj
is calculated. Note that the
magnitude of the parameter values of the most important AAM parameters (i.e.,
the parameters that model most of the variance contained in the training set) is
146
0.8
0.6
Dvar
0.4
0.2
0
−0.2
1
2
3
4
5
Shape Parameter
6
7
8
1
2
3
4
5
Texture Parameter
6
7
8
0.8
0.6
Dvar
0.4
0.2
0
−0.2
Figure 4.7: Values for Dvar for the 8 shape parameters (top panel) and the 8
most important texture parameters (lower panel) of the AAM trained on the
LIPS2008 database. Compare the values obtained with the physical meaning of
the two most important shape and texture parameters visualized in figure 4.6.
147
higher than the magnitude of the other parameter values. This means that for the
syn
will be
most important parameters, the magnitude of the measured distances Dnj
syn
higher compared to the value of Dnj calculated for the other parameters. In order
syn
to properly compare the values of Dnj
for multiple values of j, the influence of the
difference in magnitude between the model parameters must be cancelled. To this
end, every original trajectory is scaled to unit variance and zero mean (denoted
ori
as T̂nj
). In addition, the mean and variance of the corresponding synthesized
syn
ori
trajectory Tnj
are scaled using the mean and the variance of Tnj
, resulting
syn
in trajectory T̂nj . Then, the distance between the original and the synthesized
trajectory is calculated as the Euclidean difference between the scaled trajectories:
v
u Nf 2
uX
syn
ori (f ) − T̂ syn (f )
Dnj = t
(4.19)
T̂nj
nj
f =1
with Nf the number of video frames in the synthesized sentence n. This way, the
influence of the magnitude of the original trajectories is eliminated and a minimal
syn
ori
value is measured when the trajectories Tnj
and Tnj
are similar in terms of both
mean, variation and shape. To eliminate the influence of the global synthesis quality
of a particular sentence, for each sentence the measured differences for all parameters
are scaled between zero and one:
syn
D̂nj
=
syn
Dnj
syn
max Dnj
(4.20)
j
Finally, a single value for each parameter is calculated as the mean value over all
sentences:
PNs
syn
n=1 D̂nj
syn
Dj =
(4.21)
Ns
with Ns the number of synthesized sentences. The value Djsyn will be larger for
syn
parameters that are not correlated with the speech, since the values Dnj
were
calculated by comparing the parameter values of video frames corresponding to two
different (synchronized) database instances of the same phoneme. By determining
these comparison pairs using speech synthesis, each pair is selected by minimizing
the selection costs which implies that both phoneme instances are similar in terms
of visual context, linguistic properties, etc. This means that it can be assumed that
their visual representations are much alike and that a smaller difference will be
measured between the synchronized original and synthesized parameter trajectories
for those model parameters that are the most correlated with the speech information.
The following sections will explain how the parameter classification by means of
measures Djvar and Djsyn can be applied to improve the visual speech synthesis.
4.4.2
148
Database normalization
The attainable quality of data-driven speech synthesis strongly depends on the
properties and the quality of the speech database that is provided to the synthesizer. Furthermore, while recording an audiovisual speech database, it is nearly
impossible to retain exactly the same recording conditions throughout the whole
database. For instance, the LIPS2008 database contains some small changes of
the head orientation of the original speaker, some variations in illumination and
some colour shifts. Although these variations are subtle, they can cause serious
concatenation artefacts: since these features are not correlated with the speech
information, they are not taken into account by the selection costs (even not by
the cost that demands a phonemic match between the target and the candidate
segment), which means that sudden “jumps” of these features at the concatenation
points in the concatenated visual speech signal are unavoidable. A possible solution
would be to include these features in the selection (join) costs. However, this would
be disadvantageous for the attainable synthesis quality since the segment selection
should select the best original segments based on the appropriateness of their
speech-related information only.
A better approach is to directly reduce the amount of undesired variations in
the speech database. This is feasible using the parameter classification described in
section 4.4.1, since many non-speech related database variations can be removed
by assigning their associated model parameters a constant value over the whole
database. An appropriate normalization value is zero, since all-zero model parameters generate the mean AAM image (equation 4.3). To determine which model
parameters are the most appropriate to normalize, measures Djvar and Djsyn are
combined. First, for both measures the 30% shape/texture parameters least correlated with the speech are selected. Then, from these selected parameters, a final set
is chosen as the parameters that were selected by both criteria, augmented with
those parameters that were selected by only one measure and that represent less
than 1% of the model variation. For the AAM trained on the LIPS2008 database,
this resulted in the selection of 1 shape parameter and 35 texture parameters
for normalization. An example of an original video frame, reconstructed from its
original and from its normalized model parameter values is given in figure 4.8.
When the shape and the texture parameters of all database frames have been
normalized, also their corresponding combined model parameters can be normalized. To this end, for each original frame a new set of combined parameter values
is calculated from its normalized shape and it normalized texture parameter values
through equation 4.6. These normalized combined parameters are applicable for a
more accurate calculation of the visual selection costs, such as the join cost values
149
Figure 4.8: Reconstruction of a database frame using its original model parameter values (left) and its normalized model parameter values (right).
(see section 4.3.4.2) and the visual distance matrix that is applied to calculate the
visual context target cost (see section 4.3.4.1).
To investigate the effect of the database normalization technique on the perceived quality of the synthesized visual speech, a subjective perception experiment
was conducted. Fifteen medium-length English sentences were synthesized by the
AAM-based AVTTS system using both the original and the normalized version of
the AAM-projected LIPS2008 database. Two versions of each synthesis sample were
created: a “mute” sample containing only visual speech and an audiovisual version
that contained both synthetic speech modes. The samples were shown pairwise
to the participants, who were asked to report their opinion on the visual speech
mode of the presented speech samples. They were instructed to pay attention
to the smoothness and to the naturalness of the mouth movements/appearances
and, for the audiovisual samples, to the coherence between the auditory and the
visual speech. Each comparison pair contained two syntheses of the same sentence,
based on the original and on the normalized version of the LIPS2008 database,
respectively. The text transcript of each sample was given to the participants
and the order of the samples in each comparison pair was randomized. A 5-point
comparative MOS scale [-2,2] was used to express the preference of the test subjects
for the first or for the second sample of each pair. 9 participants (7 male, 2 female)
joined the experiment, 1 of them aged above 50 and the others aged between 21-30.
7 participants can be considered speech technology experts. The results obtained
using the visual speech-only samples are given in table 4.1 and the results obtained
using the audiovisual samples are given in table 4.2. An analysis using Wilcoxon
signed-rank tests pointed out that in both the visual speech-only experiment
(Z = −7.40 ; p < 0.001) and the audiovisual experiment (Z = −8.27 ; p < 0.001) the
samples created using the normalized version of the database were given significantly
better ratings than the samples created using the original database. The results
obtained in the visual speech-only experiment show that the proposed database
normalization strategy indeed smooths the synthetic visual speech by removing
150
some variations from the database. In addition, the audiovisual experiment shows
that the removed variations were indeed not related to the speech information since
the smoothing does not affect the audiovisual perception quality.
Table 4.1: Evaluation of the database normalization strategy using the visual
speech-only samples.
Normalized > Original
Normalized < Original
Normalized = Original
89
12
34
Total
135
Table 4.2: Evaluation of the database normalization strategy using the audiovisual samples.
4.4.3
Normalized > Original
Normalized < Original
Normalized = Original
90
5
40
Total
135
Differential smoothing
Section 4.3.5 explained that the visual speech information from the selected
database segments is successfully concatenated by a smooth joining of the corresponding parameter sub-trajectories. Each concatenation consists of an overlap
of the sub-trajectories at the join position and an interpolation of the original
sub-trajectories around the join position in order to smooth the transition (see
equations 4.15 and 4.16). A major benefit of the AAM-based synthesis approach
is that the strength of the concatenation smoothing (defined by parameter S in
equation 4.16) can be diversified between the shape and the texture information. A
strong smoothing (high value for S) is applied for joining the texture parameter
sub-trajectories in order to avoid a jerky appearance of the virtual speaker in the
concatenated visual speech signal. On the other hand, a weaker smoothing (low
value for S) is applied for the concatenation of the shape parameter sub-trajectories
in order to avoid visual under-articulation effects.
A further improvement to the visual speech synthesis quality is possible when the
151
concatenation smoothing strength is also diversified among the various shape/texture
parameters themselves. Section 4.4.1 elaborated on two criteria that express the
correlation between the model parameters and the visual speech information.
Obviously, a strong smoothing of parameters that are closely linked to speech
gestures will easily result in an “over-smoothed” perception of the synthetic visual
speech. On the other hand, the model parameters that are less related to speech
gestures can safely be smoothed without affecting the visual articulation strength.
In addition, as was mentioned earlier, the concatenated parameter trajectories of
the less speech-related model parameters are more likely to contain steep “jumps”
at the join positions, since these parameters model variations that are not (or
less) taken into account during the segment selection stage. Therefore, both the
shape and the texture parameters are split up into two groups according to their
correlation with the speech information, after which for each visual concatenation
a stronger concatenation smoothing is applied to join the sub-trajectories of parameters belonging to the least speech-correlated group. The classification of the
shape/texture parameters is performed using measures Djvar and Djsyn (see section
4.4.1). First, for both measures the 30% shape/texture parameters most correlated
with the speech are selected. Then, from these selected parameters a final “strongly
speech-correlated group” is chosen as the parameters that were selected by both
criteria, augmented with the parameters that were selected by only one measure
and that represent more than 1% model variation. For the classification of the
texture parameters, two extra criteria are added, based on the correlation between
the parameter values and the amount of visible teeth and the amount of visible
mouth-cavity in each video frame, respectively. For both these criteria, the 5 most
correlated parameters are implicitly added to the strongly speech-correlated group.
The two previous techniques diversify the strength of the concatenation smoothing
among the various model parameters. A final optimization is possible when the
smoothing strength that is applied for each model parameter is adjusted for each
concatenation individually. It has been explained earlier that for some particular
phonemes their visual representation is more variable compared to the visual
representation of other phonemes, since some phonemes are more affected by visual
coarticulations from neighbouring phones. This means that in a visual speech
signal, for some particular phonemes their typical visual representation will always
be clearly noticeable, while some other phonemes are most of the time “invisible”
since their corresponding speech gestures are strongly affected by coarticulation
effects [Jackson and Singampalli, 2009]. This inspires to classify each phoneme as
either “normal”, “protected” (always visible), or “invisible” (almost never visible).
Examples for English are the /t/ phoneme which can be labelled as “invisible”
and the /f/ phoneme which should be labelled as “protected” (see appendix B
for a complete overview of the classification of the English phonemes used in the
152
AVTTS system). The protected/normal/invisible classification of the phonemes
of a language can be constructed based on prior articulatory knowledge. For
instance, Jackson et al. [Jackson and Singampalli, 2009] investigated on the critical
articulators for each particular English phoneme, and on the degree in which
articulators are allowed to relax during speech production. On the other hand, it
was opted to perform an ad-hoc classification for the particular speaker that was
used to construct the synthesizer’s speech database. Therefore, a hand-crafted protected/normal/invisible classification for the phonemes from the LIPS2008 database
was constructed based on both prior articulatory knowledge and the variability
P
of the visual representations of each phoneme i expressed by j Sij (see section
4.3.4.1). The classification is used to optimize the concatenation smoothing by
applying a stronger smoothing strength in case the join takes place in an “invisible”
phoneme (to avoid over-articulation) and by applying a weaker smoothing strength
when the join takes place in a “protected” phoneme (to avoid under-articulation).
From the previous it can be concluded that the applied visual concatenation
smoothing strength is optimized by three distinct differentiations: the smoothing
strength is adjusted based on the type of model parameter (shape/texture), it is
adjusted based on the correlation of the model parameter with the speech, and it
is fine-tuned based on the type of phoneme in which the concatenation takes place.
Optimal values for parameter S (equation 4.16), which defines the concatenation
smoothing strength, were empirically deduced to optimize the attainable visual
speech synthesis quality and are summarized in table 4.3.
Table 4.3: Various visual concatenation smoothing strengths S (see equation
4.16).
Type
Speech Correlation
High
Shape
Low
High
Texture
Low
Phoneme Type
S
Protected
Normal
Invisible
Protected
Normal
Invisible
1
1
3
2
3
5
Protected
Normal
Invisible
Protected
Normal
Invisible
1
3
5
3
5
7
4.4.4
153
Spectral smoothing
When the video frames of an audiovisual speech signal are represented by means of
AAM parameters, the consecutive values of a single model parameter throughout
a sentence (i.e., its trajectory) can be seen as a data signal, sampled at the video
frame rate, that contains some part of the visual speech information that corresponds to the uttering of the sentence. By calculating the Fast Fourier Transform
(FFT) on the parameter trajectory, the spectral content of this visual speech
information can be analysed. To investigate the typical spectral content of the
AAM parameter trajectories, a random subset of sentences from the LIPS2008
database was resynthesized using the AAM-based AVTTS system. This allowed to
compare the spectral content of the original trajectories with the spectral content
of the corresponding resynthesized trajectories. This analysis showed that typically
a resynthesized trajectory contains more energy at higher frequencies in comparison
with the corresponding original trajectory. This high-frequency energy is likely to be
caused by the visual concatenations that were necessary to construct the synthetic
speech signals: around the join positions some changes in shape/texture information
can occur that are more abrupt than the variations occurring in original visual
speech signals. Recall that in the synthetic visual speech each concatenation has
been smoothed by the advanced smoothing approach that was discussed in section
4.4.3. Unfortunately, not all unnaturally fast variations can be removed from the
synthetic speech by the proposed concatenation smoothing strategy, since global
settings for the concatenation smoothing strength are applied. It can happen that
two particular consecutive selected database segments are visually very distant and
that their concatenation would require a stronger smoothing in comparison with the
other concatenations in order to avoid visual “over-articulation” effects (which was
actually one of the major complaints reported by the participants who evaluated
the initial AVTTS approach).
From this observation it can be concluded that the AAM-based AVTTS approach
can be improved by an additional smoothing of the concatenated visual speech
signal, in which only those parts of the signal that do need an extra smoothing are
significantly modified. Such a smoothing technique was developed which suppresses
unnaturally rapid variations in the concatenated visual speech by modifying its
spectral information derived from its AAM-based representation. The smoothing
technique makes use of a well-designed low-pass filter that limits the amount of
high-frequency energy in the spectrum of the synthetic visual speech such that its
spectral envelope resembles the spectral envelope of original visual speech signals.
The design of these low-pass filters is critical, since it has to be ensured that the
cutoff-frequency is adequately low to remove the unnaturally fast variations from
the speech, but on the other hand as little as possible useful speech information
154
should be removed from the signal. Optimal filter settings were found by assessing
the perceived effects of such a low-pass filtering on original visual speech trajectories.
In a first step, for each AAM parameter multiple low-pass filters were designed.
To this end, the spectral information of each model parameter was gathered from
100 random database sentences. For each parameter, the mean of the measured
spectra was calculated to determine an estimate for its common spectral content
as seen in original visual speech signals. Based on these mean spectra, for each
parameter multiple cutoff-frequencies were calculated, preserving 90, 80, 70, 60
and 50 percent of the original spectral energy, respectively. For each of these
cutoff-frequencies, a low-pass filter was designed using the Parks-McClellan optimal
FIR filter design technique implemented in Matlab [Matlab, 2013]. This allows to
filter the shape/texture information contained in a visual speech signal by filtering
each parameter trajectory individually using its corresponding filter coefficients.
In a second step, a subjective perception experiment was conducted to investigate the effect of the low-pass filtering on the human perception of the speech
signals. The test subjects were shown pairs of audiovisual speech samples both
representing a same sentence that was randomly extracted from the LIPS2008
database transcript. All samples contained original auditory speech signals copied
from the LIPS2008 database. One sample in each pair was a reference sample that
contained a visual speech mode that was regenerated by the inverse AAM-projection
of the original database parameter trajectories. The visual speech mode of the other
sample was regenerated by the inverse AAM-projection of the low-pass filtered
version of the original AAM trajectories. The position of the filtered sample in each
comparison pair was randomized and unknown to the participants. To generate the
test samples, various combinations of filters for the shape parameters and filters
for the texture parameters were applied (see table 4.4). The subjective evaluation
assessed to which extent an original visual speech signal can be low-pass filtered
without affecting its perception quality. Seven people participated in the test (4
male, 3 female, aged between 22 and 58, 4 of them were speech technology experts),
who each compared 28 sample pairs. The subjects were asked for their preference
for one of the two samples in terms of naturalness of the visual speech mode. The
participants were asked to answer “no difference” when they did not prefer one of
the two samples over the other. Each acceptable configuration of the shape/texture
filters should lead to a high percentage answers reporting “no difference” or even
reporting a preference for the filtered sample. A summary of the most important
results of the experiment is given in table 4.4.
Based on the results of the subjective experiment, an optimal filter configuration for smoothing the concatenated synthetic visual speech can be determined:
155
Table 4.4: Test results obtained by the subjective evaluation of the low-pass
filtered original parameter trajectories. The last column indicates for each filter
configuration the percentage of the total evaluations that reported that the filtered
sample was at least as good as the original sample.
Shape filter
Texture filter
% OK
none
90
80
70
70
70
60
60
60
none
70
60
93
90
64
86
79
36
the least conservative configuration that was found not to significantly affect
original speech signals will be most suited to reduce the unnatural high-frequency
information while only minimal useful speech information is modified. Similar to
the concatenation smoothing strength, the strength of the spectral smoothing can
be diversified among the various AAM parameters. To this end, less conservative
filters are applied on the texture parameter trajectories compared to the filters that
are applied on the shape parameter trajectories. In addition, the shape/texture
parameters are split into two groups based on their correlation with the speech
information (the same groups as described in section 4.4.3 were used). A stronger
spectral smoothing is applied on the trajectories of the least speech-correlated
model parameters in comparison with the spectral smoothing that is applied on the
trajectories of parameters that are the most correlated with the speech information.
Table 4.5 summarizes the filter settings that are used in the AAM-based AVTTS
strategy. An illustration of the spectral smoothing technique is given in figure 4.9.
Table 4.5: Optimal filter settings for modifying the synthetic visual speech.
Type
Speech Correlation
Filter
Shape
High
Low
90
70
Texture
High
Low
80
70
156
0.12
0.1
Parameter value
0.08
0.06
0.04
0.02
0
−0.02
−0.04
60
80
100
120
140
160
180
Frame
Figure 4.9: Spectral smoothing of a parameter trajectory. The black curve illustrates part of an original trajectory of a database sentence. The red curve
represents the corresponding synthesized trajectory that was generated by the
AAM-based AVTTS system. The original database text was used as input and
the corresponding database speech was excluded from selection. The phoneme
durations in the synthesized trajectory were synchronized with the phoneme durations in the original speech. The green curve shows the spectral smoothed
version of the synthesized trajectory. The blue rectangles indicate typical benefits from the spectral smoothing technique: local discontinuities are removed and
“overshoots” are suppressed. The red square shows a more severe discrepancy
between the original (black) and synthesized (red) curve. Unfortunately, such
differences cannot be removed by a “safe” spectral smoothing strength.
4.5. Evaluation of the AAM-based AVTTS approach
4.5
157
This section describes two subjective perception experiments that evaluate the proposed AAM-based AVTTS system. The attainable synthesis quality using this synthesis strategy is compared with the synthesis quality of the initial AVTTS approach
and with original speech samples. The first experiment involves visual speech-only
speech samples and the second experiment involves audiovisual speech samples.
4.5.1
Visual speech-only
4.5.1.1
Test setup
A first experiment evaluated whether the AAM-based synthesis approach is able to
improve the individual quality of the synthetic visual speech mode compared with
the initial AVTTS synthesizer (which was described in chapter 3). Ten English
sentences were randomly selected from the LIPS2008 database. The samples of a
reference test group, referred to as group “ORI”, were created by generating for
each sentence a new mouth-signal by the inverse AAM-projection of the parameter
trajectories contained in the database. Afterwards, this mouth-signal was merged
with a background video signal displaying the other parts of the face of the virtual
speaker (similarly as is performed by the AVTTS system to synthesize a novel
sentence). A second category of samples, referred to as group “OLD”, was created
by resynthesizing each sentence using the initial AVTTS system. The original
database text was used as input and the original speech data corresponding to each
particular sentence was excluded from selection. A third group of samples, referred
to as group “AAM”, was created by the resynthesis of each sentence using the
AAM-based AVTTS system. Optimal settings for the various enhancements to the
AAM-based synthesis (see section 4.4) were applied. For all three groups, the test
samples that were used in this experiment were “mute” speech signals containing
only a visual speech mode. To this end, for groups OLD and AAM the synthetic
auditory speech mode that was synthesized simultaneously with the visual mode
was removed from the speech sample.
The 30 samples (3 groups of each 10 sentences) were consecutively shown to
the participants. The order of both the sentences and the test groups was randomized. The participants were asked to rate the naturalness of the displayed mouth
movements on a 10-point MOS-scale [0.5,5], with rating 0.5 meaning that the
visual speech appears very unnatural (low quality) and rating 5 meaning that the
visual speech looks completely natural (excellent quality). The text corresponding
to the speech in each sample was shown. An example audiovisual recording from
158
5
4
3
2
1
0
ORI
OLD
AAM
Figure 4.10: Boxplot summarizing the ratings obtained for each test group in
the visual speech-only experiment.
the LIPS2008 database was given to illustrate the speaking style of the original
speaker. The participants were given some points of interest to take into account
while evaluating the samples, such as “Are the variations of the lips, the teeth and
the tongue in correspondence to the given text?” and “Are the displayed speech
gestures similar to the gestures seen in original visual speech and are they as smooth
as you would expect them to be in an original speech signal?”.
4.5.1.2
Participants and results
Nine people participated in the experiment (6 male, 3 female), two of them aged
1
[56-58] and the others aged [21-35]. Six of them can be consideredPagespeech
technology experts. The results obtained are visualized in figure 4.10. A Friedman test
indicated significant differences among the answers reported for each test group
(χ2 (2) = 103 ; p < 0.001). An analysis using Wilcoxon signed-rank tests indicated
that the ratings obtained for the ORI group were significantly higher than the ratings obtained for the OLD group (Z = −7.52 ; p < 0.001). The ratings obtained
for the AAM group were also significantly higher than the ratings obtained for the
OLD group (Z = −7.60 ; p < 0.001). No significant difference could be measured
between the ratings for the ORI and the AAM group (Z = −0.740 ; p = 0.459).
4.5.1.3
159
Discussion
This experiment assessed for each category of samples the perceived individual
quality of the visual speech mode. The results obtained unequivocally show that
the AAM-based AVTTS synthesizer performs better in mimicking an original visual
speech signal compared to the initial AVTTS synthesis approach. This is due
to the fact that the AAM-based approach is able to generate a smoother signal
without affecting the visual articulation strength. On the other hand, it appears
that even the samples from the ORI group were not always given excellent ratings.
Probably this is due to the moderate loss of image detail caused by the inverse
AAM projection. In addition, the merging of the mouth-signal with the background
video might also have reduced the similarity with an original video recording. Note,
however, that the participants considered these flaws less disturbing compared with
the overall level of smoothness of the visual speech (indicated by the lower rating
obtained for the OLD group).
Earlier in this thesis it was shown that it has be ensured that neither optimization to the AVTTS synthesis approach affects the audiovisual coherence in the
audiovisual output signal. Therefore, another experiment needs to be conducted in
which the perceived quality of the audiovisual speech signals is assessed instead.
4.5.2
Audiovisual speech
4.5.2.1
Test setup
A second perception experiment was conducted in which the same categories of
test samples that were used in the visual speech-only experiment were evaluated:
group ORI, group OLD, and group AAM (see section 4.5.1.1). In this test, the
complete audiovisual signals were shown to the participants. To this end, the visual
speech data from the previously used samples from the ORI group was displayed
simultaneously with the corresponding original auditory speech from the LIPS2008
database. In addition, the visual speech modes of the previously used samples
from the OLD group and the AAM group were reunited with their corresponding
synthetic auditory speech signals (which had been generated synchronously with
the synthetic visual speech by the audiovisual segment selection procedure).
Three separate aspects of the audiovisual speech quality were evaluated in the
experiment:
A - Naturalness of the mouth movements
This aspect is exactly the same as the property that was evaluated in the
visual speech-only experiment. The same points of interest as mentioned in
section 4.5.1.1 were given to the test subjects. They were instructed to rate
160
the visual speech individually by ignoring the auditory speech mode of each
sample. However, it was instructed not to turn off the volume to ensure that
the presented auditory speech mode could still (unconsciously) influence the
subject’s ratings.
B - Audiovisual coherence
The participants were asked to rate the coherence between the presented auditory and the presented visual speech mode. They were told that the key
question they should answer was “Is it plausible that the woman who is displayed in the video could have actually produced the auditory speech that you
hear?”.
C - Quality and acceptability of the audiovisual speech
The last aspect was a high-level evaluation of all aspects concerning the presented audiovisual speech samples. The participants were asked to which extent they like the speech sample in general and how suitable they think the
sample is to use in a real-world avatar application. It was explained that for this
rating, they had to ask themselves the following question: “Is the multimodal
speech (audio + video + combination audio-video) sufficiently understandable,
clear and natural as you would expect it to be for a real application?”.
A 10-point MOS scale [0.5,5] was used to rate each aspect, with rating 0.5 meaning
that the visual speech appears very unnatural (aspect A), meaning a very poor
audiovisual coherence (aspect B), and meaning that the sample is not suited for
use in a real application due to a very low overall quality (aspect C). For each
aspect, rating 5 means that the sample appears like an ideal high-quality original
audiovisual speech signal.
4.5.2.2
Eight people participated in the experiment (5 male, 3 female), one of them aged
58 and the others aged [21-31]. Five of them can be considered speech technology
experts. The results obtained are visualized in figure 4.11. For aspect A, a Friedman
test indicated significant differences among the answers reported for each test group
(χ2 (2) = 125 ; p < 0.001). An analysis using Wilcoxon signed-rank tests pointed out
that the ratings obtained for the ORI group were significantly higher than the ratings
obtained for the AAM group (Z = −5.89 ; p < 0.001) and the ratings obtained for
the OLD group (Z = −7.66 ; p < 0.001). In addition, the ratings obtained for the
AAM group were significantly higher than the ratings obtained for the OLD group
(Z = −7.13 ; p < 0.001). A similar analysis was conducted for the results obtained
for aspect B. A Friedman test indicated significant differences among the answers
reported for each test group (χ2 (2) = 103 ; p < 0.001). An analysis using Wilcoxon
signed-rank tests pointed out that the ratings obtained for the ORI group were
161
significantly higher than the ratings obtained for the AAM group (Z = −7.19 ; p <
0.001) and the ratings obtained for the OLD group (Z = −7.57 ; p < 0.001). In
addition, the ratings obtained for the AAM group were significantly higher than the
ratings obtained for the OLD group (Z = −2.43 ; p = 0.015). Finally, for aspect
C, a Friedman test indicated significant differences among the answers reported for
each test group (χ2 (2) = 133 ; p < 0.001). An analysis using Wilcoxon signed-rank
tests pointed out that the ratings obtained for the ORI group were significantly
higher than the ratings obtained for the AAM group (Z = −7.46 ; p < 0.001) and
the ratings obtained for the OLD group (Z = −7.84 ; p < 0.001). In addition,
the ratings obtained for the AAM group were significantly higher than the ratings
obtained for the OLD group (Z = −5.22 ; p < 0.001).
4.5.2.3
Discussion
Aspect A evaluated exactly the same property as the subjective test described
in section 4.5.1 did. However, if the results obtained for these two experiments
are compared (figures 4.10 and 4.11 (top)), an important difference is noticeable.
Whereas for the visual speech-only samples the ORI group and the AAM group were
rated equally high, in the audiovisual test the visual speech mode of the samples
from the ORI group was rated higher than the visual speech mode of the samples
from the AAM group. This means that the participants, although instructed to
take only the visual speech mode into account, were unconsciously influenced by
the accompanying auditory speech mode while rating the samples. This explains
the higher ratings for the ORI group in the audiovisual experiment, since these
samples contained original auditory speech while the auditory speech mode of the
AAM group consisted of less optimal synthesized auditory speech signals. In the
audiovisual test, the ORI group was given very high ratings (mean = 4.6, median
= 5), which means that the inverse AAM projection of the database trajectories
is capable of generating sufficiently accurate image data to use in an audiovisual
speech signal. Both in the visual speech-only and in the audiovisual experiment,
the AAM group was given higher ratings compared to the OLD group. This means
that the AAM-based AVTTS approach indeed improves the individual quality of
the synthetic visual speech without affecting the coherence between the two synthetic
speech modes (since the observed enhancement holds in the audiovisual case as well).
Aspect B evaluated the perceived level of audiovisual coherence. A similar evaluation of the speech signals created by the initial AVTTS system was already
described in section 3.6.2. Similar to the results obtained in that experiment, the
ratings obtained for aspect B show that the audiovisual coherence observed in
synthesized speech is lower than the audiovisual coherence observed in original
speech signals. This can be due to the local audiovisual incoherences that occur
162
5
4
3
2
1
ORI
OLD
AAM
ORI
OLD
AAM
ORI
OLD
AAM
5
4
3
2
1
5
4
3
2
1
Figure 4.11: Boxplots summarizing for each aspect the ratings obtained for the
three categories of audiovisual speech samples. From top to bottom: aspect A
(naturalness of the speech gestures), aspect B (audiovisual coherence), and aspect C (overall acceptability).
163
around the join positions, but it is also likely that the subjective perception of the
level of audiovisual coherence in synthesized speech is affected by the overall lower
degree of naturalness of this category of speech signals. On the other hand, the
audiovisual coherence of the ORI group was rated very high (mean = 4.7, median
= 5), which means that the inverse AAM projection that was needed to reconstruct
the original visual speech mode does not affect the perceived coherence between the
original speech modes. The audiovisual coherence of the samples from the AAM
group was rated higher than the coherence of the samples from the OLD group,
which means that the AAM-based optimizations to the synthesis strategy increase
the perceived coherence between the two synthetic speech modes. This is perhaps
an unexpected result, since in fact the level of audiovisual coherence in the output
of the initial AVTTS system is slightly higher since these samples contain original
video recordings instead of AAM reconstructed visual signals. This again indicates
that the individual quality of the presented speech modes influences the perceived
level of audiovisual coherence between these two signals.
Aspect C evaluated the overall quality of the presented audiovisual speech and its
applicability in a real application. As could be expected, the ORI group was given
the highest ratings since especially the auditory speech mode of these samples is
much better than the synthesized auditory mode of the samples from the OLD
and the AAM group. Since the ORI group was given very high ratings (mean
= 4.8, median = 5), the AAM-based representation of the original visual speech
information appears to be appropriate for usage in real applications. Recall that the
each sample from the ORI group was constructed by the merging of the regenerated
original mouth-signal with a background video signal. The ratings obtained in this
experiment indicate that this approach can safely be applied without affecting the
acceptability of the final visual speech signal. In addition, since the ratings obtained
for the AAM group were higher than the ratings obtained for the OLD group,
it appears that that the AAM-based optimizations to the AVTTS strategy are
improving the overall quality of the synthetic audiovisual speech by an appropriate
unimodal enhancement of the synthetic visual speech mode. Unfortunately, the
ratings obtained for the AAM group (mean = 3.1, median = 3) were still lower than
the ratings obtained for the original speech samples from the ORI group, which
means that still some improvements to the AVTTS synthesis technique are needed
to reach the quality level of original speech signals.
4.6
Chapter 3 proposed a single-phase AVTTS synthesis approach that is promising
for achieving high-quality audiovisual speech synthesis, since it is able to maximise
the coherence between the synthetic auditory and the synthetic visual speech
164
mode. This chapter elaborated on an optimization to the synthesis strategy that
enhances the individual quality of the synthetic visual speech mode. The proposed
optimization only minimally affects the coherence between both synthetic output
speech modes. An Active Appearance Model is used to describe the original visual
speech recordings, which allows to individually parameterize the shape and the texture properties of the original visual speech information. The AAM-based AVTTS
synthesizer constructs the synthetic output speech by concatenating original combinations of auditory speech and AAM parameter sub-trajectories. The AAM-based
representation of the original visual speech allows to define accurate selection costs
that take visual coarticulation effects into account. Additional optimizations to the
visual synthesis have been developed, such as the removal of undesired non-speech
related variations from the original visual speech corpus by a normalisation of the
AAM parameters. In addition, a diversified visual concatenation smoothing strength
increases the continuity and the smoothness of the synthetic visual speech signals
without affecting the visual articulation strength. Finally, a spectral smoothing
technique removes over-articulations that can occur in the concatenated visual
speech signal.
The synthetic speech produced by the AAM-based AVTTS system was subjectively evaluated and compared with original AAM-reconstructed audiovisual
speech signals and with the output of the initial AVTTS system that was proposed
in chapter 3. The experiments showed that the AAM-based representation of the
original visual speech is appropriate to regenerate accurate visual speech information. The results obtained also show that the AAM-based synthesis improves the
quality of the synthetic visual speech mode compared to visual speech synthesized
using the initial AVTTS approach. Moreover, it has been shown that the attainable
audiovisual speech quality by the AAM-based AVTTS system is higher than the
attainable audiovisual speech quality by the initial AVTTS approach.
Unfortunately, it could be noticed that the synthetic audiovisual speech is still
clearly distinguishable from original audiovisual speech recordings. This is especially true for the auditory mode of the synthetic speech. Recall that the LIPS2008
database, which has been used to evaluate the AAM-based AVTTS approach, only
contains about 23 minutes of original English speech recordings. This is far below
the general rule of thumb that a database suited for auditory concatenative speech
synthesis should contain at least between one and two hours of original continuous
speech in order to be able to synthesize high quality speech samples. Therefore, a
more extensive speech database will be necessary in order to increase the overall
synthesis quality of the AAM-based AVTTS system. Not only should this database
contain sufficiently more original speech data compared to the LIPS2008 database,
it should also exhibit higher quality visual speech recordings that captured all the
165
fine details of the original visual speech information.
Some of the techniques, experiments and results mentioned in this chapter have
been published in [Mattheyses et al., 2010a] and [Mattheyses et al., 2010b].
5
High-quality AVTTS
synthesis for Dutch
5.1
Motivation
The observers that rated the synthetic audiovisual speech signals created by the
AAM-based AVTTS synthesis strategy, proposed in the previous chapter, reported
two major issues that distinguished the synthesized speech from original speech
signals. First, the synthetic auditory speech mode appeared too jerky and it often
exhibited a non-optimal prosody. Furthermore, some details were missing in the
presentation of the virtual speaker, such as a sharp representation of the teeth and
the tongue. This implied that the synthesized video signal could only be presented
compactly to the observers (i.e., displayed on a small screen size or in a small
video window). These problems can to a large extent be resolved by providing the
synthesizer with an improved audiovisual speech database that contains, compared
to the LIPS2008 database, more and higher quality audiovisual speech recordings.
Unfortunately, only very few good quality audiovisual speech databases are available
for TTS research, partly because every database appropriate for the AVTTS system
has to exhibit very specific properties (as explained in section 3.3.1): single speaker,
fixed head orientation, fixed recording conditions, etc. It can also be noticed that
the great majority of the research on auditory/visual/audiovisual speech synthesis
reported in the literature involves synthesis for the English language. Apart from
some commercial black-box multilingual auditory TTS systems, in recent research
only the NeXTeNS project [Kerkhoff and Marsi, 2002] focuses on the Dutch lan-
166
5.2. Database construction
167
guage for academic TTS synthesis. This implies that also speech databases suited
for speech synthesis research for Dutch are very scarce. Such a database is even not
existing for research in the field of Dutch audiovisual speech synthesis. For these
reasons, it was opted to build a completely new Dutch audiovisual speech database
that is suited for performing high-quality audiovisual speech synthesis by means of
the AAM-based AVTTS synthesizer that was proposed in the previous chapter.
5.2
Database construction
This section describes the various steps needed to construct the new audiovisual
speech database, such as the preparing of the text corpus, the recording of the
original audiovisual speech and the post-processing of the data in order to allow
speech synthesis. It was opted to design the database to contain two parts: one part
that can be used for limited domain synthesis and another part that is suitable for
the synthesis of sentences from the open domain. Limited domain speech synthesis
means that the speech database that is provided to the synthesizer mainly contains
sentences from one typical domain (e.g., football reports, expressions for a talking
clock, etc.). This has the benefit that when the target sentence, given as input to
the TTS synthesizer, also fits in the limited domain of the speech database, lots
of database segments matching each synthesis target can be found. This leads to
the selection of longer original segments that exhibit highly appropriate prosodic
features, which makes it possible to attain a high-quality synthesis result. By partly
designing the new Dutch database as a limited domain database, it will be possible
to investigate the attainable synthesis in both the open domain and the limited
domain.
5.2.1
Text selection
5.2.1.1
Domain-specific
The domain-specific sentences were taken from a corpus containing one year of
Flemish weather forecasts, kindly provided by the Royal Meteorological Institute
of Belgium (RMI). A subset of 450 sentences was uniformly sampled from this
weather corpus. As the original corpus was chronologically organized, this uniform
sampling resulted in a subset covering weather forecasts from each meteorological
season. In addition, a collection of all important words involving the weather was
gathered, such as “regen” (rain), “sneeuw” (snow), “onweer” (thunderstorm), etc.
These words were added to the recording text using carrier sentences. Finally, some
slot-and-filler type sentences were included in the recording text, such as “Op vrijdag
ligt de minimum temperatuur tussen 9 en 10 graden” (On Friday, the minimum
temperature will be between 9 and 10 degrees).
5.2.1.2
168
Open domain
The open domain sentences were selected from the Leipzig Corpora Collection [Biemann et al., 2007], which contains 100.000 Dutch sentences taken from various
sources such as newspapers, magazines, cooking recipes, etc. To analyse this text
corpus, two separate Dutch lexicons were used. The Kunlex lexicon is a Dutch lexicon that is provided with the NeXTeNS distribution [Kerkhoff and Marsi, 2002].
Unfortunately, this lexicon is optimized for Northern Dutch (the variant of Dutch as
spoken in The Netherlands) and not for Flemish (the variant of Dutch as spoken in
Belgium). For this reason, the text corpus was also analysed using the Fonilex lexicon [Mertens and Vercammen, 1998]. This lexicon, originally constructed for speech
recognition purposes, is based on the Northern Dutch Celex lexicon [Baayen et al.,
1995] but contains Flemish pronunciations and it lists pronunciation variants whenever these exist. The original Fonilex lexicon was converted in a format that is more
suitable for usage with the AVTTS system, which yielded the following changes:
Phone set adaptation
The original phone set used in the Fonilex lexicon was adjusted by adding
false and real diphthongs to the phone set, as these sounds are sensitive for
concatenation mismatches. On the other hand, glottal stops were removed from
the phone set, as these are quite rare in the standard Flemish pronunciation.
Also, additional diacritic symbols (which indicate whether a phone is nasalized,
long or voiceless) were not taken into account.
Adding part-of-speech information
The Fonilex lexicon does not list part-of-speech information for each word,
which is needed in order to distinguish homographs (see section 3.2.1). This
information was extracted from the original Celex lexicon and added to the
adapted Fonilex lexicon.
Adding syllabification
As the original Fonilex lexicon does not list syllable boundaries, a simple rulebased syllabification algorithm was implemented to add syllable information
to the lexicon. Unfortunately, the original syllabification information from the
Celex lexicon could not be transferred to the adapted Fonilex lexicon since for
too many entries the Fonilex and the Celex phoneme transcripts differed too
much.
Adding additional entries
About 400 words that occur in the final text corpus that was selected for
recording were manually added to the lexicon. These words included numbers,
abbreviations and missing compound words.
169
To select an appropriate subset of sentences from the original large text corpus, multiple greedy text selection algorithms were used [Van Santen and Buchsbaum, 1997].
While selecting the text, for each entry in the lexicon all the possible pronunciation
variants were taken into account. It was ensured that no sentence was selected
twice, and the selection of long sentences (e.g., longer than 25 words) was discouraged since these sentences are harder to utter without speaking errors. In a first
stage, two subsets were extracted from the corpus in order to attain full phoneme
coverage, using the Kunlex and the adapted Fonilex lexicon, respectively. Similarly,
two other subsets were extracted in order to attain complete silence-phoneme and
phoneme-silence diphone coverage. The speech recordings corresponding to these
four subsets, containing 75 sentences in total, will have to be added to each future
sub-database to ensure a minimal coverage for speech synthesis in Dutch. To select
additional sentences for recording, only the adapted Fonilex lexicon was used. The
text corpus was split equally in two and on both parts two selection algorithms
were applied. A first algorithm selected a subset of sentences that maximally covers
all diphones existent in original Dutch speech. Consecutively, a second selection
algorithm was based on the diphone frequency: the sentences were selected such that
the relative diphone frequency in the selected text subset is similar to the diphone
frequency in the complete corpus. In a final stage, a manually selected subset of the
text corpus was added to the recording text. This manual selection was based on the
avoidance of given names, numbers, dates and other irregular words in the sentences.
The final text subset contained about 1500 distinct Dutch diphones. This is
the same as the number of distinct diphones that was found in the original text
corpus. Note that in theory, more diphones exist in the Dutch language. By
counting all diphones defined in the Fonilex lexicon (inter-word, intra-word, and
word-silence/silence-word) about 2200 distinct diphones were found. However, not
all these diphones occur in original speech since not all word-word combinations
are possible. In fact, 1500 is a large number of diphones compared to the databases
used in most other research. For instance, in [Smits et al., 2003] the perception of
diphones in the Dutch language was investigated using only about 1100 distinct
diphones.
5.2.1.3
Additional data
With a view to further research on speech synthesis, two extra subsets were added
to the recording text. Six paragraphs taken from online news reports were added,
which were shown to the speaker both as sentences in isolation and as paragraphs.
These speech recordings can be used to explore the difference between the synthesis
of isolated speech and synthesizing whole paragraphs at once. In addition, a few
sentences containing typical “fillers” were added to the recording text, such as “Ik
170
kan ook ... X ... hoor” (“I can ... X ... too”) with X representing a laugh, a cough,
a gulp, a sigh, etc. These fillers can be used to increase the expressiveness of the
synthetic speech.
5.2.2
Recordings
The database was recorded in a professional audiovisual studio located at the university campus [AV Lab, 2013]. The voice talent was a 23 year old, native Dutch
speaking girl. She is a semi-professional speaker who received a degree in discourse.
The speech was recorded in newsreader-style, which means that the speaker sits on a
fixed chair in front of the camera. The text cues were given in batches of five isolated
sentences to the speaker using a prompter. The acoustic speech signal was recorded
using multiple microphones placed on a stand in front of the speaker (out of sight
of the camera). The microphones that were used consisted of two TRAM condenser
small diaphragm omni-directional microphones put next to each other, a Neuman
U87 condenser large diaphragm microphone in cardioid mode and an AudioTechnica
AT897 hyper-cardioid microphone. The visual speech signal was recorded using a
Sony PMW-EX3 camera at 59.94 progressive frames per second and a resolution of
1280x720 pixels. The camera was swivelled to portrait orientation. The focus, exposure and colour balance were manually calibrated and kept constant throughout
the recordings. The talent was recorded in front of a blue screen, on which several
markers had been attached around the speaker’s head. In addition, some markers
were placed on the neck of the speaker. The recording room RT60 was tweaked to
a low value of 150ms at 1000Hz. Some pictures illustrating the recording setup are
given in figures 5.1 and 5.2. The recordings took two complete days, throughout
which the recording conditions were kept as constant as possible. In total more than
2TB of speech data was captured. The audio was sampled at 48kHz and was stored
as WAV files using 24bit/sample. The video was stored both as raw uncompressed
video data and as H.264 compressed AVI files. The database was manually segmented on sentence-level, in the course of which some erroneous recordings were
omitted. Finally, the total amount of available speech data consisted of 1199 audiovisual sentences (138 minutes) from the open domain and 536 audiovisual sentences
(52 minutes) from the limited domain of weather forecasts. From this point on, the
database will be referred to as the “AVKH” dataset. Two example frames from the
database are shown in figure 5.3.
5.2.3
Post-processing
5.2.3.1
Acoustic signals
As section 5.2.2 explained, the auditory speech was recorded using multiple microphones. This has the benefit that after the recording sessions, the most appropriate
171
Figure 5.1: Overview of the recording setup. The top figure shows the soundproof recording room, the acoustic panels that adjust the room RT60, the fixed
camera with the attached prompter, the lights and the reflector screen to illuminate the voice talent, and the sound-proof window at the back that gives sight
to the control room. The bottom figure shows the positioning of the voice talent.
Notice the uniform illumination on the face, the positioning of the microphones
and the separately illuminated blue-key background with attached yellow markers.
172
Figure 5.2: Some details of the recording setup. The top figure shows the four
microphones that were used to record the auditory speech. The middle figure
shows the real-time visual feedback that allowed the voice talent to reposition
herself. The bottom figure shows the control room monitoring of the recorded
audiovisual signals and the controlling of the prompter.
173
Figure 5.3: Example frames from the “AVKH” audiovisual speech database.
microphone signal can be selected as final auditory database speech. This way, the
database will exhibit the most optimal acoustic signal quality, and in addition,
there always exists a backup signal in the case one of the recorded acoustic signals
would turn out to be disrupted (e.g., a microphone failure, interference artefacts
in the signal, etc.). In a first step, the acoustic signals of the two small TRAM
microphones were added in order to increase their signal quality. After a manual
inspection of the recorded acoustic signals, it was opted to use the acoustic signal
recorded by the Neuman U87 microphone to construct the database audio, as this
signal exhibited the most clear and natural voice reproduction.
The acoustic signals contained in the database were analysed to obtain the
appropriate meta-data describing the auditory speech information that can be
used by the AVTTS synthesizer (see section 3.3.3). In a first stage, each sentence
was phonemically segmented and labelled using the open-source speech recognition
toolkit SPRAAK [Demuynck et al., 2008]. To this end, a standard 5-state left-toright context-independent phone model was used, with no skip states. The acoustic
signals were divided into 25ms frames with 5ms frame-shift, after which for each
frame 12 MFCC’s and their first and second order derivatives were extracted (the
audio was downsampled to 16kHz for this calculation). A baseline, multi-speaker
acoustic model was used to bootstrap the acoustic model training. After the
phonemic labelling stage, the appropriate symbolic features were gathered for each
174
phone-sized segment of the database (see table 3.1). Further analysis consisted
in the training of a speaker-dependent phrase break model based on the silences
occurring in the recorded auditory speech signals [Latacz et al., 2008]. Several
acoustic parameters were calculated, such as the minimum (100Hz) and maximum
(300Hz) f0 of the recorded auditory speech, MFCC coefficients, pitchmark locations,
f0 contours, and energy information. Finally, the acoustic signals were described
by means of STRAIGHT parameters [Kawahara et al., 1999]. The STRAIGHT
features (spectrum, aperiodicity and f0) were calculated with a 5ms frame-shift.
Besides the parameters for minimum and maximum f0, the default settings for
STRAIGHT were used.
5.2.3.2
Video signals
Analysis of the mouth area
In order to be able to use the AVKH database in an AAM-based synthesis approach,
such as proposed in chapter 4, the recorded visual speech information has to be
parameterized using an AAM. To this end, a new AAM was built to represent the
mouth-area of the captured video frames. This approach is similar to the AAM that
was built on the frames from the LIPS2008 database (see section 4.3.3), however,
note that the visual speech from the AVKH database is captured at a higher
resolution and that it contains much more image detail due to a higher-quality
recording set-up. This allows to build an AAM that is capable of generating a
synthetic visual speech signal that contains a more detailed representation of the
virtual speaker. Obviously, this is only feasible when the model is trained using
an appropriate set of training images provided with very accurate ground-truth
shape information. The shape information consisted of 28 landmarks that indicate
the outer lip contour and 12 landmarks that indicate the inner lip contour. It
was ensured that the distribution of the landmarks on the outer lip contour was
consistent over the training set. The location of the face is denoted by 6 landmarks
indicating the position of the cheeks, the chin and the nasal septum, together with
3 landmark points that are located on coloured markers that were put on the neck
of the speaker. Finally, in order to be able to accurately model and reconstruct
the teeth appearance, 5 landmarks were used to indicate the location of the upper
incisors. In addition, the landmarks denoting the inside of the upper lip were
positioned with respect to the location of the upper incisors, and the landmarks
denoting the inside of the lower lip were positioned with respect to the location
of the lower incisors and the lower canines. Note that the teeth of the original
speaker are not visible in each recorded video frame. For those training images
displaying no (upper) teeth, the upper incisor-landmarks were positioned one pixel
below the corresponding landmarks on the inside of the upper lip. Similar to the
technique that is used to optimize the reconstruction of a closed mouth (see section
175
Figure 5.4: Landmark information for the AVKH database. The left panel illustrates an original video frame and its shape information that is modelled by
the AAM. The right panel shows a detail of the frame illustrating the shape
information associated with mouth of the speaker.
4.3.3), while generating an image from a set of model parameters it is ensured
that the upper incisor-landmarks are always located at least one pixel below the
corresponding landmarks on the inside of the upper lip. All texture information
inside the convex hull denoted by the landmarks is modelled by the AAM. Figure
5.4 illustrates an original video frame and its associated landmark positions.
To build the AAM, the iterative technique that was described in section 4.3.3
was used. The model was built to retain 98% of the shape information from the
training set and 99% of the texture information from the training set, resulting in
13 shape parameters and 120 texture parameters that define the AAM. In addition,
a combined model was calculated (omitting 1% of the variation), which resulted in
96 combined model parameters.
Using the trained AAM, the mouth area of all recorded video frames was described by a set of shape/texture parameter values and by a set of combined
model parameter values. Then, the normalization technique that was described
in section 4.4.2 was performed in order to remove some of the non-speech related
and thus undesired variations from the database. This resulted in the normalization of 4 shape parameters and the normalization of 31 texture parameters. The
normalization of the AAM parameters particularly removed small variations in
176
the orientation of the speaker’s face. Some typical original frames and the images
regenerated from their corresponding AAM parameter values are given in figure 5.5.
In addition to the parameterization of the visual speech using AAMs, some
other features of the recorded video frames were calculated, each expressing a
particular visual property. A first feature consisted of the geometrical dimensions
of the mouth of the original speaker (width and height), derived from the shape
information associated with each video frame. Next, for each frame a ratio for
the amount of visible teeth and a ratio for the amount of visible mouth-cavity
were determined using the histogram-based detection that was described in section
3.3.3.4. Finally, a feature that expresses for each video frame the visible amount of
upper teeth (which are much more frequently visible in comparison to the lower
teeth) was calculated from the frame’s associated shape information.
Analysis of the complete face
In section 3.5.1 it was xplained that the output video signal, generated in correspondence with the target speech by the AVTTS system, only displays the animation
of the mouth area of the virtual speaker. This signal needs to be merged with a
background video signal that displays the other parts of the face of the virtual
speaker. Recall that for the speech synthesis based on the LIPS2008 database,
this background signal was constructed using original video sequences from the
database. For the synthesis based on the AVKH database, a more advanced strategy was developed that allows an easy generation of multiple custom background
signals. To this end, a second AAM was built to parameterize the complete face
of the original speaker. When this “face” AAM is used to represent a subset of
the database by means of face parameter trajectories, new background signals can
be created by concatenating multiple original face sub-trajectories followed by the
inverse AAM projection of the concatenated face parameter trajectories. This way,
any target background behaviour (e.g., displaying an eye-blink at predefined time
instants) can be generated, as long as the behaviour was found in the original
database and it is modelled by the face AAM.
The shape information used to built the face AAM consisted of 8 landmarks
that indicate the eyes, 6 landmarks that indicate the eyebrows, 3 landmarks that
indicate the nose and 3 landmarks that indicate the chin. In addition, 11 landmarks
indicate the edge of the face in order to avoid a blurred reconstructed edge when
the head position changes. To ensure that the compete face, including some parts
of the background, are modelled by the AAM, 13 additional landmarks were used
to define the convex hull denoting the texture information that is modelled by the
AAM. These landmarks are positioned on coloured markers that were put on the
177
Figure 5.5: The figures on the left are details from some typical original video
frames from the AVKH database. The figures on the right show for each frame
its reconstruction by the inverse AAM-projection of its parameter values. Much
effort went into the definition of a consistent landmarking strategy that allows
a detailed reproduction of the inside of the mouth.
5.3. AVTTS synthesis for Dutch
178
Figure 5.6: AAM-based representation of the complete face, illustrating an original frame from the database (left), its associated shape information (middle),
and the reconstructed image from its face model parameter values (right).
background and on the neck of the speaker. Note that the face AAM does not model
shape information corresponding to the mouth-area the image, since this area was
already accurately modelled by the mouth AAM. The face AAM was built using
a similar iterative procedure as was used to build the “mouth” AAM. The model
was built to retain 99% of the shape information from the training set and 99%
of the texture information from the training set, resulting in 24 shape parameters
and 27 texture parameters that define the AAM. Figure 5.6 illustrates an original
video frame from the database and its corresponding face shape information. It
also illustrates the reconstruction of an original video frame by the inverse AAM
projection of its face parameter values.
5.3
AVTTS synthesis for Dutch
In order to perform audiovisual text-to-speech synthesis for Dutch, the AVKH
database and its associated meta-data are provided to the AAM-based AVTTS
system that was proposed in chapter 4. The synthesizer must also be provided with
a Dutch front-end that performs the necessary linguistic processing on the input
text (see section 3.2). To this end, a novel Dutch front-end was constructed by
combining particular modules from the NeXTeNS TTS system [Kerkhoff and Marsi,
2002] and new modules that were designed in the scope of the laboratory’s auditory
TTS research. The details on these linguistic modules are beyond the scope of this
thesis and the interested reader is referred to [Mattheyses et al., 2011a], [Latacz
5.4. Evaluation of the Dutch AVTTS system
179
et al., 2008], and [Latacz, TBP].
Based on the parameters predicted by the front-end, the Dutch AVTTS system creates a novel audiovisual speech signal by concatenating audiovisual speech
segments, containing an original combination of acoustic and visual speech information, that were selected from the AVKH audiovisual speech database. The same
techniques for the segment selection and for the segment concatenation that were
discussed in section 4.3 are applied. All synthesis parameters (e.g., the factors that
scale each selection cost between zero and one) were recalculated to obtain optimal
values for synthesis based on the AVKH database. In addition, the optimizations to
the synthesis that were discussed in sections 4.4.3 and 4.4.4 are applied to enhance
the quality of the synthetic visual speech mode. The synthetic visual speech signal
displaying the variations of the mouth area in accordance with the target speech
is merged with a background video displaying the other parts of the face of the
virtual speaker. This background video is generated using the face AAM, as was
described in section 5.2.3.2. The background video is designed to exhibit a neutral
visual prosody, only very limited head movements, and some random eye blinks.
The time span between each eye blink is chosen similar to the eye blinks present in
the original speech, and it is ensured that each eye blink has ended before the end
of the speech signal is reached. When both video signals have been merged, a final
synthesis step consist in applying a chroma-key mask to extract the virtual speaker
from the recording background, as illustrated in figure 5.7.
5.4
Evaluation of the Dutch AVTTS system
There were two arguments that motivated the construction of the AVKH database.
First, the combination of this database with the AAM-based AVTTS synthesizer
makes up the most advanced and the most extensive academic AVTTS system for
Dutch known to date. This enables numerous future research projects for which
auditory, visual, or audiovisual speech synthesis for Dutch is necessary. Second, it
is interesting to evaluate to which extent the quality of the synthetic audiovisual
speech generated by the AVTTS system increases by providing the synthesizer with
a larger and higher quality database compared to the LIPS2008 dataset.
A systematic comparison between the quality of the “LIPS2008-based” AVTTS
synthesis output and the quality of the “AVKH-based” AVTTS synthesis output
is very hard to realize since both signals are very different. However, it is immediately clear that when the synthesizer is provided with the AVKH database,
the synthesized video signals display more detailed visual speech information: the
synthetic visual speech mode can be presented in a higher resolution (i.e., a larger
video screen size) and it contains more accurate representations of the mouth area
180
Figure 5.7: The left panel shows the merging of the mouth-signal (coloured)
with the background-signal (grey). Both are regenerated from AAM parameter
values. The middle panel shows the resulting merged signal, and the right panel
shows the final output of the AVTTS system after applying a chroma-key mask.
of the virtual speaker (especially the appearances of the lips, the teeth and the
tongue have improved). Furthermore, by informally comparing the overall quality
of the LIPS2008-based and the AVKH-based synthetic audiovisual speech signals,
it is easily noticed that the use of the AVKH database drastically improved the
audiovisual synthesis quality. Obviously, the highest quality is attained when the
target sentence is from the limited domain of weather forecasts. When the target
sentence is not from within this limited domain, a more fluctuating output quality
can be observed: in general the quality attained is more than acceptable, although
the quality has been found to suddenly drop for particular target sentences (this
is especially true for the synthetic auditory speech mode). This is a common
problem in the field of auditory speech synthesis: even the smallest local error in the
synthetic speech signal degrades the perceived quality of the whole signal [Theobald
and Matthews, 2012]. This means that a human observer will almost always be able
to discern between real and synthesized audio(visual) speech samples, since even
high-end state-of-the-art auditory TTS systems are unable to completely avoid
local imperfections such as a sporadic concatenation artefact or the selection of
an original segment not exhibiting the optimal prosodic features. Only when the
presented samples are fairly short, an almost perfect mimicking of original auditory
speech is possible. To enhance (i.e., stabilize) the quality of the synthetic auditory
speech mode generated by the Dutch AVTTS system, a manual optimization of
the AVKH database is needed to check and correct all meta-data such as the
181
phoneme boundaries, the selection of the correct pronunciation variant for each
word, the pitch mark locations, etc. Unfortunately, this is a very time-demanding
task with not much scientific contribution, which is why it is seldom performed for
non-commercial synthesis systems.
There was no extensive subjective perception test performed to assess the overall
quality of the synthetic audiovisual speech created by the Dutch AVTTS system,
since for such an evaluation an appropriate baseline signal is needed. Original speech
fragments are not very suited for this purpose, since it was explained earlier in this
section that it can be predicted that the test subjects would often be able to discern
between the synthesized samples and the original samples due to local artefacts
in the auditory speech mode. Another interesting evaluation would be to compare
the attainable synthesis quality with other state-of-the-art photorealistic AVTTS
synthesis systems. However, such a comparison is far from straightforward, since
no standard baseline system has been defined yet. Also, a more fair comparison
would be to compare the attainable synthesis quality of various AVTTS approaches
using the same original speech data (e.g., this was performed in the LIPS2008
challenge [Theobald et al., 2008]). Finally, a formal comparison between the AVKHbased synthetic audiovisual speech and the LIPS2008-based synthetic audiovisual
speech is also not essential: due to the huge difference in quality between the two
speech databases, the test subjects would for sure prefer the syntheses based on the
AVKH database. In addition, such a test would require the synthesis of the same
sentences using both databases, which is impossible since the databases are used for
synthesis in English and in Dutch, respectively. Therefore, it was decided to use an
alternative approach to evaluate the attainable synthesis quality using the AVKH
database.
5.4.1
Turing Test
5.4.1.1
Introduction
From the previous chapters it is known that the AAM-based AVTTS synthesis
strategy is able to generate an audiovisual speech signal that exhibits a maximal
coherence between both synthetic speech modes. This means that an appropriate
estimation for the overall quality of the audiovisual output signal can be obtained
by separately evaluating the individual quality of the auditory and the visual speech
modes. In the scope of this thesis, an evaluation of the synthetic visual speech is
described. Additional evaluations of the synthetic auditory speech are described in
the scope of the laboratory’s auditory TTS research [Latacz, TBP].
In order to evaluate the individual quality of the synthetic visual speech mode,
a Turing scenario was applied. In this test strategy, the participants are shown
182
several audiovisual speech samples that contain either original or synthesized speech
signals. The participants have to report for each presented sample whether he/she
believes it is original or synthesized speech, which means that for each answer there
exist a 50% chance of guessing right and a 50% chance of guessing wrong. The
more the overall percentage of wrong answers reaches 50%, the more prove there
exists that the test subjects could not distinguish between the original and the
synthesized speech signals.
5.4.1.2
Test set-up and test samples
Fifteen sentences from the open domain were randomly selected from the AVKH
database transcript. For each of these sentences, two test samples were generated.
The first sample consisted of original audiovisual speech signals. It was constructed
by first directly copying the acoustic signal from the database. To obtain the
visual speech mode, the parameter trajectories from the database were inverse
AAM-projected to generate a new sequence of video frames. This sequence defined
the mouth-signal, which was then merged with a background signal created by
the face AAM. Afterwards, a chroma-key mask was applied to create the final
visual speech mode of the “original” test samples. This approach ensures that
the original test samples exhibited the same image quality and a similar speaker
representation as the video signals synthesized by the AVTTS system. To create
a synthesized version of each test sentence, the AAM-based AVTTS system was
provided with the complete AVKH database. The original database transcript was
used as text input and the original speech data corresponding to each particular
sentence was excluded from selection. After synthesis, only the visual speech mode
of each synthetic audiovisual signal was used. These synthetic visual speech signals
were synchronized with the corresponding original speech signals by time-scaling
the synthesized parameter trajectories such that the duration of each phoneme in
the synthetic speech matches the duration of the corresponding phoneme in the
original speech. The final “synthetic” samples were created by multiplexing the
time-scaled synthetic visual speech signals with the corresponding original acoustic
speech signals.
Thirty samples (2 groups of each 15 samples) were shown consecutively to the
participants. The question asked was simple: “Do you think the presented speech
is original speech or synthesized speech?”. The test subjects were informed that for
each sample the auditory speech mode contained an original speech recording. It
was stressed that no assumptions about the number of original/synthesized samples
that were used in the experiment could be made. Two additional “original” samples
were generated and displayed to the participants before starting the experiment.
This way, the subjects could get familiarized with the particular speech signals
183
used in the experiment. The subjects were told to process the samples in order
without replaying earlier samples or revising earlier answers. They were allowed to
play each sample maximally three times. The order of the sentences and the order
of the sample types was randomized.
5.4.1.3
Twenty-seven people participated in the experiment (15 male and 12 female, aged
[23-60]). Seven of them can be considered speech experts or are very familiar with the
synthetic speech produced by the AVTTS system (e.g., by repeatedly participating
in earlier perception experiments involving the AVTTS synthesizer). All participants
were native Dutch speaking. Table 5.1 summarizes the results obtained.
Table 5.1: Turing test results.
5.4.1.4
Type
Total
Correct
% Correct
original
synthesized
405
405
273
228
67%
56%
total
810
504
62%
Discussion
A first observation is that there were more responses “original” than responses “synthesized” obtained. This reflects in the higher percentage of correct answers for the
original samples compared to the percentage of correct answers for the synthesized
samples. In total 62% of the reported answers were correct. The results obtained
were analysed using a by-subject (test subjects) and a by-item (test sentences) analysis to evaluate the hypothesis that the answers reported were completely random
(i.e., the subjects could only guess about the nature of the samples presented). Both
the by-subject analysis (t-test ; df = 26 ; t = −5.45 ; p < 0.001) and the by-item
analysis (t-test ; df = 29 ; t = −4.32 ; p < 0.001) indicated that the answers reported
can not be considered completely random. Nevertheless, the results summarized
in table 5.1 indicate that the subjects found it really difficult to distinguish a
synthesized sample from an original sample. No significant difference between the
answers reported by the male participants and the answers reported by the female
participants was found (t-test ; df = 24.9 ; t = −0.076 ; p = 0.94). In addition, by
comparing the results obtained for the participants aged above 40 (13 subjects)
and the participants aged under 40 (14 subjects), no significant influence of the age
on the subjects’ performance could be found (t-test; df = 24.7; t = 0.137; p = 0.892).
184
0,60
0,50
0,40
0,30
0,20
0,10
NON-EXPERT
EXPERT
Figure 5.8: Ratio of incorrect answers obtained by the experts and by the nonexperts in the Turing test.
On the other hand, the ratio of correct answers obtained by the “AVTTS experts” was significantly higher than the ratio of correct answers obtained by the
non-experts (t-test ; df = 7.98 ; t = −3.42 ; p = 0.009). This is visualized in figure
5.8. It is easy to explain this difference since the experts were much more familiar
with the concept of synthetic audiovisual speech. Since they already knew the
strong and the weak points of the AVTTS system, they could focus to particular
aspects that help to distinguish the synthetic samples from the originals samples.
In fact, the results obtained for the non-expert participants can be considered more
important, since this group better represents a general user of a future application
of the AVTTS system. The results obtained for the non-experts are separately
Page 1
summarized in table 5.2. Figure 5.9 visualizes the ratio of wrong answers, obtained
from the non-experts, for both the original and the synthesized samples used in the
experiment. Both a by-subject analysis (t-test ; df = 19 ; t = −4.39 ; p < 0.001)
and a by-item analysis (t-test ; df = 29 ; t = −2.32 ; p = 0.027) indicated that the
answers reported can not be considered completely random. Nevertheless, notice
from table 5.2 that 52% of the synthesized samples were perceived as an original
speech signal. Given the rather large number of evaluations performed in the
experiment, it can be concluded that the synthetic visual speech mode generated
by the AVTTS system almost perfectly mimics the variations seen in original
visual speech recordings. A manual inspection of the test results showed that some
185
1,00
0,75
0,50
0,25
0,00
ORIGINAL
SYNTHETIC
Figure 5.9: Ratio of incorrect answers for each type of sentence obtained in the
Turing test. Only the answers obtained from the non-experts are displayed.
Table 5.2: Turing test results for the non-experts.
Type
Total
Correct
% Correct
original
synthesized
300
300
190
156
63%
52%
total
600
346
58%
Page 1
186
particular test samples were much more easily detected as being non-original as
compared to the other samples. It was found that the majority of these “bad”
samples contained noticeable synthesis artefacts, such as erroneous video frames
resulting from a bad AAM reconstruction. When this kind of problems could be
prevented in the future (e.g., by optimizing the database meta-data), even better
results in the Turing experiment should be attainable.
Note that this experiment evaluated the quality of synthetic audiovisual speech
signals containing a non-original combination of auditory and visual speech information. As was concluded earlier in this thesis, this is likely to have a negative
effect on the perceived quality of the presented speech samples. This means that the
perceived overall quality of the synthetic visual speech signals used in the Turing
experiment would probably increase when these signals are displayed together
with their corresponding synthetic auditory speech mode. On the other hand, the
evaluation of the synthetic audiovisual speech in a Turing scenario would result
in a more easy detection of the synthetic samples (compared to the test results
described above), since any synthesis artefact in both the auditory or the visual
mode would be an immediate clue. In addition, due to the fact that humans tend
to be more sensitive to imperfections in the auditory speech mode, for synthetic
auditory or synthetic audiovisual speech it is very hard to pass a Turing test unless
short speech samples (e.g., isolated words or syllables) are presented.
5.4.2
Comparison between single-phase and two-phase audiovisual speech synthesis
5.4.2.1
Motivation
Recall that one of the main goals of this thesis consists in a general evaluation
of the single-phase AVTTS synthesis paradigm. Section 3.6 described a subjective
experiment that compared the perceived quality of synthetic audiovisual speech
generated by various AVTTS synthesis approaches. In that experiment, the quality of the synthetic visual speech mode was rated the highest in case the visual
speech was presented together with the most coherent auditory speech mode.
The experiment motivated the further development of the single-phase AVTTS
approach, since it was found that the standard two-phase synthesis approach, in
which separate synthesizers and separate speech databases are used to generate the
two synthetic speech modes, is likely to affect the perceived quality of the synthetic
speech when the separately synthesized speech modes are synchronized and shown
audiovisually to an observer. Note, however, that no subjective evaluation of the
audiovisual speech quality was made, since the speech database that was used in
the experiment was insufficient to generate a high-quality auditory speech mode. In
addition, probably due to the limited size of the used speech database, no significant
187
difference was measured between the synthesis quality of the proposed single-phase
synthesis strategy and the synthesis quality of a two-phase synthesis approach that
uses the same audiovisual speech database for the separate generation of both
speech modes.
At this point in the research, the attainable synthesis quality has improved
significantly, due to several optimizations such as the use of a parameterization of
the visual speech information and the construction of a new extensive audiovisual
speech database. This allows to make a new subjective comparison between the
attainable audiovisual synthesis quality using the proposed concatenative singlephase synthesis strategy and a two-phase synthesis strategy using original speech
data from the same speaker in both synthesis stages.
5.4.2.2
Method and samples
The goal of this experiment is to evaluate the attainable audiovisual synthesis quality using a single-phase and a comparable two-phase AVTTS synthesis approach.
To generate the speech samples, the Dutch version of the AVTTS system was
provided with speech data from the AVKH speech database. The speech samples
representing the single-phase synthesis were generated using the audiovisual unit
selection approach described in section 5.3. A two-phase speech synthesizis strategy was derived from the single-phase synthesizer, using the same principles as
described in section 3.6.1. The only difference is that for the current experiment,
the two separately synthesized speech modes are multiplexed by time-scaling the
synthetic visual speech signal instead of using WSOLA to time-scale the auditory
speech signal. The time-scaling of the synthetic visual speech signal is achieved by
time-stretching the synthesized AAM parameter trajectories in order to impose the
phoneme durations from the synthetic auditory speech on the viseme durations in
the synthetic visual speech. This composes a “safer” synchronization, since it was
noticed that the time-scaling of the auditory speech mode more easily generates
noticeable distortions compared to the time-scaling of the visual speech signal. The
two-phase synthesizer was employed to generate a series of representative speech
samples, of which the auditory speech mode was synthesized using both the limited
domain speech data and half of the open domain speech data of the AVKH database.
The visual speech mode was generated using the other half of the open domain
speech data of the AVKH database. This strategy has the benefit that it is ensured
that the two-phase samples completely consist of non-original combinations of
acoustic and visual speech information. Given the large size of the AVKH database,
the half of this database still contains sufficient data to perform high quality speech
synthesis. The limited domain speech data was used to generate the auditory speech
mode because auditory speech synthesis is generally more dependent on the avail-
188
ability of appropriate original speech data compared to the concatenative synthesis
of visual speech. Likewise, the representative samples for the single-phase synthesis
were synthesized using the same original speech data as was used for generating the
auditory speech mode of the samples representing the two-phase synthesis approach.
Fifteen medium-length (mean word count = 13) sentences from the open domain and fifteen medium-length (mean word count = 15) sentences from the
limited domain of weather forecasts were manually constructed. The sentences were
semantically meaningful and it was ensured that the sentences (or large parts of the
sentences) did not appear in the transcript of the AVKH database. For each sentence, a single-phase (MUL) and a two-phase (SAV) speech sample were generated
(the labels “MUL” and “SAV” are analogous to the labels used in section 3.6). In
contrast to the experiment described in section 3.6.3, the participants were asked
to rate the overall quality of the presented audiovisual speech signals. They had to
decide for themselves which aspect of the audiovisual speech (e.g., the individual
quality of the acoustic/visual speech, the audiovisual presentation of both speech
modes, etc.) they found the most important. The key question to answer was: “How
much does the sample resemble original audiovisual speech?”. And, similar: “How
much do you like the audiovisual speech for usage in an real-world application (e.g.,
reading the text messages on your cell phone)?”. The samples were shown pairwise
to the participants, who had to use a 5-point comparative MOS scale [-2,2] to
express their preference for one of the two samples. They were instructed to answer
“0” when they had no clear preference. The sequence of the sample types in each
comparison pair was randomized.
5.4.2.3
Subjects and results
Seven people participated in the experiment (5 male, 2 female, aged [24-61]), three
of which can be considered speech technology expert. The results obtained are visualized in figure 5.10. The results clearly show that the test subjects preferred the
samples generated by the single-phase synthesis approach. A Wilcoxon-signed rank
analysis indicated a significant difference between the ratings obtained for the MUL
samples and the ratings obtained for the SAV samples (Z = −6.59 ; p < 0.001). The
difference between the ratings for the MUL samples and the ratings for the SAV
samples is stronger for the limited domain samples in comparison with the sentences
from the open domain (Mann-Whitney U test; Z = −2.21; p = 0.027). Nevertheless,
for both types of sentences the MUL group was rated significantly better than the
SAV group (Wilcoxon-signed rank tests ; p ≤ 0.001).
189
80
60
40
20
0
-2
-1
SAV
0
-
1
2
MUL
Figure 5.10: Comparison between single-phase and two-phase audiovisual synthesis. The histogram shows the participants’ preference for the SAV or the
MUL sample on a 5-point scale [-2,2].
5.4.2.4
Discussion
The results obtained unequivocally indicate that the single-phase synthesis strategy
is the most preferable approach to perform audiovisual speech synthesis, since this
strategy even outperforms a two-phase synthesis approach that uses comparable
original speech data in both synthesis stages. This result isPagein1 line with the results
of the experiment described in section 3.6, where it was concluded that a twophase synthesis approach is likely to affect the perceived individual quality of a
synthetic speech mode when presented audiovisually to an observer. Note that the
individual quality of the synthetic speech modes of the SAV samples should be
at least as high as the individual quality of the speech modes of the MUL samples,
since for the synthesis of the auditory speech mode of the SAV samples only auditory
features were considered when calculating the selection costs. Similarly, the synthesis
of the visual speech mode of the SAV samples was based solely on visual features.
The increased difference between the ratings for both sample types, noticed for the
limited domain sentences, could be explained by the fact that for these sentences
the individual quality of the visual speech mode of the MUL samples is possibly a
bit higher compared to the individual quality of the visual speech mode of the SAV
samples: for the MUL samples both synthetic speech modes can be constructed
by concatenating large segments from the limited domain section of the AVKH
database. On the other hand, this difference does not exist for the synthetic auditory
mode: it is clearly noticeable that for both the MUL and the SAV samples, the
overall audiovisual quality of the limited domain sentences is higher than the quality
of the other sentences. Since the limited domain sentences are closer to original
speech, it may have been the case that the participants were more sensitive to
subtle incoherences between both presented speech modes. This could also explain
190
the larger difference between the ratings for both sample types that was observed
for the limited domain sentences.
5.5
This chapter elaborated on the extension of the AAM-based AVTTS system,
proposed in chapter 4, towards synthesis in the Dutch language. For this purpose,
a new extensive Dutch audiovisual speech database was recorded. The recorded
speech signals were processed in order to enable audiovisual speech synthesis using
the Dutch version of the AVTTS system. A detailed AAM was build that is able
to accurately reconstruct the mouth area of the original video frames. In addition,
appropriate background signals can be generated using an additional face AAM.
The quality of the synthetic audiovisual speech generated from the new database
is significantly higher than the attainable synthesis quality using the LIPS2008
database. The individual quality of the synthetic visual speech mode was assessed
in a Turing experiment. It appeared that especially observers who are not familiar
with speech synthesis are almost unable to distinguish between original and synthesized visual speech signals. The enhanced synthesis quality allowed to perform
a new comparison between single-phase and two-phase concatenative audiovisual
speech synthesis approaches. The experiment showed that observers prefer synthetic
audiovisual speech signals generated by a single-phase approach over audiovisual
speech samples generated by a two-phase synthesis strategy. This can be seen as a
final proof of the importance of audiovisual speech synthesis approaches that, apart
from pursuing the highest possible individual acoustic and visual speech quality,
also aim to maximize the level of audiovisual coherence between the two synthetic
speech modes.
been published in [Mattheyses et al., 2011a].
6
Context-dependent visemes
6.1
6.1.1
Introduction
Motivation
Up until this point, this thesis discussed the problem of audiovisual text-to-speech
synthesis, in which a given text is translated in a novel audiovisual speech signal.
Section 1.4.4 elaborated on various applications for this kind of synthesizers. On the
other hand, there are also many scenarios in which the synthesizer is only needed
to generate a synthetic visual speech signal, which is later on multiplexed with an
already existing auditory speech signal. When the text transcript corresponding to
this auditory speech signal is used as input for the synthesizer, the system can be
referred to as a visual text-to-speech (VTTS) synthesizer.
Recall from chapter 2 that such a VTTS system is used to perform the second synthesis stage in a two-phase AVTTS approach: the VTTS synthesizer creates
the synthetic visual speech mode that is later on multiplexed with the already
obtained synthetic auditory speech signal. On the other hand, many applications
exist for which a visual speech signal must be generated to accompany an original
auditory speech signal instead. For instance, in video telephony or in remote
teaching, the visual speech can be locally rendered to accompany the original
transmitted acoustic speech. This allows to attract the attention of the audience by
displaying a virtual speaker while only (low-bandwidth) acoustic signals need to be
transmitted over the network. Another example for which the accurate synthesis of
visual speech signals is crucial is computer-assisted pronunciation training. In this
scenario, a speech therapy patient will independently use the VTTS system to ge-
191
6.1. Introduction
192
nerate examples of visual articulations. From the previous chapters it is known that
the generation of a synthetic auditory speech signal that perfectly mimics original
speech is very hard to realize. Therefore, for many professional applications it is still
often opted to use original acoustic speech signals instead. On the other hand, an
optimal communication should consist of audiovisual speech and not acoustic-only
speech signals. Unfortunately, audiovisual speech recordings are much harder to
realize and are much more expensive in comparison to acoustic-only speech recordings. Since it was experienced in the previous chapters that observers are slightly
less sensitive to small imperfections in a synthetic visual speech signal compared
to imperfections in a synthetic auditory speech signal, VTTS synthesis composes a
useful solution to enhance the quality of the communication by synthesizing a visual
speech mode that can be displayed together with the original acoustic speech signal.
Section 5.4.1 described a Turing experiment that was conducted to assess the
individual quality of the synthetic visual speech generated by the AAM-based
AVTTS system. To generate the test samples, the AVTTS system was converted
into a VTTS system: the visual speech mode of the synthetic audiovisual speech
was synchronized and multiplexed with the corresponding original auditory speech
signal from the database. From the results obtained in the Turing experiment it
appears that this unit selection-based VTTS approach is able to synthesize speech
signals that are very similar to original speech signals. Therefore, it is interesting to
further investigate how well the techniques that were developed in the scope of the
AVTTS system can be applied to perform high-quality VTTS synthesis instead.
An important observation is the fact that, where in the case of AVTTS synthesis
phonemes have to be used to describe the speech information, for VTTS synthesis
also visemes can be used to describe both the target speech and the speech signals
contained in the database. It is interesting to investigate the particular behaviour
and the attainable synthesis quality of both speech labelling approaches. To this
end, the most optimal mapping from phoneme labels to viseme labels that maximizes the attainable VTTS synthesis quality will be explored. Given the results
obtained earlier in this thesis, the most preferable speech labelling technique will
have to allow the synthesis of maximally smooth and natural visual articulations as
well as the generation of visual speech information that is highly coherent with the
original acoustic speech signal that will eventually play together with the synthetic
visual speech.
6.1.2
Concatenative VTTS synthesis
The unit selection-based VTTS synthesizer that is employed to investigate the use
of viseme speech labels for visual speech synthesis is very similar to the AAM-based
AVTTS synthesizer that was discussed in chapter 4 and chapter 5. The main differ-
6.2. Visemes
193
ence is the alternative set of selection costs that is applied by the VTTS synthesizer.
The hidden target cost that enforces the selection of database segments phonemically matching the target speech is altered so it also allows the selection of segments
of which each phoneme is from the same viseme class as the corresponding target
phoneme. In addition, the binary cost that rewards the matching of the phonemic context between the candidate and the target is omitted. The overall weight
of the other binary linguistic costs (see section 3.4.2.2) is lowered with respect to
the weight of the visemic context cost that was described in section 4.3.4.1. The
difference matrix that is needed to calculate this cost (equation 4.10) is constructed
for each particular set of speech labels (phoneme labels and various viseme labels)
used in the experiments. An additional target cost is included that promotes the
selection of database segments that require only a minor time-scaling to attain synchronization with the given auditory speech signal (see section 3.6.1). The total join
cost is calculated using the same visual join costs that are used in the AVTTS system (see section 4.3.4.2). For obvious reasons, all auditory join costs are omitted.
When the most optimal sequence of database segments is selected, the database subtrajectories are concatenated using the same concatenation approach that is used
in the AVTTS system (see section 4.3.5). All optimizations to the quality of the
synthetic visual speech that were discussed in section 4.4 are applied for the VTTS
synthesis as well. Finally, the concatenated trajectories are synchronized with the
auditory speech signal by time-scaling the sub-trajectories of each synthesized visual speech segment to match the duration of the corresponding auditory speech
segment. The final output speech is created by generating the appropriate mouthsignal by the inverse AAM-projection of the synthesized parameter trajectories, the
merging of this signal with the background video signal displaying the other parts
of the face of the virtual speaker, and the multiplexing of this final visual speech
signal with the given auditory speech signal.
6.2
6.2.1
Visemes
The concept of visemes
Among all sorts of auditory speech processing applications, the concept of a
phoneme as a basic speech unit is well established. Auditory speech signals are
segmented into a sequence of phonemes for both speech analysis and speech
synthesis goals. The properties of such a phoneme set is language-dependent but
its definition is nowadays well standardized for many languages. Similarly, the
processing of visual speech signals needs the definition of an atomic unit of visual
speech. Fisher introduced the term viseme to identify the visual counterpart of
several consonant phonemes [Fisher, 1968]. Visemes can be considered as the
particular facial and oral positions that show when a speaker utters phonemes. This
6.2. Visemes
194
implies that a unique viseme is defined by the typical articulatory gestures (mouth
opening, lip protrusion, jaw movement, etc.) that are needed to produce a particular
phoneme [Saenko, 2004]. On the other hand, an alternate definition that is widely
used in the literature is to define a viseme as a group of phonemes that exhibit
a similar visual representation. This second definition entails that there is not a
one-to-one mapping between phonemes and visemes. This can be understood from
the fact that not all articulators that are needed to utter phonemes are visible to an
observer. For instance, the English /k/ and /g/ phonemes are created by raising the
back of the tongue to touch the roof of the mouth, which is a gesture that cannot
be noticed visually. In addition, some phoneme pairs differ only in terms of voicing
(e.g., English /v/ and /f/) or in terms of nasality. These two properties cannot be
distinguished in the visual domain, which means that such phoneme pairs will have
the same appearance in the visual speech mode. As a consequence, the mapping
from phonemes to visemes should behave like a many-to-one (Nx1) relationship,
were visibly similar phonemes are mapped to the same viseme.
The construction of such an Nx1 mapping scheme has been the subject of
much research. Two different approaches can be distinguished. The first approach is
based on the phonetic properties of the different phonemes of a language. Based on
articulatory rules (e.g., place of articulation, position of the lips, etc.) and expert
knowledge, a prediction of the visual appearance of a phoneme can be made [Jeffers and Barley, 1971] [Aschenberner and Weiss, 2005]. This way, visemes can be
defined by grouping those phonemes for which the visually important articulation
properties are matching. Alternatively, in a second approach the set of visemes
for a particular language is determined by various data-driven methods. These
strategies involve the recording of real (audio)visual speech data, after which the
captured visual speech mode is further analysed. To this end, the most common
analysis approach is to conduct some kind of subjective perception test. Such an
experiment involves participants who try to match fragments of the recorded
visual speech to their audible counterparts [Binnie et al., 1974] [Montgomery and
Jackson, 1983] [Owens and Blazek, 1985] [Eberhardt et al., 1990]. The nature
(consonant/vowel, position in word, etc.) and the size (phoneme, diphone, triphone,
etc.) of the used speech fragments varies among the studies. For instance, in the
pioneering study of Fisher [Fisher, 1968], the participants were asked to lip-read
the initial and final consonants of an utterance using a forced-error approach (the
set of possible responses did not contain the correct answer). The responses of these
kinds of perception experiments are often used to generate a confusion matrix,
denoting which phonemes are visibly confused with particular other phonemes.
From this confusion matrix, groups of visibly similar phonemes (i.e., visemes) can
be determined.
6.2. Visemes
195
The benefit of the human-involved data-driven approaches to determine the
phoneme-to-viseme mapping scheme is the fact that they measure exactly what
needs to be modeled: the perceived visual similarity among phonemes. On the
other hand, conducting these perception experiments is time-consuming and the
results obtained are dependent on the lip reading capabilities of the test subjects.
In order to be able to process more speech data in a less time-consuming way, the
analysis of the recorded speech samples can be seen from a mathematical point of
view [Turkmani, 2007]. In a study by Rogozan [Rogozan, 1999], basic geometrical
properties of the speaker’s mouth (height, width and opening) during the uttering
of some test sentences were determined. The clustering of these properties led to
a grouping of the phonemes into 13 visemes. Similarly, in studies by Hazen et
al. [Hazen et al., 2004] and Melenchon et al. [Melenchon et al., 2007] a clustering
was applied on the PCA coefficients that were calculated on the video frames of
the recorded visual speech. The downside of these mathematical analyses lies in the
fact that they assume that the mathematical difference between two visual speech
segments corresponds to the actual difference that a human observer would notice
when comparing the same two segments. Unfortunately, this correlation between
objective and subjective distances is far from straightforward.
Several Nx1 phoneme-to-viseme mapping schemes have been defined in the literature. Most mappings agree on the clustering of some phonemes (e.g., the English
/p/, /b/ and /m/ phonemes), although many differences between the mappings
exist. In addition, the number of visemes defined by these Nx1 mappings varies
among the different schemes. A standardized viseme mapping table for English has
been defined in MPEG-4 [Pandzic and Forchheimer, 2003], consisting of 14 different
visemes augmented with a “silence” viseme (see also section 2.2.5.3). Although this
viseme set has been applied for visual speech recognition (e.g., [Yu et al., 2010]) as
well as for visual speech synthesis applications (see section 2.2.5.3 for examples),
still many other viseme sets are used. For instance, various Nx1 phoneme-toviseme mappings were applied for visual speech synthesis in [Ezzat and Poggio,
2000] [Verma et al., 2003] [Ypsilos et al., 2004] [Bozkurt et al., 2007] while other Nx1
viseme labelling approaches have been applied for (audio-)visual speech recognition
purposes [Visser et al., 1999] [Potamianos et al., 2004] [Cappelletta and Harte, 2012].
It is not straightforward to determine the exact number of visemes that is needed
to accurately describe the visual speech information. In a study by Auer [Auer and
Bernstein, 1997] the concept of a phonemic equivalence class (PEC) was introduced.
Such a PEC can be seen as an equivalent of a viseme, since it is used to group
phonemes that are visibly similar. In that study, words from the English lexicon
were transcribed using these PEC’s in order to assess their distinctiveness. The
number of PECs was varied between 1, 2, 10, 12, 19 and 28. It was concluded
6.2. Visemes
196
that the use of at least 12 PEC’s resulted in a sufficiently unique transcription of
the words. Note, however, that when optimizing the number of visemes used in
a phoneme-to-viseme mapping, the target application (e.g., human speech recognition, machine based speech recognition, speech synthesis, etc.) should also be
taken into account. In addition, it has been shown that the best phoneme-to-viseme
mapping (and as a consequence the number of visemes) should be constructed
speaker-dependently [Lesner and Kricos, 1981].
A constant among almost every phoneme-to-viseme mapping scheme that can
be found in the literature is the fact that the nature of the mapping is many-to-one.
At first glance, this seems reasonable since it fits with the definition of a viseme
being a group of phonemes that appear visibly the same. On the other hand, a
visual speech signal often exhibits strong coarticulation effects (see section 2.2.6.1).
Both forward and backward coarticulations make the visual appearance of a particular phone in a sentence not only dependent on the corresponding phoneme’s
articulation properties but also on the nature of its neighbouring phones in the sentence. This means that a single phoneme can exhibit various visible representations,
which implies that it can be mapped on several different visemes. Consequently,
a comprehensive phoneme-to-viseme mapping scheme should be a many-to-many
(NxM) relationship [Jackson, 1988]. Unfortunately, only very little research has
been performed on the construction of such NxM mapping tables. A first step in
this direction can be found in a study by Mattys [Mattys et al., 2002], where some
of the phoneme equivalence classes from [Auer and Bernstein, 1997] were redefined
by taking the phonetic context of consonants into account.
Finally, it can be argued that even a viseme set that is computed as an NxM
mapping from phonemes is insufficient to accurately and efficiently define atomic
units of visual speech. Instead of a priory segmentation of the speech in terms
of phonemes, the segmentation of a visual speech signal could be performed by
taking only the visual features into account. For example, in a study by Hilder et
al. [Hilder et al., 2010] such a segmentation is proposed, which has the benefit that
the different allophones of a phoneme can get a different viseme label and that the
visual coarticulation is automatically taken into account. It was shown that this
strategy leads to a clustering of the speech segments into visemes which is optimal
in the sense that the inter-cluster distance is much larger than the intra-cluster
distance. Unfortunately, it is far from straightforward to use such a viseme set
for visual speech analysis or synthesis purposes: there is no direct mapping from
phoneme labels to viseme labels and the phone boundaries in the auditory mode of
a speech segment do not coincide with the viseme boundaries in the visual mode.
Recently, in a study by Taylor et al. [Taylor et al., 2012] the segmentation technique
proposed in [Hilder et al., 2010] was employed to analyse a large corpus of English
6.3. Phoneme-to-viseme mapping for visual speech synthesis
197
audiovisual speech. This led to the definition of 150 so-called dynamic visemes,
elementary units of visual speech that span on average the length of 2-3 phonemes.
A target phoneme sequence can be translated in a sequence of dynamic visemes
by inspecting both the combinations phoneme/dynamic viseme occurring in the
original text corpus and the similarity between the target phoneme duration and
the duration of the dynamic viseme. Optimizing the synthesis is not straightforward
since many sequences of dynamic visemes can be employed to match a particular
target phoneme sequence. In addition, to achieve audiovisual synchronisation the
boundaries of the dynamic visemes need to be aligned with the target phoneme
boundaries.
6.2.2
Visemes for the Dutch language
In comparison with the English language, the number of available reports on visemes
for the Dutch language is limited. Whereas for English some kind of standardization for Nx1 visemes exist in the MPEG-4 standard, for Dutch only a phoneme
set has been standardized. A first study on visemes for Dutch was performed by
Eggermont [Eggermont, 1964], were some CVC syllables were the subject of an audiovisual perception experiment. In addition, Corthals [Corthals, 1984] describes a
phoneme-to-viseme grouping using phonetic expert knowledge. Finally, Van Son et
al. [Van Son et al., 1994] define a new Nx1 phoneme-to-viseme mapping scheme that
is constructed using the experimental results of new perception tests in combination
with the few conclusions on Dutch visemes that could be found in earlier literature.
6.3
6.3.1
Phoneme-to-viseme mapping for visual speech
synthesis
Application of visemes in VTTS systems
Section 2.2 contained a detailed overview on the various techniques that are used
to synthesize visual speech. It discussed the various approaches for predicting the
appropriate speech gestures based on the input text. This section summarizes the
basic concepts of each of these techniques and it discusses for each technique in
which way a phoneme-to-viseme mapping scheme can be adopted to describe the
speech information used by the synthesizer.
A first technique is adopted by the so-called rule-based synthesizers, which assign to each target phoneme a typical configuration of the virtual speaker. For
instance, in a 3D-based synthesis approach these configurations can be expressed by
means of parameter values of a parameterized 3D model. Alternatively, a 2D-based
synthesizer can assign to each target phoneme a still image of an original speaker
198
uttering that particular phoneme. Rule-based synthesizers generate the final synthetic visual speech signal by interpolating between the predicted keyframes. In
order to cover the complete target language, a rule-based system has to know the
mapping from any phoneme of that language to its typical visual representation,
i.e., it has to define a complete phoneme-to-viseme mapping table. The actual
speaker configuration that corresponds to each viseme label can be manually
pre-defined (using a system-specific viseme set or a standardized viseme set such as
described in MPEG-4) or it can be copied from original speech recordings. Almost
all rule-based visual speech synthesizers adopt an Nx1 phoneme-to-viseme mapping
scheme, which reduces the number of rules needed to cover all phonemes of the
target language. In a rule-based synthesis approach using an Nx1 mapping scheme,
the visual coarticulation effects need to be mimicked in the keyframe interpolation
stage. To this end, a coarticulation model such as the Cohen-Massaro model [Cohen
and Massaro, 1993] or Ohman’s model [Ohman, 1967] is adopted. There have
been only a few reports on the use of NxM phoneme-to-viseme mapping tables
for rule-based visual speech synthesis. An example is the exploratory study by
Galanes et al. [Galanes et al., 1998], in which regression trees are used to analyse
a database of 3D motion capture data in order to design prototype configurations
for context-dependent visemes. To synthesize a novel visual speech signal, the same
regression trees are used to perform an NxM phoneme-to-viseme mapping that
determines for each input phoneme a typical configuration of the 3D landmarks
taking its target phonetic context into account. Afterwards, the keyframes are interpolated using splines instead of using a coarticulation model, since coarticulation
effects were already taken into account during the phoneme-to-viseme mapping
stage. Another synthesis system that uses pre-defined context-dependent visemes
was suggested by De Martino et al. [De Martino et al., 2006]. In their approach,
3D motion capture trajectories corresponding to the uttering of original CVCV
and diphthong samples are gathered, after which by means of k-means clustering
important groups of similar visual phoneme representations are distinguished.
From these context-dependent visemes, the keyframe mouth dimensions corresponding to a novel phoneme sequence can be predicted. These predictions are
then used to animate a 3D model of the virtual speaker. In a follow-up research,
for each context-dependent viseme identified by the clustering of the 3D motion
capture data, a 2D still image of an original speaker is selected to define the articulation rules for a 2D photorealistic speech synthesizer [Costa and De Martino, 2010].
In contrast with rule-based synthesizers, unit selection synthesizers construct
the novel speech signal by concatenating speech segments selected from a database
containing original visual speech recordings. In this approach, no interpolation is
needed since all output frames consist of visual speech information copied from
the database. The selection of the visual speech segments can be based on the
199
target/database matching of either phonemes (e.g., [Bregler et al., 1997] [Theobald
et al., 2004] [Deng and Neumann, 2008]) or visemes (e.g., [Breen et al., 1996] [Liu
and Ostermann, 2011]). From the literature it can be noticed that almost all visemebased unit selection visual speech synthesizers apply an Nx1 phoneme-to-viseme
mapping to label the database speech and to translate the target phoneme sequence
into a target viseme sequence. In unit selection synthesis, original coarticulations are
copied from the database to the output speech by concatenating original segments
longer than one phoneme/viseme. In addition, extended visual coarticulations can
be taken into account by selecting those original speech segments of which the visual
context (i.e., the visual properties of their neighboring phonemes/visemes) matches
the visual context of the corresponding target speech segment (this motivated the
addition of the “visual context target cost” to the AVTTS system (see section
4.3.4.1)).
The third strategy for estimating the desired speech gestures from the text
input is the use of a statistical prediction model that has been trained on the
correspondences between visual speech features and a phoneme/viseme-based labelling of the speech signal. Such a trained model can predict the target features of
the synthesizer’s output frames for an unseen phoneme/viseme sequence given as
input. A common strategy is the use of HMMs to predict the target visual features.
Such an HMM usually models the visual features of each phoneme/viseme sampled
at 3 to 5 distinct time instances along the phoneme/viseme duration. It can be
trained using both static and dynamic observation vectors, i.e., the visual feature
values and their temporal derivatives. Similar to most selection-based visual speech
synthesizers, prediction-based synthesizers often use a phoneme-based segmentation
of the speech, for which the basic training/prediction unit can be for instance a
single phoneme or a syllable. Diphones are used as basic synthesis units in a study
by Govokhina et al. [Govokhina et al., 2006a], in which it was concluded that the
use of both static and dynamic features to train an HMM improved the synthesis
quality since this permits that coarticulations are learned by the prediction model
as well. In a follow-up research, their system was extended to allow some small
asynchronies between the phoneme transitions and the transition points between
the speech segments in the visual mode. This way, the HMM is capable to more
accurately model some anticipatory visual coarticulation effects since these occur in
the visual speech mode before the corresponding phoneme is heard in the auditory
mode [Govokhina et al., 2007].
6.3.2
Discussion
Each approach for predicting the visual speech information based on a target
phoneme sequence, discussed in detail in section 2.2 and summarized in section
200
6.3.1, can be implemented using both phonemes or visemes as atomic speech
units. In theory, a viseme-based approach should be superior since this type of
labelling is more suited to identify visual speech information. It can be noticed
that almost all viseme-based visual speech synthesizers that have been described
in the literature use an Nx1 phoneme-to-viseme mapping scheme. For rule-based
synthesizers, such an Nx1 mapping is advantageous since it reduces the number of
rules needed to cover the whole target language. Moreover, the application of an
Nx1 mapping scheme is useful for unit selection synthesizers as well, since it reduces
the database size needed to provide a sufficient number of database segments that
match a given target speech segment. Similarly, the use of an Nx1 mapping scheme
reduces the minimal number of original sentences needed to train the prediction
models of prediction-based synthesis systems. On the other hand, when an Nx1
phoneme-to-viseme mapping scheme is applied, an additional modelling of the
visual coarticulation effects is needed. To this end, many coarticulation models
have been proposed for usage in rule-based visual speech synthesis. In the case
of unit selection synthesis, which is currently the most appropriate technique to
produce very realistic synthetic speech signals, visual coarticulations have to be
copied from original speech recordings. It is obvious that the accuracy of this
information transfer can be increased when the labelling of the original and the
target speech data would intrinsically describe these coarticulations. This is feasible
when context-dependent visemes are used to label the target and the database
speech, i.e., when an NxM phoneme-to-viseme mapping scheme is applied.
6.3.3
Problem statement
For unit selection synthesis, there exists a trade-off between the various approaches
for labelling the speech data. The use of an Nx1 phoneme-to-viseme mapping increases the number of database segments that match a target speech segment, which
means that it is more likely that for each target segment a highly suited database
segment can be found. On the other hand, when this type of speech labels is used,
appropriate original visual coarticulations can only be selected by means of accurate selection costs and by selecting long original segments. When context-dependent
visemes are used, the visual coarticulation effects are much better described, both
in the database speech and in the target speech. Unfortunately, such an NxM mapping increases the number of distinct speech labels and thus decreases the number
of database segments that match a target segment. Note that visual unit selection synthesis can also be performed using phoneme-based speech labels. Although
phonemes are less suited to describe visual speech information, it may help to enhance the perception quality when the synthetic visual speech is shown audiovisually
to an observer, since the use of phoneme labels increases the audiovisual coherence.
In the remainder of this chapter the effects of all these possible speech labelling
6.4. Evaluation of Nx1 mapping schemes for English
201
approaches on the quality of the synthetic visual speech is investigated. To this end,
the unit selection-based VTTS synthesizer described in section 6.1.2 is used to synthesize visual speech signals using both phonemes, Nx1 visemes and NxM visemes
to describe the target and the database speech. For this, accurate NxM mapping
schemes have to be developed, since only very few of these mappings can be found
in the literature.
6.4
Evaluation of many-to-one phoneme-to-viseme
mapping schemes for English
6.4.1
Design of many-to-one phoneme-to-viseme mapping
schemes
In this section the standardized Nx1 mapping that is described in MPEG-4 [MPEG,
2013] is evaluated for use in concatenative visual speech synthesis. Since this mapping scheme is designed for English, the English version of the visual speech synthesis
system was used, provided with the LIPS2008 audiovisual speech database. Based
on the description in MPEG-4, the English phoneme set that was originally used
to segment the LIPS2008 corpus has been mapped on 14 visemes, augmented with
one silence viseme. The mapping of those phonemes that are not mentioned in the
MPEG-4 standard was based on their visual and/or articulatory resemblance with
other phones.
The MPEG-4 mapping scheme is designed to be a “best-for-all speakers” phonemeto-viseme mapping. However, for usage in data-driven visual speech synthesis,
the phoneme-to-viseme mapping should be optimized for the particular speaker
of the synthesizer’s database. To define such a speaker-dependent mapping, the
AAM-based representations of the mouth-region of the video frames from the
database were used. In a first step, for every distinct phoneme present in the
database all its instances were gathered. Then, the combined model parameter
values of the frame located at the middle of each instance were sampled. From the
collected parameter values, a speaker-specific mean visual representation of each
phoneme was calculated. A hierarchical clustering analysis was performed on these
mean parameter values to determine which phonemes are visibly similar for the
speaker’s speaking style. Using the dendrogram, a tree diagram that visualizes the
arrangement of the clusters produced by the clustering algorithm, five important
levels could be discerned in the hierarchical clustering procedure. Consequently, five
different phoneme-to-viseme mappings were selected. They define 7, 9, 11, 19 and
22 visemes, respectively. Each of these viseme sets contains a “silence” viseme on
which only the silence phoneme is mapped. The viseme mappings are summarized
202
in appendix C.
6.4.2
Experiment
Both the MPEG-4 mapping scheme and the speaker-dependent mappings were the
subject of an experimental evaluation. A random selection of original sentences
from the database were resynthesized using the visual speech synthesis system
discussed in section 6.1.2, for which both the synthesis targets and the speech
database were labelled using the various Nx1 viseme sets. The synthesis parameters
(selection costs, concatenation smoothing, etc.) were the same for all strategies. A
reference synthesis strategy was added, for which the standard English phoneme
set (containing 45 entries) was used to label the database and the synthesis targets.
For every synthesis, the target original sentence was excluded from selection. The
original database transcript was used as text input and the original database
auditory speech was used as audio input. A subjective evaluation of the syntheses
was conducted, using four labelling strategies: speech synthesized using the speakerdependent mappings on 9 (group “SD9”) and 22 (group “SD22”) visemes, speech
synthesized using the MPEG-4 mapping (group “MPEG4”), and speech synthesized
using standard phoneme-based labels (group “PHON”). In addition, extra reference
samples were added (group “ORI”), for which the original AAM trajectories from
the database were used to resynthesize the visual speech. The samples were shown
pairwise to the participants. Six different comparisons were considered, as shown
in figure 6.1. The sequence of the comparison types as well as the sequence of the
sample types within each pair were randomized. The test consisted of 50 sample
pairs: 14 comparisons containing an ORI sample and 36 comparisons between two
actual syntheses. The same sentences were used for each comparison group. 13
people (9 male, 4 female, aged [22-59]) participated in the experiment. 8 of them can
be considered speech technology experts. The participants were asked to give their
preference for one of the two samples of each pair using a 5-point comparative MOS
scale [-2,2]. They were instructed to answer “0” if they had no clear preference for
one of the two samples. The test instructions told the participants to pay attention
to both the naturalness of the mouth movements and to how well these movements
are in coherence with the auditory speech that is played along with the video. The
key question of the test read as follows: “How much are you convinced that the
person you see in the sample actually produces the auditory speech that you hear
in the sample?”. The results obtained are visualized in figure 6.1. The results from
an analysis using Wilcoxon signed-rank tests is given in table 6.1.
The results show that the participants were clearly in favour of the syntheses
based on phonemes compared to the viseme-based syntheses. Furthermore, a higher
perceived quality was attained by increasing the number of visemes. It can be con-
203
Figure 6.1: Subjective evaluation of the N x1 phoneme-to-viseme mappings for
English. The histograms show for each comparison the participants’ preference
for the left/right sample type on a 5-point scale [-2,2].
Table 6.1: Subjective evaluation of the Nx1 phoneme-to-viseme mappings for
English. Wilcoxon signed-rank analysis.
Comparison
Z
Sign.
MPEG4 - ORI
PHON - ORI
SD9 - MPEG4
SD22 - MPEG4
PHON - MPEG4
PHON - SD22
-7.79
-6.44
-1.94
-1.64
-5.24
-5.04
p < 0.001
p < 0.001
p = 0.052
p = 0.102
p < 0.001
p < 0.001
6.5. Many-to-many phoneme-to-viseme mapping schemes
204
cluded that the speaker-dependent mappings perform similar to the standardized
mapping scheme, since the SD9 group scores worse than the MPEG4 group (which
uses 15 distinct visemes), while the SD22 group scores better than the MPEG4
group. The results show that the synthesized samples are still distinguishable from
natural visual speech, although for this aspect also the phoneme-based synthesis
outperforms the MPEG-4-based approach.
6.4.3
Conclusions
Both standardized and speaker-dependent Nx1 phoneme-to-viseme mapping
schemes for English were constructed and applied for concatenative visual speech
synthesis. In theory, such a viseme-based synthesis should outperform the phonemebased synthesis since it multiplies the number of candidate segments for selection,
while the reduced number of distinct speech labels can be justified by the fact that
there exists redundancy in a phoneme-based labelling of visual speech (this is the
reason for the Nx1 behaviour of the mapping). However, the synthesis based on
phonemes resulted in higher subjective ratings compared to the syntheses based on
visemes. In addition, the results obtained show that the synthesis quality increases
when more distinct visemes are defined. These results raise some questions on the
Nx1 viseme-based approach that is widely applied in visual speech synthesis. For
audiovisual speech synthesis, it was already shown that a phoneme-based speech
labelling is preferable, since it allows the selection of multimodal segments from the
database, which maximizes the audiovisual coherence in the synthetic multimodal
output speech (see chapter 3). From the current results it appears that a similar
phoneme-based synthesis is preferable for visual-only synthesis as well. However, it
could be that the many-to-one phoneme-to-viseme mappings insufficiently describe
all the details of the visual speech information. Although the synthesizer mimicked
the visual coarticulation effects by applying a target cost based on the visual context, it is likely that higher quality viseme-based synthesis results can be achieved
by using a many-to-many phoneme-to-viseme mapping instead, which describes the
visual coarticulation effects already in the speech labelling itself.
6.5
Many-to-many phoneme-to-viseme mapping
schemes
In order to construct an NxM phoneme-to-viseme mapping scheme, an extensive set
of audiovisual speech data must be analysed to investigate the relationship between
the visual appearances of the mouth area and the phonemic transcription of the
speech. Since the resulting mapping schemes will eventually be evaluated for usage in
concatenative visual speech synthesis, it was opted to construct a speaker-dependent
mapping by analysing a speech database that can be used for synthesis purposes
205
too (i.e., a consistent dataset from a single speaker). Unfortunately, the amount of
data in the English LIPS2008 database is insufficient for an accurate analysis of
all distinct phonemes in all possible phonetic or visemic contexts. Therefore, the
Dutch language was used instead, since this allows to use the AVKH database (see
chapter 5) for investigating the phoneme-to-viseme mapping. This dataset contains
1199 audiovisual sentences (138 min) from the open domain and 536 audiovisual
sentences (52 min) from the limited domain of weather forecasts, which should be
sufficient to analyse the multiple visual representations that can occur for each
Dutch phoneme.
6.5.1
Tree-based clustering
6.5.1.1
Decision trees
To construct the phoneme-to-viseme mapping scheme, the data from the Dutch
speech database was analysed by clustering the visual appearances of the phoneme
instances in the database. Each instance was represented by three sets of combined AAM parameters, corresponding to the video frames at 25%, 50% and 75% of
the duration of the phoneme instance, respectively. This three-point sampling was
chosen to integrate dynamics in the measure, as the mouth appearance can vary
during the uttering of a phoneme. As a clustering tool multi-dimensional decision
trees [Breiman et al., 1984] were used, similar to the technique suggested in [Galanes
et al., 1998]. A decision tree is a data analysis tool that is able to cluster training
data based on a number of decision features describing this data. To build such
a tree, first a measure for the impurity in the training data must be defined. For
this purpose, the distance d(pi , pj ) between phoneme instances pi and pj was defined as the weighted sum of the Euclidean differences between the combined AAM
parameters (c) of the video frames at 25%, 50% and 75% of the length of both
instances:
50
1 75
1
25 + ci − c75
+ ci − c50
(6.1)
d(pi , pj ) = c25
i − cj
j
j
4
4
Next, consider a subset Z containing N phoneme instances. Equation 6.2 expresses
for instance pi its mean distance from the other instances in Z:
PN
µi =
j=1
d(pi , pj )
N −1
(6.2)
Let σi denote the variance of these distances. For every instance in Z the value of
µi is calculated, from which the smallest value is selected as µbest . A final measure
IZ for the impurity of subset Z can then be calculated by
IZ = N (µbest + λ σbest )
(6.3)
206
in which λ is a scaling factor. When the decision tree is constructed, the training
data is split by asking questions about the properties of the data instances. At each
split, the best question is chosen in order to minimize the impurity in the data (this
is the sum of the impurities of all subsets). A tree-like structure is obtained since
at each split new branches are created in order to group similar data instances. In
each next step, each branch itself is further split by asking other questions that on
their turn minimize the impurity in the data. In the first steps of the tree building,
the branching is based on big differences among the data instances, while the final
splitting steps are based on only minor differences. For some branches, some of
the last splitting steps can be superfluous. However, other branches do need many
splitting steps in order to accurately cluster their data instances. A stop size must be
defined as the minimal number of instances that are needed in a cluster. Obviously,
the splitting also stops when no more improvement of the impurity can be found.
6.5.1.2
Decision features
To build the decision trees, each phoneme instance must be characterized by an appropriate set of features. Various possible features can be used, of which the identity
of the phoneme (i.e., its name or corresponding symbol) is the most straightforward.
Another key feature is a consonant/vowel (C/V) classification of the data instance.
In addition, a set of phonetic features can be linked to each instance, based on the
phonetic properties of the corresponding phoneme: vowel length, vowel type (short,
long, diphthong, schwa), vowel height, vowel frontness, lip rounding, consonant type
(plosive, fricative, affricative, nasal, liquid, trill), consonant place of articulation
(labial, alveolar, palatal, labio-dental, dental, velar, glottal) and consonant voicing. Note that these features have been determined by phonetic knowledge of Dutch
phonemes and that it can be expected that not all of them have an explicit influence
on the visual representation of the phoneme. In addition, for each Dutch phoneme
an additional set of purely visual features was calculated. To this end, several properties of the visual speech were measured throughout the database: mouth height,
mouth width, the visible amount of teeth and the visible amount of mouth-cavity
(the dark area inside an open mouth). For each of these measures, the 49 distinct
Dutch phonemes were labelled based on their mean value for each measure (labels
“−−”, “−”, “+” and “++” were used for this). For example, phoneme /a/ (the
long “a” from the word “daar”) has value “++” for both the mouth-height and
the teeth feature, while the phoneme /o/ (the long “o” from the word “door”) has
value “++” for the mouth-height feature but value “−−” for the teeth feature. In
order to construct a many-to-many mapping scheme, the tree-based clustering has
to be able to model the visual coarticulation effects. Therefore, not only features of
the particular phoneme instance itself but also features concerning its neighbouring
instances are used to describe the data instance. This way, instances from a single
V1
V1
V2
P1
V2
P1
V3
P2
207
V4
V3
P2
V5
V5
V6
P3
V4
V6
P3
V7
V7
Figure 6.2: Difference between using the phoneme identity (left) and the C/V
property (right) as pre-cluster feature.
Dutch phoneme can be mapped on different visemes, depending on their context in
the sentence.
6.5.1.3
Pre-cluster
The complete database contains about 120000 phoneme instances. Obviously, it
would require a very complex calculation to perform the tree-based clustering on
the complete dataset at once. A common approach in decision tree analysis is to
select a particular feature for pre-clustering the data: the data instances are first
grouped based on this feature, after which a separate tree-based clustering is performed on each of these groups. Two different options for this pre-cluster feature
were investigated. In a first approach, the identity of the phoneme corresponding to each instance was chosen as pre-cluster feature. This implies that for each
Dutch phoneme, a separate tree will be constructed. In another approach, the consonant/vowel property was used to pre-cluster the data. This way, only two large
trees are calculated: a first one to cluster the data instances corresponding to a vowel
and another tree to cluster the instances corresponding to a consonant. This second
approach makes it possible that two different Dutch phonemes, both in a particular
context, are mapped on the same tree-based viseme, as is illustrated in figure 6.2.
6.5.1.4
Clustering into visemes
Once a pre-cluster feature has been selected, it has to be decided which features
are used to build the decision trees. Many configurations are possible since features
from the instance itself as well as features from its neighbours can be applied. In
208
addition, a stop-size has to be chosen, which corresponds to the minimal number
of data instances that should reside in a node. This parameter has to be chosen
small enough to ensure an in-depth analysis of the data. On the other hand, an
end-node from the decision tree is characterized by the mean representation of its
data instances. Therefore, the minimal number of instances in an end-node should
be adequate to cope with inaccuracies in the training data (e.g., local phonemic
segmentation errors). After extensive testing (using similar experiments as described in the next section), two final configurations for the tree-based clustering were
defined, as described in table 6.2.
Table 6.2: Tree configurations A and B that were used to build the decision trees
that map phonemes to visemes.
Configuration A
Configuration B
Pre-cluster feature
phoneme identity
C/V classification
Features current
instance
-
phoneme identity
phonetic features
visible features
Features neighbouring
instances
phonetic features
visible feature
phoneme identity
phonetic features
visible features
For the clustering calculations, the distance between two data samples was
calculated using equation 6.1 and the impurity of the data was expressed as in
equation 6.3 (using λ = 2). In configuration A, a separate tree is built for each Dutch
phoneme. Each of these trees defines a partitioning of all the training instances of
a single phoneme based on their context. This context is described using both the
set of phonetic and the set of visible features that were described in section 6.5.1.2,
which should be sufficient to model the influence of the context on the dynamics
of the current phoneme. Alternatively, in configuration B only two large trees are
calculated. As these trees are built using a huge amount of training data (for each
tree, a maximum of 30000 uniformly sampled data instances was chosen), for both
the instance itself and for its neighbouring instances all possible features are given
as input to the tree-building algorithm. Although the description of a data instance
based on both the phoneme identity and its phonetic/visible features contains some
redundancy, this way a maximal number of features is available to efficiently and
rapidly decrease the impurity in the large data set. The clustering algorithm itself
209
will determine which features to use for this purpose. For both configurations A
and B, the trees were built using 25 and 50 instances as stop-size, respectively,
resulting in trees “A25”, “A50”, “B25” and “B50”.
6.5.1.5
Objective candidate test
As was already mentioned in the previous section, the large number of possible tree
configurations imposes the need for an objective measure to assess the quality of
the tree-based mapping from phonemes to visemes. An objective test was designed
for which a number of database sentences are resynthesized using the concatenative visual speech synthesizer. For every synthesis, the target original sentence is
excluded from the database. The original database transcript is used as text input
and the original database auditory speech is used as audio input. Both the database
labelling and the description of the synthesis targets are written in terms of the
tree-based visemes. As usual, the synthesis involves a set of candidate segments
being determined for each synthesis target. To calculate a quality measure for the
applied speech labeling, in a first step these candidate segments are arranged in
terms of their total target cost. Next, the n-best candidates are selected and their
distance from the ground-truth is measured using the three-point distance that
was described in equation 6.1. Finally, for each synthesis target a single error value
is calculated by computing the mean distance over these n-best candidates. Since
the resynthesis of a fixed group of database sentences using different speech labels
defines corresponding synthesis targets for each of these label sets (as the original
target phoneme sequence is the same for each approach), the calculation of the
mean candidate quality for each synthesis target produces paired-sample data that
can be used to compare the accuracy of the different speech labelling approaches.
Using this objective measure, tree configurations A and B (see section 6.5.1.4)
could be identified as high quality clustering approaches. In figure 6.3 their performance is visualized, using the n = 50 best candidates and omitting the silence
targets from the calculation since they are 1x1 mapped on the silence viseme. Figure
6.3 also shows the influence of including the visual context-cost in the calculation of
the total target cost (see section 6.1.2). Two reference methods were added to the
experiment. The first reference result, referred to as “PHON”, was measured using
a phoneme-based description of the database and the synthesis targets. For the
second reference approach, referred to as “STDVIS”, an Nx1 phoneme-to-viseme
mapping for Dutch was constructed, based on the 11 viseme classifications described
by Van Son et al. [Van Son et al., 1994]. These Dutch “standardized” visemes are
based on both subjective perception experiments and prior phonetic knowledge
about the uttering of Dutch phonemes.
210
0,29
0,28
0,27
0,26
0,25
0,24
0,23
Figure 6.3: Candidate test results obtained for a synthesis based on phonemes,
a synthesis based on standardized N x1 visemes and multiple syntheses based on
tree-based N xM visemes (mean distances). The “CC” values were obtained by
incorporating the visual context target cost to determine the n-best candidates.
A statistical analysis on the values obtained without using the visual context
target cost, using ANOVA with repeated measures and Greenhouse-Geisser correction, indicated significant differences among the values obtained for each group
(F (3.70, 5981) = 281 ; p < 0.001). An analysis using paired-sample t-tests indicated
that phonemes describe the synthesis targets more accurately than the standard
Nx1 visemes (p < 0.001). This result is in line with the results that were obtained
for English (see section 6.4.2), where a synthesis based on phonemes outperformed
all syntheses based on Nx1 visemes. In addition, all tree-based label approaches
perform significantly better compared with the phoneme-based labelling and the
standard Nx1 viseme-based labeling (p < 0.001). Thus, unlike the Nx1 visemes, the
tree-based NxM phoneme-to-viseme mappings define an improved description of the
visual speech information in comparison with phonemes. The results obtained show
only minor non-significant differences between the various tree configurations. In
addition, an improvement of the candidates can be noticed when the context-target
cost is applied, especially for the phoneme-based and Nx1 viseme-based labels. As
this context cost is used to model the visual coarticulation effects, it is logical that its
usage has less influence on the results obtained for the NxM viseme-based labels, as
they intrinsically model the visual coarticulation themselves. Note that, even when
the context-target cost was applied, the NxM viseme labels performed significantly
better than the phoneme-based and Nx1 viseme-based labels (p < 0.001).
6.5.2
Towards a useful many-to-many mapping scheme
6.5.2.1
Decreasing the number of visemes
211
Using the decision trees, the phoneme instances from the Dutch speech database
were clustered into small subsets. The viseme corresponding to a given phoneme
instance is determined by the traversal of such a tree based on various properties
of the instance itself and on the properties of its context. Given the extensive
amount of training data and the large number of decision features, the tree-based
clustering results in a large number of distinct visemes: 1050 for the A25-tree, 650
for the A50-tree, 800 for the B25-tree and 412 for the B50-tree. These big numbers
are partly caused by the fact that an extensive analysis with many splitting steps
was applied during the tree building. This is necessary since some pre-clusters
contain a large number of diverse data instances. On the other hand, for the
splitting of the data instances from some of the other pre-clusters (e.g., pre-clusters
corresponding to less common phonemes), less splitting steps would have been
sufficient. Consequently, the tree-based splitting has not only resulted in a large
number of end-nodes but also in an “over-splitting” of some parts of the dataset.
Another reason for the large number of tree-based visemes is the fact that the
pre-clustering step makes it impossible for the tree-clustering algorithm to combine
similar data instances from different pre-clusters into the same node. Therefore, it
can be assumed that for each tree-configuration, many of its tree-based visemes are
in fact similar enough to be considered as one single viseme.
The standardized Nx1 viseme mapping identifies 11 distinct visual appearances for
Dutch. A good quality automatic viseme classification should be expected to define
at most a few visemes more. Since the tree-based clustering results in the definition
of a much larger number of visemes, more useful NxM phoneme-to-viseme mapping
schemes were constructed by performing a new clustering on the tree-based visemes
themselves. First, a general description for each tree-based viseme defined by a
particular tree configuration was determined. To this end, for each end-node of
the tree a mean set of combined AAM parameter values was calculated, sampled
over all phoneme instances that reside in this node. Since the original phoneme
instances were sampled at three distinct points, the tree-based visemes are also
described by three sets of combined AAM parameter values (describing the visual
appearance at 25%, 50% and 75% of the viseme’s duration). Next, based on their
combined AAM parameters, all tree-based visemes were clustered using a k-means
clustering approach. Note that for k-means clustering, the number of clusters has
to be determined beforehand. Estimating this number using a heuristic approach1
1 In this method the optimal number of clusters is estimated by successively performing the
k-means clustering while increasing the cluster count. Afterwards, the final number of clusters is
chosen by graphically determining the step in which there is a drop in the marginal gain of the
212
resulted in about 20 clusters for all tree configurations. For the k-means clustering
calculations, two different distances between the tree-based visemes were defined:
the Euclidean difference between the combined AAM parameters of the frames
at the middle of the visemes, and the weighted sum of the distances between the
combined AAM parameters of the frames at 25%, 50% and 75% of the duration
of the visemes (equation 6.1). Using the first distance measure, 11 and 20 clusters
were calculated, respectively. These numbers of clusters were chosen to match the
standard Nx1 viseme mapping and according to the heuristic method, respectively.
In addition, using the three-point distance measure, an extra clustering into 50
distinct visemes was calculated. A larger number of clusters was chosen here since
this distance measure incorporates the dynamics of the visemes, which is likely to
result in the existence of more distinct visemes. Also, this number is the same as
the number of Dutch phonemes that were used for the initial database labeling,
which might be useful for comparison later on. For time issues, only the A25 and
B25 trees were processed, since these define a more in-depth initial segmentation of
the training data. In addition, a single clustering of the visemes defined by the B50
tree was performed for verification, as described in table 6.3.
Table 6.3: Mapping from tree-based visemes to final NxM visemes.
Tree A25
Tree B25
Tree B50
6.5.2.2
11 clusters
20 clusters
50 clusters
A25 11
B25 11
-
A25 20
B25 20
B50 20
A25 50
B25 50
-
Evaluation of the final NxM visemes
In order to evaluate the final NxM phoneme-to-viseme mapping schemes, a similar
evaluation of the n-best synthesis candidates as was used for the evaluation of the
tree-based visemes (see section 6.5.1.5) was performed. Both phoneme-based speech
labels and standard Nx1 viseme-based labels as well as a labelling using tree-based
visemes A25 and B25 were added as reference. Figure 6.4 illustrates the test results.
The n = 50 best candidates were used, silences were omitted from the calculation,
and the visual context was not taken into account when calculating the total target
cost.
A statistical analysis using ANOVA with repeated measures and GreenhouseGeisser correction indicated significant differences among the values obtained for
percentage of variance that is explained by the clusters.
213
0,43
0,42
0,41
0,4
0,39
0,38
0,37
0,36
0,35
0,34
0,33
PHON
STDVIS A25_11 A25_20 A25_50 B25_11 B25_20 B25_50 B50_20
A25
B25
Figure 6.4: Candidate test results obtained for a synthesis based on phonemes,
a synthesis based on standardized N x1 visemes and multiple syntheses based on
the final N xM visemes (mean distances). Some results obtained using tree-based
N xM visemes are added for comparison purposes.
each group (F (5.36, 13409) = 233 ; p < 0.001). An analysis using paired-sample
t-tests indicated that all final NxM visemes significantly outperform a labelling
based on phonemes and a labelling based on the standard Nx1 visemes (p < 0.001).
This is an important result since unlike it was the case for the tree-based visemes,
for these final NxM mapping schemes there are only a limited number of distinct
speech labels defined. The NxM mappings on 11 and 20 distinct visemes result in
a more accurate speech labelling in comparison with a phoneme-based labeling,
despite the fact that they use less than half the number of distinct labels as
compared to this phoneme-based labeling. In addition, it can be seen that for both
configuration A25 and configuration B25, better test results are obtained when
more distinct NxM viseme labels are defined (p < 0.001). This means that the
NxM mappings on 50 visemes are indeed modelling some extra differences compared with the NxM mappings on 20 visemes. From the test results it can also be
concluded that the corresponding mappings derived from tree configurations A and
B perform comparably. When the results obtained for the final NxM visemes are
compared with the results obtained for the tree-based visemes, the latter perform
best (p < 0.001). However, this difference can still be considered rather small given
the fact that for the tree-based visemes, the number of distinct speech labels is up
to a factor 50 higher than for the final NxM visemes (e.g., 20 distinct labels for the
A25 20 approach and 1050 distinct labels for the A25 mapping scheme).
6.6. NxM visemes for concatenative visual speech synthesis
Target costs
Join costs
Database
A: Determine
candidate units
Synthesis targets
214
Concatenation
B: Select final unit
from candidates
C: Output speech
Optimization
Synchronization
Figure 6.5: Relation between the visual speech synthesis stages and the objective
measures. In section 6.5.1.5 and section 6.5.2.2 the accuracy of the speech labels
was tested in stage A, while the attainable synthesis quality can be measured by
evaluating the final selection in stage B.
6.6
6.6.1
Application of many-to-many visemes for concatenative visual speech synthesis
Application in a large-database system
In a first series of experiments, the final NxM viseme labelling was applied for
concatenative visual speech synthesis using the AAM-based VTTS system provided with the open-domain part of the AVKH audiovisual speech database. The
experiments are similar to the candidate test that was described in section 6.5.1.5,
except that for actual synthesis it is not the quality of the n-best candidate segments but the quality of the final selected segment that is important. This final
selection is based on the minimization of both target and join costs (see figure
6.5). Given the large size of the speech database, many candidates are available
for each synthesis target. In order to reduce the calculation time, the synthesizer
uses only the 700 best candidates (in terms of total target cost) for further selection.
In order to objectively assess the attainable synthesis quality using a particular
speech labeling, a synthesis experiment was conducted for which 200 randomlyselected sentences from the database were resynthesized. For every synthesis, the
target original sentence was excluded from the database. The original database
transcript was used as text input and the original database auditory speech was
used as audio input. During synthesis, for each synthesis target the synthesizer
will select the most optimal segment from the database. An objective measure was
calculated by comparing each of these selected segments with its corresponding
215
0,355
0,35
0,345
0,34
0,335
0,33
0,325
PHON
B25_11
B25_20
B25_50
Figure 6.6: Evaluation of the segment selection using a large database (mean
distances). Both the results obtained using a phoneme-based speech labelling and
the results obtained using multiple final N xM viseme sets are shown.
ground-truth segment, using the weighted distance that was described in equation
6.1. Silences were omitted from the calculation. Figure 6.6 illustrates the test results
obtained for some important NxM viseme-based labelling schemes that have been
described in section 6.5.2 and a baseline system using phoneme-based speech labels.
A statistical analysis using ANOVA with repeated measures and Huynh-Feldt
group (F (2.89, 31378) = 81.2 ; p < 0.001). Section 6.5.2.2 described how an objective experiment pointed out that the best mean candidate quality (measured
over the 50 best candidates) is attained by a synthesis based on NxM visemes.
The current experiment shows however a different behaviour for the quality of
the final selected segment. In this case, a selection based on phonemes performs
as good as the best result that was obtained by using NxM viseme labels. An
analysis using paired-sample t-tests indicated that it performs significantly better
than the syntheses based on NxM viseme labels that use 11 or 20 distinct speech
labels (p < 0.001). This can be understood as follows. The main reason why a
synthesis would profit from the use of only a few distinct speech labels is the increased number of candidates for each synthesis target. For the current experiment,
however, the synthesizer’s database is very large, resulting for each of the different
speech labelling strategies in a huge number of candidates for each synthesis target.
This justifies the use of a larger number of distinct speech labels, since it will
refine the segment selection provided that the labelling is sufficiently accurate.
In the current experiment the phoneme-based system performed as good as the
system using 50 distinct NxM visemes. From section 6.5.2.2 it is known that the
216
0,4
0,39
0,38
0,37
0,36
0,35
0,34
0,33
0,32
0,31
PHON
A25_20
PHON TweakedCost
A25_20 TweakedCost
Figure 6.7: Evaluation of the segment selection using a large database: optimal
selection costs (the two leftmost results) and alternative selection costs (the two
rightmost results) (mean distances).
phoneme-based labelling is less accurate than this particular NxM viseme-based
labeling. For synthesis, however, the selection of the final segment from all possible
candidates is based on both target and join costs, meaning that the selection of a
high quality final segment from an overall lower-quality set of candidate segments
is possible when accurate selection costs are used. Even more, there is no reason to
assume that the final selected segment will be one of the n-best candidate segments
based on target cost alone. This could explain why for the current test, none of
the syntheses based on NxM visemes are able to outperform the synthesis based
on phoneme labels. To check this assumption, the same 200 sentences were again
synthesized, for which another set of selection costs was applied: the target cost
based on the context was omitted and the influence of the join costs was reduced
in favour of the influence of the target costs. Both the phoneme-based speech
labelling and the NxM viseme based labelling that scored best in the previous test
were used. Obviously, the quality of the final selected segments decreased in this
new synthesis setup, as visualized in figure 6.7. But more important, for this new
synthesis the NxM viseme-based result was found to be significantly better (paired
t-test ; t = 11.6 ; p < 0.001) than the phoneme-based segment selection.
So it appears that when non-optimal selection costs are applied, the more accurate labelling of the speech by using the NxM visemes does improve the segment
selection quality. One of the possible reasons for this is the fact that join costs
should partially model the visual coarticulation effects as well, since they push the
selection towards a segment that fits well with its neighbouring segments in the
synthetic speech signal. From these results it can be concluded that for synthesis
217
using a large database, the use of more distinct speech labels than the theoretical
minimum (11 for Dutch) is preferable. In addition, given the large number of
candidates that can be found for each target, a precise definition of the selection
costs is able to conceal the differences between the accuracy of the different speech
labelling approaches. Note, however, that the use of NxM viseme-based speech
labels for synthesis using a large database can speed-up the synthesis progress.
During synthesis, the heaviest calculation consists of the dynamic search among
the consecutive sets of candidate segments to minimize the global selection cost.
In addition, it has been shown that the use of viseme-based labels improves the
overall quality of the n-best candidate segments. Therefore, the use of viseme labels
permits to reduce the number of candidate segments in comparison with a synthesis
based on phoneme labels. Consequently, it will be easier to determine an optimal
set of final selected segments, which results in reduced synthesis times.
6.6.2
Application in limited-database systems
In the previous section the use of NxM visemes for concatenative visual speech synthesis using a large database was evaluated. In practice, however, most visual speech
synthesis systems use a much smaller database from which the speech segments are
selected. The main reason why the use of Nx1 visemes for visual speech synthesis
purposes is well-established, is the limited amount of speech data that is necessary
to cover all visemes or di-visemes of the target language. Therefore it is useful to
evaluate the use of NxM viseme-based speech labels in such a limited-database system. In this case, the number of available candidates for each synthesis target will be
much smaller compared with the large-database system that was tested in section
6.6.1. It is interesting to evaluate how this affects the quality of the syntheses based
on the various speech labelling approaches.
6.6.2.1
Limited databases
From the open-domain part of the AVKH audiovisual database, several subsets were
selected which each define a new limited database. The selection of these subsets was
performed by a sentence selection algorithm that ensures that for each of the speech
labelling approaches under test (phoneme, standard Nx1 viseme and several NxM
visemes), the subset contains at least n instances of each distinct phoneme/viseme
that is defined in the label sets. The subsets obtained are summarized in table 6.4.
For n = 3, the sentence selection was run twice, which resulted in the distinct subsets
DB1 and DB2.
218
Table 6.4: Construction of limited databases.
Database name
n
Database size
DB1
DB2
DB3
3
3
4
33 sent.
42 sent.
47 sent.
0,385
0,38
0,375
0,37
0,365
0,36
0,355
PHON
STDVIS
A25_11
A25_20
A25_50
B25_11
B25_20
B25_50
Figure 6.8: Evaluation of the selected segments using the limited database DB1
(mean distances). Both the results obtained using a phoneme-based speech labeling, the results obtained using the standardized N x1 viseme labelling and the
results obtained using various final N xM viseme sets are shown.
6.6.2.2
Evaluation of the segment selection
The attainable synthesis quality using these limited databases was assessed by
the same objective test that was described in section 6.6.1. In a first execution,
120 sentences were resynthesized using DB1. Various NxM viseme-based labelling
approaches as well as two baseline systems (phoneme-based and standard Nx1
viseme-based) were used. The results obtained are visualized in figure 6.8.
group (F (6.62, 71849) = 50.7 ; p < 0.001). An analysis using paired-sample t-tests
indicated that for a synthesis using a limited database, all NxM viseme-based labelling approaches result in selecting significantly better segments than a synthesis
based phonemes or Nx1 visemes (p < 0.001). In addition, in this test the viseme
labels using 11 and 20 distinct visemes scored better than the label sets that use
50 distinct visemes (p < 0.001). This could be explained by the fact that due to
the limited amount of available speech data, the mappings on 50 visemes result
219
0,385
0,38
0,375
0,37
0,365
PHON
STDVIS
A25_11
A25_20
A25_50
B25_11
B25_20
B25_50
PHON
STDVIS
A25_11
A25_20
A25_50
B25_11
B25_20
B25_50
0,373
0,368
0,363
0,358
0,353
Figure 6.9: Evaluation of the selected segments using the limited database DB2
(upper panel) and the limited database DB3 (lower panel) (mean distances).
in considerably less candidate segments for each synthesis target in comparison
with the approaches that use fewer distinct speech labels. This assumption is
in line with the results that were shown in figure 6.6, where it was found that
when a large amount of speech data is available, the viseme sets using 50 distinct
labels are performing best. It is worth mentioning that for the current test, the
viseme sets using 50 distinct labels still performed significantly better than the
phoneme-based labeling, which uses the same number of distinct speech labels.
Similarly, the syntheses based on 11 distinct NxM visemes performed significantly
better than the synthesis based on 11 standard Nx1 visemes. These are important
results since they show that an NxM viseme labelling does help to improve the
segment selection quality in the case where not that much speech data is available.
To verify these results, similar experiments were conducted using other sets of
target sentences and other limited databases (see figure 6.9 for some examples). For
all these experiments, results similar to the results discussed above were obtained.
6.6.2.3
Evaluation of the synthetic visual speech
While in the tests described in section 6.6.1 and section 6.6.2.2 the effect of the
speech labelling approach on the improvement of the segment selection and thus on
the attainable speech quality was evaluated, some final experiments were conducted
in order to assess the achieved quality of the visual speech synthesizer (denoted as
stage C in figure 6.5). For this, instead of evaluating the quality of the segment
220
selection itself, the quality of the final visual output speech resulting from the concatenation of the selected segments is assessed. This concatenation involves some
optimizations, like a smoothing of the parameter trajectories at the concatenation
points and an additional smoothing by a low-pass filtering technique (see section
4.4). In addition, the concatenated speech is non-uniformly time-scaled to achieve
synchronization with the target segmentation of the auditory speech. It is interesting to evaluate the effect of these final synthesis steps on the observed differences
between the different speech labelling approaches.
Objective evaluation
To objectively assess the quality of the synthesized speech, 70 randomly-selected
sentences from the full-size database were resynthesized. The VTTS system was
provided with the limited database DB1. It was ensured that the target sentences
were not part of this database. The original database transcript was used as text
input and the original database auditory speech was used as audio input. Optimal
settings were applied for the concatenation smoothing and for the other synthesis
optimizations. To measure the quality of the synthesized visual speech, the final
combined AAM parameter trajectories describing the output speech were compared
with the ground-truth trajectories from the speech database. As distance measure
a dynamic time warp (DTW) cost was used. Dynamic time warping [Myers et al.,
1980] is a time-series similarity measure that minimizes the effects of shifting and
distortion in time by allowing elastic transformation of time series in order to detect
similar shapes with different phases. A DTW from a time series X = (x1 , x2 , . . . , xn )
to a time series Y = (y1 , y2 , . . . , ym ) first involves the calculation of the local cost
matrix L representing all pairwise differences between X and Y . A warping path
between X and Y can be defined as a series of tuples (xi , yj ) that defines the
correspondences from elements from X to elements from Y . When this warping
path satisfies certain criteria like a boundary condition, a monotonicity condition
and a step-size condition, it defines a valid warp from series X towards series Y (the
interested reader is referred to [Senin, 2008] for a detailed explanation on this technique). A warping cost can be associated with each warping path by adding all local
costs collected by traversing matrix L along the warping path. Through dynamic
programming, the DTW algorithm searches the optimal warping path between
X and Y that minimizes this warping cost. This optimal warping path defines a
useful distance measure between series X and Y through its associated warping cost.
For the current experiment, a synthesized sentence was assessed by first calculating for each combined AAM parameter trajectory a DTW distance measure as
the cost of warping the synthesized parameter trajectory towards the ground-truth
trajectory. This value is then normalized according to the length of the warping
221
190
185
180
175
170
165
160
155
150
PHON
STDVIS
B25_11
B25_20
B25_50
A25_20
Figure 6.10: DTW-based evaluation of the final synthesis result (mean distances). Both the results obtained using a phoneme-based speech labeling, the
results obtained using the standardized N x1 viseme set and the results obtained
using various final N xM viseme sets are shown.
path in order to cancel out the influence of the length of the sentence. For each
sentence, a final distance measure was calculated as the weighed sum of these
normalized DTW costs, where the weights were chosen according to the total
model variance that is explained by each AAM parameter. The use of a DTW cost
has been suggested in [Theobald and Matthews, 2012], where it was concluded
that this distance correlates well with subjective evaluations as it measures the
frame-wise distance between the synthetic and ground-truth visual speech as well
as the temporal relationship between these two signals. The results obtained are
visualized in figure 6.10.
group (F (3.40, 254) = 4.05 ; p = 0.006). An analysis using paired-sample t-tests
indicated that the quality of the synthesis output based on NxM visemes is higher
than the quality of the synthesis output based on phonemes (p = 0.069 (B25 11),
p = 0.01 (B25 20), p = 0.048 (B25 50)) and Nx1 visemes (p = 0.025 (B25 11),
p = 0.002 (B25 20), p = 0.020 (B25 50)). The best results were obtained using
the B25 20 visemes, although the differences among the different NxM viseme sets
were not found to be significant. These results are in line with the evaluations of
the segment selection quality that were described in section 6.6.2.2. However, in
the current test the labelling approach using 50 distinct visemes scores as good as
the other NxM viseme-based approaches. It appears that the concatenations and
optimizations in the final stages of the synthesis partly cover the quality differences
222
between the various speech labelling approaches that have been measured for the
segment selection stage.
Subjective evaluation
In addition to an objective evaluation, two subjective perception experiments were
performed in order to compare the achieved synthesis quality using the different
speech labelling approaches. For this, 20 randomly-selected sentences from the
full-size database were resynthesized, using the limited database DB1 as selection
data set. It was ensured that the target sentences were not part of this database.
The original database transcript was used as text input and the original database
auditory speech was used as audio input. Optimal settings were used for the
concatenation smoothing and for the other synthesis optimizations. For verification
purposes, the DTW-based objective evaluation that was described in section 6.6.2.3
was repeated for the 20 sentences that were used in the subjective experiments,
which resulted in observations comparable to the results that were obtained using
the larger test-set. In a first subjective test, the differences between NxM visemes
B25 11, B25 20, B25 50 and A25 20 were investigated. 10 people participated in
the experiment (8 male, 2 female, 9 of them aged [24-32], 1 aged 60), of which 7
can be considered speech technology experts. The samples were shown pairwise
to the participants, considering all comparisons among the four approaches under
test. The sequence of the comparison types as well as the sequence of the sample
types within each pair were randomized. The participants were asked to give their
preference for one of the two samples of each pair using a 5-point comparative MOS
scale [-2,2]. They were instructed to answer “0” if they had no clear preference for
one of the two samples. The test instructions told the participants to pay attention
to both the naturalness of the mouth movements and to how well these movements
are in coherence with the auditory speech that is played along with the video. The
key question of the test read as follows: “How much are you convinced that the
person you see in the sample actually produces the auditory speech that you hear
in the sample?”. The results of the test are visualized in figure 6.11. The results
from an analysis using Wilcoxon signed-rank tests is given in table 6.5.
Table 6.5: Subjective test results evaluating the synthesis quality using various
NxM visemes. Wilcoxon signed-rank analysis.
Comparison
B25
B25
B25
B25
11
20
11
20
-
B25
B25
B25
A25
20
50
50
20
Z
Sign.
-1.54
-0.586
-0.607
-1.23
p = 0.122
p = 0.558
p = 0.544
p = 0.22
223
Figure 6.11: Subjective test results evaluating the synthesis quality using various N xM visemes. The histograms show for each comparison the participants’
preference for the left/right sample type on a 5-point scale [-2,2].
The results show a slight preference for the syntheses using the B25 20 and
B25 50 visemes, but none of the differences between the methods were shown to
be significant (see table 6.5). This result is in line with the results obtained by the
objective DTW-based evaluation. As a conclusion, it was opted to select the B25 20
labels as the most preferable viseme set for synthesis using database DB1, due to
the slight preference for this method in both the objective and the subjective test.
In addition, this labelling fits best with the assumption that an automatic viseme
classification should identify more than 11 visemes but less than the number of
phonemes.
In a final experiment, the NxM viseme labelling approach B25 20 was subjectively compared with a phoneme-based and an Nx1 viseme-based synthesis. A new
perception experiment was conducted, of which the set-up was similar to the previous experiment. In the current test all comparisons between the three approaches
under test were evaluated by 11 participants (9 male, 2 female, aged [24-60]). Six of
them can be considered speech technology experts. Figure 6.12 visualizes the test
results obtained. The results from an analysis using Wilcoxon signed-rank tests is
given in table 6.6.
224
Figure 6.12: Subjective test results evaluating the synthesis quality using the
most optimal N xM visemes, the standardized N x1 visemes and a phonemebased speech labeling. The histograms show for each comparison the participants’
preference for the left/right sample type on a 5-point scale [-2,2].
Table 6.6: Subjective test results evaluating the synthesis quality using the most
optimal NxM visemes. Wilcoxon signed-rank analysis.
Comparison
Z
Sign.
PHON - B25 20
PHON - STDVIS
B25 20 - STDVIS
-2.08
-0.064
-3.17
p = 0.037
p = 0.949
p = 0.002
225
The results obtained show a similar behaviour as the results from the DTWbased objective evaluation that was described in section 6.6.2.3. The synthesis
based on the B25 20 NxM visemes was rated significantly better in comparison with
the synthesis based on phonemes and the synthesis based on standard Nx1 visemes.
On the other hand, from the histograms it is clear that for many comparison pairs
the test subjects answered “no difference”. Also, in a substantial number of cases
the phoneme-based synthesis was preferred over the NxM viseme-based synthesis.
Feedback from the test subjects and a manual inspection of their answers pointed
out that they often found it difficult to assess the difference between the two test
samples of each comparison pair. This was mainly due to small local errors in the
synthetic visual speech. These local audiovisual incoherences degraded the perceived
quality of the whole sample, even if the sample was of overall higher quality than
the other test sample from the comparison pair. This observation is in line with
earlier observations in similar subjective perception experiments described in this
thesis.
6.7
For some time now, the use of visemes to label visual speech data has been
well-established. This labelling approach is often used in visual speech analysis
or synthesis systems, which construct the mapping from phonemes to visemes as
a many-to-one relationship. In this chapter the usage of both standardized and
speaker-dependent English many-to-one phoneme-to-viseme mappings in concatenative visual speech synthesis was evaluated. A subjective experiment showed
that the viseme-based syntheses were unable to increase (or even decreased) the
attained synthesis quality compared to a phoneme-based synthesis. This is likely to
be explained by the limited power of many-to-one phoneme-to-viseme mappings to
accurately describe the visual speech information. As every instance of a particular
phoneme is mapped on the same viseme, these viseme labels are incapable of
describing visual coarticulation effects. This implies the need for a many-to-many
phoneme-to-viseme mapping scheme, in which on one hand instances from different
phonemes can be mapped on a same viseme, and on the other hand multiple
instances of a same phoneme can be mapped on different visemes.
Using a large Dutch audiovisual speech database, a novel approach to construct
many-to-many phoneme-to-viseme mapping schemes (i.e., so-called “contextdependent” visemes) was designed. In a first step, decision trees were trained in
order to cluster the visual appearances of the phoneme instances from the speech
database. The mapping from phonemes to these tree-based visemes is based on
several properties of the phoneme instance itself and on properties of its neighbouring phoneme instances. Several tree configurations were evaluated, from which
226
two final approaches were selected. An objective evaluation of these tree-based
speech labels showed that they are indeed able to more accurately describe the
visual speech information in comparison with phoneme-based or many-to-one
viseme-based approaches. However, the tree-based viseme sets contain too many
distinct labels for practical use in visual speech analysis or synthesis applications.
Therefore, a second clustering step was performed, in which the tree-based visemes
were further clustered into limited viseme sets defining 11, 20 or 50 distinct labels.
Again these viseme labelling approaches were objectively evaluated, from which it
could be concluded that they do improve the visual speech labelling over phonemes
or many-to-one phoneme-to-viseme mappings. This is an important result, since
the new viseme labels permit an easy analysis or synthesis of visual speech as they
need less than half the number of distinct labels to describe the speech information
as opposed to phoneme-based labels.
In the second part of this research, the use of the different speech labelling
approaches for concatenative visual speech synthesis was assessed. A first conclusion that could be made is the fact that, in case an extensive database is provided
to the synthesizer, the influence of the used speech labelling (phonemes or many-tomany visemes) on the attainable synthesis quality is rather limited. Good quality
segment selection can be achieved either way, provided that appropriate selection
costs are applied. Furthermore, it was found that the more speech data that is
available for selection, the higher the number of distinct speech labels that should
be used for an optimal synthesis result (probably there exists an upper limit to this
number, however this was not investigated due to its limited practical use). The
experiments showed also that for synthesis using a large database, a many-to-many
viseme-based speech labelling could be preferable in order to reduce the synthesis
time, since the more accurate speech labelling permits a stronger pruning of the
number of candidate segments for each target speech segment. Next, the behaviour
of the synthesis was evaluated in case only a limited amount of speech data is
available for selection. In this case, the selection based on many-to-many phonemeto-viseme mappings does improve the segment selection quality in comparison
with phoneme-based or many-to-one phoneme-to-viseme based systems. Finally,
the output synthetic visual speech signals resulting from syntheses based on the
different speech labelling approaches were evaluated. In an objective evaluation
it was found that, when only a limited database is provided to the synthesizer,
a synthetic speech signal that is closer to the ground-truth is achievable when a
speech labelling based on a many-to-many phoneme-to-viseme mapping is applied.
To verify this result, a subjective perception experiment was conducted. The results
obtained from this subjective test show indeed that human observers prefer the
syntheses based on a many-to-many phoneme-to-viseme mapping scheme over
syntheses based on phonemes or on based on a many-to-one phoneme-to-viseme
227
mapping.
This chapter explained how many-to-many phoneme-to-viseme mappings can
be constructed. Their improved accuracy to describe the visual speech information
has been shown. This is an important result, since this kind of viseme definitions was
still absent in the literature. As neither for English nor for other common languages
a reference many-to-many mapping scheme has been defined, typically many-to-one
phoneme-to-viseme mappings are still used for a variety of applications. Although
the novel many-to-many mappings described in this chapter were constructed for
Dutch, a similar approach can be used to design many-to-many mapping schemes
for English or other languages. This chapter also discussed the effect of applying
the novel phoneme-to-viseme mappings for concatenative speech synthesis. Similarly, it would be very interesting to investigate how the use of the many-to-many
phoneme-to-viseme mappings influence the performance of other applications in the
field of visual speech analysis and synthesis as well. For instance, many-to-many
phoneme-to-viseme mappings have a high potential for usage in rule-based visual
speech synthesis (see section 2.2.6.2). In that particular application, the necessary
prior definition of the various model configurations that correspond to a particular
viseme is simplified by the limited number of distinct visemes that are used. Based
on a given target phoneme sequence, an accurate target viseme sequence can be
constructed using the phoneme-to-viseme mapping scheme. This way, visual coarticulation effects are directly modelled in the target speech description. In addition,
the synthesis workflow may become more simple, since the additional tweaking of
the synthesized model parameters in order to simulate the visual coarticulation
effects, which is standard for model-based synthesis, may become superfluous.
been published in [Mattheyses et al., 2011b] and [Mattheyses et al., 2013].
7
Conclusions
7.1
Brief summary
Nowadays, it is possible to produce powerful computer systems at a limited production cost. This causes a wide variety of devices, from heavy industrial machinery
to small household appliances, to be controlled by a computer system that also
arranges the communication between the device and its users. People already
interact with countless computer systems in every-day situations, and this kind
of human-machine communication will even become increasingly important in the
near future. In the most optimal scenario, the interaction with the devices feels as
familiar and as natural as the way in which humans communicate among themselves.
Speech has always been the most important means of communication between
humans. Because of this, speech can be considered as the most optimal means of
communication between a user and a computer system too. This kind of interaction
enhances both the naturalness and the ease of the user experience, and it also
increases the accessibility of the computer system. One of the requirements to allow
a speech-based human-machine interaction is the capability of the computer system
to generate novel speech signals containing any arbitrary spoken message.
Speech is a truly multimodal means of communication: the message is conveyed in
both an auditory and a visual speech signal. The auditory speech information is
encoded in a waveform that contains a sequence of speech sounds produced by the
human speech production system. Some of the articulators of this speech production
system are visible to an observer looking at the face of the speaker. The variations
of these visual articulators, occurring while uttering the speech, define the visual
228
7.1. Brief summary
229
speech signal. It is well known that an optimal conveyance of the message requires
that both the auditory and the visual speech signal can be perceived by the receiver.
Similarly, the use of audiovisual speech is the most optimal way for a computer
system to transmit a message towards its users. To this end, the system has to
generate a new waveform that contains the appropriate speech sounds. In addition,
it has to generate a new video signal that displays a virtual speaker exhibiting those
speech gestures that correspond to the synthetic auditory speech information. When
the target speech message is given as input to the synthesizer by means of text,
the speech generation process is referred to as “audiovisual text-to-speech synthesis”.
Currently, data-driven synthesis is considered as the most efficient approach
for generating high quality synthetic speech. The standard strategy is to perform
the audiovisual text-to-speech synthesis in multiple stages. In a first step, the
synthetic auditory speech is generated by an auditory text-to-speech system. Next,
the synthetic visual speech signal is generated by another synthesis system that
performs unimodal visual speech synthesis. Finally, both synthetic speech signals
are synchronized and multiplexed to create the final audiovisual speech signal.
This strategy allows to synthesize high-quality auditory and visual speech signals,
however, it results in the presentation of non-original combinations of auditory
and visual speech information towards the observer. This means that the level of
coherence between both synthetic speech modes will be lower compared to the
multimodal coherence seen in original audiovisual speech. To allow an optimization
of the audiovisual coherence in the synthetic speech, a single-phase synthesis
strategy that simultaneously generates the auditory and the visual speech signal, is
favourable. Surprisingly, such a synthesis strategy has only been adopted in some
exploratory studies.
In the first part of this thesis, a single-phase audiovisual text-to-speech synthesizer was developed, which adopts a data-driven unit selection synthesis strategy
to create a photorealistic 2D audiovisual speech signal. The synthesizer constructs
the synthetic audiovisual speech by concatenating audiovisual speech segments that
are selected from a database containing original audiovisual speech recordings. By
concatenating original combinations of auditory and visual speech information,
original audiovisual articulations are seen in the synthetic speech. This ensures a
very high level of audiovisual coherence in the output speech signal. When longer
speech segments, containing multiple consecutive phones, are copied from the
database to the synthetic speech, original audiovisual coarticulations are seen in the
synthesized speech too. Audiovisual selection costs are employed to select segments
that exhibit appropriate properties in both the acoustic and the visual domain. The
auditory and the visual concatenations are smoothed by calculating intermediate
sound samples and video frames.
7.1. Brief summary
230
Subjective perception experiments pointed out that the level of coherence between both modes of an audiovisual speech signal indeed influences the perceived
quality of the speech. It was observed that the perceived quality of a visual speech
signal is rated the highest when it is presented together with an acoustic speech
signal that is in perfect coherence with the displayed speech gestures. This means
that an audiovisual text-to-speech system should not only aim to enhance the individual quality of both output speech modes, but it should also aim for a maximal
level of coherence between these two speech modalities. It is necessary that an
observer truly believes that the virtual speaker, displayed in the synthetic visual
speech, actually uttered the speech sounds that are heard in the presented auditory
speech signal. For instance, this requires that for every non-standard articulation
present in the auditory speech mode, even a non-optimal one, an appropriate
visual counterpart is seen in the audiovisual speech as well (and vice-versa). These
observations encouraged the further development of the single-phase audiovisual
text-to-speech synthesis approach.
In an attempt to enhance the individual quality of the synthetic visual speech
mode, multiple audiovisual optimal coupling techniques were developed. These
approaches are able to increase the smoothness of the generated visual speech
at the expense of a lowered level of audiovisual coherence. It was found that the
observed benefits of this optimization do not hold in the audiovisual case, since the
introduced local audiovisual asynchronies and/or the unimodal reduction of the
articulation strength in the visual speech mode affected the perceived audiovisual
speech quality. These results indicate that any optimization to the proposed audiovisual text-to-speech synthesis technique must be verified not to affect the level of
audiovisual coherence in the output speech.
The next part of the research investigated the enhancement of the synthetic
speech quality by parameterizing the original visual speech recordings using an
active appearance model. This allows the calculation of accurate selection costs that
also take visual coarticulation effects into account. In addition, the speech database
was normalized by removing non-speech related and thus undesired variations
from the original visual speech recordings. The visual concatenations were further
enhanced by diversifying the smoothing strength among the model parameters and
by separately optimizing the smoothing strength for each particular concatenation
point. This way, a smooth and natural appearing synthetic visual speech signal
can be generated, without a significant reduction of the visual articulation strength
(and thus preserving the level of audiovisual coherence). Furthermore, unnaturally fast articulations are filtered from the synthetic visual speech by a spectral
smoothing technique. Subjective perception experiments proved that the proposed
7.1. Brief summary
231
model-based audiovisual text-to-speech synthesis indeed produces a higher quality
synthetic visual speech signal compared to the visual speech generated by the initial
single-phase synthesis system. In addition, since the increased observed speech
quality holds in the audiovisual case as well, it can be concluded that the proposed
optimization techniques allow to independently improve the synthetic visual speech
quality without significantly affecting the level of audiovisual coherence in the
output speech.
Another enhancement of the synthesis quality was achieved by providing the
synthesizer with a new extensive Dutch audiovisual speech database, containing
high quality auditory and visual speech recordings. The database was recorded using
an innovative audiovisual recoding set-up that optimizes the speech recordings for
usage in audiovisual speech synthesis. This allowed to develop the first-ever system
capable of high-quality photorealistic audiovisual speech synthesis for Dutch. From
a Turing test scenario it was concluded that the database allows the synthesis of
synthetic speech gestures that are almost indistinguishable from the speech gestures
seen in original visual speech. In addition, the use of the new database unequivocally
enhanced the attainable audiovisual speech quality too. This allowed to perform
a final evaluation of the single-phase audiovisual speech synthesis approach. A
subjective experiment compared the difference between the audiovisual speech
quality generated by the single-phase synthesizer and the audiovisual speech quality
generated by a comparable two-phase synthesis. The two-phase synthesis generated
both output modalities separately by an individual segment selection on coherent
original audiovisual speech data. The experiment showed that observers prefer the
single-phase synthesis results, especially when the synthetic speech modalities are
very close to original speech signals.
In the final part of the thesis, it was investigated how suitable the proposed
techniques for audiovisual text-to-speech synthesis are for unimodal visual text-tospeech synthesis as well. In addition, the common practice of using a many-to-one
phoneme-to-viseme mapping for visual speech synthesis was evaluated. It was found
that neither standardized nor speaker-specific many-to-one phoneme-to-viseme
mappings enhance the visual speech synthesis quality compared to a synthesis
based on phoneme speech labels. This motivated the construction of novel manyto-many phoneme-to-viseme mapping schemes. By analysing the Dutch audiovisual
speech database using decision trees and k-means clustering techniques, multiple
sets of “context-dependent” viseme labels for Dutch were defined. It was shown
that these context-dependent visemes more accurately describe the visual speech
information compared to phonemes and compared to many-to-one viseme labels.
In addition, it was shown that the context-dependent viseme labels are the most
preferable speech labeling approach for concatenative visual speech synthesis.
7.2. General conclusions
7.2
232
General conclusions
Speech synthesis has been the subject of many research over the years. The synthesis
of high-quality speech is not a straightforward task, since humans are extremely
experienced in using speech as means of communication. This implies that people
are very acquainted with the perception of original speech signals. Consequently, it
is very hard to generate a synthetic auditory or a synthetic visual speech signal that
perfectly mimics original speech information. The problem even becomes harder
when audiovisual speech is synthesized: in that case it is not only necessary that
both synthetic speech modes closely resemble an original speech signal conveying
the same speech message, also the audiovisual presentation of the two synthetic
speech modes must appear natural to an observer. This means that the coherence
between the presented auditory and the presented visual speech information must
be high enough to make the observer believe that the virtual speaker indeed uttered
the acoustic waveform that plays together with the video signal.
The results obtained in this thesis indicate that the most preferable synthesis
strategy to achieve high-quality audiovisual text-to-speech synthesis consists in
simultaneously generating both synthetic speech modes. This way, the level of
audiovisual coherence in the synthetic speech can be maximized to ensure that
the individual quality of the synthetic speech modes is not affected when they
are shown audiovisually to an observer. Similar to the advances in auditory-only
speech synthesis, the current focus in the field of visual speech synthesis is on
data-driven synthesis approaches: concatenative synthesizers that directly reuse
the original speech data, and prediction-based synthesizers that train complex
statistical prediction models on the original speech data. Both techniques are
suited for performing a single-phase audiovisual speech synthesis. A single-phase
concatenative synthesis approach was developed in this thesis, which reuses original audiovisual articulations and coarticulations to create the synthetic speech.
On the other hand, single-phase prediction-based synthesis can be achieved by
predicting the auditory and the visual speech features simultaneously from the
target speech description, by means of a prediction model that has been trained
original audiovisual speech data. Obviously, the simultaneous synthesis of both
speech modes entails some additional difficulties in maximizing the quality of the
synthetic audiovisual speech. Therefore, future research in the field should focus on
how to combine the state-of-the-art techniques for performing unimodal auditory
and unimodal visual speech synthesis in order to achieve a high-quality single-phase
synthesis of audiovisual speech. Up until now, single-phase audiovisual speech
synthesis was mainly the topic of exploratory studies. This thesis composes one of
the first efforts that not only suggests a promising single-phase synthesis approach,
but also aims to develop improvements to the synthesis in order to attain a synthesis
7.3. Future work
233
quality that allows to draw relevant conclusions. Hopefully, the research described
in this work inspires other researchers to further explore the single-phase synthesis
approach as well. For instance, it (partially) inspired researchers at the Université
De Lorraine (LORIA) in their development of a single-phase concatenative audiovisual speech synthesizer that creates a 3D-rendered visual speech signal [Toutios
et al., 2010b] [Toutios et al., 2010a] [Musti et al., 2011]. The system simultaneously
selects from an audiovisual speech database original auditory speech information
and PCA coefficients modelling original 3D facial landmark variations.
In those applications in which an original auditory speech signal has already
been provided, a unimodal visual text-to-speech synthesis can be used to generate the corresponding visual speech mode. Surprisingly, at present the common
technique in this field still uses many-to-one phoneme-to-viseme mappings to label
the visual speech information. This is in contradiction with the well-known fact
that the mapping from phonemes to viseme behaves more like a many-to-many
mapping, due to the variable visual representation of each phoneme caused by
visual coarticulation effects. The results obtained in this thesis indeed raise some
serious questions on the use of the standardized viseme set defined in the MPEG-4
standard. It was shown that a more accurate labeling of the visual speech is
feasible using context-dependent viseme labels. This technique allows to perform
good quality concatenative visual speech synthesis using only a limited amount of
original speech data. It can be assumed that the use of context-dependent viseme
labels will enhance the attainable synthesis quality of other visual speech synthesis
strategies as well. Furthermore, context-dependent visemes are promising for use in
the field of visual speech analysis too.
7.3
7.3.1
Future work
Enhancing the audiovisual synthesis quality
The audiovisual text-to-speech synthesizer, provided with the extensive Dutch
audiovisual speech database, is capable of generating high quality audiovisual
speech signals. Chapter 4 elaborated on various optimizations that enhance the
individual quality of the synthetic visual speech mode. Similarly, some techniques
found in state-of-the-art auditory speech synthesis systems can be incorporated
in the audiovisual speech synthesis in order to enhance the individual quality of
the synthetic auditory speech mode. For instance, some of the advances that were
made in the laboratory’s auditory text-to-speech research can be transferred to
the audiovisual domain, such as a semi-supervised annotation and prediction of
perceptual prominence, word accents, and prosodic phrase breaks. Note that each
enhancement has to be evaluated not to significantly affect the coherence between
7.3. Future work
234
the synthetic visual mode and the optimized synthetic auditory speech mode.
One of the most difficult tasks in optimizing the synthesis parameters of a
concatenative synthesizer is the definition of an appropriate weight distribution
for the selection costs. In the proposed audiovisual text-to-speech system, these
weights were optimized manually. However, it is likely that a more appropriate set
of weights can be calculated using an automatic parameter optimization technique.
The difficulty is that these techniques require an objective “error” measure that
denotes the quality of the synthesis that is obtained using a particular weight
distribution. This error measure has to take the individual quality of both the
synthetic auditory and the synthetic visual speech into account [Toutios et al.,
2011]. This means that first the relative importance of both modes must be assessed
in order to define the error measure. In addition, it is likely that the synthesis
quality can be further optimized by learning multiple weight distributions, which
allows to use the most optimal set of selection cost weights for each target speech
segment. Such an approach was already explored by learning context-dependent
selection cost weights for the laboratory’s auditory speech synthesizer [Latacz et al.,
2011].
Another interesting strategy that can be ported from the auditory domain to
the audiovisual domain is the “hybrid” synthesis technique, in which in a first
stage the properties of the synthetic speech are estimated using a prediction-based
synthesizer. Afterwards, a second stage consist in the physical synthesis in which
these predictions are used as target descriptions for a concatenative synthesis
system. This way, the prediction-based system’s power to accurately estimate the
desired speech properties is combined with power of concatenative synthesis to
generate realistic output signals by reusing original speech data. In the audiovisual
case, the most ideal system would simultaneously predict the acoustic and the
visual speech features (i.e., a single-phase approach) by means of a statistical model
that was trained on audiovisual speech data. Afterwards, a concatenative synthesis
approach similar to the audiovisual speech synthesizer described in this thesis can
be employed to realize the actual physical synthesis. This requires the calculation of
target selection costs that measure the audiovisual distance between the candidate
speech segments and the predicted properties of the output speech.
7.3.2
Adding expressions and emotions
The proposed audiovisual speech synthesizer uses prosody-related binary target
costs in order to copy the prosodic patterns found in the speech database to the
synthetic speech. Since both the LIPS2008 database and the AVKH database contain expression-free speech recordings exhibiting a neutral prosody, the synthesized
7.3. Future work
235
speech shows neither expressions nor emotions. The background video signal, displaying the parts of the face of the virtual speaker that are not located in the mouth
area, is designed to display a neutral visual prosody too. The reason for this is the
fact that a neutral visual prosody is a safe choice: adding more expressiveness to
the face can increase the naturalness of the synthetic speech, but it also potentially
affects the speech quality when the expressions themselves are not perceived as
natural or when the displayed expressions are not optimally correlated with the
conveyed speech message. The synthesis of expressive speech is a “hot topic” in the
fields of auditory and visual speech synthesis. These techniques can also be ported
to the audiovisual domain: expressiveness can be added to the auditory speech
signal to make the synthetic speech more “lively” and real, while the synthetic visual speech signal is highly suited to add a particular emotional state to the message.
Some exploratory work was already conducted to investigate strategies to include the synthesis of expressions and emotions to the audiovisual text-to-speech
synthesis system described in this thesis. To this end, additional speech data was
recorded (using the audiovisual recording set-up discussed in chapter 5) during
which the speaker simulated happy and sad emotions. Some example frames from
these recordings are given in figure 7.1. Figure 7.1 illustrates that a change of the
emotional state of the speaker causes variations in both the visual articulations
(seen in the mouth area of the video frames) and in the appearance of the other
parts of the face (e.g., eyes, eyebrows, etc.). Two separate strategies are needed
to mimic such original expressions in the synthetic visual speech generated by the
audiovisual text-to-speech synthesizer. First, the variations of the mouth area have
to be synthesized not only based on the target phoneme sequence, but also based
on the target emotion or expression. This means that the speech database provided
to the synthesizer should contain even more repetitions of each phoneme, since the
synthesizer has to be able to select all variations based on context, prosody and
expression. Next, a “mouth” AAM has to be trained on this expressive original
speech data. It will be a challenge to construct an AAM that is capable of accurately
regenerating all original mouth appearances, since for the expressive speech much
more variations must be modelled in comparison with the AAMs that were trained
on the neutral speech databases in chapter 4 and chapter 5. Furthermore, in order
to synthesize emotional visual speech, the necessary visual prosody has to be added
to the background video signal. To this end, the “face” AAM could be extended
to model variations of the face corresponding to emotions/expressions as well.
Afterwards, it will have to be investigated how to generate realistic transitions
between consecutive expressions. It might be necessary to explore other parameterizations that allow to match physical properties to one single parameter (and
vice-versa). This way, the displayed visual prosody could be influenced based on the
literature knowledge on the relation between emotions and facial gestures [Grant,
7.3. Future work
236
Figure 7.1: Facial expressions related to a happy emotion. Notice that both the
appearance of the mouth area and the appearance of the “background” face are
varying in relation to the expression.
1969] [Swerts and Krahmer, 2005] [Granstrom and House, 2005] [Gordon and
Hibberts, 2011].
When the new expressive original audiovisual speech data would have been
appropriately parameterized, new selection costs are necessary that promote the
selection of segments exhibiting the desired expression. Especially in the auditory
mode it will be a challenge to synthesize realistic emotional prosodic patterns. In
addition, an appropriate approach for denoting the targeted expressions in the
synthetic speech will have to be investigated.
7.3.3
Future evaluations
Throughout this thesis, many evaluations of the synthetic visual and the synthetic
audiovisual speech signals have been mentioned. For instance, objective measures
were defined to assess the smoothness of the visual speech and to calculate the
distance between a synthesized and an original version of the same sentence. In
addition, many subjective evaluations were performed that used MOS or comparative MOS ratings to denote the perceived speech quality. Apart from these
evaluation strategies, other test strategies are possible that are likely to offer useful
information about the attained synthesis quality.
In chapter 3, a subjective assessment of the audiovisual coherence was performed. It was concluded that human observers are often unable to distinguish
between synchrony issues and coherence issues. Furthermore, it appeared that the
perceived level of synchrony/coherence is influenced by the quality of the displayed
audiovisual speech. A possible solution would be to perform such evaluations
objectively, by measuring the mathematical correspondence between the auditory
and the visual speech information. To this end, various strategies are possible, such
7.3. Future work
237
as the measures proposed in [Slaney and Covell, 2001] and in [Bredin and Chollet,
2007]. A general overview of interesting audiovisual correlation measures is given
in [Bredin and Chollet, 2006]. Such correlation measures could be employed to
compare the level of audiovisual coherence between original and various synthesized
audiovisual speech signals.
This thesis focused on the synthesis of full-length sentences. This kind of speech
samples was also used in the subjective perception experiments. Unfortunately,
it could be noticed that local errors in the synthetic speech signal often degrade
the perceived quality of the whole sentence. This easily disrupts the subjective
experiment, since the other parts of the sentence might have been of very good
quality. Therefore, it can be opted to assess the quality of the synthetic speech on
a much smaller scale, by evaluating the synthetic articulations of isolated words
or sounds. An interesting approach was suggested by Cosker et al. [Cosker et al.,
2004], which involves the generation of real and synthetic McGurk stimuli. A
McGurk stimulus consist of a short sound sample of which the auditory and the
visual speech are originating from different phoneme sequences. The audiovisual
presentation of these speech modes can cause the perception of yet another speech
sound (the so-called McGurk effect [McGurk and MacDonald, 1976]). By measuring
the difference between the number of observed McGurk effects using the original
stimuli and the number of observed McGurk effects using the synthesized stimuli,
an estimation for the articulation quality of the speech synthesizer can be made.
One of the main benefits of this test strategy is the fact that it composes a
subjective evaluation of both the synthetic auditory speech, the synthetic visual
speech, and the combination of these two signals. It could be interesting to compare
the performance of a single-phase synthesis strategy with the more conventional
two-phase audiovisual text-to-speech synthesis approach.
Chapter 6 discussed the use of phoneme and viseme speech labels for visual
speech synthesis. The quality of the resulting synthetic visual speech signals was
subjectively evaluated by presenting the signals in combination with original auditory speech signals. This is a valid evaluation, since these audiovisual test samples
are similar to the speech signals that would be shown to a user when the visual
speech synthesis system is used in a real application. On the other hand, it would
also be interesting to perform a separate subjective evaluation of the individual
quality of the synthetic visual speech mode (generated by either a visual or an
audiovisual speech synthesizer). In comparison with an individual evaluation of
auditory speech, such a visual speech-only evaluation is not straightforward since
it is very hard for an observer to evaluate the quality of a presented unimodal
visual speech signal. The reason for this is the fact that people are generally not
capable of comprehending the message conveyed in a visual speech-only signal. A
7.3. Future work
238
possible solution would be to only use test subjects who possess above-average
lip-reading skills (often these are hearing-impaired people). A quality measure for
the synthetic visual speech could then consist of the intelligibility score obtained for
the synthesized visual speech samples compared to the intelligibility score obtained
for the original visual speech samples. One of the possible problems that could arise
is the fact that, even for experienced lip-readers, the recognition rate for visual
speech-only sentences without context is rather low. This will make it harder to
obtain significant differences between intelligibility scores measured for multiple
types of synthesized or original visual speech samples.
A
The Viterbi algorithm
The Viterbi algorithm [Viterbi, 1967] is a dynamic programming solution that is
used to find the optimal path through a trellis. In unit selection synthesis, it is
used to select from each set of candidate database segments matching a target
speech segment one final database segment to construct the synthetic speech. The
principle is illustrated in figure A.1, where for each of the T target segments ti a
set of N candidate segments uij is gathered. The optimal path through the trellis
is determined by minimizing a global selection cost that is calculated by target
costs and join costs, as illustrated in figure A.2. The target costs T C measure the
matching between a target segment ti and a candidate segment uij , and the join
costs JC express the effort to move from a candidate segment matching target ti
to another candidate segment matching target ti+1 .
Since a high quality unit selection approach should consider at least about
200 candidate segments per target, synthesizing a standard-length sentence that is
made up by 100 diphones involves 200100 possible sequences to evaluate. Obviously,
this number is way too high to perform the unit selection calculation in reasonable
time. Therefore, the Viterbi algorithm only evaluates those sequences that are
possibly the most optimal one. It is based on the following two principles:
◦ For a single node uij somewhere in the trellis, only the best path leading to
this node needs to be remembered. If it turns out that this particular node
uij is in fact on the global best path, then the node matching the preceding
target that was on the best path towards uij is also on the global best path.
◦ The global best path can only be found if all targets are processed from the
beginning to the end. At any given target ti , when moving forward through
239
240
t2
t3
u11
u21
u31
u12
u22
u32
u13
u23
u33
...
...
...
...
...
...
...
...
...
t1
tT
u1N
u2N
u3N
...
uTN
uT1
uT2
uT3
Figure A.1: A trellis illustrating the unit selection problem.
ti
ti+1
TCi(1)
TCi+1(1)
TCi+1(2)
TCi(2)
ui1
JCi(1,1)
JCi(1,2)
ui2
u(i+1)1
JCi(2,1)
JCi(2,2)
u(i+1)2
Figure A.2: The various costs associated with the unit selection trellis.
241
the trellis, it is possible to find the node uij with the lowest total selection
cost to this point. By a back-trace from this node, the nodes from all the
previous targets that are on the best path towards uij can be found. However,
no matter how low the total selection cost that is associated with node uij
may be, there is no guarantee that this node will end up on the global best
path when the full back-trace from target tT is performed.
The Viterbi search significantly reduces the time to calculate the global best
path. When the average time to calculate a target cost is written as OT C and the
average time to calculate a join cost is written as OJC , the total time to find the
global best path through a trellis with T targets and N nodes for each target is
T × N × OT C + N 2 × OJC 1 .
A possible implementation of the Viterbi search goes as follows. Consider a
node uij matching target ti and a node u(i+1)k matching target t(i+1) . The cost
Cstep of moving from node uij to node u(i+1)k is calculated as the target cost of
node u(i+1)k and the join cost between these two nodes:
Cstep (uij , u(i+1)k ) = T Ci+1 (k) + JCi (j, k)
(A.1)
The total cost Ctot associated with arriving in node u(i+1)k via node uij is then:
Ctot (u(i+1)k ) = Ctot (uij ) + Cstep (uij , u(i+1)k )
(A.2)
Using equation A.2, the total cost for arriving in node u(i+1)k via all possible nodes
corresponding to target ti can be calculated. This way, the most optimal node matching target ti for arriving in node u(i+1)k can be determined and remembered. When
afterwards it would turn out that node u(i+1)k is actually on the global best path,
the node matching target ti that is on the global best path is also known since this
is the node that was remembered to be the most optimal to arrive in node u(i+1)k .
This principle is repeated from the first target towards the last target. For each node
of each target, the best node from the previous target to reach it is remembered,
together with the total cost associated with reaching it via this optimal node. When
the last target tT is processed, the node uT k̂ that has the lowest total cost of reaching it is chosen as the last node on the global best path. Then, a back-trace occurs
in which the node matching target tT −1 that was remembered as the most optimal
for reaching node uT k̂ is added to the global best path. This step is repeated until
1 For instance, assume that the computer system is able to calculate 106 cost values per second
and that the synthesizer uses 4 distinct selection costs. When the best sequence matching T = 100
diphones (a standard sentence) must be calculated from N = 200 candidates for each target,
the straightforward approach that calculates the total cost for each possible sequence would take
200100
= 5224 seconds to complete. A Viterbi search would only take 4 seconds to find the most
250000
optimal path.
242
finally a node matching the first target is added to the global best path, after which
the most optimal set of candidate segments that minimizes the global selection cost
is known.
B
English phonemes
This appendix illustrates the phoneme set that is used by the AVTTS system to perform synthesis for English. It also indicates the classification of the various phonemes
that is used to optimize the concatenation smoothing strength for each join individually (see section 4.4.3). Phonemes labelled as protected (P) are less likely to be
affected by visual coarticulation effects and have to be clearly visible in the synthetic
visual speech to avoid under-articulation. On the other hand, phonemes labelled as
invisible (I) are often strongly affected by coarticulation effects. These phonemes
can be smoothed more heavily since they should not always be visible in the synthetic speech to avoid over-articulations. The other phonemes are labelled as normal
(N).
243
244
Table B.1: The English phone set used by the AVTTS system. The first column
lists the phonemes in the MRPA notation used by Festival [Black et al., 2013].
The second column lists the phonemes in the standard SAMPA notation [Wells,
1997]. The third column shows an example use of each phoneme and the last
column illustrates its classification normal/protected/invisible.
mrpa
sampa
example
class.
mrpa
sampa
example
class.
#
p
b
t
d
k
m
n
l
r
f
v
s
z
h
w
g
ch
jh
ng
th
dh
sh
...
p
b
t
d
k
m
n
l
r
f
v
s
z
h
w
g
tS
dZ
N
T
D
S
(silence)
put
but
ten
den
can
man
not
like
run
full
very
some
zeal
hat
went
game
chain
Jane
long
thin
then
ship
N
P
P
I
I
I
P
N
N
N
P
P
N
N
I
P
I
N
N
N
N
N
N
zh
y
ii
aa
oo
uu
@@
i
e
a
uh
o
u
@
ei
ai
oi
ou
au
i@
e@
u@
Z
j
i:
A:
O:
u:
3:
I
e
{
V
Q
U
@
eI
aI
OI
@U
aU
I@
e@
U@
measure
yes
bean
barn
born
boon
burn
pit
pet
pat
putt
pot
good
about
bay
buy
boy
no
now
peer
pair
poor
N
N
N
N
P
N
N
N
N
N
N
N
N
N
N
P
N
N
N
N
N
P
C
English visemes
This appendix illustrates the speaker-dependent many-to-one phoneme-to-viseme
mapping that was constructed by a hierarchical clustering analysis on the combined
AAM parameter values of the video frames from the LIPS2008 database. Table C.1
lists for each English phoneme the viseme group it matches in the mapping schemes
on 7, 9, 11, 17, and 22 visemes, respectively. The standardized phoneme-to-viseme
mapping defined in MPEG-4 is also given.
245
7
I
I
I
I
II
II
II
II
III
III
III
III
III
III
III
III
III
IV
IV
IV
IV
V
V
phoneme
ch
sh
jh
zh
b
m
p
w
h
ii
i@
e
a
ei
e@
ai
au
o
u@
oi
oo
uu
u
I
I
I
I
II
II
II
III
IV
IV
IV
IV
IV
IV
IV
IV
IV
V
V
V
V
VI
VI
9
I
I
I
I
II
II
II
III
IV
IV
IV
IV
IV
IV
IV
IV
IV
V
V
V
V
VI
VI
11
I
I
I
II
III
III
III
IV
V
V
V
VI
VI
VI
VI
VII
VII
VIII
VIII
VIII
IX
X
X
17
I
I
I
II
III
III
III
IV
V
V
V
VI
VI
VI
VI
VII
VII
VIII
VIII
VIII
IX
X
X
22
I
I
I
I
II
II
II
VII
IX
III
III
IV
V
IV
IV
V
VII
VI
VII
VI
VI
VII
VI
MPEG
y
n
i
@
k
g
l
ng
uh
ou
aa
@@
th
dh
t
d
s
z
f
v
r
(silence)
phoneme
V
V
V
V
V
V
V
V
V
V
V
V
VI
VI
VI
VI
VI
VI
VI
VI
VI
VII
7
VI
VI
VI
VI
VI
VI
VI
VI
VI
VI
VI
VI
VII
VII
VII
VII
VII
VII
VIII
VIII
VIII
IX
9
VI
VII
VII
VII
VII
VII
VII
VII
VII
VII
VII
VII
VIII
VIII
VIII
VIII
VIII
VIII
IX
IX
X
XI
11
X
XI
XI
XI
XI
XI
XI
XI
XII
XII
XII
XII
XIII
XIII
XIV
XIV
XIV
XIV
XV
XV
XVI
XVII
17
XI
XII
XII
XII
XIII
XIII
XIII
XIII
XIV
XIV
XIV
XV
XVI
XVI
XVII
XVII
XVIII
XVIII
XIX
XX
XXI
XXI
22
III
VIII
III
IV
IX
IX
VIII
IX
VII
VII
V
IV
X
X
XI
XI
XII
XII
XIII
XIII
XIV
XV
MPEG
Table C.1: Many-to-one phoneme-to-viseme mappings for English. The phonemes are listed using the MRPA notation
[Black et al., 2013].
246
Bibliography
[Abrantes and Pereira, 1999] Abrantes, G. and Pereira, F. (1999). Mpeg-4 facial
animation technology: Survey, implementation, and results. IEEE Transactions
on Circuits and Systems for Video Technology, 9(2):290–305.
[Acapela, 2013] Acapela (2013). Online: http://www.acapela-group.com/index.
php.
[Agelfors et al., 2006] Agelfors, E., Beskow, J., Karlsson, I., Kewley, J., Salvi, G.,
and Thomas, N. (2006). User evaluation of the synface talking head telephone.
In Miesenberger, K., Klaus, J., Zagler, W., and Karshmer, A., editors, Computers
Helping People with Special Needs, pages 579–586. Springer.
[Aharon and Kimmel, 2004] Aharon, M. and Kimmel, R. (2004). Representation
analysis and synthesis of lip images using dimensionality reduction. International
Journal of Computer Vision, 67(3):297–312.
[Al Moubayed et al., 2010] Al Moubayed, S., Beskow, J., Granstrom, B., and House,
D. (2010). Audio-visual prosody: Perception, detection, and synthesis of prominence. In Esposito, A., Esposito, A. M., Martone, R., Muller, V., and Scarpetta,
G., editors, Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces. Theoretical and Practical Issues, pages 55–71. Springer Berlin Heidelberg.
[Al Moubayed et al., 2012] Al Moubayed, S., Beskow, J., Skantze, G., and
Granstrom, B. (2012). Furhat: A back-projected human-like robot head for multiparty human-machine interaction. Lecture Notes in Computer Science, 7403:114–
130.
[Albrecht et al., 2002] Albrecht, I., Haber, J., Kahler, K., Schroder, M., and Seidel,
H.-P. (2002). May i talk to you? :-) – facial animation from text. In Proc. Pacific
Graphics, pages 77–86.
[Andersen, 2010] Andersen, T. S. (2010). The mcgurk illusion in the oddity task.
In Proc. International Conference on Auditory-visual Speech Processing, pages
paper S2–3.
[Anderson and Davis, 1995] Anderson, J. and Davis, J. (1995). An introduction to
neural networks. MIT Press.
247
BIBLIOGRAPHY
248
[Anime Studio, 2013] Anime Studio (2013). Online: http://anime.smithmicro.
com/.
[Arb, 2001] Arb, H. A. (2001). Hidden Markov Models for Visual Speech Synthesis
in Limited Data Environments. PhD thesis, Air Force Institute of Technology.
[Argyle and Cook, 1976] Argyle, M. and Cook, M. (1976). Gaze and Mutual Gaze.
Cambridge University Press.
[Arslan and Talkin, 1999] Arslan, L. M. and Talkin, D. (1999). Codebook based face
point trajectory synthesis algorithm using speech input. Speech Communication,
27(2):81–93.
[Aschenberner and Weiss, 2005] Aschenberner, B. and Weiss, C. (2005). Phonemeviseme mapping for german video-realistic audio-visual-speech-synthesis. Technical report, IKP Bonn.
[Auer and Bernstein, 1997] Auer, Jr, E. and Bernstein, L. E. (1997). Speechreading
and the structure of the lexicon: computationally modeling the effects of reduced
phonetic distinctiveness on lexical uniqueness. Journal of the Acoustical Society
of America, 102(6):3704–3710.
[AV Lab, 2013] AV Lab (2013). The audio-visual laboratory of etro. Online: http:
//www.etro.vub.ac.be/Research/Nosey_Elephant_Studios.
[Baayen et al., 1995] Baayen, R., Piepenbrock, R., and Gulikers, L. (1995). The
celex lexical database (release 2). Technical Report celex, Linguistic Data Consortium, University of Pennsylvania.
[Badin et al., 2010] Badin, P., Youssef, A., Bailly, G., Elisei, F., and Hueber, T.
(2010). Visual articulatory feedback for phonetic correction in second language
learning. In Proc. Workshop on Second Language Studies: Acquisition, Learning,
Education and Technology, pages 1–10.
[Bailly et al., 2003] Bailly, G., Brar, M., Elisei, F., and Odisio, M. (2003). Audiovisual speech synthesis. International Journal of Speech Technology, 6(4):331–346.
[Bailly et al., 2002] Bailly, G., Gibert, G., and Odisio, M. (2002). Evaluation of
movement generation systems using the point-light technique. In Proc. IEEE
Workshop onSpeech Synthesis, pages 27–30.
[Baker, 1975] Baker, J. (1975). The dragon system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 23(1):24–29.
[Barron et al., 1994] Barron, J., Fleet, D., and Beauchemin, S. (1994). Performance
of optical flow techniques. International journal of computer vision, 12(1):43–77.
BIBLIOGRAPHY
249
[Baum et al., 1970] Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions
of markov chains. Annals of mathematical statistics, 41(1):164–171.
[Benesty et al., 2008] Benesty, J., Sondhi, M. M., and Huang, Y., editors (2008).
Springer Handbook of Speech Processing. Springer.
[Benoit and Le Goff, 1998] Benoit, C. and Le Goff, B. (1998). Audio-visual speech
synthesis from french text: Eight years of models, designs and evaluation at the
icp. Speech Communication, 26(1):117–129.
[Benoit et al., 2000] Benoit, C., Pelachaud, C., claude Martin, J., claude Martin,
J., Schomaker, L., and Suhm, B. (2000). Audio-visual and multimodal speech
systems. In Gibbon, D., I.Mertins, and R.Moore, editors, Handbook of multimodal and spoken dialogue systems: Resources, terminology and product evaluation. Kluwer Academic.
[Benoit et al., 2010] Benoit, M. M., Raij, T., Lin, F.-H., Jskelinen, I. P., and Stufflebeam, S. (2010). Primary and multisensory cortical activity is correlated with
audiovisual percepts. Human Brain Mapping, 31(4):526–538.
[Bergeron and Lachapelle, 1985] Bergeron, P. and Lachapelle, P. (1985). Controlling facial expressions and body movements in the computer generated animated
short tony de peltrie. In Siggraph Tutorial Notes.
[Bernstein et al., 2004] Bernstein, L., Auer, E., and Moore, J. (2004). Audiovisual
speech binding: convergence or association. In Calvert, G., Spence, C., and Stein,
B., editors, The handbook of multisensory processes, pages 203–223. MIT Press.
[Bernstein et al., 1989] Bernstein, L. E., Eberhardt, S. P., and Demorest, M. E.
(1989). Single-channel vibrotactile supplements to visual perception of intonation
and stress. Journal of the Acoustical Society of America, 85(1):397–405.
[Bernstein et al., 2000] Bernstein, L. E., Tucker, P. E., and Demorest, M. E. (2000).
Speech perception without hearing. Attention, Perception, & Psychophysics,
62(2):233–252.
[Beskow, 1995] Beskow, J. (1995). Rule-based visual speech synthesis. In Proc.
European Conference on Speech Communication and Technology, pages 299–302.
[Beskow, 2004] Beskow, J. (2004). Trainable articulatory control models for visual
speech synthesis. International Journal of Speech Technology, 7(4):335–349.
[Beskow and Nordenberg, 2005] Beskow, J. and Nordenberg, M. (2005). Datadriven synthesis of expressive visual speech using an mpeg-4 talking head. In
BIBLIOGRAPHY
250
Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 793–796.
[Beutnagel et al., 1999] Beutnagel, M., Conkie, A., Schroeter, J., Stylianou, Y., and
Syrdal, A. (1999). The at&t next-gen tts system. In Proc. Joint Meeting of ASA,
EAA, and DAGA, pages 18–24.
[Biemann et al., 2007] Biemann, C., Heyer, G., Quasthoff, U., and Richter, M.
(2007). The leipzig corpora collection–monolingual corpora of standard size. In
Proc. Corpus Linguistics, pages 113–126.
[Binnie et al., 1974] Binnie, C. A., Montgomery, A. A., and Jackson, P. L. (1974).
Auditory and visual contributions to the perception of consonants. Journal of
Speech and Hearing Research, 17(4):619–630.
[Birkholz et al., 2006] Birkholz, P., Jackel, D., and Kroger, K. (2006). Construction
and control of a three-dimensional vocal tract model. In Proc. IEEE International
Conference on Acoustics, Speech and Signal Processing, volume 1, pages 873–876.
[Black et al., 2013] Black, A., Taylor, P., and Caley, R. (2013). The festival speech
synthesis system. Online: http://www.cstr.ed.ac.uk/projects/festival.
html.
[Blanz et al., 2003] Blanz, V., Basso, C., Poggio, T., and Vetter, T. (2003). Reanimating faces in images and video. Computer graphics forum, 22(3):641–650.
[Bowers, 2001] Bowers, B. (2001). Sir Charles Wheatstone FRS: 1802-1875. Inspec/Iee.
[Bozkurt et al., 2007] Bozkurt, E., Erdem, C., Erzin, E., Erdem, T., and Ozkan,
M. (2007). Comparison of phoneme and viseme based acoustic units for speech
driven realistic lip animation. In Proc. Signal Processing and Communications
Applications, pages 1–4.
[Brand, 1999] Brand, M. (1999). Voice puppetry. In Proc. Annual conference on
Computer graphics and interactive techniques, pages 21–28.
[Bredin and Chollet, 2006] Bredin, H. and Chollet, G. (2006). Measuring audio and
visual speech synchrony: methods and applications. In Proc. IET International
Conference on Visual Information Engineering, pages 255–260.
[Bredin and Chollet, 2007] Bredin, H. and Chollet, G. (2007). Audio-visual speech
synchrony measure for talking-face identity verification. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, pages
233–236.
BIBLIOGRAPHY
251
[Breen et al., 1996] Breen, A. P., Bowers, E., and Welsh, W. (1996). An investigation into the generation of mouth shapes for a talking head”. In Proc. International Conference on Spoken Language Processing, pages 2159–2162.
[Bregler et al., 1997] Bregler, C., Covell, M., and Slaney, M. (1997). Video rewrite:
driving visual speech with audio. In Proc. Annual Conference on Computer
Graphics and Interactive Techniques, pages 353–360.
[Breiman et al., 1984] Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984).
Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA.
[Brooke and Scott, 1998] Brooke, N. and Scott, S. (1998). Two- and threedimensional audio-visual speech synthesis. In Proc. International Conference on
Auditory-visual Speech Processing, pages 213–220.
[Brooke and Summerfield, 1983] Brooke, N. and Summerfield, Q. (1983). Analysis,
synthesis, and perception of visible articulatory movements. Journal of Phonetics,
11:63–76.
[Broomhead and Lowe, 1988] Broomhead, D. and Lowe, D. (1988). Radial basis
functions, multi-variable functional interpolation and adaptive networks. Technical report, Royal Signals and Radar Establishment.
[Browman and Goldstein, 1992] Browman, C. P. and Goldstein, L. (1992). Articulatory phonology: An overview. Phonetica, 49(3-4):155–180.
[Campbell and Black, 1996] Campbell, N. and Black, A. (1996). Prosody and the
selection of source units for concatenative synthesis. Progress in speech synthesis,
3:279–292.
[Campbell, 2008] Campbell, R. (2008). The processing of audio-visual speech: empirical and neural bases. Philosophical Transactions of the Royal Society of London, 363:1001–1010.
[Cao et al., 2004] Cao, Y., Faloutsos, P., Kohler, E., and Pighin, F. (2004). Realtime speech motion synthesis from recorded motions. In Proc. ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 345–353.
[Cappelletta and Harte, 2012] Cappelletta, L. and Harte, H. (2012). Phoneme-toviseme mapping for visual speech recognition. In Proc. International Conference
on Patter Recognition Applications and Methods, pages 322–329.
[Carter et al., 2010] Carter, E. J., Sharan, L., Trutoiu, L., Matthews, I., and Hodgins, J. K. (2010). Perceptually motivated guidelines for voice synchronization in
film. ACM Transactions on Applied Perception, 7(4):1–12.
BIBLIOGRAPHY
252
[Chang and Ezzat, 2005] Chang, Y.-J. and Ezzat, T. (2005). Transferable videorealistic speech animation. In Proc. ACM SIGGRAPH/Eurographics symposium
on Computer animation, pages 143–151.
[Chen, 2001] Chen, T. (2001). Audiovisual speech processing. IEEE Signal Processing Magazine, 18(1):9–21.
[Chen and Rao, 1998] Chen, T. and Rao, R. R. (1998). Audio-visual integration in
multimodal communication. Proceedings of the IEEE, 86(5):837–852.
[Clark et al., 2007] Clark, R., Richmond, K., and King, S. (2007). Multisyn: Opendomain unit selection for the festival speech synthesis system. Speech Communication, 49(4):317–330.
[CMU, 2013] CMU (2013). The carnegie mellon university pronouncing dictionary.
Online: http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
[Cohen and Massaro, 1990] Cohen, M. and Massaro, D. (1990). Synthesis of visible
speech. Behavior Research Methods, 22(2):260–263.
[Cohen et al., 1996] Cohen, M., R.Walker, and Massaro, D. (1996). Perception of
synthetic visual speech. In Speechreading by Humans and Machines: Models,
Systems and Applications, pages 154–168. Springer-Verlag.
[Cohen and Massaro, 1993] Cohen, M. M. and Massaro, D. W. (1993). Modeling
coarticulation in synthetic visual speech. In Thalmann, N. M. and Thalmann, D.,
editors, Models and Techniques in Computer Animation, pages 139–156. SpringerVerlag.
[Conkie and Isard, 1996] Conkie, A. and Isard, S. D. (1996). Optimal coupling of
diphones. In Santen, J. P. H., Sproat, R. W., Olive, J. P., and Hirschberg, editors,
Progress in Speech Synthesis. Springer.
[Cootes et al., 2001] Cootes, T., Edwards, G., and Taylor, C. (2001). Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence,
23(6):681–685.
[Cootes and Taylor, 2001] Cootes, T. and Taylor, C. (2001). Constrained active
appearance models. In Proc. Computer Vision, pages 748–754.
[Corthals, 1984] Corthals, P. (1984). Een eenvoudige visementaxonomie voor
spraakafzien. Tijdschrift voor Logopedie en Audiologie, 14(3):126–134.
[Cosatto, 2002] Cosatto, E. (2002). Sample-Based Talking-Head Synthesis. PhD
thesis, Swiss Federal Institute of Technology.
BIBLIOGRAPHY
253
[Cosatto and Graf, 1998] Cosatto, E. and Graf, H. (1998). Sample-based synthesis
of photo-realistic talking heads. In Proc. Computer Animation, pages 103–110.
[Cosatto and Graf, 2000] Cosatto, E. and Graf, H. P. (2000). Photo-realistic
talking-heads from image samples. IEEE Transactions on Multimedia, 2(3):152–
163.
[Cosatto et al., 2003] Cosatto, E., Ostermann, J., Graf, H. P., and Schroeter, J.
(2003). Lifelike talking faces for interactive services. Proceedings of the IEEE,
91(9):1406–1429.
[Cosatto et al., 2000] Cosatto, E., Potamianos, G., and Graf, H. P. (2000). Audiovisual unit selection for the synthesis of photo-realistic talking-heads. In Proc.
IEEE International Conference on Multimedia and Expo, pages 619–622.
[Cosi et al., 2002] Cosi, P., Caldognetto, E., Perin, G., and Zmarich, C. (2002).
Labial coarticulation modeling for realistic facial animation. In Proc. IEEE International Conference on Multimodal Interfaces, pages 505–510.
[Cosi et al., 2003] Cosi, P., Fusaro, A., and Tisato, G. (2003). Lucia: A new italian
talking-head based on a modified cohen-massaros labial coarticulation model.
In Proc. European Conference on Speech Communication and Technology, pages
127–132.
[Cosker et al., 2003] Cosker, D., Marshall, D., Rosin, P., and Hicks, Y. (2003). Video
realistic talking heads using hierarchical non-linear speech-appearance models. In
Proc. Mirage, pages 2–7.
[Cosker et al., 2004] Cosker, D., Paddock, S., Marshall, D., Rosin, P. L., and Rushton, S. (2004). Towards perceptually realistic talking heads: models, methods and
mcgurk. In Proc. Applied perception in graphics and visualization, pages 151–157.
[Costa and De Martino, 2010] Costa, P. and De Martino, J. (2010). Compact 2d
facial animation based on context-dependent visemes. In Proc. ACM/SSPNET
International Symposium on Facial Analysis and Animation, pages 20–20.
[Cyberware Scanning Products, 2013] Cyberware Scanning Products (2013). Online: http://www.cyberware.com/.
[Davis et al., 1952] Davis, K., Biddulph, R., and Balashek, S. (1952). Automatic
recognition of spoken digits. Journal of the Acoustical Society of America,
24(6):637–642.
[De Martino et al., 2006] De Martino, J., Pini Magalhaes, L., and Violaro, F.
(2006). Facial animation based on context-dependent visemes. Computers &
Graphics, 30(6):971–980.
BIBLIOGRAPHY
254
[Deena et al., 2010] Deena, S., Hou, S., and Galata, A. (2010). Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching
state-space model. In Proc. International Conference on Multimodal Interfaces,
pages 1–8.
[Dehn and Van Mulken, 2000] Dehn, D. and Van Mulken, S. (2000). The impact of
animated interface agents: a review of empirical research. International Journal
of Human-Computer Studies, 52(1):1–22.
[Deller et al., 1993] Deller, J., Proakis, J., and Hansen, J. (1993). Discrete-time
processing of speech signals. Macmillan publishing company.
[Demeny, 1892] Demeny, G. (1892). Les photographies parlantes. La Nature, 1:311.
[Demuynck et al., 2008] Demuynck, K., Roelens, J., Van Compernolle, D., and
Wambacq, P. (2008). Spraak: an open source speech recognition and automaticannotation kit. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 495–495.
[Deng and Neumann, 2008] Deng, Z. and Neumann, U. (2008). Expressive speech
animation synthesis with phoneme-level controls. Computer Graphics Forum,
27(8):2096–2113.
[Deng et al., 2006] Deng, Z., Neumann, U., Lewis, J., Kim, T., Bulut, M., and
Narayanan, S. (2006). Expressive facial animation synthesis by learning speech
co-articulation and expression. IEEE Transaction on Visualization and Computer
Graphics, 12(6):2006.
[Deng and Noh, 2007] Deng, Z. and Noh, J. (2007). Computer facial animation: A
survey. In Deng, Z. and Neumann, U., editors, Data-Driven 3D Facial Animation,
pages 1–28. Springer.
[Dey et al., 2010] Dey, P., Maddock, S., and Nicolson, R. (2010). Evaluation of a
viseme-driven talking head. In Proc. Theory and Practice of Computer Graphics,
pages 139–142.
[Dixon and Maxey, 1968] Dixon, N. and Maxey, H. (1968). Terminal analog synthesis of continuous speech using the diphone method of segment assembly. IEEE
Transactions on Audio and Electroacoustics, 16(1):40–50.
[Du and Lin, 2002] Du, Y. and Lin, X. (2002). Realistic mouth synthesis based
on shape appearance dependence mapping.
Pattern Recognition Letters,
23(14):1875–1885.
[Dudley et al., 1939] Dudley, H., Riesz, R., and Watkins, S. (1939). A synthetic
speaker. Journal of the Franklin Institute, 227(6):739–764.
BIBLIOGRAPHY
255
[Dudley and Tarnoczy, 1950] Dudley, H. and Tarnoczy, T. (1950). The speaking
machine of wolfgang von kempelen. Journal of the Acoustical Society of America,
22(2):151–166.
[Dunn, 1950] Dunn, H. K. (1950). The calculation of vowel resonances, and an
electrical vocal tract. Journal of the Acoustical Society of America, 22(6):740–
753.
[Dutoit, 1996] Dutoit, T. (1996). The mbrola project: towards a set of high quality
speech synthesizers free of use for non commercial purposes. In Proc. Fourth
International Conference on Spoken Language, pages 1393–1396.
[Dutoit, 1997] Dutoit, T. (1997). An introduction to text-to-speech synthesis. Kluwer
Academic.
[Eberhardt et al., 1990] Eberhardt, S. P., Bernstein, L. E., Demorest, M. E., and
Goldstein, Jr, M. (1990). Speechreading sentences with single-channel vibrotactile
presentation of voice fundamental frequency. Journal of the Acoustical Society of
America, 88(3):1274–1285.
[Edge and Maddock, 2001] Edge, J. and Maddock, S. (2001). Expressive visual
speech using geometric muscle functions. In Proc. Eurographics UK, pages 11–18.
[Edge and Hilton, 2006] Edge, J. D. and Hilton, A. (2006). Visual speech synthesis
from 3d video. In Proc. European Conference Visual Media Production, pages
174–179.
[Edwards et al., 1998a] Edwards, G., Lanitis, A., Taylor, C., and Cootes, T. (1998a).
Statistical models of face images - improving specificity. Image and Vision Computing, 16(3):203–211.
[Edwards et al., 1998b] Edwards, G. J., Taylor, C. J., and Cootes, T. F. (1998b).
Interpreting face images using active appearance models. In Proc. IEEE International Conference on Automatic Face and Gesture Recognition, pages 300–305.
[Eggermont, 1964] Eggermont, J. (1964).
Taalverwerving bij een groep dove
kinderen. Een experimenteel onderzoek naar de betekenis van een geluidsmethode voor het spraakafzien. Wolters.
[Eisert et al., 1997] Eisert, P., Chaudhuri, S., and Girod, B. (1997). Speech driven
synthesis of talking head sequences. In Proc. Workshop 3D Image Analysis and
Synthesis, pages 51–56.
[Ekman and Friesen, 1978] Ekman, P. and Friesen, W. (1978). Facial Action Coding
System (FACS): A Technique for the Measurement of Facial Action. Consulting
Psychologists Press, Stanford University.
BIBLIOGRAPHY
256
[Ekman et al., 1972] Ekman, P., Friesen, W. V., and Ellsworth, P. (1972). Emotion in the Human Face: Guidelines for Research and an Integration of Findings.
Pergamon Press.
[Elisei et al., 2001] Elisei, F., Odisio, M., G., B., and Badin, P. (2001). Creating
and controlling video-realistic talking heads. In Proc. Auditory-Visual Speech
Processing Workshop, pages 90–97.
[Endo et al., 2010] Endo, N., Endo, K., Zecca, M., and Takanishi, A. (2010). Modular design of emotion expression humanoid robot kobian. In Schiehlen, W. and
Parenti-Castelli, V., editors, ROMANSY 18 - Robot Design, Dynamics and Control, pages 465–472. Springer.
[Englebienne et al., 2008] Englebienne, G., Cootes, T., and Rattray, M. (2008). A
probabilistic model for generating realistic lip movements from speech. In Platt,
J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information
Processing Systems 20, pages 401–408. MIT Press.
[Engwall, 2001] Engwall, O. (2001). Making the tongue model talk: merging mri &
ema measurements. In Proc. Eurospeech, volume 1, pages 261–264.
[Engwall, 2002] Engwall, O. (2002). Evaluation of a system for concatenative articulatory visual speech synthesis. In Proc. International Conference on Spoken
Language Processing, pages 665–668.
[Engwall et al., 2004] Engwall, O., Wik, P., Beskow, J., and Granstrom, G. (2004).
Design strategies for a virtual language tutor. In Proc. of International Conference
on Spoken Language Processing, volume 3, pages 1693–1696.
[Erber, 1975] Erber, N. P. (1975). Auditory-visual perception of speech. Journal of
Speech and Hearing Disorders, 40(4):481–492.
[Erber and Filippo, 1978] Erber, N. P. and Filippo, C. L. D. (1978). Voice/mouth
synthesis and tactual/visual perception of /pa, ba, ma/. Journal of the Acoustical
Society of America, 64(4):1015–1019.
[Escher and Thalmann, 1997] Escher, M. and Thalmann, N. (1997). Automatic 3d
cloning and real-time animation of a human face. In Proc. Computer Animation,
pages 58–66.
[Ezzat et al., 2002] Ezzat, T., Geiger, G., and Poggio, T. (2002). Trainable videorealistic speech animation. In Proc. Annual conference on Computer graphics and
interactive techniques, pages 388–398.
BIBLIOGRAPHY
257
[Ezzat and Poggio, 2000] Ezzat, T. and Poggio, T. (2000). Visual speech synthesis
by morphing visemes. International Journal of Computer Vision, SI: learning
and vision at the center for biological and computational learning, 38(1):45–57.
[Fagel, 2006] Fagel, S. (2006). Joint audio-visual unit selection - the javus speech
synthesizer. In Proc. International Conference on Speech and Computer, pages
503–506.
[Fagel and Clemens, 2004] Fagel, S. and Clemens, C. (2004). An articulation model
for audiovisual speech synthesis – determination, adjustment, evaluation. Speech
Communication, 44(1):141–154.
[Fant, 1953] Fant, G. (1953). Speech communication research. Technical report,
Royal Swedish Academy of Engineering Sciences.
[Fasel and Luettin, 2003] Fasel, B. and Luettin, J. (2003). Automatic facial expression analysis: a survey. Pattern Recognition, 36(1):259–275.
[Ferguson, 1980] Ferguson, J. (1980). Hidden markov analysis: An introduction. In
Hidden Markov Modelsfor Speech. Institute for Defense Analyses, Princeton.
[Fisher, 1968] Fisher, C. (1968). Confusions among visually perceived consonants.
Journal of Speech and Hearing Research, 11(4):796–804.
[Fisher, 1969] Fisher, C. G. (1969). The visibility of terminal pitch contour. Journal
of Speech and Hearing Research, 12(2):379–382.
[Flanagan, 1972] Flanagan, J. (1972). Speech analysis, synthesis and perception.
Springer-Verlag.
[Galanes et al., 1998] Galanes, F., Unverferth, J., Arslan, L., and Talkin, D. (1998).
Generation of lip-synched synthetic faces from phonetically clustered face movement data. In Proc. International Conference on Auditory-visual Speech Processing, pages 191–194.
[Gao et al., 1998] Gao, L., Mukigawa, Y., and Ohta, Y. (1998). Synthesis of facial
images with lip motion from several real views. In Proc. IEEE International
Conference on Automatic Face and Gesture Recognition, pages 181–186.
[Geiger et al., 2003] Geiger, G., Ezzat, T., and Poggio, T. (2003). Perceptual evaluation of video-realistic speech. Technical report, MIT Artificial Intelligence Laboratory.
[Gibbs et al., 1993] Gibbs, S., Breiteneder, C., De Mey, V., and Papathomas, M.
(1993). Video widgets and video actors. In Proc. ACM symposium on user
interface software and technology, pages 179–185.
BIBLIOGRAPHY
258
[Gordon and Hibberts, 2011] Gordon, M. S. and Hibberts, M. (2011). Audiovisual
speech from emotionally expressive and lateralized faces. Quarterly Journal of
Experimental Psychology, 64(4):730–750.
[Goto et al., 2001] Goto, T., Kshirsagar, S., and Magnenat-Thalmann, N. (2001).
Automatic face cloning and animation using real-time facial feature tracking and
speech acquisition. IEEE Signal Processing Magazine, 18(3):17–25.
[Govokhina et al., 2007] Govokhina, O., Bailly, G., and Breton, G. (2007). Learning
optimal audiovisual phasing for a hmm-based control model for facial animation.
In Proc. ISCA Workshop on Speech Synthesis, pages 1–4.
[Govokhina et al., 2006a] Govokhina, O., Bailly, G., Breton, G., and Bagshaw, P.
(2006a). Evaluation de systèmes de génération de mouvements faciaux. In Proc.
Journées d’Etudes sur la Parole, pages 305–308.
[Govokhina et al., 2006b] Govokhina, O., Bailly, G., Breton, G., and Bagshaw, P. C.
(2006b). Tda: a new trainable trajectory formation system for facial animation. In
Proc. Annual Conference of the International Speech Communication Association
(Interspeech), pages 2474–2477.
[Goyal et al., 2000] Goyal, U., Kapoor, A., and Kalra, P. (2000). Text-to-audiovisual
speech synthesizer. In Proc. International Conference on Virtual Worlds, pages
256–269.
[Graf et al., 2002] Graf, H. P., Cosatto, E., Strom, V., and Huang, F. J. (2002).
Visual prosody: Facial movements accompanying speech. In Proc. International
Conference on Automatic Face and Gesture Recognition, pages 396–401.
[Granstrom and House, 2005] Granstrom, B. and House, D. (2005). Audiovisual
representation of prosody in expressive speech communication. Speech communication, 46(3):473–484.
[Granstrom et al., 1999] Granstrom, B., House, D., and Lundeberg, M. (1999).
Prosodic cues in multimodal speech perception. In Proc. International Congress
of Phonetic Sciences, pages 655–658.
[Grant, 1969] Grant, E. C. (1969). Human facial expression. Man, 4(4):525–536.
[Grant and Greenberg, 2001] Grant, K. W. and Greenberg, S. (2001). Speech intelligibility derived from asynchrounous processing of auditory-visual information.
In Proc. Audio-Visual Speech Processing Workshop, pages 132–137.
[Grant et al., 2004] Grant, K. W., Van Wassenhove, V., and Poeppel, D. (2004). Detection of auditory (cross-spectral) and auditory-visual (cross-modal) synchrony.
Speech Communication, 44(1):43–53.
BIBLIOGRAPHY
259
[Grant et al., 1998] Grant, K. W., Walden, B. E., and Seitz, P. F. (1998). Auditoryvisual speech recognition by hearing-impaired subjects: consonant recognition,
sentence recognition, and auditory-visual integration. Journal of the Acoustical
Society of America, 103(5):2677–2690.
[Guiard-Marigny et al., 1996] Guiard-Marigny, T., Tsingos, N., Adjoudani, A.,
Benoit, C., and Gascuel, M.-P. (1996). 3d models of the lips for realistic speech
animation. In Proc. Computer Animation, pages 80–89.
[Gutierrez-Osuna et al., 2005] Gutierrez-Osuna, R., Kakumanu, P. K., Esposito, A.,
Garcia, O. N., Bojorquez, A., Castillo, J. L., and Rudomin, I. (2005). Speechdriven facial animation with realistic dynamics. IEEE Transactions on Multimedia, 7(1):33–42.
[Hadar et al., 1983] Hadar, U., Steiner, T. J., Grant, E. C., and Rose, F. C. (1983).
Head movement correlates of juncture and stress at sentence level. Language and
Speech, 26(2):117–129.
[Hallgren and Lyberg, 1998] Hallgren, A. and Lyberg, B. (1998). Visual speech synthesis with concatenative speech. In Proc. Auditory Visual Speech Processing,
pages 181–183.
[Hazen et al., 2004] Hazen, T., Saenko, K., La, C., and Glass, J. (2004). A segmentbased audio-visual speech recognizer: data collection, development and initial experiments. In Proc. International conference on Multimodal interfaces, pages
235–242.
[Heckbert, 1986] Heckbert, P. (1986). Survey of texture mapping. IEEE Computer
Graphics and Applications, 6(11):56–67.
[Hilder et al., 2010] Hilder, S., Theobald, B., and Harvey, R. (2010). In pursuit of
visemes. In Proc. International Conference on Auditory-Visual Speech Processing,
pages 154–159.
[Hill et al., 1988] Hill, D. R., Pearce, A., and Wyvill, B. (1988). Animating speech:
an automated approach using speech synthesised by rules. The Visual Computer,
3(5):277–289.
[Hong et al., 2001] Hong, P., Wen, Z., and Huang, T. (2001). Iface: a 3d synthetic
talking face. International Journal of Image and Graphics, 1(1):19–26.
[Horn and Schunck, 1981] Horn, B. K. P. and Schunck, B. G. (1981). Determining
optical flow. Artificial Intelligence, 17:185–203.
BIBLIOGRAPHY
260
[Hou et al., 2007] Hou, Y., Sahli, H., Ilse, R., Zhang, Y., and Zhao, R. (2007). Robust shape-based head tracking. Lecture Notes in Computer Science, 4678:340–
351.
[Hsieh and Chen, 2006] Hsieh, C. and Chen, Y. (2006). Partial linear regression for
speech-driven talking head application. Signal Processing: Image Communication,
21(1):1–12.
[Huang et al., 2002] Huang, F. J., Cosatto, E., and Graf, H. P. (2002). Triphone
based unit selection for concatenative visual speech synthesis. In Proc. IEEE
International Conference on Acoustics, Speech, and Signal Processing, volume 2,
pages 2037–2040.
[Hunt and Black, 1996] Hunt, A. J. and Black, A. W. (1996). Unit selection in
a concatenative speech synthesis system using a large speech database. In Proc.
IEEE International Conference on Acoustics, Speech, and Signal Processing, pages
373–376.
[Hyvarinen et al., 2001] Hyvarinen, A., Karhunen, J., and Oja, E. (2001). Independent component analysis. Wiley & Sons.
[Ip and Yin, 1996] Ip, H. H. S. and Yin, L. (1996). Constructing a 3d individualized
head model from two orthogonal views. The visual computer, 12(5):254–266.
[Jackson, 1988] Jackson, P. (1988). The theoretical minimal unit for visual speech
perception: visemes and coarticulation. Volta Review, 90(5):99–115.
[Jackson and Singampalli, 2009] Jackson, P. J. and Singampalli, V. D. (2009). Statistical identification of articulation constraints in the production of speech.
[Jeffers and Barley, 1971] Jeffers, J. and Barley, M. (1971). Speechreading (Lipreading). Charles C Thomas Pub Ltd.
[Jiang et al., 2008] Jiang, D., Ravyse, I., Sahli, H., and Verhelst, W. (2008). Speech
driven realistic mouth animation based on multi-modal unit selection. Journal
on Multimodal User Interfaces, 2:157–169.
[Johnson et al., 2000] Johnson, W. L., Rickel, J. W., and Lester, J. C. (2000). Animated pedagogical agents: Face-to-face interaction in interactive learning environments. International Journal of Artificial intelligence in education, 11(1):47–78.
[Kahler et al., 2001] Kahler, K., Haber, J., and Seidel, H.-P. (2001). Geometrybased muscle modeling for facial animation. In Proc. Graphics Interface, pages
37–46.
BIBLIOGRAPHY
261
[Kalberer and Van Gool, 2001] Kalberer, G. and Van Gool, L. (2001). Face animation based on observed 3d speech dynamics. In Proc. Computer Animation, pages
20–251.
[Karlsson et al., 2003] Karlsson, I., Faulkner, A., and Salvi, G. (2003). Synface - a
talking face telephone. In Proc. European Conference on Speech Communication
and Technology, pages 1297–1300.
[Kawahara et al., 1999] Kawahara, H., Masuda-Katsuse, I., and de Cheveigne, A.
(1999). Restructuring speech representations using a pitch-adaptive time–
frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds. Speech communication, 27(3):187–207.
[Keating, 1988] Keating, P. (1988). Underspecification in phonetics. Phonology,
5(2):275–292.
[Kelly and Gerstman, 1961] Kelly, J. and Gerstman, L. (1961). An artificial talker
driven from a phonetic input. Journal of the Acoustical Society of America,
33(6):835–835.
[Kelly and Lochbaum, 1962] Kelly, J. and Lochbaum, C. (1962). Speech synthesis.
In Proc. Fourth International Conference on Acoustics, pages 1–4.
[Kent and Minifie, 1977] Kent, R. and Minifie, F. (1977). Coarticulation in recent
speech production models. Journal of Phonetics, 5(2):115–133.
[Kerkhoff and Marsi, 2002] Kerkhoff, J. and Marsi, E. (2002). Nextens: a new open
source text-to-speech system for dutch. In Proc. Meeting of Computational Linguistics in the Netherlands.
[Kim and Ko, 2007] Kim, I. and Ko, H. (2007). 3d lip-synch generation with datafaithful machine learning. In Proc. Computer Graphics Forum, volume 26, pages
295–301.
[King and Parent, 2005] King, S. and Parent, R. (2005).
Creating speechsynchronized animation. IEEE Transactions on Visualization and Computer
Graphics, 11(3):341–352.
[Klatt, 1987] Klatt, D. (1987). Review of text-to-speech conversion for english. Journal of the Acoustical Society of America, 82(3):737–793.
[Klir and Yuan, 1995] Klir, G. and Yuan, B. (1995). Fuzzy sets and fuzzy logic.
Prentice Hall.
[Kominek and Black, 2004] Kominek, J. and Black, A. (2004). The cmu arctic
speech databases. In Proc. ISCA Workshop on Speech Synthesis, pages 223–224.
BIBLIOGRAPHY
262
[Krahmer et al., 2002] Krahmer, E., Ruttkay, Z., Swerts, M., and Wesselink, W.
(2002). Pitch, eyebrows and the perception of focus. In Proc. Speech Prosody,
pages 443–446.
[Kshirsagar and Magnenat-Thalmann, 2003] Kshirsagar, S. and MagnenatThalmann, N. (2003). Visyllable based speech animation. Computer Graphics
Forum, 22(3):631–639.
[Kuratate et al., 2011] Kuratate, T., Pierce, B., and Cheng, G. (2011). Mask-bot:
A life-size talking head animated robot for av speech and human-robot communication research. In Proc. International Conference on Auditory-Visual Speech
Processing, pages 111–116.
[Kuratate et al., 1998] Kuratate, T., Yehia, H., and Vatikiotis-bateson, E. (1998).
Kinematics-based synthesis of realistic talking faces. In Proc. International Conference on Auditory-Visual Speech Processing, pages 185–190.
[Latacz, TBP] Latacz, L. (TBP). Speech Synthesis: Towards Automated Voice
Building And Use in Clinical and Educational Applications (Unpublished). PhD
thesis, Vrije Universiteit Brussel.
[Latacz et al., 2008] Latacz, L., Kong, Y., Mattheyses, W., and Verhelst, W. (2008).
An overview of the vub entry for the 2008 blizzard challenge. In Proc. Blizzard
Challenge 2008.
[Latacz et al., 2007] Latacz, L., Kong, Y., and Verhelst, W. (2007). Unit selection
synthesis using long non-uniform units and phonemic identity matching. In Proc.
ISCA Workshop on Speech Synthesis, pages 270–275.
[Latacz et al., 2009] Latacz, L., Mattheyses, W., and Verhelst, W. (2009). The vub
blizzard challenge 2009 entry. In Proc. Blizzard Challenge 2009.
[Latacz et al., 2010] Latacz, L., Mattheyses, W., and Verhelst, W. (2010). The vub
blizzard challenge 2010 entry: Towards automatic voice building. In Proc. Blizzard
Challenge 2010.
[Latacz et al., 2011] Latacz, L., Mattheyses, W., and Verhelst, W. (2011). Joint
target and join cost weight training for unit selection synthesis. In Proc. Annual
Conference of the International Speech Communication Association (Interspeech),
pages 321–324.
[Lawrence, 1953] Lawrence, W. (1953). The synthesis of speech from signals which
have a low information rate. In Communication theory, pages 460–469. Butterworths, London.
BIBLIOGRAPHY
263
[Le Goff, 1997] Le Goff, B. (1997). Automatic modeling of coarticulation in text-tovisual speech synthesis. In Proc. European Conference on Speech Communication
and Technology, pages 1667–1670.
[Le Goff and Benoit, 1996] Le Goff, B. and Benoit, C. (1996). A text-to-audiovisualspeech synthesizer for french. In Proc. International Conference on Spoken Language Processing, pages 2163–2166.
[Le Goff et al., 1994] Le Goff, B., Guiard-Marigny, T., Cohen, M., and Benoit, C.
(1994). Real-time analysis-synthesis and intelligibility of talking faces. In Proc.
ESCA/IEEE Workshop on Speech Synthesis, pages 53–56.
[Lee et al., 1995] Lee, Y., Terzopoulos, D., and Waters, K. (1995). Realistic modeling for facial animation. In Proc. Annual conference on Computer graphics and
interactive techniques, pages 55–62.
[Lei et al., 2003] Lei, X., Dongmei, J., Ravyse, I., Verhelst, W., Sahli, H., Slavova,
V., and Rongchun, Z. (2003). Context dependent viseme models for voice driven
animation. In Proc. EURASIP Conference focused on video/image processing and
multimedia communications, pages 649–654.
[Lesner and Kricos, 1981] Lesner, S. and Kricos, P. B. (1981). Visual vowel and
diphthong perception across speakers. Journal of the Academy of Rehabilitative
Audiology, 14:252–258.
[Lewis, 1991] Lewis, J. (1991). Automated lip-sync: Background and techniques.
Journal of Visualization and Computer Animation, 2(4):118–122.
[Lewis and Parke, 1987] Lewis, J. P. and Parke, F. I. (1987). Automated lip-synch
and speech synthesis for character animation. In Proc. SIGCHI/GI conference on
Human factors in computing systems and graphics interface, pages 143–147.
[Lin et al., 1999] Lin, I.-C., Hung, C.-S., Yang, T.-J., and Ouhyoung, M. (1999).
A speech driven talking head system based on a single face image. In Proc.
Conference on Computer Graphics and Applications, pages 43–49.
[Lindsay, 1997] Lindsay, D. (1997). Talking head. American Heritage of Invention
& Technology, Summer 1997:57–63.
[Ling and Wang, 2007] Ling, Z. and Wang, R. (2007). Hmm-based hierarchical unit
selection combining kullback-leibler divergence with likelihood criterion. In Proc.
IEEE International Conference on Acoustics, Speech and Signal Processing, volume 4, pages 1245–1248.
[Lippmann, 1989] Lippmann, R. (1989). Review of neural networks for speech recognition. Neural computation, 1(1):1–38.
BIBLIOGRAPHY
264
[Liu and Ostermann, 2009] Liu, K. and Ostermann, J. (2009). Optimization of an
image-based talking head system. EURASIP Journal on Audio, Speech and Music
Processing, SI: Animating Virtual Speakers or Singers from Audio: Lip-Synching
Facial Animation:174192.
[Liu and Ostermann, 2011] Liu, K. and Ostermann, J. (2011). Realistic facial expression synthesis for an image-based talking head. In Proc. IEEE International
Conference on Multimedia and Expo, pages 1–6.
[Lloyd, 1982] Lloyd, S. (1982). Least squares quantization in pcm. IEEE Transactions on Information Theory, 25(2):129–137.
[Lofqvist, 1990] Lofqvist, A. (1990). Speech as audible gestures. In Hardcastle, W.
and Marchal, A., editors, Speech Production and Speech Modeling, pages 289–322.
Kluwer Academic Publishers.
[Ma et al., 2006] Ma, J., Cole, R., Pellom, B., Ward, W., and Wise, B. (2006).
Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEE Transactions on Visualization and Computer Graphics,
12(2):266–276.
[Ma et al., 2009] Ma, W. J., Zhou, X., Ross, L. A., Foxe, J. J., and Parra, L. C.
(2009). Lip-reading aids word recognition most in moderate noise: a bayesian
explanation using high-dimensional feature space. PLoS One, 4(3):e4638.
[MacLeod and Summerfield, 1987] MacLeod, A. and Summerfield, Q. (1987). Quantifying the contribution of vision to speech perception in noise. British Journal
of Audiology, 21:131–141.
[MacLeod and Summerfield, 1990] MacLeod, A. and Summerfield, Q. (1990). A
procedure for measuring auditory and audio-visual speech-reception thresholds
for sentences in noise: rationale, evaluation, and recommendations for use. British
Journal of Audiology, 24(1):29–43.
[Malcangi, 2010] Malcangi, M. (2010). Text-driven avatars based on artificial neural
networks and fuzzy logic. International journal of computers, 4(2):61–69.
[Massaro et al., 1999] Massaro, D., Beskow, J., Cohen, M., Fry, C., and Rodgriguez,
T. (1999). Picture my voice: Audio to visual speech synthesis using artificial
neural networks. In Proc. International Conference on Auditory-visual Speech
[Massaro and Cohen, 1990] Massaro, D. and Cohen, M. P. S. (1990). Perception of
synthesized audible and visible speech. Psychological Science, 1(1):55–63.
BIBLIOGRAPHY
265
[Massaro, 2003] Massaro, D. W. (2003). A computer-animated tutor for spoken
and written language learning. In Proc. International Conference on Multimodal
Interfaces, pages 172–175.
[Matlab, 2013] Matlab (2013). Online documentation.
mathworks.nl/help/signal/ref/firpm.html.
Online: http://www.
[Mattheyses et al., 2009a] Mattheyses, W., Latacz, L., and Verhelst, W. (2009a).
Multimodal coherency issues in designing and optimizing audiovisual speech synthesis techniques. In Proc. International Conference on Auditory-visual Speech
[Mattheyses et al., 2009b] Mattheyses, W., Latacz, L., and Verhelst, W. (2009b).
On the importance of audiovisual coherence for the perceived quality of synthesized visual speech. EURASIP Journal on Audio, Speech and Music Processing,
SI: Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial Animation:169819.
Active appearance models for photorealistic visual speech synthesis. In Proc.
Annual Conference of the International Speech Communication Association (Interspeech), pages 1113–1116.
Optimized photorealistic audiovisual speech synthesis using active appearance
modeling. In Proc. International Conference on Auditory-visual Speech Processing, pages 148–153.
Auditory and photo-realistic audiovisual speech synthesis for dutch. In Proc.
International Conference on Auditory-Visual Speech Processing, pages 55–60.
Automatic viseme clustering for audiovisual speech synthesis. In Proc. Annual
Conference of the International Speech Communication Association (Interspeech),
pages 2173–2176.
[Mattheyses et al., 2013] Mattheyses, W., Latacz, L., and Verhelst, W. (2013).
Comprehensive many-to-many phoneme-to-viseme mapping and its application
for concatenative visual speech synthesis. Speech Communication, 55(7-8):857–
876.
[Mattheyses et al., 2008] Mattheyses, W., Latacz, L., Verhelst, W., and Sahli, H.
(2008). Multimodal unit selection for 2d audiovisual text-to-speech synthesis.
Lecture Notes In Computer Science, 5237:125–136.
BIBLIOGRAPHY
266
[Mattheyses et al., 2006] Mattheyses, W., Verhelst, W., and Verhoeve, P. (2006).
Robust pitch marking for prosodic modification of speech using td-psola. In Proc.
Annual IEEE BENELUX/DSP Valley Signal Processing Symposium, pages 43–
46.
[Mattys et al., 2002] Mattys, S., Bernstein, L., and Auer Jr., E. (2002). Stimulusbased lexical distinctiveness as a general word-recognition mechanism. Perception
and Psychophysics, 64(4):667–679.
[McGurk and MacDonald, 1976] McGurk, H. and MacDonald, J. (1976). Hearing
lips and seeing voices. Nature, 264(5588):746–748.
[Melenchon et al., 2009] Melenchon, J., Martinez, E., De La Torre, F., and Montero,
J. (2009). Emphatic visual speech synthesis. IEEE Transactions on Audio, Speech,
and Language Processing, 17(3):459–468.
[Melenchon et al., 2007] Melenchon, J., Simo, J., Cobo, G., and Martinez, E. (2007).
Objective viseme extraction and audiovisual uncertainty: Estimation limits between auditory and visual modes. In Proc. International Conference on AuditoryVisual Speech Processing, pages 191–194.
[Mermelstein, 1976] Mermelstein, P. (1976). Distance measures for speech recognition, psychological and instrumental. Pattern recognition and artificial intelligence, 116:91–103.
[Mertens and Vercammen, 1998] Mertens, P. and Vercammen, F. (1998). Fonilex
manual. Technical report, K.U.Leuven CCL.
[Minnis and Breen, 2000] Minnis, S. and Breen, A. P. (2000). Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis. In Proc. International Conference on Spoken Language Processing, pages 759–762.
[Montgomery, 1980] Montgomery, A. A. (1980). Development of a model for generating synthetic animated lip shapes. Journal of the Acoustical Society of America,
68(S1):S58–S59.
[Montgomery and Jackson, 1983] Montgomery, A. A. and Jackson, P. L. (1983).
Physical characteristics of the lips underlying vowel lipreading performance. Journal of the Acoustical Society of America, 73(6):2134–2144.
[Mori, 1970] Mori, M. (1970). The uncanny valley. Energy, 7(4):33–35.
[Moulines and Charpentier, 1990] Moulines, E. and Charpentier, F. (1990). Pitchsynchronous waveform processing techniques for text-to-speech synthesis using
diphones. Speech Communication, 9(5):453–467.
BIBLIOGRAPHY
267
[MPEG, 2013] MPEG (2013). ISO-IEC-14496-2. Online: http://www.iso.org.
[Muller et al., 2005] Muller, P., Kalberer, G., Proesmans, M., and Van Gool, L.
(2005). Realistic speech animation based on observed 3d face dynamics. IEE
Proceedings on Vision, Image and Signal Processing, 152(4):491–500.
[Munhall et al., 2004] Munhall, K. G., Jones, J. A., Callan, D. E., Kuratate, T.,
and Vatikiotis-Bateson, E. (2004). Visual prosody and speech intelligibility: head
movement improves auditory speech perception. Psychological Science, 15(2):133–
137.
[Musti et al., 2011] Musti, U., Colotte, V., Toutios, A., and Ouni, S. (2011). Introducing visual target cost within an acoustic-visual unit-selection speech synthesizer. In Proc. International Conference on Auditory-Visual Speech Processing,
pages 49–55.
[Myers et al., 1980] Myers, C., Rabiner, L., and Rosenberg, A. (1980). Performance
tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE
Transactions on Acoustics, Speech, and Signal Processing, 28(6):623–635.
[Nakaoka et al., 2009] Nakaoka, S., Kanehiro, F., Miura, K., Morisawa, M., Fujiwara, K., Kaneko, K., Kajita, S., and Hirukawa, H. (2009). Creating facial motions of cybernetic human hrp-4c. In Proc. IEEE-RAS International Conference
on Humanoid Robots, pages 561–567.
[Nefian et al., 2002] Nefian, A., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., and Murphy, K. (2002). A coupled hmm for audio-visual speech recognition. In Proc. IEEE
International Conference on Acoustics, Speech, and Signal Processing, volume 2,
pages 2013–2016.
[Noh and Neumann, 2000] Noh, J. and Neumann, U. (2000). Talking faces. In
Proc. IEEE International Conference on Multimedia and Expo, volume 2, pages
627–630.
[Noma et al., 2000] Noma, T., Zhao, L., and Badler, N. I. (2000). Design of a virtual
human presenter. Computer Graphics and Applications, 20(4):79–85.
[Nuance, 2013] Nuance (2013).
Online: http://netherlands.nuance.com/
bedrijven/oplossing/Spraak-naar-tekst/index.htm.
[Ohman, 1967] Ohman, S. E. (1967). Numerical model of coarticulation. Journal
of the Acoustical Society of America, 41(2):310–320.
[Ostermann, 1998] Ostermann, J. (1998). Animation of synthetic faces in mpeg-4.
In Proc. Computer Animation, pages 49–55.
BIBLIOGRAPHY
268
[Ostermann et al., 1998] Ostermann, J., Chen, L., and Huang, T. (1998). Animated
talking head with personalized 3d head model. Journal of VLSI Signal Processing,
20(1):97–105.
[Ostermann and Millen, 2000] Ostermann, J. and Millen, D. (2000). Talking heads
and synthetic speech: An architecture for supporting electronic commerce. In
Proc. IEEE International Conference on Multimedia and Expo, pages 71–74.
[Ouni et al., 2006] Ouni, S., Cohen, M., Ishak, H., and Massaro, D. (2006). Visual contribution to speech perception: Measuring the intelligibility of animated
talking heads. EURASIP Journal on Audio, Speech, and Music Processing,
2007:047891.
[Owens and Blazek, 1985] Owens, E. and Blazek, B. (1985). Visemes observed by
hearing-impaired and normal-hearing adult viewers. Journal of Speech and Hearing Research, 28:381–393.
[Pandzic and Forchheimer, 2003] Pandzic, I. and Forchheimer, R. (2003). MPEG-4
Facial Animation: The Standard, Implementation and Applications. John Wiley
& Sons Inc.
[Pandzic et al., 1999] Pandzic, I. S., Ostermann, J., and Millen, D. R. (1999). User
evaluation: Synthetic talking faces for interactive services. The Visual Computer,
15(7):330–340.
[Parke, 1982] Parke, F. (1982). Parametric models for facial animation. Computer
Graphics and Applications, 2(9):61–68.
[Parke, 1972] Parke, F. I. (1972). Computer generated animation of faces. In Proc.
ACM annual conference, pages 451–457.
[Parke, 1975] Parke, F. I. (1975). A model for human faces that allows speech
synchronized animation. Computers & Graphics, 1(1):3–4.
[Pearce et al., 1986] Pearce, A., Wyvill, B., Wyvill, G., and Hill, D. (1986). Speech
and expression: a computer solution to face animation. In Proc. Graphics and
Vision Interface, pages 136–140.
[Pearson, 1901] Pearson, K. (1901). On lines and planes of closest fit to systems of
points in space. Philosophical Magazine and Journal of Science, 2(11):559–572.
[Pelachaud, 1991] Pelachaud, C. (1991). Communication and Coarticulation in Facial Animation. PhD thesis, University of Pennsylvania.
[Pelachaud et al., 1996] Pelachaud, C., Badler, N., and Steedman, M. (1996). Generating facial expressions for speech. Cognitive science, 20(1):1–46.
BIBLIOGRAPHY
269
[Pelachaud et al., 1991] Pelachaud, C., Badler, N. I., and Steedman, M. (1991). Linguistic issues in facial animation. In Proc. Computer Animation, pages 15–30.
[Pelachaud et al., 2001] Pelachaud, C., Magno-Caldognetto, E., Zmarich, C., and
Cosi, P. (2001). Modelling an italian talking head. In Proc. International Conference on Auditory-Visual Speech Processing, pages 72–77.
[Perng et al., 1998] Perng, W., Wu, Y., and Ouhyoung, M. (1998). Image talk: a
real time synthetic talking head using one single image with chinese text-to-speech
capability. In Proc. Pacific Conference on Computer Graphics and Applications,
pages 140–148.
[Peterson et al., 1958] Peterson, G., Wang, W., and Sivertsen, E. (1958). Segmentation techniques in speech synthesis. Journal of the Acoustical Society of America,
30(7):682–683.
[Pighin et al., 1998] Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., and Salesin,
D. H. (1998). Synthesizing realistic facial expressions from photographs. In Proc.
Annual conference on Computer graphics and interactive techniques, pages 75–84.
[Pitrelli et al., 1994] Pitrelli, J., Beckman, M., and Hirschberg, J. (1994). Evaluation of prosodic transcription labeling reliability in the tobi framework. In Proc.
International Conference on Spoken Language Processing, pages 123–126.
[Pixar Animation Studios, 2013] Pixar Animation Studios (2013). Online: http:
//www.pixar.com/.
[Platt and Badler, 1981] Platt, S. M. and Badler, N. I. (1981). Animating facial
expressions. Compututer Graphics, 15(3):245–252.
[Plenge and Tilse, 1975] Plenge, G. and Tilse, U. (1975). The cocktail party effect
with and without conflicting visual clues. In Proc. Audio Engineering Society
Convention, pages L–11.
[Porter and Duff, 1984] Porter, T. and Duff, T. (1984). Compositing digital images.
SIGGRAPH Computer Graphics, 18(3):253–259.
[Potamianos et al., 2004] Potamianos, G., Neti, C., Luettin, J., and Matthews, I.
(2004). Audio-visual automatic speech recognition: An overview. In Bailly, G.,
Vatikiotis-Bateson, E., and Perrier, P., editors, Issues in Visual and Audio-Visual
Speech Processing. MIT Press.
[Rabiner, 1989] Rabiner, L. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286.
BIBLIOGRAPHY
270
[Rabiner and Juang, 1993] Rabiner, L. and Juang, B.-H. (1993). Fundamentals of
Speech Recognition. Prentice Hall.
[Rabiner and Schafer, 1978] Rabiner, L. and Schafer, R. (1978). Digital processing
of speech signals. Prentice-hall Englewood Cliffs.
[Reveret et al., 2000] Reveret, L., Bailly, G., and Badin, P. (2000). Mother: A new
generation of talking heads providing a flexible articulatory control for videorealistic speech animation. In Proc. International Conference on Spoken Language
[Ritter et al., 1999] Ritter, M., Meier, U., Yang, J., and Waibel, A. (1999). Face
translation: A multimodal translation agent. In Proc. International Conference
on Auditory-Visual Speech Processing, page paper 28.
[Rogozan, 1999] Rogozan, A. (1999). Discriminative learning of visual data for
audiovisual speech recognition. International Journal on Artificial Intelligence
Tools, 8(1):43–52.
[Ronnberg et al., 1998] Ronnberg, J., Samuelsson, S., and Lyxell, B. (1998). Conceptual constraints in sentence-based lipreading in the hearing-impaired. In
Campbell, R., Dodd, B., and Burnham, D., editors, Hearing by eye II: Advances
in the psychology of speechreading and auditory–visual speech, pages 143–153. Psychology Press.
[Rosen, 1958] Rosen, G. (1958). Dynamic analog speech synthesizer. Journal of the
Acoustical Society of America, 30(3):201–209.
[Roweis, 1998] Roweis, S. (1998). Em algorithms for pca and spca. Advances in
neural information processing systems, 10:626–632.
[Saenko, 2004] Saenko, E. (2004). Articulary features for robust visual speech recognition. Master’s thesis, Massachussetts Institute of Technology.
[Schmidt and Cohn, 2001] Schmidt, K. L. and Cohn, J. F. (2001). Human facial
expressions as adaptations: Evolutionary questions in facial expression research.
American Journal of Physical Anthropology, S33:3–24.
[Schroder and Trouvain, 2003] Schroder, M. and Trouvain, J. (2003). The german
text-to-speech synthesis system mary: A tool for research, development and teaching. International Journal of Speech Technology, 6(4):365–377.
[Schroeder, 1993] Schroeder, M. (1993). A brief history of synthetic speech. Speech
Communication, 13(1):231–237.
BIBLIOGRAPHY
271
[Schroeter et al., 2000] Schroeter, J., Ostermann, J., Graf, H. P., Beutnagel, M. C.,
Cosatto, E., Syrdal, A. K., Conkie, A., and Stylianou, Y. (2000). Multimodal
speech synthesis. In Proc. IEEE International Conference on Multimedia and
Expo, pages 571–578.
[Schwartz et al., 2004] Schwartz, J.-L., Berthommier, F., and Savariaux, C. (2004).
Seeing to hear better: evidence for early audio-visual interactions in speech identification. Cognition, 93(2):69–78.
[Schwippert and Benoit, 1997] Schwippert, C. and Benoit, C. (1997). Audiovisual
intelligibility of an androgynous speaker. In Proc. International Conference on
Auditory-visual Speech Processing, pages 81–84.
[Scott et al., 1994] Scott, K., Kagels, D., Watson, S., Rom, H., Wright, J., Lee, M.,
and Hussey, K. (1994). Synthesis of speaker facial movement to match selected
speech sequences. In Proc. Australian Conference on Speech Science and Technology, pages 620–625.
[Second Life, 2013] Second Life (2013). Online: http://secondlife.com/.
[Senin, 2008] Senin, P. (2008). Dynamic time warping algorithm review. Technical
report, Information and Computer. Science Department, University of Hawaii,
Honolulu.
[Shiraishi et al., 2003] Shiraishi, T., Toda, T., Kawanami, H., Saruwatari, H., and
Shikano, K. (2003). Simple designing methods of corpus-based visual speech
synthesis. In Proc. Annual Conference of the International Speech Communication
Association (Interspeech), pages 2241–2244.
[Sifakis et al., 2005] Sifakis, E., Neverov, I., and Fedkiw, R. (2005). Automatic
determination of facial muscle activations from sparse motion capture marker
data. ACM Transactions on Graphics, 24(3):417–425.
[Sifakis et al., 2006] Sifakis, E., Selle, A., Robinson-Mosher, A., and Fedkiw, R.
(2006). Simulating speech with a physics-based facial muscle model. In Proc.
ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 261–
270.
[Skipper et al., 2007] Skipper, J. I., van Wassenhove, V., Nusbaum, H. C., and
Small, S. L. (2007). Hearing lips and seeing voices: how cortical areas supporting speech production mediate audiovisual speech perception. Cerebral Cortex,
17(10):2387–2399.
[Slaney and Covell, 2001] Slaney, M. and Covell, M. (2001). Facesync: A linear
operator for measuring synchronization of video facial images and audio tracks.
Advances in Neural Information Processing Systems, 14:814–820.
BIBLIOGRAPHY
272
[Smalley, 1963] Smalley, W. (1963). Manual of Articulatory Phonetics. Practical
Anthropology.
[Smits et al., 2003] Smits, R., Warner, N., McQueen, J., and Cutler, A. (2003). Unfolding of phonetic information over time: A database of dutch diphone perception. Journal of the Acoustical Society of America, 113(1):563–573.
[Sproull et al., 1996] Sproull, L., Subramani, M., Kiesler, S., Walker, J., and Waters,
K. (1996). When the interface is a face. Human-Computer Interaction, 11(2):97–
124.
[Stegmann et al., 2003] Stegmann, M. B., Ersboll, B. K., and Larsen, R. (2003).
Fame - a flexible appearance modeling environment. IEEE Transactions on Medical Imaging, 22(10):1319–1331.
[Stewart, 1922] Stewart, J. (1922). An electrical analogue of the vocal organs. Nature, 110:311–312.
[Summerfield, 1992] Summerfield, Q. (1992). Lipreading and audio-visual speech
perception. Philosophical Transactions of the Royal Society of London: Biological
Sciences, 335(1273):71–78.
[Swerts and Krahmer, 2005] Swerts, M. and Krahmer, E. (2005). Audiovisual
prosody and feeling of knowing. Journal of Memory and Language, 53(1):81–
94.
[Swerts and Krahmer, 2006] Swerts, M. and Krahmer, E. (2006). The importance
of different facial areas for signalling visual prominence. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages
paper 1289–Tue3WeO.3.
[Tamura et al., 1999] Tamura, M., Kondo, S., Masuko, T., and Kobayashi, T.
(1999). Text-to-audio-visual speech synthesis based on parameter generation from
hmm. In Proc. European Conference on Speech Communication and Technology,
pages 959–962.
[Tamura et al., 1998] Tamura, M., Masuko, T., Kobayashi, T., and Tokuda, K.
(1998). Visual speech synthesis based on parameter generation from hmm: Speechdriven and text-and-speech-driven approaches. In Proc. International Conference
on Auditory-Visual Speech Processing, pages 221–226.
[Tao et al., 2009] Tao, J., Xin, L., and Yin, P. (2009). Realistic visual speech synthesis based on hybrid concatenation method. IEEE Transactions on Audio, Speech,
and Language Processing, 17(3):469–477.
BIBLIOGRAPHY
273
[Taylor, 2009] Taylor, P. (2009). Text-to-speech synthesis. Cambridge University
Press.
[Taylor et al., 2012] Taylor, S. L., Mahler, M., Theobald, B.-J., and Matthews, I.
(2012). Dynamic units of visual speech. In Proc. ACM SIGGRAPH/Eurographics
conference on Computer Animation, pages 275–284.
[Terzopoulos and Waters, 1993] Terzopoulos, D. and Waters, K. (1993). Analysis
and synthesis of facial image sequences using physical and anatomical models.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(6):569–579.
[Theobald, 2007] Theobald, B. (2007). Audiovisual speech synthesis. In Proc. International Congress on Phonetic Sciences, pages 285–290.
[Theobald et al., 2008] Theobald, B., Fagel, S., Bailly, G., and Elisei, F. (2008).
Lips2008: Visual speech synthesis challenge. In Proc. Annual Conference of the
International Speech Communication Association (Interspeech), pages 1875–1878.
[Theobald and Matthews, 2012] Theobald, B. and Matthews, I. (2012). Relating objective and subjective performance measures for aam-based visual speech
synthesis. IEEE Transactions on Audio, Speech, and Language Processing,
20(8):2378–2387.
[Theobald et al., 2003] Theobald, B., Matthews, I., Glauert, J., Bangham, A., and
Cawley, G. (2003). 2.5d visual speech synthesis using appearance models. In Proc.
British Machine Vision Conference, pages 42–52.
[Theobald and Wilkinson, 2007] Theobald, B. and Wilkinson, N. (2007). A realtime speech-driven talking head using active appearance models. In Proc. International Conference on Auditory-visual Speech Processing, volume 7, pages 22–28.
[Theobald, 2003] Theobald, B.-J. (2003). Visual Speech Synthesis using Shape and
Appearance Models. PhD thesis, University of East Anglia.
[Theobald et al., 2004] Theobald, B.-J., Bangham, J. A., Matthews, I. A., and Cawley, G. C. (2004). Near-videorealistic synthetic talking faces: implementation and
evaluation. Speech Communication, 44(1):127–140.
[Theobald and Wilkinson, 2008] Theobald, B.-J. and Wilkinson, N. (2008). A probabilistic trajectory synthesis system for synthesising visual speech. In Proc.
[Tiddeman and Perrett, 2002] Tiddeman, B. and Perrett, D. (2002). Prototyping
and transforming visemes for animated speech. In Proc. Computer Animation,
pages 248–251.
BIBLIOGRAPHY
274
[Tinwell et al., 2011] Tinwell, A., Grimshaw, M., Nabi, D., and Williams, A. (2011).
Facial expression of emotion and perception of the uncanny valley in virtual characters. Computers in Human Behavior, 27(2):741–749.
[Toutios et al., 2011] Toutios, A., Musti, U., Ouni, S., and Colotte, V. (2011).
Weight optimization for bimodal unit-selection talking head synthesis. In Proc.
[Toutios et al., 2010a] Toutios, A., Musti, U., Ouni, S., Colotte, V., WrobelDautcourt, B., and Berger, M.-O. (2010a). Setup for acoustic-visual speech synthesis by concatenating bimodal units. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 486–489.
[Toutios et al., 2010b] Toutios, A., Musti, U., Ouni, S., Colotte, V., WrobelDautcourt, B., and Berger, M.-O. (2010b). Towards a true acoustic-visual speech
synthesis. In Proc. International Conference on Auditory-Visual Speech Processing.
[Turkmani, 2007] Turkmani, A. (2007). Visual Analysis of Viseme Dynamics. PhD
thesis, University of Surrey.
[Uz et al., 1998] Uz, B., Gudukbay, U., and Ozguc, B. (1998). Realistic speech
animation of synthetic faces. In Proc. Computer Animation, pages 111–118.
[Van Santen and Buchsbaum, 1997] Van Santen, J. and Buchsbaum, A. (1997).
Methods for optimal text selection. In Proc. Eurospeech, pages 553–556.
[Van Son et al., 1994] Van Son, N., Huiskamp, T., Bosman, A., and Smoorenburg,
G. (1994). Viseme classifications of dutch consonants and vowels. Journal of the
Acoustical Society of America, 96(3):1341–1355.
[Van Wassenhove et al., 2005] Van Wassenhove, V., Grant, K. W., and Poeppel,
D. (2005). Visual speech speeds up the neural processing of auditory speech.
Proceedings of the National Academy of Sciences of the United States of America,
102(4):1181–1186.
[Van Wassenhove et al., 2007] Van Wassenhove, V., Grant, K. W., and Poeppel, D.
(2007). Temporal window of integration in auditory-visual speech perception.
Neuropsychologia, 45(3):598–607.
[Vatikiotis-Bateson et al., 1998] Vatikiotis-Bateson, E., Eigsti, I. M., Yano, S., and
Munhall, K. G. (1998). Eye movement of perceivers during audiovisual speech
perception. Perception and Psychophysics, 60(6):926–940.
BIBLIOGRAPHY
275
[Verhelst and Roelands, 1993] Verhelst, W. and Roelands, M. (1993). An overlapadd technique based on waveform similarity (wsola) for high quality time-scale
modification of speech. In Proc. IEEE international conference on Acoustics,
speech, and signal processing, pages 554–557.
[Verma et al., 2003] Verma, A., Rajput, N., and Subramaniam, L. (2003). Using
viseme based acoustic models for speech driven lip synthesis. In Proc. IEEE
International Conference on Acoustics, Speech and Signal Processing, pages 720–
723.
[Vicon Systems, 2013] Vicon Systems (2013). Online: http://www.vicon.com/.
[Vidakovic, 2008] Vidakovic, B. (2008). Statistical Modeling by Wavelets. Wiley.
[Vignoli and Braccini, 1999] Vignoli, F. and Braccini, C. (1999). A text-speech synchronization technique with applications to talking heads. In Proc. International
Conference on Auditory-Visual Speech Processing, page paper 22.
[Visser et al., 1999] Visser, M., Poel, M., and Nijholt, A. (1999). Classifying visemes
for automatic lipreading. In Proc. International Workshop on Text, Speech and
Dialogue, pages 349–352.
[Viterbi, 1967] Viterbi, A. (1967). Error bounds for convolutional codes and an
asymptotically optimum decoding algorithm. IEEE Transactions on Information
Theory, 13(2):260–269.
[Von Kempelen, 1791] Von Kempelen, W. (1791). Mechanismus der menschlichen
Sprache nebst der Beschreibung seiner sprechenden Maschine. J. B. Degen.
[Walker et al., 1994] Walker, J., Sproull, L., and Subramani, R. (1994). Using a
human face in an interface. In Proc. SIGCHI conference on Human factors in
computing systems, pages 85–91.
[Wang et al., 2010] Wang, L., Han, W., Qian, X., and Soong, F. (2010). Photo-real
lips synthesis with trajectory-guided sample selection. In Proc. ISCA Workshop
on Speech Synthesis, pages 217–222.
[Waters, 1987] Waters, K. (1987). A muscle model for animating three-dimensional
facial expressions. Computer Graphics, 21(4):17–24.
[Waters and Frisbie, 1995] Waters, K. and Frisbie, J. (1995). A coordinated muscle
model for speech animation. In Proc. Graphics Interface, pages 163–163.
[Waters and Levergood, 1993] Waters, K. and Levergood, T. (1993). Decface: An
automatic lip-synchronization algorithm for synthetic faces. Technical report,
DEC Cambridge Research Laboratory.
BIBLIOGRAPHY
276
[Weiss et al., 2010] Weiss, B., Kuhnel, C., Wechsung, I., Fagel, S., and Moller, S.
(2010). Quality of talking heads in different interaction and media contexts.
[Weiss, 2004] Weiss, C. (2004). A framework for data-driven video-realistic audiovisual speech synthesis. In Proc. International Conference on Language Resources
and Evaluation.
[Wells, 1997] Wells, J. (1997). Sampa computer readable phonetic alphabet. In Gibbon, D., Moore, R., and Winski, R., editors, Handbook of standards and resources
for spoken language systems. Berlin and New York: Mouton de Gruyter.
[Williams, 1990] Williams, L. (1990). Performance-driven facial animation. Computer Graphics, 24(4):235–242.
[Wilting et al., 2006] Wilting, J., Krahmer, E., and Swerts, M. (2006). Real vs
acted emotional speech. In Proc. Annual Conference of the International Speech
Communication Association (Interspeech), pages paper 1093–Tue1A3O.4.
[Wolberg, 1990] Wolberg, G. (1990). Digital Image Warping. IEEE Computer Society Press.
[Wolberg, 1998] Wolberg, G. (1998). Image morphing: a survey. The visual computer, 14(8):360–372.
[Woods, 1986] Woods, J. C. (1986). Lipreading : a guide for beginners. John Murray.
[Xvid, 2013] Xvid (2013). Online: http://www.xvid.org/.
[Yang et al., 2000] Yang, J., Xiao, J., and Ritter, M. (2000). Automatic selection
of visemes for image-based visual speech synthesis. In Proc. IEEE International
Conference on Multimedia and Expo, pages 1081–1084.
[Yilmazyildiz et al., 2006] Yilmazyildiz, S., Mattheyses, W., Patsis, Y., and Verhelst, W. (2006). Expressive speech recognition and synthesis as enabling technologies for affective robot-child communication. Lecture Notes in Computer Science, 4261:1–8.
[Young et al., 2006] Young, S. J., Evermann, G., Gales, M. J. F., Hain, T., Kershaw,
D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland,
P. C. (2006). The HTK Book, version 3.4. Cambridge University Engineering
Department.
[Ypsilos et al., 2004] Ypsilos, I., Hilton, A., Turkmani, A., and Jackson, P. (2004).
Speech-driven face synthesis from 3d video. In Proc. 3D Data Processing, Visualization and Transmission Workshop, pages 58–65.
BIBLIOGRAPHY
277
[Yu et al., 2010] Yu, D., G., O., Sutherland, A., and Whelan, P. (2010). A novel
visual speech representation and hmm classification for visual speech recognition.
IPSJ Transactions on Computer Vision and Applications, 2:25–38.
[Zelezny et al., 2006] Zelezny, M., Krnoul, Z., Cisar, P., and Matousek, J. (2006).
Design, implementation and evaluation of the czech realistic audio-visual speech
synthesis. Signal Processing, 86(12):3657–3673.
[Zen et al., 2009] Zen, H., Tokuda, K., and Black, A. (2009). Statistical parametric
speech synthesis. Speech Communication, 51(11):1039–1064.

a multimodal approach to audiovisual text-to-speech

Transcription

Similar documents

diagram

Chempacific Brochuer

October 13-17, 2014 Information: www.esoa-web.org

Marco Teiber and Thomas JJ Müller

Slides of the talk - Centre for Digital Music

Meeting Package - InterContinental Montréal

full article here

Welcome to ChemCon