a multimodal approach to audiovisual text-to-speech
Transcription
a multimodal approach to audiovisual text-to-speech
FACULTY OF ENGINEERING Department of Electronics and Informatics (ETRO) A MULTIMODAL APPROACH TO AUDIOVISUAL TEXT-TO-SPEECH SYNTHESIS Wesley Mattheyses Advisor: prof. dr. ir. Werner Verhelst Thesis submitted in fulfilment of the requirements for the degree of Doctor in de Ingenieurswetenschappen (Doctor in Engineering) June 2013 FACULTY OF ENGINEERING Department of Electronics and Informatics (ETRO) A MULTIMODAL APPROACH TO AUDIOVISUAL TEXT-TO-SPEECH SYNTHESIS Dissertation submitted in fulfilment of the requirements for the degree of Doctor in de Ingenieurswetenschappen (Doctor in Engineering) by ir. Wesley Mattheyses Advisor: prof. dr. ir. Werner Verhelst Brussels, June 2013 EXAMINING COMMITTEE prof. Bart de Boer - Vrije Universiteit Brussel - Chair prof. Rik Pintelon - Vrije Universiteit Brussel - Vice-chair prof. Hichem Sahli - Vrije Universiteit Brussel - Secretary prof. Barry-John Theobald - University of East Anglia - Member dr. Juergen Schroeter - AT&T Labs - Member prof. Werner Verhelst - Vrije Universiteit Brussel - Advisor Preface Back in the days when I was still studying for engineer, I honestly could have never thought I would ever find myself writing this PhD thesis. It is not that I had a clear goal in mind for my professional career after graduating, but I was always sure that pure research was not really my cup of tea. Things started to change when I worked in my final year on my master thesis on auditory text-to-speech synthesis with prof. Werner Verhelst as supervisor. I was still not a big fan of watching the Matlab-editor from morning till evening, but on the other hand it was fascinating to see how something complicated as a human speech signal could be mimicked by running programming code I had written myself. When prof. Verhelst offered me the possibility to work further on my thesis subject in the form of a PhD, it took a few days before I could convince myself that maybe it would indeed be possible for me to actually get a PhD and that I should at least try to take up the challenge. Afterwards, I am very glad I took this unique once-in-a-life opportunity that was offered me. The first year as a PhD student I was facing the biggest challenge I had ever experienced so far. Everybody who went from secondary school to university knows the feeling: everything that seemed hard to do in secondary school suddenly seems negligible in comparison with what is expected from you in higher education. I experienced this terrifying feeling twice: when starting a PhD the amount of knowledge you are missing to fulfil your tasks seems inseparable and the list of things to do seems endless and infeasible. In addition, you know that no one around you has the time nor the knowledge to solve the problems for you. Once seated behind your computer, the only thing to do is dig into the literature (Google & the Internet become your very best friends), let your brain work overtime, and start putting those very first bricks in place that finally have to make up that huge castle representing your PhD. After one year of working as a PhD student, I was lucky to be assigned as teaching assistant (AAP) at ETRO. From that moment, the work package became even larger since I became responsible for teaching the first year engineering students how to write program code. This made me find out that I really enjoy teaching and transferring skills to other people. It also made me realize that it i Preface ii is a challenge to let first year students find their focus and motivation (and to be silent while you are explaining something), and that it is not easy to clearly explain things that are not straightforward such as programming. Being assigned as a teaching assistant slowed down my PhD research, but on the other hand it was a blessing that I did not have to worry about funding and the teaching offered a nice alteration to working with my good friends Matlab and Visual Studio. Like anything else in life, the process of getting a PhD is a bumpy road with lots of ups and downs. Only researchers know the frustration of working for months on an optimization to find out in the end that it really does not improve the results a single bit. Debugging and the unavoidable system crashes altogether have probably cost a year of my life. On the other hand, research also offers incomparable climaxes when your code finally produces the desired output or when an experiment points out the hoped-for results. If I look back upon the 6.5 years I spend in research, these years compose a very valuable lesson for life in general, teaching you to never give up trying to reach your goals, to put yourself back together after a failure and to keep believing in the things you do. Although the process of getting a PhD often feels like a lonely quest in which you are thrown on your own resources, the opposite is true and it is important to never forget the people around you and their valuable contributions. Therefore, I have to thank prof. Werner Verhelst to convince me to start with my PhD, to find the necessary funding, to elaborate together on the proposed audiovisual speech synthesis approach and to carefully revise my publications and this thesis. I also thank my colleagues at ETRO and at DSSP for the nice working atmosphere I have been able to experience. Getting through a stressful day of failing programming code is only possible when it can be alternated with interesting, pleasant and especially funny conversations with the people around you. For this I thank my co-workers at building K and at building Ke, especially offices Ke3.13 and Ke3.14. Gio, Selma, Lukas, Tomas: I cannot imagine better office co-workers than you guys. I feel we have become real friends and that we will keep on meeting each other in our post-VUB life. Lukas and Tomas, you guys always succeeded in making me laugh with your (sometimes somewhat special) sense of humour! If I go through my pictures collection, it is amazing how many events and parties we already attended together (Tomas, you’re looking good on every single picture . . . sort of). You guys also made the Interspeech 2011 conference in Firenze probably the nicest conference ever (recall “Luca de la Tacci”, “Tommaso Blanchetti” and “Matteo di Mattei”). I hope we continue making fun together in the future and that we stay close in touch. Obviously, colleagues are not only necessary for the appropriate working atmosphere, they also supply valuable advice and support to your research. I am very Preface iii grateful to Lukas “Mr C++” Latacz for all the work he spend on the development of the linguistic front-end, the Festival-backbone, and other parts of the system. Countless times he helped me with implementing the C++ part of the system and I can honestly say I could not have developed the system as it is now without him. I am also very thankful to Tomas “Eagle-Eye” Dekens who always meticulously tested my subjective perception experiments. I also thank the other colleagues, friends and family who regularly participated in the subjective evaluations, such as Lukas, Selma, Gio, Yorgos, Jan L., Bruno, Henk, Pieter, Mikael, Chris, Jan G., Eric and Jenny. I am also grateful to Yorgos, who assisted in performing high-quality audiovisual recordings at the ETRO recording studio. Also, these recordings would have never been possible without the cooperation of Britt, Annick, Evelien and Kaat, who were willing to act as voice talent. In addition, I should not forget to thank Mike, Joris, Isabel and Bruno who shared the joy (and the frustration) of teaching at the university and to thank prof. Jacques Tiberghien, prof. Ann Dooms and prof. Jan Lemeire for their confidence in me for teaching their topics. In this final stage of my PhD, I would also like to show my gratitude to prof. Barry-John Theobald, dr. Juergen Schroeter, prof. Bart de Boer, prof. Rik Pintelon, prof. Hichem Sahli and prof. Werner Verhelst for being part of the PhD committee and for dedicating their time to review this dissertation. Finally, I would like to take this opportunity to thank the most important people in my life for their unconditional love and support they have been given me. I would like to thank my wife Annick for brightening up my life and to offer me exactly the warmth, the joy, the distraction and the love that made me go on after hard and discouraging periods in the PhD research. Furthermore, words cannot express my gratitude for my parents, Eric and Jenny, who supported and motivated me from the beginning of my studies at the VUB until the end of my PhD. They kept on enduring my complaints and my doubts and every time they succeeded in motivating me to keep going on. During the past twelve years they reorganized their life in function of lessons, exams and especially in function of the NMBS train schedule (which is quite a burden). I cannot thank them enough for this. I would like to conclude my words of gratitude with a special mention of my father, who spend over the years more time in the car waiting for me at the train station than anybody should have to bear in a life time. Thanks for getting me home dad! Wesley Mattheyses Spring 2013 Publications Journal papers ◦ W. Mattheyses, L. Latacz and W. Verhelst, “On the importance of audiovisual coherence for the perceived quality of synthesized visual speech”, EURASIP Journal on Audio, Speech, and Music Processing, SI: Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial Animation, 2009. ◦ W. Mattheyses, L. Latacz and W. Verhelst, “Comprehensive Many-to-Many Phoneme-to-Viseme Mapping and its Application for Concatenative Visual Speech Synthesis”, Speech Communication, Vol.55(7-8), pp.857-876, 2013. Conference papers First author ◦ W. Mattheyses, W. Verhelst and P. Verhoeve, “Robust Pitch Marking For Prosodic Modification Of Speech Using TD-Psola”, Proc. SPS-DARTS - IEEE BENELUX/DSP Valley Signal Processing Symposium, pp.43-46, 2006. ◦ W. Mattheyses, L. Latacz, Y.O. Kong and W. Verhelst, “A Flemish Voice for the Nextens Text-To-Speech System”, Proc. Fifth Slovenian and First International Language Technologies Conference, 2006. ◦ W. Mattheyses, L. Latacz, W. Verhelst and H. Sahli, “Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis”, International workshop on Machine Learning for Multimodal Interaction, Springer Lecture Notes in Computer Science, Vol.5237, pp.125-136, 2008. ◦ W. Mattheyses, L. Latacz and W. Verhelst, “Multimodal Coherency Issues in Designing and Optimizing Audiovisual Speech Synthesis Techniques”, Proc. International Conference on Auditory-visual Speech Processing, pp.47-52, 2009. iv Publications v ◦ W. Mattheyses, L. Latacz and W. Verhelst, “Active Appearance Models for Photorealistic Visual Speech Synthesis”, Proc. Interspeech, pp.1113-1116, 2010. ◦ W. Mattheyses, L. Latacz and W. Verhelst, “Optimized Photorealistic Audiovisual Speech Synthesis Using Active Appearance Modeling”, Proc. International Conference on Auditory-visual Speech Processing, pp.148-153, 2010. ◦ W. Mattheyses, L. Latacz and W. Verhelst, “Automatic Viseme Clustering for Audiovisual Speech Synthesis”, Proc. Interspeech, pp.2173-2176, 2011. ◦ W. Mattheyses, L. Latacz and W. Verhelst, “Auditory and Photo-realistic Audiovisual Speech Synthesis for Dutch”, Proc. International Conference on Auditory-visual Speech Processing, pp.55-60, 2011. Other ◦ S. Yilmazyildiz, W. Mattheyses, G. Patsis and W. Verhelst, “Expressive Speech Recognition and Synthesis as Enabling Technologies for Affective Robot-Child Communication”, Advances in Multimedia Information Processing - PCM06, Springer Lecture Notes in Computer Science, Vol.4261, pp.1-8, 2006. ◦ L. Latacz, Y.O. Kong, W. Mattheyses and W. Verhelst, “Novel Text-to-Speech Reading Modes for Educational Applications”, Proc. ProRISC/IEEE Benelux Workshop on Circuits, Systems and Signal Processing, pp.148-153, 2006. ◦ L. Latacz, Y.O. Kong, W. Mattheyses and W. Verhelst, “An Overview of the VUB Entry for the 2008 Blizzard Challenge”, Proc. Blizzard Challenge 2008, 2008. ◦ L. Latacz, W. Mattheyses and W. Verhelst, “The VUB Blizzard Challenge 2009 Entry”, Proc. Blizzard Challenge 2009, 2009. ◦ L. Latacz, W. Mattheyses and W. Verhelst, “The VUB Blizzard Challenge 2010 Entry: Towards Automatic Voice Building”, Proc. Blizzard Challenge 2010, 2010. ◦ S. Yilmazyildiz, L. Latacz, W. Mattheyses and W. Verhelst, “Expressive Gibberish Speech Synthesis for Affective Human-Computer Interaction”, Proc. International Conference on Text, Speech and Dialogue, pp.584-590, 2010. ◦ L. Latacz, W. Mattheyses and W. Verhelst, “Joint Target and Join Cost Weight Training for Unit Selection Synthesis”, Proc. Interspeech, pp.321-324, 2011. Publications vi Abstracts ◦ W. Mattheyses, L. Latacz and W. Verhelst, “2D Audiovisual Text-to-Speech Synthesis for Human-Machine Interaction”, Proc. Speech and Face to Face Communication, pp.24-25, 2008. ◦ W. Mattheyses and W. Verhelst, “Photorealistic 2D Audiovisual Text-toSpeech Synthesis using Active Appearance Models”, Proc. ACM / SSPNET International Symposium on Facial Analysis and Animation, pp.13, 2010. Synopsis Speech has always been the most important means of communication between humans. When a message is conveyed, it is encoded in two separate signals: an auditory speech signal and a visual speech signal. The auditory speech signal consists of a series of speech sounds that are produced by the human speech production system. In order to generate different sounds, the parameters of this speech production system are varied. Since some of the human articulators are visible to an observer (e.g., the lips, the teeth and the tongue), while uttering the speech sounds the variations of these visible articulators define a visual speech signal. It is well known that an optimal conveyance of the message requires that both the auditory and the visual speech signal can be perceived by the receiver. During the last decades the development of advanced computer systems has led to the current situation in which the vast majority of appliances, from industrial machinery to small household devices, are computer-controlled. This implicates that at nowadays people interact countless times with computer systems in everyday situations. Since the ultimate goal is to make this interaction feel completely natural and familiar, the most optimal way to interact with a machine is by means of speech. Similar to the speech communication between humans, the most appropriate human-machine interaction consists of audiovisual speech signals. In order to allow the machine to transfer a spoken message towards its users, the device has to contain a so-called audiovisual speech synthesizer. This is a system that is capable of generating a novel audiovisual speech signal, typically from text input (so-called audiovisual text-to-speech (AVTTS) synthesis). Audiovisual speech synthesis has been a popular research topic in the last decade. The synthetic auditory speech mode, created by the synthesizer, consist of a waveform that resembles as closely as possible an original acoustic speech signal uttered by a human. The synthetic visual speech signal displays a virtual speaker exhibiting the speech gestures that match the synthetic auditory speech information. The great majority of the AVTTS synthesizers perform the synthesis in separate stages: in the first stages the auditory and the visual speech signals are synthesized consecutively and often completely independently, after which both synthetic speech modes are synchronized and multiplexed. Unfortunately, this strategy is unable to optimize the vii Synopsis viii level of audiovisual coherence in the output signal. This motivates the development of a single-phase AVTTS synthesis approach, in which both speech modes are generated simultaneously which allows to maximize the coherence between the two synthetic speech signals. In this work a single-phase AVTTS synthesis technique was developed that constructs the desired speech signal by concatenating audiovisual speech segments that were selected from a database containing original audiovisual speech recordings from a single speaker. By selecting segments containing an original combination of auditory and visual speech information, the original coherence between both speech modes is copied as much as possible to the synthetic speech signal. Obviously, the simultaneous synthesis of the auditory and the visual speech entails some additional difficulties in optimizing the individual quality of both synthetic speech modes. Nevertheless, through subjective perception experiments it was concluded that the maximization of the level of audiovisual coherence is indeed necessary for achieving an optimal perception of the synthetic audiovisual speech signal. In the next part of the work it was investigated how the quality of the synthetic speech synthesized by the AVTTS system could be enhanced. To this end, the individual quality of the synthetic visual speech mode was improved, while ensuring not to affect the audiovisual coherence. The original visual speech from the database was parameterized using an Active Appearance Model. This allows many optimizations, such as a normalization of the original speech data and a smoothing of the synthetic visual speech without affecting the visual articulation strength. Next, by the construction of a new extensive Dutch audiovisual speech database, the first-ever system capable of high-quality photorealistic audiovisual speech synthesis for Dutch was developed. In a final part of this work it was investigated how the AVTTS synthesis techniques can be adopted to create a novel visual speech signal matching an original auditory speech signal and its text transcript. For visual-only synthesis, the speech information can be described by means of either phoneme or viseme labels. The attainable synthesis quality using phonemes was compared with the synthesis quality attained using both standardized and speaker-dependent many-to-one phoneme-to-viseme mappings. In addition, novel context-dependent many-to-many phoneme-to-viseme mapping strategies were investigated and evaluated for synthesis. It was found that these novel viseme labels more accurately describe the visual speech information compared to phonemes and that they enhance the attainable synthesis quality in case only a limited amount of original speech data is available. Samenvatting Gesproken communicatie, bestaande uit een auditief en een visueel spraaksignaal, is altijd al de belangrijkste vorm van menselijke interactie geweest. Een optimale overdracht van de boodschap is enkel mogelijk indien zowel het auditieve als het visuele signaal adequaat kunnen worden waargenomen door de ontvanger. Vandaag de dag interageren we talloze keren met computersystemen in dagdagelijkse situaties. Het uiteindelijke doel bestaat erin om deze communicatie zo natuurlijk en vertrouwd mogelijk te laten overkomen. Dit impliceert dat de computersystemen best interageren met hun gebruikers door middel van gesproken communicatie. Net zoals de interactie tussen mensen onderling, zal de meest optimale vorm van mens-machine communicatie bestaan uit audiovisuele spraaksignalen. Om het mogelijk te maken het computersysteem een gesproken bericht te laten verzenden naar zijn gebruikers is een audiovisueel tekst-naar-spraak systeem, hetwelk in staat is om een nieuw audiovisueel spraaksignaal aan te maken gebaseerd op een gegeven tekst, noodzakelijk. Deze thesis focust op een spraaksynthese waarbij beide spraakmodaliteiten tegelijkertijd worden gesynthetiseerd. De voorgestelde synthesestrategie maakt het gewenste spraaksignaal aan door het samenvoegen van audiovisuele spraaksegmenten, bestaande uit een originele combinatie van akoestische en visuele spraakinformatie. Dit leidt tot een maximale audiovisuele coherentie tussen beide synthetische spraakmodaliteiten. Spraaksynthese van hoge kwaliteit is bereikt door middel van verscheidene optimalisaties, zoals een normalisatie van de originele visuele spraakdata en een smoothing van het gesynthetiseerde visuele spraaksignaal waarbij de audiovisuele coherentie zo weinig mogelijk wordt aangetast. Met behulp van een nieuwe, uitgebreide en geoptimaliseerde Nederlandstalige audiovisuele spraakdatabase is het allereerste tekst-naar-spraak systeem gerealiseerd dat in staat is om fotorealistische Nederlandstalige audiovisuele spraak te genereren van hoge kwaliteit. Door middel van subjectieve perceptie-experimenten is vastgesteld dat de maximalisatie van het niveau van audiovisuele coherentie inderdaad noodzakelijk is om een optimale waarneming van de synthetische spraak te bekomen. In het geval dat er reeds een akoestisch signaal beschikbaar is, volstaat een automatische generatie van de visuele spraakmodaliteit. Hierbij kan de spraakinformatie worden beschreven aan de hand van zowel fonemen als visemen. De haalbare ix Samenvatting x synthesekwaliteit gebruik makende van een foneem-gebaseerde spraaklabeling is vergeleken met de haalbare synthesekwaliteit gebruik makende van een labeling gebaseerd op zowel gestandaardiseerde als spreker-afhankelijke veel-op-een relaties tussen fonemen en visemen. Tot slot zijn er ook nieuwe context-afhankelijke veelop-veel relaties tussen fonemen en visemen opgesteld. Door middel van objectieve en subjectieve evaluaties is aangetoond dat deze nieuwe viseemdefinities leiden tot een verbetering van de visuele spraaksynthese. Supplementary Data Supplementary data associated with this dissertation can be found online at http://www.etro.vub.ac.be/Personal/wmatthey/phd_demo.htm Audiovisual samples illustrating the following sections are supplied: Section 3.3.2 3.5 3.6 3.7 4.4 4.5 5.2 5.3 5.4.1 5.4.2 6.4 6.6 Description Databases for synthesis Concatenated audiovisual segments Evaluation of the audiovisual speech synthesis strategy Evaluation of audiovisual optimal coupling Optimized audiovisual speech synthesis Evaluation of the AAM-based AVTTS approach Dutch audiovisual database “AVKH” AVTTS synthesis for Dutch Turing test Single-phase vs Two-phase audiovisual speech synthesis Evaluation of many-to-one mapping schemes for English Evaluation of many-to-many mapping schemes for Dutch xi Contents Preface i Publications iv Synopsis vii Samenvatting ix Supplementary Data xi Contents xii List of Figures xviii List of Tables xxi Abbreviations xxii 1 Introduction 1.1 Spoken communication . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Speech signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Human speech production . . . . . . . . . . . . . . . . . . . . 1.2.2 Description of auditory speech . . . . . . . . . . . . . . . . . 1.3 Multimodality of speech . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Synthetic speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 A brief history on human-machine interaction using auditory speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Multimodality of spoken human-machine communication . . 1.4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Audiovisual speech synthesis at the VUB . . . . . . . . . . . . . . . 1.5.1 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 1 1 2 2 2 4 7 7 9 12 18 19 20 20 Table of Contents xiii 2 Generation of synthetic visual speech 2.1 Facial animation and visual speech synthesis . . . . . . . . . . . 2.2 An overview on visual speech synthesis . . . . . . . . . . . . . . 2.2.1 Input requirements . . . . . . . . . . . . . . . . . . . . . 2.2.2 Output modality . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Output dimensions . . . . . . . . . . . . . . . . . . . . . 2.2.4 Photorealism . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Definition of the visual articulators and their variations 2.2.5.1 Speech synthesis in 3D . . . . . . . . . . . . . 2.2.5.2 Speech synthesis in 2D . . . . . . . . . . . . . 2.2.5.3 Standardization: FACS and MPEG-4 . . . . . 2.2.6 Prediction of the target speech gestures . . . . . . . . . 2.2.6.1 Coarticulation . . . . . . . . . . . . . . . . . . 2.2.6.2 Rule-based synthesis . . . . . . . . . . . . . . . 2.2.6.3 Concatenative synthesis . . . . . . . . . . . . . 2.2.6.4 Synthesis based on statistical prediction . . . . 2.3 Positioning of this thesis in the literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 26 26 28 30 32 34 34 40 42 45 45 47 53 58 61 3 Single-phase concatenative AVTTS synthesis 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . 3.2 A concatenative audiovisual text-to-speech synthesizer 3.2.1 General text-to-speech workflow . . . . . . . . 3.2.2 Concatenative single-phase AVTTS synthesis . 3.3 Database preparation . . . . . . . . . . . . . . . . . . 3.3.1 Requirements . . . . . . . . . . . . . . . . . . . 3.3.2 Databases used for synthesis . . . . . . . . . . 3.3.3 Post-processing . . . . . . . . . . . . . . . . . . 3.3.3.1 Phonemic segmentation . . . . . . . . 3.3.3.2 Symbolic features . . . . . . . . . . . 3.3.3.3 Acoustic features . . . . . . . . . . . . 3.3.3.4 Visual features . . . . . . . . . . . . . 3.4 Audiovisual segment selection . . . . . . . . . . . . . . 3.4.1 Minimization of a global cost function . . . . . 3.4.2 Target costs . . . . . . . . . . . . . . . . . . . . 3.4.2.1 Phonemic match . . . . . . . . . . . . 3.4.2.2 Symbolic costs . . . . . . . . . . . . . 3.4.2.3 Safety costs . . . . . . . . . . . . . . . 3.4.3 Join costs . . . . . . . . . . . . . . . . . . . . . 3.4.3.1 Auditory join costs . . . . . . . . . . 3.4.3.2 Visual join costs . . . . . . . . . . . . 3.4.4 Weight optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 65 65 65 67 71 71 71 72 73 73 74 75 80 80 83 83 84 85 87 87 88 89 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table of Contents 3.5 3.6 3.7 3.8 xiv 3.4.4.1 Cost scaling . . . . . . . . . . . . . . . . . . 3.4.4.2 Weight distribution . . . . . . . . . . . . . . Audiovisual concatenation . . . . . . . . . . . . . . . . . . . . 3.5.1 A visual mouth-signal and a visual background-signal 3.5.2 Audiovisual synchrony . . . . . . . . . . . . . . . . . . 3.5.3 Audio concatenation . . . . . . . . . . . . . . . . . . . 3.5.4 Video concatenation . . . . . . . . . . . . . . . . . . . Evaluation of the audiovisual speech synthesis strategy . . . . 3.6.1 Single-phase and two-phase synthesis approaches . . . 3.6.2 Evaluation of the audiovisual coherence . . . . . . . . 3.6.2.1 Method and subjects . . . . . . . . . . . . . 3.6.2.2 Test strategies . . . . . . . . . . . . . . . . . 3.6.2.3 Samples and results . . . . . . . . . . . . . . 3.6.2.4 Discussion . . . . . . . . . . . . . . . . . . . 3.6.3 Evaluation of the perceived naturalness . . . . . . . . 3.6.3.1 Method and subjects . . . . . . . . . . . . . 3.6.3.2 Test strategies . . . . . . . . . . . . . . . . . 3.6.3.3 Samples and results . . . . . . . . . . . . . . 3.6.3.4 Discussion . . . . . . . . . . . . . . . . . . . 3.6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . Audiovisual optimal coupling . . . . . . . . . . . . . . . . . . 3.7.1 Concatenation optimization . . . . . . . . . . . . . . . 3.7.1.1 Maximal coherence . . . . . . . . . . . . . . 3.7.1.2 Maximal smoothness . . . . . . . . . . . . . 3.7.1.3 Maximal synchrony . . . . . . . . . . . . . . 3.7.2 Perception of non-uniform audiovisual asynchrony . . 3.7.3 Objective smoothness assessment . . . . . . . . . . . . 3.7.4 Subjective evaluation . . . . . . . . . . . . . . . . . . 3.7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . Summary and conclusions . . . . . . . . . . . . . . . . . . . . 4 Enhancing the visual synthesis using AAMs 4.1 Introduction and motivation . . . . . . . . . . . . 4.2 Facial image modeling . . . . . . . . . . . . . . . 4.3 Audiovisual speech synthesis using AAMs . . . . 4.3.1 Motivation . . . . . . . . . . . . . . . . . 4.3.2 Synthesis overview . . . . . . . . . . . . . 4.3.3 Database preparation and model training 4.3.4 Segment selection . . . . . . . . . . . . . . 4.3.4.1 Target costs . . . . . . . . . . . 4.3.4.2 Join costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 90 93 93 94 95 96 99 100 100 101 102 102 103 105 106 106 107 109 110 111 111 112 113 114 115 118 120 121 123 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 126 128 131 131 132 133 136 137 139 Table of Contents 4.4 4.5 4.6 4.3.5 Segment concatenation . . . . . . . . . . . Improving the synthesis quality . . . . . . . . . . 4.4.1 Parameter classification . . . . . . . . . . 4.4.2 Database normalization . . . . . . . . . . 4.4.3 Differential smoothing . . . . . . . . . . . 4.4.4 Spectral smoothing . . . . . . . . . . . . . Evaluation of the AAM-based AVTTS approach 4.5.1 Visual speech-only . . . . . . . . . . . . . 4.5.1.1 Test setup . . . . . . . . . . . . 4.5.1.2 Participants and results . . . . . 4.5.1.3 Discussion . . . . . . . . . . . . 4.5.2 Audiovisual speech . . . . . . . . . . . . . 4.5.2.1 Test setup . . . . . . . . . . . . 4.5.2.2 Participants and results . . . . . 4.5.2.3 Discussion . . . . . . . . . . . . Summary and conclusions . . . . . . . . . . . . . xv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 High-quality AVTTS synthesis for Dutch 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Database construction . . . . . . . . . . . . . . . . . . . 5.2.1 Text selection . . . . . . . . . . . . . . . . . . . . 5.2.1.1 Domain-specific . . . . . . . . . . . . . 5.2.1.2 Open domain . . . . . . . . . . . . . . . 5.2.1.3 Additional data . . . . . . . . . . . . . 5.2.2 Recordings . . . . . . . . . . . . . . . . . . . . . 5.2.3 Post-processing . . . . . . . . . . . . . . . . . . . 5.2.3.1 Acoustic signals . . . . . . . . . . . . . 5.2.3.2 Video signals . . . . . . . . . . . . . . . 5.3 AVTTS synthesis for Dutch . . . . . . . . . . . . . . . . 5.4 Evaluation of the Dutch AVTTS system . . . . . . . . . 5.4.1 Turing Test . . . . . . . . . . . . . . . . . . . . . 5.4.1.1 Introduction . . . . . . . . . . . . . . . 5.4.1.2 Test set-up and test samples . . . . . . 5.4.1.3 Participants and results . . . . . . . . . 5.4.1.4 Discussion . . . . . . . . . . . . . . . . 5.4.2 Comparison between single-phase and two-phase speech synthesis . . . . . . . . . . . . . . . . . . 5.4.2.1 Motivation . . . . . . . . . . . . . . . . 5.4.2.2 Method and samples . . . . . . . . . . . 5.4.2.3 Subjects and results . . . . . . . . . . . 5.4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . audiovisual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 143 143 148 150 153 157 157 157 158 159 159 159 160 161 163 166 166 167 167 167 168 169 170 170 170 174 178 179 181 181 182 183 183 186 186 187 188 189 Table of Contents 5.5 xvi Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . 190 6 Context-dependent visemes 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Concatenative VTTS synthesis . . . . . . . . . . . . . . . . . 6.2 Visemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 The concept of visemes . . . . . . . . . . . . . . . . . . . . . 6.2.2 Visemes for the Dutch language . . . . . . . . . . . . . . . . . 6.3 Phoneme-to-viseme mapping for visual speech synthesis . . . . . . . 6.3.1 Application of visemes in VTTS systems . . . . . . . . . . . . 6.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Evaluation of Nx1 mapping schemes for English . . . . . . . . . . . . 6.4.1 Design of many-to-one phoneme-to-viseme mapping schemes 6.4.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Many-to-many phoneme-to-viseme mapping schemes . . . . . . . . . 6.5.1 Tree-based clustering . . . . . . . . . . . . . . . . . . . . . . . 6.5.1.1 Decision trees . . . . . . . . . . . . . . . . . . . . . 6.5.1.2 Decision features . . . . . . . . . . . . . . . . . . . . 6.5.1.3 Pre-cluster . . . . . . . . . . . . . . . . . . . . . . . 6.5.1.4 Clustering into visemes . . . . . . . . . . . . . . . . 6.5.1.5 Objective candidate test . . . . . . . . . . . . . . . 6.5.2 Towards a useful many-to-many mapping scheme . . . . . . . 6.5.2.1 Decreasing the number of visemes . . . . . . . . . . 6.5.2.2 Evaluation of the final NxM visemes . . . . . . . . . 6.6 NxM visemes for concatenative visual speech synthesis . . . . . . . . 6.6.1 Application in a large-database system . . . . . . . . . . . . . 6.6.2 Application in limited-database systems . . . . . . . . . . . . 6.6.2.1 Limited databases . . . . . . . . . . . . . . . . . . . 6.6.2.2 Evaluation of the segment selection . . . . . . . . . 6.6.2.3 Evaluation of the synthetic visual speech . . . . . . 6.7 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . 191 191 191 192 193 193 197 197 197 199 200 201 201 202 204 204 205 205 206 207 207 209 211 211 212 214 214 217 217 218 219 225 7 Conclusions 7.1 Brief summary . . . . . . . . . . . . . . . . . . . . 7.2 General conclusions . . . . . . . . . . . . . . . . . . 7.3 Future work . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Enhancing the audiovisual synthesis quality 7.3.2 Adding expressions and emotions . . . . . . 228 228 232 233 233 234 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table of Contents 7.3.3 xvii Future evaluations . . . . . . . . . . . . . . . . . . . . . . . . 236 A The Viterbi algorithm 239 B English phonemes 243 C English visemes 245 List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 1.7 The human speech production system . . . . . . . . . . . The talking computer HAL-9000 . . . . . . . . . . . . . . Wheatstone’s Talking Machine . . . . . . . . . . . . . . . Fictional rudimentary synthetic visual speech . . . . . . . Examples of various visual speech synthesis systems . . . Intelligibility scores as a function of acoustic degradation . The uncanny valley effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 3 9 10 13 14 15 17 Examples of mechanically generated visual speech . . . . . . . . . . Georges Demeny’s “Phonoscope” . . . . . . . . . . . . . . . . . . . . Pioneering realistic computer-generated facial expressions . . . . . . Two approaches for audiovisual text-to-speech synthesis . . . . . . . Beyond pure 2D/3D synthesis . . . . . . . . . . . . . . . . . . . . . . Various examples of synthetic 2D visual speech . . . . . . . . . . . . Various examples of synthetic 3D visual speech . . . . . . . . . . . . Anatomy-based facial models . . . . . . . . . . . . . . . . . . . . . . The VICON motion capture system . . . . . . . . . . . . . . . . . . 2D visual speech synthesis . . . . . . . . . . . . . . . . . . . . . . . . The facial feature points defined in the MPEG-4 standard . . . . . . Visual speech synthesis using articulation rules to define keyframes . Modelling visual coarticulation using the Cohen-Massaro model . . . Visual speech synthesis based on the concatenation of segments of original speech data . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.15 Visual speech synthesis based on statistical prediction of visual features 23 24 25 29 32 33 35 38 39 42 44 48 51 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 66 69 70 72 73 76 77 78 Overview of the AVTTS synthesis . . . . . . . . . . . . . Diphone-based unit selection . . . . . . . . . . . . . . . . Overview of the audiovisual unit selection approach . . . Example frames from the AVBS audiovisual database . . Example frames from the LIPS2008 audiovisual database Landmarks indicating the various parts of the face . . . . Detection of the teeth and the mouth-cavity (1) . . . . . . Detection of the teeth and the mouth-cavity (2) . . . . . . xviii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 59 List of Figures xix 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 Detection of the teeth and the mouth-cavity (3) . . . . Unit selection synthesis using target and join costs . . Unit selection trellis . . . . . . . . . . . . . . . . . . . Target costs applied in the AVTTS synthesis . . . . . Join costs applied in the AVTTS synthesis . . . . . . . Join cost histograms . . . . . . . . . . . . . . . . . . . Merging the mouth-signal with the background signal Auditory concatenation artifacts . . . . . . . . . . . . Pitch-synchronous audio concatenation . . . . . . . . . Visual concatenation by image morphing . . . . . . . . Audiovisual consistence test results . . . . . . . . . . . Naturalness test results . . . . . . . . . . . . . . . . . Audiovisual optimal coupling: methods . . . . . . . . . Audiovisual optimal coupling: resulting signals . . . . Objective smoothness measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 80 82 83 88 91 94 96 97 99 104 109 116 117 120 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 A point-light visual speech signal . . . . . . . . . . . . . . . . . AAM-based image modeling . . . . . . . . . . . . . . . . . . . . AVTTS synthesis using an active appearance model . . . . . . AAM-based representation of the original visual speech data . . AAM sub-trajectory concatenation . . . . . . . . . . . . . . . . Relation between AAM parameters and physical properties . . Speech-correlation of the AAM shape and texture parameters . AAM Normalization . . . . . . . . . . . . . . . . . . . . . . . . Spectral smoothing of a parameter trajectory . . . . . . . . . . Evaluation of AAM-based AVTTS synthesis: visual speech-only Evaluation of AAM-based AVTTS synthesis: audiovisual speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 130 133 136 142 144 146 149 156 158 162 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 Overview of the recording setup . . . . . . . . . . . . . . . . . . . . . 171 Some details of the recording setup . . . . . . . . . . . . . . . . . . . 172 Example frames from the AVKH audiovisual speech database . . . . 173 Landmark information for the AVKH database . . . . . . . . . . . . 175 AAM resynthesis for the AVKH database . . . . . . . . . . . . . . . 177 AAM modelling of the complete face . . . . . . . . . . . . . . . . . . 178 Final output of the Dutch AVTTS system . . . . . . . . . . . . . . . 180 Ratio of incorrect answers obtained by the experts and the non-experts184 Ratio of incorrect answers for each type of sentence . . . . . . . . . . 185 Comparison between single-phase and two-phase audiovisual synthesis 189 6.1 6.2 Subjective evaluation of the Nx1 phoneme-to-viseme mappings . . . 203 Pre-cluster features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 List of Figures 6.3 6.4 6.5 Candidate test results for the tree-based visemes . . . . . . . . . . . Candidate test results for the final visemes . . . . . . . . . . . . . . Relation between the visual speech synthesis stages and the objective measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Evaluation of the segment selection using a large database . . . . . . 6.7 Evaluation of the segment selection using a large database (2) . . . . 6.8 Evaluation of the segment selection using a limited database . . . . . 6.9 Evaluation of the segment selection using a limited database (2) . . 6.10 DTW-based evaluation of the final synthesis result . . . . . . . . . . 6.11 Subjective test results evaluating the synthesis quality using various NxM visemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.12 Subjective test results evaluating the synthesis quality using the most optimal NxM visemes . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 xx 210 213 214 215 216 218 219 221 223 224 Facial expressions related to a happy emotion . . . . . . . . . . . . . 236 A.1 Unit selection trellis . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 A.2 Unit selection trellis costs . . . . . . . . . . . . . . . . . . . . . . . . 240 List of Tables 1.1 Example of a many-to-one phoneme-to-viseme mapping for English . 3.1 3.2 3.3 3.4 3.5 3.6 Symbolic database features . . . . . . . . . . . . . . . . . Test strategies for the audiovisual consistence test . . . . Test strategies for the naturalness test . . . . . . . . . . . Detection of local audiovisual asynchrony . . . . . . . . . Various optimal coupling configurations . . . . . . . . . . Subjective evaluation of the optimal coupling approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 103 108 118 119 122 4.1 4.2 4.3 4.4 4.5 Normalization test results: visual speech-only Normalization test results: audiovisual speech Differential visual concatenation smoothing . Subjective trajectory filtering experiment . . Optimal filter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 150 152 155 155 5.1 5.2 Turing test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Turing test results for the non-experts . . . . . . . . . . . . . . . . . 185 6.1 Subjective evaluation of the Nx1 phoneme-to-viseme mappings for English: Wilcoxon signed-rank analysis . . . . . . . . . . . . . . . . . Decision tree configurations . . . . . . . . . . . . . . . . . . . . . . . Mapping from tree-based visemes to final NxM visemes . . . . . . . Construction of limited databases . . . . . . . . . . . . . . . . . . . . Subjective test results evaluating the synthesis quality using various NxM visemes: Wilcoxon signed-rank analysis . . . . . . . . . . . . . Subjective test results evaluating the synthesis quality using the most optimal NxM visemes: Wilcoxon signed-rank analysis . . . . . . . . . 6.2 6.3 6.4 6.5 6.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 203 208 212 218 222 224 B.1 English phone set and classifications . . . . . . . . . . . . . . . . . . 244 C.1 Many-to-one phoneme-to-viseme mappings for English . . . . . . . . 246 xxi Abbreviations Section AAM ANN AVTTS C/V CGI EM-PCA FACS FAP FFT HMM ICA LPC LSP MFCC MOS Nx1 NxM PCA PEC TTS VTTS Active Appearance Model Artificial Neural Network Audiovisual Text-to-Speech Consonant/Vowel Computer Generated Imagery Expectation-Maximisation Principal Component Analysis Facial Action Coding System Facial Action Parameter Fast Fourrier Transform Hidden Markov Model Independent Component Analysis Linear Predictive Coding Line Spectral Pairs Mel-Frequency Cepstrum Coefficient Mean Opinion Score Many-to-One Many-to-Many Principal Component Analysis Phonemic Equivalence Class Text-to-Speech Visual Text-to-Speech xxii 4.2 2.2.1 2.2.2 6.5.1.2 2.1 2.2.5.1 2.2.5.3 2.2.5.3 4.4.4 1.4.2 2.2.5.1 2.2.1 2.2.1 2.2.1 3.6.2.1 6.2 6.2 2.2.1 6.2.1 2.2.1 6.1 1 Introduction 1.1 Spoken communication One of the main driving forces behind the development of mankind is its capability to effectively transfer thoughts and ideas by means of sound signals. Where at the dawn of man this communication was nothing more than a way to express basic instincts like fear and excitement by means of some grunts and growls (as can nowadays still be seen by some higher mammals), the communication between humans slowly but surely evolved towards a real spoken language. A complicated message can be passed from one individual towards many others by uttering a series of sounds of which the subtle variations can be perceived by the receiver of the message. When both the sender and the receiver agree on the linkage between such a series of sound segments and real-life concepts, a successful transfer of the information is possible. Throughout history, countless spoken languages have been employed, of which only few are still practised today. Each of these languages exhibits its own rules and agreements on the nature of its particular speech sounds and their semantic meaning. They define the elementary building blocks of which a speech signal must be constructed in order to successfully transfer a message from the speaker to the receiver. These building blocks are called phones and phonemes, two concepts that can easily be mixed up, although there is a subtle yet important difference between them. A phone is an elementary sound that can be produced by the human speech production system. A speech signal is generated by consecutively altering the parameters of this speech production system in order to create the auditory speech signal which consists of a sequence of such phones. On the other hand, a spoken language is defined by its characteristic phonemes. Each phoneme is a unique label for a set of phones that are cognitively equivalent for the 1 1.2. Speech signals 2 particular language. In an auditory speech signal, the replacement of a single phone, corresponding to phoneme X, with another phone that corresponds to a phoneme other than X, causes a change in the message conveyed by the speech signal. Two phones that correspond to the same phoneme are called allophones. Replacing a single phone in a speech signal by one of its allophones does not change the semantic meaning of the speech, although it may sound strange or less intelligible. Since spoken language has always been the primary means of communication between humans, the interest in this topic has been existing since the beginning of science. While research in the field of anatomy and physiology has been trying to figure out how the human body is able to produce the various phones, the field of phonology has been trying to describe spoken languages by identifying their characteristic phonemes, their prosodic properties (pitch, timing, accents) and the semantic meaning of these features. 1.2 1.2.1 Speech signals Human speech production The part of the human body that is responsible for the production of speech sounds is a complicated system that consists of two major components: organs of phonation and organs of articulation. The phonation system creates the air flow that will eventually leave the body and causes a change in air pressure which is needed to carry the spoken message. The two most important organs for phonation are the lungs and the larynx. The first one is responsible for the production of the necessary air pressure, while the latter controls the vibration of the vocal folds. The articulatory organs are used to create the resonances and modulations that are necessary to shape the air flow in order to achieve the target speech sound. Important articulators are the lower jaw, the tongue, the lips and the velum. The exact manner in which a particular sound is produced is a very complicated process, for which all the different phonation and articulation organs cooperate. The speech production can be described in terms of three stages in which the sounds are created: the subglottal tract, the vocal tract and the (para-)nasal cavities (illustrated in figure 1.1). 1.2.2 Description of auditory speech When people talk, the successive alterations of the properties of their phonation and articulation organs results in the production of the target sequence of speech sounds. These sounds can be classified either as consonants or as vowels. Vowels are those sounds that have no audible friction caused by the narrowing or obstruction of some part of the upper vocal tract. Different vowels can be generated by varying the 1.2. Speech signals 3 Figure 1.1: The human speech production system [Benesty et al., 2008]. degree of lip aperture, the degree of lip rounding and the placement of the tongue within the oral cavity. On the other hand, sounds that do exhibit an audible friction or closure at some point within the upper vocal tract are called consonants. Consonant sounds vary by the place in the vocal tract where the airflow is obstructed (e.g., at the lips, the teeth, the velum, the glottis, etc.). Each place of articulation produces a different set of consonants, which are further distinguished by the kind of friction that is applied (a full closure or a particular degree of aperture). The larynx also plays an important role in the discrimination of the different phones. When the vocal chords are relaxed during the traversal of the air flow, so-called voiceless or unvoiced sounds are created. On the other hand, when the vocal chords are tightened, the traversal of the air flow causes them to vibrate. This vibration influences the shape of the air flow in such a way that it will exhibit a periodic behaviour. Speech sounds that are uttered while the vocal chords are vibrating are called voiced sounds. Vowels are always voiced, but consonants can be either voiced or unvoiced. Often two different consonants are discerned by only changing the voicing property while all other parameters of the speech production system are the same (e.g., the English phoneme /s/ from the word “some” is unvoiced while its voiced counterpart is the phoneme /z/ as heard in the English word “zoo”). To encode a message by means of speech sounds, it is not only the particular sequence of generated phonemes that determines the semantic meaning of the 1.3. Multimodality of speech 4 speech signal. The speech sounds also exhibit particular prosodic properties [Smalley, 1963]. The pitch of a voiced speech sound is determined by its fundamental frequency. It is controlled by the larynx which regulates the degree of tightening of the vocal chords. Pitch variations are often used to assign expression to the speech, for instance, an English sentence can be transformed into a question just by raising the pitch of the sounds at the end. Another prosodic property is the duration of the speech sounds. For instance, by lengthening a particular word from a sentence, stress can be assigned to it. Finally, speech sounds also exhibit an energy (or amplitude) property. This is controlled by the lungs which can vary the strength of the air flow. This property is mainly used in combination with pitch variations to assign stress to specific parts of the sentence. 1.3 Multimodality of speech In the previous section it was briefly explained how the cooperative work of multiple human organs is capable of producing speech sounds. Some of these organs, such as the lungs, the larynx, and the nasal cavities are not visible from the outside. On the other hand, articulators like the lips are clearly visible when looking at a speaker’s face. In addition, other articulators such as the tongue, the teeth, and even the velum are in some cases visible, depending on the particular sounds that are being produced. The exact appearance of these visible articulators is highly correlated with the uttered phone. This implies that spoken communication should not be seen as a unimodal signal consisting of solely auditory information, since the variations of the visible articulators (lips, cheeks, tongue, teeth, etc.) when uttering the speech sounds define a visible speech signal that encodes the message as well. Consequently, the speech message is encoded in both an auditory and a visual speech signal. One can wonder if this double encoding is worth investigating, since it unequivocally contains a lot of redundancy. It could be that the visible articulatory gestures are just a side-effect of the speech production process, meaning that the visible speech mode adds no extra information to the communication channel. Let’s take a look at the everyday use of spoken language. Imagine that someone starts talking to a colleague who is reading an interesting dissertation. The reader will instinctively look up from the text and gaze at the talker’s face. Maybe this is just an act of politeness since we all tell our children to do so, but it is likely that this habit originates from the fact that we have to make an effort to understand each other the best. Don’t we also tell our children not to talk with their hand in front of their mouth in order to be more intelligible? Similarly, when we are having a conversation with someone at a place where there is a lot of background noise, we will always try to keep our eyes fixed to our companion’s mouth in order to better understand 1.3. Multimodality of speech 5 his/her words. This effect is also noticeable when speaking to a group of people at a busy party where the speech gets polluted by lots of other speech sounds. In that case, we understand the person from the group we are looking at much better than the others. The visual clues also assist in focusing our attention since they help to match for each of the audible speech sounds the correct speaker [Plenge and Tilse, 1975]. From these examples it is clear that the multimodal coding of the speech message is actively used to improve the quality of the communication. The auditory speech is considered as the primary communication channel, but receiving also the visible speech information helps to better understand the message [Massaro and Cohen, 1990] [Summerfield, 1992] [Schwippert and Benoit, 1997] [Benoit et al., 2000] [Schwartz et al., 2004] [Van Wassenhove et al., 2005], especially in conditions where the auditory speech is polluted with noise [Erber, 1975] [MacLeod and Summerfield, 1987] [MacLeod and Summerfield, 1990] [Grant et al., 1998] [Bernstein et al., 2004] [Ma et al., 2009]. The most extreme example of this is the fact that hearing-impaired people can be trained to understand a speech message by only receiving the visual speech information (so-called lip-reading) [Jeffers and Barley, 1971] [Woods, 1986] [Summerfield, 1992] [Bernstein et al., 2000]. Being able to notice the changing appearance of the visual articulators during the uttering of speech sounds increases the intelligibility of the speech. In addition, including the visual speech mode to the communication adds more expression and extra metacognitive information to the speech as well [Pelachaud et al., 1991] [Swerts and Krahmer, 2005] [Granstrom and House, 2005]. Comparable to auditory prosody, some degree of visual prosody is added to the speech signal in order to assign stress or prominence to particular parts of the speech [Swerts and Krahmer, 2006] or to add an emotion to the message [Grant, 1969] [Ekman et al., 1972] [Schmidt and Cohn, 2001]. This visual prosody can either add new expressive information to the message or it can be used to strengthen the effect of the auditory prosody [Graf et al., 2002] [Al Moubayed et al., 2010]. Typical visual prosodic gestures are movements of the eyebrows [Granstrom et al., 1999] [Krahmer et al., 2002] and subtle head movements like shakes or nods [Fisher, 1969] [Hadar et al., 1983] [Bernstein et al., 1989] [Munhall et al., 2004]. Another means of visual prosody is eye gaze, which can be linked with grammatical and conversational clues [Argyle and Cook, 1976] [Vatikiotis-Bateson et al., 1998]. Many research has been conducted to identify typical configurations of the eyebrows, the mouth, the eyes, and even the skin (e.g., wrinkles) that can be matched with basic emotions such as joy, sadness, fear, anger, surprise and disgust. By arranging the facial appearance towards one of these configurations, a speaker is able to emphasize the emotion that corresponds to the message that is conveyed [Wilting et al., 2006] [Gordon and Hibberts, 2011]. Receiving the visual speech mode helps to better understand the message and 1.3. Multimodality of speech 6 it enhances the expressiveness of the spoken communication. In addition, the concept of confidence should be considered. Let’s take another look at the everyday use of speech with the example of a salesman who is trying to sell one of his products. Will a customer be more likely to purchase when the salesman performs his sales talk over the telephone or when the salesman and the customer have a face-to-face conversation? In the latter case customers will feel more confident about the purchase since they have been able to observe the speaker while listening to his well-prepared discourse. By noticing the subtle clues in the visual speech mode (e.g., frowns or eye gaze) they are (or at least they assume that they are) more capable of correctly determining whether the salesman is trustworthy or not. A similar effect is noticeable when people are having an argument with someone over the telephone, which is more likely to escalate in comparison with a face-to-face discussion since in the auditory speech-only case they have to make assumptions about the emotional state of their speaking partner and about the intended expression of his/her words. From these examples it is clear that the multimodality of speech is not only a side-effect of which people take advantage whenever it is possible. In their effort to convey the message as efficiently as possible, humans will always try to use the both auditory and the visual communication channel. When the circumstances somehow disrupt this multimodality, we feel less satisfied with the communication since we are less assured of a correct conveyance of the message. From the previous paragraphs it is clear that speech should be seen as a multimodal means of communication. When the sender of the message utters the speech sounds, the multimodality intrinsically appears by the variations of his/her visual articulators. On the receiver’s side, the auditory speech information is captured by the ears and the visual speech information is captured by the eyes. Once captured, these two information streams are send to the brain. Research on the human perception of audiovisual speech information can be considered as a final evidence for the truly multimodal nature of speech communication, since it has been shown that the brain does not separately decode the auditory and the visual speech information. Instead, the captured speech is analyzed by use of a complex multimodal decoding of both information streams at once, in which the decoding of the auditory information is influenced by the captured visual information and vice-versa [Skipper et al., 2007] [Campbell, 2008] [Benoit et al., 2010]. The best known indication for the existence of such a combined decoding is the so-called McGurk effect [McGurk and MacDonald, 1976]. This effect occurs when displaying audiovisual speech fragments of which the auditory and the visual information originate from different sources. For example, observers reported hearing a /ta/ when a sound track containing the syllable /pa/ is dubbed onto a video track of a mouth producing /ka/. A similar effect is noticed when an auditory syllable /ba/ is dubbed with the visual gestures for /ga/. In that case, most people report hearing 1.4. Synthetic speech 7 the syllable /da/. Another such effect, called visual capture, occurs when listeners who are perceiving unmatched auditory-visual speech report to hear the visually presented syllable instead of the auditory syllable that was presented [Andersen, 2010]. Since speech is a truly multimodal signal, it is obvious that the analysis of a speech signal should inspect both the auditory and the visual speech mode. For this purpose, in section 1.1 the concept of a phoneme was explained. Likewise, a visual speech signal can also be split up in elementary building blocks. These atomic units of visual speech are called visemes [Fisher, 1968]. For each language a typical set of visemes can be determined, which describe the distinct visual appearances that occur when uttering speech sounds of that particular language. Such a representative viseme set can be constructed by collecting for each distinct phoneme its typical visual representation. However, some phoneme groups will exhibit a similar visual representation since their articulatory production only differs in terms of invisible features (e.g., voicing or nasality). This implies that a language is defined by a smaller number of visemes in comparison to the number of phonemes. A mapping table from phonemes to visemes will thus exhibit a many-toone behaviour. Later on in chapter 6 the mapping from phonemes to visemes will be extensively described and investigated, since much can be said about the naive concept of such a many-to-one mapping. Nevertheless, at this point it is sufficient to consider a viseme as an elementary building block of a visual speech signal and to remember that from each phoneme sequence describing an auditory speech mode a matching viseme sequence that describes the corresponding visual speech mode can be determined by a many-to-one mapping table such as illustrated in table 1.1. 1.4 1.4.1 Synthetic speech Motivation Throughout history, speech has been a means of communication that is used exclusively for interaction between humans. Apart from a few animals such as dogs that understand short uttered commands, no non-human living creature nor any machinery has been able to understand or produce complicated speech messages. Up until a few decades ago, the communication with machines was solely based on levers, switches, gauges, indicator lamps and other mechanical input and output devices. Over the last decades, however, this situation has evolved drastically as the development of advanced computer technologies has led to powerful computer systems that are capable of processing complicated calculations. In parallel, these computer systems have become more and more visible in common everyday situa- 1.4. Synthetic speech 8 Table 1.1: Example of a many-to-one phoneme-to-viseme mapping for English. Viseme Phonemes Example 1 2 3 4 5 6 1 7 8 9 10 11 12 13 p,b,m f,v T,D t,d k,g tS,dZ,S s,z n,l r A: e I Q U put,bed,mill far,voice think,that tip,doll call,gas chair,join,she sir,zeal lot,not red car bed tip top book tions. At present times, cars, heavy machinery, vending machines, medical devices and even the simplest home appliances such as fridges and central heating systems are computer controlled. For every such device a human-machine interface is needed for enabling users to control the machine and to make it possible for the device to give feedback to its users. Simple commands (e.g., turn off/turn on) can easily be passed to the system by a switch or a button. Similarly, elementary feedback can be returned by basic alphanumeric displays or indicator lights. For more complicated messages, however, these simple interfaces are inadequate or they would make the use of the device cumbersome. The ultimate goal should be that the computer systems that surround us are perfectly integrated in everyday life, which would make them in some way “invisible”: their usage should feel natural and intuitive so that users would “forget” that they are communicating with a highly complicated computer system. This can be achieved when the interaction between human and machine reaches the level of interaction amongst humans themselves. Consequently, the communication between humans and computers should mainly consist of the primary means of communication between humans, namely speech signals. Not only will this make the interaction with the machine feel natural and familiar, it will also improve the accessibility to the technology since everybody with speaking and/or hearing capabilities will be able to use the device without having to learn its particular interface. It is interesting to see that, like many other recent developments, human-machine interaction based on speech has been predicted by science-fiction a long time ago. 1.4. Synthetic speech 9 Figure 1.2: Detail of the user interface of the talking computer HAL-9000 from “2001: A Space Odyssey”. In Fredric Brown’s short story “Answer” (1954), the question “Is there a god?” is passed to the Universe’s supercomputer by means of a voice command. The computer answers, by means of auditory speech, “Yes, now there is a god”. Another well-known example is Arthur C. Clarke’s novel “2001: A Space Odyssey” (1968), in which the interaction with the spaceship Discovery One goes through HAL-9000, a speaking computer which can understand voice commands. In fact, the use of speech to communicate with machines is so straightforward that it has already been mentioned in countless books or motion pictures. 1.4.2 A brief history on human-machine interaction using auditory speech To allow a two-way speech-based human-machine communication, the computer system should be able to understand the user’s voice commands and it has to know how to generate a new speech signal in order to return its answer to the user. These two requirements have resulted in two important research domains in the field of speech technology: automatic speech recognition and speech synthesis. In automatic speech recognition it is investigated how to translate a given waveform into its corresponding phoneme sequence [Rabiner and Juang, 1993]. In the early days, studies from this domain involved a rudimentary recognition of isolated words [Davis et al., 1952]. However, in order to design a system that can be used for human-machine interaction, two major challenges were needed to be taken. First, the system has to be capable of recognizing unrestricted continuous speech instead of fixed-vocabulary words. In addition, it should recognize speech uttered by any given speaker and not only speech from those speakers that were used to construct or train the system. Nowadays the automatic recognition of continuous speech is performed using sophisticated machine learning techniques like Artificial Neural Networks (ANN, [Anderson and Davis, 1995]) [Lippmann, 1989] or by use of statistical models like Hidden Markov Models (HMM, [Baum et al., 1970]) [Baker, 1975] [Ferguson, 1980] [Rabiner, 1989]. 1.4. Synthetic speech 10 Figure 1.3: Wheatstone’s Talking Machine [Flanagan, 1972]. The domain of speech synthesis investigates how a computer system can create a new waveform based on a target sequence of phonemes [Dutoit, 1997] [Taylor, 2009]. It may come as a surprise that the very first “talking machine” was already invented by Wolfgang Von Kempelen at the end of the 18th century [Von Kempelen, 1791] [Dudley and Tarnoczy, 1950]. The essential components of his machine were a pressure chamber for mimicking the lungs, a vibrating reed to act as vocal cords, and a leather tube for the vocal tract action. When it was controlled by a practised human operator, the machine was able to produce a series of sounds that quite closely resembled human speech. A few decades later, the system was improved by Charles Wheatstone in 1837 (see figure 1.3) [Bowers, 2001]. The very first electrical systems that were designed to synthesize human-like speech signals were only able to generate isolated vowel sounds [Stewart, 1922]. The VODER system can be considered as the first electrical system able to produce continuous speech [Dudley et al., 1939]. It was a human-operated synthesizer that consisted of wrist bar for selecting a voicing or noise source and a foot pedal to control the fundamental frequency. The source signal was routed through 10 bandpass filters of which the output levels were controlled by the operator’s fingers. From that point in time speech synthesis became a popular research topic in which over the years various approaches have been investigated. Formant synthesizers create the target synthetic speech by generating a new waveform that mimics the known formants in human speech sounds [Fant, 1953] [Lawrence, 1953] [Kelly and Gerstman, 1961]. Formants are resonant peaks that can be seen in the spectrum of human speech signals, originating from the vocal tract which has for each particular 1.4. Synthetic speech 11 configuration several major resonant frequencies. An alternative synthesis approach is articulatory synthesis, in which a new waveform is created by modelling the different components of the human vocal tract and the articulation processes occurring there [Dunn, 1950] [Rosen, 1958] [Kelly and Lochbaum, 1962]. Formant synthesis and articulatory synthesis are examples of so-called rule-based synthesis. A rule-based synthesizer estimates in a first synthesis stage the properties of the target synthetic speech signal based on predefined knowledge about the uttering of the target phonemes (e.g., the corresponding configurations of the human speech production system) or about the properties of their corresponding speech signals (e.g., spectrum, formants, timing, etc.). Afterwards, in a second synthesis stage a new waveform is constructed based on these estimations. Alternatively, a different approach to the problem of speech synthesis are the so-called data-driven synthesizers. In this strategy, the target synthetic speech signal is constructed by reusing original speech information. A first attempt to synthesize speech by selecting and concatenating diphones (i.e., two consecutive phones) from a predefined database was described by Dixon et al. [Dixon and Maxey, 1968]. The major challenge of such concatenative synthesis is the realization of a smooth synthetic speech signal containing no audible join artefacts. Since the optimal point for concatenating two speech signals is found to be at the most stable part of a phone (i.e., the sample at the middle of the phoneme instance) [Peterson et al., 1958], for many years concatenative synthesizers were based on the selection of diphones from a diphone database. Several speech modification techniques for improving the concatenations as well as the prosody of the synthetic speech have been proposed, such as TD-PSOLA [Moulines and Charpentier, 1990] and MBROLA [Dutoit, 1996]. When the processing power of computer systems became stronger and the data storage capabilities became larger, concatenative synthesis evolved towards unit selection synthesis. In this approach the speech segments are selected from a database containing continuous original speech from a single speaker [Hunt and Black, 1996]. This has the advantage that longer original speech segments can be used to construct the target synthetic speech, which reduces the number of concatenations needed. As an alternative to concatenative speech synthesis, other data-driven approaches make use of statistical models, trained on original speech signals, to construct the target synthetic speech. The best known example is HMM-based synthesis, in which in a first stage Hidden Markov Models are trained on the correspondences between acoustic features of original speech from a single speaker and the phonemic transcript of the speech signals. Afterwards, these models can predict new acoustic features corresponding to the target phoneme sequence [Zen et al., 2009]. Recently, hybrid data-driven approaches have gained popularity. These systems first estimate the acoustic features of the target speech using statistical models, after which these estimations are used to perform a unit 1.4. Synthetic speech 12 selection synthesis using a database containing original speech signals [Ling and Wang, 2007]. This section only holds a brief overview of the research on auditory speech synthesis systems. The interested reader is referred to [Schroeder, 1993] and [Klatt, 1987] for an extensive overview of the acoustic speech synthesis research and the different approaches that have been studied. The latter reference holds also many interesting sound samples from the early systems. 1.4.3 Multimodality of spoken human-machine communication The previous sections explained why automatic speech recognition and synthetic speech synthesis are necessary for improving the interaction between humans and machines. Let’s take another look at the prophecies on human-machine interaction found in science-fiction. It can be noticed that in many cases, the authors/movie directors opted to assign some sort of virtual “face” to the computer system. In addition, when the fictional computer system is speaking, some kind of visual clues are displayed. These can be seen as an elementary type of synthetic visual speech. For instance, in Stanley Kubrick’s version of “2001: A Space Odyssey” the speaking computer HAL-9000 is presented as a red light (see figure 1.2). When the computer talks, a close-up of this light is displayed, giving the impression that the audience is looking at its “face”. Another common practice is the displaying of a graphical representation corresponding to the auditory voice of the machine. A well-known example of this is the smart car KITT featuring in the television series “Knight Rider” (see figure 1.4). These are just fictional examples, however, they do indicate that people seem to expect some sort of visual speech signal in the communication with machines as well. Recall that in section 1.3 it was shown that speech communication between humans is a truly multimodal means of communication, consisting of both an auditory and a visual mode. Consequently, it can be expected that the most optimal human-machine interaction will be feasible only when this communication consists of audiovisual speech as well [Chen and Rao, 1998]. When the machine is at the receiver’s side of the communication, the multimodal human-machine interaction is based on automatic audiovisual speech recognition. Studies on this subject have indicated that the accuracy in which a computer system is able to correctly translate an auditory speech signal into its corresponding phoneme sequence can be increased when visual speech information is given to the system as well [Potamianos et al., 2004]. The visual speech information usually consists of a video recording of the speaker’s face. It is analysed by the system in order to determine important visual features such as the opening of the lips. An 1.4. Synthetic speech 13 Figure 1.4: Fictional rudimentary synthetic visual speech used in the television series “Knight Rider”. accurate estimation of the phoneme sequence is possible by combining these visual features with the acoustic features of the auditory speech signal [Nefian et al., 2002]. Similarly, when the computer system has to transmit a message towards the user, through audiovisual speech synthesis it is possible for the computer to display both a synthetic auditory and a synthetic visual speech signal. The concept of synthetic auditory speech is more or less unambiguously defined as a waveform that resembles as close as possible a human auditory speech signal, however, numerous variations on the concept of synthetic visual speech are possible (see figure 1.5). For instance, the visual speech signal can appear photorealistic or it can display a cartoon-like representation. It can display a complete talking head or it can just simulate a speaking mouth. The visual speech can be both 2D or 3D-rendered and its level of detail can vary from a simple opening/closing of the mouth (e.g., cartoons like “South Park”) to an accurate simulation of the various visual articulators. For realistic visual speech synthesis, the system has to model the exterior of the face containing the lips, the chin and the cheeks, as well as the interior of the mouth, especially the teeth and the tongue. In the next chapter an extensive overview of the various approaches for generating synthetic visual speech will be given. It can be noticed that much similarity exist between the miscellaneous visual speech synthesis strategies and the sundry approaches for auditory speech synthesis. For instance, rule-based visual synthesizers create a new visual speech signal based on prior knowledge about the visual appearances of phonemes. This knowledge can be used to estimate an appropriate visual counterpart for every target 1.4. Synthetic speech 14 Figure 1.5: Examples of various visual speech synthesis systems. From left to right: 3D articulatory synthesis [Birkholz et al., 2006], 2D photorealistic synthesis [Ezzat et al., 2002] and 3D model-based synthesis [Cohen and Massaro, 1990]. phoneme, from which by means of interpolation a continuous synthetic visual speech signal can be constructed. Alternatively, concatenative visual speech synthesizers reuse original visual speech data to create a new visual speech signal by selecting and concatenating the most appropriate segments from a database containing original visual speech information. Another approach is to synthesize the target visual speech using a statistical prediction model (e.g., a Hidden Markov Model) that has been trained on a dataset of original (audio)visual speech samples. The next chapter will extensively elaborate on these various visual speech synthesis strategies. Over the last two decades, many studies assessed the use of original and/or synthetic audiovisual speech in various human-computer interaction scenarios [Walker et al., 1994] [Sproull et al., 1996] [Cohen et al., 1996] [Pandzic et al., 1999] [Dehn and Van Mulken, 2000] [Ostermann and Millen, 2000] [Geiger et al., 2003] [Agelfors et al., 2006] [Ouni et al., 2006] [Weiss et al., 2010]. One of the important conclusions of these studies is the fact that the addition of a high-quality, realistic synthetic visual speech signal to a (synthetic or original) auditory speech signal improves the overall intelligibility of the speech (visualized in figure 1.6). This is especially true when the intelligibility of the auditory speech itself is degraded. In addition, it has been shown that people react more positive and are more engaged when the computer interacts through audiovisual speech. Also, the results obtained in numerous perception experiments show that the displaying of a realistic talking face makes the computer more human-like, causes the users to be more comfortable in interacting with the system and increases the level in which users trust the system. From these findings it can be concluded that for an optimal communication from the machine towards the user, the speech should indeed consist of both an auditory and a visual speech mode. 1.4. Synthetic speech 15 Figure 1.6: Intelligibility scores as a function of acoustic degradation, depending on the mode of presentation [Le Goff et al., 1994]. From bottom to top: audio alone, audio and the animation of an elementary synthetic lip model, audio and the animation of a non-photorealistic 3D face model, audio and original 2D visual speech. 1.4. Synthetic speech 16 It has been shown that intelligibility scores increase when a more realistic and a more accurate synthetic visual speech signal is displayed (see figure 1.6) [Benoit and Le Goff, 1998]. On the other hand, it has to be ensured that the presented (synthetic) visual speech is appropriate for the particular target application. For instance, a suitable visual speech signal for a system interacting with children could appear cartoon-like, since it is mainly the entertainment value of the system that is important to draw the children’s attention. On the other hand, for more general applications intended for either professional or entertainment purposes, an important aspect of the synthetic visual speech that determines its applicability is the exhibited degree of realism. It is evident that optimal circumstances for human-machine interaction are feasible when a 100% realistic visual speech signal is displayed. Unfortunately, this degree of realism cannot be reached by any visual speech synthesis strategy known to date, although current state-of-the-art 3D rendering techniques are capable of generating near-realistic static representations of a virtual speaker. Surprisingly, it has been noticed that a high but not perfect degree of realism can result in a worse user experience compared with less realistic visual speech signals (such as cartoon-like 2D or 3D speech). This effect, called the uncanny valley effect (see figure 1.7), was first noticed in the field of humanoid robotics, where it was found that the more realistic a robot appears, the more human observers are sensitive to subtle flaws or shortcomings in the design [Mori, 1970]. This effect holds also in the field of visual speech synthesis, since human observers were found to easily dislike a near-realistic synthesis due to the existence of a few short or subtle unnatural mouth-appearances in the signal [Theobald and Matthews, 2012]. More in general, people tend to dislike a near-realistic synthesis that “tries to fool them” by a realistic mimicking of an original speaker in case it can still be noticed that the presented speech signal is originating from a synthesizer [Tinwell et al., 2011]. In contrast, flaws in explicitly non-realistic synthetic visual speech signals are more easily forgiven by human observers, provided that the movements of the visible articulators are correctly simulated. From a psychological point of view, this can be explained by the fact that the almost-realistic virtual characters are perceived as strange, abnormal or “spooky” real people. In contrast, it is clear that the non-realistic characters are originating from a virtual world, which makes the observers feel more comfortable. Bridging the uncanny valley imposes a major challenge for visual speech synthesis research, since a high degree of realism of the synthetic speech is necessary to provide an optimal communication channel between the machine and its users. 1.4. Synthetic speech 17 Figure 1.7: The uncanny valley effect. For 2D cartoon-based synthesis the appearance of the mouth area varies among a limited set of drawn mouth representations (e.g., open/closed mouth represented by a disc/line). 3D model based synthesis is capable of exhibiting very natural movements of the visual articulators represented by variations of a 3D polygon mesh without texture. 2D photorealistic synthesizers mimic original video recordings of a person uttering speech, and 3D photorealistic synthesis uses a 3D model on which a photorealistic texture is mapped. Example figures taken from [Anime Studio, 2013] [Karlsson et al., 2003] [Liu and Ostermann, 2011] [Albrecht et al., 2002]. 1.4. Synthetic speech 1.4.4 18 Applications Considering the rapidly increasing number of computer systems people interact with in everyday situations, countless applications for speech-based human-machine interaction are thinkable. The use of speech to transfer a message towards the computer system is mainly used to improve the accessibility of the device. For instance, several functions in a modern car can be triggered by voice command in order to permit the driver to keep his/her hands on the steering wheel and to maintain focus on the traffic. Voice controlled devices can also help elderly or physically impaired people in using the appliance, since commands can be passed with a minimal physical effort. The accuracy of these speech-controlled applications can be enhanced by incorporating the visual speech mode in the communication. This can also increase the level of interaction between the system and its users. For instance, based on the observed facial expressions, the computer can estimate the emotional state of its user and it can react in an appropriate way. This can, for instance, enhance the communication between the computer system and young children [Yilmazyildiz et al., 2006]. On the other hand, speech based communication from the machine towards its users is advantageous for both the ease of interaction with the device and the applicability of the system in common everyday tasks. It helps in making the computer system more human-like, especially when the synthetic auditory speech is accompanied by a good-quality visual speech signal [Pandzic et al., 1999]. Nowadays, auditory-only speech synthesis is already used in various applications, such as the reading of text messages in cell-phones, automatic telephone exchanges and satellite navigation systems. For all these applications, a logical next step is the extension towards communication by means of audiovisual speech, which will improve the intelligibility of the synthetic speech and it will enhance the accessibility for hearing-impaired users. The addition of a synthetic visual speech mode can also be used to improve the intelligibility of original or synthesized announcements in train stations or airports. Audiovisual speech synthesis can be used to create talking avatars or virtual assistants for increasing the working experience on personal computers, portable devices, websites and social media [Gibbs et al., 1993] [Noma et al., 2000] [Cosatto et al., 2003]. Audiovisual speech synthesis can also be applied in the entertainment sector. As where nowadays the speaking gestures of animated characters are almost completely hand-crafted, an automatic prediction of these gestures would speed up the animation process. In addition, synthetic audiovisual speech can be used for remote-teaching applications. A virtual teacher, displayed as a high-quality speaking head/person, will help to draw the student’s attention in comparison with 1.5. Audiovisual speech synthesis at the VUB 19 the displaying of plain text [Johnson et al., 2000]. Another example can be found in the field of video telephony and video conferencing, which is becoming increasingly popular these days. Note that the transmission of high-quality audiovisual speech requires high data rates, since the video signal containing the visual speech must have a resolution and a frame-rate that are adequate for preserving the fine details of the speech information. However, the transmission of audiovisual speech is also feasible in a low-bandwidth scenario by transmitting only the textual information after which a new audiovisual speech signal is locally generated at the receiver’s side. Alternatively, when a model is used to describe the visual speech (see next chapter), model parameters corresponding to the target message can be predicted at the sender’s side, after which only these parameters need to be transmitted to the receiver for allowing a local generation of the visible speech. Synthetic audiovisual speech can also be applied in the health-care sector [Massaro, 2003] [Engwall et al., 2004]. For instance, after an accident or surgery speech therapy involving speech exercises demonstrated and counselled by a speech therapist may be necessary in order to regain normal speech function. The use of audiovisual speech synthesis for this purpose could drastically reduce the workload since custom speech samples for usage during therapy can be generated beforehand. Similarly, an application using audiovisual speech synthesis can be designed that allows the patients to practise the speech production on their own in an individual training scheme. In addition, audiovisual speech synthesizers can be used to generate speech samples for miscellaneous speech perception experiments. This will avoid the time-consuming and costly audiovisual recordings that are necessary for these experiments. Moreover, speech synthesis is able to produce series of highly consistent speech samples, which is very hard to achieve when the speech samples are gathered during multiple recording sessions. Apart from the examples mentioned in this section, many other applications that involve audiovisual speech synthesis are imaginable. It is highly possible that within a few years we will live in a world where both the car park ticket machine, the train that brings us to work and the fridge in our kitchen will interact with us by means of synthetic auditory speech while displaying their own typical virtual talking agent. 1.5 Audiovisual speech synthesis at the VUB This thesis describes the research on audiovisual speech synthesis that was performed at the Vrije Universiteit Brussel (VUB) in the Laboratory for Digital Speech and Audio Processing (DSSP). The study has resulted in an audiovisual speech synthesis system that is capable of generating high-quality photorealistic audiovisual speech signals based on a given English or Dutch text. The research originated from the 1.5. Audiovisual speech synthesis at the VUB 20 observation that speech is a truly multimodal means of communication that people practise every day of their life. Consequently, they are extremely skilled in perceiving this type of audiovisual information, which implies that a quality perception of synthetic audiovisual speech is only feasible when the two synthetic speech modes closely resemble original speech signals and, in addition, when the level of coherence between these two information streams is as high as found in original audiovisual speech. The research focusses on synthesis strategies that allow the optimization of both these features. It also investigates the importance of the level of audiovisual coherence on the perceived speech quality, since this aspect is often disregarded by audiovisual speech synthesizers described in the literature. 1.5.1 Thesis outline Chapter 2 gives a comprehensive overview of the diverse (audio)visual speech synthesis strategies that have been described in the literature. It explains the various aspects that distinguish these synthesis approaches and it positions the synthesis strategy that is developed in this thesis in the literature. Next, chapter 3 describes the proposed audiovisual speech synthesis strategy and it describes the experiments that were conducted to evaluate the influence of the level of audiovisual coherence on the perceived speech quality. Chapter 4 explains how the attainable synthesis quality of the audiovisual speech synthesizer was enhanced by increasing the individual quality of the synthetic visual speech mode. Subsequently, chapter 5 explains how the synthesis quality was further enhanced by the construction of a new extensive audiovisual speech database for the Dutch language. For some applications, an (original) auditory speech signal is already available which means that instead of audiovisual speech synthesis, a visual-only speech synthesis is required in order to generate the accompanying visual speech mode. Chapter 6 elaborates on the use of many-to-one phoneme-to-viseme mappings for this purpose. It also describes the construction and the evaluation of novel many-to-many phoneme-to-viseme mapping schemes. Finally, chapter 7 concludes the thesis by discussing the results obtained and by elaborating on possible future additions to the research. 1.5.2 Contributions Some of the important scientific and technical contributions made by this thesis include: ◦ The development of a unit selection-based audiovisual text-to-speech synthesis approach that is able to maximize the level of audiovisual coherence in the synthetic speech. ◦ The development of a set of audiovisual selection costs and the development of an audiovisual concatenation technique that allows the synthesis of audiovisual 1.5. Audiovisual speech synthesis at the VUB 21 speech of which the quality is sufficient to draw important conclusions on the proposed synthesis approach. ◦ Subjective evaluations that point out that for an optimal perception of the synthetic auditory and the synthetic visual speech, a maximal level of audiovisual coherence is mandatory. ◦ The enhancement of the quality of the synthetic visual speech by employing a model-based parameterization of the speech in order to normalize the database, to employ a diversified concatenation smoothing, and to apply a spectral smoothing to the synthetic visual speech information. ◦ Configuring a set-up that is appropriate for recording audiovisual speech databases for speech synthesis purposes. This involves, among other things, a strategy for maintaining constant recording conditions throughout the database and an illumination set-up that allows a careful feature tracking in the post-processing stage. ◦ The development of the first-ever system that is able to perform high-quality photorealistic audiovisual text-to-speech synthesis for the Dutch language. ◦ An evaluation of the use of standardized and speaker-specific many-to-one phoneme-to-viseme mappings for concatenative visual speech synthesis. ◦ The development of context-dependent viseme labels and the evaluation of their applicability for concatenative visual speech synthesis. 2 Generation of synthetic visual speech 2.1 Facial animation and visual speech synthesis The previous chapter explained how the increasing number of everyday-life interactions with computer systems entails the need for strategies that allow the generation of high-quality synthetic speech. It also explained that the most optimal synthetic speech should consist of both an auditory and a visual speech mode. Section 1.4.2 briefly elaborated on the various strategies for synthesizing auditory speech. This chapter focuses on the diverse approaches for generating synthetic visual speech that have been the subject of investigation from the early days until present. From an historical point of view, the very first visual speech synthesis approaches emerged in the pre-computer era. Similar to auditory mechanical talking machines like the one designed by Von Kempelen [Von Kempelen, 1791], the first attempts for (audio-)visual speech synthesis consisted of a human-operated mechanical construction that mimicked the human vocal tract in order to produce speech sounds. Simultaneously, the operator could animate components of the machine that resembled visible human articulators (e.g., wooden lips). A famous example of such a machine was the “Wonderful Talking Machine” which was presented by Joseph Faber in 1845 (see figure 2.1) [Lindsay, 1997]. Obviously, the synthetic speech that is of interest for the great majority of modern applications is computergenerated. However, mechanically-generated visual speech gestures are nowadays still a critical aspect in the development of humanoid robots, whose synthetic 22 2.1. Facial animation and visual speech synthesis 23 Figure 2.1: Examples of mechanically generated visual speech. From left to right, the “Wonderful Talking Machine” [Lindsay, 1997], the humanoid robot “KOBIAN” which is able of mimicking human facial emotions and expressions [Endo et al., 2010], and the humanoid robot “HRP-4C” which is able to produce realistic facial expressions [Nakaoka et al., 2009]. face and articulators should exhibit appropriate variations that correspond to the robot’s voice (see figure 2.1). Long before research on automatic generation of synthetic visual speech existed, hand-crafted visual speech animation was well-known. In the art of cartooning and 2D animation pictures, visual speech was simulated by successively displaying mouth appearances from a limited set of predefined images. For example, a minimal set consisted of a closed mouth (e.g., represented by line) and an open mouth (e.g., represented by a disc). The set of reference images could be augmented with other variations like the displaying of the tongue or the teeth. The very first use of this technique is credited to Georges Demeny (1892), who used his “Phonoscope” to successively display 12 “chronophotographs” (an ancestor of a transparency slide) containing speech movements (see figure 2.2) [Demeny, 1892]. Later, the animation technique became more and more popular after the release of popular shorts like Walt Disney’s “Steamboat Willie” (1928). Cartoon animation inspired the very first automatic techniques for generating visual speech. Erber et al. [Erber and Filippo, 1978] used an oscilloscope for displaying a line drawing that represented various lip-shapes occurring while uttering speech. Likewise, the displaying of simple vector graphics was used by Montgomery et al. [Montgomery, 1980] and Brooke et al. [Brooke and Summerfield, 1983] to generate sequences of lip-shapes. Obviously, such rudimental 2D visual speech signals were of limited practical use due to the lack of realism. Fortunately, later on the available computing power increased and more complicated computer-based graphics generation became possible. Computer-based generation of visual speech can be seen as a sub-problem in the field of facial animation [Deng and Noh, 2007]. Facial animation studies the 2.1. Facial animation and visual speech synthesis 24 Figure 2.2: Georges Demeny’s “Phonoscope” displaying early visual speech animation [Demeny, 1892]. design of virtual faces as well as methods to vary the appearance of these faces in order to create human-like facial expressions. Two important categories of facial expressions can be discerned: expressions that illustrate the emotional state of a person and expressions that are linked with the production of speech sounds. From this perspective, visual speech synthesis can be defined as the generation of a sequence of synthetic facial expressions that are linked with the uttering of a given sequence of speech sounds. Facial animation is a branch in computer-generated imagery (CGI) that became increasingly popular after the release of the animated short-film “Tony de Peltrie” by Philippe Bergeron and Pierre Lachapelle in 1985 (see figure 2.3) [Bergeron and Lachapelle, 1985]. This was the first computer-generated animation displaying a human character that exhibits realistic facial expressions in order to show emotions and to accompany auditory speech. In this animated short, the facial expressions were produced by photographing an actor with a control grid on his face, and then matching points to those on a 3D computer-generated face (itself obtained by digitizing a clay model). Similar early-days animated shorts that initiated the development of computer-based facial animation are “Rendez-vous Montreal” by Thalmann in 1987 and “Sextone for President” by Kleiser in 1988 (illustrated in figure 2.3). As this thesis focuses on the synthesis of synthetic visual speech, the current chapter will mainly discuss the various strategies for generating facial expressions that correspond to the uttering of speech sounds. However, in chapter 1 it was explained that certain facial gestures are used to stress or convey an emotion in the speech information (i.e., the visual prosody). Therefore, it should be noted that an 2.1. Facial animation and visual speech synthesis 25 Figure 2.3: Snapshots from pioneering animations showing realistic computergenerated facial expressions. From left to right: “Tony de Peltrie” (1985), “Rendez-vous Montreal” (1987), and “Sextone for President” (1988). ideal visual speech synthesizer has to be capable of mimicking both speech-related gestures and facial expressions that add an emotion to the communication. Throughout the years many strategies for the generation of synthetic visual speech have been described [Bailly et al., 2003] [Theobald, 2007]. Classifying these diverse approaches is not an easy task since many different aspects can be used for typifying each of the proposed strategies. In the remainder of this section a brief overview of such characteristic properties is given. The next section will then elaborate on each of these aspects and will provide various examples from the literature. Input requirements A description of the target visual speech has to be given to the synthesis system. This can be accomplished by means of a phoneme (or viseme) sequence or by means of plain text (so-called phoneme-driven or text-driven systems). Alternatively, the speech synthesizer can be designed to generate synthetic visual speech corresponding to an auditory speech signal that is given as input to the system (so-called speech-driven systems). Output modality Most visual speech synthesis systems only generate a video signal or a sequence of video frames containing the target visual speech. Moreover, some text-driven systems generate both a synthetic auditory and a synthetic visual speech signal. These systems are generally referred to as audiovisual speech synthesis systems. Output dimensions The synthetic visual speech can be rendered in either two or three dimensions. 2D-based synthesis usually displays a frontal view of the talking head, while 3D-based synthesis uses 3D rendering approaches to permit free movement around the talking head. 2.2. An overview on visual speech synthesis 26 Photorealism The synthetic visual speech can appear photorealistic, which means that it is intended to appear as human-like as possible. On the other hand, some systems generate 2D cartoon-like visual speech or render a 3D model using solid colours instead of photorealistic textures. Definition of the visual articulators and their variations It was explained earlier that visual speech synthesis can be considered as a sub-problem in the field of facial animation. Each visual speech synthesizer has to adopt a facial animation technique in order to represent the virtual speaker and to define the possible variations that allow the mimicking of speech gestures. Note that most literature overviews on visual speech synthesis use this property to classify the various proposed synthesis strategies. A wide variety of animation approaches exist. For instance, 3D-based rendering needs the definition of a 3D polygon mesh that models the mouth or the complete face/head of the virtual speaker. In addition, it must define multiple variations of this mesh that can be used to mimic speech gestures. A similar graphics rendering can be used to generate a 2D representation of the virtual speaker. On the other hand, 2D-based facial animation can also be achieved by reusing original video recordings of a human speaker. In that case, the various speech gestures are defined by the labelling of the original visual speech data. Prediction of the target speech gestures Whatever facial animation strategy that is used to describe the visual speech information, each synthesis system needs to estimate the target visual speech gestures based on the input data. Various strategies for this prediction have been proposed, such as predefining correspondences between phonemes and visemes (so-called rule-based systems), a statistical modelling of input-output correspondences (e.g., speech-driven synthesis) or the reuse of appropriate original speech data. 2.2 2.2.1 An overview on visual speech synthesis Input requirements In order to generate the target visual speech signal, the majority of the synthesis systems require the sequence of phonemes/visemes that must be uttered. Such a phoneme sequence can be directly given as input to the system [Pearce et al., 1986] or the synthesizer’s input can be plain text. In the latter case, these so-called text-to-speech (TTS) synthesis systems will in a first synthesis stage determine a target phoneme sequence based on the given textual information [Dutoit, 1997]. Many TTS systems predict the prosodic properties of the target speech as well 2.2. An overview on visual speech synthesis 27 (e.g., phoneme durations, stress, etc.). Another category of synthesizers generate the novel visual speech signal based on an auditory speech signal that is given as input to the system. These speechdriven systems estimate the target facial expressions based on features extracted from the auditory input signal. For this purpose a training database is used to train a statistical model on the correspondences between these auditory speech features and their corresponding visual features. After training, this model is used to predict the target visual features corresponding to a novel audio segment that was given as input. The predicted visual features can then be used to drive the facial animation. Various types of auditory features have been used. For example, Mel-frequency Cepstrum Coefficients (MFCC, [Mermelstein, 1976]) were used by Massaro et al. [Massaro et al., 1999], by Theobald et al. [Theobald and Wilkinson, 2007] [Theobald et al., 2008], and by Wang et al. [Wang et al., 2010]. Other potential auditory features include Line Spectral Pairs (LSP, [Deller et al., 1993]) [Hsieh and Chen, 2006], Linear Prediction Coefficients (LPC, [Rabiner and Schafer, 1978]) [Eisert et al., 1997] [Du and Lin, 2002] or filter-bank output coefficients [Gutierrez-Osuna et al., 2005]. The definition of the visual features depends heavily on the nature of the synthetic visual speech (e.g., 2D-based or 3D-based) and the manner in which it is represented by the synthesizer (i.e., the chosen facial animation strategy). For instance, when landmark points describing the location of the lips are known, the geometric dimensions of the mouth can be used to describe the visual speech information [Hsieh and Chen, 2006]. Alternatively, when the visual speech is described by a parameterized 3D model (see further in section 2.2.5.1), the model’s control parameters are highly suited as visual features [Massaro et al., 1999]. Other systems use a mathematical model to parameterize the visual speech signal in order to obtain useful visual features. For example, Brooke et al. [Brooke and Scott, 1998] use Principal Component Analysis (PCA, [Pearson, 1901]) and Theobald et al. [Theobald and Wilkinson, 2007] use an Active Appearance Model (AAM, [Cootes et al., 2001]). Diverse approaches have been suggested to learn the mapping from auditory to visual features, such as a Hidden-Markov Model (HMM, [Baum et al., 1970]) [Brand, 1999] [Arb, 2001] [Bozkurt et al., 2007], an Artificial Neural Network (ANN, [Anderson and Davis, 1995]) [Eisert et al., 1997] [Massaro et al., 1999], regression techniques [Hsieh and Chen, 2006], Gaussian-mixture models [Chen, 2001], switching linear dynamical systems [Englebienne et al., 2008] or switching shared Gaussian process dynamical models [Deena et al., 2010]. Note that speech-driven visual speech synthesis can also be realized using a hybrid analysis/synthesis approach. In this strategy, an auditory speech signal is given as input to the system, from which in a first stage its corresponding phoneme sequence is determined using speech recognition. Afterwards, in a second stage this 2.2. An overview on visual speech synthesis 28 phoneme sequence is used as input for the actual visual speech synthesis [Lewis and Parke, 1987] [Lewis, 1991] [Bregler et al., 1997] [Hong et al., 2001] [Ypsilos et al., 2004] [Jiang et al., 2008]. A similar approach was proposed by Verma et al. [Verma et al., 2003] and by Lei et al. [Lei et al., 2003], in which the speech recognition stage is designed to directly estimate a sequence of visemes instead of phonemes. In fact, the actual speech synthesis stage of these systems can be considered as text-driven visual speech synthesizers. Finally, a last category of visual speech synthesizers are based on the cloning of visual speech. These systems generate a new synthetic visual speech signal by mimicking the speech gestures that are detected in another visual speech signal that is given as input to the system. This way, several “virtual actors” can be animated by a single recording of a human speaker uttering speech sequences [Escher and Thalmann, 1997] [Pighin et al., 1998] [Gao et al., 1998] [Goto et al., 2001] [Chang and Ezzat, 2005]. 2.2.2 Output modality The previous chapter explained that speech is a multimodal means of communication. In practice, however, speech is often transmitted as an unimodal signal containing only an auditory mode (e.g., a telephone conversation). In contrast, only in rare cases a speech signal consisting of solely visual speech information is used. This is in line with the fact that humans are very well-practised in understanding auditory-only speech, while intelligibility scores for visual-only speech are much lower [Ronnberg et al., 1998]. Only (hearing-impaired) people who have spent a lot of practice time increasing their lip-reading skills are able to (partially) understand visual-only speech [Jeffers and Barley, 1971]. From this observation it is clear that for almost all applications, once the synthetic visual speech signal has been created by the visual speech synthesis system, it will be multiplexed with an auditory speech signal before it is presented to a user. For a speech-driven visual speech synthesis system this workflow is obvious, since in this particular case the desired audiovisual speech consists of a combination of the synthetic visual output speech and the auditory speech that was given as input to the system. For text-driven synthesis, however, multiple workflows are possible. In some applications an original auditory speech fragment is available. In this case, the text or phoneme sequence that corresponds to this original speech signal must be given as input to the visual speech synthesizer. In addition, the system needs to know the timing properties of the auditory speech signal (i.e., the duration of each phoneme) in order to generate a synchronous synthetic visual speech signal. Multiplexing the generated visual speech with the original auditory speech then gives the target audiovisual speech signal. In other applications, the desired audiovisual speech needs to be generated 2.2. An overview on visual speech synthesis 29 Auditory speech Text Audiovisual speech Auditory TTS Phoneme sequence + Timings Visual TTS Visual speech Auditory speech Text Audiovisual speech Audiovisual TTS Visual speech Figure 2.4: Two approaches for audiovisual text-to-speech synthesis. Most systems adopt the strategy illustrated on top, in which the synthetic audiovisual speech is generated in two distinct stages. A truly audiovisual synthesis should synthesize both the audio and the video in a single stage, as illustrated in the bottom figure. from only textual information. This means that both a synthetic auditory and a synthetic visual speech signal have to be synthesized. The great majority of the systems found in the literature tackle this problem by a two-phase synthesis, where in a first stage the synthetic auditory speech is generated by an auditory speech synthesizer. This auditory synthesizer also provides the target phoneme sequence and the corresponding phoneme durations. In the second stage, this information is given as input to a visual speech synthesizer that creates the synchronized synthetic visual speech. Afterwards, the two synthetic speech modes are multiplexed in order to create the desired audiovisual speech. In contrast, the audiovisual text-to-speech (AVTTS) synthesis can be performed in a single phase when the synthetic auditory and the synthetic visual mode are generated at the same time, as illustrated in figure 2.4. Such single-phase systems can be considered as truly audiovisual synthesizers. On the other hand, although the systems that apply a two-phase synthesis are often also referred to as “audiovisual speech synthesizers”, it is more correct to consider these systems as two separate synthesizers jointly performing the AVTTS synthesis. In many cases, the auditory and the visual speech synthesizer were even developed independent of each other. Therefore, this chapter will only consider the visual speech synthesis stage of these two-phase synthesis systems. Schroeter et al. described an overview of the workflow of two-phase AVTTS 2.2. An overview on visual speech synthesis 30 systems in [Schroeter et al., 2000]. Many implementations of this strategy can be found in the literature. For instance, the 3D talking head LUCIA [Cosi et al., 2003] converts text to Italian audiovisual speech by using the Festival auditory speech synthesizer [Black et al., 2013] to perform the first stage of the synthesis. The Festival system is also used by King et al. for generating English audiovisual speech [King and Parent, 2005]. Another example is the system by Cosatto et al. [Cosatto et al., 2000] which uses the AT&T auditory TTS system [Beutnagel et al., 1999] for realizing 2D photorealistic AVTTS synthesis, and the system by Albrecht et al. [Albrecht et al., 2002] which uses the MARY TTS system [Schroder and Trouvain, 2003]. Many other two-phase AVTTS implementations exist, such as the synthesizers developed by Goyal et al. [Goyal et al., 2000] and by Zelezny et al. [Zelezny et al., 2006]. This is in contrast with the single-phase AVTTS approach, of which only a few implementations can be found. In 1988, Hill et al. developed such an early AVTTS system based on articulatory synthesis [Hill et al., 1988]. Tamura et al. realized single-phase audiovisual TTS synthesis by jointly modelling auditory and visual speech features [Tamura et al., 1999]. Other exploratory studies, focusing on single-phase concatenative audiovisual synthesis (see further in section 2.2.6.3), were conducted by Hallgren et al. [Hallgren and Lyberg, 1998], Minnis et al. [Minnis and Breen, 2000], Bailly et al. [Bailly et al., 2002], Shiraishi et al. [Shiraishi et al., 2003] and Fagel [Fagel, 2006]. 2.2.3 Output dimensions The numerous visual speech synthesis systems that are described in the literature produce a variety of visual speech signals, which can be coarsely divided in 2D-rendered or 3D-rendered signals. 3D-based visual speech synthesizers use 3D rendering techniques from the field of CGI by modelling the virtual speaker as a 3D polygon mesh consisting of vertices and their connecting edges. The 3D effect is realized by casting shadow effects on the model based on a virtual illumination source. The realism of the virtual speaker can be increased by adding detailed texture information for simulating skin and wrinkles, eyes, eyebrows, etc. Most 3D-based systems model the complete face or even the whole head of the virtual speaker, although some synthesizers only model the lips/mouth (e.g., [GuiardMarigny et al., 1996]). The major benefit of synthesizing 3D-rendered synthetic visual speech is the possibility of a free movement around the virtual speaker. Because of this, the synthetic speech is applicable in countless virtual surroundings like virtual worlds (e.g., Second Life [Second Life, 2013]), computer games, and 3D animation pictures. In addition, 3D-based facial animation offers a convenient way to add visual prosody to the synthetic speech, since gestures like head movements and eyebrow raises can easily be mimicked by alterations of the 3D mesh. The design of a high quality 3D facial model or head model is a time-consuming task, 2.2. An overview on visual speech synthesis 31 especially when realism is important. Fortunately, this process can be partly automated by creating dense meshes based on 3D scans of real persons (e.g., Cyberware scanners [Cyberware Scanning Products, 2013]). Note, however, that the rendering of detailed 3D models requires heavy calculations, which limits the synthesizer’s applicability to computer systems that offer sufficient computing power. Another important consideration is the fact that the use of realistic 3D-rendered synthetic speech imposes an extra difficulty in bridging the “Uncanny Valley” (see section 1.4.3), since with 3D-rendered visual speech even static appearances of the virtual speaker (on top of which synthetic speech movements will be imposed) can be perceived as “almost but not quite good enough” human-like. Other visual speech synthesizers generate a 2D visual speech signal. The majority of these systems aims to mimic standard 2D video recordings by pursuing photorealism. An obvious downside of 2D-based speech synthesis is its limited applicability in virtual worlds or surroundings, since these are mostly rendered in 3D. On the other hand, a 2D visual speech signal can be applied in numerous other applications due to its similarity with standard television broadcast and motion pictures. For instance, a video signal displaying a frontal view of the virtual speaker can simulate a virtual newsreader or announcer. In addition, a 2D photorealistic representation of the virtual talking agent is the most optimal technique for simulating a real (familiar) person, which is useful in applications such as a virtual teacher or low-bandwidth video-conferencing. In comparison with 3D-based speech synthesis, in a 2D-based approach it is more easy to create a virtual speaker that exhibits a very high static realism, since people are very familiar with standard 2D video recordings of real persons. Therefore, a high-quality photorealistic 2D visual speech synthesis is more likely to bridge the Uncanny Valley in comparison with 3D-based speech synthesis. Of course, the major challenge remains to accurately mimic the speech gestures on top of this realistic speaker representation. A few systems have been developed that cannot be classified as either 2Dbased or 3D-based synthesis. Cosatto et al. created a visual speech synthesis system that produces synthetic visual speech signals resembling standard 2D video recordings, while it permits some limited head movements of the virtual speaker as well [Cosatto, 2002]. These movements can be user-defined or can be predicted based on the target speech information [Graf et al., 2002]. Note that the movement of the speaker’s head causes a 3D motion of some important visual articulators like the lips and the cheeks. By using a rudimental 3D head model, Cosatto et al. were able to mimic these movements by affine transformations on 2D textures, as illustrated in figure 2.5. Another such system has been developed by Theobald et al. [Theobald et al., 2003]. This system generates the target visual speech in 2D, however by using this synthetic 2D visual speech as texture map for a 3D facial 2.2. An overview on visual speech synthesis 32 Figure 2.5: Modelling 3D motion using 2D texture samples (left) [Cosatto, 2002] and visual speech synthesis using 3D screens (right) [Al Moubayed et al., 2012]. polygon mesh (describing the face of the same speaker that was used to model the 2D speech), a 3D representation of the synthetic speech is possible. The resulting speech should be considered as “2.5D”, since there is no speech-correlated depth variation of the 3D shape. Another category of systems that cannot be classified as either 2D-based or 3D-based makes use of a 3D screen on which a visual speech signal is projected. By shaping these screens in the form of a human face, “true” 3D speech synthesis is possible, which can for instance be applied in the development of humanoid robots (see figure 2.5) [Kuratate et al., 2011] [Al Moubayed et al., 2012]. 2.2.4 Photorealism The visual speech signal that is generated by the synthesis system ought to exhibit speech gestures that mimic as close as possible gestures that can be seen in original visual speech. Independent of the realism of these synthetic speech movements (i.e., the dynamic realism), the synthetic visual speech signal exhibits some degree of photorealism. A high degree of photorealism implies that a static pose of the virtual speaker (i.e., the static realism that can be seen in a single video frame from the visual speech signal) appears very close to a (recording of a) real human. The manner in which photorealism can be achieved is highly dependent on the dimensionality of the synthetic visible speech (see section 2.2.3). For 2D-based synthesis, a photorealistic speech signal appears close to standard television broadcast and video recordings, while 2D non-photorealistic speech signals appear cartoon-like. Possible applications for such cartoon-like 2D visual speech synthesis are the automation of the animation process for 2D animation pictures (which are nowadays more and more outdated due to the success of 3D 2.2. An overview on visual speech synthesis 33 Figure 2.6: Various examples of synthetic 2D visual speech. From left to right, animation of a painting [Blanz et al., 2003], 2D photorealistic visual speech synthesis using a mathematical model to describe the video signal [Theobald et al., 2004], and 2D photorealistic visual speech synthesis by reusing 2D texture samples [Cosatto and Graf, 2000]. animation pictures) or in various situations involving interaction between the computer system and small children. There are a few 2D non-photorealistic speech synthesizers described in the scientific literature. For instance, in the early days 2D speech gestures were generated using oscilloscopes [Erber and Filippo, 1978] and vector graphics devices [Montgomery, 1980] [Brooke and Summerfield, 1983]. In more recent times, speech synthesis systems sometimes generate a 2D representation based on lines or dots to verify a synthesis concept, which can later on be extended for generating a more realistic speech signal [Arslan and Talkin, 1999] [Tamura et al., 1999] [Arb, 2001]. In addition, some synthesis approaches have been developed to animate paintings and drawings, obviously resulting in 2D non-photorealistic speech [Perng et al., 1998] [Lin et al., 1999] [Brand, 1999] [Blanz et al., 2003]. Note, however, that some of these techniques internally use a 3D-based representation of the speech and that these systems permit more photorealistic results by animating a picture instead of a drawing. In contrast to 2D non-photorealistic synthesizers, over the years many systems for generating 2D photorealistic visual speech have been developed [Waters and Levergood, 1993] [Bregler et al., 1997] [Ezzat and Poggio, 2000] [Aharon and Kimmel, 2004] [Melenchon et al., 2009]. The various synthesis strategies adopted by these and other systems will be discussed in the following sections. As was already mentioned in section 2.2.3, these synthetic photorealistic 2D visual speech signals are applicable in numerous applications since they will appear familiar to the observers due to their resemblance to standard television broadcast and motion pictures. The previous section explained how 3D-based visual speech synthesizers apply 3D rendering techniques to model the virtual speaker. This involves the definition of a 3D polygon mesh, which consists of multiple vertices and their connecting edges. A 3D surface can be created by colouring the faces that are defined by these edges. The level of photorealism of a 3D rendered virtual speaker depends on the 2.2. An overview on visual speech synthesis 34 level of detail and the density of the polygon mesh as well as on the accuracy in which the faces of the mesh are colourized. Due to a limited computing power, the meshes that were applied for facial animation in the early days did not contain much detail, nevertheless they were able to model important gestures such as lip movement and eyebrow raises. The first 3D model that could be used to mimic facial motions was defined by Parke in 1972 [Parke, 1972] (see figure 2.7). Since then, this model has been adopted and improved by many researchers [Hill et al., 1988] [Cohen and Massaro, 1990] [Beskow, 1995]. In current times, the computing power has grown exponentially, allowing to model much more details of the face. Modern systems apply a detailed colouring of the faces of the polygon mesh in order to render an appealing 3D representation of the virtual speaker [Dey et al., 2010]. In addition, photorealism can be achieved by “covering” the 3D surface with a photorealistic texture map [Heckbert, 1986] [Ostermann et al., 1998]. This texture can be sampled from one or multiple photographs [Ip and Yin, 1996] [Hallgren and Lyberg, 1998] [Pighin et al., 1998], it can be captured together with a 3D depth scan [Kuratate et al., 1998], or a full-head cylindrical texture can be obtained by scanning 360 degrees around the head of a real human [Escher and Thalmann, 1997] [Hong et al., 2001] [Elisei et al., 2001]. Finally, it is also worth mentioning that there are reports on systems developed to generate a very rudimental 3D visual speech signal (e.g., only vertices or a non-colourized polygon mesh) intended to verify a synthesis concept which can later on be extended for generating a more realistic speech signal [Galanes et al., 1998] [Bailly et al., 2002] [Engwall, 2002]. 2.2.5 Definition of the visual articulators and their variations Section 2.1 explained that the synthesis of visual speech can be seen as a subproblem in the field of facial animation. Research on facial animation aims to develop strategies for accurately representing the human face and all the facial motions humans are capable of. Each visual speech synthesis system has to adopt a facial animation strategy in order to represent the virtual speaker. While the accuracy of the speech gestures of the virtual speaker is determined by the quality of their estimation based on the input data (see further in section 2.2.6), the appearance and the properties of synthesizer’s output speech will be greatly dependent on the manner in which the facial animation is performed. Given the wide range of possible facial animation approaches, this section separately describes the facial animation strategies that have been applied for 2D-based and 3D-based visual speech synthesis. 2.2.5.1 Speech synthesis in 3D Sections 2.2.3 and 2.2.4 explained that 3D-based visual speech synthesis requires the definition of a 3D polygon mesh that describes the visual articulators. The quality of the polygon mesh and the added texture information is crucial to achieve 2.2. An overview on visual speech synthesis 35 Figure 2.7: Various examples of synthetic 3D visual speech. The top row shows the pioneering model of Parke [Parke, 1982], the bottom row shows, from left to right, the non-photorealistic 3D talking head MASSY [Fagel and Clemens, 2004], the photorealistic talking head LUCIA which adds a texture map to the 3D polygon mesh [Cosi et al., 2003], and a photorealistic 3D facial image resulting from a 3D depth and texture scan [Kuratate et al., 1998]. 2.2. An overview on visual speech synthesis 36 a (photo)realistic representation of the virtual speaker (i.e., static realism). In addition, in order to be able to achieve a high dynamic realism by an accurate prediction of the target speech gestures (see further in section 2.2.6), the chosen facial animation approach has to define an appropriate collection of deformations of the facial model that can be used to mimic the appropriate speech gestures. In the pioneering work of Parke [Parke, 1972] [Parke, 1975] [Parke, 1982] the facial model is hand-crafted by mimicking the geometry of the human face (see figure 2.7). The facial deformations are directly parameterized: the model’s control parameters act directly on the vertices/edges of the polygon mesh. As such, visual speech gestures can be mimicked by varying those control parameters that are linked to the important visual articulators. For instance, there is a parameter that defines the jaw rotation (which determines the mouth opening), there is a parameter describing the lip protrusion, and there are parameters defining a translation of the mouth corners. Many other facial animation approaches can be seen as descendants of Parke’s directly parameterized facial model, such as the talking head Baldi [Cohen and Massaro, 1990] (see figure 1.5) and its extension by LeGoff et al. [Le Goff and Benoit, 1996], and the talking heads developed by Beskow [Beskow, 1995] and Fagel et al. [Fagel and Clemens, 2004] (see figure 2.7). One of the important additions that were made to Parke’s initial model is the addition of 3D representations of the tongue and the teeth. These directly parameterized facial animation approaches are often referred to as terminal analogue systems. Note that this label was classically given to formant-based auditory speech synthesizers, which generate a novel waveform based on a manually predefined spectral information. A second strategy for defining the 3D polygon mesh and its deformations is the so-called anatomy-based approach. In contrast with the terminal-analogue facial animation approach, in an anatomy-based model the deformations of the polygon mesh are not directly parameterized. Instead, the facial motions are mimicked by modelling the anatomy of the human face: bones, muscles and skin. Platt et al. developed a strategy to mimic facial expression by modelling the elasticity of the human face by a mass-spring system [Platt and Badler, 1981]. This way, facial deformations can be created by applying a fictional force on some vertices of the mesh, after which it can be calculated how these forces will propagate further to cause variations in other vertices too. An alternative approach has been suggested by Waters, in which various facial muscles and the effect of their activation on the facial appearance are modelled [Waters, 1987]. Facial gestures are mimicked by activating a subset of these virtual muscles and calculating the combined effect of each muscle pulling particular vertices/edges of the polygon mesh towards a predefined point where the virtual muscle is attached to the bone. In those days, Water’s muscular model showed great potential, indicated by the fact that it was 2.2. An overview on visual speech synthesis 37 adopted by the entertainment industry as well. For instance, Pixar’s animated short “Tin Toy” (1988) [Pixar Animation Studios, 2013] featured a realistic baby character whose facial expressions were modelled using a Waters-style facial model. The muscular model has been extended towards a coupled skin-muscle model [Terzopoulos and Waters, 1993] and a multi-layered anatomy model [Lee et al., 1995]. It has also been fine-tuned for simulating speech-related facial expressions [Waters and Frisbie, 1995]. Other muscle-based facial animation schemes for generating expressive visual speech were described by Uz et al. [Uz et al., 1998] and by Edge et al. [Edge and Maddock, 2001]. A hybrid muscle-based approach was described by King et al. [King and Parent, 2005]. In this system, some of the model parameters describe anatomical properties like muscle activations while other parameters of the face model directly parameterize features like the jaw opening and the location of the tip of the tongue. An advanced anatomy-based facial model was developed by Kahler et al., which models the face using three layers: the skull, the muscles and the skin [Kahler et al., 2001] (see figure 2.8). Alternatively, Sifakis et al. designed a complex muscle-based model of the human head, for which the relationship between muscle activation and facial expressions were determined using real-life motion captured data and a finite element tetrahedral mesh [Sifakis et al., 2005] (see figure 2.8). The 3D facial animation techniques discussed so far all required a prior manual definition of the deformations of the polygon mesh: the terminal analogue systems directly parameterize the displacements of the vertices and the anatomy-based models parameterize muscular and/or skin actions/forces which cause a variation of the 3D vertices. Another technique for determining the mesh deformations needed to mimic human facial expressions is the so-called performance-driven strategy, which makes use of captured facial gestures from original visual speech data [Williams, 1990]. In general, this technique first requires a 3D polygon mesh that describes a human face/head, which can be hand-crafted or automatically determined using a 3D scanner. Next, original facial gestures are captured and mapped on the polygon mesh. This way, speech-related deformations of the mesh can be learned, which can later on be reused to animate the virtual speaker. Note that there have been reports on the use of performance-driven facial animation using anatomy-based facial models as well [Terzopoulos and Waters, 1993] [Sifakis et al., 2006]. In that particular case, the captured speech gestures will not be used to directly determine the possible mesh deformations, but they allow to estimate the muscular/skin actions occurring in real speech and their effect on the appearance of the speaker’s face. Performancedriven animation techniques can also be applied using a terminal-analogue facial animation scheme. In that case, the captured speech motions are used to deduce articulatory rules that map speech information on parameter configurations of the facial model [Fagel and Clemens, 2004]. The facial movements can be tracked from regular video recordings of a human speaker using image processing techniques like 2.2. An overview on visual speech synthesis 38 Figure 2.8: Anatomy-based facial models. The top row illustrates the musclelayer of the model by Kahler et al. [Kahler et al., 2001], the bottom row illustrates the facial muscles and the finite element tetrahedral mesh described by Sifakis et al. [Sifakis et al., 2005]. 2.2. An overview on visual speech synthesis 39 Figure 2.9: Capturing original facial motions for performance-driven facial animation using the VICON motion capture system [Deng and Neumann, 2008]. snakes [Terzopoulos and Waters, 1993] or key-point tracking [Escher and Thalmann, 1997]. Alternatively, the motions can be captured by tracking markers that are attached to the speaker’s face. These markers can be coloured [Elisei et al., 2001] or fluorescent [Hallgren and Lyberg, 1998] [Minnis and Breen, 2000]. The tracking of the markers can be achieved in 2D by applying image processing on the recorded video frames [Kalberer and Van Gool, 2001] [Muller et al., 2005], in 3D using multiple cameras [Ma et al., 2006] [Zelezny et al., 2006], or by using 3D motion capture systems (e.g., the VICON system [Vicon Systems, 2013] (figure 2.9)) [Cao et al., 2004] [Deng et al., 2006]. A variation to this technique is optoelectronic motion capture, where the facial deformations are tracked by sensors attached to the speaker’s face [Kuratate et al., 1998]. The correspondences between the original speech recordings and the facial model can also be calculated using an analysisby-synthesis approach, for which no facial markers are needed [Reveret et al., 2000]. In performance-driven facial animation, an easy mapping from the captured speech gestures to mesh deformations is feasible when the facial markers or the tracked key-points correspond to vertices of the 3D mesh (e.g., such marker positions are standardized in MPEG-4 (see further in section 2.2.5.3). In many cases, the captured original speech movements are mapped on a mathematical model. This way, a parameterization of the original speech gestures (and their analogue deformations of the 3D polygon mesh) is feasible. For instance, a PCA calculation is performed by the synthesizers developed by Galanes et al. [Galanes et al., 1998], Kuratate et al. [Kuratate et al., 1998], Kalberer et al. [Kalberer and Van Gool, 2001], Elisei et al. [Elisei et al., 2001], and Kshirsagar et al. [Kshirsagar and Magnenat-Thalmann, 2003]. Expectation-Maximization PCA (EM-PCA, [Roweis, 1998]) is applied by Ma et al. [Ma et al., 2006] and by Deng et al. [Deng et al., 2006]. Independent Component Analysis (ICA, [Hyvarinen et al., 2001]) is applied by Muller et al. [Muller et al., 2005] and a Wavelet decomposition ( [Vidakovic, 2.2. An overview on visual speech synthesis 40 2008]) is used by Edge et al. [Edge and Hilton, 2006]. The captured data can also be used for learning auditory-visual correlations for speech-driven visual synthesis. An interesting example was described by Badin et al., where electromagnetic articulography was used to capture motion data from the inner articulators like the tongue, the lower incisors and the boundaries between the vermillion and the skin in the midsagittal plane [Badin et al., 2010]. An HMM was trained on the correspondences between this data and auditory features. The trained HMM could then be used for a speech-driven animation of a 3D model illustrating the inner of the human speech production system. 2.2.5.2 Speech synthesis in 2D 2D-based visual speech synthesis aims to create a novel speech signal resembling standard 2D video recordings or animations. Note that an original 2D representation of a human speaker can simply be obtained by a photograph and that original 2D speech gestures can easily be gathered using standard video recordings. Therefore, 2D-based speech synthesizers will often rely on such registrations of original speech. This means that, in contrast with 3D-based synthesizers, 2D-based speech synthesis not necessarily involves the construction of a graphical model and associated rendering techniques. Where in the case of 3D-based synthesizers the synthesis problem can mostly be split up in a facial animation problem (how are the face and its deformations modeled) and a speech gesture prediction problem (which speech gestures need to be rendered), in the case of 2D-based visual speech synthesis it is less straightforward to make a similar separation since the rendering of the 2D synthetic speech often automatically follows from the prediction of the speech gestures (see further in section 2.2.6). A first category of 2D-based visual speech synthesis systems defines the visual speech information by a set of still images of the virtual speaker. In the early days, these were hand-crafted representations of the lips [Erber and Filippo, 1978] [Montgomery, 1980]. More recently, Scott et al. [Scott et al., 1994], Ritter et al. [Ritter et al., 1999], Ezzat et al. [Ezzat and Poggio, 2000] (see figure 2.10), Noh et al. [Noh and Neumann, 2000], Goyal et al. [Goyal et al., 2000], and Verma et al. [Verma et al., 2003] described a synthesis approach in which the virtual speaker is defined by a consistent set of photographs of an original speaker uttering speech fragments. This set is constructed to contain an example image of all typical mouth appearances occurring when uttering speech in the target language. A more advanced technique has been described by Cosatto et al. [Cosatto and Graf, 1998], in which a more extensive set of static mouth images is gathered from the recordings of a human speaker. These images are used to populate a multidimensional grid based on the geometric properties of the mouth. Another extension to the image-based definition 2.2. An overview on visual speech synthesis 41 of the virtual speaker was developed by Tiddeman et al., who built a system that is able to generate from a single given photograph a set of images containing various mouth appearances that can be used to define the virtual speaker [Tiddeman and Perrett, 2002]. Where the aforementioned systems require a set of photographs of an original speaker, an alternative approach uses pre-recorded video fragments of an original speaker to define the virtual speaker. A pioneering work is the Video Rewrite system by Bregler et al. [Bregler et al., 1997] which creates a novel visual speech signal by reusing triphone-sized original video fragments. Similarly, systems described by Cosatto et al. [Cosatto and Graf, 2000] [Cosatto et al., 2000], Shiraishi et al. [Shiraishi et al., 2003], Weiss [Weiss, 2004], and Liu et al. [Liu and Ostermann, 2009] use arbitrary-sized video fragments to construct the visual speech. Instead of directly using data from images or video recordings to create the virtual speaker, some systems mathematically model the original 2D visual speech information and use this model-based representation instead. Note that only a few text-driven 2D visual speech synthesis systems apply this technique. An Active Appearance Model (AAM) is used in the synthesizers by Theobald et al. [Theobald et al., 2003] [Theobald et al., 2004] (see figure 2.6) and by Melenchon et al. [Melenchon et al., 2009], while the visual speech is mapped on a Multidimensional Morphable Model (MMM) in the system by Ezzat et al. [Ezzat et al., 2002] (see figure 1.5). On the other hand, as was already mentioned in section 2.2.1, many speech-driven 2D visual speech synthesizers use a mathematical model to describe the visual speech since it permits an easy mapping from auditory to visual parameters. For instance, Principal Component Analysis is used by Brooke et al. [Brooke and Scott, 1998] and by Wang et al. [Wang et al., 2010], AAMs are used by Cosker et al. [Cosker et al., 2003], by Englebienne et al. [Englebienne et al., 2008] and by Deena et al. [Deena et al., 2010], and Shape Appearance Dependence Mapping (SADM) is used by Du et al. [Du and Lin, 2002]. Finally, there are also reports on systems that use a graphical model to render the 2D synthetic visual speech, similar to the facial animation strategies that are used by 3D-based visual speech synthesizers. For instance, a rendering approach based of a 2D wireframe and its associated texture information was used in the DECface system by Waters et al. [Waters and Levergood, 1993] (see figure 2.10). Similarly, a wireframe and associated texture samples copied from an original photograph are used to generate visual speech from a single given image in the system by Lin et al. (which uses a 2D wireframe) [Lin et al., 1999] and in the Voice Puppetry system by Brand (which uses a 3D wireframe) [Brand, 1999]. 2.2. An overview on visual speech synthesis 42 Figure 2.10: 2D visual speech synthesis using a 2D wireframe and a corresponding texture map (left) [Waters and Levergood, 1993] and 2D visual speech synthesis based on a limited set of photographs (right) [Ezzat and Poggio, 2000]. 2.2.5.3 Standardization: FACS and MPEG-4 From the previous paragraphs it is clear that there exists a huge variation in techniques for representing the virtual speaker. Each of these approaches has its own strong points and weaknesses. Unfortunately, such a diversity of methods makes it very hard to compare or to combine multiple systems and it forms a barrier for collaborative research. Therefore, some standardizations on the topic of facial animation have been defined. In 1978, Ekman et al. published their work on the so-called Facial Action Coding System (FACS) [Ekman and Friesen, 1978]. This coding system defines numerous Action Units, each corresponding to a contraction or relaxation of one or more facial muscles. The FACS methodology models each human facial expression by one or more Action Units. The standard has been proven to be useful for both psychologists (analysis of human expressions) and animators (synthesis of human expressions). It has also been successfully applied for automatic facial expression analysis [Fasel and Luettin, 2003]. In the field of automatic facial animation, the FACS has been the driving force behind the development of the anatomy-based facial models of Platt et al. [Platt and Badler, 1981], Waters [Waters, 1987] and Terzopoulos et al. [Terzopoulos and Waters, 1993]. These models were designed to mimic particular Action Units, which could then be combined to create meaningful and realistic facial expressions. In addition, Pelachaud developed a system for generating synthetic visual speech and realistic emotions using the FACS [Pelachaud, 1991]. This system was later on extended to split the FACS-based facial animation into independent phonemic, intonational, informational, and affective elements [Pelachaud et al., 1996]. More recently, the interest in using the FACS for generating visual speech has tempered. The reason for this is two-fold. First, the majority of the modern facial models designed for visual speech synthesis purposes is not designed based on human anatomy, but high detailed polygon meshes and their corresponding tex- 2.2. An overview on visual speech synthesis 43 tures are mostly automatically determined by 3D scanning techniques. In addition, natural mesh deformations are learned by advanced 3D motion capture, which is faster and easier than a detailed manual definition of the numerous facial muscles and their effect on the face appearance. A second reason is that the FACS is not optimized for modelling visual speech gestures: the FACS offers a lot of Action Units to accurately mimic emphatic expressions, however it is very hard to use these Action Units to simulate all the detailed mouth gestures corresponding to speech uttering. Where the standardization defined by the FACS is based on the biomechanics of the face, a second standardization for facial animation has been defined in the MPEG-4 standard [MPEG, 2013] which is derived from the geometric properties of the human face [Ostermann, 1998] [Abrantes and Pereira, 1999] [Pandzic and Forchheimer, 2003]. MPEG-4 is an object-based multimedia compression standard, which allows to independently encode different audiovisual objects in the scene. These visual objects may have a natural or synthetic content, including arbitrary shape video objects, special synthetic objects such as human face and body, and generic 2D/3D objects composed of primitives like rectangles/spheres or indexed face sets that define an object surface by means of vertices and surface patches. The MPEG-4 standard foresees that talking heads will serve an important role in future customer service applications. To this effect, MPEG-4 enables integration of face animation with multimedia communications and allows face animation over low bit rate communication channels. The standard specifies a face model in its neutral state, a number of feature points on this neutral face as reference points and a set of Facial Action Parameters (FAPs), each corresponding to a particular facial action deforming the face model away from its neutral state. This way, a facial animation sequence can be generated by deforming the neutral face model according to some specified FAP values at each time instant. The value for a particular FAP indicates the magnitude of the corresponding action. MPEG-4 specifies 84 feature points on the neutral face (see figure 2.11). The main purpose of these feature points is to provide spatial references for defining FAPs. The 68 FAPs are categorized into 10 groups related to parts of the face. The FAPs represent a complete set of basic facial actions including head motion and control over the tongue, eyes and mouth. Most FAPs represent low-level gestures such as head or eyeball rotation around a particular axis. In addition, the FAP set contains two high-level parameters corresponding to the realization of visemes and expressions, respectively. The expression FAP is used to deform the face towards one of the six primary expressions (anger, fear, joy, disgust, sadness and surprise). The viseme FAP is used to deform the face towards a representative configuration that matches one of the 15 predefined (English) visemes. The use of this viseme FAP makes it possible to generate visual speech by consecutively deforming 2.2. An overview on visual speech synthesis 44 Figure 2.11: The facial feature points defined in the MPEG-4 standard [MPEG, 2013]. the face model towards viseme representations that correspond to the target speech. Many 3D-based visual speech synthesis systems have adopted the MPEG-4 standard for describing the visual speech information. For instance, Cosi et al. developed a 3D photorealistic talking head based on the MPEG-4 facial coding standard [Cosi et al., 2003]. In this system, an advanced smoothing between the consecutive viseme representations was implemented to achieve natural-looking speech gestures (see further in section 2.2.6.2). An alternative implementation of an MPEG-4 talking head is described by Pelachaud et al., which involves a (pseudo-) anatomy-based facial model of which the feature points follow the MPEG-4 standard [Pelachaud et al., 2001]. In this system, the two high-level FAPs are not implemented, as speech gestures and expressions are simulated by the researchers’ own animation rules. Facial animation based on the MPEG-4 standard is often used in performance-driven visual speech animation approaches. In that case, the facial markers on the original speaker’s face are placed conform to the feature points described in MPEG-4. This way, the captured speech motions can be directly mapped to displacements of the vertices of an MPEG-4 based polygon mesh [Beskow and Nordenberg, 2005] [Gutierrez-Osuna et al., 2005]. From these mapped deformations, variations in FAP values corresponding to the uttering of speech can be learned [Eisert et al., 1997] [Tao et al., 2009]. The major drawback of using the MPEG-4 standard to simulate visual speech gestures is the fact that in such a system the FAPs are at the same time geometrical degrees-of-freedom and articulatory degrees-of-freedom. Since they originate from the modelling of geometrical deformations of the face, they compose a less-optimal base-set for 2.2. An overview on visual speech synthesis 45 constructing articulatory gestures. A possible workaround this problem has been proposed by Vignoli et al., who designed a facial animation scheme based on socalled Articulatory Parameters (including mouth height, mouth width, protrusion and jaw rotation) [Vignoli and Braccini, 1999]. An MPEG-4 compliant animation is achieved by mapping these Articulatory Parameters to FAPs. 2.2.6 Prediction of the target speech gestures Section 2.2.5 elaborated on various strategies for representing the virtual speaker. The selection of such a strategy is an important step in the design of a visual speech synthesis system, since the chosen facial animation technique not only determines the static realism of the synthetic visual speech but it also defines the synthetic gestures that can be imposed on the virtual speaker. On the other hand, this section elaborates on the prediction of the target speech gestures based on the system’s input data. In other words, this section explains how the facial models that have been described in section 2.2.5 can be used to generate an appropriate sequence of speech gestures. An accurate prediction of these gestures is necessary to achieve synthetic visual speech exhibiting a high level of dynamic realism. Section 2.2.1 explained that, based on the synthesizer’s input requirements, two main categories of visual speech synthesis systems can be discerned, namely text-driven and speech-driven approaches. The problem of predicting the speech gestures based on auditory speech input was already addressed in section 2.2.1. These systems learn in a prior training stage the correspondences between auditory and visual speech features. After training, visual features corresponding to an unseen auditory input signal can be estimated, from which a new sequence of speech gestures is determined. From this point onwards, this section will only focus on the estimation of speech gestures based on textual information (i.e., text-driven visual speech synthesis). 2.2.6.1 Coarticulation Before elaborating on the prediction of the target speech gestures, the concept of coarticulation must be explained. Coarticulation refers to the way in which the realization of a speech sound is influenced by its neighbouring sounds in a spoken message [Kent and Minifie, 1977] [Keating, 1988]. Forward or anticipatory coarticulation is mainly caused by high-level articulatory planning and occurs when the articulation of a speech segment is affected by other segments that are not yet realized. On the other hand, backward or preservatory coarticulation (also known as “carry-over” coarticulation) is mainly caused by inertia in the biomechanical structures of the vocal tract which causes the articulation at some point in time to be affected by the articulation of speech segments at an earlier 2.2. An overview on visual speech synthesis 46 point in time. Note that coarticulation may not be seen as a pure side-effect since it also serves a communicative purpose: it makes the speech signal more robust to noise by introducing redundancies, since the phonetic information is spread out over time. Many studies have tried to explain and to model the effect of coarticulation on the uttering of speech sounds. Two important approaches can be discerned, namely look-ahead models and time-locked models. Look-ahead models allow the beginning of an anticipatory coarticulatory gesture at the earliest possible time allowed by the articulatory constrains of other segments in the utterance. A well-known look-ahead model is the numerical model of Ohman [Ohman, 1967]. This model splits the speech articulation in vocalic and consonant gestures. Every articulatory parameter is defined as a numerical function over time, of which the value is dependent on pure vocalic gestures onto which a consonant gesture is superimposed. The consonant has an associated temporal blend function that dictates how its shape should blend with the vowel gesture over time. It also has a spatial coarticulation function that dictates to what degree different parts of the vocal tract should deviate from the underlying vowel shape. On the other hand, time-locked models assume that articulatory gestures are independent entities which are combined in an approximately additive fashion. They allow the onset of a gesture to occur a fixed time before the onset of the associated speech segment, regardless of the timing of other segments in the utterance. A well-known time-locked coarticulation model is Lofqvist’s gestural model [Lofqvist, 1990], in which each speech segment has dominance over the vocal articulators which increases and then decreases over time during articulation. Adjacent segments will have overlapping dominance functions which dictates the blending over time of the articulatory commands related to these segments. The height of the dominance function at the peak determines to what degree the segment is subject to coarticulation. Another gestural model of speech production was described by Browman et al. [Browman and Goldstein, 1992]. In their approach to articulatory phonology, gestures are dynamic articulatory structures that can be specified by a group of related vocal tract variables. Syllable-sized coarticulation effects are modelled by phasing consonant and vowel gestures with respect to one other. The basic relationship is that initial consonants are coordinated with vowel gesture onset, and final consonants with vowel gesture offset. This results in organisations in which there is substantial temporal overlap between movements associated with vowel and consonant gestures. Coarticulation effects are noticeable in both the auditory and the visual speech mode. In the auditory mode, it leads to smooth spectral transitions from one speech segment to the other. Auditory speech synthesis has to mimic these transitions to avoid “jerky” synthetic speech. Rule-based auditory synthesizers such as articulatory or formant synthesis systems predict for each target phoneme a corresponding speech sound. To achieve high-quality synthetic speech, these systems have to 2.2. An overview on visual speech synthesis 47 integrate a coarticulation model to simulate the transitions between the consecutive predicted articulatory or spectral properties. In visual speech, coarticulation effects are even more pronounced than in the auditory speech mode. Particular gestures, like lip protrusion, have been found to influence neighbouring articulatory gestures up to several phonemes before and after the actual speech segment they are intended for. In a visual speech signal, the inertia of the visual articulators can be directly noticed. For instance, preservatory coarticulation is noticeable when a speech gesture continues after uttering a particular sound segment while the other gestures needed to create this sound are already completed. An example of this effect is the presence of lip protrusion during the /s/ segment of the English word “boots”. Moreover, anticipatory coarticulation can be seen in the visual speech signal when a visible gesture of a speech segment occurs in advance of the other articulatory components of the segment. An example of such anticipatory coarticulation is the pre-rounding of the lips in order to utter the English sound /uw/: in the word “school” the lip rounding can be already noticed while the sounds /s/ or /k/ are still being uttered. As will be explained further on, many visual speech synthesis systems adopt a rule-based strategy, in which at particular time-instances the properties of the visual speech are predicted (e.g., at the middle of each phoneme or viseme). Similar to the rule-based auditory synthesizers, these visual speech synthesis systems have to implement a strategy for creating smooth and natural transitions between the consecutive predicted appearances by mimicking the visual coarticulation effects. Note, however, that in the field of visual speech synthesis a noticeable trend exists towards concatenation-based synthesis (see further in section 2.2.6.3). A similar shift has already been made in the field of auditory synthesis. One of the benefits of such a concatenative synthesis approach is the fact that coarticulation effects can be automatically included in the synthetic speech. Indeed, original transitions between adjacent phonemes can be seen in the synthetic speech signal by reusing segments of original speech recordings that are longer than a single phone. 2.2.6.2 Rule-based synthesis Analogue to the rule-based approaches for auditory speech synthesis, rule-based visual speech synthesizers generate the synthetic speech by estimating its properties using predefined rules. In general, only a few particular frames of the output video signal will be predicted directly. Therefore, rule-based synthesis is often referred to as keyframe-based synthesis. In most systems the predicted keyframes are located at the middle of each phoneme or viseme of the target speech. The rule-based synthesis approach can be split up into two stages. In an initial offline stage, the synthesis rules are determined. This means that for each instance from a set of predefined synthesis targets (e.g., all phonemes/visemes of the target language) at least one typical configuration of the visual articulators is defined. The way in which 2.2. An overview on visual speech synthesis 48 Phoneme sequence + Timings Articulatory rules FR0 FR1 FR2 FR3 FR4 FR5 FR6 FR7 FR8 FR9 time Interpolation rules Figure 2.12: Visual speech synthesis using articulation rules to define keyframes (black). The other video frames (white) are interpolated. such a typical configuration is described is greatly dependent on the chosen facial animation strategy (see section 2.2.5). In a second stage, synthesis of novel speech signals is feasible by composing a sequence of predefined configurations based on the textual input information. The target visual speech signal can then be generated by interpolating between the predicted keyframes in order to attain a smooth signal that is in synchrony with the imposed duration of each speech segment. As was explained in section 2.2.6.1, for synthesizing high-quality speech sequences this interpolation should mimic the visual coarticulation effects. A general overview of the rule-based synthesis approach is illustrated in figure 2.12. Early rule-based visual speech synthesis, using Parke’s directly parameterized facial animation model [Parke, 1975], was developed by Pearce et al. [Pearce et al., 1986], Lewis et al. [Lewis and Parke, 1987], Hill et al. [Hill et al., 1988] and Cohen et al. [Cohen and Massaro, 1990]. These approaches specify for each instance from a set of representative phonemes a set of typical parameter values for the 3D facial model. As such, a series of keyframes can be determined based on a given target sequence of phonemes. Smoothing between these keyframes is performed by a interpolation of the parameter values of the model. A similar approach was followed by Guiard-Marigny et al. [Guiard-Marigny et al., 1996], in which the synthetic lips are described by algebraic equations. Because of this, interpolation between consecutive keyframes can be easily mathematically achieved. Note that, although all these mentioned systems are capable of generating smooth facial animations, no real solution to mimic visual coarticulations is mentioned in these strategies. In order to create natural transitions between the consecutive phoneme or viseme representations found in the predicted keyframes, the interpolation strategy should 2.2. An overview on visual speech synthesis 49 generate intermediate frames that mimic visual coarticulations. One approach for this is to mimic the biomechanics of the face. Such an interpolation technique is found in the DECface system [Waters and Levergood, 1993], which predefines for each instance of a representative set of visemes a fixed configuration of the 2D wireframe that is used to render the visual speech. From this set of static shapes, a sequence of 2D keyframes is composed based on the input text. In order to achieve realistic keyframe transitions, the system models the dynamics of the mouth movements by representing each node of the wireframe by a position, a mass, and a velocity. The interpolation between the keyframes is calculated by applying fictional forces on the nodes and calculating their propagation through the wireframe by mimicking the elastic behaviour of facial tissue. A similar anatomy-based interpolation was used between keyframes containing a 3D polygon mesh by Hong et al. [Hong et al., 2001]. Obviously, such an anatomy-based interpolation is also feasible for rule-based visual speech synthesis systems that adopt anatomy-based facial animation schemes. For instance, the system by Uz et al. [Uz et al., 1998] defines for each representative phoneme a set of animation parameters defining the muscle contractions and the jaw rotation. By use of these rules a series of keyframes is composed, after which smooth animations are achieved by a cosine-based interpolation between the keyframes. Another approach for mimicking coarticulations while interpolating between the predicted keyframes is to adopt a comprehensive model that describes the various visual coarticulation effects. Pelachaud et al. proposed a look-ahead model to simulate coarticulation effects in visual speech synthesis using Action Units from the FACS [Pelachaud et al., 1991]. In their system, phonemes are assigned a high or low deformability rank. When synthesizing a new utterance, forward and backward coarticulation rules are applied so that a phoneme takes the lip shape of a less deformable phoneme forwards or backwards in the target phoneme sequence. This is calculated in three stages, where the first one computes the ideal lip shapes, after which in two additional stages temporal and spatial muscle actions are computed based on constraints such as the contraction and relaxation time of the involved facial muscles. Conflicting muscle actions are then resolved by use of a table of Action Units similarities. Another implementation of the look-ahead coarticulation model for visual speech synthesis purposes was described by Beskow [Beskow, 1995]. This system uses Parke’s facial animation strategy, where a 3D model of the tongue was added. Each phoneme is assigned a target vector of articulatory control parameters. To allow the targets to be influenced by coarticulation, the target vector may be under-specified, i.e. some parameter values can be left undefined. If a target is left undefined, the value is inferred from the phonemic context using interpolation, followed by a smoothing of the resulting trajectory. 2.2. An overview on visual speech synthesis 50 One of the most adopted strategies for estimating visual coarticulation is the so-called Cohen-Massaro model [Cohen and Massaro, 1993], which is based on the time-locked gestural model of Lofqvist [Lofqvist, 1990] and was originally designed to interpolate between keyframe parameter values of a terminal-analogue facial animation system. In this model, each synthetic speech segment (i.e., each keyframe) is assigned a target vector of parameter values. Overlapping temporal dominance functions are used to blend the target values over time. The dominance functions take the shape of a pair of negative exponential functions, one rising and one falling. The height of the peak and the rate in which the dominance rises and falls are free parameters that can be adjusted for each representative phoneme and articulatory control parameter. An illustration of this strategy is given in figure 2.13. To implement the Cohen-Massaro coarticulation model, for each representative phoneme or viseme the parameters of the dominance functions have to be estimated. Le Goff et al. described a terminal-analogue rule-based visual speech synthesis system that interpolates between keyframes using a slightly improved version of the Cohen-Massaro coarticulation model [Le Goff, 1997]. In their approach, the parameters of the dominance functions are automatically determined from original speech recordings. A downside of the Cohen-Massaro model is that it offers no way to ensure that particular target parameter values are reached. In some cases this is necessary, for instance at a bilabial stop where the reaching of full mouth closure is crucial. To overcome this problem, Cosi et al. augmented the Cohen-Massaro model with a resistance function that can be used to suppress the dominance of segments surrounding such a critical target [Cosi et al., 2002]. This interpolation scheme was used in the rule-based visual speech synthesis system LUCIA [Cosi et al., 2003]. Many other implementations of the Cohen-Massaro model for rule-based visual speech synthesis exist. For instance, Fagel et al. captured original speech motions to estimate the dominance functions of the coarticulation model [Fagel and Clemens, 2004]. These captured motions were also used to learn articulation rules for a terminal-analogue facial animation scheme. A similar data-driven training of the Cohen-Massaro model for interpolating keyframes described by MPEG-4 FAPs was described by Beskow et al. [Beskow and Nordenberg, 2005]. In that system, the FAP-based articulation rules were also learned from original speech data. The Cohen-Massaro model has also been used to interpolate between keyframes described by anatomy-based facial animation schemes [Albrecht et al., 2002]. Furthermore, Lin et al. described a system that uses the Cohen-Massaro model to interpolate between predefined configurations of a 2D wireframe [Lin et al., 1999]. An interesting extension of the Cohen-Massaro model was proposed by King et al. [King and Parent, 2005]. In their approach, for each viseme a typical trajectory of model parameters was hand-crafted (instead of a single set of parameter values). Based on the target phoneme sequence, a sequence of parameter sub-trajectories is composed. Smooth trajectories are obtained by 2.2. An overview on visual speech synthesis 51 Figure 2.13: Modelling visual coarticulation using the Cohen-Massaro model [Beskow, 2004]. The dominance functions of the speech segments define the interpolation of the facial model parameter between keyframes. Note that not all predicted keyframe-values will be reached. interpolation using decaying dominance functions. When the visual speech synthesis is based on performance-driven facial animation (see section 2.2.5.1), the original speech data can be used to learn articulation rules and coarticulation behavior. For instance, Muller et al. described a technique in which motion capture data is modelled using ICA, from which for each representative viseme the mean parameter values and an “uncertainty” of this mean representation are calculated [Muller et al., 2005]. For generating new visual speech, a series of keyframes is generated based on the target phoneme sequence, which are then interpolated by fitting fourth order splines. Coarticulation is modelled by defining an attraction force from each keyframe to the interpolation curve that is inversely proportional to the uncertainty of the mean representation of the corresponding viseme. In a strategy proposed by Deng et al., the transitions between representative visemes are learned from motion capture data in a prior training phase [Deng et al., 2006]. When synthesizing new speech, a series of appropriate keyframes is constructed which are then interpolated using the corresponding trained coarticulation rules. Revret et al. implemented Ohman’s numerical coarticulation model for a rule-based visual speech synthesis based on motion capture data [Reveret et al., 2000]. The captured original speech gestures were used to estimate the values of the coarticulation model, such as the coarticulation coefficients and the temporal functions guiding the blending of consonants and the underlying vowel track. 2.2. An overview on visual speech synthesis 52 Instead of defining a single articulation rule for each representative phoneme, a more extensive set of rules can be learned to predict the keyframes. Galanes et al. developed such a rule set using a tree-based clustering of 3D motion capture data [Galanes et al., 1998]. For each distinct phoneme, several typical representations are collected based on the properties of the phonetic context. This way, coarticulation is automatically included in the articulation rules. To synthesize novel speech, the same tree is traversed to determine a keyframe for each target phoneme, after which interpolation using splines is performed to create smooth parameter trajectories. Another approach for defining context-dependent articulation rules was suggested by De Martino et al. [De Martino et al., 2006]. In this approach, 3D motion capture trajectories corresponding to the uttering of original CVCV and diphthong samples are gathered, after which by means of k-means clustering [Lloyd, 1982] important groups of similar visual phoneme representations are distinguished. From these context-dependent viseme definitions, the keyframe mouth dimensions and jaw opening corresponding to a novel phoneme sequence can be predicted. These predictions are then used to animate a 3D model of the virtual speaker. Rule-based synthesis is also a popular approach for synthesizing 2D synthetic visual speech from text. To this end, the system needs to define for each representative phoneme or viseme a typical 2D representation of the virtual speaker (e.g., a picture of an original speaker uttering the particular speech sound). From these typical representations, a series of keyframes is composed based on the target phoneme sequence, after which an interpolation between these 2D keyframes is needed to achieve a smooth video signal. Scott et al. [Scott et al., 1994] created the “Actors” system which interpolated between a series of photograph-based keyframes using image morphing techniques [Wolberg, 1998]. Unfortunately, this morphing step required a hand-crafted definition of the various morph targets in each keyframe. A more automated morphing between 2D keyframes is feasible using optical flow techniques [Horn and Schunck, 1981] [Barron et al., 1994]. This type of keyframe interpolation is used in the rule-based 2D visual speech synthesis systems by Ezzat et al. [Ezzat and Poggio, 2000], Goyal et al. [Goyal et al., 2000], and Verma et al. [Verma et al., 2003]. Similar rule-based synthesis approaches were described by Noh et al. [Noh and Neumann, 2000], where Radial Basis Functions [Broomhead and Lowe, 1988] are used to interpolate between the keyframes, and by Tiddeman et al. [Tiddeman and Perrett, 2002], where the interpolation is achieved by texture mapping and alpha blending [Porter and Duff, 1984]. A semi-automatic technique to gather an appropriate set of 2D images from original speech recordings in order to define the synthesis rules was described by Yang et al. [Yang et al., 2000]. In order to include coarticulation effects in the articulation rules, context-dependent 2D keyframe prototypes are manually extracted from original speech recordings by Costa et al. [Costa and De Martino, 2010]. To determine which articulation rules 2.2. An overview on visual speech synthesis 53 are necessary, original motion capture speech data was analysed [De Martino et al., 2006]. The major drawback of all these mentioned 2D interpolation approaches is that no articulatory constraints are taken into account when creating the intermediate video frames. In other words, by smoothing the transition between two keyframes, new configurations of the virtual speaker are generated that may or may not exist in original visual speech. To avoid unnatural interpolated speaker configurations, Melonchon et al. [Melenchon et al., 2009] developed a 2D visual speech synthesis system that can be classified as either rule-based or concatenative (see further in section 2.2.6.3). In this system, for each distinct phoneme many typical representations are gathered (instead of just one as in the other rule-based systems). To generate new visible speech, for each target phoneme the best representation is selected based on the distance between each candidate representation and the representation that was selected for the previous phoneme in the target phoneme sequence. When all keyframes are determined, continuous speech is generated by an interpolation technique in which a smooth transition from one keyframe to the next is constructed by reusing original recorded video frames. As such, every speaker configuration between two keyframes will exhibit static realism. A final category of rule-based visual speech synthesizers can be classified as articulatory synthesis systems. Similar to articulatory auditory speech synthesizers, these systems do not directly predict the result of speech production (i.e., formants (auditory synthesis) or speaker appearances (visual synthesis)) but rather the configurations of the human speech production system (i.e., the manner in which the speech signal is produced). Visual articulatory systems are mainly used to illustrate the mechanism of speech production, which can for instance be applied in speech therapy applications. An example is the system by Birkholz et al. [Birkholz et al., 2006] (see figure 1.5), which is in fact a rule-based speech synthesis system based on a terminal-analogue animation scheme. In this system, the speech signal is rendered using a directly parameterized polygon mesh that represents the lips, the tongue, the teeth, the upper and lower cover, and the glottis. A new speech signal is generated by predicting keyframe parameter values, which are interpolated using dominance functions to mimic coarticulation. A similar approach has been followed by Engwall, in which captured original speech gestures are used to learn rules for animating a 3D tongue model from text [Engwall, 2001]. 2.2.6.3 Concatenative synthesis Although at present rule-based approaches are still adopted, over the last decade a trend is noticeable towards concatenative visual speech synthesis strategies. A similar transition already took place in the field of auditory synthesis, where rulebased techniques such as articulatory or formant synthesis have become obsolete 2.2. An overview on visual speech synthesis 54 Speech database Phoneme sequence + Timings FR0 FR1 FR2 FR3 FR4 FR5 FR6 FR7 FR8 FR9 time Figure 2.14: Visual speech synthesis based on the concatenation of segments of original speech data. Each output video frame is copied from the speech database. over concatenative synthesis approaches. A concatenative speech synthesizer needs to be provided with a database containing original speech recordings from a single speaker. To synthesize novel speech, the synthesizer searches in the database for suitable segments that (partially) match the target phoneme sequence. Once an optimal set of original segments is determined, these segments are concatenated to create the final synthetic speech signal. An important factor is the size of the database segments that are available for selection. Older systems often use a database containing diphone recordings. This way, the size of the database can be kept limited as it suffices to contain at least one instance of each possible combination of every two phonemes or visemes that exist in the target language. The selected diphones are usually concatenated at the middle of the first and the second phoneme or viseme, respectively. As such, each transition between two consecutive phonemes or visemes in the output speech will consist of original speech data. This way, coarticulation effects are copied from the original speech to the synthetic speech. When more data storage and stronger computing power is available, the concatenative synthesis can be improved by selecting longer segments (triphones, syllables, words, etc.) from a database of continuous original speech. In this so-called unit-selection approach, fewer concatenations are needed to create a synthetic sentence, which reduces the chance for concatenation artefacts. In addition, reusing longer original segments permits of copying extensive coarticulation effects (extending over multiple phonemes) from the database to the synthetic speech. A general overview of the concatenative synthesis approach is illustrated in figure 2.14. The major benefit of a concatenative synthesis approach is the fact that a maximal amount of original speech data is reused for generating the novel synthetic speech. Because of this, modelling the coarticulation effects becomes superfluous since original transitions between phonemes or visemes are copied from the original speech data. In addition, the synthetic visual speech will exhibit a high degree of static realism since only a limited number of output frames are newly generated 2.2. An overview on visual speech synthesis 55 during synthesis (e.g., for smoothing the concatenations). This is in contrast with rule-based synthesis approaches, in which many new video frames are generated for interpolating between the predicted keyframes. Such a generation of new frames involves the danger that these frames exhibit unrealistic speaker configurations that are non-existing in original speech. The drawback of concatenative synthesis is its large data footprint and the strong computing power that is required to perform the segment selection calculations. However, at present time this is only a possible issue with small-scale systems like cell-phones, automotive applications, etc. Even more, the current large bandwidth capabilities allow a distant calculation where the hand-held device only needs to send the synthesis request to a server and display the synthetic speech after receiving the server’s response. Concatenative synthesis requires a beforehand recording of original speech data. For concatenative synthesis approaches using a 3D-based facial animation scheme, a performance-driven facial animation strategy is necessary, in which the speechrelated deformations of the facial model are copied from motion capture data of an original speaker. Exploratory studies on the concatenation of polyphones described in terms of 3D polygon mesh configurations were described by Hallgren et al. [Hallgren and Lyberg, 1998] and Kuratate et al. [Kuratate et al., 1998]. Edge et al. proposed a unit selection approach based on so-called “dynamic phonemes”, which can be seen as phonemes in a particular phonemic context [Edge and Hilton, 2006]. The visual context of each phoneme was also taken into account by Breen et al. by performing a concatenative synthesis based on di-visemes [Breen et al., 1996]. Each di-viseme corresponds to some units that describe variations of a 3D polygon mesh. A unit selection synthesis based on 3D motion capture data was described by Minnis et al. [Minnis and Breen, 2000]. In their system, the selection of variable-length original speech segments depends on the correspondence between the phonemic context of the candidate segment and the phonemic context of the target phoneme. A similar system was proposed by Cao et al., in which the longest possible segments are selected from the database in order to minimize the number of concatenations [Cao et al., 2004]. In the system by Ma et al., the captured 3D motions are organized in a graph indicating the cost of each possible transition between the recorded phoneme instances [Ma et al., 2006]. Based on the target phoneme sequence, an optimal path through this graph that traverses the necessary nodes is searched. A similar approach has been described by Deng et al., in which an optimal path through all recorded phoneme instances is constructed to create a trajectory of facial animation parameters that corresponds to both a target phoneme sequence and to time-evolving expressive properties [Deng and Neumann, 2008]. Note that any facial animation model can be animated using concatenative synthesis, given an appropriate database of original speech gestures. For instance, Engwall investigated on the diphone-based concatenation of captured articulation 2.2. An overview on visual speech synthesis 56 data for animating a 3D tongue model [Engwall, 2002]. From sections 2.2.6.2 and 2.2.6.3 it is clear that 3D motion capture data can be used for learning articulation rules as well as for direct reusage when generating the synthetic visual speech. Bailly et al. evaluated a synthesis approach based on the concatenation of audiovisual diphones represented by 3D model parameters [Bailly et al., 2002]. The attained synthetic visual speech was found to be superior to a rule-based synthesis approach for which the articulation rules were trained on the same original speech data as was used in the concatenative synthesis. For an optimal transfer of the original visual coarticulation effects from the original speech to the synthetic speech, Kshirsagar et al. proposed a technique that selects and concatenates syllable-length original speech segments [Kshirsagar and Magnenat-Thalmann, 2003]. These syllables are described in terms of facial movement parameters resulting from a mathematical analysis of facial motion capture data. The use of syllables is motivated by the fact that most coarticulation occurs within the boundaries of a syllable. An interesting concatenative synthesis approach using an anatomy-based facial animation scheme was proposed by Sifakis et al. [Sifakis et al., 2006]. Based on motion capture data, a database of sentences and the parameter trajectories corresponding to the muscle activations needed to utter these sentences were constructed. By segmenting these muscle-parameter trajectories based on phoneme boundaries, so-called psysemes were defined. To synthesize new speech, an optimal sequence of such psysemes is selected and concatenated. From these concatenated muscle-parameter trajectories, a novel facial animation sequence is generated. As the proposed synthesis strategy creates smooth muscle-activation trajectories by selecting and concatenating original speech segments based on muscle activation (instead of selecting segments based on appearance like other methods do), visual coarticulation is taken into account since such coarticulation effects are due to the inability of the facial muscles to instantaneous change their activation level. In comparison with 3D motion capture techniques, gathering a database of 2D original visual speech is much easier. The synthetic speech signal is directly generated from reusing original video frames from the visual speech database. A pioneering work is the Video Rewrite system by Bregler et al., in which a new video sequence is constructed by reusing original triphone-sized video fragments from the database [Bregler et al., 1997]. A similar approach based on the selection of variable-length segments was proposed by Shiraishi et al. [Shiraishi et al., 2003] and by Fagel [Fagel, 2006]. A system by Arslan et al. [Arslan and Talkin, 1999] selects phoneme-sized 2D segments from the database, where each phoneme instance is represented by its phonemic context up to 5 phonemes backward and forward. The distance between two such phonemic contexts is calculated based on the measured 2.2. An overview on visual speech synthesis 57 similarity between the mean visual representations of every two distinct phonemes in the database. A similar approach was also used by Theobald et al. to select phoneme-sized original speech segments from a database containing visual speech fragments that are mapped on AAM parameters [Theobald et al., 2004]. Some systems construct the synthetic speech signal by a frame-by-frame selection from the database. Each new frame that is added to the synthetic video sequence is selected based on various aspects, such as phonetic/visemic matching with the target phoneme sequence and the continuity of the resulting visual speech signal. Examples of such systems are described by Weiss [Weiss, 2004] and by Liu et al. [Liu and Ostermann, 2009]. The last system was also extended to select video frames from a database containing expressive visual speech fragments as well [Liu and Ostermann, 2011]. An interesting approach to concatenative visual speech synthesis was proposed by Jiang et al. [Jiang et al., 2008]. Their speech-driven synthesizer uses a database of audiovisual di-viseme instances, learned from original audiovisual speech. The input auditory speech is first translated in a sequence target di-visemes, after which an appropriate sequence of database di-visemes is collected to create the output visual speech. For each target di-viseme, the system selects from all matching database instances the most suitable one by measuring the similarity between the spectral information of the database auditory speech signal and the spectral information of the input auditory speech signal. Smooth animation is ensured by taking also the ease of the concatenation of two consecutive database segments into account. An important contribution to the field of 2D photorealistic visual speech synthesis is due to Cosatto & Graf. The first version of their system implemented a hybrid rule-based/concatenation-based approach [Cosatto and Graf, 1998]. In this system, for each representative phoneme a typical set of mouth parameters (width, position upper lip and position lower lip) is determined. Similar to other rule-based approaches, to synthesize a new speech signal, for each target phoneme a keyframe is defined by its predicted mouth parameters. A grid is populated with mouth appearances sampled from original visual speech recordings. Each dimension of the grid represents a mouth parameter and each grid entry contains multiple mouth samples. As such, for each keyframe a representative mouth sample can be selected from the populated grid. Interpolation between the keyframes is achieved by interpolating the keyframe mouth parameters and selecting for each intermediate parameter set the most corresponding mouth sample from the grid. The Cohen-Massaro coarticulation model is used by calculating the interpolated parameter values based on an exponentially decaying dominance function that is defined for each representative phoneme. The authors also suggest a concatenativebased interpolation strategy, in which common coarticulations are not generated 2.2. An overview on visual speech synthesis 58 using keyframe interpolation but in which the intermediate frames are created by reusing sequences of mouth parameters that have been measured in original speech fragments. In a later version of the system, such a data-based interpolation was used to predict the mouth parameters for every output frame [Cosatto and Graf, 2000]. Then, a frame-based unit selection procedure is performed, in which for each output frame a set of candidate mouth samples is gathered from the database based on their similarity with the predicted mouth parameters for that frame. From each set of candidate mouth instances, one final instance is selected by maximizing the overall smoothness of the synthetic visual speech. Note that this synthesis strategy can be seen as a hybrid rule-based/concatenation-based approach, in which the concatenative synthesis stage is based on target speech features predicted by the rule-based synthesis stage. Later on this visual speech synthesizer evolved towards a truly concatenative system (omitting the rule-based prediction stage) where variable-length video sequences are selected from a visual speech corpus based on target and join costs [Cosatto et al., 2000]. Similar to unit selection-based auditory synthesis [Hunt and Black, 1996], the target costs express how good the candidate segment matches the target segment, while the join costs express the ease in which consecutive selected segments can be concatenated. Finally, in another implementation the unit selection-based synthesizer was extended to minimally select triphone-sized segments in order to speed-up the selection process [Huang et al., 2002]. 2.2.6.4 Synthesis based on statistical prediction Section 2.2.1 described how statistical modelling (e.g., using an HMM) can be used to synthesize novel visual speech based on a given auditory speech signal. However, such a statistical modelling can also be used to synthesize novel speech from text input. This technique has been applied for both auditory and visual speech synthesis purposes. In general, prediction-based speech synthesis requires that in a prior training stage a prediction model is built by learning the correspondences between captured properties of original speech and the corresponding phoneme sequence or viseme sequence. The prediction model has to take both static correspondences (the relationship between the observed features and the corresponding phoneme or viseme) and dynamic properties (the transitions between feature sets) into account. After training, the model can predict new parameter trajectories based on a target phoneme or viseme sequence. Sampling these trajectories gives a prediction of the target speech features for each frame of the output speech signal. In an alternative approach, the statistical model only predicts target features for a limited set of keyframes, after which an interpolation is performed to acquire target features for each output frame. These synthesizers can be seen as hybrid rule-based/statistical model-based systems, which learn their articulation rules by statistically modelling 2.2. An overview on visual speech synthesis 59 Phoneme sequence + Timings Trained prediction model FR0 FR1 FR2 FR3 FR4 FR5 FR6 FR7 FR8 FR9 time Figure 2.15: Visual speech synthesis based on statistical prediction of visual features. Note that some systems only predict features for a limited number of keyframes, after which an extra interpolation is required to acquire a set of target features for each output frame. features derived from original speech fragments. A general overview of the synthesis approach is illustrated in figure 2.15. The benefit of speech synthesis based on statistical prediction is the fact that it combines the advantages of both rule-based and concatenative synthesis: observed original (co-)articulations can be reused without the need to explicitly model this behaviour, while the synthesizer’s data footprint is still small since no original speech data needs to be stored after the training stage. The downside is the fact that the original speech data must be parameterized in order to be able to train the model. Thus, the synthetic speech signal is not constructed directly from original speech data but it is regenerated from the predicted features. This possibly leads to a degraded signal quality. Tamura et al. proposed a visual speech synthesis strategy based on visual features predicted by an HMM [Tamura et al., 1998]. In this system, simple geometrical features describe the visual speech by using 2D landmark points indicating the lip shape. Syllables were chosen as basic speech synthesis unit, where for each syllable a four-state left-to-right model with single Gaussian diagonal output distributions and no skips was trained. HMMs are also used in the system by Zelezny et al., in which a phoneme is used as basic synthesis unit [Zelezny et al., 2006]. Each phoneme is modelled using a five-state left-to-right HMM with three central emitting states. The visual speech was parameterized in terms of 3D landmark points around the lips and the chin. Note that in this system, the HMM is only used to predict some particular keyframes of the output speech. Afterwards, smooth trajectories are calculated using an interpolation based on the Cohen-Massaro coarticulation model. Govokhina et al. also trained phone-based HMMs for visual speech synthesis purposes [Govokhina et al., 2006a]. This system uses articulatory parameters derived from a PCA analysis on 3D motion capture data as speech features. An alternative approach was proposed by Malcangi, who describes a system that statistically 2.2. An overview on visual speech synthesis 60 predicts keyframe values using ANNs [Malcangi, 2010]. Afterwards, smooth trajectories are obtained using an interpolation based on fuzzy logic [Klir and Yuan, 1995]. An interesting approach for synthesizing 2D visual speech was proposed by Ezzat et al., in which a multidimensional morphable model was used to model each video frame from the original speech recordings [Ezzat et al., 2002]. Such a model is built by selecting a reference image and a set of images containing key mouth shapes. Then, the optical flows that morph each key image to the reference image are calculated. A novel frame is defined in the model space by a set of shape parameters and a set of appearance parameters. The shape parameters define the linear contribution of the original optical flow vectors that, when applied to the reference image, generate a set of morphed images, while the appearance parameters define the contribution of these morphed images in the synthesis of the target frame. Each video frame from a recording of original visual speech was projected in the model space. Afterwards, from all original frames corresponding to a particular phoneme the shape and the appearance parameters were gathered. By doing so, each phoneme is represented by two multidimensional Gaussians (one for the shape and one for the appearance). The synthesis of novel speech, based on a target phoneme sequence, is solved as a regularization problem since a trajectory through the model space is searched in order to minimize both a target term and a smoothness term. Coarticulation is modelled via the magnitude of the measured variance for each phoneme. A small variance means that the trajectory must pass through that region in the phoneme space, and hence neighbouring phonemes have little coarticulatory influence. On the other hand, a large variance means that the trajectory has a lot of flexibility in choosing a path through a particular phonetic region, and hence it may choose to pass through regions which are closer to a phoneme’s neighbours. The phoneme will thus experience strong coarticulatory effects. The downside of the strategy proposed by Ezzat et al. is that each viseme is characterized by only static features. Kim et al. extended the technique for generating 3D model parameters, for which not only static but also dynamic properties of each phoneme instance from the original speech fragments were used to train the model [Kim and Ko, 2007]. Recently, some hybrid synthesis approaches that are based on both statistical modelling and reusing original speech data have been proposed [Govokhina et al., 2006b] [Tao et al., 2009] [Wang et al., 2010]. Note that a similar hybrid strategy has also been proposed for generating synthetic auditory speech (see section 1.4.2). In a first stage of the hybrid synthesis, target features describing the synthetic speech are predicted using a trained statistical model. In a second stage, these predictions are used to select appropriate segments from a database containing original speech fragments. Govokhina et al. proposed such a hybrid text-driven synthesis method in which the HMM-based synthesis stage is performed using context-dependent 2.3. Positioning of this thesis in the literature 61 phoneme models [Govokhina et al., 2006b]. The hybrid synthesis was found to outperform both HMM-only and concatenative-only synthesis approaches. Wang et al. proposed a hybrid strategy to synthesize visual speech from auditory speech input [Wang et al., 2010]. In the training stage, an HMM is trained on the correspondences between auditory and visual features of original audiovisual speech recordings. Then, in a first synthesis stage, given a novel auditory input, the trained HMM predicts a set of target visual features for each output frame. In a second synthesis stage, a frame-based unit selection is performed, where the target cost is calculated as the distance between the candidate original frame and the frame predicted by the HMM. An alternative speech-driven hybrid synthesis approach was proposed by Tao et al., in which sub-sequences of original visual speech are selected from a database based on a target cost that is calculated using a Fused-HMM [Tao et al., 2009]. The Fused-HMM models the joint probabilistic distribution of the novel audio input and the candidate visual deformations. 2.3 Positioning of this thesis in the literature From the literature overview given in section 2.2 is it clear that a wide variety of approaches for achieving (audio)visual speech synthesis can be adopted. A particular category of systems that shows much potential are the synthesizers that generate both a synthetic auditory and a synthetic visual speech mode from a given text input (i.e., audiovisual text-to-speech synthesis systems). The reason for this is twofold. First, there exist countless applications, such as virtual announcers and virtual teachers, for which these synthesizers can be adopted: AVTTS synthesis is the most optimal technique to realize speech-based communication from a computer system towards its users. Second, the generation of the auditory and the visual speech mode by the same system permits to enhance the level of audiovisual coherence in the synthetic speech as much as possible. For this purpose, a single-phase audiovisual speech synthesis approach is favourable (see section 2.2.2). It is remarkable that such single-phase AVTTS strategies have only been adopted in some exploratory studies (see section 2.2.2 for references). Since humans are very experienced in simultaneously perceiving auditory and visual speech information, they are very sensitive to the coherence between these two information streams. The most important coherence-related feature that an AVTTS system needs to address is the synchrony between the two synthetic speech modes. Synchronous speech modes can be generated by both single-phase and two-phase AVTTS synthesizers. In general, a two-phase system will first generate the synthetic auditory mode, after which the phoneme durations found in this signal are imposed on the durations of the visemes in the synthetic visual speech signal (or vice-versa). Obviously, audiovisual synchrony can be achieved by single- 2.3. Positioning of this thesis in the literature 62 phase AVTTS systems as well, since both synthetic speech modes are generated simultaneously. However, synchrony is not the only feature that determines the overall level of audiovisual coherence. For instance, both the auditory and the visual speech mode contain coarticulation effects. In original audiovisual speech, these coarticulations occur simultaneously in both speech modes. However, when the synthetic speech modes are synthesized separately, the auditory and the visual coarticulations are introduced independently. It is impossible to predict how these fragments of auditory and visual speech information will be perceived when they are presented audiovisually to an observer. More in general, notwithstanding that a well-built two-phase AVTTS synthesizer is able to generate synchronous auditory and visual speech signals which both exhibit on their own high-quality and natural speech sounds/gestures, such a two-phase synthesis is unable to ensure that the audiovisual coherence between both synthetic speech modes is sufficient for a high-quality and natural perception of the multiplexed audiovisual speech signal. Humans are very well trained to match auditory and visual speech. Therefore, the challenge for an AVTTS system is to create an auditory speech mode of which the human observers believe that it could indeed have been generated by the virtual speaker’s speech gestures that are displayed in the accompanying visual speech signal. For this purpose, a single-phase AVTTS approach is the most favourable synthesis strategy. This thesis evaluates the benefits of a single-phase AVTTS synthesis approach over the more conventional two-phase synthesis strategy. Similar to most modern speech synthesizers, a concatenative synthesis strategy is adopted in which original audiovisual articulations and coarticulations are copied from original speech recordings to the synthetic speech signal (see section 2.2.6.3). Section 2.2.3 elaborated on the differences between 2D-based and 3D-based synthesis strategies. This thesis adopts a 2D-based synthesis approach. The reason for this is two-fold. First, as section 2.2.5 described, a 2D-based synthesis does not require the construction and implementation of advanced facial models and their associated rendering techniques, since the virtual speaker can be directly rendered from original 2D speech recordings. Moreover, gathering original 2D speech data is much easier to perform in comparison with 3D motion capture techniques. A second reason to opt for a photorealistic 2D-based synthesis approach is that its output speech resembles standard television broadcast and video recordings (see section 2.2.4), two categories of audiovisual speech signals that people are very familiar with. This is advantageous when conducting subjective perception experiments in which the participants have to rate or compare samples containing synthetic audiovisual speech. The major downside of 2D-based visual speech synthesis is its limited applicability in virtual surroundings and its limited power to create new expressions. However, this is not an issue since the main goal of this thesis is to investigate efficient 2.3. Positioning of this thesis in the literature 63 strategies for performing single-phase AVTTS synthesis and the general evaluation of a single-phase AVTTS synthesis approach in comparison with more traditional two-phase synthesis strategies. Possible important synthesis paradigms resulting from this thesis can in future research still be adopted in 3D facial animation schemes and/or in systems that also incorporate additional facial expressions for mimicking visual prosody and the emotional state of the virtual speaker. This thesis describes the development of a 2D photorealistic single-phase concatenative AVTTS synthesizer. Single-phase concatenative audiovisual speech synthesis using 3D motion data has already been mentioned in the studies by Hallgren et al. [Hallgren and Lyberg, 1998], Minnis et al. [Minnis and Breen, 2000] and Bailly et al. [Bailly et al., 2002]. Unfortunately, these studies select the appropriate audiovisual speech segments from the database based on auditory features only. Obviously, this will result in sub-optimal synthetic visual speech signals, although all studies report that the attained synthesis quality benefits from the fact that synchronous and coherent original audiovisual speech data is applied. The study of Minnis et al. also mentions a visual concatenation strategy that takes the importance of each phoneme for the purpose of lip-readability into account. Only two systems that apply a 2D photorealistic single-phase concatenative AVTTS synthesis have been described in the literature. Shiraisi et al. developed a system to synthesize Japanese audiovisual speech using a database of 500 original sentences [Shiraishi et al., 2003]. Both a single-phase approach (in which audiovisual speech segments are selected from the database) and a two-phase approach (in which auditory and visual segments are independently selected from the database) were implemented. The smoothness and the naturalness of the resulting visual speech mode were assessed, from which it was found that a unimodal selection of the original visual speech segments resulted in higher-quality synthetic visual speech. This is a rather obvious result, since no visual features are taken into account during the audiovisual segment selection. Unfortunately, no assessment of the resulting audiovisual synthetic speech was made. In addition, the authors do not mention any strategy to smooth the concatenations, indicating that the resulting (audio-)visual speech is likely to contain noticeable concatenation artefacts (given the limited size of the provided speech database) which interfere with a subjective evaluation of the quality of the system. Another approach similar to the one that is investigated in this thesis was described by Fagel [Fagel, 2006]. In this system, audiovisual segments are selected from a database containing original German speech fragments. The longest possible segments are selected in order to minimize the number of concatenations. The smoothness between two candidate segments is determined using both auditory and visual features. Unfortunately, the system applies no technique to smooth the concatenated speech, which results in a jerky output signal: despite the fact that the recorded text corpus was optimized to contain about 820 distinct 2.3. Positioning of this thesis in the literature 64 diphones, the complete database contained only 2000 phones which is fairly limited for unit selection-based speech synthesis. The system was evaluated by measuring intelligibility scores for consonant-vowel and vowel-consonant sequences in three modalities: auditory, visual, and audiovisual speech. Both natural and synthesized speech was evaluated and in all samples the auditory mode was contaminated with noise. It was found that for all modalities, the recognition of the synthetic sequences was as good as the recognition of the original sequences. Unfortunately, the author does not mention any conclusion on the comparison between a single-phase and a two-phase audiovisual speech synthesis approach. Therefore, such a comparison will be the primary goal of this thesis. Note that in order to allow meaningful subjective evaluations of the AVTTS strategy, high quality synthetic auditory and visual speech signals have to be generated. For this purpose, much attention should be given to the design of the database containing original speech fragments, to the audiovisual segment selection technique, and also to a concatenation strategy that is able to create smooth synthetic speech signals. Recall from section 2.2.6.3 that high quality synthesis should also pay attention to successfully transfer coarticulation effects from the original speech to the synthetic speech. 3 Single-phase concatenative AVTTS synthesis 3.1 Motivation As explained in section 2.3, this thesis aims to evaluate the single-phase audiovisual speech synthesis approach. To this end, in the first part of the research a concatenative single-phase AVTTS synthesizer will be developed. Afterwards, the benefits of the single-phase approach will be evaluated and the single-phase synthesis strategy will be compared with the more traditional two-phase synthesis paradigm, in which the synthetic auditory and the synthetic visual speech mode are generated separately. 3.2 3.2.1 A concatenative audiovisual text-to-speech synthesizer General text-to-speech workflow The general workflow in which the AVTTS system translates the text input into an audiovisual speech signal is very similar to standard auditory unit selection text-to-speech synthesis. The synthesis process can be split-up in two stages, where in a high-level synthesis stage the input text is processed to acquire an appropriate collection of parameters and descriptions that can be used by the low-level synthesis stage to create the actual auditory/visual speech signals. An overview of the AVTTS synthesis process is given in figure 3.1. 65 3.2. A concatenative audiovisual text-to-speech synthesizer Text Normalisation Tokenisation 66 Part-of-speech tagging Syntactic parsing Assign prosody model Token-to-sound rules Lexicon Postlex rules Phonemic transcription Assign timings and f0-contour Sound/Image synthesis Audiovisual speech Figure 3.1: Overview of the AVTTS synthesis. High-level synthesis steps are indicated by rectangles and the low-level synthesis stage is indicated by an ellipse. The high-level synthesis stage, also known as the linguistic front-end, first normalizes the input text by converting it into a set of known tokens. For instance, abbreviations are expanded and numbers are written down using plain words. Then, each word from the target speech is typified using a part-of-speech tagger (indicating the nouns, verbs, adverbs, etc.) and possibly by a syntactic parser (to provide information about the inter-word relationships in a sentence, such as subject, direct object, etc.). Using this data, a prosody model is constructed for each target utterance, describing variations in pitch and timings, assigning accents to speech segments, and predicting phrase-breaks between words. This prosodic information can for instance be expressed by means of “tone-and-break indices” (ToBi), which indicate the variations in speech rate and pitch going from word to word or from syllable to syllable [Pitrelli et al., 1994]. The sequence of input tokens and the part-of-speech/syntactic information is also used to construct a target phoneme sequence. For this a lexicon is used that contains for each word of the target language its phonemic transcription. Note that each language contains several words that have the same spelling but a different pronunciation. The distinction in the pronunciation of the word can be due to the phonemic context (e.g., the English word “the” which sounds different when uttered before a consonant of before a vowel) or it can be due to multiple semantic meanings of the word (so-called heteronyms, e.g., the English word “refuse” which can be 3.2. A concatenative audiovisual text-to-speech synthesizer 67 either a noun or a verb). Especially for heteronyms it is crucial that the correct phonemic transcript is applied by the speech synthesizer in order to convey the correct semantic information in the synthetic speech. Based on the part-of-speech tagging and on the syntactic information, it should be possible to select for each heteronym found in the input text the intended entry from the lexicon. On the other hand, pronunciation variations due to phonemic context can be defined in so-called postlex rules, which are applied for locally fine-tuning the phonemic transcript after the complete input text has been processed. It is impossible to avoid that some words from the input text are missing in the lexicon (e.g., names or foreign expressions). For these particular words, a phonemic transcription is estimated by a predefined set of token-to-sound rules (also known as grapheme-to-phoneme rules). Once the final target phoneme sequence has been determined, the assigned prosody model can be used to predict for each individual phoneme a target duration. In addition, an f0-contour that models the target pitch for each speech segment can be constructed. Finally, the target phoneme sequence and its associated prosodic parameters are given as input to the low-level synthesizer which then constructs the appropriate physical speech signals. 3.2.2 Concatenative single-phase AVTTS synthesis The synthesis paradigm for the low-level synthesis stage that is adopted in this thesis is unit selection concatenative synthesis. As section 2.2.6.3 described, in this synthesis approach the synthetic speech is constructed by the concatenation of original speech segments that are selected from a database containing original speech recordings. In most modern systems, the segments are selected from continuous original speech, which permits the selection of segments containing multiple consecutive original phones. This way, both original coarticulations and original prosody can be copied from the original speech to the synthetic speech. Original segments exhibiting appropriate prosodic properties can be selected from the database by using selection criteria that are linked with prosody, such as the position in the sentence, the position in the syllable and syllable stress (see further in section 3.4.2.2). Because of this, the AVTTS system does not necessarily need to predict a prosody model and its associated timing and pitch parameters for each utterance. This simplifies the synthesis workflow and it also minimizes the need for modification of the selected original segments. This is advantageous since additional modifications to the original speech, such as a time-scaling in order to match the target phoneme durations, can result in irregular signals that degrade the quality of the synthetic speech [Campbell and Black, 1996]. This thesis focusses on the development of the low-level synthesis stage. As backbone of the synthesis system, the Festival framework is used [Black et al., 3.2. A concatenative audiovisual text-to-speech synthesizer 68 2013]. The Festival framework offers modules for each step of the TTS synthesis process, as well as an environment to connect the various modules in order to attain a fully-operational speech synthesizer. Festival also allows to integrate user-defined synthesis modules in the synthesis workflow, a functionality that can for instance be applied to combine new low-level synthesis algorithms with the high-level synthesizer of the official Festival release. The modules of the Festival system are written in C++, while the backbone of the system uses a Scheme interpreter to pass data from each module to the other. This thesis describes two audiovisual speech synthesizers, targeting English and Dutch, respectively. For the English synthesis, original Festival high-level synthesis modules were used, while for the Dutch synthesis some high-level modules of the NeXTeNS TTS system were applied [Kerkhoff and Marsi, 2002]. The research described in this theses was conducted in parallel with research on auditory-only TTS synthesis within the same research laboratory. In that research, both new high-level and low-level synthesis techniques for generating auditory speech from text input are investigated [Latacz et al., 2007] [Latacz et al., 2010] [Latacz et al., 2011]. Over time, many of the new high-level synthesis modules that have been developed in the auditory TTS research were included in the AVTTS synthesizer that is described in this thesis. Note that the details on these modules are beyond the scope of this thesis as they are mainly used to compose a high-quality set of parameters and descriptions that are given as input to the low-level synthesis stage (which is the actual subject of this thesis). For an overview on the English high-level synthesis that is used in the AVTTS system the interested reader is referred to [Latacz et al., 2008]. In addition, details on the Dutch high-level synthesis stage that is used in the AVTTS system are found in [Mattheyses et al., 2011a]. The low-level single-phase concatenative synthesizer (from this point on the “low-level” label will be dropped, assuming that the input text has been translated in its corresponding phoneme sequence) generates the synthetic audiovisual speech by concatenating original audiovisual speech segments selected from an audiovisual speech database. By jointly selecting and concatenating auditory and visual speech data, a maximal audiovisual coherence is seen in the synthetic speech. The base unit that is used in the selection process is a diphone. This means that the input phoneme sequence is split up in consecutive diphones, for each of which a set of matching candidate segments (in terms of phonemic transcript) is gathered from the database. Then, for each target diphone one final original speech segment is selected from its matching candidates based on the optimization of a global selection cost that is calculated using both auditory and visual features (see further in section 3.4). When the database contains a speech fragment that matches the target phoneme sequence over multiple successive phones (e.g., when a complete word of the input text is found in the database speech), in many cases all the 3.2. A concatenative audiovisual text-to-speech synthesizer Text: fish Phoneme sequence: Diphones: Concatenation: 69 F IH SH _F _ F IH IH SH SH _ F F IH IH SH SH _ Figure 3.2: Diphone-based unit selection. The original speech data that is eventually copied to the synthetic speech is indicated in green. Phoneme labels are in the Arpabet notation [CMU, 2013] and “ ” represents the silence phoneme. consecutive diphone segments that make up this original fragment are selected for the corresponding target phoneme sub-sequence (i.e., the whole original segment is copied to the synthetic speech). However, in the general unit selection paradigm, it is not necessarily so that the longest possible segment is always selected, since apart from signal continuity other features such as the extended phonemic context and a match with the target prosody are taken into account. Once a final original segment has been selected for each diphone target, these segments are concatenated in order to construct the synthetic speech signal. This concatenation involves both the joining of waveforms and the joining of video frame sequences. As was explained in section 2.2.6.3, the diphones are concatenated at the middle of the first and the second phone, respectively. This way, the concatenation takes place in the most stable part of each phone and the original transition between the two phones (i.e., the original local coarticulation) is copied to the synthetic speech (see figure 3.2). In some cases it can occur that a target diphone is not found in the database, especially when the applied database is small and not optimized to contain at least one instance of each diphone existing in the target language. In that case, a back-off 3.2. A concatenative audiovisual text-to-speech synthesizer High-level synthesis Audiovisual unit selection Audiovisual concatenation audiovisual units Pitch-synchronous crossfade & image metamorphosis Original combinations of auditory & visual speech Audiovisual speech database Text 70 Waveforms & video sequences Figure 3.3: Overview of the audiovisual unit selection synthesis. takes place in which the synthesizer selects a single phone from the database. In the concatenation stage, the phone-sized segment is concatenated at the phone boundaries. Since the use of such phone-sized segments leads to less optimal results as compared to diphone-sized segments, back-offs should be avoided by optimizing the synthesis database. After the concatenation stage, the final audiovisual speech is obtained by a simple multiplexing of the concatenated auditory and the concatenated visual speech signal, without the need for any additional signal processing. This is in contrast to the two-phase AVTTS approach, in which an additional synchronization of the two separately synthesized speech modes is needed. The general workflow of the low-end synthesis process is illustrated in figure 3.3. In the following sections the various steps of the audiovisual unit selection synthesis will be discussed in more detail. 3.3. Database preparation 3.3 3.3.1 71 Database preparation Requirements In order to perform concatenative speech synthesis, a database containing original audiovisual speech recordings must be created and provided to the synthesizer. This is an important off-line step since the properties and the quality of this database for a great deal determine the attainable synthesis quality. Since the speech synthesis involves the concatenation of original speech segments that are extracted from multiple randomly-located parts of the database, it is crucial that the original speech data is consistent throughout the whole dataset. Therefore, speech data from a single speaker is used and it is attempted that the audiovisual recording conditions remain constant during the recording session(s). This thesis aims to design an AVTTS system that creates a 2D photorealistic synthetic visual speech signal displaying a frontal view of the virtual speaker (i.e., “newsreader-style”). Therefore, the original visual speech should be recorded in a similar fashion. An important issue to take into account are head movements. Even when the original speaker was instructed to keep his/her head as steady as possible, some small variations of the position of the face toward the camera are unavoidable. Some researchers have tried to overcome this problem by fixing the speaker’s head in a canvas [Fagel and Clemens, 2004] or by using a head-mounted camera [Theobald, 2003]. Unfortunately, such solutions often result in a less optimal video quality or are unable to capture a natural appearance of the complete face. Even more, when recording large databases the speaker should be able to sit adequately comfortable, which is impossible when his/her head is fixed in a stiff construction. The video signal itself should have a sufficiently high resolution and a sufficiently high frame rate to capture all subtle speech movements. In addition, it should be ensured that the signal allows a quality post-processing stage, in which for instance a spatial segmentation of the image data in each video frame is performed (separating the face from the background and/or indicating the position of various visual articulators). On the other hand, it should also be ensured that the audio recordings contain only minimal background noise and that the used microphones allow a natural voice reproduction. 3.3.2 Databases used for synthesis In this stage of the research, two audiovisual databases were used. A first preliminary audiovisual speech corpus “AVBS” containing 53 Dutch sentences from weather forecasts was recorded in a quiet room on the university campus. The audiovisual speech was recorded at a resolution of 704x576 pixels at 25 progressive frames per second. The audio was recorded by a lavalier microphone at 44100Hz. Two example frames of this database are given in figure 3.4. Obviously, this database 3.3. Database preparation 72 Figure 3.4: Example frames from the “AVBS” audiovisual database. is too limited to attain high quality synthesis results. Nevertheless, is has been very useful to design and test various high-level and low-level synthesis modules by synthesizing Dutch sentences from the limited domain of weather forecasts. In 2008 the LIPS visual speech synthesis challenge was organized to assess and compare various visual speech synthesis strategies using the same original speech data [Theobald et al., 2008]. With this event an English audiovisual speech database suitable for concatenative audiovisual speech synthesis was released. A great part of the work described in this thesis was conducted using this “LIPS2008” dataset. The database consists of audiovisual “newsreader-style” recordings of a native English female speaker uttering 278 English sentences from the phonetically-balanced Messiah corpus [Theobald, 2003]. The visual speech was recorded at 25 interlaced frames per second in portrait orientation at a resolution of 288x720 pixels. After post-processing, the final visual speech signals consisted of 50 progressive frames per second at a resolution of 576x720 pixels. The acoustic speech signal was captured using a boom-microphone near the subject and was stored with 16 bits/sample at a sampling frequency of 44100Hz. Two example frames from the LIPS2008 corpus are given in figure 3.5. 3.3.3 Post-processing In order to be able to use a speech database for concatenative speech synthesis purposes, appropriate meta-data describing various aspects of the speech contained in the database has to be calculated. These features will be used by the synthesizer to select the most appropriate sequence of original speech segments that compose the target synthetic speech. 3.3. Database preparation 73 Figure 3.5: Example frames from the “LIPS2008” audiovisual database. 3.3.3.1 Phonemic segmentation In order to determine which database segments are matching the target speech description, the original speech must be phonemically segmented. To this end, the original auditory speech is analysed and each phoneme boundary is indicated. Afterwards, the original visual speech signal is synchronously segmented by positioning the viseme boundaries at those video frames that are closest to the phoneme boundaries in the corresponding acoustic signal. Note that an exact match between these two boundaries is impossible since the sample rate of a video signal is much lower compared to the sample rate of an audio signal. In general, the phonemic segmentation of an auditory speech signal is performed by a speech recognition tool in forced-alignment mode. This means that both the acoustic signal and its corresponding phoneme sequence are given as input to the recognizer, after which for each sentence an optimal set of phoneme boundaries is calculated. For the preliminary Dutch database AVBS, the phonemic segmentation was obtained using the SPRAAK toolkit [Demuynck et al., 2008]. The LIPS2008 database was already provided with a hand-corrected phonemic segmentation created using HTK [Young et al., 2006]. 3.3.3.2 Symbolic features It was described in section 3.2.2 that in the case of unit selection-based synthesis the synthesis system does not have to directly estimate the prosodic features of the output speech, since segments containing an appropriate original prosody can be copied from the database to the synthetic speech. For this, the segment selection has to take prosody-dependent features into account. To this end, for each phone in the original speech multiple symbolic features were calculated based on phonemic, prosodic and linguistic properties, such as part-of-speech, lexical stress, syllable type, 3.3. Database preparation 74 etc. A complete list of these features is given in table 3.1. Note that some features were determined for the neighbouring phones/syllables/words as well. The symbolic features can be used in the segment selection process to force the selection towards original segments exhibiting appropriate prosodic features such as pitch, stress and duration (see further in section 3.4.2.2). Table 3.1: Symbolic database features. Features with a are also calculated for the neighboring phones, syllables or words. Neighboring syllables are restricted to the syllables of the current word. Three neighbors on the left and three on the right are taken into account. Level phone phone phone syllable syllable syllable syllable syllable syllable syllable syllable syllable syllable word word word word word word word word word word 3.3.3.3 Feature Phonemic identity Pause type (if silence) Position in syllable Phoneme sequence Lexical stress ToBI accent Is accented Onset and coda type Onset, nucleus and coda size Distance to next/previous stressed syllable (in terms of syllables) Nbr. stressed syllables until next/prev. phrase break Distance to next/previous accented syllable (in terms of syllables) Nbr. accented syllables until next/prev. phrase break Position in phrase Part of speech Is content word Has accented syllable(s) Is capitalized Position in phrase Token punctuation Token prepunctuation Nbr. words until next/prev. phrase break Nbr. content words until next/prev. phrase break Acoustic features Several acoustic features describing the auditory speech signal were determined. The acoustic signal was divided into 32ms frames with 8ms frame-shift, after which 3.3. Database preparation 75 12 MFCCs were calculated to parameterize the spectral information of each frame. In addition, for each sentence a series of pitch-markers was determined, indicating each pitch period in the (voiced) segments of the speech. This information is useful in case the speech has to be pitch-modified or time-scaled by algorithms such as PSOLA [Moulines and Charpentier, 1990]. Moreover, these pitch-markers are used in the acoustic concatenation strategy (see further in section 3.5.3). The pitchmarkers are calculated using a dynamic programming approach. Summarized, a crude estimation for each marker is calculated using an average magnitude difference function. Then, for each estimation a final marker position is determined by selecting the most appropriate marker from a set of candidate markers. For more details on this pitch-marking strategy the interested reader is referred to [Mattheyses et al., 2006] since this algorithm is beyond the scope of this thesis. Based on the distance between consecutive pitch-markers, for each sentence a pitch contour is calculated. Sampling this contour at the middle of each phoneme defines a pitch feature for each database segment. Finally, an energy feature is calculated by measuring the spectral energy of the acoustic signal in a window of 1024 samples (24ms) around the centre of each phoneme. 3.3.3.4 Visual features The most important step in the post-processing of the visual speech recordings is the tracking of key points throughout the video signal. These landmarks indicate the position of various visual articulators and other parts of the speaker’s face in each video frame (illustrated in figure 3.6). The key point tracking was based on both a general facial feature tracker developed at the Vrije Universiteit Brussel [Hou et al., 2007] and an AAM-based tracker that was kindly provided by prof. Barry-John Theobald (University of East Anglia) [Theobald, 2003]. The AAM-based tracker performed best since the recorded video frames make up a uniform sequence of images from which a manually landmarked subset can be used to train the tracker. Based on these landmarks, the mouth-region of each video frame was extracted (using a fixed-size rectangular area around the mouth). These mouth regions were mathematically parameterized by a PCA analysis. These calculations resulted in a set of “eigenfaces” and defined for each database frame a set of PCA coefficients that reconstruct the grayscaled version of the mouth area of that frame by a linear combination of the eigenfaces. While key point tracking is useful to locate in each frame visually important areas such as the lips and the cheeks, it cannot be used to identify the teeth or the tongue since these are not visible in each recorded video frame. Therefore, an image processing technique was developed to track these facial features throughout the database. In a first step, for each frame the mouth is extracted based on the 3.3. Database preparation 76 Figure 3.6: Landmarks indicating the various parts of the face. landmarks that indicate the position of the lips. Both horizontally and vertically a margin of only a few pixels from the most outside landmark is used for this crop. Then, the coloured mouth-image is split in a blue, a green, and a red channel. In order to measure the area of the video frame representing visible teeth, the number of pixels in the blue channel that have an intensity value above a predefined threshold is calculated. This threshold is manually determined in such a way that the intensity-measure results in a value close to zero when the detection is applied on video frames containing no visible teeth. The blue channel is chosen as this channel contains the least intensity information from the lips, the tongue and the skin. Note that this detection strategy only works in these particular circumstances where each video frame is captured in the same recording conditions (e.g., external lightning, camera settings, etc.). The use of a similar technique to detect the presence of the tongue in a video frame is hard to perform since the tongue exhibits a variable appearance by moving forwards and backwards in the mouth. Therefore, it was opted to measure the visibility of the mouth-cavity (the dark area inside an open mouth when no tongue is displayed) instead. This is achieved in a similar fashion as the detection of the teeth, only this time the red channel is used and all pixels showing an intensity value below a predefined threshold are counted. This way, the mouth-cavity measure indirectly measures the tongue behaviour since a high value will be obtained when the mouth appears wide open and no tongue is visible (since the tongue appears reddish it mostly affects the pixel intensities in the red channel). The teeth and mouth-cavity detection is illustrated in figures 3.7, 3.8 and 3.9. 3.3. Database preparation 77 Figure 3.7: Detection of the teeth and the mouth-cavity (1). In the blue channel, the lips/skin/tongue contribute less to the pixel intensities as compared to the standard grayscale image, which improves the detection of the teeth (high pixel intensities). Likewise, detecting low pixel intensities in the red channel helps to detect the amount of mouth-cavity that is not blocked by the tongue. 3.3. Database preparation 78 600 400 200 0 0 50 100 150 200 250 50 100 150 200 250 50 100 150 200 250 50 100 150 200 250 50 100 150 200 250 600 400 200 0 0 600 400 200 0 0 600 400 200 0 0 600 400 200 0 0 Figure 3.8: Detection of the teeth and the mouth-cavity (2). This figure shows the histograms of the red channel of the five mouth representations displayed in figure 3.7. The summed histogram values below the treshold (indicated by the red line) are an appropriate representation of the amount of visible mouth-cavity. 3.3. Database preparation 79 1000 500 0 0 20 40 60 80 100 120 140 160 180 20 40 60 80 100 120 140 160 180 20 40 60 80 100 120 140 160 180 20 40 60 80 100 120 140 160 180 20 40 60 80 100 120 140 160 180 1000 500 0 0 1000 500 0 0 1000 500 0 0 1000 500 0 0 Figure 3.9: Detection of the teeth and the mouth-cavity (3). This figure shows the histograms of the blue channel of the five mouth representations displayed in figure 3.7. The summed histogram values above the treshold (indicated by the red line) are an appropriate representation of the amount of visible teeth. 3.4. Audiovisual segment selection 80 ti-1 ti ti+1 Ctarget,i-1 Ctarget,i Ctarget,i+1 ui-1 Cjoin,i-1 ui Cjoin,i ui+1 Cjoin,i+1 Figure 3.10: Unit selection synthesis using target and join costs. 3.4 3.4.1 Audiovisual segment selection Minimization of a global cost function The general idea behind unit selection synthesis is illustrated in figure 3.10. The desired output speech is defined by a series of targets. Each target has the size of a basic synthesis unit, which is a diphone in the case of the proposed AVTTS synthesis approach, and describes the ideal database segment needed to construct the synthetic speech. For each target, a set of phonemically matching candidate segments is gathered from the database. The distance between a candidate segment and the corresponding target defines the total target cost associated with that candidate. This distance is usually calculated using multiple features, each defining a sub-target cost. Since each target describes a diphone, the sub-target costs are calculated by comparing the features of the first and the second phone of the target with the features of the first and the second phone of the candidate segment, respectively. Apart from searching for original speech segments closely matching the target speech, the segment selection algorithm has to take into account the ease in which two original speech segments can be joined together. To this end, the segment selection takes join costs into account that indicate the smoothness of the signal resulting from the concatenation of each pair of candidate segments corresponding two consecutive targets. Similar to the calculation of the total target cost, the total join cost is calculated by comparing multiple features of the candidate segments, where each comparison defines its own sub-join cost. Note that the total join cost is always zero when the two candidate segments are adjacent in the database, since when those segments would be selected they can be copied as a whole from the database to the synthetic speech (no concatenation is needed). The total target cost of a candidate segment ui matching a synthesis target ti 3.4. Audiovisual segment selection 81 can be written as the weighted sum of k sub-target costs: Pk target Ctotal (ti , ui ) j=1 = ωjtarget Cjtarget (ti , ui ) Pk target j=1 ωj (3.1) in which ωjtarget represents the weight factor of the j-th target cost. The various target costs Cjtarget that are used by the synthesizer are discussed in section 3.4.2. Similarly, the join cost associated with the transition from candidate segment ui to candidate segment ui+1 can be written as the weighted sum of l sub-join costs: Pl join Ctotal (ui , ui+1 ) = j=1 ωjjoin Cjjoin (ui , ui+1 ) Pl join j=1 ωj (3.2) in which ωjjoin represents the weight factor of the j-th join cost. The various join costs Cjjoin that are used by the synthesizer are discussed in section 3.4.3. Using these two expressions, the total cost for synthesizing a sentence that is composed of T targets t1 , t2 , . . . , tT by concatenating candidate segments u1 , u2 , . . . , uT can be written as: C(t1 , t2 , . . . , tT , u1 , u2 , . . . , uT ) = " T # T −1 X join X target Ctotal (ui , ui+1 ) α Ctotal (ti , ui ) + i=1 (3.3) i=1 in which α is a parameter that controls the importance of the total target cost over the total join cost. The most appropriate set of candidate segments is the sequence (û1 , û2 , . . . , ûT ) that minimizes equation 3.3. Searching for this optimal set is a complicated problem, since for every target multiple candidates exist which leads to an enormous number of possible sequences, as illustrated in figure 3.11. Therefore, a dynamic programming approach called Viterbi search [Viterbi, 1967] is applied to efficiently find the most optimal sequence of database segments. The Viterbi algorithm is explained in detail in appendix A. Since the AVTTS synthesizer is designed to select and concatenate audiovisual speech segments, the total selection cost has to force the selection towards original segments that are optimal in the auditory as well as in the visual mode. To this end, both auditory sub-costs and visual sub-cost are applied and equations 3.1 3.4. Audiovisual segment selection 82 t2 t3 u11 u21 u31 u12 u22 u32 u13 u23 u33 ... ... ... ... ... ... ... ... ... t1 tT u1N u2N u3N ... uTN uT1 uT2 uT3 Figure 3.11: A trellis illustrating the unit selection problem. Each target t has many associated candidate segments u. From each candidate segment uij matching target ti the transition to every candidate segment matching target ti+1 must be considered. and 3.2 can be written as: Pka target,a target,a Cj (ti , ui ) j=1 ωj target Ctotal (ti , ui ) = Pka target,a Pkv target,v + j=1 ωj j=1 ωj P k target,v v Cjtarget,v (ti , ui ) j=1 ωj + Pka target,a Pkv target,v + j=1 ωj j=1 ωj P l join,a join,a a Cj (ui , ui+1 ) j=1 ωj join Ctotal (ui , ui+1 ) = Pla Plv join,a + j=1 ωjjoin,v j=1 ωj Plv join,v join,v Cj (ui , ui+1 ) j=1 ωj + Pla Plv join,a + j=1 ωjjoin,v j=1 ωj (3.4) with label a denoting audio-related values and label v denoting video-related values. High-quality audiovisual synthesis can be achieved by minimizing equation 3.3 only if accurate sub-costs and an appropriate weighting between these multiple sub-costs are defined. This is not trivial, since it is likely that often the most optimal segment for constructing the synthetic auditory speech will not be the most preferable segment to construct the synthetic visual speech and vice-versa. The following two sections will elaborate on the various sub-costs that are applied in the proposed AVTTS synthesis approach. 3.4. Audiovisual segment selection 83 Target cost Phonemic match * Symbolic binary costs Safety costs Phonemic context Suspicious units * Syllable name Suspicious-timing units Syllable stress Position word in phrase ... Figure 3.12: Target costs applied in the AVTTS synthesis. Costs marked with an * are assigned an infinitely high weight factor. 3.4.2 Target costs A target cost C target (ti , ui ) indicates to which extent a candidate database segment ui matches the target speech segment ti . An overview of the various target costs used by the AVTTS system is given in figure 3.12. 3.4.2.1 Phonemic match Section 3.2.2 explained that for each target the synthesizer searches in the database for candidate segments that phonemically match the target phoneme sequence. This technique already involves a “hidden” target cost, since in the general unit selection paradigm [Hunt and Black, 1996] each database segment is considered as a candidate unit. The candidate selection technique applied in the AVTTS system assumes a binary target cost Cphon.match (ti , ui ) based on the phonemic matching between the target segment ti and the database segment ui : when the segment from the database has the same phoneme label as the target speech segment, the value of the cost is set to 0. Otherwise, the cost is assigned a value 1. This target cost is given an infinitely high associated weight. Most auditory unit selection synthesizers are implemented this way, since for auditory synthesis it cannot be afforded to include an incorrect phoneme in the synthetic speech. This explains why the hidden target cost is employed in the AVTTS synthesizer as well, since in the proposed singlephase AVTTS approach auditory and visual segments are selected together from the database. Note, however, that for visual-only speech synthesis it is possible to apply 3.4. Audiovisual segment selection 84 non-phonemically matching database segments due to the many-to-one behaviour of the mapping from phonemes to visemes. This allows to select a candidate segment of which the phonemic transcript does not match the target phoneme sequence in case each phoneme of the candidate segment is from the same viseme class as its corresponding target phoneme. The advantages and disadvantages of this technique will be discussed later on in chapter 6. 3.4.2.2 Symbolic costs target The synthesizer adopts multiple symbolic target costs Csymb (ti , ui ) to guide the selection towards database segments that exhibit appropriate prosodic features. These symbolic target costs are calculated using the symbolic features that were discussed in section 3.3.3.2 and are assigned a binary value (zero or one) based on the matching between the feature value for target segment ti and the feature value for candidate segment ui . Various sub-sets from all features listed in table 3.1 have been evaluated (e.g., the subset used in the Festival Multisyn TTS synthesizer [Clark et al., 2007]), from which it was concluded that a minimal set of symbolic target costs should at least contain cost values based on: Context phoneme name The context of the database segment is compared with the context of the target in terms of phoneme identity. Six binary values are assigned, based on the matching of the phonemes found one, two and three steps forward and backward in the target/database phoneme sequence. These target costs encourage the synthesizer to select original segments that were uttered in a similar context as described in the target speech, since this way appropriate longer coarticulation effects can be copied from the database to the synthetic speech. Silence type of the previous and next segment The silence type is either “none” (no silence), “light” (short phrase break), or “heavy” (long phrase break, for example after a comma). This cost is needed since the uttering of a phoneme can be influenced by the vicinity of a pause or phrase break. Syllable name This feature encourages the synthesizer to select database segments that are located in the same syllable as described in the target phoneme sequence. This helps to copy the appropriate coarticulations from the original speech since such coarticulation effects are most profound within a syllable. Syllable stress This feature encourages the synthesizer to select database segments that are located in a stressed syllable in case the corresponding target syllable is stressed 3.4. Audiovisual segment selection 85 too, and vice-versa. The syllable level is used for this cost since this is the most appropriate level to assign stress-related features. Part-of-speech (word level) This feature encourages the synthesizer to select database segments that are located in a word that was assigned the same part-of-speech label as the corresponding word in the target sequence. This cost is useful when an entire word from the target phoneme sequence is found in the database. In many cases, the whole original speech signal representing this word will be selected since its consecutive candidate segments all contribute a zero join cost. In that case, it is necessary to inspect the part-of-speech information of the original speech segment: when the part-of-speech label of the word in the database does not match the target part-of-speech, it is likely that an incorrect original prosody is copied to the synthetic speech signal. Position in phrase (word level) The position of a word in a phrase often determines its prosodic properties (especially pitch-related properties since each type of phrase exhibits its own typical f0-contour). Therefore, when selecting longer segments from the database this cost promotes the selection of segments that are more likely to exhibit an appropriate prosody. Punctuation The prosodic properties of a sentence are highly dependent on the punctuation (e.g., commas, colons, question marks, etc.). Therefore, the selection of a database segment that matches the punctuation in the input text (e.g., both followed by a question mark) is rewarded a lower target cost value. 3.4.2.3 Safety costs The quality of the synthesized speech is highly dependent on the accurateness of the database meta-data, since it is this meta-data that is used to calculate the various selection cost values. In addition, the phonemic segmentation of the database has to be very precise in order to be able to copy the correct pieces of acoustic/visual data from the database to the synthetic speech. Unfortunately, the automatic phonemic segmentation of speech data is never error-proof (while correcting it manually would take a massive amount of work and time). For instance, it can occur that the speech recognizer misplaces the boundary between two consecutive phonemes. Moreover, it is possible that the original speaker made a mistake while uttering the database sentences, such as the pronouncing of an incorrect phoneme or the inadequate articulating of a particular phoneme instance. When these flaws are not manually detected in the (post-)recording stage, the automatic phonemic segmentation of such a sentence is likely to result in unpredictable errors. This is why 3.4. Audiovisual segment selection 86 during the construction of expensive databases for commercial TTS systems (e.g., Acapela [Acapela, 2013], Nuance [Nuance, 2013], etc.) a considerable amount of the development time is spend on the manual inspection of the automatically generated segmentation and database meta-data. An alternative automatic technique to avoid synthesis errors caused by flaws in the database is applied by the AVTTS system proposed in this thesis, for which an extra set of “safety” target costs has been developed to minimize the chance of selecting a candidate segment that is likely to contain such database errors. A first safety target cost Chard−pruning (ti , ui ) is based on an offline analysis of the database in which for each distinct phoneme its most extreme instances (i.e., its outliers) are marked as “suspicious” segments. These segments are restricted from selection by assigning an infinitely high weight to a “safety” target cost which is assigned the value one when the candidate segment ui has been marked as “suspicious” and a value zero otherwise. In order to mark particular database segments as “suspicious”, all database instances of a particular phoneme are compared to each other using both auditory and visual features. To this end, for each feature all instances of a particular phoneme are gathered from the database and each instance i is characterized by its mean distance di from all other instances. The actual way in which di is calculated depends on the feature that is being used in the analysis. Then, the overall mean µd and the standard deviation σd of these mean distances are calculated. “Suspicious” segments that possibly contain a database error are those segments for which equation 3.5 holds: |di − µd | > λ × σd (3.5) with λ a factor that controls the number of segments to restrict from selection. This calculation is performed for each distinct phoneme present in the database. A first series of “suspicious” labels was calculated by describing each phoneme instance based on its acoustic properties. To this end, each instance was segmented into 25ms frames after which each frame was represented by a feature-vector containing both MFCC, pitch and energy information. The distance between two phoneme instances was calculated as the frame-wise distance between the corresponding feature vectors after time-aligning both instances (for more details on this the interested reader is referred to [Latacz et al., 2009]). Another series of “suspicious” labels was calculated on visual features. To this end, each phoneme instance was identified by the PCA coefficients of the video frame that is closest to the middle of the instance. The distance between two phoneme instances was calculated as the Euclidean distance between their corresponding PCA coefficients. Note that this safety target cost is likely to eliminate some extreme phoneme 3.4. Audiovisual segment selection 87 instances that were correctly segmented/analysed as well. In general, this is not a problem since these particular segments will be inappropriate for most target speech sequences anyway. On the other hand, it should be ensured that only a few instances of each phoneme are labelled as “suspicious”, since deviant instances could still be needed to synthesize particular irregular coarticulations or prosody configurations. Therefore, in equation 3.5 the parameter λ is used to ensure that the number of “suspicious” segments is sufficiently small compared to the total database size. The “suspicious” labelling of the database defines a so-called hard pruning, in which the labelled segments are completely excluded from selection. On the other hand, a soft pruning of the database could also be advantageous, in which the selection of some particular segments is strongly discouraged but not prohibited. The AVTTS system applies such a soft pruning by performing an additional analysis of the database in which the duration of each segment is evaluated in a similar fashion as the analysis to determine the “suspicious” segments. This way, for each distinct phoneme those instances exhibiting an atypical duration are assigned a “suspicious-duration” label. A second safety target cost Csof t−pruning (ti , ui ) is defined which is assigned a value one when a candidate segment ui was assigned such a “suspicious-duration” label and a value zero in all other cases. By assigning this target cost a high (but not infinite) weight, it can be ensured that these “suspicious-duration” segments are only selected in case no other options are possible. 3.4.3 Join costs A join cost C join (ui , ui+1 ) indicates how two candidate segments ui and ui+1 can be concatenated without creating disturbing concatenation artefacts. An overview of the various join costs used by the AVTTS system is given in figure 3.13. 3.4.3.1 Auditory join costs Auditory join costs promote the selection of candidate segments (for consecutive targets) of which the auditory speech modes can be smoothly concatenated. To this end, the continuity of various acoustic features (see section 3.3.3.3) at the concatenation point is evaluated. A first important auditory join cost measures the spectral smoothness by calculating the Euclidean difference between the MFCC values at both sides of the concatenation point: v uNM F CC u X 2 (M F CCi (n) − M F CCi+1 (n)) (3.6) CM F CC (ui , ui+1 ) = t n=1 3.4. Audiovisual segment selection 88 Join cost Auditory costs Visual costs MFCC Landmark points Pitch Visible teeth Energy Visible mouth cavity PCA Figure 3.13: Join costs applied in the AVTTS synthesis. with NM F CC the number of MFCC values used to describe spectral information, and M F CCi (n) and M F CCi+1 (n) the MFCC values of the last audio frame of segment ui and the first audio frame of segment ui+1 , respectively. In addition, a second join cost calculates the difference in spectral energy between both segments: Cenergy (ui , ui+1 ) = |Ei − Ei+1 | (3.7) with Ei and Ei+1 the energy features of ui and ui+1 , respectively. A third auditory join cost takes pitch levels into account by calculating the absolute difference in logarithmic f0 between the two sides of a join: Cpitch (ui , ui+1 ) = log (f0i ) − log f0i+1 (3.8) with f0i and f0i+1 the pitch marker-based pitch value measured at the end of segment ui and at the beginning of segment ui+1 , respectively. If the phone at the join position is voiceless, the value of Cpitch (ui , ui+1 ) is set to zero. 3.4.3.2 Visual join costs Similar to the auditory join costs, the visual join costs promote the selection of database segments that allow a smooth concatenation of their visual speech modes. To this end, the continuity of various visual features (see section 3.3.3.4) at the concatenation point is evaluated. A first visual join cost measures the “shape” similarity at both sides of the join by comparing the positions of the landmarks denoting 3.4. Audiovisual segment selection 89 the lips of the original speaker. The value of this cost is calculated as the summed Euclidean distance between every two corresponding mouth-landmarks. Before calculating these distances, both frames at the join position (and their corresponding landmark positions) are aligned in order to improve the concatenation quality (see further in section 3.5.4): Clandmark (ui , ui+1 ) = NL q X 2 (x̂i (m) − x̂i+1 (m)) + (ŷi (m) − ŷi+1 (m)) 2 (3.9) m=1 with x̂i /ŷi and x̂i+1 /ŷi+1 the vectors containing the coordinates of the landmarks of the last video frame of segment ui and the first video frame of segment ui+1 , respectively, after the spatial alignment of segments ui and ui+1 . NL represents the number of landmarks used in the calculation. Apart from continuity of the shape information, it is also important that the “appearance” of the virtual speaker varies smoothly around the concatenation point. To this end, a second visual join cost is calculated as the difference in the amount of visible teeth between the two frames at the join position. Similarly, another visual join cost measures the difference in the amount of visible mouth-cavity between these two frames: ( Cteeth (ui , ui+1 ) = |T Ei − T Ei+1 | (3.10) Ccavity (ui , ui+1 ) = |CAi − CAi+1 | with T Ei and T Ei+1 the amount of teeth visible in the last video frame of segment ui and the first video frame of segment ui+1 , respectively. Similarly, CAi and CAi+1 represent the amount of mouth cavity visible in the last video frame of segment ui and the first video frame of segment ui+1 , respectively. Finally, a fourth visual join cost measures the mathematical continuity of the concatenated visual speech by calculating the Euclidean distance between the PCA coefficients of both frames at the join position: v uNP CA u X 2 (P CAi (n) − P CAi+1 (n)) (3.11) CP CA (ui , ui+1 ) = t n=1 with NP CA the number of PCA coefficients used to describe each video frame, and P CAi (n) and P CAi+1 (n) the PCA coefficients of the last video frame of segment ui and the first video frame of segment ui+1 , respectively. 3.4.4 Weight optimization From the previous two sections it is clear that the total cost that corresponds to the selection of a particular candidate segment involves the calculation of many separate sub-costs. A specific weight is assigned to each sub-cost after which the total cost is 3.4. Audiovisual segment selection 90 given by the weighed sum of all sub-costs (see equation 3.4). In the proposed joint audio/video selection strategy, these weights are not only used to specify the relative importance of each sub-cost over the other sub-costs but they also determine the relative importance of the auditory sub-costs over the visual sub-costs. In addition, recall that the importance of the total target cost over the total join cost can be adjusted by the factor α in equation 3.3. Good quality segment selection is only feasible when an appropriate configuration of all these various weights is applied. 3.4.4.1 Cost scaling Sections 3.4.2 and 3.4.3 explained how the various sub-costs are calculated. Since each cost is calculated on different features, every sub-cost will be assigned its own typical range of cost values. In order to be able to easily specify the contribution of the various sub-costs to the total selection cost, each sub-cost is scaled with a scaling factor that adjusts its possible cost values to a range that is approximately between zero and one. To determine these scaling factors, for each sub-cost an extensive set of typical cost values is gathered. Sub-target cost values can be learned by synthesizing random speech samples and registering all calculated cost values for each sub-target cost. However, in the current set-up of the AVTTS system only binary target costs are used which are always assigned the value zero or one. Because of this, only for the sub-join costs typical cost values must be learned. To this end, for each distinct phoneme a fixed number of instances are uniformly sampled from the database. Cost values are collected by calculating the sub-join cost between each two gathered instances of a particular phoneme. From the histograms describing the learned sub-join cost values, three categories of join costs can be identified (see figure 3.14). A first category of join costs results in symmetrical Gaussian-distributed cost values (e.g., CP CA ). A second category results in asymmetric Gaussian distributions (e.g., CM F CC and the Clandmark ), while a third category exhibits an exponentially decaying behaviour, for which in the majority of the cases a low cost value is assigned (e.g., Cpitch and the Cteeth ). For each cost, a scaling factor is determined that adjusts the 95% lowest gathered cost values towards the range [0,1]. 3.4.4.2 Weight distribution Once an appropriate scaling factor has been determined for each sub-cost, a particular sub-cost Ci can be given twice the importance of sub-cost Cj by assigning it a weight ωi = 2ωj . Empirically determining an optimal set of weights is a very time-consuming task. Therefore, the weight optimization was split into several stages. In a first stage, an appropriate weight distribution among the auditory join sub-costs and among the visual join sub-costs was determined using small informal perception tests in which the attained synthesis quality for multiple random test sentences was compared using several weight configurations. For the auditory 3.4. Audiovisual segment selection 91 Figure 3.14: Join cost histograms, indicating three different behaviours. join sub-costs, it was found that the MFCC-cost should be assigned an increased weight compared to the pitch-cost and the energy-cost. A similar conclusion could be made for the PCA-based visual join cost in comparison with the other visual sub-join costs. Next, the overall influence of the auditory join costs in comparison to the visual join costs was evaluated. To this end, a small perception test was conducted in which 6 participants (all speech technology experts) were shown 10 pairs of audiovisual speech samples synthesized using the joint audio/video selection approach. Each sample contained a standard-length English sentence and the original combinations of auditory and visual speech were selected from the LIPS2008 database. One sample of each pair was synthesized using only auditory join costs, while the other sample contained a synthesis of the same sentence for which only visual join costs were taken into account. All other synthesis parameters were the same for both samples. The subjects were asked to write down their preference for one of the two samples using a 5-point comparative MOS-scale [-2,2]. The results obtained were analysed using a Wilcoxon signed-rank test, which indicated that the samples synthesized using only auditory join costs were preferred over the samples synthesized using only visual join costs (Z = −5.0 ; p < 0.001). From this small experiment it can be concluded that the smoothness of the auditory speech mode appears to be more crucial than the smoothness of the visual mode. As a consequence, the total weight assigned to the auditory join costs should be higher than the total weight assigned to the visual join costs. All binary target costs were assigned the same weight, except for the costs calculated on the phonemic match between the candidate context and the target 3.4. Audiovisual segment selection 92 context. These particular target costs were triangularly weighted to assign more influence to the matching of the context close to the segment and less influence to the matching of the context further away from the segment. Finally, a last parameter that must be determined is the factor α in equation 3.3 which sets the relative influence of the total target cost compared to the total join cost. To this end, a value for α that balances these two influences was calculated by collecting a large number of total target cost values and total join cost values occurring when synthesizing an arbitrary set of sentences. From these gathered values, an appropriate value for α was computed as the ratio of the mean total join cost over the mean total target cost. Once all weight factors have been determined, the first expression from equation 3.4 can be written as: target Ctotal (ti , ui ) = ω1 Cphon.match (ti , ui ) + ω2 Chard−pruning (ti , ui ) + ω3 Csof t−pruning (ti , ui ) PNs symb symb Cj (ti , ui ) j=1 ωj + PNs symb j=1 ωj (3.12) In equation 3.12, all costs are binary costs, Ns represents the number of symbolic costs (discussed in section 3.4.2.2) that are used in the calculation, ω1 = ω2 = ∞, ω3 = 1000, and all ωjsymb are equal to 1, except for the costs based on the phonemic context (up to three phonemes before/after the target/candidate segment) which are triangularly weighed using values (0.5, 0.25, 0.125). Note that in order to speed up the unit selection process, an efficient implementation adds for each target segment ti only those database segments ui to the list of candidate segments for which Cphon.match (ti , ui ) = Chard−pruning (ti , ui ) = 0. Then, costs Cphon.match and Chard−pruning are omitted in the Viterbi search. Likewise, the second expression from equation 3.4 can be written as join Ctotal (ui , ui+1 ) = ω1 ĈM F CC (ui , ui+1 ) + ω2 Ĉpitch (ui , ui+1 ) ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7 ω3 Ĉenergy (ui , ui+1 ) + ω4 Ĉlandmark (ui , ui+1 ) + ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7 ω5 Ĉteeth (ui , ui+1 ) + ω6 Ĉcavity (ui , ui+1 ) + ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7 ω7 ĈP CA (ui , ui+1 ) + ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7 (3.13) 3.5. Audiovisual concatenation 93 with ω1 = 5, ω2 = ω3 = 2, ω4 = ω5 = ω6 = 1 and ω7 = 3. Ĉ represents the scaled value of the original cost C (see section 3.4.4.1). Note that it is likely that the chosen weight distribution is only sub-optimal. A more optimal set of weights could be automatically learned by a parameter optimization technique. In the laboratory’s auditory-only TTS research, such an automatic weight optimization strategy has been developed which learns multiple context-dependent weight configurations [Latacz et al., 2011]. Unfortunately, only a variable benefit was gained from this automatic weight training since the attained synthesis quality was still fluctuating between consecutive syntheses. Nevertheless, it would be an interesting future initiative to design and evaluate such an automatic weight optimization technique to learn the balancing between auditory and visual selection costs as well. In addition, note that the applied visual join costs contain some redundancy: features such as the amount of visible teeth and the amount of visible mouth cavity are described by the PCA parameter values as well. However, it was opted to include all these sub-costs in order to be able to separately fine-tune the influence of these aspects on the total join cost. 3.5 Audiovisual concatenation Once the optimal sequence of database segments matching the target speech has been determined, these audiovisual speech signals need to be concatenated in order to construct the desired synthetic speech signal. This requires two parallel concatenation actions, since for every two consecutive selected segments both the acoustic signals and the video signals need to be joined together. For each speech mode, a concatenation strategy is needed that smooths the concatenated signal around the concatenation point in order to avoid jerky synthetic speech. In addition, it has to be ensured that this smoothing does not produce unnatural speech signals, since otherwise observers will still be able to notice the join positions. 3.5.1 A visual mouth-signal and a visual background-signal Despite the fact that the visual speech from the database displays the complete face of the original speaker, it has to be noticed that all the selection costs mentioned in section 3.4 focus on the mouth area of the video frames only. Obviously, this part of each frame contains the major share of the speech-related information. This is especially true for the visual speech from the LIPS2008 database, since while recording this dataset the original speaker was asked to utter the text while maintaining as much as possible a neutral visual prosody. However, since the original speaker’s head was not mechanically fixed, slight head movements are present in the original visual speech data. This makes it very hard to smoothly concatenate visual 3.5. Audiovisual concatenation 94 Figure 3.15: The left panel shows the result of the merging of the mouth-signal with the background signal. The right panel shows the background signal in gray and the mouth-signal in colour. speech segments containing the complete face of the speaker, as this would require a 3D rotation and translation of the face towards its “mean” position in front of the camera. Therefore, the AVTTS system focuses on synthesizing a mouth-signal that matches the target visual speech, after which this synthetic mouth signal is merged with a background signal that contains the other parts of the face of the virtual speaker. These background signals are original sequences extracted from the database, of which it was ensured that they exhibit a neutral visual prosody. When a new mouth-signal has been constructed by the concatenation of the selected database segments, each frame of this signal is aligned with its corresponding frame from the background sequence, after which a hand-crafted mask is used to smoothly merge the two video streams as illustrated in figure 3.15. 3.5.2 Audiovisual synchrony As was explained in section 3.2.2, the joining of the selected segments takes place at the middle of the two overlapping phonemes (see figure 3.2). The exact join position in the auditory speech mode will always coincide with a pitch-marker as this allows a pitch-synchronous concatenation smoothing (see further in section 3.5.3). The most straightforward technique would be to select in the two overlapping phones the pitch-marker that is closest to the phone centre as join position. Instead, the AVTTS system optimizes each join by calculating the best pair of pitch-markers (one marker for each overlapping phone) that minimizes the spectral distance between the parts of the acoustic signals that will be overlapped during the concatenation process [Conkie and Isard, 1996]. These optimal pitch-markers are 3.5. Audiovisual concatenation 95 searched in a small window around the middle of each overlapping phone (typically 4 consecutive pitch-markers are evaluated for each phone). Once the exact join position is determined in the auditory mode, in each corresponding video signal a video frame must be selected as join position in the visual speech mode. Since the sample rate of the acoustic signal is much higher than the sample rate of the video signal, the join position in the visual speech mode cannot be determined with the same accuracy as the pitch-marker based optimization strategy that was applied for the auditory mode. Note, however, that in order to successfully copy the original audiovisual coherence from the two selected original segments to the concatenated synthetic speech, it is important that the audiovisual synchronization is preserved. To this end, for each concatenation the join position in the visual speech mode is positioned as closely as possible to the join position in the auditory mode. This still causes some degree of audiovisual asynchrony, since the join position in the visual mode will always be located a small time extent before or after the corresponding join position in the auditory mode. It is well-known that in audiovisual speech perception, human observers are very sensitive to a lead of the auditory speech information in front of the visual speech information. On the other hand, there seems to exist quite a tolerance on the lead of the video signal in front of the auditory signal [Summerfield, 1992] [Grant and Greenberg, 2001] [Grant et al., 2004] [Van Wassenhove et al., 2007] [Carter et al., 2010]. The AVTTS system exploits this property to optimize the concatenation of the selected audiovisual segments by ensuring that throughout the whole concatenated audiovisual signal, the original combinations of auditory and visual speech are always desynchronized by the smallest possible video lead, i.e., between zero and one video frame (40ms for a video signal containing 25 frames per second). More details on the exact implementation of this technique are given further on in this chapter. 3.5.3 Audio concatenation To smooth the concatenation of two acoustic signals, a small section of both signals is overlapped and cross-faded. When the join takes place in a voiced speech segment, it has to be ensured that the periodicity is not affected by the smoothing technique. For instance, figure 3.16 illustrates the concatenation of two speech segments representing diphones “b-o” and “o-m”. It shows that around the join position there is quite a large dissimilarity between the two signals, although both represent the same phoneme /o/. The figure shows that the usage of a standard cross-fade technique results in the creation of some anomalous pitch periods around the concatenation point, which causes noticeable concatenation artefacts in the output speech. 3.5. Audiovisual concatenation 96 Figure 3.16: Auditory concatenation artifacts. The rectangle indicates erroneous pitch periods resulting from the cross-fading of the two waveforms. To successfully smooth the acoustic concatenations, the AVTTS system applies a pitch-synchronous cross-fade technique. When the two segments that are concatenated are referred to as A and B, the join technique first extracts from both signals a number of pitch periods (typically 2 to 5) around the pitch-marker that was selected as optimal join position, producing short segments a and b. Then, the pitch of signals a and b is altered using PSOLA [Moulines and Charpentier, 1990] in such a way that the two resulting signals â and b̂ exhibit exactly the same pitch contour. The initial pitch value of these signals is chosen equal to the original pitch level measured in signal A at the time instance on which segment a was extracted. The pitch value at the end of â and b̂ is chosen equal to the original pitch value measured in signal B at the end of the time interval from which segment b was extracted. The pitch contour of â and b̂ linearly evolves from the pitch level at the beginning to the pitch level at end of the signals. The concatenation of segments A and B is performed by overlapping and cross-fading the pitch-synchronized signals â and b̂ using a hanning-function. This strategy, illustrated in figure 3.17, minimizes the creation of irregular pitch periods and preserves the periodicity in the concatenated signal as much as possible. 3.5.4 Video concatenation Similar to the acoustic concatenation technique, the approach for joining the visual speech signals of two selected database segments has to smooth the concatenated signal around the join position in order to avoid jerky synthetic visual speech. To this end, the frames at the end and at the beginning of the first and the second overlapping video segment, respectively, are replaced by a sequence of new intermediate video frames. It is obvious that these intermediate frames cannot be generated by a simple image cross-fade, since for instance at the middle of the cross-fade the intermediate frame will consist of two different original mouth configurations that 3.5. Audiovisual concatenation 97 Figure 3.17: Pitch-synchronous audio concatenation. The upper panel illustrates the two signals A and B that need to be concatenated, the middle panel illustrates the pitch-synchronized waveforms â and b̂, and the lower panel illustrates the resulting concatenated signal after cross-fading. 3.5. Audiovisual concatenation 98 are each 50% visible. This easily results in erroneous mouth representations such as frames displaying “double” lips, the visibility of teeth together with a closed mouth, etc. A first step in the concatenation procedure is the spatial alignment of the two video segments. To this end, the pixels in each frame of the second video segment are translated in such a way that the speaker’s mouth in the first frame of the second segment is aligned with the mouth in the last frame of the first video segment. To determine the translation parameters, for both frames an alignment center is calculated from the facial landmark positions. The translation is then defined by the vector that connects these two alignment centres. Next, image morphing techniques are used to smooth the transition between the two aligned video segments. Image morphing is a widely used technique for creating a transformation between two arbitrary images [Wolberg, 1998]. It consists of a combination of a stepwise image warp and a stepwise cross-dissolve. To perform an image morph, the correspondence between the two input images has to be denoted by means of pairs of feature primitives. A common approach is to define a mesh as feature primitive for both input images (so-called mesh-warping) [Wolberg, 1990]. A careful definition of such meshes has been proven to result in a high quality metamorphosis, however, the construction of these meshes is not always straightforward and often very time-consuming. Fortunately, when the morphing technique is applied to smooth the visual concatenations in the AVTTS system, every image given as input to the morph algorithm is a frame from the speech database. This means that for each morph input an appropriate mesh can be automatically defined by using the frame’s facial landmark positions as mesh intersections (see figure 3.18). Since these landmarks indicate the important visual articulators, the resulting meshes adequately describe feature primitives for morphing. This way, for every concatenation the appropriate new frames (typically 2 or 4) that realize the transition of the mouth region from the first video segment toward the second video segment can be generated, as illustrated in figure 3.18. Note that some segments that are selected from the database will be fairly short and will contain only a few video frames. When such a short segment has been added to the output video signal, the concatenation of the next database segments to the output frame sequence can entail an interpolation of a frame that was already interpolated during a previous concatenation. This way, the concatenation smoothing is likely to smooth the short segments in such a way they become “invisible” in the output visual speech. This is necessary to avoid over-articulation effects in the synthetic visual speech information. 3.6. Evaluation of the audiovisual speech synthesis strategy 99 Figure 3.18: Example of the video concatenation technique using the “AVBS” database. The two frames shown in the middle of the lower panel were generated by image morphing and replace the segments’ original boundary frames in order to ensure the signal continuity during the transition from the first segment to the second segment. The frame on the left and the frame on the right of the lower panel were used as input for the morph calculations. A detail of the landmark data and the morph meshes derived from these landmarks is shown in the top panel. 3.6 Evaluation of the audiovisual speech synthesis strategy This section describes the experiments that were conducted in order to evaluate the proposed single-phase AVTTS synthesis approach. It is especially interesting to assess the influence of the joint auditory/visual segment selection on the perception of the synthesizer output. The quality assessment of audiovisual speech includes various aspects such as speech intelligibility, naturalness, and acceptance ratio measures. All these aspects can be individually evaluated for the auditory mode and for the visual mode, or the multiplexed audiovisual signal can be evaluated as a whole. Particularly the quality of the multiplexed audiovisual speech is important, since it is this signal that will be presented to a user in a possible future application of the speech synthesis system. The major benefit of the proposed single-phase audiovisual unit selection synthesis is the fact that the synthetic speech shows original combinations of auditory and visual information. This way a maximal audiovisual coherence in the synthetic speech is attainable, but on the other hand the multimodal selection strategy reduces the flexibility to optimize each speech mode individually in comparison with a separate synthesis of the auditory and the visual speech signals. Therefore, it should be investigated whether an enhanced audiovisual coherence indeed positively influences the perception of the synthetic audiovisual speech. If so, 3.6. Evaluation of the audiovisual speech synthesis strategy 100 the reduced flexibility of the single-phase approach would be justified. 3.6.1 Single-phase and two-phase synthesis approaches To evaluate the proposed single-phase concatenative AVTTS approach, the synthesis techniques described earlier in this chapter were used to synthesize novel English audiovisual sentences from text input. For comparison purposes, a corresponding two-phase AVTTS synthesis strategy was developed. In this strategy, in a first stage a unimodal auditory TTS synthesis is performed, producing a synthetic auditory speech signal that matches the target text. Then, in a second synthesis stage a synthetic visual speech signal is synthesized using the synthesis techniques described earlier in this chapter, but for which only visual selection costs are taken into account (see equation 3.4). When both synthetic speech modes have been generated, each phoneme in the synthetic auditory signal is uniformly time-scaled using WSOLA [Verhelst and Roelands, 1993] to match the duration of the corresponding segment in the synthetic visual speech. After this synchronization step the two separately synthesized signals are multiplexed to create the final synthetic audiovisual speech. When time-scaling the phonemes in the synthetic auditory speech, it has to be ensured that the time-scale factors are sufficiently close to one (= no scaling) in order to avoid a degradation of the speech quality. To this end, during the second synthesis stage in which the visual speech is synthesized, an additional target cost Cdur (ti , ui ) is applied that measures the difference in duration between each candidate speech segment ui matching target ti and the duration of the corresponding auditory speech segment that was selected for target ti during the auditory synthesis stage. A low value is assigned to Cdur when these durations are much alike, since the selection of that candidate segment would afterwards require only a minor time-scaling in the synchronization stage. 3.6.2 Evaluation of the audiovisual coherence A first subjective experiment was designed to measure the degree in which audiovisual mismatches between the two modes of a synthetic audiovisual speech signal are detected by human observers. Such mismatches can be classified as synchrony issues, caused by an inaccurate synchronization of the two information streams, or as incoherence issues, which are due to different origins or a unimodal processing of the auditory and the visual speech information that is shown simultaneously to the observer. In theory, every synthesis approach should be able to minimize the number of audiovisual synchrony issues. In the proposed single-phase audiovisual concatenative synthesis this is achieved by positioning the boundaries of the auditory and the visual speech segments that are copied from the database such that 3.6. Evaluation of the audiovisual speech synthesis strategy 101 in the synthetic audiovisual speech the visual speech information always leads the auditory speech information by a time extent between zero and one video frame duration (see section 3.5.2). In the two-phase synthesis strategy described in section 3.6.1 the audiovisual asynchrony is kept minimal by accurately time-scaling the synthetic auditory speech mode to match the timings of the synthetic visual speech mode. On the other hand, the number of audiovisual incoherences that are likely to occur in the synthetic audiovisual speech is dependent on the chosen synthesis approach. Such incoherences are minimized by the joint audio/video selection in the single-phase strategy. This cannot be achieved when both output speech modes are synthesized separately. This means that a subjective evaluation of the single-phase synthesis strategy should only assess the level in which the participants notice audiovisual incoherences in the presented audiovisual speech. Note, however, that while perceiving a continuous speech signal it is very hard for an observer to distinguish between audiovisual incoherence issues and audiovisual asynchronies. Therefore, a more general question was assessed in the experiment by evaluating to which extent the participants found the two synthetic speech modes to be consistent. The participants were asked to take both the level of audiovisual synchrony and the level of audiovisual coherence into account. This is because it is likely that some incoherences in the audiovisual speech will be perceived as synchrony issues by the test subjects. 3.6.2.1 Method and subjects Medium-length audiovisual English sentences were displayed to the test subjects who were asked to rate the overall level of consistence between the presented auditory and visual speech mode. It was stressed that they should only rate audiovisual consistence, and not, for instance, the smoothness or the naturalness of the speech. The subjects were asked to use a 5-point Mean Opinion Score (MOS) scale [1,5] with rating 5 meaning “perfect consistence” and rating 1 meaning “heavily distorted consistence”. There was no time limit and the participants could play and replay each sample any time they wanted. The samples were presented on a standard LCD screen, placed at normal working distance from the viewers. The video signals had a resolution of 532x550 pixels at 50 frames per second and they were displayed at 100% size. The acoustic signal was played through high-quality headphones using flat equalizer settings. Eleven subjects (8 male and 3 female) participated in this test, seven of which were experienced in speech processing. Six of the subjects were aged between 20-30 years, the other subjects were between 35-57 years of age. None of them were native English speakers but it was ensured that all participants had good command of the English language. 3.6. Evaluation of the audiovisual speech synthesis strategy 3.6.2.2 102 Test strategies Four types of speech samples were used in this evaluation (see table 3.2), each sample containing a single English sentence extracted from the text transcript of the LIPS2008 speech database. The first group, called “ORI” (“original”), contained original audiovisual speech samples from the LIPS2008 database. A second group of samples, called “MUL” (“multimodal”), were synthesized using the proposed single-phase joint audio/video unit selection synthesis. To synthesize a sentence, the AVTTS system was provided with the LIPS2008 audiovisual database from which each time the particular sentence that had to be synthesized was excluded. The third group of test samples, called “SAV” (“separate audio/video”), was created by synthesizing the auditory and the visual speech mode separately using the two-phase synthesis approach described in section 3.6.1. Both the auditory and the visual speech mode were synthesized using the LIPS2008 database. The only difference between the two synthesis stages was that for the auditory synthesis only auditory selection costs were used and for the visual speech synthesis only visual selection costs were used. A fourth group of samples, referred to as “SVO” (“switch voice”), were also created by the two-phase AVTTS approach, but a different TTS system was used in each synthesis stage. The auditory mode was synthesized using the laboratory’s auditory TTS system [Latacz et al., 2008] provided with the CMU ARCTIC database of an English female speaker [Kominek and Black, 2004]. This database is commonly used in TTS research and its length of 52 minutes continuous speech allows higher quality acoustic synthesis compared to the LIPS2008 database. The visual mode of the SVO samples was synthesized in the same way as the visual mode of the SAV samples by using the LIPS2008 database. Note that the audiovisual synthesis strategy that was used to generate the SVO samples is similar to most other AVTTS approaches found in the literature, in which two different systems and databases are used to create the auditory and the visual mode of the synthetic audiovisual speech. All samples, including the files from group ORI, were (re-)coded using the Xvid codec [Xvid, 2013] with fixed quality settings in order to attain a homogeneous image quality among all samples. Note that all files were created fully automatically and no manual correction was involved for any of the synthesis or synchronization steps. 3.6.2.3 Samples and results Fifteen sample sentences with a mean word count of 15.8 words were randomly selected from the LIPS2008 database transcript and were synthesized for each of the groups ORI, MUL, SAV & SVO. Each participant was shown a subset containing 20 samples (5 sentences each synthesized using the four different techniques). While distributing the sample sentences among the participants, each sentence was used as many times as possible. The order in which the various versions of a sentence 3.6. Evaluation of the audiovisual speech synthesis strategy 103 Table 3.2: Test strategies for the audiovisual consistence test. ORI Origin A Origin V Description Original LIPS2008 audio Original LIPS2008 video Original AV signal MUL Origin A Origin V Description Audiovisual unit selection on LIPS2008 db Audiovisual unit selection on LIPS2008 db Concatenated original AV combinations SAV Origin A Origin V Description Auditory unit selection on LIPS2008 db Visual unit selection on LIPS2008 db Separate A/V synthesis using same db SVO Origin A Origin V Description Auditory unit selection on ARCTIC db Visual unit selection on LIPS2008 db Separate A/V synthesis using different dbs were shown to the participants was randomized. Figure 3.19 summarizes the test results obtained. A Friedman test indicated significant differences among the answers reported for each test group (χ2 (3) = 117 ; p < 0.001). An analysis using Wilcoxon-signed rank tests indicated that all differences among the test groups were significant (p < 0.001), except for the difference between the MUL and the SAV group (Z = −0.701 ; p = 0.483). Further analysis of the test results, using Mann-Whitney U test statistics, showed no difference between the overall ratings of the speech technology experts and the ratings given by the non-experts (Z = −0.505 ; p = 0.614). No significant difference was found between the ratings given by the male and the female participants (Z = −0.695 ; p = 0.487). Some participants consistently reported higher ratings compared to other participants, although this difference was not found to be significant by a Kruskal-Wallis test (χ2 (10) = 16.0 ; p = 0.099). Maybe this could have been prevented by showing some training samples to the participants indicating a “good” and a “bad” sample. 3.6.2.4 Discussion For each group of samples an estimation of the actual audiovisual consistence/coherence can be made. For group ORI, a perfect coherence is expected since these samples are original audiovisual speech recordings. Samples from group MUL are composed of concatenated original combinations of audio and video. Therefore, at the time instances between the concatenation points they exhibit the original coherence as found in the database recordings. Only at the join positions an exact calculation of the audiovisual coherence is impossible since at these time 3.6. Evaluation of the audiovisual speech synthesis strategy 104 5 4 3 2 1 ORI SVO MUL SAV Figure 3.19: Box plot showing the results obtained for the audiovisual consistence test. instants the signal consists of an interpolated auditory speech signal accompanied by an interpolated visual speech signal. For the SAV and the SVO samples, almost perfect audiovisual synchrony should be attained by the synchronization step during synthesis, however, audiovisual incoherences are likely to occur since non-original combinations of auditory and visual speech are presented. The results of the experiment show that the perceived audiovisual consistence does differ between the groups. From the significant difference between the ratings for group ORI and group MUL it appears that it is hard for a human observer to judge only the audiovisual coherence aspect without being influenced by the overall smoothness and naturalness of the speech modes themselves. Also, the perception of the audiovisual consistence of the MUL samples could be affected by Pagethe 1 moderate loss of multimodal coherence at the join positions. Between groups MUL and SAV no significant difference was found. In order to explain this result, the selected segments from the LIPS2008 database that were used to construct both speech modes of each sample from the SAV group were compared. It appeared that for many sentences, more than 70% of the selected segments were identical for both speech modes. The reason for this is that for both the auditory and the visual synthesis phoneme-based speech labels were used (instead of visemes), together with the fact that the LIPS2008 database only contains about 25 minutes of original speech. The use of such a small database implies that most of the time only a few candidate 3.6. Evaluation of the audiovisual speech synthesis strategy 105 segments matching a longer target (syllable, word, etc.) are available. Because of this, very often the same original segment gets selected for both the acoustic and the visual synthesis, disregarded the configuration of the selection costs (the selection of long segments is favoured since these add a zero join cost to the global selection cost). It was calculated that for the SAV samples on average around 50% of the video frames are accompanied by the original matching audio from the database. This could explain why the SAV group scored almost as good as the MUL group in the subjective assessment. This result also indicates that the synchronization step in the two-phase synthesis approach is indeed able to appropriately synchronize the two speech modes that were synthesized separately. Keeping this in mind, it is remarkable that significantly better ratings were found for the MUL group in comparison with the SVO group. Since it can be safely assumed that the SVO samples contain two speech modes that have been appropriately synchronized, the reason for their degraded ratings has to be found in the fact that these samples are completely composed of non-original combinations of auditory and visual speech information, which apparently resulted in noticeable audiovisual mismatches. This can be understood by the fact that the displayed auditory and visual speech information are resulting from different repetitions of the same phoneme sequence by two distinct speakers exhibiting different speaking accents. 3.6.3 Evaluation of the perceived naturalness From the previous experiment it can be concluded that the single-phase AVTTS synthesis approach reduces the number of noticeable mismatches between the two synthetic speech modes. Consecutively, a new experiment has to investigate how this influences the perceived quality of the synthetic speech signals. In order to generate highly natural audiovisual speech, two aspects need to be optimized. First, it is needed that both the auditory and the visual speech mode individually exhibit high quality speech closely resembling original speech signals. In addition, it is necessary that a human observer feels very familiar with the synchronous observation of these two information streams. A possible test scenario would be to present the test subjects audiovisual speech fragments, synthesized using both single-phase and two-phase synthesis strategies, and to assess the overall perceived level of naturalness of the audiovisual speech. However, such ratings would show a lot of variability since each test subject would rate the samples following his/her own personal feeling of which aspect is the most important: the individual quality of the auditory/visual speech or the naturalness of the audiovisual observation of these signals. Also, it has to be taken into account that the limited size of the LIPS2008 database does not allow high quality auditory speech synthesis. Because of this, it is likely that the overall level of naturalness would be rated rather low, which makes it more difficult to draw important conclusions from the test results obtained. 3.6. Evaluation of the audiovisual speech synthesis strategy 106 On the other hand, the main goal of the experiment is to evaluate whether the reduced flexibility of the single-phase synthesis approach to optimize the individual synthetic speech modes is justified by the benefits of the increased audiovisual coherence between the two synthetic speech modes. Therefore, a test scenario can be developed to directly evaluate the effect of the level of audiovisual coherence on the perceived naturalness of the synthetic speech. An ideal scenario would be to evaluate the perceived naturalness of multiple groups of audiovisual speech samples designed such that there exists a variation of the level of audiovisual coherence among the groups while the individual quality of both the synthetic auditory and the synthetic visual speech mode is the same for all groups. Unfortunately, it is not clear how such samples can be realized in practice. Therefore, an alternative test scenario was used in which several types of audiovisual speech signals were created using various concatenative synthesis strategies. It was ensured that the individual quality of the visual speech mode was the same for all groups. During the subjective assessment the perceived naturalness of this visual speech mode was evaluated. This allows to determine the influence of the audiovisual presentation of a unimodal speech signal on its subjective quality assessment. Both the impact of the degree of audiovisual coherence and the impact of the individual quality of the corresponding speech mode can be evaluated. 3.6.3.1 Method and subjects The participants were asked to rate the naturalness of the mouth movements displayed in the audiovisual speech fragments. It was stressed that for this experiment they should only rate the visual speech mode. A 5-point MOS scale [1,5] was used, with rating 5 meaning that the mouth variations are as smooth and as correct as original visual speech and rating 1 meaning that the movements considerably differ from the expected visual speech. The same subjects who participated in the experiment described in section 3.6.2 contributed to this test. The setup of the subjective evaluation procedure was the same as was described in section 3.6.2.1. 3.6.3.2 Test strategies Five different types of samples were generated for this experiment, as summarized in table 3.3. Four sample types (ORI, MUL, SAV and SVO) were similar to the samples used in the previous experiment (see section 3.6.2.2). Since for this experiment the quality of the synthetic visual speech mode has to be equal for all sample types, the samples were created in such a way that for each group the same original visual speech segments were used to construct the synthetic visual speech mode. The samples from the MUL group were synthesized using the single-phase AVTTS approach for which only visual selection costs were applied. The extra target cost 3.6. Evaluation of the audiovisual speech synthesis strategy 107 Cdur (see section 3.6.1) was included for which for each sentence the original timings from its version from the LIPS2008 database were used as reference. Next, similar to the previous experiment, auditory speech signals were generated using the LIPS2008 and the ARCTIC databases to create the auditory speech mode of the SAV and SVO samples, respectively. Audiovisual synchronization was obtained by time-scaling these acoustic signals using WSOLA. In addition, a fifth group of samples, referred to as “RES” (“resynth”), was added. For these samples, the same visual speech mode as applied in the MUL, SVO and SAV samples was used. The auditory mode consisted of original auditory speech from the LIPS2008 database, synchronized with the corresponding visual speech signals using WSOLA. Note that the approach that was used to create the RES samples is a common visual speech synthesis approach when a novel visual speech signal needs to be generated to accompany an already existing auditory speech signal (only in that case the visual speech mode is time-scaled instead). 3.6.3.3 Samples and results Fifteen sample sentences with a mean word count of 15.8 words were randomly selected from the LIPS2008 database transcript and were synthesized for each of the five groups ORI, MUL, SAV, SVO & RES. Each participant was shown a subset containing 20 samples (4 sentences each synthesized using the 5 different techniques). While distributing the samples among the participants, each sentence was used as many times as possible. The order in which the various versions of a sentence were shown to the participants was randomized. Figure 3.20 summarizes the test results obtained. A Friedman test indicated significant differences among the answers reported for each test group (χ2 (4) = 103 ; p < 0.001). An analysis using Wilcoxon-signed rank tests indicated that the ORI samples were rated significantly better than all other sample groups (p < 0.001). In addition, the SVO samples were rated significantly worse than the samples from groups MUL, SAV and RES (p < 0.005). No significant differences were found between the ratings for the groups MUL, SAV and RES. Further analysis of the test results, using Mann-Whitney U test statistics, showed no difference between the overall ratings of the speech technology experts and the ratings given by the non-experts (Z = −1.58 ; p = 0.114). No significant difference was found between the ratings given by the male and the female participants (Z = −1.48; p = 0.138). Some participants consistently reported higher ratings compared to other participants (this was found to be significant by a Kruskal-Wallis test (χ2 (10) = 23.2 ; p = 0.010). Probably, this could have been prevented by showing some training samples to the participants indicating a “good” and a “bad” sample. Original LIPS2008 audio Original LIPS2008 video Original AV signal Audiovisual unit selection on LIPS2008 database (video costs) Audiovisual unit selection on LIPS2008 database (video costs) Concatenated original AV combinations Auditory unit selection on LIPS2008 database (audio costs) Visual unit selection on LIPS2008 database (video costs) Separate A/V synthesis using same database Auditory unit selection on ARCTIC database Visual unit selection on LIPS2008 database (video costs) Separate A/V synthesis using different databases Original LIPS2008 audio Visual unit selection on LIPS2008 database (video costs) Original audio and synthesized video Origin A Origin V Description Origin A Origin V Description Origin A Origin V Description Origin A Origin V Description Origin A Origin V Description ORI MUL SAV SVO RES Table 3.3: Test strategies for the naturalness test. 3.6. Evaluation of the audiovisual speech synthesis strategy 108 3.6. Evaluation of the audiovisual speech synthesis strategy 109 5 4 3 2 1 ORI SVO SAV MUL RES Figure 3.20: Box plot showing the results obtained for the naturalness test. 3.6.3.4 Discussion For all but the ORI samples, the visual speech mode was synthesized by reusing the same segments from the LIPS2008 database. This implies that any difference in the perceived quality of the visual speech mode is caused by the properties of the auditory speech that played along with the visual speech. The results obtained show a clear preference for the MUL samples compared to the SVO samples. Note that the individual quality of the auditory speech mode of the SVO samples is at least as high as the individual quality of the auditory mode of the MUL samples, since the auditory mode of the SVO samples is synthesized using acoustic selection costs and a more extensive speech database. From this it can be concluded that the perceived naturalness of the visual speech mode of the SVO samples was dePage 1 graded by the non-original combinations of auditory/visual speech information that were used to construct these samples. On the other hand, only a small decrease in perceived naturalness can be noticed between the MUL and the SAV samples (Wilcoxon signed-rank analysis ; Z = −1.78 ; p = 0.076). As explained earlier in section 3.6.2.4, the samples of these two groups are probably too similar to lead to important perception differences. However, it is interesting to see that the more appropriate synthesis set-up to create the auditory mode of the SAV samples (acoustic costs instead of visual costs) certainly did not improve the test results obtained for this group. The test results also contain slightly higher ratings for the MUL samples compared to the results obtained for the RES group. This difference was 3.6. Evaluation of the audiovisual speech synthesis strategy 110 not found to be significant, however a trend is noticeable (Wilcoxon signed-rank analysis ; Z = −1.90 ; p = 0.058). Since the auditory mode of the RES samples is made out of original auditory speech signals, its quality is much higher compared to the quality of the auditory mode of the MUL samples. Since the MUL samples scored as least as good (even slightly better) than the RES samples, it can be concluded that for a high quality perception of the visual speech mode, a high level of audiovisual coherence is equally (or even more) important than the individual quality of the accompanying auditory speech. In addition, the comparison between the MUL and the SVO samples showed that the perceived quality of a synthetic speech mode can be strongly affected when it is presented audiovisually with a less consistent accompanying speech mode. 3.6.4 Conclusions This thesis proposes a single-phase AVTTS approach that is able to maximize the audiovisual coherence in the synthetic speech signal. Two experiments were conducted in order to assess the benefits of this joint audio/video segment selection strategy. A first test measured the perceived audiovisual consistence resulting from different synthesis strategies. It showed that human observers tend to underestimate this coherence when the displayed speech signals are synthetic and clearly distinguishable from original speech. Perhaps this is due to a moderate loss of coherence around the concatenation points. On the other hand, the highest level of audiovisual consistence is perceived when the speech is synthesized using the single-phase audiovisual concatenative synthesis. The more standard approach in which both synthetic speech modes are synthesized separately was found to easily result in a degraded perceived audiovisual consistence. A second experiment investigated how the perceived quality of a (synthetic) visual speech mode can be affected by the audiovisual presentation of the speech signal. The results obtained showed that this quality can be seriously degraded when the consistence between the two presented speech modes is poorer. In addition, it was found that the influence of the individual quality of the accompanying auditory speech mode only seems to be of secondary order. The standard two-phase synthesis approach in which the auditory and the visual speech modes are synthesized separately (generally using different databases and different synthesis techniques) is likely to cause audiovisual mismatches that cannot be prevented by an accurate synchronization of the two synthetic speech signals, since they are due to the fact that the two information streams originate from different repetitions of a same phoneme sequence (usually by two distinct speakers that are likely to exhibit different speaking accents). The experiments described in this section indicate that these mismatches reduce the perceived audiovisual 3.7. Audiovisual optimal coupling 111 coherence and that they are likely to degrade the perceived naturalness of the synthetic speech modes. From this, it can be concluded that a major requirement for an audiovisual speech synthesis system is to maximize the level of coherence between the two synthetic speech modes. The speech synthesizer obviously also has to optimize the individual quality of both synthetic speech modes, but it has to be ensured that each optimization technique to increase these individual qualities does not affect the level of audiovisual coherence in the audiovisual output speech. Otherwise, it is likely that the benefits gained from the optimization technique are cancelled out by the audiovisual way of presenting the synthetic speech modes. The experiments described in this section encourage to further investigate on the single-phase audiovisual segment selection technique, since this approach indeed is able to maximize the coherence between both synthetic speech modes. On the other hand, at this point the attainable synthesis quality of both speech modes is still too low compared to original speech recordings. Therefore, the AVTTS synthesis strategy will have to be extended to improve the individual quality of the synthetic auditory and the synthetic visual speech, while it is ensured that the level of coherence between these two output speech modes is minimally affected. 3.7 Audiovisual optimal coupling Section 3.5 described that the concatenation of the audiovisual speech segments that are selected from the database requires two join actions: one in the auditory mode and one in the visual mode. It was explained that the auditory signals are concatenated using a pitch-synchronous cross-fade that preserves the periodicity of voiced speech sounds in the concatenated signal. The visual modes of the selected segments are smoothly joined by generating interpolation frames using an image morphing technique. In the previous section it was concluded that any optimization to the audiovisual synthesis strategy, in order to enhance the individual quality of the synthetic auditory or visual mode, should be designed not to affect the coherence between these two speech modes. This section elaborates on an optimization to the single-phase AVTTS approach that enhances the individual quality of the synthetic visual speech mode. Unfortunately, the optimization technique also decreases the coherence between both synthetic speech modes, which means that a trade-off will have to be made. 3.7.1 Concatenation optimization Section 3.5.2 explained that the AVTTS system separately optimizes each acoustic concatenation by positioning the exact join position at a time instant that coincides with a pitch-marker. Around the theoretical concatenation point (the centre of the first and the second overlapping phone, respectively), an optimal pair of pitch- 3.7. Audiovisual optimal coupling 112 markers (one marker in the first phone and one marker in the second phone) is found by minimizing a spectral distance measure. This way, the transition takes place between two signals that are maximally similar at the concatenation point, which improves the smoothness of the resulting concatenated auditory speech signal. A similar technique could be employed to enhance the concatenation quality of the visual modes of the selected database segments as well. This would require for each concatenation a separate optimization of the join position in the visual speech mode. Such an optimization is, for instance, also applied in the “Video Rewrite” system [Bregler et al., 1997]. For this purpose, three different approaches were developed, each discussed in detail in the remainder of this section. 3.7.1.1 Maximal coherence The standard approach for determining for each concatenation the exact join position in the visual mode was already briefly described in section 3.5.2: the synthesizer tries to minimize the asynchrony between the concatenated auditory and the concatenated visual speech information as much as possible (see figure 3.21). Since the sample rate of the auditory speech is much higher than the sample rate of the visual speech, it is impossible to ensure that the join position in the visual mode perfectly coincides with the optimized join position in the auditory mode. Because of this, for each selected database segment the exact length of its auditory speech signal will be different from the length of its accompanying visual speech signal. For a database segment i, the difference between the length of its acoustic signal (Laudio (i)) and the length of its video signal (Lvideo (i)) can be written as ∆L(i) = Laudio (i) − Lvideo (i) (3.14) The auditory and the visual speech information that correspond to a particular audiovisual segment is copied from the original recordings contained in the database. The database time instance on which the extraction of the acoustic speech information corresponding to segment i starts can be denoted as taudio start (i). Similarly, the time instance on which the extraction of the visual information starts can be written as tvideo start (i). In general, due to the difference in sample rate between the acoustic video and the visual signal, tvideo start (i) will be different from tstart (i). This means that after the audiovisual segment i is added to the concatenated audiovisual speech signal, the speech modes of segment i are shifted by a value ∆tstart (i) with respect to each other: video ∆tstart (i) = taudio (3.15) start (i) − tstart (i) This can easily be understood if segment i is the first segment from the sequence that constructs the output synthetic speech. If segment i is not the first segment from this sequence, both its speech modes are added to an audiovisual signal of which 3.7. Audiovisual optimal coupling 113 the length of its acoustic signal and the length of its visual signal are dissimilar due to the earlier concatenations. This means that the total shift async(i) between the original auditory and the original visual speech information corresponding to segment i in the final concatenated speech is given by async(i) = i−1 X ∆L(n) − ∆tstart (i) (3.16) n=1 Note that in equation 3.16 a positive value of async(i) means that the visual speech information leads the auditory speech information. Obviously, these calculations assume that both speech modes of the original database recordings are correctly synchronized. Equation 3.16 indicates that the level of audiovisual synchrony in the final concatenated speech signal changes value after each concatenation point. In addition, it shows that the audiovisual asynchrony of segment i after concatenation is caused by two independent terms, determined by the properties of the previous segments (1, . . . , i − 1) and the current segment i, respectively. This means that the value of async(i) can be confined to reasonable limits by selecting for each segment i a video frame as join position tvideo start (i) that lies in the vicinity of the auditory audio join position tstart (i) and that maximally cancels the asynchrony caused by the difference between the lengths of the speech modes of the already concatenated Pi−1 audiovisual speech signal (i.e., the term n=1 ∆L(n) in equation 3.16). Since it is generally assumed that human observers are more sensitive to a lead of the auditory speech information in front of the visual speech information compared to a lead of the visual information in front of the acoustic information [Summerfield, 1992] [Grant et al., 2004], the most “safe” concatenation strategy is the one that maximizes the audiovisual coherence in the concatenated speech signal by selecting for each visual concatenation a video frame as join position that ensures that 0 ≤ async(i) < 1 F svideo (3.17) with F svideo the sample rate of the video signal. 3.7.1.2 Maximal smoothness Similar to the optimal coupling technique that is applied for optimizing the acoustic concatenations, the smoothness of the concatenated visual speech can be enhanced by fine-tuning for each video concatenation the exact join position in order to maximize the similarity between the two video frames at which the concatenation takes place. In a first stage, for both phones that need to be joined some of the video frames in the vicinity of the corresponding auditory join position are selected as candidate “join-frames”. Then, in a second stage two final frames are selected (one from each set of candidate join-frames) that minimize a visual distance measure (see 3.7. Audiovisual optimal coupling 114 figure 3.21). This visual distance is calculated in a similar way as the total visual join cost value (using the difference in teeth, mouth-cavity and PCA properties). Unfortunately, this optimization technique increases the possible values for async(i) since in this case the visual join position is not chosen to minimize equation 3.16 (see figure 3.22). The visual optimal coupling strategy is adjusted by three parameters: the maximal allowed local audio lead (the minimal value of async(i)), the maximal allowed local video lead (the maximal value of async(i)), and a search-length parameter that defines the number of video frames in each phone that is considered as candidate join-frame. The search-length parameter influences both terms of equation 3.16: it determines the maximal audiovisual asynchrony in a segment caused by the difference between the time instants on which its two speech modes are extracted from the database (equation 3.15), and it also determines to which extent the length of the auditory and the visual mode of the already concatenated speech signal can be altered (this can be seen in figure 3.22). Since the value of async(i) is confined between its maximal and its minimal limit, most of the time the set of candidate join frames that is selected for each of the two phones will not be centered around the auditory join position. This is due to the fact that the maximal value for async(i) can be chosen higher than the minimal value of async(i) due to the asymmetric human sensitivity for audiovisual asynchrony. 3.7.1.3 Maximal synchrony Where the approaches described in section 3.7.1.1 and section 3.7.1.2 maximize the audiovisual coherence and the audiovisual smoothness, respectively, an in-between approach exists that is able to enhance the smoothness of the synthetic visual speech without introducing extra audiovisual asynchronies in the concatenated speech segments. The first stage of this in-between strategy is similar to the “maximal smoothness” approach since for both phones that need to be joined a set of candidate joinframes are selected around the corresponding join positions in the auditory mode. In the second stage, from both sets of candidate join-frames a final frame is selected by minimizing the visual join cost for this particular concatenation. In contrast with the “maximal smoothness” approach, in this case only those pairs of join-frames are considered that do not add an extra audiovisual asynchrony to the database segment. This is possible when for a segment i the visual join position is chosen such that the contribution of the term ∆tstart (i) to async(i) is cancelled by the modification of Pi−1 the term n=1 ∆L(n), i.e., by the alteration of the length of the auditory and the visual mode of the already concatenated audiovisual speech signal (this can be seen in figure 3.21). This approach evaluates for each concatenation fewer combinations of join frames compared to the “maximal smoothness” approach and thus offers less freedom in optimizing the smoothness of the visual concatenations. Only one parameter adjusts the optimal coupling technique: the search-length determining 3.7. Audiovisual optimal coupling 115 the number of candidate join-frames that are selected in each phone. An important observation is that even when a large search-length is applied, no extra audiovisual asynchrony is introduced, however, at the join positions non-original combinations of auditory and visual speech information are created (as illustrated in figure 3.22). This means that the proposed optimization technique is likely to degrade overall level of audiovisual coherence in the concatenated audiovisual speech signal. 3.7.2 Perception of non-uniform audiovisual asynchrony In order to obtain appropriate parameter settings for the proposed optimal coupling techniques, the maximal allowed level of local audiovisual asynchrony must be determined. Literature on the effects of a uniform audiovisual asynchrony on the human perception of audiovisual speech signals mentions −50ms and +200ms as tolerable bounds for the asynchrony level without being noticed by an observer [Grant et al., 2004]. On the other hand, in the proposed audiovisual synthesis the level of audiovisual synchrony in the concatenated audiovisual speech is not constant (it changes after each concatenation point). Since no exact tolerance for this particular type of audiovisual asynchrony could be found in the literature, a subjective perception test was conducted from which appropriate parameter settings for the optimal coupling approaches can be inferred. To this end, it was investigated to which extent local audiovisual asynchronies can be introduced in an audiovisual speech signal without being noticed by a human observer. The speech samples were generated by resynthesizing sentences from the LIPS2008 database using the AVTTS system and the “maximal smoothness” optimal coupling approach (the speech data corresponding to the target original sentence was excluded from selection). Equation 3.16 was used to calculate the occurring levels of audiovisual asynchrony in each synthesized sentence. An appropriate subset of samples was collected from the synthesis results that covers the target range of maximal/minimal local asynchronies that needs to be evaluated. For each sample from the selected subset, a second version was synthesized using the “maximal coherence” optimal coupling approach. These new syntheses were used as baseline samples since they exhibit no significant audiovisual asynchrony. The two versions of each sentence (with/without local audiovisual asynchronies) were shown pairwise to the test subjects, who were asked to report which of the two samples they preferred, in terms of synchrony between the two presented speech modes. The participants were instructed to answer “no difference” if no significant difference in audiovisual synchrony between the two samples could be noticed. Seven people participated in the experiment, three of which were experienced in speech processing. The test results obtained are summarized in table 3.4, in which the test samples are grouped based on the minimal and maximal occurring local asynchrony level. It shows a detection ratio of less than 20% for the samples in which the audio lead is always 3.7. Audiovisual optimal coupling 116 Figure 3.21: Three approaches for optimal audiovisual coupling. The two audiovisual signals that need to be joined and the optimized auditory join positions are indicated. The top panel shows the “maximal coherence” method in which the visual join positions are close to the auditory join positions. The middle panel illustrates the “maximal smoothness” approach, in which for both signals a set of candidate join frames A1-A5 and B1-B5 are selected, from which the most optimal pair is calculated by minimizing the visual join cost. The bottom panel illustrates the in-between approach in which only candidate pairs A1-B1, A2-B2, etc. are considered. 3.7. Audiovisual optimal coupling 117 Figure 3.22: Resulting signals obtained by the three proposed optimal coupling techniques. The top panel shows that the audiovisual coherence is maximized by the “maximal coherence” approach. The middle panel shows that an extended audiovisual asynchrony can occur by employing the “maximal smoothness” approach. The bottom panel shows that the “maximal synchrony” approach maintains the audiovisual synchrony but introduces some unseen combinations of auditory and visual speech information at the join position (indicated by the arrows). 3.7. Audiovisual optimal coupling 118 lower than 0.04s and the video lead is always lower than 0.2s. It was opted to use these values as parameter settings for the “maximal smoothness” optimal coupling approach. Table 3.4: Detection of local audiovisual asynchrony. Max desync 0s 0s 0.1 s 0.2 s 0.4s 0% 15% 46% Min desync -0.04s -0.08s 0% 0% 0% no samples 26% 10% 20% 40% -0.15s 90% 100% 60% 100% Neither the number of participants to the experiment nor the number of test samples in each group of table 3.4 were large enough to exactly define thresholds for noticing non-uniform audiovisual asynchronies. Nevertheless, the results obtained are sufficient for determining suitable parameter values for the optimal coupling technique. The subjective experiment can also be seen as a preliminary study on the general effects of a time-varying audiovisual asynchrony on audiovisual speech perception. The results obtained indicate that the thresholds for noticing non-uniform audiovisual asynchronies are quite similar to the noticing thresholds for uniform audiovisual asynchrony that are mentioned in the literature. It seems to be the case that the length of an audiovisual asynchrony occurring in an audiovisual speech signal has only little influence on its detection by human observers, since the subjective experiment showed that even very short asynchronies (occurring when short segments are selected from the database) were noticed by the test subjects. This result is in agreement with earlier experiments investigating audiovisual perception effects such as the McGurk effect [McGurk and MacDonald, 1976], in which it was found that even audiovisual mismatches with a duration of only a single phoneme can drastically affect the intelligibility and/or the perceived quality of the audiovisual speech signal. 3.7.3 Objective smoothness assessment Before evaluating the effects of the proposed audiovisual optimal coupling approaches on the human perception of the concatenated audiovisual speech signals, it was objectively assessed to which extent the three approaches are able to smooth the visual concatenations. To this end, eleven English sentences (mean word count = 15 words) were synthesized using the AVTTS system provided with the LIPS2008 3.7. Audiovisual optimal coupling 119 database. For each sentence, six different configurations for the optimization of the audiovisual concatenations were used, as described in table 3.5. For each synthesized sample, the smoothness of the visual speech mode was objectively assessed. To this end, the synthetic visual speech signals were automatically analysed in order to calculate for each video frame a set of facial landmarks indicating the lips of the virtual speaker, and a set of PCA coefficients that models the mouth-area of the frame. This metadata was derived similar to the analysis of the original visual speech information contained in the database (see section 3.3.3.4). A smoothness measure was defined as the linear combination of the summed Euclidean distances between the landmark positions and the Euclidean distance between the PCA coefficients for every two consecutive frames in the synthetic visual speech located at the join positions. A single measure for each synthesized sentence was calculated as the sum of all distance measures that were calculated for the sentence divided by the number of database segments that were used to construct the sentence. Table 3.5: Various optimal coupling configurations. (SL = search-length parameter) Group Method SL Min Async. Max Async. I II III IV V VI max coherence max smoothness max smoothnes max synchrony max synchrony max synchrony 0.20s 0.20s 0.08s 0.20s 0.40s -0.04s -0.05s - 0.20s 0.35s - Figure 3.23 shows the objective smoothness levels obtained. The results show that both the “maximal smoothness” (groups II & III) and the “maximal synchrony” (groups IV, V & VI) strategies resulted in smoother synthetic visual speech signals compared to the “maximal coherence” technique (group I). A statistical analysis using ANOVA with repeated measures and Greenhouse-Geisser correction indicated significant differences among the values obtained for each group (F (2.37, 23.7) = 14.127 ; p < 0.001). An analysis using paired-sample t-tests indicated that the smoothness of the samples from group I was significantly worse than the smoothness of the samples from the other groups (p ≤ 0.006). On the other hand, as can be noticed from figure 3.23, only a minor improvement of the smoothness of the synthetic visual speech is measured when more extreme audiovisual asynchronies or longer audiovisual incoherences are allowed: no significant difference between the values for groups II-VI was found. This also means 3.7. Audiovisual optimal coupling 120 500 450 400 350 300 250 I II III IV V VI Figure 3.23: Objective smoothness measures for various optimal coupling approaches. A lower value indicates a smoother visual speech signal. that the “maximal smoothness” approach did not really outperform the “maximal synchrony” optimization approach, despite the fact that the “maximal smoothness” approach allows additional audiovisual asynchronies in order to increase the freedom in optimizing the visual join position. 3.7.4 Subjective evaluation To assess the effect of the various audiovisual optimal coupling techniques on the perception of the synthetic audiovisual speech, a subjective perception experiment was performed. Groups I, II and V from table 3.5 were selected to represent the thee Page 1 proposed optimal coupling approaches. Eleven standard-length English sentences were synthesized using the LIPS2008 database and the optimal coupling settings of groups I, II and V. For each of these sentences, two sample pairs were shown to the test subjects. One sample pair contained a synthesis from group I and a synthesis from group II, the other sample pair contained the same synthesis from group I and a synthesis from group V. The participants were asked to report which of the two samples of each pair they preferred. They were told to pay attention especially to the smoothness of the mouth movements and to the overall level of audiovisual coherence, but it was up to themselves to decide which aspect of the audiovisual speech they found the most important to rate the samples. They were informed that the quality of the auditory speech mode was the same for both samples of each comparison pair. If the test subjects had no preference for one of the two samples, they were asked to answer “no difference”. 9 people participated in this test (3 3.7. Audiovisual optimal coupling 121 female, 6 male, aged [20-56]), 6 of them being experienced in speech processing. The preference scores obtained are summarized in table 3.6. The difference between groups I-II and the difference between groups I-V were analysed using a Wilcoxon signed-rank test. The results obtained indicate that for neither of the two comparisons the test subjects showed a preference for the samples of which the smoothness of the visual speech mode was optimized. Samples from groups I and V were rated equally good with many answers reporting “no difference” (Z = −0.160 ; p = 0.873). On the other hand, it appeared that the participants disliked the optimized samples from group II in comparison with the samples from group I (Z = −2.07 ; p = 0.039). A manual inspection of the answer sheets and feedback from the participants pointed out several explanations for these results. Firstly, it can be noticed that the answers differ heavily among the participants: some subjects tended to generally like or dislike the optimized sample in each pair while other participants very often reported not to notice any difference between the presented samples. Furthermore, many participants informed that they often did notice that the smoothness of the visual speech mode of one of the two samples had been improved. Unfortunately, in many of these cases the optimized sample exhibited an affected audiovisual coherence which motivated the test subjects to report a preference for the non-optimized sample of the comparison pair. This explains why the samples from group II were rated worse than the samples from group I, since these samples contain time-varying audiovisual asynchronies. This is a quite unexpected result, since the minimal and maximal asynchrony level were chosen in the range that was hardly noticed by human observers in the subjective test described in section 3.7.2. Maybe these parameters should have been chosen more conservative, however, lowering these thresholds would leave only little freedom to optimize the visual concatenations. In contrast, the samples from group V were not rated worse than the samples from group I, which means that the introduction of short local audiovisual incoherences at the join positions is not as easily noticed as a time-varying audiovisual asynchrony. On the other hand, the improved smoothness of the visual speech mode of the samples from group V did not manage to increase their subjective quality rating, which means that the benefits of the individual optimization of the visual speech mode was cancelled by the decrease of the level of audiovisual coherence. Some participants indeed reported that the smoothed visual speech appeared as “mumbled” compared to the accompanying auditory speech signal that exhibited stronger articulations. 3.7.5 Conclusions This section studied the audiovisual concatenation problem by investigating an optimal technique that calculates the appropriate join positions in both the au- 3.7. Audiovisual optimal coupling 122 Table 3.6: Subjective evaluation of the optimal coupling approaches. Test Preference Count Group I - Group II Group I > Group II Group I = Group II Group I < Group II 38 39 22 Group I - Group V Group I > Group V Group I = Group V Group I < Group V 19 60 20 ditory and the visual speech mode. The proposed optimization techniques are designed to smooth the synthetic visual speech by introducing a time-varying audiovisual asynchrony or some local audiovisual incoherences in the concatenated audiovisual speech. Results from a subjective perception experiment indicate that earlier published values for just noticeable audiovisual asynchrony hold in the nonuniform case as well (i.e., they hold for both constant and time varying audiovisual asynchrony levels). A possible explanation for this resides in the fact that human speech perception is for a great deal based on predictions. By applying speech communication in every-day life humans learn what is to be considered as “normal” speech signals. Every aspect of a synthetic speech signal that is not conforming to these normal speech patterns will be immediately noticed. Since time-varying audiovisual asynchronies do not exist in original speech signals, it can be expected that there exist no temporal window in which humans are less sensitive to the audiovisual asynchrony in multimodal speech perception. Objective measures showed that the optimization of the visual join positions indeed enhances the smoothness of the synthetic visual speech. However, a subjective experiment assessing the effects of these optimizations on the perception of the concatenated audiovisual speech showed no indication that the observers preferred the smoothed synthesis samples over the samples that were synthesized to contain a maximal coherence between both synthetic speech modes. Quite often the benefit of the optimal coupling approach, i.e., an improved individual quality of the synthetic visual speech mode, was cancelled by a noticeable decrease of the audiovisual coherence in the synthetic audiovisual speech. The proposed audiovisual optimal coupling techniques appear to cause some sort of disturbing under-articulation effect, since some rapid variations in the auditory mode are not seen in the corresponding video mode. These findings are in line with the results from the experiments described in section 3.6, where it was concluded that a maximal audiovisual coherence is crucial in order to attain a high-quality perception of the synthetic audiovisual speech signal. The avoidance of any mismatch between both synthetic speech modes 3.8. Summary and conclusions 123 appears to be at least equally important as the individual optimization of one of the two speech modes. 3.8 Summary and conclusions The great majority of the audiovisual text-to-speech synthesis systems that are described in the literature adopt a two-phase synthesis approach, in which the synthetic auditory and the synthetic visual speech are synthesized separately. The downside of this synthesis strategy is its inability to maximize the level of coherence between the two output speech modes. To overcome this problem, a single-phase audiovisual speech synthesis approach is proposed, in which the synthetic audiovisual speech is generated by concatenating original combinations of auditory and visual speech that are selected from a pre-recorded speech database. Auditory and visual selection costs are used to select original speech segments that match the target speech as close as possible in both speech modes. To concatenate the selected segments, an advanced join technique is used that smooths the concatenations by generating appropriate intermediate pitch periods and video frames. The proposed single-phase AVTTS approach was subjectively compared with a common twophase synthesis strategy in which the auditory and the visual speech are synthesized separately using two different databases. These experiments indicated a reduction of the perceived audiovisual speech quality when the level of coherence between the two presented speech modes is lowered. Because of this, the single-phase synthesis results were preferred over the two-phase synthesis results. In order to improve the individual quality of the synthetic visual speech mode, multiple audiovisual optimal coupling techniques were designed and evaluated. These techniques are able to improve the smoothness of the synthetic visual speech signal at the expense of a degraded audiovisual synchrony and/or coherence. However, a subjective evaluation pointed out that the proposed optimization techniques are unable to enhance the perceived audiovisual speech quality. This result again indicates the importance of the level of audiovisual coherence on the perception of the audiovisual speech information. People are very familiar with perceiving audiovisual speech signals, since this kind of communication is used countless times in their daily life. When perceiving (audiovisual) speech, the observer continuously makes predictions about the received information. This means that even the shortest and smallest errors in the speech communication will be directly noticed. Such local errors have been found to degrade the perceived quality of much longer speech signals in which they occur [Theobald and Matthews, 2012]. This raises serious problems when a natural perception of synthesized speech is aimed for, since it is a major challenge to design a TTS system that is able to generate a completely error-free synthetic 3.8. Summary and conclusions 124 speech signal. For audiovisual speech synthesis, the problem even becomes more challenging, since in that case not only the two individual speech modes but also the combination of these two signals should be perceived error-free. When perceiving a synthetic audiovisual speech signal of which the intermodal coherence is affected, a typical problem that occurs is that the observers do not believe that the virtual speaker they see in the visual speech mode actually uttered the auditory speech information they hear in the corresponding acoustic speech mode. For instance, this was noticed when evaluating the audiovisual optimal coupling techniques: when the visual speech mode of the test samples was smoothed individually, the visual speech easily appeared as “mumbled” in comparison with the more profound articulations that were present in the accompanying auditory speech mode. A similar problem is likely to occur when a two-phase synthesis approach is applied. These systems tend to generate a “safe” synthetic visual speech signal that contains for each target phoneme its most typical visual representation (e.g., systems that apply a simple many-to-one mapping from phonemes to visemes (see section 1.3 or chapter 6), rule-based synthesizers that apply the same synthesis rule for each instance of the same phoneme (see section 2.2.6), etc.). This implies that some of the atypical articulations (e.g., very profound ones) that are present in the synthetic acoustic speech mode will be missing an appropriate visual counterpart in the audiovisual output speech. When evaluating the single-phase synthesis approach it was noticed that this effect holds for non-optimal parts of the synthetic speech as well, since it could be observed that it is preferable that very fast or sudden (non-optimal) articulations occurring in one speech mode have a similar counterpart in the other speech mode as well. Obviously, this is infeasible when the two synthetic speech modes are synthesized separately. Another downside of a two-phase AVTTS synthesis approach is the fact that a post-synchronization of the two synthetic speech modes is required. As was shown in the perception experiments, this synchronization is feasible by non-uniformly time-stretching the signals in order to align the boundaries of each phoneme with the boundaries of its corresponding viseme. However, since allophones can exhibit a specific kinematics profile, for some parts of the speech signal this synchronization step is likely to affect the speech quality. For instance, the lengthening of an allophone can be due to a decrease in speech rate, pre-boundary lengthening, lexical stress, or emphatic accentuation. When each phone of a speech signal is individually time-stretched, the kinematics of the phones are altered which can lead to a degraded transmission of the speech information and to a decrease of the level of audiovisual coherence [Bailly et al., 2003]. The results obtained motivate to investigate further on the single-phase synthesis approach, since this is the most convenient technique to ensure that the perceived quality of the synthetic audiovisual speech is not affected by audiovisual coherence issues. Unfortunately, the subjective experiments indicated that 3.8. Summary and conclusions 125 the attainable synthesis quality of the proposed AVTTS approach is too limited to accurately mimic original audiovisual speech. Therefore, optimizations to the audiovisual synthesis have to be developed that enhance the quality of the synthetic auditory and the synthetic visual speech. From the results obtained in this chapter it is known that it will have to be ensured that these optimizations do not affect the coherence between these two synthetic speech modes. Note that this thesis will mainly focus on the enhancement of the synthetic visual speech mode, as various strategies for improving the auditory speech quality were developed in the scope of the laboratory’s parallel research on auditory text-to-speech synthesis. Some of the techniques, experiments and results mentioned in this chapter have been published in [Mattheyses et al., 2008], [Mattheyses et al., 2009a] and [Mattheyses et al., 2009b]. 4 Enhancing the visual synthesis using AAMs 4.1 Introduction and motivation The results described in the previous chapter offer strong motivation to investigate further on the single-phase AVTTS synthesis approach. Unfortunately, the proposed synthesis set-up resulted in synthetic audiovisual speech signals that did not close enough resemble original audiovisual speech so that human observers are unable to distinguish between the two. Therefore, additional improvements to the synthesis technique are needed. Recall from the previous chapter that it has to be ensured that neither of these optimizations significantly affects the audiovisual coherence in the synthetic output speech. There already exists a wide research area that investigates the improvement of auditory speech synthesis. One of such research projects is conducted in parallel with the research described in this thesis [Latacz et al., 2010] [Latacz et al., 2011]. Therefore, this thesis focusses on various developments to enhance the quality of the synthetic visual speech mode. A first experiment was conducted to investigate exactly which features caused the observers to distinguish between the synthesized and the original visual speech. Multiple original sentences from the LIPS2008 database were resynthesized, after which for each generated video frame its facial landmarks were tracked, similar to the analysis performed on the original database speech (see section 3.3.3.4). Then, using these landmark points, for each sentence both the original and the synthesized visual speech information was represented using point-light signals, as 126 4.1. Introduction and motivation 127 Figure 4.1: A point-light visual speech signal. illustrated in figure 4.1. Such point-light signals display only the kinematics of the visual speech. It has been shown that human observers are quite sensitive to the coherence between an acoustic speech signal and the variation of facial flesh points (indicated here using point-lights) [Bailly et al., 2002]. The synthetic point-light signals were synchronized with the original acoustic speech from the database by aligning the phoneme boundaries in both signals. The time-scaling of the point-light signals was achieved by adding video frames (by calculating intermediate point-light positions) or by removing video frames. Then, the original auditory speech was played synchronously with both the original and the synthetic point-light signals. In an informal perception test it was found that observers performed much worse in distinguishing between the original and the synthesized point-light signals compared to the distinguishing between original and synthesized “real” visual speech signals. From this observation it can be concluded that the kinematics of the synthetic visual speech signals do a reasonably good job in mimicking the original speech kinematics. This means that in order to enhance the quality of the synthetic visual speech, a greater effort has to be made to improve the smoothness and the naturalness of total appearance of the synthetic visual speech (e.g., teeth visibility, tongue visibility, colour and lightning continuity, etc.). In order to improve the quality of the synthetic visual speech, a more detailed analysis of the original visual speech data is needed. Processing this data as a series of static images makes it very hard to differentiate between the analysis and/or synthesis of aspects concerning the speech movements (e.g., lip and jaw movements) and aspects concerning the overall appearance of the mouth area (e.g., visibility of the teeth, colours, shadows, etc.). Such a differentiation would allow to design techniques that remove the jerkiness from the overall appearance of the virtual speaker, while the amplitude of the displayed speech kinematics is maintained to avoid under-articulation effects that cause a “mumbled” perception of the synthetic visual speech mode (e.g., this was noticed in the subjective assessment of the audiovisual optimal coupling techniques (see section 3.7)). 4.2. Facial image modeling 4.2 128 Facial image modeling A convenient technique to analyse the original visual speech recordings is to parameterize each captured video frame. This way, the frame-by-frame variations of the parameter values compose parameter trajectories that describe the visual speech information. Some image parameterizations not only mathematically describe the image data in terms of parameter values, they also allow to reconstruct an image from a given set of parameter values. This category of parameterizations is often referred to as image modelling techniques. Section 3.3.3.4 already mentioned that the original visual speech was modelled by a PCA calculation. To this end, a PCA analysis was performed on the image data gathered by extracting the mouth area from each original video frame. This analysis determines for each original video frame an associated vector of PCA coefficients. It also calculates a set of so-called “eigenfaces”, which can be linearly combined to recreate any particular original video frame (using that frame’s PCA coefficients as combination weights). Unfortunately, a PCA analysis parameterizes the visual speech information contained in each video frame “as a whole”, since it treats the facial images as standard mathematical matrices containing the (grayscale) pixel values. This means that a PCA-based analysis of the original speech data is unable to differentiate between the parameterization of the kinematics and the parameterization of the appearance of the visual speech. An extension to PCA for modelling a set of similar images makes use of a socalled 2D Active Appearance Model (AAM) [Edwards et al., 1998b] [Cootes et al., 2001]. Similar to other modelling techniques, an AAM is able to project an image into a model-space, meaning that the original image data can be represented by means of its corresponding model parameter values. In addition, when the AAM has been appropriately trained using hand-labelled ground-truth data, it is possible to generate a new image from a set of unseen model parameter values. In contrast to plain PCA calculations on the pixel values of the image, an AAM models two separate aspects of the image: the shape information and the texture information. The shape of an image is defined by a set of landmark points that indicate the position of certain objects that are present in each image that is used to build the AAM. The texture of an image is determined by its (RGB) pixel values, which are sampled over triangles defined by the landmark points that denote the shape information of the image. The sampling of these pixel values is performed on the shape-normalized equivalent of the image: before sampling the triangles, a warped version of the image is calculated by aligning its landmark points to the mean shape of the AAM (i.e., the mean value of every landmark point sampled over all training images). 4.2. Facial image modeling 129 An AAM is built from a set of ground-truth images, of which the shape of each image is hand-labelled by a manual positioning of the appropriate landmark points. The vector containing the landmark positions that correspond to a particular image is called the shape S of that image. In addition, its texture T is defined by the vector containing the pixel values of its shape-normalized equivalent. From all training shapes Si , the mean shape Sm is calculated and a PCA calculation is performed on the normalized shapes Ŝi with Ŝi = Si − Sm (4.1) This PCA calculation returns a set of “eigenshapes” Ps which determine the shape model of the AAM. Likewise, the mean training texture Tm is calculated and a second PCA calculation is performed on the normalized training textures T̂i with T̂i = Ti − Tm (4.2) This returns the “eigentextures” Pt which define the texture model of the AAM. After the AAM has been built, any unseen image with shape S and texture T can be projected on the AAM by searching iteratively for the most appropriate model parameters (shape parameters Bs and texture parameters Bt ) that reconstruct the original shape and the original texture using the shape model and the texture model, respectively: ( Srecon = Sm + Ps Bs (4.3) Trecon = Tm + Pt Bt Several approaches exist for optimizing the values of Bs and Bt to ensure that: ( Srecon ≈ S (4.4) Trecon ≈ T These techniques are beyond the scope of this thesis, and the interested reader is referred to [Cootes et al., 2001]. After projection on the AAM, the original image information has been parameterized by means of vectors Bs and Bt , as illustrated in figure 4.2. In addition, the trained AAM is capable of calculating from an unseen set of shape parameters Bsnew and texture parameters Btnew , a new shape S new and a new texture T new by means of equation 4.3. From S new and T new a new image can be generated by warping the shape-normalized texture T new (aligned with the mean shape Sm ) towards the new shape S new . For some applications, it is convenient that an image is represented by a sin- 4.2. Facial image modeling 130 AAM SHAPE SHAPE MODEL SHAPE PARAMS TEXTURE TEXTURE MODEL TEXTURE PARAMS IMAGE Figure 4.2: AAM-based image modelling. gle set of model parameters, of which each individual parameter value determines both shape and texture properties. To this end, from the shape model and the texture model of the AAM, a combined AAM is calculated which can be used to transform image data into a vector of so-called “combined AAM” parameter values (and vice-versa) [Edwards et al., 1998a]. To build this combined model, the shape parameters Bs and the texture parameters Bt of each training image are concatenated to create the vector Bconcat Ws Bs Bconcat = (4.5) Bt Ws is a weighting matrix to correct the difference in magnitude between Bs (which models landmark coordinate positions) and Bt (which models pixel intensities). Ws scales the variance of the shape parameters of the training images to equal the variance of the texture parameters of the training images. Then, a PCA calculation is performed on the Bconcat vectors of the training images, resulting in a collection of eigenvectors denoted as Q. Each concatenated vector Bconcat can be written as a linear combination of these eigenvectors: Bconcat = Qc (4.6) The vector c describes the combined model parameter values of the image data that was used to construct Bconcat . Q can be written as Qs Q= (4.7) Qt 4.3. Audiovisual speech synthesis using AAMs 131 The original image data can be directly reconstructed from a combined parameter vector c by substituting equations 4.3 and 4.5 in equation 4.6: ( Srecon = Sm + Ps Ws−1 Qs c (4.8) Trecon = Tm + Pt Qt c 4.3 4.3.1 Audiovisual speech synthesis using AAMs Motivation From sections 4.1 and 4.2 it can be concluded that the use of AAMs can make a significant contribution to the enhancement of the (audio-)visual speech synthesis strategy, since an AAM offers an individual parameterization of the shape and the texture information of each original video frame. When all video frames from an original visual speech recording are represented by their shape and their texture parameters, these consecutive parameter values define parameter trajectories that describe the visual speech information. The shape-trajectories describe the variations of the shape information in the visual speech signal. This means that these shape-trajectories can be seen as a representation of the kinematics of the speech information (i.e., the movements of the various visual articulators). Similarly, the variation of the texture in the visual speech is described by the texture-trajectories. These trajectories can be seen as a representation for the variation of the appearance of the virtual speaker throughout the original speech recordings (i.e., the visibility of teeth/tongue inside the mouth, changes in illumination on the skin, etc.). Because of this, the AAM-based parameterization of the original speech data offers exactly the representation that allows to diversify the processing between the kinematics-related and the appearance-related visual speech aspects, which is needed to improve the attainable synthetic visual speech quality (as was suggested in section 4.1). The use of AAMs for visual speech synthesis is not new. Theobald et al. developed a concatenative visual speech synthesizer that selects original segments of visual speech information from a database containing AAM-mapped original speech recordings [Theobald, 2003] [Theobald et al., 2004]. Melenchon et al. developed a rule-based synthesis approach in which AAMs are used to represent the original speech data that is used to learn multiple visual pronunciation rules for each phoneme [Melenchon et al., 2009]. AAMs have also been used for speech-driven visual speech synthesis, since the AAM-based representation of the visual speech offers a convenient set of visual features that can be linked to auditory features of the corresponding auditory speech [Cosker et al., 2003] [Theobald and Wilkinson, 2008] [Englebienne et al., 2008]. 4.3. Audiovisual speech synthesis using AAMs 132 In this thesis, however, it is investigated how the use of AAMs can enhance the individual quality of the synthetic visual speech mode generated by the singlephase audiovisual speech synthesis system. From the previous chapter it is known that it will have to be evaluated if this unimodal optimization does not affect the level of audiovisual coherence between the two synthetic speech modes. When an AAM is used to represent the original visual speech data from the speech database, the model must first be trained on a set of hand-labelled training images. Afterwards, any unseen image can be projected into the model-space by calculating the most appropriate model parameters that reconstruct the image. Note that only for the training images an ideal reconstruction is feasible. For any other image, there will always exist a difference between the original pixel values and the pixel values of the regenerated image. A smaller reconstruction error is feasible for images that are similar to at least one of the training images that were used to train the AAM. Therefore, it is a challenge to build an AAM that is able to appropriately model and reconstruct each frame of the original visual speech recordings. In addition, when the original visual speech data is reconstructed from trajectories of model parameters, the resulting speech information will be slightly different from the original speech data. Therefore, it has to be investigated whether this modification has an influence on the perception of the speech data when it is shown to an observer. In addition, it should be assessed to which extent the AAM modelling of the visual speech data affects the level of audiovisual coherence in the speech signal that is created by multiplexing the original auditory speech with the regenerated visual speech signal. 4.3.2 Synthesis overview The AAM-based AVTTS synthesis procedure is similar to the synthesis approach that was described in the previous chapter. The synthesizer selects original combinations of auditory and visual speech from the database, after which these segments are audiovisually concatenated to construct the desired synthetic speech. The major difference is that in this case, the original video recordings are modelled using an AAM, which means that the original visual speech data that is provided to the synthesizer consists of trajectories of AAM parameters instead of video frames. The selection of an original video segment corresponds to the extraction of a sub-trajectory from the database, while the concatenation of two selected video fragments is performed by joining the extracted sub-trajectories. In a final stage, the concatenated sub-trajectories are sampled at the output video frame rate, after which from each sampled set of AAM parameter values a new image is generated. These newly generated images then construct the video frame sequence of the synthetic visual speech. The individual parameterization of various aspects of the visual speech data allows 4.3. Audiovisual speech synthesis using AAMs High-level synthesis Audiovisual unit selection Audiovisual concatenation audiovisual units Pitch-synchronous crossfade & AAM parameter interpolation Original combinations of auditory speech & AAM parameter sub-trajectories AAM-projected audiovisual speech database Text 133 Waveforms & AAM parameter trajectories Inverse AAM projection of the concatenated sub-trajectories Figure 4.3: AVTTS synthesis using an active appearance model. to optimize the visual synthesis in each synthesis stage. An overview of synthesis approach is given in figure 4.3. 4.3.3 Database preparation and model training Most of the time, AAMs are used in visual speech analysis and synthesis to model images representing the complete face of the original/virtual speaker. This approach was also followed by the AAM-based visual speech synthesizers that were mentioned in section 4.3.1. Recall from section 3.5.1 that in the initial AVTTS approach only the mouth-area of the video frames is synthesized in accordance with the target phoneme sequence. Afterwards, this mouth-signal is merged with a background signal to create the final output visual speech signal. A similar approach is followed in the AAM-based AVTTS approach: the AAM is trained only on the mouth area of the video frames from the LIPS2008 database. This way, the AAM only models the most important speech-related variations and does not have to model, for instance, 4.3. Audiovisual speech synthesis using AAMs 134 variations of the eyes or the eyebrows contained in the original speech recordings. The shape information of each frame is determined by 29 landmarks that indicate the outside of the lips and 18 landmarks that indicate the inside of the lips. In addition, 14 landmarks are used to indicate the position of the cheeks and the neck, and 3 additional landmarks are used to indicate the position of the nose. The landmarks corresponding to a typical frame are visualised in figure 4.4. All texture information inside the convex hull denoted by the landmarks is modelled by the AAM. In order to efficiently implement the AAM-based analysis and synthesis, the AAM-API library was used [Stegmann et al., 2003]. This is a freely available C++ programming API that offers the core functions to perform AAM training, AAM projection, and AAM reconstruction. The library was extended to allow the use of AAMs for the specific purpose of visual speech synthesis. For instance, the memory usage during the process of AAM training was optimized in order to be able to build complex models on a large set of high-resolution training images. In addition, the library was extended to be able to read a given set of model parameter trajectories and reconstruct a sequence of mouth images of which the displayed mouth configurations are appropriately aligned to create a smooth mouth animation. To this end, all output frames are aligned using the mean value of the six inner landmark points on the outside of the upper lip. Also, in order to optimize the reconstruction of an image displaying a closed mouth, it was ensured that all landmarks indicating the inside of the upper lip are located above the corresponding landmarks indicating inside of the lower lip. Building a high-quality AAM is not a straightforward task. Since the AAM has to be able to accurately describe each original video frame by means of model parameter values, it is important that the training images that are used to build the AAM sufficiently cover all possible variations present in the database. In addition, the hand-crafted landmark information of these training images should be as consistent as possible, since all variations that are present in this manually determined shape data are regarded as “ground-truth” and will be modelled by the AAM (this includes also the variations caused by an inconsistent or erratic manual landmark positioning). In order to obtain a collection of training images with associated shape data that satisfies these criteria, an iterative technique was developed to build a high quality AAM that preserves as much original image detail as possible. It was ensured that only a limited amount of manual labour is needed to build the model. In a first step, 20 frames from the database were landmarked manually. The subset of 20 frames was selected manually to ensure that it contains many different mouth representations (open/closed mouth, visible teeth, visible mouth-cavity, etc.). From these frames and their associated shape-information, an 4.3. Audiovisual speech synthesis using AAMs 135 “initial” AAM was built. Afterwards, this trained AAM was used to calculate for each frame of 100 sentences, selected randomly from the database, its shape and its texture parameter values. A k-means clustering was performed on all the calculated AAM parameter values to determine 50 visually distant frames, of which the shape information was re-labelled manually. These frames were used to train an improved “intermediate” AAM. The intermediate AAM was applied to calculate a set of model parameter values for every video frame contained in the speech database. Then, by means of a k-means clustering on these new model parameter values, 160 visually distant frames were selected as training set for the “final” AAM. For each of these frames, the corresponding landmark positions were automatically calculated from its shape parameter values corresponding to the intermediate AAM. These landmark positions were checked manually and corrected if necessary, after which the final AAM was built from these 160 frames and their manually adjusted shape information. Section 4.2 explained that the shape model and the texture model of the AAM are determined by PCA calculations on the training data. This implies that each resulting eigenshape or eigentexture corresponds to a particular degree of model variation. A standard approach in PCA analysis is to compress the model-based representation of the original data by omitting some of the least important eigenvectors (and their corresponding parameter values). This technique was applied by designing the final AAM to retain 97% of the variation contained in the final training set, which resulted in 8 eigenvectors that represent the shape model and 134 eigenvectors that represent the texture model. It was checked that no difference in image quality could be noticed between images regenerated using the final AAM and images regenerated using an AAM that was built to model 100% of the variation of the final training set. Finally, a combined model was calculated from the shape model and the texture model of the final AAM (see section 4.2). By omitting 3% of the total variation, 94 eigenvectors were needed to represent this combined AAM model. The final AAM was used to project all frames of the speech database into the model-space. To this end, for each frame the corresponding shape parameter values, texture parameter values, and combined model parameter values were calculated. In addition, for each frame the delta-shape, delta-texture and delta-combined parameters were calculated in order to parameterize the variation of the image information. The AAM-API functions were configured to use the Fixed Jacobian Matrix Estimate technique to iteratively search for the most appropriate model parameters that describe a particular original video frame (equation 4.3). Details on this technique can be found in [Cootes and Taylor, 2001] and [Stegmann et al., 2003]. An example of an original frame extracted from the database, its automatically determined shape information and the AAM reconstruction using its shape 4.3. Audiovisual speech synthesis using AAMs 136 Figure 4.4: AAM-based representation of the original visual speech data, illustrating the original image (left), its landmarking denoting the associated shape information (middle), and the reconstructed image from its shape and texture parameter values (right). and texture parameter values is given in figure 4.4. The proposed approach for building the AAM has the benefit that the final AAM is trained on a large number of images that are selected to represent each typical configuration found in the original speech data. As a consequence, it is able to appropriately model all variety of visual speech information that is contained in the speech database. The initial AAM and the intermediate AAM are necessary to allow an accurate selection of these representative final training images. Since the shape information (i.e., the landmark positions) of the final training images is automatically generated from shape parameter values corresponding to the intermediate AAM, this data is more consistent compared to a manually landmarked collection of images. The AAM-based parameterization of the visual speech from the LIPS2008 database makes some of the visual meta-data mentioned in section 3.3.3.4 superfluous, such as the original landmark positions (that were determined by the original landmark tracker) and the PCA-based parameterization. On the other hand, the measures for the amount of visible teeth and visible mouth-cavity are still useful in the AAM-based AVTTS system, since these compose a more direct indication for these two visual features compared to the AAM texture parameter values. Obviously, the symbolic and the acoustic meta-data, described in sections 3.3.3.2 and 3.3.3.3, respectively, are still applied in the AAM-based AVTTS approach in order to describe the linguistic/prosodic and the acoustic properties of the original audiovisual speech recordings. 4.3.4 Segment selection The selection costs that are applied in the AAM-based AVTTS system are similar to the various sub-costs that were used in the initial AVTTS synthesis strategy (see 4.3. Audiovisual speech synthesis using AAMs 137 section 3.4 for an overview). Candidate segments are selected from the database based on their phonemic match with the target phoneme sequence. Next, target costs and join costs are applied to select for each synthesis target one final database segment to construct the output audiovisual speech. 4.3.4.1 Target costs The target costs have to force the selection towards original segments that resemble the synthesis targets as closely as possible. Binary target costs are used to select original segments that exhibit the appropriate prosodic features (see section 3.4.2.2). In addition, “safety” target costs are used to minimize the selection of erroneous segments (see section 3.4.2.3). A novel strategy of labelling “suspicious” original segments was introduced using the combined AAM parameter values as visual feature for calculating equation 3.5. In contrast with the initial AVTTS system, the AAM-based AVTTS approach includes an additional target cost that takes the visual coarticulation effects into account by promoting the selection of original segments of which the extended visual context is similar to the visual context of the corresponding synthesis target. For instance, when the candidate segment is preceded by a vowel that is associated with a wide mouth opening, the transition effects that are likely to be present in the candidate segment are suited for copying to the synthetic speech in case the corresponding target is also preceded by a vowel that exhibits a profound mouth opening. Therefore, in addition to the symbolic target costs that express the phonemic matching between the target context and the candidate context, the “visual context” target cost has to express the visemic matching between the target context and the candidate context. The visemic matching between two phoneme sequences is often calculated as a binary cost, for which all phonemes are first given a unique viseme label (based on their most common visual representation). Afterwards, the viseme labels of the corresponding phones from both sequences can be compared. Unfortunately, an accurate definition of these viseme labels is far from straightforward since many phonemes exhibit a variable visual representation due to visual coarticulation effects (see also further in this thesis). Therefore, it was opted to calculate the visual context target cost using a visual difference matrix that expresses the visual distance between every two phonemes present in the database (as was proposed by Arslan et al. [Arslan and Talkin, 1999]). It is important that this matrix is ad-hoc calculated for the particular original speech data that is used for synthesis, since each speaker exhibits his/her own personal speaking style and visual coarticulation effects have been found to be speaker-specific [Lesner and Kricos, 1981]. To calculate the visual difference matrix, for every distinct phoneme all its instances in the database are gathered. For each instance, the combined AAM 4.3. Audiovisual speech synthesis using AAMs 138 parameters of the video frame located at the middle of the instance are sampled. From these values, means Mij and variances Sij are calculated, where index i corresponds to the various phonemes and index j corresponds to the distinct model parameters. For a particular phoneme i, the sum of all the variances of the model P parameters j Sij expresses how much the visual appearance of that phoneme is affected by visual coarticulation effects. Two phonemes can be considered comparable in terms of visual representation if their mean representations are alike and, in addition, if these mean visual representations are sufficiently reliable (i.e., if small summed variances were measured for the visual representation of these phonemes). Therefore, two matrices are calculated, which express for each pair of phonemes (p, q) the Euclidean difference between their mean visual representations and the sum of the variances of their visual representation, respectively: sX M (Mpj − Mqj )2 Dpq = j (4.9) X X S D = S + S pj qj pq j j M and Dividing each matrix by its largest element produces the scaled matrices D̂pq S , after which the final difference matrix D is constructed by: D̂pq M S Dpq = 2D̂pq + D̂pq (4.10) Matrix D can be applied to calculate the visual context target cost Cviscon for a candidate segment u, matching a given target phoneme sequence t, by comparing the three phonemes located before (u + n) and after (u − n) the segment u in the database (i.e., the visual context of the candidate segment) with the visual context of the synthesis target: Cviscon (t, u) = 3 X n=1 3 X (4 − n)D(t+n,u+n) (4 − n)D(t−n,u−n) + (4.11) n=1 In equation 4.11 the factor (4 − n) defines a triangular weighting of the calculated visual distances. Finally, for the AAM-based synthesis the first expression from equation 3.4 4.3. Audiovisual speech synthesis using AAMs 139 can be written as: target Ctotal (ti , ui ) = ω1 Cphon.match (ti , ui ) + ω2 Chard−pruning (ti , ui ) + ω3 Csof t−pruning (ti , ui ) PN ω4 Ĉviscon (ti , ui ) + j=1 ωjsymb Cjsymb (ti , ui ) + PN ω4 + j=1 ωjsymb (4.12) Note that this equation is very similar to equation 3.12. Ĉviscon represents the scaled value of Cviscon (see section 3.4.4.1). The values for weights ω1 , ω2 , ω3 , and ωjsymb are the same as described in section 3.4.4.2. The weight factor of the visual context cost is chosen such that the term ω4 Ĉviscon is equally important as the summed symbolic PN PN costs j=1 ωjsymb Cjsymb . To this end, mean values for Ĉviscon and j=1 ωjsymb Cjsymb are learned from multiple random syntheses. The value for ω4 can then be calculated from the ratio of these two measures. 4.3.4.2 Join costs The join costs have to ensure that the final sequence of selected database segments can be appropriately concatenated to construct the desired synthetic speech signal. The AAM-based AVTTS system uses both auditory and visual join costs to ensure smooth concatenations in both speech modes. The auditory join costs CM F CC (ui , ui+1 ), Cpitch (ui , ui+1 ) and Cenergy (ui , ui+1 ) are based on the difference between the MFCC, pitch, and energy features of the waveforms at the join position, respectively (see section 3.4.3.1). In addition, the AVTTS system employs four separate visual join costs. A first join cost Cteeth (ui , ui+1 ) was also used in the initial AVTTS strategy and expresses the difference between the amount of teeth that is visible in the two video frames at the join position (see section 3.4.3.2). This join cost is useful since a sudden “jump” of this feature around a join position easily causes noticeable concatenation artefacts. Three additional visual join costs are calculated on the AAM parameter values of the video frames at the join position. A first cost calculates the Euclidean difference between the shape parameter values of these frames. Another cost calculates the Euclidean difference between the combined AAM parameter values of these frames. A final visual join cost is calculated as the Euclidean difference between the delta-combined AAM parameters of the video frames at the join position. This cost is included since these delta values express how the original visual features are varying in segments ui and ui+1 that need to be concatenated. When this variation can be maintained across the join position, smooth and natural variations will be perceived in the synthetic speech. The three 4.3. Audiovisual speech synthesis using AAMs AAM-based costs can be written as: Cshape (ui , ui+1 ) = ||Bs,i − Bs,i+1 || Ccombined (ui , ui+1 ) = ||ci − ci+1 || C∆combined (ui , ui+1 ) = ||∆ci − ∆ci+1 || 140 (4.13) with Bs,i the shape parameters, ci the combined AAM parameters, and ∆ci the delta-combined parameters of the last video frame of segment ui . Similarly, parameter values Bs,i+1 , ci+1 and ∆ci+1 are sampled at the first video frame of segment ui+1 . Note that the difference between the texture parameter values of the join frames does not define a separate join cost. The reason for this is that the most critical texture-related feature, namely the continuity of the teeth-appearance, is already separately taken into account by the teeth-join cost. Other texture-related aspects are taken into account by comparing the combined model parameter values. In addition, further on in section 4.4.3 it will be explained that the texture information is heavily smoothed around the concatenation points anyway. For the AAM-based synthesis, the second expression from equation 3.4 can be written as: join Ctotal (ui , ui+1 ) = ω1 ĈM F CC (ui , ui+1 ) + ω2 Ĉpitch (ui , ui+1 ) ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7 ω3 Ĉenergy (ui , ui+1 ) + ω4 Ĉteeth (ui , ui+1 ) + ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7 ω5 Ĉshape (ui , ui+1 ) + ω6 Ĉcombined (ui , ui+1 ) + ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7 ω7 Ĉ∆combined (ui , ui+1 ) + ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7 (4.14) The join cost weights (ω1 , . . . , ω7 ) were optimized manually in a similar way as described in section 3.4.4, which resulted in values ω1 = 5, ω2 = 2, ω3 = ω4 = 1, ω5 = 2, and ω6 = ω7 = 1. Ĉ represents the scaled value of the original cost C (see section 3.4.4.1). A value for the factor α in equation 3.3 was determined such that the total target cost and the total join cost contribute equally to the total selection cost associated with the selection of a particular candidate segment. 4.3.5 Segment concatenation Joining the selected original audiovisual speech segments requires a concatenation in the auditory mode and a concatenation in the visual mode. The AAM-based AVTTS system performs the concatenation of the acoustic signals by a pitchsynchronous cross-fade, as was explained earlier in section 3.5.3. Recall that in 4.3. Audiovisual speech synthesis using AAMs 141 the initial AVTTS approach, the concatenation of the visual speech segments was smoothed by substituting the original video frames around the join position with intermediate video frames generated using image morphing techniques. The AAMbased AVTTS approach has the benefit that a parameterization of the original visual speech is available. Instead of concatenating the video data directly, the selected original visual speech segments can be joined by the concatenation of the parameter sub-trajectories that correspond to the selected visual speech signals. Obviously, this allows a much more efficient way to smooth the visual concatenations since the visual speech can be modified by adjusting the concatenated parameter trajectories around the join position. The concatenation of two selected visual speech segments involves a separate join calculation for all sub-trajectories that describe the visual speech information. The segments that need to be joined are partially overlapped to calculate appropriate parameter values at the concatenation point. The AAM-based AVTTS system overlaps both sub-trajectories by exactly one video frame. Then, the concatenated trajectory is smoothed by adjusting the parameter values of the frames in the vicinity of the join position. When the two visual segments that need to be concatenated are denoted as α and β, the sub-trajectories, corresponding to a particular AAM α ) parameter, that describe these two segments can be written as (B1α , B2α , . . . , Bm β β β and (B1 , B2 , . . . , Bn ), respectively, given that segment α contains m frames from the beginning until the frame that was selected as join-position in α, and that segment β contains n frames from the video frame that was selected as join position in β until the end of the segment. The joining of these two segments results in a joined J segment J that is described by the parameter trajectory (B1J , B2J , . . . , Bm+n−1 ). J The parameter value at the join position Bm is calculated by the overlap of the two boundary frames: B α + B1β J Bm = m (4.15) 2 When a parameter S is used to denote the smoothing strength, the resulting parameter values of the concatenated trajectory are calculated as follows: ∀ k, 1 ≤ k ≤ m + n − 1, k 6= m : BkJ = Bkα m−k α S+1 Bk + if 1 ≤ k < m − S (S+1)−(m−k) J Bm S+1 (S+1)−(k−m) J Bm S+1 + k−m β S+1 Bk−m+1 β Bk−m+1 if m − S ≤ k < m if m < k ≤ m + S if m + S < k ≤ m + n − 1 (4.16) 4.3. Audiovisual speech synthesis using AAMs 142 0.17 0.16 0.15 Parameter value 0.14 0.13 0.12 0.11 0.1 0.09 0.08 0 10 20 30 Frame 40 50 60 Figure 4.5: Example of the concatenation of two sub-trajectories. The two original sub-trajectories are shown in black. The coloured lines indicate the concatenated trajectories that were calculated using S = 1 (blue), S = 3 (red), and S = 6 (green) as smoothing strength. Note that all concatenated trajectories pass through the interpolated value at the join position. For each AAM parameter, equations 4.15 and 4.16 are used to concatenate the corresponding sub-trajectories. This allows to optimize the concatenation smoothing strength for each AAM parameter individually, since for each parameter a separate value for the smoothing strength S can be applied. For instance, as was concluded in section 4.1, it is opportune to increase the value of S for joining sub-trajectories corresponding to texture parameters in comparison with the concatenation of subtrajectories corresponding to shape parameters. This way, the texture variation in the concatenated visual speech can be adequately smoothed to achieve a continuous appearance of the virtual speaker across the concatenation points, while the shape information is smoothed less strongly to ensure that the visual articulations are appropriately pronounced. An example of the concatenation of two sub-trajectories is given in figure 4.5. When all parameter sub-trajectories of all selected database segments have been concatenated, each concatenated trajectory is sampled at the output video frame rate. This allows to generate a new sequence of images by the inverse AAM projection of the sampled parameter values. These new images describe the animation of the mouth area in accordance with the target speech. The merging of this video 4.4. Improving the synthesis quality 143 signal with the background signal creates the final full-face synthetic visual speech (see section 3.5.1). 4.4 4.4.1 Improving the synthesis quality Parameter classification The AAM-based representation of the original visual speech contained in the database allows to independently parameterize the shape-related information and the texture-related information, which can for instance be used to apply a different concatenation smoothing strength to these two aspects (see section 4.3.5). However, new possibilities to enhance the attainable synthesis quality emerge when a separate description of distinct shape-aspects and distinct texture-aspects is feasible. This would allow that speech-related shape/texture variations (e.g., lip movements, changes in teeth/tongue visibility, etc.) are treated separately from the other variations present in the database (e.g., deviations of the head orientation, illumination changes, etc.). To this end, a technique to classify each shape/texture parameter in terms of its correlation with the speech is proposed. Section 4.2 explained that the AAM parameters are each linked to an eigenvector, resulting from PCA calculations on the shape information and on the texture information contained in the set of images that was used to train the model. A manual inspection of the various parameters of the AAM that was trained on the LIPS2008 database indicated that many of these parameters/eigenvectors can be linked to a particular physical property. For example, the first shape parameter of the AAM influences the amplitude of the mouth-opening while the second shape parameter is linked to a (limited) head rotation. Likewise, the first texture parameter influences the appearance of shadows on the face of the speaker, while the second texture parameter controls the presence of teeth in the image (see figure 4.6). Two separate criteria were designed to identify the correlation between each model parameter and the speech information. A first measure is based on the knowledge that the visual representations of multiple instances of a same phoneme will be much alike. Since this is more valid for some phonemes than for others due to visual coarticulation effects, all distinct phonemes that are present in the database are processed consecutively, after which the mean behaviour calculated over all phonemes is taken as final measure (see further in this section). It can safely be assumed that in general it is very likely that the visual representations of two random database instances of the same phoneme are more similar than the visual representations of two phones that are completely randomly selected from the database. Therefore, it can be assumed that when a parameter is sufficiently 4.4. Improving the synthesis quality 144 Figure 4.6: Relation between AAM parameters and physical properties. The two top rows indicate the speech related first shape parameter and second texture parameter that influence the mouth opening and the appearance of visible teeth, respectively. The two bottom rows indicate the non-speech related second shape parameter and first texture parameter that influence the head rotation and the casting of shadows on the face, respectively. 4.4. Improving the synthesis quality 145 correlated with the speech information, its values sampled at multiple database instances of the same phoneme will be more similar compared with its values sampled at random database locations. In a first step of the analysis, for every distinct phoneme all its instances in the database are gathered. For each instance, the shape/texture parameters of the video frame located at the middle of the instance are sampled. From these values, means Mij and variances Sij are calculated, where index i corresponds to the various phonemes and index j corresponds to the distinct model parameters. Then, for each phoneme i the shape/texture parameter values of a set of video frames randomly selected from the database are gathered. The size of this set of random frames is the same as the number of phoneme instances of phoneme i that exist in the database. The mean and the variance of the random rand rand parameter set are denoted as Mij and Sij , respectively. Next, the relative var rand differences Dij between the values Sij and Sij are calculated: var Dij = rand Sij − Sij rand Sij (4.17) Finally, a single measure for each model parameter is acquired by calculating the mean variance difference over all phonemes: P var Dij Djvar = i (4.18) Np with Np the number of distinct phonemes in the database. Djvar expresses for each parameter j the relative difference between its overall variation and its intra-phoneme variation. This means that highly speech-correlated parameters will exhibit larger values for Djvar than other parameters. The values for Djvar that were measured for the AAM that was trained on the LIPS2008 database are visualized in figure 4.7. In a second approach for determining which of the AAM parameters are the most correlated with the speech information, some random original sentences from the LIPS2008 database are resynthesized using the AAM-based AVTTS system (the original speech corresponding to these sentences is excluded from selection). Then, for each sentence the synthesized parameter trajectories are synchronized with their corresponding original database trajectories. The synchronization is performed by time-scaling each synthesized phoneme sub-trajectory such that its length matches the duration of the corresponding original phoneme sub-trajectory. For each sensyn tence n and for each parameter j, the distance Dnj between the original trajectory syn ori Tnj and the synchronized synthetic trajectory Tnj is calculated. Note that the magnitude of the parameter values of the most important AAM parameters (i.e., the parameters that model most of the variance contained in the training set) is 4.4. Improving the synthesis quality 146 0.8 0.6 Dvar 0.4 0.2 0 −0.2 1 2 3 4 5 Shape Parameter 6 7 8 1 2 3 4 5 Texture Parameter 6 7 8 0.8 0.6 Dvar 0.4 0.2 0 −0.2 Figure 4.7: Values for Dvar for the 8 shape parameters (top panel) and the 8 most important texture parameters (lower panel) of the AAM trained on the LIPS2008 database. Compare the values obtained with the physical meaning of the two most important shape and texture parameters visualized in figure 4.6. 4.4. Improving the synthesis quality 147 higher than the magnitude of the other parameter values. This means that for the syn will be most important parameters, the magnitude of the measured distances Dnj syn higher compared to the value of Dnj calculated for the other parameters. In order syn to properly compare the values of Dnj for multiple values of j, the influence of the difference in magnitude between the model parameters must be cancelled. To this end, every original trajectory is scaled to unit variance and zero mean (denoted ori as T̂nj ). In addition, the mean and variance of the corresponding synthesized syn ori trajectory Tnj are scaled using the mean and the variance of Tnj , resulting syn in trajectory T̂nj . Then, the distance between the original and the synthesized trajectory is calculated as the Euclidean difference between the scaled trajectories: v u Nf 2 uX syn ori (f ) − T̂ syn (f ) Dnj = t (4.19) T̂nj nj f =1 with Nf the number of video frames in the synthesized sentence n. This way, the influence of the magnitude of the original trajectories is eliminated and a minimal syn ori value is measured when the trajectories Tnj and Tnj are similar in terms of both mean, variation and shape. To eliminate the influence of the global synthesis quality of a particular sentence, for each sentence the measured differences for all parameters are scaled between zero and one: syn D̂nj = syn Dnj syn max Dnj (4.20) j Finally, a single value for each parameter is calculated as the mean value over all sentences: PNs syn n=1 D̂nj syn Dj = (4.21) Ns with Ns the number of synthesized sentences. The value Djsyn will be larger for syn parameters that are not correlated with the speech, since the values Dnj were calculated by comparing the parameter values of video frames corresponding to two different (synchronized) database instances of the same phoneme. By determining these comparison pairs using speech synthesis, each pair is selected by minimizing the selection costs which implies that both phoneme instances are similar in terms of visual context, linguistic properties, etc. This means that it can be assumed that their visual representations are much alike and that a smaller difference will be measured between the synchronized original and synthesized parameter trajectories for those model parameters that are the most correlated with the speech information. The following sections will explain how the parameter classification by means of measures Djvar and Djsyn can be applied to improve the visual speech synthesis. 4.4. Improving the synthesis quality 4.4.2 148 Database normalization The attainable quality of data-driven speech synthesis strongly depends on the properties and the quality of the speech database that is provided to the synthesizer. Furthermore, while recording an audiovisual speech database, it is nearly impossible to retain exactly the same recording conditions throughout the whole database. For instance, the LIPS2008 database contains some small changes of the head orientation of the original speaker, some variations in illumination and some colour shifts. Although these variations are subtle, they can cause serious concatenation artefacts: since these features are not correlated with the speech information, they are not taken into account by the selection costs (even not by the cost that demands a phonemic match between the target and the candidate segment), which means that sudden “jumps” of these features at the concatenation points in the concatenated visual speech signal are unavoidable. A possible solution would be to include these features in the selection (join) costs. However, this would be disadvantageous for the attainable synthesis quality since the segment selection should select the best original segments based on the appropriateness of their speech-related information only. A better approach is to directly reduce the amount of undesired variations in the speech database. This is feasible using the parameter classification described in section 4.4.1, since many non-speech related database variations can be removed by assigning their associated model parameters a constant value over the whole database. An appropriate normalization value is zero, since all-zero model parameters generate the mean AAM image (equation 4.3). To determine which model parameters are the most appropriate to normalize, measures Djvar and Djsyn are combined. First, for both measures the 30% shape/texture parameters least correlated with the speech are selected. Then, from these selected parameters, a final set is chosen as the parameters that were selected by both criteria, augmented with those parameters that were selected by only one measure and that represent less than 1% of the model variation. For the AAM trained on the LIPS2008 database, this resulted in the selection of 1 shape parameter and 35 texture parameters for normalization. An example of an original video frame, reconstructed from its original and from its normalized model parameter values is given in figure 4.8. When the shape and the texture parameters of all database frames have been normalized, also their corresponding combined model parameters can be normalized. To this end, for each original frame a new set of combined parameter values is calculated from its normalized shape and it normalized texture parameter values through equation 4.6. These normalized combined parameters are applicable for a more accurate calculation of the visual selection costs, such as the join cost values 4.4. Improving the synthesis quality 149 Figure 4.8: Reconstruction of a database frame using its original model parameter values (left) and its normalized model parameter values (right). (see section 4.3.4.2) and the visual distance matrix that is applied to calculate the visual context target cost (see section 4.3.4.1). To investigate the effect of the database normalization technique on the perceived quality of the synthesized visual speech, a subjective perception experiment was conducted. Fifteen medium-length English sentences were synthesized by the AAM-based AVTTS system using both the original and the normalized version of the AAM-projected LIPS2008 database. Two versions of each synthesis sample were created: a “mute” sample containing only visual speech and an audiovisual version that contained both synthetic speech modes. The samples were shown pairwise to the participants, who were asked to report their opinion on the visual speech mode of the presented speech samples. They were instructed to pay attention to the smoothness and to the naturalness of the mouth movements/appearances and, for the audiovisual samples, to the coherence between the auditory and the visual speech. Each comparison pair contained two syntheses of the same sentence, based on the original and on the normalized version of the LIPS2008 database, respectively. The text transcript of each sample was given to the participants and the order of the samples in each comparison pair was randomized. A 5-point comparative MOS scale [-2,2] was used to express the preference of the test subjects for the first or for the second sample of each pair. 9 participants (7 male, 2 female) joined the experiment, 1 of them aged above 50 and the others aged between 21-30. 7 participants can be considered speech technology experts. The results obtained using the visual speech-only samples are given in table 4.1 and the results obtained using the audiovisual samples are given in table 4.2. An analysis using Wilcoxon signed-rank tests pointed out that in both the visual speech-only experiment (Z = −7.40 ; p < 0.001) and the audiovisual experiment (Z = −8.27 ; p < 0.001) the samples created using the normalized version of the database were given significantly better ratings than the samples created using the original database. The results obtained in the visual speech-only experiment show that the proposed database normalization strategy indeed smooths the synthetic visual speech by removing 4.4. Improving the synthesis quality 150 some variations from the database. In addition, the audiovisual experiment shows that the removed variations were indeed not related to the speech information since the smoothing does not affect the audiovisual perception quality. Table 4.1: Evaluation of the database normalization strategy using the visual speech-only samples. Normalized > Original Normalized < Original Normalized = Original 89 12 34 Total 135 Table 4.2: Evaluation of the database normalization strategy using the audiovisual samples. 4.4.3 Normalized > Original Normalized < Original Normalized = Original 90 5 40 Total 135 Differential smoothing Section 4.3.5 explained that the visual speech information from the selected database segments is successfully concatenated by a smooth joining of the corresponding parameter sub-trajectories. Each concatenation consists of an overlap of the sub-trajectories at the join position and an interpolation of the original sub-trajectories around the join position in order to smooth the transition (see equations 4.15 and 4.16). A major benefit of the AAM-based synthesis approach is that the strength of the concatenation smoothing (defined by parameter S in equation 4.16) can be diversified between the shape and the texture information. A strong smoothing (high value for S) is applied for joining the texture parameter sub-trajectories in order to avoid a jerky appearance of the virtual speaker in the concatenated visual speech signal. On the other hand, a weaker smoothing (low value for S) is applied for the concatenation of the shape parameter sub-trajectories in order to avoid visual under-articulation effects. A further improvement to the visual speech synthesis quality is possible when the 4.4. Improving the synthesis quality 151 concatenation smoothing strength is also diversified among the various shape/texture parameters themselves. Section 4.4.1 elaborated on two criteria that express the correlation between the model parameters and the visual speech information. Obviously, a strong smoothing of parameters that are closely linked to speech gestures will easily result in an “over-smoothed” perception of the synthetic visual speech. On the other hand, the model parameters that are less related to speech gestures can safely be smoothed without affecting the visual articulation strength. In addition, as was mentioned earlier, the concatenated parameter trajectories of the less speech-related model parameters are more likely to contain steep “jumps” at the join positions, since these parameters model variations that are not (or less) taken into account during the segment selection stage. Therefore, both the shape and the texture parameters are split up into two groups according to their correlation with the speech information, after which for each visual concatenation a stronger concatenation smoothing is applied to join the sub-trajectories of parameters belonging to the least speech-correlated group. The classification of the shape/texture parameters is performed using measures Djvar and Djsyn (see section 4.4.1). First, for both measures the 30% shape/texture parameters most correlated with the speech are selected. Then, from these selected parameters a final “strongly speech-correlated group” is chosen as the parameters that were selected by both criteria, augmented with the parameters that were selected by only one measure and that represent more than 1% model variation. For the classification of the texture parameters, two extra criteria are added, based on the correlation between the parameter values and the amount of visible teeth and the amount of visible mouth-cavity in each video frame, respectively. For both these criteria, the 5 most correlated parameters are implicitly added to the strongly speech-correlated group. The two previous techniques diversify the strength of the concatenation smoothing among the various model parameters. A final optimization is possible when the smoothing strength that is applied for each model parameter is adjusted for each concatenation individually. It has been explained earlier that for some particular phonemes their visual representation is more variable compared to the visual representation of other phonemes, since some phonemes are more affected by visual coarticulations from neighbouring phones. This means that in a visual speech signal, for some particular phonemes their typical visual representation will always be clearly noticeable, while some other phonemes are most of the time “invisible” since their corresponding speech gestures are strongly affected by coarticulation effects [Jackson and Singampalli, 2009]. This inspires to classify each phoneme as either “normal”, “protected” (always visible), or “invisible” (almost never visible). Examples for English are the /t/ phoneme which can be labelled as “invisible” and the /f/ phoneme which should be labelled as “protected” (see appendix B for a complete overview of the classification of the English phonemes used in the 4.4. Improving the synthesis quality 152 AVTTS system). The protected/normal/invisible classification of the phonemes of a language can be constructed based on prior articulatory knowledge. For instance, Jackson et al. [Jackson and Singampalli, 2009] investigated on the critical articulators for each particular English phoneme, and on the degree in which articulators are allowed to relax during speech production. On the other hand, it was opted to perform an ad-hoc classification for the particular speaker that was used to construct the synthesizer’s speech database. Therefore, a hand-crafted protected/normal/invisible classification for the phonemes from the LIPS2008 database was constructed based on both prior articulatory knowledge and the variability P of the visual representations of each phoneme i expressed by j Sij (see section 4.3.4.1). The classification is used to optimize the concatenation smoothing by applying a stronger smoothing strength in case the join takes place in an “invisible” phoneme (to avoid over-articulation) and by applying a weaker smoothing strength when the join takes place in a “protected” phoneme (to avoid under-articulation). From the previous it can be concluded that the applied visual concatenation smoothing strength is optimized by three distinct differentiations: the smoothing strength is adjusted based on the type of model parameter (shape/texture), it is adjusted based on the correlation of the model parameter with the speech, and it is fine-tuned based on the type of phoneme in which the concatenation takes place. Optimal values for parameter S (equation 4.16), which defines the concatenation smoothing strength, were empirically deduced to optimize the attainable visual speech synthesis quality and are summarized in table 4.3. Table 4.3: Various visual concatenation smoothing strengths S (see equation 4.16). Type Speech Correlation High Shape Low High Texture Low Phoneme Type S Protected Normal Invisible Protected Normal Invisible 1 1 3 2 3 5 Protected Normal Invisible Protected Normal Invisible 1 3 5 3 5 7 4.4. Improving the synthesis quality 4.4.4 153 Spectral smoothing When the video frames of an audiovisual speech signal are represented by means of AAM parameters, the consecutive values of a single model parameter throughout a sentence (i.e., its trajectory) can be seen as a data signal, sampled at the video frame rate, that contains some part of the visual speech information that corresponds to the uttering of the sentence. By calculating the Fast Fourier Transform (FFT) on the parameter trajectory, the spectral content of this visual speech information can be analysed. To investigate the typical spectral content of the AAM parameter trajectories, a random subset of sentences from the LIPS2008 database was resynthesized using the AAM-based AVTTS system. This allowed to compare the spectral content of the original trajectories with the spectral content of the corresponding resynthesized trajectories. This analysis showed that typically a resynthesized trajectory contains more energy at higher frequencies in comparison with the corresponding original trajectory. This high-frequency energy is likely to be caused by the visual concatenations that were necessary to construct the synthetic speech signals: around the join positions some changes in shape/texture information can occur that are more abrupt than the variations occurring in original visual speech signals. Recall that in the synthetic visual speech each concatenation has been smoothed by the advanced smoothing approach that was discussed in section 4.4.3. Unfortunately, not all unnaturally fast variations can be removed from the synthetic speech by the proposed concatenation smoothing strategy, since global settings for the concatenation smoothing strength are applied. It can happen that two particular consecutive selected database segments are visually very distant and that their concatenation would require a stronger smoothing in comparison with the other concatenations in order to avoid visual “over-articulation” effects (which was actually one of the major complaints reported by the participants who evaluated the initial AVTTS approach). From this observation it can be concluded that the AAM-based AVTTS approach can be improved by an additional smoothing of the concatenated visual speech signal, in which only those parts of the signal that do need an extra smoothing are significantly modified. Such a smoothing technique was developed which suppresses unnaturally rapid variations in the concatenated visual speech by modifying its spectral information derived from its AAM-based representation. The smoothing technique makes use of a well-designed low-pass filter that limits the amount of high-frequency energy in the spectrum of the synthetic visual speech such that its spectral envelope resembles the spectral envelope of original visual speech signals. The design of these low-pass filters is critical, since it has to be ensured that the cutoff-frequency is adequately low to remove the unnaturally fast variations from the speech, but on the other hand as little as possible useful speech information 4.4. Improving the synthesis quality 154 should be removed from the signal. Optimal filter settings were found by assessing the perceived effects of such a low-pass filtering on original visual speech trajectories. In a first step, for each AAM parameter multiple low-pass filters were designed. To this end, the spectral information of each model parameter was gathered from 100 random database sentences. For each parameter, the mean of the measured spectra was calculated to determine an estimate for its common spectral content as seen in original visual speech signals. Based on these mean spectra, for each parameter multiple cutoff-frequencies were calculated, preserving 90, 80, 70, 60 and 50 percent of the original spectral energy, respectively. For each of these cutoff-frequencies, a low-pass filter was designed using the Parks-McClellan optimal FIR filter design technique implemented in Matlab [Matlab, 2013]. This allows to filter the shape/texture information contained in a visual speech signal by filtering each parameter trajectory individually using its corresponding filter coefficients. In a second step, a subjective perception experiment was conducted to investigate the effect of the low-pass filtering on the human perception of the speech signals. The test subjects were shown pairs of audiovisual speech samples both representing a same sentence that was randomly extracted from the LIPS2008 database transcript. All samples contained original auditory speech signals copied from the LIPS2008 database. One sample in each pair was a reference sample that contained a visual speech mode that was regenerated by the inverse AAM-projection of the original database parameter trajectories. The visual speech mode of the other sample was regenerated by the inverse AAM-projection of the low-pass filtered version of the original AAM trajectories. The position of the filtered sample in each comparison pair was randomized and unknown to the participants. To generate the test samples, various combinations of filters for the shape parameters and filters for the texture parameters were applied (see table 4.4). The subjective evaluation assessed to which extent an original visual speech signal can be low-pass filtered without affecting its perception quality. Seven people participated in the test (4 male, 3 female, aged between 22 and 58, 4 of them were speech technology experts), who each compared 28 sample pairs. The subjects were asked for their preference for one of the two samples in terms of naturalness of the visual speech mode. The participants were asked to answer “no difference” when they did not prefer one of the two samples over the other. Each acceptable configuration of the shape/texture filters should lead to a high percentage answers reporting “no difference” or even reporting a preference for the filtered sample. A summary of the most important results of the experiment is given in table 4.4. Based on the results of the subjective experiment, an optimal filter configuration for smoothing the concatenated synthetic visual speech can be determined: 4.4. Improving the synthesis quality 155 Table 4.4: Test results obtained by the subjective evaluation of the low-pass filtered original parameter trajectories. The last column indicates for each filter configuration the percentage of the total evaluations that reported that the filtered sample was at least as good as the original sample. Shape filter Texture filter % OK none 90 80 70 70 70 60 60 60 none 70 60 93 90 64 86 79 36 the least conservative configuration that was found not to significantly affect original speech signals will be most suited to reduce the unnatural high-frequency information while only minimal useful speech information is modified. Similar to the concatenation smoothing strength, the strength of the spectral smoothing can be diversified among the various AAM parameters. To this end, less conservative filters are applied on the texture parameter trajectories compared to the filters that are applied on the shape parameter trajectories. In addition, the shape/texture parameters are split into two groups based on their correlation with the speech information (the same groups as described in section 4.4.3 were used). A stronger spectral smoothing is applied on the trajectories of the least speech-correlated model parameters in comparison with the spectral smoothing that is applied on the trajectories of parameters that are the most correlated with the speech information. Table 4.5 summarizes the filter settings that are used in the AAM-based AVTTS strategy. An illustration of the spectral smoothing technique is given in figure 4.9. Table 4.5: Optimal filter settings for modifying the synthetic visual speech. Type Speech Correlation Filter Shape High Low 90 70 Texture High Low 80 70 4.4. Improving the synthesis quality 156 0.12 0.1 Parameter value 0.08 0.06 0.04 0.02 0 −0.02 −0.04 60 80 100 120 140 160 180 Frame Figure 4.9: Spectral smoothing of a parameter trajectory. The black curve illustrates part of an original trajectory of a database sentence. The red curve represents the corresponding synthesized trajectory that was generated by the AAM-based AVTTS system. The original database text was used as input and the corresponding database speech was excluded from selection. The phoneme durations in the synthesized trajectory were synchronized with the phoneme durations in the original speech. The green curve shows the spectral smoothed version of the synthesized trajectory. The blue rectangles indicate typical benefits from the spectral smoothing technique: local discontinuities are removed and “overshoots” are suppressed. The red square shows a more severe discrepancy between the original (black) and synthesized (red) curve. Unfortunately, such differences cannot be removed by a “safe” spectral smoothing strength. 4.5. Evaluation of the AAM-based AVTTS approach 4.5 157 Evaluation of the AAM-based AVTTS approach This section describes two subjective perception experiments that evaluate the proposed AAM-based AVTTS system. The attainable synthesis quality using this synthesis strategy is compared with the synthesis quality of the initial AVTTS approach and with original speech samples. The first experiment involves visual speech-only speech samples and the second experiment involves audiovisual speech samples. 4.5.1 Visual speech-only 4.5.1.1 Test setup A first experiment evaluated whether the AAM-based synthesis approach is able to improve the individual quality of the synthetic visual speech mode compared with the initial AVTTS synthesizer (which was described in chapter 3). Ten English sentences were randomly selected from the LIPS2008 database. The samples of a reference test group, referred to as group “ORI”, were created by generating for each sentence a new mouth-signal by the inverse AAM-projection of the parameter trajectories contained in the database. Afterwards, this mouth-signal was merged with a background video signal displaying the other parts of the face of the virtual speaker (similarly as is performed by the AVTTS system to synthesize a novel sentence). A second category of samples, referred to as group “OLD”, was created by resynthesizing each sentence using the initial AVTTS system. The original database text was used as input and the original speech data corresponding to each particular sentence was excluded from selection. A third group of samples, referred to as group “AAM”, was created by the resynthesis of each sentence using the AAM-based AVTTS system. Optimal settings for the various enhancements to the AAM-based synthesis (see section 4.4) were applied. For all three groups, the test samples that were used in this experiment were “mute” speech signals containing only a visual speech mode. To this end, for groups OLD and AAM the synthetic auditory speech mode that was synthesized simultaneously with the visual mode was removed from the speech sample. The 30 samples (3 groups of each 10 sentences) were consecutively shown to the participants. The order of both the sentences and the test groups was randomized. The participants were asked to rate the naturalness of the displayed mouth movements on a 10-point MOS-scale [0.5,5], with rating 0.5 meaning that the visual speech appears very unnatural (low quality) and rating 5 meaning that the visual speech looks completely natural (excellent quality). The text corresponding to the speech in each sample was shown. An example audiovisual recording from 4.5. Evaluation of the AAM-based AVTTS approach 158 5 4 3 2 1 0 ORI OLD AAM Figure 4.10: Boxplot summarizing the ratings obtained for each test group in the visual speech-only experiment. the LIPS2008 database was given to illustrate the speaking style of the original speaker. The participants were given some points of interest to take into account while evaluating the samples, such as “Are the variations of the lips, the teeth and the tongue in correspondence to the given text?” and “Are the displayed speech gestures similar to the gestures seen in original visual speech and are they as smooth as you would expect them to be in an original speech signal?”. 4.5.1.2 Participants and results Nine people participated in the experiment (6 male, 3 female), two of them aged 1 [56-58] and the others aged [21-35]. Six of them can be consideredPagespeech technology experts. The results obtained are visualized in figure 4.10. A Friedman test indicated significant differences among the answers reported for each test group (χ2 (2) = 103 ; p < 0.001). An analysis using Wilcoxon signed-rank tests indicated that the ratings obtained for the ORI group were significantly higher than the ratings obtained for the OLD group (Z = −7.52 ; p < 0.001). The ratings obtained for the AAM group were also significantly higher than the ratings obtained for the OLD group (Z = −7.60 ; p < 0.001). No significant difference could be measured between the ratings for the ORI and the AAM group (Z = −0.740 ; p = 0.459). 4.5. Evaluation of the AAM-based AVTTS approach 4.5.1.3 159 Discussion This experiment assessed for each category of samples the perceived individual quality of the visual speech mode. The results obtained unequivocally show that the AAM-based AVTTS synthesizer performs better in mimicking an original visual speech signal compared to the initial AVTTS synthesis approach. This is due to the fact that the AAM-based approach is able to generate a smoother signal without affecting the visual articulation strength. On the other hand, it appears that even the samples from the ORI group were not always given excellent ratings. Probably this is due to the moderate loss of image detail caused by the inverse AAM projection. In addition, the merging of the mouth-signal with the background video might also have reduced the similarity with an original video recording. Note, however, that the participants considered these flaws less disturbing compared with the overall level of smoothness of the visual speech (indicated by the lower rating obtained for the OLD group). Earlier in this thesis it was shown that it has be ensured that neither optimization to the AVTTS synthesis approach affects the audiovisual coherence in the audiovisual output signal. Therefore, another experiment needs to be conducted in which the perceived quality of the audiovisual speech signals is assessed instead. 4.5.2 Audiovisual speech 4.5.2.1 Test setup A second perception experiment was conducted in which the same categories of test samples that were used in the visual speech-only experiment were evaluated: group ORI, group OLD, and group AAM (see section 4.5.1.1). In this test, the complete audiovisual signals were shown to the participants. To this end, the visual speech data from the previously used samples from the ORI group was displayed simultaneously with the corresponding original auditory speech from the LIPS2008 database. In addition, the visual speech modes of the previously used samples from the OLD group and the AAM group were reunited with their corresponding synthetic auditory speech signals (which had been generated synchronously with the synthetic visual speech by the audiovisual segment selection procedure). Three separate aspects of the audiovisual speech quality were evaluated in the experiment: A - Naturalness of the mouth movements This aspect is exactly the same as the property that was evaluated in the visual speech-only experiment. The same points of interest as mentioned in section 4.5.1.1 were given to the test subjects. They were instructed to rate 4.5. Evaluation of the AAM-based AVTTS approach 160 the visual speech individually by ignoring the auditory speech mode of each sample. However, it was instructed not to turn off the volume to ensure that the presented auditory speech mode could still (unconsciously) influence the subject’s ratings. B - Audiovisual coherence The participants were asked to rate the coherence between the presented auditory and the presented visual speech mode. They were told that the key question they should answer was “Is it plausible that the woman who is displayed in the video could have actually produced the auditory speech that you hear?”. C - Quality and acceptability of the audiovisual speech The last aspect was a high-level evaluation of all aspects concerning the presented audiovisual speech samples. The participants were asked to which extent they like the speech sample in general and how suitable they think the sample is to use in a real-world avatar application. It was explained that for this rating, they had to ask themselves the following question: “Is the multimodal speech (audio + video + combination audio-video) sufficiently understandable, clear and natural as you would expect it to be for a real application?”. A 10-point MOS scale [0.5,5] was used to rate each aspect, with rating 0.5 meaning that the visual speech appears very unnatural (aspect A), meaning a very poor audiovisual coherence (aspect B), and meaning that the sample is not suited for use in a real application due to a very low overall quality (aspect C). For each aspect, rating 5 means that the sample appears like an ideal high-quality original audiovisual speech signal. 4.5.2.2 Participants and results Eight people participated in the experiment (5 male, 3 female), one of them aged 58 and the others aged [21-31]. Five of them can be considered speech technology experts. The results obtained are visualized in figure 4.11. For aspect A, a Friedman test indicated significant differences among the answers reported for each test group (χ2 (2) = 125 ; p < 0.001). An analysis using Wilcoxon signed-rank tests pointed out that the ratings obtained for the ORI group were significantly higher than the ratings obtained for the AAM group (Z = −5.89 ; p < 0.001) and the ratings obtained for the OLD group (Z = −7.66 ; p < 0.001). In addition, the ratings obtained for the AAM group were significantly higher than the ratings obtained for the OLD group (Z = −7.13 ; p < 0.001). A similar analysis was conducted for the results obtained for aspect B. A Friedman test indicated significant differences among the answers reported for each test group (χ2 (2) = 103 ; p < 0.001). An analysis using Wilcoxon signed-rank tests pointed out that the ratings obtained for the ORI group were 4.5. Evaluation of the AAM-based AVTTS approach 161 significantly higher than the ratings obtained for the AAM group (Z = −7.19 ; p < 0.001) and the ratings obtained for the OLD group (Z = −7.57 ; p < 0.001). In addition, the ratings obtained for the AAM group were significantly higher than the ratings obtained for the OLD group (Z = −2.43 ; p = 0.015). Finally, for aspect C, a Friedman test indicated significant differences among the answers reported for each test group (χ2 (2) = 133 ; p < 0.001). An analysis using Wilcoxon signed-rank tests pointed out that the ratings obtained for the ORI group were significantly higher than the ratings obtained for the AAM group (Z = −7.46 ; p < 0.001) and the ratings obtained for the OLD group (Z = −7.84 ; p < 0.001). In addition, the ratings obtained for the AAM group were significantly higher than the ratings obtained for the OLD group (Z = −5.22 ; p < 0.001). 4.5.2.3 Discussion Aspect A evaluated exactly the same property as the subjective test described in section 4.5.1 did. However, if the results obtained for these two experiments are compared (figures 4.10 and 4.11 (top)), an important difference is noticeable. Whereas for the visual speech-only samples the ORI group and the AAM group were rated equally high, in the audiovisual test the visual speech mode of the samples from the ORI group was rated higher than the visual speech mode of the samples from the AAM group. This means that the participants, although instructed to take only the visual speech mode into account, were unconsciously influenced by the accompanying auditory speech mode while rating the samples. This explains the higher ratings for the ORI group in the audiovisual experiment, since these samples contained original auditory speech while the auditory speech mode of the AAM group consisted of less optimal synthesized auditory speech signals. In the audiovisual test, the ORI group was given very high ratings (mean = 4.6, median = 5), which means that the inverse AAM projection of the database trajectories is capable of generating sufficiently accurate image data to use in an audiovisual speech signal. Both in the visual speech-only and in the audiovisual experiment, the AAM group was given higher ratings compared to the OLD group. This means that the AAM-based AVTTS approach indeed improves the individual quality of the synthetic visual speech without affecting the coherence between the two synthetic speech modes (since the observed enhancement holds in the audiovisual case as well). Aspect B evaluated the perceived level of audiovisual coherence. A similar evaluation of the speech signals created by the initial AVTTS system was already described in section 3.6.2. Similar to the results obtained in that experiment, the ratings obtained for aspect B show that the audiovisual coherence observed in synthesized speech is lower than the audiovisual coherence observed in original speech signals. This can be due to the local audiovisual incoherences that occur 4.5. Evaluation of the AAM-based AVTTS approach 162 5 4 3 2 1 ORI OLD AAM ORI OLD AAM ORI OLD AAM 5 4 3 2 1 5 4 3 2 1 Figure 4.11: Boxplots summarizing for each aspect the ratings obtained for the three categories of audiovisual speech samples. From top to bottom: aspect A (naturalness of the speech gestures), aspect B (audiovisual coherence), and aspect C (overall acceptability). 4.6. Summary and conclusions 163 around the join positions, but it is also likely that the subjective perception of the level of audiovisual coherence in synthesized speech is affected by the overall lower degree of naturalness of this category of speech signals. On the other hand, the audiovisual coherence of the ORI group was rated very high (mean = 4.7, median = 5), which means that the inverse AAM projection that was needed to reconstruct the original visual speech mode does not affect the perceived coherence between the original speech modes. The audiovisual coherence of the samples from the AAM group was rated higher than the coherence of the samples from the OLD group, which means that the AAM-based optimizations to the synthesis strategy increase the perceived coherence between the two synthetic speech modes. This is perhaps an unexpected result, since in fact the level of audiovisual coherence in the output of the initial AVTTS system is slightly higher since these samples contain original video recordings instead of AAM reconstructed visual signals. This again indicates that the individual quality of the presented speech modes influences the perceived level of audiovisual coherence between these two signals. Aspect C evaluated the overall quality of the presented audiovisual speech and its applicability in a real application. As could be expected, the ORI group was given the highest ratings since especially the auditory speech mode of these samples is much better than the synthesized auditory mode of the samples from the OLD and the AAM group. Since the ORI group was given very high ratings (mean = 4.8, median = 5), the AAM-based representation of the original visual speech information appears to be appropriate for usage in real applications. Recall that the each sample from the ORI group was constructed by the merging of the regenerated original mouth-signal with a background video signal. The ratings obtained in this experiment indicate that this approach can safely be applied without affecting the acceptability of the final visual speech signal. In addition, since the ratings obtained for the AAM group were higher than the ratings obtained for the OLD group, it appears that that the AAM-based optimizations to the AVTTS strategy are improving the overall quality of the synthetic audiovisual speech by an appropriate unimodal enhancement of the synthetic visual speech mode. Unfortunately, the ratings obtained for the AAM group (mean = 3.1, median = 3) were still lower than the ratings obtained for the original speech samples from the ORI group, which means that still some improvements to the AVTTS synthesis technique are needed to reach the quality level of original speech signals. 4.6 Summary and conclusions Chapter 3 proposed a single-phase AVTTS synthesis approach that is promising for achieving high-quality audiovisual speech synthesis, since it is able to maximise the coherence between the synthetic auditory and the synthetic visual speech 4.6. Summary and conclusions 164 mode. This chapter elaborated on an optimization to the synthesis strategy that enhances the individual quality of the synthetic visual speech mode. The proposed optimization only minimally affects the coherence between both synthetic output speech modes. An Active Appearance Model is used to describe the original visual speech recordings, which allows to individually parameterize the shape and the texture properties of the original visual speech information. The AAM-based AVTTS synthesizer constructs the synthetic output speech by concatenating original combinations of auditory speech and AAM parameter sub-trajectories. The AAM-based representation of the original visual speech allows to define accurate selection costs that take visual coarticulation effects into account. Additional optimizations to the visual synthesis have been developed, such as the removal of undesired non-speech related variations from the original visual speech corpus by a normalisation of the AAM parameters. In addition, a diversified visual concatenation smoothing strength increases the continuity and the smoothness of the synthetic visual speech signals without affecting the visual articulation strength. Finally, a spectral smoothing technique removes over-articulations that can occur in the concatenated visual speech signal. The synthetic speech produced by the AAM-based AVTTS system was subjectively evaluated and compared with original AAM-reconstructed audiovisual speech signals and with the output of the initial AVTTS system that was proposed in chapter 3. The experiments showed that the AAM-based representation of the original visual speech is appropriate to regenerate accurate visual speech information. The results obtained also show that the AAM-based synthesis improves the quality of the synthetic visual speech mode compared to visual speech synthesized using the initial AVTTS approach. Moreover, it has been shown that the attainable audiovisual speech quality by the AAM-based AVTTS system is higher than the attainable audiovisual speech quality by the initial AVTTS approach. Unfortunately, it could be noticed that the synthetic audiovisual speech is still clearly distinguishable from original audiovisual speech recordings. This is especially true for the auditory mode of the synthetic speech. Recall that the LIPS2008 database, which has been used to evaluate the AAM-based AVTTS approach, only contains about 23 minutes of original English speech recordings. This is far below the general rule of thumb that a database suited for auditory concatenative speech synthesis should contain at least between one and two hours of original continuous speech in order to be able to synthesize high quality speech samples. Therefore, a more extensive speech database will be necessary in order to increase the overall synthesis quality of the AAM-based AVTTS system. Not only should this database contain sufficiently more original speech data compared to the LIPS2008 database, it should also exhibit higher quality visual speech recordings that captured all the 4.6. Summary and conclusions 165 fine details of the original visual speech information. Some of the techniques, experiments and results mentioned in this chapter have been published in [Mattheyses et al., 2010a] and [Mattheyses et al., 2010b]. 5 High-quality AVTTS synthesis for Dutch 5.1 Motivation The observers that rated the synthetic audiovisual speech signals created by the AAM-based AVTTS synthesis strategy, proposed in the previous chapter, reported two major issues that distinguished the synthesized speech from original speech signals. First, the synthetic auditory speech mode appeared too jerky and it often exhibited a non-optimal prosody. Furthermore, some details were missing in the presentation of the virtual speaker, such as a sharp representation of the teeth and the tongue. This implied that the synthesized video signal could only be presented compactly to the observers (i.e., displayed on a small screen size or in a small video window). These problems can to a large extent be resolved by providing the synthesizer with an improved audiovisual speech database that contains, compared to the LIPS2008 database, more and higher quality audiovisual speech recordings. Unfortunately, only very few good quality audiovisual speech databases are available for TTS research, partly because every database appropriate for the AVTTS system has to exhibit very specific properties (as explained in section 3.3.1): single speaker, fixed head orientation, fixed recording conditions, etc. It can also be noticed that the great majority of the research on auditory/visual/audiovisual speech synthesis reported in the literature involves synthesis for the English language. Apart from some commercial black-box multilingual auditory TTS systems, in recent research only the NeXTeNS project [Kerkhoff and Marsi, 2002] focuses on the Dutch lan- 166 5.2. Database construction 167 guage for academic TTS synthesis. This implies that also speech databases suited for speech synthesis research for Dutch are very scarce. Such a database is even not existing for research in the field of Dutch audiovisual speech synthesis. For these reasons, it was opted to build a completely new Dutch audiovisual speech database that is suited for performing high-quality audiovisual speech synthesis by means of the AAM-based AVTTS synthesizer that was proposed in the previous chapter. 5.2 Database construction This section describes the various steps needed to construct the new audiovisual speech database, such as the preparing of the text corpus, the recording of the original audiovisual speech and the post-processing of the data in order to allow speech synthesis. It was opted to design the database to contain two parts: one part that can be used for limited domain synthesis and another part that is suitable for the synthesis of sentences from the open domain. Limited domain speech synthesis means that the speech database that is provided to the synthesizer mainly contains sentences from one typical domain (e.g., football reports, expressions for a talking clock, etc.). This has the benefit that when the target sentence, given as input to the TTS synthesizer, also fits in the limited domain of the speech database, lots of database segments matching each synthesis target can be found. This leads to the selection of longer original segments that exhibit highly appropriate prosodic features, which makes it possible to attain a high-quality synthesis result. By partly designing the new Dutch database as a limited domain database, it will be possible to investigate the attainable synthesis in both the open domain and the limited domain. 5.2.1 Text selection 5.2.1.1 Domain-specific The domain-specific sentences were taken from a corpus containing one year of Flemish weather forecasts, kindly provided by the Royal Meteorological Institute of Belgium (RMI). A subset of 450 sentences was uniformly sampled from this weather corpus. As the original corpus was chronologically organized, this uniform sampling resulted in a subset covering weather forecasts from each meteorological season. In addition, a collection of all important words involving the weather was gathered, such as “regen” (rain), “sneeuw” (snow), “onweer” (thunderstorm), etc. These words were added to the recording text using carrier sentences. Finally, some slot-and-filler type sentences were included in the recording text, such as “Op vrijdag ligt de minimum temperatuur tussen 9 en 10 graden” (On Friday, the minimum temperature will be between 9 and 10 degrees). 5.2. Database construction 5.2.1.2 168 Open domain The open domain sentences were selected from the Leipzig Corpora Collection [Biemann et al., 2007], which contains 100.000 Dutch sentences taken from various sources such as newspapers, magazines, cooking recipes, etc. To analyse this text corpus, two separate Dutch lexicons were used. The Kunlex lexicon is a Dutch lexicon that is provided with the NeXTeNS distribution [Kerkhoff and Marsi, 2002]. Unfortunately, this lexicon is optimized for Northern Dutch (the variant of Dutch as spoken in The Netherlands) and not for Flemish (the variant of Dutch as spoken in Belgium). For this reason, the text corpus was also analysed using the Fonilex lexicon [Mertens and Vercammen, 1998]. This lexicon, originally constructed for speech recognition purposes, is based on the Northern Dutch Celex lexicon [Baayen et al., 1995] but contains Flemish pronunciations and it lists pronunciation variants whenever these exist. The original Fonilex lexicon was converted in a format that is more suitable for usage with the AVTTS system, which yielded the following changes: Phone set adaptation The original phone set used in the Fonilex lexicon was adjusted by adding false and real diphthongs to the phone set, as these sounds are sensitive for concatenation mismatches. On the other hand, glottal stops were removed from the phone set, as these are quite rare in the standard Flemish pronunciation. Also, additional diacritic symbols (which indicate whether a phone is nasalized, long or voiceless) were not taken into account. Adding part-of-speech information The Fonilex lexicon does not list part-of-speech information for each word, which is needed in order to distinguish homographs (see section 3.2.1). This information was extracted from the original Celex lexicon and added to the adapted Fonilex lexicon. Adding syllabification As the original Fonilex lexicon does not list syllable boundaries, a simple rulebased syllabification algorithm was implemented to add syllable information to the lexicon. Unfortunately, the original syllabification information from the Celex lexicon could not be transferred to the adapted Fonilex lexicon since for too many entries the Fonilex and the Celex phoneme transcripts differed too much. Adding additional entries About 400 words that occur in the final text corpus that was selected for recording were manually added to the lexicon. These words included numbers, abbreviations and missing compound words. 5.2. Database construction 169 To select an appropriate subset of sentences from the original large text corpus, multiple greedy text selection algorithms were used [Van Santen and Buchsbaum, 1997]. While selecting the text, for each entry in the lexicon all the possible pronunciation variants were taken into account. It was ensured that no sentence was selected twice, and the selection of long sentences (e.g., longer than 25 words) was discouraged since these sentences are harder to utter without speaking errors. In a first stage, two subsets were extracted from the corpus in order to attain full phoneme coverage, using the Kunlex and the adapted Fonilex lexicon, respectively. Similarly, two other subsets were extracted in order to attain complete silence-phoneme and phoneme-silence diphone coverage. The speech recordings corresponding to these four subsets, containing 75 sentences in total, will have to be added to each future sub-database to ensure a minimal coverage for speech synthesis in Dutch. To select additional sentences for recording, only the adapted Fonilex lexicon was used. The text corpus was split equally in two and on both parts two selection algorithms were applied. A first algorithm selected a subset of sentences that maximally covers all diphones existent in original Dutch speech. Consecutively, a second selection algorithm was based on the diphone frequency: the sentences were selected such that the relative diphone frequency in the selected text subset is similar to the diphone frequency in the complete corpus. In a final stage, a manually selected subset of the text corpus was added to the recording text. This manual selection was based on the avoidance of given names, numbers, dates and other irregular words in the sentences. The final text subset contained about 1500 distinct Dutch diphones. This is the same as the number of distinct diphones that was found in the original text corpus. Note that in theory, more diphones exist in the Dutch language. By counting all diphones defined in the Fonilex lexicon (inter-word, intra-word, and word-silence/silence-word) about 2200 distinct diphones were found. However, not all these diphones occur in original speech since not all word-word combinations are possible. In fact, 1500 is a large number of diphones compared to the databases used in most other research. For instance, in [Smits et al., 2003] the perception of diphones in the Dutch language was investigated using only about 1100 distinct diphones. 5.2.1.3 Additional data With a view to further research on speech synthesis, two extra subsets were added to the recording text. Six paragraphs taken from online news reports were added, which were shown to the speaker both as sentences in isolation and as paragraphs. These speech recordings can be used to explore the difference between the synthesis of isolated speech and synthesizing whole paragraphs at once. In addition, a few sentences containing typical “fillers” were added to the recording text, such as “Ik 5.2. Database construction 170 kan ook ... X ... hoor” (“I can ... X ... too”) with X representing a laugh, a cough, a gulp, a sigh, etc. These fillers can be used to increase the expressiveness of the synthetic speech. 5.2.2 Recordings The database was recorded in a professional audiovisual studio located at the university campus [AV Lab, 2013]. The voice talent was a 23 year old, native Dutch speaking girl. She is a semi-professional speaker who received a degree in discourse. The speech was recorded in newsreader-style, which means that the speaker sits on a fixed chair in front of the camera. The text cues were given in batches of five isolated sentences to the speaker using a prompter. The acoustic speech signal was recorded using multiple microphones placed on a stand in front of the speaker (out of sight of the camera). The microphones that were used consisted of two TRAM condenser small diaphragm omni-directional microphones put next to each other, a Neuman U87 condenser large diaphragm microphone in cardioid mode and an AudioTechnica AT897 hyper-cardioid microphone. The visual speech signal was recorded using a Sony PMW-EX3 camera at 59.94 progressive frames per second and a resolution of 1280x720 pixels. The camera was swivelled to portrait orientation. The focus, exposure and colour balance were manually calibrated and kept constant throughout the recordings. The talent was recorded in front of a blue screen, on which several markers had been attached around the speaker’s head. In addition, some markers were placed on the neck of the speaker. The recording room RT60 was tweaked to a low value of 150ms at 1000Hz. Some pictures illustrating the recording setup are given in figures 5.1 and 5.2. The recordings took two complete days, throughout which the recording conditions were kept as constant as possible. In total more than 2TB of speech data was captured. The audio was sampled at 48kHz and was stored as WAV files using 24bit/sample. The video was stored both as raw uncompressed video data and as H.264 compressed AVI files. The database was manually segmented on sentence-level, in the course of which some erroneous recordings were omitted. Finally, the total amount of available speech data consisted of 1199 audiovisual sentences (138 minutes) from the open domain and 536 audiovisual sentences (52 minutes) from the limited domain of weather forecasts. From this point on, the database will be referred to as the “AVKH” dataset. Two example frames from the database are shown in figure 5.3. 5.2.3 Post-processing 5.2.3.1 Acoustic signals As section 5.2.2 explained, the auditory speech was recorded using multiple microphones. This has the benefit that after the recording sessions, the most appropriate 5.2. Database construction 171 Figure 5.1: Overview of the recording setup. The top figure shows the soundproof recording room, the acoustic panels that adjust the room RT60, the fixed camera with the attached prompter, the lights and the reflector screen to illuminate the voice talent, and the sound-proof window at the back that gives sight to the control room. The bottom figure shows the positioning of the voice talent. Notice the uniform illumination on the face, the positioning of the microphones and the separately illuminated blue-key background with attached yellow markers. 5.2. Database construction 172 Figure 5.2: Some details of the recording setup. The top figure shows the four microphones that were used to record the auditory speech. The middle figure shows the real-time visual feedback that allowed the voice talent to reposition herself. The bottom figure shows the control room monitoring of the recorded audiovisual signals and the controlling of the prompter. 5.2. Database construction 173 Figure 5.3: Example frames from the “AVKH” audiovisual speech database. microphone signal can be selected as final auditory database speech. This way, the database will exhibit the most optimal acoustic signal quality, and in addition, there always exists a backup signal in the case one of the recorded acoustic signals would turn out to be disrupted (e.g., a microphone failure, interference artefacts in the signal, etc.). In a first step, the acoustic signals of the two small TRAM microphones were added in order to increase their signal quality. After a manual inspection of the recorded acoustic signals, it was opted to use the acoustic signal recorded by the Neuman U87 microphone to construct the database audio, as this signal exhibited the most clear and natural voice reproduction. The acoustic signals contained in the database were analysed to obtain the appropriate meta-data describing the auditory speech information that can be used by the AVTTS synthesizer (see section 3.3.3). In a first stage, each sentence was phonemically segmented and labelled using the open-source speech recognition toolkit SPRAAK [Demuynck et al., 2008]. To this end, a standard 5-state left-toright context-independent phone model was used, with no skip states. The acoustic signals were divided into 25ms frames with 5ms frame-shift, after which for each frame 12 MFCC’s and their first and second order derivatives were extracted (the audio was downsampled to 16kHz for this calculation). A baseline, multi-speaker acoustic model was used to bootstrap the acoustic model training. After the phonemic labelling stage, the appropriate symbolic features were gathered for each 5.2. Database construction 174 phone-sized segment of the database (see table 3.1). Further analysis consisted in the training of a speaker-dependent phrase break model based on the silences occurring in the recorded auditory speech signals [Latacz et al., 2008]. Several acoustic parameters were calculated, such as the minimum (100Hz) and maximum (300Hz) f0 of the recorded auditory speech, MFCC coefficients, pitchmark locations, f0 contours, and energy information. Finally, the acoustic signals were described by means of STRAIGHT parameters [Kawahara et al., 1999]. The STRAIGHT features (spectrum, aperiodicity and f0) were calculated with a 5ms frame-shift. Besides the parameters for minimum and maximum f0, the default settings for STRAIGHT were used. 5.2.3.2 Video signals Analysis of the mouth area In order to be able to use the AVKH database in an AAM-based synthesis approach, such as proposed in chapter 4, the recorded visual speech information has to be parameterized using an AAM. To this end, a new AAM was built to represent the mouth-area of the captured video frames. This approach is similar to the AAM that was built on the frames from the LIPS2008 database (see section 4.3.3), however, note that the visual speech from the AVKH database is captured at a higher resolution and that it contains much more image detail due to a higher-quality recording set-up. This allows to build an AAM that is capable of generating a synthetic visual speech signal that contains a more detailed representation of the virtual speaker. Obviously, this is only feasible when the model is trained using an appropriate set of training images provided with very accurate ground-truth shape information. The shape information consisted of 28 landmarks that indicate the outer lip contour and 12 landmarks that indicate the inner lip contour. It was ensured that the distribution of the landmarks on the outer lip contour was consistent over the training set. The location of the face is denoted by 6 landmarks indicating the position of the cheeks, the chin and the nasal septum, together with 3 landmark points that are located on coloured markers that were put on the neck of the speaker. Finally, in order to be able to accurately model and reconstruct the teeth appearance, 5 landmarks were used to indicate the location of the upper incisors. In addition, the landmarks denoting the inside of the upper lip were positioned with respect to the location of the upper incisors, and the landmarks denoting the inside of the lower lip were positioned with respect to the location of the lower incisors and the lower canines. Note that the teeth of the original speaker are not visible in each recorded video frame. For those training images displaying no (upper) teeth, the upper incisor-landmarks were positioned one pixel below the corresponding landmarks on the inside of the upper lip. Similar to the technique that is used to optimize the reconstruction of a closed mouth (see section 5.2. Database construction 175 Figure 5.4: Landmark information for the AVKH database. The left panel illustrates an original video frame and its shape information that is modelled by the AAM. The right panel shows a detail of the frame illustrating the shape information associated with mouth of the speaker. 4.3.3), while generating an image from a set of model parameters it is ensured that the upper incisor-landmarks are always located at least one pixel below the corresponding landmarks on the inside of the upper lip. All texture information inside the convex hull denoted by the landmarks is modelled by the AAM. Figure 5.4 illustrates an original video frame and its associated landmark positions. To build the AAM, the iterative technique that was described in section 4.3.3 was used. The model was built to retain 98% of the shape information from the training set and 99% of the texture information from the training set, resulting in 13 shape parameters and 120 texture parameters that define the AAM. In addition, a combined model was calculated (omitting 1% of the variation), which resulted in 96 combined model parameters. Using the trained AAM, the mouth area of all recorded video frames was described by a set of shape/texture parameter values and by a set of combined model parameter values. Then, the normalization technique that was described in section 4.4.2 was performed in order to remove some of the non-speech related and thus undesired variations from the database. This resulted in the normalization of 4 shape parameters and the normalization of 31 texture parameters. The normalization of the AAM parameters particularly removed small variations in 5.2. Database construction 176 the orientation of the speaker’s face. Some typical original frames and the images regenerated from their corresponding AAM parameter values are given in figure 5.5. In addition to the parameterization of the visual speech using AAMs, some other features of the recorded video frames were calculated, each expressing a particular visual property. A first feature consisted of the geometrical dimensions of the mouth of the original speaker (width and height), derived from the shape information associated with each video frame. Next, for each frame a ratio for the amount of visible teeth and a ratio for the amount of visible mouth-cavity were determined using the histogram-based detection that was described in section 3.3.3.4. Finally, a feature that expresses for each video frame the visible amount of upper teeth (which are much more frequently visible in comparison to the lower teeth) was calculated from the frame’s associated shape information. Analysis of the complete face In section 3.5.1 it was xplained that the output video signal, generated in correspondence with the target speech by the AVTTS system, only displays the animation of the mouth area of the virtual speaker. This signal needs to be merged with a background video signal that displays the other parts of the face of the virtual speaker. Recall that for the speech synthesis based on the LIPS2008 database, this background signal was constructed using original video sequences from the database. For the synthesis based on the AVKH database, a more advanced strategy was developed that allows an easy generation of multiple custom background signals. To this end, a second AAM was built to parameterize the complete face of the original speaker. When this “face” AAM is used to represent a subset of the database by means of face parameter trajectories, new background signals can be created by concatenating multiple original face sub-trajectories followed by the inverse AAM projection of the concatenated face parameter trajectories. This way, any target background behaviour (e.g., displaying an eye-blink at predefined time instants) can be generated, as long as the behaviour was found in the original database and it is modelled by the face AAM. The shape information used to built the face AAM consisted of 8 landmarks that indicate the eyes, 6 landmarks that indicate the eyebrows, 3 landmarks that indicate the nose and 3 landmarks that indicate the chin. In addition, 11 landmarks indicate the edge of the face in order to avoid a blurred reconstructed edge when the head position changes. To ensure that the compete face, including some parts of the background, are modelled by the AAM, 13 additional landmarks were used to define the convex hull denoting the texture information that is modelled by the AAM. These landmarks are positioned on coloured markers that were put on the 5.2. Database construction 177 Figure 5.5: The figures on the left are details from some typical original video frames from the AVKH database. The figures on the right show for each frame its reconstruction by the inverse AAM-projection of its parameter values. Much effort went into the definition of a consistent landmarking strategy that allows a detailed reproduction of the inside of the mouth. 5.3. AVTTS synthesis for Dutch 178 Figure 5.6: AAM-based representation of the complete face, illustrating an original frame from the database (left), its associated shape information (middle), and the reconstructed image from its face model parameter values (right). background and on the neck of the speaker. Note that the face AAM does not model shape information corresponding to the mouth-area the image, since this area was already accurately modelled by the mouth AAM. The face AAM was built using a similar iterative procedure as was used to build the “mouth” AAM. The model was built to retain 99% of the shape information from the training set and 99% of the texture information from the training set, resulting in 24 shape parameters and 27 texture parameters that define the AAM. Figure 5.6 illustrates an original video frame from the database and its corresponding face shape information. It also illustrates the reconstruction of an original video frame by the inverse AAM projection of its face parameter values. 5.3 AVTTS synthesis for Dutch In order to perform audiovisual text-to-speech synthesis for Dutch, the AVKH database and its associated meta-data are provided to the AAM-based AVTTS system that was proposed in chapter 4. The synthesizer must also be provided with a Dutch front-end that performs the necessary linguistic processing on the input text (see section 3.2). To this end, a novel Dutch front-end was constructed by combining particular modules from the NeXTeNS TTS system [Kerkhoff and Marsi, 2002] and new modules that were designed in the scope of the laboratory’s auditory TTS research. The details on these linguistic modules are beyond the scope of this thesis and the interested reader is referred to [Mattheyses et al., 2011a], [Latacz 5.4. Evaluation of the Dutch AVTTS system 179 et al., 2008], and [Latacz, TBP]. Based on the parameters predicted by the front-end, the Dutch AVTTS system creates a novel audiovisual speech signal by concatenating audiovisual speech segments, containing an original combination of acoustic and visual speech information, that were selected from the AVKH audiovisual speech database. The same techniques for the segment selection and for the segment concatenation that were discussed in section 4.3 are applied. All synthesis parameters (e.g., the factors that scale each selection cost between zero and one) were recalculated to obtain optimal values for synthesis based on the AVKH database. In addition, the optimizations to the synthesis that were discussed in sections 4.4.3 and 4.4.4 are applied to enhance the quality of the synthetic visual speech mode. The synthetic visual speech signal displaying the variations of the mouth area in accordance with the target speech is merged with a background video displaying the other parts of the face of the virtual speaker. This background video is generated using the face AAM, as was described in section 5.2.3.2. The background video is designed to exhibit a neutral visual prosody, only very limited head movements, and some random eye blinks. The time span between each eye blink is chosen similar to the eye blinks present in the original speech, and it is ensured that each eye blink has ended before the end of the speech signal is reached. When both video signals have been merged, a final synthesis step consist in applying a chroma-key mask to extract the virtual speaker from the recording background, as illustrated in figure 5.7. 5.4 Evaluation of the Dutch AVTTS system There were two arguments that motivated the construction of the AVKH database. First, the combination of this database with the AAM-based AVTTS synthesizer makes up the most advanced and the most extensive academic AVTTS system for Dutch known to date. This enables numerous future research projects for which auditory, visual, or audiovisual speech synthesis for Dutch is necessary. Second, it is interesting to evaluate to which extent the quality of the synthetic audiovisual speech generated by the AVTTS system increases by providing the synthesizer with a larger and higher quality database compared to the LIPS2008 dataset. A systematic comparison between the quality of the “LIPS2008-based” AVTTS synthesis output and the quality of the “AVKH-based” AVTTS synthesis output is very hard to realize since both signals are very different. However, it is immediately clear that when the synthesizer is provided with the AVKH database, the synthesized video signals display more detailed visual speech information: the synthetic visual speech mode can be presented in a higher resolution (i.e., a larger video screen size) and it contains more accurate representations of the mouth area 5.4. Evaluation of the Dutch AVTTS system 180 Figure 5.7: The left panel shows the merging of the mouth-signal (coloured) with the background-signal (grey). Both are regenerated from AAM parameter values. The middle panel shows the resulting merged signal, and the right panel shows the final output of the AVTTS system after applying a chroma-key mask. of the virtual speaker (especially the appearances of the lips, the teeth and the tongue have improved). Furthermore, by informally comparing the overall quality of the LIPS2008-based and the AVKH-based synthetic audiovisual speech signals, it is easily noticed that the use of the AVKH database drastically improved the audiovisual synthesis quality. Obviously, the highest quality is attained when the target sentence is from the limited domain of weather forecasts. When the target sentence is not from within this limited domain, a more fluctuating output quality can be observed: in general the quality attained is more than acceptable, although the quality has been found to suddenly drop for particular target sentences (this is especially true for the synthetic auditory speech mode). This is a common problem in the field of auditory speech synthesis: even the smallest local error in the synthetic speech signal degrades the perceived quality of the whole signal [Theobald and Matthews, 2012]. This means that a human observer will almost always be able to discern between real and synthesized audio(visual) speech samples, since even high-end state-of-the-art auditory TTS systems are unable to completely avoid local imperfections such as a sporadic concatenation artefact or the selection of an original segment not exhibiting the optimal prosodic features. Only when the presented samples are fairly short, an almost perfect mimicking of original auditory speech is possible. To enhance (i.e., stabilize) the quality of the synthetic auditory speech mode generated by the Dutch AVTTS system, a manual optimization of the AVKH database is needed to check and correct all meta-data such as the 5.4. Evaluation of the Dutch AVTTS system 181 phoneme boundaries, the selection of the correct pronunciation variant for each word, the pitch mark locations, etc. Unfortunately, this is a very time-demanding task with not much scientific contribution, which is why it is seldom performed for non-commercial synthesis systems. There was no extensive subjective perception test performed to assess the overall quality of the synthetic audiovisual speech created by the Dutch AVTTS system, since for such an evaluation an appropriate baseline signal is needed. Original speech fragments are not very suited for this purpose, since it was explained earlier in this section that it can be predicted that the test subjects would often be able to discern between the synthesized samples and the original samples due to local artefacts in the auditory speech mode. Another interesting evaluation would be to compare the attainable synthesis quality with other state-of-the-art photorealistic AVTTS synthesis systems. However, such a comparison is far from straightforward, since no standard baseline system has been defined yet. Also, a more fair comparison would be to compare the attainable synthesis quality of various AVTTS approaches using the same original speech data (e.g., this was performed in the LIPS2008 challenge [Theobald et al., 2008]). Finally, a formal comparison between the AVKHbased synthetic audiovisual speech and the LIPS2008-based synthetic audiovisual speech is also not essential: due to the huge difference in quality between the two speech databases, the test subjects would for sure prefer the syntheses based on the AVKH database. In addition, such a test would require the synthesis of the same sentences using both databases, which is impossible since the databases are used for synthesis in English and in Dutch, respectively. Therefore, it was decided to use an alternative approach to evaluate the attainable synthesis quality using the AVKH database. 5.4.1 Turing Test 5.4.1.1 Introduction From the previous chapters it is known that the AAM-based AVTTS synthesis strategy is able to generate an audiovisual speech signal that exhibits a maximal coherence between both synthetic speech modes. This means that an appropriate estimation for the overall quality of the audiovisual output signal can be obtained by separately evaluating the individual quality of the auditory and the visual speech modes. In the scope of this thesis, an evaluation of the synthetic visual speech is described. Additional evaluations of the synthetic auditory speech are described in the scope of the laboratory’s auditory TTS research [Latacz, TBP]. In order to evaluate the individual quality of the synthetic visual speech mode, a Turing scenario was applied. In this test strategy, the participants are shown 5.4. Evaluation of the Dutch AVTTS system 182 several audiovisual speech samples that contain either original or synthesized speech signals. The participants have to report for each presented sample whether he/she believes it is original or synthesized speech, which means that for each answer there exist a 50% chance of guessing right and a 50% chance of guessing wrong. The more the overall percentage of wrong answers reaches 50%, the more prove there exists that the test subjects could not distinguish between the original and the synthesized speech signals. 5.4.1.2 Test set-up and test samples Fifteen sentences from the open domain were randomly selected from the AVKH database transcript. For each of these sentences, two test samples were generated. The first sample consisted of original audiovisual speech signals. It was constructed by first directly copying the acoustic signal from the database. To obtain the visual speech mode, the parameter trajectories from the database were inverse AAM-projected to generate a new sequence of video frames. This sequence defined the mouth-signal, which was then merged with a background signal created by the face AAM. Afterwards, a chroma-key mask was applied to create the final visual speech mode of the “original” test samples. This approach ensures that the original test samples exhibited the same image quality and a similar speaker representation as the video signals synthesized by the AVTTS system. To create a synthesized version of each test sentence, the AAM-based AVTTS system was provided with the complete AVKH database. The original database transcript was used as text input and the original speech data corresponding to each particular sentence was excluded from selection. After synthesis, only the visual speech mode of each synthetic audiovisual signal was used. These synthetic visual speech signals were synchronized with the corresponding original speech signals by time-scaling the synthesized parameter trajectories such that the duration of each phoneme in the synthetic speech matches the duration of the corresponding phoneme in the original speech. The final “synthetic” samples were created by multiplexing the time-scaled synthetic visual speech signals with the corresponding original acoustic speech signals. Thirty samples (2 groups of each 15 samples) were shown consecutively to the participants. The question asked was simple: “Do you think the presented speech is original speech or synthesized speech?”. The test subjects were informed that for each sample the auditory speech mode contained an original speech recording. It was stressed that no assumptions about the number of original/synthesized samples that were used in the experiment could be made. Two additional “original” samples were generated and displayed to the participants before starting the experiment. This way, the subjects could get familiarized with the particular speech signals 5.4. Evaluation of the Dutch AVTTS system 183 used in the experiment. The subjects were told to process the samples in order without replaying earlier samples or revising earlier answers. They were allowed to play each sample maximally three times. The order of the sentences and the order of the sample types was randomized. 5.4.1.3 Participants and results Twenty-seven people participated in the experiment (15 male and 12 female, aged [23-60]). Seven of them can be considered speech experts or are very familiar with the synthetic speech produced by the AVTTS system (e.g., by repeatedly participating in earlier perception experiments involving the AVTTS synthesizer). All participants were native Dutch speaking. Table 5.1 summarizes the results obtained. Table 5.1: Turing test results. 5.4.1.4 Type Total Correct % Correct original synthesized 405 405 273 228 67% 56% total 810 504 62% Discussion A first observation is that there were more responses “original” than responses “synthesized” obtained. This reflects in the higher percentage of correct answers for the original samples compared to the percentage of correct answers for the synthesized samples. In total 62% of the reported answers were correct. The results obtained were analysed using a by-subject (test subjects) and a by-item (test sentences) analysis to evaluate the hypothesis that the answers reported were completely random (i.e., the subjects could only guess about the nature of the samples presented). Both the by-subject analysis (t-test ; df = 26 ; t = −5.45 ; p < 0.001) and the by-item analysis (t-test ; df = 29 ; t = −4.32 ; p < 0.001) indicated that the answers reported can not be considered completely random. Nevertheless, the results summarized in table 5.1 indicate that the subjects found it really difficult to distinguish a synthesized sample from an original sample. No significant difference between the answers reported by the male participants and the answers reported by the female participants was found (t-test ; df = 24.9 ; t = −0.076 ; p = 0.94). In addition, by comparing the results obtained for the participants aged above 40 (13 subjects) and the participants aged under 40 (14 subjects), no significant influence of the age on the subjects’ performance could be found (t-test; df = 24.7; t = 0.137; p = 0.892). 5.4. Evaluation of the Dutch AVTTS system 184 0,60 0,50 0,40 0,30 0,20 0,10 NON-EXPERT EXPERT Figure 5.8: Ratio of incorrect answers obtained by the experts and by the nonexperts in the Turing test. On the other hand, the ratio of correct answers obtained by the “AVTTS experts” was significantly higher than the ratio of correct answers obtained by the non-experts (t-test ; df = 7.98 ; t = −3.42 ; p = 0.009). This is visualized in figure 5.8. It is easy to explain this difference since the experts were much more familiar with the concept of synthetic audiovisual speech. Since they already knew the strong and the weak points of the AVTTS system, they could focus to particular aspects that help to distinguish the synthetic samples from the originals samples. In fact, the results obtained for the non-expert participants can be considered more important, since this group better represents a general user of a future application of the AVTTS system. The results obtained for the non-experts are separately Page 1 summarized in table 5.2. Figure 5.9 visualizes the ratio of wrong answers, obtained from the non-experts, for both the original and the synthesized samples used in the experiment. Both a by-subject analysis (t-test ; df = 19 ; t = −4.39 ; p < 0.001) and a by-item analysis (t-test ; df = 29 ; t = −2.32 ; p = 0.027) indicated that the answers reported can not be considered completely random. Nevertheless, notice from table 5.2 that 52% of the synthesized samples were perceived as an original speech signal. Given the rather large number of evaluations performed in the experiment, it can be concluded that the synthetic visual speech mode generated by the AVTTS system almost perfectly mimics the variations seen in original visual speech recordings. A manual inspection of the test results showed that some 5.4. Evaluation of the Dutch AVTTS system 185 1,00 0,75 0,50 0,25 0,00 ORIGINAL SYNTHETIC Figure 5.9: Ratio of incorrect answers for each type of sentence obtained in the Turing test. Only the answers obtained from the non-experts are displayed. Table 5.2: Turing test results for the non-experts. Type Total Correct % Correct original synthesized 300 300 190 156 63% 52% total 600 346 58% Page 1 5.4. Evaluation of the Dutch AVTTS system 186 particular test samples were much more easily detected as being non-original as compared to the other samples. It was found that the majority of these “bad” samples contained noticeable synthesis artefacts, such as erroneous video frames resulting from a bad AAM reconstruction. When this kind of problems could be prevented in the future (e.g., by optimizing the database meta-data), even better results in the Turing experiment should be attainable. Note that this experiment evaluated the quality of synthetic audiovisual speech signals containing a non-original combination of auditory and visual speech information. As was concluded earlier in this thesis, this is likely to have a negative effect on the perceived quality of the presented speech samples. This means that the perceived overall quality of the synthetic visual speech signals used in the Turing experiment would probably increase when these signals are displayed together with their corresponding synthetic auditory speech mode. On the other hand, the evaluation of the synthetic audiovisual speech in a Turing scenario would result in a more easy detection of the synthetic samples (compared to the test results described above), since any synthesis artefact in both the auditory or the visual mode would be an immediate clue. In addition, due to the fact that humans tend to be more sensitive to imperfections in the auditory speech mode, for synthetic auditory or synthetic audiovisual speech it is very hard to pass a Turing test unless short speech samples (e.g., isolated words or syllables) are presented. 5.4.2 Comparison between single-phase and two-phase audiovisual speech synthesis 5.4.2.1 Motivation Recall that one of the main goals of this thesis consists in a general evaluation of the single-phase AVTTS synthesis paradigm. Section 3.6 described a subjective experiment that compared the perceived quality of synthetic audiovisual speech generated by various AVTTS synthesis approaches. In that experiment, the quality of the synthetic visual speech mode was rated the highest in case the visual speech was presented together with the most coherent auditory speech mode. The experiment motivated the further development of the single-phase AVTTS approach, since it was found that the standard two-phase synthesis approach, in which separate synthesizers and separate speech databases are used to generate the two synthetic speech modes, is likely to affect the perceived quality of the synthetic speech when the separately synthesized speech modes are synchronized and shown audiovisually to an observer. Note, however, that no subjective evaluation of the audiovisual speech quality was made, since the speech database that was used in the experiment was insufficient to generate a high-quality auditory speech mode. In addition, probably due to the limited size of the used speech database, no significant 5.4. Evaluation of the Dutch AVTTS system 187 difference was measured between the synthesis quality of the proposed single-phase synthesis strategy and the synthesis quality of a two-phase synthesis approach that uses the same audiovisual speech database for the separate generation of both speech modes. At this point in the research, the attainable synthesis quality has improved significantly, due to several optimizations such as the use of a parameterization of the visual speech information and the construction of a new extensive audiovisual speech database. This allows to make a new subjective comparison between the attainable audiovisual synthesis quality using the proposed concatenative singlephase synthesis strategy and a two-phase synthesis strategy using original speech data from the same speaker in both synthesis stages. 5.4.2.2 Method and samples The goal of this experiment is to evaluate the attainable audiovisual synthesis quality using a single-phase and a comparable two-phase AVTTS synthesis approach. To generate the speech samples, the Dutch version of the AVTTS system was provided with speech data from the AVKH speech database. The speech samples representing the single-phase synthesis were generated using the audiovisual unit selection approach described in section 5.3. A two-phase speech synthesizis strategy was derived from the single-phase synthesizer, using the same principles as described in section 3.6.1. The only difference is that for the current experiment, the two separately synthesized speech modes are multiplexed by time-scaling the synthetic visual speech signal instead of using WSOLA to time-scale the auditory speech signal. The time-scaling of the synthetic visual speech signal is achieved by time-stretching the synthesized AAM parameter trajectories in order to impose the phoneme durations from the synthetic auditory speech on the viseme durations in the synthetic visual speech. This composes a “safer” synchronization, since it was noticed that the time-scaling of the auditory speech mode more easily generates noticeable distortions compared to the time-scaling of the visual speech signal. The two-phase synthesizer was employed to generate a series of representative speech samples, of which the auditory speech mode was synthesized using both the limited domain speech data and half of the open domain speech data of the AVKH database. The visual speech mode was generated using the other half of the open domain speech data of the AVKH database. This strategy has the benefit that it is ensured that the two-phase samples completely consist of non-original combinations of acoustic and visual speech information. Given the large size of the AVKH database, the half of this database still contains sufficient data to perform high quality speech synthesis. The limited domain speech data was used to generate the auditory speech mode because auditory speech synthesis is generally more dependent on the avail- 5.4. Evaluation of the Dutch AVTTS system 188 ability of appropriate original speech data compared to the concatenative synthesis of visual speech. Likewise, the representative samples for the single-phase synthesis were synthesized using the same original speech data as was used for generating the auditory speech mode of the samples representing the two-phase synthesis approach. Fifteen medium-length (mean word count = 13) sentences from the open domain and fifteen medium-length (mean word count = 15) sentences from the limited domain of weather forecasts were manually constructed. The sentences were semantically meaningful and it was ensured that the sentences (or large parts of the sentences) did not appear in the transcript of the AVKH database. For each sentence, a single-phase (MUL) and a two-phase (SAV) speech sample were generated (the labels “MUL” and “SAV” are analogous to the labels used in section 3.6). In contrast to the experiment described in section 3.6.3, the participants were asked to rate the overall quality of the presented audiovisual speech signals. They had to decide for themselves which aspect of the audiovisual speech (e.g., the individual quality of the acoustic/visual speech, the audiovisual presentation of both speech modes, etc.) they found the most important. The key question to answer was: “How much does the sample resemble original audiovisual speech?”. And, similar: “How much do you like the audiovisual speech for usage in an real-world application (e.g., reading the text messages on your cell phone)?”. The samples were shown pairwise to the participants, who had to use a 5-point comparative MOS scale [-2,2] to express their preference for one of the two samples. They were instructed to answer “0” when they had no clear preference. The sequence of the sample types in each comparison pair was randomized. 5.4.2.3 Subjects and results Seven people participated in the experiment (5 male, 2 female, aged [24-61]), three of which can be considered speech technology expert. The results obtained are visualized in figure 5.10. The results clearly show that the test subjects preferred the samples generated by the single-phase synthesis approach. A Wilcoxon-signed rank analysis indicated a significant difference between the ratings obtained for the MUL samples and the ratings obtained for the SAV samples (Z = −6.59 ; p < 0.001). The difference between the ratings for the MUL samples and the ratings for the SAV samples is stronger for the limited domain samples in comparison with the sentences from the open domain (Mann-Whitney U test; Z = −2.21; p = 0.027). Nevertheless, for both types of sentences the MUL group was rated significantly better than the SAV group (Wilcoxon-signed rank tests ; p ≤ 0.001). 5.4. Evaluation of the Dutch AVTTS system 189 80 60 40 20 0 -2 -1 SAV 0 - 1 2 MUL Figure 5.10: Comparison between single-phase and two-phase audiovisual synthesis. The histogram shows the participants’ preference for the SAV or the MUL sample on a 5-point scale [-2,2]. 5.4.2.4 Discussion The results obtained unequivocally indicate that the single-phase synthesis strategy is the most preferable approach to perform audiovisual speech synthesis, since this strategy even outperforms a two-phase synthesis approach that uses comparable original speech data in both synthesis stages. This result isPagein1 line with the results of the experiment described in section 3.6, where it was concluded that a twophase synthesis approach is likely to affect the perceived individual quality of a synthetic speech mode when presented audiovisually to an observer. Note that the individual quality of the synthetic speech modes of the SAV samples should be at least as high as the individual quality of the speech modes of the MUL samples, since for the synthesis of the auditory speech mode of the SAV samples only auditory features were considered when calculating the selection costs. Similarly, the synthesis of the visual speech mode of the SAV samples was based solely on visual features. The increased difference between the ratings for both sample types, noticed for the limited domain sentences, could be explained by the fact that for these sentences the individual quality of the visual speech mode of the MUL samples is possibly a bit higher compared to the individual quality of the visual speech mode of the SAV samples: for the MUL samples both synthetic speech modes can be constructed by concatenating large segments from the limited domain section of the AVKH database. On the other hand, this difference does not exist for the synthetic auditory mode: it is clearly noticeable that for both the MUL and the SAV samples, the overall audiovisual quality of the limited domain sentences is higher than the quality of the other sentences. Since the limited domain sentences are closer to original speech, it may have been the case that the participants were more sensitive to subtle incoherences between both presented speech modes. This could also explain 5.5. Summary and conclusions 190 the larger difference between the ratings for both sample types that was observed for the limited domain sentences. 5.5 Summary and conclusions This chapter elaborated on the extension of the AAM-based AVTTS system, proposed in chapter 4, towards synthesis in the Dutch language. For this purpose, a new extensive Dutch audiovisual speech database was recorded. The recorded speech signals were processed in order to enable audiovisual speech synthesis using the Dutch version of the AVTTS system. A detailed AAM was build that is able to accurately reconstruct the mouth area of the original video frames. In addition, appropriate background signals can be generated using an additional face AAM. The quality of the synthetic audiovisual speech generated from the new database is significantly higher than the attainable synthesis quality using the LIPS2008 database. The individual quality of the synthetic visual speech mode was assessed in a Turing experiment. It appeared that especially observers who are not familiar with speech synthesis are almost unable to distinguish between original and synthesized visual speech signals. The enhanced synthesis quality allowed to perform a new comparison between single-phase and two-phase concatenative audiovisual speech synthesis approaches. The experiment showed that observers prefer synthetic audiovisual speech signals generated by a single-phase approach over audiovisual speech samples generated by a two-phase synthesis strategy. This can be seen as a final proof of the importance of audiovisual speech synthesis approaches that, apart from pursuing the highest possible individual acoustic and visual speech quality, also aim to maximize the level of audiovisual coherence between the two synthetic speech modes. Some of the techniques, experiments and results mentioned in this chapter have been published in [Mattheyses et al., 2011a]. 6 Context-dependent visemes 6.1 6.1.1 Introduction Motivation Up until this point, this thesis discussed the problem of audiovisual text-to-speech synthesis, in which a given text is translated in a novel audiovisual speech signal. Section 1.4.4 elaborated on various applications for this kind of synthesizers. On the other hand, there are also many scenarios in which the synthesizer is only needed to generate a synthetic visual speech signal, which is later on multiplexed with an already existing auditory speech signal. When the text transcript corresponding to this auditory speech signal is used as input for the synthesizer, the system can be referred to as a visual text-to-speech (VTTS) synthesizer. Recall from chapter 2 that such a VTTS system is used to perform the second synthesis stage in a two-phase AVTTS approach: the VTTS synthesizer creates the synthetic visual speech mode that is later on multiplexed with the already obtained synthetic auditory speech signal. On the other hand, many applications exist for which a visual speech signal must be generated to accompany an original auditory speech signal instead. For instance, in video telephony or in remote teaching, the visual speech can be locally rendered to accompany the original transmitted acoustic speech. This allows to attract the attention of the audience by displaying a virtual speaker while only (low-bandwidth) acoustic signals need to be transmitted over the network. Another example for which the accurate synthesis of visual speech signals is crucial is computer-assisted pronunciation training. In this scenario, a speech therapy patient will independently use the VTTS system to ge- 191 6.1. Introduction 192 nerate examples of visual articulations. From the previous chapters it is known that the generation of a synthetic auditory speech signal that perfectly mimics original speech is very hard to realize. Therefore, for many professional applications it is still often opted to use original acoustic speech signals instead. On the other hand, an optimal communication should consist of audiovisual speech and not acoustic-only speech signals. Unfortunately, audiovisual speech recordings are much harder to realize and are much more expensive in comparison to acoustic-only speech recordings. Since it was experienced in the previous chapters that observers are slightly less sensitive to small imperfections in a synthetic visual speech signal compared to imperfections in a synthetic auditory speech signal, VTTS synthesis composes a useful solution to enhance the quality of the communication by synthesizing a visual speech mode that can be displayed together with the original acoustic speech signal. Section 5.4.1 described a Turing experiment that was conducted to assess the individual quality of the synthetic visual speech generated by the AAM-based AVTTS system. To generate the test samples, the AVTTS system was converted into a VTTS system: the visual speech mode of the synthetic audiovisual speech was synchronized and multiplexed with the corresponding original auditory speech signal from the database. From the results obtained in the Turing experiment it appears that this unit selection-based VTTS approach is able to synthesize speech signals that are very similar to original speech signals. Therefore, it is interesting to further investigate how well the techniques that were developed in the scope of the AVTTS system can be applied to perform high-quality VTTS synthesis instead. An important observation is the fact that, where in the case of AVTTS synthesis phonemes have to be used to describe the speech information, for VTTS synthesis also visemes can be used to describe both the target speech and the speech signals contained in the database. It is interesting to investigate the particular behaviour and the attainable synthesis quality of both speech labelling approaches. To this end, the most optimal mapping from phoneme labels to viseme labels that maximizes the attainable VTTS synthesis quality will be explored. Given the results obtained earlier in this thesis, the most preferable speech labelling technique will have to allow the synthesis of maximally smooth and natural visual articulations as well as the generation of visual speech information that is highly coherent with the original acoustic speech signal that will eventually play together with the synthetic visual speech. 6.1.2 Concatenative VTTS synthesis The unit selection-based VTTS synthesizer that is employed to investigate the use of viseme speech labels for visual speech synthesis is very similar to the AAM-based AVTTS synthesizer that was discussed in chapter 4 and chapter 5. The main differ- 6.2. Visemes 193 ence is the alternative set of selection costs that is applied by the VTTS synthesizer. The hidden target cost that enforces the selection of database segments phonemically matching the target speech is altered so it also allows the selection of segments of which each phoneme is from the same viseme class as the corresponding target phoneme. In addition, the binary cost that rewards the matching of the phonemic context between the candidate and the target is omitted. The overall weight of the other binary linguistic costs (see section 3.4.2.2) is lowered with respect to the weight of the visemic context cost that was described in section 4.3.4.1. The difference matrix that is needed to calculate this cost (equation 4.10) is constructed for each particular set of speech labels (phoneme labels and various viseme labels) used in the experiments. An additional target cost is included that promotes the selection of database segments that require only a minor time-scaling to attain synchronization with the given auditory speech signal (see section 3.6.1). The total join cost is calculated using the same visual join costs that are used in the AVTTS system (see section 4.3.4.2). For obvious reasons, all auditory join costs are omitted. When the most optimal sequence of database segments is selected, the database subtrajectories are concatenated using the same concatenation approach that is used in the AVTTS system (see section 4.3.5). All optimizations to the quality of the synthetic visual speech that were discussed in section 4.4 are applied for the VTTS synthesis as well. Finally, the concatenated trajectories are synchronized with the auditory speech signal by time-scaling the sub-trajectories of each synthesized visual speech segment to match the duration of the corresponding auditory speech segment. The final output speech is created by generating the appropriate mouthsignal by the inverse AAM-projection of the synthesized parameter trajectories, the merging of this signal with the background video signal displaying the other parts of the face of the virtual speaker, and the multiplexing of this final visual speech signal with the given auditory speech signal. 6.2 6.2.1 Visemes The concept of visemes Among all sorts of auditory speech processing applications, the concept of a phoneme as a basic speech unit is well established. Auditory speech signals are segmented into a sequence of phonemes for both speech analysis and speech synthesis goals. The properties of such a phoneme set is language-dependent but its definition is nowadays well standardized for many languages. Similarly, the processing of visual speech signals needs the definition of an atomic unit of visual speech. Fisher introduced the term viseme to identify the visual counterpart of several consonant phonemes [Fisher, 1968]. Visemes can be considered as the particular facial and oral positions that show when a speaker utters phonemes. This 6.2. Visemes 194 implies that a unique viseme is defined by the typical articulatory gestures (mouth opening, lip protrusion, jaw movement, etc.) that are needed to produce a particular phoneme [Saenko, 2004]. On the other hand, an alternate definition that is widely used in the literature is to define a viseme as a group of phonemes that exhibit a similar visual representation. This second definition entails that there is not a one-to-one mapping between phonemes and visemes. This can be understood from the fact that not all articulators that are needed to utter phonemes are visible to an observer. For instance, the English /k/ and /g/ phonemes are created by raising the back of the tongue to touch the roof of the mouth, which is a gesture that cannot be noticed visually. In addition, some phoneme pairs differ only in terms of voicing (e.g., English /v/ and /f/) or in terms of nasality. These two properties cannot be distinguished in the visual domain, which means that such phoneme pairs will have the same appearance in the visual speech mode. As a consequence, the mapping from phonemes to visemes should behave like a many-to-one (Nx1) relationship, were visibly similar phonemes are mapped to the same viseme. The construction of such an Nx1 mapping scheme has been the subject of much research. Two different approaches can be distinguished. The first approach is based on the phonetic properties of the different phonemes of a language. Based on articulatory rules (e.g., place of articulation, position of the lips, etc.) and expert knowledge, a prediction of the visual appearance of a phoneme can be made [Jeffers and Barley, 1971] [Aschenberner and Weiss, 2005]. This way, visemes can be defined by grouping those phonemes for which the visually important articulation properties are matching. Alternatively, in a second approach the set of visemes for a particular language is determined by various data-driven methods. These strategies involve the recording of real (audio)visual speech data, after which the captured visual speech mode is further analysed. To this end, the most common analysis approach is to conduct some kind of subjective perception test. Such an experiment involves participants who try to match fragments of the recorded visual speech to their audible counterparts [Binnie et al., 1974] [Montgomery and Jackson, 1983] [Owens and Blazek, 1985] [Eberhardt et al., 1990]. The nature (consonant/vowel, position in word, etc.) and the size (phoneme, diphone, triphone, etc.) of the used speech fragments varies among the studies. For instance, in the pioneering study of Fisher [Fisher, 1968], the participants were asked to lip-read the initial and final consonants of an utterance using a forced-error approach (the set of possible responses did not contain the correct answer). The responses of these kinds of perception experiments are often used to generate a confusion matrix, denoting which phonemes are visibly confused with particular other phonemes. From this confusion matrix, groups of visibly similar phonemes (i.e., visemes) can be determined. 6.2. Visemes 195 The benefit of the human-involved data-driven approaches to determine the phoneme-to-viseme mapping scheme is the fact that they measure exactly what needs to be modeled: the perceived visual similarity among phonemes. On the other hand, conducting these perception experiments is time-consuming and the results obtained are dependent on the lip reading capabilities of the test subjects. In order to be able to process more speech data in a less time-consuming way, the analysis of the recorded speech samples can be seen from a mathematical point of view [Turkmani, 2007]. In a study by Rogozan [Rogozan, 1999], basic geometrical properties of the speaker’s mouth (height, width and opening) during the uttering of some test sentences were determined. The clustering of these properties led to a grouping of the phonemes into 13 visemes. Similarly, in studies by Hazen et al. [Hazen et al., 2004] and Melenchon et al. [Melenchon et al., 2007] a clustering was applied on the PCA coefficients that were calculated on the video frames of the recorded visual speech. The downside of these mathematical analyses lies in the fact that they assume that the mathematical difference between two visual speech segments corresponds to the actual difference that a human observer would notice when comparing the same two segments. Unfortunately, this correlation between objective and subjective distances is far from straightforward. Several Nx1 phoneme-to-viseme mapping schemes have been defined in the literature. Most mappings agree on the clustering of some phonemes (e.g., the English /p/, /b/ and /m/ phonemes), although many differences between the mappings exist. In addition, the number of visemes defined by these Nx1 mappings varies among the different schemes. A standardized viseme mapping table for English has been defined in MPEG-4 [Pandzic and Forchheimer, 2003], consisting of 14 different visemes augmented with a “silence” viseme (see also section 2.2.5.3). Although this viseme set has been applied for visual speech recognition (e.g., [Yu et al., 2010]) as well as for visual speech synthesis applications (see section 2.2.5.3 for examples), still many other viseme sets are used. For instance, various Nx1 phoneme-toviseme mappings were applied for visual speech synthesis in [Ezzat and Poggio, 2000] [Verma et al., 2003] [Ypsilos et al., 2004] [Bozkurt et al., 2007] while other Nx1 viseme labelling approaches have been applied for (audio-)visual speech recognition purposes [Visser et al., 1999] [Potamianos et al., 2004] [Cappelletta and Harte, 2012]. It is not straightforward to determine the exact number of visemes that is needed to accurately describe the visual speech information. In a study by Auer [Auer and Bernstein, 1997] the concept of a phonemic equivalence class (PEC) was introduced. Such a PEC can be seen as an equivalent of a viseme, since it is used to group phonemes that are visibly similar. In that study, words from the English lexicon were transcribed using these PEC’s in order to assess their distinctiveness. The number of PECs was varied between 1, 2, 10, 12, 19 and 28. It was concluded 6.2. Visemes 196 that the use of at least 12 PEC’s resulted in a sufficiently unique transcription of the words. Note, however, that when optimizing the number of visemes used in a phoneme-to-viseme mapping, the target application (e.g., human speech recognition, machine based speech recognition, speech synthesis, etc.) should also be taken into account. In addition, it has been shown that the best phoneme-to-viseme mapping (and as a consequence the number of visemes) should be constructed speaker-dependently [Lesner and Kricos, 1981]. A constant among almost every phoneme-to-viseme mapping scheme that can be found in the literature is the fact that the nature of the mapping is many-to-one. At first glance, this seems reasonable since it fits with the definition of a viseme being a group of phonemes that appear visibly the same. On the other hand, a visual speech signal often exhibits strong coarticulation effects (see section 2.2.6.1). Both forward and backward coarticulations make the visual appearance of a particular phone in a sentence not only dependent on the corresponding phoneme’s articulation properties but also on the nature of its neighbouring phones in the sentence. This means that a single phoneme can exhibit various visible representations, which implies that it can be mapped on several different visemes. Consequently, a comprehensive phoneme-to-viseme mapping scheme should be a many-to-many (NxM) relationship [Jackson, 1988]. Unfortunately, only very little research has been performed on the construction of such NxM mapping tables. A first step in this direction can be found in a study by Mattys [Mattys et al., 2002], where some of the phoneme equivalence classes from [Auer and Bernstein, 1997] were redefined by taking the phonetic context of consonants into account. Finally, it can be argued that even a viseme set that is computed as an NxM mapping from phonemes is insufficient to accurately and efficiently define atomic units of visual speech. Instead of a priory segmentation of the speech in terms of phonemes, the segmentation of a visual speech signal could be performed by taking only the visual features into account. For example, in a study by Hilder et al. [Hilder et al., 2010] such a segmentation is proposed, which has the benefit that the different allophones of a phoneme can get a different viseme label and that the visual coarticulation is automatically taken into account. It was shown that this strategy leads to a clustering of the speech segments into visemes which is optimal in the sense that the inter-cluster distance is much larger than the intra-cluster distance. Unfortunately, it is far from straightforward to use such a viseme set for visual speech analysis or synthesis purposes: there is no direct mapping from phoneme labels to viseme labels and the phone boundaries in the auditory mode of a speech segment do not coincide with the viseme boundaries in the visual mode. Recently, in a study by Taylor et al. [Taylor et al., 2012] the segmentation technique proposed in [Hilder et al., 2010] was employed to analyse a large corpus of English 6.3. Phoneme-to-viseme mapping for visual speech synthesis 197 audiovisual speech. This led to the definition of 150 so-called dynamic visemes, elementary units of visual speech that span on average the length of 2-3 phonemes. A target phoneme sequence can be translated in a sequence of dynamic visemes by inspecting both the combinations phoneme/dynamic viseme occurring in the original text corpus and the similarity between the target phoneme duration and the duration of the dynamic viseme. Optimizing the synthesis is not straightforward since many sequences of dynamic visemes can be employed to match a particular target phoneme sequence. In addition, to achieve audiovisual synchronisation the boundaries of the dynamic visemes need to be aligned with the target phoneme boundaries. 6.2.2 Visemes for the Dutch language In comparison with the English language, the number of available reports on visemes for the Dutch language is limited. Whereas for English some kind of standardization for Nx1 visemes exist in the MPEG-4 standard, for Dutch only a phoneme set has been standardized. A first study on visemes for Dutch was performed by Eggermont [Eggermont, 1964], were some CVC syllables were the subject of an audiovisual perception experiment. In addition, Corthals [Corthals, 1984] describes a phoneme-to-viseme grouping using phonetic expert knowledge. Finally, Van Son et al. [Van Son et al., 1994] define a new Nx1 phoneme-to-viseme mapping scheme that is constructed using the experimental results of new perception tests in combination with the few conclusions on Dutch visemes that could be found in earlier literature. 6.3 6.3.1 Phoneme-to-viseme mapping for visual speech synthesis Application of visemes in VTTS systems Section 2.2 contained a detailed overview on the various techniques that are used to synthesize visual speech. It discussed the various approaches for predicting the appropriate speech gestures based on the input text. This section summarizes the basic concepts of each of these techniques and it discusses for each technique in which way a phoneme-to-viseme mapping scheme can be adopted to describe the speech information used by the synthesizer. A first technique is adopted by the so-called rule-based synthesizers, which assign to each target phoneme a typical configuration of the virtual speaker. For instance, in a 3D-based synthesis approach these configurations can be expressed by means of parameter values of a parameterized 3D model. Alternatively, a 2D-based synthesizer can assign to each target phoneme a still image of an original speaker 6.3. Phoneme-to-viseme mapping for visual speech synthesis 198 uttering that particular phoneme. Rule-based synthesizers generate the final synthetic visual speech signal by interpolating between the predicted keyframes. In order to cover the complete target language, a rule-based system has to know the mapping from any phoneme of that language to its typical visual representation, i.e., it has to define a complete phoneme-to-viseme mapping table. The actual speaker configuration that corresponds to each viseme label can be manually pre-defined (using a system-specific viseme set or a standardized viseme set such as described in MPEG-4) or it can be copied from original speech recordings. Almost all rule-based visual speech synthesizers adopt an Nx1 phoneme-to-viseme mapping scheme, which reduces the number of rules needed to cover all phonemes of the target language. In a rule-based synthesis approach using an Nx1 mapping scheme, the visual coarticulation effects need to be mimicked in the keyframe interpolation stage. To this end, a coarticulation model such as the Cohen-Massaro model [Cohen and Massaro, 1993] or Ohman’s model [Ohman, 1967] is adopted. There have been only a few reports on the use of NxM phoneme-to-viseme mapping tables for rule-based visual speech synthesis. An example is the exploratory study by Galanes et al. [Galanes et al., 1998], in which regression trees are used to analyse a database of 3D motion capture data in order to design prototype configurations for context-dependent visemes. To synthesize a novel visual speech signal, the same regression trees are used to perform an NxM phoneme-to-viseme mapping that determines for each input phoneme a typical configuration of the 3D landmarks taking its target phonetic context into account. Afterwards, the keyframes are interpolated using splines instead of using a coarticulation model, since coarticulation effects were already taken into account during the phoneme-to-viseme mapping stage. Another synthesis system that uses pre-defined context-dependent visemes was suggested by De Martino et al. [De Martino et al., 2006]. In their approach, 3D motion capture trajectories corresponding to the uttering of original CVCV and diphthong samples are gathered, after which by means of k-means clustering important groups of similar visual phoneme representations are distinguished. From these context-dependent visemes, the keyframe mouth dimensions corresponding to a novel phoneme sequence can be predicted. These predictions are then used to animate a 3D model of the virtual speaker. In a follow-up research, for each context-dependent viseme identified by the clustering of the 3D motion capture data, a 2D still image of an original speaker is selected to define the articulation rules for a 2D photorealistic speech synthesizer [Costa and De Martino, 2010]. In contrast with rule-based synthesizers, unit selection synthesizers construct the novel speech signal by concatenating speech segments selected from a database containing original visual speech recordings. In this approach, no interpolation is needed since all output frames consist of visual speech information copied from the database. The selection of the visual speech segments can be based on the 6.3. Phoneme-to-viseme mapping for visual speech synthesis 199 target/database matching of either phonemes (e.g., [Bregler et al., 1997] [Theobald et al., 2004] [Deng and Neumann, 2008]) or visemes (e.g., [Breen et al., 1996] [Liu and Ostermann, 2011]). From the literature it can be noticed that almost all visemebased unit selection visual speech synthesizers apply an Nx1 phoneme-to-viseme mapping to label the database speech and to translate the target phoneme sequence into a target viseme sequence. In unit selection synthesis, original coarticulations are copied from the database to the output speech by concatenating original segments longer than one phoneme/viseme. In addition, extended visual coarticulations can be taken into account by selecting those original speech segments of which the visual context (i.e., the visual properties of their neighboring phonemes/visemes) matches the visual context of the corresponding target speech segment (this motivated the addition of the “visual context target cost” to the AVTTS system (see section 4.3.4.1)). The third strategy for estimating the desired speech gestures from the text input is the use of a statistical prediction model that has been trained on the correspondences between visual speech features and a phoneme/viseme-based labelling of the speech signal. Such a trained model can predict the target features of the synthesizer’s output frames for an unseen phoneme/viseme sequence given as input. A common strategy is the use of HMMs to predict the target visual features. Such an HMM usually models the visual features of each phoneme/viseme sampled at 3 to 5 distinct time instances along the phoneme/viseme duration. It can be trained using both static and dynamic observation vectors, i.e., the visual feature values and their temporal derivatives. Similar to most selection-based visual speech synthesizers, prediction-based synthesizers often use a phoneme-based segmentation of the speech, for which the basic training/prediction unit can be for instance a single phoneme or a syllable. Diphones are used as basic synthesis units in a study by Govokhina et al. [Govokhina et al., 2006a], in which it was concluded that the use of both static and dynamic features to train an HMM improved the synthesis quality since this permits that coarticulations are learned by the prediction model as well. In a follow-up research, their system was extended to allow some small asynchronies between the phoneme transitions and the transition points between the speech segments in the visual mode. This way, the HMM is capable to more accurately model some anticipatory visual coarticulation effects since these occur in the visual speech mode before the corresponding phoneme is heard in the auditory mode [Govokhina et al., 2007]. 6.3.2 Discussion Each approach for predicting the visual speech information based on a target phoneme sequence, discussed in detail in section 2.2 and summarized in section 6.3. Phoneme-to-viseme mapping for visual speech synthesis 200 6.3.1, can be implemented using both phonemes or visemes as atomic speech units. In theory, a viseme-based approach should be superior since this type of labelling is more suited to identify visual speech information. It can be noticed that almost all viseme-based visual speech synthesizers that have been described in the literature use an Nx1 phoneme-to-viseme mapping scheme. For rule-based synthesizers, such an Nx1 mapping is advantageous since it reduces the number of rules needed to cover the whole target language. Moreover, the application of an Nx1 mapping scheme is useful for unit selection synthesizers as well, since it reduces the database size needed to provide a sufficient number of database segments that match a given target speech segment. Similarly, the use of an Nx1 mapping scheme reduces the minimal number of original sentences needed to train the prediction models of prediction-based synthesis systems. On the other hand, when an Nx1 phoneme-to-viseme mapping scheme is applied, an additional modelling of the visual coarticulation effects is needed. To this end, many coarticulation models have been proposed for usage in rule-based visual speech synthesis. In the case of unit selection synthesis, which is currently the most appropriate technique to produce very realistic synthetic speech signals, visual coarticulations have to be copied from original speech recordings. It is obvious that the accuracy of this information transfer can be increased when the labelling of the original and the target speech data would intrinsically describe these coarticulations. This is feasible when context-dependent visemes are used to label the target and the database speech, i.e., when an NxM phoneme-to-viseme mapping scheme is applied. 6.3.3 Problem statement For unit selection synthesis, there exists a trade-off between the various approaches for labelling the speech data. The use of an Nx1 phoneme-to-viseme mapping increases the number of database segments that match a target speech segment, which means that it is more likely that for each target segment a highly suited database segment can be found. On the other hand, when this type of speech labels is used, appropriate original visual coarticulations can only be selected by means of accurate selection costs and by selecting long original segments. When context-dependent visemes are used, the visual coarticulation effects are much better described, both in the database speech and in the target speech. Unfortunately, such an NxM mapping increases the number of distinct speech labels and thus decreases the number of database segments that match a target segment. Note that visual unit selection synthesis can also be performed using phoneme-based speech labels. Although phonemes are less suited to describe visual speech information, it may help to enhance the perception quality when the synthetic visual speech is shown audiovisually to an observer, since the use of phoneme labels increases the audiovisual coherence. In the remainder of this chapter the effects of all these possible speech labelling 6.4. Evaluation of Nx1 mapping schemes for English 201 approaches on the quality of the synthetic visual speech is investigated. To this end, the unit selection-based VTTS synthesizer described in section 6.1.2 is used to synthesize visual speech signals using both phonemes, Nx1 visemes and NxM visemes to describe the target and the database speech. For this, accurate NxM mapping schemes have to be developed, since only very few of these mappings can be found in the literature. 6.4 Evaluation of many-to-one phoneme-to-viseme mapping schemes for English 6.4.1 Design of many-to-one phoneme-to-viseme mapping schemes In this section the standardized Nx1 mapping that is described in MPEG-4 [MPEG, 2013] is evaluated for use in concatenative visual speech synthesis. Since this mapping scheme is designed for English, the English version of the visual speech synthesis system was used, provided with the LIPS2008 audiovisual speech database. Based on the description in MPEG-4, the English phoneme set that was originally used to segment the LIPS2008 corpus has been mapped on 14 visemes, augmented with one silence viseme. The mapping of those phonemes that are not mentioned in the MPEG-4 standard was based on their visual and/or articulatory resemblance with other phones. The MPEG-4 mapping scheme is designed to be a “best-for-all speakers” phonemeto-viseme mapping. However, for usage in data-driven visual speech synthesis, the phoneme-to-viseme mapping should be optimized for the particular speaker of the synthesizer’s database. To define such a speaker-dependent mapping, the AAM-based representations of the mouth-region of the video frames from the database were used. In a first step, for every distinct phoneme present in the database all its instances were gathered. Then, the combined model parameter values of the frame located at the middle of each instance were sampled. From the collected parameter values, a speaker-specific mean visual representation of each phoneme was calculated. A hierarchical clustering analysis was performed on these mean parameter values to determine which phonemes are visibly similar for the speaker’s speaking style. Using the dendrogram, a tree diagram that visualizes the arrangement of the clusters produced by the clustering algorithm, five important levels could be discerned in the hierarchical clustering procedure. Consequently, five different phoneme-to-viseme mappings were selected. They define 7, 9, 11, 19 and 22 visemes, respectively. Each of these viseme sets contains a “silence” viseme on which only the silence phoneme is mapped. The viseme mappings are summarized 6.4. Evaluation of Nx1 mapping schemes for English 202 in appendix C. 6.4.2 Experiment Both the MPEG-4 mapping scheme and the speaker-dependent mappings were the subject of an experimental evaluation. A random selection of original sentences from the database were resynthesized using the visual speech synthesis system discussed in section 6.1.2, for which both the synthesis targets and the speech database were labelled using the various Nx1 viseme sets. The synthesis parameters (selection costs, concatenation smoothing, etc.) were the same for all strategies. A reference synthesis strategy was added, for which the standard English phoneme set (containing 45 entries) was used to label the database and the synthesis targets. For every synthesis, the target original sentence was excluded from selection. The original database transcript was used as text input and the original database auditory speech was used as audio input. A subjective evaluation of the syntheses was conducted, using four labelling strategies: speech synthesized using the speakerdependent mappings on 9 (group “SD9”) and 22 (group “SD22”) visemes, speech synthesized using the MPEG-4 mapping (group “MPEG4”), and speech synthesized using standard phoneme-based labels (group “PHON”). In addition, extra reference samples were added (group “ORI”), for which the original AAM trajectories from the database were used to resynthesize the visual speech. The samples were shown pairwise to the participants. Six different comparisons were considered, as shown in figure 6.1. The sequence of the comparison types as well as the sequence of the sample types within each pair were randomized. The test consisted of 50 sample pairs: 14 comparisons containing an ORI sample and 36 comparisons between two actual syntheses. The same sentences were used for each comparison group. 13 people (9 male, 4 female, aged [22-59]) participated in the experiment. 8 of them can be considered speech technology experts. The participants were asked to give their preference for one of the two samples of each pair using a 5-point comparative MOS scale [-2,2]. They were instructed to answer “0” if they had no clear preference for one of the two samples. The test instructions told the participants to pay attention to both the naturalness of the mouth movements and to how well these movements are in coherence with the auditory speech that is played along with the video. The key question of the test read as follows: “How much are you convinced that the person you see in the sample actually produces the auditory speech that you hear in the sample?”. The results obtained are visualized in figure 6.1. The results from an analysis using Wilcoxon signed-rank tests is given in table 6.1. The results show that the participants were clearly in favour of the syntheses based on phonemes compared to the viseme-based syntheses. Furthermore, a higher perceived quality was attained by increasing the number of visemes. It can be con- 6.4. Evaluation of Nx1 mapping schemes for English 203 Figure 6.1: Subjective evaluation of the N x1 phoneme-to-viseme mappings for English. The histograms show for each comparison the participants’ preference for the left/right sample type on a 5-point scale [-2,2]. Table 6.1: Subjective evaluation of the Nx1 phoneme-to-viseme mappings for English. Wilcoxon signed-rank analysis. Comparison Z Sign. MPEG4 - ORI PHON - ORI SD9 - MPEG4 SD22 - MPEG4 PHON - MPEG4 PHON - SD22 -7.79 -6.44 -1.94 -1.64 -5.24 -5.04 p < 0.001 p < 0.001 p = 0.052 p = 0.102 p < 0.001 p < 0.001 6.5. Many-to-many phoneme-to-viseme mapping schemes 204 cluded that the speaker-dependent mappings perform similar to the standardized mapping scheme, since the SD9 group scores worse than the MPEG4 group (which uses 15 distinct visemes), while the SD22 group scores better than the MPEG4 group. The results show that the synthesized samples are still distinguishable from natural visual speech, although for this aspect also the phoneme-based synthesis outperforms the MPEG-4-based approach. 6.4.3 Conclusions Both standardized and speaker-dependent Nx1 phoneme-to-viseme mapping schemes for English were constructed and applied for concatenative visual speech synthesis. In theory, such a viseme-based synthesis should outperform the phonemebased synthesis since it multiplies the number of candidate segments for selection, while the reduced number of distinct speech labels can be justified by the fact that there exists redundancy in a phoneme-based labelling of visual speech (this is the reason for the Nx1 behaviour of the mapping). However, the synthesis based on phonemes resulted in higher subjective ratings compared to the syntheses based on visemes. In addition, the results obtained show that the synthesis quality increases when more distinct visemes are defined. These results raise some questions on the Nx1 viseme-based approach that is widely applied in visual speech synthesis. For audiovisual speech synthesis, it was already shown that a phoneme-based speech labelling is preferable, since it allows the selection of multimodal segments from the database, which maximizes the audiovisual coherence in the synthetic multimodal output speech (see chapter 3). From the current results it appears that a similar phoneme-based synthesis is preferable for visual-only synthesis as well. However, it could be that the many-to-one phoneme-to-viseme mappings insufficiently describe all the details of the visual speech information. Although the synthesizer mimicked the visual coarticulation effects by applying a target cost based on the visual context, it is likely that higher quality viseme-based synthesis results can be achieved by using a many-to-many phoneme-to-viseme mapping instead, which describes the visual coarticulation effects already in the speech labelling itself. 6.5 Many-to-many phoneme-to-viseme mapping schemes In order to construct an NxM phoneme-to-viseme mapping scheme, an extensive set of audiovisual speech data must be analysed to investigate the relationship between the visual appearances of the mouth area and the phonemic transcription of the speech. Since the resulting mapping schemes will eventually be evaluated for usage in concatenative visual speech synthesis, it was opted to construct a speaker-dependent mapping by analysing a speech database that can be used for synthesis purposes 6.5. Many-to-many phoneme-to-viseme mapping schemes 205 too (i.e., a consistent dataset from a single speaker). Unfortunately, the amount of data in the English LIPS2008 database is insufficient for an accurate analysis of all distinct phonemes in all possible phonetic or visemic contexts. Therefore, the Dutch language was used instead, since this allows to use the AVKH database (see chapter 5) for investigating the phoneme-to-viseme mapping. This dataset contains 1199 audiovisual sentences (138 min) from the open domain and 536 audiovisual sentences (52 min) from the limited domain of weather forecasts, which should be sufficient to analyse the multiple visual representations that can occur for each Dutch phoneme. 6.5.1 Tree-based clustering 6.5.1.1 Decision trees To construct the phoneme-to-viseme mapping scheme, the data from the Dutch speech database was analysed by clustering the visual appearances of the phoneme instances in the database. Each instance was represented by three sets of combined AAM parameters, corresponding to the video frames at 25%, 50% and 75% of the duration of the phoneme instance, respectively. This three-point sampling was chosen to integrate dynamics in the measure, as the mouth appearance can vary during the uttering of a phoneme. As a clustering tool multi-dimensional decision trees [Breiman et al., 1984] were used, similar to the technique suggested in [Galanes et al., 1998]. A decision tree is a data analysis tool that is able to cluster training data based on a number of decision features describing this data. To build such a tree, first a measure for the impurity in the training data must be defined. For this purpose, the distance d(pi , pj ) between phoneme instances pi and pj was defined as the weighted sum of the Euclidean differences between the combined AAM parameters (c) of the video frames at 25%, 50% and 75% of the length of both instances: 50 1 75 1 25 + ci − c75 + ci − c50 (6.1) d(pi , pj ) = c25 i − cj j j 4 4 Next, consider a subset Z containing N phoneme instances. Equation 6.2 expresses for instance pi its mean distance from the other instances in Z: PN µi = j=1 d(pi , pj ) N −1 (6.2) Let σi denote the variance of these distances. For every instance in Z the value of µi is calculated, from which the smallest value is selected as µbest . A final measure IZ for the impurity of subset Z can then be calculated by IZ = N (µbest + λ σbest ) (6.3) 6.5. Many-to-many phoneme-to-viseme mapping schemes 206 in which λ is a scaling factor. When the decision tree is constructed, the training data is split by asking questions about the properties of the data instances. At each split, the best question is chosen in order to minimize the impurity in the data (this is the sum of the impurities of all subsets). A tree-like structure is obtained since at each split new branches are created in order to group similar data instances. In each next step, each branch itself is further split by asking other questions that on their turn minimize the impurity in the data. In the first steps of the tree building, the branching is based on big differences among the data instances, while the final splitting steps are based on only minor differences. For some branches, some of the last splitting steps can be superfluous. However, other branches do need many splitting steps in order to accurately cluster their data instances. A stop size must be defined as the minimal number of instances that are needed in a cluster. Obviously, the splitting also stops when no more improvement of the impurity can be found. 6.5.1.2 Decision features To build the decision trees, each phoneme instance must be characterized by an appropriate set of features. Various possible features can be used, of which the identity of the phoneme (i.e., its name or corresponding symbol) is the most straightforward. Another key feature is a consonant/vowel (C/V) classification of the data instance. In addition, a set of phonetic features can be linked to each instance, based on the phonetic properties of the corresponding phoneme: vowel length, vowel type (short, long, diphthong, schwa), vowel height, vowel frontness, lip rounding, consonant type (plosive, fricative, affricative, nasal, liquid, trill), consonant place of articulation (labial, alveolar, palatal, labio-dental, dental, velar, glottal) and consonant voicing. Note that these features have been determined by phonetic knowledge of Dutch phonemes and that it can be expected that not all of them have an explicit influence on the visual representation of the phoneme. In addition, for each Dutch phoneme an additional set of purely visual features was calculated. To this end, several properties of the visual speech were measured throughout the database: mouth height, mouth width, the visible amount of teeth and the visible amount of mouth-cavity (the dark area inside an open mouth). For each of these measures, the 49 distinct Dutch phonemes were labelled based on their mean value for each measure (labels “−−”, “−”, “+” and “++” were used for this). For example, phoneme /a/ (the long “a” from the word “daar”) has value “++” for both the mouth-height and the teeth feature, while the phoneme /o/ (the long “o” from the word “door”) has value “++” for the mouth-height feature but value “−−” for the teeth feature. In order to construct a many-to-many mapping scheme, the tree-based clustering has to be able to model the visual coarticulation effects. Therefore, not only features of the particular phoneme instance itself but also features concerning its neighbouring instances are used to describe the data instance. This way, instances from a single 6.5. Many-to-many phoneme-to-viseme mapping schemes V1 V1 V2 P1 V2 P1 V3 P2 207 V4 V3 P2 V5 V5 V6 P3 V4 V6 P3 V7 V7 Figure 6.2: Difference between using the phoneme identity (left) and the C/V property (right) as pre-cluster feature. Dutch phoneme can be mapped on different visemes, depending on their context in the sentence. 6.5.1.3 Pre-cluster The complete database contains about 120000 phoneme instances. Obviously, it would require a very complex calculation to perform the tree-based clustering on the complete dataset at once. A common approach in decision tree analysis is to select a particular feature for pre-clustering the data: the data instances are first grouped based on this feature, after which a separate tree-based clustering is performed on each of these groups. Two different options for this pre-cluster feature were investigated. In a first approach, the identity of the phoneme corresponding to each instance was chosen as pre-cluster feature. This implies that for each Dutch phoneme, a separate tree will be constructed. In another approach, the consonant/vowel property was used to pre-cluster the data. This way, only two large trees are calculated: a first one to cluster the data instances corresponding to a vowel and another tree to cluster the instances corresponding to a consonant. This second approach makes it possible that two different Dutch phonemes, both in a particular context, are mapped on the same tree-based viseme, as is illustrated in figure 6.2. 6.5.1.4 Clustering into visemes Once a pre-cluster feature has been selected, it has to be decided which features are used to build the decision trees. Many configurations are possible since features from the instance itself as well as features from its neighbours can be applied. In 6.5. Many-to-many phoneme-to-viseme mapping schemes 208 addition, a stop-size has to be chosen, which corresponds to the minimal number of data instances that should reside in a node. This parameter has to be chosen small enough to ensure an in-depth analysis of the data. On the other hand, an end-node from the decision tree is characterized by the mean representation of its data instances. Therefore, the minimal number of instances in an end-node should be adequate to cope with inaccuracies in the training data (e.g., local phonemic segmentation errors). After extensive testing (using similar experiments as described in the next section), two final configurations for the tree-based clustering were defined, as described in table 6.2. Table 6.2: Tree configurations A and B that were used to build the decision trees that map phonemes to visemes. Configuration A Configuration B Pre-cluster feature phoneme identity C/V classification Features current instance - phoneme identity phonetic features visible features Features neighbouring instances phonetic features visible feature phoneme identity phonetic features visible features For the clustering calculations, the distance between two data samples was calculated using equation 6.1 and the impurity of the data was expressed as in equation 6.3 (using λ = 2). In configuration A, a separate tree is built for each Dutch phoneme. Each of these trees defines a partitioning of all the training instances of a single phoneme based on their context. This context is described using both the set of phonetic and the set of visible features that were described in section 6.5.1.2, which should be sufficient to model the influence of the context on the dynamics of the current phoneme. Alternatively, in configuration B only two large trees are calculated. As these trees are built using a huge amount of training data (for each tree, a maximum of 30000 uniformly sampled data instances was chosen), for both the instance itself and for its neighbouring instances all possible features are given as input to the tree-building algorithm. Although the description of a data instance based on both the phoneme identity and its phonetic/visible features contains some redundancy, this way a maximal number of features is available to efficiently and rapidly decrease the impurity in the large data set. The clustering algorithm itself 6.5. Many-to-many phoneme-to-viseme mapping schemes 209 will determine which features to use for this purpose. For both configurations A and B, the trees were built using 25 and 50 instances as stop-size, respectively, resulting in trees “A25”, “A50”, “B25” and “B50”. 6.5.1.5 Objective candidate test As was already mentioned in the previous section, the large number of possible tree configurations imposes the need for an objective measure to assess the quality of the tree-based mapping from phonemes to visemes. An objective test was designed for which a number of database sentences are resynthesized using the concatenative visual speech synthesizer. For every synthesis, the target original sentence is excluded from the database. The original database transcript is used as text input and the original database auditory speech is used as audio input. Both the database labelling and the description of the synthesis targets are written in terms of the tree-based visemes. As usual, the synthesis involves a set of candidate segments being determined for each synthesis target. To calculate a quality measure for the applied speech labeling, in a first step these candidate segments are arranged in terms of their total target cost. Next, the n-best candidates are selected and their distance from the ground-truth is measured using the three-point distance that was described in equation 6.1. Finally, for each synthesis target a single error value is calculated by computing the mean distance over these n-best candidates. Since the resynthesis of a fixed group of database sentences using different speech labels defines corresponding synthesis targets for each of these label sets (as the original target phoneme sequence is the same for each approach), the calculation of the mean candidate quality for each synthesis target produces paired-sample data that can be used to compare the accuracy of the different speech labelling approaches. Using this objective measure, tree configurations A and B (see section 6.5.1.4) could be identified as high quality clustering approaches. In figure 6.3 their performance is visualized, using the n = 50 best candidates and omitting the silence targets from the calculation since they are 1x1 mapped on the silence viseme. Figure 6.3 also shows the influence of including the visual context-cost in the calculation of the total target cost (see section 6.1.2). Two reference methods were added to the experiment. The first reference result, referred to as “PHON”, was measured using a phoneme-based description of the database and the synthesis targets. For the second reference approach, referred to as “STDVIS”, an Nx1 phoneme-to-viseme mapping for Dutch was constructed, based on the 11 viseme classifications described by Van Son et al. [Van Son et al., 1994]. These Dutch “standardized” visemes are based on both subjective perception experiments and prior phonetic knowledge about the uttering of Dutch phonemes. 6.5. Many-to-many phoneme-to-viseme mapping schemes 210 0,29 0,28 0,27 0,26 0,25 0,24 0,23 Figure 6.3: Candidate test results obtained for a synthesis based on phonemes, a synthesis based on standardized N x1 visemes and multiple syntheses based on tree-based N xM visemes (mean distances). The “CC” values were obtained by incorporating the visual context target cost to determine the n-best candidates. A statistical analysis on the values obtained without using the visual context target cost, using ANOVA with repeated measures and Greenhouse-Geisser correction, indicated significant differences among the values obtained for each group (F (3.70, 5981) = 281 ; p < 0.001). An analysis using paired-sample t-tests indicated that phonemes describe the synthesis targets more accurately than the standard Nx1 visemes (p < 0.001). This result is in line with the results that were obtained for English (see section 6.4.2), where a synthesis based on phonemes outperformed all syntheses based on Nx1 visemes. In addition, all tree-based label approaches perform significantly better compared with the phoneme-based labelling and the standard Nx1 viseme-based labeling (p < 0.001). Thus, unlike the Nx1 visemes, the tree-based NxM phoneme-to-viseme mappings define an improved description of the visual speech information in comparison with phonemes. The results obtained show only minor non-significant differences between the various tree configurations. In addition, an improvement of the candidates can be noticed when the context-target cost is applied, especially for the phoneme-based and Nx1 viseme-based labels. As this context cost is used to model the visual coarticulation effects, it is logical that its usage has less influence on the results obtained for the NxM viseme-based labels, as they intrinsically model the visual coarticulation themselves. Note that, even when the context-target cost was applied, the NxM viseme labels performed significantly better than the phoneme-based and Nx1 viseme-based labels (p < 0.001). 6.5. Many-to-many phoneme-to-viseme mapping schemes 6.5.2 Towards a useful many-to-many mapping scheme 6.5.2.1 Decreasing the number of visemes 211 Using the decision trees, the phoneme instances from the Dutch speech database were clustered into small subsets. The viseme corresponding to a given phoneme instance is determined by the traversal of such a tree based on various properties of the instance itself and on the properties of its context. Given the extensive amount of training data and the large number of decision features, the tree-based clustering results in a large number of distinct visemes: 1050 for the A25-tree, 650 for the A50-tree, 800 for the B25-tree and 412 for the B50-tree. These big numbers are partly caused by the fact that an extensive analysis with many splitting steps was applied during the tree building. This is necessary since some pre-clusters contain a large number of diverse data instances. On the other hand, for the splitting of the data instances from some of the other pre-clusters (e.g., pre-clusters corresponding to less common phonemes), less splitting steps would have been sufficient. Consequently, the tree-based splitting has not only resulted in a large number of end-nodes but also in an “over-splitting” of some parts of the dataset. Another reason for the large number of tree-based visemes is the fact that the pre-clustering step makes it impossible for the tree-clustering algorithm to combine similar data instances from different pre-clusters into the same node. Therefore, it can be assumed that for each tree-configuration, many of its tree-based visemes are in fact similar enough to be considered as one single viseme. The standardized Nx1 viseme mapping identifies 11 distinct visual appearances for Dutch. A good quality automatic viseme classification should be expected to define at most a few visemes more. Since the tree-based clustering results in the definition of a much larger number of visemes, more useful NxM phoneme-to-viseme mapping schemes were constructed by performing a new clustering on the tree-based visemes themselves. First, a general description for each tree-based viseme defined by a particular tree configuration was determined. To this end, for each end-node of the tree a mean set of combined AAM parameter values was calculated, sampled over all phoneme instances that reside in this node. Since the original phoneme instances were sampled at three distinct points, the tree-based visemes are also described by three sets of combined AAM parameter values (describing the visual appearance at 25%, 50% and 75% of the viseme’s duration). Next, based on their combined AAM parameters, all tree-based visemes were clustered using a k-means clustering approach. Note that for k-means clustering, the number of clusters has to be determined beforehand. Estimating this number using a heuristic approach1 1 In this method the optimal number of clusters is estimated by successively performing the k-means clustering while increasing the cluster count. Afterwards, the final number of clusters is chosen by graphically determining the step in which there is a drop in the marginal gain of the 6.5. Many-to-many phoneme-to-viseme mapping schemes 212 resulted in about 20 clusters for all tree configurations. For the k-means clustering calculations, two different distances between the tree-based visemes were defined: the Euclidean difference between the combined AAM parameters of the frames at the middle of the visemes, and the weighted sum of the distances between the combined AAM parameters of the frames at 25%, 50% and 75% of the duration of the visemes (equation 6.1). Using the first distance measure, 11 and 20 clusters were calculated, respectively. These numbers of clusters were chosen to match the standard Nx1 viseme mapping and according to the heuristic method, respectively. In addition, using the three-point distance measure, an extra clustering into 50 distinct visemes was calculated. A larger number of clusters was chosen here since this distance measure incorporates the dynamics of the visemes, which is likely to result in the existence of more distinct visemes. Also, this number is the same as the number of Dutch phonemes that were used for the initial database labeling, which might be useful for comparison later on. For time issues, only the A25 and B25 trees were processed, since these define a more in-depth initial segmentation of the training data. In addition, a single clustering of the visemes defined by the B50 tree was performed for verification, as described in table 6.3. Table 6.3: Mapping from tree-based visemes to final NxM visemes. Tree A25 Tree B25 Tree B50 6.5.2.2 11 clusters 20 clusters 50 clusters A25 11 B25 11 - A25 20 B25 20 B50 20 A25 50 B25 50 - Evaluation of the final NxM visemes In order to evaluate the final NxM phoneme-to-viseme mapping schemes, a similar evaluation of the n-best synthesis candidates as was used for the evaluation of the tree-based visemes (see section 6.5.1.5) was performed. Both phoneme-based speech labels and standard Nx1 viseme-based labels as well as a labelling using tree-based visemes A25 and B25 were added as reference. Figure 6.4 illustrates the test results. The n = 50 best candidates were used, silences were omitted from the calculation, and the visual context was not taken into account when calculating the total target cost. A statistical analysis using ANOVA with repeated measures and GreenhouseGeisser correction indicated significant differences among the values obtained for percentage of variance that is explained by the clusters. 6.5. Many-to-many phoneme-to-viseme mapping schemes 213 0,43 0,42 0,41 0,4 0,39 0,38 0,37 0,36 0,35 0,34 0,33 PHON STDVIS A25_11 A25_20 A25_50 B25_11 B25_20 B25_50 B50_20 A25 B25 Figure 6.4: Candidate test results obtained for a synthesis based on phonemes, a synthesis based on standardized N x1 visemes and multiple syntheses based on the final N xM visemes (mean distances). Some results obtained using tree-based N xM visemes are added for comparison purposes. each group (F (5.36, 13409) = 233 ; p < 0.001). An analysis using paired-sample t-tests indicated that all final NxM visemes significantly outperform a labelling based on phonemes and a labelling based on the standard Nx1 visemes (p < 0.001). This is an important result since unlike it was the case for the tree-based visemes, for these final NxM mapping schemes there are only a limited number of distinct speech labels defined. The NxM mappings on 11 and 20 distinct visemes result in a more accurate speech labelling in comparison with a phoneme-based labeling, despite the fact that they use less than half the number of distinct labels as compared to this phoneme-based labeling. In addition, it can be seen that for both configuration A25 and configuration B25, better test results are obtained when more distinct NxM viseme labels are defined (p < 0.001). This means that the NxM mappings on 50 visemes are indeed modelling some extra differences compared with the NxM mappings on 20 visemes. From the test results it can also be concluded that the corresponding mappings derived from tree configurations A and B perform comparably. When the results obtained for the final NxM visemes are compared with the results obtained for the tree-based visemes, the latter perform best (p < 0.001). However, this difference can still be considered rather small given the fact that for the tree-based visemes, the number of distinct speech labels is up to a factor 50 higher than for the final NxM visemes (e.g., 20 distinct labels for the A25 20 approach and 1050 distinct labels for the A25 mapping scheme). 6.6. NxM visemes for concatenative visual speech synthesis Target costs Join costs Database A: Determine candidate units Synthesis targets 214 Concatenation B: Select final unit from candidates C: Output speech Optimization Synchronization Figure 6.5: Relation between the visual speech synthesis stages and the objective measures. In section 6.5.1.5 and section 6.5.2.2 the accuracy of the speech labels was tested in stage A, while the attainable synthesis quality can be measured by evaluating the final selection in stage B. 6.6 6.6.1 Application of many-to-many visemes for concatenative visual speech synthesis Application in a large-database system In a first series of experiments, the final NxM viseme labelling was applied for concatenative visual speech synthesis using the AAM-based VTTS system provided with the open-domain part of the AVKH audiovisual speech database. The experiments are similar to the candidate test that was described in section 6.5.1.5, except that for actual synthesis it is not the quality of the n-best candidate segments but the quality of the final selected segment that is important. This final selection is based on the minimization of both target and join costs (see figure 6.5). Given the large size of the speech database, many candidates are available for each synthesis target. In order to reduce the calculation time, the synthesizer uses only the 700 best candidates (in terms of total target cost) for further selection. In order to objectively assess the attainable synthesis quality using a particular speech labeling, a synthesis experiment was conducted for which 200 randomlyselected sentences from the database were resynthesized. For every synthesis, the target original sentence was excluded from the database. The original database transcript was used as text input and the original database auditory speech was used as audio input. During synthesis, for each synthesis target the synthesizer will select the most optimal segment from the database. An objective measure was calculated by comparing each of these selected segments with its corresponding 6.6. NxM visemes for concatenative visual speech synthesis 215 0,355 0,35 0,345 0,34 0,335 0,33 0,325 PHON B25_11 B25_20 B25_50 Figure 6.6: Evaluation of the segment selection using a large database (mean distances). Both the results obtained using a phoneme-based speech labelling and the results obtained using multiple final N xM viseme sets are shown. ground-truth segment, using the weighted distance that was described in equation 6.1. Silences were omitted from the calculation. Figure 6.6 illustrates the test results obtained for some important NxM viseme-based labelling schemes that have been described in section 6.5.2 and a baseline system using phoneme-based speech labels. A statistical analysis using ANOVA with repeated measures and Huynh-Feldt correction indicated significant differences among the values obtained for each group (F (2.89, 31378) = 81.2 ; p < 0.001). Section 6.5.2.2 described how an objective experiment pointed out that the best mean candidate quality (measured over the 50 best candidates) is attained by a synthesis based on NxM visemes. The current experiment shows however a different behaviour for the quality of the final selected segment. In this case, a selection based on phonemes performs as good as the best result that was obtained by using NxM viseme labels. An analysis using paired-sample t-tests indicated that it performs significantly better than the syntheses based on NxM viseme labels that use 11 or 20 distinct speech labels (p < 0.001). This can be understood as follows. The main reason why a synthesis would profit from the use of only a few distinct speech labels is the increased number of candidates for each synthesis target. For the current experiment, however, the synthesizer’s database is very large, resulting for each of the different speech labelling strategies in a huge number of candidates for each synthesis target. This justifies the use of a larger number of distinct speech labels, since it will refine the segment selection provided that the labelling is sufficiently accurate. In the current experiment the phoneme-based system performed as good as the system using 50 distinct NxM visemes. From section 6.5.2.2 it is known that the 6.6. NxM visemes for concatenative visual speech synthesis 216 0,4 0,39 0,38 0,37 0,36 0,35 0,34 0,33 0,32 0,31 PHON A25_20 PHON TweakedCost A25_20 TweakedCost Figure 6.7: Evaluation of the segment selection using a large database: optimal selection costs (the two leftmost results) and alternative selection costs (the two rightmost results) (mean distances). phoneme-based labelling is less accurate than this particular NxM viseme-based labeling. For synthesis, however, the selection of the final segment from all possible candidates is based on both target and join costs, meaning that the selection of a high quality final segment from an overall lower-quality set of candidate segments is possible when accurate selection costs are used. Even more, there is no reason to assume that the final selected segment will be one of the n-best candidate segments based on target cost alone. This could explain why for the current test, none of the syntheses based on NxM visemes are able to outperform the synthesis based on phoneme labels. To check this assumption, the same 200 sentences were again synthesized, for which another set of selection costs was applied: the target cost based on the context was omitted and the influence of the join costs was reduced in favour of the influence of the target costs. Both the phoneme-based speech labelling and the NxM viseme based labelling that scored best in the previous test were used. Obviously, the quality of the final selected segments decreased in this new synthesis setup, as visualized in figure 6.7. But more important, for this new synthesis the NxM viseme-based result was found to be significantly better (paired t-test ; t = 11.6 ; p < 0.001) than the phoneme-based segment selection. So it appears that when non-optimal selection costs are applied, the more accurate labelling of the speech by using the NxM visemes does improve the segment selection quality. One of the possible reasons for this is the fact that join costs should partially model the visual coarticulation effects as well, since they push the selection towards a segment that fits well with its neighbouring segments in the synthetic speech signal. From these results it can be concluded that for synthesis 6.6. NxM visemes for concatenative visual speech synthesis 217 using a large database, the use of more distinct speech labels than the theoretical minimum (11 for Dutch) is preferable. In addition, given the large number of candidates that can be found for each target, a precise definition of the selection costs is able to conceal the differences between the accuracy of the different speech labelling approaches. Note, however, that the use of NxM viseme-based speech labels for synthesis using a large database can speed-up the synthesis progress. During synthesis, the heaviest calculation consists of the dynamic search among the consecutive sets of candidate segments to minimize the global selection cost. In addition, it has been shown that the use of viseme-based labels improves the overall quality of the n-best candidate segments. Therefore, the use of viseme labels permits to reduce the number of candidate segments in comparison with a synthesis based on phoneme labels. Consequently, it will be easier to determine an optimal set of final selected segments, which results in reduced synthesis times. 6.6.2 Application in limited-database systems In the previous section the use of NxM visemes for concatenative visual speech synthesis using a large database was evaluated. In practice, however, most visual speech synthesis systems use a much smaller database from which the speech segments are selected. The main reason why the use of Nx1 visemes for visual speech synthesis purposes is well-established, is the limited amount of speech data that is necessary to cover all visemes or di-visemes of the target language. Therefore it is useful to evaluate the use of NxM viseme-based speech labels in such a limited-database system. In this case, the number of available candidates for each synthesis target will be much smaller compared with the large-database system that was tested in section 6.6.1. It is interesting to evaluate how this affects the quality of the syntheses based on the various speech labelling approaches. 6.6.2.1 Limited databases From the open-domain part of the AVKH audiovisual database, several subsets were selected which each define a new limited database. The selection of these subsets was performed by a sentence selection algorithm that ensures that for each of the speech labelling approaches under test (phoneme, standard Nx1 viseme and several NxM visemes), the subset contains at least n instances of each distinct phoneme/viseme that is defined in the label sets. The subsets obtained are summarized in table 6.4. For n = 3, the sentence selection was run twice, which resulted in the distinct subsets DB1 and DB2. 6.6. NxM visemes for concatenative visual speech synthesis 218 Table 6.4: Construction of limited databases. Database name n Database size DB1 DB2 DB3 3 3 4 33 sent. 42 sent. 47 sent. 0,385 0,38 0,375 0,37 0,365 0,36 0,355 PHON STDVIS A25_11 A25_20 A25_50 B25_11 B25_20 B25_50 Figure 6.8: Evaluation of the selected segments using the limited database DB1 (mean distances). Both the results obtained using a phoneme-based speech labeling, the results obtained using the standardized N x1 viseme labelling and the results obtained using various final N xM viseme sets are shown. 6.6.2.2 Evaluation of the segment selection The attainable synthesis quality using these limited databases was assessed by the same objective test that was described in section 6.6.1. In a first execution, 120 sentences were resynthesized using DB1. Various NxM viseme-based labelling approaches as well as two baseline systems (phoneme-based and standard Nx1 viseme-based) were used. The results obtained are visualized in figure 6.8. A statistical analysis using ANOVA with repeated measures and Huynh-Feldt correction indicated significant differences among the values obtained for each group (F (6.62, 71849) = 50.7 ; p < 0.001). An analysis using paired-sample t-tests indicated that for a synthesis using a limited database, all NxM viseme-based labelling approaches result in selecting significantly better segments than a synthesis based phonemes or Nx1 visemes (p < 0.001). In addition, in this test the viseme labels using 11 and 20 distinct visemes scored better than the label sets that use 50 distinct visemes (p < 0.001). This could be explained by the fact that due to the limited amount of available speech data, the mappings on 50 visemes result 6.6. NxM visemes for concatenative visual speech synthesis 219 0,385 0,38 0,375 0,37 0,365 PHON STDVIS A25_11 A25_20 A25_50 B25_11 B25_20 B25_50 PHON STDVIS A25_11 A25_20 A25_50 B25_11 B25_20 B25_50 0,373 0,368 0,363 0,358 0,353 Figure 6.9: Evaluation of the selected segments using the limited database DB2 (upper panel) and the limited database DB3 (lower panel) (mean distances). in considerably less candidate segments for each synthesis target in comparison with the approaches that use fewer distinct speech labels. This assumption is in line with the results that were shown in figure 6.6, where it was found that when a large amount of speech data is available, the viseme sets using 50 distinct labels are performing best. It is worth mentioning that for the current test, the viseme sets using 50 distinct labels still performed significantly better than the phoneme-based labeling, which uses the same number of distinct speech labels. Similarly, the syntheses based on 11 distinct NxM visemes performed significantly better than the synthesis based on 11 standard Nx1 visemes. These are important results since they show that an NxM viseme labelling does help to improve the segment selection quality in the case where not that much speech data is available. To verify these results, similar experiments were conducted using other sets of target sentences and other limited databases (see figure 6.9 for some examples). For all these experiments, results similar to the results discussed above were obtained. 6.6.2.3 Evaluation of the synthetic visual speech While in the tests described in section 6.6.1 and section 6.6.2.2 the effect of the speech labelling approach on the improvement of the segment selection and thus on the attainable speech quality was evaluated, some final experiments were conducted in order to assess the achieved quality of the visual speech synthesizer (denoted as stage C in figure 6.5). For this, instead of evaluating the quality of the segment 6.6. NxM visemes for concatenative visual speech synthesis 220 selection itself, the quality of the final visual output speech resulting from the concatenation of the selected segments is assessed. This concatenation involves some optimizations, like a smoothing of the parameter trajectories at the concatenation points and an additional smoothing by a low-pass filtering technique (see section 4.4). In addition, the concatenated speech is non-uniformly time-scaled to achieve synchronization with the target segmentation of the auditory speech. It is interesting to evaluate the effect of these final synthesis steps on the observed differences between the different speech labelling approaches. Objective evaluation To objectively assess the quality of the synthesized speech, 70 randomly-selected sentences from the full-size database were resynthesized. The VTTS system was provided with the limited database DB1. It was ensured that the target sentences were not part of this database. The original database transcript was used as text input and the original database auditory speech was used as audio input. Optimal settings were applied for the concatenation smoothing and for the other synthesis optimizations. To measure the quality of the synthesized visual speech, the final combined AAM parameter trajectories describing the output speech were compared with the ground-truth trajectories from the speech database. As distance measure a dynamic time warp (DTW) cost was used. Dynamic time warping [Myers et al., 1980] is a time-series similarity measure that minimizes the effects of shifting and distortion in time by allowing elastic transformation of time series in order to detect similar shapes with different phases. A DTW from a time series X = (x1 , x2 , . . . , xn ) to a time series Y = (y1 , y2 , . . . , ym ) first involves the calculation of the local cost matrix L representing all pairwise differences between X and Y . A warping path between X and Y can be defined as a series of tuples (xi , yj ) that defines the correspondences from elements from X to elements from Y . When this warping path satisfies certain criteria like a boundary condition, a monotonicity condition and a step-size condition, it defines a valid warp from series X towards series Y (the interested reader is referred to [Senin, 2008] for a detailed explanation on this technique). A warping cost can be associated with each warping path by adding all local costs collected by traversing matrix L along the warping path. Through dynamic programming, the DTW algorithm searches the optimal warping path between X and Y that minimizes this warping cost. This optimal warping path defines a useful distance measure between series X and Y through its associated warping cost. For the current experiment, a synthesized sentence was assessed by first calculating for each combined AAM parameter trajectory a DTW distance measure as the cost of warping the synthesized parameter trajectory towards the ground-truth trajectory. This value is then normalized according to the length of the warping 6.6. NxM visemes for concatenative visual speech synthesis 221 190 185 180 175 170 165 160 155 150 PHON STDVIS B25_11 B25_20 B25_50 A25_20 Figure 6.10: DTW-based evaluation of the final synthesis result (mean distances). Both the results obtained using a phoneme-based speech labeling, the results obtained using the standardized N x1 viseme set and the results obtained using various final N xM viseme sets are shown. path in order to cancel out the influence of the length of the sentence. For each sentence, a final distance measure was calculated as the weighed sum of these normalized DTW costs, where the weights were chosen according to the total model variance that is explained by each AAM parameter. The use of a DTW cost has been suggested in [Theobald and Matthews, 2012], where it was concluded that this distance correlates well with subjective evaluations as it measures the frame-wise distance between the synthetic and ground-truth visual speech as well as the temporal relationship between these two signals. The results obtained are visualized in figure 6.10. A statistical analysis using ANOVA with repeated measures and Huynh-Feldt correction indicated significant differences among the values obtained for each group (F (3.40, 254) = 4.05 ; p = 0.006). An analysis using paired-sample t-tests indicated that the quality of the synthesis output based on NxM visemes is higher than the quality of the synthesis output based on phonemes (p = 0.069 (B25 11), p = 0.01 (B25 20), p = 0.048 (B25 50)) and Nx1 visemes (p = 0.025 (B25 11), p = 0.002 (B25 20), p = 0.020 (B25 50)). The best results were obtained using the B25 20 visemes, although the differences among the different NxM viseme sets were not found to be significant. These results are in line with the evaluations of the segment selection quality that were described in section 6.6.2.2. However, in the current test the labelling approach using 50 distinct visemes scores as good as the other NxM viseme-based approaches. It appears that the concatenations and optimizations in the final stages of the synthesis partly cover the quality differences 6.6. NxM visemes for concatenative visual speech synthesis 222 between the various speech labelling approaches that have been measured for the segment selection stage. Subjective evaluation In addition to an objective evaluation, two subjective perception experiments were performed in order to compare the achieved synthesis quality using the different speech labelling approaches. For this, 20 randomly-selected sentences from the full-size database were resynthesized, using the limited database DB1 as selection data set. It was ensured that the target sentences were not part of this database. The original database transcript was used as text input and the original database auditory speech was used as audio input. Optimal settings were used for the concatenation smoothing and for the other synthesis optimizations. For verification purposes, the DTW-based objective evaluation that was described in section 6.6.2.3 was repeated for the 20 sentences that were used in the subjective experiments, which resulted in observations comparable to the results that were obtained using the larger test-set. In a first subjective test, the differences between NxM visemes B25 11, B25 20, B25 50 and A25 20 were investigated. 10 people participated in the experiment (8 male, 2 female, 9 of them aged [24-32], 1 aged 60), of which 7 can be considered speech technology experts. The samples were shown pairwise to the participants, considering all comparisons among the four approaches under test. The sequence of the comparison types as well as the sequence of the sample types within each pair were randomized. The participants were asked to give their preference for one of the two samples of each pair using a 5-point comparative MOS scale [-2,2]. They were instructed to answer “0” if they had no clear preference for one of the two samples. The test instructions told the participants to pay attention to both the naturalness of the mouth movements and to how well these movements are in coherence with the auditory speech that is played along with the video. The key question of the test read as follows: “How much are you convinced that the person you see in the sample actually produces the auditory speech that you hear in the sample?”. The results of the test are visualized in figure 6.11. The results from an analysis using Wilcoxon signed-rank tests is given in table 6.5. Table 6.5: Subjective test results evaluating the synthesis quality using various NxM visemes. Wilcoxon signed-rank analysis. Comparison B25 B25 B25 B25 11 20 11 20 - B25 B25 B25 A25 20 50 50 20 Z Sign. -1.54 -0.586 -0.607 -1.23 p = 0.122 p = 0.558 p = 0.544 p = 0.22 6.6. NxM visemes for concatenative visual speech synthesis 223 Figure 6.11: Subjective test results evaluating the synthesis quality using various N xM visemes. The histograms show for each comparison the participants’ preference for the left/right sample type on a 5-point scale [-2,2]. The results show a slight preference for the syntheses using the B25 20 and B25 50 visemes, but none of the differences between the methods were shown to be significant (see table 6.5). This result is in line with the results obtained by the objective DTW-based evaluation. As a conclusion, it was opted to select the B25 20 labels as the most preferable viseme set for synthesis using database DB1, due to the slight preference for this method in both the objective and the subjective test. In addition, this labelling fits best with the assumption that an automatic viseme classification should identify more than 11 visemes but less than the number of phonemes. In a final experiment, the NxM viseme labelling approach B25 20 was subjectively compared with a phoneme-based and an Nx1 viseme-based synthesis. A new perception experiment was conducted, of which the set-up was similar to the previous experiment. In the current test all comparisons between the three approaches under test were evaluated by 11 participants (9 male, 2 female, aged [24-60]). Six of them can be considered speech technology experts. Figure 6.12 visualizes the test results obtained. The results from an analysis using Wilcoxon signed-rank tests is given in table 6.6. 6.6. NxM visemes for concatenative visual speech synthesis 224 Figure 6.12: Subjective test results evaluating the synthesis quality using the most optimal N xM visemes, the standardized N x1 visemes and a phonemebased speech labeling. The histograms show for each comparison the participants’ preference for the left/right sample type on a 5-point scale [-2,2]. Table 6.6: Subjective test results evaluating the synthesis quality using the most optimal NxM visemes. Wilcoxon signed-rank analysis. Comparison Z Sign. PHON - B25 20 PHON - STDVIS B25 20 - STDVIS -2.08 -0.064 -3.17 p = 0.037 p = 0.949 p = 0.002 6.7. Summary and conclusions 225 The results obtained show a similar behaviour as the results from the DTWbased objective evaluation that was described in section 6.6.2.3. The synthesis based on the B25 20 NxM visemes was rated significantly better in comparison with the synthesis based on phonemes and the synthesis based on standard Nx1 visemes. On the other hand, from the histograms it is clear that for many comparison pairs the test subjects answered “no difference”. Also, in a substantial number of cases the phoneme-based synthesis was preferred over the NxM viseme-based synthesis. Feedback from the test subjects and a manual inspection of their answers pointed out that they often found it difficult to assess the difference between the two test samples of each comparison pair. This was mainly due to small local errors in the synthetic visual speech. These local audiovisual incoherences degraded the perceived quality of the whole sample, even if the sample was of overall higher quality than the other test sample from the comparison pair. This observation is in line with earlier observations in similar subjective perception experiments described in this thesis. 6.7 Summary and conclusions For some time now, the use of visemes to label visual speech data has been well-established. This labelling approach is often used in visual speech analysis or synthesis systems, which construct the mapping from phonemes to visemes as a many-to-one relationship. In this chapter the usage of both standardized and speaker-dependent English many-to-one phoneme-to-viseme mappings in concatenative visual speech synthesis was evaluated. A subjective experiment showed that the viseme-based syntheses were unable to increase (or even decreased) the attained synthesis quality compared to a phoneme-based synthesis. This is likely to be explained by the limited power of many-to-one phoneme-to-viseme mappings to accurately describe the visual speech information. As every instance of a particular phoneme is mapped on the same viseme, these viseme labels are incapable of describing visual coarticulation effects. This implies the need for a many-to-many phoneme-to-viseme mapping scheme, in which on one hand instances from different phonemes can be mapped on a same viseme, and on the other hand multiple instances of a same phoneme can be mapped on different visemes. Using a large Dutch audiovisual speech database, a novel approach to construct many-to-many phoneme-to-viseme mapping schemes (i.e., so-called “contextdependent” visemes) was designed. In a first step, decision trees were trained in order to cluster the visual appearances of the phoneme instances from the speech database. The mapping from phonemes to these tree-based visemes is based on several properties of the phoneme instance itself and on properties of its neighbouring phoneme instances. Several tree configurations were evaluated, from which 6.7. Summary and conclusions 226 two final approaches were selected. An objective evaluation of these tree-based speech labels showed that they are indeed able to more accurately describe the visual speech information in comparison with phoneme-based or many-to-one viseme-based approaches. However, the tree-based viseme sets contain too many distinct labels for practical use in visual speech analysis or synthesis applications. Therefore, a second clustering step was performed, in which the tree-based visemes were further clustered into limited viseme sets defining 11, 20 or 50 distinct labels. Again these viseme labelling approaches were objectively evaluated, from which it could be concluded that they do improve the visual speech labelling over phonemes or many-to-one phoneme-to-viseme mappings. This is an important result, since the new viseme labels permit an easy analysis or synthesis of visual speech as they need less than half the number of distinct labels to describe the speech information as opposed to phoneme-based labels. In the second part of this research, the use of the different speech labelling approaches for concatenative visual speech synthesis was assessed. A first conclusion that could be made is the fact that, in case an extensive database is provided to the synthesizer, the influence of the used speech labelling (phonemes or many-tomany visemes) on the attainable synthesis quality is rather limited. Good quality segment selection can be achieved either way, provided that appropriate selection costs are applied. Furthermore, it was found that the more speech data that is available for selection, the higher the number of distinct speech labels that should be used for an optimal synthesis result (probably there exists an upper limit to this number, however this was not investigated due to its limited practical use). The experiments showed also that for synthesis using a large database, a many-to-many viseme-based speech labelling could be preferable in order to reduce the synthesis time, since the more accurate speech labelling permits a stronger pruning of the number of candidate segments for each target speech segment. Next, the behaviour of the synthesis was evaluated in case only a limited amount of speech data is available for selection. In this case, the selection based on many-to-many phonemeto-viseme mappings does improve the segment selection quality in comparison with phoneme-based or many-to-one phoneme-to-viseme based systems. Finally, the output synthetic visual speech signals resulting from syntheses based on the different speech labelling approaches were evaluated. In an objective evaluation it was found that, when only a limited database is provided to the synthesizer, a synthetic speech signal that is closer to the ground-truth is achievable when a speech labelling based on a many-to-many phoneme-to-viseme mapping is applied. To verify this result, a subjective perception experiment was conducted. The results obtained from this subjective test show indeed that human observers prefer the syntheses based on a many-to-many phoneme-to-viseme mapping scheme over syntheses based on phonemes or on based on a many-to-one phoneme-to-viseme 6.7. Summary and conclusions 227 mapping. This chapter explained how many-to-many phoneme-to-viseme mappings can be constructed. Their improved accuracy to describe the visual speech information has been shown. This is an important result, since this kind of viseme definitions was still absent in the literature. As neither for English nor for other common languages a reference many-to-many mapping scheme has been defined, typically many-to-one phoneme-to-viseme mappings are still used for a variety of applications. Although the novel many-to-many mappings described in this chapter were constructed for Dutch, a similar approach can be used to design many-to-many mapping schemes for English or other languages. This chapter also discussed the effect of applying the novel phoneme-to-viseme mappings for concatenative speech synthesis. Similarly, it would be very interesting to investigate how the use of the many-to-many phoneme-to-viseme mappings influence the performance of other applications in the field of visual speech analysis and synthesis as well. For instance, many-to-many phoneme-to-viseme mappings have a high potential for usage in rule-based visual speech synthesis (see section 2.2.6.2). In that particular application, the necessary prior definition of the various model configurations that correspond to a particular viseme is simplified by the limited number of distinct visemes that are used. Based on a given target phoneme sequence, an accurate target viseme sequence can be constructed using the phoneme-to-viseme mapping scheme. This way, visual coarticulation effects are directly modelled in the target speech description. In addition, the synthesis workflow may become more simple, since the additional tweaking of the synthesized model parameters in order to simulate the visual coarticulation effects, which is standard for model-based synthesis, may become superfluous. Some of the techniques, experiments and results mentioned in this chapter have been published in [Mattheyses et al., 2011b] and [Mattheyses et al., 2013]. 7 Conclusions 7.1 Brief summary Nowadays, it is possible to produce powerful computer systems at a limited production cost. This causes a wide variety of devices, from heavy industrial machinery to small household appliances, to be controlled by a computer system that also arranges the communication between the device and its users. People already interact with countless computer systems in every-day situations, and this kind of human-machine communication will even become increasingly important in the near future. In the most optimal scenario, the interaction with the devices feels as familiar and as natural as the way in which humans communicate among themselves. Speech has always been the most important means of communication between humans. Because of this, speech can be considered as the most optimal means of communication between a user and a computer system too. This kind of interaction enhances both the naturalness and the ease of the user experience, and it also increases the accessibility of the computer system. One of the requirements to allow a speech-based human-machine interaction is the capability of the computer system to generate novel speech signals containing any arbitrary spoken message. Speech is a truly multimodal means of communication: the message is conveyed in both an auditory and a visual speech signal. The auditory speech information is encoded in a waveform that contains a sequence of speech sounds produced by the human speech production system. Some of the articulators of this speech production system are visible to an observer looking at the face of the speaker. The variations of these visual articulators, occurring while uttering the speech, define the visual 228 7.1. Brief summary 229 speech signal. It is well known that an optimal conveyance of the message requires that both the auditory and the visual speech signal can be perceived by the receiver. Similarly, the use of audiovisual speech is the most optimal way for a computer system to transmit a message towards its users. To this end, the system has to generate a new waveform that contains the appropriate speech sounds. In addition, it has to generate a new video signal that displays a virtual speaker exhibiting those speech gestures that correspond to the synthetic auditory speech information. When the target speech message is given as input to the synthesizer by means of text, the speech generation process is referred to as “audiovisual text-to-speech synthesis”. Currently, data-driven synthesis is considered as the most efficient approach for generating high quality synthetic speech. The standard strategy is to perform the audiovisual text-to-speech synthesis in multiple stages. In a first step, the synthetic auditory speech is generated by an auditory text-to-speech system. Next, the synthetic visual speech signal is generated by another synthesis system that performs unimodal visual speech synthesis. Finally, both synthetic speech signals are synchronized and multiplexed to create the final audiovisual speech signal. This strategy allows to synthesize high-quality auditory and visual speech signals, however, it results in the presentation of non-original combinations of auditory and visual speech information towards the observer. This means that the level of coherence between both synthetic speech modes will be lower compared to the multimodal coherence seen in original audiovisual speech. To allow an optimization of the audiovisual coherence in the synthetic speech, a single-phase synthesis strategy that simultaneously generates the auditory and the visual speech signal, is favourable. Surprisingly, such a synthesis strategy has only been adopted in some exploratory studies. In the first part of this thesis, a single-phase audiovisual text-to-speech synthesizer was developed, which adopts a data-driven unit selection synthesis strategy to create a photorealistic 2D audiovisual speech signal. The synthesizer constructs the synthetic audiovisual speech by concatenating audiovisual speech segments that are selected from a database containing original audiovisual speech recordings. By concatenating original combinations of auditory and visual speech information, original audiovisual articulations are seen in the synthetic speech. This ensures a very high level of audiovisual coherence in the output speech signal. When longer speech segments, containing multiple consecutive phones, are copied from the database to the synthetic speech, original audiovisual coarticulations are seen in the synthesized speech too. Audiovisual selection costs are employed to select segments that exhibit appropriate properties in both the acoustic and the visual domain. The auditory and the visual concatenations are smoothed by calculating intermediate sound samples and video frames. 7.1. Brief summary 230 Subjective perception experiments pointed out that the level of coherence between both modes of an audiovisual speech signal indeed influences the perceived quality of the speech. It was observed that the perceived quality of a visual speech signal is rated the highest when it is presented together with an acoustic speech signal that is in perfect coherence with the displayed speech gestures. This means that an audiovisual text-to-speech system should not only aim to enhance the individual quality of both output speech modes, but it should also aim for a maximal level of coherence between these two speech modalities. It is necessary that an observer truly believes that the virtual speaker, displayed in the synthetic visual speech, actually uttered the speech sounds that are heard in the presented auditory speech signal. For instance, this requires that for every non-standard articulation present in the auditory speech mode, even a non-optimal one, an appropriate visual counterpart is seen in the audiovisual speech as well (and vice-versa). These observations encouraged the further development of the single-phase audiovisual text-to-speech synthesis approach. In an attempt to enhance the individual quality of the synthetic visual speech mode, multiple audiovisual optimal coupling techniques were developed. These approaches are able to increase the smoothness of the generated visual speech at the expense of a lowered level of audiovisual coherence. It was found that the observed benefits of this optimization do not hold in the audiovisual case, since the introduced local audiovisual asynchronies and/or the unimodal reduction of the articulation strength in the visual speech mode affected the perceived audiovisual speech quality. These results indicate that any optimization to the proposed audiovisual text-to-speech synthesis technique must be verified not to affect the level of audiovisual coherence in the output speech. The next part of the research investigated the enhancement of the synthetic speech quality by parameterizing the original visual speech recordings using an active appearance model. This allows the calculation of accurate selection costs that also take visual coarticulation effects into account. In addition, the speech database was normalized by removing non-speech related and thus undesired variations from the original visual speech recordings. The visual concatenations were further enhanced by diversifying the smoothing strength among the model parameters and by separately optimizing the smoothing strength for each particular concatenation point. This way, a smooth and natural appearing synthetic visual speech signal can be generated, without a significant reduction of the visual articulation strength (and thus preserving the level of audiovisual coherence). Furthermore, unnaturally fast articulations are filtered from the synthetic visual speech by a spectral smoothing technique. Subjective perception experiments proved that the proposed 7.1. Brief summary 231 model-based audiovisual text-to-speech synthesis indeed produces a higher quality synthetic visual speech signal compared to the visual speech generated by the initial single-phase synthesis system. In addition, since the increased observed speech quality holds in the audiovisual case as well, it can be concluded that the proposed optimization techniques allow to independently improve the synthetic visual speech quality without significantly affecting the level of audiovisual coherence in the output speech. Another enhancement of the synthesis quality was achieved by providing the synthesizer with a new extensive Dutch audiovisual speech database, containing high quality auditory and visual speech recordings. The database was recorded using an innovative audiovisual recoding set-up that optimizes the speech recordings for usage in audiovisual speech synthesis. This allowed to develop the first-ever system capable of high-quality photorealistic audiovisual speech synthesis for Dutch. From a Turing test scenario it was concluded that the database allows the synthesis of synthetic speech gestures that are almost indistinguishable from the speech gestures seen in original visual speech. In addition, the use of the new database unequivocally enhanced the attainable audiovisual speech quality too. This allowed to perform a final evaluation of the single-phase audiovisual speech synthesis approach. A subjective experiment compared the difference between the audiovisual speech quality generated by the single-phase synthesizer and the audiovisual speech quality generated by a comparable two-phase synthesis. The two-phase synthesis generated both output modalities separately by an individual segment selection on coherent original audiovisual speech data. The experiment showed that observers prefer the single-phase synthesis results, especially when the synthetic speech modalities are very close to original speech signals. In the final part of the thesis, it was investigated how suitable the proposed techniques for audiovisual text-to-speech synthesis are for unimodal visual text-tospeech synthesis as well. In addition, the common practice of using a many-to-one phoneme-to-viseme mapping for visual speech synthesis was evaluated. It was found that neither standardized nor speaker-specific many-to-one phoneme-to-viseme mappings enhance the visual speech synthesis quality compared to a synthesis based on phoneme speech labels. This motivated the construction of novel manyto-many phoneme-to-viseme mapping schemes. By analysing the Dutch audiovisual speech database using decision trees and k-means clustering techniques, multiple sets of “context-dependent” viseme labels for Dutch were defined. It was shown that these context-dependent visemes more accurately describe the visual speech information compared to phonemes and compared to many-to-one viseme labels. In addition, it was shown that the context-dependent viseme labels are the most preferable speech labeling approach for concatenative visual speech synthesis. 7.2. General conclusions 7.2 232 General conclusions Speech synthesis has been the subject of many research over the years. The synthesis of high-quality speech is not a straightforward task, since humans are extremely experienced in using speech as means of communication. This implies that people are very acquainted with the perception of original speech signals. Consequently, it is very hard to generate a synthetic auditory or a synthetic visual speech signal that perfectly mimics original speech information. The problem even becomes harder when audiovisual speech is synthesized: in that case it is not only necessary that both synthetic speech modes closely resemble an original speech signal conveying the same speech message, also the audiovisual presentation of the two synthetic speech modes must appear natural to an observer. This means that the coherence between the presented auditory and the presented visual speech information must be high enough to make the observer believe that the virtual speaker indeed uttered the acoustic waveform that plays together with the video signal. The results obtained in this thesis indicate that the most preferable synthesis strategy to achieve high-quality audiovisual text-to-speech synthesis consists in simultaneously generating both synthetic speech modes. This way, the level of audiovisual coherence in the synthetic speech can be maximized to ensure that the individual quality of the synthetic speech modes is not affected when they are shown audiovisually to an observer. Similar to the advances in auditory-only speech synthesis, the current focus in the field of visual speech synthesis is on data-driven synthesis approaches: concatenative synthesizers that directly reuse the original speech data, and prediction-based synthesizers that train complex statistical prediction models on the original speech data. Both techniques are suited for performing a single-phase audiovisual speech synthesis. A single-phase concatenative synthesis approach was developed in this thesis, which reuses original audiovisual articulations and coarticulations to create the synthetic speech. On the other hand, single-phase prediction-based synthesis can be achieved by predicting the auditory and the visual speech features simultaneously from the target speech description, by means of a prediction model that has been trained original audiovisual speech data. Obviously, the simultaneous synthesis of both speech modes entails some additional difficulties in maximizing the quality of the synthetic audiovisual speech. Therefore, future research in the field should focus on how to combine the state-of-the-art techniques for performing unimodal auditory and unimodal visual speech synthesis in order to achieve a high-quality single-phase synthesis of audiovisual speech. Up until now, single-phase audiovisual speech synthesis was mainly the topic of exploratory studies. This thesis composes one of the first efforts that not only suggests a promising single-phase synthesis approach, but also aims to develop improvements to the synthesis in order to attain a synthesis 7.3. Future work 233 quality that allows to draw relevant conclusions. Hopefully, the research described in this work inspires other researchers to further explore the single-phase synthesis approach as well. For instance, it (partially) inspired researchers at the Université De Lorraine (LORIA) in their development of a single-phase concatenative audiovisual speech synthesizer that creates a 3D-rendered visual speech signal [Toutios et al., 2010b] [Toutios et al., 2010a] [Musti et al., 2011]. The system simultaneously selects from an audiovisual speech database original auditory speech information and PCA coefficients modelling original 3D facial landmark variations. In those applications in which an original auditory speech signal has already been provided, a unimodal visual text-to-speech synthesis can be used to generate the corresponding visual speech mode. Surprisingly, at present the common technique in this field still uses many-to-one phoneme-to-viseme mappings to label the visual speech information. This is in contradiction with the well-known fact that the mapping from phonemes to viseme behaves more like a many-to-many mapping, due to the variable visual representation of each phoneme caused by visual coarticulation effects. The results obtained in this thesis indeed raise some serious questions on the use of the standardized viseme set defined in the MPEG-4 standard. It was shown that a more accurate labeling of the visual speech is feasible using context-dependent viseme labels. This technique allows to perform good quality concatenative visual speech synthesis using only a limited amount of original speech data. It can be assumed that the use of context-dependent viseme labels will enhance the attainable synthesis quality of other visual speech synthesis strategies as well. Furthermore, context-dependent visemes are promising for use in the field of visual speech analysis too. 7.3 7.3.1 Future work Enhancing the audiovisual synthesis quality The audiovisual text-to-speech synthesizer, provided with the extensive Dutch audiovisual speech database, is capable of generating high quality audiovisual speech signals. Chapter 4 elaborated on various optimizations that enhance the individual quality of the synthetic visual speech mode. Similarly, some techniques found in state-of-the-art auditory speech synthesis systems can be incorporated in the audiovisual speech synthesis in order to enhance the individual quality of the synthetic auditory speech mode. For instance, some of the advances that were made in the laboratory’s auditory text-to-speech research can be transferred to the audiovisual domain, such as a semi-supervised annotation and prediction of perceptual prominence, word accents, and prosodic phrase breaks. Note that each enhancement has to be evaluated not to significantly affect the coherence between 7.3. Future work 234 the synthetic visual mode and the optimized synthetic auditory speech mode. One of the most difficult tasks in optimizing the synthesis parameters of a concatenative synthesizer is the definition of an appropriate weight distribution for the selection costs. In the proposed audiovisual text-to-speech system, these weights were optimized manually. However, it is likely that a more appropriate set of weights can be calculated using an automatic parameter optimization technique. The difficulty is that these techniques require an objective “error” measure that denotes the quality of the synthesis that is obtained using a particular weight distribution. This error measure has to take the individual quality of both the synthetic auditory and the synthetic visual speech into account [Toutios et al., 2011]. This means that first the relative importance of both modes must be assessed in order to define the error measure. In addition, it is likely that the synthesis quality can be further optimized by learning multiple weight distributions, which allows to use the most optimal set of selection cost weights for each target speech segment. Such an approach was already explored by learning context-dependent selection cost weights for the laboratory’s auditory speech synthesizer [Latacz et al., 2011]. Another interesting strategy that can be ported from the auditory domain to the audiovisual domain is the “hybrid” synthesis technique, in which in a first stage the properties of the synthetic speech are estimated using a prediction-based synthesizer. Afterwards, a second stage consist in the physical synthesis in which these predictions are used as target descriptions for a concatenative synthesis system. This way, the prediction-based system’s power to accurately estimate the desired speech properties is combined with power of concatenative synthesis to generate realistic output signals by reusing original speech data. In the audiovisual case, the most ideal system would simultaneously predict the acoustic and the visual speech features (i.e., a single-phase approach) by means of a statistical model that was trained on audiovisual speech data. Afterwards, a concatenative synthesis approach similar to the audiovisual speech synthesizer described in this thesis can be employed to realize the actual physical synthesis. This requires the calculation of target selection costs that measure the audiovisual distance between the candidate speech segments and the predicted properties of the output speech. 7.3.2 Adding expressions and emotions The proposed audiovisual speech synthesizer uses prosody-related binary target costs in order to copy the prosodic patterns found in the speech database to the synthetic speech. Since both the LIPS2008 database and the AVKH database contain expression-free speech recordings exhibiting a neutral prosody, the synthesized 7.3. Future work 235 speech shows neither expressions nor emotions. The background video signal, displaying the parts of the face of the virtual speaker that are not located in the mouth area, is designed to display a neutral visual prosody too. The reason for this is the fact that a neutral visual prosody is a safe choice: adding more expressiveness to the face can increase the naturalness of the synthetic speech, but it also potentially affects the speech quality when the expressions themselves are not perceived as natural or when the displayed expressions are not optimally correlated with the conveyed speech message. The synthesis of expressive speech is a “hot topic” in the fields of auditory and visual speech synthesis. These techniques can also be ported to the audiovisual domain: expressiveness can be added to the auditory speech signal to make the synthetic speech more “lively” and real, while the synthetic visual speech signal is highly suited to add a particular emotional state to the message. Some exploratory work was already conducted to investigate strategies to include the synthesis of expressions and emotions to the audiovisual text-to-speech synthesis system described in this thesis. To this end, additional speech data was recorded (using the audiovisual recording set-up discussed in chapter 5) during which the speaker simulated happy and sad emotions. Some example frames from these recordings are given in figure 7.1. Figure 7.1 illustrates that a change of the emotional state of the speaker causes variations in both the visual articulations (seen in the mouth area of the video frames) and in the appearance of the other parts of the face (e.g., eyes, eyebrows, etc.). Two separate strategies are needed to mimic such original expressions in the synthetic visual speech generated by the audiovisual text-to-speech synthesizer. First, the variations of the mouth area have to be synthesized not only based on the target phoneme sequence, but also based on the target emotion or expression. This means that the speech database provided to the synthesizer should contain even more repetitions of each phoneme, since the synthesizer has to be able to select all variations based on context, prosody and expression. Next, a “mouth” AAM has to be trained on this expressive original speech data. It will be a challenge to construct an AAM that is capable of accurately regenerating all original mouth appearances, since for the expressive speech much more variations must be modelled in comparison with the AAMs that were trained on the neutral speech databases in chapter 4 and chapter 5. Furthermore, in order to synthesize emotional visual speech, the necessary visual prosody has to be added to the background video signal. To this end, the “face” AAM could be extended to model variations of the face corresponding to emotions/expressions as well. Afterwards, it will have to be investigated how to generate realistic transitions between consecutive expressions. It might be necessary to explore other parameterizations that allow to match physical properties to one single parameter (and vice-versa). This way, the displayed visual prosody could be influenced based on the literature knowledge on the relation between emotions and facial gestures [Grant, 7.3. Future work 236 Figure 7.1: Facial expressions related to a happy emotion. Notice that both the appearance of the mouth area and the appearance of the “background” face are varying in relation to the expression. 1969] [Swerts and Krahmer, 2005] [Granstrom and House, 2005] [Gordon and Hibberts, 2011]. When the new expressive original audiovisual speech data would have been appropriately parameterized, new selection costs are necessary that promote the selection of segments exhibiting the desired expression. Especially in the auditory mode it will be a challenge to synthesize realistic emotional prosodic patterns. In addition, an appropriate approach for denoting the targeted expressions in the synthetic speech will have to be investigated. 7.3.3 Future evaluations Throughout this thesis, many evaluations of the synthetic visual and the synthetic audiovisual speech signals have been mentioned. For instance, objective measures were defined to assess the smoothness of the visual speech and to calculate the distance between a synthesized and an original version of the same sentence. In addition, many subjective evaluations were performed that used MOS or comparative MOS ratings to denote the perceived speech quality. Apart from these evaluation strategies, other test strategies are possible that are likely to offer useful information about the attained synthesis quality. In chapter 3, a subjective assessment of the audiovisual coherence was performed. It was concluded that human observers are often unable to distinguish between synchrony issues and coherence issues. Furthermore, it appeared that the perceived level of synchrony/coherence is influenced by the quality of the displayed audiovisual speech. A possible solution would be to perform such evaluations objectively, by measuring the mathematical correspondence between the auditory and the visual speech information. To this end, various strategies are possible, such 7.3. Future work 237 as the measures proposed in [Slaney and Covell, 2001] and in [Bredin and Chollet, 2007]. A general overview of interesting audiovisual correlation measures is given in [Bredin and Chollet, 2006]. Such correlation measures could be employed to compare the level of audiovisual coherence between original and various synthesized audiovisual speech signals. This thesis focused on the synthesis of full-length sentences. This kind of speech samples was also used in the subjective perception experiments. Unfortunately, it could be noticed that local errors in the synthetic speech signal often degrade the perceived quality of the whole sentence. This easily disrupts the subjective experiment, since the other parts of the sentence might have been of very good quality. Therefore, it can be opted to assess the quality of the synthetic speech on a much smaller scale, by evaluating the synthetic articulations of isolated words or sounds. An interesting approach was suggested by Cosker et al. [Cosker et al., 2004], which involves the generation of real and synthetic McGurk stimuli. A McGurk stimulus consist of a short sound sample of which the auditory and the visual speech are originating from different phoneme sequences. The audiovisual presentation of these speech modes can cause the perception of yet another speech sound (the so-called McGurk effect [McGurk and MacDonald, 1976]). By measuring the difference between the number of observed McGurk effects using the original stimuli and the number of observed McGurk effects using the synthesized stimuli, an estimation for the articulation quality of the speech synthesizer can be made. One of the main benefits of this test strategy is the fact that it composes a subjective evaluation of both the synthetic auditory speech, the synthetic visual speech, and the combination of these two signals. It could be interesting to compare the performance of a single-phase synthesis strategy with the more conventional two-phase audiovisual text-to-speech synthesis approach. Chapter 6 discussed the use of phoneme and viseme speech labels for visual speech synthesis. The quality of the resulting synthetic visual speech signals was subjectively evaluated by presenting the signals in combination with original auditory speech signals. This is a valid evaluation, since these audiovisual test samples are similar to the speech signals that would be shown to a user when the visual speech synthesis system is used in a real application. On the other hand, it would also be interesting to perform a separate subjective evaluation of the individual quality of the synthetic visual speech mode (generated by either a visual or an audiovisual speech synthesizer). In comparison with an individual evaluation of auditory speech, such a visual speech-only evaluation is not straightforward since it is very hard for an observer to evaluate the quality of a presented unimodal visual speech signal. The reason for this is the fact that people are generally not capable of comprehending the message conveyed in a visual speech-only signal. A 7.3. Future work 238 possible solution would be to only use test subjects who possess above-average lip-reading skills (often these are hearing-impaired people). A quality measure for the synthetic visual speech could then consist of the intelligibility score obtained for the synthesized visual speech samples compared to the intelligibility score obtained for the original visual speech samples. One of the possible problems that could arise is the fact that, even for experienced lip-readers, the recognition rate for visual speech-only sentences without context is rather low. This will make it harder to obtain significant differences between intelligibility scores measured for multiple types of synthesized or original visual speech samples. A The Viterbi algorithm The Viterbi algorithm [Viterbi, 1967] is a dynamic programming solution that is used to find the optimal path through a trellis. In unit selection synthesis, it is used to select from each set of candidate database segments matching a target speech segment one final database segment to construct the synthetic speech. The principle is illustrated in figure A.1, where for each of the T target segments ti a set of N candidate segments uij is gathered. The optimal path through the trellis is determined by minimizing a global selection cost that is calculated by target costs and join costs, as illustrated in figure A.2. The target costs T C measure the matching between a target segment ti and a candidate segment uij , and the join costs JC express the effort to move from a candidate segment matching target ti to another candidate segment matching target ti+1 . Since a high quality unit selection approach should consider at least about 200 candidate segments per target, synthesizing a standard-length sentence that is made up by 100 diphones involves 200100 possible sequences to evaluate. Obviously, this number is way too high to perform the unit selection calculation in reasonable time. Therefore, the Viterbi algorithm only evaluates those sequences that are possibly the most optimal one. It is based on the following two principles: ◦ For a single node uij somewhere in the trellis, only the best path leading to this node needs to be remembered. If it turns out that this particular node uij is in fact on the global best path, then the node matching the preceding target that was on the best path towards uij is also on the global best path. ◦ The global best path can only be found if all targets are processed from the beginning to the end. At any given target ti , when moving forward through 239 240 t2 t3 u11 u21 u31 u12 u22 u32 u13 u23 u33 ... ... ... ... ... ... ... ... ... t1 tT u1N u2N u3N ... uTN uT1 uT2 uT3 Figure A.1: A trellis illustrating the unit selection problem. ti ti+1 TCi(1) TCi+1(1) TCi+1(2) TCi(2) ui1 JCi(1,1) JCi(1,2) ui2 u(i+1)1 JCi(2,1) JCi(2,2) u(i+1)2 Figure A.2: The various costs associated with the unit selection trellis. 241 the trellis, it is possible to find the node uij with the lowest total selection cost to this point. By a back-trace from this node, the nodes from all the previous targets that are on the best path towards uij can be found. However, no matter how low the total selection cost that is associated with node uij may be, there is no guarantee that this node will end up on the global best path when the full back-trace from target tT is performed. The Viterbi search significantly reduces the time to calculate the global best path. When the average time to calculate a target cost is written as OT C and the average time to calculate a join cost is written as OJC , the total time to find the global best path through a trellis with T targets and N nodes for each target is T × N × OT C + N 2 × OJC 1 . A possible implementation of the Viterbi search goes as follows. Consider a node uij matching target ti and a node u(i+1)k matching target t(i+1) . The cost Cstep of moving from node uij to node u(i+1)k is calculated as the target cost of node u(i+1)k and the join cost between these two nodes: Cstep (uij , u(i+1)k ) = T Ci+1 (k) + JCi (j, k) (A.1) The total cost Ctot associated with arriving in node u(i+1)k via node uij is then: Ctot (u(i+1)k ) = Ctot (uij ) + Cstep (uij , u(i+1)k ) (A.2) Using equation A.2, the total cost for arriving in node u(i+1)k via all possible nodes corresponding to target ti can be calculated. This way, the most optimal node matching target ti for arriving in node u(i+1)k can be determined and remembered. When afterwards it would turn out that node u(i+1)k is actually on the global best path, the node matching target ti that is on the global best path is also known since this is the node that was remembered to be the most optimal to arrive in node u(i+1)k . This principle is repeated from the first target towards the last target. For each node of each target, the best node from the previous target to reach it is remembered, together with the total cost associated with reaching it via this optimal node. When the last target tT is processed, the node uT k̂ that has the lowest total cost of reaching it is chosen as the last node on the global best path. Then, a back-trace occurs in which the node matching target tT −1 that was remembered as the most optimal for reaching node uT k̂ is added to the global best path. This step is repeated until 1 For instance, assume that the computer system is able to calculate 106 cost values per second and that the synthesizer uses 4 distinct selection costs. When the best sequence matching T = 100 diphones (a standard sentence) must be calculated from N = 200 candidates for each target, the straightforward approach that calculates the total cost for each possible sequence would take 200100 = 5224 seconds to complete. A Viterbi search would only take 4 seconds to find the most 250000 optimal path. 242 finally a node matching the first target is added to the global best path, after which the most optimal set of candidate segments that minimizes the global selection cost is known. B English phonemes This appendix illustrates the phoneme set that is used by the AVTTS system to perform synthesis for English. It also indicates the classification of the various phonemes that is used to optimize the concatenation smoothing strength for each join individually (see section 4.4.3). Phonemes labelled as protected (P) are less likely to be affected by visual coarticulation effects and have to be clearly visible in the synthetic visual speech to avoid under-articulation. On the other hand, phonemes labelled as invisible (I) are often strongly affected by coarticulation effects. These phonemes can be smoothed more heavily since they should not always be visible in the synthetic speech to avoid over-articulations. The other phonemes are labelled as normal (N). 243 244 Table B.1: The English phone set used by the AVTTS system. The first column lists the phonemes in the MRPA notation used by Festival [Black et al., 2013]. The second column lists the phonemes in the standard SAMPA notation [Wells, 1997]. The third column shows an example use of each phoneme and the last column illustrates its classification normal/protected/invisible. mrpa sampa example class. mrpa sampa example class. # p b t d k m n l r f v s z h w g ch jh ng th dh sh ... p b t d k m n l r f v s z h w g tS dZ N T D S (silence) put but ten den can man not like run full very some zeal hat went game chain Jane long thin then ship N P P I I I P N N N P P N N I P I N N N N N N zh y ii aa oo uu @@ i e a uh o u @ ei ai oi ou au i@ e@ u@ Z j i: A: O: u: 3: I e { V Q U @ eI aI OI @U aU I@ e@ U@ measure yes bean barn born boon burn pit pet pat putt pot good about bay buy boy no now peer pair poor N N N N P N N N N N N N N N N P N N N N N P C English visemes This appendix illustrates the speaker-dependent many-to-one phoneme-to-viseme mapping that was constructed by a hierarchical clustering analysis on the combined AAM parameter values of the video frames from the LIPS2008 database. Table C.1 lists for each English phoneme the viseme group it matches in the mapping schemes on 7, 9, 11, 17, and 22 visemes, respectively. The standardized phoneme-to-viseme mapping defined in MPEG-4 is also given. 245 7 I I I I II II II II III III III III III III III III III IV IV IV IV V V phoneme ch sh jh zh b m p w h ii i@ e a ei e@ ai au o u@ oi oo uu u I I I I II II II III IV IV IV IV IV IV IV IV IV V V V V VI VI 9 I I I I II II II III IV IV IV IV IV IV IV IV IV V V V V VI VI 11 I I I II III III III IV V V V VI VI VI VI VII VII VIII VIII VIII IX X X 17 I I I II III III III IV V V V VI VI VI VI VII VII VIII VIII VIII IX X X 22 I I I I II II II VII IX III III IV V IV IV V VII VI VII VI VI VII VI MPEG y n i @ k g l ng uh ou aa @@ th dh t d s z f v r (silence) phoneme V V V V V V V V V V V V VI VI VI VI VI VI VI VI VI VII 7 VI VI VI VI VI VI VI VI VI VI VI VI VII VII VII VII VII VII VIII VIII VIII IX 9 VI VII VII VII VII VII VII VII VII VII VII VII VIII VIII VIII VIII VIII VIII IX IX X XI 11 X XI XI XI XI XI XI XI XII XII XII XII XIII XIII XIV XIV XIV XIV XV XV XVI XVII 17 XI XII XII XII XIII XIII XIII XIII XIV XIV XIV XV XVI XVI XVII XVII XVIII XVIII XIX XX XXI XXI 22 III VIII III IV IX IX VIII IX VII VII V IV X X XI XI XII XII XIII XIII XIV XV MPEG Table C.1: Many-to-one phoneme-to-viseme mappings for English. The phonemes are listed using the MRPA notation [Black et al., 2013]. 246 Bibliography [Abrantes and Pereira, 1999] Abrantes, G. and Pereira, F. (1999). Mpeg-4 facial animation technology: Survey, implementation, and results. IEEE Transactions on Circuits and Systems for Video Technology, 9(2):290–305. [Acapela, 2013] Acapela (2013). Online: http://www.acapela-group.com/index. php. [Agelfors et al., 2006] Agelfors, E., Beskow, J., Karlsson, I., Kewley, J., Salvi, G., and Thomas, N. (2006). User evaluation of the synface talking head telephone. In Miesenberger, K., Klaus, J., Zagler, W., and Karshmer, A., editors, Computers Helping People with Special Needs, pages 579–586. Springer. [Aharon and Kimmel, 2004] Aharon, M. and Kimmel, R. (2004). Representation analysis and synthesis of lip images using dimensionality reduction. International Journal of Computer Vision, 67(3):297–312. [Al Moubayed et al., 2010] Al Moubayed, S., Beskow, J., Granstrom, B., and House, D. (2010). Audio-visual prosody: Perception, detection, and synthesis of prominence. In Esposito, A., Esposito, A. M., Martone, R., Muller, V., and Scarpetta, G., editors, Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces. Theoretical and Practical Issues, pages 55–71. Springer Berlin Heidelberg. [Al Moubayed et al., 2012] Al Moubayed, S., Beskow, J., Skantze, G., and Granstrom, B. (2012). Furhat: A back-projected human-like robot head for multiparty human-machine interaction. Lecture Notes in Computer Science, 7403:114– 130. [Albrecht et al., 2002] Albrecht, I., Haber, J., Kahler, K., Schroder, M., and Seidel, H.-P. (2002). May i talk to you? :-) – facial animation from text. In Proc. Pacific Graphics, pages 77–86. [Andersen, 2010] Andersen, T. S. (2010). The mcgurk illusion in the oddity task. In Proc. International Conference on Auditory-visual Speech Processing, pages paper S2–3. [Anderson and Davis, 1995] Anderson, J. and Davis, J. (1995). An introduction to neural networks. MIT Press. 247 BIBLIOGRAPHY 248 [Anime Studio, 2013] Anime Studio (2013). Online: http://anime.smithmicro. com/. [Arb, 2001] Arb, H. A. (2001). Hidden Markov Models for Visual Speech Synthesis in Limited Data Environments. PhD thesis, Air Force Institute of Technology. [Argyle and Cook, 1976] Argyle, M. and Cook, M. (1976). Gaze and Mutual Gaze. Cambridge University Press. [Arslan and Talkin, 1999] Arslan, L. M. and Talkin, D. (1999). Codebook based face point trajectory synthesis algorithm using speech input. Speech Communication, 27(2):81–93. [Aschenberner and Weiss, 2005] Aschenberner, B. and Weiss, C. (2005). Phonemeviseme mapping for german video-realistic audio-visual-speech-synthesis. Technical report, IKP Bonn. [Auer and Bernstein, 1997] Auer, Jr, E. and Bernstein, L. E. (1997). Speechreading and the structure of the lexicon: computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness. Journal of the Acoustical Society of America, 102(6):3704–3710. [AV Lab, 2013] AV Lab (2013). The audio-visual laboratory of etro. Online: http: //www.etro.vub.ac.be/Research/Nosey_Elephant_Studios. [Baayen et al., 1995] Baayen, R., Piepenbrock, R., and Gulikers, L. (1995). The celex lexical database (release 2). Technical Report celex, Linguistic Data Consortium, University of Pennsylvania. [Badin et al., 2010] Badin, P., Youssef, A., Bailly, G., Elisei, F., and Hueber, T. (2010). Visual articulatory feedback for phonetic correction in second language learning. In Proc. Workshop on Second Language Studies: Acquisition, Learning, Education and Technology, pages 1–10. [Bailly et al., 2003] Bailly, G., Brar, M., Elisei, F., and Odisio, M. (2003). Audiovisual speech synthesis. International Journal of Speech Technology, 6(4):331–346. [Bailly et al., 2002] Bailly, G., Gibert, G., and Odisio, M. (2002). Evaluation of movement generation systems using the point-light technique. In Proc. IEEE Workshop onSpeech Synthesis, pages 27–30. [Baker, 1975] Baker, J. (1975). The dragon system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 23(1):24–29. [Barron et al., 1994] Barron, J., Fleet, D., and Beauchemin, S. (1994). Performance of optical flow techniques. International journal of computer vision, 12(1):43–77. BIBLIOGRAPHY 249 [Baum et al., 1970] Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Annals of mathematical statistics, 41(1):164–171. [Benesty et al., 2008] Benesty, J., Sondhi, M. M., and Huang, Y., editors (2008). Springer Handbook of Speech Processing. Springer. [Benoit and Le Goff, 1998] Benoit, C. and Le Goff, B. (1998). Audio-visual speech synthesis from french text: Eight years of models, designs and evaluation at the icp. Speech Communication, 26(1):117–129. [Benoit et al., 2000] Benoit, C., Pelachaud, C., claude Martin, J., claude Martin, J., Schomaker, L., and Suhm, B. (2000). Audio-visual and multimodal speech systems. In Gibbon, D., I.Mertins, and R.Moore, editors, Handbook of multimodal and spoken dialogue systems: Resources, terminology and product evaluation. Kluwer Academic. [Benoit et al., 2010] Benoit, M. M., Raij, T., Lin, F.-H., Jskelinen, I. P., and Stufflebeam, S. (2010). Primary and multisensory cortical activity is correlated with audiovisual percepts. Human Brain Mapping, 31(4):526–538. [Bergeron and Lachapelle, 1985] Bergeron, P. and Lachapelle, P. (1985). Controlling facial expressions and body movements in the computer generated animated short tony de peltrie. In Siggraph Tutorial Notes. [Bernstein et al., 2004] Bernstein, L., Auer, E., and Moore, J. (2004). Audiovisual speech binding: convergence or association. In Calvert, G., Spence, C., and Stein, B., editors, The handbook of multisensory processes, pages 203–223. MIT Press. [Bernstein et al., 1989] Bernstein, L. E., Eberhardt, S. P., and Demorest, M. E. (1989). Single-channel vibrotactile supplements to visual perception of intonation and stress. Journal of the Acoustical Society of America, 85(1):397–405. [Bernstein et al., 2000] Bernstein, L. E., Tucker, P. E., and Demorest, M. E. (2000). Speech perception without hearing. Attention, Perception, & Psychophysics, 62(2):233–252. [Beskow, 1995] Beskow, J. (1995). Rule-based visual speech synthesis. In Proc. European Conference on Speech Communication and Technology, pages 299–302. [Beskow, 2004] Beskow, J. (2004). Trainable articulatory control models for visual speech synthesis. International Journal of Speech Technology, 7(4):335–349. [Beskow and Nordenberg, 2005] Beskow, J. and Nordenberg, M. (2005). Datadriven synthesis of expressive visual speech using an mpeg-4 talking head. In BIBLIOGRAPHY 250 Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 793–796. [Beutnagel et al., 1999] Beutnagel, M., Conkie, A., Schroeter, J., Stylianou, Y., and Syrdal, A. (1999). The at&t next-gen tts system. In Proc. Joint Meeting of ASA, EAA, and DAGA, pages 18–24. [Biemann et al., 2007] Biemann, C., Heyer, G., Quasthoff, U., and Richter, M. (2007). The leipzig corpora collection–monolingual corpora of standard size. In Proc. Corpus Linguistics, pages 113–126. [Binnie et al., 1974] Binnie, C. A., Montgomery, A. A., and Jackson, P. L. (1974). Auditory and visual contributions to the perception of consonants. Journal of Speech and Hearing Research, 17(4):619–630. [Birkholz et al., 2006] Birkholz, P., Jackel, D., and Kroger, K. (2006). Construction and control of a three-dimensional vocal tract model. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, pages 873–876. [Black et al., 2013] Black, A., Taylor, P., and Caley, R. (2013). The festival speech synthesis system. Online: http://www.cstr.ed.ac.uk/projects/festival. html. [Blanz et al., 2003] Blanz, V., Basso, C., Poggio, T., and Vetter, T. (2003). Reanimating faces in images and video. Computer graphics forum, 22(3):641–650. [Bowers, 2001] Bowers, B. (2001). Sir Charles Wheatstone FRS: 1802-1875. Inspec/Iee. [Bozkurt et al., 2007] Bozkurt, E., Erdem, C., Erzin, E., Erdem, T., and Ozkan, M. (2007). Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. In Proc. Signal Processing and Communications Applications, pages 1–4. [Brand, 1999] Brand, M. (1999). Voice puppetry. In Proc. Annual conference on Computer graphics and interactive techniques, pages 21–28. [Bredin and Chollet, 2006] Bredin, H. and Chollet, G. (2006). Measuring audio and visual speech synchrony: methods and applications. In Proc. IET International Conference on Visual Information Engineering, pages 255–260. [Bredin and Chollet, 2007] Bredin, H. and Chollet, G. (2007). Audio-visual speech synchrony measure for talking-face identity verification. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, pages 233–236. BIBLIOGRAPHY 251 [Breen et al., 1996] Breen, A. P., Bowers, E., and Welsh, W. (1996). An investigation into the generation of mouth shapes for a talking head”. In Proc. International Conference on Spoken Language Processing, pages 2159–2162. [Bregler et al., 1997] Bregler, C., Covell, M., and Slaney, M. (1997). Video rewrite: driving visual speech with audio. In Proc. Annual Conference on Computer Graphics and Interactive Techniques, pages 353–360. [Breiman et al., 1984] Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA. [Brooke and Scott, 1998] Brooke, N. and Scott, S. (1998). Two- and threedimensional audio-visual speech synthesis. In Proc. International Conference on Auditory-visual Speech Processing, pages 213–220. [Brooke and Summerfield, 1983] Brooke, N. and Summerfield, Q. (1983). Analysis, synthesis, and perception of visible articulatory movements. Journal of Phonetics, 11:63–76. [Broomhead and Lowe, 1988] Broomhead, D. and Lowe, D. (1988). Radial basis functions, multi-variable functional interpolation and adaptive networks. Technical report, Royal Signals and Radar Establishment. [Browman and Goldstein, 1992] Browman, C. P. and Goldstein, L. (1992). Articulatory phonology: An overview. Phonetica, 49(3-4):155–180. [Campbell and Black, 1996] Campbell, N. and Black, A. (1996). Prosody and the selection of source units for concatenative synthesis. Progress in speech synthesis, 3:279–292. [Campbell, 2008] Campbell, R. (2008). The processing of audio-visual speech: empirical and neural bases. Philosophical Transactions of the Royal Society of London, 363:1001–1010. [Cao et al., 2004] Cao, Y., Faloutsos, P., Kohler, E., and Pighin, F. (2004). Realtime speech motion synthesis from recorded motions. In Proc. ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 345–353. [Cappelletta and Harte, 2012] Cappelletta, L. and Harte, H. (2012). Phoneme-toviseme mapping for visual speech recognition. In Proc. International Conference on Patter Recognition Applications and Methods, pages 322–329. [Carter et al., 2010] Carter, E. J., Sharan, L., Trutoiu, L., Matthews, I., and Hodgins, J. K. (2010). Perceptually motivated guidelines for voice synchronization in film. ACM Transactions on Applied Perception, 7(4):1–12. BIBLIOGRAPHY 252 [Chang and Ezzat, 2005] Chang, Y.-J. and Ezzat, T. (2005). Transferable videorealistic speech animation. In Proc. ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 143–151. [Chen, 2001] Chen, T. (2001). Audiovisual speech processing. IEEE Signal Processing Magazine, 18(1):9–21. [Chen and Rao, 1998] Chen, T. and Rao, R. R. (1998). Audio-visual integration in multimodal communication. Proceedings of the IEEE, 86(5):837–852. [Clark et al., 2007] Clark, R., Richmond, K., and King, S. (2007). Multisyn: Opendomain unit selection for the festival speech synthesis system. Speech Communication, 49(4):317–330. [CMU, 2013] CMU (2013). The carnegie mellon university pronouncing dictionary. Online: http://www.speech.cs.cmu.edu/cgi-bin/cmudict. [Cohen and Massaro, 1990] Cohen, M. and Massaro, D. (1990). Synthesis of visible speech. Behavior Research Methods, 22(2):260–263. [Cohen et al., 1996] Cohen, M., R.Walker, and Massaro, D. (1996). Perception of synthetic visual speech. In Speechreading by Humans and Machines: Models, Systems and Applications, pages 154–168. Springer-Verlag. [Cohen and Massaro, 1993] Cohen, M. M. and Massaro, D. W. (1993). Modeling coarticulation in synthetic visual speech. In Thalmann, N. M. and Thalmann, D., editors, Models and Techniques in Computer Animation, pages 139–156. SpringerVerlag. [Conkie and Isard, 1996] Conkie, A. and Isard, S. D. (1996). Optimal coupling of diphones. In Santen, J. P. H., Sproat, R. W., Olive, J. P., and Hirschberg, editors, Progress in Speech Synthesis. Springer. [Cootes et al., 2001] Cootes, T., Edwards, G., and Taylor, C. (2001). Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681–685. [Cootes and Taylor, 2001] Cootes, T. and Taylor, C. (2001). Constrained active appearance models. In Proc. Computer Vision, pages 748–754. [Corthals, 1984] Corthals, P. (1984). Een eenvoudige visementaxonomie voor spraakafzien. Tijdschrift voor Logopedie en Audiologie, 14(3):126–134. [Cosatto, 2002] Cosatto, E. (2002). Sample-Based Talking-Head Synthesis. PhD thesis, Swiss Federal Institute of Technology. BIBLIOGRAPHY 253 [Cosatto and Graf, 1998] Cosatto, E. and Graf, H. (1998). Sample-based synthesis of photo-realistic talking heads. In Proc. Computer Animation, pages 103–110. [Cosatto and Graf, 2000] Cosatto, E. and Graf, H. P. (2000). Photo-realistic talking-heads from image samples. IEEE Transactions on Multimedia, 2(3):152– 163. [Cosatto et al., 2003] Cosatto, E., Ostermann, J., Graf, H. P., and Schroeter, J. (2003). Lifelike talking faces for interactive services. Proceedings of the IEEE, 91(9):1406–1429. [Cosatto et al., 2000] Cosatto, E., Potamianos, G., and Graf, H. P. (2000). Audiovisual unit selection for the synthesis of photo-realistic talking-heads. In Proc. IEEE International Conference on Multimedia and Expo, pages 619–622. [Cosi et al., 2002] Cosi, P., Caldognetto, E., Perin, G., and Zmarich, C. (2002). Labial coarticulation modeling for realistic facial animation. In Proc. IEEE International Conference on Multimodal Interfaces, pages 505–510. [Cosi et al., 2003] Cosi, P., Fusaro, A., and Tisato, G. (2003). Lucia: A new italian talking-head based on a modified cohen-massaros labial coarticulation model. In Proc. European Conference on Speech Communication and Technology, pages 127–132. [Cosker et al., 2003] Cosker, D., Marshall, D., Rosin, P., and Hicks, Y. (2003). Video realistic talking heads using hierarchical non-linear speech-appearance models. In Proc. Mirage, pages 2–7. [Cosker et al., 2004] Cosker, D., Paddock, S., Marshall, D., Rosin, P. L., and Rushton, S. (2004). Towards perceptually realistic talking heads: models, methods and mcgurk. In Proc. Applied perception in graphics and visualization, pages 151–157. [Costa and De Martino, 2010] Costa, P. and De Martino, J. (2010). Compact 2d facial animation based on context-dependent visemes. In Proc. ACM/SSPNET International Symposium on Facial Analysis and Animation, pages 20–20. [Cyberware Scanning Products, 2013] Cyberware Scanning Products (2013). Online: http://www.cyberware.com/. [Davis et al., 1952] Davis, K., Biddulph, R., and Balashek, S. (1952). Automatic recognition of spoken digits. Journal of the Acoustical Society of America, 24(6):637–642. [De Martino et al., 2006] De Martino, J., Pini Magalhaes, L., and Violaro, F. (2006). Facial animation based on context-dependent visemes. Computers & Graphics, 30(6):971–980. BIBLIOGRAPHY 254 [Deena et al., 2010] Deena, S., Hou, S., and Galata, A. (2010). Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model. In Proc. International Conference on Multimodal Interfaces, pages 1–8. [Dehn and Van Mulken, 2000] Dehn, D. and Van Mulken, S. (2000). The impact of animated interface agents: a review of empirical research. International Journal of Human-Computer Studies, 52(1):1–22. [Deller et al., 1993] Deller, J., Proakis, J., and Hansen, J. (1993). Discrete-time processing of speech signals. Macmillan publishing company. [Demeny, 1892] Demeny, G. (1892). Les photographies parlantes. La Nature, 1:311. [Demuynck et al., 2008] Demuynck, K., Roelens, J., Van Compernolle, D., and Wambacq, P. (2008). Spraak: an open source speech recognition and automaticannotation kit. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 495–495. [Deng and Neumann, 2008] Deng, Z. and Neumann, U. (2008). Expressive speech animation synthesis with phoneme-level controls. Computer Graphics Forum, 27(8):2096–2113. [Deng et al., 2006] Deng, Z., Neumann, U., Lewis, J., Kim, T., Bulut, M., and Narayanan, S. (2006). Expressive facial animation synthesis by learning speech co-articulation and expression. IEEE Transaction on Visualization and Computer Graphics, 12(6):2006. [Deng and Noh, 2007] Deng, Z. and Noh, J. (2007). Computer facial animation: A survey. In Deng, Z. and Neumann, U., editors, Data-Driven 3D Facial Animation, pages 1–28. Springer. [Dey et al., 2010] Dey, P., Maddock, S., and Nicolson, R. (2010). Evaluation of a viseme-driven talking head. In Proc. Theory and Practice of Computer Graphics, pages 139–142. [Dixon and Maxey, 1968] Dixon, N. and Maxey, H. (1968). Terminal analog synthesis of continuous speech using the diphone method of segment assembly. IEEE Transactions on Audio and Electroacoustics, 16(1):40–50. [Du and Lin, 2002] Du, Y. and Lin, X. (2002). Realistic mouth synthesis based on shape appearance dependence mapping. Pattern Recognition Letters, 23(14):1875–1885. [Dudley et al., 1939] Dudley, H., Riesz, R., and Watkins, S. (1939). A synthetic speaker. Journal of the Franklin Institute, 227(6):739–764. BIBLIOGRAPHY 255 [Dudley and Tarnoczy, 1950] Dudley, H. and Tarnoczy, T. (1950). The speaking machine of wolfgang von kempelen. Journal of the Acoustical Society of America, 22(2):151–166. [Dunn, 1950] Dunn, H. K. (1950). The calculation of vowel resonances, and an electrical vocal tract. Journal of the Acoustical Society of America, 22(6):740– 753. [Dutoit, 1996] Dutoit, T. (1996). The mbrola project: towards a set of high quality speech synthesizers free of use for non commercial purposes. In Proc. Fourth International Conference on Spoken Language, pages 1393–1396. [Dutoit, 1997] Dutoit, T. (1997). An introduction to text-to-speech synthesis. Kluwer Academic. [Eberhardt et al., 1990] Eberhardt, S. P., Bernstein, L. E., Demorest, M. E., and Goldstein, Jr, M. (1990). Speechreading sentences with single-channel vibrotactile presentation of voice fundamental frequency. Journal of the Acoustical Society of America, 88(3):1274–1285. [Edge and Maddock, 2001] Edge, J. and Maddock, S. (2001). Expressive visual speech using geometric muscle functions. In Proc. Eurographics UK, pages 11–18. [Edge and Hilton, 2006] Edge, J. D. and Hilton, A. (2006). Visual speech synthesis from 3d video. In Proc. European Conference Visual Media Production, pages 174–179. [Edwards et al., 1998a] Edwards, G., Lanitis, A., Taylor, C., and Cootes, T. (1998a). Statistical models of face images - improving specificity. Image and Vision Computing, 16(3):203–211. [Edwards et al., 1998b] Edwards, G. J., Taylor, C. J., and Cootes, T. F. (1998b). Interpreting face images using active appearance models. In Proc. IEEE International Conference on Automatic Face and Gesture Recognition, pages 300–305. [Eggermont, 1964] Eggermont, J. (1964). Taalverwerving bij een groep dove kinderen. Een experimenteel onderzoek naar de betekenis van een geluidsmethode voor het spraakafzien. Wolters. [Eisert et al., 1997] Eisert, P., Chaudhuri, S., and Girod, B. (1997). Speech driven synthesis of talking head sequences. In Proc. Workshop 3D Image Analysis and Synthesis, pages 51–56. [Ekman and Friesen, 1978] Ekman, P. and Friesen, W. (1978). Facial Action Coding System (FACS): A Technique for the Measurement of Facial Action. Consulting Psychologists Press, Stanford University. BIBLIOGRAPHY 256 [Ekman et al., 1972] Ekman, P., Friesen, W. V., and Ellsworth, P. (1972). Emotion in the Human Face: Guidelines for Research and an Integration of Findings. Pergamon Press. [Elisei et al., 2001] Elisei, F., Odisio, M., G., B., and Badin, P. (2001). Creating and controlling video-realistic talking heads. In Proc. Auditory-Visual Speech Processing Workshop, pages 90–97. [Endo et al., 2010] Endo, N., Endo, K., Zecca, M., and Takanishi, A. (2010). Modular design of emotion expression humanoid robot kobian. In Schiehlen, W. and Parenti-Castelli, V., editors, ROMANSY 18 - Robot Design, Dynamics and Control, pages 465–472. Springer. [Englebienne et al., 2008] Englebienne, G., Cootes, T., and Rattray, M. (2008). A probabilistic model for generating realistic lip movements from speech. In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems 20, pages 401–408. MIT Press. [Engwall, 2001] Engwall, O. (2001). Making the tongue model talk: merging mri & ema measurements. In Proc. Eurospeech, volume 1, pages 261–264. [Engwall, 2002] Engwall, O. (2002). Evaluation of a system for concatenative articulatory visual speech synthesis. In Proc. International Conference on Spoken Language Processing, pages 665–668. [Engwall et al., 2004] Engwall, O., Wik, P., Beskow, J., and Granstrom, G. (2004). Design strategies for a virtual language tutor. In Proc. of International Conference on Spoken Language Processing, volume 3, pages 1693–1696. [Erber, 1975] Erber, N. P. (1975). Auditory-visual perception of speech. Journal of Speech and Hearing Disorders, 40(4):481–492. [Erber and Filippo, 1978] Erber, N. P. and Filippo, C. L. D. (1978). Voice/mouth synthesis and tactual/visual perception of /pa, ba, ma/. Journal of the Acoustical Society of America, 64(4):1015–1019. [Escher and Thalmann, 1997] Escher, M. and Thalmann, N. (1997). Automatic 3d cloning and real-time animation of a human face. In Proc. Computer Animation, pages 58–66. [Ezzat et al., 2002] Ezzat, T., Geiger, G., and Poggio, T. (2002). Trainable videorealistic speech animation. In Proc. Annual conference on Computer graphics and interactive techniques, pages 388–398. BIBLIOGRAPHY 257 [Ezzat and Poggio, 2000] Ezzat, T. and Poggio, T. (2000). Visual speech synthesis by morphing visemes. International Journal of Computer Vision, SI: learning and vision at the center for biological and computational learning, 38(1):45–57. [Fagel, 2006] Fagel, S. (2006). Joint audio-visual unit selection - the javus speech synthesizer. In Proc. International Conference on Speech and Computer, pages 503–506. [Fagel and Clemens, 2004] Fagel, S. and Clemens, C. (2004). An articulation model for audiovisual speech synthesis – determination, adjustment, evaluation. Speech Communication, 44(1):141–154. [Fant, 1953] Fant, G. (1953). Speech communication research. Technical report, Royal Swedish Academy of Engineering Sciences. [Fasel and Luettin, 2003] Fasel, B. and Luettin, J. (2003). Automatic facial expression analysis: a survey. Pattern Recognition, 36(1):259–275. [Ferguson, 1980] Ferguson, J. (1980). Hidden markov analysis: An introduction. In Hidden Markov Modelsfor Speech. Institute for Defense Analyses, Princeton. [Fisher, 1968] Fisher, C. (1968). Confusions among visually perceived consonants. Journal of Speech and Hearing Research, 11(4):796–804. [Fisher, 1969] Fisher, C. G. (1969). The visibility of terminal pitch contour. Journal of Speech and Hearing Research, 12(2):379–382. [Flanagan, 1972] Flanagan, J. (1972). Speech analysis, synthesis and perception. Springer-Verlag. [Galanes et al., 1998] Galanes, F., Unverferth, J., Arslan, L., and Talkin, D. (1998). Generation of lip-synched synthetic faces from phonetically clustered face movement data. In Proc. International Conference on Auditory-visual Speech Processing, pages 191–194. [Gao et al., 1998] Gao, L., Mukigawa, Y., and Ohta, Y. (1998). Synthesis of facial images with lip motion from several real views. In Proc. IEEE International Conference on Automatic Face and Gesture Recognition, pages 181–186. [Geiger et al., 2003] Geiger, G., Ezzat, T., and Poggio, T. (2003). Perceptual evaluation of video-realistic speech. Technical report, MIT Artificial Intelligence Laboratory. [Gibbs et al., 1993] Gibbs, S., Breiteneder, C., De Mey, V., and Papathomas, M. (1993). Video widgets and video actors. In Proc. ACM symposium on user interface software and technology, pages 179–185. BIBLIOGRAPHY 258 [Gordon and Hibberts, 2011] Gordon, M. S. and Hibberts, M. (2011). Audiovisual speech from emotionally expressive and lateralized faces. Quarterly Journal of Experimental Psychology, 64(4):730–750. [Goto et al., 2001] Goto, T., Kshirsagar, S., and Magnenat-Thalmann, N. (2001). Automatic face cloning and animation using real-time facial feature tracking and speech acquisition. IEEE Signal Processing Magazine, 18(3):17–25. [Govokhina et al., 2007] Govokhina, O., Bailly, G., and Breton, G. (2007). Learning optimal audiovisual phasing for a hmm-based control model for facial animation. In Proc. ISCA Workshop on Speech Synthesis, pages 1–4. [Govokhina et al., 2006a] Govokhina, O., Bailly, G., Breton, G., and Bagshaw, P. (2006a). Evaluation de systèmes de génération de mouvements faciaux. In Proc. Journées d’Etudes sur la Parole, pages 305–308. [Govokhina et al., 2006b] Govokhina, O., Bailly, G., Breton, G., and Bagshaw, P. C. (2006b). Tda: a new trainable trajectory formation system for facial animation. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 2474–2477. [Goyal et al., 2000] Goyal, U., Kapoor, A., and Kalra, P. (2000). Text-to-audiovisual speech synthesizer. In Proc. International Conference on Virtual Worlds, pages 256–269. [Graf et al., 2002] Graf, H. P., Cosatto, E., Strom, V., and Huang, F. J. (2002). Visual prosody: Facial movements accompanying speech. In Proc. International Conference on Automatic Face and Gesture Recognition, pages 396–401. [Granstrom and House, 2005] Granstrom, B. and House, D. (2005). Audiovisual representation of prosody in expressive speech communication. Speech communication, 46(3):473–484. [Granstrom et al., 1999] Granstrom, B., House, D., and Lundeberg, M. (1999). Prosodic cues in multimodal speech perception. In Proc. International Congress of Phonetic Sciences, pages 655–658. [Grant, 1969] Grant, E. C. (1969). Human facial expression. Man, 4(4):525–536. [Grant and Greenberg, 2001] Grant, K. W. and Greenberg, S. (2001). Speech intelligibility derived from asynchrounous processing of auditory-visual information. In Proc. Audio-Visual Speech Processing Workshop, pages 132–137. [Grant et al., 2004] Grant, K. W., Van Wassenhove, V., and Poeppel, D. (2004). Detection of auditory (cross-spectral) and auditory-visual (cross-modal) synchrony. Speech Communication, 44(1):43–53. BIBLIOGRAPHY 259 [Grant et al., 1998] Grant, K. W., Walden, B. E., and Seitz, P. F. (1998). Auditoryvisual speech recognition by hearing-impaired subjects: consonant recognition, sentence recognition, and auditory-visual integration. Journal of the Acoustical Society of America, 103(5):2677–2690. [Guiard-Marigny et al., 1996] Guiard-Marigny, T., Tsingos, N., Adjoudani, A., Benoit, C., and Gascuel, M.-P. (1996). 3d models of the lips for realistic speech animation. In Proc. Computer Animation, pages 80–89. [Gutierrez-Osuna et al., 2005] Gutierrez-Osuna, R., Kakumanu, P. K., Esposito, A., Garcia, O. N., Bojorquez, A., Castillo, J. L., and Rudomin, I. (2005). Speechdriven facial animation with realistic dynamics. IEEE Transactions on Multimedia, 7(1):33–42. [Hadar et al., 1983] Hadar, U., Steiner, T. J., Grant, E. C., and Rose, F. C. (1983). Head movement correlates of juncture and stress at sentence level. Language and Speech, 26(2):117–129. [Hallgren and Lyberg, 1998] Hallgren, A. and Lyberg, B. (1998). Visual speech synthesis with concatenative speech. In Proc. Auditory Visual Speech Processing, pages 181–183. [Hazen et al., 2004] Hazen, T., Saenko, K., La, C., and Glass, J. (2004). A segmentbased audio-visual speech recognizer: data collection, development and initial experiments. In Proc. International conference on Multimodal interfaces, pages 235–242. [Heckbert, 1986] Heckbert, P. (1986). Survey of texture mapping. IEEE Computer Graphics and Applications, 6(11):56–67. [Hilder et al., 2010] Hilder, S., Theobald, B., and Harvey, R. (2010). In pursuit of visemes. In Proc. International Conference on Auditory-Visual Speech Processing, pages 154–159. [Hill et al., 1988] Hill, D. R., Pearce, A., and Wyvill, B. (1988). Animating speech: an automated approach using speech synthesised by rules. The Visual Computer, 3(5):277–289. [Hong et al., 2001] Hong, P., Wen, Z., and Huang, T. (2001). Iface: a 3d synthetic talking face. International Journal of Image and Graphics, 1(1):19–26. [Horn and Schunck, 1981] Horn, B. K. P. and Schunck, B. G. (1981). Determining optical flow. Artificial Intelligence, 17:185–203. BIBLIOGRAPHY 260 [Hou et al., 2007] Hou, Y., Sahli, H., Ilse, R., Zhang, Y., and Zhao, R. (2007). Robust shape-based head tracking. Lecture Notes in Computer Science, 4678:340– 351. [Hsieh and Chen, 2006] Hsieh, C. and Chen, Y. (2006). Partial linear regression for speech-driven talking head application. Signal Processing: Image Communication, 21(1):1–12. [Huang et al., 2002] Huang, F. J., Cosatto, E., and Graf, H. P. (2002). Triphone based unit selection for concatenative visual speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 2037–2040. [Hunt and Black, 1996] Hunt, A. J. and Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 373–376. [Hyvarinen et al., 2001] Hyvarinen, A., Karhunen, J., and Oja, E. (2001). Independent component analysis. Wiley & Sons. [Ip and Yin, 1996] Ip, H. H. S. and Yin, L. (1996). Constructing a 3d individualized head model from two orthogonal views. The visual computer, 12(5):254–266. [Jackson, 1988] Jackson, P. (1988). The theoretical minimal unit for visual speech perception: visemes and coarticulation. Volta Review, 90(5):99–115. [Jackson and Singampalli, 2009] Jackson, P. J. and Singampalli, V. D. (2009). Statistical identification of articulation constraints in the production of speech. Speech Communication, 51(8):695–710. [Jeffers and Barley, 1971] Jeffers, J. and Barley, M. (1971). Speechreading (Lipreading). Charles C Thomas Pub Ltd. [Jiang et al., 2008] Jiang, D., Ravyse, I., Sahli, H., and Verhelst, W. (2008). Speech driven realistic mouth animation based on multi-modal unit selection. Journal on Multimodal User Interfaces, 2:157–169. [Johnson et al., 2000] Johnson, W. L., Rickel, J. W., and Lester, J. C. (2000). Animated pedagogical agents: Face-to-face interaction in interactive learning environments. International Journal of Artificial intelligence in education, 11(1):47–78. [Kahler et al., 2001] Kahler, K., Haber, J., and Seidel, H.-P. (2001). Geometrybased muscle modeling for facial animation. In Proc. Graphics Interface, pages 37–46. BIBLIOGRAPHY 261 [Kalberer and Van Gool, 2001] Kalberer, G. and Van Gool, L. (2001). Face animation based on observed 3d speech dynamics. In Proc. Computer Animation, pages 20–251. [Karlsson et al., 2003] Karlsson, I., Faulkner, A., and Salvi, G. (2003). Synface - a talking face telephone. In Proc. European Conference on Speech Communication and Technology, pages 1297–1300. [Kawahara et al., 1999] Kawahara, H., Masuda-Katsuse, I., and de Cheveigne, A. (1999). Restructuring speech representations using a pitch-adaptive time– frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds. Speech communication, 27(3):187–207. [Keating, 1988] Keating, P. (1988). Underspecification in phonetics. Phonology, 5(2):275–292. [Kelly and Gerstman, 1961] Kelly, J. and Gerstman, L. (1961). An artificial talker driven from a phonetic input. Journal of the Acoustical Society of America, 33(6):835–835. [Kelly and Lochbaum, 1962] Kelly, J. and Lochbaum, C. (1962). Speech synthesis. In Proc. Fourth International Conference on Acoustics, pages 1–4. [Kent and Minifie, 1977] Kent, R. and Minifie, F. (1977). Coarticulation in recent speech production models. Journal of Phonetics, 5(2):115–133. [Kerkhoff and Marsi, 2002] Kerkhoff, J. and Marsi, E. (2002). Nextens: a new open source text-to-speech system for dutch. In Proc. Meeting of Computational Linguistics in the Netherlands. [Kim and Ko, 2007] Kim, I. and Ko, H. (2007). 3d lip-synch generation with datafaithful machine learning. In Proc. Computer Graphics Forum, volume 26, pages 295–301. [King and Parent, 2005] King, S. and Parent, R. (2005). Creating speechsynchronized animation. IEEE Transactions on Visualization and Computer Graphics, 11(3):341–352. [Klatt, 1987] Klatt, D. (1987). Review of text-to-speech conversion for english. Journal of the Acoustical Society of America, 82(3):737–793. [Klir and Yuan, 1995] Klir, G. and Yuan, B. (1995). Fuzzy sets and fuzzy logic. Prentice Hall. [Kominek and Black, 2004] Kominek, J. and Black, A. (2004). The cmu arctic speech databases. In Proc. ISCA Workshop on Speech Synthesis, pages 223–224. BIBLIOGRAPHY 262 [Krahmer et al., 2002] Krahmer, E., Ruttkay, Z., Swerts, M., and Wesselink, W. (2002). Pitch, eyebrows and the perception of focus. In Proc. Speech Prosody, pages 443–446. [Kshirsagar and Magnenat-Thalmann, 2003] Kshirsagar, S. and MagnenatThalmann, N. (2003). Visyllable based speech animation. Computer Graphics Forum, 22(3):631–639. [Kuratate et al., 2011] Kuratate, T., Pierce, B., and Cheng, G. (2011). Mask-bot: A life-size talking head animated robot for av speech and human-robot communication research. In Proc. International Conference on Auditory-Visual Speech Processing, pages 111–116. [Kuratate et al., 1998] Kuratate, T., Yehia, H., and Vatikiotis-bateson, E. (1998). Kinematics-based synthesis of realistic talking faces. In Proc. International Conference on Auditory-Visual Speech Processing, pages 185–190. [Latacz, TBP] Latacz, L. (TBP). Speech Synthesis: Towards Automated Voice Building And Use in Clinical and Educational Applications (Unpublished). PhD thesis, Vrije Universiteit Brussel. [Latacz et al., 2008] Latacz, L., Kong, Y., Mattheyses, W., and Verhelst, W. (2008). An overview of the vub entry for the 2008 blizzard challenge. In Proc. Blizzard Challenge 2008. [Latacz et al., 2007] Latacz, L., Kong, Y., and Verhelst, W. (2007). Unit selection synthesis using long non-uniform units and phonemic identity matching. In Proc. ISCA Workshop on Speech Synthesis, pages 270–275. [Latacz et al., 2009] Latacz, L., Mattheyses, W., and Verhelst, W. (2009). The vub blizzard challenge 2009 entry. In Proc. Blizzard Challenge 2009. [Latacz et al., 2010] Latacz, L., Mattheyses, W., and Verhelst, W. (2010). The vub blizzard challenge 2010 entry: Towards automatic voice building. In Proc. Blizzard Challenge 2010. [Latacz et al., 2011] Latacz, L., Mattheyses, W., and Verhelst, W. (2011). Joint target and join cost weight training for unit selection synthesis. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 321–324. [Lawrence, 1953] Lawrence, W. (1953). The synthesis of speech from signals which have a low information rate. In Communication theory, pages 460–469. Butterworths, London. BIBLIOGRAPHY 263 [Le Goff, 1997] Le Goff, B. (1997). Automatic modeling of coarticulation in text-tovisual speech synthesis. In Proc. European Conference on Speech Communication and Technology, pages 1667–1670. [Le Goff and Benoit, 1996] Le Goff, B. and Benoit, C. (1996). A text-to-audiovisualspeech synthesizer for french. In Proc. International Conference on Spoken Language Processing, pages 2163–2166. [Le Goff et al., 1994] Le Goff, B., Guiard-Marigny, T., Cohen, M., and Benoit, C. (1994). Real-time analysis-synthesis and intelligibility of talking faces. In Proc. ESCA/IEEE Workshop on Speech Synthesis, pages 53–56. [Lee et al., 1995] Lee, Y., Terzopoulos, D., and Waters, K. (1995). Realistic modeling for facial animation. In Proc. Annual conference on Computer graphics and interactive techniques, pages 55–62. [Lei et al., 2003] Lei, X., Dongmei, J., Ravyse, I., Verhelst, W., Sahli, H., Slavova, V., and Rongchun, Z. (2003). Context dependent viseme models for voice driven animation. In Proc. EURASIP Conference focused on video/image processing and multimedia communications, pages 649–654. [Lesner and Kricos, 1981] Lesner, S. and Kricos, P. B. (1981). Visual vowel and diphthong perception across speakers. Journal of the Academy of Rehabilitative Audiology, 14:252–258. [Lewis, 1991] Lewis, J. (1991). Automated lip-sync: Background and techniques. Journal of Visualization and Computer Animation, 2(4):118–122. [Lewis and Parke, 1987] Lewis, J. P. and Parke, F. I. (1987). Automated lip-synch and speech synthesis for character animation. In Proc. SIGCHI/GI conference on Human factors in computing systems and graphics interface, pages 143–147. [Lin et al., 1999] Lin, I.-C., Hung, C.-S., Yang, T.-J., and Ouhyoung, M. (1999). A speech driven talking head system based on a single face image. In Proc. Conference on Computer Graphics and Applications, pages 43–49. [Lindsay, 1997] Lindsay, D. (1997). Talking head. American Heritage of Invention & Technology, Summer 1997:57–63. [Ling and Wang, 2007] Ling, Z. and Wang, R. (2007). Hmm-based hierarchical unit selection combining kullback-leibler divergence with likelihood criterion. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, volume 4, pages 1245–1248. [Lippmann, 1989] Lippmann, R. (1989). Review of neural networks for speech recognition. Neural computation, 1(1):1–38. BIBLIOGRAPHY 264 [Liu and Ostermann, 2009] Liu, K. and Ostermann, J. (2009). Optimization of an image-based talking head system. EURASIP Journal on Audio, Speech and Music Processing, SI: Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial Animation:174192. [Liu and Ostermann, 2011] Liu, K. and Ostermann, J. (2011). Realistic facial expression synthesis for an image-based talking head. In Proc. IEEE International Conference on Multimedia and Expo, pages 1–6. [Lloyd, 1982] Lloyd, S. (1982). Least squares quantization in pcm. IEEE Transactions on Information Theory, 25(2):129–137. [Lofqvist, 1990] Lofqvist, A. (1990). Speech as audible gestures. In Hardcastle, W. and Marchal, A., editors, Speech Production and Speech Modeling, pages 289–322. Kluwer Academic Publishers. [Ma et al., 2006] Ma, J., Cole, R., Pellom, B., Ward, W., and Wise, B. (2006). Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEE Transactions on Visualization and Computer Graphics, 12(2):266–276. [Ma et al., 2009] Ma, W. J., Zhou, X., Ross, L. A., Foxe, J. J., and Parra, L. C. (2009). Lip-reading aids word recognition most in moderate noise: a bayesian explanation using high-dimensional feature space. PLoS One, 4(3):e4638. [MacLeod and Summerfield, 1987] MacLeod, A. and Summerfield, Q. (1987). Quantifying the contribution of vision to speech perception in noise. British Journal of Audiology, 21:131–141. [MacLeod and Summerfield, 1990] MacLeod, A. and Summerfield, Q. (1990). A procedure for measuring auditory and audio-visual speech-reception thresholds for sentences in noise: rationale, evaluation, and recommendations for use. British Journal of Audiology, 24(1):29–43. [Malcangi, 2010] Malcangi, M. (2010). Text-driven avatars based on artificial neural networks and fuzzy logic. International journal of computers, 4(2):61–69. [Massaro et al., 1999] Massaro, D., Beskow, J., Cohen, M., Fry, C., and Rodgriguez, T. (1999). Picture my voice: Audio to visual speech synthesis using artificial neural networks. In Proc. International Conference on Auditory-visual Speech Processing, pages 133–138. [Massaro and Cohen, 1990] Massaro, D. and Cohen, M. P. S. (1990). Perception of synthesized audible and visible speech. Psychological Science, 1(1):55–63. BIBLIOGRAPHY 265 [Massaro, 2003] Massaro, D. W. (2003). A computer-animated tutor for spoken and written language learning. In Proc. International Conference on Multimodal Interfaces, pages 172–175. [Matlab, 2013] Matlab (2013). Online documentation. mathworks.nl/help/signal/ref/firpm.html. Online: http://www. [Mattheyses et al., 2009a] Mattheyses, W., Latacz, L., and Verhelst, W. (2009a). Multimodal coherency issues in designing and optimizing audiovisual speech synthesis techniques. In Proc. International Conference on Auditory-visual Speech Processing, pages 47–52. [Mattheyses et al., 2009b] Mattheyses, W., Latacz, L., and Verhelst, W. (2009b). On the importance of audiovisual coherence for the perceived quality of synthesized visual speech. EURASIP Journal on Audio, Speech and Music Processing, SI: Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial Animation:169819. [Mattheyses et al., 2010a] Mattheyses, W., Latacz, L., and Verhelst, W. (2010a). Active appearance models for photorealistic visual speech synthesis. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 1113–1116. [Mattheyses et al., 2010b] Mattheyses, W., Latacz, L., and Verhelst, W. (2010b). Optimized photorealistic audiovisual speech synthesis using active appearance modeling. In Proc. International Conference on Auditory-visual Speech Processing, pages 148–153. [Mattheyses et al., 2011a] Mattheyses, W., Latacz, L., and Verhelst, W. (2011a). Auditory and photo-realistic audiovisual speech synthesis for dutch. In Proc. International Conference on Auditory-Visual Speech Processing, pages 55–60. [Mattheyses et al., 2011b] Mattheyses, W., Latacz, L., and Verhelst, W. (2011b). Automatic viseme clustering for audiovisual speech synthesis. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 2173–2176. [Mattheyses et al., 2013] Mattheyses, W., Latacz, L., and Verhelst, W. (2013). Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis. Speech Communication, 55(7-8):857– 876. [Mattheyses et al., 2008] Mattheyses, W., Latacz, L., Verhelst, W., and Sahli, H. (2008). Multimodal unit selection for 2d audiovisual text-to-speech synthesis. Lecture Notes In Computer Science, 5237:125–136. BIBLIOGRAPHY 266 [Mattheyses et al., 2006] Mattheyses, W., Verhelst, W., and Verhoeve, P. (2006). Robust pitch marking for prosodic modification of speech using td-psola. In Proc. Annual IEEE BENELUX/DSP Valley Signal Processing Symposium, pages 43– 46. [Mattys et al., 2002] Mattys, S., Bernstein, L., and Auer Jr., E. (2002). Stimulusbased lexical distinctiveness as a general word-recognition mechanism. Perception and Psychophysics, 64(4):667–679. [McGurk and MacDonald, 1976] McGurk, H. and MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588):746–748. [Melenchon et al., 2009] Melenchon, J., Martinez, E., De La Torre, F., and Montero, J. (2009). Emphatic visual speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 17(3):459–468. [Melenchon et al., 2007] Melenchon, J., Simo, J., Cobo, G., and Martinez, E. (2007). Objective viseme extraction and audiovisual uncertainty: Estimation limits between auditory and visual modes. In Proc. International Conference on AuditoryVisual Speech Processing, pages 191–194. [Mermelstein, 1976] Mermelstein, P. (1976). Distance measures for speech recognition, psychological and instrumental. Pattern recognition and artificial intelligence, 116:91–103. [Mertens and Vercammen, 1998] Mertens, P. and Vercammen, F. (1998). Fonilex manual. Technical report, K.U.Leuven CCL. [Minnis and Breen, 2000] Minnis, S. and Breen, A. P. (2000). Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis. In Proc. International Conference on Spoken Language Processing, pages 759–762. [Montgomery, 1980] Montgomery, A. A. (1980). Development of a model for generating synthetic animated lip shapes. Journal of the Acoustical Society of America, 68(S1):S58–S59. [Montgomery and Jackson, 1983] Montgomery, A. A. and Jackson, P. L. (1983). Physical characteristics of the lips underlying vowel lipreading performance. Journal of the Acoustical Society of America, 73(6):2134–2144. [Mori, 1970] Mori, M. (1970). The uncanny valley. Energy, 7(4):33–35. [Moulines and Charpentier, 1990] Moulines, E. and Charpentier, F. (1990). Pitchsynchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5):453–467. BIBLIOGRAPHY 267 [MPEG, 2013] MPEG (2013). ISO-IEC-14496-2. Online: http://www.iso.org. [Muller et al., 2005] Muller, P., Kalberer, G., Proesmans, M., and Van Gool, L. (2005). Realistic speech animation based on observed 3d face dynamics. IEE Proceedings on Vision, Image and Signal Processing, 152(4):491–500. [Munhall et al., 2004] Munhall, K. G., Jones, J. A., Callan, D. E., Kuratate, T., and Vatikiotis-Bateson, E. (2004). Visual prosody and speech intelligibility: head movement improves auditory speech perception. Psychological Science, 15(2):133– 137. [Musti et al., 2011] Musti, U., Colotte, V., Toutios, A., and Ouni, S. (2011). Introducing visual target cost within an acoustic-visual unit-selection speech synthesizer. In Proc. International Conference on Auditory-Visual Speech Processing, pages 49–55. [Myers et al., 1980] Myers, C., Rabiner, L., and Rosenberg, A. (1980). Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(6):623–635. [Nakaoka et al., 2009] Nakaoka, S., Kanehiro, F., Miura, K., Morisawa, M., Fujiwara, K., Kaneko, K., Kajita, S., and Hirukawa, H. (2009). Creating facial motions of cybernetic human hrp-4c. In Proc. IEEE-RAS International Conference on Humanoid Robots, pages 561–567. [Nefian et al., 2002] Nefian, A., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., and Murphy, K. (2002). A coupled hmm for audio-visual speech recognition. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 2013–2016. [Noh and Neumann, 2000] Noh, J. and Neumann, U. (2000). Talking faces. In Proc. IEEE International Conference on Multimedia and Expo, volume 2, pages 627–630. [Noma et al., 2000] Noma, T., Zhao, L., and Badler, N. I. (2000). Design of a virtual human presenter. Computer Graphics and Applications, 20(4):79–85. [Nuance, 2013] Nuance (2013). Online: http://netherlands.nuance.com/ bedrijven/oplossing/Spraak-naar-tekst/index.htm. [Ohman, 1967] Ohman, S. E. (1967). Numerical model of coarticulation. Journal of the Acoustical Society of America, 41(2):310–320. [Ostermann, 1998] Ostermann, J. (1998). Animation of synthetic faces in mpeg-4. In Proc. Computer Animation, pages 49–55. BIBLIOGRAPHY 268 [Ostermann et al., 1998] Ostermann, J., Chen, L., and Huang, T. (1998). Animated talking head with personalized 3d head model. Journal of VLSI Signal Processing, 20(1):97–105. [Ostermann and Millen, 2000] Ostermann, J. and Millen, D. (2000). Talking heads and synthetic speech: An architecture for supporting electronic commerce. In Proc. IEEE International Conference on Multimedia and Expo, pages 71–74. [Ouni et al., 2006] Ouni, S., Cohen, M., Ishak, H., and Massaro, D. (2006). Visual contribution to speech perception: Measuring the intelligibility of animated talking heads. EURASIP Journal on Audio, Speech, and Music Processing, 2007:047891. [Owens and Blazek, 1985] Owens, E. and Blazek, B. (1985). Visemes observed by hearing-impaired and normal-hearing adult viewers. Journal of Speech and Hearing Research, 28:381–393. [Pandzic and Forchheimer, 2003] Pandzic, I. and Forchheimer, R. (2003). MPEG-4 Facial Animation: The Standard, Implementation and Applications. John Wiley & Sons Inc. [Pandzic et al., 1999] Pandzic, I. S., Ostermann, J., and Millen, D. R. (1999). User evaluation: Synthetic talking faces for interactive services. The Visual Computer, 15(7):330–340. [Parke, 1982] Parke, F. (1982). Parametric models for facial animation. Computer Graphics and Applications, 2(9):61–68. [Parke, 1972] Parke, F. I. (1972). Computer generated animation of faces. In Proc. ACM annual conference, pages 451–457. [Parke, 1975] Parke, F. I. (1975). A model for human faces that allows speech synchronized animation. Computers & Graphics, 1(1):3–4. [Pearce et al., 1986] Pearce, A., Wyvill, B., Wyvill, G., and Hill, D. (1986). Speech and expression: a computer solution to face animation. In Proc. Graphics and Vision Interface, pages 136–140. [Pearson, 1901] Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine and Journal of Science, 2(11):559–572. [Pelachaud, 1991] Pelachaud, C. (1991). Communication and Coarticulation in Facial Animation. PhD thesis, University of Pennsylvania. [Pelachaud et al., 1996] Pelachaud, C., Badler, N., and Steedman, M. (1996). Generating facial expressions for speech. Cognitive science, 20(1):1–46. BIBLIOGRAPHY 269 [Pelachaud et al., 1991] Pelachaud, C., Badler, N. I., and Steedman, M. (1991). Linguistic issues in facial animation. In Proc. Computer Animation, pages 15–30. [Pelachaud et al., 2001] Pelachaud, C., Magno-Caldognetto, E., Zmarich, C., and Cosi, P. (2001). Modelling an italian talking head. In Proc. International Conference on Auditory-Visual Speech Processing, pages 72–77. [Perng et al., 1998] Perng, W., Wu, Y., and Ouhyoung, M. (1998). Image talk: a real time synthetic talking head using one single image with chinese text-to-speech capability. In Proc. Pacific Conference on Computer Graphics and Applications, pages 140–148. [Peterson et al., 1958] Peterson, G., Wang, W., and Sivertsen, E. (1958). Segmentation techniques in speech synthesis. Journal of the Acoustical Society of America, 30(7):682–683. [Pighin et al., 1998] Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., and Salesin, D. H. (1998). Synthesizing realistic facial expressions from photographs. In Proc. Annual conference on Computer graphics and interactive techniques, pages 75–84. [Pitrelli et al., 1994] Pitrelli, J., Beckman, M., and Hirschberg, J. (1994). Evaluation of prosodic transcription labeling reliability in the tobi framework. In Proc. International Conference on Spoken Language Processing, pages 123–126. [Pixar Animation Studios, 2013] Pixar Animation Studios (2013). Online: http: //www.pixar.com/. [Platt and Badler, 1981] Platt, S. M. and Badler, N. I. (1981). Animating facial expressions. Compututer Graphics, 15(3):245–252. [Plenge and Tilse, 1975] Plenge, G. and Tilse, U. (1975). The cocktail party effect with and without conflicting visual clues. In Proc. Audio Engineering Society Convention, pages L–11. [Porter and Duff, 1984] Porter, T. and Duff, T. (1984). Compositing digital images. SIGGRAPH Computer Graphics, 18(3):253–259. [Potamianos et al., 2004] Potamianos, G., Neti, C., Luettin, J., and Matthews, I. (2004). Audio-visual automatic speech recognition: An overview. In Bailly, G., Vatikiotis-Bateson, E., and Perrier, P., editors, Issues in Visual and Audio-Visual Speech Processing. MIT Press. [Rabiner, 1989] Rabiner, L. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286. BIBLIOGRAPHY 270 [Rabiner and Juang, 1993] Rabiner, L. and Juang, B.-H. (1993). Fundamentals of Speech Recognition. Prentice Hall. [Rabiner and Schafer, 1978] Rabiner, L. and Schafer, R. (1978). Digital processing of speech signals. Prentice-hall Englewood Cliffs. [Reveret et al., 2000] Reveret, L., Bailly, G., and Badin, P. (2000). Mother: A new generation of talking heads providing a flexible articulatory control for videorealistic speech animation. In Proc. International Conference on Spoken Language Processing, pages 755–758. [Ritter et al., 1999] Ritter, M., Meier, U., Yang, J., and Waibel, A. (1999). Face translation: A multimodal translation agent. In Proc. International Conference on Auditory-Visual Speech Processing, page paper 28. [Rogozan, 1999] Rogozan, A. (1999). Discriminative learning of visual data for audiovisual speech recognition. International Journal on Artificial Intelligence Tools, 8(1):43–52. [Ronnberg et al., 1998] Ronnberg, J., Samuelsson, S., and Lyxell, B. (1998). Conceptual constraints in sentence-based lipreading in the hearing-impaired. In Campbell, R., Dodd, B., and Burnham, D., editors, Hearing by eye II: Advances in the psychology of speechreading and auditory–visual speech, pages 143–153. Psychology Press. [Rosen, 1958] Rosen, G. (1958). Dynamic analog speech synthesizer. Journal of the Acoustical Society of America, 30(3):201–209. [Roweis, 1998] Roweis, S. (1998). Em algorithms for pca and spca. Advances in neural information processing systems, 10:626–632. [Saenko, 2004] Saenko, E. (2004). Articulary features for robust visual speech recognition. Master’s thesis, Massachussetts Institute of Technology. [Schmidt and Cohn, 2001] Schmidt, K. L. and Cohn, J. F. (2001). Human facial expressions as adaptations: Evolutionary questions in facial expression research. American Journal of Physical Anthropology, S33:3–24. [Schroder and Trouvain, 2003] Schroder, M. and Trouvain, J. (2003). The german text-to-speech synthesis system mary: A tool for research, development and teaching. International Journal of Speech Technology, 6(4):365–377. [Schroeder, 1993] Schroeder, M. (1993). A brief history of synthetic speech. Speech Communication, 13(1):231–237. BIBLIOGRAPHY 271 [Schroeter et al., 2000] Schroeter, J., Ostermann, J., Graf, H. P., Beutnagel, M. C., Cosatto, E., Syrdal, A. K., Conkie, A., and Stylianou, Y. (2000). Multimodal speech synthesis. In Proc. IEEE International Conference on Multimedia and Expo, pages 571–578. [Schwartz et al., 2004] Schwartz, J.-L., Berthommier, F., and Savariaux, C. (2004). Seeing to hear better: evidence for early audio-visual interactions in speech identification. Cognition, 93(2):69–78. [Schwippert and Benoit, 1997] Schwippert, C. and Benoit, C. (1997). Audiovisual intelligibility of an androgynous speaker. In Proc. International Conference on Auditory-visual Speech Processing, pages 81–84. [Scott et al., 1994] Scott, K., Kagels, D., Watson, S., Rom, H., Wright, J., Lee, M., and Hussey, K. (1994). Synthesis of speaker facial movement to match selected speech sequences. In Proc. Australian Conference on Speech Science and Technology, pages 620–625. [Second Life, 2013] Second Life (2013). Online: http://secondlife.com/. [Senin, 2008] Senin, P. (2008). Dynamic time warping algorithm review. Technical report, Information and Computer. Science Department, University of Hawaii, Honolulu. [Shiraishi et al., 2003] Shiraishi, T., Toda, T., Kawanami, H., Saruwatari, H., and Shikano, K. (2003). Simple designing methods of corpus-based visual speech synthesis. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 2241–2244. [Sifakis et al., 2005] Sifakis, E., Neverov, I., and Fedkiw, R. (2005). Automatic determination of facial muscle activations from sparse motion capture marker data. ACM Transactions on Graphics, 24(3):417–425. [Sifakis et al., 2006] Sifakis, E., Selle, A., Robinson-Mosher, A., and Fedkiw, R. (2006). Simulating speech with a physics-based facial muscle model. In Proc. ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 261– 270. [Skipper et al., 2007] Skipper, J. I., van Wassenhove, V., Nusbaum, H. C., and Small, S. L. (2007). Hearing lips and seeing voices: how cortical areas supporting speech production mediate audiovisual speech perception. Cerebral Cortex, 17(10):2387–2399. [Slaney and Covell, 2001] Slaney, M. and Covell, M. (2001). Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. Advances in Neural Information Processing Systems, 14:814–820. BIBLIOGRAPHY 272 [Smalley, 1963] Smalley, W. (1963). Manual of Articulatory Phonetics. Practical Anthropology. [Smits et al., 2003] Smits, R., Warner, N., McQueen, J., and Cutler, A. (2003). Unfolding of phonetic information over time: A database of dutch diphone perception. Journal of the Acoustical Society of America, 113(1):563–573. [Sproull et al., 1996] Sproull, L., Subramani, M., Kiesler, S., Walker, J., and Waters, K. (1996). When the interface is a face. Human-Computer Interaction, 11(2):97– 124. [Stegmann et al., 2003] Stegmann, M. B., Ersboll, B. K., and Larsen, R. (2003). Fame - a flexible appearance modeling environment. IEEE Transactions on Medical Imaging, 22(10):1319–1331. [Stewart, 1922] Stewart, J. (1922). An electrical analogue of the vocal organs. Nature, 110:311–312. [Summerfield, 1992] Summerfield, Q. (1992). Lipreading and audio-visual speech perception. Philosophical Transactions of the Royal Society of London: Biological Sciences, 335(1273):71–78. [Swerts and Krahmer, 2005] Swerts, M. and Krahmer, E. (2005). Audiovisual prosody and feeling of knowing. Journal of Memory and Language, 53(1):81– 94. [Swerts and Krahmer, 2006] Swerts, M. and Krahmer, E. (2006). The importance of different facial areas for signalling visual prominence. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages paper 1289–Tue3WeO.3. [Tamura et al., 1999] Tamura, M., Kondo, S., Masuko, T., and Kobayashi, T. (1999). Text-to-audio-visual speech synthesis based on parameter generation from hmm. In Proc. European Conference on Speech Communication and Technology, pages 959–962. [Tamura et al., 1998] Tamura, M., Masuko, T., Kobayashi, T., and Tokuda, K. (1998). Visual speech synthesis based on parameter generation from hmm: Speechdriven and text-and-speech-driven approaches. In Proc. International Conference on Auditory-Visual Speech Processing, pages 221–226. [Tao et al., 2009] Tao, J., Xin, L., and Yin, P. (2009). Realistic visual speech synthesis based on hybrid concatenation method. IEEE Transactions on Audio, Speech, and Language Processing, 17(3):469–477. BIBLIOGRAPHY 273 [Taylor, 2009] Taylor, P. (2009). Text-to-speech synthesis. Cambridge University Press. [Taylor et al., 2012] Taylor, S. L., Mahler, M., Theobald, B.-J., and Matthews, I. (2012). Dynamic units of visual speech. In Proc. ACM SIGGRAPH/Eurographics conference on Computer Animation, pages 275–284. [Terzopoulos and Waters, 1993] Terzopoulos, D. and Waters, K. (1993). Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(6):569–579. [Theobald, 2007] Theobald, B. (2007). Audiovisual speech synthesis. In Proc. International Congress on Phonetic Sciences, pages 285–290. [Theobald et al., 2008] Theobald, B., Fagel, S., Bailly, G., and Elisei, F. (2008). Lips2008: Visual speech synthesis challenge. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 1875–1878. [Theobald and Matthews, 2012] Theobald, B. and Matthews, I. (2012). Relating objective and subjective performance measures for aam-based visual speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 20(8):2378–2387. [Theobald et al., 2003] Theobald, B., Matthews, I., Glauert, J., Bangham, A., and Cawley, G. (2003). 2.5d visual speech synthesis using appearance models. In Proc. British Machine Vision Conference, pages 42–52. [Theobald and Wilkinson, 2007] Theobald, B. and Wilkinson, N. (2007). A realtime speech-driven talking head using active appearance models. In Proc. International Conference on Auditory-visual Speech Processing, volume 7, pages 22–28. [Theobald, 2003] Theobald, B.-J. (2003). Visual Speech Synthesis using Shape and Appearance Models. PhD thesis, University of East Anglia. [Theobald et al., 2004] Theobald, B.-J., Bangham, J. A., Matthews, I. A., and Cawley, G. C. (2004). Near-videorealistic synthetic talking faces: implementation and evaluation. Speech Communication, 44(1):127–140. [Theobald and Wilkinson, 2008] Theobald, B.-J. and Wilkinson, N. (2008). A probabilistic trajectory synthesis system for synthesising visual speech. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 1857–1860. [Tiddeman and Perrett, 2002] Tiddeman, B. and Perrett, D. (2002). Prototyping and transforming visemes for animated speech. In Proc. Computer Animation, pages 248–251. BIBLIOGRAPHY 274 [Tinwell et al., 2011] Tinwell, A., Grimshaw, M., Nabi, D., and Williams, A. (2011). Facial expression of emotion and perception of the uncanny valley in virtual characters. Computers in Human Behavior, 27(2):741–749. [Toutios et al., 2011] Toutios, A., Musti, U., Ouni, S., and Colotte, V. (2011). Weight optimization for bimodal unit-selection talking head synthesis. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 2249–2252. [Toutios et al., 2010a] Toutios, A., Musti, U., Ouni, S., Colotte, V., WrobelDautcourt, B., and Berger, M.-O. (2010a). Setup for acoustic-visual speech synthesis by concatenating bimodal units. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages 486–489. [Toutios et al., 2010b] Toutios, A., Musti, U., Ouni, S., Colotte, V., WrobelDautcourt, B., and Berger, M.-O. (2010b). Towards a true acoustic-visual speech synthesis. In Proc. International Conference on Auditory-Visual Speech Processing. [Turkmani, 2007] Turkmani, A. (2007). Visual Analysis of Viseme Dynamics. PhD thesis, University of Surrey. [Uz et al., 1998] Uz, B., Gudukbay, U., and Ozguc, B. (1998). Realistic speech animation of synthetic faces. In Proc. Computer Animation, pages 111–118. [Van Santen and Buchsbaum, 1997] Van Santen, J. and Buchsbaum, A. (1997). Methods for optimal text selection. In Proc. Eurospeech, pages 553–556. [Van Son et al., 1994] Van Son, N., Huiskamp, T., Bosman, A., and Smoorenburg, G. (1994). Viseme classifications of dutch consonants and vowels. Journal of the Acoustical Society of America, 96(3):1341–1355. [Van Wassenhove et al., 2005] Van Wassenhove, V., Grant, K. W., and Poeppel, D. (2005). Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences of the United States of America, 102(4):1181–1186. [Van Wassenhove et al., 2007] Van Wassenhove, V., Grant, K. W., and Poeppel, D. (2007). Temporal window of integration in auditory-visual speech perception. Neuropsychologia, 45(3):598–607. [Vatikiotis-Bateson et al., 1998] Vatikiotis-Bateson, E., Eigsti, I. M., Yano, S., and Munhall, K. G. (1998). Eye movement of perceivers during audiovisual speech perception. Perception and Psychophysics, 60(6):926–940. BIBLIOGRAPHY 275 [Verhelst and Roelands, 1993] Verhelst, W. and Roelands, M. (1993). An overlapadd technique based on waveform similarity (wsola) for high quality time-scale modification of speech. In Proc. IEEE international conference on Acoustics, speech, and signal processing, pages 554–557. [Verma et al., 2003] Verma, A., Rajput, N., and Subramaniam, L. (2003). Using viseme based acoustic models for speech driven lip synthesis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pages 720– 723. [Vicon Systems, 2013] Vicon Systems (2013). Online: http://www.vicon.com/. [Vidakovic, 2008] Vidakovic, B. (2008). Statistical Modeling by Wavelets. Wiley. [Vignoli and Braccini, 1999] Vignoli, F. and Braccini, C. (1999). A text-speech synchronization technique with applications to talking heads. In Proc. International Conference on Auditory-Visual Speech Processing, page paper 22. [Visser et al., 1999] Visser, M., Poel, M., and Nijholt, A. (1999). Classifying visemes for automatic lipreading. In Proc. International Workshop on Text, Speech and Dialogue, pages 349–352. [Viterbi, 1967] Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269. [Von Kempelen, 1791] Von Kempelen, W. (1791). Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine. J. B. Degen. [Walker et al., 1994] Walker, J., Sproull, L., and Subramani, R. (1994). Using a human face in an interface. In Proc. SIGCHI conference on Human factors in computing systems, pages 85–91. [Wang et al., 2010] Wang, L., Han, W., Qian, X., and Soong, F. (2010). Photo-real lips synthesis with trajectory-guided sample selection. In Proc. ISCA Workshop on Speech Synthesis, pages 217–222. [Waters, 1987] Waters, K. (1987). A muscle model for animating three-dimensional facial expressions. Computer Graphics, 21(4):17–24. [Waters and Frisbie, 1995] Waters, K. and Frisbie, J. (1995). A coordinated muscle model for speech animation. In Proc. Graphics Interface, pages 163–163. [Waters and Levergood, 1993] Waters, K. and Levergood, T. (1993). Decface: An automatic lip-synchronization algorithm for synthetic faces. Technical report, DEC Cambridge Research Laboratory. BIBLIOGRAPHY 276 [Weiss et al., 2010] Weiss, B., Kuhnel, C., Wechsung, I., Fagel, S., and Moller, S. (2010). Quality of talking heads in different interaction and media contexts. Speech Communication, 52(6):481–492. [Weiss, 2004] Weiss, C. (2004). A framework for data-driven video-realistic audiovisual speech synthesis. In Proc. International Conference on Language Resources and Evaluation. [Wells, 1997] Wells, J. (1997). Sampa computer readable phonetic alphabet. In Gibbon, D., Moore, R., and Winski, R., editors, Handbook of standards and resources for spoken language systems. Berlin and New York: Mouton de Gruyter. [Williams, 1990] Williams, L. (1990). Performance-driven facial animation. Computer Graphics, 24(4):235–242. [Wilting et al., 2006] Wilting, J., Krahmer, E., and Swerts, M. (2006). Real vs acted emotional speech. In Proc. Annual Conference of the International Speech Communication Association (Interspeech), pages paper 1093–Tue1A3O.4. [Wolberg, 1990] Wolberg, G. (1990). Digital Image Warping. IEEE Computer Society Press. [Wolberg, 1998] Wolberg, G. (1998). Image morphing: a survey. The visual computer, 14(8):360–372. [Woods, 1986] Woods, J. C. (1986). Lipreading : a guide for beginners. John Murray. [Xvid, 2013] Xvid (2013). Online: http://www.xvid.org/. [Yang et al., 2000] Yang, J., Xiao, J., and Ritter, M. (2000). Automatic selection of visemes for image-based visual speech synthesis. In Proc. IEEE International Conference on Multimedia and Expo, pages 1081–1084. [Yilmazyildiz et al., 2006] Yilmazyildiz, S., Mattheyses, W., Patsis, Y., and Verhelst, W. (2006). Expressive speech recognition and synthesis as enabling technologies for affective robot-child communication. Lecture Notes in Computer Science, 4261:1–8. [Young et al., 2006] Young, S. J., Evermann, G., Gales, M. J. F., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. C. (2006). The HTK Book, version 3.4. Cambridge University Engineering Department. [Ypsilos et al., 2004] Ypsilos, I., Hilton, A., Turkmani, A., and Jackson, P. (2004). Speech-driven face synthesis from 3d video. In Proc. 3D Data Processing, Visualization and Transmission Workshop, pages 58–65. BIBLIOGRAPHY 277 [Yu et al., 2010] Yu, D., G., O., Sutherland, A., and Whelan, P. (2010). A novel visual speech representation and hmm classification for visual speech recognition. IPSJ Transactions on Computer Vision and Applications, 2:25–38. [Zelezny et al., 2006] Zelezny, M., Krnoul, Z., Cisar, P., and Matousek, J. (2006). Design, implementation and evaluation of the czech realistic audio-visual speech synthesis. Signal Processing, 86(12):3657–3673. [Zen et al., 2009] Zen, H., Tokuda, K., and Black, A. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11):1039–1064.