Eurospeech 2003 Abstracts Book
Transcription
Eurospeech 2003 Abstracts Book
8th European Conference on Speech Communication and Technology September 1-4, 2003 – Geneva, Switzerland BOOK OF ABSTRACTS Typeset by: Causal Productions Pty Ltd www.causal.on.net [email protected] Table of Contents Page Plenary Talks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 SMoCa . . . Aurora Noise Robustness on SMALL Vocabulary Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 SMoCb . . . ISCA Special Interest Group Session: "Hot Topics" in Speech Science & Technology . . . . . . 2 OMoCc . . . Speech Signal Processing I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 OMoCd . . Phonology & Phonetics I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 PMoCe . . . Topics in Prosody & Emotional Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 PMoCf . . . Language Modeling, Discourse & Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 PMoCg . . . Speech Synthesis: Unit Selection I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 SMoDa . . . Aurora Noise Robustness on LARGE Vocabulary Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 SMoDb . . . Multilingual Speech-to-Speech Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 OMoDc . . Prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 OMoDd . . Language Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 PMoDe . . . Speech Modeling & Features I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 PMoDf . . . Speech Enhancement I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 PMoDg . . . Spoken Dialog Systems I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 OTuBa . . . Robust Speech Recognition - Noise Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 STuBb . . . Forensic Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 OTuBc . . . Emotion in Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 OTuBd . . . Dialog System User & Domain Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 PTuBf . . . . Phonology & Phonetics II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 PTuBg . . . Speech Modeling & Features II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 PTuBh . . . Topics in Speech Recognition & Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 OTuCa . . . Robust Speech Recognition - Acoustic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 STuCb . . . Advanced Machine Learning Algorithms for Speech & Language Processing. . . . . . . . . . . . .35 OTuCc . . . Speech Modeling & Features III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 OTuCd . . . Multi-Modal Spoken Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 PTuCe . . . Speech Coding & Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 PTuCf . . . . Speech Recognition - Search & Lexicon Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 PTuCg . . . Speech Technology Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 OTuDa . . . Robust Speech Recognition - Front-end Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 STuDb . . . Spoken Language Processing for e-Inclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 OTuDc . . . Speech Synthesis: Unit Selection II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 OTuDd . . Language & Accent Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 PTuDe . . . Speech Enhancement II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 PTuDf . . . Speech Recognition - Adaptation I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 PTuDg . . . Speech Resources & Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 OWeBa . . . Speech Recognition - Adaptation II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 SWeBb . . . Towards Synthesizing Expressive Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 OWeBc . . . Speaker Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 OWeBd . . Dialog System Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 PWeBe . . . Speech Signal Processing II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 PWeBf . . . Robust Speech Recognition I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 PWeBg . . . Speech Recognition - Large Vocabulary I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 PWeBh . . . Spoken Dialog Systems II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 OWeCa. . .Speech Recognition - Large Vocabulary II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 SWeCb . . . Robust Methods in Processing of Natural Language Dialogues . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 OWeCc . . . Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 OWeCd . . Speech Synthesis: Miscellaneous I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71 PWeCe . . . Speech Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 PWeCf . . . Robust Speech Recognition II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 PWeCg . . . Multi-Modal Processing & Speech Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 OWeDb . . Speech Recognition - Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 OWeDc . . Speech Modeling & Features IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 SWeDd . . . Feature Analysis & Cross-Language Processing of Chinese Spoken Language. . . . . . . . . . . .82 PWeDe . . . Speech Production & Physiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 PWeDf . . . Speech Synthesis: Voice Conversion & Miscellaneous Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 PWeDg . . . Acoustic Modelling I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 SThBb . . . Time is of the Essence - Dynamic Approaches to Spoken Language . . . . . . . . . . . . . . . . . . . . . . 90 OThBc . . . Topics in Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 OThBd . . . Acoustic Modelling II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92 PThBe. . . .Speaker & Language Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 PThBf . . . . Robust Speech Recognition III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 PThBg . . . Spoken Language Understanding & Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 PThBh . . . Speech Signal Processing III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 SThCb . . . Towards a Roadmap for Speech Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 OThCc . . . Speech Signal Processing IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 OThCd . . . Speech Synthesis: Miscellaneous II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 PThCe . . . Speaker Recognition & Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 PThCf . . . . Robust Speech Recognition IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 PThCg . . . Multi-Lingual Spoken Language Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .109 PThCh . . . Interdisciplinary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Eurospeech 2003 Plenary & Monday proach to compensate the noise effect. In addition, we proposed the use of multi-class normalization in which different normalization factors can be applied to different phonetic units. The combination of the robust features and ML normalization is particularly useful for highly mismatched condition in the Aurora 3 corpus resulting in a 15.8% relative improvement in the highly mismatched case and a 10.4% relative improvement on average over the three conditions. PLENARY TALKS Speech and Language Processing: Where Have We Been and Where Are We Going? Kenneth Ward Church; AT&T Labs-Research, USA Time: Tuesday 08:30 to 09:30, Venue: Room 1 Robust Speech Recognition Using Model-Based Feature Enhancement Can we use the past to predict the future? Moore’s Law is a great example: performance doubles and prices halve approximately every 18 months. This trend has held up well to the test of time and is expected to continue for some time. Similar arguments can be found in speech demonstrating consistent progress over decades. Unfortunately, there are also cases where history repeats itself, as well as major dislocations, fundamental changes that invalidate fundamental assumptions. What will happen, for example, when petabytes become a commodity? Can demand keep up with supply? How much text and speech would it take to match this supply? Priorities will change. Search will become more important than coding and dictation. Veronique Stouten, Hugo Van hamme, Kris Demuynck, Patrick Wambacq; Katholieke Universiteit Leuven, Belgium Maintaining a high level of robustness for Automatic Speech Recognition (ASR) systems is especially challenging when the background noise has a time-varying nature. We have implemented a ModelBased Feature Enhancement (MBFE) technique that not only can easily be embedded in the feature extraction module of a recogniser, but also is intrinsically suited for the removal of non-stationary additive noise. To this end we combine statistical models of the cepstral feature vectors of both clean speech and noise, using a Vector Taylor Series approximation in the power spectral domain. Based on this combined HMM, a global MMSE-estimate of the clean speech is then calculated. Because of the scalability of the applied models, MBFE is flexible and computationally feasible. Recognition experiments with this feature enhancement technique on the Aurora2 connected digit recognition task showed significant improvements on the noise robustness of the HTK recogniser. Auditory Principles in Speech Processing - Do Computers Need Silicon Ears ? Birger Kollmeier; Universität Oldenburg, Germany Time: Wednesday 08:30 to 09:30, September 1-4, 2003 – Geneva, Switzerland Venue: Room 1 A brief review is given about speech processing techniques that are based on auditory models with an emphasis on applications of the “Oldenburg perception model”, i.e., objective assessment of subjective sound quality for speech and audio codecs, automatic speech recognition, SNR estimation, and hearing aids. Several HKU Approaches for Robust Speech Recognition and Their Evaluation on Aurora Connected Digit Recognition Tasks Session: SMoCa– Oral Aurora Noise Robustness on SMALL Vocabulary Databases Jian Wu, Qiang Huo; University of Hong Kong, China Recently, we, at The University of Hong Kong (HKU) have proposed several approaches based on stochastic vector mapping and switching linear Gaussian HMMs to compensate for environmental distortions in robust speech recognition. In this paper, we present a comparative study of these algorithms and report results of performance evaluation on Aurora connected digits databases. By following the protocol specified by the organizer of the Eurospeech2003 special session on Aurora tasks, the best performance we achieved on Aurora2 database is a digit recognition error rate, averaged on all three test sets, of 5.53% and 6.28% for multi- and cleancondition training respectively. In a preliminary evaluation on Aurora3 Finnish and Spanish databases, significant performance improvement is also achieved by our approach. Time: Monday 13.30, Venue: Room 1 Chair: David Pierce, Motorola Lab., UK A Speech Processing Front-End with Eigenspace Normalization for Robust Speech Recognition in Noisy Automobile Environments Kaisheng Yao, Erik Visser, Oh-Wook Kwon, Te-Won Lee; University of California at San Diego, USA A new front-end processing scheme for robust speech recognition is proposed and evaluated on the multi-lingual Aurora 3 database. The front-end processing scheme consists of Mel-scaled spectral subtraction, speech segmentation, cepstral coefficient extraction, utterance-level frame dropping and eigenspace feature normalization. We also investigated performance on all language databases by post-processing features extracted by the ETSI advanced front-end with an additional eigenspace normalization module. This step consists in linear PCA matrix feature transformation followed by mean and variance normalization of the transformed cepstral coefficients. In speech recognition experiments, our proposed front-end yielded better than 16 percent relative error rate reduction over the ETSI front-end on the Finnish language database. Also, more than 6% in average relative error reduction was observed over all languages with the ETSI front-end augmented by eigenspace normalization. Average Instantaneous Frequency (AIF) and Average Log-Envelopes (ALE) for ASR with the Aurora 2 Database Yadong Wang, Jesse Hansen, Gopi Krishna Allu, Ramdas Kumaresan; University of Rhode Island, USA We have developed a novel approach to speech feature extraction based on a modulation model of a band-pass signal. Speech is processed by a bank of band-pass filters. At the output of the band-pass filters the signal is subjected to a log-derivative operation which naturally decomposes the band-pass signal into anaˆ ) compoˆ ) and anti-analytic (called β̇(t) + j β̇ lytic (called α̇(t) + j α̇ nents. The average instantaneous frequency (AIF) and average logenvelope (ALE) are then extracted as coarse features at the output of each filter. Further, refined features may also be extracted from the analytic and anti-analytic components (but not done in this paper). We then evaluated the Aurora 2 task where noise corruption is synthetic. For clean training, (compared to the mel-cepstrum front end, with 3 mixture HMM back-end,) our AIF/ALE front end achieves an average improvement of 13.97% with set A and 17.92% improvement with set B and -31.72% (negative) ‘improvement’ with set C. The overall improvement in accuracy rates for clean training is 7.97%. Although the improvements are modest, the novelty of the frontend and its potential for future enhancements are our strengths. Maximum Likelihood Normalization for Robust Speech Recognition Yiu-Pong Lai, Man-Hung Siu; Hong Kong University of Science & Technology, China It is well-known that additive and channel noise cause shift and scaling in MFCC features. Empirical normalization techniques to estimate and compensate for the effects, such as cepstral mean subtraction and variance normalization, have been shown to be useful. However, these empirical estimate may not be optimal. In this paper, we approach the problem from two directions, 1) use a more robust MFCC-based features that is less sensitive to additive and channel noise and 2) propose a maximum likelihood (ML) based ap- 1 Eurospeech 2003 Monday September 1-4, 2003 – Geneva, Switzerland Adaptation of Acoustic Model Using the Gain-Adapted HMM Decomposition Method ISCA Special Session: Hot Topics in Speech Synthesis Akira Sasou 1 , Futoshi Asano 1 , Kazuyo Tanaka 2 , Satoshi Nakamura 3 ; 1 AIST, Japan; 2 University of Tsukuba, Japan; 3 ATR-SLT, Japan Gerard Bailly 1 , Nick Campbell 2 , Bernd Möbius 3 ; 1 ICP-CNRS, France; 2 ATR-HIS, Japan; 3 University of Stuttgart, Germany In a real environment, it is essential to adapt acoustic models to variations in background noises in order to realize robust speech recognition. In this paper, we construct an extended acoustic model by combining a mismatch model with a clean acoustic model trained using only clean speech data. We assume the mismatch model conforms to a Gaussian distribution with time-varying population parameters. The proposed method adapts on-line the extended acoustic model to the unknown noises by estimating the timevarying population parameters using a Gaussian Mixture Model (GMM) and Gain-Adapted Hidden Markov Model (GA-HMM) decomposition method. We performed recognition experiments under noisy conditions using the AURORA2 database in order to confirm the effectiveness of the proposed method. What are the Hot Topics for speech synthesis? How will they differ in 5-years time? ISCA’s SynSIG presents a few suggestions. This paper attempts to identify the top five hot topics, based not on an analysis of what is being presented at current workshops and conferences, but rather on an analysis of what is NOT. It will be accompanied by results from a questionnaire polling SynSIG members’ views and opinions. Perceiving Emotions by Ear and by Eye Beatrice de Gelder; Tilburg University, The Netherlands Affective information is conveyed through visual as well as auditory perception. The present paper considers the integration of these channels of information, that is, the multisensory processing of emotion. Findings from behavioral, neuropsychological and imaging studies are reviewed. Session: SMoCb– Oral ISCA Special Interest Group Session: "Hot Topics" in Speech Science & Technology Strategies for Automatic Multi-Tier Annotation of Spoken Language Corpora Time: Monday 13.30, Venue: Room 2 Chair: Valérie Hazan, University College London, UK Steven Greenberg; The Speech Institute, USA Spoken corpora of the future will be annotated at multiple levels of linguistic organization largely through automatic methods using a combination of sophisticated signal processing, statistical classifiers and expert knowledge. It is important that annotation tools be adaptable to a wide range of languages and speaking styles, as well as readily accessible to the speech research and technology communities around the world. This latter objective is of particular importance for minority languages, which are less likely to foster development of sophisticated speech technology without such universal access. Person Authentication by Voice: A Need for Caution Jean-François Bonastre 1 , Frédéric Bimbot 2 , Louis-Jean Boë 3 , Joseph P. Campbell 4 , Douglas A. Reynolds 4 , Ivan Magrin-Chagnolleau 5 ; 1 LIA-CNRS, France; 2 IRISA, France; 3 ICP-CNRS, France; 4 Massachusetts Institute of Technology, USA; 5 DDL-CNRS, France Why is the Special Structure of the Language Important for Chinese Spoken Language Processing? – Examples on Spoken Document Retrieval, Segmentation and Summarization Because of recent events and as members of the scientific community working in the field of speech processing, we feel compelled to publicize our views concerning the possibility of identifying or authenticating a person from his or her voice. The need for a clear and common message was indeed shown by the diversity of information that has been circulating on this matter in the media and general public over the past year. In a press release initiated by the AFCP and further elaborated in collaboration with the SpLC ISCA-SIG, the two groups herein discuss and present a summary of the current state of scientific knowledge and technological development in the field of speaker recognition, in accessible wording for nonspecialists. Our main conclusion is that, despite the existence of technological solutions to some constrained applications, at the present time, there is no scientific process that enables one to uniquely characterize a person’s voice or to identify with absolute certainty an individual from his or her voice. Lin-shan Lee, Yuan Ho, Jia-fu Chen, Shun-Chuan Chen; National Taiwan University, Taiwan The Chinese language is not only spoken by the largest population in the world, but quite different from many western languages with a very special structure. It is not alphabetic: large number of Chinese characters are ideographic symbols and pronounced as monosyllables. The open vocabulary nature, the flexible wording structure and the tone behavior are also good examples within the special structure. It is believed that better results and performance will be obtainable in developing Chinese spoken language processing technologies, if this special structure can be taken into account. In this paper, a set of “feature units” for Chinese spoken language processing is identified, and the retrieval, segmentation and summarization of Chinese spoken documents are taken as examples in analyzing the use of such “feature units”. Experimental results indicate that by careful considerations of the special structure and proper choice of the “feature units”, significantly better performance can be achieved. En raison d’événements récents et en tant que membres de la communauté scientifique en traitement de la parole, l’AFCP et le SpLC, tous deux Groupes d’Intérêt Spécialisés de l’ISCA, ont collaboré à cet article pour présenter un résumé de l’état des connaissances scientifiques et du développement technologique en reconnaissance du locuteur, en des termes clairement accessibles à des non-spécialistes. La nécessité d’un message clair et commun concernant la possibilité d’identifier ou d’authentifier une personne par sa voix apparaît particulièrement nécessaire compte-tenu de la diversité des informations qui ont circulé dans les médias et dans l’opinion publique ces derniers mois. En dépit de l’existence de solutions technologiques pour quelques applications dans des contextes d’utilisation très contraints, nous tenons à affirmer qu’ à l’heure actuelle, il n’existe pas de processus scientifique qui permet de caractériser de façon unique la voix d’une personne, ou d’identifier avec certitude un individu à partir de sa voix. 2 Eurospeech 2003 Monday September 1-4, 2003 – Geneva, Switzerland frequency resolution. Two popularly used hearing aid algorithms, a two channel wide band system and a nine channel compression system, are simulated and are used to compensate the impaired auditory model. The responses of the compensated system, in terms of the acoustic-phonetics cues that characterise speech intelligibility, are analysed and compared with one another and with that of a normal auditory system. It is shown that although the nine channel compression algorithm performs better than the two channel system both the hearing aid algorithms distort severely the acousticphonetic cues. Session: OMoCc– Oral Speech Signal Processing I Time: Monday 13.30, Venue: Room 3 Chair: Hynek Hermansky, Oregon Graduate Institute of Science and Technology, USA Speech Analysis with the Short-Time Chirp Transform Frequency-Related Representation of Speech Luis Weruaga 1 , Marián Képesi 2 ; 1 Cartagena University of Technology, Spain; 2 Forschungszentrum Telekommunikation Wien, Austria Kuldip K. Paliwal 1 , Bishnu S. Atal 2 ; 1 Griffith University, Australia; 2 AT&T Labs-Research, USA Cepstral features derived from power spectrum are widely used for automatic speech recognition. Very little work, if any, has been done in speech research to explore phase-based representations. In this paper, an attempt is made to investigate the use of phase function in the analytic signal of critical-band filtered speech for deriving a representation of frequencies present in the speech signal. Results are presented which show the validity of this approach. The most popular time-frequency analysis tool, the Short-Time Fourier Transform, suffers from blurry harmonic representation when voiced speech undergoes changes in pitch. These relatively fast variations lead to inconsistent bins in frequency domain and cannot be accurately described by the Fourier analysis with high resolution both in time and frequency. In this paper a new analysis tool, called Short-Time Chirp Transform is presented, offering more precise time-frequency representation of speech signals. The base of this adaptive transform is composed of quadratic chirps that follow the pitch tendency segment-by-segment. Comparative results between the proposed STCT and popular time-frequency techniques reveal an improvement in time-frequency localization and finer spectral representation. Since the signal can be resynthesized from its STCT, the proposed method is also suitable for filtering purposes. Tracking a Moving Speaker Using Excitation Source Information Vikas C. Raykar 1 , Ramani Duraiswami 1 , B. Yegnanarayana 2 , S.R. Mahadeva Prasanna 2 ; 1 University of Maryland, USA; 2 Indian Institute of Technology, India Glottal Spectrum Based Inverse Filtering Microphone arrays are widely used to detect, locate, and track a stationary or moving speaker. The first step is to estimate the time delay, between the speech signals received by a pair of microphones. Conventional methods like generalized cross-correlation are based on the spectral content of the vocal tract system in the speech signal. The spectral content of the speech signal is affected due to degradations in the speech signal caused by noise and reverberation. However, features corresponding to the excitation source of speech are less affected by such degradations. This paper proposes a novel method to estimate the time delays using the excitation source information in speech. The estimated delays are used to get the position of the moving speaker. The proposed method is compared with the spectrum-based approach using real data from a microphone array setup. Ixone Arroabarren, Alfonso Carlosena; Universidad Publica de Navarra, Spain In this paper a new inverse filtering technique for the time-domain estimation of the glottal excitation is presented. This approach uses the DAP modeling for the vocal tract characterization, and a spectral model for the derivative of the glottal flow. This spectral model is based on the spectrum of the KLGLOTT88 model for the glottal source. The proposed procedure removes the glottal source from the spectrum of the speech signal in an accurate manner, particularly for high-pitched signals and singing voice, and the estimated glottal waveforms present less amount of formant ripple. En este trabajo se presenta una nueva técnica de filtrado inverso para la estimación de la fuente glotal. Dicha técnica combina la herramienta de cálculo de la respuesta de un sistema todo polos, basada en las muestras espectrales de la señal (DAP modeling), con un modelo espectral más preciso de la derivada de la fuente glotal. Este modelo espectral está basado en el espectro del modelo temporal para la fuente glotal KLGLOTT88. El algoritmo propuesto, elimina el efecto de la fuente en el espectro de la señal de habla de una manera más precisa, lo cual es de especial interés en señales con alta frecuencia fundamental y señales de canto. Como consecuencia de esto la estimación de la fuente glotal resultante del filtrado inverso presenta un menor rizado característico del efecto de los formantes. Tracking Vocal Tract Resonances Using an Analytical Nonlinear Predictor and a Target-Guided Temporal Constraint Li Deng, Issam Bazzi, Alex Acero; Microsoft Research, USA A technique for high-accuracy tracking of formants or vocal tract resonances is presented in this paper using a novel nonlinear predictor and using a target-directed temporal constraint. The nonlinear predictor is constructed from a parameter-free, discrete mapping function from the formant (frequencies and bandwidths) space to the LPC-cepstral space, with trainable residuals. We examine in this study the key role of vocal tract resonance targets in the tracking accuracy. Experimental results show that due to the use of the targets, the tracked formants in the consonantal regions (including closures and short pauses) of the speech utterance exhibit the same dynamic properties as for the vocalic regions, and reflect the underlying vocal tract resonances. The results also demonstrate the effectiveness of training the prediction-residual parameters and of incorporating the target-based constraint in obtaining high-accuracy formant estimates, especially for non-sonorant portions of speech. A Novel Method of Analysing and Comparing Responses of Hearing Aid Algorithms Using Auditory Time-Frequency Representation G.V. Kiran, T.V. Sreenivas; Indian Institute of Science, India A new and potentially important method for predicting, analysing and comparing responses of hearing aid algorithms is studied and presented here. This method is based on a time-frequency representation (TFR) generated by a computational auditory model. Hearing impairment is simulated by a change of parameters of the auditory model. To simulate the basilar membrane (BM) filtering part of the auditory model we propose a single parameter control version of the gammachirp filterbank and for simulating the neural processing in the auditory pathway we propose a signal processing model motivated by the physiological properties of the auditory nerve. This model then interprets the information processing in the auditory pathway through the use of a TFR called the auditory TFR (A-TFR) which matches the standard spectrogram in terms of both time and 3 Eurospeech 2003 Monday September 1-4, 2003 – Geneva, Switzerland Analysis and Modeling of Syllable Duration for Thai Speech Synthesis Session: OMoCd– Oral Phonology & Phonetics I Chatchawarn Hansakunbuntheung 1 , Virongrong Tesprasit 1 , Rungkarn Siricharoenchai 1 , Yoshinori Sagisaka 2 ; 1 NECTEC, Thailand; 2 Waseda University, Japan Time: Monday 13.30, Venue: Room 4 Chair: Dafydd Gibbon, Linguistics, Bielefeld, Germany Features of Contracted Syllables of Spontaneous Mandarin This paper describes the analysis results on the control factors of Thai syllable duration, and a statistical control model using linear regression technique. The analyses have been carried out both at a syllable level and at a phrase level. In a syllable level duration control, the effects of five Thai tones and syllable structures are investigated. To analyze syllable structure effects statistically, we applied the quantification theory with two linguistic factors: (1) phone categories by themselves, and (2) the categories grouped by articulatory similarities. In a phrase level, the effects of position in a phrase and syllable counts in a phrase were analyzed. The experimental results showed that tones, syllable structures, and position in a phrase play significant roles on syllable duration control. Syllable counts in a phrase slightly affects the syllable duration. These analysis results have been integrated into a statistical control model. The duration assignment precision of the proposed model is evaluated using 2480-word speech data. Total correlation 0.73 between predicted values and observed values for test set samples shows the fair precision of the proposed control model. Shu-Chuan Tseng; Academia Sinica, Taiwan Mandarin is a syllable-timed language whose syllable structure is quite simple [1]. In spontaneous Mandarin, because of rapid speech rate the structure of syllable may be changed, phonemes may be reduced and syllable boundaries as well as lexical tones may be merged. This fact has long been noticed, but no quantified empirical data were actually presented in the literature until now. This paper focuses on a special type of syllable reduction in spontaneous Mandarin caused by heavy coarticulation of phonemes across syllable boundaries, namely the phenomenon of syllable contraction. Contracted syllables result from segmental deletions and omission of syllable boundary. This paper reports a series of corpus-based results of analyses on contracted syllables in Mandarin conversation by taking account of phonological as well as non-phonological factors. Durational Characteristics of Hindi Stop Consonants Reaction Time as an Indicator of Discrete Intonational Contrasts in English K. Samudravijaya; Tata Institute of Fundamental Research, India Aoju Chen; University of Nijmegen, The Netherlands This paper reports a perceptual study using a semantically motivated identification task in which we investigated the nature of two pairs of intonational contrasts in English: (1) normal High accent vs. emphatic High accent; (2) early peak alignment vs. late peak alignment. Unlike previous inquiries, the present study employs an on-line method using the Reaction Time measurement, in addition to the measurement of response frequencies. Regarding the peak height continuum, the mean RTs are shortest for within-category identification but longest for across-category identification. As for the peak alignment contrast, no identification boundary emerges and the mean RTs only reflect a difference between peaks aligned with the vowel onset and peaks aligned elsewhere. We conclude that the peak height contrast is discrete but the previously claimed discreteness of the peak alignment contrast is not borne out. A study of the durational characteristics of Hindi stop consonants in spoken sentences was carried out. An annotated and time-aligned Hindi speech database was used in the experiment. The influences of aspiration, voicing and gemination on the durations of closure and post-release segments of plosives as well as the duration of the preceding vowel were studied. It was observed that the post-release duration of a plosive changes systematically with manner of articulation. However, due to its large variation in continuous speech, the post-release duration alone is not sufficient to identify the manner of articulation of Hindi stops as hypothesised in earlier studies. A low value of the ratio of the duration of a vowel to the closure duration of the following plosive is a reliable indicator of gemination in Hindi stop consonants in continuous speech. Quantity Comparison of Japanese and Finnish in Various Word Structures Session: PMoCe– Poster Topics in Prosody & Emotional Speech Toshiko Isei-Jaakkola; University of Helsinki, Finland The durational patterns of short and long vowels and consonants were investigated at the segmental and lexical level using variable syllable structures in Japanese and Finnish. The results showed that the Japanese segmental ratios between short and long in both vowels and consonants were longer than those of Finnish only when all segments were pooled. However, this was not necessarily true when observing their positions in different structures. Compared the lexical increase ratios based on the CVCV words, Japanese and Finnish showed regular patterns according to the word structures respectively. The Japanese patterns were very isochronical in any word structures, whereas the Finnish durational ratios stably decreased within the same moraic word structures with the same number of segments but different combinations of the same vowel and consonant. These results suggest that Japanese has a tendency to be more mora-counting than Finnish in temporal isochronity. Time: Monday 13.30, Venue: Main Hall, Level -1 Chair: Keikichi Hirose, Tokyo Univ., Japan Transforming F0 Contours Ben Gillett, Simon King; University of Edinburgh, U.K. Voice transformation is the process of transforming the characteristics of speech uttered by a source speaker, such that a listener would believe the speech was uttered by a target speaker. Training F0 contour generation models for speech synthesis requires a large corpus of speech. If it were possible to adapt the F0 contour of one speaker to sound like that of another speaker, using a small, easily obtainable parameter set, this would be extremely valuable. We present a new method for the transformation of F0 contours from one speaker to another based on a small linguistically motivated parameter set. The system performs a piecewise linear mapping using these parameters. A perceptual experiment clearly demonstrates that the presented system is at least as good as an existing technique for all speaker pairs, and that in many cases it is much better and almost as good as using the target F0 contour. Broad Focus Across Sentence Types in Greek Mary Baltazani; University of California at Los Angeles, USA In Greek main sentence stress is located on the rightmost constituent in ‘all new’ declaratives, but for all-new negatives, polar questions, and wh-questions it is located on the negative particle, main verb, and wh-word respectively. I discuss the implications of this pattern for the focus projection rules and for the accentedness of discourse new constituents. Evaluation of the Affect of Speech Intonation Using a Model of the Perception of Interval Dissonance and Harmonic Tension Norman D. Cook, Takeshi Fujisawa, Kazuaki Takami; Kansai University, Japan 4 Eurospeech 2003 Monday We report the application of a psychophysical model of pitch perception to the analysis of speech intonation. The model was designed to reproduce the empirical findings on the perception of musical phenomena (the dissonance/consonance of intervals and the tension/sonority of chords), but does not depend on specific musical scales or tuning systems. Application to intonation allows us to calculate the total dissonance and tension among the pitches in the speech utterance. In an experiment using the 144 utterances of 18 male and female subjects, we found greater dissonance and harmonic tension in sentences with negative affect, in comparison with sentences with positive affect. September 1-4, 2003 – Geneva, Switzerland segmentation cue, the results of several cross-modal fragment priming experiments reveal strong limitations to stress-based segmentation. When stress was pitted against phonotactic and coarticulatory cues, substantial effects of the latter two cues were found, but there was no evidence for stress-based segmentation. However, when the stimuli were presented in a background of noise, the pattern of results reversed: Strong syllables generated more priming than weak ones, regardless of coarticulation and phonotactics. Furthermore, a similar dependency was found between stress and lexicality. Priming was stronger when the prime was preceded by a real than a nonsense word, regardless of the stress pattern of the prime. Yet, again, a reversal in cue dominance was observed when the stimuli were played in noise. These results underscore the secondary role of stress-based segmentation in clear speech, and its efficiency in impoverished listening conditions. More generally, they call for an integrated, hierarchical, and signal-contingent approach to speech segmentation. A New Pitch Modeling Approach for Mandarin Speech Wen-Hsing Lai, Yih-Ru Wang, Sin-Horng Chen; National Chiao Tung University, Taiwan In this paper, a new approach to model syllable pitch contour for Mandarin speech is proposed. It takes the mean and shape of syllable pitch contour as two basic modeling units and considers several affecting factors that contribute to their variations. Parameters of the two models are automatically estimated by the EM algorithm. Experimental results showed that RMSEs of 0.551 ms and 0.614 ms in the reconstructed pitch were obtained for the closed and open tests, respectively. All inferred values of those affecting factors agreed well with our prior linguistic knowledge. Besides, the prosodic states automatically labeled by the pitch mean model provided useful cues to determine the prosodic phrase boundaries occurred at inter-syllable locations without punctuation marks. So it is a promising pitch modeling approach. Emotion Recognition by Speech Signals Oh-Wook Kwon, Kwokleung Chan, Jiucang Hao, Te-Won Lee; University of California at San Diego, USA For emotion recognition, we selected pitch, log energy, formant, mel-band energies, and mel frequency cepstral coefficients (MFCCs) as the base features, and added velocity/ acceleration of pitch and MFCCs to form feature streams. We extracted statistics used for discriminative classifiers, assuming that each stream is a onedimensional signal. Extracted features were analyzed by using quadratic discriminant analysis (QDA) and support vector machine (SVM). Experimental results showed that pitch and energy were the most important factors. Using two different kinds of databases, we compared emotion recognition performance of various classifiers: SVM, linear discriminant analysis (LDA), QDA and hidden Markov model (HMM). With the text-independent SUSAS database, we achieved the best accuracy of 96.3% for stressed/neutral style classification and 70.1% for 4-class speaking style classification using Gaussian SVM, which is superior to the previous results. With the speaker-independent AIBO database, we achieved 42.3% accuracy for 5-class emotion recognition. Bayesian Induction of Intonational Phrase Breaks P. Zervas, M. Maragoudakis, Nikos Fakotakis, George Kokkinakis; University of Patras, Greece For the present paper, a Bayesian probabilistic framework for the task of automatic acquisition of intonational phrase breaks was established. By considering two different conditional independence assumptions, the naïve Bayes and Bayesian networks approaches were regarded and evaluated against the CART algorithm, which has been previously used with success. A finite length window of minimal morphological and syntactic resources was incorporated, i.e. the POS label and the kind of phrase boundary, a novel syntactic feature that has not been applied to intonational phrase break detection before. This feature can be used in languages where syntactic parsers are not available and proves to be important, not only for the proposed Bayesian methodologies but for other algorithms, like CART. Trained on a 5500 word database, Bayesian networks proved to be the most effective in terms of precision (82,3%) and recall (77,2%) for predicting phrase breaks. Automatic Prosodic Prominence Detection in Speech Using Acoustic Features: An Unsupervised System Fabio Tamburini; University of Bologna, Italy This paper presents work in progress on the automatic detection of prosodic prominence in continuous speech. Prosodic prominence involves two different phonetic features: pitch accents, connected with fundamental frequency (F0) movements and syllable overall energy, and stress, which exhibits a strong correlation with syllable nuclei duration and mid-to-high-frequency emphasis. By measuring these acoustic parameters it is possible to build an automatic system capable of correctly identifying prominent syllables with an agreement, with human-tagged data, comparable with the interhuman agreement reported in the literature. This system does not require any training phase, additional information or annotation, it is not tailored to a specific set of data and can be easily adapted to different languages. Predicting the Perceptive Judgment of Voices in a Telecom Context: Selection of Acoustic Parameters T. Ehrette 1 , N. Chateau 1 , Christophe d’Alessandro 2 , V. Maffiolo 1 ; 1 France Télécom R&D, France; 2 LIMSI-CNRS, France Perception of vocal styles is of paramount importance in vocal server application as the global style of a telecom service is highly dependant on the voice used. In this work we develop tools for automatic inference of perceived vocal styles for a set of 100 vocal sequences. In a first stage, twenty subjective evaluation criteria have been identified by running perceptive experiments with naïve listeners. In a second stage, the vocal sequences have been parameterised using more than a hundred acoustic features representing prosody, spectral energy distribution, articulation and waveform. Then, regression analysis and neural networks are used for predicting the subjective score of each voice for each subjective criterion. The results show that the prediction error is generally low: it seems possible to predict automatically the perceived quality of the sequences. Moreover, the prediction error decreases when nonsignificant parameters are removed. Improved Emotion Recognition with Large Set of Statistical Features Vladimir Hozjan, Zdravko Kačič; University of Maribor, Slovenia This paper presents and discusses the speaker dependent emotion recognition with large set of statistical features. The speaker dependent emotion recognition gains in present the best accuracy performance. Recognition was performed on English, Slovenian, Spanish, and French InterFace emotional speech databases. All databases include 9 speakers. The InterFace databases include neutral speaking style and six emotions: disgust, surprise, joy, fear, anger and sadness. Speech features for emotion recognition were determined in two steps. In the first step, acoustical features were defined and in the second statistical features were calculated from acoustical features. Acoustical features are composed from pitch, derivative of pitch, energy, derivative of energy, duration of speech segments, Stress-Based Speech Segmentation Revisited Sven L. Mattys; University of Bristol, U.K. Although word stress is usually seen as a powerful speech- 5 Eurospeech 2003 Monday jitter, and shimmer. Statistical features are statistical presentations of acoustical features. In previous study feature vector was composed from 26 elements. In this study the feature vector was composed from 144 elements. The new feature set was called large set of statistical features. Emotion recognition was performed using artificial neural networks. Significant improvement was achieved for all speakers except for Slovenian male and second English male speaker were the improvement was about 2%. Large set of statistical features improve the accuracy of recognised emotion in average for about 18%. September 1-4, 2003 – Geneva, Switzerland In this study, we introduced a new model of how a human understands speech in real time and performed a cognitive experiment to investigate the unit for processing and understanding speech. In the model, first humans segment the acoustical signal into some acoustical units, and then the mental lexicon is accessed and searched for the segmented units. For this segmentation, we believe that prosody information must be used. In order to investigate how humans segment acoustical speech using only prosody, we performed an experiment in which participants listened to a pair of segmented speech materials, where each material was divided from the same speech material where the two segmentation positions differed from each other, and judged which material sounded more natural. On the basis of the results of this experiment, it is suggested that humans tend to segment speech based on the accent rules of Japanese, and that the introduced model is supported. Recognition of Intonation Patterns in Thai Utterance Patavee Charnvivit, Nuttakorn Thubthong, Ekkarit Maneenoi, Sudaporn Luksaneeyanawin, Somchai Jitapunkul; Chulalongkorn University, Thailand Language-Reconfigurable Universal Phone Recognition Thai intonation can be categorized as paralinguistic information of F0 contour of the utterance. There are three classes of intonation pattern in Thai, the Fall Class, the Rise Class, and the Convolution Class. This paper presents a method of intonation pattern recognition of Thai utterance. Two intonation feature contours, extracted from F0 contour, were proposed. The feature contours were converted to feature vector to use as input of neural network recognizer. The recognition results show that an average recognition rate is 63.4% for male speakers and 75.4% for female speakers. The recognizer can recognize the Fall Class from the others better than distinguish between the Rise Class and the Convolution Class. B.D. Walker, B.C. Lackey, J.S. Muller, P.J. Schone; U.S. Department of Defense, USA We illustrate the development of a universal phone recognizer for conversational telephone-quality speech. The acoustic models for this system were trained in a novel fashion and with a wide variety of language data, thus permitting it to recognize most of the world’s major phonemic categories. Moreover, with push-button ease, this recognizer can automatically reconfigure itself to apply the strongest language model in its inventory to whatever language it is used on. In this paper, we not only describe this system, but we also provide performance measurements for it using extensive testing material both from languages in its training set as well as from a language it has never seen. Surprisingly, the recognizer produces near-equivalent performance between the two types of data thus showing its true universality. This recognizer presents a viable solution for processing conversational, telephone-quality speech in any language – even in low-density languages. Use of Linguistic Information for Automatic Extraction of F0 Contour Generation Process Model Parameters Keikichi Hirose, Yusuke Furuyama, Shuichi Narusawa, Nobuaki Minematsu, Hiroya Fujisaki; University of Tokyo, Japan Emotion Recognition Using a Data-Driven Fuzzy Inference System A method was developed to utilize linguistic information (lexical accent types and syntactic boundaries) to improve the performance of the automatic extraction of the F0 contour generation process model commands. The extraction scheme is first to smooth the observed F0 contour by a piecewise 3rd order polynomial function and to locate accent command positions by taking the derivative of the function. If the results of automatic extraction differ from those estimated from the linguistic information, they are modified according to the several rules. The results showed that some errors could be corrected by the use of linguistic information, especially when the initial word of an accent phrase is type 0 (flat) accent. As a whole, the correct extraction rate (recall rate) was increased from 79.8% to 82.3% for phrase commands and from 81.6% to 85.9% for accent commands. Chul Min Lee, Shrikanth Narayanan; University of Southern California, USA The need and importance of automatically recognizing emotions from human speech has grown with the increasing role of humancomputer interaction applications. This paper explores the detection of domain-specific emotions using a fuzzy inference system to detect two emotion categories, negative and nonnegative emotions. The input features are a combination of segmental and suprasegmental acoustic information; feature sets are selected from a 21dimensional feature set and applied to the fuzzy classifier. Our fuzzy inference system is designed through a data-driven approach. The design of the fuzzy inference system has two phases: one for initialization for which fuzzy c-means method is used, and the other is fine-tuning of parameters of the fuzzy model. For fine-tuning, a well known neuro-fuzzy method are used. Results from on spoken dialog data from a call center application show that the optimized FIS with two rules (FIS-2) improves emotion classification by 63.0% for male data and 73.7% for female over previous results obtained using linear discriminant classifier. Potential Audiovisual Correlates of Contrastive Focus in French Marion Dohen, Hélène Lœvenbruck, Marie-Agnès Cathiard, Jean-Luc Schwartz; ICP-CNRS, France The long-term purpose of this study is to determine whether there are “visual” cues to prosody. An audiovisual corpus was recorded from a male native French speaker. The sentences had a subjectverb-object (SVO) syntactic structure. Four conditions were studied: focus on each phrase (S,V,O) and no focus. Normal and reiterant modes were recorded. We first measured F0, duration and intensity to validate the corpus. The pitch maximum over the utterance was generally on a focused syllable and duration and intensity were higher for the focused syllables. Then lip aperture and jaw opening were extracted from the video. The jaw opening maximum generally fell on one of the focused syllables, but peak velocity was more consistently correlated with focus. Moreover, lip closure duration was longer for the first segment of the focused phrase. We can therefore assume that there are visual aspects in prosody that may be used in communication. Effects of Voice Prosody by Computers on Human Behaviors Noriko Suzuki 1 , Yohei Yabuta 2 , Yugo Takeuchi 2 , Yasuhiro Katagiri 1 ; 1 ATR-MIS, Japan; 2 Shizuoka University, Japan This paper examines whether a human is aware of slight prosodic differences in a computer voice and changes his/her behaviors accordingly through interaction, when the prosodic difference carries informational significance. We conduct a route selection experiment, in which subjects were asked to find a route in a computer generated 3-D maze. The maze system occasionally provides a confirmation in response to the subject’s choice of a route. The prosodic characteristics of confirmation utterances are made to marginally change according to whether the route selected is the right route for reaching the goal or a wrong route that ends up How does Human Segment the Speech by Prosody ? Toshie Hatano, Yasuo Horiuchi, Akira Ichikawa; Chiba University, Japan 6 Eurospeech 2003 Monday in a cul de sac. In this experiment, subjects are able to pick up the difference and successfully navigate through the maze. This result demonstrates that subjects are sensitive to even a slight change in the voice’s prosodic characteristics and that computer voice prosody can affect the route selection behaviors of subjects. September 1-4, 2003 – Geneva, Switzerland one male and one female. The corpus was labeled on the syllabic level and analyzed using the Fujisaki model. Results show that the six tone types basically fall into two categories: Level, rising, curve and falling tone can be accurately modeled by using tone commands of positive or negative polarity. The so-called drop and broken tones, however, obviously require a special control causing creaky voice and in cases a very fast drop in F0 leading to temporary F0 halving or even quartering. In contrast to the drop tone, the broken tone exhibits an F0 rise and hence a positive tone command right after the creak occurs. Further observations suggest that drop and broken tone do not only differ from the other four tones with respect to their F0 characteristics, but also as to their much tenser articulation. A perception experiment performed with natural and resynthesized stimuli shows, inter alia, that tone 4 is most prone to confusion and that tone 6 obviously requires tense articulation as well as vocal fry to be identified reliably. An Investigation of Intensity Patterns for German Oliver Jokisch, Marco Kühne; Dresden University of Technology, Germany The perceived quality of synthetic speech strongly depends on its prosodic naturalness. Concerning the control of duration and fundamental frequency in a speech synthesis system, sophisticated models have been developed during the last decade. Speech intensity modeling is often considered as algorithmically and perceptually less important. Departing from a syllable-based, trainable prosody model the authors tested new factors of influence to improve the predicted intensity contour on phonemic level. Therefore, a German newsreader corpus has been analyzed with respect to typical intensity patterns. The f0-intensity interaction has the most significant influence and was perceptually evaluated by 32 listeners ranking 20 different stimuli. Using an elementary, linear intensity model, modified natural speech only slightly degrades about 0.3 at the ITU-T conform MOS scale. Japanese Prosodic Labeling Support System Utilizing Linguistic Information Shinya Kiriyama, Yoshifumi Mitsuta, Yuta Hosokawa, Yoshikazu Hashimoto, Toshihiko Ito, Shigeyoshi Kitazawa; Shizuoka University, Japan A prosodic labeling support system has been developed. Largescale prosodic databases are strongly desired for years, however, the construction of databases depend on hand labeling, because of the variety of prosody. We aim at not automating the whole labeling process, but making the hand labeling work more efficient by providing the labelers with the appropriate support information. The methods of auto-generating initial phoneme and prosodic labels utilizing linguistic information are proposed and evaluated. The experimental results showed that more than 70% of prosodic labels were correctly generated, and proved the efficiency of the proposed methods. The results also yielded the useful knowledge to support the labelers. Segmental Durations Predicted with a Neural Network João Paulo Teixeira 1 , Diamantino Freitas 2 ; 1 Polytechnic Institute of Bragança, Portugal; 2 University of Porto, Portugal This paper presents a segmental durations’ model applied to the European Portuguese language for TTS purposes. The model is based on a feed-forward neural network, trained with a back-propagation algorithm, and has as input a set of phonological and contextual features, automatically extracted from the text. The relative importance of each feature, concerning the correlation with segmental durations and improvements in the performance of the model, is presented. Finally the model is evaluated objectively and subjectively by a perceptual test. Why and How to Control the Authentic Emotional Speech Corpora Véronique Aubergé, Nicolas Audibert, Albert Rilliard; ICP-CNRS, France Generation and Perception of F0 Markedness in Conversational Speech with Adverbs Expressing Degrees The affects are expressed in different levels of speech: metalinguistic (expressiveness), linguistic (attitudes), both anchored in the “linguistic time”, and para-linguistic (emotions expressions) that is anchored in the emotional causes timing. In an experimental approach, the corpus are the base of analysis. Main of emotional corpus have been produced by acting/elicitating speakers on one side (with a possible strong control), and on the other side they have been collected in “reallife”. This paper proposes both to generate a Wizard of Oz method and some tools (E-Wiz and Top Logic, Sound Teacher applications) in order to control the production of authentic data, separately for the three levels of affects. Takumi Yamashita, Yoshinori Sagisaka; Waseda University, Japan Aiming at natural F0 control for conversational speech synthesis, F0 characteristics are analyzed from both generation and perception viewpoints. By systematically designing conversational situations and utterances with adverb phrases expressing different degree of markedness, their F0 characteristics are compared. The comparison shows the consistent F0 control dependencies not only on adverbs themselves but also on the attribute of neighboring adjective phrases. Strong positive/negative correlation is observed between the markedness of adverbs and F0 height when an adjective phrase with a positive/negative image is followed to the current adverb phrase. These consistencies have been perceptually confirmed by naturalness evaluation tests using the same two-phrase samples with different F0 heights. These results indicate the possibility of F0 control for natural conversational speech using lexical markedness information and adjacent word attributes. Prosodic Cues for Emotion Characterization in Real-Life Spoken Dialogs Laurence Devillers 1 , Ioana Vasilescu 2 ; 1 LIMSI-CNRS, France; 2 ENST-CNRS, France This paper reports on an analysis of prosodic cues for emotion characterization in 100 natural spoken dialogs recorded at a telephone customer service center. The corpus annotated with taskdependent emotion tags which were validated by a perceptual test. Two F0 range parameters, one at the sentence level and the other at the subsegment level, emerge as the most salient cues for emotion classification. These parameters can differentiate between negative emotion (irritation/anger, anxiety/fear) and neutral attitude and confirm trends illustrated by the perceptual experiment. Quantitative Analysis and Synthesis of Syllabic Tones in Vietnamese Hansjörg Mixdorff 1 , Nguyen Hung Bach 2 , Hiroya Fujisaki 3 , Mai Chi Luong 2 ; 1 Berlin University of Applied Sciences, Germany; 2 National Centre for Science and Technology, Vietnam; 3 University of Tokyo, Japan The current paper presents a preliminary study on the production and perception of syllabic tones of Vietnamese. A speech corpus consisting of fifty-two six-syllable sequences with various combinations of tones was uttered by two speakers of Standard Vietnamese, 7 Eurospeech 2003 Monday September 1-4, 2003 – Geneva, Switzerland several hundred dialogues in French and English. Inter-annotator agreement was moderate. We are using these data to design our dialogue system, and we hope that they will help us to derive appropriate dialogue strategies for novel situations. Session: PMoCf– Poster Language Modeling, Discourse & Dialog Time: Monday 13.30, Venue: Main Hall, Level -1 Chair: Peter Heeman, Oregon Graduate Int., USA Disfluency Under Feedback and Time-Pressure H.B.M. Nicholson 1 , E.G. Bard 1 , A.H. Anderson 2 , M.L. Flecha-Garcia 1 , D. Kenicer 2 , L. Smallwood 2 , J. Mullin 2 , R.J. Lickley 3 , Y. Chen 1 ; 1 University of Edinburgh, U.K.; 2 University of Glasgow, U.K.; 3 Queen Margaret University College, U.K. Towards the Automatic Generation of Mixed-Initiative Dialogue Systems from Web Content Joseph Polifroni 1 , Grace Chung 2 , Stephanie Seneff 1 ; 1 Massachusetts Institute of Technology, USA; 2 Corporation for National Research Initiatives, USA Speakers engaging in dialogue with another conversationalist must create and execute plans with respect to the content of the utterance. An analysis of disfluencies from Map Task monologues shows that a speaker is influenced by the pressure to communicate with a distant listener. Speakers were also subject to time-pressure, thereby increasing the cognitive burden of the overall task at hand. The duress of the speaker, as determined by disfluency rate, was examined across four conditions of variable feedback and timing. A surprising result was found that does not adhere to the predictions of the traditional views concerning collaboration in dialogue. Through efforts over the past fifteen years, we have acquired a great deal of experience in designing spoken dialogue systems that provide access to large corpora of data in a variety of different knowledge domains, such as flights, hotels, restaurants, weather, etc. In our recent research, we have begun to shift our focus towards developing tools that enable the rapid development of new applications. This paper addresses a novel approach that drives system design from the on-line knowledge resource. We were motivated by a desire to minimize the need for a pre-determined dialogue flow. In our approach, decisions on dialogue flow are made dynamically based on analyses of data, either prior to user interaction or during the dialogue itself. Automated methods, used to organize numeric and symbolic data, can be applied at every turn, as user constraints are being specified. This helps the user mine through large data sets to a few choices by allowing the system to synthesize intelligent summaries of the data, created on-the-fly at every turn. Moreover automatic methods are ultimately more robust against the frequent changes to on-line content. Simulations generating hundreds of dialogues have produced log files that allow us to assess and improve system behavior, including system responses and interactions with the dialogue flow. Together, these techniques are aimed towards the goal of instantiating new domains with little or no input from a human developer. Control in Task-Oriented Dialogues Peter A. Heeman, Fan Yang, Susan E. Strayer; Oregon Health & Science University, USA In this paper, we explore the mechanisms by which conversants control the direction of a dialogue. We find further evidence that control in task-oriented dialogues is subordinate to discourse structure. The initiator of a discourse segment has control; the non-initiator can contribute to the purpose of the segment, but this does not result in that person taking over control. The proposal has important implications for dialogue management, as it will pave the way for building dialogue systems that can engage in mixed initiative dialogues. The 300k LIMSI German Broadcast News Transcription System A Context Resolution Server for the Galaxy Conversational Systems Kevin McTait, Martine Adda-Decker; LIMSI-CNRS, France Edward Filisko, Stephanie Seneff; Massachusetts Institute of Technology, USA This paper describes improvements to the existing LIMSI German broadcast news transcription system, especially its extension from a 65k vocabulary to 300k words. Automatic speech recognition for German is more problematic than for a language such as English in that the inflectional morphology of German and its highly generative process of compounding lead to many more out of vocabulary words for a given vocabulary size. Experiments undertaken to tackle this problem and reduce the transcription error rate include bringing the language models up to date, improved pronunciation models, semi-automatically constructed pronunciation lexicons and increasing the size of the system’s vocabulary. The context resolution (CR) component of a conversational dialogue system is responsible for interpreting a user’s utterance in the context of previously spoken user utterances, spatial and temporal context, inference, and shared world knowledge. This paper describes a new and independent CR server for the GALAXY conversational system framework. Among the functionality provided by the CR server is the inheritance and masking of historical information, pragmatic verification, as well as reference and ellipsis resolution. The new server additionally features a process that attempts to reconstruct the intention of the user given a robust parse of an utterance. Design issues are described, followed by a description of each function in the context resolution process along with examples. The effectiveness of the CR server in various domains attests to its success as a module for context resolution. Weighted Entropy Training for the Decision Tree Based Text-to-Phoneme Mapping Jilei Tian 1 , Janne Suontausta 1 , Juha Häkkinen 2 ; 1 Nokia Research Center, Finland; 2 Nokia Mobile Phones, Finland Semantic and Dialogic Annotation for Automated Multilingual Customer Service Hilda Hardy 1 , Kirk Baker 2 , Hélène Bonneau-Maynard 3 , Laurence Devillers 3 , Sophie Rosset 3 , Tomek Strzalkowski 1 ; 1 University at Albany, USA; 2 Duke University, USA; 3 LIMSI-CNRS, France The pronunciation model providing the mapping from the written form of words to their pronunciations is called the text-to-phoneme (TTP) mapping. Such a mapping is commonly used in automatic speech recognition (ASR) as well as in text-to-speech (TTS) applications. Rule based TTP mappings can be derived for structured languages, such as Finnish and Japanese. Data-driven TTP mappings are usually applied for non-structured languages such as English and Danish. Artificial neural network (ANN) and decision tree (DT) approaches are commonly applied in this task. Compared to the ANN methods, the DT methods usually provide more accurate pronunciation models. The DT methods can, however, lead to a set of models with a high memory footprint if the mappings between letters and phonemes are complex. In this paper, we present a weighted entropy training method for the DT based TTP mapping. Statistical information about the vocabulary is utilized in the One central goal of the AMITIÉS multilingual human-computer dialogue project is to create a dialogue management system capable of engaging the user in human-like conversation in a specific domain. To that end, we have developed new methods for the manual annotation of spoken dialogue transcriptions from European financial call centers. We have modified the DAMSL dialogic schema to create a dialogue act taxonomy appropriate for customer services. To capture the semantics, we use a domain-independent framework populated with domain-specific lists. We have designed a new flexible, platform-independent annotation tool, XDML Tool, and annotated 8 Eurospeech 2003 Monday training process in order to optimize the TTP performance for predefined memory requirements. The results obtained in the simulation experiments indicate that the memory requirements of the TTP models can be significantly reduced without degrading the mapping accuracy. The applicability of the approach is also verified in the speech recognition experiments. September 1-4, 2003 – Geneva, Switzerland the classification of syllables in TIMIT. The main motivation for this study is to circumvent the “beads-on-a-string” problem, i.e. the assumption that words can be described as a simple concatenation of phones. Posterior probabilities for articulatory-acoustic features are obtained from artificial neural nets and are used to classify speech within the scope of syllables instead of phones. This gives the opportunity to account for asynchronous feature changes, exploiting the strengths of the articulatory-acoustic features, instead of losing the potential by reverting to phones. Word Class Modeling for Speech Recognition with Out-of-Task Words Using a Hierarchical Language Model Hierarchical Class N-Gram Language Models: Towards Better Estimation of Unseen Events in Speech Recognition Yoshihiko Ogawa 1 , Hirofumi Yamamoto 2 , Yoshinori Sagisaka 1 , Genichiro Kikui 2 ; 1 Waseda University, Japan; 2 ATR-SLT, Japan Imed Zitouni 1 , Olivier Siohan 2 , Chin-Hui Lee 3 ; 1 Lucent Technologies, USA; 2 IBM T.J. Watson Research Center, USA; 3 Georgia Institute of Technology, USA Out-of-vocabulary (OOV) problems are frequently seen when adapting a language model to another task where there are some observed word classes but few individual words, such as names, places and other proper nouns. Simple task adaptation cannot handle this problem properly. In this paper, for task dependent OOV words in the noun category, we adopt a hierarchical language model. In this modeling, the lower class model expressing word phonotactics does not require any additional task dependent corpora for training. It can be trained independent of the upper class model of conventional word class N-grams, as the proposed hierarchical model clearly separates Inter-word characteristics and Intra-word characteristics. This independent-layered training capability makes it possible to apply this model to general vocabularies and tasks in combination with conventional language model adaptation techniques. Speech recognition experiments showed a 19-point increase in word accuracy (from 54% to 73%) in the with-OOV sentences, and comparable accuracy (85%) in the without-OOV sentences, compared with a conventional adapted model. This improvement corresponds to the performance when all OOVs are ideally registered in a dictionary. In this paper, we show how a multi-level class hierarchy can be used to better estimate the likelihood of an unseen event. In classical backoff n-gram models, the (n-1)-gram model is used to estimate the probability of an unseen n-gram. In the approach we propose, we use a class hierarchy to define an appropriate context which is more general than the unseen n-gram but more specific than the (n-1)-gram. Each node in the hierarchy is a class containing all the words of the descendant nodes (classes). Hence, the closer a node is to the root, the more general the corresponding class is. We also investigate in this paper the impact of the hierarchy depth and the Turing’s discount coefficient on the performance of the model. We evaluate the backoff hierarchical n-gram models on WSJ database with two large vocabularies, 5, 000 and 20, 000 words. Experiments show up to 26% improvement on the unseen events perplexity and up to 12% improvement in the WER when a backoff hierarchical class trigram language model is used on an ASR test set with a relatively large number of unseen events. Compound Decomposition in Dutch Large Vocabulary Speech Recognition Incremental and Iterative Monolingual Clustering Algorithms Roeland Ordelman, Arjan van Hessen, Franciska de Jong; University of Twente, The Netherlands Sergio Barrachina, Juan Miguel Vilar; Universidad Jaume I, Spain This paper addresses compound splitting for Dutch in the context of broadcast news transcription. Language models were created using original text versions and text versions that were decomposed using a data-driven compound splitting algorithm. Language model performances were compared in terms of out-of- vocabulary rates and word error rates in a real-world broadcast news transcription task. It was concluded that compound splitting does improve ASR performance. Best results were obtained when frequent compounds were not decomposed. To reduce speech recognition error rate we can use better statistical language models. These models can be improved by grouping words into word equivalence classes. Clustering algorithms can be used to automatically do this word grouping. We present an incremental clustering algorithm and two iterative clustering algorithms. Also, we compare them with previous algorithms. The experimental results show that the two iterative algorithms perform as well as previous ones. It should be pointed out that one of them, that uses the leaving one out technique, has the ability to automatically determine the optimum number of classes. These iterative algorithms are used by the incremental one. Designing for Errors: Similarities and Differences of Disfluency Rates and Prosodic Characteristics Across Domains On the other hand, the proposed incremental algorithm achieves the best results of the compared algorithms, its behavior is the most regular with the variation of the number of classes and can automatically determine the optimum number of classes. Guergana Savova 1 , Joan Bachenko 2 ; 1 Mayo Clinic, USA; 2 Linguistech Consortium, USA This paper focuses on some characteristics of disfluencies in human-human (HHI) and human-computer (HCI) interaction corpora to outline similarities and differences. The main variables studied are disfluency rates and prosodic features. Structured, table-like input increases the disfluency rate in HCI and decreases it in HHI. Direct exposure (visibility) to the interface also increases the rate and gives speech a unique prosodic pattern of hyperarticulation. In most of the studied corpora, silences at the disfluency site are not predicted by syntactic rules. Similarities between HCI and HHI exist mainly in the prosodic realizations of the reparandum and the repair. The findings contribute to better understanding and modeling of disfluencies. Speech-based interfaces need to focus on communication types that are well-understood and prone to good modeling. Techniques for Effective Vocabulary Selection Anand Venkataraman, Wen Wang; SRI International, USA The vocabulary of a continuous speech recognition (CSR) system is a significant factor in determining its performance. In this paper, we present three principled approaches to select the target vocabulary for a particular domain by trading off between the target outof-vocabulary (OOV) rate and vocabulary size. We evaluate these approaches against an ad-hoc baseline strategy. Results are presented in the form of OOV rate graphs plotted against increasing vocabulary size for each technique. Recognition of Out-of-Vocabulary Words with Sub-Lexical Language Models Syllable Classification Using Articulatory-Acoustic Features Lucian Galescu; Institute for Human and Machine Cognition, USA Mirjam Wester; University of Edinburgh, U.K. This paper investigates the use of articulatory-acoustic features for 9 Eurospeech 2003 Monday A major source of recognition errors, out-of-vocabulary (OOV) words are also semantically important; recognizing them is, therefore, crucial for understanding. Success, so far, has been modest, even on very constrained tasks. In this paper we present a new approach to unlimited vocabulary speech recognition based on using grapheme-to-phoneme correspondences for sub-lexical modeling of OOV words, and also some very encouraging results we obtained with our approach on a large vocabulary speech recognition task. September 1-4, 2003 – Geneva, Switzerland Session: PMoCg– Poster Speech Synthesis: Unit Selection I Time: Monday 13.30, Venue: Main Hall, Level -1 Chair: Beat Pfister, TIK, ETHZ, Zurich, Switzerland Unit Selection Based on Voice Recognition Yi Zhou 1 , Yiqing Zu 2 ; 1 Shanghai Jiaotong University, China; 2 Motorola China Research Center, China A Semantic Representation for Spoken Dialogs Hélène Bonneau-Maynard, Sophie Rosset; LIMSI-CNRS, France This paper describes a semantic annotation scheme for spoken dialog corpora. Manual semantic annotation of large corpora is tedious, expensive, and subject to inconsistencies. Consistency is a necessity to increase the usefulness of corpus for developing and evaluating spoken understanding models and for linguistics studies. A semantic representation, which is based on a concept dictionary definition, has been formalized and is described. Each utterance is divided into semantic segments and each segment is assigned with a 5-tuplets containing a mode, the underlying concept, the normalized form of the concept, the list of related segments, and an optional comment about the annotation. Based on this scheme, a tool was developed which ensures that the provided annotations respect the semantic representation. The tool includes interfaces for both the formal definition of the hierarchical concept dictionary and the annotation process. An experiment was conducted to assess inter-annotator agreement using both a human-human dialog corpus and a human-machine dialog corpus. For human-human dialogs, the agreement rate, computed on the triplets (mode, concept, value) is 61%, and the agreement rate on the concepts alone is 74%. For the human-machine dialogs, the percentage of agreement on the triplet is 83% and the correct concept identification rate is 93%. A Corpus-Based Decompounding Algorithm for German Lexical Modeling in LVCSR Martine Adda-Decker; LIMSI-CNRS, France In this paper a corpus-based decompounding algorithm is described and applied for German LVCSR. The decompounding algorithm contributes to address two major problems for LVCSR: lexical coverage and letter-to-sound conversion. The idea of the algorithm is simple: given a word start of length k only few different characters can continue an admissible word in the language. But concerning compounds, if word start k reaches a constituent word boundary, the set of successor characters can theoretically include any character. The algorithm has been applied to a 300M word corpus with 2.6M distinct words. 800k decomposition rules have been extracted automatically. OOV (out of vocabulary) word reductions of 25% to 50% relative have been achieved using word lists from 65k to 600k words. Pronunciation dictionaries have been developed for the LIMSI 300k German recognition system. As no language specific knowledge is required beyond the text corpus, the algorithm can apply more generally to any compounding language. Modeling Cross-Morpheme Pronunciation Variations for Korean Large Vocabulary Continuous Speech Recognition Kyong-Nim Lee, Minhwa Chung; Sogang University, Korea In this paper, we describe a cross-morpheme pronunciation variation model which is especially useful for constructing morphemebased pronunciation lexicon for Korean LVCSR. There are a lot of pronunciation variations occurring at morpheme boundaries in continuous speech. Since phonemic context together with morphological category and morpheme boundary information affect Korean pronunciation variations, we have distinguished pronunciation variation rules according to the locations such as within a morpheme, across a morpheme boundary in a compound noun, across a morpheme boundary in an eojeol, and across an eojeol boundary. In 33K-morpheme Korean CSR experiment, an absolute improvement of 1.16% in WER from the baseline performance of 23.17% WER is achieved by modeling cross-morpheme pronunciation variations with a context-dependent multiple pronunciation lexicon. In this paper, we describe a perceptual voice recognition method to improve the naturalness of synthesized speech for Mandarin Chinese text-to-speech (TTS) baseline system. As a large TTS speech corpus, speech data always has different acoustic properties for different data recording conditions. Speech data recorded under different conditions can finally influence the naturalness of synthesized speech. Concerning this fact, we separate the speech data in a TTS corpus into several different voice classes based on an iterative voice recognition method, which is something like speaker recognition. Among each class, speech units will be considered to have the same voice characteristics. Based on the voice recognition result, a novel unit selection algorithm is performed to select better units to synthesize a more natural-sounding speech. Primary experiment shows the possibility and validity of the method. On Unit Analysis for Cantonese Corpus-Based TTS Jun Xu, Thomas Choy, Minghui Dong, Cuntai Guan, Haizhou Li; InfoTalk Technology, Singapore This paper reports a study of unit analysis for concatenative TTS, which usually has an inventory of hundreds of thousand of voice units. It is known that the quality of synthesis units is especially critical to the quality of resulting corpus-based TTS system. This research focuses on the analysis of a Chinese Cantonese unit inventory, which has been built earlier for open vocabulary Chinese Cantonese TTS tasks. The analysis results show that the exercise helps identify the sources of pronunciation deficiency and suggests ways of improvement to address quality issues. After taking remedy measures, subjective tests on improved system are carried out to validate the exercise. The test results are encouraging. Unit Selection in Concatenative TTS Synthesis Systems Based on Mel Filter Bank Amplitudes and Phonetic Context T. Lambert 1 , Andrew P. Breen 2 , Barry Eggleton 2 , Stephen J. Cox 1 , Ben P. Milner 1 ; 1 University of East Anglia, U.K.; 2 Nuance Communications, U.K. In concatenative text-to-speech (TTS) synthesis systems unit selection aims to reduce the number of concatenation points in the synthesized speech and make concatenation joins as smooth as possible. This research considers synthesis of completely new utterances from non-uniform units, whereby the most appropriate units, according to acoustic and phonetic criteria, are selected from a myriad of similar speech database candidates. A Viterbi-style algorithm dynamically selects the most suitable database units from a large speech database by considering concatenation and target costs. Concatenation costs are derived from mel filter bank amplitudes, whereas target costs are considered in terms of the phonemic and phonetic properties of required units. Within subjects and between subjects ANOVA [9] evaluation of listeners’ scores showed that the TTS system with this method of unit selection was preferred in 52% of test sentences. Text Design for TTS Speech Corpus Building Using a Modified Greedy Selection Baris Bozkurt 1 , Ozlem Ozturk 2 , Thierry Dutoit 3 ; 1 Multitel, Belgium; 2 Middle East Technical University, Turkey; 3 Faculté Polytechnique de Mons, Belgium Speech corpora design is one of the key issues in building high quality text to speech synthesis systems. Often read speech is used since it seems to be the easiest way to obtain a recorded speech corpus 10 Eurospeech 2003 Monday with highest control of the content. The main topic of this study is designing text for recording read speech corpora for concatenative text to speech systems. We will discuss application of the greedy algorithm for text selection by proposing a new way of implementing it and comparing with the standard implementation. Additionally, a text corpus design for Turkish TTS is presented. Discriminative Weight Training for Unit-Selection Based Speech Synthesis Seung Seop Park, Chong Kyu Kim, Nam Soo Kim; Seoul National University, Korea Concatenative speech synthesis by selecting units from large database has become popular due to its high quality in synthesized speech. The units are selected by minimizing the combination of target and join costs for a given sentence. In this paper, we propose a new approach to train the weight parameters associated with the cost functions used for unit selection in concatenative speech synthesis. We first view the unit selection as a classification problem, and apply the discriminative training technique which is found an efficient way to parameter estimation in speech recognition. Instead of defining an objective function which accounts for the subjective speech quality, we take the classification error as the objective function to be optimized. The classification error is approximated by a smooth function and the relevant parameters are updated by means of the gradient descent technique. The Application of Interactive Speech Unit Selection in TTS Systems Peter Rutten, Justin Fackrell; Rhetorical Systems Ltd., U.K. Speech unit selection algorithms have the task to find a single sequence of speech units that optimally fit the target transcription of an utterance that must be synthesized. In doing so, these algorithms ignore a very large number of possible alternative unit sequences that lead to alternative renderings of that utterance. In this paper we set out to explore these alternative unit sequences by introducing interactive unit selection. Interactive unit selection is based on feedback of a listener. To collect this feedback we implement two levels of control: an elaborate GUI, and a simple XML tag mechanism. The GUI offers access to unit selection with a granularity of a single speech unit, and allows a user to set prosodic constraints for the selection of alternative speech units. The XML tag mechanism operates on words, and allows the user to request an nth-best alternative selection. Results show that interactive unit selection succeeds in correcting most of the synthesis problems that occur in our default synthesis system, providing very detailed information that can be used to improve our run-time algorithms. This work not only provides a powerful research tool, it also leads to a number of commercial applications. The GUI can be used efficiently to improve speech synthesis off-line – to the extent that it eliminates the need to make special recordings for domain specific applications. The XML tag, on the other hand, can be used to quickly optimize the output of the system. On the Design of Cost Functions for Unit-Selection Speech Synthesis Francisco Campillo Díaz, Eduardo R. Banga; Universidad de Vigo, Spain The quality of the synthetic speech provided by concatenative speech systems depends heavily on the capability of accurately modeling the different characteristics of speech segments. Moreover, the relative significance or weighting of each feature in the unit selection process is a key point in the relationship between synthetic speech and human perception. In this paper we propose a new method for optimizing these weights, making a separate training according to the nature of the different parts of the cost function, i.e., the features referred to the phonetic context of the units and the features related to their prosodic characteristics. This work is mainly focused on the target cost function. September 1-4, 2003 – Geneva, Switzerland Kalman-Filter Based Join Cost for Unit-Selection Speech Synthesis Jithendra Vepa, Simon King; University of Edinburgh, U.K. We introduce a new method for computing join cost in unitselection speech synthesis which uses a linear dynamical model (also known as a Kalman filter) to model line spectral frequency trajectories. The model uses an underlying subspace in which it makes smooth, continuous trajectories. This subspace can be seen as an analogy for underlying articulator movement. Once trained, the model can be used to measure how well concatenated speech segments join together. The objective join cost is based on the error between model predictions and actual observations. We report correlations between this measure and mean listener scores obtained from a perceptual listening experiment. Our experiments use a state-of-the art unit-selection text-to-speech system: rVoice from Rhetorical Systems Ltd. Optimizing Integrated Cost Function for Segment Selection in Concatenative Speech Synthesis Based on Perceptual Evaluations Tomoki Toda, Hisashi Kawai, Minoru Tsuzaki; ATR-SLT, Japan This paper describes optimizing a cost function for segment selection in concatenative Text-to-Speech based on perceptual characteristics. We use the norm of a local cost for each segment as an integrated cost function for a segment sequence to consider both the degradation of naturalness over the entire synthetic speech and the local degradation. The cost function is optimized by adjusting not only the power coefficient of the norm but also weights for sub-costs so that the integrated cost corresponds better to perceptual scores determined by perceptual experiments. As a result, it is clarified that the correspondence of the cost can be improved to a greater degree by optimizing both the weights and the power coefficient than by optimizing either the weights or the power coefficient. However, it is also clarified that the correspondence is insufficient after optimizing the integrated cost function. Automatic Segmentation for Czech Concatenative Speech Synthesis Using Statistical Approach with Boundary-Specific Correction Jindřich Matoušek, Daniel Tihelka, Josef Psutka; University of West Bohemia in Pilsen, Czech Republic This paper deals with the problems of automatic segmentation for the purposes of Czech concatenative speech synthesis. Statistical approach to speech segmentation using hidden Markov models (HMMs) is applied in the baseline system. Several improvements of this system are then proposed to get more accurate segmentation results. These enhancements mainly concern the various strategies of HMM initialization (flat-start initialization, hand-labeled or speaker independent HMM bootstrapping). Since HTK, the hidden Markov model toolkit, was utilized in our work, a correction of the output boundary placements is proposed to reflect speech parameterization mechanism. An objective comparison of various automatic methods and manual segmentation is performed to find out the best method. The best results were obtained for boundaryspecific statistical correction of the segmentation that resulted from bootstrapping with hand-labeled HMMs (96% segmentation accuracy in tolerance region 20 ms). Automatic Speech Segmentation and Verification for Concatenative Synthesis Chih-Chung Kuo, Chi-Shiang Kuo, Jau-Hung Chen, Sen-Chia Chang; Industrial Technology Research Institute, Taiwan This paper presents an automatic speech segmentation method based on HMM alignment and a categorized multiple-expert fine adjustment. The accuracy of syllable boundaries is significantly improved (72.8% and 51.9% for starting and ending boundaries of syllables, respectively) after the fine adjustment. Moreover, a novel phonetic verification method for checking inconsistency between text script and recorded speech are also proposed. Design and 11 Eurospeech 2003 Monday performance of confidence measures for both segmentation and verification are described, which manifests the automatic detection of problematic speech segments can be achieved. These methods together largely reduce human labor in construction of our new corpus-based TTS system. DTW-Based Phonetic Alignment Using Multiple Acoustic Features Sérgio Paulo, Luís C. Oliveira; INESC-ID/IST, Portugal This paper presents the results of our effort in improving the accuracy of a DTW-based automatic phonetic aligner. The adopted model assumes that the phonetic segment sequence is already known and so the goal is only to align the spoken utterance with a reference synthetic signal produced by waveform concatenation without prosodic modifications. Instead of using a single acoustic measure to compute the alignment cost function, our strategy uses a combination of acoustic features depending on the pair of phonetic segment classes being aligned. The results show that this strategy considerably reduces the segment boundary location errors, even when aligning synthetic and natural speech signals of different gender speakers. Evaluating and Correcting Phoneme Segmentation for Unit Selection Synthesis September 1-4, 2003 – Geneva, Switzerland a female speaker with frequent mid phrase rises. Speaker 3 was a male speaker with a similar f0 range to speaker 1 and with a measured prosodic style suitable for news and financial text. We apply the models created for speaker 2 (an inappropriate model) and speaker 3 (an appropriate model) to speaker 1 and compare the results. Three passages (of three to four sentences in length) from challenging prosodic genres (news report, poetry and personal email) were synthesised using the target speaker and each of the three models. The synthesised utterances were played to 15 native english subjects and rated using a 5 point MOS scale. In addition, 7 experienced speech engineers rated each word for errors on a three point scale: 1. Acceptable, 2. Poor, 3. Unacceptable. The results suggest that a large model from an appropriate speaker does not sound more natural or produce fewer errors than a smaller model generated from the individual speaker’s own data. In addition it shows that an inappropriate model does produce both less natural and more errors in the speech. High variance in both subject and materials analysis suggest both tests are far from ideal and that evaluation techniques for both error rate and naturalness need to improve. Learning Phrase Break Detection in Thai Text-to-Speech Virongrong Tesprasit 1 , Paisarn Charoenpornsawat 1 , Virach Sornlertlamvanich 2 ; 1 NECTEC, Thailand; 2 CRL Asia Research Center, Thailand John Kominek, Christina L. Bennett, Alan W. Black; Carnegie Mellon University, USA As part of improved support for building unit selection voices, the Festival speech synthesis system now includes two algorithms for automatic labeling of wavefile data. The two methods are based on dynamic time warping and HMM-based acoustic modeling. Our experiments show that DTW is more accurate 70% of the time, but is also more prone to gross labeling errors. HMM modeling exhibits a systematic bias of 15 ms. Combining both methods directs human labelers towards data most likely to be problematic. Control and Prediction of the Impact of Pitch Modification on Synthetic Speech Quality Esther Klabbers, Jan P.H. van Santen; Oregon Health & Science University, USA In order to use speech synthesis to generate highly expressive speech convincingly, the problem of poor prosody (both prediction and generation) needs to be overcome. In this paper we will show that with a simple annotation scheme using the notion of foot structure, we can more accurately predict the shape of local pitch contours. The assumption is that with a better selection mechanism we can reduce the amount of pitch modification required, thereby reducing speech degradation. In addition, we present a perceptual experiment that investigates the degradation introduced by pitch modification using the OGIresLPC algorithm. We correlated the weighted perceptual score with different pitch and delta pitch distances. The best combination of distance measures is able to explain 63% of the variance in the perceptual scores. Decreasing the pitch is shown to have a higher impact on perception than increasing the pitch. My Voice, Your Prosody: Sharing a Speaker Specific Prosody Model Across Speakers in Unit Selection TTS Matthew Aylett, Justin Fackrell, Peter Rutten; Rhetorical Systems Ltd., U.K. Data sparsity is a major problem for data driven prosodic models. Being able to share prosodic data across speakers is a potential solution to this problem. This paper explores this potential solution by addressing two questions: 1) Does a larger less sparse model from a different speaker produce more natural speech than a small sparse model built from the original speaker? 2)Does a different speaker’s larger model generate more unit selection errors than a small sparse model built from the original speaker? A unit selection approach is used to produce a lazy learning model of three English RP speaker’s f0 and durational parameters. Speaker 1 (the target speaker) had a much smaller database (approximately one quarter to one fifth the size) of the other two. Speaker 2 was One of the crucial problems in developing high quality Thai text-tospeech synthesis is to detect phrase break from Thai texts. Unlike English, Thai has no word boundary delimiter and no punctuation mark at the end of a sentence. It makes the problem more serious. Because when we detect phrase break incorrectly, it is not only producing unnatural speech but also creating the wrong meaning. In this paper, we apply machine learning algorithms namely C4.5 and RIPPER in detecting phrase break. These algorithms can learn useful features for locating a phrase break position. The features which are investigated in our experiments are collocations in different window sizes and the number of syllables before and after a word in question to a phrase break position. We compare the results from C4.5 and RIPPER with a based-line method (Part-of-Speech sequence model). The experiment shows that C4.5 and RIPPER appear to outperform the based-line method and RIPPER performs better accuracy results than C4.5. A Speech Model of Acoustic Inventories Based on Asynchronous Interpolation Alexander B. Kain, Jan P.H. van Santen; Oregon Health & Science University, USA We propose a speech model that describes acoustic inventories of concatenative synthesizers. The model has the following characteristics: (i) very compact representations and thus high compression ratios are possible, (ii) re-synthesized speech is free of concatenation errors, (iii) the degree of articulation can be controlled explicitly, and (iv) voice transformation is feasible with relatively few additional recordings of a target speaker. The model represents a speech unit as a synthesis of several types of features, each of which has been computed using non-linear, asynchronous interpolation of neighboring basis vectors associated with known phonemic identities. During analysis, basis vectors and transition weights are estimated under a strict diphone assumption using a dynamic time warping approach. During synthesis, the estimated transition weight values are modified to produce changes in duration and articulation effort. Corpus-Based Synthesis of Fundamental Frequency Contours of Japanese Using Automatically-Generated Prosodic Corpus and Generation Process Model Keikichi Hirose, Takayuki Ono, Nobuaki Minematsu; University of Tokyo, Japan We have been developing corpus-based synthesis of fundamental frequency (F0 ) contours for Japanese. Since, in our method, the synthesis is done under the constraint of F0 contour generation 12 Eurospeech 2003 Monday process model, a rather good quality is still kept even if the prediction process is done poorly. Although it was already shown that the synthesized F0 contours sounded as highly natural as those using heuristic rules carefully arranged by experts, the F0 model parameters for the training corpus were extracted with some manual processes. In the current paper, the automatically extracted parameters are used, and a good result is obtained. Also several features are added as the inputs to the statistical method to obtain better results. Some results on the accent phrase boundary prediction in the similar corpus-based framework are also shown. Session: SMoDa– Oral Aurora Noise Robustness on LARGE Vocabulary Databases Time: Monday 16.00, Venue: Room 1 Chair: David Pierce, Motorola Lab., UK Analysis of the Aurora Large Vocabulary Evaluations N. Parihar, Joseph Picone; Mississippi State University, USA In this paper, we analyze the results of the recent Aurora large vocabulary evaluations. Two consortia submitted proposals on speech recognition front ends for this evaluation: (1) Qualcomm, ICSI, and OGI (QIO), and (2) Motorola, France Telecom, and Alcatel (MFA). These front ends used a variety of noise reduction techniques including discriminative transforms, feature normalization, voice activity detection, and blind equalization. Participants used a common speech recognition engine to postprocess their features. In this paper, we show that the results of this evaluation were not significantly impacted by suboptimal recognition system parameter settings. Without any front end specific tuning, the MFA front end outperforms the QIO front end by 9.6% relative. With tuning, the relative performance gap increases to 15.8%. Both the mismatched microphone and additive noise evaluation conditions resulted in a significant degradation in performance for both front ends. Evaluation of Quantile Based Histogram Equalization with Filter Combination on the Aurora 3 and 4 Databases Florian Hilger, Hermann Ney; RWTH Aachen, Germany The recognition performance of automatic speech recognition systems can be improved by reducing the mismatch between training and test data during feature extraction. The approach described in this paper is based on estimating the signal’s cumulative density functions on the filter bank using a small number of quantiles. A two-step transformation is then applied to reduce the difference between these quantiles and the ones estimated on the training data. The first step is a power function transformation applied to each individual filter channel, followed by a linear combination of neighboring filters. On the Aurora 4 16kHz database the average word error rates could be reduced from 60.8% to 37.6% (clean training) and from 38.0% to 31.5% (multi condition training). Large Vocabulary Noise Robustness on Aurora4 Evaluation of Model-Based Feature Enhancement on the AURORA-4 Task Veronique Stouten, Hugo Van hamme, Jacques Duchateau, Patrick Wambacq; Katholieke Universiteit Leuven, Belgium In this paper we focus on the challenging task of noise robustness for large vocabulary Continuous Speech Recognition (LVCSR) systems in non-stationary noise environments. We have extended our Model-Based Feature Enhancement (MBFE) algorithm – that we earlier successfully applied to small vocabulary CSR in the AURORA-2 framework – to cope with the new demands that are imposed by the large vocabulary size in the AURORA-4 task. To incorporate a priori knowledge of the background noise, we combine scalable Hidden Markov Models (HMMs) of the cepstral feature vectors of both clean speech and noise, using a Vector Taylor Series approximation in the power spectral domain. Then, a global MMSE-estimate of the clean speech is calculated based on this combined HMM. This technique is easily embeddable in the feature extraction module of a recogniser and is intrinsically suited for the removal of non-stationary additive noise. Our approach is validated on the AURORA-4 task, revealing a significant gain in noise robustness over the baseline. Improved Feature Extraction Based on Spectral Noise Reduction and Nonlinear Feature Normalization José C. Segura, Javier Ramírez, Carmen Benítez, Ángel de la Torre, Antonio J. Rubio; Universidad de Granada, Spain This paper is mainly focused on showing experimental results of a feature extraction algorithm that combines spectral noise reduction and nonlinear feature normalization. The successfulness of this approach has been shown in a previous work, and in this one, we present several improvements that result in a performance comparable to that of the recently approved AFE for DSR. Noise reduction is now based on a Wiener filter instead of spectral subtraction. The voice activity detection based on the full-band energy has been replaced with a new one using spectral information. Relative improvements of 24.81% and 17.50% over our previous system are obtained for AURORA 2 and 3 respectively. Results for AURORA 2 are not as good as those for the AFE, but for AURORA 3 a relative improvement of 5.27% is obtained. Feature Compensation Technique for Robust Speech Recognition in Noisy Environments Young Joon Kim 1 , Hyun Woo Kim 2 , Woohyung Lim 1 , Nam Soo Kim 1 ; 1 Seoul National University, Korea; 2 Electronics and Telecommunications Research Institute, Korea In this paper, we analyze the problems of the existing interacting multiple model (IMM) and spectral subtraction (SS) approaches and propose a new approach to overcome the problems of these algorithms. Our approach combines the IMM and SS techniques based on a soft decision for speech presence. Results reported on AURORA2 database show that proposed approach shows 14.26% of average relative improvement compared to the IMM algorithm in the speech recognition experiments. Luca Rigazio, Patrick Nguyen, David Kryze, Jean-Claude Junqua; Panasonic Speech Technology Laboratory, USA This paper presents experiments of noise robust ASR on the Aurora4 database. The database is designed to test large vocabulary systems in presence of noise and channel distortions. A number of different model-based and signal-based noise robustness techniques have been tested. Results show that it is difficult to design a technique that is superior in every condition. Because of this we combined different techniques to improve results. Best results have been obtained when short time compensation / normalization methods are combined with long term environmental adaptation and robust acoustic models. The best average error rate obtained over the 52 conditions is 30.8%. This represents a 40% relative improvement compared to the baseline results [1]. September 1-4, 2003 – Geneva, Switzerland Session: SMoDb– Oral Multilingual Speech-to-Speech Translation Time: Monday 16.00, Venue: Room 2 Chair: Gianni Lazzari, Istituto Trentino di Cultura, Trento, Italy The Statistical Approach to Machine Translation and a Roadmap for Speech Translation Hermann Ney; RWTH Aachen, Germany During the last few years, the statistical approach has found widespread use in machine translation, in particular for spoken language. In many comparative evaluations of automatic speech translation, the statistical approach was found to be significantly supe- 13 Eurospeech 2003 Monday rior to the existing conventional approaches. The paper will present the main components of a statistical machine translation system (such as alignment and lexicon models, training procedure, generation of the target sentence) and summarize the progress made so far. We will conclude with a roadmap for future research on spoken language translation. Coupling vs. Unifying: Modeling Techniques for Speech-to-Speech Translation Yuqing Gao; IBM T.J. Watson Research Center, USA As a part of our effort to develop a unified computational framework for speech-to-speech translation, so that sub-optimizations or local optimizations can be avoided, we are developing direct models for speech recognition. In direct model, the focus is on the creation of one single integrated model p(text|acoustics), rather than a complex series of artifices, therefore various factors such as linguistics and language features, speaker or speaking rate differences, different acoustic conditions, can be applied to the joint optimization. In this paper we discuss how linguistic and semantic constraints are used in phoneme recognition. Speechalator: Two-Way Speech-to-Speech Translation on a Consumer PDA Alex Waibel 1 , Ahmed Badran 1 , Alan W. Black 1 , Robert Frederking 1 , Donna Gates 1 , Alon Lavie 1 , Lori Levin 1 , Kevin A. Lenzo 2 , Laura Mayfield Tomokiyo 2 , Jürgen Reichert 3 , Tanja Schultz 1 , Dorcas Wallace 1 , Monika Woszczyna 4 , Jing Zhang 3 ; 1 Carnegie Mellon University, USA; 2 Cepstral LLC, USA; 3 Mobile Technologies Inc., USA; 4 Multimodal Technologies Inc., USA September 1-4, 2003 – Geneva, Switzerland evaluation campaign, which will take place in 2003, will focus on written language translation by exploiting a large phrase-book parallel corpus covering several European and Asiatic languages. Creating Corpora for Speech-to-Speech Translation Genichiro Kikui, Eiichiro Sumita, Toshiyuki Takezawa, Seiichi Yamamoto; ATR-SLT, Japan This paper presents three approaches to creating corpora that we are working on for speech-to-speech translation in the travel conversation task. The first approach is to collect sentences that bilingual travel experts consider useful for people going-to/coming-from another country. The resulting English-Japanese aligned corpora are collectively called the basic travel expression corpus (BTEC), which is now being translated into several other languages. The second approach tries to expand this corpus by generating many “synonymous” expressions for each sentence. Although we can create large corpora by the above two approaches relatively cheaply, they may be different from utterances in actual conversation. Thus, as the third approach, we are collecting dialogue corpora by letting two people talk, each in his/her native language, through a speech-tospeech translation system. To concentrate on translation modules, we have replaced speech recognition modules with human typists. We will report some of the characteristics of these corpora as well. Session: OMoDc– Oral Prosody Time: Monday 16.00, Venue: Room 3 Chair: Eva Hajicova, Charles University in Prague, Czech Republic This paper describes a working two-way speech-to-speech translation system that runs in near real-time on a consumer handheld computer. It can translate from English to Arabic and Arabic to English in the domain of medical interviews. We describe the general architecture and frameworks within which we developed each of the components: HMM-based recognition, interlingua translation (both rule and statistically based), and unit selection synthesis. Development of Phrase Translation Systems for Handheld Computers: From Concept to Field Horacio Franco 1 , Jing Zheng 1 , Kristin Precoda 1 , Federico Cesari 1 , Victor Abrash 1 , Dimitra Vergyri 1 , Anand Venkataraman 1 , Harry Bratt 1 , Colleen Richey 1 , Ace Sarich 2 ; 1 SRI International, USA; 2 Marine Acoustics, USA We describe the development and conceptual evolution of handheld spoken phrase translation systems, beginning with an initial unidirectional system for translation of English phrases, and later extending to a limited bidirectional phrase translation system between English and Pashto, a major language of Afghanistan. We review the challenges posed by such projects, such as the constraints imposed by the computational platform, to the limitations of the phrase translation approach when dealing with naïve respondents. We discuss our proposed solutions, in terms of architecture, algorithms, and software features, as well as some field experience by users of initial prototypes. Evaluation Frameworks for Speech Translation Technologies Marcello Federico; ITCirst, Italy This paper reports on activities carried out under the European project PF-STAR and within the CSTAR consortium, which aim at evaluating speech translation technologies. In PF-STAR, speech translation baselines developed by the partners and off-the-shelf commercial systems will be compared systematically on several language pairs and application scenarios. In CSTAR, evaluation campaigns will be organized, on a regular basis, to compare research baselines developed by the members of the consortium. The first Prosodic Analysis and Modeling of the NAGAUTA Singing to Synthesize its Prosodic Patterns from the Standard Notation Nobuaki Minematsu, Bungo Matsuoka, Keikichi Hirose; University of Tokyo, Japan NAGAUTA is a classical style of the Japanese singing. It has very original and unique prosodic patterns in its singing, where an abrupt and sharp change of F0 is always observed at a transition from a note to another. This F0 change is often found even where the transition is not accompanied by a change of tone. In this paper, we propose a model to synthesize this unique F0 pattern from the standard notation. Further, this paper shows an interesting phenomenon about power movements at the F0 changes. Acoustic analysis of NAGAUTA singing samples reveals that sharp increases of F0 and sharp decreases of power are observed synchronously. Although no discussion on physical mechanisms of this phenomenon is done in this paper, another model to generate this unique power pattern is also proposed. Evaluation experiments are done through listening and their results indicate high validity of the two proposed models. Statistical Evaluation of the Influence of Stress on Pitch Frequency and Phoneme Durations in Farsi Language D. Gharavian, S.M. Ahadi; Amirkabir University of Technology, Iran Stress is known to be an important prosodic feature of speech. The recognition of stressed speech has always been an important issue for speech researchers. On the other hand, providing a large corpus with the coverage of all different stressed conditions in a certain language is a difficult task. Farsi (Persian) has been no exception to this. In this research, our aim has been to evaluate the effect of stress on prosodic features of Farsi language, such as phoneme duration, pitch frequency and the pitch contour slope. These might be valuable in further research in speech recognition. As the main influence of stress is on vowels, the effect of stress on such parameters as duration and pitch frequency and its slope on the phoneme level and for vowels has been evaluated. 14 Eurospeech 2003 Monday Prosody Dependent Speech Recognition with Explicit Duration Modelling at Intonational Phrase Boundaries K. Chen, S. Borys, Mark Hasegawa-Johnson, J. Cole; University of Illinois at Urbana-Champaign, USA Does prosody help word recognition? In this paper, we propose a novel probabilistic framework in which word and phoneme are dependent on prosody in a way that improves word recognition. The prosody attribute that we investigate in this study is the lengthening of speech segments in the vicinity of intonational phrase boundaries. Explicit Duration Hidden Markov Model (EDHMM) is implemented to provide an accurate phoneme duration model. This study is conducted on Boston University Radio News Corpus with prosodic boundaries marked using ToBI labelling system. We found that lengthening of the phrase final rhymes can be reliably modelled by EDHMM, which significantly improves the prosody dependent acoustic modelling. Conversely, no systematic duration variation is found at phrase initial position. With prosody dependence implemented in the acoustic model, pronunciation model and language model, both word recognition accuracy and boundary recognition accuracy are improved by 1% over systems without prosody dependence. Prediction of Fujisaki Model’s Phrase Commands João Paulo Teixeira 1 , Diamantino Freitas 2 , Hiroya Fujisaki 3 ; 1 Polytechnic Institute of Bragança, Portugal; 2 University of Porto, Portugal; 3 University of Tokyo, Japan This paper presents a model to predict the phrase commands of the Fujisaki Model for F0 contour for the Portuguese Language. Phrase commands location in text is governed by a set of weighted rules. The amplitude (Ap) and timing (T0) of the phrase commands are predicted in separate neural networks. The features for both neural networks are discussed. Finally a comparison between target and predicted values is presented. September 1-4, 2003 – Geneva, Switzerland the best combination among the proposed acoustic parameters. Experiments are also conducted to verify the perceived degree of pitch change within a phrase final, and the perceived degree of pitch reset. While a good relationship is found between the perceptual scores and some of the acoustic parameters, our results also advocate a continuous rather than a categorical relationship between some of the phrase final tone-types considered. Session: OMoDd– Oral Language Modeling Time: Monday 16.00, Venue: Room 4 Chair: Holger Schwenk, Limsi, CNRS, France Efficient Linear Combination for Distant n-Gram Models David Langlois, Kamel Smaïli, Jean-Paul Haton; LORIA, France The objective of this paper is to present a large study concerning the use of distant language models. In order to combine efficiently distant and classical models, an adaptation of the back-off principle is made. Also, we show the importance of each part of a history for the prediction. In fact, each sub-history is analyzed in order to estimate its importance in terms of prediction and then a weight is associated to each class of sub-histories. Therefore, the combined models take into account the features of each history’s part and not the whole history as made in other works. The contribution of distant n-gram models in terms of perplexity is significant and improves the results by 12.8%. Making the linear combination depending on sub-histories achieves an improvement of 5.3% in comparison to classical linear combination. Improving a Connectionist Based Syntactical Language Model Ahmad Emami; Johns Hopkins University, USA Using a connectionist model as one of the components of the Structured Language Model has lead to significant improvements in perplexity and word error rate, mainly because of the connectionist model’s power in using longer contexts and its ability in fighting the data sparseness problem. For its training, the SLM needs the syntactical parses of the word strings in the training data, provided by either humans or an external parser. Corpus-Based Modeling of Naturalness Estimation in Timing Control for Non-Native Speech Makiko Muto, Yoshinori Sagisaka, Takuro Naito, Daiju Maeki, Aki Kondo, Katsuhiko Shirai; Waseda University, Japan In this paper, aiming at automatic estimation of naturalness in timing control of non-native’s speech, we have analyzed the timing characteristics of non-native’s speech to correlate with the corresponding subjective naturalness evaluation scores given by native speakers. Through statistical analyses using English speech data spoken by Japanese with temporal naturalness scores ranging one to five given by natives, we found high correlation between their scores and the differences from native’s speech. These analyses provided a linear regression model where naturalness in timing control is estimated by differences from native’s speech in durations of overall sentences, individual content and function words and pauses. The proposed naturalness evaluation model was tested its estimation accuracy using open data. The root mean square errors 0.64 between scores predicted by the model and those given by the natives turned out to be comparable to the differences 0.85 of scores among native listeners. Good correlation between model prediction and native’s judgments confirmed the appropriateness of the proposed model. Perceptually-Related Acoustic-Prosodic Features of Phrase Finals in Spontaneous Speech Carlos Toshinori Ishi, Parham Mokhtari, Nick Campbell; ATR-HIS, Japan With the aim of automatically categorizing phrase final tones, investigations are conducted on the relationship between acousticprosodic parameters and perceptual tone categories. Three types of acoustic parameters are proposed: one related to pitch movement within the phrase final, one related to pitch reset prior to the phrase final, and one related to the length of the phrase final. A classification tree is used to evaluate automatic categorization of phrase final tone types, resulting in 76% correct classification for In this paper we study the effect of training the connectionist based language model on the hidden parses hypothesized by the SLM itself. Since multiple partial parses are constructed for each word position, the model and the log-likelihood function will be in a form that necessitates a specific manner of training of the connectionist model. Experiments on the UPENN section of the Wall Street Journal corpus show significant improvements in perplexity. Using Untranscribed User Utterances for Improving Language Models Based on Confidence Scoring Mikio Nakano 1 , Timothy J. Hazen 2 ; 1 NTT Corporation, Japan; 2 Massachusetts Institute of Technology, USA This paper presents a method for reducing the effort of transcribing user utterances to develop language models for conversational speech recognition when a small number of transcribed and a large number of untranscribed utterances are available. The recognition hypotheses for untranscribed utterances are classified according to their confidence scores such that hypotheses with high confidence are used to enhance language model training. The utterances that receive low confidence can be scheduled to be manually transcribed first to improve the language model. The results of experiments using automatic transcription of the untranscribed user utterances show the proposed methods are effective in achieving improvements in recognition accuracy while reducing the effort required from manual transcription. 15 Eurospeech 2003 Monday Improved Chinese Broadcast News Transcription by Language Modeling with Temporally Consistent Training Corpora and Iterative Phrase Extraction September 1-4, 2003 – Geneva, Switzerland yields the LPLE predictor. It is proved that the all-pole filters computed by LPLE are always stable. The results show that the method is well-suited when low-order all-pole models with improved modeling of the lowest formants are needed. Pi-Chuan Chang, Shuo-Peng Liao, Lin-shan Lee; National Taiwan University, Taiwan Beyond a Single Critical-Band in TRAP Based ASR In this paper an iterative Chinese new phrase extraction method based on the intra-phrase association and context variation statistics is proposed. A Chinese language model enhancement framework including lexicon expansion is then developed. Extensive experiments for Chinese broadcast news transcription were then performed to explore the achievable improvements with respect to the degree of temporal consistency for the adaptation corpora. Very encouraging results were obtained and detailed analysis discussed. Language Model Adaptation Using Word Clustering Shinsuke Mori, Masafumi Nishimura, Nobuyasu Itoh; IBM Japan Ltd., Japan Building a stochastic language model (LM) for speech recognition requires a large corpus of target tasks. For some tasks no enough large corpus is available and this is an obstacle to achieving high recognition accuracy. In this paper, we propose a method for building an LM with a higher prediction power using large corpora from different tasks rather than an LM estimated from a small corpus for a specific target task. In our experiment, we used transcriptions of air university lectures and articles from Nikkei newspaper and compared an existing interpolation-based method and our new method. The results show that our new method reduces perplexity by 9.71%. Hierarchical Topic Classification for Dialog Speech Recognition Based on Language Model Switching Pratibha Jain, Hynek Hermansky; Oregon Health & Science University, USA TRAP based ASR attempts to extract information from rather long (as long as 1 s) and narrow(one critical-band) patches (temporal patterns) from time-frequency plane. We investigate the effect of combining temporal patterns of logarithmic critical-band energies from several adjacent bands. The frequency context is gradually increased from one critical-band to several critical-bands by using temporal patterns jointly from adjacent bands as input to the classposterior estimators. We show that up to three critical-bands of frequency context is required for achieving higher recognition performance. This work also indicates that local bands interaction is important for improved speech recognition performance. Variational Bayesian GMM for Speech Recognition Fabio Valente, Christian Wellekens; Institut Eurecom, France In this paper, we explore the potentialities of Variational Bayesian (VB) learning for speech recognition problems. VB methods deal in a more rigorous way with model selection and are a generalization of MAP learning. VB training for Gaussian Mixture Models is less affected than EM-ML training by over- fitting and singular solutions. We compare two types of Variational Bayesian Gaussian Mixture Models (VBGMM) with classical EM-ML GMM in a phoneme recognition task on the TIMIT database. VB learning performs better than EM-ML learning and is less affected by the initial model guess. Ian R. Lane 1 , Tatsuya Kawahara 1 , Tomoko Matsui 2 , Satoshi Nakamura 3 ; 1 Kyoto University, Japan; 2 Institute of Statistical Mathematics, Japan; 3 ATR-SLT, Japan Time Alignment for Scenario and Sounds with Voice, Music and BGM A speech recognition architecture combining topic detection and topic-dependent language modeling is proposed. In this architecture, a hierarchical back-off mechanism is introduced to improve system robustness. Detailed topic models are applied when topic detection is confident, and wider models that cover multiple topics are applied in cases of uncertainty. In this paper, two topic detection methods are evaluated for the architecture: unigram likelihood and SVM (Support Vector Machine). On the ATR Basic Travel Expression corpus, both topic detection methods provide a comparable reduction in WER of 10.0% and 11.1% respectively over a single language model system. Finally the proposed re-decoding approach is compared with an equivalent system based on re-scoring. It is shown that re-decoding is vital to provide optimal recognition performance. This paper proposes a new time alignment method between scenario and sounds with voice, music and BGM (Back Ground Music) in order to generate video caption automatically. The proposed time alignment method, Voice-Music-Pause+BGM method, is based on the composition of voice and music models. The result of the experiments to evaluate the proposed method shows the proposed method works about 10∼60 times better than the conventional time alignment methods. Yamato Wada, Masahide Sugiyama; University of AIZU, Japan Session: PMoDe– Poster Speech Modeling & Features I Time: Monday 16.00, Venue: Main Hall, Level -1 Chair: Hynek Hermansky, Oregon Graduate Institute of Science and Technology, USA Linear Predictive Method with Low-Frequency Emphasis Paavo Alku, Tom Bäckström; Helsinki University of Technology, Finland An all-pole modeling technique, Linear Prediction with Lowfrequency Emphasis (LPLE), which emphasizes the lower frequency range of speech, is presented. The method is based on first interpreting conventional linear predictive (LP) analyses of successive prediction orders with parallel structures using the concept of symmetric linear prediction. In these implementations, symmetric linear prediction is preceded by simple pre-filters, which are of either low or high frequency characteristics. Combining those symmetric linear predictors that are not preceded by high-frequency pre-filters Efficient Quantization of Speech Excitation Parameters Using Temporal Decomposition Phu Chien Nguyen, Masato Akagi; JAIST, Japan In this paper, we investigate the application of temporal decomposition (TD) technique to describe the temporal patterns of speech excitation parameter contours, i.e. gain, pitch, and voicing. We use a common set of event functions to describe the temporal structure of both spectral and excitation parameters, and then quantize them. Experimental results show that each speech excitation parameter contour can be well described by a set of excitation targets using the event functions obtained from TD analysis of line spectral frequency (LSF) parameters, with considerably low reconstruction error. Moreover, we can efficiently quantize the excitation targets by a combination of two uniform quantizers, one working directly on logarithmic excitation targets and the other working on the difference between current and previous logarithmic excitation targets. Distributed Genetic Algorithm to Discover a Wavelet Packet Best Basis for Speech Recognition Robert van Kommer 1 , Béat Hirsbrunner 2 ; 1 Swisscom Innovations, Switzerland; 2 University of Fribourg, Switzerland In the learning process of speech modeling, many choices or settings are defined “a priori” or are resulting from years of experimental work. In this paper, instead, a global learning scheme is pro- 16 Eurospeech 2003 Monday posed based on a Distributed Genetic Algorithm combined with a standard speech-modeling algorithm. The speech recognition models are now created out of a predefined space of solutions. Furthermore, this global scheme enables to learn the speech models as well as the best feature extraction module. Experimental validation is performed on the task of discovering the Wavelet Packet best basis decomposition, knowing that the “a priori” reference is the mel-scaled subband decomposition. Two experiments are presented, a reference system using a simulated fitness and a second one that uses the speech recognition performance as fitness value. In the latter, each element of the space is a connectionist system defined by a Wavelet topology and its associated Neural Network. New Model-Based HMM Distances with Applications to Run-Time ASR Error Estimation and Model Tuning Chao-Shih Huang 1 , Chin-Hui Lee 2 , Hsiao-Chuan Wang 3 ; 1 Acer Inc., Taiwan; 2 Georgia Institute of Technology, USA; 3 National Tsing Hua University, Taiwan We propose a novel model-based HMM distance computation framework to estimate run-time recognition errors and adapt recognition parameters without the need of using any testing or adaptation data. The key idea is to use HMM distances between competing models to measure the confusability between phones in speech recognition. Starting with a set of simulated models in a given noise condition, the corresponding error rate could be estimated with a smooth approximation of the error count computed form the set of phone distances without using any testing data. By minimizing the estimated error between the desired and simulated models, the target model parameters could also be adjusted without using any adaptation data. Experimental results show that the word errors, estimated with the proposed framework, closely resemble the errors obtained by running actual recognition experiments on a large testing set in a number of adverse conditions. The adapted models also gave better recognition performances than those obtained with environment-matched models, especially in low signal-to-noise conditions. September 1-4, 2003 – Geneva, Switzerland tone contexts. Experimental results indicate the effectiveness of the method in both tone discrimination and detection of the inconsistency between a lexical tone and its F0 pattern. The method is suitable for the prosodic labeling of a large scale speech corpus. Feature Selection for the Classification of Crosstalk in Multi-Channel Audio Stuart N. Wrigley, Guy J. Brown, Vincent Wan, Steve Renals; University of Sheffield, U.K. An extension to the conventional speech / nonspeech classification framework is presented for a scenario in which a number of microphones record the activity of speakers present at a meeting (one microphone per speaker). Since each microphone can receive speech from both the participant wearing the microphone (local speech) and other participants (crosstalk), the recorded audio can be broadly classified in four ways: local speech, crosstalk plus local speech, crosstalk alone and silence. We describe a classifier in which a Gaussian mixture model (GMM) is used to model each class. A large set of potential acoustic features are considered, some of which have been employed in previous speech / nonspeech classifiers. A combination of two feature selection algorithms is used to identify the optimal feature set for each class. Results from the GMM classifier using the selected features are superior to those of a previously published approach. A DTW-Based DAG Technique for Speech and Speaker Feature Analysis Jingwei Liu; Tsinghua University, China Analysis of Voice Source Characteristics Using a Constrained Polynomial Model A DTW-based directed acyclic graph (DAG) optimization method is proposed to exploit the interaction information of speech and speaker in feature component. We introduce the DAG representation of intra-class samples based on dynamic time warping (DTW) measure and propose two criteria based on in-degree of DAG. Combined with (l - r ) optimization algorithm, the DTW-based DAG model is applied to discuss the feature subset information of representing speech and speaker in text-dependent speaker identification and speaker-dependent speech recognition. The experimental results demonstrate the powerful ability of our model to reveal the low dimensional performance and the influence of speech and speaker information in different tasks, and the corresponding DTW recognition rates are also calculated for comparison. Tokihiko Kaburagi, Koji Kawai; Kyushu Institute of Design, Japan Feature Transformations and Combinations for Improving ASR Performance This paper presents an analysis method of voice source characteristics from speech by simultaneously employing models of the vocal tract and voice source signal. The vocal tract is represented as a linear filter based on the conventional all-pole assumption. On the other hand, the voice source signal is represented by linearly overlapping multiple number of base signals obtained from a generalization of the Rosenberg model. The resulting voice source model is a polynomial function of time and has lesser degrees-of-freedom than the polynomial order. By virtue of the linearity of both models, the optimal values of their parameters can be jointly determined when the instants of the glottal opening and closing are given for each pitch period. We also present a temporal search method of these glottal events using the dynamic programming technique. Finally, experimental results are presented to reveal the applicability of the proposed method for several phonation conditions. Panu Somervuo, Barry Chen, Qifeng Zhu; International Computer Science Institute, USA Tone Pattern Discrimination Combining Parametric Modeling and Maximum Likelihood Estimation Jinfu Ni, Hisashi Kawai; ATR-SLT, Japan This paper presents a novel method for tone pattern discrimination derived by combining a functional fundamental frequency (F0 ) model for feature extraction with vector quantization and maximum likelihood estimation techniques. Tone patterns are represented in a parametric form based on the F0 model and clustered using the LBG algorithm. The mapping between lexical tones and acoustic patterns is statistically modeled and decoded by the maximum likelihood estimation. Evaluation experiments are conducted on 469 Mandarin utterances (1.4 hours of read speech from a female native) with varied analysis conditions of codebook sizes and In this work, linear and nonlinear feature transformations have been experimented in ASR front end. Unsupervised transformations were based on principal component analysis and independent component analysis. Discriminative transformations were based on linear discriminant analysis and multilayer perceptron networks. The acoustic models were trained using a subset of HUB5 training data and they were tested using OGI Numbers corpus. Baseline feature vector consisted of PLP cepstrum and energy with first and second order deltas. None of the feature transformations could outperform the baseline when used alone, but improvement in the word error rate was gained when the baseline feature was combined with the feature transformation stream. Two combination methods were experimented: feature vector concatenation and n-best list combination using ROVER. Best results were obtained using the combination of the baseline PLP cepstrum and the feature transform based on multilayer perceptron network. The word error rate in the number recognition task was reduced from 4.1 to 3.1. On the Role of Intonation in the Organization of Mandarin Chinese Speech Prosody Chiu-yu Tseng; Academia Sinica, Taiwan This paper reports 3 perception experiments on intonation groups and the role of phrasal intonation in the organization of speech prosody. The goal is to help unlimited TTS achieve better naturalness. Experiments were also designed to compliment previous extensive analyses of speech data. Using the PRAAT software and removing segmental information humming experiments of extracted 17 Eurospeech 2003 Monday intonation groups ending in interrogative and declarative intonations in both complete and edited forms were used. Results showed that (1.)phrasal or sentential intonation contour is less significant for Mandarin, (2.) yes-no questions with utterance question particles are characterized by a rising pitch on the final syllable only, (3.) the general higher register exhibited in yes-no questions without utterance final question particles is not the most salient cue for intonation, (4.) utterance final lengthening appears to be a salient perceptual cue for intonation identification, (5.) speech units larger than single sentences deserve more attention. September 1-4, 2003 – Geneva, Switzerland An Optimized Multi-Duration HMM for Spontaneous Speech Recognition harmonic product spectrum based feature is extracted in frequency domain while the autocorrelation and the average magnitude difference based methods work in time domain. The algorithms produce a measure of voicing for each time frame. The voicing measure was combined with the standard Mel Frequency Cepstral Coefficients (MFCC) using linear discriminant analysis to choose the most relevant features. Experiments have been performed on small and large vocabulary tasks. The three different voicing measures combined with MFCCs resulted in similar improvements in word error rate: improvements of up to 14% on the small-vocabulary task and improvements of up to 6% on the large-vocabulary task relative to using MFCC alone with the same overall number of parameters in the system. Yuichi Ohkawa, Akihiro Yoshida, Motoyuki Suzuki, Akinori Ito, Shozo Makino; Tohoku University, Japan Use of a CSP-Based Voice Activity Detector for Distant-Talking ASR In spontaneous speech, various speech style and speed changes can be observed, which are known to degrade speech recognition accuracy. Luca Armani, Marco Matassoni, Maurizio Omologo, Piergiorgio Svaizer; ITCirst, Italy In this paper, we describe an optimized multi-duration HMM (OMD). An OMD is a kind of multi-path HMM with at most two parallel paths. Each path is trained using speech samples with short or long phoneme duration. The thresholds to divide samples of phonemes are determined through phoneme recognition experiment. Not only the thresholds but also topologies of HMM are determined using the recognition result. This paper addresses the problem of voice activity detection for distant-talking speech recognition in noisy and reverberant environment. The proposed algorithm is based on the same Cross-power Spectrum Phase analysis that is used for talker location and tracking purposes. A normalized feature is derived, which is shown to be more effective than an energy-based one. The algorithm exploits that feature by dynamically updating the threshold as a non-linear average value computed during the preceding pause. Given a real multichannel database, recorded with the speaker at 2.5 meter distance from the microphones, experiments show that the proposed algorithm provides a relevant relative error rate reduction. Next, we parallelize OMD model with ordinary HMM trained by spontaneous speech and HMM trained by read speech in parallel. Using this ‘all-parallel’ model, 19.3% reduction of word error rate was obtained compared with the ordinary HMM trained with spontaneous speech. Speaker Recognition Using MPEG-7 Descriptors Maximum Conditional Mutual Information Projection for Speech Recognition Mohamed Kamal Omar, Mark Hasegawa-Johnson; University of Illinois at Urbana-Champaign, USA Hyoung-Gook Kim, Edgar Berdahl, Nicolas Moreau, Thomas Sikora; Technische Universität Berlin, Germany Our purpose is to evaluate the efficiency of MPEG-7 audio descriptors for speaker recognition. The upcoming MPEG-7 standard provides audio feature descriptors, which are useful for many applications. One example application is a speaker recognition system, in which reduced-dimension log-spectral features based on MPEG-7 descriptors are used to train hidden Markov models for individual speakers. The feature extraction based on MPEG-7 descriptors consists of three main stages: Normalized Audio Spectrum Envelope (NASE), Principal Component Analysis (PCA) and Independent Component Analysis (ICA). An experimental study is presented where the speaker recognition rates are compared for different feature extraction methods. Using ICA, we achieved better results than NASE and PCA in a speaker recognition system. A Comparative Study on Maximum Entropy and Discriminative Training for Acoustic Modeling in Automatic Speech Recognition Wolfgang Macherey, Hermann Ney; RWTH Aachen, Germany While Maximum Entropy (ME) based learning procedures have been successfully applied to text based natural language processing, there are only little investigations on using ME for acoustic modeling in automatic speech recognition. In this paper we show that the well known Generalized Iterative Scaling (GIS) algorithm can be used as an alternative method to discriminatively train the parameters of a speech recognizer that is based on Gaussian densities. The approach is compared with both a conventional maximum likelihood training and a discriminative training based on the Extended Baum algorithm. Experimental results are reported on a connected digit string recognition task. Linear discriminant analysis (LDA) in its original model-free formulation is best suited to classification problems with equal-covariance classes. Heteroscedastic discriminant analysis (HDA) removes this equal covariance constraint, and therefore is more suitable for automatic speech recognition (ASR) systems. However, maximizing HDA objective function does not correspond directly to minimizing the recognition error. In its original formulation, HDA solves a maximum likelihood estimation problem in the original feature space to calculate the HDA transformation matrix. Since the dimension of the original feature space in ASR problems is usually high, the estimation of the HDA transformation matrix becomes computationally expensive and requires a large amount of training data. This paper presents a generalization of LDA that solves these two problems. We start with showing that the calculation of the LDA projection matrix is a maximum mutual information estimation problem in the lower-dimensional space with some constraints on the model of the joint conditional and unconditional probability density functions (PDF) of the features, and then, by relaxing these constraints, we develop a dimensionality reduction approach that maximizes the conditional mutual information between the class identity and the feature vector in the lower-dimensional space given the recognizer model. Using this approach, we achieved 1% improvement in phoneme recognition accuracy compared to the baseline system. Improvement in recognition accuracy compared to both LDA and HDA approaches is also achieved. Extraction Methods of Voicing Feature for Robust Speech Recognition András Zolnay, Ralf Schlüter, Hermann Ney; RWTH Aachen, Germany In this paper, three different voicing features are studied as additional acoustic features for continuous speech recognition. The 18 Eurospeech 2003 Monday September 1-4, 2003 – Geneva, Switzerland (WCs) in each subband. The idea is based on that the change of WC variance in speech-dominated frames is larger than the change of WC variance in noise-dominated frames. We can define a weighting function for WCs in each subband so that WCs are preserved in speech-dominated frames and reduced in noise-dominated frames. Then a weighting function in terms of WC’s variance is derived. The experimental results show that the proposed method is more robust than that of SNR adjusted speech enhancement system. Session: PMoDf– Poster Speech Enhancement I Time: Monday 16.00, Venue: Main Hall, Level -1 Chair: Joaquin Gonzalez-Rodriguez, ATVS-DIAC-Univ. Politecnica de Madrid, Spain A Semi-Blind Source Separation Method for Hands-Free Speech Recognition of Multiple Talkers Microphone Array Voice Activity Detection and Noise Suppression Using Wideband Generalized Likelihood Ratio Panikos Heracleous 1 , Satoshi Nakamura 2 , Kiyohiro Shikano 1 ; 1 Nara Institute of Science and Technology, Japan; 2 ATR-SLT, Japan Ilyas Potamitis 1 , Eran Fishler 2 ; 1 University of Patras, Greece; 2 Princeton University, USA In this paper, we present a beamforming based semi-blind source separation technique, which can be applied efficiently for handsfree speech recognition of multiple talkers (including moving talkers, too). The main difference from the conventional blind source separation techniques lies in the fact that the proposed method does not attempt to separate explicitly the unknown signals in a pre-processing pass before speech recognition. In fact, localization of multiple talkers, separation of the signals, and speech recognition are integrated in a single pass. Each time frame, beams formed by a delay-and-sum beamformer are steered to every direction, and speech information is extracted. A modified Viterbi formula provides n-best hypotheses for each direction and word hypotheses. At the final frame, all hypotheses are clustered based on their direction information. The clusters, which correspond to the talkers include information about the recognized speech of the multiple talkers and about their direction. Experiments for recognition of two and three talkers showed very promising results. In the case of two talkers, and using simulated clean data we achieved for ‘top 5’ hypotheses a recognition rate of 95.02% on average, which is very promising result. Influence of the Waveguide Propagation on the Antenna Performance in a Car Cabin Leonid Krasny, Ali Khayrallah; Ericsson Research, USA This paper presents a novel array processing algorithm for noise reduction in a hands free car environment. The algorithm incorporates the spatial properties of the sound field in a car cabin and a constraint on allowable speech signal distortion. Our results indicate that the proposed algorithm gives substantial performance improvement of 15-20 dB in comparison with the conventional array processing which is based on a coherent model of the signal field. Multi-Speaker DOA Tracking Using Interactive Multiple Models and Probabilistic Data Association Ilyas Potamitis, George Tremoulis, Nikos Fakotakis; University of Patras, Greece The general problem addressed in this work is that of tracking the Direction of Arrival (DOA) of active moving speakers in the presence of background noise and moderate reverberation level in the acoustic field. In order to efficiently beamform each moving speaker on an extended basis we adapt the theory developed in the context of Multi-target Tracking for military and civilian applications to the context of microphone array. Our approach employs Wideband MUSIC and Interacting Multiple Model (IMM) estimators to estimate the DOAs of the speakers under different kinds of motion and sudden change in their course. Probabilistic Data Association (PDA) is used to disambiguate and resolve DOA measurements. The efficiency of the approach is illustrated on simulated and real room experiments dealing with the crossing trajectories of two speakers. Speech Enhancement Using Weighting Function Based on the Variance of Wavelet Coefficients Ching-Ta Lu, Hsiao-Chuan Wang; National Tsing Hua University, Taiwan There are few works on the problem of heavy noise corruption in wavelet-based speech enhancement. In this paper, a new method is introduced to adapt the weighting function for wavelet coefficients The subject of this work is the use of microphone arrays for speech activity detection and noise suppression in the case of a moving speaker. The approach is based on the generalized likelihood ratio test applied to the framework of far-field, wideband moving sources (W-GLRT). It is shown that under certain distributional assumptions the W-GLRT provides a unifying framework for evaluation of Direction of Arrival (DOA) measurements against spurious DOAs, probabilistic speech activity detection as well as noise suppression. As regards speech enhancement, we demonstrate the direct connection of W-GLRT with enhancement based on subspace methods. In addition, through the concept of directive a-priori SNR we demonstrate its indirect connection with Minimum Mean Square Error spectral (MMSE_SA) and log-spectral gain modification (MMSE_LSA). The efficiency of the approach is illustrated on a moving speaker where additive white Gaussian Noise (AWGN) is present in the acoustical field at very low SNRs. Adaptive Beamforming in Room with Reverberation Zoran Šarić 1 , Slobodan Jovičić 2 ; 1 Institute of Security, Serbia and Montenegro; 2 University of Belgrade, Serbia and Montenegro Microphone arrays are powerful tools for noise suppression in a reverberant room. Generalized Sidelobe Canceller (GSC) that exploits Minimum Variance (MV) criterion is efficient in interference suppression when there is no correlation between the desired signal and the interferences. Correlation between the desired signal and any of interference produces a desired signal cancellation and degradation of signal-to-noise ratio. This paper analyses the unwanted cancellation of the desired source. It shows that cancellation level of the desired signal is proportional to the correlation between the direct wave and the reflected waves. For prevention of a desired signal cancellation we suggest the GSC parameter estimation during the pauses of the desired signal. For this case it is analytically shown that there is no cancellation of the desired signal. The proposed algorithm was experimentally tested and compared with the Conventional Beamformer (CBF) and GSC. Experimental tests have shown the advantage of the proposed method. Perceptually-Constrained Generalized Singular Value Decomposition-Based Approach for Enhancing Speech Corrupted by Colored Noise Gwo-hwa Ju, Lin-shan Lee; National Taiwan University, Taiwan In a previous work, we have successfully integrated the transformation-based signal subspace technique with the generalized singular value decomposition (GSVD) algorithm to develop an improved speech enhancement framework [1]. In this paper, we further incorporate the perceptual masking effect of the psychoacoustics model as extra constraints of the previously proposed GSVD-based algorithm to obtain improved sound feature, and furthermore make sure the undesired residual noise to be nearly unperceivable. Both subjective listening tests and spectrogram-plot comparison showed that the closed-form solution developed here can offer significantly better speech quality than either the conventional spectral subtraction algorithm or the previously proposed GSVD-based technique, regardless of whether the additive noise is white or not. 19 Eurospeech 2003 Monday Blind Separation and Deconvolution for Convolutive Mixture of Speech Using SIMO-Model-Based ICA and Multichannel Inverse Filtering September 1-4, 2003 – Geneva, Switzerland Speech Segregation Based on Fundamental Event Information Using an Auditory Vocoder Toshio Irino 1 , Roy D. Patterson 2 , Hideki Kawahara 1 ; 1 Wakayama University, Japan; 2 Cambridge University, U.K. Hiroaki Yamajo, Hiroshi Saruwatari, Tomoya Takatani, Tsuyoki Nishikawa, Kiyohiro Shikano; Nara Institute of Science and Technology, Japan We propose a new two-stage blind separation and deconvolution (BSD) algorithm for a convolutive mixture of speech, in which a new Single-Input Multiple-Output (SIMO)-model-based ICA (SIMOICA) and blind multichannel inverse filtering are combined. SIMOICA can separate the mixed signals, not into monaural source signals but into SIMO-model-based signals from independent sources as they are at the microphones. After SIMO-ICA, a simple blind deconvolution technique for the SIMO model can be applied even when each source signal is temporally correlated. The simulation results reveal that the proposed method can successfully achieve the separation and deconvolution for a convolutive mixture of speech. Quality Enhancement of CELP Coded Speech by Using an MFCC Based Gaussian Mixture Model We present a new auditory method to segregate concurrent speech sounds. The system is based on an auditory vocoder developed to resynthesize speech from an auditory Mellin representation using the vocoder STRAIGHT. The auditory representation preserves fine temporal information, unlike conventional window-based processing, and this makes it possible to segregate speech sources with an event synchronous procedure. We developed a method to convert fundamental frequency information to estimate glottal pulse times so as to facilitate robust extraction of the target speech. The results show that the segregation is good even when the SNR is 0 dB; the extracted target speech was a little distorted but entirely intelligible, whereas the distracter speech was reduced to a non-speech sound that was not perceptually disturbing. So, this auditory vocoder has potential for speech enhancement in applications such as hearing aids. Time Delay Estimation Based on Hearing Characteristic D.G. Raza, C.F. Chan; City University of Hong Kong, China At low bit rates CELP coders present certain artifacts generally known as hoarse and muffing characteristics. An enhancement system is developed to lessen the effects of these artifacts in CELP coded speech. In enhancement system, the high frequency components (4kHz-8kHz) are reinserted to reduce the muffing characteristics. This is achieved by using an MFCC based Gaussian Mixture Model. The hoarse characteristics are reduced by re-synthesizing the CELP reproduced speech with harmonic plus noise model. The pair-wise listening experiment results show that the re-synthesized wideband speech is preferred over the CELP coded speech. The enhanced speech is affirmed to be pleasant to listen and exhibits the naturalness of the original wideband speech. Enhancement of Noisy Speech for Noise Robust Front-End and Speech Reconstruction at Back-End of DSR System Zhaoli Yan, Limin Du, Jianqiang Wei, Hui Zeng; Chinese Academy of Sciences, China This paper proposes a new time delay estimation model, Summary Cross-correlation Function (SCCF). It is based on a hearing model of the human ear, which comes from a pitch perception model. The inherent relation between some time delay estimation (TDE) and pitch perception method is mentioned, and propose an idea – some pitch perception models’ pre-processing can be used for references in TDE model and vice versa. The new TDE model is proposed based on this viewpoint. Then SCCF is analyzed further, and compares its performance with Phase Transform (PHTA) and Modified Crosspower Spectrum (M-CPSP). The simulated experiments show that the new model is more robust to noise than PHAT and M-CPSP. Parametric Multi-Band Automatic Gain Control for Noisy Speech Enhancement M. Stolbov, S. Koval, M. Khitrov; Speech Technology Center, Russia Hyoung-Gook Kim, Markus Schwab, Nicolas Moreau, Thomas Sikora; Technische Universität Berlin, Germany This paper presents a speech enhancement method for noise robust front-end and speech reconstruction at the back-end of Distributed Speech Recognition (DSR). The speech noise removal algorithm is based on a two stage noise filtering LSAHT by log spectral amplitude speech estimator (LSA) and harmonic tunneling (HT) prior to feature extraction. The noise reduced features are transmitted with some parameters, viz., pitch period, the number of harmonic peaks from the mobile terminal to the server along noise-robust mel-frequency cepstral coefficients. Speech reconstruction at the back end is achieved by sinusoidal speech representation. Finally, the performance of the system is measured by the segmental signalnoise ratio, MOS tests, and the recognition accuracy of an Automatic Speech Recognition (ASR) in comparison to other noise reduction methods. This report is devoted to a new approach to wide band nonstationary noise reduction and corrupted speech signal enhancement. The objective is to provide processed speech intelligibility and quality while maintaining computation simplicity. We present a new (non-subtractive) noise suppression method called multiband Automatic Gain Control (AGC). The proposed method is based on the introduction of a non-subtractive noise suppression model and multi band filter gain control. This model provides less residual noise and better speech quality over the Spectral Subtraction Method (SSM). Modification of multi-band AGC gain function allows easy introduce new useful feature called Spectral Contrasting. The report contains the discussion of AGC control parameters values. Experiments show that the proposed algorithms are effective in non-stationary noisy background for Signal-to-Noise Ratio (SNR) up to -6dB. Improved Kalman Filter-Based Speech Enhancement Neural Networks versus Codebooks in an Application for Bandwidth Extension of Speech Signals Jianqiang Wei, Limin Du, Zhaoli Yan, Hui Zeng; Chinese Academy of Sciences, China Bernd Iser, Gerhard Schmidt; Temic Speech Dialog Systems, Germany In this paper, a Kalman filter-based speech enhancement algorithm with some improvements of previous work is presented. A new technique based on spectral subtraction is used for separation speech and noise characteristics from noisy speech and for the computation of speech and noise autoregressive (AR) parameters. In order to obtain a Kalman filter output with high audible quality, a perceptual post-filter is placed at the output of the Kalman filter to smooth the enhanced speech spectra. Experiments indicate that this newly proposed method works well. This paper presents two versions of an algorithm for bandwidth extension of speech signals. We focus on the generation of the spectral envelope and compare the performance of two different approaches – neural networks versus codebooks – in terms of objective and subjective distortion measures. Wavelet-Based Perceptual Speech Enhancement Using Adaptive Threshold Estimation Essa Jafer, Abdulhussain E. Mahdi; University of Limerick, Ireland 20 Eurospeech 2003 Monday A new speech enhancement system, which is based on a timefrequency adaptive wavelet soft thresholding, is presented in this paper. The system utilises a Bark-scaled wavelet packet decomposition integrated into a modified Weiner filtering technique using a novel threshold estimation method based on a magnitude decisiondirected approach. First, a Bark-Scaled wavelet packet transform is used to decompose the speech signal into critical bands. Threshold estimation is then performed for each wavelet band according to an adaptive noise level-tracking algorithm. Finally, the speech is estimated by incorporating the computed threshold into a Wiener filtering process, using the magnitude decision-directed approach. The proposed speech enhancement technique has been tested with various stationary and non-stationary noise cases. Reported results show that the system is capable of a high-level of noise suppression while preserving the intelligibility and naturalness of the speech. A Trainable Speech Enhancement Technique Based on Mixture Models for Speech and Noise Noise Reduction Using Paired-Microphones on Non-Equally-Spaced Microphone Arrangement Mitsunori Mizumachi, Satoshi Nakamura; ATR-SLT, Japan A wide variety of microphone arrays have been developed, and the authors have also proposed a type of equally-spaced small-scale microphone array. In this approach, a paired-microphone is selected at each frequency to design a subtractive beamformer that can estimate a noise spectrum. This paper introduces a non-equallyspaced microphone arrangement, which might give more spatial information than equally-spaced microphones, with two criteria for selecting the most suitable paired-microphone. These criteria are based on noise reduction rate and spectral smoothness, assuming that objective signals are speech. The feasibility of both the nonequally-spaced array and the criterion on spectral smoothness are confirmed by computer simulation. Ilyas Potamitis, Nikos Fakotakis, George Kokkinakis; University of Patras, Greece Our work introduces a trainable speech enhancement technique that can directly incorporate information about the long-term, timefrequency characteristics of speech signals prior to the enhancement process. We approximate noise spectral magnitude from available recordings from the operational environment as well as clean speech from a clean database with mixtures of Gaussian pdfs using the Expectation-Maximization algorithm (EM). Subsequently, we apply the Bayesian inference framework to the degraded spectral coefficients and by employing Minimum Mean Square Error Estimation (MMSE) we derive a closed form solution for the spectral magnitude estimation task. We evaluate our technique with a focus on real, highly non-stationary noise types (e.g. passing-by aircraft noise) and demonstrate its efficiency at low SNRs. Perceptual Wavelet Adaptive Denoising of Speech Qiang Fu, Eric A. Wan; Oregon Health & Science University, USA This paper introduces a novel speech enhancement system based on a wavelet denoising framework. In this system, the noisy speech is first preprocessed using a generalized spectral subtraction method to initially lower the noise level with negligible speech distortion. A perceptual wavelet transform is then used to decompose the resulting speech signal into critical bands. Threshold estimation is implemented that is both time and frequency dependent, providing robustness to non-stationary and correlated noisy environments. Finally, to eliminate the “musical noise” artifact, we apply a modified Ephraim/Malah suppression rule to the thresholding operation – adaptive denoising. Both objective and subjective experiments prove that the new speech enhancement system is capable of significant noise reduction with little speech distortion. Enhancement of Speech in Multispeaker Environment B. Yegnanarayana 1 , S.R. Mahadeva Prasanna 1 , Mathew Magimai Doss 2 ; 1 Indian Institute of Technology, India; 2 IDIAP, Switzerland In this paper a method based on the excitation source information is proposed for enhancement of speech, degraded by speech from other speakers. Speech from multiple speakers is simultaneously collected over two spatially distributed microphones. Time-delay of each speaker with respect to the two microphones is estimated using the excitation source information. A weight function is derived for each speaker using the knowledge of the time-delay and the excitation source information. Linear prediction (LP) residuals of the microphone signals are processed separately using the weight functions. Speech signals are synthesized from the modified residuals. One speech signal per speaker is derived from each microphone signal. The synthesized speech signals of each speaker are combined to produce enhanced speech. Significant enhancement of the speech of one speaker relative to other was observed from the combined signal. September 1-4, 2003 – Geneva, Switzerland Session: PMoDg– Poster Spoken Dialog Systems I Time: Monday 16.00, Venue: Main Hall, Level -1 Chair: Antje Schweitzer, Universit"at Stuttgart, Germany Two Studies of Open vs. Directed Dialog Strategies in Spoken Dialog Systems Silke M. Witt, Jason D. Williams; Edify Corporation, USA This paper analyzes the behavior of callers responding to a speech recognition system when prompted either with an open or a directed dialog strategy. The results of two usability studies with different caller populations are presented. Differences between the results from the two studies are analyzed and are shown to arise from the differences in the domains. It is shown that it depends on the caller population whether an open or a directed dialog strategy is preferred. In addition, we examine the effect of additional informational prompts on the routability of caller utterances. The Queen’s Communicator: An Object-Oriented Dialogue Manager Ian O’Neill 1 , Philip Hanna 1 , Xingkun Liu 1 , Michael McTear 2 ; 1 Queen’s University Belfast, U.K.; 2 University of Ulster, U.K. This paper presents some of the main features of a prototype spoken dialogue manager (DM) that has been incorporated into the DARPA Communicator architecture. Developed in Java, the object components that constitute the DM separate generic from domainspecific dialogue behaviour in the interests of maintainability and extensibility. Confirmation strategies encapsulated in a high-level DiscourseManager determine the system’s behaviour across transactional domains, while rules of thumb encapsulated in a suite of domain experts enable the system to guide the user towards completion of particular transactions. We describe the nature of the generic confirmation strategy and the domain experts’ specialised dialogue behaviour. We describe how rules of thumb fire given certain combinations of user-supplied values – or in the light of the system’s own interaction with its database. RavenClaw: Dialog Management Using Hierarchical Task Decomposition and an Expectation Agenda Dan Bohus, Alexander I. Rudnicky; Carnegie Mellon University, USA We describe RavenClaw, a new dialog management framework developed as a successor to the Agenda [1] architecture used in the CMU Communicator. RavenClaw introduces a clear separation between task and discourse behavior specification, and allows rapid development of dialog management components for spoken dialog systems operating in complex, goal-oriented domains. The system development effort is focused entirely on the specification of the dialog task, while a rich set of domain-independent conversational behaviors are transparently generated by the dialog engine. To date, RavenClaw has been applied to five different domains allowing us 21 Eurospeech 2003 Monday to draw some preliminary conclusions as to the generality of the approach. We briefly describe our experience in developing these systems. Features for Tree Based Dialogue Course Management Klaus Macherey, Hermann Ney; RWTH Aachen, Germany In this paper, we introduce different features for dialogue course management and investigate their effect on the system’s behaviour for choosing the subsequent dialogue action during a dialogue session. Especially, we investigate whether the system is able to detect and resolve ambiguities, and if it always chooses that state which leads as quickly as possible to a final state that presumably meets the user’s request. The criteria and used data structures are independently from the underlying domain and can therefore be used for different applications of spoken dialogue systems. September 1-4, 2003 – Geneva, Switzerland Conceptual Decoding for Spoken Dialog Systems Yannick Estève, Christian Raymond, Frédéric Béchet, Renato De Mori; LIA-CNRS, France A search methodology is proposed for performing conceptual decoding process. Such a process provides the best sequence of word hypotheses according to a set of conceptual interpretations. The resulting models are combined in a network of Stochastic Finite State Transducers. This approach is a framework that tries to bridge the gap between speech recognition and speech understanding processes. Indeed, conceptual interpretations are generated according to both a semantic representation of the task and a system t belief which evolves according to the dialogue states. Preliminary experiments on the detection of semantic entities (mainly named entities) in a dialog application have shown that interesting results can be obtained even if the Word Error Rate is pretty high. Sentence Verification in Spoken Dialogue System Huei-Ming Wang, Yi-Chung Lin; Industrial Technology Research Institute, Taiwan Development of a Stochastic Dialog Manager Driven by Semantics Francisco Torres, Emilio Sanchis, Encarna Segarra; Universitat Politècnica de València, Spain We present an approach for the development of a dialog manager based on stochastic models for the representation of the dialogue structure and strategy. This dialog manager processes semantic representations and, when it is integrated with our understanding and answer generation modules, it performs natural language dialogs. It has been applied to a Spanish dialogue system which answers telephone queries about train timetables. Generation of Natural Response Timing Using Decision Tree Based on Prosodic and Linguistic Information Masashi Takeuchi, Norihide Kitaoka, Seiichi Nakagawa; Toyohashi University of Technology, Japan In spoken dialogue systems, sentence verification technique is very useful to avoid misunderstanding user’s intention by rejecting outof-domain or bad quality utterances. However, compared with word verification and concept verification, sentence verification has been seldom touched in the past. In this paper, we propose a sentence verification approach which uses discriminative features extracted from the edit operation sequence. Since the edit operation sequence indicates what kinds of errors (i.e., insertion, deletion and substitution errors) may occur in the hypothetical concept sequence, it conveys sentence-level information for evaluating the quality of system’s interpretation for the user’s utterance. In addition, a sentence verification criterion concerning precision and recall rates of hypothetical concepts is also proposed to pursue efficient and correct spoken dialogue interactions. Compared with the verification method using acoustic confidence measure, the proposed approach reduces 17.3% of errors. Detection and Recognition of Correction Utterance in Spontaneously Spoken Dialog If a dialog system can respond to the user as reasonable as a human, the interaction will be more smooth. Timing of response such as backchannels and turn-taking plays important role in such a smooth dialog as in human-human interaction. We are now developing a dialog system which can generate response timing in real time. In this paper, we introduce a response timing generator for such a dialog system. First, we analyzed conversations between two persons and extracted prosodic and linguistic information which had effects on the timing. Then we constructed a decision tree based on the features coming from the information and developed a timing generator using rules derived from the decision tree. The timing generator decides the action of the system at every 100ms in user’s pause. We evaluated the timing generator by subjective and objective evaluation. Child and Adult Speaker Adaptation During Error Resolution in a Publicly Available Spoken Dialogue System Linda Bell, Joakim Gustafson; Telia Research, Sweden This paper describes how speakers adapt their language during error resolution when interacting with the animated agent Pixie. A corpus of spontaneous human-computer interaction was collected at the Telecommunication museum in Stockholm, Sweden. Adult and children speakers were compared with respect to user behavior and strategies during error resolution. In this study, 16 adults and 16 children speakers were randomly selected from a corpus from almost 3.000 speakers. This sub-corpus was then analyzed in greater detail. Results indicate that adults and children use partly different strategies when their interactions with Pixie become problematic. Children tend to repeat the same utterance verbatim, altering certain phonetic features. Adults, on the other hand, often modify other aspects of their utterances such as lexicon and syntax. Results from the present study will be useful for constructing future spoken dialogue systems with improved error handling for adults as well as children. Norihide Kitaoka, Naoko Kakutani, Seiichi Nakagawa; Toyohashi University of Technology, Japan Recently, the performance of speech recognition was drastically improved, and the products with the interface based on speech recognition have been realized. However, when we communicate with computers through a speech interface, misrecognition is inevitable, and it is difficult to recover from it because of the immaturity of the interface. Users try to recover from misrecognition by a repetition of the same content. So, the detection of user’s repetition is helpful for a system to detect its misunderstanding, and to recover from the misrecognition. In this paper, we assume the utterance which includes repetitions a correction and propose a method to detect correction utterances in spontaneously spoken dialog using a word spotting based on DTW (dynamic time warping) and N-best hypotheses overlapping measure. As a result, we achieved recall rate of 92.7% and precision of 89.1%. Moreover, we tried to improve recognition accuracy using the detection. Using the choice of vocabulary and grammar setup based on the detection, we achieved improvement in recognition performance from 42.7% to 50.0% for correction utterance and from 70.5% to 77.9% for non-correction utterance. Topic-Specific Parser Design in an Air Travel Natural Language Understanding Application Chaitanya J.K. Ekanadham, Juan M. Huerta; IBM T.J. Watson Research Center, USA In this paper we contrast a traditional approach to semantic parsing for Natural Language Understanding applications in which a single parser captures a whole application domain, with an alternative approach consisting of a collection of smaller parsers, each able to handle only a portion of the domain. We implement this topic-specific parsing strategy by fragmenting the training corpus into subject specific subsets and developing from each subset a corresponding subject parser. We demonstrate this procedure on the DARPA Communicator task, and we observe that given an appropri- 22 Eurospeech 2003 Monday September 1-4, 2003 – Geneva, Switzerland ate smoothing mechanism to overcome data sparseness, the set of subject-specific parsers performs as effectively (in accuracy terms) as the original parser. We present experiments both under supervised and unsupervised subject selection modes. eling. The task consists of answering telephone queries about train timetables, prices and services for long distance trains in Spanish. A comparison between a global understanding model and the specific models is presented. The Use of Confidence Measures in Vector Based Call-Routing Robust Parsing of Utterances in Negotiative Dialogue Stephen J. Cox, Gavin Cawley; University of East Anglia, U.K. Johan Boye, Mats Wirén; Telia Research, Sweden In previous work, we experimented with different techniques of vector-based call routing, using the transcriptions of the queries to compare algorithms. In this paper, we base the routing decisions on the recogniser output rather than transcriptions and examine the use of confidence measures (CMs) to combat the problems caused by the “noise” in the recogniser output. CMs are derived for both the words output from the recogniser and for the routings themselves and are used to investigate improving both routing accuracy and routing confidence. Results are given for a 35 route retail store enquiry-point task. They suggest that although routing error is controlled by the recogniser error-rate, confidence in routing decisions can be improved using these techniques. Multi-Channel Sentence Classification for Spoken Dialogue Language Modeling Frédéric Béchet 1 , Giuseppe Riccardi 2 , Dilek Z. Hakkani-Tür 2 ; 1 LIA-CNRS, France; 2 AT&T Labs-Research, USA In traditional language modeling word prediction is based on the local context (e.g. n-gram). In spoken dialog, language statistics are affected by the multidimensional structure of the human-machine interaction. In this paper we investigate the statistical dependencies of users’ responses with respect to the system’s and user’s channel. The system channel components are the prompts’ text, dialogue history, dialogue state. The user channel components are the Automatic Speech Recognition (ASR) transcriptions, the semantic classifier output and the sentence length. We describe an algorithm for language model rescoring using users’ response classification. The user’s response is first mapped into a multidimensional state and the state specific language model is applied for ASR rescoring. We present perplexity and ASR results on the How May I Help You ?sm 100K spoken dialogs. Automatic Induction of N-Gram Language Models from a Natural Language Grammar This paper presents an algorithm for domain-dependent parsing of utterances in negotiative dialogue. To represent such utterances, the algorithm outputs semantic expressions that are more expressive than propositional slot-filler structures. It is very fast and robust, yet precise and capable of correctly combining information from different utterance fragments. Flexible Speech Act Identification of Spontaneous Speech with Disfluency Chung-Hsien Wu, Gwo-Lang Yan; National Cheng Kung University, Taiwan This paper describes an approach for flexible speech act identification of spontaneous speech with disfluency. In this approach, semantic information, syntactic structure, and fragment features of an input utterance are statistically encapsulated into a proposed speech act hidden Markov model (SAHMM) to characterize the speech act. To deal with the disfluency problem in a sparse training corpus, an interpolation mechanism is exploited to re-estimate the state transition probability in SAHMM. Finally, the dialog system accepts the speech act with best score and returns the corresponding response. Experiments were conducted to evaluate the proposed approach using a spoken dialogue system for the air travel information service. A testing database from 25 speakers containing 480 dialogues including 3038 sentences was collected and used for evaluation. Using the proposed approach, the experimental results show that the performance can achieve 90.3% in speech act correct rate (SACR) and 85.5% in fragment correct rate (FCR) for fluent speech and gains a significant improvement of 5.7% in SACR and 6.9% in FCR compared to the baseline system without considering filled pauses for disfluent speech. Efficient Spoken Dialogue Control Depending on the Speech Recognition Rate and System’s Database Kohji Dohsaka, Norihito Yasuda, Kiyoaki Aikawa; NTT Corporation, Japan Stephanie Seneff, Chao Wang, Timothy J. Hazen; Massachusetts Institute of Technology, USA This paper details our work in developing a technique which can automatically generate class n-gram language models from natural language (NL) grammars in dialogue systems. The procedure eliminates the need for double maintenance of the recognizer language model and NL grammar. The resulting language model adopts the standard class n-gram framework for computational efficiency. Moreover, both the n-gram classes and training sentences are enhanced with semantic/syntactic tags defined in the NL grammar, such that the trained language model preserves the distinctive statistics associated with different word senses. We have applied this approach in several different domains and languages, and have evaluated it on our most mature dialogue systems to assess its competitiveness with pre-existing n-gram language models. The speech recognition performances with the new language model are in fact the best we have achieved in both the JUPITER weather domain and the MERCURY flight reservation domain. We present dialogue control methods (the dual-cost method and the trial dual-cost method) that enable a spoken dialogue system to convey information to the user in as short a dialogue as possible depending on the speech recognition rate and the content of its database. Both methods control a dialogue so as to minimize the sum of two costs: the confirmation cost (C-cost) and the information transfer cost (I-cost). The C-cost is the length of a subdialogue for confirming a user query, and the I-cost is the length of a system response generated after the confirmations. The dual-cost method can avoid the unnecessary confirmations that are inevitable in conventional methods. The trial dual-cost method is an improved version of the dual-cost method. Whereas the dual-cost method has the limitation that it generates a system response based on only the content of a query that the user has acknowledged in the confirmation subdialogue, the trial dual-cost method does not. Dialogue experiments prove that the trial dual-cost method outperforms the dual-cost method and that both methods outperform conventional ones. Robust Speech Understanding Based on Expected Discourse Plan Connectionist Classification and Specific Stochastic Models in the Understanding Process of a Dialogue System Shin-ya Takahashi, Tsuyoshi Morimoto, Sakashi Maeda, Naoyuki Tsuruta; Fukuoka University, Japan David Vilar, María José Castro, Emilio Sanchis; Universitat Politècnica de València, Spain In this paper we present an approach to the application of specific models to the understanding process of a dialogue system. The previous classification process is done by means of Multilayer Perceptrons, and Hidden Markov Models are used for the semantic mod- This paper reports spoken dialogue experiments for elderly people in the home health care system we have developed. In spoken dialogue systems, it is important to decrease recognition errors. The recognition errors, however, cannot be completely avoided with current speech recognition techniques. In this paper, we propose a robust recognition understanding technique based on expected dis- 23 Eurospeech 2003 Tuesday course plans in order to improve a recognition accuracy. First, we collect dialogue examples of elderly users through a Wizard-of-Oz (WOZ) experiment. Next, we conduct a recognition experiment for collected elderly speech using the proposed technique. The experimental result demonstrates that this technique improved a sentence recognition rate from 69.1% to 74.3%, a word recognition rate from 80.3% to 81.7% , and a plan matching rate from 88.3% to 92.0%. Session: OTuBa– Oral Robust Speech Recognition - Noise Compensation September 1-4, 2003 – Geneva, Switzerland mance for clean speech. We proposed a novel sub-band approach, where frequency sub-bands are multiplied with weighting factors and merged, which considers sub-band dependence and proves to be more robust than both full-band and conventional sub-band approaches. And further the weighting factors can be obtained by using the maximum-likelihood estimation approaches in order to minimize the mismatch between the trained models and the observed features. Finally we evaluated our methods on both the Aurora2 task and the Resource Management task and showed improvement of performance on the two tasks consistently. Feature Compensation Scheme Based on Parallel Combined Mixture Model Time: Tuesday 10.00, Venue: Room 1 Chair: Iain McCowan, IDIAP, Martigny, Switzerland Wooil Kim, Sungjoo Ahn, Hanseok Ko; Korea University, Korea Normalization of Time-Derivative Parameters Using Histogram Equalization Yasunari Obuchi 1 , Richard M. Stern 2 ; 1 Hitachi Ltd., Japan; 2 Carnegie Mellon University, USA In this paper we describe a new framework of feature compensation for robust speech recognition. We introduce Delta-Cepstrum Normalization (DCN) that normalizes not only cepstral coefficients, but also their time-derivatives. In previous work, the mean and the variance of cepstral coefficients are normalized to reduce the irrelevant information, but such a normalization was not applied to time-derivative parameters because the reduction of the irrelevant information was not enough. However, Histogram Equalization provides better compensation and can be applied even to delta and delta-delta cepstra. We investigate various implementation of DCN, and show that we can achieve the best performance when the normalization of the cepstra and delta cepstra can be mutually interdependent. We evaluate the performance of DCN using speech data recorded by a PDA. DCN provides significant improvements compared to HEQ. We also examine the possibility of combining Vector Taylor Series (VTS) and DCN. Even though some combinations do not improve the performance of VTS, it is shown that the best combination gives better performance than VTS alone. Finally, the advantages of DCN in terms of the computation speed are also discussed. Tree-Structured Noise-Adapted HMM Modeling for Piecewise Linear-Transformation-Based Adaptation Zhipeng Zhang 1 , Kiyotaka Otsuji 1 , Sadaoki Furui 2 ; 1 NTT DoCoMo Inc., Japan; 2 Tokyo Institute of Technology, Japan This paper proposes an effective feature compensation scheme based on speech model for achieving robust speech recognition. Conventional model-based method requires off-line training with noisy speech database and is not suitable for online adaptation. In the proposed scheme, we can relax the off-line training with noisy speech database by employing the parallel model combination technique for estimation of correction factors. Applying the model combination process over to the mixture model alone as opposed to entire HMM makes the online model combination possible. Exploiting the availability of noise model from off-line sources, we accomplish the online adaptation via MAP(Maximum A Posteriori) estimation. In addition, the online channel estimation procedure is induced within the proposed framework. The representative experimental results indicate that the suggested algorithm is effective in realizing robust speech recognition under the combined adverse conditions of additive background noise and channel distortion. A Comparison of Three Non-Linear Observation Models for Noisy Speech Features Jasha Droppo, Li Deng, Alex Acero; Microsoft Research, USA This paper reports our recent efforts to develop a unified, nonlinear, stochastic model for estimating and removing the effects of additive noise on speech cepstra. The complete system consists of prior models for speech and noise, an observation model, and an inference algorithm. The observation model quantifies the relationship between clean speech, noise, and the noisy observation. Since it is expressed in terms of the log Mel-frequency filter-bank features, it is non-linear. The inference algorithm is the procedure by which the clean speech and noise are estimated from the noisy observation. This paper proposes the application of tree-structured clustering to various noise samples or noisy speech in the framework of piecewise-linear transformation (PLT)-based noise adaptation. According to the clustering results, a noisy speech HMM is made for each node of the tree structure. Based on the likelihood maximization criterion, the HMM that best matches the input speech is selected by tracing the tree from top to bottom, and the selected HMM is further adapted by linear transformation. The proposed method is evaluated by applying it to a Japanese dialogue recognition system. The results confirm that the proposed method is effective in recognizing noise-added speech under various noise conditions. The most critical component of the system is the observation model. This paper derives a new approximation strategy and compares it with two existing approximations. It is shown that the new approximation uses half the calculation, and produces equivalent or improved word accuracy scores, when compared to previous techniques. We present noise-robust recognition results on the standard Aurora 2 task. Maximum Likelihood Sub-Band Weighting for Robust Speech Recognition We present a new predictive compensation scheme which makes no assumption on how the noise sources alter the speech data and which do not rely on clean speech models. Rather, this new scheme makes the (realistic) assumption that speech databases recorded under different background noise conditions are available. The philosophy of this scheme is to process these databases in order to build a “tool” which will allow it to handle new noise conditions in a robust way. We evaluate the performances of this new compensation scheme on a connected digits recognition task and show that it can perform significantly better than multi-conditions training, which is the most widely used techniques in these kind of scenarios. Donglai Zhu 1 , Satoshi Nakamura 2 , Kuldip K. Paliwal 3 , Renhua Wang 1 ; 1 University of Science and Technology of China, China; 2 ATR-SLT, Japan; 3 Griffith University, Australia Sub-band speech recognition approaches have been proposed for robust speech recognition, where full-band power spectra are divided into several sub-bands and then likelihoods or cepstral vectors of the sub-bands are merged depending on their reliability. In conventional sub-band approaches, correlations across the subbands are not modeled and the merging weights can only be set experientially or estimated during training procedures, which may not match observed data. The methods further degrade perfor- A New Supervised-Predictive Compensation Scheme for Noisy Speech Recognition Khalid Daoudi, Murat Deviren; LORIA, France 24 Eurospeech 2003 Tuesday September 1-4, 2003 – Geneva, Switzerland paper we will address two issues related to the factors that affect the system performance, namely the speech signal duration and the signal-to-noise ratio. Session: STuBb– Oral Forensic Speaker Recognition Estimating the Weight of Evidence in Forensic Speaker Verification Time: Tuesday 10.00, Venue: Room 2 Chair: Andrzej Drygajlo, EPFL, Switzerland Beat Pfister, René Beutler; ETH Zürich, Switzerland Statistical Methods and Bayesian Interpretation of Evidence in Forensic Automatic Speaker Recognition Andrzej Drygajlo 1 , Didier Meuwly 2 , Anil Alexander 1 ; 1 EPFL, Switzerland; 2 Forensic Science Service, U.K. The goal of this paper is to establish a robust methodology for forensic automatic speaker recognition (FASR) based on sound statistical and probabilistic methods, and validated using databases recorded in real-life conditions. The interpretation of recorded speech as evidence in the forensic context presents particular challenges. The means proposed for dealing with them is through Bayesian inference and corpus based methodology. A probabilistic model – the odds form of Bayes’ theorem and likelihood ratio – seems to be an adequate tool for assisting forensic experts in the speaker recognition domain to interpret this evidence. In forensic speaker recognition, statistical modelling techniques are based on the distribution of various features pertaining to the suspect’s speech and its comparison to the distribution of the same features in a reference population with respect to the questioned recording. In this paper, the state-of-the-art automatic, text-independent speaker recognition system, using Gaussian mixture model (GMM), is adapted to the Bayesian interpretation (BI) framework to estimate the within-source variability of the suspected speaker and the between-sources variability, given the questioned recording. This double-statistical approach (BI-GMM) gives an adequate solution for the interpretation of the recorded speech as evidence in the judicial process. Robust Likelihood Ratio Estimation in Bayesian Forensic Speaker Recognition J. Gonzalez-Rodriguez, D. Garcia-Romero, M. Garcia-Gomar, D. Ramos-Castro, J. Ortega-Garcia; Universidad Politécnica de Madrid, Spain In this paper we summarize the bayesian methodology for forensic analysis of the evidence in the speaker recognition area. We also describe the procedure to convert any speaker recognition system into a valuable forensic tool according to the bayesian methodology. Furthermore, we study the difference between assessment of speaker recognition technology using DET curves and assessment of forensic systems by means of Tippet plots. Finally, we will show several complete examples of our speaker recognition system in a forensic environment. Some experiments will be presented where, using Ahumada-Gaudí speech data, we optimize the Likelihood Ratio computation procedure in order to be robust to inconsistencies in the estimation of within- and between-sources statistical distributions. Results in the different tested situations, summarized in Tippet plots, show the adequacy of this approach to daily forensic work. Automated Speaker Recognition in Real World Conditions: Controlling the Uncontrollable Hirotaka Nakasone; Federal Bureau of Investigation, USA The current development of automatic speaker recognition technology may provide a new method to augment or replace the traditional method offered by qualified experts using aural and spectrographic analysis. The most promising of these automated technologies are based on statistical hypothesis testing methods involving likelihood ratios. The null hypothesis is generated using a universal background model composed of a large population of speakers. However, techniques with excellent performance in standardized evaluations (NIST trials) may not work perfectly in the real world. By defining and controlling the input speech samples carefully, we show quantitative differences in performance for different factors affecting a speaker population, and discuss on-going efforts to improve the accuracy rate for use in real world conditions. In this In forensic casework, the application of automatic speaker verification (SV) aims to determine the likelihood ratio of a suspect being vs. being not the speaker of an incriminating speech recording. For that purpose, the likelihood of the anti-speaker has to be estimated from the speech of an adequate number of other speakers. In many cases, speech signals of such an anti-speaker population are not available and it is generally too expensive to make an appropriate collection. This paper presents a practical procedure of forensic SV which is based on a text-dependent SV system and instead of an anti-speaker population, a special speech database is used to calibrate the valuation scale for an individual case. Auditory-Instrumental Forensic Speaker Recognition Stefan Gfroerer; Bundeskriminalamt, Germany The most prominent part in forensic speech and audio processing is speaker recognition. In the world a number of approaches to forensic speaker recognition (FSR) have been developed, that are different in terms of technical procedures, methodology, instrumentation and also in terms of the probability scale on which the final conclusion is based. The BKA’s approach to speaker recognition is a combination of classical phonetic analysis techniques including analytical listening by an expert and the use of signal processing techniques within an acoustic-phonetic framework. This combined auditory-instrumental method includes acoustic measurements of parameters which may be interpreted using statistical information on their distributions, e.g. probability distributions of average fundamental frequency for adult males and females, average syllable rates as indicators of speech rate, etc. In a voice comparison report the final conclusion is determined by a synopsis of the results from auditory and acoustic parameters, amounting to about eight to twelve on average, depending on the nature of the speech material. Results are given in the form of probability statements. The paper gives an overview of current procedures and specific problems of FSR. Earwitness Line-Ups: Effects of Speech Duration, Retention Interval and Acoustic Environment on Identification Accuracy J.H. Kerstholt 1 , E.J.M. Jansen 1 , A.G. van Amelsvoort 2 , A.P.A. Broeders 3 ; 1 TNO Human Factors, The Netherlands; 2 LSOP Police Knowledge and Expertise Centre, The Netherlands; 3 Netherlands Forensic Institute, The Netherlands An experiment was conducted to investigate the effects of retention interval, exposure duration and acoustic environment on speaker identification accuracy in voice line-ups. In addition, the relation between confidence assessments by participants and test assistant and identification accuracy was explored. A total of 361 participants heard a single target voice in one of four exposure conditions (short or long speech sample, recorded only indoors or indoors and outdoors). Half the participants were tested immediately after exposure to the target voice and half one week later. The results show that the target was correctly identified in 42% of cases. In the target-absent condition there were 51% false alarms. Acoustic environment did not affect identification accuracy. There was an interaction between speech duration and retention interval in the target-absent condition: after a one-week interval, listeners made fewer false identifications if the speech sample was long. No effects were found when participants were tested immediately. Only the confidence scores of the test assistant had predictive value. Taking the confidence score of the test assistant into account therefore increases the diagnostic value of the line-up. 25 Eurospeech 2003 Tuesday September 1-4, 2003 – Geneva, Switzerland future directions utilizing spectral tilt and pitch contour to distinguish emotions in the valence dimension. Session: OTuBc– Oral Emotion in Speech Recognition of Emotions in Interactive Voice Response Systems Time: Tuesday 10.00, Venue: Room 3 Chair: Elizabeth Shriberg, SRI, Menlo Park, USA Sherif Yacoub, Steve Simske, Xiaofan Lin, John Burns; Hewlett-Packard Laboratories, USA Characteristics of Authentic Anger in Hebrew Speech Noam Amir, Shirley Ziv, Rachel Cohen; Tel Aviv University, Israel In this study we examine a number of characteristics of angry Hebrew speech. Whereas such studies are frequently carried out on acted speech, in this study we used recordings of participants in broadcasted, politically oriented talk shows. The recordings were audited and rated for anger content by 11 listeners. 12 utterances judged to contain angry speech were then analyzed along with 12 utterances from the same speakers that were judged to contain neutral speech. Various statistics of the F0 curve and spectral tilt were calculated and correlated with the degree of anger, giving a number of interesting results: for example, though pitch range was significantly correlated to anger in general, pitch range was significantly negative-correlated to the degree of anger. A separate test was conducted, judging only the textual content of the utterances, to examine the degree to which it influenced the listening tests. After neutralizing for the textual content, some of the acoustic measures became weaker predictors of anger, whereas mean F0 remained the strongest indicator of anger. Spectral tilt also showed a significant decrease in angry speech. Prosody-Based Classification of Emotions in Spoken Finnish Tapio Seppänen, Eero Väyrynen, Juhani Toivanen; University of Oulu, Finland An emotional speech corpus of Finnish was collected that includes utterances of four emotional states of speakers. More than 40 prosodic features were derived and automatically computed for the speech samples. Statistical classification experiments with kNN classifier and human listening tests indicate that emotion recognition performance comparable to human listeners can be achieved. This paper reports emotion recognition results from speech signals, with particular focus on extracting emotion features from the short utterances typical of Interactive Voice Response (IVR) applications. We focus on distinguishing anger versus neutral speech, which is salient to call center applications. We report on classification of other types of emotions such as sadness, boredom, happy, and cold anger. We compare results from using neural networks, Support Vector Machines (SVM), K-Nearest Neighbors, and decision trees. We use a database from the Linguistic Data Consortium at University of Pennsylvania, which is recorded by 8 actors expressing 15 emotions. Results indicate that hot anger and neutral utterances can be distinguished with over 90% accuracy. We show results from recognizing other emotions. We also illustrate which emotions can be clustered together using the selected prosodic features. We are not Amused – But How do You Know? User States in a Multi-Modal Dialogue System Anton Batliner, Viktor Zeißler, Carmen Frank, Johann Adelhardt, Rui P. Shi, Elmar Nöth; Universität Erlangen-Nürnberg, Germany For the multi-modal dialogue system SmartKom, emotional user states in a Wizard-of-Oz experiment as, e.g., joyful, angry, helpless, are annotated holistically and based purely on facial expressions; other phenomena (prosodic peculiarities, offtalk, i.e., speaking aside, etc.) are labelled as well. We present the correlations between these different annotations and report classification results using a large prosodic feature vector. The performance of the user state classification is not yet satisfactory; possible reasons and remedies are discussed. Session: OTuBd– Oral Dialog System User & Domain Modeling Time: Tuesday 10.00, Venue: Room 4 Chair: Paul Dalsgaard, Center for PersonKommunikation (CPK) Frequency Distribution Based Weighted Sub-Band Approach for Classification of Emotional/Stressful Content in Speech On-Line User Modelling in a Mobile Spoken Dialogue System Mandar A. Rahurkar, John H.L. Hansen; University of Colorado at Boulder, USA Niels Ole Bernsen; University of Southern Denmark, Denmark In this paper we explore the use of nonlinear Teager Energy Operator based features derived from multi-resolution sub-band analysis for classification of emotional/stressful speech. We propose a novel scheme for automatic sub-band weighting in an effort towards developing a generic algorithm for understanding emotion or stress in speech. We evaluate the proposed algorithm using a corpus of audio material from a military stressful Soldier of the Quarter Board evaluation panel. We establish classification performance of emotional/stressful speech using an open speaker set with open test tokens. With the new frequency distribution based scheme, we obtain a relative detection error reduction of 81.3% in stress speech, and a 75.4% relative detection rate reduction in neutral speech detection error rate. The results suggest a important step forward in establishing an effective processing scheme for developing generic models of neutral and emotional speech. Classifying Subject Ratings of Emotional Speech Using Acoustic Features Jackson Liscombe, Jennifer Venditti, Julia Hirschberg; Columbia University, USA This paper presents results from a study examining emotional speech using acoustic features and their use in automatic machine learning classification. In addition, we propose a classification scheme for the labeling of emotions on continuous scales. Our findings support those of previous research as well as indicate possible The paper presents research on user modelling for an in-car spoken dialogue system, including the implementation of a generic user modelling module applied to the modelling of drivers’ task objectives. Towards Dynamic Multi-Domain Dialogue Processing Botond Pakucs; KTH, Sweden This paper introduces SesaME, a generic dialogue management framework, especially designed for supporting dynamic multidomain dialogue processing. SesaME supports a multitude of highly distributed applications and facilitates simultaneous adaptation to individual users and their environment. The dynamic multi-domain dialogue processing is supported through the use of standardised and highly distributed domain descriptions. For fast, runtime handling of these domain descriptions a specially developed, dynamic plug and play solution is employed. In this paper, a description of how SesaME’s functionality is evaluated within the framework of the PER demonstrator is also presented. User Modeling in Spoken Dialogue Systems for Flexible Guidance Generation Kazunori Komatani, Shinichi Ueno, Tatsuya Kawahara, Hiroshi G. Okuno; Kyoto University, Japan 26 Eurospeech 2003 Tuesday September 1-4, 2003 – Geneva, Switzerland We address appropriate user modeling in order to generate cooperative responses to each user in spoken dialogue systems. Unlike previous studies that focus on users’ knowledge or typical kinds of users, the proposed user model is more comprehensive. Specifically, we set up three dimensions of user models: skill level to the system, knowledge level on the target domain and degree of hastiness. Moreover, the models are automatically derived by decision tree learning using real dialogue data. We obtained reasonable classification accuracy for all dimensions. Dialogue strategies based on the user modeling are implemented in Kyoto city bus information system that has been developed at our laboratory. Experimental evaluation shows that the cooperative responses adaptive to individual users serve as good guidance for novice users without increasing the dialogue duration for skilled users. tutoring agents are easily expandable and configurable, and general agents can be shared between applications. We have also received positive feedback about integrated tutoring in initial user tests conducted with the implementation. Empowering End Users to Personalize Dialogue Systems Through Spoken Interaction Empirical study of the syntax-prosody relation is hampered by the fact that current prosodic models are essentially linear, while syntactic structure is hierarchical. The present contribution describes a syntax-prosody comparison heuristic based on two new algorithms: Time Tree Induction, TTI, for building a prosodic treebank from time-annotated speech data, and Tree Similarity Indexing, TSI) for comparing syntactic trees with the prosodic trees. Two parametrisations of the TTI algorithm, for different tree branching conditions, are applied to sentences taken from a read-aloud narrative, and compared with parses of the same sentences, using the TSI. In addition, null-hypotheses in the form of flat bracketing of the sentences are compared. A preference for iambic (heavy rightmost branch) grouping is found. The resulting quantitative evidence for syntaxprosody relations has applications in speech genre characterisation and in duration models for speech synthesis. Stephanie Seneff 1 , Grace Chung 2 , Chao Wang 1 ; 1 Massachusetts Institute of Technology, USA; 2 Corporation for National Research Initiatives, USA This paper describes recent advances we have made towards the goal of empowering end users to automatically expand the knowledge base of a dialogue system through spoken interaction, in order to personalize it to their individual needs. We describe techniques used to incrementally reconfigure a preloaded trained natural language grammar, as well as the lexicon and language models for the speech recognition system. We also report on advances in the technology to integrate a spoken pronunciation with a spoken spelling, in order to improve spelling accuracy. While the original algorithm was designed for a “speak and spell” input mode, we have shown here that the same methods can be applied to separately uttered spoken and spelled forms of the word. By concatenating the two waveforms, we can take advantage of the mutual constraints realized in an integrated composite FST. Using an OGI corpus of separately spoken and spelled names, we have demonstrated letter error rates of under 6% for in-vocabulary words and under 11% for words not contained in the training lexicon, a 44% reduction in error rate over that achieved without use of the spoken form. We anticipate applying this technique to unknown words embedded in a larger context, followed by solicited spellings. LET’S GO: Improving Spoken Dialog Systems for the Elderly and Non-Natives Antoine Raux, Brian Langner, Alan W. Black, Maxine Eskenazi; Carnegie Mellon University, USA With the recent improvements in speech technology, it is now possible to build spoken dialog systems that basically work. However, such systems are designed and tailored for the general population. When users come from less general sections of the population, such as the elderly and non-native speakers of English, the accuracy of dialog systems degrades. This paper describes Let’s Go, a dialog system specifically designed to allow dialog experiments to be carried out on the elderly and non-native speakers in order to better tune such systems for these important populations. Let’s Go is designed to provide Pittsburgh area bus information. The basic system is described and our initial experiments are outlined. Agents for Integrated Tutoring in Spoken Dialogue Systems Jaakko Hakulinen, Markku Turunen, Esa-Pekka Salonen; University of Tampere, Finland In this paper, we introduce the concept of integrated tutoring in speech applications. An integrated tutoring system teaches the use of a system to a user while he/she is using the system in a typical manner. Furthermore, we introduce the general principles of how to implement applications with integrated tutoring agents and present an example implementation for an existing e-mail system. The main innovation of the approach is that the tutoring agents are part of the application, but implemented in a way which makes it possible to plug them into the system without modifying it. This is possible due to a set of small, stateless agents and a shared Information Storage provided by our system architecture. Integrated Session: PTuBf– Poster Phonology & Phonetics II Time: Tuesday 10.00, Venue: Main Hall, Level -1 Chair: Yoshinori Sagisaka, Waseda Univ., Japan Corpus-Based Syntax-Prosody Tree Matching Dafydd Gibbon; Universität Bielefeld, Germany A New Approach to Segment and Detect Syllables from High-Speed Speech D.W. Ying, W. Gao, W.Q. Wang; Chinese Academy of Sciences, China In this paper, we present a novel method to detect sound onsets and offsets, and apply it to detect and segment syllables from highspeed speech according to the Mandarin characteristic. Our system detects onsets and offsets in 8 frequency bands by a two-layer integrate-and-fire neural network. The continuous speech is segmented based on the timing of onsets and offsets. And the energy is used as another cue to locate the segmentation point. In order to improve the accuracy of segmenting, we introduce three time constraints by defining three refractory periods of neurons, which make syllable length no less than the minimum. Although the boundaries between syllables in high-speed speech are not salient, our system can still segment individual syllables from speech robustly and accurately. Information Structure and Efficiency in Speech Production R.J.J.H. van Son, Louis C.W. Pols; University of Amsterdam, The Netherlands Speech is considered an efficient communication channel. This implies that the organization of utterances is such that more speaking effort is directed towards important parts than towards redundant parts. Based on a model of incremental word recognition, the importance of a segment is defined as its contribution to worddisambiguation. This importance is measured as the segmental information content, in bits. On a labeled Dutch speech corpus it is then shown that crucial aspects of the information structure of utterances partition the segmental information content and explain 90% of the variance. Two measures of acoustical reduction, duration and spectral center of gravity, are correlated with the segmental information content in such a way that more important phonemes are less reduced. It is concluded that the organization of conventional information structure does indeed increase efficiency. Learning Rule Ranking by Dynamic Construction of Context-Free Grammars Using AND/OR Graphs Anna Corazza 1 , Louis ten Bosch 2 ; 1 University of Milan, Italy; 2 University of Nijmegen, The Netherlands This paper1 discusses a novel approach for the construction of a context-free grammar based on a sequential processing of sen- 27 Eurospeech 2003 Tuesday tences. The construction of the grammar is based on a search algorithm for the minimum weight subgraph in an AND/OR graph. Aspects of optimality and robustness are discussed. The algorithm plays an essential role in a model for adaptive learning of probabilistic ordering. The novelty in the proposed model is the combination of well-established methods from two different disciplines: graph theory and statistics. September 1-4, 2003 – Geneva, Switzerland able” theme [p. 656]. We shall show that intonational marking of themes in German seems rather gradual. Themes in contrastive contexts have a significantly longer stressed vowel, a higher and longer rise which results in a higher and more delayed peak than noncontrastive themes. Moreover, speakers can use different strategies to signal the contrast. The set-up of this paper is mainly theoretical, and we follow a quite formal approach. There is a close link with Optimality Theory, one of the mainstream approaches in phonology, and with graph theory. The resulting techniques, however, can be applied in a more general domain. Data were elicited by reading short paragraphs with a contrastive and non-contrastive pre-context. The use of many filler texts distracted subjects’ attention from the contrast so that the data may be regarded as highly natural. Implementing these prosodic features in speech synthesis systems might help to avoid unnatural exaggerated prosodic realisations. The Effect of Surrounding Phrase Lengths on Pause Duration Accentual Lengthening in Standard Chinese: Evidence from Four-Syllable Constituents Elena Zvonik, Fred Cummins; University College Dublin, Ireland Yiya Chen; University of Edinburgh, U.K. Little is known about the determining influences on the length of silent intervals at IP boundaries and no current models accurately predict their duration. The contribution of independent factors with different characteristic properties to pause duration needs to be explored. The present study seeks to investigate if pause duration is correlated with the length of sentences or phrases preceding and following a pause. We find that two independent factors – the length of an IP (intonational phrase) preceding a pause and the length of an IP following a pause combine superadditively. The probability of a pause being short (<300 ms) rises greatly if both the preceding and the following phrases are short(<=10 syllables). Statistical Estimation of Phoneme’s Most Stable Point Based on Universal Constraint This study examines the pattern of accentual lengthening (AL) over four-syllable mono-morphemic words in Standard Chinese (SC). I show that 1) the domain of AL in SC is best characterized as the constituent that is under focus; 2) the distribution of AL over a focused domain is non-uniform and there is a strong tendency of edge effect with the last syllable lengthened the most; and 3) different prosodic boundaries do not block but attenuate the spread of AL with different magnitudes. These results are also compared to the results of studies on AL in languages such as English and Dutch. While there are similarities of AL in these two typologically different languages, which open the possibility that some effects of AL are universal, there are clearly important differences in the way that AL is distributed over the focused constituent in different languages, due to the specific phonology of the language. Syllable Structure Based Phonetic Units for Context-Dependent Continuous Thai Speech Recognition Shigeki Okawa 1 , Katsuhiko Shirai 2 ; 1 Chiba Institute of Technology, Japan; 2 Waseda University, Japan In this paper, we present a statistical approach for phoneme extraction based on universal constraint. Inspired by former phonological studies, we assume a fictitious point in each phoneme that exhibits the most stable information to explain the phoneme’s existence. With the universal constraint of phoneme definitions, the point is statistically estimated by an iterative procedure to maximize the local likelihood using a large amount of speech data. We also mention a context dependent modeling of the proposed approach and its integration strategy to obtain more stability. The experimental results show favorable convergencies of both the fictitious points and their likelihoods, which give usefulness for the stable phoneme modeling. Independent Automatic Segmentation by Self-Learning Categorial Pronunciation Rules N. Beringer; Ludwig-Maximilians-Universität München , Germany The goal of this paper is to present a new method to automatically generate pronunciation rules for automatic segmentation of speech – the German MAUSER system. MAUSER is an algorithm which generates pronunciation rules independently of any domain dependent training data either by clustering and statistically weighting self-learned rules according to a small set of phonological rules clustered by categories or by reweighting “seen”’ phonological rules. By this method we are able to automatically segment cost-effectively large corpora of mainly unprompted speech. Prosodic Correlates of Contrastive and Non-Contrastive Themes in German Bettina Braun 1 , D. Robert Ladd 2 ; 1 Saarland University, Germany; 2 University of Edinburgh, U.K. Semantic theories on focus and information structure assume that there are different accent types for thematic (backward-looking, known) and rhematic (forward-looking, new) information in languages as English and German. According to Steedman [1], thematic material may only be intonationally marked (= bear a pitch accent), if it “contrasts with a different established or accommodat- Supphanat Kanokphara; NECTEC, Thailand Choice of the phonetic units for speech recognizer is a factor greatly affecting the system performance. Phonetic units are normally defined according to the acoustic properties in parts of speech. Nevertheless, with the limit of training data, too delicate acoustic properties are ignored. Syllable structure is one of the properties usually ignored in English phonetic units due to the structure complexity. Some language like Chinese successfully gets the benefit from incorporating this property in the phonetic units, as the language itself is naturally syllabic and has only small amount of subsegments (onsets, nuclei, and codas). Thai, as some point between English and Chinese, has more subsegments than Chinese but not as much as English. There are two main steps in this paper. First, prove that Thai phonetic units can be defined as a set of subsegments without any data sparseness problem. Second, demonstrate that subsegmental phonetic units give better accuracy rate from integrating the syllable structure information and reduce a lot of number of triphone units because of left and right context constraints in the syllable structure. An Acoustic Phonetic Analysis of Diphthongs in Ningbo Chinese Fang Hu; Chinese Academy of Social Sciences, China This paper describes the acoustic phonetic properties of diphthongs in Ningbo Chinese. Data from 20 speakers indicate that (1) falling diphthongs have both onset and offset steady states while rising diphthongs only have steady states on the offset element; (2) both falling and rising diphthongs begin from an onset frequency area close to their target vowels, but only the normal-length rising diphthongs reach the offset target, and falling and short rising diphthongs stop at somewhere before reaching the target; (3) diphthongs can be well characterized by the F2 rate of change as far the falling diphthongs are concerned, whereas data lack consistency when rising diphthongs are also taken into account. Results show that the temporal organization within diphthongs, formant patterns, and formant rate of change all contribute to the characterization of Ningbo diphthongs. 28 Eurospeech 2003 Tuesday Latent Ability to Manipulate Phonemes by Japanese Preliterates in Roman Alphabet September 1-4, 2003 – Geneva, Switzerland the classification capability of related nonlinear features over broad phoneme classes. The results of these preliminary experiments indicate that the information carried by these novel nonlinear feature sets is important and useful. Takashi Otake, Yoko Sakamoto; Dokkyo University, Japan Recent studies in spoken word recognition show that Japanese listeners with or without alphabetic knowledge are accessible to phonemes during word activation. This suggests that even morabased language users can recognize a submoraic unit. The present study investigates a possibility of latent ability to manipulate phonemes to search and to construct new words by Japanese preliterates in Roman alphabet. Three experiments were conducted. In Experiment 1 it was tested whether they could search embedded words by deleting word initial consonants. In Experiments 2 and 3 it was tested whether they could construct new words by manipulating consonants and vowels at word initial and medial positions. The results show that they could successfully manage these tasks with high accuracy. These results suggests that they are likely to have latent ability to manipulate phonemes to search and to construct new words. The /i/-/a/-/u/-ness of Spoken Vowels Hartmut R. Pfitzinger; University of Munich, Germany This paper investigates acoustic, phonetic, and phonological representations of spoken vowels. For this purpose four experiments have been conducted. First, by drawing the analogy between the spectral energy distribution of vowels and the vowel space concept of Dependency Phonology, we achieve a new phonologically motivated vowel quality representation of spoken vowels which we name the /i/-/a/-/u/-ness. As a second step, it is shown that the extension of this approach is connected with the work of Pols, van der Kamp & Plomp 1969 [1] who, among other things, predicted formant frequencies from the spectral energy distribution of vowels. Third, the vowel quality relating to the IPA vowel diagram is derived directly from the spectral energy distribution. Finally, we compare this method with a formant and fundamental frequency based approach introduced by Pfitzinger 2003 [2]. While both the /i/-/a//u/-ness of vowels as well as the perceived vowel quality prediction are quite robust and therefore useful for both signal pre-processing and vowel quality research, the formant prediction achieved the lowest accuracy for the mapping to the IPA vowel diagram. Session: PTuBg– Poster Speech Modeling & Features II Time-Domain Based Temporal Processing with Application of Orthogonal Transformations Petr Motlíček, Jan Černocký; Brno University of Technology, Czech Republic In the paper, novel approach that efficiently extracts the temporal information of speech has been proposed. This algorithm is fully employed in time-domain, and the preprocessing blocks are well justified by psychoacoustic studies. The achieved results show the different properties of proposed algorithm compared to the traditional approach. The algorithm is advantageous in terms of possible modifications and computational inexpensiveness. Then, in our experiments, we have focused on different representation of time trajectories. Classical methods that are efficient in conventional feature extraction approaches showed not to be suitable to approximate temporal trajectories of speech. However, the application of some orthogonal transformations, such as discrete Fourier transform or discrete cosine transform, on top of previously derived temporal trajectories outperforms classification in original domain. In addition, these transformed features are very efficient to reduce the dimensionality of data. Recognition of Phoneme Strings Using TRAP Technique Petr Schwarz, Pavel Matějka, Jan Černocký; Brno University of Technology, Czech Republic We investigate and compare several techniques for automatic recognition of unconstrained context-independent phoneme strings from TIMIT and NTIMIT databases. Among the compared techniques, the technique based on TempoRAl Patterns (TRAP) achieves the best results in the clean speech, it achieves about 10% relative improvements against baseline system. Its advantage is also observed in the presence of mismatch between training and testing conditions. Issues such as the optimal length of temporal patterns in the TRAP technique and the effectiveness of mean and variance normalization of the patterns and the multi-band input the TRAP estimations, are also explored. Comparative Study on Hungarian Acoustic Model Sets and Training Methods Time: Tuesday 10.00, Venue: Main Hall, Level -1 Chair: Bojan Petek, University of Ljubljana, Slovenia Tibor Fegyó, Péter Mihajlik, Péter Tatai; Budapest University of Technology and Economics, Hungary A Computational Model of Arm Gestures in Conversation Dafydd Gibbon, Ulrike Gut, Benjamin Hell, Karin Looks, Alexandra Thies, Thorsten Trippel; Universität Bielefeld, Germany Currently no standardised gesture annotation systems are available. As a contribution towards solving this problem, CoGesT, a machine processable and human usable computational model for the annotation of a subset of conversational gestures is presented, its empirical and formal properties are detailed, and application areas are discussed. Nonlinear Analysis of Speech Signals: Generalized Dimensions and Lyapunov Exponents In recent speech recognition systems the base unit of recognition is generally the speech sound. To each speech sound an acoustic model is associated, whose parameters are estimated by statistical methods. The proper training data fundamentally determine the efficiency of the recognizer. Present day technology and computational capacity allow speech recognition systems to operate with large dictionaries and complex language models, but the quality of the basic pattern matching units has large influence on the reliability of the system. In our experiments presented here we investigated the effects of different training methods to the recognition accuracy; namely, the effect of increasing the number of speakers and the number of mixtures were examined in the case of pronunciation modeling and context independent models. F0 Estimation of One or Several Voices Alain de Cheveigné, Alexis Baskind; IRCAM-CNRS, France Vassilis Pitsikalis, Iasonas Kokkinos, Petros Maragos; National Technical University of Athens, Greece In this paper, we explore modern methods and algorithms from fractal/chaotic systems theory for modeling speech signals in a multidimensional phase space and extracting characteristic invariant measures like generalized fractal dimensions and Lyapunov exponents. Such measures can capture valuable information for the characterisation of the multidimensional phase space – which is closer to the true dynamics – since they are sensitive to the frequency with which the attractor visits different regions and the rate of exponential divergence of nearby orbits, respectively. Further we examine A methodology is presented for fundamental frequency estimation of one or more voices. The signal is modeled as the sum of one or more periodic signals, and the parameters estimated by search with interpolation. Accurate, reliable estimates are obtained for each frame without tracking or continuity constraints, and without the use of specific instrument models (although their use might further boost performance). In formal evaluation over a large database of speech, the single-voice algorithm outperformed the best competing methods by a factor of three. 29 Eurospeech 2003 Tuesday In Search Of Target Class Definition In Tandem Feature Extraction September 1-4, 2003 – Geneva, Switzerland GFA-HMM can achieve better performances over traditional HMM with the same amount of training data but much smaller number of model parameters. Sunil Sivadas, Hynek Hermansky; Oregon Health & Science University, USA In the tandem feature extraction scheme a Multi-Layer Perceptron (MLP) with softmax output layer is discriminatively trained to estimate context independent phoneme posterior probabilities on a labeled database. The outputs of the MLP after nonlinear transformation and Principal Component Analysis (PCA) are used as features in a Gaussian Mixture Model (GMM) based recognizer. The baseline tandem system is trained on 56 Context Independent (CI) phoneme targets. In this paper we examine alternatives to CI phoneme targets by grouping phonemes using apriori and and data-derived knowledge. On connected digit recognition task we achieve comparable performance to the baseline system using fewer data-derived classes. Segmentation of Speech for Speaker and Language Recognition André G. Adami, Hynek Hermansky; Oregon Health & Science University, USA Current Automatic Speech Recognition systems convert the speech signal into a sequence of discrete units, such as phonemes, and then apply statistical methods on the units to produce the linguistic message. Similar methodology has also been applied to recognize speaker and language, except that the output of the system can be the speaker or language information. Therefore, we propose the use of temporal trajectories of fundamental frequency and shortterm energy to segment and label the speech signal into a small set of discrete units that can be used to characterize speaker and/or language. The proposed approach is evaluated using the NIST Extended Data Speaker Detection task and the NIST Language Identification task. Feature Generation Based on Maximum Classification Probability for Improved Speech Recognition Xiang Li, Richard M. Stern; Carnegie Mellon University, USA Feature representation is a very important factor that has great effect on the performance of speech recognition systems. In this paper we focus on a feature generation process that is based on linear transformation of the original log-spectral representation. We first discuss several three popular linear transformation methods, MelFrequency Cepstral Coefficients (MFCC), Principal Component Analysis (PCA), and Linear Discriminant Analysis (LDA). We then propose a new method of linear transformation that maximizes the normalized acoustic likelihood of the most likely state sequences of training data, a measure that directly related to our ultimate objective of reducing Bayesian classification error rate in speech recognition. Experimental results show that the proposed method decreases the relative word error rate by more than 8.8% compared to the best implementation of LDA, and by more than 25.9% compared to MFCC features. Speech Recognition with a Generative Factor Analyzed Hidden Markov Model Learning Discriminative Temporal Patterns in Speech: Development of Novel TRAPS-Like Classifiers Barry Chen 1 , Shuangyu Chang 2 , Sunil Sivadas 3 ; 1 International Computer Science Institute, USA; 2 University of California at Berkeley, USA; 3 Oregon Health & Science University, USA Motivated by the temporal processing properties of human hearing, researchers have explored various methods to incorporate temporal and contextual information in ASR systems. One such approach, TempoRAl PatternS (TRAPS), takes temporal processing to the extreme and analyzes the energy pattern over long periods of time (500 ms to 1000 ms) within separate critical bands of speech. In this paper we extend the work on TRAPS by experimenting with two novel variants of TRAPS developed to address some shortcomings of the TRAPS classifiers. Both the Hidden Activation TRAPS (HATS) and Tonotopic Multi- Layer Perceptrons (TMLP) require 84% less parameters than TRAPS but can achieve significant phone recognition error reduction when tested on the TIMIT corpus under clean, reverberant, and several noise conditions. In addition, the TMLP performs training in a single stage and does not require critical band level training targets. Using these variants, we find that approximately 20 discriminative temporal patterns per critical band is sufficient for good recognition performance. In combination with a conventional PLP system, these TRAPS variants achieve significant additional performance improvements. Using Mutual Information to Design Class-Specific Phone Recognizers Patricia Scanlon 1 , Daniel P.W. Ellis 1 , Richard Reilly 2 ; 1 Columbia University, USA; 2 University College Dublin, Ireland Information concerning the identity of subword units such as phones cannot easily be pinpointed because it is broadly distributed in time and frequency. Continuing earlier work, we use Mutual Information as measure of the usefulness of individual timefrequency cells for various speech classification tasks, using the hand-annotations of the TIMIT database as our ground truth. Since different broad phonetic classes such as vowels and stops have such different temporal characteristics, we examine mutual information separately for each class, revealing structure that was not uncovered in earlier work; further structure is revealed by aligning the time-frequency displays of each phone at the center of their handmarked segments, rather than averaging across all possible alignments within each segment. Based on these results, we evaluate a range of vowel classifiers over the TIMIT test set and show that selecting input features according to the mutual information criteria can provides a significant increase in classification accuracy. Estimation of GMM in Voice Conversion Including Unaligned Data Helenca Duxans, Antonio Bonafonte; Universitat Politècnica de Catalunya, Spain Kaisheng Yao 1 , Kuldip K. Paliwal 2 , Te-Won Lee 1 ; 1 University of California at San Diego, USA; 2 Griffith University, Australia We present a generative factor analyzed hidden Markov model (GFAHMM) for automatic speech recognition. In a traditional HMM, the observation vectors are represented by mixture of Gaussians (MoG) that are dependent on discrete-valued hidden state sequence. The GFA-HMM introduces a hierarchy of continuous-valued latent representation of observation vectors, where latent vectors in one level are acoustic-unit dependent and the latent vectors in a higher level are acoustic-unit independent. An expectation maximization (EM) algorithm is derived for maximum likelihood parameter estimation of the model. The GFA-HMM can achieve a much more compact representation of the intra-frame statistics of observation vectors than traditional HMM. We conducted an experiment to show that the Voice conversion consists in transforming a source speaker voice into a target speaker voice. There are many applications of voice conversion systems where the amount of training data from the source speaker and the target speaker is different. Usually, the amount of source data available is large, but it is desired to estimate the transformation with a small amount of target data. Systems based on joint Gaussian Mixture Models (GMM) are well suited to voice conversion [1], but they can’t deal with source data without its corresponding aligned target data. In this paper, two alternatives are studied to incorporate unaligned source data in the estimation of a GMM for a voice conversion task. It is shown that when a limited amount of aligned parameters are available in the training step, to only include data from the source speaker increases the performance of the voice transformation. 30 Eurospeech 2003 Tuesday Trajectory Modeling Based on HMMs with the Explicit Relationship Between Static and Dynamic Features September 1-4, 2003 – Geneva, Switzerland individual experts are mixtures of Gaussians. However, in contrast to the standard PoE model, the individual experts are not required to be valid distributions, thus allowing additional flexibility in the component priors and variances. The performance of PoE models when used as a distributed representation on a large vocabulary speech recognition task, SwitchBoard, is evaluated. Keiichi Tokuda, Heiga Zen, Tadashi Kitamura; Nagoya Institute of Technology, Japan This paper shows that the HMM whose state output vector includes static and dynamic feature parameters can be reformulated as a trajectory model by imposing the explicit relationship between the static and dynamic features. The derived model, named trajectory HMM, can alleviate the limitations of HMMs: i) constant statistics within an HMM state and ii) independence assumption of state output probabilities. We also derive a Viterbi-type training algorithm for the trajectory HMM. A preliminary speech recognition experiment based on N-best rescoring demonstrates that the training algorithm can improve the recognition performance significantly even though the trajectory HMM has the same parameterization as the standard HMM. On the Advantage of Frequency-Filtering Features for Speech Recognition with Variable Sampling Frequencies. Experiments with SpeechDatCar Databases Hermann Bauerecker, Climent Nadeu, Jaume Padrell; Universitat Politècnica de Catalunya, Spain When a speech recognition system has to work with signals corresponding to different sampling frequencies, multiple acoustic models may have to be maintained. To avoid this drawback, the system can be trained at the highest expected sampling frequency and the acoustic models are posteriorly converted to a new sampling frequency. However, the usual mel-frequency cepstral coefficients are not well suited to this approach since they are not located in the frequency domain. For this reason, we propose in this paper to face that problem with the features resulting from frequency-filtering the logarithmic band energies. Experimental results are reported with SpeechDatCar databases, at 16 kHz, 11 kHz, and 8 kHz sampling rates, which show no degradation in terms of recognition performance for 11/8 kHz testing signals when the system, trained at 16 kHz, is converted, in an inexpensive way, to 11/8 kHz, instead of directly training the system at 11/8 kHz. Towards the Automatic Extraction of Fujisaki Model Parameters for Mandarin Harmonic Weighting for All-Pole Modeling of the Voiced Speech Davor Petrinovic; University of Zagreb, Croatia A new distance measure for all-pole modeling of voiced speech is introduced in this paper. It can easily be integrated within the concept of discrete Weighted Mean Square Error (WMSE) all-pole modeling, by a suitable choice of the modeling weights. The proposed weighting will address the problems such as: harmonic estimation reliability, perceptual significance of the harmonic and the model mismatch errors. The robust estimator is proposed, to reduce the effect of outliers caused by spectral nulls or additive non-speech contributions (e.g. background noise or music). It is demonstrated that the proposed all-pole estimation can significantly improve the performance of speech coders based on sinusoidal model, since the harmonic magnitudes are modeled better by the WMSE all-pole model. Estimation of Resonant Characteristics Based on AR-HMM Modeling and Spectral Envelope Conversion of Vowel Sounds Nobuyuki Nishizawa, Keikichi Hirose, Nobuaki Minematsu; University of Tokyo, Japan A new method was developed for accurately separating source and articulation filter characteristics of speech. This method is based on the AR-HMM modeling, where the residual waveform is expressed as the output sequence from an HMM. To realize an accurate analysis, a scheme of dividing HMM state was newly introduced. Using the AR-filter parameter values obtained through the analysis, we can construct a vocoder-type formant synthesizer, where the residual waveform is used as the excitation source. Through the listening test on the vowel sounds synthesized using AR-filter from a vowel and excitation waveform from another vowel, it was shown that a “flexible” synthesis with a high controllability on the acoustic parameters were possible by our formant synthesis configuration. Session: PTuBh– Poster Topics in Speech Recognition & Segmentation Hansjörg Mixdorff 1 , Hiroya Fujisaki 2 , Gao Peng Chen 3 , Yu Hu 3 ; 1 Berlin University of Applied Sciences, Germany; 2 University of Tokyo, Japan; 3 University of Science and Technology of China, China The generation of naturally-sounding F 0 contours in TTS enhances the intelligibility and perceived naturalness of synthetic speech. In earlier works the first author developed a linguistically motivated model of German intonation based on the quantitative Fujisaki model of the production process of F 0, and an automatic procedure for extracting the parameters from the F 0 contour which, however, was specific to German. As has been shown by Fujisaki and his co-workers, parametrization of F 0 contours of Mandarin requires negative tone commands, as well as a more precise control of F 0 associated with the syllabic tones. This paper presents an approach to the automatic parameter estimation for Mandarin, as well as first results concerning the accuracy of estimation. The paper also introduces a recently developed tool for editing Fujisaki parameters featuring resynthesis which will soon be publicly available. Product of Gaussians as a Distributed Representation for Speech Recognition Time: Tuesday 10.00, Venue: Main Hall, Level -1 Chair: John Makhoul, BBN Technologies, USA Utterance Verification Under Distributed Detection and Fusion Framework Taeyoon Kim, Hanseok Ko; Korea University, Korea In this paper, we consider an application of distributed detection and fusion framework to utterance verification (UV) and confidence measure (CM) objectives. We formulate the UV as a distributed detection and Bayesian fusion problem by combining various individual UV methods. We essentially design an optimal fusion rule that achieves minimum error rate. In the relevant isolated word OOV rejection experiments, the proposed method consistently outperforms over the individual UV methods. Joint Estimation of Thresholds in a Bi-Threshold Verification Problem Simon Ho, Brian Mak; Hong Kong University of Science & Technology, China S.S. Airey, M.J.F. Gales; Cambridge University, U.K. Distributed representations allow the effective number of Gaussian components in a mixture model, or state of an HMM, to be increased without dramatically increasing the number of model parameters. Various forms of distributed representation have previously been investigated. In this work it shown that the product of experts (PoE) framework may be viewed as a distributed representation when the Verification problems are usually posted as a 2-class problem and the objective is to verify if an observation belongs to a class, say, A or its complement A’. However, we find that in a computer-assisted language learning application, because of the relatively low reliability of phoneme verification – with an equal-error-rate of more than 30% – a system built on conventional phoneme verification algorithm needs to be improved. In this paper, we propose to cast 31 Eurospeech 2003 Tuesday the problem as a 3-class verification problem with the addition of an “in-between” class besides A and A’. As a result, there are two thresholds to be designed in such a system. Although one may determine the two thresholds independently, better performance can be obtained by a joint estimation of these thresholds by allowing small deviation from the specified false acceptance and false rejection rates. This paper describes a cost-based approach to do that. Furthermore, issues such as per-phoneme thresholds vs. phonemeclass thresholds, and the use of bagging technique to improve the stability of thresholds are investigated. Experimental results on a kids’ corpus show that cost-based thresholds and bagging improve verification performance. Confidence Measures for Phonetic Segmentation of Continuous Speech Samir Nefti 1 , Olivier Boëffard 1 , Thierry Moudenc 2 ; 1 IRISA, France; 2 France Télécom R&D, France In the context of text-to-speech synthesis, this contribution deals with the segmentation of speech into phone units. Using an HMM based segmentation system, we proceed to compare several phonelevel confidence measures to detect potential local mismatches between the phone labels and the acoustics. As well as serving this purpose, these confidence measures will help the system suggest a new local graph of hypotheses for the markovian segmentation system. We propose a new formulation of a frame-based posterior probability confidence measure which gives the best results for all of our experiments over a bench of six confidence measures. Adopting an hypothesis testing formulation, this posterior frame-based measure gives an EER of 12% for a randomly blurred test database. Using Confidence Measures and Domain Knowledge to Improve Speech Recognition Pascal Wiggers, Leon J.M. Rothkrantz; Delft University of Technology, The Netherlands In speech recognition domain knowledge is usually implemented by training specialized acoustic and language models. This requires large amounts of training data for the domain. When such data is not available there often still exists external knowledge, obtainable through other means, that might be used to constrain the search for likely utterances. This paper presents a number of methods to exploit such knowledge; an adaptive language model and a lattice rescoring approach based on Bayesian updating. To decide whether external knowledge is applicable a word level confidence measure is implemented. As a special case of the general problem station-to-station travel frequencies are considered to improve recognition accuracy in a train table dialog system. Experiments are described that test and compare the different techniques. Isolated Word Verification Using Cohort Word-Level Verification K. Thambiratnam, Sridha Sridharan; Queensland University of Technology, Australia Isolated Word Verification (IWV) is the task of verifying the occurrence of a keyword at a specified location within a speech stream. Typical applications of IWV are to reduce the number of incorrect results output by a speech recognizer or keyword spotter. Such algorithms are also vital in reducing the false alarm rate in many commercial applications of speech recognition, such as automated telephone transaction systems and audio database search engines. In this paper, we propose a new method of isolated word verification that we call Cohort Word-level Verification (CWV). The CWV method attempts to increase IWV performance by incorporating higher level linguistic and word level information into the selection of non-keyword models for verification. When used in conjunction with speech background model based IWV, we are able to achieve significant performance improvements for IWV of short words. September 1-4, 2003 – Geneva, Switzerland A New Approach to Minimize Utterance Verification Error Rate for a Specific Operating Point Wing-Hei Au, Man-Hung Siu; Hong Kong University of Science & Technology, China In many telephony applications that use speech recognition, it is important to identify and reject out-of-vocabulary words or utterances without keywords by means of utterance verification (UV). Typically, UV is performed based on the likelihood ratio of the target model versus an alternative model. The “goodness” of the models and the particular criteria used for estimating these models can have significant impact on its performance. Because the UV problem can be considered as a two-class classification problem, minimum classification error (MCE) training is a natural choice. Earlier work has focused on MCE training to reduce total classification errors. In this paper, we extend the MCE approach to minimize the error rates. In particular, we focus on the error rates at certain operating points and show how this can result in a significant EER reduction for phone verification on the TIMIT and a non-native kids corpus. While the particular technique is developed on utterance verification, it can also be generalized for other verification tasks such as speaker verification. Continuous Speech Recognition and Verification Based on a Combination Score Binfeng Yan, Rui Guo, Xiaoyan Zhu; Tsinghua University, China In this paper we present a speech recognition and verification method based on the integration of likelihood and likelihood ratio. Speech recognition and verification is unified in one-phase framework. A modified agglomerative hierarchical clustering algorithm is adopted to train the alternative model used in speech verification. In the process of decoding likelihood ratio is combined with likelihood to get the combination score for searching the final results. Our experimental results showed that false-alarm rate get decreased a lot with only slight loss in accuracy rate. Impact of Word Graph Density on the Quality of Posterior Probability Based Confidence Measures Tibor Fabian, Robert Lieb, Günther Ruske, Matthias Thomae; Technical University of Munich, Germany Our new experimental results, presented in this paper, clearly prove the dependence between word graph density and the quality of two different confidence measures. Both confidence measures are based on the computation of the posterior probabilities of the hypothesized words and apply the time alignment information of the word graph for confidence score accumulation. We show that the quality of the confidence scores of both confidence measures significantly increases for higher word graph densities. The analyses were carried out on two different German spontaneous speech corpora: on the Verbmobil evaluation corpus [1] and on the NaDia corpus. We achieved a relative reduction of the confidence error rate by up to 41.4%, compared to the baseline confidence error rate. The results lead us to propose to perform the confidence score calculation – based on posterior probability accumulation – on higher word graph densities in order to get the best results. An Efficient Keyword Spotting Technique Using a Complementary Language for Filler Models Training Panikos Heracleous 1 , Tohru Shimizu 2 ; 1 Nara Institute of Science and Technology, Japan; 2 KDDI R&D Laboratories Inc., Japan The task of keyword spotting is to detect a set of keywords in the input continuous speech. In a keyword spotter, not only the keywords, but also the non-keyword intervals must be modeled. For this purpose, filler (or garbage) models are used. To date, most of the keyword spotters have been based on hidden Markov models (HMM). More specifically, a set of HMM is used as garbage models. In this paper, a two-pass keyword spotting technique based on bilingual hidden Markov models is presented. In the first pass, 32 Eurospeech 2003 Tuesday our technique uses phonemic garbage models to represent the nonkeyword intervals, and in the second stage the putative hits are verified using normalized scores. The main difference from similar approaches lies in the way the non-keyword intervals are modeled. In this work, the target language is Japanese, and English was chosen as the ‘garbage’ language for training the phonemic garbage models. Experimental results on both clean and noisy telephone speech data showed higher performance compared with using a common set of acoustic models. Moreover, parameter tuning (e.g. word insertion penalty tuning) does not have a serious effect on the performance. For a vocabulary of 100 keywords and using clean telephone speech test data we achieved a 92.04% recognition rate with only a 7.96% false alarm rate, and without word insertion penalty tuning. Using noisy telephone speech test data we achieved a 87.29% recognition rate with only a 12.71% false alarm rate. Context-Sensitive Evaluation and Correction of Phone Recognition Output September 1-4, 2003 – Geneva, Switzerland Integrating Statistical and Rule-Based Knowledge for Continuous German Speech Recognition René Beutler, Beat Pfister; ETH Zürich, Switzerland A new approach to continuous speech recognition (CSR) for German is presented, which integrates both statistical knowledge (at the acoustic-phonetic level) and rule-based knowledge (at the word and sentence levels).We introduce a flexible framework allowing bidirectional processing and virtually any search strategy given an acoustic model and a context-free grammar. An implementation of this class of recognizers by means of a word spotter and an island chart parser is presented. A word recognition accuracy of 93.5% is reported on a speaker dependent recognition task with a 4k words dictionary. A Fast, Accurate and Stream-Based Speaker Segmentation and Clustering Algorithm An Vandecatseye, Jean-Pierre Martens; Ghent University, Belgium Michael Levit 1 , Hiyan Alshawi 1 , Allen Gorin 1 , Elmar Nöth 2 ; 1 AT&T Labs-Research, USA; 2 Universität Erlangen-Nürnberg, Germany In speech and language processing, information about the errors made by a learning system is commonly used to assess and improve its performance. Because of high computational complexity, the context of the errors is usually either ignored, or exploited in a simplistic form. The complexity becomes tractable, however, for phone recognition because of the small lexicon. For phone-based systems, an exhaustive modeling of local context is possible. Furthermore, recent research studies have shown phone recognition to be useful for several spoken language processing tasks. In this paper, we present a mechanism which learns patterns of context-sensitive errors from ASR-output aligned with the “true” phone transcriptions. We also show how this information, encoded as a context-sensitive weighted transducer, can provide a modest improvement to phone recognition accuracy even when no transcriptions are available for the domain of interest. Estimating Speech Recognition Error Rate Without Acoustic Test Data Yonggang Deng 1 , Milind Mahajan 2 , Alex Acero 2 ; 1 Johns Hopkins University, USA; 2 Microsoft Research, USA We address the problem of estimating the word error rate (WER) of an automatic speech recognition (ASR) system without using acoustic test data. This is an important problem which is faced by the designers of new applications which use ASR. Quick estimate of WER early in the design cycle can be used to guide the decisions involving dialog strategy and grammar design. Our approach involves estimating the probability distribution of the word hypotheses produced by the underlying ASR system given the text test corpus. A critical component of this system is a phonemic confusion model which seeks to capture the errors made by ASR on the acoustic data at a phonemic level. We use a confusion model composed of probabilistic phoneme sequence conversion rules which are learned from phonemic transcription pairs obtained by leave-one-out decoding of the training set. We show reasonably close estimation of WER when applying the system to test sets from different domains. Multigram-Based Grapheme-to-Phoneme Conversion for LVCSR M. Bisani, Hermann Ney; RWTH Aachen, Germany Many important speech recognition tasks feature an open, constantly changing vocabulary. (E.g. broadcast news transcription, spoken document retrieval, . . . ) Recognition of (new) words requires acoustic baseforms for them to be known. Commonly words are transcribed manually, which poses a major burden on vocabulary adaptation and inter-domain portability. In this work we investigate the possibility of applying a data-driven grapheme-to-phoneme converter to obtain the necessary phonetic transcriptions. Experiments were carried out on English and German speech recognition tasks. We study the relation between transcription quality and word error rate and show that manual transcription effort can be reduced significantly by this method with acceptable loss in performance. In this paper a new pre-processor for a free speech transcription system is described. It performs a speech/non-speech partition, a segmentation of the speech parts into speaker turns, and a clustering of the speaker turns. It works in a stream-based mode, and it is aiming for a high accuracy with a low delay and processing time. Experiments on the Hub4 Broadcast News corpus show that the newly proposed pre-processor is competitive with and in some respects better than the best systems published so far. The paper also describes attempts to raise the system performance by supplementing the standard MFCC features with prosodic features such as pitch and voicing evidence. A Sequential Metric-Based Audio Segmentation Method via the Bayesian Information Criterion Shi-sian Cheng, Hsin-Min Wang; Academia Sinica, Taiwan In this paper, we propose a sequential metric-based audio segmentation method that has the advantage of low computation cost of metric-based methods and the advantage of high accuracy of modelselection-based methods. There are two major differences between our method and the conventional metric-based methods:(1) Each changing point has multiple chances to be detected by different pairs of windows, rather than only once by its neighboring acoustic information.(2) By introducing the Bayesian Information Criterion(BIC) into the distance computation of two windows, we can deal with the thresholding issue more easily. We used five onehour broadcast news shows for experiments, and the experimental results show that our method performs as well as the modelselection-based methods, but with a lower computation cost. Sentence Boundary Detection in Arabic Speech Amit Srivastava, Francis Kubala; BBN Technologies, USA This paper presents an automatic system to detect sentence boundaries in speech recognition transcripts. Two systems were developed that use independent sources of information. One is a linguistic system that uses linguistic features in a statistical language model while the other is an acoustic system that uses prosodic features in a feed-forward neural network model. A third system was developed that combines the scores from the acoustic and the linguistic systems in a Maximum-Likelihood framework. All systems outlined in this paper are essentially language-independent but all our experiments were conducted on the Arabic Broadcast News speech recognition transcripts. Our experiments show that while the acoustic system outperforms the linguistic system, the combined system achieves the best performance at detecting sentence boundaries. Automated Transcription and Topic Segmentation of Large Spoken Archives Martin Franz, Bhuvana Ramabhadran, Todd Ward, Michael Picheny; IBM T.J. Watson Research Center, USA Digital archives have emerged as the pre-eminent method for cap- 33 Eurospeech 2003 Tuesday turing the human experience. Before such archives can be used efficiently, their contents must be described. The scale of such archives along with the associated content mark up cost make it impractical to provide access via purely manual means, but automatic technologies for search in spoken materials still have relatively limited capabilities. The NSF-funded MALACH project will use the world’s largest digital archive of video oral histories, collected by the Survivors of the Shoah Visual History Foundation (VHF) to make a quantum leap in the ability to access such archives by advancing the state-of-the-art in Automated Speech Recognition (ASR), Natural Language Processing (NLP) and related technologies [1, 2]. This corpus consists of over 115,000 hours of unconstrained, natural speech from 52,000 speakers in 32 different languages, filled with disfluencies, heavy accents, age-related coarticulations, and un-cued speaker and language switching. This paper discusses some of the ASR and NLP tools and technologies that we have been building for the English speech in the MALACH corpus. We also discuss this new test bed while emphasizing the unique characteristics of this corpus. Automatic Disfluency Identification in Conversational Speech Using Multiple Knowledge Sources Yang Liu 1 , Elizabeth Shriberg 2 , Andreas Stolcke 2 ; 1 International Computer Science Institute, USA; 2 SRI International, USA Disfluencies occur frequently in spontaneous speech. Detection and correction of disfluencies can make automatic speech recognition transcripts more readable for human readers, and can aid downstream processing by machine. This work investigates a number of knowledge sources for disfluency detection, including acousticprosodic features, a language model (LM) to account for repetition patterns, a part-of-speech (POS) based LM, and rule-based knowledge. Different components are designed for different purposes in the system. Results show that detection of disfluency interruption points is best achieved by a combination of prosodic cues, wordbased cues, and POS-based cues. The onset of a disfluency to be removed, in contrast, is best found using knowledge-based rules. Finally, specific disfluency types can be aided by the modeling of word patterns. Topic Segmentation and Retrieval System for Lecture Videos Based on Spontaneous Speech Recognition Natsuo Yamamoto 1 , Jun Ogata 2 , Yasuo Ariki 1 ; 1 Ryukoku University, Japan; 2 AIST, Japan In this paper, we propose a segmentation method of continuous lecture speech into topics. A lecture includes several topics but it is difficult to judge their boundaries. To solve this problem, transcriptions obtained by spontaneous speech recognition of a lecture speech is associated with the textbook used in the lecture. This method showed high performance of the topic segmentation with an average of 93.7%. Incorporating this method, we constructed a system where we can view an interesting part of lecture videos, by specifying the chapters or sections as well as keywords. Session: OTuCa– Oral Robust Speech Recognition - Acoustic Modeling Time: Tuesday 13.30, Venue: Room 1 Chair: Richard Stern, CMU, USA Hybrid HMM/BN ASR System Integrating Spectrum and Articulatory Features Konstantin Markov 1 , Jianwu Dang 2 , Yosuke Iizuka 2 , Satoshi Nakamura 1 ; 1 ATR-SLT, Japan; 2 JAIST, Japan In this paper, we describe automatic speech recognition system where features extracted from human speech production system in form of articulatory movements data are effectively integrated in the acoustic model for improved recognition performance. The September 1-4, 2003 – Geneva, Switzerland system is based on the hybrid HMM/BN model, which allows for easy integration of different speech features by modeling probabilistic dependencies between them. In addition, features like articulatory movements, which are difficult or impossible to obtain during recognition, can be left hidden, in fact eliminating the need of their extraction. The system was evaluated in phoneme recognition task on small database consisting of three speakers’ data in speaker dependent and multi-speaker modes. In both cases, we obtained higher recognition rates compared to conventional, spectrum based HMM system with the same number of parameters. Context-Dependent Output Densities for Hidden Markov Models in Speech Recognition Georg Stemmer, Viktor Zeißler, Christian Hacker, Elmar Nöth, Heinrich Niemann; Universität Erlangen-Nürnberg, Germany In this paper we propose an efficient method to utilize context in the output densities of HMMs. State scores of a phone recognizer are integrated into the HMMs of a word recognizer which makes their output densities context-dependent. A significant reduction of the word error rate has been achieved when the approach is evaluated on a set of spontaneous speech utterances. As we can expect that context is more important for some phone models than for others, we further extend the approach by state-dependent weighting factors which are used to control the influence of the different information sources. A small additional improvement has been achieved. Time Adjustable Mixture Weights for Speaking Rate Fluctuation Takahiro Shinozaki, Sadaoki Furui; Tokyo Institute of Technology, Japan One of the most serious problems in spontaneous speech recognition is the degradation of recognition accuracy due to the speaking rate fluctuation in an utterance. This paper proposes a method for adjusting mixture weights of an HMM frame by frame depending on the local speaking rate. The proposed method is implemented using the Bayesian network framework. A hidden variable representing the variation of the “mode” of the speaking rate is introduced and its value controls the mixture weights of Gaussian mixtures. Model training and maximum probability assignment of the variables are conducted using the EM/GEM and inference algorithms for Bayesian networks. The Bayesian network is used to rescore the acoustic likelihood of the hypotheses in N-best lists. Experimental results show that the proposed method improves word accuracy by 1.6% for the absolute value on meeting speech given the speaking rate information, whereas improvement by a regression HMM is less significant. A Switching Linear Gaussian Hidden Markov Model and Its Application to Nonstationary Noise Compensation for Robust Speech Recognition Jian Wu, Qiang Huo; University of Hong Kong, China The Switching Linear Gaussian (SLG) Models was proposed recently for time series data with nonlinear dynamics. In this paper, we present a new modelling approach, called SLGHMM, that uses a hybrid Dynamic Bayesian Network of SLG models and Continuous Density HMMs (CDHMMs) to compensate for the nonstationary distortion that may exist in speech utterance to be recognized. With this representation, the CDHMMs (each modelling mainly the linguistic information of a speech unit) and a set of linear Gaussian models (each modelling a kind of stationary distortion) can be jointly learnt from multi-condition training data. Such a SLGHMM is able to model approximately the distribution of speech corrupted by switchingcondition distortions. The effectiveness of the proposed approach is confirmed in noisy speech recognition experiments on Aurora2 task. On Factorizing Spectral Dynamics for Robust Speech Recognition Vivek Tyagi, Iain A. McCowan, Hervé Bourlard, Hemant Misra; IDIAP, Switzerland 34 Eurospeech 2003 Tuesday In this paper, we introduce new dynamic speech features based on the modulation spectrum. These features, termed Mel-cepstrum Modulation Spectrum (MCMS), map the time trajectories of the spectral dynamics into a series of slow and fast moving orthogonal components, providing a more general and discriminative range of dynamic features than traditional delta and acceleration features. The features can be seen as the outputs of an array of band-pass filters spread over the cepstral modulation frequency range of interest. In experiments, it is shown that, as well as providing a slight improvement in clean conditions, these new dynamic features yield a significant increase in speech recognition performance in various noise conditions when compared directly to the standard temporal derivative features and RASTA-PLP features. Joint Model and Feature Based Compensation for Robust Speech Recognition Under Non-Stationary Noise Environments Chuan Jia, Peng Ding, Bo Xu; Chinese Academy of Sciences, China This paper presents a novel compensation approach, which is implemented in both model and feature spaces, for non-stationary noise Due to the nature of non-stationary noise which can be decomposed into constant part and residual noise part, our proposed scheme is performed in two steps: before recognition, an extended Jacobian adaptation (JA) is applied to adapt the speech models for the constant part of noise; during recognition, the power spectra of noisy speech are compensated to eliminate the effect of residual noise part of noise. As verified by the experiments performed under different stationary and non-stationary noise environments, the proposed JA is superior to the basic JA and the joint approach is better than the compensation in single space. Session: STuCb– Oral Advanced Machine Learning Algorithms for Speech & Language Processing Time: Tuesday 13.30, Venue: Room 2 Chair: Rahim Mazin, AT&T Res., USA September 1-4, 2003 – Geneva, Switzerland Robust Multi-Class Boosting Gunnar Rätsch; Fraunhofer FIRST, Germany Boosting approaches are based on the idea that high-quality learning algorithms can be formed by repeated use of a “weak-learner”, which is required to perform only slightly better than random guessing. It is known that Boosting can lead to drastic improvements compared to the individual weak-learner. For two-class problems it has been shown that the original Boosting algorithm, called AdaBoost, is quite unaffected by overfitting. However, for the case of noisy data, it is also understood that AdaBoost can be improved considerably by introducing some regularization technique. In speech-related problems one often considers multi-class problems and Boosting formulations have been used successfully to solve them. I review existing multi-class boosting algorithms, which have been much less analyzed and explored than the two-class pendants. In this work I extend these methods to derive new boosting algorithms which are more robust against outliers and noise in the data and are able to exploit prior knowledge about relationships between the classes. Statistical Signal Processing with Nonnegativity Constraints Lawrence K. Saul, Fei Sha, Daniel D. Lee; University of Pennsylvania, USA Nonnegativity constraints arise frequently in statistical learning and pattern recognition. Multiplicative updates provide natural solutions to optimizations involving these constraints. One well known set of multiplicative updates is given by the Expectation- Maximization algorithm for hidden Markov models, as used in automatic speech recognition. Recently, we have derived similar algorithms for nonnegative deconvolution and nonnegative quadratic programming. These algorithms have applications to low-level problems in voice processing, such as fundamental frequency estimation, as well as high-level problems, such as the training of large margin classifiers. In this paper, we describe these algorithms and the ideas that connect them. Inline Updates for HMMs Ashutosh Garg 1 , Manfred K. Warmuth 2 ; 1 IBM Corporation, USA; 2 University of California at Santa Cruz, USA Weighted Automata Kernels – General Framework and Algorithms Corinna Cortes, Patrick Haffner, Mehryar Mohri; AT&T Labs-Research, USA Kernel methods have found in recent years wide use in statistical learning techniques due to their good performance and their computational efficiency in high-dimensional feature space. However, text or speech data cannot always be represented by the fixed-length vectors that the traditional kernels handle. We recently introduced a general kernel framework based on weighted transducers, rational kernels, to extend kernel methods to the analysis of variable-length sequences and weighted automata [5] and described their application to spoken-dialog applications. We presented a constructive algorithm for ensuring that rational kernels are positive definite symmetric, a property which guarantees the convergence of discriminant classification algorithms such as Support Vector Machines, and showed that many string kernels previously introduced in the computational biology literature are special instances of such positive definite symmetric rational kernels [4]. This paper reviews the essential results given in [5, 3, 4] and presents them in the form of a short tutorial. Most training algorithms for HMMs assume that the whole batch of observation sequences is given ahead of time. This is particularly the case for the standard EM algorithm. However, in many applications such as speech, the data is generated by a temporal process. Singer and Warmuth developed online updates for HMMs that process a single observation sequence in each update. In this paper we take this approach one step further and develop an inline update for training HMMs. Now the parameters are updated after processing a single symbol of the current observation sequence. The methodology for deriving the online and the new inline update is quite different from the standard EM motivation. We show experimentally on speech data that even when all observation sequences are available (batch mode), then the online update converges faster than the batch update, and the inline update converges even faster. The standard batch EM update exhibits the slowest convergence. Factorial Models and Refiltering for Speech Separation and Denoising Large Margin Methods for Label Sequence Learning Sam T. Roweis; University of Toronto, Canada Yasemin Altun, Thomas Hofmann; Brown University, USA This paper proposes the combination of several ideas, some old and some new, from machine learning and speech processing. We review the max approximation to log spectrograms of mixtures, show why this motivates a “refiltering” approach to separation and denoising, and then describe how the process of inference in factorial probabilistic models performs a computation useful for deriving the masking signals needed in refiltering. A particularly simple model, factorial-max vector quantization (MAXVQ), is introduced along with a branch-and-bound technique for efficient exact inference and applied to both denoising and monaural separation. Our approach represents a return to the ideas of Ephraim, Varga and Moore but applied to auditory scene analysis rather than to speech recognition. Label sequence learning is the problem of inferring a state sequence from an observation sequence, where the state sequence may encode a labeling, annotation or segmentation of the sequence. In this paper we give an overview of discriminative methods developed for this problem. Special emphasis is put on large margin methods by generalizing multiclass Support Vector Machines and AdaBoost to the case of label sequences. An experimental evaluation demonstrates the advantages over classical approaches like Hidden Markov Models and the competitiveness with methods like Conditional Random Fields. 35 Eurospeech 2003 Tuesday September 1-4, 2003 – Geneva, Switzerland ality reduction tasks in continuous speech recognition systems. A new type of feature transformation, LP transformation, is proposed and its performance is compared to those of LDA and PCA transformations. Session: OTuCc– Oral Speech Modeling & Features III Time: Tuesday 13.30, Venue: Room 3 Chair: Daniel Ellis, Columbia Univ., USA Distributed Speech Recognition on the WSJ Task Jan Stadermann, Gerhard Rigoll; Technische Universitaet Muenchen, Germany Band-Independent Speech-Event Categories for TRAP Based ASR Hynek Hermansky, Pratibha Jain; Oregon Health & Science University, USA Band-independent categories are investigated for feature estimation in ASR. These categories represent distinct speech-events manifested in frequency-localized temporal patterns of the speech signal. A universal, single estimator is proposed for estimating speechevent posterior probabilities using temporal patterns of criticalband energies for all the bands. The estimated posteriors are used as the input features (referred to as speech-event features) to a backend recognizer. These features are evaluated on continuous OGIDigits task. The features are also evaluated on Aurora-2 and Aurora3 tasks in a Distributed Speech Recognition (DSR) framework. These features are compared with earlier proposed broad-phonetic TRAPs features estimated from temporal patterns using independent estimators in each critical-band. Local Averaging and Differentiating of Spectral Plane for TRAP-Based ASR A comparison of traditional continuous speech recognizers with hybrid tied-posterior systems in distributed environments is presented for the first time on a challenging medium vocabulary task. We show how monophone and triphone systems are affected if speech features are sent over a wireless channel with limited bandwidth. The algorithms are evaluated on the Wall Street Journal database (WSJ0) and the results show that our monophone tiedposterior recognizer outperforms the traditional methods on this task by a dramatic reduction of the performance loss by a factor of 4 compared to non-distributed recognizers. Integrating Multilingual Articulatory Features into Speech Recognition Sebastian Stüker 1 , Florian Metze 1 , Tanja Schultz 2 , Alex Waibel 2 ; 1 Universität Karlsruhe, Germany; 2 Carnegie Mellon University, USA Minimum Variance Distortionless Response on a Warped Frequency Scale The use of articulatory features, such as place and manner of articulation, has been shown to reduce the word error rate of speech recognition systems under different conditions and in different settings. For example recognition systems based on features are more robust to noise and reverberation. In earlier work we showed that articulatory features can compensate for inter language variability and can be recognized across languages. In this paper we show that using cross- and multilingual detectors to support an HMM based speech recognition system significantly reduces the word error rate. By selecting and weighting the features in a discriminative way, we achieve an error rate reduction that lies in the same range as that seen when using language specific feature detectors. By combining feature detectors from many languages and training the weights discriminatively, we even outperform the case where only monolingual detectors are being used. Matthias Wölfel 1 , John McDonough 1 , Alex Waibel 2 ; 1 Universität Karlsruhe, Germany; 2 Carnegie Mellon University, USA Session: OTuCd– Oral Multi-Modal Spoken Language Processing František Grézl, Hynek Hermansky; Oregon Health & Science University, USA Local frequency and time averaging and differentiating operators, using three neighboring points of critical-band time-frequency plane, are used to process the plane prior to its use in TRAP-based ASR. In that way, five alternative TRAP-based ASR systems (the original one and the time/frequency integrated/ differentiated ones)are created. We show that the frequency differentiating operator improves performance of the TRAP-based ASR. In this work we propose a time domain technique to estimate an all-pole model based on the minimum variance distortionless response (MVDR) using a warped short time frequency axis such as the Mel scale. The use of the MVDR eliminates the overemphasis of harmonic peaks typically seen in medium and high pitched voiced speech when spectral estimation is based on linear prediction (LP). Moreover, warping the frequency axis prior to MVDR spectral estimation ensures more parameters in the spectral model are allocated to the low, as opposed to high, frequency regions of the spectrum, thereby mimicking the human auditory system. In a series of speech recognition experiments on the Switchboard Corpus (spontaneous English telephone speech), the proposed approach achieved a word error rate (WER) of 32.1% for female speakers, which is clearly superior to the 33.2% WER obtained by the usual combination of Mel warping and linear prediction. Improving the Efficiency of Automatic Speech Recognition by Feature Transformation and Dimensionality Reduction Xuechuan Wang, Douglas O’Shaughnessy; Université du Québec, Canada In speech recognition systems, feature extraction can be achieved in two steps: parameter extraction and feature transformation. Feature transformation is an important step. It can concentrate the energy distributions of a speech signal onto fewer dimensions than those of parameter extraction and thus reduce the dimensionality of the system. Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are the two popular feature transformation methods. This paper investigates their performances in dimension- Time: Tuesday 13.30, Venue: Room 4 Chair: Roger Moore, 20/20 Speech, United Kingdom Using Corpus-Based Methods for Spoken Access to News Texts on the Web Alexandra Klein 1 , Harald Trost 2 ; 1 Austrian Research Institute for Artificial Intelligence, Austria; 2 University of Vienna, Austria The system described in this paper relies both on a multimodal corpus and a written newspaper corpus for processing spoken and written user requests to Austrian news texts. Requests may be spontaneous spoken and written utterances as well as mouse clicks; user actions may concern actual search, but also control of the browser. Because of spontaneous utterances, a large vocabulary and multimodal interaction, interpreting the user request and generating an appropriate system response is often difficult. Apart from a controller module, the system uses data from two corpora for compensating the difficulties associated with the scenario. Multimodal user actions, which were collected in Wizard-of-Oz experiments, serve as a base for the identification of patterns in users’ spontaneous utterances. Furthermore, news documents are used for obtaining background knowledge which can contribute to query expansion whenever the interpretation of users’ utterances encounters ambiguity or underspecification concerning the search terms. 36 Eurospeech 2003 Tuesday Cross-Modal Informational Masking Due to Mismatched Audio Cues in a Speechreading Task Douglas S. Brungart 1 , Brian D. Simpson 1 , Alex Kordik 2 ; 1 Air Force Research Laboratory, USA; 2 Sytronics Inc., USA Although most known examples of cross-modal interactions in audio-visual speech perception involve a dominant visual signal that modifies the apparent audio signal heard by the observer, there may also be cases where an audio signal can alter the visual image seen by the observer. In this experiment, we examined the effects that different distracting audio signals had on an observer’s ability to speechread a color and number combination from a visual speech stimulus. When the distracting signal was noise, timereversed speech, or irrelevant continuous speech, speechreading performance was unaffected. However, when the distracting audio signal was speech that followed the same general syntax as the target speech but contained a different color and number combination, speechreading performance was dramatically reduced. This suggests that the amount of interference an audio signal causes in a speechreading task strongly depends on the semantic similarity of the target and masking phrases. The amount of interference did not, however, depend on the apparent similarity between the audio speech signal and the visible talker: masking phrases spoken by a talker who was different in sex than the visible talker interfered nearly as much with the speechreading task as masking phrases spoken by the same talker used in the visual stimulus. A second experiment that examined the effects of desynchronizing the audio and visual signals found that the amount of interference caused by the audio phrase decreased when it was time advanced or time delayed relative to the visual target, but that time shifts as large as 1 s were required before performance approached the level achieved with no audio signal. The results of these experiments are consistent with the existence of a kind of cross-modal “informational masking” that occurs when listeners who see one word and hear another are unable to correctly determine which word was present in the visual stimulus. September 1-4, 2003 – Geneva, Switzerland control and presentation of a multi-modal in-car e-mail system. A simple interface for reading e-mail was constructed, which could be controlled manually by pressing keyboard buttons, by speech through a Wizard of Oz setup, or both. The e-mail program was presented visually on a VDU, read to the driver through speech synthesis, or both. Results indicate that in this context subjective task load was highest when manual/visual interaction was used. A solution may be interaction through user-determined modality selection, as results indicate that subjects judge their load lowest and performance and preference highest among the tested conditions when they are able to select the modality. Some evaluation issues for multi-modal interfaces are discussed. Bayesian Networks for Spoken Dialogue Management in Multimodal Systems of Tour-Guide Robots Plamen Prodanov, Andrzej Drygajlo; EPFL, Switzerland In this paper, we propose a method based on Bayesian networks for interpretation of multimodal signals used in the spoken dialogue between a tour-guide robot and visitors in mass exhibition conditions. We report on experiments interpreting speech and laser scanner signals in the dialogue management system of the autonomous tour-guide robot RoboX, successfully deployed at the Swiss National Exhibition (Expo.02). A correct interpretation of a user’s (visitor’s) goal or intention at each dialogue state is a key issue for successful voice-enabled communication between tour-guide robots and visitors. To infer the visitors’ goal under the uncertainty intrinsic to these two modalities, we introduce Bayesian networks for combining noisy speech recognition with data from a laser scanner, which is independent of acoustic noise. Experiments with real data, collected during the operation of RoboX at Expo.02 demonstrate the effectiveness of the approach. Session: PTuCe– Poster Speech Coding & Transmission Audiovisual Speech Enhancement Based on the Association Between Speech Envelope and Video Features Time: Tuesday 13.30, Venue: Main Hall, Level -1 Chair: Isabel Trancoso, INESC ID / IST, Lisboa, Portugal Frédéric Berthommier; ICP-CNRS, France The low level acoustico-visual association reported by Yehia et al. (Speech Comm., 26(1):23-43, 1998) is exploited for audio-visual speech enhancement with natural video sequences. The aim of this study is to demonstrate that the redundant components of AV speech are extractible with a suitable representation which does not involve any categorization process. A comparative study is achieved between different types of audio features, including the initial Line Spectral Pairs (LSP) and 4-subbands envelope energy. A gain measure of the enhancement is applied for the comparison. The results clearly show that the coarse envelope features allows a better gain than the LSP. Robust Speech Interaction in a Mobile Environment Through the Use of Multiple and Different Media Input Types Optimization of Window and LSF Interpolation Factor for the ITU-T G.729 Speech Coding Standard Wai C. Chu, Toshio Miki; DoCoMo USA Labs, USA A gradient-descent based optimization procedure is applied to the window sequence used for linear prediction (LP) analysis of the ITUT G.729 CS-ACELP coder. By replacing the original window of the standard by the optimized versions, similar subjective quality is obtainable at reduced computational cost and / or lowered coding delay. In addition, an optimization strategy is described to find the line spectral frequency (LSF) interpolation factor. Likelihood Ratio Test with Complex Laplacian Model for Voice Activity Detection Joon-Hyuk Chang, Jong-Won Shin, Nam Soo Kim; Seoul National University, Korea Rainer Wasinger, Christoph Stahl, Antonio Krueger; DFKI GmbH, Germany Mobile and outdoor environments have long been out of reach for speech engines due to the performance limitations that were associated with portable devices, and the difficulties of processing speech in high-noise areas. This paper outlines an architecture for attaining robust speech recognition rates in a mobile pedestrian indoor/outdoor navigation environment, through the use of a media fusion knowledge component. Speech-Based, Manual-Visual, and Multi-Modal Interaction with an In-Car Computer – Evaluation of a Pilot Study This paper proposes a voice activity detector (VAD) based on the complex Laplacian model. With the use of a goodness-of-fit (GOF) test, it is discovered that the Laplacian model is more suitable to describe noisy speech distribution than the conventional Gaussian model. The likelihood ratio (LR) based on the Laplacian model is computed and then applied to the VAD operation. According to the experimental results, we can find that the Laplacian statistical model is more suitable for the VAD algorithm compared to the Gaussian model. Multi-Mode Quantization of Adjacent Speech Parameters Using a Low-Complexity Prediction Scheme Jani Nurminen; Nokia Research Center, Finland Rogier Woltjer, Wah Jin Tan, Fang Chen; Linköpings Universitet, Sweden This paper presents a pilot study comparing various modalities for This work addresses joint quantization of adjacent speech parameter values or vectors. The basic joint quantization scheme is improved by using a low-complexity predictor and by allowing the 37 Eurospeech 2003 Tuesday quantizer to operate in several modes. In addition, this paper introduces an efficient algorithm for training quantizers having the proposed structure. The algorithm is used for training a practical quantizer that is evaluated in the context of the quantization of the linear prediction coefficients. The simulation results indicate that the proposed quantizer clearly outperforms conventional quantizers both in an error-free environment and in erroneous conditions at all bit error rates included in the evaluation. Multi-Mode Matrix Quantizer for Low Bit Rate LSF Quantization Ulpu Sinervo 1 , Jani Nurminen 2 , Ari Heikkinen 2 , Jukka Saarinen 2 ; 1 Tampere University of Technology, Finland; 2 Nokia Research Center, Finland September 1-4, 2003 – Geneva, Switzerland pendent Speech Processing (A.L.I.S.P) approach that automatically segments the speech signal ([1]), we studied the possibility of optimising this rate as well as the quality of re-synthesised signal, by using the text information corresponding to the speech signal, and by implementing a new segmentation method. This led to the speech alignment with its phonetic transcription and the use of polyphones, to finally increase output speech quality while keeping a bitrate between 400bits/s and 600bits/s. Typically, this can be used to store recorded alpha-numeric books for blind people, or compressing recorded courses for e-learning. Cell phone applications could also be considered. Entropy-Optimized Channel Error Mitigation with Application to Speech Recognition Over Wireless In this paper, we introduce a novel method for quantization of line spectral frequencies (LSF) converted from mth order linear prediction coefficients. In the proposed method, the interframe correlation of LSFs is exploited using matrix quantization where N consecutive frames are quantized as one m-by-N matrix. The voicingbased multi-mode operation reduces the bit rate by taking advantage of the properties of the speech signal. That is, certain parts of a signal, such as unvoiced segments, can be quantized with smaller codebooks. With this method, very low variable bit rate LSF quantization is obtained. The proposed method is suitable especially for very low bit rate speech coders in which short time delay is tolerable, and high but not necessarily transparent quality is sufficient. Voicing Controlled Frame Loss Concealment for Adaptive Multi-Rate (AMR) Speech Frames in Voice-Over-IP Victoria Sánchez, Antonio M. Peinado, Angel M. Gómez, José L. Pérez-Córdoba; Universidad de Granada, Spain In this paper we propose an entropy-optimized channel error mitigation technique with a low computational complexity and moderate memory requirements, suitable for transmissions over wireless channels. We apply it to Distributed Speech Recognition (DSR), getting an improvement of around 3% in word accuracy over the recognition performance obtained by the mitigation technique proposed in the ETSI standard for DSR (ETSIES- 201-108 v1.1.2) for bad channel conditions (GSMEP3 error pattern). Robust Jointly Optimized Multistage Vector Quantization for Speech Coding Venkatesh Krishnan, David V. Anderson; Georgia Institute of Technology, USA Frank Mertz 1 , Hervé Taddei 2 , Imre Varga 2 , Peter Vary 1 ; 1 RWTH Aachen, Germany; 2 Siemens AG, Germany In this paper we present a voicing controlled, speech parameter based frame loss concealment for frames that have been encoded with the Adaptive Multi-Rate (AMR) speech codec. The missing parameters are estimated by interpolation and extrapolation techniques that are chosen in dependence of the voicing state of the speech frames preceding and following the lost frames. The voicing controlled concealment outperforms the conventional extrapolation/muting based approach and it shows a consistent improvement over interpolation techniques that do not distinguish between voiced and unvoiced speech. The quality can be further improved if additional information about the predictor states of predictively encoded parameters is available from a redundant transmission in future packets. Perceptual Irrelevancy Removal in Narrowband Speech Coding Marja Lähdekorpi 1 , Jani Nurminen 2 , Ari Heikkinen 2 , Jukka Saarinen 2 ; 1 Tampere University of Technology, Finland; 2 Nokia Research Center, Finland A masking model originally designed for audio signals is applied to narrowband speech. The model is used to detect and remove the perceptually irrelevant simultaneously masked frequency components of a speech signal. Objective measurements have shown that the modified speech signal can be coded more efficiently than the original signal. Furthermore, it has been confirmed through perceptual evaluation that the removal of these frequency components does not cause significant degradation of the speech quality but rather, it has consistently improved the output quality of two standardized speech codecs. Thus, the proposed irrelevancy removal technique can be used at the front end of a speech coder to achieve enhanced coding efficiency. Very-Low-Rate Speech Compression by Indexation of Polyphones Charles du Jeu, Maurice Charbit, Gérard Chollet; ENST-CNRS, France Speech coding by indexation has proven to lower the rate of speech compression drastically. Based on the Automatic Language Inde- In this paper, a novel channel-optimized multistage vector quantization (COMSVQ) codec is presented in which the stage codebooks are jointly designed. The proposed codec uses a signal source and channel-dependent distortion measure to encode line spectral frequencies derived from segments of a speech signal. Simulation results are provided to demonstrate the consistent reduction in the spectral distortion obtained using the proposed codec as compared to the conventional sequentially-designed channel-matched multistage vector quantizer. Polar Quantization of Sinusoids from Speech Signal Blocks Harald Pobloth, Renat Vafin, W. Bastiaan Kleijn; KTH, Sweden We introduce a block polar quantization (BPQ) procedure that minimizes a weighted distortion for a set of sinusoids representing one block of a signal. The minimization is done under a resolution constraint for the entire signal block. BPQ outperforms rectangular quantization, strictly polar quantization, and unrestricted polar quantization (UPQ) both when assuming the Cartesian coordinates of the sinusoidal components to be Gaussian and for sinusoids found from speech data. In the case of speech data we found a significant performance gain (about 4 dB) over the best performing polar quantization (UPQ). Transcoding Algorithm for G.723.1 and AMR Speech Coders: For Interoperability Between VoIP and Mobile Networks Sung-Wan Yoon, Jin-Kyu Choi, Hong-Goo Kang, Dae-Hee Youn; Yonsei University, Korea In this paper, an efficient transcoding algorithm between G.723.1 and AMR speech coders is proposed for providing interoperability between IP and mobile networks. Transcoding is completed through three processing steps: line spectral pair (LSP) conversion, pitch interval conversion, and fast adaptive-codebook search. For maintaining minimum distortion, sensitive parameters to quality such as adaptive and fixed-codebooks are re-estimated from synthesized target signals. To reduce overall complexity, other parameters are directly converted in parametric levels without running through the complete decoding process. Objective and subjective preference tests verify that the proposed transcoding algorithm has equivalent quality to conventional tandem approach. In addition, the proposed 38 Eurospeech 2003 Tuesday algorithm achieves 20∼40% reduction of the overall complexity over tandem approach with a shorter processing delay. September 1-4, 2003 – Geneva, Switzerland Multi-Rate Extension of the Scalable to Lossless PSPIHT Audio Coder Mohammed Raad 1 , Ian Burnett 1 , Alfred Mertins 2 ; 1 University Of Wollongong, Australia; 2 University of Oldenburg, Germany Quality-Complexity Trade-Off in Predictive LSF Quantization Davorka Petrinovic, Davor Petrinovic; University of Zagreb, Croatia In this paper several techniques are investigated for reduction of complexity and/or improving quality of a line spectrum frequencies (LSF) quantization based on switched prediction (SP) and vector quantization (VQ). For switched prediction, a higher number of prediction matrices is proposed. Quality of the quantized speech is improved by the prediction multi-candidate and delayed decision algorithm. It is shown that quantizers with delayed decision can save up to one bit still having similar or even lower complexity than the baseline quantizers with 2 switched matrices. By efficient implementation of prediction, lower complexity can be achieved through use of prediction matrices with reduced number of non-zero elements. By combining such sparse matrices and multiple prediction candidates, the best quality-complexity compromise quantizers can be obtained as demonstrated by experimental results. Variable Bit Rate Control with Trellis Diagram Approximation Kei Kikuiri, Nobuhiko Naka, Tomoyuki Ohya; NTT DoCoMo Inc., Japan In this paper, we present a variable bit rate control method for speech/audio coding, under the constraint that the total bit rate of a super-frame to be a constant. The proposed method uses a trellis diagram for optimizing the overall quality of the super-frame. In order to reduce the computational complexity, the trellis diagram uses approximation by ignoring the encoder memory state between different paths. Simulations on the AMR Wideband show that the proposed variable bit rate control achieves up to 4.3 dB improvements to the constant rate coding in the perceptual weighted SNR. This paper extends a scalable to lossless compression scheme to allow scalability in terms of sampling rate as well as quantization resolution. The scheme presented is an extension of a perceptually scalable scheme that scales to lossless compression, producing smooth objective scalability, in terms of SNR, until lossless compression is achieved. The scheme is built around the Perceptual SPIHT algorithm, which is a modification of the SPIHT algorithm. An analysis of the expected limitations of scaling across sampling rates is given as well as lossless compression results showing the competitive performance of the presented technique. Entropy Constrained Quantization of LSP Parameters Turaj Zakizadeh Shabestary, Per Hedelin, Fredrik Nordén; Chalmers University of Technology, Sweden Conventional procedures for spectrum coding for speech address fixed rate coding. For the variable rate case, we develop spectrum coding based on constrained entropy quantization. Our approach integrates high rate theory for Gaussian mixture modeling with lattices based on line spectrum pairs. The overall procedure utilizes a union of several lattices in order to enhance performance and to comply with source statistics. We provide experimental results in terms of SD for different conditions and compare these with high rate lower bounds. One major advantage of our coding system concerns adaptivity, one design can operate at a variety of rates without re-training. Session: PTuCf– Poster Speech Recognition - Search & Lexicon Modeling Time: Tuesday 13.30, Venue: Main Hall, Level -1 Chair: Hermann Ney, Aachen University of Technology, Germany Named Entity Extraction from Japanese Broadcast News Towards Optimal Encoding for Classification with Applications to Distributed Speech Recognition Naveen Srinivasamurthy, Antonio Ortega, Shrikanth Narayanan; University of Southern California, USA In distributed classification applications, due to computational constraints, data acquired by low complexity clients is compressed and transmitted to a remote server for classification. In this paper the design of optimal quantization for distributed classification applications is considered and evaluated in the context of a speech recognition task. The proposed encoder minimizes the detrimental effect compression has on classification performance. Specifically, the proposed methods concentrate on designing low dimension encoders. Here individual encoders independently quantize sub-dimensions of a high dimension vector used for classification. The main novelty of the work is the introduction of mutual information as a metric for designing compression algorithms in classification applications. Given a rate constraint, the proposed algorithm minimizes the mutual information loss due to compression. Alternatively it ensures that the compressed data used for classification retains maximal information about the class labels. An iterative empirical algorithm (similar to the Lloyd algorithm) is provided to design quantizers for this new distortion measure. Additionally, mutual information is also used to propose a rate-allocation scheme where rates are allocated to the sub-dimensions of a vector (which are independently encoded) to satisfy a given rate constraint. The results obtained indicate that mutual information is a better metric (when compared to mean square error) for optimizing encoders used in distributed classification applications. In a distributed spoken names recognition task, the proposed mutual information based rate-allocation reduces by a factor of six the increase in WER due to compression when compared to a heuristic rate-allocation. Akio Kobayashi 1 , Franz J. Och 2 , Hermann Ney 3 ; 1 NHK Science & Technical Research Laboratories, Japan; 2 University of Southern California, USA; 3 RWTH Aachen, Germany This paper describes a method for named entity extraction from Japanese broadcast news. Our proposed named entity tagger gives entity categories for every character in order to deal with unknown words and entities correctly. This character-based tagger has models designed by maximum entropy modeling. We discuss the efficiency of the proposed tagger by comparison with a conventional word-based tagger. The results indicate that the capability of the taggers depends on the entity categories. Therefore, the features derived from both character and word contexts are required to obtain high performance of named entity extraction. Morpheme-Based Lexical Modeling for Korean Broadcast News Transcription Young-Hee Park, Dong-Hoon Ahn, Minhwa Chung; Sogang University, Korea In this paper, we describe our LVCSR system for Korean broadcast news transcription. The main focus here is to find the most proper morpheme-based lexical model for Korean broadcast news recognition to deal with the inflectional flexibilities in Korean. Since there are trade-offs between lexicon size and lexical coverage, and between the length of lexical unit and WER, in our system we analyzed the training corpus to obtain a compact 24k-morpheme-based lexicon with 98.8% coverage. Then, the lexicon is optimized by combining morphemes using statistics of training corpus under monosyllable constraint or maximum length constraint. In experiments, our 39 Eurospeech 2003 Tuesday system reduced the number of monosyllable morphemes which are the most error-prone, from 52% to 29% of the lexicon and obtained 13.24% WER for anchor and 24.97% for reporter. Data Driven Example Based Continuous Speech Recognition September 1-4, 2003 – Geneva, Switzerland shown to give good coverage on all four languages and represent a large set of shared sub-word models. For all experiments, the acoustic models are trained from scratch in order not to use any prior phonetic knowledge. Finally, we show that for the Dutch and German tasks, the presented approach works well and may also help do decrease the word error rate below that obtained by monolingual acoustic models. For all four languages, adding language questions to the multilingual decision tree helps to improve the word error rate. Mathias De Wachter, Kris Demuynck, Dirk Van Compernolle, Patrick Wambacq; Katholieke Universiteit Leuven, Belgium The dominant acoustic modeling methodology based on Hidden Markov Models is known to have certain weaknesses. Partial solutions to these flaws have been presented, but the fundamental problem remains: compression of the data to a compact HMM discards useful information such as time dependencies and speaker information. In this paper, we look at pure example based recognition as a solution to this problem. By replacing the HMM with the underlying examples, all information in the training data is retained. We show how information about speaker and environment can be used, introducing a new interpretation of adaptation. The basis for the recognizer is the well-known DTW algorithm, which has often been used for small tasks. However, large vocabulary speech recognition introduces new demands, resulting in an explosion of the search space. We show how this problem can be tackled using a data driven approach which selects appropriate speech examples as candidates for DTW-alignment. A Cross-Media Retrieval System for Lecture Videos Atsushi Fujii 1 , Katunobu Itou 2 , Tomoyosi Akiba 2 , Tetsuya Ishikawa 1 ; 1 University of Tsukuba, Japan; 2 AIST, Japan We propose a cross-media lecture-on-demand system, in which users can selectively view specific segments of lecture videos by submitting text queries. Users can easily formulate queries by using the textbook associated with a target lecture, even if they cannot come up with effective keywords. Our system extracts the audio track from a target lecture video, generates a transcription by large vocabulary continuous speech recognition, and produces a text index. Experimental results showed that by adapting speech recognition to the topic of the lecture, the recognition accuracy increased and the retrieval accuracy was comparable with that obtained by human transcription. Large Vocabulary Speaker Independent Isolated Word Recognition for Embedded Systems Building a Test Collection for Speech-Driven Web Retrieval Sergey Astrov, Bernt Andrassy; Siemens AG, Germany Atsushi Fujii 1 , Katunobu Itou 2 ; 1 University of Tsukuba, Japan; 2 AIST, Japan In this paper the implementation of a word-stem based tree search for large vocabulary speaker independent isolated word recognition for embedded systems is presented. Two fast search algorithms combine the effectiveness of the tree structure for large vocabularies and the fast Viterbi search within the regular structures of word-stems. The algorithms are proved to be very effective for workstation and embedded platform realizations. In order to decrease the processing power the word-stem based tree search with frame dropping approach is used. The recognition speed was increased by a factor of 5 without frame dropping and by a factor of 10 with frame dropping in comparison to linear Viterbi search for isolated word recognition task with a vocabulary of 20102 words. Thus, the large vocabulary isolated word recognition becomes possible for embedded systems. Low-Latency Incremental Speech Transcription in the Synface Project Alexander Seward; KTH, Sweden In this paper, a real-time decoder for low-latency online speech transcription is presented. The system was developed within the Synface project, which aims to improve the possibilities for hard of hearing people to use conventional telephony by providing speechsynchronized multimodal feedback. This paper addresses the specific issues related to HMM-based incremental phone classification with real-time constraints. The decoding algorithm described in this work enables a trade-off to be made between improved recognition accuracy and reduced latency. By accepting a longer latency per output increment, more time can be ascribed to hypothesis look-ahead and by that improve classification accuracy. Experiments performed on the Swedish SpeechDat database show that it is possible to generate the same classification as is produced by non-incremental decoding using HTK, by adopting a latency of approx. 150 ms or more. Multilingual Acoustic Modeling Using Graphemes This paper describes a test collection (benchmark data) for retrieval systems driven by spoken queries. This collection was produced in the subtask of the NTCIR-3 Web retrieval task, which was performed in a TREC-style evaluation workshop. The search topics and document collection for the Web retrieval task were used to produce spoken queries and language models for speech recognition, respectively. We used this collection to evaluate the performance of our retrieval system. Experimental results showed that (a) the use of target documents for language modeling and (b) enhancement of the vocabulary size in speech recognition were effective in improving the system performance. Confidence Measure Driven Scalable Two-Pass Recognition Strategy for Large List Grammars Miroslav Novak 1 , Diego Ruiz 2 ; 1 IBM T.J. Watson Reseach Center, USA; 2 Université Catholique de Louvain, Belgium In this article we will discuss recognition performance on large list grammars, a class of tasks often encountered in telephony applications. In these tasks, the user makes a selection from a large list of choices (e.g. stock quotes, yellow pages, etc). Though the redundancy of the complete utterance is often high enough to achieve high recognition accuracy, large search space presents a challenge for the recognizer, in particular, when real-time, low latency performance is required. We propose a confidence measure driven two-pass search strategy, exploiting the high mutual information between grammar states to improve pruning efficiency while minimizing the need for memory. An Efficient, Fast Matching Approach Using Posterior Probability Estimates in Speech Recognition Sherif Abdou, Michael S. Scordilis; University of Miami, USA S. Kanthak, Hermann Ney; RWTH Aachen, Germany In this paper we combine grapheme-based sub-word units with multilingual acoustic modeling. We show that a global decision tree together with automatically generated grapheme questions eliminate manual effort completely. We also investigate the effects of additional language questions. We present experimental results on four corpora with different languages, namely the Dutch and French ARISE corpus, the Italian EUTRANS corpus and the German VERBMOBIL corpus. Graphemes are Acoustic fast matching is an effective technique to accelerate the search process in large vocabulary continuous speech recognition. This paper introduces a novel fast matching method. This method is based on the evaluation of future posterior probabilities for a look-ahead number of timeframes in order to exclude unlikely phone models as early as possible during the search. In contrast to the likelihood scores used by more traditional fast matching methods these posterior probabilities are more discriminative by nature 40 Eurospeech 2003 Tuesday as they sum up to unity over all the possible models. By applying the proposed method we managed to reduce by 66% the decoding time consumed in our time-synchronous Viterbi decoder for a recognition task based on the Wall Street Journal database with virtually no additional decoding errors. On Lexicon Creation for Turkish LVCSR Kadri Hacioglu 1 , Bryan Pellom 1 , Tolga Ciloglu 2 , Ozlem Ozturk 2 , Mikko Kurimo 3 , Mathias Creutz 3 ; 1 University of Colorado at Boulder, USA; 2 Middle East Technical University, Turkey; 3 Helsinki University of Technology, Finland Although multiple cues, such as different signal processing techniques and feature representations, have been used in speech recognition in adverse acoustic environment, how to maximally utilize the benefit of these cues is largely unsolved. In this paper, a novel search strategy is proposed. During parallel decoding of different feature streams, the intermediate outputs are cross-referenced to reduce pruning errors. Experiment results show this method significantly improved recognition performance on a noisy large vocabulary continuous speech task. Design of the CMU Sphinx-4 Decoder In this paper, we address the lexicon design problem in Turkish large vocabulary speech recognition. Although we focus only on Turkish, the methods described here are general enough that they can be considered for other agglutinative languages like Finnish, Korean etc. In an agglutinative language, several words can be created from a single root word using a rich collection of morphological rules. So, a virtually infinite size lexicon is required to cover the language if words are used as the basic units. The standard approach to this problem is to discover a number of primitive units so that a large set of words can be created by compounding those units. Two broad classes of methods are available for splitting words into their sub-units; morphology-based and data-driven methods. Although the word splitting significantly reduces the out of vocabulary rate, it shrinks the context and increases acoustic confusibility. We have used two methods to address the latter. In one method, we use word counts to avoid splitting of high frequency lexical units, and in the other method, we recompound splits according to a probabilistic measure. We present experimental results that show the methods are very effective to lower the word error rate at the expense of lexicon size. Compiling Large-Context Phonetic Decision Trees into Finite-State Transducers Paul Lamere 1 , Philip Kwok 1 , William Walker 1 , Evandro Gouvêa 2 , Rita Singh 2 , Bhiksha Raj 3 , Peter Wolf 3 ; 1 Sun Microsystems Laboratories, USA; 2 Carnegie Mellon University, USA; 3 Mitsubishi Electric Research Laboratories, USA Sphinx-4 is an open source HMM-based speech recognition system written in the JavaT M programming language. The design of the Sphinx-4 decoder incorporates several new features in response to current demands on HMM-based large vocabulary systems. Some new design aspects include graph construction for multilevel parallel decoding with multiple feature streams without the use of compound HMMs, the incorporation of a generalized search algorithm that subsumes Viterbi decoding as a special case, token stack decoding for efficient maintenance of multiple paths during search, design of a generalized language HMM graph from grammars and language models of multiple standard formats, that can potentially toggle between flat search structure, tree search structure, etc. This paper describes a few of these design aspects, and reports some preliminary performance measures for speed and accuracy. A New Decoder Design for Large Vocabulary Turkish Speech Recognition Onur Çilingir 1 , Mübeccel Demirekler 2 ; 1 TÜBİTAK BİLTEN, Turkey; 2 Middle East Technical University, Turkey Stanley F. Chen; IBM T.J. Watson Research Center, USA Recent work has shown that the use of finite-state transducers (FST’s) has many advantages in large vocabulary speech recognition. Most past work has focused on the use of triphone phonetic decision trees. However, numerous applications use decision trees that condition on wider contexts; for example, many systems at IBM use 11-phone phonetic decision trees. Alas, large-context phonetic decision trees cannot be compiled straightforwardly into FST’s due to memory constraints. In this work, we discuss memory-efficient techniques for manipulating large-context phonetic decision trees in the FST framework. First, we describe a lazy expansion technique that is applicable when expanding small word graphs. For general applications, we discuss how to construct large-context transducers via a sequence of simple, efficient finite-state operations; we also introduce a memory-efficient implementation of determinization. An important problem in large vocabulary speech recognition for agglutinative languages like Turkish is the high out of vocabulary (OOV) rate caused by extensive number of distinct words. Recognition systems using words as the basic lexical elements have difficulty in dealing with such virtually unlimited vocabulary. We propose a new time-synchronous lexical tree decoder design using morphemes as the lexical elements. A key feature of the proposed decoder is the dynamic generation of the lexical tree according to the morphological rules. The architecture emulates word generation in the language and therefore allows very large vocabularies through the defined set of morphemes and morphotactical rules. Session: PTuCg– Poster Speech Technology Applications Automatic Summarization of Broadcast News Using Structural Features Time: Tuesday 13.30, Venue: Main Hall, Level -1 Chair: Jerome Bellegarda, Spoken Language Group, Apple Computer, Inc., USA Sameer Raj Maskey, Julia Hirschberg; Columbia University, USA We present a method for summarizing broadcast news that is not affected by word errors in an automatic speech recognition transcription, using information about the structure of the news program. We construct a directed graphical model to represent the probability distribution and dependencies among the structural features which we train by finding the values of parameters of the conditional probability tables. We then rank segments of the test set and extract the highest ranked ones as a summary. We present the procedure and preliminary test results. A Dynamic Cross-Reference Pruning Strategy for Multiple Feature Fusion at Decoder Run Time Yonghong Yan 1 , Chengyi Zheng 1 , Jianping Zhang 2 , Jielin Pan 2 , Jiang Han 2 , Jian Liu 2 ; 1 Oregon Health & Science University, USA; 2 Chinese Academy of Sciences, China September 1-4, 2003 – Geneva, Switzerland Automatic Speech Recognition with Sparse Training Data for Dysarthric Speakers Phil Green, James Carmichael, Athanassios Hatzis, Pam Enderby, Mark Hawley, Mark Parker; University of Sheffield, U.K. We describe an unusual ASR application: recognition of command words from severely dysarthric speakers, who have poor control of their articulators. The goal is to allow these clients to control assistive technology by voice. While this is a small vocabulary, speakerdependent, isolated-word application, the speech material is more variable than normal, and only a small amount of data is available for training. After training a CDHMM recogniser, it is necessary to predict its likely performance without using an independent test set, so that confusable words can be replaced by alternatives. We present a battery of measures of consistency and confusability, based on forced-alignment, which can be used to predict recogniser 41 Eurospeech 2003 Tuesday performance. We show how these measures perform, and how they are presented to the clinicians who are the users of the system. September 1-4, 2003 – Geneva, Switzerland Evaluating Multiple LVCSR Model Combination in NTCIR-3 Speech-Driven Web Retrieval Task Masahiko Matsushita 1 , Hiromitsu Nishizaki 1 , Takehito Utsuro 2 , Yasuhiro Kodama 1 , Seiichi Nakagawa 1 ; 1 Toyohashi University of Technology, Japan; 2 Kyoto University, Japan Prediction of Sentence Importance for Speech Summarization Using Prosodic Parameters Akira Inoue, Takayoshi Mikami, Yoichi Yamashita; Ritsumeikan University, Japan Recent improvements in computer systems are increasing the amount of accessible speech data. Since speech media is not appropriate for quick scanning, the development of automatic summarization of lecture or meeting speech is expected. Spoken messages contain non-linguistic information, which is mainly expressed by prosody, while written text conveys only linguistic information. There are possibilities that the prosodic information can improve the quality of speech summarization. This paper describes a technique of using prosodic parameters as well as linguistic information to identify important sentences for speech summarization. Several prosodic parameters about F0, power and duration are extracted for each sentence in lecture speech. Importance of the sentence is predicted by the prosodic parameters and the linguistic information. We also tried to combine the prosodic parameters and the linguistic information by multiple regression analysis. Proposed methods are evaluated both on the correlation between the predicted scores of sentence importance and the preference scores by subjects and on the accuracy of extraction of important sentences. By combination of the prosodic parameters improves the quality of speech summarization. An Automatic Singing Transcription System with Multilingual Singing Lyric Recognizer and Robust Melody Tracker Chong-kai Wang 1 , Ren-Yuan Lyu 1 , Yuang-Chin Chiang 2 ; 1 Chang Gung University, Taiwan; 2 National Tsing Hua University, Taiwan A singing transcription system which transcribes human singing voice to musical notes is described in this paper. The fact that human singing rarely follows standard musical scale makes it a challenge to implement such a system. This system utilizes some new methods to deal with the issue of imprecise musical scale of input voice of a human singer, such as spectral standard deviation used for note segmentation, Adaptive Round Semitone used for melody tracking and Tune Map acting as a musical grammar constraint in melody tracking. Furthermore, a large vocabulary speech recognizer performing the lyric recognition tasks is also added, which is a new trial in a singing transcription system. Speech Shift: Direct Speech-Input-Mode Switching Through Intentional Control of Voice Pitch Masataka Goto 1 , Yukihiro Omoto 2 , Katunobu Itou 1 , Tetsunori Kobayashi 2 ; 1 AIST, Japan; 2 Waseda University, Japan This paper describes a speech-input interface function, called speech shift, that enables a user to specify a speech-input mode by simply changing (shifting) voice pitch. While current speech-input interfaces have used only verbal information, we aimed at building a more user-friendly speech interface by making use of nonverbal information, the voice pitch. By intentionally controlling the pitch, a user can enter the same word with it having different meanings (functions) without explicitly changing the speech-input mode. Our speech-shift function implemented on a voice-enabled word processor, for example, can distinguish an utterance with a high pitch from one with a normal (low) pitch, and regard the former as voicecommand-mode input (such as file-menu and edit-menu commands) and the latter as regular dictation-mode text input. Our experimental results from twenty subjects showed that the speech-shift function is effective, easy to use, and a labor-saving input method. This paper studies speech-driven Web retrieval models which accepts spoken search topics (queries) in the NTCIR-3 Web retrieval task. The major focus of this paper is on improving speech recognition accuracy of spoken queries and then improving retrieval accuracy in speech-driven Web retrieval. We experimentally evaluate the techniques of combining outputs of multiple LVCSR models in recognition of spoken queries. As model combination techniques, we compare the SVM learning technique and conventional voting schemes such as ROVER. We show that the techniques of multiple LVCSR model combination can achieve improvement both in speech recognition and retrieval accuracies in speech-driven text retrieval. We also show that model combination by SVM learning outperforms conventional voting schemes both in speech recognition and retrieval accuracies. Semantic Object Synchronous Understanding in SALT for Highly Interactive User Interface Kuansan Wang; Microsoft Research, USA SALT is an industrial standard that enables speech input/ output for Web applications. Although the core design is to make simple tasks easy, SALT gives the designers ample fine-grained controls to create advanced user interface. The paper exploits a speech input mode in which SALT would dynamically report partial semantic parses while audio capturing is still ongoing. The semantic parses can be evaluated and the outcome reported immediately back to the user. The potential impact for the dialog systems is that tasks conventionally performed in a system turn can now be carried out in the midst of a user turn, thereby presenting a significant departure from the conventional turn-taking. To assess the efficacy of such highly interactive interface, more user studies are undoubtedly needed. This paper demonstrates how SALT can be employed to facilitate such studies. Information Retrieval Based Call Classification Jan Kneissler, Anne K. Kienappel, Dietrich Klakow; Philips Research Laboratories, Germany In this paper we describe a fully automatic call classification system for customer service selection. Call classification is based on one customer utterance following a “How may I help you” prompt. In particular, we introduce two new elements to our information retrieval based call classifier, which significantly improve the classification accuracy: the use of a-priory term relevance based on class information, and classification confidence estimation. We describe the spontaneous speech recognizer as well as the classifier and investigate correlations between speech recognition and call classification accuracy. Using Syllable-Based Indexing Features and Language Models to Improve German Spoken Document Retrieval Martha Larson, Stefan Eickeler; Fraunhofer Institute for Media Communication, Germany Spoken document collections with high word-type/word-token ratios and heterogeneous audio continue to constitute a challenge for information retrieval. The experimental results reported in this paper demonstrate that syllable-based indexing features can outperform word-based indexing features on such a domain, and that syllable-based speech recognition language models can successfully be used to generate syllable-based indexing features. Recognition is carried out with a 5k syllable language model and a 10k mixed-unit language model whose vocabulary consists of a mixture of words and syllables. Both language models make retrieval performance possible that is comparable to that attained when a large vocabulary word-based language model is used. Experiments are performed on a spoken document collection consisting of short German-language radio documentaries. First, the vector space model is applied to 42 Eurospeech 2003 Tuesday a known item retrieval task and a similar-document search. Then, the known item retrieval task is further explored with a Levenshteindistance-based fuzzy word match. An Empirical Text Transformation Method for Spontaneous Speech Synthesizers Shiva Sundaram, Shrikanth Narayanan; University of Southern California, USA Spontaneously spoken utterances are characterized by a number of lexical and non-lexical features. These features can also reflect speaker specific characteristics. A major factor that discriminates spontaneous speech from written text is the presence of these paralinguistic features such as filled pauses (fillers), false starts, laughter, disfluencies and discourse markers that are beyond the framework of formal grammars. The speech recognition community has dealt with these variabilities by making provisions for them in language models, to improve recognition accuracy for spoken language. In another scenario, the analysis of these features could also be used for language processing/generation for the overall improvement of synthesized speech or machine response. Such synthesized spontaneous speech could be used for computer avatars and Speech User Interfaces (SUIs) where lengthy interactions with machines occur, and it is generally desired to mimic a particular speaker or the speaking style. This problem of language generation involves capturing general characteristics of spontaneous speech and also speaker specific traits. The usefulness of conventional language processing tools is limited by the availability of training corpus. Hence and empirical text processing technique with ideas motivated from psycholinguistics is proposed. Such an empirical technique could be included in the text analysis stage of a TTS system. The proposed technique is adaptable: it can be extended to mimic different speakers based on an individual’s speaking style and filler preferences. A New Approach to Reducing Alarm Noise in Speech September 1-4, 2003 – Geneva, Switzerland recognition over Bluetooth is described. We simulate a Bluetooth environment and then incorporate its performance, in the form of packet loss ratio, into the speech recognition system. We show how intelligent framing of speech feature vectors, extracted by a fixedpoint arithmetic front-end, together with an interpolation technique for lost vectors, can lead to a 50.48% relative improvement in recognition accuracy. This is achieved at a distance of 10 meters, around the maximum operating distance between a Bluetooth transmitter and a Bluetooth receiver. Speech Starter: Noise-Robust Endpoint Detection by Using Filled Pauses Koji Kitayama 1 , Masataka Goto 2 , Katunobu Itou 2 , Tetsunori Kobayashi 1 ; 1 Waseda University, Japan; 2 AIST, Japan In this paper we propose a speech interface function, called speech starter, that enables noise-robust endpoint (utterance) detection for speech recognition. When current speech recognizers are used in a noisy environment, a typical recognition error is caused by incorrect endpoints because their automatic detection is likely to be disturbed by non-stationary noises. The speech starter function enables a user to specify the beginning of each utterance by uttering a filler with a filled pause, which is used as a trigger to start speechrecognition processes. Since filled pauses can be detected robustly in a noisy environment, practical endpoint detection is achieved. Speech starter also offers the advantage of providing a hands-free speech interface and it is user-friendly because a speaker tends to utter filled pauses (e.g., “er. . .”) at the beginning of utterances when hesitating in human-human communication. Experimental results from a 10-dB-SNR noisy environment show that the recognition error rate with speech starter was lower than with conventional endpoint-detection methods. Automatic Segmentation of Film Dialogues into Phonemes and Graphemes Gilles Boulianne, Jean-François Beaumont, Patrick Cardinal, Michel Comeau, Pierre Ouellet, Pierre Dumouchel; CRIM, Canada Yilmaz Gül 1 , Aladdin M. Ariyaeeinia 2 , Oliver Dewhirst 1 ; 1 Fulcrum Voice Technologies, U.K.; 2 University of Hertfordshire, U.K. This paper presents a new single channel noise reduction method for suppressing periodic alarm noise in telephony speech. The presence of background alarm noise can significantly detract from the intelligibility of telephony speech received by emergency services, and in particular, by the fire brigade control rooms. The attraction of the proposed approach is that it targets the alarm noise without affecting the speech signal. This is achieved through discriminating the alarm noise by appropriately modelling the contaminated speech. The effectiveness of this method is confirmed experimentally using a set of real speech data collected by the Kent Fire Brigade HQ (UK). Improved Name Recognition with User Modeling Dong Yu, Kuansan Wang, Milind Mahajan, Peter Mau, Alex Acero; Microsoft Research, USA Speech recognition of names in Personal Information Management (PIM) systems is an important yet difficult task. The difficulty arises from various sources: the large number of possible names that users may speak, different ways a person may be referred to, ambiguity when only first names are used, and mismatched pronunciations. In this paper we present our recent work on name recognition with User Modeling (UM), i.e., automatic modeling of user’s behavior patterns. We show that UM and our learning algorithm lead to significant improvement in the perplexity, Out Of Vocabulary rate, recognition speed, and accuracy of the top recognized candidate. The use of an exponential window reduces the perplexity by more than 30%. Speech Recognition Over Bluetooth Wireless Channels Ziad Al Bawab, Ivo Locher, Jianxia Xue, Abeer Alwan; University of California at Los Angeles, USA This paper studies the effect of Bluetooth wireless channels on distributed speech recognition. An approach for implementing speech In film post-production, efficient methods for re-recording a dialogue or dubbing in a new language require a precisely timealigned text, with individual letters time-coded to video frame resolution. Currently, this time alignment is performed by experts in a painstaking and slow process. To automate this process, we used CRIM’s large vocabulary HMM speech recognizer as a phoneme segmenter and measured its accuracy on typical film extracts in French and English. Our results reveal several characteristics of film dialogues, in addition to noise, that affect segmentation accuracy, such as speaking style or reverberant recordings. Despite these difficulties, an HMM-based segmenter trained on clean speech can still provide more than 89% acceptable phoneme boundaries on typical film extracts. We also propose a method which provides the correspondence between aligned phonemes and graphemes of the text. The method does not use explicit rules, but rather computes an optimal string alignment according to an edit-distance metric. Together, HMM phoneme segmentation and phoneme-grapheme correspondence meet the needs of film postproduction for a timealigned text, and make it possible to automate a large part of the current post-synch process. Automated Closed-Captioning of Live TV Broadcast News in French Julie Brousseau, Jean-François Beaumont, Gilles Boulianne, Patrick Cardinal, Claude Chapdelaine, Michel Comeau, Frédéric Osterrath, Pierre Ouellet; CRIM, Canada This paper describes the system currently under development at CRIM whose aim is to provide real-time closed captioning of live TV broadcast news in Canadian French. This project is done in collaboration with TVA Network, a national TV broadcaster and the RQST (a Québec association which promotes the use of subtitling). The automated closed-captioning system will use CRIM’s transducer-based 43 Eurospeech 2003 Tuesday large vocabulary French recognizer. The system will be totally integrated to the existing broadcaster’s equipment and working methods. First “on-air” use will take place in February 2004. Automatic Construction of Unique Signatures and Confusable Sets for Natural Language Directory Assistance Applications E.E. Jan, Benoît Maison, Lidia Mangu, Geoffrey Zweig; IBM T.J. Watson Research Center, USA September 1-4, 2003 – Geneva, Switzerland done in a sequential manner, resulting in the choice of overall excitation parameters being sub-optimal. In this paper, we propose a joint excitation parameter optimization framework in which the associated complexity is slightly greater than the traditional sequential optimization, but with significant quality improvement. Moreover, the framework allows joint optimization to be easily incorporated into existing pulse codebook systems with little or no impact to the codebook search algorithms. Named Entity Extraction from Word Lattices This paper addresses the problem of building natural language based grammars and language models for directory assistance applications that use automatic speech recognition. As input, one is given an electronic version of a standard phone book, and the output is a grammar or language model that will accept all the ways in which one might ask for a particular listing. We focus primarily on the problem of processing listings for businesses and government offices, but our techniques can be used to speech-enable other kinds of large listings (like book titles, catalog entries, etc.). We have applied these techniques to the business listings of a state in the Midwestern United States, and we present highly encouraging recognition results. Recent Enhancements in CU VOCAL for Chinese TTS-Enabled Applications James Horlock, Simon King; University of Edinburgh, U.K. We present a method for named entity extraction from word lattices produced by a speech recogniser. Previous work by others on named entity extraction from speech has used either a manual transcript or 1-best recogniser output. We describe how a single Viterbi search can recover both the named entity sequence and the corresponding word sequence from a word lattice, and further that it is possible to trade off an increase in word error rate for improved named entity extraction. A Topic Classification System Based on Parametric Trajectory Mixture Models William Belfield, Herbert Gish; BBN Technologies, USA Helen M. Meng, Yuk-Chi Li, Tien-Ying Fung, Man-Cheuk Ho, Chi-Kin Keung, Tin-Hang Lo, Wai-Kit Lo, P.C. Ching; Chinese University of Hong Kong, China CU VOCAL is a Cantonese text-to-speech (TTS) engine. We use a syllable-based concatenative synthesis approach to generate intelligible and natural synthesized speech [1]. This paper describes several recent enhancements in CU VOCAL. First, we have augmented the syllable unit selection strategy with a positional feature. This feature specifies the relative location of a syllable in a sentence and serves to improve the quality of Cantonese tone realization. Second, we have developed the CU VOCAL SAPI engine, a version of the synthesizer that eases integration with applications using SAPI (Speech Application Programming Interface). We demonstrate the use of CU VOCAL SAPI in an electronic book (e-book) reader. Third, we have made an initial attempt to use the CU VOCAL SAPI engine in Web content authored with Speech Application Language Tags (SALT). The use of SALT tags can ease the task of invoking Cantonese TTS service on webpages. In this paper we address the problem of topic classification of speech data. Our concern in this paper is the situation in which there is no speech or phoneme recognizer available for the domain of the speech data. In this situation the only inputs for training the system are audio speech files labeled according to the topics of interest. The process that we follow in developing the topic classifier is that of data segmentation followed by the representation of the segments by polynomial trajectory models. The clustering of acoustically similar segments enables us to train a trajectory Gaussian mixture model that is used to label segments of both on topic and off topic data and the labeled data enables us to create topic classifiers. The advantage of the approach that we are pursuing is that it is language and domain independent. We evaluated the performance of our approach with several classifiers demonstrated positive results. Session: OTuDa– Oral Robust Speech Recognition - Front-end Processing Evaluation of an Alert System for Selective Dissemination of Broadcast News Time: Tuesday 16.00, Venue: Room 1 Chair: Sadaoki Furui, Tokyo Inst. of Technology, Japan Isabel Trancoso 1 , João P. Neto 1 , Hugo Meinedo 1 , Rui Amaral 2 ; 1 INESC-ID/IST, Portugal; 2 INESC-ID/IPS, Portugal Model Based Noisy Speech Recognition with Environment Parameters Estimated by Noise Adaptive Speech Recognition with Prior This paper describes the evaluation of the system for selective dissemination of Broadcast News that we developed in the context of the European project ALERT. Each component of the main processing block of our system was evaluated separately, using the ALERT corpus. Likewise, the user interface was also evaluated separately. Besides this modular evaluation which will be briefly mentioned here, as a reference, the system can also be evaluated as a whole, in a field trial from the point of view of a potential user. This is the main topic of this paper. The analysis of the main sources of problems hinted at a large number of issues that must be dealt with in order to improve the performance. In spite of these pending problems, we believe that having a fully operational system is a must for being able to address user needs in the future in this type of service. Low Complexity Joint Optimization of Excitation Parameters in Analysis-by-Synthesis Speech Coding Kaisheng Yao 1 , Kuldip K. Paliwal 2 , Satoshi Nakamura 3 ; 1 University of California at San Diego, USA; 2 Griffith University, Australia; 3 ATR-SLT, Japan We have proposed earlier a noise adaptive speech recognition approach for recognizing speech corrupted by nonstationary noise and channel distortion. In this paper, we extend this approach. Instead of maximum likelihood estimation of environment parameters (as done in our previous work), the present method estimates environment parameters within the Bayesian framework that is capable of incorporating prior knowledge of the environment. Experiments are conducted on a database that contains digit utterances contaminated by channel distortion and nonstationary noise. Results show that this method performs better than the previous methods. A Harmonic-Model-Based Front End for Robust Speech Recognition U. Mittal, J.P. Ashley, E.M. Cruz-Zeno; Motorola Labs, USA Codebook searches in analysis-by-synthesis speech coders typically involve minimization of a perceptually weighted squared error signal. Minimization of the error over multiple codebooks is often Michael L. Seltzer 1 , Jasha Droppo 2 , Alex Acero 2 ; 1 Carnegie Mellon University, USA; 2 Microsoft Research, USA Speech recognition accuracy degrades significantly when the speech 44 Eurospeech 2003 Tuesday September 1-4, 2003 – Geneva, Switzerland has been corrupted by noise, especially when the system has been trained on clean speech. Many compensation algorithms have been developed which require reliable online noise estimates or a priori knowledge of the noise. In situations where such estimates or knowledge is difficult to obtain, these methods fail. We present a new robustness algorithm which avoids these problems by making no assumptions about the corrupting noise. Instead, we exploit properties inherent to the speech signal itself to denoise the recognition features. In this method, speech is decomposed into harmonic and noise-like components, which are then processed independently and recombined. By processing noise-corrupted speech in this manner we achieve significant improvements in recognition accuracy on the Aurora 2 task. working scheme of CFABF consists of two steps: source location calibration and target signal enhancement. The first step is to prerecord the transfer functions between speaker and microphone array from different potential source positions using adaptive beamforming under quiet environments; and the second step is to use this pre-recorded information to enhance the desired speech when the car is running on the road. An evaluation using extensive actual car speech data from the CU-Move Corpus shows that the method can decrease WER for speech recognition by up to 30% over a single channel scenario. A New Perspective on Feature Extraction for Robust In-Vehicle Speech Recognition Gerasimos Potamianos, Chalapathy Neti; IBM T.J. Watson Research Center, USA Umit H. Yapanel, John H.L. Hansen; University of Colorado at Boulder, USA Visual speech information is known to improve accuracy and noise robustness of automatic speech recognizers. However, to-date, all audio-visual ASR work has concentrated on “visually clean” data with limited variation in the speaker’s frontal pose, lighting, and background. In this paper, we investigate audiovisual ASR in two practical environments that present significant challenges to robust visual processing: (a) Typical offices, where data are recorded by means of a portable PC equipped with an inexpensive web camera, and (b) automobiles, with data collected at three approximate speeds. The performance of all components of a state-of-the-art audio-visual ASR system is reported on these two sets and benchmarked against “visually clean” data recorded in a studio-like environment. Not surprisingly, both audio- and visual-only ASR degrade, more than doubling their respective word error rates. Nevertheless, visual speech remains beneficial to ASR. The problem of reliable speech recognition for in-vehicle applications has recently emerged as a challenging research domain. This study focuses on the feature extraction stage of this problem. The approach is based on MinimumVariance Distortionless Response (MVDR) spectrum estimation. MVDR is used for robustly estimating the envelope of the speech signal and shown to be very accurate and relatively less sensitive to additive noise. The proposed feature estimation process removes the traditional Mel-scaled filterbank as a perceptually motivated frequency partitioning. Instead, we directly warp the FFT power spectrum of speech. The word error rate (WER) is shown to decrease by 27.3% with respect to the MFCCs and 18.8% with respect to recently proposed PMCCs on an extended digit recognition task in real car environments. The proposed feature estimation approach is called PMVDR and conclusively shown to be a better speech representation in real environments with emphasis on time-varying car noise. Audio-Visual Speech Recognition in Challenging Environments Session: STuDb– Oral Spoken Language Processing for e-Inclusion Speech Recognition of Double Talk Using SAFIA-Based Audio Segregation Time: Tuesday 16.00, Venue: Room 2 Chair: Paul Dalsgaard, Center for PersonKommunikation (CPK) Toshiyuki Sekiya, Tetsuji Ogawa, Tetsunori Kobayashi; Waseda University, Japan SYNFACE – A Talking Face Telephone Double-talk recognition under a distant microphone condition, a serious problem in speech applications in a real environment, is realized through use of modified SAFIA and acoustic model adaptation or training. The original SAFIA is a high-performance audio segregation method based on band selection using two directivity microphones. We have modified SAFIA by adopting array signal processing and have realized optimal directivity for SAFIA. We also used generalized harmonic analysis (GHA) instead of FFT for the spectral analysis in SAFIA to remove the effect of windowing which causes soundquality degradation in SAFIA. These modifications of SAFIA enable good segregation in a human auditory sense, but the quality is still insufficient for recognition. Because SAFIA causes some particular distortion, we used MLLRbased acoustic model adaptation and immunity training to be robust to the distortion of SAFIA. These efforts enabled 76.2% word accuracy under the condition that the SN ratio is 0 dB, this represents a 45% reduction in the error obtained in the case where only array signal processing was used, and a 30% error reduction compared with when only SAFIA-based audio segregation was used. Inger Karlsson 1 , Andrew Faulkner 2 , Giampiero Salvi 1 ; 1 KTH, Sweden; 2 University College London, U.K. The SYNFACE project has as its primary goal to facilitate for hearing-impaired people to use an ordinary telephone. This will be achieved by using a talking face connected to the telephone. The incoming speech signal will govern the speech movements of the talking face, hence the talking face will provide lip-reading support for the user. The project will define the visual speech information that supports lip-reading, and develop techniques to derive this information from the acoustic speech signal in near real time for three different languages: Dutch, English and Swedish. This requires the development of automatic speech recognition methods that detect information in the acoustic signal that correlates with the speech movements. This information will govern the speech movements in a synthetic face and synchronise them with the acoustic speech signal. A prototype system is being constructed. The prototype contains results achieved so far in SYNFACE. This system will be tested and evaluated for the three languages by hearing-impaired users. SYNFACE is an IST project (IST-2001-33327) with partners from the Netherlands, UK and Sweden. SYNFACE builds on experiences gained in the Swedish Teleface project. CFA-BF: A Novel Combined Fixed/Adaptive Beamforming for Robust Speech Recognition in Real Car Environments A Voice-Driven Web Browser for Blind People Xianxian Zhang, John H.L. Hansen; University of Colorado at Boulder, USA Among a number of studies which have investigated various speech enhancement and processing schemes for in-vehicle speech systems, the delay-and-sum beamforming (DASB) and adaptive beamforming are two typical methods that both have their advantages and disadvantages. In this paper, we propose a novel combined fixed/adaptive beamforming solution (CFABF) based on previous work for speech enhancement and recognition in real moving car environments, which seeks to take advantage of both methods. The Boštjan Vesnicer, Janez Žibert, Simon Dobrišek, Nikola Pavešić, France Mihelič; University of Ljubljana, Slovenia A small self-voicing Web browser designed for blind users is presented. The Web browser was built from the GTK Web browser Dillo, which is a free software project in terms of the GNU general public license. Additional functionality has been introduced to this original browser in form of different modules. The browser operates in two different modes, browsing mode and dialogue mode. 45 Eurospeech 2003 Tuesday In browsing mode user navigates through structure of Web pages using mouse and/or keyboard. When in dialogue mode, the dialogue module offers different actions and the user chooses between them using either keyboard or spoken-commands which are recognized by the speech-recognition module. The content of the page is presented to the user by screen-reader module which uses textto-speech module for its output. The browser is capable of displaying all common Web pages that do not contain frames, java or flash animations. However, the best performance is achieved when pages comply with the recommendations set by the WAI. The browser has been developed in Linux operating system and later ported to Windows 9x/ME/NT/2000/XP platform. Currently it is being tested by members of the Slovenian blind people society. Any suggestions or wishes from them will be considered for inclusion in future versions of the browser. Exploiting Speech for Recognizing Elderly Users to Respond to Their Special Needs Christian Müller, Frank Wittig, Jörg Baus; Saarland University, Germany In this paper we show how to exploit raw speech data to gain higher level information about the user in a mobile context. In particular we introduce an approach for the estimation of age and gender using well known machine learning techniques. On the basis of this information, systems like for example a mobile pedestrian navigation system, can be made adaptive to the special needs of a specific user group (here the elderly). First we provide a motivation why we consider such an adaptation as necessary, then we outline some adaptation strategies that are adequate for mobile assistants. The major part of the paper is about (a) identifying and extracting features of speech that are relevant for age and gender estimation and (b) classifying a particular speaker, treating uncertainty, and updating the user model over time. Finally we provide a short outlook on current work. Spoken Language and E-Inclusion Alan F. Newell; University of Dundee, U.K. Speech technology can help people with disabilities. Blind and nonspeaking people were amongst the first to be provided with commercially available speech synthesis systems, and, to this day, represent a much higher percentage of users of this technology than their numbers would predict. Speech synthesis technology has, for example, transformed the lives of many blind people, but the success of speech output to allow blind people to word processes, browse the web, and use domestic appliances should not to lull us into a false sense of security. In the main, these users were young, aware of their limitations, and of the substantial potential impact of such technology on their life styles, and were generally highly motivated to make a success of their use of the technology. The speech community needs to be aware of the major differences between the young disabled people who have found speech technology so useful, and the other groups of people who are excluded from “e-society”. An example is older people. These have a much greater range of characteristics than younger people and these characteristics change more rapidly with time. Very importantly for speech technologists, most older people possess multiple minor disabilities, which can seriously interact, particularly in the context of a human machine communication. In addition, a relatively high proportion of older people also have a major disability. Acoustic Normalization of Children’s Speech Georg Stemmer, Christian Hacker, Stefan Steidl, Elmar Nöth; Universität Erlangen-Nürnberg, Germany Young speakers are not represented adequately in current speech recognizers. In this paper we focus on the problem to adapt the acoustic frontend of a speech recognizer which has been trained on adults’ speech to achieve a better performance on speech from children. We introduce and evaluate a method to perform non-linear VTLN by an unconstrained data-driven optimization of the filterbank. A second approach normalizes the speaking rate of the young speakers with the PSOLA algorithm. Significant reductions in word error rate have been achieved. September 1-4, 2003 – Geneva, Switzerland Session: OTuDc– Oral Speech Synthesis: Unit Selection II Time: Tuesday 16.00, Venue: Room 3 Chair: Alan Black, CMU, USA Unit Size in Unit Selection Speech Synthesis S.P. Kishore 1 , Alan W. Black 2 ; 1 International Institute of Information Technology, India; 2 Carnegie Mellon University, USA In this paper, we address the issue of choice of unit size in unit selection speech synthesis. We discuss the development of a Hindi speech synthesizer and our experiments with different choices of units: syllable, diphone, phone and half phone. Perceptual tests conducted to evaluate the quality of the synthesizers with different unit size indicate that the syllable synthesizer performs better than the phone, diphone and half phone synthesizers, and the half phone synthesizer performs better than diphone and phone synthesizers. Restricted Unlimited Domain Synthesis Antje Schweitzer, Norbert Braunschweiler, Tanja Klankert, Bernd Möbius, Bettina Säuberlich; University of Stuttgart, Germany This paper describes the hybrid unit selection strategy for restricted domain synthesis in the SmartKom dialog system. Restricted domains are characterized as being biased toward domain specific utterances while being unlimited in terms of vocabulary size. This entails that unit selection in restricted domains must deal with both domain specific and open-domain material. The strategy presented here combines the advantages of two existing unit selection approaches, motivated by the claim that the phonological structure matching approach is advantageous for domain specific parts of utterances, while the acoustic clustering algorithm is more appropriate for open-domain material. This dichotomy is also reflected in the speech database, which consists of a domain specific and an open-domain part. The text material for the open-domain part was constructed to optimize coverage of diphones and phonemes in different contexts. Evaluation of Units Selection Criteria in Corpus-Based Speech Synthesis Hélène François, Olivier Boëffard; IRISA, France This work comes within the scope of concatenative speech synthesis. We propose a method to evaluate the criteria used in units selection methods. Usually criteria are evaluated in a comparative black-box way : the performance of a criterion are measured relatively to other criteria performances, thus evaluation is not always discriminant or formative. We present a glas-box method to measure the performances of a criterion in an absolute way. The principle is to explore the possible sequences of units able to synthesize a given target utterance, to assign to each sequence a value X of objective quality and a value Y related to the tested criterion ; then mutual information I(X;Y) is calculated to measure the explicative power of the criterion Y in relation to the quality variable X. Results are encouraging concerning criteria associated to units types, but combinatorial problems weigh heavy for criteria related to units instances. Combining Non-Uniform Unit Selection with Diphone Based Synthesis Michael Pucher 1 , Friedrich Neubarth 1 , Erhard Rank 1 , Georg Niklfeld 1 , Qi Guan 2 ; 1 ftw., Austria; 2 Siemens Österreich AG, Austria This paper describes the unit selection algorithm of a speech synthesis system, which selects the k-best paths over units from a relational unit database. The algorithm uses words and diphones as basic unit types. It is part of a customisable text-to-speech system designed for generating new prompts using a recorded speech corpus, with the option that the user can interactively optimise the results from the unit selection algorithm. This algorithm combines 46 Eurospeech 2003 Tuesday advantages of non-uniform unit selection algorithms and diphone inventory based speech synthesis. Evolutionary Weight Tuning Based on Diphone Pairs for Unit Selection Speech Synthesis Francesc Alías 1 , Xavier Llorà 2 ; 1 Ramon Llull University, Spain; 2 University of Illinois at Urbana-Champaign, USA Unit selection text-to-speech (TTS) conversion is an ongoing research for the speech synthesis community. This paper is focused on tuning the weights involved in the target and concatenation cost metrics. We propose a method for automatically adjusting these weights simultaneously by means of diphone and triphone pairs. This method is based on techniques provided by the evolutionary computation community, taking advantage of their robustness in noisy domains. The experiments and their analyses demonstrate its good performance in this problem, thus, overcoming some constraints assumed by previous works and leading to a new interesting framework for further investigations. La conversió text-parla (CTP) basada en selecció d’unitats és una de les línies de recerca actuals de la comunitat científica de síntesi de veu. Aquest treball se centra en l’ajust dels pesos involucrats en el càlcul dels costos d’unitat i de concatenació. Es presenta un mètode automàtic per l’ajust simultani d’aquests pesos a partir de parelles de difonemes i trifonemes. Aquest mètode està basat en tècniques obtingudes de la comunitat de computació evolutiva, aprofitant la robustesa d’aquests algorismes en dominis sorollosos. Els experiments que s’han dut a terme i la seva posterior anàlisi demostren el bon funcionament del mètode en aquest problema, ja que supera algunes de les restriccions d’anteriors mètodes. A més, esdevé un marc de treball molt interessant per a properes investigacions. Keeping Rare Events Rare Ove Andersen, Charles Hoequist; Aalborg University, Denmark September 1-4, 2003 – Geneva, Switzerland systems that used parallel banks of tokenizer-dependent language models produced the best language identification performance. Since that time, other approaches to language identification have been developed that match or surpass the performance of phonebased systems. This paper describes and evaluates three techniques that have been applied to the language identification problem: phone recognition, Gaussian mixture modeling, and support vector machine classification. A recognizer that fuses the scores of three systems that employ these techniques produces a 2.7% equal error rate (EER) on the 1996 NIST evaluation set and a 2.8% EER on the NIST 2003 primary condition evaluation set. An approach to dealing with the problem of out-of-set data is also discussed. Using Place Name Data to Train Language Identification Models Stanley F. Chen, Benoît Maison; IBM T.J. Watson Research Center, USA The language of origin of a name affects its pronunciation, so language identification is an important technology for speech synthesis and recognition. Previous work on this task has typically used training sets that are proprietary or limited in coverage. In this work, we investigate the use of a publically-available geographic database for training language ID models. We automatically cluster place names by language, and show that models trained from place name data are effective for language ID on person names. In addition, we compare several source-channel and direct models for language ID, and achieve a 24% reduction in error rate over a sourcechannel letter trigram model on a 26-way language ID task. Use of Trajectory Models for Automatic Accent Classification Pongtep Angkititrakul, John H.L. Hansen; University of Colorado at Boulder, USA NIST 2003 Language Recognition Evaluation This paper describes a proposed automatic language accent identification system based on phoneme class trajectory models. Our focus is to preserve discriminant information of the spectral evolution that belong to each accent. Here, we describe two classification schemes based on stochastic trajectory models; supervised and unsupervised classification. For supervised classification, we assume text of spoken words are known and integrate this into the classification scheme. Unsupervised classification uses a Multi-Trajectory Template, which represents the global temporal evolution of each accent. No prior text knowledge of the input speech is required for the unsupervised scheme. We also conduct human-perceptual accent classification experiments for comparison automatic system performance. The experiments are conducted on 3 foreign accents (Chinese, Thai, and Turkish) with native American English. Our experimental evaluation shows that supervised classification outperforms unsupervised classification by 11.5%. In general, supervised classification performance increases to 80% correct accent discrimination as we increase the phoneme sequence to 11 accent-sensitive phonemes. Alvin F. Martin, Mark A. Przybocki; National Institute of Standards and Technology, USA Language Identification Using Parallel Sub-Word Recognition – An Ergodic HMM Equivalence The 2003 NIST Language Recognition Evaluation was very similar to the last such NIST evaluation in 1996. It was intended to establish a new baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. The primary evaluation data consisted of excerpts from conversations in twelve languages from the CallFriend Corpus. These test segments had durations of approximately three, ten, or thirty seconds. Six sites from three continents participated in the evaluation. The best performance results were significantly improved from those of the previous evaluation. V. Ramasubramanian, A.K.V. Sai Jayram, T.V. Sreenivas; Indian Institute of Science, India It has been claimed that corpus-based TTS is unworkable because it is not practical to include representative units to cover all or most of the combinations of segments and prosodic characteristics found in general texts, a problem characterized as Large Numbers of Rare Events (LNRE). We argue that part of this problem is in its formulation, and that a closer look, including investigations into corpusbased TTS for Danish, show that LNRE need not be a fatal problem for inventory design in corpus-based TTS. Session: OTuDd– Oral Language & Accent Identification Time: Tuesday 16.00, Venue: Room 4 Chair: Stephen Cox, Univ. of East Anglia Acoustic, Phonetic, and Discriminative Approaches to Automatic Language Identification E. Singer, P.A. Torres-Carrasquillo, T.P. Gleason, W.M. Campbell, Douglas A. Reynolds; Massachusetts Institute of Technology, USA Formal evaluations conducted by NIST in 1996 demonstrated that Recently, we have proposed a parallel sub-word recognition (PSWR) system for language identification (LID) in a framework similar to the parallel phone recognition (PPR) approach in the literature, but without requiring phonetic labeling of the speech data in any of the languages in the LID task. In this paper, we show the theoretical equivalence of PSWR and ergodic- HMM (E-HMM) based LID. Here, the front-end sub-word recognizer (SWR) and back-end language model (LM) of each language in PSWR correspond to the states and state-transitions of the E-HMM in that language. This equivalence unifies the parallel phone (sub-word) recognition and ergodic-HMM approaches, which have been treated as two distinct frameworks in the LID literature so far, thus providing further insights into both these frameworks. On a 6-language LID task using the OGITS database, the E-HMM system achieves performances comparable to the PSWR system, offering clear experimental validation of their equivalence. 47 Eurospeech 2003 Tuesday September 1-4, 2003 – Geneva, Switzerland On the Combination of Speech and Speaker Recognition Speech Enhancement for a Car Environment Using LP Residual Signal and Spectral Subtraction Mohamed Faouzi BenZeghiba, Hervé Bourlard; IDIAP, Switzerland A. Álvarez, V. Nieto, P. Gómez, R. Martínez; Universidad Politécnica de Madrid, Spain This paper investigates an approach that maximizes the joint posterior probability of the pronounced word and the speaker identity given the observed data. This probability can be expressed as a product of the posterior probability of the pronounced word estimated through an artificial neural network (ANN), and the likelihood of the data estimated through a Gaussian mixture model (GMM). We show that the posterior probabilities estimated through a speaker-dependent ANN, as usually done in the hybrid HMM/ANN systems, are reliable for speech recognition but they are less reliable for speaker recognition. To alleviate this problem, we thus study how this posterior probability can be combined with the likelihood derived from a speaker-dependent GMM model to improve the speaker recognition performance. We thus end up with a joint model that can be used for text-dependent speaker identification and for speech recognition (and mutually benefiting from each other). Handsfree speaker input is mandatory to enable safe operation in cars. In those scenarios robust speech recognition emerges as one of the key technologies to produce voice control car devices. Through this paper, we propose a method of processing speech degraded by reverberation and noise in an automobile environment. This approach involves analyzing the linear prediction error signal to produce a weight function suitable for being combined with spectral subtraction techniques. The paper includes also an evaluation of the performance of the algorithm in speech recognition experiments. The results show a reduction of more than 30% in word error rate when the new speech enhancement frontend is applied. Speech Enhancement and Improved Recognition Accuracy by Integrating Wavelet Transform and Spectral Subtraction Algorithm Gwo-hwa Ju, Lin-shan Lee; National Taiwan University, Taiwan Session: PTuDe– Poster Speech Enhancement II Time: Tuesday 16.00, Venue: Main Hall, Level -1 Chair: Maurizio Omologo, ITC-irst Improving Speech Intelligibility by Steady-State Suppression as Pre-Processing in Small to Medium Sized Halls Nao Hodoshima 1 , Takayuki Arai 1 , Tsuyoshi Inoue 1 , Keisuke Kinoshita 1 , Akiko Kusumoto 2 ; 1 Sophia University, Japan; 2 Portland VA Medical Center, USA One of the reasons that reverberation degrades speech intelligibility is the effect of overlap-masking, in which segments of an acoustic signal are affected by reverberation components of previous segments [Bolt et al., 1949]. To reduce the overlap-masking, Arai et al. suppressed steady-state portions having more energy, but which are less crucial for speech perception, and confirmed promising results for improving speech intelligibility [Arai et al., 2002]. Our goal is to provide a pre-processing filter for each auditorium. To explore the relationship between the effect of a pre-processing filter and reverberation conditions, we conducted a perceptual test with steady-state suppression under various reverberation conditions. The results showed that processed stimuli performed better than unprocessed ones and clear improvements were observed for reverberation conditions of 0.8 - 1.0s. We certified that steady-state suppression was an effective pre-processing method for improving speech intelligibility under reverberant conditions and proved the effect of overlap-masking. Enhancement of Hearing-Impaired Mandarin Speech Chen-Long Lee 1 , Ya-Ru Yang 1 , Wen-Whei Chang 1 , Yuan-Chuan Chiang 2 ; 1 National Chiao Tung University, Taiwan; 2 National Hsinchu Teachers College, Taiwan Spectral subtraction (SS) approach has been widely used for speech enhancement and recognition accuracy improvement, but becomes less effective when the additive noise is not white. In this paper, we propose to integrate wavelet transform and the SS algorithm. The spectrum of the additive noise in each frequency band obtained in this way can then be better approximated as white if the number of bands is large enough, and therefore the SS approach can be more effective. Experimental results based on three objective performance measures and spectrogram-plot comparison show that this new approach can provide better performance especially when the noise is non-white. Listening test results also indicate that the new algorithm can give more preferable sound quality and intelligibility than the conventional spectral subtraction algorithm. Moreover, the new approach also offers some reductions of the computational complexity when compared with the conventional SS algorithm. Multi-Referenced Correction of the Voice Timbre Distortions in Telephone Networks Gaël Mahé 1 , André Gilloire 2 ; 1 Université René Descartes – Paris V, France; 2 France Télécom R&D, France In a telephone link, the voice timbre is impaired by spectral distortions generated by the analog parts of the link. We first evaluate from a perceptual point of view an equalization method consisting in matching the long term spectrum of the processed signal to a reference spectrum. This evaluation shows a satisfying restoration of the timbre for most speakers. For some speakers however, a noticeable spectral distortion remains. That is why we propose a multi-referenced equalizer, based on a classification of speakers and using a different reference spectrum for each class. This leads to a decrease of the spectral distortion and, as a consequence, to a significant improvement of the timbre correction. Efficient Speech Enhancement Based on Left-Right HMM with State Sequence Detection Using LRT This paper presents a new voice conversion system that modifies misarticulations and prosodic deviations of the hearing-impaired Mandarin speech. The basic strategy is the detection and exploitation of characteristic features that distinguish the impaired speech from the normal speech at segmental and prosodic levels. For spectral conversion, cepstral coefficients were characterized under the form of a Gaussian mixture model with parameters converted using a mapping function that minimizes the spectral distortion between the impaired and normal speech. We also proposed a VQ-based approach to prosodic conversion that involves modifying the features extracted from the pitch contour by orthogonal polynomial transform. Experimental results indicate that the proposed system appears useful in enhancing the hearing-impaired Mandarin speech. J.J. Lee 1 , J.H. Lee 2 , K.Y. Lee 1 ; 1 SoongSil University, Korea; 2 Dong-Ah Broadcasting College, Korea Since the conventional HMM (Hidden Markov Model)-based speech enhancement methods try to improve speech quality by considering all states for the state transition, hence introduce huge computational loads inappropriate to real-time implementation. In the Left-Right HMM (LR-HMM), only the current and the next states are considered for a possible state transition so to reduce the computation complexity. We propose a new speech enhancement algorithm based on LR-HMM with state sequence detection using LRT (Likelihood Ratio Test). Experimental results show that the proposed method improves the speed up with little degradation of speech quality compared to the conventional method. 48 Eurospeech 2003 Tuesday Introduction of the CELP Structure of the GSM Coder in the Acoustic Echo Canceller for the GSM Network September 1-4, 2003 – Geneva, Switzerland while that of the proposed beamformer will be increased by only 0.95dB. Therefore, the passband of the proposed GSC beamformer can be extended without loss of performance. Speech Enhancement Using A-Priori Information H. Gnaba 1 , M. Turki-Hadj Alouane 1 , M. Jaidane-Saidane 1 , P. Scalart 2 ; 1 Ecole Nationale d’Ingénieurs de Tunis, Tunisia; 2 France Télécom R&D, France Sriram Srinivasan, Jonas Samuelsson, W. Bastiaan Kleijn; KTH, Sweden This paper presents a new structure of an Acoustic Echo Canceller (AEC) designed to operate in the Mobile Switching Center (MSC) of a GSM network. The purpose of such system is to cancel the echo for all the subscribers. Contrarily to the conventional AEC, the proposed combined AEC/CELP Predictor is able to take into account the non linearities introduced by the GSM speech coders/decoders. A short term predictor is used to model the behavior of the codecs. This new combined system presents higher performance compared to the conventional AEC. Extracting an AV Speech Source from a Mixture of Signals David Sodoyer 1 , Laurent Girin 1 , Christian Jutten 2 , Jean-Luc Schwartz 1 ; 1 ICP-CNRS, France; 2 LIS-CNRS, France We present a new approach to the source separation problem for multiple speech signals. Using the extra visual information of the face speaker, the method aims to extract an acoustic speech signal from other acoustic signals by exploiting its coherence with the speaker’s lip movements. We define a statistical model of the joint probability of visual and spectral audio input for quantifying the audio-visual coherence. Then, separation can be achieved by maximising this joint probability. Experiments on additive mixtures of 2, 3 and 5 sources show that the algorithm performs well, and systematically better than the classical BSS algorithm JADE. Speech Enhancement for Hands-Free Car Phones by Adaptive Compensation of Harmonic Engine Noise Components Henning Puder; Darmstadt University of Technology, Germany This paper presents a method for enhancing speech disturbed by car noise. The proposed method cancels the powerful harmonic components of engine noise by adaptive filtering which utilizes the known rpm signal available on the CAN bus in modern cars. The procedure can be used as a preprocessing method for classical broad-band noise reduction as it is able to cancel the engine noise – and thus a large amount of low-frequent car noise – without provoking speech distortion. The main part of the paper is dedicated to the step-size control of the utilized LMS algorithm necessary for a complete cancellation of the harmonics without speech distortion. Therefore, first a theoretically optimal step-size is determined and then a procedure is described which allows its determination in real applications. The paper concludes with a presentation of results obtained with this approach. Enhance Low-Frequency Suppression of GSC Beamforming Zhaorong Hou, Ying Jia; Intel China Research Center, China Usually the generalized sidelobe canceller (GSC) beamformer requires additional highpass pre-filtering due to insufficient suppression of low frequency directional interference, and it deteriorates the bandwidth quality of speech enhancement, especially for small size microphone array. This paper proposes a new GSC beamformer with multiple frequency dependent norm-constrained adaptive filters (FD-NCAF), which combine bin-wise constraint in low frequency band and norm constraint in high frequency band, to improve the performance of the adaptive interference canceller (AIC) for low frequency interference. Simulation on five testing signals shows that directional response of the proposed beamformer is less sensitive to the spectrum of interference than the full-band GSC. In the experiments based on real recordings, when the cut-off frequency of highpass pre-filtering extended to lower frequency, the residual directional interference of the full-band GSC will be increased by 3.37dB, In this paper, we present a speech enhancement technique that uses a-priori information about both speech and noise. The a-priori information consists of speech and noise spectral shapes stored in trained codebooks. The excitation variances of speech and noise are determined through the optimization of a criterion that finds the best fit between the noisy observation and the model represented by the two codebooks. The optimal spectral shapes and variances are used in a Wiener filter to obtain an estimate of clean speech. The method uses both a-priori and estimated noise information to perform well in stationary as well as non-stationary noise environments. The high computational complexity resulting from a full search of joint speech and noise codebooks is avoided through an iterative optimization procedure. Experiments indicate that the method significantly outperforms conventional enhancement techniques, especially for non-stationary noise. Blind Inversion of Multidimensional Functions for Speech Enhancement John Hogden 1 , Patrick Valdez 1 , Shigeru Katagiri 2 , Erik McDermott 2 ; 1 Los Alamos National Laboratory, USA; 2 NTT Corporation, Japan We discuss speech production in terms of a mapping from a lowdimensional articulator space to low-dimensional manifold embedded in a high-dimensional acoustic space. Our discussion highlights the advantages of using an articulatory representation of speech. We then summarize mathematical results showing that, because articulator motions are bandlimited, a large class of mappings from articulation to acoustics can be blindly inverted. Simulation results showing the power of the inversion technique are also presented. One of the most interesting simulation results is that some manyto-one mappings can also be inverted. These results explain earlier experimental results that the studied technique can recover articulator positions. We conclude that our technique has many advantages for speech processing, including invariance with respect to various nonlinearities and the ability to exploit context more easily. Convergence Improvement for Oversampled Subband Adaptive Noise and Echo Cancellation H.R. Abutalebi 1 , H. Sheikhzadeh 2 , R.L. Brennan 2 , G.H. Freeman 3 ; 1 Amirkabir University of Technology, Iran; 2 Dspfactory Ltd., Canada; 3 University of Waterloo, Canada The convergence rate of the Least Mean Square (LMS) algorithm is dependent on the eigenvalue distribution of the reference input correlation matrix. When adaptive filters are employed in low-delay over-sampled subband structures, colored subband signals considerably decelerate the convergence speed. Here, we propose and implement two promising techniques for improving the convergence rate based on: 1) Spectral emphasis and 2) Decimation of the subband signals. We analyze the effects of the proposed methods based on theoretical relationships between eigenvalue distribution and convergence characteristics. We also propose a combined decimation and spectral emphasis whitening technique that exploits the advantages of both methods to dramatically improve the convergence rate. Moreover, through decimation the combined whitening approach reduces the overall computation cost compared to subband LMS with no pre-processing. Presented theoretical and simulation results confirm the effectiveness of the proposed convergence improvement methods. A Speech Dereverberation Method Based on the MTF Concept Masashi Unoki, Keigo Sakata, Masato Akagi; JAIST, Japan This paper proposes a speech dereverberation method based on 49 Eurospeech 2003 Tuesday the MTF concept. This method can be used without measuring the impulse response of room acoustics. In the model, the power envelopes and carriers are decomposed from a reverberant speech signal using an N-channel filterbank and then are dereverberated in each respective channel. In the envelope dereverberation process, a power envelope inverse filtering method is used to dereverberate the envelopes. In the carrier regeneration process, a carrier generation method based on voiced/unvoiced speech from the estimated fundamental frequency (F0) is used. In this paper, we assume that F0 has been estimated accurately. We have carried out 15,000 simulations of dereverberation for reverberant speech signals to evaluate the proposed model. We found that the proposed model can accurately dereverberate not only the power envelopes but also the speech signal from the reverberant speech using regenerated carriers. Accuracy Improved Double-Talk Detector Based on State Transition Diagram SangGyun Kim, Jong Uk Kim, Chang D. Yoo; KAIST, Korea A double-talk detector (DTD) is generally used with an acoustic echo canceller (AEC) in pinpointing the region where far-end and nearend signal coexist. This region is called double-talk and during this region AEC usually freezes the adaptation. Decision variable used in DTD has a relatively longer transient time going from double-talk to single-talk than time going in opposite direction. Therefore, using a single threshold to pinpoint the location of double-talk region can be difficult. In this paper, a DTD based on a novel state transition diagram and a decision variable which requires minimal computational overhead is proposed to improve the accuracy of pinpointing the location. The use of different thresholds according to the state helps the DTD locate double-talk region more accurately. The proposed DTD algorithm is evaluated by obtaining a receiver operating characteristic (ROC) and is compared to that of Cho’s DTD. Perceptual Based Speech Enhancement for Normal-Hearing & Hearing-Impaired Individuals Ajay Natarajan, John H.L. Hansen, Kathryn Arehart, Jessica A. Rossi-Katz; University of Colorado at Boulder, USA This paper describes a new noise suppression scheme with the goal of improving speech-in-noise perception for hearing-impaired listeners. Following the work of Tsoukalas et al. (1997) [4], Arehart et al (2003) [3] implemented and evaluated a noise suppression algorithm based on an approach that used the auditory masked threshold in conjunction with a version of spectral subtraction to adjust the enhancement parameters based on the masked threshold of the noise across the frequency spectrum. That original formulation was based on masking properties of the normal auditory system, with its theoretical underpinnings based on MPEG-4 audio coding [6]. We describe here a revised formulation, which is more suitable for hearing aid applications and which addresses changes in masking that occur with cochlear hearing loss. In contrast to previous formulations, the algorithm described here is implemented with generalized minimum mean square error estimators, which provide improvements over spectral subtraction estimators [1]. Second, the frequency resolution of the cochlea is described with auditory filter equivalent rectangular bandwidths (ERBs) [2] rather than the critical band scale. Third, estimation of the auditory masked thresholds and masking spreading functions are adjusted to address elevated thresholds and broader auditory filters characteristic of cochlear hearing loss. Fourth, the current algorithm does not include the tonality offset developed for use in MPEG-4 audio coding applications. The scheme also shows an overall improvement of 11% in the Itakura-Saito distortion measure. Residual Echo Power Estimation for Speech Reinforcement Systems in Vehicles Alfonso Ortega, Eduardo Lleida, Enrique Masgrau; University of Zaragoza, Spain In acoustic echo cancellation systems, some residual echo exists after the acoustic echo canceller (AEC) due to the fact that the adaptive filter does not model exactly the impulse response of the Loudspeaker-Enclosure-Microphone (LEM) path. This is specially September 1-4, 2003 – Geneva, Switzerland important in feedback acoustic environments like speech reinforcement systems for cars where this residual echo can make the system become unstable. In order to suppress this residual echo remaining after the AEC, postfiltering is the most used technique. The optimal filter that ensures stability without attenuating the speech signal depends on the power spectral density (psd) of the residual echo that must be estimated. This paper presents a residual echo psd estimation method needed to obtain the optimal echo suppression filter in speech reinforcement systems for cars. Dual-Mode Wideband Speech Recovery from Narrowband Speech Yasheng Qian, Peter Kabal; McGill University, Canada The present public telephone networks trim o. the lowband (50300 Hz) and the highband (3400-7000 Hz) components of sounds. As a result, telephone speech is characterized by thin and muffled sounds, and degraded speaker identification. The lowband components are deterministically recoverable, while the missing highband can be recovered statistically. We develop an equalizer to restore the lowband parts. The highband parts are filled in using a linear prediction approach. The highband excitation is generated using a bandpass envelope modulated Gaussian signal and the spectral envelope is generated using a Gaussian Mixture Model. The mean log-spectrum distortion decreases by 0.96 dB, comparing to a previous method using wideband reconstruction with a VQ codebook mapping algorithm. Informal subjective tests show that the reconstructed wideband speech enhances lowband sounds and regenerates realistic highband components. A Robust Noise and Echo Canceller Khaldoon Al-Naimi, Christian Sturt, Ahmet Kondoz; University of Surrey, U.K. The performance of an echo canceller systems deployed in a practical communication environment (i.e. the presence of background noise and the possible double talk scenario) depends on an accurate Voice Activity Detector (VAD) and an effective filter coefficient adaptation strategies. Accuracy of the VAD, which affects the coefficient adaptation strategy, is itself affected by the presence of background noise. In this paper, a novel soft weighting approach is proposed to replace the VAD and filter coefficient adaptation strategy. The robustness of the echo canceller system is further improved through integrating it with a noise suppression algorithm. The integrated echo canceller and noise suppressor systems has shown excellent performances under double talk scenarios with SNR as low as 5 dB. Computational Auditory Scene Analysis by Using Statistics of High-Dimensional Speech Dynamics and Sound Source Direction Johannes Nix, Michael Kleinschmidt, Volker Hohmann; Universität Oldenburg, Germany A main task for computational auditory scene analysis (CASA) is to separate several concurrent speech sources. From psychoacoustics it is known that common onsets, common amplitude modulation and sound source direction are among the important cues which allow the separation for the human auditory system. A new algorithm for binaural signals is presented here, that performs statistical estimation of two speech sources by a state-space approach which integrates temporal and frequency-specific features of speech. It is based on a Sequential Monte Carlo (SMC) scheme and tracks magnitude spectra and direction on a frameby-frame basis. First results for estimating sound source direction and separating the envelopes of two voices are shown. The results indicate that the algorithm is able to localize two superimposed sound sources in a time scale of 50 ms. This is achieved by integrating measured high-dimensional statistics of speech. Also, the algorithm is able to track the short-time envelope and the shorttime magnitude spectra of both voices on a time scale of 10 - 40 ms. The algorithm presented in this paper is developed for but not restricted to use in binaural hearing aid applications, as it is based on two head-mounted microphone signals as input. It is conceptionally able to separate more than two voices and integrate additional cues. 50 Eurospeech 2003 Tuesday September 1-4, 2003 – Geneva, Switzerland have an average 34∼37%, 9% higher accuracy than the speakerindependent acoustic models, respectively. The experimental results of Korean phone and word recognition confirmed the significant performance increase in small adaptation utterances compared with without any speaker adaptation. Session: PTuDf– Poster Speech Recognition - Adaptation I Time: Tuesday 16.00, Venue: Main Hall, Level -1 Chair: Richard Stern, CMU, USA Vocal Tract Normalization as Linear Transformation of MFCC Reduction of Dimension of HMM Parameters Using ICA and PCA in MLLR Framework for Speaker Adaptation Michael Pitz, Hermann Ney; RWTH Aachen, Germany Jiun Kim, Jaeho Chung; Inha University, Korea We have shown previously that vocal tract normalization (VTN) results in a linear transformation in the cepstral domain. In this paper we show that Mel-frequency warping can equally well be integrated into the framework of VTN as linear transformation on the cepstrum. We show examples of transformation matrices to obtain VTN warped Mel-frequency cepstral coefficients (VTN-MFCC) as linear transformation of the original MFCC and discuss the effect of Mel-frequency warping on the Jacobian determinant of the transformation matrix. Finally we show that there is a strong interdependence of VTN and Maximum Likelihood Linear Regression (MLLR) for the case of Gaussian emission probabilities. We discuss how to reduce the number of inverse matrix and its dimensions requested in MLLR framework for speaker adaptation. To find a smaller set of variables with less redundancy, we employ PCA (principal component analysis) and ICA (independent component analysis) that would give as good a representation as possible. The amount of additional computation when PCA or ICA is applied is as small as it can be disregarded. The dimension of HMM parameters is reduced to about 1/3∼2/7 dimensions of SI (speaker independent) model parameter with which speech recognition system represents word recognition rate as much as ordinary MLLR framework. If dimension of SI model parameter is n , the amount of computation of inverse matrix in MLLR is proportioned to O(n4 ). So, compared with ordinary MLLR, the amount of total computation requested in speaker adaptation is reduced to about 1/80∼1/150. Non-Native Spontaneous Speech Recognition Through Polyphone Decision Tree Specialization Zhirong Wang, Tanja Schultz; Carnegie Mellon University, USA With more and more non-native speakers speaking in English, the fast and efficient adaptation to non-native English speech becomes a practical concern. The performance of speech recognition systems is consistently poor on non-native speech. The challenge for non-native speech recognition is to maximize the recognition performance with small amount of non-native data available. In this paper we report on the effectiveness of using polyphone decision tree specialization method for non-native speech adaptation and recognition. Several recognition results are presented by using nonnative speech from German speakers. Results obtained from the experiments demonstrate the feasibility of this method. Live Speech Recognition in Sports Games by Adaptation of Acoustic Model and Language Model Yasuo Ariki 1 , Takeru Shigemori 1 , Tsuyoshi Kaneko 1 , Jun Ogata 2 , Masakiyo Fujimoto 1 ; 1 Ryukoku University, Japan; 2 AIST, Japan This paper proposes a method to automatically extract keywords from baseball radio speech through LVCSR for highlight scene retrieval. For robust recognition, we employed acoustic and language model adaptation. In acoustic model adaptation, supervised and unsupervised adaptations were carried out using MLLR+MAP. By this two level adaptation, word accuracy was improved by 28%. In language model adaptation, language model fusion and pronunciation modification were carried out. This adaptation showed 13% improvement at word accuracy. Finally, by integrating both adaptations, 38% improvement was achieved at word accuracy level and 28% improvement at keyword accuracy level. Speaker Adaptation Using Regression Classes Generated by Phonetic Decision Tree-Based Successive State Splitting Se-Jin Oh 1 , Kwang-Dong Kim 1 , Duk-Gyoo Roh 1 , Woo-Chang Sung 2 , Hyun-Yeol Chung 2 ; 1 Korea Astronomy Observatory, Korea; 2 Yeungnam University, Korea In this paper, we propose a new generation of regression classes for MLLR speaker adaptation method using the PDTSSS algorithm so as to represent the characteristic of speaker effectively. This method extends the state splitting through clustering the context components of adaptation data into a tree structure. It enables to autonomously control a number of adaptation parameters (mean, variance) depending on the context information and the amount of adaptation utterances from a new speaker. Through the experiments, the phone and word recognition rates with adaptation Geometric Constrained Maximum Likelihood Linear Regression On Mandarin Dialect Adaptation Huayun Zhang, Bo Xu; Chinese Academy of Sciences, China This paper presents a geometric constrained transformation approach for fast acoustic adaptation, which improves the modeling resolution of the conventional Maximum Likelihood Linear Regression (MLLR). For this approach, the underlying geometry difference between the seed and the target spaces is exposed and quantified, and used as a prior knowledge to reconstruct refiner transforms. Ignoring dimensions that have minor affections to this difference, the transform could be constrained to a lower rank subspace. And only distortions within this subspace are to be refined in a cascaded process. Compared to previous cascade method, we employ a different parameterization and obtain a higher resolution. At the same time, since the geometric span for refiner transforms is highly controlled, it could be adapted quickly. So, it could achieve a better tradeoff between resolution and robustness. In Mandarin dialect adaptations, this approach provides 4∼9% word-error-rate relative decrease over MLLR and 3∼5% over previous cascade method correspondingly with varying amounts of data. Adapting Language Models for Frequent Fixed Phrases by Emphasizing N-Gram Subsets Tomoyosi Akiba 1 , Katunobu Itou 1 , Atsushi Fujii 2 ; 1 AIST, Japan; 2 University of Tsukuba, Japan In support of speech-driven question answering, we propose a method to construct N-gram language models for recognizing spoken questions with high accuracy. Question-answering systems receive queries that often consist of two parts: one conveys the query topic and the other is a fixed phrase used in query sentences. A language model constructed by using a target collection of QA, for example, newspaper articles, can model the former part, but cannot model the latter part appropriately. We tackle this problem as task adaptation from language models obtained from background corpora (e.g., newspaper articles) to the fixed phrases, and propose a method that does not use the task-specific corpus, which is often difficult to obtain, but instead uses only manually listed fixed phrases. The method emphasizes a subset of N-grams obtained from a background corpus that corresponds to fixed phrases specified by the list. Theoretically, this method can be regarded as maximizing a posteriori probability (MAP) estimation using the subset of the N-grams as a posteriori distribution. Some experiments show the effectiveness of our method. 51 Eurospeech 2003 Tuesday Learning Intra-Speaker Model Parameter Correlations from Many Short Speaker Segments Anne K. Kienappel; Philips Research Laboratories, Germany Very rapid speaker adaptation algorithms, such as eigenvoices or speaker clustering, typically rely on learning intra-speaker correlations of model parameters from the training data. On the base of this a-priori knowledge, many model parameters can be successfully adapted on the basis of few observations. However, eigenvoice training or speaker clustering is non-trivial with training databases containing many short speaker segments, where for each speaker the available data to detect intra-speaker correlations is sparse. We have trained eigenvoices that yield a small but significant word error rate reduction in on-line adaptation (i.e. self adaptation) for a telephony database with on average only 5 seconds of speech per speaker in training and test data. Modeling Cantonese Pronunciation Variation by Acoustic Model Refinement Patgi Kam 1 , Tan Lee 1 , Frank K. Soong 2 ; 1 Chinese University of Hong Kong, China; 2 ATR-SLT, Japan Pronunciation variations can be roughly classified into two types: a phone change or a sound change [1][2]. A phone change happens when a canonical phone is produced as a different phone. Such a change can be modeled by converting the baseform (standard) phone to a surfaceform (actual) phone. A sound change happens at a lower, phonetic or subphonetic level within a phone and it cannot be modeled well by either the baseform or the surfaceform phone alone. We propose here to refine the acoustic models to cope with sound changes by (1) sharing the Gaussian mixture components of HMM states in the baseform and the surfaceform models; (2) adapting the mixture components of the baseform models towards those of the surfaceform models; (3) selectively reconstructing new acoustic models through sharing or adapting. The proposed pronunciation modeling algorithms are generic and can, in principle, be applied to different languages. Specifically, they were tested in a Cantonese speech recognition database. Relative word error rate reductions of 5.45%, 2.53%, and 3.04% have been achieved using the three approaches, respectively. Performance Improvement of Rapid Speaker Adaptation Based on Eigenvoice and Bias Compensation Jong Se Park, Hwa Jeon Song, Hyung Soon Kim; Pusan National University, Korea In this paper, we propose the bias compensation methods and the eigenvoice method using the mean of dimensional eigenvoice to improve the performance of rapid speaker adaptation based on eigenvoice. Experimental results for vocabulary-independent word recognition task shows the proposed method yields improvements for a small adaptation data. We obtained 22∼30% relative improvement by the bias compensation methods, and obtained 41% relative improvement by the eigenvoice method using the mean of dimensional eigenvoice with only single adaptation word. Training Data Optimization for Language Model Adaptation September 1-4, 2003 – Geneva, Switzerland tion from two large variable quality out-domain data sets for our task. Then a new algorithm is proposed to adjust the n-gram distribution of the two data sets to that of a task-specific but small data set. We consider preventing over-fitting problem in adaptation. All resulting models are evaluated on the realistic application of email dictation. Experiments show that each method achieves better performance, and the combined method achieves a perplexity reduction of 24% to 80%. Approaches to Foreign-Accented Speaker-Independent Speech Recognition Stefanie Aalburg, Harald Hoege; Siemens AG, Germany Current research in the area of foreign-accented speech recognition focusses either on acoustic model adaptation or speaker-dependent pronunciation variation modeling. In this paper both approaches are applied in parallel and in a speaker-independent fashion: the acoustic modeling part is based on a derived Hidden Markov Model (HMM) clustering algorithm and the lexicon adaptation is based on speaker-independent multiple pronunciation rules. The pronunciation rules are derived using phoneme-level pronunciation scores. Foreign-accented speech was simulated with Columbian Spanish and Spanish of Spain and the experiments showed an improved recognition performance for the acoustic modeling part and identical recognition results when adding pronunciation variants to the lexica. Both results are taken as indicators for an improved recognition performance when applied on real foreign-accented speech. The present limited availability of foreign-accented speech databases, however, clearly merits further investigations. Unsupervised Speaker Adaptation Based on HMM Sufficient Statistics in Various Noisy Environments Shingo Yamade, Akinobu Lee, Hiroshi Saruwatari, Kiyohiro Shikano; Nara Institute of Science and Technology, Japan Noise and speaker adaptation techniques are essential to realize robust speech recognition in noisy environments. In this paper, first, a noise robust speech recognition algorithm is implemented by superimposing a small quantity of noise data on spectral subtracted input speech. According to the recognition experiments, 30dB SNR noise superimposition on input speech after spectral subtraction increases the robustness against different noises significantly. Next, we apply this noise robust speech recognition to the unsupervised speaker adaptation algorithm based on HMM sufficient statistics in different noise environments. The HMM sufficient statistics for each speaker are calculated from 25dB SNR office noise added speech database beforehand. We evaluate successfully our proposed unsupervised speaker adaptation algorithm in noisy environments with 20k dictation task using 11 kinds of different noises, including office, car, exhibition, and crowd noises. Using Genetic Algorithms for Rapid Speaker Adaptation Fabrice Lauri, Irina Illina, Dominique Fohr, Filipp Korkmazsky; LORIA, France Xiaoshan Fang 1 , Jianfeng Gao 2 , Jianfeng Li 3 , Huanye Sheng 1 ; 1 Shanghai Jiao Tong University, China; 2 Microsoft Research Asia, China; 3 University of Science and Technology of China, China Language model (LM) adaptation is a necessary step when the LM is applied to speech recognition. The task of LM adaptation is to use out-domain data to improve in-domain model’s performance since the available in-domain (task-specific) data set is usually not large enough for LM training. LM adaptation faces two problems. One is the poor quality of the out-domain training data. The other is the mismatch between the n-gram distribution in out-domain data set and that in in-domain data set. This paper presents two methods, filtering and distribution adaptation, to solve them respectively. First, a bootstrapping method is presented to filter suitable por- This paper proposes two new approaches to rapid speaker adaptation of acoustic models by using genetic algorithms. Whereas conventional speaker adaptation techniques yield adapted models which represent local optimum solutions, genetic algorithms are capable to provide multiple optimal solutions, thereby delivering potentially more robust adapted models. We have investigated two different strategies of application of the genetic algorithm in the framework of speaker adaptation of acoustic models. The first approach (GA) consists in using a genetic algorithm to adapt the set of Gaussian means to a new speaker. The second approach (GA + EV) uses the genetic algorithm to enrich the set of speaker-dependant systems employed by the EigenVoices. Experiments with the Resource Management corpus show that, with one adaptation utterance, GA can improve the performances of a speaker-independent 52 Eurospeech 2003 Tuesday system as efficiently as EigenVoices. The method GA + EV outperforms EigenVoices. Structural State-Based Frame Synchronous Compensation Vincent Barreaud, Irina Illina, Dominique Fohr, Filipp Korkmazsky; LORIA, France In this paper we present improvements of a frame-synchronous noise compensation algorithm that uses Stochastic Matching approach to cope with time-varying unknown noise. We propose to estimate a hierarchical mapping function in parallel with Viterbi alignment. The structure of the transformation tree is build from the states of acoustical models. The objective of this hierarchical transformation is to better compensate non-linear distortions of the feature space. The technique is entirely general since no assumption is made on the nature, level and variation of noise. Our algorithm is evaluated on the VODIS database recorded in a moving car. For various tasks, proposed technique significantly outperforms classical compensation/ adaptation methods. Effect of Foreign Accent on Speech Recognition in the NATO N-4 Corpus Aaron D. Lawson 1 , David M. Harris 2 , John J. Grieco 3 ; 1 Research Associates for Defense Conversion, USA; 2 ACS Defense Inc., USA; 3 Air Force Research Laboratory, USA We present results from a series of 151 speech recognition experiments based on the N4 corpus of accented English speech, using a small vocabulary recognition system. These experiments looked at the impact of foreign accent on speech recognition, both within nonnative accented English and across different accents, with particular interest in using context free grammar technology to improve callsign identification. Results show that phonetic models built from foreign accented English are not less accurate than native ones at decoding novel data with the same accent. Cross accent recognition experiments show that phonetic models from a given accent group were 1.8 times less accurate in recognizing speech from a different accent. In contrast to other attempts to perform accurate recognition across accents, our approach of training very compact, accent-specific models (less than 3 hours of speech) provided very accurate results without the arduous task of adapting a phonetic dictionary to every accent. Duration Normalization and Hypothesis Combination for Improved Spontaneous Speech Recognition Jon P. Nedel, Richard M. Stern; Carnegie Mellon University, USA When phone segmentations are known a priori, normalizing the duration of each phone has been shown to be effective in overcoming weaknesses in duration modeling of Hidden Markov Models (HMMs). While we have observed potential relative reductions in word error rate (WER) of up to 34.6% with oracle segmentation information, it has been difficult to achieve significant improvement in WER with segmentation boundaries that are estimated blindly. In this paper, we present simple variants of our duration normalization algorithm, which make use of blindly-estimated segmentation boundaries to produce different recognition hypotheses for a given utterance. These hypotheses can then be combined for significant improvements in WER. With oracle segmentations, WER reductions of up to 38.5% are possible. With automatically derived segmentations, this approach has achieved a reduction of WER of 3.9% for the Broadcast News corpus, 6.2% for the spontaneous register of the MULT_REG corpus, and 7.7% for a spontaneous corpus of connected Spanish digits collected by Telefónica Investigación y Desarrollo. September 1-4, 2003 – Geneva, Switzerland ous density HMMs is described. In our approach, a class of informative prior distribution for MAPLR based variance adaptation is identified, from which the close form solution of MAPLR based variance adaptation is obtained under its EM formulation. Effects of the proposed prior distribution in MAPLR based variance adaptation are characterized and compared with conventional maximum likelihood linear regression (MLLR) based variance adaptation. These findings provide a consistent Bayesian theoretical framework to incorporate prior knowledge in linear regression based variance adaptation. Experiments on large vocabulary speech recognition tasks were performed. The experimental results indicate that significant performance gain over the MLLR based variance adaptation can be obtained based on the proposed approach. On Divergence Based Clustering of Normal Distributions and Its Application to HMM Adaptation Tor André Myrvoll 1 , Frank K. Soong 2 ; 1 NTNU, Norway; 2 ATR-SLT, Japan We present an algorithm for clustering multivariate normal distributions based upon the symmetric, Kullback-Leibler divergence. Optimal mean vector and covariance matrix of the centroid normal distribution are derived and a set of Riccati matrix equations is used to find the optimal covariance matrix. The solutions are found iteratively by alternating the intermediate mean and covariance solutions. Clustering performance of the new algorithm is shown to be superior to that of non-optimal sample mean and covariance solutions. It achieves a lower overall distortion and flatter distributions of pdf samples across clusters. The resultant optimal clusters were further tested on the Wall Street Journal database for adapting HMM parameters in a Structured Maximum A Posterior Linear Regression (SMAPLR) framework. The recognition performance was significantly improved and the word error rate was reduced from 32.6% for a non-optimal centroid (sample mean and covariance) to 27.6% and 27.5% for the diagonal and full covariance matrix cases, respectively. Fast Incremental Adaptation Using Maximum Likelihood Regression and Stochastic Gradient Descent Sreeram V. Balakrishnan; IBM T.J. Watson Research Center, USA Adaptation to a new speaker or environment is becoming very important as speech recognition systems are deployed in unpredictable real world situations. Constrained or Feature space Maximum Likelihood Regression (fMLLR) [1] has proved to be especially effective for this purpose, particularly when used for incremental unsupervised adaptation [2]. Unfortunately the standard implementation described in [1] and used by most authors since, requires statistics that require O(n3 ) operations to collect per frame. In addition the statistics require O(n3 ) space for storage and the estimation of the feature transform matrix requires O(n4 ) operations. This is an unacceptable cost for most embedded speech recognition systems. In this paper we show the fMLLR objective function can be optimized using stochastic gradient descent in a way that achieves almost the same results as the standard implementation. All this is accomplished with an algorithm that requires only O(n2 ) operations per frame and O(n2 ) storage requirements. This order of magnitude savings allows continuous adaptation to be implemented in most resource constrained embedded speech recognition applications. Maximum A Posteriori Linear Regression (MAPLR) Variance Adaptation for Continuous Density HMMS Wu Chou 1 , Xiaodong He 2 ; 1 Avaya Labs Research, USA; 2 University of Missouri, USA In this paper, the theoretical framework of maximum a Posteriori linear regression (MAPLR) based variance adaptation for continu- 53 Eurospeech 2003 Tuesday September 1-4, 2003 – Geneva, Switzerland This paper describes an effective method for automatic speech unit segmentation. Based on hidden Markov models (HMM), an initial estimation of segmentation from the explicit phonetic transcription are processed by our local HMM training algorithm. With reliable silence boundaries obtained by a silence detector, this algorithm tries different training methods to overcome the insufficient training data problem. The performance is tested in a Mandarin TTS speech corpus. The results show that using this method, a 14.98% improvement is achieved in the boundary detection error rate (deviating larger than 20 ms). Session: PTuDg– Poster Speech Resources & Standards Time: Tuesday 16.00, Venue: Main Hall, Level -1 Chair: Bruce Millar, Australian National University, Australia Tfarsdat – The Telephone Farsi Speech Database Mahmood Bijankhan 1 , Javad Sheykhzadegan 2 , Mahmood R. Roohani 2 , Rahman Zarrintare 2 , Seyyed Z. Ghasemi 1 , Mohammad E. Ghasedi 2 ; 1 University of Tehran, Iran; 2 Research Center of Intelligent Signal Processing, Iran Quality Control of Language Resources at ELRA This paper describes an ongoing research to create an acoustic phonetic based telephone Farsi speech database, called “Tfarsdat”. It is compared with two LDC Farsi corpora, OGI and Call friend in terms of corpus dialectology. Up to now, we have recorded about 8 hours of monologue calls containing spontaneous and read speech for 64 speakers belonging to one of ten dialect regions. A hierarchical annotation system is used to transcribe phoneme, word and sentence levels of speech data. User software is written to access speech and label files efficiently using a menu driven query system. We conducted two experiments to validate Tfarsdat statistically. Results showed the necessity of increasing speaker size and also quality enhancement of annotation system. Henk van den Heuvel 1 , Khalid Choukri 2 , Harald Höge 3 , Bente Maegaard 4 , Jan Odijk 5 , Valerie Mapelli 2 ; 1 SPEX, The Netherlands; 2 ELRA/ELDA, France; 3 Siemens AG, Germany; 4 CST, Denmark; 5 ScanSoft Belgium, Belgium To promote quality control of its language resources the European Language Resources Association (ELRA) installed a Validation Committee. This paper presents an overview of current activities of the Committee: validation of language resources, standardisation, bug reporting, patches of updates of language resources, and dissemination of results. Validation of Phonetic Transcriptions Based on Recognition Performance Christophe Van Bael 1 , Diana Binnenpoorte 1 , Helmer Strik 1 , Henk van den Heuvel 2 ; 1 University of Nijmegen, The Netherlands; 2 SPEX, The Netherlands Large Lexica for Speech-to-Speech Translation: From Specification to Creation Elviira Hartikainen 1 , Giulio Maltese 2 , Asunción Moreno 3 , Shaunie Shammass 4 , Ute Ziegenhain 5 ; 1 Nokia Research Center, Finland; 2 IBM Italy, Italy; 3 Universitat Politècnica de Catalunya, Spain; 4 Natural Speech Communication, Israel; 5 Siemens AG, Germany This paper presents the corpora collection and lexica creation for the purposes of Automatic Speech Recognition (ASR) and Text-tospeech (TTS) that are needed in speech-to-speech translation (SST). These lexica will be specified, built and validated within the scope of the EU-project LC-STAR (Lexica and Corpora for Speech-to-Speech Translation Components) during the years 2002-2005. Large lexica consisting of phonetic, prosodic and morpho-syntactic content will be provided with well-documented specifications for at least 12 languages [1]. This paper provides a short overview of the speechto-speech translation lexica in general as well as a summary of the LC-STAR project itself. More detailed information about the specification for the corpora collection and word extraction as well as the specification and format of the lexica are presented in later chapters. A Pronunciation Lexicon for Turkish Based on Two-Level Morphology In fundamental linguistic as well as in speech technology research there is an increasing need for procedures to automatically generate and validate phonetic transcriptions. Whereas much research has already focussed on the automatic generation of phonetic transcriptions, far less attention has been paid to the validation of such transcriptions. In the little research performed in this area, the estimation of the quality of (automatically generated) phonetic transcriptions is typically based on the comparison between these transcriptions and a human-made reference transcription. We believe, however, that the quality of phonetic transcriptions should ideally be estimated with the application in which the transcriptions will be used in mind, provided that the application is known at validation time. The application focussed on in this paper is automatic speech recognition, the validation criterion is the word error rate. We achieved a higher accuracy with a recogniser trained on an automatically generated transcription than with a similar recogniser trained on a human-made transcription resembling a human-made reference transcription more. This indicates that the traditional validation approach may not always be the most optimal one. The Basque Speech_Dat (II) Database: A Description and First Test Recognition Results I. Hernaez, I. Luengo, E. Navas, M. Zubizarreta, I. Gaminde, J. Sanchez; University of the Basque Country, Spain Kemal Oflazer 1 , Sharon Inkelas 2 ; 1 Sabancı University, Turkey; 2 University of California at Berkeley, USA This paper describes the implementation of a full-scale pronunciation lexicon for Turkish based on a two-level morphological analyzer. The system produces at its output, a parallel representation of the pronunciation and the morphological analysis of the word form so that morphological disambiguation can be used to disambiguate pronunciation when necessary. The pronunciation representation is based on the SAMPA standard and also encodes the position of the primary stress. The computation of the position of the primary stress depends on an interplay of any exceptional stress in root words and stress properties of certain morphemes, and requires that a full morphological analysis be done. The system has been implemented using XRCE Finite State Toolkit. In this work we present a telephone speech database for Basque, compliant with the guidelines of the Speechdat project. The database contains 1060 calls from the fixed telephone network. We first describe the main aspects of the database design. We also present the recognition results using the database and a set of procedures following the language independent reference recogniser commonly named Refrec. Towards an Evaluation Standard for Speech Control Concepts in Real-World Scenarios Using Both Global and Local Hidden Markov Models for Automatic Speech Unit Segmentation Jens Maase 1 , Diane Hirschfeld 2 , Uwe Koloska 2 , Timo Westfeld 3 , Jörg Helbig 3 ; 1 Bosch und Siemens Hausgeräte GmbH, Germany; 2 voice INTER connect GmbH, Germany; 3 MediaInterface Dresden GmbH, Germany Hong Zheng 1 , Yiqing Lu 2 ; 1 CASCO (ALSTOM) Signal Ltd., China; 2 Motorola China Research Center, China Speech control is still mainly evaluated through statistical performance measures (recognition rate, insertion rate, etc.) considering 54 Eurospeech 2003 Tuesday September 1-4, 2003 – Geneva, Switzerland the performance of a speech recognizer under laboratory or artificial noise conditions. All these measures give no idea about the practical usability of a speech interface, since practical aspects concern more than operational aspects of the speech recognizer inside a product. no compression to keep the highest possible image quality. The LIUM-AVS database comprises two parts: Since it was felt, that no evaluation standard so far fulfills the practical requirements for speech controlled products, this paper aims in the establishment of an open design- and evaluation standard for speech control concepts in real-world scenarios. These two parts contain sequences with both natural and blue lips. The whole database is released mainly to test and compare lip segmentation approaches on natural images, but speech recognition experiments may also be carried using this corpus. For information on obtaining the LIUM-AVS database, please contact us through our webpage (http://www-lium.univlemans. fr/lium/avs-database). First, the behaviour of the users and the normal environmental conditions (typical noises) were evaluated in usability experiments. Data recordings were conducted, trying to capture these typical usage requirements in a special corpus (Apollo-corpus). Finally, a set of standard desktop-, as well as embedded speech recognizers were tested for their performance under these real world conditions. OrienTel: Recording Telephone Speech of Turkish Speakers in Germany Chr. Draxler; Ludwig-Maximilians-Universität München, Germany OrienTel is a project to create telephone speech databases for both the local and the business languages of the Mediterranean and the Arab Emirates. In Germany, 300 Turkish speakers speaking German were to be recorded. The database is an extension of the SpeechDat databases. This paper outlines the recording setup, the recruitment strategy and the annotation procedure. Recruiting the speakers was a particular challenge because none of the recruitment strategies used in previous SpeechDat projects in Germany did work and a new approach had to be found. Spanish Broadcast News Transcription Gerhard Backfried, Roser Jaquemot Caldés; SAIL LABS Technology AG, Austria We describe the Sail Labs Media Mining System (MMS) aimed at the transcription of Castilian Spanish broadcast-news. In contrast to previous systems, the focus of this system is on Spanish as spoken on the Iberian Peninsula as opposed to the Americas. We discuss the development of a Castilian Spanish broadcast-news corpus suitable for training the various system components of the MMS and report on the development of the speech-recognition component using the newly established corpora. • PBS Phonetically Balanced Sentences in French • LET Spelled letters (also in French) Implementation and Evaluation of a Text-to-Speech Synthesis System for Turkish Özgül Salor 1 , Bryan Pellom 2 , Mübeccel Demirekler 1 ; 1 Middle East Technical University, Turkey; 2 University of Colorado at Boulder, USA In this paper, a diphone based Text-to-Speech (TTS) system for the Turkish language is presented. Turkish is the official language of Turkey, where it is the native language of 70 million people and it is also widely spoken in Asia (Azerbaidjain, Uzbekhstan, Kazakhstan, Kirgizhstan and Iran), Cyprus and the Balkans. The research has been done through a visiting internship at CSLR (the Center for Spoken Language Research, University of Colorado at Boulder) as part of an ongoing collaboration between CSLR and METU (Middle East Technical University), Department of Electrical and Electronics Engineering. The system is based on Festival Speech Synthesis System. A diphone database has been designed for Turkish. Tools developed for quick diphone collection and segmentation are illustrated. The text analysis module, the methods used for determination of segment durations and pitch contours are discussed in detail. A Diagnostic Rhyme Test (DRT) has been designed for Turkish to test the intelligibility of the output speech. The resulting TTS system is found to be 86.5% intelligible on the average by 20 listeners. This is the first diphone based Turkish TTS system, whose intelligibility is reported. We also believe that, this paper would help researchers working on building TTS voices, especially those who work on agglutinative languages, since every step needed along the way are explained in detail. The Czech Speech and Prosody Database Both for ASR and TTS Purposes Jáchym Kolář, Jan Romportl, Josef Psutka; University of West Bohemia in Pilsen, Czech Republic Large Vocabulary Continuous Speech Recognition in Greek: Corpus and an Automatic Dictation System Vassilios Digalakis, Dimitrios Oikonomidis, D. Pratsolis, N. Tsourakis, C. Vosnidis, N. Chatzichrisafis, V. Diakoloukas; Technical University of Crete, Greece In this work, we present the creation of the first Greek Speech Corpus and the implementation of a Dictation System for workflow improvement in the field of journalism. The current work was implemented under the project called Logotypografia (Logos = logos, speech and Typografia = typography) sponsored by the General Secretariat of Research and Development of Greece. This paper presents the process of data collection (texts and recordings), waveform processing (transcriptions), creation of the acoustic and language models and the final integration to a fully functional dictation system. The evaluation of this system is also presented. The Logotypografia database, described here, is available by ELRA. The LIUM-AVS Database : A Corpus to Test Lip Segmentation and Speechreading Systems in Natural Conditions Philippe Daubias, Paul Deléglise; Université du Maine, France We present here a new freely available audio-visual speech database. Contrary to other existing corpora, the LIUM-AVS corpus was recorded in conditions we qualify as natural, which are, according to us, much closer to real application conditions than other databases. This database was recorded without artificial lighting using an analog camcorder in camera mode. Images were stored digitally with This paper describes a preparation of the first large Czech prosodic database which should be useful both in automatic speech recognition (ASR) and text-to-speech (TTS) synthesis. In the area of ASR we intend to use it for an automatic punctuation annotation, in the area of TTS for building a prosodic module for the Czech high-quality synthesis. The database is based on the Czech Radio&TV Broadcast News Corpus (UWB B02) recorded at the University of West Bohemia. The configuration of the database includes recorded speech, raw and stylized F0 values, frame level energy values, a word- and phoneme-level time alignment, and a linguistically motivated description of the prosodic data. A technique of prosodic data acquisition and stylization is described. A new tagset for a linguistical annotation of the Czech prosody is proposed and used. Construction of an Advanced In-Car Spoken Dialogue Corpus and its Characteristic Analysis Itsuki Kishida, Yuki Irie, Yukiko Yamaguchi, Shigeki Matsubara, Nobuo Kawaguchi, Yasuyoshi Inagaki; Nagoya University, Japan This paper describes an advanced spoken language corpus which has been constructed by enhancing an in-car speech database. The corpus has the following characteristic features: (1) Advanced tag: Not only linguistic phenomena tags but also advanced discourse tags such as sentential structures, and utterance intentions, have been provided for the transcribed texts. (2) Large-scale: The sentential structures and the intentions are currently provided for 45,053 phrases and 35,421 utterance units, respectively. (3) Multi-layer: The corpus consists of different levels of spoken language data such as speech signals, transcribed texts, sentential structures, inten- 55 Eurospeech 2003 Tuesday tional markers and dialogue structures, moreover, they are related with each other. It allows a very wide variety of analysis of spontaneous spoken dialogue to utilize the multi-layered corpus. This paper also reports the result of investigation of the corpus, especially, focusing on the relations between the syntactic style and the intentional style of spoken utterances. Measuring the Readability of Automatic Speech-to-Text Transcripts Douglas A. Jones, Florian Wolf, Edward Gibson, Elliott Williams, Evelina Fedorenko, Douglas A. Reynolds, Marc Zissman; Massachusetts Institute of Technology, USA This paper reports initial results from a novel psycholinguistic study that measures the readability of several types of speech transcripts. We define a four-part figure of merit to measure readability: accuracy of answers to comprehension questions, reaction-time for passage reading, reaction-time for question answering and a subjective rating of passage difficulty. We present results from an experiment with 28 test subjects reading transcripts in four experimental conditions. The NESPOLE! VoIP Multilingual Corpora in Tourism and Medical Domains Nadia Mana 1 , Susanne Burger 2 , Roldano Cattoni 1 , Laurent Besacier 3 , Victoria MacLaren 2 , John McDonough 4 , Florian Metze 4 ; 1 ITCirst, Italy; 2 Carnegie Mellon University, USA; 3 CLIPS-IMAG Laboratory, France; 4 Universität Karlsruhe, Germany September 1-4, 2003 – Geneva, Switzerland From Switchboard to Fisher: Telephone Collection Protocols, Their Uses and Yields Christopher Cieri, David Miller, Kevin Walker; University of Pennsylvania, USA This paper describes several methodologies for collecting conversational telephone speech (CTS) comparing their design, goals and yields. We trace the evolution of the Switchboard protocol including recent adaptations that have allowed for very cost-efficient data collection. We compare Switchboard to the CallHome and CallFriend protocols that have similarly produced CTS data for speech technologies research. Finally, we introduce the new “Fisher” protocol comparing its design and yield to the other protocols. We conclude with a summary of data resources that result from each of the protocols described herein and that are generally available. Development of the Estonian SpeechDat-Like Database Einar Meister, Jürgen Lasn, Lya Meister; Tallinn Technical University, Estonia A new database project has been launched in Estonia last year. It aims the collection of telephone speech from a large number of speakers for speech and speaker recognition purposes. Up to 2000 speakers are expected to participate in recordings. SpeechDat databases, especially Finnish SpeechDat, have been chosen as a prototype for the Estonian database. It means that principles of corpus design, file formats, recording and labelling methods implemented by the SpeechDat consortium will be followed as closely as possible. The paper is a progress report of the project. Towards a Repository of Digital Talking Books In this paper we present the multilingual VoIP (Voice over Internet Protocol networks) corpora collected for the second showcase of the Nespole! project in the tourism and medical domains. The corpora comprise over 20 hours of human-to-human monolingual dialogues in English, French, German and Italian: 66 dialogues in the tourism domain and 49 in the medical domain. We describe in detail the data collection (technical set-up, scenarios for each domain, recording procedure and data transcription), as well as statistically illustrated corpora and a preliminary data analysis. Lexica and Corpora for Speech-to-Speech Translation: A Trilingual Approach David Conejero, Jesús Giménez, Victoria Arranz, Antonio Bonafonte, Neus Pascual, Núria Castell, Asunción Moreno; Universitat Politècnica de Catalunya, Spain Creation of lexica and corpora for Catalan, Spanish and US-English is described. A lexicon is being created for speech recognition and synthesis including relevant information. The lexicon contains 50K common words selected to achieve a wide coverage on the chosen domains, and 50K additional entries including special application words, and proper nouns. Furthermore, a large trilingual spontaneous speech corpus has been created. These corpora, together with other available US-English data, have been translated into their counterpart languages. This is being used to investigate the language resources requirements for statistical machine translation. Se describe la creación de léxicos y corpus para el catalán, castellano e inglés hablado en Estados Unidos. Un léxico conteniendo información relevante para el reconocimiento y síntesis del habla está siendo creado. El léxico contiene 50.000 palabras comunes seleccionadas con el fin de lograr una amplia cobertura de los dominios escogidos, y 50.000 entradas adicionales que incluyen vocabulario específico, y nombres propios. Además, se han creado corpus orales para el catalán y el castellano. Estos corpus, junto con otros datos disponibles sobre inglés hablado en Estados Unidos, han sido traducidos a las otras dos lenguas con el propósito de generar un gran corpus trilingüe. Éste está siendo utilizado para investigar los requisitos de los recursos lingüísticos para la traduccíon automática estadística. António Serralheiro 1 , Isabel Trancoso 2 , Diamantino Caseiro 2 , Teresa Chambel 3 , Luís Carriço 3 , Nuno Guimarães 3 ; 1 INESC-ID/Academia Militar, Portugal; 2 INESC-ID/IST, Portugal; 3 LASIGE/FC, Portugal Considerable effort has been devoted at L2 F to increase and broaden our speech and text data resources. Digital Talking Books (DTB), comprising both speech and text data are, as such, an invaluable asset as multimedia resources. Furthermore, those DTB have been under a speech-to-text alignment procedure, either word or phone-based, to increase their potential in research activities. This paper thus describes the motivation and the method that we used to accomplish this goal for aligning DTBs. This alignment allows specific access interfaces for persons with special needs, and also tools for easily detecting and indexing units (words, sentences, topics) in the spoken books. The alignment tool was implemented in a Weighted Finite State Transducer framework, which provides an efficient way to combine different types of knowledge sources, such as alternative pronunciation rules. With this tool, a 2-hour long spoken book was aligned in a single step in much less than real time. Last but not least, new browsing interfaces, allowing improved access and data retrieval to and from the DTBs, are described in this paper. Shared Resources for Robust Speech-to-Text Technology Stephanie Strassel, David Miller, Kevin Walker, Christopher Cieri; University of Pennsylvania, USA This paper describes ongoing efforts at Linguistic Data Consortium to create shared resources for improved speech-to-text technology. Under the DARPA EARS program, technology providers are charged with creating STT systems whose outputs are substantially richer and much more accurate than is currently possible. These aggressive program goals motivate new approaches to corpus creation and distribution. EARS participants require multilingual broadcast and telephone speech data, transcripts and annotations at a much higher volume than for any previous program. While standard approaches to resource collection and creation are prohibitively expensive for this volume of material, within EARS new methods have been established to allow for the development of vast quantities of audio, transcripts and annotations. New distribution methods also provide for efficient deployment of needed resources to participating research sites as well as enabling eventual publication to a wider community of language researchers. 56 Eurospeech 2003 Wednesday September 1-4, 2003 – Geneva, Switzerland Structural Linear Model-Space Transformations for Speaker Adaptation Session: OWeBa– Oral Speech Recognition - Adaptation II Driss Matrouf, Olivier Bellot, Pascal Nocera, Georges Linares, Jean-François Bonastre; LIA-CNRS, France Time: Wednesday 10.00, Venue: Room 1 Chair: John Hansen, Colorado Univ., USA Large Vocabulary Conversational Speech Recognition with a Subspace Constraint on Inverse Covariance Matrices Scott Axelrod, Vaibhava Goel, Brian Kingsbury, Karthik Visweswariah, Ramesh Gopinath; IBM T.J. Watson Research Center, USA This paper applies the recently proposed SPAM models for acoustic modeling in a Speaker Adaptive Training (SAT) context on large vocabulary conversational speech databases, including the Switchboard database. SPAM models are Gaussian mixture models in which a subspace constraint is placed on the precision and mean matrices (although this paper focuses on the case of unconstrained means). They include diagonal covariance, full covariance, MLLT, and EMLLT models as special cases. Adaptation is carried out with maximum likelihood estimation of the means and feature-space under the SPAM model. This paper shows the first experimental evidence that the SPAM models can achieve significant word-errorrate improvements over state-of-the-art diagonal covariance models, even when those diagonal models are given the benefit of choosing the optimal number of Gaussians (according to the Bayesian Information Criterion). This paper also is the first to apply SPAM models in a SAT context. All experiments are performed on the IBM “Superhuman” speech corpus which is a challenging and diverse conversational speech test set that includes the Switchboard portion of the 1998 Hub5e evaluation data set. Speaker Adaptation Based on Confidence-Weighted Training Gyucheol Jang, Minho Jin, Chang D. Yoo; KAIST, Korea This paper presents a novel method to enhance the performance of traditional speaker adaptation algorithm using discriminative adaptation procedure based on a novel confidence measure and nonlinear weighting. Regardless of the distribution of the adaptation data, traditional model adaptation methods incorporate the adaptation data undiscriminatingly. When the data size is small and the parameter tying is extensive, adaptation based on outliers can be detrimental. A way to discriminate the contribution of each data in the adaptation is to incorporate a confidence measure based on likelihood. We evaluate and compare the performances of the proposed weighted SMAP (WSMAP) which controls the contribution of each data by sigmoid weighting using a novel confidence measure. The effectiveness of the proposed algorithm is experimentally verified by adapting native speaker models to nonnative speaker environment using TIDIGIT. Jacobian Adaptation Based on the Frequency-Filtered Spectral Energies Alberto Abad, Climent Nadeu, Javier Hernando, Jaume Padrell; Universitat Politècnica de Catalunya, Spain Jacobian Adaptation (JA) of the acoustic models is an efficient adaptation technique for robust speech recognition. Several improvements for the JA have been proposed in the last years, either to generalize the Jacobian linear transformation for the case of large noise mismatch between training and testing or to extend the adaptation to other degrading factors, like channel distortion and vocal tract length. However, the JA technique has only been used so far with the conventional mel-frequency cepstral coefficients (MFCC). In this paper, the JA technique is applied to an alternative type of features, the Frequency-Filtered (FF) spectral energies, resulting in a more computationally efficient approach. Furthermore, in experimental tests with the database Aurora1, this new approach has shown an improved recognition performance with respect to the Jacobian adaptation with MFCCs. Within the framework of speaker-adaptation, a technique based on tree structure and the maximum a posteriori criterion was proposed (SMAP). In SMAP, the parameters estimation, at each node in the tree is based on the assumption that the mismatch between the training and adaptation data is a Gaussian PDF which parameters are estimated by using the Maximum Likelihood criterion. To avoid poor transformation parameters estimation accuracy due to an insufficiency of adaptation data in a node, we propose a new technique based on the maximum a posteriori approach and PDF Gaussians Merging. The basic idea behind this new technique is to estimate an affine transformations which bring the training acoustic models as close as possible to the test acoustic models rather than transformation maximizing the likelihood of the adaptation data. In this manner, even with very small amount of adaptation data, the parameters transformations are accurately estimated for means and variances. This adaptation strategy has shown a significant performance improvement in a large vocabulary speech recognition task, alone and combined with the MLLR adaptation. Minimum Classification Error (MCE) Model Adaptation of Continuous Density HMMS Xiaodong He 1 , Wu Chou 2 ; 1 University of Missouri, USA; 2 Avaya Labs Research, USA In this paper, a framework of minimum classification error (MCE) model adaptation for continuous density HMMs is proposed based on the approach of “super” string model. We show that the error rate minimization in the proposed approach can be formulated into maximizing a special ratio of two positive functions, and from that a general growth transform algorithm is derived for MCE based model adaptation. This algorithm departs from the generalized probability descent (GPD) algorithm, and it is well suited for model adaptation with a small amount of training data. The proposed approach is applied to linear regression based variance adaptation, and the close form solution for variance adaptation using MCE linear regression (MCELR) is derived. The MCELR approach is evaluated on large vocabulary speech recognition tasks. The relative performance gain is more than doubled on the standard (WSJ Spoke 3) database, comparing to maximum likelihood linear regression (MLLR) based variance adaptation for the same amount of adaptation data. Adapting Acoustic Models to New Domains and Conditions Using Untranscribed Data Asela Gunawardana, Alex Acero; Microsoft Research, USA This paper investigates the unsupervised adaptation of an acoustic model to a domain with mismatched acoustic conditions. We use techniques borrowed from the unsupervised training literature to adapt an acoustic model trained on the Wall Street Journal corpus to the Aurora-2 domain, which is composed of read digit strings over a simulated noisy telephone channel. We show that it is possible to use untranscribed in-domain data to get significant performance improvements, even when it is severely mismatched to the acoustic model training data. Session: SWeBb– Oral Towards Synthesizing Expressive Speech Time: Wednesday 10.00, Venue: Room 2 Chair: Wael Hamza, IBM, USA Towards Synthesising Expressive Speech; Designing and Collecting Expressive Speech Data Nick Campbell; ATR-HIS, Japan Corpus-based speech synthesis needs representative corpora of human speech if it is to meet the needs of everyday spoken interaction. This paper describes methods for recording such corpora, and details some difficulties (with their solutions) found in the use of spontaneous speech data for synthesis. 57 Eurospeech 2003 Wednesday September 1-4, 2003 – Geneva, Switzerland Is There an Emotion Signature in Intonational Patterns? And Can It be Used in Synthesis? Applications of Computer Generated Expressive Speech for Communication Disorders Tanja Bänziger 1 , Michel Morel 2 , Klaus R. Scherer 1 ; 1 University of Geneva, Switzerland; 2 University of Caen, France Jan P.H. van Santen, Lois Black, Gilead Cohen, Alexander B. Kain, Esther Klabbers, Taniya Mishra, Jacques de Villiers, Xiaochuan Niu; Oregon Health & Science University, USA Intonation is often considered to play an important role in the vocal communication of emotion. Early studies using pitch manipulation have supported this view. However, the properties of pitch contours involved in marking emotional state remain largely unidentified. In this contribution, a corpus of actor-generated utterances for 8 emotions was used to measure intonation (pitch contour) by identifying key features of the F0 contour. The data show that the profiles obtained vary reliably with respect to F0 level as a function of the degree of activation of the emotion concerned. However, there is little evidence for qualitatively different forms of profiles for different emotions. Results of recent collaborative studies on the use of the F0 patterns identified in this research with synthesized utterances are presented. The nature of the contribution of F0/pitch contours to emotional speech is discussed; it is argued that pitch contours have to be considered as configurations that acquire emotional meaning only through interaction with a linguistic and paralinguistic context. This paper focuses on generation of expressive speech, specifically speech displaying vocal affect. Generating speech with vocal affect is important for diagnosis, research, and remediation for children with autism and developmental language disorders. However, because vocal affect involves many acoustic factors working together in complex ways, it is unlikely that we will be able to generate compelling vocal affect with traditional diphone synthesis. Instead, methods are needed that preserve as much of the original signals as possible. We describe an approach to concatenative synthesis that attempts to combine the naturalness of unit selection based synthesis with the ability of diphone based synthesis to handle unrestricted input domains. Session: OWeBc– Oral Speaker Verification Multilayered Extensions to the Speech Synthesis Markup Language for Describing Expressiveness Time: Wednesday 10.00, Venue: Room 3 Chair: Douglas Reynolds, MIT Lincoln Laboratory, USA E. Eide, R. Bakis, W. Hamza, J. Pitrelli; IBM T.J. Watson Research Center, USA Speaker Verification Systems and Security Considerations In this paper we discuss possible extensions to the Speech Synthesis Markup Language (SSML) to facilitate the generation of synthetic expressive speech. The proposed extensions are hierarchical in nature, allowing specification in terms of physical parameters such as instantaneous pitch, higher-level parameters such as ToBI labels, or abstract concepts such as emotions. Low-level tags tend to change their values frequently, even within a word, while the more abstract tags generally apply to whole words, sentences or paragraphs. We envision interfaces at different levels to serve different types of users; speech experts may want to use low-level interfaces while artists may prefer to interface with the TTS system at more abstract levels. Unit Selection and Emotional Speech David A. van Leeuwen; TNO Human Factors, The Netherlands In speaker verification technology, the security considerations are quite different from performance measures that are usually studied. The security level of a system is generally expressed in the amount of effort it takes to have a successful break-in attempt. This paper discusses potential weaknesses of speaker verification systems and methods of exploiting these weaknesses, and suggests proper experiments for determining the security level of a speaker verification system. Phonetic Class-Based Speaker Verification Matthieu Hébert, Larry P. Heck; Nuance Communications, USA Alan W. Black; Carnegie Mellon University, USA Unit Selection Synthesis, where appropriate units are selected from large databases of natural speech, has greatly improved the quality of speech synthesis. But the quality improvement has come at a cost. The quality of the synthesis relies on the fact that little or no signal processing is done on the selected units, thus the style of the recording is maintained in the quality of the synthesis. The synthesis style is implicitly the style of the database. If we want more general flexibility we have to record more data of the desired style. Which means that our already large unit databases must be made even larger. This paper gives examples of how to produce varied style and emotion using existing unit selection synthesis techniques and also highlights the limitations of generating truly flexible synthetic voices. Phonetic Class-based Speaker Verification (PCBV) is a natural refinement of the traditional single Gaussian Mixture Model (Single GMM) scheme. The aim is to accurately model the voice characteristics of a user on a per-phonetic class basis. The paper describes briefly the implementation of a representation of the voice characteristics in a hierarchy of phonetic classes. We present a framework to easily study the effect of the modeling on the PCBV. A thorough study of the effect of the modeling complexity, the amount of enrollment data and noise conditions is presented. It is shown that Phonemebased Verification (PBV), a special case of PCBV, is the optimal modeling scheme and consistently outperforms the state-of-the-art Single GMM modeling even in noisy environments. PBV achieves 9% to 14% relative error rate reduction while cutting the speaker model size by 50% and CPU by 2/3. Voice Quality Modification for Emotional Speech Synthesis An Evaluation of VTS and IMM for Speaker Verification in Noise Christophe d’Alessandro, Boris Doval; LIMSI-CNRS, France Suhadi, Sorel Stan, Tim Fingscheidt, Christophe Beaugeant; Siemens AG, Germany Synthesis of expressive speech has demonstrated that convincing natural sounding results are impossible to obtain without dealing with voice quality parameters. Time-domain and spectral-domain models of the voice source signal are presented. Then algorithms for analysis and synthesis of voice quality are discussed, including modification of the periodic and aperiodic components. These algorithms may be useful for applications such as pre-processing of speech corpora, modification of voice quality parameters together with intonation in synthesis, voice transformation. The performance of speaker verification (SV) systems degrades rapidly in noise rendering them unsuitable for security-critical applications in mobile phones, where false acceptance rates (FAR) of ∼ 10−4 are required. However, less demanding applications for which equal error rates (EER) comparable to word error rates (WER) of speech recognizers are acceptable could benefit from the SV technology. In this paper we evaluate two feature-based noise compensation algorithms in the context of SV: vector Taylor series (VTS) combined with statistical linear approximation (SLA), and Kalman filter-based interacting multiple models (IMM). Tests with the YOHO database 58 Eurospeech 2003 Wednesday and the NTT-AT ambient noises show that EERs as low as 5%-10% in medium to high noise conditions can be achieved for a textindependent SV system. Locally Recurrent Probabilistic Neural Network for Text-Independent Speaker Verification Todor Ganchev, Dimitris K. Tasoulis, Michael N. Vrahatis, Nikos Fakotakis; University of Patras, Greece This paper introduces Locally Recurrent Probabilistic Neural Networks (LRPNN) as an extension of the well-known Probabilistic Neural Networks (PNN). A LRPNN, in contrast to a PNN, is sensitive to the context in which events occur, and therefore, identification of time or spatial correlations is attainable. Besides the definition of the LRPNN architecture a fast three-step training method is proposed. The first two steps are identical to the training of traditional PNNs, while the third step is based on the Differential Evolution optimization method. Finally, the superiority of LRPNNs over PNNs on the task of text-independent speaker verification is demonstrated. Learning to Boost GMM Based Speaker Verification The Gaussian mixture models (GMM) has proved to be an effective probabilistic model for speaker verification, and has been widely used in most of state-of-the-art systems. In this paper, we introduce a new method for the task: that using AdaBoost learning based on the GMM. The motivation is the following: While a GMM linearly combines a number of Gaussian models according to a set of mixing weights, we believe that there exists a better means of combining individual Gaussian mixture models. The proposed AdaBoost-GMM method is non-parametric in which a selected set of weak classifiers, each constructed based on a single Gaussian model, is optimally combined to form a strong classifier, the optimality being in the sense of maximum margin. Experiments show that the boosted GMM classifier yields 10.81% relative reduction in equal error rate for the same handsets and 11.24% for different handsets, a significant improvement over the baseline adapted GMM system. Speaker Verification Based on G.729 and G.723.1 Coder Parameters and Handset Mismatch Compensation 1 cise and informative utterances. While interacting over a phone, users must both understand the system’s utterances, and remember important facts that the system is providing. Thus most dialogue systems implement some combination of different techniques for (1) option selection: pruning the set of options; (2) information selection: selecting a subset of information to present about each option; (3) aggregation: combining multiple items of information succinctly. We first describe how user models based on multi-attribute decision theory support domain-independent algorithms for both option selection and information selection. We then describe experiments to determine an optimal level of conciseness in information selection, i.e. how much information to include for an option. Our results show that (a) users are highly oriented to utterance conciseness; (b) the information selection algorithm is highly consistent with user’s judgments of conciseness; and (c) the appropriate level of conciseness is both user and dialogue strategy dependent. Natural Language Response Generation in Mixed-Initiative Dialogs Using Task Goals and Dialog Acts Helen M. Meng, Wing Lin Yip, Oi Yan Mok, Shuk Fong Chan; Chinese University of Hong Kong, China Stan Z. Li, Dong Zhang, Chengyuan Ma, Heung-Yeung Shum, Eric Chang; Microsoft Research Asia, China 1 September 1-4, 2003 – Geneva, Switzerland This paper presents our approach towards natural language response generation for mixed-initiative dialogs in the CUHK Restaurants domain. Our experimental corpus consists of about 4000 customer requests and waiter responses. Every request/response utterance is annotated with its task goal (TG) and dialog act (DA). The variable pair {TG, DA} is used to represent the dialog state. Our approach involves a set of corpus-derived dialog state transition rules in the form of {TG, DA}request → {TG, DA}response . These rules encode the communication goal(s) and initiatives of the request/ response. Another set of hand-designed rules associate each response dialog state with one or more text generation templates. Upon testing, our system parses the input customer request for concept categories and from these infers the TG and DA using trained Belief Networks. Application of the dialog state transition rules and text generation templates automatically generates a (virtual) waiter response. Ten subjects were invited to interact with the system. Performance evaluation based on Grice’s maxims gave a mean score of 4 on a fivepoint Likert scale and a task completion rate of at least 90%. Speech Generation from Concept for Realizing Conversation with an Agent in a Virtual Room 1 Eric W.M. Yu , Man-Wai Mak , Chin-Hung Sit , Sun-Yuan Kung 2 ; 1 Hong Kong Polytechnic University, China; 2 Princeton University, USA Keikichi Hirose, Junji Tago, Nobuaki Minematsu; University of Tokyo, Japan A novel technique for speaker verification over a communication network is proposed. The technique employs cepstral coefficients (LPCCs) derived from G.729 and G.723.1 coder parameters as feature vectors. Based on the LP coefficients derived from the coder parameters, LP residuals are reconstructed, and the verification performance is improved by taking account of the additional speakerdependent information contained in the reconstructed residuals. This is achieved by adding the LPCCs of the LP residuals to the LPCCs derived from the coder parameters. To reduce the acoustic mismatch between different handsets, a technique combining a handset selector with stochastic feature transformation is employed. Experimental results based on 150 speakers show that the proposed technique outperforms the approaches that only utilize the coder-derived LPCCs. Session: OWeBd– Oral Dialog System Generation A concept to speech generation was realized in an agent dialogue system, where an agent (a stuffed animal) walked around in a small room constructed on a computer display to complete some jobs with instructions from a user. The communication between the user and the agent was done through speech. If the agent could not complete the job because of some difficulties, it tried to solve the problems through conversations with the user. Different from other spoken dialogue systems, the speech output from the agent was generated directly from the concept, and was synthesized using higher linguistic information. This scheme could largely improve the prosodic quality of speech output. In order to realize the concept to speech conversion, the linguistic information was handled as a tree structure in the whole dialogue process. A Trainable Generator for Recommendations in Multimodal Dialog Marilyn Walker 1 , Rashmi Prasad 2 , Amanda Stent 3 ; 1 University of Sheffield, U.K.; 2 University of Pennsylvania, USA; 3 Stony Brook University, USA Time: Wednesday 10.00, Venue: Room 4 Chair: Rolf Carlson, KTH, Stockholm, Sweden Should I Tell All?: An Experiment on Conciseness in Spoken Dialogue Stephen Whittaker 1 , Marilyn Walker 1 , Preetam Maloor 2 ; 1 University of Sheffield, U.K.; 2 University of Toronto, Canada Spoken dialogue systems have a strong requirement to produce con- As the complexity of spoken dialogue systems has increased, there has been increasing interest in spoken language generation (SLG). SLG promises portability across application domains and dialogue situations through the development of application-independent linguistic modules. However in practice, rule-based SLGs often have to be tuned to the application. Recently, a number of research groups have been developing hybrid methods for spoken language generation, combining general linguistic modules with methods for training parameters for particular applications. This paper describes the 59 Eurospeech 2003 Wednesday use of boosting to train a sentence planner to generate recommendations for restaurants in MATCH, a multimodal dialogue system providing entertainment information for New York. Spoken Dialogue System for Queries on Appliance Manuals Using Hierarchical Confirmation Strategy Tatsuya Kawahara, Ryosuke Ito, Kazunori Komatani; Kyoto University, Japan We address a dialogue framework for queries on manuals of electric appliances with a speech interface. Users can make queries by unconstrained speech, from which keywords are extracted and matched to the items in the manual. As a result, so many items are usually obtained. Thus, we introduce an effective dialogue strategy which narrows down the items using a tree structure extracted from the manual. Three cost functions are presented and compared to minimize the number of dialogue turns. We have evaluated the system performance on VTR manual query task. The number of average dialogue turns is reduced to 71% using our strategy compared with a conventional method that makes confirmation in turn according to the matching likelihood. Thus, the proposed system helps users find their intended items more efficiently. SAG: A Procedural Tactical Generator for Dialog Systems September 1-4, 2003 – Geneva, Switzerland component transforms the spectral envelope as represented by a linear prediction model. The transformation is achieved using a Gaussian mixture model, which is trained on aligned speech from source and target speakers. The second part of the system predicts the spectral detail from the transformed linear prediction coefficients. A novel approach is proposed, which is based on a classifier and residual codebooks. On the basis of a number of performance metrics it outperforms existing systems. DOA Estimation of Speech Signal Using Equilateral-Triangular Microphone Array Yusuke Hioka, Nozomu Hamada; Keio University, Japan In this contribution, we propose a DOA (Direction Of Arrival) estimation method of speech signal whose angular resolution is almost uniform with respect to DOA. Our previous DOA estimation method[1] achieves high precision with only two microphones, however its resolution degrades as the propagating direction apart from the array broadside. In the proposed method, the equilateraltriangular microphone array is adopted, and the subspace analysis is applied. The efficiency of the proposed method is shown both from the simulation and experimental results. Dalina Kallulli; SAIL LABS Technology, Austria Widely used declarative approaches to generation in which generation speed is a function of grammar size are not optimal for realtime dialog systems. We argue that a procedural system like the one we present is potentially more efficient for time-critical realworld generation applications as it provides fine-grained control of each processing step on the way from input to output representations. In this way the procedural behaviour of the generator can be tailored to the task at hand. During the generation process, the realizer generates flat deep structures from semantic-pragmatic expressions, then syntactic deep structures from the deep semanticpragmatic structures and from these syntactic deep structures surface strings. Nine different generation levels can be distinguished and are described in the paper. Session: PWeBe– Poster Speech Signal Processing II Time: Wednesday 10.00, Venue: Main Hall, Level -1 Chair: Matti Karjalainen, HUT, Finland Optimization of the CELP Model in the LSP Domain Khosrow Lashkari, Toshio Miki; DoCoMo USA Labs, USA This paper presents a new Analysis-by-Synthesis (AbS) technique for joint optimization of the excitation and model parameters based on minimizing the closed loop synthesis error instead of the linear prediction error. By minimizing the synthesis error, the analysis and synthesis stages become more compatible. Using a gradient descent algorithm, LSPs for a given excitation are optimized to minimize the error between the original and the synthesized speech. Since the optimization starts from the LPC solution, the synthesis error is guaranteed to be lower than that obtained using the LPC coefficients. For the ITU G.729 codec, there is about 1dB of improvement in the segmental SNR for male and female speakers over 4 to 6 second long sentences. By adding an extra optimization step, the technique can be incorporated into the LPC, multi-pulse LPC and CELP-type speech coders. Transforming Voice Quality Ben Gillett, Simon King; University of Edinburgh, U.K. Voice transformation is the process of transforming the characteristics of speech uttered by a source speaker, such that a listener would believe the speech was uttered by a target speaker. In this paper we address the problem of transforming voice quality. We do not attempt to transform prosody. Our system has two main parts corresponding to the two components of the source-filter model of speech production. The first Multi-Array Fusion for Beamforming and Localization of Moving Speakers Ilyas Potamitis, George Tremoulis, Nikos Fakotakis, George Kokkinakis; University of Patras, Greece In this work we deal with the fusion of the estimates of independent microphone arrays to produce an improved estimate of the Direction of Arrival (DOA) of one moving speaker, as well as localization coordinates of multiple moving speakers based on Time Delay Of Arrivals (TDOA). Our approach (a) fuses measurements from independent arrays, (b) incorporates kinematic information of speakers’ movement by using parallel Kalman filters, and (c) associates observations to specific speakers by using a Probabilistic Data Association (PDA) technique. We demonstrate that a network of arrays combined with statistical fusion techniques provides a consistent and coherent way to reduce uncertainty and ambiguity of measurements. The efficiency of the approach is illustrated on a simulation dealing with beamforming one moving speaker on an extended basis and localization of two closely spaced moving speakers with crossing trajectories. Integrated Pitch and MFCC Extraction for Speech Reconstruction and Speech Recognition Applications Xu Shao, Ben P. Milner, Stephen J. Cox; University of East Anglia, U.K. This paper proposes an integrated speech front-end for both speech recognition and speech reconstruction applications. Speech is first decomposed into a set of frequency bands by an auditory model. The output of this is then used to extract both robust pitch estimates and MFCC vectors. Initial tests used a 128 channel auditory model, but results show that this can be reduced significantly to between 23 and 32 channels. A detailed analysis of the pitch classification accuracy and the RMS pitch error shows the system to be more robust than both comb function and LPC-based pitch extraction. Speech recognition results show that the auditory-based cepstral coefficients give very similar performance to conventional MFCCs. Spectrograms and informal listening tests also reveal that speech reconstructed from the auditory-based cepstral coefficients and pitch has similar quality to that reconstructed from conventional MFCCs and pitch. 60 Eurospeech 2003 Wednesday September 1-4, 2003 – Geneva, Switzerland Exploiting Time Warping in AMR-NB and AMR-WB Speech Coders A Clustering Approach to On-Line Audio Source Separation Lasse Laaksonen 1 , Sakari Himanen 2 , Ari Heikkinen 2 , Jani Nurminen 2 ; 1 Tampere University of Technology, Finland; 2 Nokia Research Center, Finland Julien Bourgeois; DaimlerChrysler AG, Germany In this paper, a time warping algorithm is implemented and its performance is evaluated in the context of Adaptive Multi- Rate (AMR) wideband (WB) and narrowband (NB) speech coders. The aim of time warping is to achieve bit savings in transmission of pitch information with no significant quality degradations. In the case of AMR-NB and AMR-WB speech coders, these bit savings are 0.65-1.15 kbit/s depending on the mode. The performance of the modified AMR speech coders is verified by subjective and objective measures in error-free conditions. MOS tests show that only slight, statistically insignificant degradation of speech quality is experienced when time warping is implemented. A New Approach to Voice Activity Detection Based on Self-Organizing Maps We have developed an on-line separation method for audio signals. The adopted approach makes use of the time-frequency transform of the signals as a sparse decomposition. Since the sources for the most part do not overlap in the time-frequency domain, we get raw estimates of their individual mixing parameters with an analysis of the mixture ratios. We then obtain reliable mixing parameters by dynamically clustering these instantaneous estimates. The mixing parameters are used to separate the mixtures, even at time-frequency points where the sources overlap. In addition, even when the mixing parameters change over time, our approach is able to separate signals with only one pass through the data. We have evaluated this approach first on computer generated anechoic mixtures and then on real echoic mixtures recorded in a car. Estimation of Voice Source and Vocal Tract Characteristics Based on Multi-Frame Analysis Yoshinori Shiga, Simon King; University of Edinburgh, U.K. Stephan Grashey; Siemens AG, Germany Accurate discrimination between speech and non-speech is an essential part in many tasks of speech processing systems. In this paper an approach to the classification part of a Voice Activity Detector (VAD) is presented. Some possible shortcomings of present VAD-systems are described and a classification approach which overcomes these weaknesses is derived. This approach is based on a Self-Organizing Map (SOM), a neural network, which is able to detect clusters within the feature space of its training data. Training of the classifier takes place in two steps: First the SOM has to be trained. When finished, it is used in the second training step to learn the mapping between its classes and the desired output “speech” resp. “non-speech”. Experiments on a database containing audio-samples obtained under different noisy conditions show the potential of the proposed algorithm. Estimating the Spectral Envelope of Voiced Speech Using Multi-Frame Analysis This paper presents a new approach for estimating voice source and vocal tract filter characteristics of voiced speech. When it is required to know the transfer function of a system in signal processing, the input and output of the system are experimentally observed and used to calculate the function. However, in the case of sourcefilter separation we deal with in this paper, only the output (speech) is observed and the characteristics of the system (vocal tract) and the input (voice source) must simultaneously be estimated. Hence the estimate becomes extremely difficult, and it is usually solved approximately using oversimplified models. We demonstrate that these characteristics are separable under the assumption that they are independently controlled by different factors. The separation is realised using an iterative approximation along with the Multiframe Analysis method, which we have proposed to find spectral envelopes of voiced speech with minimum interference of the harmonic structure. A New Method for Pitch Prediction from Spectral Envelope and its Application in Voice Conversion Yoshinori Shiga, Simon King; University of Edinburgh, U.K. This paper proposes a novel approach for estimating the spectral envelope of voiced speech independently of its harmonic structure. Because of the quasi-periodicity of voiced speech, its spectrum indicates harmonic structure and only has energy at frequencies corresponding to integral multiples of F0 . It is hence impossible to identify transfer characteristics between the adjacent harmonics. In order to resolve this problem, Multi-frame Analysis (MFA) is introduced. The MFA estimates a spectral envelope using many portions of speech which are vocalised using the same vocal-tract shape. Since each of the portions usually has a different F0 and ensuing different harmonic structure, a number of harmonics can be obtained at various frequencies to form a spectral envelope. The method thereby gives a closer approximation to the vocal-tract transfer function. Adaptive Noise Estimation Using Second Generation and Perceptual Wavelet Transforms Taoufik En-Najjary 1 , Olivier Rosec 1 , Thierry Chonavel 2 ; 1 France Télécom R&D, France; 2 ENST Bretagne, France This paper deals with the estimation of pitch from only spectral envelope information. The proposed method uses a Gaussian Mixture Model (GMM) to characterize the joint distribution of the spectral envelope parameters and pitch-normalized values. During the learning stage, the model parameters are estimated by means of the EM algorithm. Then, a regression is made which enables the determination of a pitch prediction function from spectral envelope coefficients. Some results are presented which show the accuracy of the proposed method in terms of pitch prediction. Finally, the application of this method in a voice conversion system is described. Maximum Likelihood Endpoint Detection with Time-Domain Features Essa Jafer, Abdulhussain E. Mahdi; University of Limerick, Ireland This paper describes the implementation and performance evaluation of three noise estimation algorithms using two different signal decomposition methods: a second-generation wavelet transform and a perceptual wavelet packet transform. These algorithms, which do not require the use of a speech activity detector or signal statistics learning histograms, are: a smoothing-based adaptive technique, a minimum variance tracking-based technique and a quantile-based technique. The paper also proposes a new and robust noise estimation technique, which utilises a combination of the quantile-based and smoothing-based algorithms. The performance of the latter technique is then evaluated and compared to those of the above three noise estimation methods under various noise conditions. Reported results demonstrate that all four algorithms are capable of tracking both stationary and non-stationary noise adequately but with varying degree of accuracy. Marco Orlandi, Alfiero Santarelli, Daniele Falavigna; ITCirst, Italy In this paper we propose an effective, robust and computationally low-cost HMM-based start-endpoint detector for speech recognisers1 . Our first attempts follow the classical scheme feature extractor-Viterbi classifier (used for voice activity detection), followed by a post-processing stage, but the ultimate goal we pursue is a pure HMM-based architecture capable of performing the endpointing task. The features used for voice activity detection are energy and zero crossing rate, together with AMDF (Average Magnitude Difference Function), which proves to be a valid alternative to energy; further, we study the impact on performance of grammar structures and training conditions. In the end, we set the basis for the investigation of pure HMM-based architectures. 61 Eurospeech 2003 Wednesday September 1-4, 2003 – Geneva, Switzerland Integration of Noise Reduction Algorithms for Aurora2 Task Unified Analysis of Glottal Source Spectrum Ixone Arroabarren, Alfonso Carlosena; Universidad Publica de Navarra, Spain The spectral study of the glottal excitation has traditionally been based on a single time-domain mathematical model of the signal, and the spectral dependence on its time domain parameters. Opposite to this approach, in this work the two most widely used time domain models have been studied jointly, namely the KLGLOTT88 and the LF models. Their spectra are analyzed in terms of their dependence on the general glottal source parameters: Open quotient, asymmetry coefficient and spectral tilt. As a result, it has been proved that even though the mathematical expressions for both models are quite different, they can be made to converge. The main difference found is that in the KLGLOTT88 model the asymmetry coefficient is not independent of the open quotient and the spectral tilt. Once this relationship has been identified and translated to LF model, both models are shown to be equivalent in both time and frequency domains. En este trabajo se ha analizado el espectro de la derivada de la fuente glotal. Este tipo de estudios tradicionalmente se han enfocado hacia el estudio de un determinado modelo de la fuente glotal, y cómo afectan los parámetros temporal de dicho modelo a su espectro. Por el contrario, en este caso se pretende dar una visión más general, y para ello se han estudiado conjuntamente dos de los modelos temporales de fuente glotal más relevantes: el modelo KLGLOTT88 y el modelo LF. El espectro de ambos modelos ha sido estudiado en términos de las tres características de la fuente glotal que tiene modelar cualquier modelo matemático: el cociente de apertura, el coeficiente de asimetría y la tendencia o inclinación espectral. Como consecuencia de este estudio se ha podido comprobar que a pesar de que las expresiones matemáticas de ambos modelos son muy diferentes, la principal diferencia entre ambos reside en que en el caso del modelo KLGLOTT88 el coeficiente de asimetría viene determinado por el cociente de apertura y la tendencia espectral. Dada la relación matemática entre los parámetros se puede demostrar que en estas condiciones ambos modelos de fuente son equivalentes en el domino temporal y el dominio espectral. Session: PWeBf– Poster Robust Speech Recognition I Time: Wednesday 10.00, Venue: Main Hall, Level -1 Chair: Christian Wellekens, Eurecom, France A Hidden Markov Model-Based Missing Data Imputation Approach Yu Luo, Limin Du; Chinese Academy of Sciences, China The accuracy of automatic speech recognizer degrades rapidly when speech was distorted by noise. Robustness against noise arises to be one of the challenge problems. In this paper, a hidden Markov model (HMM) based data imputation approach is presented to improve speech recognition robustness against noise at the front-end of recognizer. Considering the correlation between different filter-banks, the approach realizes missing data imputation by a HMM of L states, each of which has a Gaussian output distribution with full covariance matrix. “Missing” data in speech filter-bank vector sequences are recovered by MAP procedure from local optimal state path or marginal Viterbi decoded HMM state sequence. The potential of the approach was tested using speaker independent continuous mandarin speech recognizer with syllable-loop of perplexity 402 for both Gaussian and babble noises each at 6 different SNR levels ranging from 0dB to 25dB, showing a significant improvement in robustness against additive noises. Takeshi Yamada 1 , Jiro Okada 1 , Kazuya Takeda 2 , Norihide Kitaoka 3 , Masakiyo Fujimoto 4 , Shingo Kuroiwa 5 , Kazumasa Yamamoto 6 , Takanobu Nishiura 7 , Mitsunori Mizumachi 8 , Satoshi Nakamura 8 ; 1 University of Tsukuba, Japan; 2 Nagoya University, Japan; 3 Toyohashi University of Technology, Japan; 4 Ryukoku University, Japan; 5 University of Tokushima, Japan; 6 Shinshu University, Japan; 7 Wakayama University, Japan; 8 ATR-SLT, Japan To achieve high recognition performance for a wide variety of noise and for a wide range of signal-to-noise ratios, this paper presents the integration of four noise reduction algorithms: spectral subtraction with smoothing of time direction, temporal domain SVDbased speech enhancement, GMM-based speech estimation and KLT-based comb-filtering. Recognition results on the Aurora2 task show that the effectiveness of these algorithms and their combinations strongly depends on noise conditions, and excessive noise reduction tends to degrade recognition performance in multicondition training. Classification with Free Energy at Raised Temperatures Rita Singh 1 , Manfred K. Warmuth 2 , Bhiksha Raj 3 , Paul Lamere 4 ; 1 Carnegie Mellon University, USA; 2 University of California at Santa Cruz, USA; 3 Mitsubishi Electric Research Laboratories, USA; 4 Sun Microsystems Laboratories, USA In this paper we describe a generalized classification method for HMM-based speech recognition systems, that uses free energy as a discriminant function rather than conventional probabilities. The discriminant function incorporates a single adjustable temperature parameter T . The computation of free energy can be motivated using an entropy regularization, where the entropy grows monotonically with the temperature. In the resulting generalized classification scheme, the values of T = 0 and T = 1 give the conventional Viterbi and forward algorithms, respectively, as special cases. We show experimentally that if the test data are mismatched with the classifier, classification at temperatures higher than one can lead to significant improvements in recognition performance. The temperature parameter is far more effective in improving performance on mismatched data than a variance scaling factor, which is another apparent single adjustable parameter that has a very similar analytical form. Flooring the Observation Probability for Robust ASR in Impulsive Noise Pei Ding 1 , Bertram E. Shi 2 , Pascale Fung 2 , Zhigang Cao 1 ; 1 Tsinghua University, China; 2 Hong Kong University of Science & Technology, China Impulsive noise usually introduces sudden mismatches between the observation features and the acoustic models trained with clean speech, which drastically degrades the performance of automatic speech recognition (ASR) systems. This paper presents a novel method to directly suppress the adverse effect of impulsive noise on recognition. In this method, according to the noise sensitivity of each feature dimension, the observation vector is divided into several subvectors, each of which is assigned to a suitable flooring threshold. In recognition stage, observation probability of each feature sub-vector is floored at the Gaussian mixture level. Thus, the unreliable relative probability difference caused by impulsive noise is eliminated, and the expected correct state sequence recovers the priority of being chosen in decoding. Experimental evaluations on Aurora2 database show that the proposed method achieves the average error rate reduction (ERR) of 61.62% and 84.32% in simulated impulsive noise and machinegun noise environment, respectively, while maintaining high performance for clean speech recognition. 62 Eurospeech 2003 Wednesday Combination of Temporal Domain SVD Based Speech Enhancement and GMM Based Speech Estimation for ASR in Noise – Evaluation on the AURORA2 Task – September 1-4, 2003 – Geneva, Switzerland error rate of about 62% with respect to the baseline ETSI system and of about 18% with respect to the advanced ETSI system. This confirm previous positive experience with the multi-band architecture on other databases. Noise Robust Speech Parameterization Based on Joint Wavelet Packet Decomposition and Autoregressive Modeling Masakiyo Fujimoto, Yasuo Ariki; Ryukoku University, Japan In this paper, we propose a noise robust speech recognition method by combination of temporal domain singular value decomposition( SVD) based speech enhancement and Gaussian mixture model(GMM) based speech estimation. The bottleneck of GMM based approach is a noise estimation problem. For this noise estimation problem, we incorporated the adaptive noise estimation in GMM based approach. Furthermore, in order to obtain higher recognition accuracy, we employed a temporal domain SVD based speech enhancement method as a pre-processing module of the GMM based approach. In addition, to reduce the influence of the noise included in the noisy speech, we introduced an adaptive over-subtraction factor into the SVD based speech enhancement. Usually, a noise reduction method has a problem that it degrades the recognition rate because of spectral distortion caused by residual noise occurred through noise reduction and over estimation. To solve the problem in the noise reduction method, acoustic model adaptation is employed by using an unsupervised MLLR to the distorted speech signal. In evaluation on the AURORA2 tasks, our method showed the improvement in relative improvement of clean condition training task. Additive Noise and Channel Distortion-Robust Parametrization Tool – Performance Evaluation on Aurora 2 & 3 Petr Fousek, Petr Pollák; Czech Technical University in Prague, Czech Republic In this paper a HTK-compatible robust speech parametrization tool CtuCopy is presented. This tool allows for the usage of several additive noise suppression preprocessing techniques, nonlinear spectrum transformation, RASTA-like filtration, and direct final feature computation. The tool is general, it is easily extendible, and it may be also used for speech enhancement purposes. In the second part, parametrizations combining the extended spectral subtraction for additive noise suppression and LDA RASTA-like filtration for channel-distortion elimination with final computation of PLP cepstral coefficients are examined and evaluated on Aurora 2 & 3 and Czech SpeechDat corpora. This comparison shows specific algorithm features and the differences in their behavior on above mentioned databases. PLP cepstral coefficients with both extended spectral subtraction and LDA RASTA-like filtration seem to be good choice for noise robust parametrization. Robust Feature Extraction and Acoustic Modeling at Multitel: Experiments on the Aurora Databases Bojan Kotnik, Zdravko Kačič, Bogomir Horvat; University of Maribor, Slovenia In this paper a noise robust feature extraction algorithm using joint wavelet packet decomposition (WPD) and an autoregressive (AR) modeling of the speech signal is presented. In opposition to the short time Fourier transform (STFT) based time-frequency signal representation, a computationally efficient WPD can lead to better representation of non-stationary parts of the speech signal (consonants). The vowels are well described with an AR model like in LPC analysis. The separately extracted WPD and AR based features are combined together with the usage of modified principal component analysis (PCA) and voiced/unvoiced decision to produce final output feature vector. The noise robustness is improved with the application of the proposed wavelet based denoising algorithm with the modified soft thresholding procedure and the voice activity detection. Speech recognition results on Aurora 3 databases show performance improvement of 47.6% relative to the standard MFCC front-end. Database Adaptation for ASR in Cross-Environmental Conditions in the SPEECON Project Christophe Couvreur 1 , Oren Gedge 2 , Klaus Linhard 3 , Shaunie Shammass 2 , Johan Vantieghem 1 ; 1 ScanSoft Belgium, Belgium; 2 Natural Speech Communication, Israel; 3 DaimlerChrysler AG, Germany As part of the SPEECON corpora collection project, a software toolbox for transforming speech recordings made in a quiet environment with a close-talk microphone into far-talk noisy recordings has been developed. The toolbox allows speech recognizers to be trained for new acoustic environments without requiring an extensive data collection effort. This communication complements a previous article in which the adaptation toolbox was described in details and preliminary experimental results were presented. Detailed experimental results on a database specifically collected for testing purposes show the performance improvements that can be obtained with the database adaptation toolbox in various far-talk and noisy conditions. The Hebrew corpus collected for SPEECON is also used to assess how close a recognizer trained on simulated data can get to a recognizer trained on real far-talk noisy data. Autoregressive Modeling Based Feature Extraction for Aurora3 DSR Task Stéphane Dupont, Christophe Ris; Multitel, Belgium This paper intends to summarize some of the robust feature extraction and acoustic modeling technologies used at Multitel, together with their assessment on some of the ETSI Aurora reference tasks. Ongoing work and directions for further research are also presented. For feature extraction (FE), we are using PLP coefficients. Additive and convolutional noise are addressed using a cascade of spectral subtraction and temporal trajectory filtering. For acoustic modeling (AM), artificial neural networks (ANNs) are used for estimating the HMM state probabilities. At the junction of FE and AM, the multiband structure provides a way to address the needs of robustness by targeting both processing levels. Robust features within subbands can be extracted using a form of discriminant analysis. In this work, this is obtained using sub-band ANN acoustic models. The robust sub-band features are then used for the estimation of state probabilities. These systems are evaluated on the Aurora tasks in comparison to the existing ETSI features. Our baseline system has similar performance than the ETSI advanced features coupled with the HTK back-end. On the Aurora 3 tasks, the multi-band system outperforms the best ETSI results with an average reduction of the word Petr Motlíček, Jan Černocký; Brno University of Technology, Czech Republic Techniques for analysis of speech, that use autoregressive (all-pole) modeling approaches, are presented here and compared to generally known Mel-frequency cepstrum based feature extraction. In the paper, first, we focus on several possible applications of modeling speech power spectra that increase the performance of ASR system mainly in case of large mismatch between training and testing data. Then, the attention is payed to the different types of features that can be extracted from all-pole model to reduce the overall word error rate. The results show that generally used cepstrum based features, which can be easily extracted from all-pole model, are not the most suitable parameters for ASR, where the input speech is corrupted by different types of real noises. Very good recognition performances were achieved e.g., with discrete or selective all-pole modeling based approaches, or with decorrelated line spectral frequencies. The feature extraction techniques were tested on SpeechDat-Car databases used for front-end evaluation of advanced distributed speech recognition (DSR) systems. 63 Eurospeech 2003 Wednesday Evaluation on the Aurora 2 Database of Acoustic Models That Are Less Noise-Sensitive Edmondo Trentin 1 , Marco Matassoni 2 , Marco Gori 1 ; 1 Università degli Studi di Siena, Italy; 2 ITCirst, Italy The Aurora 2 database may be used as a benchmark for evaluation of algorithms under noisy conditions. In particular, the clean training/noisy test mode is aimed at evaluating models that are trained on clean data only without further adjustments on the noisy data, i.e. under severe mismatch between the training and test conditions. While several researchers proposed techniques at the frontend level to improve recognition performance over the reference hideen Markov model (HMM) baseline, investigations at the backend level are sought. In this respect, the goal is to develop acoustic models that are intrinsically less noise sensitive. This paper presents the word accuracy yielded by a non-parametric HMM with connectionist estimates of the emission probabilities, i.e. a neural network is applied instead of the usual parametric (Gaussian mixture) probability densities. A regularization technique, relying on a maximum-likelihood parameter grouping algorithm, is explicitly introduced to increase the generalization capability of the model and, in turn, its noise-robustness. Results show that a 15,43% relative word error rate reduction w.r.t. the Gaussianmixture HMM is obtained by averaging over the different noises and SNRs of Aurora 2 test set A. Revisiting Scenarios and Methods for Variable Frame Rate Analysis in Automatic Speech Recognition J. Macías-Guarasa, J. Ordóñez, J.M. Montero, J. Ferreiros, R. Córdoba, L.F. D’Haro; Universidad Politécnica de Madrid, Spain In this paper we present a revision and evaluation of some of the main methods used in variable frame rate (VFR) analysis, applied to speech recognition systems. The work found in the literature in this area usually deals with restricted conditions and scenarios and we have revisited the main algorithmic alternatives and evaluated them under the same experimental framework, so that we have been able to establish objective considerations for each of them, selecting the most adequate strategy. We also show till what extent VFR analysis is useful in its three main application scenarios, namely “reduction of computational load”, “improve acoustic modelling” and “handling additive noise conditions in the time domain”. From our evaluation on a difficult telephone large vocabulary task, we establish that VFR analysis does not significantly improve the results obtained using the traditional fixed frame rate analysis (FFR), except when additive noise is present in the database and specially for low SNRs. Multitask Learning in Connectionist Robust ASR Using Recurrent Neural Networks September 1-4, 2003 – Geneva, Switzerland measure of classification confidence. However, at high noise levels, entropy can give a misleading indication of classification certainty. Very noisy data vectors may be classified systematically into classes which happen to be most noise-like and the resulting confusion matrix shows a dense column for each noise-like class. In this article we show how this pattern of misclassification in the confusion matrix can be used to derive a linear correction to the MLP posteriors estimate. We test the ability of this correction to reduce the problem of misleading confidence estimates and to enhance the performance of entropy based full-combination multi-stream approach. Better word-error-rates are achieved for Numbers95 database at different levels of added noise. The correction performs significantly better at high SNRs. Session: PWeBg– Poster Speech Recognition - Large Vocabulary I Time: Wednesday 10.00, Venue: Main Hall, Level -1 Chair: Alex Acero, Microsoft Research, USA Large Vocabulary ASR for Spontaneous Czech in the MALACH Project Josef Psutka 1 , Pavel Ircing 1 , J.V. Psutka 1 , Vlasta Radová 1 , William J. Byrne 2 , Jan Hajič 3 , Jirí Mírovsky 3 , Samuel Gustman 4 ; 1 University of West Bohemia in Pilsen, Czech Republic; 2 Johns Hopkins University, USA; 3 Charles University, Czech Republic; 4 Survivors of the Shoah Visual History Foundation, USA This paper describes LVCSR research into the automatic transcription of spontaneous Czech speech in the MALACH (Multilingual Access to Large Spoken Archives) project. This project attempts to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation (VHF) (www.vhf.org) by advancing the state of the art in automated speech recognition. We describe a baseline ASR system and discuss the problems in language modeling that arise from the nature of Czech as a highly inflectional language that also exhibits diglossia between its written and spontaneous forms. The difficulties of this task are compounded by heavily accented, emotional and disfluent speech along with frequent switching between languages. To overcome the limited amount of relevant language model data we use statistical techniques for selecting an appropriate training corpus from a large unstructured text collection resulting in significant reductions in word error rate. Active and Unsupervised Learning for Automatic Speech Recognition Giuseppe Riccardi, Dilek Z. Hakkani-Tür; AT&T Labs-Research, USA Shahla Parveen, Phil Green; University of Sheffield, U.K. The use of prior knowledge in machine learning techniques has been proved to give better generalisation performance for unseen data. However, this idea has not been investigated so far for robust ASR. Training several related tasks simultaneously is also called multitask learning (MTL): the extra tasks effectively incorporate prior knowledge. In this work we present an application of MTL in robust ASR. We have used an RNN architecture to integrate classification and enhancement of noisy speech in an MTL framework. Enhancement is used as an extra task to get higher recognition performance on unseen data. We report our results on an isolated word recognition task. The reduction in error rate relative to multicondition training with HMMs for subway, babble, car and exhibition noises was 53.37%, 21.99%, 37.01% and 44.13% respectively. Confusion Matrix Based Entropy Correction in Multi-Stream Combination Hemant Misra, Andrew Morris; IDIAP, Switzerland An MLP classifier outputs a posterior probability for each class. With noisy data, classification becomes less certain, and the entropy of the posteriors distribution tends to increase providing a State-of-the-art speech recognition systems are trained using human transcriptions of speech utterances. In this paper, we describe a method to combine active and unsupervised learning for automatic speech recognition (ASR). The goal is to minimize the human supervision for training acoustic and language models and to maximize the performance given the transcribed and untranscribed data. Active learning aims at reducing the number of training examples to be labeled by automatically processing the unlabeled examples, and then selecting the most informative ones with respect to a given cost function. For unsupervised learning, we utilize the remaining untranscribed data by using their ASR output and word confidence scores. Our experiments show that the amount of labeled data needed for a given word accuracy can be reduced by 75% by combining active and unsupervised learning. Perceptual MVDR-Based Cepstral Coefficients (PMCCs) for High Accuracy Speech Recognition Umit H. Yapanel 1 , Satya Dharanipragada 2 , John H.L. Hansen 1 ; 1 University of Colorado at Boulder, USA; 2 IBM T.J. Watson Research Center, USA This paper describes an accurate feature representation for contin- 64 Eurospeech 2003 Wednesday uous clean speech recognition. The main components of the technique involve performing a moderate order Linear Predictive (LP) analysis and computing the Minimum Variance Distortionless Response (MVDR) spectrum from these LP coefficients. This feature representation, PMCCs, was earlier shown to yield superior performance over MFCCs for different noise conditions with emphasis on car noise [1]. The performance improvement was then attributed to better spectrum and envelope modeling properties of the MVDR methodology. This study shows that the representation is also quite efficient for clean speech recognition. In fact, PMCCs are shown to be a more accurate envelope representation and reduce speaker variability. This, in turn, yields a 12.8% relative word error rate (WER) reduction on the combination of Wall Street Journal (WSJ) Nov’92 dev/eval sets with respect to the MFCCs. Accurate envelope modeling and reduction in the speaker variability also lead to faster decoding, based on efficient pruning in the search stage. The total gain in the decoding speed is 22.4%, relative to the standard MFCC features. It is also shown that PMCCs are not very demanding in terms of computation when compared to MFCCs. Therefore, we conclude that PMCC feature extraction scheme is a better representation of clean speech as well as noisy speech than MFCC scheme. A Discriminative Decision Tree Learning Approach to Acoustic Modeling Sheng Gao 1 , Chin-Hui Lee 2 ; 1 Institute for Infocomm Research, Singapore; 2 Georgia Institute of Technology, USA The decision tree is a popular method to accomplish tying of the states of a set context dependent phone HMMs for efficient and effective training of the large acoustic models. A likelihood-based impurity function is commonly adopted. It is well known that maximizing likelihood does not result in the maximal separation between the distributions in the leaves of the tree. To improve robustness, a discriminative decision tree learning approach is proposed. It embeds the MCE-GPD formulation in defining the impurity function so that the discriminative information could be taken into account while optimizing the tree. We compare the proposed approach with the conventional tree building using a Mandarin syllable recognition task. Our preliminary results show that the separation between the divided subspaces in the tree nodes is clearly enhanced although there is a slight performance reduction. Large Corpus Experiments for Broadcast News Recognition Patrick Nguyen, Luca Rigazio, Jean-Claude Junqua; Panasonic Speech Technology Laboratory, USA This paper investigates the use of a large corpus for the training of a Broadcast News speech recognizer. A vast body of speech recognition algorithms and mathematical machinery is aimed at smoothing estimates toward accurate modeling with scant amounts of data. In most cases, this research is motivated by a real need for more data. In Broadcast News, however, a large corpus is already available to all LDC members. Until recently, it has not been considered for acoustic training. We would like to pioneer the use of the largest speech corpus (1200h) available for the purpose of acoustic training of speech recognition systems. To the best of our knowledge it is the largest scale acoustic training ever considered in speech recognition. We obtain a performance improvement of 1.5% absolute WER over our best standard (200h) training. Performance Evaluation of Phonotactic and Contextual Onset-Rhyme Models for Speech Recognition of Thai Language Somchai Jitapunkul, Ekkarit Maneenoi, Visarut Ahkuputra, Sudaporn Luksaneeyanawin; Chulalongkorn University, Thailand This paper proposed two acoustic modelings of the onsetrhyme for speech recognition. The two models are Phonotactic Onset-Rhyme Model (PORM) and Contextual Onset-Rhyme Model (CORM). The models comprise a pair of onset and rhyme units, which makes up a September 1-4, 2003 – Geneva, Switzerland syllable. An onset comprises an initial consonant and its transition towards the following vowel. Together with the onset, the rhyme consists of a steady vowel portion and a final consonant. The experiments have been carried out to find the proper acoustic model, which can accurately model Thai sound and gives higher accuracy. Experimental results show that the onset-rhyme model excels the efficiency of the triphone for both PORM and CORM. The PORM achieves higher syllable accuracy than the CORM 2.74%. Moreover the onset-rhyme models also give a more efficiency in term of system complexity compared to the triphone models. Overlapped Di-Tone Modeling for Tone Recognition in Continuous Cantonese Speech Yao Qian, Tan Lee, Yujia Li; Chinese University of Hong Kong, China This paper presents a novel approach to tone recognition in continuous Cantonese speech based on overlapped di-tone Gaussian mixture models (ODGMM). The ODGMM is designed with special consideration on the fact that Cantonese tone identification relies more on the relative pitch level than on the pitch contour. A di-tone unit covers a group of two consecutive tone occurrences. The tone sequence carried by a Cantonese utterance can be considered as the connection of such di-tone units. Adjacent di-tone units overlap with each other by exactly one tone. For each di-tone unit, a GMM is trained with a 10-dimensional feature vector that characterizes the F0 movement within the unit. In particular, the di-tone models capture the relative deviation between the F0 levels of the two tones. Viterbi decoding algorithm is adopted to search for the optimal tone sequence, under the phonological constraints on syllable-tone combination. Experimental results show the ODGMM approach significantly outperforms the previously proposed methods for tone recognition in continuous Cantonese speech. Speaker Model Selection Using Bayesian Information Criterion for Speaker Indexing and Speaker Adaptation Masafumi Nishida 1 , Tatsuya Kawahara 2 ; 1 Japan Science and Technology Corporation, Japan; 2 Kyoto University, Japan This paper addresses unsupervised speaker indexing for discussion audio archives. We propose a flexible framework that selects an optimal speaker model (GMM or VQ) based on the Bayesian Information Criterion (BIC) according to input utterances. The framework makes it possible to use a discrete model when the data is sparse, and to seamlessly switch to a continuous model after a large cluster is obtained. The speaker indexing is also applied and evaluated at automatic speech recognition of discussions by adapting a speaker-independent acoustic model to each participant. It is demonstrated that indexing with our method is sufficiently accurate for the speaker adaptation. Automatic Transcription of Football Commentaries in the MUMIS Project Janienke Sturm 1 , Judith M. Kessens 1 , Mirjam Wester 2 , Febe de Wet 1 , Eric Sanders 1 , Helmer Strik 1 ; 1 University of Nijmegen, The Netherlands; 2 University of Edinburgh, U.K. This paper describes experiments carried out to automatically transcribe football commentaries in Dutch, English and German for multimedia indexing. Our results show that the high levels of stadium noise in the material create a task that is extremely difficult for conventional ASR. The baseline WERs vary from 83% to 94% for the three languages investigated. Employing state-of-the-art noise robustness techniques leads to relative reductions of 9-10% WER. Application specific words such as players’ names are recognized correctly in about 50% of cases. Although this result is substantially better than the overall result, it is inadequate. Much better results can be obtained if the football commentaries are recorded separately from the stadium noise. This would make the automatic transcriptions more useful for multimedia indexing. 65 Eurospeech 2003 Wednesday On the Limits of Cluster-Based Acoustic Modeling S. Douglas Peters; Nuance Communications, Canada This article reports a two-part study of structured acoustic modeling of speech. First, speaker-independent clustering of speech material was used as the basis for a practical cluster-based acoustic modeling. Each cluster’s training material is applied to the adaptation of baseline hidden Markov model(HMM)parameters for recognition purposes. Further, the training material of each cluster is also used to train phone-level Gaussian mixture models (GMMs) for cluster identification. Test utterances are evaluated on all such models to identify an appropriate cluster or cluster combination. Experiments demonstrate that such cluster-based adaptation can yield accuracy gains over computationally similar baseline models. At the same time, these gains and those of similar methods found in the literature are modest. Hence, the second part of our study examined the limitations of the approach by considering utterance consistency: that is, the ability of acoustically-derived cluster models to uniquely identify a single utterance. These second experiments show that arbitrary pieces of a given utterance are likely to be identified by different clusters, in opposition to an implicit assumption of cluster-based acoustic modeling. Large Vocabulary Taiwanese (Min-Nan) Speech Recognition Using Tone Features and Statistical Pronunciation Modeling September 1-4, 2003 – Geneva, Switzerland a single pronunciation dictionary, a 1.8% absolute word error rate improvement is achieved on Switchboard, a large vocabulary conversational speech recognition task. Fitting Class-Based Language Models into Weighted Finite-State Transducer Framework Pavel Ircing, Josef Psutka; University of West Bohemia in Pilsen, Czech Republic In our paper we propose a general way of incorporating class-based language models with many-to-many word-to-class mapping into the finite-state transducer (FST) framework. Since class-based models alone usually do not improve the recognition accuracy, we also present a method for an efficient language model combination. An example of a word-to-class mapping based on morphological tags is also given. Several word-based and tag-based language models are tested in the task of transcribing Czech broadcast news. Results show that class-based models help to achieve a moderate improvement in recognition accuracy. Multi-Source Training and Adaptation for Generic Speech Recognition Fabrice Lefevre, Jean-Luc Gauvain, Lori Lamel; LIMSI-CNRS, France Dau-Cheng Lyu 1 , Min-Siong Liang 1 , Yuang-Chin Chiang 2 , Chun-Nan Hsu 3 , Ren-Yuan Lyu 1 ; 1 Chang Gung University, Taiwan; 2 National Tsing Hua University, Taiwan; 3 Academia Sinica, Taiwan A large vocabulary Taiwanese (Min-nan) speech recognition system is described in this paper. Due to the severe multiple pronunciation phenomenon in Taiwanese partly caused by tone sandhi, a statistical pronunciation modeling technique based on tonal features is used. This system is speaker independent. It was trained by a bi-lingual Mandarin/Taiwanese speech corpus to alleviate the lack of pure Taiwanese speech corpus. The searching network is constructed based on nodes of Chinese characters and results in the direct output Chinese character string. Experiments show that by using the approaches proposed in this paper, the character error rate can decrease significantly from 21.50% to 11.97%. A New Spectral Transformation for Speaker Normalization Pierre L. Dognin, Amro El-Jaroudi; University of Pittsburgh, USA This paper proposes a new spectral transformation for speaker normalization. We use the Bilinear Transformation (BLT) to introduce a new frequency warping resulting from a mapping of a prototype Band-Pass (BP) filter into a general BP filter. This new transformation called “Band-Pass Transform” (BPT) offers two degrees of freedom enabling complex warpings of the frequency axis and different from previous works with BLT. A procedure based on the NelderMead algorithm is proposed to estimate the BPT parameters. Our experimental results include a detailed study of the performance of the BPT compared to other VTLN methods for a subset of speakers and results on large test sets. BPT performs better than other VTLN methods and offers a gain of 1.13% absolute on Hub-5 English Eval01 set. Enhanced Tree Clustering with Single Pronunciation Dictionary for Conversational Speech Recognition Hua Yu, Tanja Schultz; Carnegie Mellon University, USA Modeling pronunciation variation is key for recognizing conversational speech. Rather than being limited to dictionary modeling, we argue that triphone clustering is an integral part of pronunciation modeling. We propose a new approach called enhanced tree clustering. This approach, in contrast to traditional decision tree based state tying, allows parameter sharing across phonemes. We show that accurate pronunciation modeling can be achieved through efficient parameter sharing in the acoustic model. Combined with In recent years there has been a considerable amount of work devoted to porting speech recognizers to new tasks. Recognition systems are usually tuned to a particular task and porting the system to a new task (or language) is both time-consuming and expensive. In this paper, issues in speech recognition portability are addressed and in particular the development of generic models for speech recognition. Multi-source training techniques aimed at enhancing the genericity of some wide domain models are investigated. We show that multi-source training and adaptation can reduce the performance gap between task-independent and taskdependent acoustic models, and for some tasks even out-perform task-dependent acoustic models. Ces dernières années, des efforts considérables ont été faits pour faciliter le transfert des systèmes de reconnaissance de la parole vers de nouvelles tâches. Les systèmes sont généralement optimisés sur une tâche particulière et leur transfert vers une nouvelle tâche est fastidieux et très coûteux en temps. Dans ce papier, nous nous intéresserons au problème du transfert des systèmes de reconnaissance, en particuliers au travers du developpement de modèles génériques pour la reconnaissance de la parole. Des techniques d’apprentissage multi-source visant à augmenter le niveau de généricite de modèles à large domaine sont étudiées. Nous montrons que l’apprentissage et l’adaptation multi-sources peuvent permettre de réduire l’écart de performance entre des modèles indépendants et dépendants de la tâche, et même pour certaines tâches de dépasser les performances des modèles dépendants de la tâche. Toward Domain-Independent Conversational Speech Recognition Brian Kingsbury, Lidia Mangu, George Saon, Geoffrey Zweig, Scott Axelrod, Vaibhava Goel, Karthik Visweswariah, Michael Picheny; IBM T.J. Watson Research Center, USA We describe a multi-domain, conversational test set developed for IBM’s Superhuman speech recognition project and our 2002 benchmark system for this task. Through the use of multi-pass decoding, unsupervised adaptation and combination of hypotheses from systems using diverse feature sets and acoustic models, we achieve a word error rate of 32.0% on data drawn from voicemail messages, two-person conversations and multiple-person meetings. Comparative Study of Boosting and Non-Boosting Training for Constructing Ensembles of Acoustic Models Rong Zhang, Alexander I. Rudnicky; Carnegie Mellon University, USA This paper compares the performance of Boosting and non- Boost- 66 Eurospeech 2003 Wednesday ing training algorithms in large vocabulary continuous speech recognition (LVCSR) using ensembles of acoustic models. Both algorithms demonstrated significant word error rate reduction on the CMU Communicator corpus. However, both algorithms produced comparable improvements, even though one would expect that the Boosting algorithm, which has a solid theoretic foundation, should work much better than the non-Boosting algorithm. Several voting schemes for hypothesis combining were evaluated, including weighted voting, un-weighted voting and ROVER. Session: PWeBh– Poster Spoken Dialog Systems II Time: Wednesday 10.00, Venue: Main Hall, Level -1 Chair: Paul Heisterkamp, DaimlerChrysler, Germany A Study on Domain Recognition of Spoken Dialogue Systems T. Isobe, S. Hayakawa, H. Murao, T. Mizutani, Kazuya Takeda, Fumitada Itakura; Nagoya University, Japan In this paper, we present a multi-domain spoken dialogue system equipped with the capability of parallel computation of speechrecognition engines that are assigned to each domain. The experimental system is set up to handle three different domains (restaurant information, weather report, and news query) in an in-car usage. All of these tasks are of information retrieval nature. The domain of a particular utterance is determined based on the likelihood of each speech recognizer. In addition to the human-machine interaction, synthesized voice of the route sub-system interrupts the dialogue frequently. Experimental evaluation has yielded 95 percent recognition accuracy in selecting the task domain based on a specially designed scoring method. Domain Adaptation Augmented by State-Dependence in Spoken Dialog Systems Wei He, Honglian Li, Baozong Yuan; Northern Jiaotong University, China In the development of spoken dialog systems, domain adaptation and dialog state-dependent language model are usually researched separately. This paper proposes a new approach for domain adaptation augmented by the dialog state-dependence, which means a dialog turn based cache model decaying synchronously with the dialog state change. Through this approach it’s more simple and rapid to adapt a Chinese spoken dialog system to a new task. Two different tasks, the train ticket reservation and the park guide are selected respectively as the target task in the experiments. The consistent reductions of perplexity and character error rate are observed during the adaptation. SmartKom-Home – An Advanced Multi-Modal Interface to Home Entertainment Thomas Portele 1 , Silke Goronzy 2 , Martin Emele 2 , Andreas Kellner 1 , Sunna Torge 2 , Jürgen te Vrugt 1 ; 1 Philips Research Aachen, Germany; 2 Sony International (Europe) GmbH, Germany This paper describes the SmartKom-Home system realized within the SmartKom project. It assists the user by means of a multi-modal dialogue system in the home environment. This involves the control of various devices and the access to services. SmartKom-Home is supposed to serve as a uniform interface to all these devices and services so the user is freed from the necessity to understand which of the devices to consult how and when to fulfill complex wishes. We describe the setting of this scenario together with the hardware used. We furthermore discuss the specific requirements that evolve in a home environment, and how they are handled in the project. September 1-4, 2003 – Geneva, Switzerland Methods to Improve Its Portability of A Spoken Dialog System Both on Task Domains and Languages Yunbiao Xu 1 , Fengying Di 1 , Masahiro Araki 2 , Yasuhisa Niimi 2 ; 1 Hangzhou University of Commerce, China; 2 Kyoto Institute of Technology, Japan This paper presents the methods to improve its portability of a spoken dialog system both on task domains and languages, which have been implemented in Chinese and Japanese in the tasks of sightseeing, accommodation-seeking guidance. Such methods include case frame conversion, template-based text generation and topic frame driven dialog control scheme. The former two methods are for improving the portability across languages, and the last one is for improving the portability across domains. The case frame conversion is used for translating a source language case frame into a pivot language one. The template-based text generation is used for generating text responses in a particular language from abstract responses. The topic frame driven dialog control scheme makes it possible to manage mixed-initiative dialog based on a set of task-dependent topic frames. The experiments showed that the proposed methods could be used to improve the portability of a dialog system across domains and languages. VoxenterT M – Intelligent Voice Enabled Call Center for Hungarian Tibor Fegyó 1 , Péter Mihajlik 1 , Máté Szarvas 2 , Péter Tatai 1 , Gábor Tatai 3 ; 1 Budapest University of Technology and Economics, Hungary; 2 Tokyo Institute of Technology, Japan; 3 AITIA Inc., Hungary In this article we present a voice enabled call center which integrates our basic and applied research results on Hungarian speech recognition. Telephone interfaces, data storage and retrieval modules, and an intelligent dialog descriptor and manager module are also parts of the system. To evaluate the efficiency of the recognition and the dialog, a voice enabled call center was implemented and tested under real life conditions. This article describes the main modules of the system and compares the result of the field tests with that of the laboratory testing. Automatic Call-Routing Without Transcriptions Qiang Huang, Stephen J. Cox; University of East Anglia, U.K. Call-routing is now an established technology to automate customers’ telephone queries. However, transcribing calls for training purposes for a particular application requires considerable human effort, and it would be preferable for the system to learn routes without transcriptions being provided. This paper introduces a technique for fully automatic routing. It is based on firstly identifying salient acoustic morphemes in a phonetic decoding of the input speech, followed by Linear Discriminant Analysis (LDA) to improve classification. Experimental results on an 18 route retail store enquiry point task using this technique are compared with results obtained using word-level transcriptions. Jaspis2 – An Architecture for Supporting Distributed Spoken Dialogues Markku Turunen, Jaakko Hakulinen; University of Tampere, Finland In this paper, we introduce an architecture for a new generation of speech applications. The presented architecture is based on our previous work with multilingual speech applications and extends it by introducing support for synchronized distributed dialogues, which is needed in emerging application areas, such as mobile and ubiquitous computing. The architecture supports coordinated distribution of dialogues, concurrent dialogues, system level adaptation and shared system context. The overall idea is to use interaction agents to distribute dialogues, use an evaluation mechanism to make them dynamically adaptive and synchronize them by using a coordination mechanism with triggers and transactions. We present experiences from several applications written on top of the freely available architecture. 67 Eurospeech 2003 Wednesday Development of a Bilingual Spoken Dialog System for Weather Information Retrieval Janez Žibert 1 , Sanda Martinčić-Ipšić 2 , Melita Hajdinjak 1 , Ivo Ipšić 2 , France Mihelič 1 ; 1 University of Ljubljana, Slovenia; 2 University of Rijeka, Croatia In this paper we present a strategy, current activities and results of a joint project in designing a spoken dialog system for Slovenian and Croatian weather information retrieval. We give a brief description of the system design, of the procedures we have performed in order to obtain domain specific speech databases, monolingual and bilingual speech recognition experiments and WOZ simulation experiments. Recognition results for Croatian and Slovenian speech are presented, as well as bilingual speech recognition results when using common acoustic models. We propose two different approaches to the language identification problem and show recognition results for the both acoustically similar languages. Results of dialog simulations, performed in order to gain user behaviors when accessing a spoken dialog system, are also presented. Improving “How May I Help You?” Systems Using the Output of Recognition Lattices James Allen, David Attwater, Peter Durston, Mark Farrell; BTexact Technologies, U.K. September 1-4, 2003 – Geneva, Switzerland ing a dialog system to assist in workforce training in automotive manufacturing. The overall system design is presented with focus on development of the semantic information needed by the natural language and dialog management modules. We describe data collection and analysis through which the information was derived. Through this process we reduced the parsing error rate by over 20% and system understanding errors to 3%. The Development of a Multi-Purpose Spoken Dialogue System João P. Neto, Nuno J. Mamede, Renato Cassaca, Luís C. Oliveira; INESC-ID/IST, Portugal In this paper we describe a multi-purpose Spoken Dialogue System platform associated with two distinct applications as an home intelligent environment and remote access to information databases. These applications differ substantially on contents and possible uses but gives us the chance to develop a platform where we were able to represent diverse services to be accessible by a spoken interface. The implemented voice input/output possibilities and the service independence level opens a wide range of possibilities for the development of new applications using the current components of our Spoken Dialogue System. The Dynamic, Multi-lingual Lexicon in SmartKom “How may I help you?” systems where a caller to a call centre is routed to one of a set of destinations using machine recognition of spontaneous natural language is a difficult task. Previous BT “How May I Help You” work [1,2] has used top 1 recognition results for classification with much better results when tested on human transcriptions. Classifying using a recognition lattice was found to reduce the gap between results on transcriptions and recognition output. Using features generated from the lattice in addition to the top 1 recognition results gave an improvement in classification of 4% absolute over a baseline system using only the top 1 recognition result. This reduced the gap between classification performance on recognition and transcription by over 25%. Incremental Learning of New User Formulations in Automatic Directory Assistance M. Andorno 1 , L. Fissore 2 , P. Laface 1 , M. Nigra 2 , C. Popovici 2 , F. Ravera 2 , C. Vair 2 ; 1 Politecnico di Torino, Italy; 2 Loquendo, Italy Directory Assistance for business listings is a challenging task: one of its main problems is that customers formulate their requests for the same listing with great variability. Since it is difficult to reliably predict a priori the user formulations, we have proposed a procedure for detecting, from field data, user formulations that were not foreseen by the designers. These formulations can be added, as variants, to the denominations already included in the system to reduce its failures. In this work, we propose an incremental procedure that is able to filter a huge amount of calls routed to the operators, collected every month, and to detect a limited number of phonetic strings that can be included as new formulation variants in the system vocabulary. The results of our experiments, tested on 9 months of calls that the system was unable to serve automatically, show that the incremental procedure, using only additional amount of data collected every month, is able to stay close to the (upper bound) performance of the not incremental one, and offers the possibility of periodically updating the system formulation variants of every city. Dialog Systems for Automotive Environments Julie A. Baca, Feng Zheng, Hualin Gao, Joseph Picone; Mississippi State University, USA The Center for Advanced Vehicular Systems (CAVS), located at Mississippi State University (MSU), is collaborating with regional automotive manufacturers such as Nissan, to advance telematics research. This paper describes work resulting from a research initiative to investigate the use of dialog systems in automotive environments, which includes in-vehicle driver as well as automotive manufacturing environments. We present recent results of an effort to develop an in-vehicle dialog prototype, preliminary to build- Silke Goronzy, Zica Valsan, Martin Emele, Juergen Schimanowski; Sony International (Europe) GmbH, Germany This paper describes the dynamic, multi-lingual lexicon that was developed in the SmartKom project. SmartKom is a multimodal dialogue system that is supposed to assist the user in many applications which are characterised by their highly dynamic contents. Because of this dynamic nature various modules of the dialogue ranging from speech recognition over analysis to synthesis need to have one common knowledge source that takes care of the dynamic vocabularies that need to be processed. This central knowledge source is the lexicon. It is able to dynamically add and remove new words and generate the pronunciations for these words. We also describe the class-based language model (LM) that is used in SmartKom and that is closely coupled with the lexicon. Also evaluation results for this LM are given. Furthermore we describe our approach to dynamically generate pronunciations and give experimental results for the different classifiers we trained for this task. Evaluating Discourse Understanding in Spoken Dialogue Systems Ryuichiro Higashinaka, Noboru Miyazaki, Mikio Nakano, Kiyoaki Aikawa; NTT Corporation, Japan This paper describes a method for creating an evaluation measure for discourse understanding in spoken dialogue systems. Discourse understanding means utterance understanding taking the context into account. Since the measure needs to be determined based on its correlation with the system’s performance, conventional measures, such as the concept error rate, cannot be easily applied. Using the multiple linear regression analysis, we have previously shown that the weighted sum of various metrics concerning dialogue states can be used for the evaluation of discourse understanding in a single domain. This paper reports the progress of our work: verification of our approach by additional experiments in another domain. The support vector regression method performs better than the multiple linear regression method in creating the measure, indicating non-linearity in mapping the metrics to the system’s performance. The results give strong support for our approach and hint at its suitability as a universal evaluation measure for discourse understanding. Assessment of Spoken Dialogue System Usability – What are We really Measuring? Lars Bo Larsen; Aalborg University, Denmark Speech based interfaces have not experienced the breakthrough many have predicted during the last decade. This paper attempts to clarify some of the reasons why by investigating the currently applied methods of usability evaluation. Usability attributes espe- 68 Eurospeech 2003 Wednesday cially important for speech based interfaces are identified and discussed. It is shown that subjective measures (even for widespread evaluation schemes, such as PARADISE) are mostly done in an ad hoc manner and are rarely validated. A comparison is made between some well-known scales, and through an example application of the CCIR usability questionnaire it is shown how validation of the subjective measures can be performed. Evaluation of a Speech-Driven Telephone Information Service Using the PARADISE Framework: A Closer Look at Subjective Measures September 1-4, 2003 – Geneva, Switzerland the flexibility of the framework. As a result, system developers have significant freedom to design user verification solutions, and a wide variety of application-specific, transaction-specific and user-specific constraints can be addressed using a generic system. The paper also describes a prototype implementation of a Conversational Biometrics solution with the proposed programmable policy manager. Integration of Speaker Recognition into Conversational Spoken Dialogue Systems Timothy J. Hazen, Douglas A. Jones, Alex Park, Linda C. Kukolich, Douglas A. Reynolds; Massachusetts Institute of Technology, USA Paula M.T. Smeele, Juliette A.J.S. Waals; TNO Human Factors, The Netherlands For the evaluation of a speech-driven telephone flight information service we applied the PARADISE model developed by Walker and colleagues [1] in order to gain insight into the factors affecting the user satisfaction of this service. We conducted an experiment in which participants were asked to call the service and book a flight. During the telephone conversations quantitative measures (e.g. total elapsed time, the number of system errors) were logged. After completion of the telephone calls, the participants judged some quality related aspects such as dialogue presentation and accessability of the system. These subjective measures together represent a value for user satisfaction. Using multivariate linear regression, it was possible to derive a performance function with user satisfaction as the dependent variable and a combination of objective measures as independent variables. The results of the regression analysis also indicated that an extended definition of user satisfaction including a subjective measure ‘Grade’ provides a better prediction than the analysis based on the narrow definition used by Walker et al. Further, we investigated the correlation between the subjective measures by conducting a principal components analysis. The results showed that these measures fell into two groups. Implications are discussed. Quantifying the Impact of System Characteristics on Perceived Quality Dimensions of a Spoken Dialogue Service In this paper we examine the integration of speaker identification/verification technology into two dialogue systems developed at MIT: the Mercury air travel reservation system and the Orion task delegation system. These systems both utilize information collected from registered users that is useful in personalizing the system to specific users and that must be securely protected from imposters. Two speaker recognition systems, the MIT Lincoln Laboratory text-independent GMM based system and the MIT Laboratory for Computer Science text-constrained speaker-adaptive ASR-based system, are evaluated and compared within the context of these conversational systems. Session: OWeCa– Oral Speech Recognition - Large Vocabulary II Time: Wednesday 13.30, Venue: Room 1 Chair: John Bridle, Novauris Laboratories UK Ltd Discriminative Optimization of Large Vocabulary Mandarin Conversational Speech Recognition System Peng Ding, Zhenbiao Chen, Sheng Hu, Shuwu Zhang, Bo Xu; Chinese Academy of Sciences, China Sebastian Möller, Janto Skowronek; Ruhr-University Bochum, Germany Developers of telephone services which are relying on spoken dialogue systems would like to identify system characteristics influencing the quality perceived by the user, and to quantify the respective impact before the system is put into service. A laboratory experiment is described in which speech input, speech output, and confirmation characteristics of a restaurant information system were manipulated in a controlled way. Users’ quality perceptions were collected by means of a specifically designed questionnaire. It is based on a recently developed taxonomy of quality aspects, and aims at capturing a multitude of perceptually relevant quality dimensions. Experimental results indicate that ASR performance affects a number of interaction parameters, and is a relatively well identifiable quality impact for the user. In contrast, speech output affects perceived quality on a number of different levels, up to global user satisfaction judgments. Potential reasons for these findings are discussed. A Programmable Policy Manager for Conversational Biometrics Ganesh N. Ramaswamy, Ran D. Zilca, Oleg Alecksandrovich; IBM T.J. Watson Research Center, USA Conversational Biometrics combines acoustic speaker verification with conversational knowledge verification to make a more accurate identity decision. To manage the added level of complexity that the multi-modal user recognition approach introduces, this paper proposes the use of verification policies, in the form of Finite State Machines, which can be used to program a policy manager. Once a verification policy is written, the policy manager interprets the policy on the fly, and at every turn in the session decides dynamically whether to accept the user, reject the user, or continue to interact and collect more data. The policy manager allows for any number of verification engines to be plugged-in, thereby adding to This paper examines techniques of discriminative optimization for acoustic model, including both HMM parameters and linear transforms, in the context of HUB5 Mandarin large vocabulary speech recognition task, with the aim to partly solve the problems brought by the sparseness and the highly ambiguous nature of the telephony conversational speech data. Three techniques are studied: MMI training of the HMM acoustic parameters, MMI training of Semi-Tied Covariance Model and MMI Speak Adaptive Training. Descriptions of our recognition system and the algorithms used in our experiments will be detailed, followed by the corresponding results. Speech Recognition with Dynamic Grammars Using Finite-State Transducers Johan Schalkwyk 1 , Lee Hetherington 2 , Ezra Story 1 ; 1 SpeechWorks International, USA; 2 Massachusetts Institute of Technology, USA Spoken language systems, ranging from interactive voice response (IVR) to mixed-initiative conversational systems, make use of a wide range of recognition grammars and vocabularies. The recognition grammars are either static (created at design time) or dynamic (dependent on database lookup at run time). This paper examines the compilation of recognition grammars with an emphasis on the dynamic (changing) properties of the grammar and how these relate to context-dependent speech recognizers. By casting the problem in the algebra of finite-state transducers (FSTs) we can use the composition operator for fast-and-efficient compilation and splicing of dynamic recognition grammars within the context of a larger precompiled static grammar. FLaVoR: A Flexible Architecture for LVCSR Kris Demuynck, Tom Laureys, Dirk Van Compernolle, Hugo Van hamme; Katholieke Universiteit Leuven, Belgium This paper describes a new architecture for large vocabulary continuous speech recognition (LVCSR), which will be developed within the project FLaVoR (Flexible Large Vocabulary Recognition). The pro- 69 Eurospeech 2003 Wednesday posed architecture abandons the standard all-in-one search strategy with integrated acoustic, lexical and language model information. Instead, a modular framework is proposed which allows for the integration of more complex linguistic components. The search process consists of two layers. First, a pure acoustic-phonemic search generates a dense phoneme network enriched with meta-data. Then, the output of the first layer is used by sophisticated language technology components for word decoding in the second layer. Preliminary experiments prove the feasibility of the approach. An Architecture for Rapid Decoding of Large Vocabulary Conversational Speech September 1-4, 2003 – Geneva, Switzerland Session: SWeCb– Oral Robust Methods in Processing of Natural Language Dialogues Time: Wednesday 13.30, Venue: Room 2 Chair: Vincenzo Pallotta, EPFL, Switzerland Spoken Language Condensation in the 21st Century Klaus Zechner; Educational Testing Service, USA George Saon, Geoffrey Zweig, Brian Kingsbury, Lidia Mangu, Upendra Chaudhari; IBM T.J. Watson Research Center, USA This paper addresses the question of how to design a large vocabulary recognition system so that it can simultaneously handle a sophisticated language model, perform state-of-the-art speaker adaptation, and run in one times real time1 (1×RT). The architecture we propose is based on classical HMM Viterbi decoding, but uses an extremely fast initial speaker-independent decoding to estimate VTL warp factors, feature-space and model-space MLLR transformations that are used in a final speaker-adapted decoding. We present results on past Switchboard evaluation data that indicate that this strategy compares favorably to published unlimited-time systems (running in several hundred times real-time). Coincidentally, this is the system that IBM fielded in the 2003 EARS Rich Transcription evaluation. MMI-MAP and MPE-MAP for Acoustic Model Adaptation D. Povey, M.J.F. Gales, D.Y. Kim, P.C. Woodland; Cambridge University, U.K. This paper investigates the use of discriminative schemes based on the maximum mutual information (MMI) and minimum phone error (MPE) objective functions for both task and gender adaptation. A method for incorporating prior information into the discriminative training framework is described. If an appropriate form of prior distribution is used, then this may be implemented by simply altering the values of the counts used for parameter estimation. The prior distribution can be based around maximum likelihood parameter estimates, giving a technique known as I-smoothing, or for adaptation it can be based around a MAP estimate of the ML parameters, leading to MMI-MAP, or MPE-MAP. MMI-MAP is shown to be effective for task adaptation, where data from one task (Voicemail) is used to adapt a HMM set trained on another task (Switchboard). MPE-MAP is shown to be effective for generating gender-dependent models for Broadcast News transcription. Lattice Segmentation and Minimum Bayes Risk Discriminative Training Vlasios Doumpiotis, Stavros Tsakalidis, William J. Byrne; Johns Hopkins University, USA Modeling approaches are presented that incorporate discriminative training procedures in segmental Minimum Bayes-Risk decoding (SMBR). SMBR is used to segment lattices produced by a general automatic speech recognition (ASR) system into sequences of separate decision problems involving small sets of confusable words. We discuss two approaches to incorporating these segmented lattices in discriminative training. We investigate the use of acoustic models specialized to discriminate between the competing words in these classes which are then applied in subsequent SMBR rescoring passes. Refinement of the search space that allows the use of specialized discriminative models is shown to be an improvement over rescoring with conventionally trained discriminative models. While the field of Information Retrieval originally had the search for the most relevant documents in mind, it has become increasingly clear that in many instances, what the user wants is a piece of coherent information, derived from a set of relevant documents and possibly other sources. Reducing relevant documents, passages, and sentences to their core is the task of text summarization or information condensation. Applying text-based technologies to speech is not always workable and often not enough to capture speech specific phenomena. In this paper, we will contrast speech summarization with text summarization, give an overview of the history of speech summarization, its current state, and, finally, sketch possible avenues as well as remaining challenges in future research. Robust Methods in Automatic Speech Recognition and Understanding Sadaoki Furui; Tokyo Institute of Technology, Japan This paper overviews robust architecture and modeling techniques for automatic speech recognition and understanding. The topics include robust acoustic and language modeling for spontaneous speech recognition, unsupervised adaptation of acoustic and language models, robust architecture for spoken dialogue systems, multi-modal speech recognition, and speech understanding. This paper also discusses the most important research problems to be solved in order to achieve ultimate robust speech recognition and understanding systems. Parsing Spontaneous Speech Rodolfo Delmonte; Università Ca’ Foscari, Italy In this paper we will present work carried out lately on the 50,000 words Italian Spontaneous Speech Corpus called AVIP, under national project API, made available for free download from the website of the coordinator, the University of Naples. We will concentrate on the tuning of the parser for Italian which had been previously used to parse 100,000 words corpus of written Italian within the National Treebank initiative coordinated by ILC in Pisa. The parser receives as input the adequately transformed orthographic transcription of the dialogues making up the corpus, in which pauses, hesitations and other disfluencies have been turned into most likely corresponding punctuation marks, interjections or truncation of the word underlying the uttered segment. The most interesting phenomenon we will discuss is without any doubts “overlapping”, i.e. a speech event in which two people speak at the same time by uttering actual words or in some cases nonwords, when one of the speakers, usually the one which is not the current turntaker, interrupts the current speaker. This phenomenon takes place at a certain point in time where it has to be anchored to the speech signal but in order to be fully parsed and subsequently semantically interpreted, it needs to be referred semantically to a following turn. 70 Eurospeech 2003 Wednesday September 1-4, 2003 – Geneva, Switzerland Model Compression for GMM Based Speaker Recognition Systems trained on a large pool of speakers. Speaker models are then used to score the test data; they are normalized by subtracting the scores obtained with the background model. We find that this approach yields significant performance improvement when combined with a state-of-the-art speaker recognition system based on standard cepstral features. Furthermore, the improvement persists even after combination with lexical features. Finally, the improvement continues to increase with longer test sample durations, beyond the test duration at which standard system accuracy level off. Douglas A. Reynolds; Massachusetts Institute of Technology, USA Improved Speaker Verification Through Probabilistic Subspace Adaptation Session: OWeCc– Oral Speaker Identification Time: Wednesday 13.30, Venue: Room 3 Chair: Samy Bengio, IDIAP, Switzerland For large-scale deployments of speaker verification systems models size can be an important issue for not only minimizing storage requirements but also reducing transfer time of models over networks. Model size is also critical for deployments to small, portable devices. In this paper we present a new model compression technique for Gaussian Mixture Model (GMM) based speaker recognition systems. For GMM systems using adaptation from a background model, the compression technique exploits the fact that speaker models are adapted from a single speaker-independent model and not all parameters need to be stored. We present results on the 2002 NIST speaker recognition evaluation cellular telephone corpus and show that the compression technique provides a good tradeoff of compression ratio to performance loss. We are able to achieve a 56:1 compression (624KB → 11KB) with only a 3.2% relative increase in EER (9.1% → 9.4%). Simon Lucey, Tsuhan Chen; Carnegie Mellon University, USA In this paper we propose a new adaptation technique for improved text-independent speaker verification with limited amounts of training data using Gaussian mixture models (GMMs). The technique, referred to as probabilistic subspace adaptation (PSA), employs a probabilistic subspace description of how a client’s parametric representation (i.e. GMM) is allowed to vary. Our technique is compared to traditional maximum a posteriori (MAP) adaptation, or relevance adaptation (RA), and maximum likelihood eigendecomposition (MLED), or subspace adaptation (SA) techniques. Results are given on a subset of the XM2VTS databases for the task of text-independent speaker verification. The Awe and Mystery of T-Norm An Improved Model-Based Speaker Segmentation System Jiří Navrátil, Ganesh N. Ramaswamy; IBM T.J. Watson Research Center, USA Peng Yu, Frank Seide, Chengyuan Ma, Eric Chang; Microsoft Research Asia, China A popular score normalization technique termed T-norm is the central focus of this paper. Based on widely confirmed experimental observation regarding T-norm tilting the DET curves of speaker detection systems, we set out to identify the components taking role in this phenomenon. We claim that under certain local assumptions the T-norm performs a gaussianization of the individual true and impostor score populations and further derive conditions for clockwise and counter-clockwise DET rotations caused by this transform. In this paper, we report our recent work on speaker segmentation. Without a priori information about speaker number and speaker identities, the audio stream is segmented, and segments of the same speaker are grouped together. Speakers are represented by Gaussian Mixture Models (GMMs), then an HMM network is used for segmentation. However, unlike other model-based segmentation methods, the speaker GMMs are initialized using a simpler distance based segmentation algorithm. To group segments of identical speakers, a two-level clustering mechanism is introduced, which we found to achieve higher accuracy than direct distance based clustering methods. Our method significantly outperforms the best result reported at the 2002 Speaker Recognition Workshop. When tested on a professionally produced TV program set, our system reports only 3.5% frame errors. Gaussian Dynamic Warping (GDW) Method Applied to Text-Dependent Speaker Detection and Verification Jean-François Bonastre 1 , Philippe Morin 2 , Jean-Claude Junqua 2 ; 1 LIA-CNRS, France; 2 Panasonic Speech Technology Laboratory, USA This paper introduces a new acoustic modeling method called Gaussian Dynamic Warping (GDW). It is targeting real world applications such as voice-based entrance door security systems, the example presented in this paper. The proposed approach uses a hierarchical statistical framework with three levels of specialization for the acoustic modeling. The highest level of specialization is in addition responsible for the modeling of the temporal constraints via a specific Temporal Structure Information (TSI) component. The preliminary results show the ability of the GDW method to elegantly take into account the acoustic variability of speech while capturing important temporal constraints. Modeling Duration Patterns for Speaker Recognition Luciana Ferrer, Harry Bratt, Venkata R.R. Gadde, Sachin S. Kajarekar, Elizabeth Shriberg, Kemal Sönmez, Andreas Stolcke, Anand Venkataraman; SRI International, USA We present a method for speaker recognition that uses the duration patterns of speech units to aid speaker classification. The approach represents each word and/or phone by a feature vector comprised of either the durations of the individual phones making up the word, or the HMM states making up the phone. We model the vectors using mixtures of Gaussians. The speaker specific models are obtained through adaptation of a “background” model that is Session: OWeCd– Oral Speech Synthesis: Miscellaneous I Time: Wednesday 13.30, Venue: Room 4 Chair: Wolfgang Hess, IKP Universit"at Bonn, Germany A Latent Analogy Framework for Grapheme-to-Phoneme Conversion Jerome R. Bellegarda; Apple Computer Inc., USA Data-driven grapheme-to-phoneme conversion involves either (topdown) inductive learning or (bottom-up) pronunciation by analogy. As both approaches rely on local context information, they typically require some external linguistic knowledge, e.g., individual grapheme/phoneme correspondences. To avoid such supervision, this paper proposes an alternative solution, dubbed pronunciation by latent analogy, which adopts a more global definition of analogous events. For each out-of-vocabulary word, a neighborhood of globally relevant pronunciations is constructed through an appropriate data-driven mapping of its graphemic form. Phoneme transcription then proceeds via locally optimal sequence alignment and maximum likelihood position scoring. This method was successfully applied to the synthesis of proper names with a large diversity of origin. Conditional and Joint Models for Grapheme-to-Phoneme Conversion Stanley F. Chen; IBM T.J. Watson Research Center, USA 71 Eurospeech 2003 Wednesday In this work, we introduce several models for grapheme-tophoneme conversion: a conditional maximum entropy model, a joint maximum entropy n-gram model, and a joint maximum entropy n-gram model with syllabification. We examine the relative merits of conditional and joint models for this task, and find that joint models have many advantages. We show that the performance of our best model, the joint n-gram model, compares favorably with the best results for English grapheme-to-phoneme conversion reported in the literature, sometimes by a wide margin. In the latter part of this paper, we consider the task of merging pronunciation lexicons expressed in different phone sets. We show that models for grapheme-to-phoneme conversion can be adapted effectively to this task. Mixed-Lingual Text Analysis for Polyglot TTS Synthesis Beat Pfister, Harald Romsdorfer; ETH Zürich, Switzerland Text-to-speech (TTS) synthesis is more and more confronted with the language mixing phenomenon. An important step towards the solution of this problem and thus towards a so-called polyglot TTS system is an analysis component for mixed-lingual texts. In this paper it is shown how such an analyzer can be realized for a set of languages, starting from a corresponding set of monolingual analyzers which are based on DCGs and chart parsing. Identifying Speakers in Children’s Stories for Speech Synthesis 1 1 September 1-4, 2003 – Geneva, Switzerland Arabic in My Hand: Small-Footprint Synthesis of Egyptian Arabic Laura Mayfield Tomokiyo 1 , Alan W. Black 2 , Kevin A. Lenzo 1 ; 1 Cepstral LLC, USA; 2 Carnegie Mellon University, USA The research described in this paper addresses the dual concerns of synthesis of Arabic, a language that has shot to prominence in the past few years, and synthesis on a handheld device, realization of which presents difficult software engineering problems. Our system was developed in conjunction with the DARPA BABYLON project, and has been integrated with English synthesis, English and Arabic ASR, and machine translation on a single off-the-shelf PDA. We present a concatenative, general-domain Arabic synthesizer that runs 7 times faster than real time with a 9MB footprint. The voice itself was developed over only a few months, without access to costly prepared databases. It has been evaluated using standard test protocols with results comparable to those achieved by English voices of the same size with the same level of development. Session: PWeCe– Poster Speech Perception Time: Wednesday 13.30, Venue: Main Hall, Level -1 Chair: Anders Eriksson, Umea University, Sweden Schema-Based Modeling of Phonemic Restoration 2 Soundararajan Srinivasan, DeLiang Wang; Ohio State University, USA Jason Y. Zhang , Alan W. Black , Richard Sproat ; 1 Carnegie Mellon University, USA; 2 AT&T Labs Research, USA Choosing appropriate voices for synthesizing children’s stories requires text analysis techniques that can identify which portions of the text should be read by which speakers. Our work presents techniques to take raw text stories and automatically identify the quoted speech, identify the characters within the stories and assign characters to each quote. The resulting marked-up story may then be rendered with a standard speech synthesizer with appropriate voices for the characters. This paper presents each of the basic stages in identification, and the algorithms, both rule-driven and data-driven, used to achieve this. A variety of story texts are used to test our system. Results are presented with a discussion of the limitations and recommendations on how to improve speaker assignment in further texts. Experimental Tools to Evaluate Intelligibility of Text-to-Speech (TTS) Synthesis: Effects of Voice Gender and Signal Quality Phonemic restoration refers to the synthesis of masked phonemes in speech when sufficient lexical context is present. Current models for phonemic restoration however, make no use of lexical knowledge. Such models are inherently inadequate for restoring unvoiced phonemes and may be limited in their ability to restore voiced phonemes too. We present a predominantly top-down model for phonemic restoration. The model uses a missing data speech recognition system to recognize speech utterances as words and activates word templates corresponding to the words containing the masked phonemes. An activated template is dynamically time warped to the noisy word and is then used to restore the speech frames corresponding to the masked phoneme, thereby synthesizing it. The model is able to restore both voiced and unvoiced phonemes. Systematic testing shows that this model performs significantly better than a Kalman-filter based model. Perception of Voice-Individuality for Distortions of Resonance/Source Characteristics and Waveforms Hisao Kuwabara; Teikyo University of Science & Technology, Japan Catherine Stevens 1 , Nicole Lees 1 , Julie Vonwiller 2 ; 1 University of Western Sydney, Australia; 2 APPEN Speech Technology, Australia Two experiments are reported that constitute new methods for evaluation of text-to-speech (TTS) synthesis from the user’s perspective. Experiment 1, using sentence stimuli, and Experiment 2, using discrete word stimuli, investigate the effect of voice gender and signal quality on the intelligibility of three TTS synthesis systems from the user’s point of view. Accuracy scores and reaction time were recorded as on-line, implicit indices of intelligibility during phoneme detection tasks. It was hypothesized that male voice TTS would be more intelligible than female voice TTS, and that low quality signals would reduce intelligibility. Results indicate an interaction between voice gender and signal quality which is dependent on the TTS system. We suggest that intelligibility from the user’s perspective is modulated by several factors and there is a need to tailor systems to particular commercial applications. Methods to achieve commercially relevant evaluation of TTS synthesis are discussed. A perceptual study has been performed to investigate relationship between acoustic parameters and the voice-individuality making use of a pitch synchronous analysis-synthesis system. Voiceindividuality is involved in many acoustic parameters and the aim of this experiment is to examine how individual parameters affect the voice-individuality by separately giving them some distortions. Formant-frequency shift and bandwidth manipulations are given for spectral distortion, F0 -shift for source manipulation. As the waveform distortion, zero-crossing and center-clipping techniques are used. It has been found that formant-shift is very sensitive to voice-individuality change and F0 -shift and bandwidth manipulations are rather tolerant to the voice-individuality. The results of waveform manipulation reveal that the voice-individuality is kept more than the phonetic information for zero-crossing distortion and the results for center-clipping distortion are reverse. The Perceptual Cues of a High Level Pitch-Accent Pattern in Japanese: Pitch-accent Patterns and Duration Tsutomu Sato; Meiji Gakuin University, Japan It has been pointed out that a head-high pitch accent pattern (hereafter HLL) of a loanword in Japanese tends to be flattened and pronounced as a high level pitch-accent pattern (phonetically repre- 72 Eurospeech 2003 Wednesday sented here as HHH) by younger generation. This paper attempts to clarify how Japanese who are in their twenties distinguish a high level pitch-accent pattern from a level pitchaccent pattern (hereafter LHH) as a form of perception experiment, using speech synthesis techniques. In the first part of this paper, it will be shown how often HHH patterns appear among 10 Japanese college students’ production experiment and the F0 configurations will be investigated. Then, the result of perception experiment indicates that HHH patterns are taken for LHH patterns in faster speech. Finally, it will be suggested that durational differences caused by pitch-accent patterns can be perceptual cues in telling the pitch-accent patterns apart. Illusory Continuity of Intermittent Pure Tone in Binaural Listening and Its Dependency on Interaural Time Difference Mamoru Iwaki 1 , Norio Nakamura 2 ; 1 Niigata University, Japan; 2 AIST, Japan Illusory continuity is known as a psychoacoustical phenomenon in hearing; i.e. an intermittent pure tone may be perceived as if it was continuous, when it is padded with enough large white noise. There are many researches related to this issue in monaural listening. It is said that such illusory continuity is observed when there is no evidence of discontinuity and level of the white noise is large enough to mask the pure tone. In this paper, we investigated illusory continuity in binaural listening, and measured its threshold levels according to some interaural time differences (ITDs). The ITDs simulate a sense of direction about sound sources and give new information for evidence of discontinuity, which should be expected to promote the illusory continuity. As a result, the threshold level of illusory continuity in binaural listening depended on ITDs between tone target and noise masker. The increase of threshold level was minimum when the target and masker had the same ITD. CART-Based Factor Analysis of Intelligibility Reduction in Japanese English Nobuaki Minematsu 1 , Changchen Guo 2 , Keikichi Hirose 1 ; 1 University of Tokyo, Japan; 2 KTH, Sweden This study aims at automatically estimating probability of individual words of Japanese English (JE) being perceived correctly by American listeners and clarifying what kinds of (combinations of) segmental, prosodic, and linguistic errors in the words are more fatal to their correct perception. From a JE speech database, a balanced set of 360 utterances by 90 male speakers are firstly selected. Then, a listening experiment is done where 6 Americans are asked to transcribe all the utterances. Next, using speech and language technology, values of many segmental, prosodic, and linguistic attributes of the words are extracted. Finally, relation between transcription rate of each word and its attribute values is analyzed with Classification And Regression Tree (CART) method to predict probability of each of the JE words being transcribed correctly. The machine prediction is compared with the human prediction of seven teachers and this method is shown to be comparable to the best American teacher. This paper also describes differences in perceiving intelligibility of the pronunciation between American and Japanese teachers. Harmonic Alternatives to Sine-Wave Speech László Tóth, András Kocsor; Hungarian Academy of Sciences, Hungary Sine-wave speech (SWS) is a three-tone replica of speech, conventionally created by matching each constituent sinusoid in amplitude and frequency with the corresponding vocal tract resonance (formant). We propose an alternative technique where we take a high-quality multicomponent sinusoidal representation and decimate this model so that there are only three components per frame. In contrast to SWS, the resulting signal contains only components that were present in the original signal. Consequently it preserves the harmonic fine structure of voiced speech. Perceptual studies indicate that this signal is judged more natural and intelligible than SWS. Furthermore, its tonal artifacts can mostly be eliminated by the introduction of only a few additional components, which leads to an intriguing speculation about grouping issues. September 1-4, 2003 – Geneva, Switzerland Non-Intrusive Assessment of Perceptual Speech Quality Using a Self-Organising Map Dorel Picovici, Abdulhussain E. Mahdi; University of Limerick, Ireland A new output-based method for non-intrusive assessment of speech quality for voice communication system is proposed and its performance evaluated. The method is based on comparing the output speech to an appropriate reference representing the closest match from a pre-formulated codebook containing optimally clustered speech parameter vectors extracted from a large number of various undistorted clean speech records. The objective auditory distances between vectors of the distorted speech and their corresponding matching references are then measured and appropriately converted into an equivalent subjective score. The optimal clustering of the reference codebook is achieved by a dynamic k-means method. A self-organising map algorithm is used to match the distorted speech vectors to the references. Speech parameters derived from Bark spectrum analysis, Perceptual Linear Prediction (PLP), and Mel-Frequency Cepstral coefficients (MFCC) are used to provide speaker independent parametric representation of the speech signals as required by an output-based quality measure. Inhibitory Priming Effect in Auditory Word Recognition: The Role of the Phonological Mismatch Length Between Primes and Targets Sophie Dufour, Ronald Peereman; LEAD-CNRS, France Three experiments examined lexical competition effects using the phonological priming paradigm in a shadowing task. Experiment 1 replicated Hamburger and Slowiaczek’ s [1] finding of an initial overlap inhibition when primes and targets share three phonemes (/böiz/-/böik/) but not when they share two phonemes (/böEz//böik/). This observation suggests that lexical competition depends on the number of shared phonemes between primes and targets. However, Experiment 2 showed that an overlap of two phonemes was sufficient to cause inhibition when the primes mismatched the targets only on the last phoneme (/bol/-/bot/). Conversely, using a three phonemes overlap, no inhibition was observed in Experiment 3 when the primes mismatched the targets on the last twophonemes (/bagEt/-/bagaj/). The data indicate that what essentially determines prime-target competition effects in word-form priming is the number of mismatching phonemes. Recognising ‘Real-Life’ Speech with SpeM: A Speech-Based Computational Model of Human Speech Recognition Odette Scharenborg, Louis ten Bosch, Lou Boves; University of Nijmegen, The Netherlands In this paper, we present a novel computational model of human speech recognition – called SpeM – based on the theory underlying Shortlist. We will show that SpeM, in combination with an automatic phone recogniser (APR), is able to simulate the human speech recognition process from the acoustic signal to the ultimate recognition of words. This joint model takes an acoustic speech file as input and calculates the activation flows of candidate words on the basis of the degree of fit of the candidate words with the input. Experiments showed that SpeM outperforms Shortlist on the recognition of ‘real-life’ input. Furthermore, SpeM performs only slightly worse than an off-the-shelf full-blown automatic speech recogniser in which all words are equally probable, while it provides a transparent computationally elegant paradigm for modelling word activations in human word recognition. The Effect of Speech Rate and Noise on Bilinguals’ Speech Perception: The Case of Native Speakers of Arabic in Israel Judith Rosenhouse 1 , Liat Kishon-Rabin 2 ; 1 Technion Israel Institute of Technology, Israel; 2 Tel-Aviv University, Israel Listening conditions affect bilinguals’ speech perception, but relatively little is known about the effect of the combination of several degrading listening conditions. We studied the combined effect of 73 Eurospeech 2003 Wednesday speech rate and background noise on bilinguals’ speech perception in their L1 and L2. Speech perception of twenty Israeli university students, native speakers of Arabic (L1), with Hebrew as L2, was tested. The tests consisted of CHABA sentences adapted to Hebrew and Arabic. In each language, speech perception was evaluated under four conditions: quiet + regular speaking rate, quiet + fast speaking rate, noise + regular speaking rate, and noise + fast speaking rate. Results show that under optimal conditions bilingual speakers of Arabic and Hebrew have similar achievements in Arabic (L1) and Hebrew (L2). Under difficult conditions, performance was poorer in L2 than in L1. The lowest scores were in the combined condition. This reflects bilinguals’ disadvantages when listening to L2. Subjective Evaluations for Perception of Speaker Identity Through Acoustic Feature Transplantations Oytun Turk 1 , Levent M. Arslan 2 ; 1 Sestek Inc., Turkey; 2 Bogazici University, Turkey Perception of speaker identity is an important characteristic of the human auditory system. This paper1 describes a subjective test for the investigation of the relevance of four acoustic features in this process: vocal tract, pitch, duration, and energy. PSOLA based methods provide the framework for the transplantations of these acoustic features between two speakers. The test database consists of different combinations of transplantation outputs obtained from a database of 8 speakers. Subjective decisions on speaker similarity indicate that the vocal tract is the most relevant feature for single feature transplantations. Pitch and duration possess similar significance whereas the energy is the least important acoustic feature. Vocal tract + pitch + duration transplantation results in the highest similarity to the target speaker. Vocal tract + pitch, vocal tract + duration + energy and vocal tract + duration transplantations also yield convincing results in transformation of the perceived speaker identity. Konuşmacı kimliği algılanması insan işitme sisteminin önemli özelliklerinden biridir. Bu çalışma, dört akustik özniteliğin konuşmacı kimliği algılanmasındaki önemlerini öznel bir deneyle incelemektedir: gırtlak yapısı, ses perdesi, süre ve enerji. Geliştirilen PSOLA tabanlı yöntemler bu özniteliklerin konuşmacılar arasında nakledilmesine olanak sağlamaktadır. Deneyde sekiz kişilik bir veri tabanındaki konuşmacı çiftlerinden elde edilen nakil çıktıları kullanılmıştır. Öznel deney sonuçları, konuşmacı kimliği algılanmasında tek başına en önemli özniteliğin gırtlak yapısı olduğunu göstermektedir. Gırtlak yapısı + ses perdesi + süre nakilleri, hedef konuşmacıya en benzer çıktının elde edilmesini sağlamıştır. Gırtlak yapısı + ses perdesi, gırtlak yapısı + süre + enerji nakilleri de konuşmacı kimliğinin dönüştürülmesi açısından başarılı sonuçlar vermiştir. Modelling Human Speech Recognition Using Automatic Speech Recognition Paradigms in SpeM Odette Scharenborg 1 , James M. McQueen 2 , Louis ten Bosch 1 , Dennis Norris 3 ; 1 University of Nijmegen, The Netherlands; 2 Max Planck Institute for Psycholinguistics, The Netherlands; 3 Medical Research Council Cognition and Brain Sciences Unit, U.K. September 1-4, 2003 – Geneva, Switzerland The Effect of Amplitude Compression on Wide Band Telephone Speech for Hearing-Impaired Elderly People Mutsumi Saito 1 , Kimio Shiraishi 2 , Kimitoshi Fukudome 2 ; 1 Fujitsu Kyushu Digital Technology Ltd., Japan; 2 Kyushu Institute of Design, Japan Recently, high-speed multimedia communication systems have become widespread. Not only conventional narrow band speech signal (up to 3.4 kHz) but also wide band speech signal (up to 7 kHz) can be transmitted through high-speed communication lines. Generally, the quality of wide band speech signal is high and its articulation score is good for normal-hearing people, but for elderly people who have hearing losses in higher frequencies, the effect of wide band speech is doubtful. Therefore, we investigated the effect of wide band phone speech on the elderly people’s speech perception in terms of articulation. And we also considered the effect of amplitude compression method that is used for hearing aids. Japanese 62 CV syllables were used as test speech samples. The original speech samples were re-sampled to narrow band speech (8 kHz sampling) and wide band speech (16 kHz sampling). All speech samples were processed with AMRCODEC (Adaptive Multi Rate COder-DECoder), which is a voice coding system available to both narrow band and wide band speech signals. Then, coded speech signals were processed with a multi-band amplitude compression method. Ratios of the compression in each frequency bands were determined according to the average value of subjects’ hearing levels. All subjects were native Japanese speakers, aged 68 to 72 years, and have hearing losses (more than 40 dB HL). From the results of the test, we found that combination of wide band speech and amplitude compression showed significant improvement of the articulation. Word Activation Model by Japanese School Children without Knowledge of Roman Alphabet Takashi Otake, Miki Komatsu; Dokkyo University, Japan Recent models in word recognition have assumed that a word activation device which is based upon phonemes is universal. The present study has attempted to investigate this proposal with Japanese school children without knowledge of Roman alphabet. The main question addressed in this study is to test whether Japanese school children without Roman alphabet could activate word candidates on the basis of phonemes. An experiment was conducted with 21 Japanese elementary school children who were preliterate in Roman letters, employing a word reconstruction task. The results show that regardless of absence of alphabetic knowledge they could reconstruct Japanese words just like Japanese adults. This suggests that the current word activation model may equally be applicable to Japanese children as well as adults who are mora-based language users. Multi-Resolution Auditory Scene Analysis: Robust Speech Recognition Using Pattern-Matching from a Noisy Signal Sue Harding 1 , Georg Meyer 2 ; 1 University of Sheffield, U.K.; 2 University of Liverpool, U.K. We have recently developed a new model of human speech recognition, based on automatic speech recognition techniques [1]. The present paper has two goals. First, we show that the new model performs well in the recognition of lexically ambiguous input. These demonstrations suggest that the model is able to operate in the same optimal way as human listeners. Second, we discuss how to relate the behaviour of a recogniser, designed to discover the optimum path through a word lattice, to data from human listening experiments. We argue that this requires a metric that combines both path-based and word-based measures of recognition performance. The combined metric varies continuously as the input speech signal unfolds over time. Unlike automatic speech recognition systems, humans can understand speech when other competing sounds are present Although the theory of auditory scene analysis (ASA) may help to explain this ability, some perceptual experiments show fusion of the speech signal under circumstances in which ASA principles might be expected to cause segregation. We propose a model of multi-resolution ASA that uses both high- and low- resolution representations of the auditory signal in parallel in order to resolve this conflict. The use of parallel representations reduces variability for pattern-matching while retaining the ability to identify and segregate low-level features of the signal. An important feature of the model is the assumption that features of the auditory signal are fused together unless there is good reason to segregate them. Speech is recognised by matching the low-resolution representation to previously learned speech templates without prior segregation of the signal into separate perceptual streams; this contrasts with the approach 74 Eurospeech 2003 Wednesday generally used by computational models of ASA. We describe an implementation of the multi-resolution model, using hidden Markov models, that illustrates the feasibility of this approach and achieves much higher identification performance than standard techniques used for computer recognition of speech mixed with other sounds. Investigation of Emotionally Morphed Speech Perception and its Structure Using a High Quality Speech Manipulation System Hisami Matsui, Hideki Kawahara; Wakayama University, Japan A series of perceptual experiments using morphed emotional speech sounds was conducted. A high-quality speech modification procedure STRAIGHT [1] extended to enable auditory morphing[2] was used for providing CD quality test stimuli. The test results indicated that naturalness of morphed speech samples were comparable to natural speech samples and resynthesized samples without any modifications when interpolated. It also indicated that the proposed morphing procedure enables to provide stimulus continuum between different emotional expressions. Partial morphing tests were also conducted to evaluate relative contributions and interdependence between spectral, temporal and source parameters. Usefulness of Phase Spectrum in Human Speech Perception Kuldip K. Paliwal, Leigh Alsteris; Griffith University, Australia Short-time Fourier transform of speech signal has two components: magnitude spectrum and phase spectrum. In this paper, relative importance of short-time magnitude and phase spectra on speech perception is investigated. Human perception experiments are conducted to measure intelligibility of speech tokens synthesized either from magnitude spectrum or phase spectrum. It is traditionally believed that magnitude spectrum plays a dominant role for shorter windows (20-30 ms); while phase spectrum is more important for longer windows (128-3500 ms). It is shown in this paper that even for shorter windows, phase spectrum can contribute to speech intelligibility as much as the magnitude spectrum if the shape of the window function is properly selected. Perception of English Lexical Stress by English and Japanese Speakers: Effect of Duration and “Realistic” Intensity Change September 1-4, 2003 – Geneva, Switzerland Physical and Perceptual Configurations of Japanese Fricatives from Multidimensional Scaling Analyses Won Tokuma; Seijo University, Japan This study investigates the correlations between physical and perceptual spaces of voiceless Japanese fricatives /f s S ç h/, using Multidimensional Scaling technique. The spatial configurations were constructed from spectral distance measures and perceptual similarity judgements. The results show that 2-dimensional solutions adequately account for the data and the correlations between the two spaces are high. The dimensions also corresponded to ‘sibilance’ and ‘place’ properties. The spectral analyses on fricative sections, excluding the transitions, seem to contain sufficient information for correct perceptual judgements. These results are highly comparable to those of English fricative study (Choo, 1999), and support a universal prototype theory, according to which the correct identification of speech segments depends on the perceived distance between speech stimuli and a prototype in perceptual spaces. An Acquisition Model of Speech Perception with Considerations of Temporal Information Ching-Pong Au; City University of Hong Kong, China Speech Perception of humans begins to develop as young as 6month-old or even earlier. The development of perception was suggested to be a self-organizing process driven by the linguistic environment to the infants [1]. Self-organizing maps have been widely used for modeling the perception development of infants [2]. However, in these models, temporal information within speech is ignored. Only single vowels or phones have little variations along time can be represented in this kind of models. In the present model, temporal information of speech can be captured by the self-feeding input preprocessors so that the sequence of speech components can be learnt by the self-organizing map. The acquisition of both the single vowels and diphthongs will be demonstrated in this paper. Session: PWeCf– Poster Robust Speech Recognition II Time: Wednesday 13.30, Venue: Main Hall, Level -1 Chair: Nelson Morgan, ICSI and UC Berkeley, USA Dynamic Channel Compensation Based on Maximum A Posteriori Estimation Shinichi Tokuma; Chuo University, Japan This study investigated the effect of duration and intensity on the perception of English lexical stress by native and non-native speakers of English. The spectral balance of intensity was manipulated in a “realistic” way suggested by Sluijter et al. [1], which is to increase intensity level in the higher frequency bands (above 500Hz) as shown in the realisation of vocal effort. A non-sense English word /n@:n@:/ embedded in a frame sentence was used as the stimuli of the perceptual experiment, where English speakers and two levels of Japanese learners of English (advanced and pre-intermediate) were asked to determine lexical stress locations. The result showed: (1) “realistically” manipulated intensity serves as a strong cue for lexical stress perception of English for all subject groups; (2) advanced Japanese learners of English are, like English speakers, sensitive to duration in lexical stress perception, whereas pre-intermediate Japanese learners are, to a very limited extent, duration-sensitive; and (3) intensity, if altered in a proper way, could be as significant a cue as duration in perceiving English lexical stress. French Intonational Rises and Their Role in Speech Seg Mentation [sic] Huayun Zhang, Zhaobing Han, Bo Xu; Chinese Academy of Sciences, China The degradation of speech recognition performance in real-life environments and through transmission channels is a main embarrassment for many speech-based applications around the world, especially when non-stationary noise and changing channel exist. In this paper, we extend our previous works on MaximumLikelihood (ML) dynamic channel compensation by introducing a phone-conditioned prior statistic model for the channel bias and applying Maximum A Posteriori (MAP) estimation technique. Compared to the ML based method, the new MAP based algorithm follows with the variations within channels more effectively. The average structural delay of the algorithm is decreased from 400ms to 200 ms, which means it works better for short utterance compensation (as in many real applications). An additional 7∼8% charactererror-rate relative reduction is observed in telephone-based Mandarin large vocabulary continuous speech recognition (LVCSR). In short utterance test, the word-error-rate relatively reduced 30%. Far-Field ASR on Inexpensive Microphones Pauline Welby; Ohio State University, USA The results of two perception experiments provide evidence that French listeners use the presence of an early intonational rise and its alignment to the beginning of a content word as a cue to speech segmentation. Laura Docio-Fernandez 1 , David Gelbart 2 , Nelson Morgan 2 ; 1 Universidad de Vigo, Spain; 2 International Computer Science Institute, USA For a connected digits speech recognition task, we have compared the performance of two inexpensive electret microphones with that of a single high quality PZM microphone. Recognition error rates were measured both with and without compensation techniques, where both single-channel and two-channel approaches were used. 75 Eurospeech 2003 Wednesday In all cases the task was recognition at a significant distance (2-6 feet) from the talker’s mouth. The results suggest that the wide variability in characteristics among inexpensive electret microphones can be compensated for without explicit quality control, and that this is particularly effective when both single-channel and twochannel techniques are used. In particular, the resulting performance for the inexpensive microphones used together is essentially equivalent to the expensive microphone, and better than for either inexpensive microphone used alone. Evaluation of ETSI Advanced DSR Front-End and Bias Removal Method on the Japanese Newspaper Article Sentences Speech Corpus September 1-4, 2003 – Geneva, Switzerland ever, it is difficult for the conventional approach to reduce nonstationary noise, although it is easy to robustly reduce stationary noise. To cope with this problem, we propose a new combination technique with microphone array steering and Fourier / wavelet spectral subtraction. Wavelet spectral subtraction promises to effectively reduce non-stationary noise, because the wavelet transform admits a variable time-frequency resolution on each frequency band. As a result of an evaluation experiment in a real room, we confirmed that the proposed combination technique provides better performance of the ASR (Automatic Speech Recognition) and NRR (Noise Reduction Rate) than the conventional combination technique. Environmental Sound Source Identification Based on Hidden Markov Model for Robust Speech Recognition Satoru Tsuge, Shingo Kuroiwa, Kenji Kita; University of Tokushima, Japan In October 2002, European Telecommunications Standards Institute (ETSI) recommended a standard Distributed Speech Recognition (DSR) advanced front-end, ETSI ES202 050 version 1.1.1 (ES202). Many studies use this front-end in noise environments on several languages on connected digit recognition tasks. However, we have not seen the reports of large vocabulary continuous speech recognition using this front-end on a Japanese speech corpus. Since the DSR system is used on several languages and tasks, we conducted large vocabulary continuous speech recognition experiments using ES202 on a Japanese speech corpus in noise environments. Experimental results show that ES202 has better recognition performance than previous DSR front-end, ETSI ES201 050 version 1.1.2 under all conditions. In addition, we focus on the influence on recognition performance of DSR with acoustic mismatches caused by input devices. DSR employs a vector quantization (VQ) algorithm for feature compression so that the VQ distortion is increased by these mismatches. Large VQ distortion increases the speech recognition error rate. To overcome increases in VQ distortion, we have proposed the Bias Removal method (BRM) in previous work. However, this method can not be applied in real-time. Hence, we have proposed the Real-time Bias Removal Method (RBRM) in this paper. The continuous speech recognition experiments on a Japanese speech corpus show that RBRM achieves an 8.7% improvement in the error rate compared to ES202 under noise conditions (SNR=20dB with convolutional noise). Takanobu Nishiura 1 , Satoshi Nakamura 2 , Kazuhiro Miki 3 , Kiyohiro Shikano 3 ; 1 Wakayama University, Japan; 2 ATR-SLT, Japan; 3 Nara Institute of Science and Technology, Japan In real acoustic environments, humans communicate with each other through speech by focusing on the target speech among environmental sounds. We can easily identify the target sound from other environmental sounds. For hands-free speech recognition, the identification of the target speech from environmental sounds is imperative. This mechanism may also be important for a selfmoving robot to sense the acoustic environments and communicate with humans. Therefore, this paper first proposes Hidden Markov Model (HMM)-based environmental sound source identification. Environmental sounds are modeled by three states of HMMs and evaluated using 92 kinds of environmental sounds. The identification accuracy was 95.4%. This paper also proposes a new HMM composition method that composes speech HMMs and an HMM of categorized environmental sounds for robust environmental sound-added speech recognition. As a result of the evaluation experiments, we confirmed that the proposed HMM composition outperforms the conventional HMM composition with speech HMMs and a noise (environmental sound) HMM trained using noise periods prior to the target speech in a captured signal. High-Likelihood Model based on Reliability Statistics for Robust Combination of Features: Application to Noisy Speech Recognition Environment Adaptive Control of Noise Reduction Parameters for Improved Robustness of ASR Chng Chin Soon 1 , Bernt Andrassy 2 , Josef Bauer 2 , Günther Ruske 1 ; 1 Technical University of Munich, Germany; 2 Siemens AG, Germany Peter Jančovič, Münevver Köküer, Fionn Murtagh; Queen’s University Belfast, U.K. This paper describes an extension to an automatic speech recognition system that improves the robustness concerning varying environments. A dedicated control unit tries to derive an optimal set of parameters for a Wiener Filter based noise reduction unit aiming at maximum recognition performances in different environments. The input measure for the control unit is derived from the speech signal. Apart from the SNR level, several other measures are investigated. The controlled parameters are closely related to the strength of the noise reduction. Several non-linear methods such as the Tabulated References and Neural Networks serve as the core of the control unit. Experiments on realistic handsfree as well as non-handsfree speech data show that the word error rate can be reduced by as much as 31% through the proposed methods. An already optimized static configuration of the applied noise reduction hereby serves as the baseline level. Speech Enhancement with Microphone Array and Fourier / Wavelet Spectral Subtraction in Real Noisy Environments This paper introduces a novel statistical approach for combination of multiple features, assuming no knowledge about the identity of the noisy features. In a given set of features, some of the features may be dominated by noise. The proposed model deals with the uncertainty about the noisy features by deriving the joint probability of a subset of features with highest probabilities. The core of the model lies in the determination the number of features to be included in the feature-subset – this is estimated based on calculating the reliability of each feature, which is defined as its normalized probability, and evaluating the joint maximal reliability. For the evaluation, we used the TIDIGITS database for connected digit recognition. The utterances were corrupted by various types of additive noise, which resulted the number and identity of the noisy features varied over time (or changed suddenly). The experimental results show that the high-likelihood model achieves recognition performance similar to the one obtained with a full a-priori knowledge about the identity of the noisy features. Noise Robust Digit Recognition with Missing Frames Yuki Denda, Takanobu Nishiura, Hideki Kawahara; Wakayama University, Japan Cenk Demiroglu, David V. Anderson; Georgia Institute of Technology, USA It is very important to capture distant-talking speech with high quality for teleconferencing systems or voice-controlled systems. For this purpose, microphone array steering and Fourier spectral subtraction, for example, are ideal candidates. A combination technique using both microphone array steering and Fourier spectral subtraction has also been proposed to improve performance. How- Noise robustness is one of the most challenging problems in speech recognition research. In this work, we propose a noise robust and computationally simple system for small vocabulary speech recognition. We approach the noise robust digit recognition problem with the missing frames idea. The key point behind the missing frames 76 Eurospeech 2003 Wednesday idea is that frames with energies below a certain threshold are considered unreliable frames. We set these frames to a silence floor and treat them as silence frames. Performing this operation only in decoding stage creates high mismatch between trained speech and decoded speech. To solve the mismatch problem, we apply the same thresholding algorithm on the training data before training. The algorithm adds a negligible computational complexity at the front end, and decreases the overall computational complexity. Moreover, it outperforms other computationally comparable, well known methods. This makes the proposed system particularly suitable for real-time systems. A Noise-Robust ASR Back-End Technique Based on Weighted Viterbi Recognition Xiaodong Cui 1 , Alexis Bernard 2 , Abeer Alwan 1 ; 1 University of California at Los Angeles, USA; 2 Texas Instruments Inc., USA The performance of speech recognition systems trained in quiet degrades significantly under noisy conditions. To address this problem, a Weighted Viterbi Recognition (WVR) algorithm that is a function of the SNR of each speech frame is proposed. Acoustic models trained on clean data, and the acoustic front-end features are kept unchanged in this approach. Instead, a confidence/robustness factor is assigned to the output observation probability of each speech frame according to its SNR estimate during the Viterbi decoding stage. Comparative experiments are conducted with Weighted Viterbi Recognition with different front-end features such as MFCC, LPCC and PLP. Results show consistent improvements with all three feature vectors. For a reasonable size of adaptation data, WVR outperforms environment adaptation using MLLR. September 1-4, 2003 – Geneva, Switzerland ent noise conditioned recognizers in terms of Word Error Rate (WER) and CPU usage. Results show that the model matching scheme using the knowledge extracted from the audio stream by Environmental Sniffing does a better job than a ROVER solution both in accuracy and computation. A relative 11.1% WER improvement is achieved with a relative 75% reduction in CPU resources. Energy Contour Extraction for In-Car Speech Recognition Tai-Hwei Hwang; Industrial Technology Research Institute, Taiwan The time derivatives of speech energy, such as the delta and the delta-delta log energy, have been known as critical features for automatic speech recognition (ASR). However, their discriminative ability in lower signal-to-noise ratio (SNR) could be limited or even becomes harmful because of the corruption of energy contour. By taking the advantage of the spectral characteristic of in-car noise, the speech energy contour is extracted from the high-pass filtered signal so as to reduce the distortion in the delta energy. Such filtering can be implemented by using a pre-emphasis-like filter or a summation of higher frequency band energies. A Chinese name recognition task is conducted to evaluate the proposed method by using real in-car speech and artificially generated one as the test data. As shown in the experimental results, the method is capable of improving the recognition accuracy of in-car speech in lower SNR as well as of the clean speech. Noise-Robust ASR by Using Distinctive Phonetic Features Approximated with Logarithmic Normal Distribution of HMM Voice Quality Normalization in an Utterance for Robust ASR Takashi Fukuda, Tsuneo Nitta; Toyohashi University of Technology, Japan Muhammad Ghulam, Takashi Fukuda, Tsuneo Nitta; Toyohashi University of Technology, Japan Various approaches focused on noise-robustness have been investigated with the aim of using an automatic speech recognition (ASR) system in practical environments. We have previously proposed a distinctive phonetic feature (DPF) parameter set for a noise-robust ASR system, which reduced the effect of high-level additive noise[1]. This paper describes an attempt to replace normal distributions (NDs) of DPFs with logarithmic normal distributions (LNDs) in HMMs because DPFs show skew symmetry, or positive and negative skewness. The HMM with the LNDs was firstly evaluated in comparison with a standard HMM with NDs in an experiment using an isolated spoken-word recognition task with clean speech. Then noise robustness was tested with four types of additive noise. In the case of DPFs as an input feature vector set, the proposed HMM with the LNDs can outperform the standard HMM with the NDs in the isolated spoken-word recognition task both with clean speech and with speech contaminated by additive noise. Furthermore, we achieved significant improvements over a baseline system with MFCC and dynamic feature-set when combining the DPFs with static MFCCs and ∆P. In this paper, we propose a novel method of normalizing the voice quality in an utterance for both clean speech and speech contaminated by noise. The normalization method is applied to the Nbest hypotheses from an HMM-based classifier, then an SM (Subspace Method)-based verifier tests the hypotheses after normalizing the monophone scores together with the HMM-based likelihood score. The HMM-SM-based speech recognition system was proposed previously [1, 2] and successfully implemented on a speakerindependent word recognition task and an OOV word rejection task. We extend the proposed system to a connected digit string recognition task by exploring the effect of the voice quality normalization in an utterance for robust ASR and compare it with the HMM-based recognition systems with utterance-level normalization, word-level normalization, monophone-level normalization, and state-level normalization. Experimental results performed on connected 4- digit strings showed that the word accuracy was significantly improved from 95.7% obtained by the typical HMM-based system with utterance-level normalization to 98.2% obtained by the HMM-SM-based system for clean speech, from 88.1% to 91.5% for noise-added speech with SNR=10dB, and from 72.4% to 76.4% for noise-added speech with SNR=5dB, while the other HMM-based systems also showed lower performances. Environmental Sniffing: Robust Digit Recognition for an In-Vehicle Environment Murat Akbacak, John H.L. Hansen; University of Colorado at Boulder, USA In this paper, we propose to integrate an Environmental Sniffing [1] framework, into an in-vehicle hands-free digit recognition task. The framework of Environmental Sniffing is focused on detection, classification and tracking changing acoustic environments. Here, we extend the framework to detect and track acoustic environmental conditions in a noisy-speech audio stream. Knowledge extracted about the acoustic environmental conditions is used to determine which environment dependent acoustic model to use. Critical Performance Rate (CPR), previously considered in [1], is formulated and calculated for this task. The sniffing framework is compared to a ROVER solution for automatic speech recognition (ASR) using differ- Noise-Robust Automatic Speech Recognition Using Orthogonalized Distinctive Phonetic Feature Vectors Takashi Fukuda, Tsuneo Nitta; Toyohashi University of Technology, Japan With the aim of using an automatic speech recognition (ASR) system in practical environments, various approaches focused on noiserobustness such as noise adaptation and reduction techniques have been investigated. We have previously proposed a distinctive phonetic feature (DPF) parameter set for a noise-robust ASR system, which reduced the effect of high-level additive noise[1]. This paper describes an attempt to apply an orthogonalized DPF parameter set as an input of HMMs. In our proposed method, orthogonal bases are calculated using conventional DPF vectors that represent 38 Japanese phonemes, then the Karhunen-Loeve transform (KLT) is used to orthogonalize the DPFs, output from a multilayer neural network (MLN), by using the orthogonal bases. In experiments, orthogonalized DPF parameters were firstly compared with original DPF parameters on an isolated spoken-word recognition task with clean speech. Noise robustness was then tested with four types of additive noise. The proposed orthogonalized DPFs can reduce the error 77 Eurospeech 2003 Wednesday rate in an isolated spoken-word recognition task both with clean speech and with speech contaminated by additive noise. Furthermore, we achieved significant improvements over a baseline system with MFCC and dynamic feature-set when combining the orthogonalized DPFs with conventional static MFCCs and ∆P. Language Model Accuracy and Uncertainty in Noise Cancelling in the Stochastic Weighted Viterbi Algorithm September 1-4, 2003 – Geneva, Switzerland by a second corpus which contains a record of subjects’ viewing habits over a two year period. Finally, the two corpora have been combined to create two information retrieval test sets. Two probabilistic information retrieval systems are described, and the results obtained on the PUMA IR test sets using these systems are presented. Towards a Personal Robot with Language Interface L. Seabra Lopes, António Teixeira, M. Rodrigues, D. Gomes, C. Teixeira, L. Ferreira, P. Soares, J. Girão, N. Sénica; Universidade de Aveiro, Portugal Nestor Becerra Yoma, Iván Brito, Jorge Silva; University of Chile, Chile In this paper, the Stochastic Weighted Viterbi (SWV) decoding is combined with language modeling, which in turn guides the Viterbi decoding in those intervals where the information provided by noisy frames is not reliable. In other words, the knowledge from higher layers (e.g. language model) compensates the low accuracy of the information provided by the acoustic-phonetic modeling where the original clean speech signal is not reliably estimated. Bigram and trigram language models are tested, and in combination with spectral subtraction, the SWV algorithm can lead to reductions as high as 20% or 45% in word error rate (WER) using a rough estimation of the additive noise made in a short non-speech interval. Also, the results presented here suggest that the higher the language model accuracy, the higher the improvement due to SWV. This paper proposes that the problem of noise robustness in speech recognition should be classified in two different contexts: firstly, at the acousticphonetic level only, as in small vocabulary tasks with flat language model; and, by integrating noise canceling with the information from higher layers. The development of robots capable of accepting instructions in terms of familiar concepts to the user is still a challenge. For these robots to emerge it’s essential the development of natural language interfaces, since this is regarded as the only interface acceptable for a machine which expected to have a high level of interactivity with Man. Our group has been involved for several years in the development of a mobile intelligent robot, named Carl, designed having in mind such tasks as serving food in a reception or acting as a host in an organization. The approach that has been followed in the design of Carl is based on an explicit concern with the integration of the major dimensions of intelligence, namely Communication, Action, Reasoning and Learning. This paper focuses on the multimodal human-robot language communication capabilities of Carl, since these have been significantly improved during the last year. Preference, Perception, and Task Completion of Open, Menu-Based, and Directed Prompts for Call Routing: A Case Study Jason D. Williams, Andrew T. Shaw, Lawrence Piano, Michael Abt; Edify Corporation, USA Session: PWeCg– Poster Multi-Modal Processing & Speech Interface Design Usability subjects’ success with and preference among Open, Menubased and Directed Strategy dialogs for a call routing application in the consumer retail industry are assessed. Each subject experienced two strategies and was asked for a preference. Task completion was assessed, and subjective feedback was taken through Likert-scale questions. Preference and task completion scores were highest for one of the Directed strategies; another directed strategy was least preferred and the open strategy had the lowest task completion score. Time: Wednesday 13.30, Venue: Main Hall, Level -1 Chair: Florian Schiel, Bavarian Archive for Speech Signals (BAS), M"unchen, Germany An Integrated System for Smart-Home Control of Appliances Based on Remote Speech Interaction Ilyas Potamitis, K. Georgila, Nikos Fakotakis, George Kokkinakis; University of Patras, Greece We present an integrated system that uses speech as a natural input modality to provide user-friendly access to information and entertainment devices installed in a real home environment. The system is based on a combination of beamforming techniques and speech recognition. The general problem addressed in this work is that of hands-free speech recognition in a reverberant room where users walk while engaged in conversation in the presence of different types of house-specific noisy conditions (e.g. TV/radio broadcast, interfering speakers, ventilator/air-condition noise, etc.). The paper focuses on implementation details and practical considerations concerning the integration of diverse technologies into a working system. A Spoken Language Interface to an Electronic Programme Guide Jianhong Jin 1 , Martin J. Russell 1 , Michael J. Carey 2 , James Chapman 3 , Harvey Lloyd-Thomas 4 , Graham Tattersall 5 ; 1 University of Birmingham, U.K.; 2 University of Bristol, U.K.; 3 BT Exact Technologies, U.K.; 4 Ensigma Technologies, U.K.; 5 Snape Signals Research, U.K. An Integrated Toolkit Deploying Speech Technology for Computer Based Speech Training with Application to Dysarthric Speakers Athanassios Hatzis 1 , Phil Green 1 , James Carmichael 1 , Stuart Cunningham 2 , Rebecca Palmer 1 , Mark Parker 1 , Peter O’Neill 2 ; 1 University of Sheffield, U.K.; 2 Barnsley District General Hospital NHS Trust, U.K. Computer based speech training systems aim to provide the client with customised tools for improving articulation based on audiovisual stimuli and feedback. They require the integration of various components of speech technology, such as speech recognition and transcription tools, and a database management system which supports multiple on-the-fly configurations of the speech training application. This paper describes the requirements and development of STRAPTk (www.dcs.shef.ac.uk/spandh/projects/straptk-Speech Training Application Toolkit) from the point of view of developers, clinicians, and clients in the domain of speech training for severely dysarthric speakers. Preliminary results from an extended field trial are presented. Towards Best Practices for Speech User Interface Design This paper describes research into the development of personalised spoken language interfaces to an electronic programme guide A substantial data collection exercise has been conducted, resulting in a corpus of nearly 10,000 spoken queries to an electronic programme guide by a total of 64 subjects. A substantial part of the corpus comprises recordings of many queries from a small number of ‘core’ subjects to facilitate research into personalisation and the construction of user profiles. This spoken query data is supported Bernhard Suhm; BBN Technologies, USA Designing speech interfaces is difficult. Research on spoken language systems and commercial application development has created a body of speech interface design knowledge. However, this knowledge is not easily accessible to practitioners. Few experts understand both speech recognition and human factors well enough to avoid the pitfalls of speech interface design. To facilitate the design 78 Eurospeech 2003 Wednesday of better speech interfaces, this paper presents a methodology to compile design guidelines for various classes of speech interfaces. Such guidelines enable practitioners to employ discount usability engineering methods to speech interfaces, including obtaining guidance during early stages of design, and heuristic evaluation. To illustrate our methodology, we apply it to generate a short list of ten guidelines for telephone spoken dialog systems. We demonstrate the usefulness of the guidelines with examples from our consulting practice, applying each guideline to improve poorly designed prompts. We believe this methodology can facilitate compiling the growing body of design knowledge to best practices for important classes of speech interfaces. Design and Evaluation of a Limited Two-Way Speech Translator David Stallard, John Makhoul, Frederick Choi, Ehry Macrostie, Premkumar Natarajan, Richard Schwartz, Bushra Zawaydeh; BBN Technologies, USA We present a limited speech translation system for English and colloquial Levantine Arabic, which we are currently developing as part of the DARPA Babylon program. The system is intended for question/answer communication between an English-speaking operator and an Arabic-speaking subject. It uses speech recognition to convert a spoken English question into text, and plays out a prerecorded speech file corresponding to the Arabic translation of this text. It then uses speech recognition to convert the Arabic reply into Arabic text, and does information extraction on this text to find the answer content, which is rendered into English. A novel aspect of our work is its use of a statistical classifier to extract information content from the Arabic text. We present evaluation results for both individual components and the end-to-end system. Multimodal Interaction on PDA’s Integrating Speech and Pen Inputs Sorin Dusan 1 , Gregory J. Gadbois 2 , James Flanagan 1 ; 1 Rutgers University, USA; 2 HandHeld Speech LLC, USA Recent efforts in the field of mobile computing are directed toward speech-enabling portable computers. This paper presents a method of multimodal interaction and an application which integrates speech and pen on mobile computers. The application is designed for documenting traffic accident diagrams by police. The novelty of this application is due to a) its method of fusing the speech and pen inputs, and b) its fully embedded speech engine. Preliminary experiments showed flexibility, versatility and increased naturalness and user satisfaction during multimodal interaction. Towards Multimodal Interaction with an Intelligent Room September 1-4, 2003 – Geneva, Switzerland shown at the 2003 North American International Auto Show in Detroit. The system, including a touch screen and a speech recognizer, is used for controlling several non-critical automobile operations, such as climate, entertainment, navigation, and telephone. The prototype implements a natural language spoken dialog interface integrated with an intuitive graphical user interface, as opposed to the traditional, speech only, command-and-control interfaces deployed in some of the vehicles currently on the market. Context Awareness Using Environmental Noise Classification L. Ma, D.J. Smith, Ben P. Milner; University of East Anglia, U.K. Context-awareness is essential to the development of adaptive information systems. Environmental noise can provide a rich source of information about the current context. We describe our approach for automatically sensing and recognising noise from typical environments of daily life, such as office, car and city street. In this paper we present our hidden Markov model based noise classifier. We describe the architecture of the system, compare classification results from the system with human listening tests, and discuss open issues in environmental noise classification for mobile computing. Simple Designing Methods of Corpus-Based Visual Speech Synthesis Tatsuya Shiraishi 1 , Tomoki Toda 2 , Hiromichi Kawanami 1 , Hiroshi Saruwatari 1 , Kiyohiro Shikano 1 ; 1 Nara Institute of Science and Technology, Japan; 2 ATR-SLT, Japan This paper describes simple designing methods of corpus-based visual speech synthesis. Our approach needs only a synchronous real image and speech database. Visual speech is synthesized by concatenating real image segments and speech segments selected from the database. In order to automatically perform all processes, e.g. feature extraction, segment selection and segment concatenation, we simply design two types of visual speech synthesis. One is synthesizing visual speech using synchronous real image and speech segments selected with only speech information. The other is using speech segment selection and image segment selection with features extracted from the database without processes by hand. We performed objective and subjective experiments to evaluate these designing methods. As a result, the latter method can synthesize visual speech more naturally than the former method. Comparing the Usability of a User Driven and a Mixed Initiative Multimodal Dialogue System for Train Timetable Information Janienke Sturm 1 , Ilse Bakx 2 , Bert Cranen 1 , Jacques Terken 2 ; 1 University of Nijmegen, The Netherlands; 2 University of Eindhoven, The Netherlands Petra Gieselmann 1 , Matthias Denecke 2 ; 1 Universität Karlsruhe, Germany; 2 Carnegie Mellon University, USA There is a great potential for combining speech and gestures to improve human computer interaction because this kind of communication resembles more and more the natural communication humans use every day with each other. Therefore, this paper is about the multimodal interaction consisting of speech and gestures in an intelligent room. The advantages of using multimodal systems are explained and we present the gesture recognizer and the dialogue system we use. We explain how the information from the different modalities is parsed and integrated in one semantic representation. The aim of the study presented in this paper was to compare the usability of a user driven and a mixed initiative user interface of a multimodal system for train timetable information. The evaluation shows that the effectiveness of the two interfaces does not differ significantly. However, as a result of the absence of spoken prompts and the obligatory use of buttons to provide values, the efficiency of the user driven interface is much higher than the efficiency of the mixed initiative interface. Although the user satisfaction was not significantly higher for the user driven interface, by far most people preferred the user driven interface to the mixed initiative interface. Read My Tongue Movements: Bimodal Learning to Perceive and Produce Non-Native Speech /r/ and /l/ A Multimodal Conversational Interface for a Concept Vehicle Roberto Pieraccini 1 , Krishna Dayanidhi 1 , Jonathan Bloom 1 , Jean-Gui Dahan 1 , Michael Phillips 1 , Bryan R. Goodman 2 , K. Venkatesh Prasad 2 ; 1 SpeechWorks International, USA; 2 Ford Motor Co., USA Dominic W. Massaro, Joanna Light; University of California at Santa Cruz, USA This paper describes a prototype of a conversational system that was implemented on the Ford Model U Concept Vehicle and first This study investigated the effectiveness of Baldi for teaching nonnative phonetic contrasts, by comparing instruction illustrating the internal articulatory processes of the oral cavity versus instruction 79 Eurospeech 2003 Wednesday providing just the normal view of the tutor’s face. Eleven Japanese speakers of English as a second language were bimodally trained under both instruction methods to identify and produce American English /r/ and /l/ in a within-subject design. Speech identification and production improved under both training methods although training with a view of the internal articulators did not show an additional benefit. A generalization test showed that this learning transferred to the production of new words. Low Resource Lip Finding and Tracking Algorithm for Embedded Devices Jesús F. Guitarte Pérez 1 , Klaus Lukas 1 , Alejandro F. Frangi 2 ; 1 Siemens AG, Germany; 2 University of Zaragoza, Spain One of the best challenges in Lip Reading is to apply this technology to an embedded device. In the current solutions the high use of resources, especially in reference to visual processing, makes the implementation in a small device very difficult. In this article a new, efficient and straightforward algorithm for detection and tracking of lips is presented. Lip Finding and Tracking is the first step in visual processing for Lip Reading. In our approach the Lip Finding is performed between a small amount of blobs, which should fulfill a geometric restriction. In terms of computational power and memory the proposed algorithm meets the requirements of an embedded device; on average less than 4 MHz1 of CPU is required. This algorithm shows promising results in a realistic environment accomplishing successful lip finding and tracking in 94.2% of more than 4900 image frames. Detection and Separation of Speech Segment Using Audio and Video Information Fusion Futoshi Asano 1 , Yoichi Motomura 1 , Hideki Asoh 1 , Takashi Yoshimura 1 , Naoyuki Ichimura 1 , Kiyoshi Yamamoto 2 , Nobuhiko Kitawaki 2 , Satoshi Nakamura 3 ; 1 AIST, Japan; 2 Tsukuba University, Japan; 3 ATR-SLT, Japan In this paper, a method of detecting and separating speech events in a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, the information on the time and location of speech events can be known in a multiple-sound-source condition. Based on the detected speech event information, a maximum likelihood adaptive beamformer is constructed and the speech signal is separated from the background noise and interferences. Resynthesis of 3D Tongue Movements from Facial Data Olov Engwall, Jonas Beskow; KTH, Sweden Simultaneous measurements of tongue and facial motion, using a combination of electromagnetic articulography (EMA) and optical motion tracking, are analysed to investigate the possibility to resynthesize the subject’s tongue movements with a parametrically controlled 3D model using the facial data only. The recorded material consists of 63 VCV words spoken by one Swedish subject. The tongue movements are resynthesized using a combination of a linear estimation to predict the tongue data from the face and an inversion procedure to determine the articulatory parameters of the model. Acquiring Lexical Information from Multilevel Temporal Annotations September 1-4, 2003 – Geneva, Switzerland and a method is presented for automatically accomplishing this task, and evaluated using German, Japanese and Anyi (W. Africa) corpora. LUCIA a New Italian Talking-Head Based on a Modified Cohen-Massaro’s Labial Coarticulation Model Piero Cosi, Andrea Fusaro, Graziano Tisato; ISTC-CNR, Italy LUCIA, a new Italian talking head based on a modified version of the Cohen-Massaro’s labial coarticulation model is described. A semiautomatic minimization technique, working on real cinematic data, acquired by the ELITE optoelectronic system, was used to train the dynamic characteristics of the model. LUCIA is an MPEG-4 standard facial animation system working on standard FAP visual parameters and speaking with the Italian version of FESTIVAL TTS. A Visual Context-Aware Multimodal System for Spoken Language Processing Niloy Mukherjee, Deb Roy; Massachusetts Institute of Technology, USA Recent psycholinguistic experiments show that acoustic and syntactic aspects of online speech processing are influenced by visual context through cross-modal influences. During interpretation of speech, visual context seems to steer speech processing and vice versa. We present a real-time multimodal system motivated by these findings that performs early integration of visual contextual information to recognize the most likely word sequences in spoken language utterances. The system first acquires a grammar and a visually grounded lexicon from a “show-and-tell” procedure where the training input consists of camera images consisting of sets of objects paired with verbal object descriptions. Given a new scene, the system generates a dynamic visually-grounded language model and drives a dynamic model of visual attention to steer speech recognition search paths towards more likely word sequences. Session: OWeDb– Oral Speech Recognition - Language Modeling Time: Wednesday 16.00, Venue: Room 2 Chair: Jean-Luc Gauvain, LIMSI, France Maximum Entropy Good-Turing Estimator for Language Modeling Juan P. Piantanida, Claudio F. Estienne; University of Buenos Aires, Argentina In this paper, we propose a new formulation of the classical GoodTuring estimator for n-gram language model. The new approach is based on defining a dynamic model for language production. Instead of assuming a fixed probability distribution of occurrence of an n-gram on the whole text, we propose a maximum entropy approximation of a time varying distribution. This approximation led us to a new distribution, which in turn is used to calculate expectations of the Good-Turing estimator. This defines a new estimator that we call Maximum Entropy Good-Turing estimator. Contrary to the classical Good-Turing estimator it needs neither expectations approximations nor windowing or other smoothing techniques. It also contains the well know discounting estimators as special cases. Performance is evaluated both in terms of perplexity and word error rate in an N-best re-scoring task. Also comparison to other classical estimators is performed. In all cases our approach performs significantly better than classical estimators. Exploiting Order-Preserving Perfect Hashing to Speedup N-Gram Language Model Lookahead Thorsten Trippel, Felix Sasaki, Benjamin Hell, Dafydd Gibbon; Universität Bielefeld, Germany The extraction of lexical information for machine readable lexica from multilevel annotations is addressed in this paper. Relations between these levels of annotation are used for sub-classification of lexical entries. A method for relating annotation units is presented, based on a temporal calculus. Relating the annotation units manually is error-prone, time consuming and tends to be inconsistent, Xiaolong Li, Yunxin Zhao; University of Missouri-Columbia, USA Minimum Perfect Hashing (MPH) has recently been shown successful in reducing Language Model (LM) lookahead time in LVCSR decoding. In this paper we propose to exploit the order-preserving (OP) property of a string-key based MPH function to further reduce 80 Eurospeech 2003 Wednesday hashing operation and speed up LM lookahead. A subtree structure is proposed for LM lookahead and an order-preserving MPH is integrated into the structure design. Subtrees are generated on demand and stored in caches. Experiments were performed on Switchboard data. By using the proposed method of OP MPH and subtree cache structure for both trigrams and backoff bigrams, the LM lookahead time was reduced by a factor of 2.9 in comparison with the baseline case of using MPH alone. Stem-Based Maximum Entropy Language Models for Inflectional Languages Dimitrios Oikonomidis, Vassilios Digalakis; Technical University of Crete, Greece In this work we build language models using three different training methods: n-gram, class-based and maximum entropy models. The main issue is the use of stem information to cope with the very large number of distinct words of an inflectional language, like Greek. We compare the three models with both perplexity and word error rate. We also examine thoroughly the perplexity differences of the three models on specific subsets of words. Combination of a Hidden Tag Model and a Traditional N-Gram Model: A Case Study in Czech Speech Recognition September 1-4, 2003 – Geneva, Switzerland morphosyntactic language model. The architecture of the recognition system is based on the weighted finite-state transducer (WFST) paradigm. Thanks to the flexible transducer-based architecture, the morphosyntactic component is integrated seamlessly with the basic modules with no need to modify the decoder itself. We compare the phoneme, morpheme, and word error-rates as well as the sizes of the recognition networks in two configurations. In one configuration we use only the N-gram model while in the other we use the combined model. The proposed stochastic morphosyntactic language model decreases the morpheme error rate by between 1.7 and 7.2% relatively when compared to the baseline trigram system. The morpheme error-rate of the best configuration is 18% and the best word error-rate is 22.3%. Session: OWeDc– Oral Speech Modeling & Features IV Time: Wednesday 16.00, Venue: Room 3 Chair: Katrin Kirchhoff, University of Washington, USA Locus Equations Determination Using the SpeechDat(II) Bojan Petek; University of Ljubljana, Slovenia Pavel Krbec, Petr Podveský, Jan Hajič; Charles University, Czech Republic A speech recognition system targeting high inflective languages is described that combines the traditional trigram language model and an HMM tagger, obtaining results superior to the trigram language model itself. An experiment in speech recognition of Czech has been performed with promising results. Unlimited Vocabulary Speech Recognition Based on Morphs Discovered in an Unsupervised Manner This paper presents a corpus-based approach to determination of locus equations for Slovenian language. The SpeechDat(II) spoken language database is analyzed first for all available target VCV contexts in order to yield candidate subsets for the acoustic-phonetic measurements. Only the VCVs embedded within judiciously chosen carrier utterances are then selected for the (F2vowel , F2onset ) measurements. The paper discusses challenges, methodology, and results obtained on the 1000-speaker Slovenian SpeechDat(II) database in the framework of /VbV/, /VdV/, and /VgV/-based determination of locus equations. A Memory-Based Approach to Cantonese Tone Recognition Vesa Siivola, Teemu Hirsimäki, Mathias Creutz, Mikko Kurimo; Helsinki University of Technology, Finland We study continuous speech recognition based on sub-word units found in an unsupervised fashion. For agglutinative languages like Finnish, traditional word-based n-gram language modeling does not work well due to the huge number of different word forms. We use a method based on the Minimum Description Length principle to split words statistically into subword units allowing efficient language modeling and unlimited vocabulary. The perplexity and speech recognition experiments on Finnish speech data show that the resulting model outperforms both word and syllable based trigram models. Compared to the word trigram model, the out-ofvocabulary rate is reduced from 20% to 0% and the word error rate from 56% to 32%. Tutkimme ohjaamattomasti löydettyihin sanaa lyhyempiin yksiköihin perustuvaa jatkuvan puheen tunnistusta. Perinteiset sanoihin perustuvat n-grammikielimallit toimivat huonosti agglutinatiivisille kielille kuten suomi, sillä näissä kielissä on erittäin paljon erilaisia sanamuotoja. Tässä työssä käytämme lyhyimpään kuvauspituuteen (Minimum Description Length, MDL) perustuvaa menetelmää sanojen tilastolliseen pilkkomiseen. Näin saamme tehokkaan kielimallin, jolla on rajoittamaton sanasto. Kokeet suomenkielisellä aineistolla osoittavat, että tämä malli toimii selvästi sekä sana- että tavupohjaisia malleja paremmin. Sanapohjaiseen trigrammimalliin verrattuna sanastosta puuttuvien sanojen osuus tippuu 20 prosentista nollaan prosenttiin ja puheentunnistimen sanavirhe 56 prosentista 32 prosenttiin. Evaluation of the Stochastic Morphosyntactic Language Model on a One Million Word Hungarian Dictation Task Máté Szarvas, Sadaoki Furui; Tokyo Institute of Technology, Japan Michael Emonts 1 , Deryle Lonsdale 2 ; 1 Sony Electronics, USA; 2 Brigham Young University, USA This paper introduces memory-based learning as a viable approach for Cantonese tone recognition. The memory-based learning algorithm employed here outperforms other documented current approaches for this problem, which is based on neural networks. Various numbers of tones and features are modeled to find the best method for feature selection and extraction. To further optimize this approach, experiments are performed to isolate the best feature weighting method, the best class voting weights method, and the best number of k-values to implement. Results and possible future work are discussed. Experimental Evaluation of the Relevance of Prosodic Features in Spanish Using Machine Learning Techniques David Escudero 1 , Valentín Cardeñoso 1 , Antonio Bonafonte 2 ; 1 Universidad de Valladolid, Spain; 2 Universitat Politècnica de Catalunya, Spain In this work, machine learning techniques have been applied for the assessment of the relevance of several prosodic features in TTS for Spanish. Using a two step correspondence between sets of prosodic features and intonation parameters, the influence of the number of different intonation patterns and the number and order of prosodic features is evaluated. The output of the trained classifiers is proposed as a labelling mechanism of intonation units which can be used to synthesize high quality pitch contours. The input output correspondence of the classifier also provides a bundle of relevant prosodic knowledge. Dominance Spectrum Based V/UV Classification and F0 Estimation In this article we evaluate our stochastic morphosyntactic language model (SMLM) on a Hungarian newspaper dictation task that requires modeling over 1 million different word forms. The proposed method is based on the use of morphemes as the basic recognition units and the combination of a morpheme N-gram model and a Tomohiro Nakatani 1 , Toshio Irino 2 , Parham Zolfaghari 1 ; 1 NTT Corporation, Japan; 2 Wakayama University, Japan 81 Eurospeech 2003 Wednesday This paper presents a new method for robust voiced/unvoiced segment (V/UV) classification and accurate fundamental frequency (F0 ) estimation in a noisy environment. For this purpose, we introduce the degree of dominance and dominance spectrum that are defined by instantaneous frequency. The degree of dominance allows us to evaluate the magnitude of individual harmonic components of speech signals relative to the background noise. The V/UV segments are robustly classified based on the capability of the dominance spectrum to extract the regularity in the harmonic structure. F0 is accurately determined based on fixed points corresponding to dominant harmonic components easily selected from the dominance spectrum. Experimental results show that the present method is better than the existing methods in terms of gross and fine F0 errors, and V/UV correct rates in the presence of background white and babble noise. Analysis and Modeling of F0 Contours of Portuguese Utterances Based on the Command-Response Model 1 1 September 1-4, 2003 – Geneva, Switzerland un incremento absoluto del 18.4% (10.3%) en exactitud con respecto a la prueba base. Estos descubrimientos proporcionan evidencias adicionales del potencial de los flujos descompuestos harmónicamente para dar mejoras en rendimiento y, sustancialmente, para realzar la exactitud del reconocimiento en ruido. Session: SWeDd– Oral Feature Analysis & Cross-Language Processing of Chinese Spoken Language Time: Wednesday 16.00, Venue: Room 4 Chair: Tao Jianhua, Chinese Academy of Sciences, Beijing Automatic Title Generation for Chinese Spoken Documents Considering the Special Structure of the Language Lin-shan Lee, Shun-Chuan Chen; National Taiwan University, Taiwan 2 Hiroya Fujisaki , Shuichi Narusawa , Sumio Ohno , Diamantino Freitas 3 ; 1 University of Tokyo, Japan; 2 Tokyo University of Technology, Japan; 3 University of Porto, Portugal This paper describes the results of a joint study on the applicability of the command-response model to F0 contours of European Portuguese, with an aim to use it in a TTS system. Analysis-bySynthesis of observed F0 contours of a number of utterances by five native speakers indicated that the model with provisions for both positive and negative accent commands applies quite well to all the utterances tested. The estimated commands are found to be closely related to the linguistic contents of the utterances. One of the features of European Portuguese found in utterances by the majority of speakers is the occurrence of a negative accent command at certain phrase-initial positions, and its perceptual significance is examined by an informal listening test, using stimuli synthesized both with and without negative accent commands. Covariation and Weighting of Harmonically Decomposed Streams for ASR Philip J.B. Jackson 1 , David M. Moreno 2 , Martin J. Russell 3 , Javier Hernando 2 ; 1 University of Surrey, U.K.; 2 Universitat Politècnica de Catalunya, Spain; 3 University of Birmingham, U.K. The purpose of automatic title generation is to understand a document and to summarize it with only several but readable words or phrases. It is important for browsing and retrieving spoken documents, which may be automatically transcribed, but it will be much more helpful if given the titles indicating the content subjects of the documents. On the other hand, the Chinese language is not only spoken by the largest population of the world, but with very special structure different from western languages. It is not alphabetic, with large number of distinct characters each pronounced as a monosyllable, while the total number of syllables is limited. In this paper, considering the special structure of the Chinese language, a set of “feature units” for Chinese spoken language processing is defined and the effects of the choice of these “feature units” on automatic title generation are analyzed with a new adaptive K nearestneighbor approach, proposed in a companion paper also submitted to this conference as the baseline. Statistical Speech-to-Speech Translation with Multilingual Speech Recognition and Bilingual-Chunk Parsing Bo Xu, Shuwu Zhang, Chengqing Zong; Chinese Academy of Sciences, China Decomposition of speech signals into simultaneous streams of periodic and aperiodic information has been successfully applied to speech analysis, enhancement, modification and recently recognition. This paper examines the effect of different weightings of the two streams in a conventional HMM system in digit recognition tests on the Aurora 2.0 database. Comparison of the results from using matched weights during training showed a small improvement of approximately 10% relative to unmatched ones, under clean test conditions. Principal component analysis of the covariation amongst the periodic and aperiodic features indicated that only 45 (51) of the 78 coefficients were required to account for 99% of the variance, for clean (multi-condition) training, which yielded an 18.4% (10.3%) absolute increase in accuracy with respect to the baseline. These findings provide further evidence of the potential for harmonically-decomposed streams to improve performance and substantially to enhance recognition accuracy in noise. La descomposición de señales del habla en flujos simultaneos de información periódica y aperiódica ha sido aplicada exitosamente al análisis, realce, modificación y, recientemente, reconocimiento del habla. Este artículo examina el efecto de diferentes ponderaciones de estos dos flujos en un sistema ‘HMM’ convencional de reconocimiento de dígitos con la base de datos Aurora 2.0. Bajo condiciones de prueba no ruidosas, la comparación de los resultados utilizando ponderaciones coincidentes durante entrenamiento y prueba mostró una pequeña mejora relativa de aproximadamente un 10% con respecto al caso de utilizar ponderaciones sólo en las puebas. El análisis de componentes principales de la covarianza entre los rasgos periódicos y aperiódicos indicó que sólo fueron requeridos 45 (51) de los 78 coeficientes para cubrir el 99% de la varianza, para entrenamento limpio (multicondicional), el cual produjo Initiated mainly from speech community, researches in speech to speech (S2S) translation have made steady progress in the past decade. Many approaches to S2S translation have been proposed continually. Among of them, corpus-dependent statistical strategies have been widely studied during recent years. In corpus-based translation methodology, rather than taking the corpus just as reference templates, more detailed or structural information should be exploited and integrated in statistical modeling. Under the statistical translation framework that provides very flexible way of integrating different prior or structural knowledge, we have conducted a series of R&D activities on S2S translation. In the most recent version, we have independently developed a prototype Chinese-English bi-directional S2S translation system with the supports of multilingual speech recognition and bilingual-Chunk based statistical translation techniques to meet the demand of Manos – a multilingual information service project for 2008 Beijing Olympic Games. This paper introduces our works in the research of multilingual S2S translation. Automatic Extraction of Bilingual Chunk Lexicon for Spoken Language Translation Limin Du, Boxing Chen; Chinese Academy of Sciences, China In language communication, an utterance may be segmented as a concatenation of chunks that are reasonable in syntax, meaningful in semantics, and composed of several words. Usually, the order of words within chunks is fixed, and the order of chunks within an utterance is rather flexible. The improvement of spoken language translation could benefit from using bilingual chunks. This paper presents a statistical algorithm to build the bilingual chunk-lexicon automatically from spoken language corpora. Several association measurements are set up as the criteria of the extraction. And local 82 Eurospeech 2003 Wednesday best algorithm, length ratio filtration and stop-word filtration are also incorporated to improve the performance. A bilingual chunklexicon was extracted from a corpus with precision of 86.0% and recall of 86.7%. The usability of the chunk-lexicon was then tested with an innovative framework for English-to-Chinese Spoken Language translation, resulted in translation accuracy of 81.83% and 78.69% for training and test sets respectively, measured with Levenshtein distance based similarity score. Multi-Scale Document Expansion in English-Mandarin Cross-Language Spoken Document Retrieval Wai-Kit Lo 1 , Yuk-Chi Li 1 , Gina Levow 2 , Hsin-Min Wang 3 , Helen M. Meng 1 ; 1 Chinese University of Hong Kong, China; 2 University of Chicago, USA; 3 Academia Sinica, Taiwan This paper presents the application of document expansion using a side collection to a cross-language spoken document retrieval (CLSDR) task to improve retrieval performance. Document expansion is applied to a series of English-Mandarin CL-SDR experiments using selected retrieval models (probabilistic belief network, vector space model, and HMM-based retrieval model). English textual queries are used to retrieve relevant documents from an archive of Mandarin radio broadcast news. We have devised a multi-scale approach for document expansion - a process that enriches the Mandarin spoken document collection in order to improve overall retrieval performance. A document is expanded by (i) first retrieving related documents on a character bigram scale, (ii) then extracting word units from such related documents as expansion terms to augment the original document and (iii) finally indexing all documents in the collection by means of character bigrams and those expanded terms by within-word character bigrams to prepare for future retrieval. Hence the document expansion approach is multi-scale as it involves both word and subword scales. Experimental results show that this approach achieves performance improvements up to 14% across several retrieval models. Mandarin Speech Prosody: Issues, Pitfalls and Directions Chiu-yu Tseng; Academia Sinica, Taiwan From the perspective of speech technology development for unlimited Mandarin Chinese TTS, two issues appear most impedimental: (1.) how to predict prosody from text, and (2.) how to achieve better naturalness for speech output. These impediments somewhat brought out the major pitfalls in related research, i.e., characteristics of Chinese connected speech and the overall rhythmic structure of speech flow. This paper discusses where the problems stem from and how some solutions could be found. We propose that for Mandarin, prosody research needs to include the following: (1.) characteristics of Mandarin connected speech that constitute the prosodic properties in speech flow, i.e., units and boundaries, (2.) scope and type of speech data collected, i.e., text other than isolated sentences, (3.) prosody in relation to speech planning, i.e., information other than lexical, syntactic and semantic, and (4.) an overall organization of prosody for speech flow, i.e., a framework that accommodate the above mentioned features. A Contrastive Investigation of Standard Mandarin and Accented Mandarin 1 September 1-4, 2003 – Geneva, Switzerland Standard Chinese. No significant difference exists on durations of initials and finals for these 20 speakers. And no phonological difference is found on four lexical tones. It seems that the prosodic difference is mainly on rhythmic or stress pattern. Emotion Control of Chinese Speech Synthesis in Natural Environment Jianhua Tao; Chinese Academy of Sciences, China Emotional speech analysis was normally conducted from the viewpoint of prosody and articulation features. But for emotional speech synthesis system, two issues appear most important: (1) how to realize the acoustic features among various emotion states? (2) how to convey the emotion with the combination of text analysis and environment detection. To answer the two questions, both acoustic features and emotion focus were analyzed in the paper. Due to the different background and culture, even the same emotion has different meaning for different people in certain contexts. The paper also tries to explain if there are special characters in Chinese emotion expression. Finally, the emotion controlling model is described in the paper, some rules are listed in a table. Environment influence was also classified and integrated into the system. At the end of paper, the emotion synthesis results were evaluated and compared to other previous works. Session: PWeDe– Poster Speech Production & Physiology Time: Wednesday 16.00, Venue: Main Hall, Level -1 Chair: Hideki Kawahara, Wakayama University, Japan Optimality Criteria in Inverse Problems for Tongue-Jaw Interaction A.S. Leonov 1 , V.N. Sorokin 2 ; 1 Moscow Physical Engineering Institute, Russia; 2 Russian Academy of Sciences, Russia We consider the system of articulators “jaw – tip of the tongue” in order to investigate instant and integral optimality criteria in the variational approach to the solution of speech inverse problem “from total displacement of articulators to their controls”. The required experimental data i.e., coordinates of the tip of the tongue and lower incisor have been measured by the use of the X-ray microbeam system together with EMGs of masseter, longitudinalis superior and longitudinalis inferior. These data have been registered for sequences of syllable /ta/ with different articulation rates, as well as for an elevation and lowering of the tongue tip in non-speech mode. We analyze instant and integral criteria of the work, kinetic energy, elastic and inertial forces for the system. In speech mode, total displacements of the tongue tip and the jaw are simulated perfectly by the use of any instant and integral criterion, mentioned above. At the same time, the own displacements of the tongue tip and the jaw are reproduced well by means of integral criteria only. On the contrary, the own displacements in non-speech mode are reproduced satisfactory only by the use of any instant optimality criterion. FEM Analysis Based on 3-D Time-Varying Vocal Tract Shape Koji Sasaki 1 , Nobuhiro Miki 2 , Yoshikazu Miyanaga 1 ; 1 Hokkaido University, Japan; 2 Future University-Hakodate, Japan 2 1 Aijun Li , Xia Wang ; Chinese Academy of Social Sciences, China; 2 Nokia Research Center, China Segmental and supra-segmental acoustic features between standard and Shanghai-accented Mandarin were analyzed in the paper. The Shanghai Accented Mandarin was first classified into three categories as light, middle and heavy, by statistical method and dialectologist with subjective criteria. Investigation to initials, finals and tones were then carried out. The results show that Shanghainese always mispronounce or modify some sorts of phonemes of initials and finials. The heavier the accent is, the more frequently the mispronunciation occurs. Initials present more modifications than finals. Nine vowels are also compared phonetically for 10 Standard Chinese speakers and 10 Shanghai speakers with middle-class accent. Additionally, retroflexed finals occur more than 10 times in We propose a computational method for time-varying spectra based on 3-D vocal tract shape using Finite Element Method (FEM). In order to obtain the time-varying spectra, we introduce auto-mesh algorithm and interpolation. We show the vocal tract transfer function (VTTF) with variable shape continuously. Consideration of Muscle Co-Contraction in a Physiological Articulatory Model Jianwu Dang 1 , Kiyoshi Honda 2 ; 1 JAIST, Japan; 2 ATR-HIS, Japan Physiological models of the speech organs must consider cocontrac- 83 Eurospeech 2003 Wednesday tion of the muscles, a common phenomenon taking place during articulation. This study investigated cocontraction of the tongue muscles using the physiological articulatory model that replicates midsagittal regions of the speech organs to simulate articulatory movements during speech [1,2]. The relation between the muscle force and tongue movement obtained by the model simulation indicated that each muscle drives the tongue towards an equilibrium position (EP) corresponding to the magnitude of the activation forces. Contributions of the muscles to the tongue movement were evaluated by the distance between the equilibrium positions. Based on the EPs and the muscle contributions, an invariant mapping (the EP map) was established to function the connection of a spatial location to a muscle force. Cocontractions between agonist and antagonist muscles were simulated using the EP maps. The simulations demonstrated that coarticulation with multiple targets could be compatibly realized using the co-contraction mechanism. The implementation of the co-contraction mechanism enables relatively independent control over the tongue tip and body. Robust Techniques for Pre- and Post-Surgical Voice Analysis Claudia Manfredi 1 , Giorgio Peretti 2 ; 1 University of Florence, Italy; 2 Civil Brescia Hospital, Italy Objective measure and tracking of the most relevant voice parameters is obtained for voice signals coming from patients that underwent thyroplasty implant. Due to the strong noise component and high non-stationarity of the pre-surgical signal, robust methods are proposed, capable to recover the fundamental frequency, tracking formants, and quantify the degree of hoarseness as well as the patient’s functional recovery in an objective way. Thanks to its high-resolution properties, autoregressive parametric modelling is considered, with modifications required for the present application. The method is applied to sustained /a/ vowels, recorded from patients suffering from unilateral vocal cord paralysis. Preand post-surgical parameters are evaluated, that allow the physician quantifying the effectiveness of the Montgomery thyroplasty implant. Analysis of Lossy Vocal Tract Models for Speech Production K. Schnell, A. Lacroix; Goethe-University Frankfurt am Main, Germany Discrete time tube models describe the propagation of plane sound waves through the vocal tract. Therefore they are important for speech analysis and production. In most cases discrete time models without losses have been used. In this contribution loss effects are introduced by extended uniform tube elements modeling frequency dependent losses. The parameters of these extended tube elements can be fitted to experimental and theoretical data of the loss effects of wall vibrations, viscosity and heat conduction. For the analysis of speech sounds the parameters of a lossy vocal tract model are estimated from speech signals by an optimization algorithm. The spectrum of the analyzed speech can be approximated well by the estimated magnitude response of the lossy vocal tract model. Furthermore the estimated vocal tract areas show reasonable shapes. September 1-4, 2003 – Geneva, Switzerland is positively correlated with the total nasal airflow volume for the nasals. Estimation of Vocal Noise in Running Speech by Means of Bi-Directional Double Linear Prediction F. Bettens 1 , F. Grenez 1 , J. Schoentgen 2 ; 1 Université Libre de Bruxelles, Belgium; 2 National Fund for Scientific Research, Belgium The presentation concerns forward and backward double linear prediction of speech with a view to the characterization of vocal noise due to voice disorders. Bi-directional double linear prediction consists in a conventional short-term prediction followed by a distal inter-cycle prediction that enables removing inter-cycle correlations owing to voicing. The long-term prediction is performed forward and backward. The minimum of the forward and backward prediction error is a cue of vocal noise. The minimum backward and forward prediction error has been calculated for corpora involving connected speech and sustained vowels. Comparisons have been performed between the estimated vocal noise and the perceived hoarseness in steady vowel fragments, as well as between the estimated vocal noise in connected speech and sustained vowels produced by the same speakers. Visualisation of the Vocal Tract Based on Estimation of Vocal Area Functions and Formant Frequencies Abdulhussain E. Mahdi; University of Limerick, Ireland A system for visualisation of the vocal-tract shapes during vowel articulation has been designed and developed. The system generates the vocal tract configuration using a new approach based on extracting both the area functions and the formant frequencies form the acoustic speech signal. Using a linear prediction analysis, the vocal tract area functions and the first three formants are first estimated. The estimated area functions are then mapped to corresponding mid-sagittal distances and displayed as 2D vocal tract lateral graphics. The mapping process is based on a simple numerical algorithm and an accurate reference grid derived from x-rays for the pronunciation of a number English vowels uttered by different speakers. To compensate for possible errors in the estimated area functions due to variations in vocal tract length, the first two section distances are determined by the three formants. The formants are also used to adjust the rounding of the lips and the height of the jawbone. Results show high correlation with x-ray data and the PARAFAC analysis. The system could be useful as a visual sensory aid for speech training of the hearing-impaired. Reproducing Laryngeal Mechanisms with a Two-Mass Model Denisse Sciamarella, Christophe d’Alessandro; LIMSI-CNRS, France Evidence is produced for the correspondence between the oscillation regimes of an up-to-date two-mass model and laryngeal mechanisms. Features presented by experimental electroglottographic signals during transition between laryngeal mechanisms are shown to be reproduced by the model. Temporal Properties of the Nasals and Nasalization in Cantonese Methods for Estimation of Glottal Pulses Waveforms Exciting Voiced Speech Beatrice Fung-Wah Khioe; City University of Hong Kong, China Milan Boštík, Milan Sigmund; Brno University of Technology, Czech Republic This paper is an investigation of the temporal properties of the nasals and vowel nasalization in Cantonese by analyzing synchronized nasal and oral airflows. The nasal airflow volume for the vowels in both oral and nasal contexts and for the syllable-final nasals [-m, -n, -N] were also obtained. Results show that (i) the vowel duration in the (C)VN syllables is negatively correlated with the duration for the following nasals [-m, -n, -N]; (ii) the vowel duration in the (C)VN syllables is positively correlated with the duration of nasalization; (iii) the vowel duration in the (C)VN syllables is positively correlated with the nasal airflow volume for the vowel and for the nasalized portion; (iv) the degree of nasalization is inversely correlated with the tongue height of the vowel; and (v) the nasal duration Nowadays, the most popular techniques of the speech processing are the recognition of all kinds (the speech, the speaker and the state of speaker recog.) and the text-to-speech synthesis. In both these domains, there are possibilities to use the glottal pulses waveforms. In the recognition techniques we can use them for the vocal cords description and then use it for the classification of speaker’s state (physiological or mental state) or for the classification of a speaker. In the text-to-speech techniques we can use them for the speech timbre changing. This paper describes some methods for obtaining of glottal pulses waveforms from recorded speech. There are several results obtained by application of described methods. 84 Eurospeech 2003 Wednesday Acoustic Modeling of American English Lateral Approximants Zhaoyan Zhang 1 , Carol Espy-Wilson 1 , Mark Tiede 2 ; 1 University of Maryland, USA; 2 Haskins Laboratories, USA A vocal tract model for an American English /l/ production with lateral channels and a supralingual side branch has been developed. Acoustic modeling of an /l/ production using MRI-derived vocal tract dimensions shows that both the lateral channels and the supralingual side branch contribute to the production of zeros in the F3 to F5 frequency range, thereby resulting in pole-zero clusters around 2-5 kHz in the spectrum of the /l/ sound. Translation and Rotation of the Cricothyroid Joint Revealed by Phonation-Synchronized High-Resolution MRI September 1-4, 2003 – Geneva, Switzerland This paper explores the estimation and mapping of probability models of formant parameter vectors for voice conversion. The formant parameter vectors consist of the frequency, bandwidth and intensity of resonance at formants. Formant parameters are derived from the coefficients of a linear prediction (LP) model of speech. The formant distributions are modelled with phoneme-dependent two-dimensional hidden Markov models with state Gaussian mixture densities. The HMMs are subsequently used for re-estimation of the formant trajectories of speech. Two alternative methods are explored for voice morphing. The first is a non-uniform frequency warping method and the second is based on spectral mapping via rotation of the formant vectors of the source towards those of the target. Both methods transform all formant parameters (Frequency, Bandwidth and Intensity). In addition, the factors that affect the selection of the warping ratios for the mapping function are presented. Experimental evaluation of voice morphing examples is presented. Perceptually Weighted Linear Transformations for Voice Conversion Sayoko Takano 1 , Kiyoshi Honda 1 , Shinobu Masaki 2 , Yasuhiro Shimada 2 , Ichiro Fujimoto 2 ; 1 ATR-HIS, Japan; 2 ATR-BAIC, Japan Hui Ye, Steve Young; Cambridge University, U.K. The action of the cricothyroid joint for regulating voice fundamental frequency is thought to have two components; rotation and translation. Its empirical verification, however, has faced methodological problems. This study examines the joint action by means of a phonation-synchronized high-resolution Magnetic Resonance Imaging (MRI) technique, which employs two technical improvements; a custom laryngeal coil to enhance image resolution and an external triggering method to synchronize the subject’s phonation and MRI scan. The obtained images were clear enough to demonstrate two actions of the joint; the cricoid cartilage rotates 5 degrees and the thyroid cartilage translated 1.25 mm in the range of half an octave. Voice conversion is a technique for modifying a source speaker’s speech to sound as if it was spoken by a target speaker. A popular approach to voice conversion is to apply a linear transformation to the spectral envelope. However, conventional parameter estimation based on least square error optimization does not necessarily lead to the best perceptual result. In this paper, a perceptually weighted linear transformation is presented which is based on the minimization of the perceptual spectral distance between the voices of the source and target speakers. The paper describes the new conversion algorithm and presents a preliminary evaluation of the performance of the method based on objective and subjective tests. Voice Conversion with Smoothed GMM and MAP Adaptation Session: PWeDf– Poster Speech Synthesis: Voice Conversion & Miscellaneous Topics Yining Chen 1 , Min Chu 2 , Eric Chang 2 , Jia Liu 1 , Runsheng Liu 1 ; 1 Tsinghua University, China; 2 Microsoft Research Asia, China Time: Wednesday 16.00, Venue: Main Hall, Level -1 Chair: Christophe D’Alessandro, LIMSI, France GMM-Based Voice Conversion Applied to Emotional Speech Synthesis Hiromichi Kawanami 1 , Yohei Iwami 1 , Tomoki Toda 2 , Hiroshi Saruwatari 1 , Kiyohiro Shikano 1 ; 1 Nara Institute of Science and Technology, Japan; 2 ATR-SLT, Japan Voice conversion method is applied to synthesizing emotional speech from standard reading (neutral) speech. Pairs of neutral speech and emotional speech are used for conversion rule training. The conversion adopts GMM (Gaussian Mixture Model) with DFW (Dynamic Frequency Warping). We also adopt STRAIGHT, the highquality speech analysis-synthesis algorithm. As conversion target emotions, (Hot) anger, (cold) sadness and (hot) happiness are used. The converted speech is evaluated objectively first using mel cepstrum distortion as a criterion. The result confirms the GMM-based voice conversion can reduce distortion between target speech and neutral speech. A subjective test is also carried out to investigate perceptual effect. From the viewpoint of influence of prosody, two kinds of prosody are used to synthesis. One is natural prosody extracted from neutral speech and the other is from emotional speech. The result shows that prosody mainly contribute to emotion and spectrum conversion can reinforce it. Probability Models of Formant Parameters for Voice Conversion Dimitrios Rentzos 1 , Saeed Vaseghi 1 , Qin Yan 1 , Ching-Hsiang Ho 2 , Emir Turajlic 1 ; 1 Brunel University, U.K.; 2 Fortune Institute of Technology, Taiwan In most state-of-the-art voice conversion systems, speech quality of converted utterances is still unsatisfactory. In this paper, STRAIGHT analysis-synthesis framework is used to improve the quality. A smoothed GMM and MAP adaptation is proposed for spectrum conversion to avoid the overly smooth phenomenon in the traditional GMM method. Since frames are processed independently, the GMM based transformation function may generate discontinuous features. Therefore, a time domain low pass filter is applied on the transformation function during the conversion phase. The results of listening evaluations show that the quality of the speech converted by the proposed method is significantly better than that by the traditional GMM method. Meanwhile, speaker identifiability of the converted voice reaches 75%, even when the difference between the source speaker and the target speaker is not very large. A System for Voice Conversion Based on Adaptive Filtering and Line Spectral Frequency Distance Optimization for Text-to-Speech Synthesis Özgül Salor 1 , Mübeccel Demirekler 1 , Bryan Pellom 2 ; 1 Middle East Technical University, Turkey; 2 University of Colorado at Boulder, USA This paper proposes a new voice conversion algorithm that modifies the source speaker’s speech to sound as if produced by a target speaker. To date, most approaches for speaker transformation are based on mapping functions or codebooks. We propose a linear filtering based approach to the problem of mapping the spectral parameters of one speaker to those of the other. In the proposed method, the transformation is performed by filtering the source speaker’s Line Spectral Pair (LSP) frequencies to obtain the LSP frequency estimates of the target speaker. Speech signal is time-aligned into a sequence of HMM states. The filters are designed for each HMM state using the aligned data. We consider two methods for spectral conversion. A linear transformation for the LSP’s was obtained using the adaptive steepest gradient descent 85 Eurospeech 2003 Wednesday approach. Mean values of LSP’s are adjusted to match those of the target speaker. In order to prevent the LSP vectors from resulting in unstable vocal tract filters, weighted least square estimation is used. This approach optimizes differences between source and target LSP’s. Weights are inverses of the source LSP variances. This approach is integrated into a Time Domain Pitch Synchronous Overlap and Add (TD-PSOLA) analysis-synthesis framework. The algorithm is objectively evaluated using a distance measure based on the loglikelihood ratio of observing the input speech, given Gaussian mixture speaker models for both the source and the target voice. Results using the Gaussian mixture model formulated criteria demonstrate consistent transformation using a 5 speaker database. The algorithm offers promise for rapidly adapting text-to-speech systems to new voices. Speaker Conversion in ARX-Based Source-Formant Type Speech Synthesis Hiroki Mori, Hideki Kasuya; Utsunomiya University, Japan A speaker conversion framework for formant synthesis is proposed. With this framework, given a small set of a target speaker’s utterances, segmental features of an original speech can be converted to those of the given speaker. Unlike other speaker conversion frameworks, further voice quality modification can also be applied to the converted speech with conventional formant modification techniques. The parameter conversion is based on MLLR in the cepstral domain. The effect of parameter conversion can be seen from the graphical representation of formant placement. The results of an auditory experiment showed that most of the converted speech was perceived as being similar to that of target speakers. Implementing an SSML Compliant Concatenative TTS System Andrew P. Breen, Steve Minnis, Barry Eggleton; Nuance Communications, U.K. The W3C Speech Synthesis Markup Language (SSML) unifies a number of recent related markup languages that have emerged to fill the perceived need for increased, and standardized, user control over Text to Speech (TTS) engines. One of the main drivers for markup has been the increasing use of TTS engines as embedded components of specific applications – which means they are in a position to take advantage of additional knowledge about the text. Although SSML allows improved control over the text normalization process, most of the attention has focused on the level of prosody markup, especially since the prediction of the prosody is generally acknowledged as one of the most significant problems in TTS synthesis. Prosody control is by no means simple due to the large crossdependency between other related aspects of prosody. Prosody control is also of particular complexity for concatenative TTS systems. SSML is about much more than prosody control though – allowing high level engine control such as language switching and voice switching, and low level control such as phonetic input for words. Our experiences in implementing these diverse requirements of the SSML standard are discussed. Acoustic Variations of Focused Disyllabic Words in Mandarin Chinese: Analysis, Synthesis and Perception Zhenglai Gu, Hiroki Mori, Hideki Kasuya; Utsunomiya University, Japan The focus effects on acoustic correlates include both prosodic and segmental modifications. Analysis of 35 focused words in a carrier sentences uttered by 2 male and 3 female speakers has shown that: (1) there is a significant asymmetry of vowel duration as well as F0 range between the pre-stressed and post-stressed syllables, implying that different strategies are employed in the task of focusing disyllabic words, i.e., emphasizing the first syllable as well as weakening the second syllable for the former, but emphasizing the second syllable only for the latter; (2) the tonal combinations significantly affect the variations of both the vowel duration and F0 range; (3) the formant frequencies (F1, F2) are changed systematically in a way that that the formants of the vowels plotted in the (F1, F2) plane were stretched outwards. Perceptual validation of the relative importance of these acoustic cues for signaling a focal September 1-4, 2003 – Geneva, Switzerland word has been accomplished. Results of the perception experiment indicate that F0 is the dominant cue closely related to the judgment of focused word and the other two cues, duration and formant frequencies contribute less to the judgment. An Approach to Common Acoustical Pole and Zero Modeling of Consecutive Periods of Voiced Speech Pedro Quintana-Morales, Juan L. Navarro-Mesa; Universidad de Las Palmas de Gran Canaria, Spain In this paper the open and closed phases within a speech period are separately modeled as acoustical pole-zero filters. We approach the estimation of the coefficients associated to the poles and zeros by minimizing a cost function based on the reconstruction error. The cost function leads to a matrix formulation of the error for two time intervals where the error must be defined. This defines a framework that facilitates to model the phases associated to consecutive periods. We give a matrix formulation of the estimation process that let us to attain two main objectives. Firstly, estimate the common-pole structure of several consecutive periods and their particular zero structure. And secondly, estimate their common-pole-zero structure. The experiments are carried out over a speech database of five men and five women. The experiments are done in terms of the reconstruction error and its dependence on the period length and the order of the analysis. Estimating the Vocal-Tract Area Function and the Derivative of the Glottal Wave from a Speech Signal Huiqun Deng, Michael Beddoes, Rabab Ward, Murray Hodgson; University of British Columbia, Canada We present a new method for estimating the vocal-tract area functions from speech signals. First, we point out and correct a longstanding sign error in some literature related to the derivation of the acoustic reflection coefficients of the vocal tract from a speech signal. Next, to eliminate the influence of the glottal wave on the estimation of the vocal-tract filter, we estimate the vocal-tract filter and the derivative of the glottal wave simultaneously from a speech signal. From the vocal-tract filter obtained, we derive the vocal-tract area function. Our improvements to existing methods can be seen from the vocal-tract area functions obtained for vowel sounds /A/ and /i/, each produced by a female and a male subject. They are comparable with those obtained using the magnetic resonance imaging method. The derivatives of the glottal waves for these sounds are also presented, and they show very detailed structures. Glottal Closure Instant Synchronous Sinusoidal Model for High Quality Speech Analysis/Synthesis Parham Zolfaghari 1 , Tomohiro Nakatani 1 , Toshio Irino 2 , Hideki Kawahara 2 , Fumitada Itakura 3 ; 1 NTT Corporation, Japan; 2 Wakayama University, Japan; 3 Nagoya University, Japan In this paper, a glottal event synchronous sinusoidal model is proposed. A glottal event corresponds to the glottal closure instant (GCI), which is accurately estimated using group delay and fixed point analysis in the time domain using energy centroids. The GCI synchronous sinusoidal model allows adequate processing according to the inherent local properties of speech, resulting in phase matching between adjacent and corresponding harmonics that are essential for precise speech analysis. Frequency domain fixed points from mapping filter center frequencies to the instantaneous frequencies of the filter outputs result in highly accurate estimates of the constituent sinusoidal components. Adequate window selection and placement at the GCI is found to be important in obtaining stable sinusoidal components. We demonstrate that the GCI synchronous instantaneous frequency method allows a large reduction in spurious peaks in the spectrum and enables high quality synthesised speech. In speech quality evaluations, glottal synchronous analysis-synthesis results in a 0.4 improvement in MOS over conventional fixed frame rate analysis-synthesis. 86 Eurospeech 2003 Wednesday Mixed Physical Modeling Techniques Applied to Speech Production Matti Karjalainen; Helsinki University of Technology, Finland The Kelly-Lochbaum transmission-line model of the vocal tract started the discrete-time modeling of speech production. More recently similar techniques have been developed in computer music towards a more generalized methodology. In this paper we will study the application of mixed physical modeling to speech production and speech synthesis. These approaches are Digital Waveguides (DWG), Finite Difference Time-Domain schemes (FDTD), and Wave Digital Filters (WDF). The equivalence and interconnectivity of these schemes is shown and flexible real-time synthesizers for articulatory type of speech production are demonstrated. An Expandable Web-Based Audiovisual Text-to-Speech Synthesis System Sascha Fagel, Walter F. Sendlmeier; Technische Universität Berlin, Germany The authors propose a framework for audiovisual speech synthesis systems [1] and present a first implementation of the framework [2], which is called MASSY - Modular Audiovisual Speech SYnthesizer. This paper describes how the audiovisual speech synthesis system, the ‘talking head’, works, how it can be integrated into webapplications, and why it is worthwhile using it. The presented applications use the wrapped audio synthesis, the phonetic and visual articulation modules, and a face module. One of the two already implemented visual articulation models, based on a dominance model for coarticulation, is used. The face is a 3D model described in VRML 97. The facial animation is described in a motion parameter model which is capable of realizing the most important visible articulation gestures [3][4]. MASSY is developed in the client-server paradigm. The server is easy to set up and does not need special or high performance hardware. The required bandwidth is low, and the client is an ordinary web browser with a freely available standard plug-in. The system is used for the evaluation of measured and predicted articulation models and is also suitable for the enhancement of human-computer-interfaces in applications like e.g. virtual tutors in e-learning environments, speech training, video conferencing, computer games, audiovisual information systems, virtual agents, and many more. A Reconstruction of Farkas Kempelen’s Speaking Machine P. Nikléczy, G. Olaszy; Hungarian Academy of Sciences, Hungary The first “speaking machine” of the world was created by the Hungarian polyhistor Farkas Kempelen. He can also be referred to as the first phonetician in the world. He went on improving his speaking machine for twenty-two years, and described the final version in a book published in 1791 in Vienna. The reconstruction was made based on this book. What we wanted to make was not just an exhibition piece but a machine that actually worked. Thus we can go back more than 200 years and study the working of one of the most precious instruments of the Baroque period. We can try out the ways of producing sounds that Kempelen wrote so many pages about in his book. The acoustic patterns of the machine’s speech can be studied by today’s sophisticated signal processing methods and prove or disprove Kempelen’s claims by measurement data. Besides these we took it to be an important task in terms of the history of science to contribute to our knowledge of the beginnings of phonetic research. Acoustic Model Selection and Voice Quality Assessment for HMM-Based Mandarin Speech Synthesis Wentao Gu, Keikichi Hirose; University of Tokyo, Japan This paper presents a preliminary study in implementing HMMbased Mandarin speech synthesis system, whose main advantage exists in generating various voices. A variety of acoustic unit September 1-4, 2003 – Geneva, Switzerland representations for Mandarin are compared to select an optimal acoustic model set. Syllabic vs. sub-syllabic, context-independent vs. context-dependent, toneless vs. tonal, initial-final vs. premetoneme models, and models with various numbers of states, are investigated respectively. To take the most advantage of HMM-based speech synthesis, some aspects affecting speaker adaptation quality, especially the selection of adaptation data size, are also studied. Modeling of Various Speaking Styles and Emotions for HMM-Based Speech Synthesis Junichi Yamagishi, Koji Onishi, Takashi Masuko, Takao Kobayashi; Tokyo Institute of Technology, Japan This paper presents an approach to realizing various emotional expressions and speaking styles in synthetic speech using HMMbased speech synthesis. We show two methods for modeling speaking styles and emotions. In the first method, called “style dependent modeling,” each speaking style and emotion is individually modeled. On the other hand, in the second method, called “style mixed modeling,” speaking style or emotion is treated as a contextual factor as well as phonetic, prosodic, and linguistic factors, and all speaking styles and emotions are modeled by a single acoustic model simultaneously. We chose four styles, that is, “reading,” “rough,” “joyful,” and “sad,” and compared those two modeling methods using these styles. From the results of subjective tests, it is shown that both modeling methods have almost the same performance, and that it is possible to synthesize speech with similar speaking styles and emotions to those of the recorded speech. In addition, it is also shown that the style mixed modeling can reduce the number of output distributions in comparison with the style dependent modeling. Towards the Development of a Brazilian Portuguese Text-to-Speech System Based on HMM R. da S. Maia 1 , Heiga Zen 1 , Keiichi Tokuda 1 , Tadashi Kitamura 1 , F.G.V. Resende Jr. 2 ; 1 Nagoya Institute of Technology, Japan; 2 Federal University of Rio de Janeiro, Brazil This paper describes the development of a Brazilian Portuguese text-to-speech system which applies a technique wherein speech is directly synthesized from hidden Markov models. In order to build the synthesizer a speech database was recorded and phonetically segmented. Furthermore, contextual informations about syllables, words, phrases, and utterances were determined, as well as questions for decision tree-based context clustering algorithms. The resulting system presents a fair reproduction of the prosody even when a small database is used for training. Grapheme to Phoneme Conversion and Dictionary Verification Using Graphonemes Paul Vozila, Jeff Adams, Yuliya Lobacheva, Ryan Thomas; Scansoft Inc., USA We present a novel data-driven language independent approach for grapheme to phoneme conversion, which achieves a phoneme error rate of 3.68% and a pronunciation error rate of 17.13% for English. We apply our stochastic model to the task of dictionary verification and conclude that it is able to detect spurious entries, which can then be examined and corrected by a human expert. Improving the Accuracy of Pronunciation Prediction for Unit Selection TTS Justin Fackrell, Wojciech Skut, Kathrine Hammervold; Rhetorical Systems Ltd., U.K. This paper describes a technique which improves the accuracy of pronunciation prediction for unit selection TTS. It does this by performing an orthography-based context-dependent lookup on the unit database. During synthesis, the pronunciations of words which have matching contexts in the unit database are determined. Pronunciations not found using this method are determined using traditional lexicon lookup and/or letter-to-sound rules. In its simplest form, the model involves a lookup based on left and right word con- 87 Eurospeech 2003 Wednesday text. A modified form, which backs-off to a lookup based on right context, is shown to have a much higher firing rate, and to produce more pronunciation variation. The technique is good at occasionally inhibiting vowel reduction; at choosing appropriate pronunciations in case of free variation; and at choosing the correct pronunciation for names. Its effectiveness is assessed by experiments on unseen data; by resynthesis; and by a listening test on sentences rich in reducible words. Detection of List-Type Sentences Taniya Mishra, Esther Klabbers, Jan P.H. van Santen; Oregon Health & Science University, USA In this paper, we explore a text type based scheme of text analysis, through the specific problem of detecting the list text type. This is important because TTS systems that can generate the very distinct F0 contour of lists sound more natural. The presented list detection algorithm uses part-of-speech tags as input, and detects lists by computing the alignment costs of clauses in a sentence. The algorithm detects lists with 80% accuracy. Session: PWeDg– Poster Acoustic Modelling I September 1-4, 2003 – Geneva, Switzerland with tied variances. Finally, scalar quantization is performed for the mean components of the models. With the proposed method, a memory saving of 77.6% was achieved compared with the original continuous density HMMs and 23.0% compared to the quantized parameter HMMs, respectively. The recognition performance of the resulted models was similar to what was obtained with the original continuous density HMMs in all tested environments. Nearest-Neighbor Search Algorithms Based on Subcodebook Selection and its Application to Speech Recognition José A.R. Fonollosa; Universitat Politècnica de Catalunya, Spain Vector quantization (VQ) is a efficient technique for data compression with a minimum distortion. VQ is widely used in applications as speech and image coding, speech recognition, and image retrieval. This paper presents a novel fast nearest-neighbor algorithm and shows its application to speech recognition. The proposed algorithm is based on a fast preselection that reduces the search to a limited number of code vectors. The presented results show that the computational cost of the VQ stage can be significantly reduced without affecting the performance of the speech recognizer. Non-Linear Maximum Likelihood Feature Transformation for Speech Recognition Time: Wednesday 16.00, Venue: Main Hall, Level -1 Chair: Melvyn Hunt, Phonetic Systems UK Ltd, United Kingdom Mohamed Kamal Omar, Mark Hasegawa-Johnson; University of Illinois at Urbana-Champaign, USA A New Pitch Synchronous Time Domain Phoneme Recognizer Using Component Analysis and Pitch Clustering Ramon Prieto, Jing Jiang, Chi-Ho Choi; Stanford University, USA A new framework for time domain voiced phoneme recognition is shown. Each speech frame taken for training and recognition is bounded by consecutive glottal closures. A pre-processing stage is designed and implemented to model pitch synchronous frames with gaussian mixture models. Component analysis carried out on the data shows optimal performance with a very small number of components, requiring low computational power. We designed a new clustering technique that, using the pitch period, gives better results than other well known clustering algorithms like k-means. Mixed-Lingual Spoken Word Recognition by Using VQ Codebook Sequences of Variable Length Segments Hiroaki Kojima 1 , Kazuyo Tanaka 2 ; 1 AIST, Japan; 2 University of Tsukuba, Japan We are investigating unsupervised phone modeling. This paper describes a derivation method of VQ codebook sequences of variable length segments from spoken word samples, and also describes evaluation results by applying the method to mixed-lingual speech recognition tasks which include non-native speakers. The VQ codebook is generated based on a piecewise linear segmentation method which includes segmentation, alignment, reduction and clustering processes. Derived codebook sequences are evaluated by speaker independent recognition of a word set which is a mixture of English and Japanese word. Speech samples are uttered by both English and Japanese native speakers. The recognition rates of mixed-lingual 618 words by using a codebook consist of 128 codes are 89.7% for English native speakers and 79.4% for Japanese native speakers in average . Low Memory Acoustic Models for HMM Based Speech Recognition Tommi Lahti, Olli Viikki, Marcel Vasilache; Nokia Research Center, Finland In this paper, we propose a new approach to reduce the memory footprint of HMM based ASR systems. The proposed method involves three steps. Starting from the continuous density HMMs, mixture variances are tied using k-means based vector quantization. Next, the re-estimation of the resulted models is performed Most automatic speech recognition (ASR) systems use Hidden Markov model (HMM) with a diagonal-covariance Gaussian mixture model for the state-conditional probability density function. The diagonal-covariance Gaussian mixture can model discrete sources of variability like speaker variations, gender variations, or local dialect, but can not model continuous types of variability that account for correlation between the elements of the feature vector. In this paper, we present a transformation of the acoustic feature vector that minimizes an empirical estimate of the relative entropy between the likelihood based on the diagonal-covariance Gaussian mixture HMM model and the true likelihood. Based on this formulation, we provide a solution to the problem using volume-preserving maps; existing linear feature transform designs are shown to be special cases of the proposed solution. Since most of the acoustic features used in ASR are not linear functions of the sources of correlation in the speech signal, we use a non-linear transformation of the features to minimize this objective function. We describe an iterative algorithm to estimate the parameters of both the volume-preserving feature transformation and the HMM that jointly optimize the objective function for an HMM-based speech recognizer. Using this algorithm, we achieved 2% improvement in phoneme recognition accuracy compared to the baseline system. Our approach shows also improvement in recognition accuracy compared to previous linear approaches like linear discriminant analysis (LDA), maximum likelihood linear transform (MLLT), and independent component analysis (ICA). Automatic Generation of Context-Independent Variable Parameter Models Using Successive State and Mixture Splitting Soo-Young Suk, Ho-Youl Jung, Hyun-Yeol Chung; Yeungnam University, Korea A Speech and Character Combined Recognition System (SCCRS) is developed for working on PDA (Personal Digital Assistants) or on mobile devices. In SCCRS, feature extraction for speech and for character is carried out separately, but recognition is performed in an engine. The recognition engine employs essentially CHMM (Continuous Hidden Markov Model) structure and this CHMM consists of variable parameter topology in order to minimize the number of model parameters and to reduce recognition time. This model also adopts our proposed SSMS (Successive State and Mixture Splitting) for generating context independent model. SSMS optimizes the number of mixtures through splitting in mixture domain and the number of states through splitting in time domain. The recognition results show that the proposed SSMS method can reduce the total number of Gaussian up to 40.0% compared with the fixed parameter 88 Eurospeech 2003 Wednesday models at the same recognition performance in speech recognition system. Data Driven Generation of Broad Classes for Decision Tree Construction in Acoustic Modeling Andrej Žgank, Zdravko Kačič, Bogomir Horvat; University of Maribor, Slovenia A new data driven approach for phonetic broad class generation is proposed. The phonetic broad classes are used by tree based clustering procedure for node questions during the context dependent acoustic models generation for speech recognition. The data driven approach is based on phoneme confusion matrix, which is produced with the phoneme recogniser. Such approach enables the data driven method independency from particular language or phoneme set found in a database. Data driven broad classes generated with this method were compared to expert defined and randomly generated broad classes. The experiment was carried out with the Slovenian SpeechDat(II) database. Six different test configurations were included in the evaluation. Analysis of speech recognition results for different acoustic models showed that the proposed data driven method gives comparable or better results than standard method. An Efficient Integrated Gender Detection Scheme and Time Mediated Averaging of Gender Dependent Acoustic Models Peder A. Olsen, Satya Dharanipragada; IBM T.J. Watson Research Center, USA This paper discusses building gender dependent gaussian mixture models (GMMs) and how to integrate these with an efficient gender detection scheme. Gender specific acoustic models of half the size of a corresponding gender independent acoustic model substantially outperform the larger gender independent acoustic models. With perfect gender detection, gender dependent modeling should therefore yield higher recognition accuracy without consuming more memory. Furthermore, as certain phonemes are inherently gender independent (e.g. silence) much of the male and female specific acoustic models can be shared. This paper proposes how to discover which phonemes are inherently similar for male and female speakers and how to efficiently share this information between gender dependent GMMs. A highly accurate and computationally efficient gender detection scheme is suggested that takes advantage of computations inherently done in the speech recognizer. By making the gender assignment probabilistic an increase in word error rate (WER) seen for erroneously gender labeled speakers is avoided. The method of gender detection and probabilistic use of gender is novel and should be of interest beyond mere gender detection. The only requirement for the method to work is that the training data be appropriately labeled. Syllable-Based Acoustic Modeling for Japanese Spontaneous Speech Recognition This paper extends prior work in multi-stream modeling by introducing cross-stream observation dependencies and a new discriminative criterion for selecting such dependencies. Experimental results combining short-term PLP features with long-term TRAP features show gains associated with a multi-stream model with partial state asynchrony over a baseline HMM. Frame-based analyses show significant discriminant information in the added cross-stream dependencies, but so far there are only small gains in recognition accuracy. Pruning Transitions in a Hidden Markov Model with Optimal Brain Surgeon Brian Mak, Kin-Wah Chan; Hong Kong University of Science & Technology, China This paper concerns about reducing the topology of a hidden Markov model (HMM) for a given task. The purpose is two-fold: (1) to select a good model topology with improved generalization capability; and/or (2) to reduce the model complexity so as to save memory and computation costs. The first goal falls into the active research area of model selection. From the model-theoretic research community, various measures such as Bayesian information criterion, minimum description length, minimum message length have been proposed and used with some success. In this paper, we are considering another approach in which a well-performed HMM, though perhaps oversized, is optimally pruned so that the loss in the model training cost function is minimal. The method is known as Optimal Brain Surgeon (OBS) that has been used in the neural network (NN) community. The application of OBS to NN is a constrained optimization problem; its application to HMM is more involved and it becomes a quadratic programming problem with both equality and inequality constraints. The detailed formulation is presented, and the algorithm is shown effective by an example in which HMM state transitions are pruned. The reduced model also results in better generalization performance on unseen test data. Using Pitch Frequency Information in Speech Recognition Mathew Magimai-Doss, Todd A. Stephenson, Hervé Bourlard; IDIAP, Switzerland Automatic Speech Recognition (ASR) systems typically use smoothed spectral features as acoustic observations. In recent studies, it has been shown that complementing these standard features with pitch frequency could improve the system performance of the system [1, 2]. While previously proposed systems have been studied in the framework of HMM/GMMs, in this paper we study and compare different ways to include pitch frequency in state-ofthe-art hybrid HMM/ANN system. We have evaluated the proposed system on two different ASR tasks, namely, isolated word recognition and connected word recognition. Our results show that pitch frequency can indeed be used in ASR systems to improve the recognition performance. Hidden Feature Models for Speech Recognition Using Dynamic Bayesian Networks Jun Ogata 1 , Yasuo Ariki 2 ; 1 AIST, Japan; 2 Ryukoku University, Japan We study on a syllable-based acoustic modeling method for Japanese spontaneous speech recognition. Traditionally, morabased acoustic models have been adopted for Japanese read speech recognition systems. In this paper, syllable-based unit and morabased unit are clearly distinguished in their definition, and syllables are shown to be more suitable as an acoustic model for Japanese spontaneous speech recognition. In spontaneous speech, a vowel lengthening occurs frequently, and recognition accuracy is greatly affected by this phenomena. From this viewpoint, we propose an acoustic modeling technique that explicitly incorporates the vowel lengthening in syllable-based HMMs. Experimental results showed that the proposed model could exceed the performance of conventionally used cross-word triphone model and mora-based model in Japanese spontaneous speech recognition task. Cross-stream Observation Dependencies for Multi-Stream Speech Recognition September 1-4, 2003 – Geneva, Switzerland Karen Livescu 1 , James Glass 1 , Jeff Bilmes 2 ; 1 Massachusetts Institute of Technology, USA; 2 University of Washington, USA In this paper, we investigate the use of dynamic Bayesian networks (DBNs) to explicitly represent models of hidden features, such as articulatory or other phonological features, for automatic speech recognition. In previous work using the idea of hidden features, the representation has typically been implicit, relying on a single hidden state to represent a combination of features. We present a class of DBN-based hidden feature models, and show that such a representation can be not only more expressive but also more parsimonious. We also describe a way of representing the acoustic observation model with fewer distributions using a product of models, each corresponding to a subset of the features. Finally, we describe our recent experiments using hidden feature models on the Aurora 2.0 corpus. Özgür Çetin, Mari Ostendorf; University of Washington, USA 89 Eurospeech 2003 Thursday An Efficient Viterbi Algorithm on DBNs September 1-4, 2003 – Geneva, Switzerland and a variant distance measure. Compared to a baseline system using triphones as subword units and with minimal pronunciation variants, this method achieved a relative improvement of the word error rate by 10%. Wei Hu, Yimin Zhang, Qian Diao, Shan Huang; Intel China Research Center, China DBNs (Dynamic Bayesian Networks) [1] are powerful tool in modeling time-series data, and have been used in speech recognition recently [2,3,4]. The “decoding” task in speech recognition means to find the viterbi path [5](in graphical model community, “viterbi path” has the same meaning as MPE “Most Probable Explanation”) for a given acoustic observations. In this paper we describe a new algorithm utilizes a new data structure “backpointer”, which is produced in the “marginalization” procedure in probability inference. With these backpointers, the viterbi path can be found in a simple backtracking. We first introduce the concept of backpointer and backtracking; then give the algorithm to compute the viterbi path for DBNs based on backpointer and backtracking. We prove that the new algorithm is correct, faster and more memory saving comparison with old algorithm. Several experiments are conducted to demonstrate the effectiveness of the algorithm on several well known DBNs. We also test the algorithm on a real world DBN model that can recognize continuous digit numbers. Speech Recognition Based on Syllable Recovery Li Zhang, William Edmondson; University of Birmingham, U.K. This paper reports the results of syllable recovery from speech using an articulatory model of the syllable. The contribution of syllable recovery to the overall process of speech recognition is discussed and speech recognition results are presented. Session: SThBb– Oral Time is of the Essence - Dynamic Approaches to Spoken Language Time: Thursday 10.00, Venue: Room 2 Chair: Steve Greenberg, ICSI, USA Time is of the Essence – Dynamic Approaches to Spoken Language Steven Greenberg; The Speech Institute, USA Temporal dynamics provide a fruitful framework with which to examine the relation between information and spoken language. This paper serves as an introduction to the special Eurospeech session on “Time is of the Essence – Dynamic Approaches to Spoken Language,” providing historical and conceptual background germane to timing, as well as a discussion of its scientific and technological prospects. Dynamics is examined from the perspectives of perception, production, neurology, synthesis, recognition and coding, in an effort to define a prospective course for speech technology and research. Spectro-Temporal Interactions in Auditory and Auditory-Visual Speech Processing Ken W. Grant 1 , Steven Greenberg 2 ; 1 Walter Reed Army Medical Center, USA; 2 The Speech Institute, USA HARTFEX: A Multi-Dimensional System of HMM Based Recognisers for Articulatory Features Extraction Tarek Abu-Amer, Julie Carson-Berndsen; University College Dublin, Ireland HARTFEX is a novel system that employs several tiers of HMMs recognisers that work in parallel to extract multi-dimensions of articulatory features. The features segments on the different tiers overlap to account for the co-articulation phenomena. The overlap and precedence relation among features are applied to a phonological parser for further processing. HARTFEX system is built on a modified version of HTK toolkit that allows it to perform multithread multi-feature recognition. The system testing results are highly promising. The recognition accuracy for vowel is 98% and for rhotic is 93%. Current work investigates inherited interdependencies of extracting different feature sets. Automatic Baseform Generation from Acoustic Data Speech recognition often involves the face-to-face communication between two or more individuals. The combined influences of auditory and visual speech information leads to a remarkably robust signal that is greatly resistant to noise, reverberation, hearing loss, and other forms of signal distortion. Studies of auditoryvisual speech processing have revealed that speechreading interacts with audition in both the spectral and temporal domain. For example, not all speech frequencies are equal in their ability to supplement speechreading, with low-frequency speech cues providing more benefit than high-frequency speech cues. Additionally, in contrast to auditory speech processing which integrates information across frequency over relatively short time windows (20- 40 ms), auditory-visual speech processing appears to use relatively long time windows of integration (roughly 250 ms). In this paper, some of the basic spectral and temporal interactions between auditory and visual speech channels are enumerated and discussed. Brain Imaging Correlates of Temporal Quantization in Spoken Language Benoît Maison; IBM T.J. Watson Research Center, USA We describe two algorithms for generating pronunciation networks from acoustic data. One is based on raw phonetic recognition and the other uses the spelling of the words and the identification of their language of origin as guides. In both cases, a pruning and voting procedure distills the noisy phonetic sequences into pronunciation networks. Recognition experiments on two large, grammarbased, test sets show a reduction of sentence error rates between 2% and 14%, and of word error rate between 3% to 23% when the learned baseforms are added to our baseline lexicons. Data-Driven Pronunciation Modeling for ASR Using Acoustic Subword Units Thurid Spiess 1 , Britta Wrede 2 , Gernot A. Fink 1 , Franz Kummert 1 ; 1 Universität Bielefeld, Germany; 2 International Computer Science Institute, USA We describe a method to model pronunciation variation for ASR in a data-driven way, namely by use of automatically derived acoustic subword units. The inventory of units is designed so as to produce maximal separable pronunciation variants of words while at the same time only the most important variants for the particular application are trained. In doing so, the optimal number of variants per word is determined iteratively. All this is accomplished (almost) fully automatically by use of a state splitting algorithm David Poeppel; University of Maryland, USA Psychophysical research has established that temporal-integration windows of several different sizes are critical for the analysis of any acoustic speech signal. Recent work from our laboratory has examined speech processing in the human auditory cortex using both hemodynamic (fMRI, PET) and electromagnetic (MEG, EEG) recording techniques. These studies provide evidence for at least two distinct temporal scales relevant to the integration and processing of speech at the cortical level – a relatively short window of 25-50 ms and a longer window of 150- 300 ms. In addition to support for processing on these time scales, there is also evidence for hemispheric asymmetry in temporal quantization. Left auditory cortex shows enhanced sensitivity to rapid temporal changes (possibly associated with segmental and subsegmental perceptual analysis), while right auditory cortex is more sensitive to slower changes (possibly associated with syllabic rate processing and dynamics of pitch). Temporal Aspects of Articulatory Control Elliot Saltzman; Boston University, USA This contribution is focused on temporal aspects of articulatory control during the production of speech. We review a set of computational and experimental results whose focus is on intragestural, transgestural, and intergestural timing properties. The computa- 90 Eurospeech 2003 Thursday tional results are based on recent developments of the task-dynamic model of gestural patterning. These developments are focused on the shaping and relative timing of gestural activations, and on the manner in which relative timing among gestures can be interpreted and modeled in the context of systems of coupled nonlinear oscillators. Emphasis is placed on dynamical accounts of prosodic boundary influences on gestural activation patterns, and the manner in which intergestural coupling structures shape the timing patterns and stability properties of onset and coda clusters. The Temporal Organisation of Speech as Gauged by Speech Synthesis September 1-4, 2003 – Geneva, Switzerland Session: OThBc– Oral Topics in Speech Recognition Time: Thursday 10.00, Venue: Room 3 Chair: Sadaoki Furui, Tokyo Inst. of Technology, Japan A Comparison of the Data Requirements of Automatic Speech Recognition Systems and Human Listeners Roger K. Moore; 20/20 Speech Ltd., U.K. Brigitte Zellner Keller; Université de Lausanne, Switzerland The simulation of speech by means of speech synthesis involves, among other things, the ability to mimic typical delivery for different speech styles. This requires a realistic imitation of the manner in which speakers organize their information flow in time (i.e., word grouping boundaries), as well their speech rate with its variations. The originality of our model is grounded in two levels. First, it is assumed that the temporal component plays a dominant role in the simulation of speech rhythm, whereas in traditional language models, temporal issues are mostly put aside. Second, the outcome of our temporal modeling, based on statistical analysis and qualitative parameters, results from the harmonization of various layers (segmental, syllabic, phrasal). The benefit of a multidimensional model is the possibility of imposing subtle quantitative and qualitative effects at various levels, which is a key for respecting a specific language system as well as speech coherence and fluency for different speech styles. Localized Spectro-Temporal Features for Automatic Speech Recognition Michael Kleinschmidt; Universität Oldenburg, Germany Recent results from physiological and psychoacoustic studies indicate that spectrally and temporally localized time-frequency envelope patterns form a relevant basis of auditory perception. This motivates new approaches to feature extraction for automatic speech recognition (ASR) which utilize two-dimensional spectro-temporal modulation filters. The paper provides a motivation and a brief overview on the work related to Localized Spectro-Temporal Features (LSTF). It further focuses on the Gabor feature approach, where a feature selection scheme is applied to automatically obtain a suitable set of Gabor-type features for a given task. The optimized feature sets are examined in ASR experiments with respect to robustness and their statistical properties are analyzed. Modulation Spectral Filtering of Speech Les Atlas; University of Washington, USA Recent auditory physiological evidence points to a modulation frequency dimension in the auditory cortex. This dimension exists jointly with the tonotopic acoustic frequency dimension. Thus, audition can be considered as a relatively slowly-varying twodimensional representation, the “modulation spectrum,” where the first dimension is the well-known acoustic frequency and the second dimension is modulation frequency. We have recently developed a fully invertible analysis/synthesis approach for this modulation spectral transform. A general application of this approach is removal or modification of different modulation frequencies in audio or speech signals, which, for example, causes major changes in perceived dynamic character. A specific application of this modification is single-channel multiple-talker separation. Since the introduction of hidden Markov modelling there has been an increasing emphasis on data-driven approaches to automatic speech recognition. This derives from the fact that systems trained on substantial corpora readily outperform those that rely on more phonetic or linguistic priors. Similarly, extra training data almost always results in a reduction in word error rate - “there’s no data like more data”. However, despite this progress, contemporary systems are not able to fulfill the requirements demanded by many potential applications, and performance is still significantly short of the capabilities exhibited by human listeners. For these reasons, the R&D community continues to call for even greater quantities of data in order to train their systems. This paper addresses the issue of just how much data might be required in order to bring the performance of an automatic speech recognition system up to that of a human listener. Modeling Linguistic Features in Speech Recognition Min Tang, Stephanie Seneff, Victor W. Zue; Massachusetts Institute of Technology, USA This paper explores a new approach to speech recognition in which sub-word units are modeled in terms of linguistic features. Specifically, we have adopted a scheme of modeling separately the manner and place of articulation for these units. A novelty of our work is the use of a generalized definition of place of articulation that enables us to map both vowels and consonants into a common linguistic space. Modeling manner and place separately also allows us to explore a multi-stage recognition architecture, in which the search space is successively reduced as more detailed models are brought in. In the 8,000 word PhoneBook isolated word telephone speech recognition task, we show that such an approach can achieve a recognition WER that is 10% better than that achieved in the best results reported in the literature. This performance gain comes with improvements in search space and computation time as well. Impact of Audio Segmentation and Segment Clustering on Automated Transcription Accuracy of Large Spoken Archives Bhuvana Ramabhadran, Jing Huang, Upendra Chaudhari, Giridharan Iyengar, Harriet J. Nock; IBM T.J. Watson Research Center, USA This paper addresses the influence of audio segmentation and segment clustering on automatic transcription accuracy for large spoken archives. The work forms part of the ongoing MALACH project, which is developing advanced techniques for supporting access to the world’s largest digital archive of video oral histories collected in many languages from over 52000 survivors and witnesses of the Holocaust. We present several audio-only and audio-visual segmentation schemes, including two novel schemes: the first is iterative and audio-only, the second uses audio-visual synchrony. Unlike most previous work, we evaluate these schemes in terms of their impact upon recognition accuracy. Results on English interviews show the automatic segmentation schemes give performance comparable to (exorbitantly expensive and impractically lengthy) manual segmentation when using a single pass decoding strategy based on speaker-independent models. However, when using a multiple pass decoding strategy with adaptation, results are sensitive to both initial audio segmentation and the scheme for clustering segments prior to adaptation: the combination of our best automatic segmentation and clustering scheme has an error rate 8% worse (relative) to manual audio segmentation and clustering due to the occurrence of “speaker-impure” segments. 91 Eurospeech 2003 Thursday Learning Linguistically Valid Pronunciations from Acoustic Data September 1-4, 2003 – Geneva, Switzerland Session: OThBd– Oral Acoustic Modelling II Françoise Beaufays, Ananth Sankar, Shaun Williams, Mitch Weintraub; Nuance Communications, USA We describe an algorithm to learn word pronunciations from acoustic data. The algorithm jointly optimizes the pronunciation of a word using (a) the acoustic match of this pronunciation to the observed data, and (b) how “linguistically reasonable” the pronunciation is. Variations of word pronunciations in the recognition dictionary (which was created by linguists), are used to train a model of whether new hypothesized pronunciations are reasonable or not. The algorithm is well-suited for proper name pronunciation learning. Experiments on a corporate name dialing database show 40% error rate reduction with respect to a letter-to-phone pronunciation engine. Improvement of Non-Native Speech Recognition by Effectively Modeling Frequently Observed Pronunciation Habits Nobuaki Minematsu, Koichi Osaki, Keikichi Hirose; University of Tokyo, Japan In this paper, two techniques are proposed to enhance the nonnative (Japanese English) speech recognition performance. The first technique effectively integrates orthographic representation of a phoneme as an additional context in state clustering in training tied-state triphones. Non-native speakers often learned the target language not through their ears but through their eyes and it is easily assumed that their pronunciation of a phoneme may depend upon its grapheme. Here, correspondence between a vowel and its grapheme is automatically extracted and used as an additional context in the state clustering. The second technique elaborately couples a Japanese English acoustic model and a Japanese Japanese model to make a parallel model. When using triphones, mapping between the two models should be carefully trained because phoneme sets of both the models are different. Here, several phoneme recognition experiments are done to induce the mapping, and based upon the mapping, a tentative method of the coupling is examined. Results of LVCSR experiments show high validity of both the proposed methods. Non-Audible Murmur Recognition Yoshitaka Nakajima 1 , Hideki Kashioka 1 , Kiyohiro Shikano 1 , Nick Campbell 2 ; 1 Nara Institute of Science and Technology, Japan; 2 ATR-HIS, Japan Time: Thursday 10.00, Venue: Room 4 Chair: John Hansen, Colorado Univ., USA Variable Length Mixtures of Inverse Covariances Vincent Vanhoucke 1 , Ananth Sankar 2 ; 1 Stanford University, USA; 2 Nuance Communications, USA The mixture of inverse covariances model is a low-complexity, approximate decomposition of the inverse covariance matrices in a Gaussian mixture model which achieves high modeling accuracy with very good computational efficiency. In this model, the inverse covariances are decomposed into a linear combination of K shared prototype matrices. In this paper, we introduce an extension of this model which uses a variable number of prototypes per Gaussian for improved efficiency. The number of prototypes per Gaussian is optimized using a maximum likelihood criterion. This variable length model is shown to achieve significantly better accuracy at a given complexity level on several speech recognition tasks. Semi-Tied Full Deviation Matrices for Laplacian Density Models Christoph Neukirchen; Philips Research Laboratories, Germany The Philips speech recognition system uses mixtures of Laplacian densities with diagonal deviations to model acoustic feature vectors. Such an approach neglects the correlations between different feature components that typically exist in the acoustic vectors. This paper extends the conventional Laplacian approach to model the between-feature interdependencies explicitly. These extensions either lead to a full deviation matrix model or to an integrated feature space transformation similar to the semi-tied covariances for Gaussian densities. Both methods can be efficiently implemented by exploiting a strong tying of the feature transformations and the deviation matrices, respectively. The novel approach is evaluated on two different digit string recognition tasks. Acoustic Modeling with Mixtures of Subspace Constrained Exponential Models Karthik Visweswariah, Scott Axelrod, Ramesh Gopinath; IBM T.J. Watson Research Center, USA We propose a new style of practical input interface for the recognition of non-audible murmur (NAM), i.e., for the recognition of inaudible speech produced without vibration of the vocal folds. We developed a microphone attachment, which adheres to the skin, applying the principle of a medical stethoscope, found the ideal position for sampling flesh-conducted NAM sound vibration and retrained an acoustic model with NAM samples. Then using the Julius Japanese Dictation Toolkit, we tested the possibilities for practical use of this method in place of an external microphone for analyzing air-conducted voice sound. Additionally we propose laryngeal elevation index (LEI), a new index of prosody, which can show the prosody of NAM without F0, using simple processing of images from medical ultrasonography. We realized and defined NAM never used for input or communication and propose that we should make use of it for the interface of human-human and human-cybernetic machines. Gaussian distributions are usually parameterized with their natural parameters: the mean µ and the covariance Σ. They can also be re-parameterized as exponential models with canonical parameters P = Σ−1 and ψ = P µ. In this paper we consider modeling acoustics with mixtures of Gaussians parameterized with canonical parameters where the parameters are constrained to lie in a shared affine subspace. This class of models includes Gaussian models with various constraints on its parameters: diagonal covariances, MLLT models, and the recently proposed EMLLT and SPAM models. We describe how to perform maximum likelihood estimation of the subspace and parameters within a fixed subspace. In speech recognition experiments, we show that this model improves upon all of the above classes of models with roughly the same number of parameters and with little computational overhead. In particular we get 30-40% relative improvement over LDA+MLLT models when using roughly the same number of parameters. Discriminative Estimation of Subspace Precision and Mean (SPAM) Models Vaibhava Goel, Scott Axelrod, Ramesh Gopinath, Peder A. Olsen, Karthik Visweswariah; IBM T.J. Watson Research Center, USA The SPAM model was recently proposed as a very general method for modeling Gaussians with constrained means and covariances. It has been shown to yield significant error rate improvements over other methods of constraining covariances such as diagonal, semitied covariances, and extended maximum likelihood linear transformations. In this paper we address the problem of discriminative estimation of SPAM model parameters, in an attempt to further im- 92 Eurospeech 2003 Thursday prove its performance. We present discriminative estimation under two criteria: maximum mutual information (MMI) and an “errorweighted” training. We show that both these methods individually result in over 20% relative reduction in word error rate on a digit task over maximum likelihood (ML) estimated SPAM model parameters. We also show that a gain of as much as 28% relative can be achieved by combining these two discriminative estimation techniques. The techniques developed in this paper also apply directly to an extension of SPAM called subspace constrained exponential models. Model-Integration Rapid Training Based on Maximum Likelihood for Speech Recognition Shinichi Yoshizawa 1 , Kiyohiro Shikano 2 ; 1 Matsushita Electric Industrial Co. Ltd., Japan; 2 Nara Institute of Science and Technology, Japan Speech recognition technology has been widely used. Considering a training cost of an acoustic model, it is beneficial to reuse preexisting acoustic models for making a suitable one for various apparatus and application. However, a complex acoustic model for high CPU power does not work for low CPU power. And a simple model for fast-processing-demanded application does not work well for high-precision-demanded ones. Therefore, it is important to adjust a model complexity according to apparatus or application, such as a number of mixture of Gaussians. This paper describes a new model-integration-type of training for obtaining a required number of mixture of Gaussians. This training can alter a number of mixture into a required one according to a specification of apparatus or application. We propose a model integration rapid training based on maximum likelihood, and evaluate the recognition performance successfully. On the Use of Kernel PCA for Feature Extraction in Speech Recognition Amaro Lima, Heiga Zen, Yoshihiko Nankaku, Chiyomi Miyajima, Keiichi Tokuda, Tadashi Kitamura; Nagoya Institute of Technology, Japan This paper describes an approach for feature extraction in speech recognition systems using kernel principal component analysis (KPCA). This approach consists in representing speech features as the projection of the extracted speech features mapped into a feature space via a nonlinear mapping onto the principal components. The nonlinear mapping is implicitly performed using the kerneltrick, which is an useful way of not mapping the input space into a feature space explicitly, making this mapping computationally feasible. Better results were obtained by using this approach when compared to the standard technique. September 1-4, 2003 – Geneva, Switzerland Who Knows Carl Bildt? – And What if You don’t? Elisabeth Zetterholm 1 , Kirk P.H. Sullivan 2 , James Green 3 , Erik Eriksson 2 , Jan van Doorn 2 , Peter E. Czigler 4 ; 1 Lund University, Sweden; 2 Umeå University, Sweden; 3 University of Otago, New Zealand; 4 Örebro University, Sweden One problem with using speaker identification by witnesses in legal settings is that high quality imitations can result in speaker misidentification. A recent series of experiments has looked at listener acceptance of an imitation of a well known Swedish politician. Results showed that listener expectation of the topic of an imitated passage impacts on the acceptance or rejection of the imitation. The strength of that impact varied according to various listener characteristics, including age of listener. It is likely that age reflected the degree of familiarity with the voice that was being imitated. The present study has reanalyzed the data from Swedish listeners in the previous studies to look at performance according to self reports of whether the listeners were familiar with the politician. Results showed that the acceptance of the imitation by those listeners who reported knowing the politician was more influenced by the topic of the imitated passage than by those who reported not knowing him. Implications of this finding in regard to listeners’ choice of alternate voices in the line up are discussed. Improving the Competitiveness of Discriminant Neural Networks in Speaker Verification C. Vivaracho-Pascual 1 , J. Ortega-Garcia 2 , L. Alonso-Romero 3 , Q. Moro-Sancho 1 ; 1 Universidad de Valladolid, Spain; 2 Universidad Politécnica de Madrid, Spain; 3 Universidad de Salamanca, Spain The Artificial Neural Network (ANN) Multilayer Perceptron (MLP) has shown good performance levels as discriminant system in textindependent Speaker Verification (SV) tasks, as shown in our work presented at Eurospeech 2001. In this paper, substantial improvements with regard to that reference architecture are described. Firstly, a new heuristic method for selecting the impostors in the ANN training process is presented, eliminating the random nature of the system behaviour introduced by the traditional random selection. The use of the proposed selection method, together with an improvement in the classification stage based on a selective use of the network outputs to calculate the final sample score, and an optimisation of the MLP learning coefficient, allow an improvement of over 35% with regard to our reference system, reaching a final EER of 13% over the NIST-AHUMADA database. These promising results show that MLP as discriminant system can be competitive with respect to GMM-based SV systems. On the Fusion of Dissimilarity-Based Classifiers for Speaker Identification Session: PThBe– Poster Speaker & Language Recognition Tomi Kinnunen, Ville Hautamäki, Pasi Fränti; University of Joensuu, Finland Time: Thursday 10.00, Venue: Main Hall, Level -1 Chair: Larry Heck, Nuance Communication, USA Speaker Modeling from Selected Neighbors Applied to Speaker Recognition Yassine Mami, Delphine Charlet; France Télécom R&D, France This paper addresses the estimation of a speaker GMM through the selection and merging of a set of neighbors models for that speaker. The selection of the neighbors models is based on the likelihood score for the training data on a set of potential neighbor GMM. Once the neighbors models are selected, they are merged to give a model of the speaker, which can also be used as an a priori model for an adaptation phase. Experiments show that merging neighborhood models captures significant information about the speaker but doesn’t improve significantly compared to classical UBM-adapted GMM. In this work, we describe a speaker identification system that uses multiple supplementary information sources for computing a combined match score for the unknown speaker. Each speaker profile in the database consists of multiple feature vector sets that can vary in their scale, dimensionality, and the number of vectors. The evidence from a given feature set is weighted by its reliability that is set in a priori fashion. The confidence of the identification result is also estimated. The system is evaluated with a corpus of 110 Finnish speakers. The evaluated feature sets include mel-cepstrum, LPC-cepstrum, dynamic cepstrum, long-term averaged spectrum of /A/ vowel, and F0. Robust Speaker Identification Using Posterior Union Models Ji Ming 1 , Darryl Stewart 1 , Philip Hanna 1 , Pat Corr 1 , Jack Smith 1 , Saeed Vaseghi 2 ; 1 Queen’s University Belfast, U.K.; 2 Brunel University, U.K. This paper investigates the problem of speaker identification in noisy conditions, assuming that there is no prior knowledge about the noise. To confine the effect of the noise on recognition, we use a 93 Eurospeech 2003 Thursday multi-stream approach to characterize the speech signal, assuming that while all of the feature streams may be affected by the noise, there may be some streams that are less severely affected and thus still provide useful information about the speaker. Recognition decisions are based on the feature streams that are uncontaminated or least contaminated, thereby reducing the effect of the noise on recognition. We introduce a novel statistical method, the posterior union model, for selecting reliable feature streams. An advantage of the union model is that knowledge of the structure of the noise is not needed, thereby providing robustness to time-varying unpredictable noise corruption. We have tested the new method on the TIMIT database with additive corruption from real-world nonstationary noise; the results obtained are encouraging. “Syncpitch”: A Pseudo Pitch Synchronous Algorithm for Speaker Recognition Ran D. Zilca, Jiří Navrátil, Ganesh N. Ramaswamy; IBM T.J. Watson Research Center, USA Pitch mismatch between enrollment and testing is a common problem in speaker recognition systems. It is well known that the fine spectral structure related to fundamental frequency manifests itself in Mel cepstral features used for speaker recognition. Therefore pitch variations result in variation of the acoustic features, and potentially an increase in error rate. A previous study introduced a signal processing procedure termed depitch that attempts to remove pitch information from the speech signal by forcing every speech frame to be pitch synchronous and include a single pitch cycle. This paper presents a modification of the depitch algorithm, termed syncpitch, that performs pseudo pitch synchronous processing while still preserving the pitch information. The new algorithm has a relatively moderate effect on the speech signal. System combination of syncpitch with a baseline system is shown to improve speaker verification accuracy in experiments conducted on the 2002 NIST Speaker Recognition Evaluation data. A Method for On-Line Speaker Indexing Using Generic Reference Models Soonil Kwon, Shrikanth Narayanan; University of Southern California, USA On-line Speaker indexing is useful for multimedia applications such as meeting or teleconference archiving and browsing. It sequentially detects the points where a speaker identity changes in a multispeaker audio stream, and classifies each speaker segment. The main problem of on-line processing is that we can use only current and previous information in the data stream for any decisioning. To address this difficulty, we apply a predetermined reference speakerindependent model set. This set can be useful for more accurate speaker modeling and clustering without actual training of target data speaker models. Once a speaker-independent model is selected from the reference set, it is adapted into a speaker-dependent model progressively. Experiments were performed with HUB-4 Broadcast News Evaluation English Test Material(1999) and Speaker Recognition Benchmark NIST Speech(1999). Results showed that our new technique gave 96.5% indexing accuracy on a telephone conversation data source and 84.3% accuracy on a broadcast news source. Discriminative Training and Maximum Likelihood Detector for Speaker Identification M. Mihoubi, Gilles Boulianne, Pierre Dumouchel; CRIM, Canada This article describes a new approach for cues discrimination between speakers addressed to a speaker identification task. To this end, we make use of elements of decision theory. We propose to decompose the conventional feature space (MFCCs) into two subspaces which carry information about discriminative and confusable sections of the speech signal. The method is based on the idea that, instead of adapting the speakers models to a new test environment, we require the test utterance to fit the speakers models environment. Discriminative sections of training speech are used to estimate the probability density function (pdf) of a discriminative world model (DM), and confusable sections to estimate the probability density function of a confusion world model (CM). The two models are then used as a maximum likelihood detector (filter) September 1-4, 2003 – Geneva, Switzerland at the input of the recogniser. The method was experimented on highly mismatched telephone speech and achieves a considerable improvement (averaging 16% gain in performance) over the baseline GMM system. Novel Approaches for One- and Two-Speaker Detection Sachin S. Kajarekar 1 , André G. Adami 2 , Hynek Hermansky 2 ; 1 SRI International, USA; 2 Oregon Health & Science University, USA The paper reviews OGI submission for NIST 2002 speaker recognition evaluation. It describes the systems submitted for one- and two-speaker detection tasks and the post-evaluation improvements. In one-speaker detection system, we present a new design of a datadriven temporal filter. We show that using few broad phonetic categories improves the performance of speaker recognition system. In post evaluation experiments, we show that combinations with complementary features and modeling techniques significantly improve the performance of the GMM-based system. In two-speaker detection system, we present a structured approach to detect speaker in the conversations. Fusing High- and Low-Level Features for Speaker Recognition Joseph P. Campbell, Douglas A. Reynolds, Robert B. Dunn; Massachusetts Institute of Technology, USA The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have produced low error rates, they ignore higher levels of information beyond low-level acoustics that convey speaker information. Recently published works have demonstrated that such high-level information can be used successfully in automatic speaker recognition systems by improving accuracy and potentially increasing robustness. Wide ranging high-level-feature-based approaches using pronunciation models, prosodic dynamics, pitch gestures, phone streams, and conversational interactions were explored and developed under the SuperSID project at the 2002 JHU CLSP Summer Workshop (WS2002): http://www.clsp.jhu.edu/ws2002/groups/supersid/. In this paper, we show how these novel features and classifiers provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST Extended Data Task to 0.2% – a 71% relative reduction in error over the previous state of the art. Score Normalisation Applied to Open-Set, Text-Independent Speaker Identification P. Sivakumaran 1 , J. Fortuna 2 , Aladdin M. Ariyaeeinia 2 ; 1 20/20 Speech Ltd., U.K.; 2 University of Hertfordshire, U.K. This paper presents an investigation into the relative effectiveness of various score normalisation methods for open-set, textindependent speaker identification. The paper describes the need for score normalisation in this case, and provides a detailed theoretical and experimental analysis of the methods that can be used for this purpose. The experimental investigations are based on the use of speech material drawn from 9 hours of recordings of different Broadcast News. The results clearly demonstrate the significance of improvement offered by score normalisation. It is shown that, amongst various normalisation methods considered, the unconstrained cohort normalisation method achieves the best performance in terms of reducing the errors associated with the open-set nature of the process. Furthermore, it is demonstrated that both the cohort and world model methods can offer very similar effectiveness, and also outperform the T-norm method in this particular case of speaker recognition. On the Number of Gaussian Components in a Mixture: An Application to Speaker Verification Tasks Mijail Arcienega, Andrzej Drygajlo; EPFL, Switzerland Despite all advances in the speaker recognition domain, Gaussian Mixture Models (GMM) remain the state-of-the-art modeling tech- 94 Eurospeech 2003 Thursday nique in speaker recognition systems. The key idea is to approximate the probability density function (pdf) of the feature vectors associated to a speaker with a weighted sum of Gaussian densities. Although the extremely efficient Expectation-Maximization (EM) algorithm can be used for estimating the parameters associated with this Gaussian mixture, there is no explicit method for predicting the best number of Gaussian components in the mixture (also called order of the model). This paper presents an attempt for determining the “optimal” number of components for a given feature database. September 1-4, 2003 – Geneva, Switzerland Session: PThBf– Poster Robust Speech Recognition III Time: Thursday 10.00, Venue: Main Hall, Level -1 Chair: Nelson Morgan, ICSI and UC Berkeley, USA Assessment of Dereverberation Algorithms for Large Vocabulary Speech Recognition Systems Koen Eneman, Jacques Duchateau, Marc Moonen, Dirk Van Compernolle, Hugo Van hamme; Katholieke Universiteit Leuven, Belgium Using Accent Information in ASR Models for Swedish Giampiero Salvi; KTH, Sweden In this study accent information is used in an attempt to improve acoustic models for automatic speech recognition (ASR). First, accent dependent Gaussian models were trained independently. The Bhattacharyya distance was then used in conjunction with agglomerative hierarchical clustering to define optimal strategies for merging those models. The resulting allophonic classes were analyzed and compared with the phonetic literature. Finally, accent “aware” models were built, in which the parametric complexity for each phoneme corresponds to the degree of variability across accent areas and to the amount of training data available for it. The models were compared to models with the same, but evenly spread, overall complexity showing in some cases a slight improvement in recognition accuracy. The performance of large vocabulary recognition systems, for instance in a dictation application, typically deteriorates severely when used in a reverberant environment. This can be partially avoided by adding a dereverberation algorithm as a speech signal preprocessing step. The purpose of this paper is to compare the effect of different speech dereverberation algorithms on the performance of a recognition system. Experiments were conducted on the Wall Street Journal dictation benchmark. Reverberation was added to the clean acoustic data in the benchmark both by simulation and by re-recording the data in a reverberant room. Moreover additive noise was added to investigate its effect on the dereverberation algorithms. We found that dereverberation based on a delay-and-sum beamforming algorithm has the best performance of the investigated algorithms. Estimating Japanese Word Accent from Syllable Sequence Using Support Vector Machine Analysis and Compensation of Packet Loss in Distributed Speech Recognition Using Interleaving Hideharu Nakajima, Masaaki Nagata, Hisako Asano, Masanobu Abe; NTT Corporation, Japan Ben P. Milner, A.B. James; University of East Anglia, U.K. This paper proposes two methods that estimate, from the word reading (syllable sequence), the place in the word where the accent should be placed (hereafter we call it “accent type”). Both methods use a statistical classifier; one directly estimates accent type, and the other first estimates tone high and low labels and then decides the accent type from the tone label sequence obtained before. Experiments show that both offer high accuracy in the estimation of accent type of Japanese proper names without the use of linguistic knowledge. The aim of this work is to improve the robustness of speech recognition systems operating in burst-like packet loss. First a set of highly artificial packet loss profiles are used to analyse their effect on both recognition performance and on the underlying feature vector stream. This indicates that the simple technique of vector repetition can make the recogniser robust to high percentages of packet loss, providing burst lengths are reasonably short. This leads to the proposal of interleaving the feature vector sequence, prior to packetisation, to disperse bursts of packet loss throughout the feature vector stream. PPRLM Optimization for Language Identification in Air Traffic Control Tasks Recognition results on the Aurora connected digits database show considerable accuracy gains across a range of packet losses and burst lengths. For example at a packet loss rate of 50% with an average burst length of 4 packets (corresponding to 8 static vectors) performance is increased from 49.4% to 88.5% with an increase in delay of 90ms. R. Córdoba, G. Prime, J. Macías-Guarasa, J.M. Montero, J. Ferreiros, J.M. Pardo; Universidad Politécnica de Madrid, Spain In this paper, we present the work done in language identification for two air traffic control speech recognizers, one for continuous speech and the other one for a command interface. The system is able to distinguish between Spanish and English. We will confirm the advantage of using PPRLM over PRLM. All previous studies show that PPRLM is the technique with the best performance despite of its drawbacks: more processing time and labeled data is needed. No work has been published regarding the optimum weights which should be given to the language models to optimize the performance of the language recognizer. This paper addresses this topic, providing three different approaches for weight selection in the language model score. We will also see that a trigram language model improves performance. The final results are very good even with very short segments of speech. Non-Linear Compression of Feature Vectors Using Transform Coding and Non-Uniform Bit Allocation Ben P. Milner; University of East Anglia, U.K. This paper uses transform coding for compressing feature vectors in distributed speech recognition applications. Feature vectors are first grouped together into non-overlapping blocks and a transformation applied. A non-uniform allocation of bits to the elements of the resultant matrix is based on their relative information content. Analysis of the amplitude distribution of these elements indicates that non-linear quantisation is more appropriate than linear quantisation. Comparative results, based on speech recognition accuracy, confirm this. RASTA filtering is also considered as is shown to reduce the temporal variation of the feature vector stream. Recognition tests demonstrate that compression to bits rates of 2400bps, 1200bps and 800bps has very little effect on recognition accuracy for both clean and noisy speech. For example at a bit rate of 1200bps, recognition accuracy is 98.0% compared to 98.6% with no compression. Predictive Hidden Markov Model Selection for Decision Tree State Tying Jen-Tzung Chien 1 , Sadaoki Furui 2 ; 1 National Cheng Kung University, Taiwan; 2 Tokyo Institute of Technology, Japan 95 Eurospeech 2003 Thursday September 1-4, 2003 – Geneva, Switzerland This paper presents a novel predictive information criterion (PIC) for hidden Markov model (HMM) selection. The PIC criterion is exploited to select the best HMMs, which provide the largest prediction information for generalization of future data. When the randomness of HMM parameters is expressed by a product of conjugate prior densities, the prediction information is derived without integral approximation. In particular, a multivariate t distribution is attained to characterize the prediction information corresponding to HMM mean vector and precision matrix. When performing HMM selection in tree structure HMMs, we develop a top-down prior/posterior propagation algorithm for estimation of structural hyperparameters. The prediction information is accordingly determined so as to choose the best HMM tree model. The parameters of chosen HMMs can be rapidly computed via maximum a posteriori (MAP) estimation. In the evaluation of continuous speech recognition using decision tree HMMs, the PIC model selection criterion performs better than conventional maximum likelihood and minimum description length criteria in building a compact tree structure with moderate tree size and higher recognition rate. Developing a real-life spoken dialogue system must face with many practical issues, where the out-of-vocabulary (OOV) words problem is one of the key difficulties. This paper presents the OOV detection mechanism based on the word confidence scoring developed for the d-Ear Attendant system, a spontaneous spoken dialogue system. In the d-Ear Attendant system, an explicit filler model is originally used to detect the presence of OOV words [1]. Although this approach has a satisfactory OOV detection rate, it badly degrades the accuracy of in-vocabulary (IV) detection by 4.4% absolutely (from 97% to 92.6%). Such the degradation will not be acceptable in a practical system. By using a few commonly used acoustic confidence features and some new context confidence features, our confidence measure method not only is able to detect the word level speech recognition errors, but also has a good ability for OOV words detection with an acceptable false alarm rate. For example, with a false rejection rate of 2.5%, the false acceptance rate of 26% is achieved. Three Simultaneous Speech Recognition by Integration of Active Audition and Face Recognition for Humanoid Hiroyuki Manabe, Akira Hiraiwa, Toshiaki Sugimura; NTT DoCoMo Inc., Japan Kazuhiro Nakadai 1 , Daisuke Matsuura 2 , Hiroshi G. Okuno 3 , Hiroshi Tsujino 4 ; 1 Japan Science and Technology Corporation, Japan; 2 Tokyo Institute of Technology, Japan; 3 Kyoto University, Japan; 4 Honda Research Institute Japan Co. Ltd., Japan This paper addresses listening to three simultaneous talkers by a humanoid with two microphones. In such situations, sound separation and automatic speech recognition (ASR) of the separated speech are difficult, because the number of simultaneous talkers exceeds that of its microphones, the signal-to-noise ratio is quite low (around -3 dB) and noise is not stable due to interfering voices. Humanoid audition system consists of sound separation, face recognition and ASR. Sound sources are separated by an active directionpass filter (ADPF), which extracts sounds from a specified direction in real-time. Since features of sounds separated by ADPF vary according to the sound direction, ASR uses multiple direction- and speaker-dependent acoustic models. The system integrates ASR results by using the sound direction and speaker information by face recognition as well as confidence measure of ASR results to select the best one. The resulting system improves word recognition rates against three simultaneous utterances. Mis-Recognized Utterance Detection Using Multiple Language Models Generated by Clustered Sentences Katsuhisa Fujinaga 1 , Hiroaki Kokubo 2 , Hirofumi Yamamoto 2 , Genichiro Kikui 2 , Hiroshi Shimodaira 1 ; 1 JAIST, Japan; 2 ATR-SLT, Japan This paper proposes a new method of detecting mis-recognized utterances based on a ROVER-like voting scheme. Although the ROVER approach is effective in improving recognition accuracy, it has two serious problems from a practical point of view: 1) it is difficult to construct multiple automatic speech recognition (ASR) systems, 2) the computational cost increase according to the number of ASR systems. To overcome these problems, a new method is proposed where only a single acoustic engine is employed but multiple language models (LMs) consisting of a baseline (main) LM and sub LMs are used. The sub LMs are generated by clustered sentences and used to rescore the word lattice given by the main LM. As a result, the computational cost is greatly reduced. Through experiments, the proposed method resulted in 18-point higher precision with 10% loss of recall when compared with the baseline, and 22point higher precision with 20% loss of recall. Using Word Confidence Measure for OOV Words Detection in a Spontaneous Spoken Dialog System Hui Sun 1 , Guoliang Zhang 1 , Fang Zheng 2 , Mingxing Xu 1 ; 1 Tsinghua University, China; 2 Beijing d-Ear Technologies Co. Ltd., China Speech Recognition Using EMG; Mime Speech Recognition The cellular phone offers significant benefits but causes several social problems. One such problem is phone use in places where people should not speak, such as trains and libraries. A communication style that would not require voiced speech has the potential to solve this problem. Speech recognition based on electromyography (EMG), which we call “Mime Speech Recognition” is proposed. It not only eases communication in socially sensitive environments, but also improves speech recognition accuracy in noisy environments. In this paper, we report that EMG yields stable and accurate recognition of 5 Japanese vowels uttered statically without generating voice. Moreover, the ability of EMG to handle consonants is described, and the feasibility of basing comprehensive speech recognition systems on EMG is shown. Automatic Generation of Non-Uniform Context-Dependent HMM Topologies Based on the MDL Criterion Takatoshi Jitsuhiro 1 , Tomoko Matsui 2 , Satoshi Nakamura 1 ; 1 ATR-SLT, Japan; 2 Institute of Statistical Mathematics, Japan We propose a new method of automatically creating non-uniform context-dependent HMM topologies by using the Minimum Description Length (MDL) criterion. Phonetic decision tree clustering is widely used, based on the Maximum Likelihood (ML) criterion, and creates only contextual variations. However, it also needs to empirically predetermine control parameters for use as stop criteria, for example, the total number of states. Furthermore, it cannot create topologies with various state lengths automatically. Therefore, we introduce the MDL criterion as split and stop criteria, and use the Successive State Splitting (SSS) algorithm as a method of generating contextual and temporal variations. This proposed method, the MDL-SSS, can automatically create proper topologies without such predetermined parameters. Experimental results show that the MDLSSS can automatically stop splitting and obtain more appropriate HMM topologies than the original one. Furthermore, we investigated the MDL-SSS combined with phonetic decision tree clustering, and this method can automatically obtain the best performance with any heuristic. Comparison of Effects of Acoustic and Language Knowledge on Spontaneous Speech Perception/Recognition Between Human and Automatic Speech Recognizer Norihide Kitaoka, Masahisa Shingu, Seiichi Nakagawa; Toyohashi University of Technology, Japan An automatic speech recognizer uses acoustic knowledge and linguistic knowledge. In large vocabulary speech recognition, acoustic knowledge is modeled by hidden Markov models (HMM), linguistic knowledge is modeled by N-gram (typically bi-gram or trigram), and these models are stochastically integrated. It is thought that 96 Eurospeech 2003 Thursday humans also integrate acoustic and linguistic knowledge of speech when perceiving continuous speech. Automatic speech recognition with HMM and N-gram is thought to roughly model the process of human perception. Although these models have drastically improved the performance of automatic speech recognition of well-formed read speech so far, they cannot deliver sufficient performance on spontaneous speech recognition tasks because of various particular phenomena of spontaneous speech. In this paper, we conducted simulation experiments of N-gram language models by combining human acoustic knowledge and instruction of local context and assured that using two words neighboring the target word was enough to improve the performance of recognition when we could use only local information as linguistic knowledge. We also assured that coarticulation affected the perception of short words. We then compared some language models on speech recognizer. We calculated acoustic scores with HMM and then linguistic scores calculated from a language model were added. We obtained 37.5% recognition rate only with acoustic model, whereas we obtained 51.0% with both acoustic and language models, thus the relative performance improvement was 36%. On the other hand, we obtained a 16.5% recognition rate only with the language model, so the acoustic model improved the performance relatively 209%. The performance of the language model on spontaneous speech is almost equal to that on read speech and thus, the improvements of the acoustic models is more effective than that of the language model. Using Statistical Language Modelling to Identify New Vocabulary in a Grammar-Based Speech Recognition System September 1-4, 2003 – Geneva, Switzerland acoustic levels. A potential difficulty with such a model is that advantages gained by the introduction of an articulatory layer might be compromised by limitations due to an insufficiently rich articulatory representation, or by compromises made for mathematical or computational expediency. This paper describes a simple model in which speech dynamics are modelled as linear trajectories in a formant-based ‘articulatory’ layer, and the articulatory-toacoustic mappings are linear. Phone classification results for TIMIT are presented for monophone and triphone systems with a phonelevel syntax. The results demonstrate that provided the intermediate representation is sufficiently rich, or a sufficiently large number of phone-class-dependent articulatory-to-acoustic mapping are employed, classification performance is not compromised. Presentamos un nuevo HMM multinivel en el que una representación ‘articulatoria’ intermedia se incluye entre el nivel de estados y el acústico de superficie. Una dificultad potencial con tal modelo es que las ventajas ganadas por la introducción de una capa articulatoria quizás sean cedidas por limitaciones debidas a una representación articulatoria insuficientemente rica, o por cesiones realizadas por conveniencia matemática o computacional. Este artículo describe un modelo sencillo en el cuál la dinámica del habla se modela como trayectorias lineales en una capa articulatoria basada en formantes, y las proyecciones acústico-articulatorias son lineales. Los resultados de la clasificación de fonemas para TIMIT se presentan para sistemas de monofonemas y trifonemas con una sintaxis a nivel de fonema. Los resultados demuestran que la representación intermedia es suficientemente rica, o se emplea un número suficientemente grande de proyecciones acústico-articulatorias dependiente de la clase de fonema, donde no se comprometen las prestaciones de la clasificación. Automatic Phone Set Extension with Confidence Measure for Spontaneous Speech Genevieve Gorrell; Linköping University, Sweden Spoken language recognition meets with difficulties when an unknown word is encountered. In addition to the new word being unrecognisable, its presence impacts on recognition performance on the surrounding words. The possibility is explored here of using a back-off statistical recogniser to allow recognition of out-ofvocabulary words in a grammar-based speech recognition system. This study shows that a statistical language model created from a corpus obtained using a grammar-based system and augmented with minimally-constrained domain-appropriate material allows extraction of words that are out of the vocabulary of the grammar in an unseen corpus with fairly high precision. A Source Model Mitigation Technique for Distributed Speech Recognition Over Lossy Packet Channels Ángel M. Gómez, Antonio M. Peinado, Victoria Sánchez, Antonio J. Rubio; Universidad de Granada, Spain In this paper, we develop a new mitigation technique for a distributed speech recognition system over IP. We have designed and tested several methods to improve the interpolation used in the Aurora DSR ETSI standard without any significant increase of computational cost at the decoder. These methods make use of the information contained in the data-source, because, in IP networks, unlike in cellular networks, no information is received during packet losses. When a packet loss occurs, the lost information can be reconstructed through estimations from the N nearest received packets. Due to the enormous amount of combinations from previous and next received speech vector sequences, we have developed a methodology that drastically reduces the amount of required estimations. The Effect of an Intermediate Articulatory Layer on the Performance of a Segmental HMM Martin J. Russell 1 , Philip J.B. Jackson 2 ; 1 University of Birmingham, U.K.; 2 University of Surrey, U.K. We present a novel multi-level HMM in which an intermediate ‘articulatory’ representation is included between the state and surface- Yi Liu, Pascale Fung; Hong Kong University of Science & Technology, China Extending the phone set is one common approach for dealing with phonetic confusions in spontaneous speech. We propose using likelihood ratio test as a confidence measure for automatic phone set extension to model phonetic confusions. We first extend the standard phone set using dynamic programming (DP) alignment to cover all possible phonetic confusions in training data. Likelihood ratio test is then used as a confidence measure to optimize the extended phonetic units to represent the acoustic samples between two standard phonetic units with high confusability. The optimum set of extended phonetic units is combined with the standard phone set to form a multiple pronunciation dictionary. The effectiveness of this approach is evaluated on spontaneous Mandarin telephony speech. It gives an encouraging 1.09% absolute syllable error rate reduction. Using the extended phone set provides a good balance between the demands of high resolution acoustic model and the available training data. Utterance Verification Using an Optimized k-Nearest Neighbour Classifier R. Paredes, A. Sanchis, E. Vidal, A. Juan; Universitat Politècnica de València, Spain Utterance verification can be seen as a conventional pattern classification problem in which a feature vector is obtained for each hypothesized word in order to classify it as either correct or incorrect. In this paper, we study the application to this problem of an optimized version of the k-Nearest Neighbour decision rule which also incorporates an adequate feature selection technique. Experiments are reported showing that it gives comparatively good results. La detección de errores de reconocimiento puede considerarse como un problema clásico de clasificación en dos clases, en el que para cada palabra reconocida se obtiene un vector de características que permite clasificarla como correcta o incorrecta. En este trabajo se estudia la aplicación a este problema de una técnica basada en una regla optimizada de clasificación por los k-vecinos más próximos. Esta técnica permite, además, seleccionar aquellas características que son más importantes en el proceso de clasificación. Los resultados obtenidos muestran que la aplicación de esta técnica consigue comparativamente buenos resultados. 97 Eurospeech 2003 Thursday September 1-4, 2003 – Geneva, Switzerland Session: PThBg– Poster Spoken Language Understanding & Translation ber of concepts handled in our mixed-initiative dialogue system, the proposed system achieves a considerable concept interpretation result on either a typed-in test set or a spoken test set. A high subframe recall rate also verifies an applicability of the proposed system. Time: Thursday 10.00, Venue: Main Hall, Level -1 Chair: Hélène Bonneau-Maynard, LIMSI-CNRS, France Discriminative Methods for Improving Named Entity Extraction on Speech Data Spoken Cross-Language Access to Image Collection via Captions James Horlock, Simon King; University of Edinburgh, U.K. Hsin-Hsi Chen; National Taiwan University, Taiwan In this paper we present a method of discriminatively training language models for spoken language understanding; we show improvements in named entity F-scores on speech data using these improved language models. A comparison between theoretical probabilities associated with manual markup and the actual probabilities of output markup is used to identify probabilities requiring adjustment. We present results which support our hypothesis that improvements in F-scores are possible by using either previously used training data or held out development data to improve discrimination amongst a set of N-gram language models. This paper presents a framework of using Chinese speech to access images via English captions. The formulation and the structure mapping rules of Chinese and English named entities are extracted from an NICT foreign location name corpus. For a named location, name part and keyword part are usually transliterated and translated, respectively. Keyword spotting identifies the keyword from speech queries and narrows down the search space of image collections. A scoring function is proposed to compute the similarity between speech query and annotated captions in terms of International Phonetic Alphabets. The experimental results show that the average rank and the mean reciprocal rank are 2.04 and 0.8322, respectively, which is very close to the best performance, i.e., 1, for both average rank and mean reciprocal rank. Understanding Process for Speech Recognition Salma Jamoussi, Kamel Smaïli, Jean-Paul Haton; LORIA, France The automatic speech understanding problem could be considered as an association problem between two different languages. At the entry, the request expressed in natural language and at the end, just before the interpretation stage, the same request is expressed in term of concepts. A concept represents a given meaning, it is defined by a set of words sharing the same semantic properties. In this paper, we propose a new Bayesian network based method to automatically extract the underlined concepts. We also propose a new approach for the vector representation of words. We finish this paper by a description of the postprocessing step during which, we label our sentences and we generate the corresponding SQL queries. This step allows us to validate our speech understanding approach by obtaining good results. In fact, a rate of 92.5% of well formed SQL requests has been achieved on the test corpus. Collecting Machine-Translation-Aided Bilingual Dialogues for Corpus-Based Speech Translation Toshiyuki Takezawa, Genichiro Kikui; ATR-SLT, Japan A huge bilingual corpus of English and Japanese is being built at ATR Spoken Language Translation Research Laboratories in order to enhance speech translation technology, so that people can use a portable translation system for traveling abroad, dining and shopping, as well as hotel situations. As a part of these corpus construction activities, we have been collecting dialogue data using an experimental translation system between English and Japanese. The purpose of this data collection is to study the communication behaviors and linguistic expressions preferred in front of such systems. We use human typists to transcribe the users’ utterances and input them into a machine translation system between English and Japanese instead of using speech recognition systems. In this paper, we present an overview of our activities and discussions based on the basic characteristics. Combination of Finite State Automata and Neural Network for Spoken Language Understanding Chai Wutiwiwatchai, Sadaoki Furui; Tokyo Institute of Technology, Japan This paper proposes a novel approach for spoken language understanding based on a combination of weighted finite state automata and an artificial neural network. The former machine acts as a robust parser, which extracts some semantic information called subframes from an input sentence, then the latter machine interprets a concept of the sentence by considering the existence of subframes and their scores obtained from the automata. With a large num- Improving Statistical Natural Concept Generation in Interlingua-Based Speech-to-Speech Translation Liang Gu, Yuqing Gao, Michael Picheny; IBM T.J. Watson Research Center, USA Natural concept generation is critical to statistical interlingua-based speech translation performance. To improve maximum-entropybased concept generation, a set of novel features and algorithms are proposed including features enabling model training on parallel corpora, employment of confidence thresholds and multiple sets of features. The concept generation error rate is reduced by 43%-50% in our speech translation corpus within limited domains. Improvements are also achieved in our experiments on speech-tospeech translation. How NLP Techniques can Improve Speech Understanding: ROMUS – A Robust Chunk Based Message Understanding System Using Link Grammars Jérôme Goulian, Jean-Yves Antoine, Franck Poirier; University of South-Brittany, France This paper discusses the issue of how a speech understanding system can be made robust against spontaneous speech phenomena (hesitations and repairs) as well as achieving a detailed analysis of spoken French. The Romus system is presented. It implements speech understanding in a two-stage process. The first stage achieves a finite-state shallow parsing that consists in segmenting the recognized sentence into basic units (spoken-adapted chunks). The second one, a Link Grammar parser, looks for inter-chunks dependencies in order to build a rich representation of the semantic structure of the utterance. These dependencies are mainly investigated at a pragmatic level through the consideration of a task concept hierarchy. Discussion about the approach adopted, its benefits and limitations, is based on the results of the system’s assessment carried out under different linguistic phenomena during an evaluation campaign held by the French CNRS. Discriminative Training of N-Gram Classifiers for Speech and Text Routing Ciprian Chelba, Alex Acero; Microsoft Research, USA We present a method for conditional maximum likelihood estimation of N-gram models used for text or speech utterance classification. The method employs a well known technique relying on a generalization of the Baum-Eagon inequality from polynomials to rational functions. The best performance is achieved for the 1-gram classifier where conditional maximum likelihood training reduces the class error rate over a maximum likelihood classifier by 45% relative. 98 Eurospeech 2003 Thursday Correction of Disfluencies in Spontaneous Speech Using a Noisy-Channel Approach Matthias Honal 1 , Tanja Schultz 2 ; 1 Universität Karlsruhe, Germany; 2 Carnegie Mellon University, USA In this paper we present a system which automatically corrects disfluencies such as repairs and restarts typically occurring in spontaneously spoken speech. The system is based on a noisy-channel model and its development requires no linguistic knowledge, but only annotated texts. Therefore, it has large potential for rapid deployment and the adaptation to new target languages. The experiments were conducted on spontaneously spoken dialogs from the English VERBMOBIL corpus where a recall of 77.2% and a precision of 90.2% was obtained. To demonstrate the feasibility of rapid adaptation additional experiments on the spontaneous Mandarin Chinese CallHome corpus were performed achieving 49.4% recall and 76.8% precision. Multi-class Extractive Voicemail Summarization Konstantinos Koumpis, Steve Renals; University of Sheffield, U.K. This paper is about a system that extracts principal content words from speech-recognized transcripts of voicemail messages and classifies them into proper names, telephone numbers, dates/times and ‘other’. The short text summaries generated are suitable for mobile messaging applications. The system uses a set of classifiers to identify the summary words, with each word being identified by a vector of lexical and prosodic features. The features are selected using Parcel, an ROC-based algorithm. We visually compare the role of a large number of individual features and discuss effective ways to combine them. We finally evaluate their performance on manual and automatic transcriptions derived from two different speech recognition systems. Active Labeling for Spoken Language Understanding September 1-4, 2003 – Geneva, Switzerland data is available. The first method augments the training data by using the machine-labeled call-types for the unlabeled utterances. The second method, instead, augments the classification model trained using the human-labeled utterances with the machine-labeled ones in a weighted manner. We have evaluated these methods using a call classification system used for AT&T natural dialog customer care system. For call classification, we have used a boosting algorithm. Our results indicate that it is possible to obtain the same classification performance by using 30% less labeled data when the unlabeled data is utilized. This corresponds to a 1-1.5% absolute classification error rate reduction, using the same amount of labeled data. Noise Robustness in Speech to Speech Translation Fu-Hua Liu, Yuqing Gao, Liang Gu, Michael Picheny; IBM T.J. Watson Research Center, USA This paper describes various noise robustness issues in a speechto-speech translation system. We present quantitative measures for noise robustness in the context of speech recognition accuracy and speech-to-speech translation performance. To enhance noise immunity, we explore two approaches to improve the overall speech-to-speech translation performance. First, a multi-style training technique is used to tackle the issue of environmental degradation at the acoustic model level. Second, a pre-processing technique, CDCN, is exploited to compensate for the acoustic distortion at the signal level. Further improvement can be obtained by combining both schemes. In addition to recognition accuracy for speech recognition, this paper studies and examines how closely speech recognition accuracy is related the overall speech-to-speech recognition. When we apply the proposed schemes to an English-toChinese translation task, the word error rate for our speech recognition subsystem is substantially reduced by 28% relative, to 13.2% from 18.9% for test data of 15dB SNR. The corresponding BLEU score improves to 0.478 from 0.43 for the overall speech-to-speech translation. Similar improvements are also observed for a lower SNR condition. Example-Based Bi-Directional Chinese-English Machine Translation with Semi-Automatically Induced Grammars Gokhan Tur, Mazin Rahim, Dilek Z. Hakkani-Tür; AT&T Labs-Research, USA State-of-the-art spoken language understanding (SLU) systems are trained using human-labeled utterances, preparation of which is labor intensive and time consuming. Labeling is an error-prone process due to various reasons, such as labeler errors or imperfect description of classes. Thus, usually a second (or maybe more) pass(es) of labeling is required in order to check and fix the labeling errors and inconsistencies of the first (or earlier) pass(es). In this paper, we check the effect of labeling errors for statistical call classification and evaluate methods of finding and correcting these errors by checking minimum amount of data. We describe two alternative methods to speed up the labeling effort, one is based on the confidences obtained from a prior model and the other completely unsupervised. We call the labeling process employing one of these methods as active labelling. Active labeling aims to minimize the number of utterances to be checked again by automatically selecting the ones that are likely to be erroneous or inconsistent with the previously labeled examples. Although very same methods can be used as a postprocessing step to correct labeling errors, we only consider them as part of the labeling process. We have evaluated these active labelling methods using a call classification system used for AT&T natural dialog customer care system. Our results indicate that it is possible to find about 90% of the labeling errors or inconsistencies by checking just half the data. Exploiting Unlabeled Utterances for Spoken Language Understanding Gokhan Tur, Dilek Z. Hakkani-Tür; AT&T Labs-Research, USA State of the art spoken language understanding systems are trained using labeled utterances, which is labor intensive and time consuming to prepare. In this paper, we propose methods for exploiting the unlabeled data in a statistical call classification system within a natural language dialog system. The basic assumption is that some amount of labeled data and relatively larger chunks of unlabeled K.C. Siu, Helen M. Meng, C.C. Wong; Chinese University of Hong Kong, China We have previously developed a framework for bi-directional English-to-Chinese/Chinese-to-English machine translation using semi-automatically induced grammars from unannotated corpora. The framework adopts an example-based machine translation (EBMT) approach. This work reports on three extensions to the framework. First, we investigate the comparative merits of three distance metrics (Kullback-Leibler, Manhattan-Norm and Gini Index) for agglomerative clustering in grammar induction. Second, we seek an automatic evaluation method that can also consider multiple translation outputs generated for a single input sentence based on the BLEU metric. Third, our previous investigation shows that Chinese-to-English translation has lower performance due to incorrect use of English inflectional forms – a consequence of random selection among translation alternatives. We present an improved selection strategy that leverages information from the example parse trees in our EBMT paradigm. Spotting “Hot Spots” in Meetings: Human Judgments and Prosodic Cues Britta Wrede 1 , Elizabeth Shriberg 2 ; 1 International Computer Science Institute, USA; 2 SRI International, USA Recent interest in the automatic processing of meetings is motivated by a desire to summarize, browse, and retrieve important information from lengthy archives of spoken data. One of the most useful capabilities such a technology could provide is a way for users to locate “hot spots” or regions in which participants are highly involved in the discussion (e.g. heated arguments, points of excitement, etc.). We ask two questions about hot spots in meetings in the ICSI Meeting Recorder corpus. First, we ask whether involvement can be judged reliably by human listeners. Results show that despite the subjective nature of the task, raters show significant 99 Eurospeech 2003 Thursday agreement in distinguishing involved from non-involved utterances. Second, we ask whether there is a relationship between human judgments of involvement and automatically extracted prosodic features of the associated regions. Results show that there are significant differences in both F0 and energy between involved and noninvolved utterances. These findings suggest that humans do agree to some extent on the judgment of hot spots, and that acoustic-only cues could be used for automatic detection of hot spots in natural meetings. Combination of CFG and N-Gram Modeling in Semantic Grammar Learning Ye-Yi Wang, Alex Acero; Microsoft Research, USA SGStudio is a grammar authoring tool that eases semantic grammar development. It is capable of integrating different information sources and learning from annotated examples to induct CFG rules. In this paper, we investigate a modification to its underlying model by replacing CFG rules with n-gram statistical models. The new model is a composite of HMM and CFG. The advantages of the new model include its built-in robust feature and its scalability to an ngram classifier when the understanding does not involve slot filling. We devised a decoder for the model. Preliminary results show that the new model achieved 32% error reduction in high resolution understanding. Automatic Title Generation for Chinese Spoken Documents Using an Adaptive K Nearest-Neighbor Approach Shun-Chuan Chen, Lin-shan Lee; National Taiwan University, Taiwan The purpose of automatic title generation is to understand a document and to summarize it with only several but readable words or phrases. It is important for browsing and retrieving spoken documents, which may be automatically transcribed, but it will be much more helpful if given the titles indicating the content subjects of the documents. For title generation for Chinese language, additional problems such as word segmentation and key phrase extraction also have to be solved. In this paper, we developed a new approach of title generation for Chinese spoken documents. It includes key phrase extraction, topic classification, and a new title generation model based on an adaptive K nearest-neighbor concept. The tests were performed with a training corpus including 151,537 news stories in text form with human-generated titles and a testing corpus of 210 broadcast news stories. The evaluation included both objective F1 measures and 5-level subjective human evaluation. Very positive results were obtained. Speech Summarization Using Weighted Finite-State Transducers September 1-4, 2003 – Geneva, Switzerland Cross Domain Chinese Speech Understanding and Answering Based on Named-Entity Extraction Yun-Tien Lee, Shun-Chuan Chen, Lin-shan Lee; National Taiwan University, Taiwan Chinese language is not alphabetic, with flexible wording structure and large number of domain-specific terms generated every day for each domain. In this paper, a new approach for cross-domain Chinese speech understanding and answering is proposed based on named-entity extraction. This approach includes two parts: a speech query recognition (SQR) part and a speech understanding and answering (SUA) part. The huge quantities of news documents retrieved from the Web are used to construct domain-specific lexicons and language models for SQR. The named-entity extraction is used to construct a domain-specific named-entity database for SUA. It is found that by combining domain classifiers and named-entity extraction, we can not only understand cross-domain queries, but also find answers in a specific domain. Evaluation Method for Automatic Speech Summarization Chiori Hori 1 , Takaaki Hori 1 , Sadaoki Furui 2 ; 1 NTT Corporation, Japan; 2 Tokyo Institute of Technology, Japan We have proposed an automatic speech summarization approach that extracts words from transcription results obtained by automatic speech recognition (ASR) systems. To numerically evaluate this approach, the automatic summarization results are compared with manual summarization generated by humans through word extraction. We have proposed three metrics, weighted word precision, word strings precision and summarization accuracy (SumACCY), based on a word network created by merging manual summarization results. In this paper, we propose a new metric for automatic summarization results, weighted summarization accuracy (WSumACCY). This accuracy is weighted by the posterior probability of the manual summaries in the network to give the reliability of each answer extracted from the network. We clarify the goal of each metric and use these metrics to provide automatic evaluation results of the summarized speech. To compare the performance of each evaluation metric, correlations between the evaluation results using these metrics and subjective evaluation by hand are measured. It is confirmed that WSumACCY is an effective and robust measure for automatic summarization. An Information Theoretic Approach for Using Word Cluster Information in Natural Language Call Routing Li Li, Feng Liu, Wu Chou; Avaya Labs Research, USA Takaaki Hori, Chiori Hori, Yasuhiro Minami; NTT Corporation, Japan This paper proposes an integrated framework to summarize spontaneous speech into written-style compact sentences. Most current speech recognition systems attempt to transcribe whole spoken words correctly. However, recognition results of spontaneous speech are usually difficult to understand, even if the recognition is perfect, because spontaneous speech includes redundant information, and its style is different to that of written sentences. In particular, the style of spoken Japanese is very different to that of the written language. Therefore, techniques to summarize recognition results into readable and compact sentences are indispensable for generating captions or minutes from speech. Our speech summarization includes speech recognition, paraphrasing, and sentence compaction, which are integrated in a single Weighted Finite-State Transducer (WFST). This approach enables the decoder to employ all the knowledge sources in a one-pass search strategy and reduces the search errors, since all the constraints of the models are used from the beginning of the search. We conducted experiments on a 20kword Japanese lecture speech recognition and summarization task. Our approach yielded improvements in both recognition accuracy and summarization accuracy compared with other approaches that perform speech recognition and summarization separately. In this paper, an information theoretic approach for using word clusters in natural language call routing (NLCR) is proposed. This approach utilizes an automatic word class clustering algorithm to generate word classes from the word based training corpus. In our approach, the information gain (IG) based term selection is used to combine both word term and word class information in NLCR. A joint latent semantic indexing natural language understanding algorithm is derived and studied in NLCR tasks. Comparing with word term based approach, an average performance gain of 10.7% to 14.5% is observed averaged over various training and testing conditions. Unsupervised Topic Discovery Applied to Segmentation of News Transcriptions Sreenivasa Sista, Amit Srivastava, Francis Kubala, Richard Schwartz; BBN Technologies, USA Audio transcriptions from Automatic Speech Recognition systems are a continuous stream of words that are difficult to read. Segmenting these transcriptions into thematically distinct stories and categorizing the stories by topics increases readability and comprehensibility. However, manually defined topic categories are rarely available, and the cost of annotating a large corpus with thousands of distinct topics is high. We describe a procedure for applying the Unsupervised Topic Discovery (UTD) algorithm to the Thematic Story Segmentation procedure for segmenting broadcast news episodes 100 Eurospeech 2003 Thursday into stories and to assign these stories with automatic topic labels. We report our results of applying automatic topics for the task of story segmentation on a collection of news episodes in English and Arabic. Our results indicate that story segmentation performance with automatic topic annotations from UTD is at par with the performance with manual topic annotations. Session: PThBh– Poster Speech Signal Processing III Time: Thursday 10.00, Venue: Main Hall, Level -1 Chair: Javier Hernando, Universitat Politecnica de Catalunya, Spain September 1-4, 2003 – Geneva, Switzerland iki yeni yöntem sunulmaktadır. Seçimli ön vurgulama olarak adlandırılan birinci yöntem, gırtlak yapısı dönüşümü için bant-geçiren süzgeçleme kullanmaktadır. İkinci yöntem ses perdesinin zamanla değişim eğrisini modellemek için parçalı bir model önermektedir. Her iki yöntem, yeni bir konuşmacı dönüştürme algoritmasında kullanılmıştır. Yöntemler, öznel deneyler yoluyla halen kullanılan gırtlak yapısı ve ses perdesi dönüştürme yöntemleriyle karşılaştırılmıştır. Seçimli ön vurgulama yönteminin yüksek örnekleme sıklıkları için daha düşük kestirim derecelerinde önceki çalışmalarımızdaki yöntemlere benzer sonuç verdiği gösterilmiştir. Sonuçlar, parçalı ses perdesi modelinin konuşmacı dönüşümünde performansı arttırdığını göstermektedir. Modulation Spectrum for Pitch and Speech Pause Detection Local Regularity Analysis at Glottal Opening and Closure Instants in Electroglottogram Signal Using Wavelet Transform Modulus Maxima Olaf Schreiner; DaimlerChrysler AG, Germany Aïcha Bouzid 1 , Noureddine Ellouze 2 ; 1 Superior Institute of Technological Studies of Sfax, Tunisia; 2 National school of engineers of Tunis, Tunisia This paper deals with singularities characterisation and detection in Electroglottogram (EGG) signal using wavelet transform modulus maxima. These singularities correspond to glottal opening and closure instants (GOIs and GCIs). Wavelets with one and two vanishing moments are applied to EGG signal. We show that wavelet with one vanishing moment is sufficient to detect singularities of EGG signal and to measure their regularities. The Lipschitz regularity at any point is the maxima slope of log2 of wavelet transform modulus maxima as a function of log2 s along the maxima lines converging to this point. Local regularity measures allow us to conclude that EGG signal is more regular at glottal opening instant than at glottal closure instant. Improved Robustness of Automatic Speech Recognition Using a New Class Definition in Linear Discriminant Analysis M. Schafföner, M. Katz, S.E. Krüger, A. Wendemuth; Otto-von-Guericke-University Magdeburg, Germany This work discusses the improvements which can be expected when applying linear feature-space transformations based on Linear Discriminant Analysis (LDA) within automatic speech-recognition (ASR). It is shown that different factors influence the effectiveness of LDA-transformations. Most importantly, increasing the number of LDA-classes by using time-aligned states of Hidden-MarkovModels instead of phonemes is necessary to obtain improvements predictably. An extension of LDA is presented, which utilises the elementary Gaussian components of the mixture probability-density functions of the Hidden-Markov-Models’ states to define actual Gaussian LDA-classes. Experimental results on the TIMIT and WSJCAM0 recognition task are given, where relative improvements of the error-rate of 3.2% and 3.9%, respectively, were obtained. Voice Conversion Methods for Vocal Tract and Pitch Contour Modification Oytun Turk 1 , Levent M. Arslan 2 ; 1 Sestek Inc., Turkey; 2 Bogazici University, Turkey This study1 proposes two new methods for detailed modeling and transformation of the vocal tract spectrum and the pitch contour. The first method (selective pre-emphasis) relies on band-pass filtering to perform vocal tract transformation. The second method (segmental pitch contour model) focuses on a more detailed modeling of pitch contours. Both methods are utilized in the design of a voice conversion algorithm based on codebook mapping. We compare them with existing vocal tract and pitch contour transformation methods and acoustic feature transplantations in subjective tests. The performance of the selective pre-emphasis based method is similar to the methods used in our previous work at higher sampling rates with a lower prediction order. The results also indicate that the segmental pitch contour model improves voice conversion performance. Bu çalışmada, gırtlak yapısı ve ses perdesinin daha ayrıntılı biçimde modellenmesi ve bir konuşmacıdan diğerine dönüştürülmesi için This paper describes a new approach to the speech pause detection problem. The goal is to safely decide for a given signal frame whether speech is present or not in order to switch an automatic speech recognizer on or off. The modulation spectrum is introduced as a method to determine the amount of voicing in a signal frame. This method is tested against two standard methods in pitch detection. Robust Energy Demodulation Based on Continuous Models with Application to Speech Recognition Dimitrios Dimitriadis, Petros Maragos; National Technical University of Athens, Greece In this paper, we develop improved schemes for simultaneous speech interpolation and demodulation based on continuous-time models. This leads to robust algorithms to estimate the instantaneous amplitudes and frequencies of the speech resonances and extract novel acoustic features for ASR. The continuous-time models retain the excellent time resolution of the ESAs based on discrete energy operators and perform better in the presence of noise. We also introduce a robust algorithm based on the ESAs for amplitude compensation of the filtered signals. Furthermore, we use robust nonlinear modulation features to enhance the classic cepstrum-based features and use the augmented feature set for ASR applications. ASR experiments show promising evidence that the robust modulation features improve recognition. A Robust and Sensitive Word Boundary Decision Algorithm Jong Uk Kim, SangGyun Kim, Chang D. Yoo; KAIST, Korea A robust and sensitive word boundary decision algorithm for automatic speech recognition (ASR) system is proposed. The algorithm uses a time-frequency feature to improve both robustness and sensitivity. The time-frequency features are passed through a bank of moving average filters for temporary decision of word boundary in each band. The decision results of each band are then passed through a median filter for the final decision. The adoption of time-frequency feature improves the sensitivity, while the median filtering improves the robustness. Proposed algorithm uses an adaptive threshold based on the signal-to-noise ratio (SNR) in each band which further improves the decision performance. Experimental result shows that the proposed algorithm outperforms the Q.Li et al’s robust algorithm. A Novel Transcoding Algorithm for SMV and G.723.1 Speech Coders via Direct Parameter Transformation Seongho Seo, Dalwon Jang, Sunil Lee, Chang D. Yoo; KAIST, Korea In this paper, a novel transcoding algorithm for the Selectable Mode Vocoder (SMV) and the G.723.1 speech coder is proposed. In contrast to the conventional tandem transcoding algorithm, the proposed algorithm converts the parameters of one coder to the other without going through the decoding and encoding process. The proposed algorithm is composed of four parts: the parameter decoding, Line Spectral Pair (LSP) conversion, pitch period conversion and rate selection. The evaluation results show that the proposed 101 Eurospeech 2003 Thursday algorithm achieves equivalent speech quality to that of tandem transcoding with reduced computational complexity and delay. A Novel Rate Selection Algorithm for Transcoding CELP-Type Codec and SMV September 1-4, 2003 – Geneva, Switzerland Estimation of the Parameters of the Quantitative Intonation Model with Continuous Wavelet Analysis Hans Kruschke, Michael Lenz; Dresden University of Technology, Germany Dalwon Jang, Seongho Seo, Sunil Lee, Chang D. Yoo; KAIST, Korea In this paper, we propose an efficient rate selection algorithm that can be used to transcode speech encoded by any code excited linear prediction (CELP)-type codec into a format compatible with selectable mode vocoder (SMV) via direct parameter transformation. The proposed algorithm performs rate selection using the CELP parameters. Simulation results show that while maintaining similar overall bit-rate compared to the rate selection algorithm of SMV, the proposed algorithm requires less computational load than that of SMV and does not degrade the quality of the transcoded speech. Subband-Based Acoustic Shock Limiting Algorithm on a Low-Resource DSP System Intonation generation in state-of-the-art speech synthesis requires the analysis of a large amount of data. Therefore reliable algorithms for the extraction of the parameters of an intonation model from a given F0 contour are required. This contribution proposes improvements concerning the extraction of the parameters of the quantitative intonation model developed by Fujisaki. The improvements are mainly based on the application of the continuous wavelet transform for the detection of accents and phrases in a F0 contour. A detailed explanation of the underlying idea of this approach is given and the implemented algorithm is described. Results prove that with the proposed method a significant improvement in the accuracy of the extracted parameters is achieved. Thereby the structure and the rules of the algorithm are kept relatively simple. Morphological Filtering of Speech Spectrograms in the Context of Additive Noise G. Choy, D. Hermann, R.L. Brennan, T. Schneider, H. Sheikhzadeh, E. Cornu; Dspfactory Ltd., Canada Acoustic Shock describes a condition where sudden loud acoustic signals in communication equipment causes hearing damage and discomfort to the users. To combat this problem, a subbandbased acoustic shock limiting (ASL) algorithm is proposed and implemented on an ultra low-power DSP system with an input-output latency of 6.5 msec. This algorithm processes the input signal in both the time and frequency domains. This approach allows the algorithm to detect sudden increases in sound level (time-domain), as well as frequency-selectively suppressing shock disturbances in frequency domain. The unaffected portion of the sound spectrum is thus preserved as much as possible. A simple ASL algorithm calibration procedure is proposed to satisfy different sound pressure level (SPL) limit requirements for various communication equipment. Acoustic test results show that the ASL algorithm limits acoustic shock signals to below specified SPL limits while preserving speech quality. Pitch Estimation Using Phase Locked Loops Patricia A. Pelle, Matias L. Capeletto; University of Buenos Aires, Argentina In this paper we present a new method for pitch estimation using a system based on phase-locked-loop devices. Three main blocks define our system. The aim of the first one is to make an harmonic decomposition of the speech signal. This stage is implemented using a band-pass filter bank and phase-locked-loops cascaded to the output of each filter. A second block enhances the harmonic corresponding to the fundamental frequency and attenuates all other harmonics. Finally a third stage re-synthesizes a new signal with high energy at the fundamental frequency and extracts pitch contour from that signal using another phase locked-loop. Performance is evaluated over two databases of laryngograph-labeled speech and compared to various well known pitch estimation algorithms. Performance Evaluation of IFAS-Based Fundamental Frequency Estimator in Noisy Environment Dhany Arifianto, Takao Kobayashi; Tokyo Institute of Technology, Japan In this paper, instantaneous frequency amplitude spectrum (IFAS)based fundamental frequency estimator is evaluated with speech signal corrupted by additive white gaussian noise. A key idea of the IFAS-based estimator is the use of degree of regularity of periodicity in spectrum of speech signal, de- fined by a quantity called harmonicity measure, for band selection in the fundamental frequency estimation. Several frequency band and window length selection methods based on harmonicity measure are assessed to find out better performance. It is shown that the performance of the IFASbased estimator is maintained at constant error rate about 1% from clean speech data up to 15 dB and about 11% at 0 dB SNR. For both female and male speakers, the IFAS-based estimator outperforms several well-known methods particularly at 0 dB SNR. Francisco Romero Rodriguez 1 , Wei M. Liu 2 , Nicholas W.D. Evans 2 , John S.D. Mason 2 ; 1 Escuela Superior de Ingenieros, Spain; 2 University of Wales Swansea, U.K. A recent approach to signal segmentation in additive noise [1, 2] uses features of small spectrogram sub-units accrued over the full spectrogram. The original work considered chirp signals in additive white Gaussian noise. This paper extends this work first by considering similar signals at different signal-to-noise ratios and then in the context of speech recognition. For the chirp case, a cost function based on spectrogram area is introduced and this indicates that the segmentation process is robust down to and below 0 dB SNR. For the speech experiments the objectives are again to assess the segmentation capabilities of the process. White Gaussian noise is added to clean speech and the segmentation process applied. The cost function now is automatic speech recognition (ASR) accuracy. After segmentation speech areas are set to one constant level and non-speech areas are set to a lower constant level, thereby assessing the segmentation process and the importance of spectral shape in ASR. For the ASR experiments the TIDigits database is used in a standard AURORA 2 configuration, under mis-matched test and training conditions. With 5 dB SNR for the test set only (clean training) a word accuracy of 56% is achieved. This compares with 16% when the same noisy test data is applied directly to the ASR system without segmentation. Thus the segmentation approach shows that spectral shapes alone (without normal spectral amplitude variations) leads to perhaps surprisingly good ASR results in noisy conditions. The next stage is to include amplitude information along with appropriate noise compensation. Segmenting Multiple Concurrent Speakers Using Microphone Arrays Guillaume Lathoud, Iain A. McCowan, Darren C. Moore; IDIAP, Switzerland Speaker turn detection is an important task for many speech processing applications. However, accurate segmentation can be hard to achieve if there are multiple concurrent speakers (overlap), as is typically the case in multi-party conversations. In such cases, the location of the speaker, as measured using a microphone array, may provide greater discrimination than traditional spectral features. This was verified in previous work which obtained a global segmentation in terms of single speaker classes, as well as possible overlap combinations. However, such a global strategy suffers from an explosion of the number of overlap classes, as each possible combination of concurrent speakers must be modeled explicitly. In this paper, we propose two alternative schemes that produce an individual segmentation decision for each speaker, implicitly handling all overlapping speaker combinations. The proposed approaches also allow straightforward online implementations. Experiments are presented comparing the segmentation with that obtained using the previous system. 102 Eurospeech 2003 Thursday September 1-4, 2003 – Geneva, Switzerland Segmentation of Speech into Syllable-Like Units Session: OThCc– Oral Speech Signal Processing IV T. Nagarajan, Hema A. Murthy, Rajesh M. Hegde; Indian Institute of Technology, India In the development of a syllable-centric ASR system, segmentation of the acoustic signal into syllabic units is an important stage. This paper presents a minimum phase group delay based approach to segment spontaneous speech into syllable-like units. Here, three different minimum phase signals are derived from the short term energy functions of three sub-bands of speech signals, as if it were a magnitude spectrum. The experiments are carried out on Switchboard and OGI-MLTS corpus and the error in segmentation is found to be utmost 40msec for 85% of the syllable segments. Session: SThCb– Oral Towards a Roadmap for Speech Technology Time: Thursday 13.30, Venue: Room 2 Chair: Steven Krauwer, Utrecht University / ELSNET “Do not attempt to light with match!”: Some Thoughts on Progress and Research Goals in Spoken Dialog Systems Time: Thursday 13.30, Venue: Room 3 Chair: Ben Milner, School of Information Systems A Syllable Segmentation Algorithm for English and Italian Massimo Petrillo, Francesco Cutugno; Università degli Studi di Napoli “Federico II”, Italy In this paper we present a simple algorithm for speech syllabification. It is based on the detection of the most relevant energy maximums, using two different energy calculations: the former from the original signal, the latter from a low-pass filtered version. The system requires setting appropriate values for a number of parameter. The procedure to assign a proper value to each one is reduced to the minimization of a n-variable function, for which we use either a genetic algorithm and simulated annealing. Different estimation of parameters for both Italian and English was carried out. We found the English setting was also suitable for Italian but not the reverse. Modeling Speaking Rate for Voice Fonts Paul Heisterkamp; DaimlerChrysler AG, Germany In view of the current market consolidation in the speech recognition industry, we ask some questions as to what constitutes the ideas underlying the ‘roadmap’ metaphor. These questions challenge the traditional faith in ever more complex and ‘natural’ systems as the ultimate goals and keys to full commercial success of Spoken Dialog Systems. As we strictly obey that faith, we consider those questions ‘jesuitic’ rather than ‘heretical’. Mainly, we ask: Have we (i.e. the scientific and industrial communities) been promising the right things to the right people? We leave the question open for discussion, and only cast glimpses at potential alternatives. Multimodality and Speech Technology: Verbal and Non-Verbal Communication in Talking Agents Björn Granström, David House; KTH, Sweden This paper presents methods for the acquisition and modelling of verbal and non-verbal communicative signals for the use in animated talking agents. This work diverges from the traditional focus on the acoustics of speech in speech technology and will be of importance for the realization of future multimodal interfaces, some experimental examples of which are presented at the end of the paper. Roadmaps, Journeys and Destinations Speculations on the Future of Speech Technology Research Ronald A. Cole; University of Colorado at Boulder, USA This article presents thoughts on the future of speech technology research, and a vision of the near future in which computer interaction is characterized by natural face-to-face conversations with lifelike characters that speak, emote and gesture. A first generation of these perceptive animated interfaces are now under development in a project called the Colorado Literacy Tutor, which uses perceptive animated agents in a computer-based literacy program. Spoken Language Output: Realising the Vision Roger K. Moore; 20/20 Speech Ltd., U.K. Significant progress has taken place in ‘Spoken Language Output’ (SLO) R&D, yet there is still some way to go before it becomes a ubiquitous and widely deployed technology. This paper reviews the challenges facing SLO, using ‘Technology Roadmapping’ (TRM) to identify market drivers and future product concepts. It concludes with a summary of the behaviours that will be required in future SLO systems. Ashish Verma, Arun Kumar; Indian Institute of Technology, India Voice fonts are created and stored for a speaker, to be used to synthesize speech in the speaker’s voice. The most important descriptors of voice fonts are spectral envelope for acoustic units and prosodic features such as fundamental frequency and average speaking rate. In this paper, we present a new approach to model the speaking rate so that it can be easily incorporated in voice fonts and used for personality transformation. We model speaking rate in the form of average duration for various acoustic units and categories for the speaker. The speaking rate can be automatically extracted from a speech corpus in the speaker’s voice using the proposed approach. We show how the proposed approach can be implemented, and present its performance evaluation through various subjective tests. A New HMM-Based Approach to Broad Phonetic Classification of Speech Jouni Pohjalainen; Helsinki University of Technology, Finland A novel automatic method is introduced for classifying speech segments into broad phonetic categories using one or more hidden Markov models (HMMs) on long speech utterances. The general method is based on prior analysis of the acoustic features of speech and the properties of HMMs. Three example algorithms are implemented and applied to voiced-unvoiced-silence classification. The main advantages of the approach are that it does not require a separate training phase or training data, is adaptive, and that the classification results are automatically smoothed because of the Markov assumption of successive phonetic events. The method is especially applicable to speech recognition. Acoustic Change Detection and Segment Clustering of Two-Way Telephone Conversations Xin Zhong 1 , Mark A. Clements 1 , Sung Lim 2 ; 1 Georgia Institute of Technology, USA; 2 Fast-Talk Communications, USA We apply the Bayesian information criterion (BIC) to unsupervised segmentation of two-way telephone conversations according to speaker turns, and then proceed to produce homogenous clusters consisting of the resulting segments. Such clustering allows more accurate feature normalization and model adaption for ASR-related tasks. In contrast to similar processing of broadcast data reported in previous work, we can safely assume there are two distinguishable acoustic environments in a call, but new challenges include a much faster changing rate, variation of speaking style by a talker, and presence of crosstalk and non-meaningful sounds. The algorithm is tested on two-speaker telephone conversations with dif- 103 Eurospeech 2003 Thursday ferent genders and via different telephony networks (land-line and cellular). Using the purities of segments and final clusters as the performance measure, the BIC-based algorithm approaches the optimal result without requiring an iterative procedure. Blind Normalization of Speech from Different Channels David N. Levin; University of Chicago, USA We show how to construct a channel-independent representation of speech that has propagated through a noisy reverberant channel. The method achieved greater channel-independence than cepstral mean normalization (CMN), and it was comparable to the combination of CMN and spectral subtraction (SS), despite the fact that no measurements of channel noise or reverberations were required (unlike SS). September 1-4, 2003 – Geneva, Switzerland of vowels and diphthongs. Comparative analysis of the formant values, the formant trajectories and the formant target points of British and broad Australian accents are presented. A method for ranking the contribution of formants to accent identity is proposed whereby formants are ranked according to the normalised distances between formants across accents. The first two formants are considered more sensitive to accents than other formants. Finally a set of experiments on accent conversion is presented to transform the broad Australian accent of a speaker to British Received Pronunciation (RP) accent by formant mapping and prosody modification. Perceptual evaluations of accent conversion results illustrate that besides prosodic correlates such as pitch and duration, formants also play an important role in conveying accents. Cycle Extraction for Perfect Reconstruction and Rate Scalability Speech Watermarking by Parametric Embedding with an ∞ Fidelity Criterion Miguel Arjona Ramírez; University of São Paulo, Brazil A.R. Gurijala, J.R. Deller Jr.; Michigan State University, USA A cycle extractor is presented to be used in a speech coder independently from the coding stage. It samples cycle waveforms (CyWs) of the original prediction residual signal at their natural nonuniform rate. It is shown that perfect reconstruction is possible due to the interplay of these properties for two cycle length normalization and denormalization techniques. The coding stage is coupled to the cycle extractor in the analysis stage by an evolving waveform interpolator that may handle several interpolation methods and sampling rates for a variety of fixed and variable rate coders. The description of extraction, evolution interpolation and synthesis stages is cast in discrete time. The upper performance bound is perfect reconstruction while the lower bound is equivalent to conventional waveform interpolation (WI) speech coding. Parameter-embedded watermarking of speech signals is effected through slight perturbations of parametric models of some deeplyintegrated dynamics of the signal. One of the objectives of the present research is to develop, within the parameter-embedding framework, quantifiable measures of fidelity of the stegosignal and of robustness of the watermark to attack. This paper advances previous developments on parameter-embedded watermarking by introducing a specific technique for watermark selection subject to a fidelity constraint. New results in set-theoretic filtering are used to obtain sets of allowable parameter perturbations (i.e., watermarks) subject to an ∞ constraint on the error between the watermarked and original material. With respect to previous trial-anderror perturbation methods, the set-based parameter perturbation is not only quantified and systematic, it is found to be more robust, and to have a higher threshold of perceptibility with perturbation energy. After a brief review of the general parameter-embedding strategy, the new algorithm for set-theoretic watermark selection is presented. Experiments with real speech data are used to assess robustness and other performance properties. This work is being undertaken in support of the development of the National Gallery of the Spoken Word, a project of the Digital Libraries II Initiative. Session: OThCd– Oral Speech Synthesis: Miscellaneous II Time: Thursday 13.30, Venue: Room 4 Chair: Jan van Santen, OGI, USA On présente un extracteur de cycles pour des codeurs de la parole qui est indépendent de l’étage de codage. Il échantillonne des cycles (CyW) du signal résiduel de prédiction à leur débit d’échantillonnage naturel qui n’est pas uniforme. On montre qu’il est possible d’obtenir la reconstruction parfaite à cause des liens entre ces deux propriétés par deux techniques de normalisation et de denormalisation du longueur des cycles. L’étage de codage est couplé à l’extracteur de cycles dans l’étage d’analyse par un interpolateur de formes d’onde d’évolution que peut ménager plusiers méthodes d’interpolation et débits d’échantillonnage pour une grande variété de codeurs à débits fixes ou variable. La description des étages d’extraction, d’interpolation des formes d’onde d’évolution et de synthèse est en temps discret. La limite supérieur de performance est la reconstruction parfaite tandis que l’inférieur est équivalente à celle du codage conventionnel par interpolation de formes d’onde (WI). Adding Fricatives to the Portuguese Articulatory Synthesiser Using Acoustic Models to Choose Pronunciation Variations for Synthetic Voices António Teixeira, Luis M.T. Jesus, Roberto Martinez; Universidade de Aveiro, Portugal Christina L. Bennett, Alan W. Black; Carnegie Mellon University, USA Within-speaker pronunciation variation is a well-known phenomenon; however, attempting to capture and predict a speaker’s choice of pronunciations has been mostly overlooked in the field of speech synthesis. We propose a method to utilize acoustic modeling techniques from speech recognition in order to detect a speaker’s choice between full and reduced pronunciations. Comparative Analysis and Synthesis of Formant Trajectories of British and Broad Australian Accents First attempts at incorporating models of frication into an articulatory synthesizer, with a modular and flexible design, are presented. Although the synthesizer allows the user to choose different combinations of source types, noise volume velocity sources have been used to generate turbulence. Preliminary results indicate that the model is capturing essential characteristics of the transfer functions and spectral characteristics of fricatives. Results also show the potential of performing synthesis based on broad articulatory configurations of fricatives. A Hybrid Method Oriented to Concatenative Text-to-Speech Synthesis Qin Yan 1 , Saeed Vaseghi 1 , Ching-Hsiang Ho 2 , Dimitrios Rentzos 1 , Emir Turajlic 1 ; 1 Brunel University, U.K.; 2 Fortune Institute of Technology, Taiwan Ignasi Iriondo, Francesc Alías, Javier Sanchis, Javier Melenchón; Ramon Llull University, Spain The differences between the formant trajectories of British and broad Australian English accents are analysed and used for accent conversion. An improved formant model based on linear prediction (LP) feature analysis and a 2-D hidden Markov model (HMM) of formants is employed for estimation of the formant trajectories In this paper we present a speech synthesis method for diphonebased text-to-speech systems. Its main goal is to achieve prosodic modifications that result in more natural-sounding synthetic speech. This improvement is especially useful for emotional speech synthesis, which requires high-quality prosodic modification. We present a hybrid method based on TD-PSOLA and the harmonic plus noise model, which incorporates a novel method to jointly mod- 104 Eurospeech 2003 Thursday ify pitch and time-scale. Preliminary results show an improvement in the synthetic speech quality when high pitch modification is required. Custom-Tailoring TTS Voice Font – Keeping the Naturalness When Reducing Database Size Yong Zhao, Min Chu, Hu Peng, Eric Chang; Microsoft Research Asia, China This paper presents a framework for custom-tailoring voice font in data-driven TTS systems. Three criteria for unit pruning, the prosodic outlier criterion, the importance criterion and the combination of the two, are proposed. The performance of voice fonts in different sizes which are pruned with the three criteria is evaluated by simulating speech synthesis over large amount of texts and estimating the naturalness with an objective measure at the same time. The result shows that the combined criterion performs the best among the three. The pre-estimated curve for naturalness vs. database size might be used as a reference for custom-tailoring voice font. The naturalness remains almost unchanged when 50% of instances are pruned off with the combined criterion. Session: PThCe– Poster Speaker Recognition & Verification Time: Thursday 13.30, Venue: Main Hall, Level -1 Chair: Samy Bengio, IDIAP, Martigny, Switzerland New MAP Estimators for Speaker Recognition P. Kenny, M. Mihoubi, Pierre Dumouchel; CRIM, Canada We report the results of some experiments which demonstrate that eigenvoice MAP and eigenphone MAP are at least as effective as classical MAP for discriminative speaker modeling on SWITCHBOARD data. We show how eigenvoice MAP can be modified to yield a new model-based channel compensation technique which we call eigenchannel MAP. When compared with multi-channel training, eigenchannel MAP was found to reduce speaker identification errors by 50%. A New SVM Approach to Speaker Identification and Verification Using Probabilistic Distance Kernels Pedro J. Moreno, Purdy P. Ho; Hewlett-Packard, USA One major SVM weakness has been the use of generic kernel functions to compute distances among data points. Polynomial, linear, and Gaussian are typical examples. They do not take full advantage of the inherent probability distributions of the data. Focusing on audio speaker identification and verification, we propose to explore the use of novel kernel functions that take full advantage of good probabilistic and descriptive models of audio data. We explore the use of generative speaker identification models such as Gaussian Mixture Models and derive a kernel distance based on the KullbackLeibler (KL) divergence between generative models. In effect our approach combines the best of both generative and discriminative methods. Our results show that these new kernels perform as well as baseline GMM classifiers and outperform generic kernel based SVM’s in both speaker identification and verification on two different audio databases. Adaptive Decision Fusion for Multi-Sample Speaker Verification Over GSM Networks Ming-Cheung Cheung 1 , Man-Wai Mak 1 , Sun-Yuan Kung 2 ; 1 Hong Kong Polytechnic University, China; 2 Princeton University, USA In speaker verification, a claimant may produce two or more utterances. In our previous study [1], we proposed to compute the optimal weights for fusing the scores of these utterances based on their score distribution and our prior knowledge about the score statistics estimated from the mean scores of the corresponding client speaker and some pseudo-impostors during enrollment. As the fusion weights depend on the prior scores, in this paper, we propose to adapt the prior scores during verification based on the likelihood of the claimant being an impostor. To this end, a pseudo-imposter September 1-4, 2003 – Geneva, Switzerland GMM score model is created for each speaker. During verification, the claimant’s scores are fed to the score model to obtain a likelihood for adapting the prior score. Experimental results based on the GSM-transcoded speech of 150 speakers from the HTIMIT corpus demonstrate that the proposed prior score adaptation approach provides a relative error reduction of 15% when compared with our previous approach where the prior scores are non-adaptive. Environment Adaptation for Robust Speaker Verification Kwok-Kwong Yiu 1 , Man-Wai Mak 1 , Sun-Yuan Kung 2 ; 1 Hong Kong Polytechnic University, China; 2 Princeton University, USA In speaker verification over public telephone networks, utterances can be obtained from different types of handsets. Different handsets may introduce different degrees of distortion to the speech signals. This paper attempts to combine a handset selector with (1) handset-specific transformations and (2) handset-dependent speaker models to reduce the effect caused by the acoustic distortion. Specifically, a number of Gaussian mixture models are independently trained to identify the most likely handset given a test utterance; then during recognition, the speaker model and background model are either transformed by MLLR-based handsetspecific transformation or respectively replaced by a handsetdependent speaker model and a handset-dependent background model whose parameters were adapted by reinforced learning to fit the new environment. Experimental results based on 150 speakers of the HTIMIT corpus show that environment adaptation based on both MLLR and reinforced learning outperforms the classical CMS, Hnorm and Tnorm approaches, with MLLR adaptation achieves the best performance. On Cohort Selection for Speaker Verification Yaniv Zigel, Arnon Cohen; Ben-Gurion University, Israel Speaker verification systems require some kind of background model to reliably perform the verification task. Several algorithms have been proposed for the selection of cohort models to form a background model. This paper proposes a new cohort selection method called the Close Impostor Clustering (CIC). The new method is shown to outperform several other methods in a textdependent verification task. Several normalization methods are also compared. With three cohort models and the best scorenormalization method, the CIC yielded an average Equal Error Rate (EER) of 0.8%, while the second best method (Maximally-Spread Close, MSC) yielded average EER of 1.1%. Speaker Characterization Using Principal Component Analysis and Wavelet Transform for Speaker Verification C. Tadj, A. Benlahouar; École de Technologie Supérieure, Canada In this paper, we investigate the use of the Wavelet Transform for text-dependent and text-independent Speaker Verification tasks. We have introduced a Principal Component Analysis based wavelet transform to perform frequencies segmentation with levels decomposition. A speaker dependent library tree has been built, corresponding to the best structure for a given speaker. The constructed tree is abstract and specific to every single speaker. Therefore the extracted parameters are more discriminative and appropriate for speaker verification applications. It has been compared to MFCC’s and other wavelet-based parameters. Experiments have been conducted using corpus, extracted from Yoho and Spidre Databases. This technique has shown robustness and 100% efficiency in both cases. Unsupervised Speaker Indexing Using Anchor Models and Automatic Transcription of Discussions Yuya Akita, Tatsuya Kawahara; Kyoto University, Japan We present unsupervised speaker indexing combined with auto- 105 Eurospeech 2003 Thursday matic speech recognition (ASR) for speech archives such as discussions. Our proposed indexing method is based on anchor models, by which we define a feature vector based on the similarity with speakers of a large scale speech database. Several techniques are introduced to improve discriminant ability. ASR is performed using the results of this indexing. No discussion corpus is available to train acoustic and language models. So we applied the speaker adaptation technique to the baseline acoustic model based on the indexing. We also constructed a language model by merging two models that cover different linguistic features. We achieved the speaker indexing accuracy of 93% and the significant improvement of ASR for real discussion data. A Statistical Approach to Assessing Speech and Voice Variability in Speaker Verification Klaus R. Scherer, D. Grandjean, T. Johnstone, G. Klasmeyer, Tanja Bänziger; University of Geneva, Switzerland Voice and speech parameters for a single speaker vary widely over different contexts, in particular in situations in which speakers are affected by stress or emotion or in which speech styles are used strategically. This high degree of intra-speaker variability presents a major challenge for speaker verification systems. Based on a largescale study in which different kinds of affective states were induced in over 100 speakers from three language groups, we use a statistical approach to identify speech and voice parameters that are likely to strongly vary as a function of the respective situation and affective state as well as those that tend to remain relatively stable. In addition, we evaluate the latter with respect to their potential to differentiate individual speakers. Automatic Singer Identification of Popular Music Recordings via Estimation and Modeling of Solo Vocal Signal Wei-Ho Tsai, Hsin-Min Wang, Dwight Rodgers; Academia Sinica, Taiwan This study presents an effective technique for automatically identifying the singer of a music recording. Since the vast majority of popular music contains background accompaniment during most or all vocal passages, directly acquiring isolated solo voice data for extracting the singer’s vocal characteristics is usually infeasible. To eliminate the interference of background music for singer identification, we leverage statistical estimation of a piece’s musical background to build a reliable model for the solo voice. Validity of the proposed singer identification system is confirmed via the experimental evaluations conducted on a 23-singer pop music database. A DP Algorithm for Speaker Change Detection Michele Vescovi 1 , Mauro Cettolo 2 , Romeo Rizzi 1 ; 1 Università degli Studi di Trento, Italy; 2 ITCirst, Italy The Bayesian Information Criterion (BIC) is a widely adopted method for audio segmentation; typically, it is applied within a sliding variable-size analysis window where single changes in the nature of the audio are locally searched. In this work, a dynamic programming algorithm which uses the BIC method for globally segmenting the input audio stream is described, analyzed, and experimentally evaluated. On the 2000 NIST Speaker Recognition Evaluation test set, the DP algorithm outperforms the local one by 2.4% (relative) F-score in the detection of changes, at the cost of being 38 times slower. September 1-4, 2003 – Geneva, Switzerland Automatic Estimation of Perceptual Age Using Speaker Modeling Techniques Nobuaki Minematsu, Keita Yamauchi, Keikichi Hirose; University of Tokyo, Japan This paper proposes a technique to estimate speakers’ perceptual age automatically only with acoustic information of their utterances. Firstly, we experimentally collected data of how old individual speakers in databases sound to listeners. Speech samples of approximately 500 male speakers with a very wide range of the real age were presented to listeners, who were asked to estimate the age only by hearing. Using the results, the perceptual age of the individual speakers was defined in two ways as label (averaged age over the listeners) and distribution. Then, each of the speakers was acoustically modeled by GMMs. Finally, the perceptual age of an input speaker was estimated as weighted sum of the perceptual age of all the other speakers in the databases, where the weight for speaker i was calculated as a function of likelihood score of the input speaker as speaker i. Experiments showed that correlation was about 0.9 between the perceptual age estimated by the listening test and that estimated by the proposed method. This paper also introduces some techniques to realize robust estimation of the perceptual age. Speaker Recognition Using Local Models Ryan Rifkin; Honda Research Institute, USA Many of the problems arising in speech processing are characterized by extremely large training and testing sets, constraining the kinds of models and algorithms that lead to tractable implementations. In particular, we would like the amount of processing associated with each test frame to be sublinear (i.e., logarithmic) in the number of training points. In this paper, we consider smoothed kernel regression models at each test frame, using only those training frames that are close to the desired test frame. The problem is made tractable via the use of approximate nearest neighbors techniques. The resulting system is conceptually simple, easy to implement, and fast, with performance comparable to more sophisticated methods. Preliminary results on a NIST speaker recognition task are presented, demonstrating the feasibility of the method. Dependence of GMM Adaptation on Feature Post-Processing for Speaker Recognition Robbie Vogt, Jason Pelecanos, Sridha Sridharan; Queensland University of Technology, Australia This paper presents a study on the relationship between feature post-processing and speaker modelling techniques for robust textindependent speaker recognition. A fully coupled target and background Gaussian mixture speaker model structure is used for hypothesis testing in this speaker model based recognition system. Two formulations of the Maximum a Posteriori (MAP) adaptation algorithm for Gaussian mixture models are considered. We contrast the standard single iteration adaptation algorithm to adaptation using multiple iterations. Three post-processing techniques for cepstral features are considered; feature warping, cepstral mean subtraction (CMS) and RelAtive SpecTrA (RASTA) processing. It is shown that the advantage gained through iterative MAP adaptation is dependent on the parameterisation technique used. Reasons for this dependency are discussed. Text-Independent Speaker Recognition by Speaker-Specific GMM and Speaker Adapted Syllable-Based HMM Seiichi Nakagawa, Wei Zhang; Toyohashi University of Technology, Japan SOM as Likelihood Estimator for Speaker Clustering Itshak Lapidot; IDIAP, Switzerland A new approach is presented for clustering the speakers from unlabeled and unsegmented conversation, when the number of speakers is unknown. In this approach, Self-Organizing-Map (SOM) is used as likelihood estimators for speaker model. For estimation of the number of clusters the Bayesian Information Criterion (BIC) is applied. This approach was tested on the NIST 1996 HUB-4 evaluation test in terms of speaker and cluster purities. Results indicate that the combined SOM-BIC approach can lead to better clustering results than the baseline system. We present a new text-independent speaker recognition method by combining speaker-specific Gaussian Mixture Model(GMM) with syllable-based HMM adapted by MLLR or MAP. The robustness of this speaker recognition method for speaking style’s change was evaluated. The speaker identification experiment using NTT database which consists of sentences data uttered at three speed modes (normal, fast and slow) by 35 Japanese speakers(22 males and 13 females) on five sessions over ten months was conducted. Each speaker uttered only 5 training utterances. We obtained the 106 Eurospeech 2003 Thursday accuracy of 100% for text-independent speaker identification. This result was superior to some conventional methods for the same database. On the Amount of Speech Data Necessary for Successful Speaker Identification Aleš Padrta, Vlasta Radová; University of West Bohemia in Pilsen, Czech Republic The paper deals with the dependence between the speaker identification performance and the amount of test data. Three speaker identification procedures based on hidden Markov models (HMMs) of phonemes are presented here. One, which is quite commonly used in the speaker recognition systems based on HMMs, uses the likelihood of the whole utterance for speaker identification. The other two that are proposed in this paper are based on the majority voting rule. The experiments were performed for two different situations: either both training and test data were obtained from the same channel, or they were obtained from different channels. All experiments show that the proposed speaker identification procedure based on the majority voting rule for sequences of phonemes allows us to reduce the amount of test data necessary for successful speaker identification. September 1-4, 2003 – Geneva, Switzerland ficients (MFCC)s, JRASTA Perceptual Linear Prediction Coefficients (JRASTAPLP) indicate that executing Principal Component Analysis (PCA) on MRA features result in performance superior to the use of MFCCs and competitive with the use of JRASTAPLP features. Experiments in noisy conditions, using the Italian component of the AURORA3 corpus, show a WER reduction of 15.7% when SNRdependent Spectral Subtraction (SS) is performed on MRA-PCA features compared to when it is performed on JRASTAPLP features. Furthermore, SS appears to be better than Soft Thresholding (ST). An Accurate Noise Compensation Algorithm in the Log-Spectral Domain for Robust Speech Recognition Mohamed Afify; Cairo University, Egypt Speaker Verification Based on the German VeriDat Database This paper presents an algorithm for noise compensation in the logspectral domain. The idea is based on the use of accurate approximations which allow theoretical derivations of the noisy speech statistics, and using these statistics to define a compensation algorithm under a Gaussian mixture model assumption. The algorithm is tested on a digit data base recorded in the car, the word recognition accuracies for the baseline (uncompensated), first order VTS, the proposed method, and the matched test, are 85.8%, 90.6%, 93.1%, and 93.9% respectively. This clearly indicates the performance gain due to the proposed technique. Ulrich Türk, Florian Schiel; Ludwig-Maximilians-Universität München, Germany A New Adaptive Long-Term Spectral Estimation Voice Activity Detector This paper introduces the new German speaker verification (SV) database VeriDat as well as the system design, the baseline performance and the results of several experiments of our experimental speaker verification (SV) frame work. The main focus is how typical problems using real-world telephone speech can be avoided automatically by rejecting inputs to the enrollment or test material. Possible splittings of the data sets according to network type and acoustical environment are tested in cheating experiments. Session: PThCf– Poster Robust Speech Recognition IV Time: Thursday 13.30, Venue: Main Hall, Level -1 Chair: Jean-Claude Junqua, Panasonic, USA A Segment-Based Algorithm of Speech Enhancement for Robust Speech Recognition Guokang Fu 1 , Ta-Hsin Li 2 ; 1 IBM China Research Lab, China; 2 IBM T.J. Watson Research Center, USA Accurate recognition of speech in noisy environment is still an obstacle for wider application of speech recognition technology. Noise reduction, which is aimed at cleaning the corrupted testing signal to match the ideal training conditions, remain to be an effective approach to improving the accuracy of speech recognition in noisy environment. This paper introduces a new algorithm of noise reduction that combines a tree-based segmentation method with the maximum likelihood estimation to accommodate the nonstationarity of speech while efficiently suppressing the possibly nonstationary noise. Numerical results are obtained from the experiments on an speech recognition system, showing the effectiveness of the proposed algorithm in improving the accuracy of Chinese speech recognition. Robust Multiple Resolution Analysis for Automatic Speech Recognition Roberto Gemello 1 , Franco Mana 1 , Dario Albesano 1 , Renato De Mori 2 ; 1 Loquendo, Italy; 2 LIA-CNRS, France This paper investigates the potential of exploiting the redundancy implicit in Multi Resolution Analysis (MRA) for Automatic Speech Recognition (ASR) systems. Experiments, carried with data collected from home telephones and in cars, confirm the proposed approach for exploiting this redundancy. Comparisons with the use of Mel Frequency-scaled Cepstral Coef- Javier Ramírez, José C. Segura, Carmen Benítez, Ángel de la Torre, Antonio J. Rubio; Universidad de Granada, Spain This paper shows an efficient voice activity detector (VAD) that is based on the estimation of the long-term spectral divergence (LTSD) between noise and speech periods. The proposed method decomposes the input signal into overlapped speech frames, uses a sliding window to compute the long-term spectral envelope and measures the speech/non-speech LTSD, thus yielding a high discriminating decision rule and minimizing the average number of decision errors. In order to increase nonspeech detection accuracy, the decision threshold is adapted to the measured noise energy while a controlled hang-over is activated only when the observed signal-tonoise ratio (SNR) is low. An exhaustive analysis of the proposed VAD is carried out using the AURORA TIdigits and SpeechDat-Car (SDC) databases. The proposed VAD is compared to the most commonly used ones in the field in terms of speech/non-speech detection and recognition performance. Experimental results demonstrate a sustained advantage over G.729, AMR and AFE VADs. Robust Speech Recognition Using Non-Linear Spectral Smoothing Michael J. Carey; University of Bristol, U.K. A new simple but robust method of front-end analysis, nonlinear spectral smoothing (NLSS), is proposed. NLSS uses rank-order filtering to replace noisy low-level speech spectrum coefficients with values computed from adjacent spectral peaks. The resulting transformation bears significant similarities with masking in the auditory system. It can be used as an intermediate processing stage between the FFT and the filter-bank analyzer. It also produces features which can be cosine transformed and used by a pattern matcher. NLSS gives significant improvements in the performance of speech recognition systems in the presence of stationary noise, a reduction in error rate of typically 50% or an increased tolerance to noise of 3dB for the same error rate in an isolated digit test on the Noisex database. Results on female speech were superior to those on male speech: female speech gave a recognition error rate of 1.1% at a 0dB signal to noise ratio. A Novel Use of Residual Noise Model for Modified PMC Cailian Miao, Yangsheng Wang; Chinese Academy of Sciences, China In this paper, a new approach based on model adaptation is proposed for acoustic mismatch problem. A specific bias model – resid- 107 Eurospeech 2003 Thursday ual noise model – is presented, which is the joint compensation model for additive and convolutive bias. The novel noise model is estimated on the basis of maximum likelihood manner. In conjunction with the Parallel Model combination (PMC), it is effective for noisy environments. The experiments have been done based on the Cambridge’s HTK toolkit to implement the continuous Mandarin digit recognition in noisy environments. Robust Speech Recognition to Non-Stationary Noise Based on Model-Driven Approaches Christophe Cerisara, Irina Illina; LORIA, France Automatic speech recognition works quite well in clean conditions, and several algorithms have already been proposed to deal with stationary noise. The next challenge consists to work with nonstationary noise. This paper studies this problem. We propose three algorithms to non-stationary noise adaptation : Static and Dynamic Optional Parallel Model Combination (OPMC) and one algorithm derived from the Missing Data framework. The combination of speech and noise is expressed in the spectral domain and different ways to estimate the non-stationary noise model are studied. The proposed algorithms are tested on a telephone database with added background music at different SNRs. The best result is obtained using dynamic OPMC. Towards Missing Data Recognition with Cepstral Features Christophe Cerisara; LORIA, France We study in this work the Missing Data Recognition (MDR) framework applied to a large vocabulary continuous speech recognition (LVCSR) task with cepstral models when the speech signal is corrupted by musical noise. We do not propose a full system that solves this difficult problem, but we rather present some of the issues involved and study some possible solutions to them. We focus in this work on the issues concerning the application of masks to cepstral models. We further identify possible errors and study how some of them affect the performances of the system. On-Line Parametric Histogram Equalization Techniques for Noise Robust Embedded Speech Recognition September 1-4, 2003 – Geneva, Switzerland The performance of the proposed methods is comparable to that of CMN in using cepstral coefficients. Voicing Parameter and Energy Based Speech/Non-Speech Detection for Speech Recognition in Adverse Conditions Arnaud Martin 1 , Laurent Mauuary 2 ; 1 Université de Bretagne Sud, France; 2 France Télécom R&D, France In adverse conditions, the speech recognition performance decreases in part due to imperfect speech/non-speech detection. In this paper, a new combination of voicing parameter and energy for speech/non-speech detection is described. This combination avoids especially the noise detections in real life very noisy environments and provides better performance for continuous speech recognition. This new speech/non-speech detection approach outperforms both noise statistical based [1] and Linear Discriminate Analysis (LDA) based [2] criteria in noisy environments and for continuous speech recognition applications. Two Correction Models for Likelihoods in Robust Speech Recognition Using Missing Feature Theory Hugo Van hamme; Katholieke Universiteit Leuven, Belgium In Missing Feature Theory (MFT), it is assumed that some of the features that are extracted from an observation are missing or unreliable. Applied to spectral features for noisy speech recognition, the clean feature values are known to be less than the observed noisy features. Based on this inequality constraint, an HMM-statedependent clean speech value of the missing features can be inferred through maximum likelihood estimation. This paper describes two observed biases of the likelihood evaluated at the estimate. Theoretical and experimental evidence are provided that an upper bound on the accuracy is improved by applying computationally simple corrections for the number of free variables in the likelihood maximization and for the global acoustic space density function. Spectral Maxima Representation for Robust Automatic Speech Recognition J. Sujatha, K.R. Prasanna Kumar, K.R. Ramakrishnan, N. Balakrishnan; Indian Institute of Science, India Hemmo Haverinen, Imre Kiss; Nokia Research Center, Finland In this paper, two low-complexity histogram equalization algorithms are presented that significantly reduce the mismatch between training and testing conditions in HMM-based automatic speech recognizers. The proposed algorithms use Gaussian approximations for the initial and target distributions and perform a linear mapping between them. We show that even this simplified mapping can improve the noise robustness of ASR systems, while the associated computational load, memory requirements, and algorithmic delay are minimal. The proposed algorithms were evaluated in a multi-lingual speaker independent isolated word recognition task without and in combination with on-line MAP acoustic model adaptation. The best results obtained showed an approximate 25/20% relative error-rate reduction without/with acoustic model adaptation. Compensation of Channel Distortion in Line Spectrum Frequency Domain In the context of automatic speech recognition, the popular Mel Frequency Cepstral Coefficients(MFCC) as features, though perform very well under clean and matched environments, are observed to fail in mismatched conditions. The spectral maxima are often observed to preserve their locations and energies under noisy environments, but are not presented explicitly by the MFCC features. This paper presents a framework for representing the maxima information for robust recognition in the presence of additive White Gaussian Noise(WGN). For the task of phoneme based Isolated Word Recognition (IWR) under different Signal to Noise Ratio (SNR) environments, the results show an improved recognition performance. The cepstral features are computed from a reconstructed spectrogram by fitting gaussians around the spectral maxima. In view of the inherent robustness and easy trackability of the maxima, this opens up interesting avenues towards a robust feature representation as well as preprocessing techniques. Missing Feature Theory Applied to Robust Speech Recognition Over IP Network An-Tze Yu, Hsiao-Chuan Wang; National Tsing Hua University, Taiwan This paper addresses the problem of channel effect in the line spectrum frequency (LSF) domain. The channel effect can be expressed in terms of the channel phase. The speech signal is represented by its inverse filter derived from LP analysis. Then the mean normalization on the inverse filters is introduced for removing the channel distortion. Further study indicates that the mean normalization on the inverse filters becomes the mean subtraction in phase domain. Based on this finding, two methods are proposed to compensate the channel effect. Experiments on simulated channel distorted speech are conducted to evaluate the effectiveness of the proposed methods. The experimental results show that the proposed methods can give significant improvements in speech recognition performance. Toshiki Endo 1 , Shingo Kuroiwa 2 , Satoshi Nakamura 1 ; 1 ATR-SLT, Japan; 2 University of Tokushima, Japan This paper addresses the problems involved in performing speech recognition over mobile and IP networks. The main problem is speech data loss caused by packet loss in the network. We present two missing-feature-based approaches that recover lost regions of speech data. These approaches are based on reconstruction of missing frames or on marginal distributions. For comparison, we also use a tacking method, which recognizes only received data. We evaluate these approaches with packet loss models, i.e., random loss and Gilbert loss models. The results show that the marginal- 108 Eurospeech 2003 Thursday distributions-based approach is most effective for a packet loss environment; the degradation of word accuracy is only 5% when the packet loss rate is 30% and only 3% when mean burst loss length is 24 frames. Comparative Experiments to Evaluate the Use of Auditory-Based Acoustic Distinctive Features and Formant Cues for Robust Automatic Speech Recognition in Low-SNR Car Environments Hesham Tolba, Sid-Ahmed Selouani, Douglas O’Shaughnessy; Université du Québec, Canada This paper presents an evaluation of the use of some auditory-based distinctive features and formant cues for robust automatic speech recognition (ASR) in the presence of highly interfering car noise. Comparative experiments have indicated that combining the classical MFCCs with some auditory-based acoustic distinctive cues and either the main formant magnitudes or the formant frequencies of a speech signal using a multi-stream paradigm leads to an improvement in the recognition performance in noisy car environments. To test the use of the new multi-stream feature vector, a series of experiments on speaker-independent continuous-speech recognition have been carried out using a noisy version of the TIMIT database. Using such multi-stream paradigm, we found that the use of the proposed paradigm, outperforms the conventional recognition process based on the use of the MFCCs in interfering noisy car environments for a wide range of SNRs. Robust Speech Recognition Using Missing Feature Theory in the Cepstral or LDA Domain Hugo Van hamme; Katholieke Universiteit Leuven, Belgium When applying Missing Feature Theory to noise robust speech recognition, spectral features are labeled as either reliable or unreliable in the time-frequency plane. The acoustic model evaluation of the unreliable features is modified to express that their clean values are unknown or confined within bounds. Classically, MFT requires an assumption of statistical independence in the spectral domain, which deteriorates the accuracy on clean speech. In this paper, MFT is expressed in any domain that is a linear transform of (log)spectra, for example for cepstra and their time-derivatives. The acoustic model evaluation is recast as a nonnegative least squares problem. Approximate solutions are proposed and the success of the method is shown through experiments on the AURORA-2 database. Bandwidth Mismatch Compensation for Robust Speech Recognition September 1-4, 2003 – Geneva, Switzerland feature estimation for automatic speech recognition. By using these methods, it is possible to explore new possibilities in leveraging the autoregressive assumption for noise robust feature extraction. Two minimum mean square error estimators are compared that directly estimate the mean of the feature vectors. The first estimator uses the assumption that the speech is an autoregressive signal, while the second makes no assumptions about the speech spectrum. By creating samples from the posterior distribution, these methods also provide an elegant solution to finding feature variances. These variances can be used to create optimal temporal smoothers of the features as well as input for uncertainty observation decoding. Testing on the Aurora2 database shows that autoregressive modeling provides additional information to improve speech recognition performance. In addition, both smoothing and uncertain observation decoding improve performance in this method. A Comparative Study of Some Discriminative Feature Reduction Algorithms on the AURORA 2000 and the DaimlerChrysler In-Car ASR Tasks Joan Marí Hilario, Fritz Class; DaimlerChrysler AG, Germany A common practice in ASR to add contextual information is to append consecutive feature frames in a single large feature vector. However, this increases the processing time in the acoustic modelling and may lead to poorly trained parameters. A possible solution is to use a Linear Discriminant Analysis (LDA) mapping to reduce the dimensionality of the feature, but this is not optimal, at least in the case where the LDA classes are HMM-states. It is shown in this paper that the feature reduction problem is essentially a problem of approximating class posterior probabilities. These can be approximated using Neural Nets (NN). Some approaches using different choices for the classes and NN topology are presented and tested on the AURORA 2000 digit task and on our in-car task. Results on AURORA show a significant performance increase compared to LDA, but none of the NN-based approaches outperforms LDA on our in-car task. Session: PThCg– Poster Multi-Lingual Spoken Language Processing Time: Thursday 13.30, Venue: Main Hall, Level -1 Chair: Torbjorn Svendsen, NTNU, Trondheim, Norway Recent Progress in the Decoding of Non-Native Speech with Multilingual Acoustic Models V. Fischer, E. Janke, S. Kunzmann; IBM Pervasive Computing, Germany Yuan-Fu Liao 1 , Jeng-Shien Lin 1 , Wei-Ho Tsai 2 ; 1 National Taipei University of Technology, Taiwan; 2 Academia Sinica, Taiwan In this paper, an iterative bandwidth mismatch compensation (BMC) algorithm is proposed to alleviate the need of multiple pre-trained models for recognizing different bandwidth speech. The BMC uses the concept of the bandwidth extension as similar as in the speech enhancement approaches. However, it aims at directly improving the recognition accuracy instead of speech intelligence or quality and utilizes only recognizer’s hidden Markov models (HMMs) for both bandwidth mismatch compensation and recognition. The BMC first detects the bandwidth of the input speech signal based on a divergence measurement. The HMM/Gaussian mixture model (GMM)based method is then used to iteratively segment the input speech utterance and compensates the speech features. Experiments on serious bandwidth mismatched conditions, i.e., training on 8 kHz and testing on 4 kHz or 5.5 kHz bandwidth database have verified the effectiveness of the proposed approach. Markov Chain Monte Carlo Methods for Noise Robust Feature Extraction Using the Autoregressive Model Robert W. Morris, Jon A. Arrowood, Mark A. Clements; Georgia Institute of Technology, USA In this paper, Markov Chain Monte Carlo techniques are applied to In this paper we report on recent progress in the use of multilingual Hidden Markov Models for the recognition of non-native speech. While we have previously discussed the use of bilingual acoustic models and recognizer combination methods, we now seek to avoid the increased computational load imposed by methods such as ROVER by focusing on acoustic models that share training data from 5 languages. Our investigations concentrate on the determination of a proper model complexity and show the multilingual models’ capability to handle cases where a non-native speaker is borrowing phones from his or her native language. Finally, using a limited amount of non-native speech for MLLR adaptation, we demonstrate the superiority of multilingual models even after adaptation. An NN-Based Approach to Prosodic Information Generation for Synthesizing English Words Embedded in Chinese Text Wei-Chih Kuo, Li-Feng Lin, Yih-Ru Wang, Sin-Horng Chen; National Chiao Tung University, Taiwan In this paper, a neural network-based approach to generating proper prosodic information for spelling/reading English words embedded in background Chinese texts is discussed. It expands an existing RNN-based prosodic information generator for Mandarin TTS to an RNN-MLP scheme for Mandarin-English mixed-lingual TTS. It first treats each English word as a Chinese word and uses the RNN, trained for Mandarin TTS, to generate a set of initial prosodic in- 109 Eurospeech 2003 Thursday formation for each syllable of the English word. It then refines the initial prosodic information by using additional MLPs. The resulting prosodic information is expected to be appropriate for Englishword synthesis as well as to match well with that of the background Mandarin speech. Experimental results showed that the proposed RNN-MLP scheme performed very well. For English word spelling/reading, RMSEs of 41.8/78.2 ms, 30.8/26 ms, 0.65/0.45 ms/frame, and 3.06/4.9 dB were achieved in the open tests for the synthesized syllable duration, inter-syllable pause duration, pitch contour, and energy level, respectively. So it is a promising approach. Speaker Adaptation for Non-Native Speakers Using Bilingual English Lexicon and Acoustic Models S. Matsunaga, A. Ogawa, Yoshikazu Yamaguchi, A. Imamura; NTT Corporation, Japan This paper proposes a supervised speaker adaptation method that is effective for both non-native (i.e. Japanese) and native English speakers’ pronunciation of English speech. This method uses English and Japanese phoneme acoustic models and a pronunciation lexicon in which each word has both English and Japanese phoneme transcriptions. The same utterances are used for adaptation of both acoustic models. A recognition system uses these two adapted acoustic models and the lexicon, and the highestlikelihood word sequence obtained in combining with English- and Japanese-pronounced words is the recognition result. Continuous speech recognition experiments show that the proposed adaptation method greatly improves both Japanese-English and native- English recognition performance, and the system using bilingual adapted models achieves the highest accuracy for Japanese speakers among those using monolingual models, while maintaining the same performance level for native speakers as that of an English recognition system using an English adapted model. Using the Web for Fast Language Model Construction in Minority Languages Viet Bac Le 1 , Brigitte Bigi 1 , Laurent Besacier 1 , Eric Castelli 2 ; 1 CLIPS-IMAG Laboratory, France; 2 MICA Center, Vietnam The design and construction of a language model for minority languages is a hard task. By minority language, we mean a language with small available resources, especially for the statistical learning problem. In this paper, a new methodology for fast language model construction in minority languages is proposed. It is based on the use of Web resources to collect and make efficient textual corpora. By using some filtering techniques, this methodology allows a quick and efficient construction of a language model with a small cost in term of computational and human resources. Our primary experiments have shown excellent performance of the Web language models vs newspaper language models using the proposed filtering methods on a majority language (French). Following the same way for a minority language (Vietnamese), a valuable language model was constructed in 3 month with only 15% new development to modify some filtering tools. An Approach to Multilingual Acoustic Modeling for Portable Devices Yan Ming Cheng, Chen Liu, Yuan-Jun Wei, Lynette Melnar, Changxue Ma; Motorola Labs, USA There is an increasing need to deploy speech recognition systems supporting multiple languages/dialects on portable devices worldwide. A common approach uses a collection of individual monolingual speech recognition systems as a solution. However, such an approach is not practical for handheld devices such as cell phones due to stringent restrictions on memory and computational resources. In this paper, we present a simple and effective method to develop multilingual acoustic models that achieve comparable performance relative to monolingual acoustic models but with only a fraction of the storage space of the combined monolingual acoustic model set. September 1-4, 2003 – Geneva, Switzerland Cross-Lingual Pronunciation Modelling for Indonesian Speech Recognition Terrence Martin 1 , Torbjørn Svendsen 2 , Sridha Sridharan 1 ; 1 Queensland University of Technology, Australia; 2 Norwegian University of Science and Technology, Norway The resources necessary to produce Automatic Speech Recognition systems for a new language are considerable, and for many languages these resources are not available. This emphasizes the need for the development of generic techniques which overcome this data shortage. Indonesian is one language which suffers from this problem and whose population and importance suggest it could benefit from speech enabled technology. Accordingly, we investigate using English acoustic models to recognize Indonesian speech. The mapping process, where the symbolic representation of the Source language acoustic models is equated to the Target language phonetic units, has typically been achieved using one to one mapping techniques. This mapping method does not allow for the incorporation of predictable allophonic variation in the lexicon. Accordingly, in this paper we present the use of cross-lingual pronunciation modelling to extract context dependant mapping rules, which are subsequently used to produce a more accurate cross lingual lexicon. Language Model Adaptation Using Cross-Lingual Information Woosung Kim, Sanjeev Khudanpur; Johns Hopkins University, USA The success of statistical language modeling techniques is crucially dependent on the availability of a large amount training text. For a language in which such large text collections are not available, methods have recently been proposed to take advantage of a resourcerich language, together with cross-lingual information retrieval and machine translation, to sharpen language models for the resourcedeficient language. In this paper, we describe investigations into such language models for an automatic speech recognition system for Mandarin Broadcast News. By exploiting a large side-corpus of contemporaneous English news articles to adapt a static Chinese language model to the news story being transcribed, we demonstrate significant improvements in recognition accuracy. The improvement from using English text is greater when less Chinese text is available to estimate the static language model. We also compare our cross-lingual adaptation to monolingual topic-dependent language model adaptation, and achieve further gains by combining the two adaptation techniques. Multilingual Phone Clustering for Recognition of Spontaneous Indonesian Speech Utilising Pronunciation Modelling Techniques Eddie Wong 1 , Terrence Martin 1 , Torbjørn Svendsen 2 , Sridha Sridharan 1 ; 1 Queensland University of Technology, Australia; 2 Norwegian University of Science and Technology, Norway In this paper, a multilingual acoustic model set derived from English, Hindi, and Spanish is utilised to recognise speech in Indonesian. In order to achieve this task we incorporate a two tiered approach to perform the cross-lingual porting of the multilingual models to a new language. In the first stage, we use an entropy based decision tree to merge similar phones from different languages into clusters to form a new multilingual model set. In the second stage, we propose the use of a cross-lingual pronunciation modelling technique to perform the mapping from the multilingual models to the Indonesian phone set. A set of mapping rules are derived from this process and are employed to convert the original Indonesian lexicon into a pronunciation lexicon in terms of the multilingual model set. Preliminary experimental results show that, compared to the common knowledge based approach, both of these techniques reduce the word error rate in a spontaneous speech recognition task. 110 Eurospeech 2003 Thursday Language-Adaptive Persian Speech Recognition Naveen Srinivasamurthy, Shrikanth Narayanan; University of Southern California, USA Development of robust spoken language technology ideally relies on the availability of large amounts of data preferably in the target domain and language. However, more often than not, speech developers need to cope with very little or no data, typically obtained from a different target domain. This paper focuses on developing techniques towards addressing this challenge. Specifically we consider the case of developing a Persian language speech recognizer with sparse amounts of data. For language modeling, there are several potential sources of text data, e.g., available on the Internet, to help bootstrap initial models; however, acoustic data can be obtained only by tedious data collection efforts. The drawback of limited Persian acoustic data can be partially overcome by making use of acoustic data from languages that have vast resources such as English (and other languages, if available). The phoneme sets especially for diverse languages such as English and Persian differ considerably. However by incorporating knowledge-based as well as data-driven phoneme mappings, reliable Persian acoustic models can be trained using well-trained English models and small amounts of Persian re-training data. In our experiments Persian models retrained from seed models created by data-driven phoneme mappings of English models resulted in a phoneme error rate of 19.80% as compared to a phoneme error rate of 20.35% when the Persian models were re-trained from seed models created by sparse Persian data. Grapheme Based Speech Recognition Mirjam Killer 1 , Sebastian Stüker 2 , Tanja Schultz 3 ; 1 ETH Zürich, Switzerland; 2 Universität Karlsruhe, Germany; 3 Carnegie Mellon University, USA Large vocabulary speech recognition systems traditionally represent words in terms of subword units, usually phonemes. This paper investigates the potential of graphemes acting as subunits. In order to develop context dependent grapheme based speech recognizers several decision tree based clustering procedures are performed and compared to each other. Grapheme based speech recognizers in three languages – English, German, and Spanish - are trained and compared to their phoneme based counterparts. The results show that for languages with a close grapheme-to-phoneme relation, grapheme based modeling is as good as the phoneme based one. Furthermore, multilingual grapheme based recognizers are designed to investigate whether grapheme based information can be successfully shared among languages. Finally, some bootstrapping experiments for Swedish were performed to test the potential for rapid language deployment. Session: PThCh– Poster Interdisciplinary Time: Thursday 13.30, Venue: Main Hall, Level -1 Chair: Mike McTear, University of Ulster at Jordanstown Learning Chinese Tones Valery A. Petrushin; Accenture, USA This paper is devoted to developing techniques for improving learning of foreign spoken languages. It presents a general framework for evaluating student’s spoken response, which is based on collecting experimental data about experts’ and novices’ performance and applying machine learning and knowledge management techniques for deriving evaluation rules. The related speech analysis, visualization, and student response evaluation techniques are described. An experimental course for learning tones of Standard Chinese (Mandarin) is discussed. A Pronunciation Training System for Japanese Lexical Accents with Corrective Feedback in Learner’s Voice Keikichi Hirose, Frédéric Gendrin, Nobuaki Minematsu; University of Tokyo, Japan A system was developed for teaching non-Japanese learners pro- September 1-4, 2003 – Geneva, Switzerland nunciation of Japanese lexical accents. The system first identifies word accent types in a learner’s utterance using F0 change between two adjacent morae as the feature parameter. As for the representative F0 value for a mora, we defined one with a good match to the perceived pitch. The system notices the user if his/her pronunciation is good or not, and, then, generates audio and visual corrective feedbacks. Using TDPSOLA technique, the learner’s utterance is modified in its prosodic features by referring to teacher’s features, and offered to the learner as the audio corrective feedback. The visual feedback is also offered to enhance the modifications that occurred. Accent type pronunciation training experiments were conducted for 8 non-Japanese speakers, and the results showed that the training process could be facilitated by the feedbacks especially when they were asked to pronounce sentences. Considerations on Vowel Durations for Japanese CALL System Taro Mouri, Keikichi Hirose, Nobuaki Minematsu; University of Tokyo, Japan Due to various difficulties in pronunciation, utterances by nonnative speakers may be lacking in fluency. The Japanese pronunciation is said to have mora-synchronism, and, therefore, we assume that the disfluency may cause larger variations in vowel durations. Analyses of vowel (and CV) durations were conducted for Japanese sentence utterances by 2 non-Japanese speakers and one Japanese speaker (all female speakers). Larger variations were clearly observed in non-Japanese utterances. Then, 10 Japanese speakers were asked to rate the non-Japanese utterances. Strong negative correlations were observed between durational variations and pronunciation ratings. Based on the result, a method was developed for automatic evaluation of non- Japanese utterances. The ratings by the method were shown to be close to those by native speakers. Also, in order to offer a corrective feedback in learner’s voice, non-Japanese utterances were modified in their vowel durations by referring to native Japanese utterances. The modification was done using TD-PSOLA scheme. The result of listening test indicated some improvements in nativeness. Influence of Recording Equipment on the Identification of Second Language Phoneme Contrasts Hiroaki Kato 1 , Masumi Nukinay 2 , Hideki Kawaharay 2 , Reiko Akahane-Yamada 1 ; 1 ATR-HIS, Japan; 2 Wakayama University, Japan This paper investigates the perceptual quality of English words recorded by different types of microphones to assess their availability in Computer Assisted Language Learning (CALL) systems. English words minimally contrasting in /r/ and /l/, /b/ and /v/, or /s/ and /th/ were recorded from native female and male speakers of American English using six different microphones. The phonemic contrasts in these recordings were then evaluated by 14 native listeners of American English. The results showed that the identification of the /r/-/l/ contrast was unaltered by the difference in microphones, whereas that of the /s/-/th/ contrast significantly dropped with several headset microphones, and that of the /b/-/v/ contrast dropped with a tie-pin microphone. These findings suggest that some microphones are not appropriate for speech perception training. Finally, a post hoc equalization procedure was applied to compensate for the acoustic characteristics of the microphones tested, and this procedure was confirmed to be effective in recovering phonemic contrasts under several conditions. Training a Confidence Measure for a Reading Tutor That Listens Yik-Cheung Tam, Jack Mostow, Joseph E. Beck, Satanjeev Banerjee; Carnegie Mellon University, USA One issue in a Reading Tutor that listens is to determine which words the student read correctly. We describe a confidence measure that uses a variety of features to estimate the probability that a word was read correctly. We trained two decision tree classifiers. The first classifier tries to fix insertion and substitution errors made by the speech decoder, while the second classifier tries to fix deletion errors. By applying the two classifiers together, we achieved a 111 Eurospeech 2003 Thursday relative reduction in false alarm rate by 25.89% while holding the miscue detection rate constant. Evaluating the Effect of Predicting Oral Reading Miscues Satanjeev Banerjee, Joseph E. Beck, Jack Mostow; Carnegie Mellon University, USA This paper extends and evaluates previously published methods for predicting likely miscues in children’s oral reading in a Reading Tutor that listens. The goal is to improve the speech recognizer’s ability to detect miscues but limit the number of “false alarms” (correctly read words misclassified as incorrect). The “rote” method listens for specific miscues from a training corpus. The “extrapolative” method generalizes to predict other miscues on other words. We construct and evaluate a scheme that combines our rote and extrapolative models. This combined approach reduced false alarms by 0.52% absolute (12% relative) while simultaneously improving miscue detection by 1.04% absolute (4.2% relative) over our existing miscue prediction scheme. VISPER II – Enhanced Version of the Educational Software for Speech Processing Courses Miroslav Holada, Jan Nouza; Technical University of Liberec, Czech Republic In the paper we describe a new version of the software tool developed for education and experimental works in speech processing domain. Since 1997, when the original VISPER was released, we have added several new modules and options that give a student a deeper look at the basic principles, methods and algorithms used namely in speech recognition. Newly included modules allow for visualization of the Viterbi search algorithm implemented either in sequential or parallel way, they introduce the idea of the beam search with pruning and guide a student towards understanding the principle of word string recognition. The VISPER concept of a single graphic environment with mutually linked modules remains untouched. The VISPER II is compatible with all recent versions of the MS Windows OS and it is freely available. The Use of Multiple Pause Information in Dependency Structure Analysis of Spoken Japanese Sentences Meirong Lu, Kazuyuki Takagi, Kazuhiko Ozeki; University of Electro-Communications, Japan There is a close relationship between prosody and syntax. In the field of speech synthesis, many investigations have been made to control prosody so that it conforms to the syntactic structure of the sentence. This paper is concerned with the inverse problem: recovery of syntactic structure with the help of prosodic information. In our past investigations, it was observed that duration of inter-phrase pause is most effective among various prosodic features in dependency structure analysis of Japanese sentences. In those studies, only one kind of pause, i.e. the pause that immediately follows a phrase in question was used. In this paper, another kind of pause is employed as a prosodic feature: the pause that immediately follows the succeeding phrase of a phrase in question. It is shown that simultaneous use of the first and second pauses improves the parsing accuracy compared to the case where only the first pause is used. A Neural Network Approach to Dependency Analysis of Japanese Sentences Using Prosodic Information Kazuyuki Takagi, Mamiko Okimoto, Yoshio Ogawa, Kazuhiko Ozeki; University of Electro-Communications, Japan Prosody and syntax are significantly related with each other as has often been observed. In the field of speech synthesis, many efforts have been made to control prosody so that it reflects the syntactic structure of the sentence. However, the inverse problem, recovery of syntactic structure using prosodic information, has not been so much investigated. This paper focuses on syntactic information contained in prosodic features extracted from read Japanese September 1-4, 2003 – Geneva, Switzerland sentences, and describes a method of exploiting it in dependency structure analysis. In this paper, a multilayer perceptron is employed to estimate conditional probability of dependency distance of a phrase given its prosodic feature, i.e., pause duration and F0 contour. Parsing accuracy was improved by combining two different kinds of prosodic information by the perceptron. Say-As Classification for Alphabetic Words in Japanese Texts Hisako Asano, Masaaki Nagata, Masanobu Abe; NTT Corporation, Japan Modern Japanese texts often include Western sourced words written in Roman alphabet. For example, a shopping directory in a web portal, which lists more than 8,000 shops, includes a total of 6,400 alphabetic words. As most of them are very new and idiosyncratic proper nouns, it is impractical to assume all those alphabetic words can be registered in the word dictionary of a text-to-speech synthesis system; their pronunciations must be derived automatically. Our solution consists of two steps. Step 1 classifies each unknown alphabetic word into a say-as class (English, Japanese, French, Italian or English spell-out), which indicates how it is to be read, and Step 2 derives the pronunciation using the grapheme-to-phoneme conversion rules for the classified say-as class. This paper proposes a method of say-as classification (i.e. Step 1) that uses the Support Vector Machine. After some trial and error, we achieved 89.2% accuracy for web shop data, which we think sufficient for practical use. Automatic Transformation of Environmental Sounds into Sound-Imitation Words Based on Japanese Syllable Structure Kazushi Ishihara, Yasushi Tsubota, Hiroshi G. Okuno; Kyoto University, Japan Sound-imitation words, a sound-related subset of onomatopoeia, are important for computer-human interaction and automatic tagging of sound archives. The main problem of automatic recognition of sound-imitation word is that the literal representation of such words is dependent on listeners and influenced by a particular cultural history. Based on our preliminary experiments of such dependency and the sonority theory, we discovered that the process of transforming environmental sounds into syllable-structure expressions is mostly listener-independent while that of transforming syllable-structure expressions into sound-imitation words is mostly listener-dependent and influenced by culture. This paper focuses on the former lister-independent process and presents the threestage architecture of automatic transformation of environmental sounds to sound-imitation words; segmenting sound signals to syllables, identifying syllable structure as mora, and recognizing mora as phonemes. Decision Tree-Based Simultaneous Clustering of Phonetic Contexts, Dimensions, and State Positions for Acoustic Modeling Heiga Zen, Keiichi Tokuda, Tadashi Kitamura; Nagoya Institute of Technology, Japan In this paper, a new decision tree-based clustering technique called Phonetic, Dimensional and State Positional Decision Tree (PDSDT) is proposed. In PDS-DT, phonetic contexts, dimensions and state positions are grouped simultaneously during decision tree construction. PDS-DT provides a complicate distribution sharing structure without any external control parameters. In speakerindependent continuous speech recognition experiments, PDS-DT achieved about 13%–15% error reduction over the phonetic decision tree-based state-tying technique. A Statistical Method of Evaluating Pronunciation Proficiency for English Words Spoken by Japanese Seiichi Nakagawa, Kazumasa Mori, Naoki Nakamura; Toyohashi University of Technology, Japan In this paper, we propose a statistical method of evaluating the pronunciation proficiency of English words spoken by Japanese. We analyze statistically the utterances to find a combination that has a 112 Eurospeech 2003 Thursday high correlation between an English teacher’s score and some acoustic features. We found that the likelihood ratio of English phoneme acoustic models to phoneme acoustic models adapted by Japanese was the best measure of pronunciation proficiency. The combination of the likelihood for American native models, likelihood for English models adapted by Japanese, the best likelihood for arbitrary sequences of acoustic models, phoneme recognition rate and the rate of speech are highly related to the English teacher’s score. We obtained the correlation coefficient of 0.81 with open data for vocabulary and 0.69 with open data for speaker at the five words set level, respectively. The coefficient was higher than the correlation between humans’ scores, 0.65. 113 September 1-4, 2003 – Geneva, Switzerland Eurospeech 2003 September 1-4, 2003 – Geneva, Switzerland 114 Eurospeech 2003 Author Index A Aalburg, Stefanie . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Abad, Alberto. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Abdou, Sherif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Abe, Masanobu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Abe, Masanobu . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Abrash, Victor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Abt, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Abu-Amer, Tarek . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Abutalebi, H.R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Adami, André G. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Adami, André G. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Adams, Jeff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Adda-Decker, Martine . . . . . . . . . . . . . . . . . . . . . . 8 Adda-Decker, Martine . . . . . . . . . . . . . . . . . . . . . 10 Adelhardt, Johann . . . . . . . . . . . . . . . . . . . . . . . . . 26 Afify, Mohamed . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Ahadi, S.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Ahkuputra, Visarut . . . . . . . . . . . . . . . . . . . . . . . . 65 Ahn, Dong-Hoon . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Ahn, Sungjoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Aikawa, Kiyoaki . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Aikawa, Kiyoaki . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Airey, S.S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Akagi, Masato . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Akagi, Masato . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Akahane-Yamada, Reiko . . . . . . . . . . . . . . . . . 111 Akbacak, Murat . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Akiba, Tomoyosi . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Akiba, Tomoyosi . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Akita, Yuya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Al Bawab, Ziad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Albesano, Dario . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Alecksandrovich, Oleg . . . . . . . . . . . . . . . . . . . . 69 Alexander, Anil . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Alías, Francesc. . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Alías, Francesc . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Alku, Paavo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Allen, James . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Allu, Gopi Krishna . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Al-Naimi, Khaldoon . . . . . . . . . . . . . . . . . . . . . . . 50 Alonso-Romero, L. . . . . . . . . . . . . . . . . . . . . . . . . . 93 Alouane, M. Turki-Hadj . . . . . . . . . . . . . . . . . . . 49 Alshawi, Hiyan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Alsteris, Leigh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Altun, Yasemin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Álvarez, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Alwan, Abeer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Alwan, Abeer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Amaral, Rui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Amir, Noam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Andersen, Ove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Anderson, A.H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Anderson, David V. . . . . . . . . . . . . . . . . . . . . . . . . 38 Anderson, David V. . . . . . . . . . . . . . . . . . . . . . . . . 76 Andorno, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Andrassy, Bernt . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Andrassy, Bernt . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Angkititrakul, Pongtep . . . . . . . . . . . . . . . . . . . . 47 Antoine, Jean-Yves . . . . . . . . . . . . . . . . . . . . . . . . 98 Arai, Takayuki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Araki, Masahiro. . . . . . . . . . . . . . . . . . . . . . . . . . . .67 Arcienega, Mijail . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Arehart, Kathryn . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Arifianto, Dhany . . . . . . . . . . . . . . . . . . . . . . . . . 102 Ariki, Yasuo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Ariki, Yasuo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Ariki, Yasuo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Ariki, Yasuo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Ariyaeeinia, Aladdin M. . . . . . . . . . . . . . . . . . . . 43 Ariyaeeinia, Aladdin M. . . . . . . . . . . . . . . . . . . . 94 Armani, Luca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Arranz, Victoria . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Arroabarren, Ixone . . . . . . . . . . . . . . . . . . . . . . . . . 3 Arroabarren, Ixone . . . . . . . . . . . . . . . . . . . . . . . . 62 Arrowood, Jon A. . . . . . . . . . . . . . . . . . . . . . . . . 109 Arslan, Levent M. . . . . . . . . . . . . . . . . . . . . . . . . . . 74 September 1-4, 2003 – Geneva, Switzerland Arslan, Levent M. . . . . . . . . . . . . . . . . . . . . . . . . 101 Asano, Futoshi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Asano, Futoshi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Asano, Hisako . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Asano, Hisako . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Ashley, J.P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Asoh, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Astrov, Sergey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Atal, Bishnu S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Atlas, Les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Attwater, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Au, Ching-Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Au, Wing-Hei. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32 Aubergé, Véronique . . . . . . . . . . . . . . . . . . . . . . . . 7 Audibert, Nicolas . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Axelrod, Scott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Axelrod, Scott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Axelrod, Scott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Axelrod, Scott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Aylett, Matthew. . . . . . . . . . . . . . . . . . . . . . . . . . . .12 B Baca, Julie A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Bach, Nguyen Hung . . . . . . . . . . . . . . . . . . . . . . . . . 7 Bachenko, Joan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Backfried, Gerhard . . . . . . . . . . . . . . . . . . . . . . . . 55 Bäckström, Tom . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Badran, Ahmed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Bailly, Gerard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Baker, Kirk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Bakis, R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Bakx, Ilse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79 Balakrishnan, N. . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Balakrishnan, Sreeram V. . . . . . . . . . . . . . . . . . 53 Baltazani, Mary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Banerjee, Satanjeev . . . . . . . . . . . . . . . . . . . . . . 111 Banerjee, Satanjeev . . . . . . . . . . . . . . . . . . . . . . 112 Banga, Eduardo R. . . . . . . . . . . . . . . . . . . . . . . . . . 11 Bänziger, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Bänziger, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Bard, E.G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Barrachina, Sergio . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Barreaud, Vincent . . . . . . . . . . . . . . . . . . . . . . . . . 53 Baskind, Alexis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Batliner, Anton . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Bauer, Josef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Bauerecker, Hermann . . . . . . . . . . . . . . . . . . . . . 31 Baus, Jörg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Bazzi, Issam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Beaufays, Françoise . . . . . . . . . . . . . . . . . . . . . . . 92 Beaugeant, Christophe . . . . . . . . . . . . . . . . . . . . 58 Beaumont, Jean-François . . . . . . . . . . . . . . . . . . 43 Beaumont, Jean-François . . . . . . . . . . . . . . . . . . 43 Béchet, Frédéric . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Béchet, Frédéric . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Beck, Joseph E. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Beck, Joseph E. . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Beddoes, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Belfield, William . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Bell, Linda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Bellegarda, Jerome R. . . . . . . . . . . . . . . . . . . . . . 71 Bellot, Olivier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Benítez, Carmen . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Benítez, Carmen . . . . . . . . . . . . . . . . . . . . . . . . . 107 Benlahouar, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Bennett, Christina L. . . . . . . . . . . . . . . . . . . . . . . 12 Bennett, Christina L. . . . . . . . . . . . . . . . . . . . . . 104 BenZeghiba, Mohamed Faouzi . . . . . . . . . . . . 48 Berdahl, Edgar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Beringer, N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Bernard, Alexis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Bernsen, Niels Ole . . . . . . . . . . . . . . . . . . . . . . . . . 26 Berthommier, Frédéric . . . . . . . . . . . . . . . . . . . . 37 Besacier, Laurent . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Besacier, Laurent . . . . . . . . . . . . . . . . . . . . . . . . . 110 Beskow, Jonas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Bettens, F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Beutler, René . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Beutler, René . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Bigi, Brigitte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Bijankhan, Mahmood . . . . . . . . . . . . . . . . . . . . . . 54 Bilmes, Jeff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Bimbot, Frédéric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Binnenpoorte, Diana . . . . . . . . . . . . . . . . . . . . . . 54 Bisani, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 115 Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Black, Lois . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Bloom, Jonathan . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Boë, Louis-Jean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Boëffard, Olivier . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Boëffard, Olivier . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Bohus, Dan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Bonafonte, Antonio . . . . . . . . . . . . . . . . . . . . . . . 30 Bonafonte, Antonio . . . . . . . . . . . . . . . . . . . . . . . 56 Bonafonte, Antonio . . . . . . . . . . . . . . . . . . . . . . . 81 Bonastre, Jean-François . . . . . . . . . . . . . . . . . . . . 2 Bonastre, Jean-François . . . . . . . . . . . . . . . . . . . 57 Bonastre, Jean-François . . . . . . . . . . . . . . . . . . . 71 Bonneau-Maynard, Hélène . . . . . . . . . . . . . . . . . . 8 Bonneau-Maynard, Hélène . . . . . . . . . . . . . . . . 10 Borys, S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Boštík, Milan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Boulianne, Gilles. . . . . . . . . . . . . . . . . . . . . . . . . . .43 Boulianne, Gilles. . . . . . . . . . . . . . . . . . . . . . . . . . .43 Boulianne, Gilles. . . . . . . . . . . . . . . . . . . . . . . . . . .94 Bourgeois, Julien . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Bourlard, Hervé . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Bourlard, Hervé . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Bourlard, Hervé . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Bouzid, Aïcha . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Boves, Lou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Boye, Johan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Bozkurt, Baris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Bratt, Harry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Bratt, Harry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Braun, Bettina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Braunschweiler, Norbert . . . . . . . . . . . . . . . . . . 46 Breen, Andrew P. . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Breen, Andrew P. . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Brennan, R.L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Brennan, R.L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Brito, Iván . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Broeders, A.P.A. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Brousseau, Julie . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Brown, Guy J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Brungart, Douglas S. . . . . . . . . . . . . . . . . . . . . . . 37 Burger, Susanne . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Burnett, Ian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Burns, John . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Byrne, William J. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Byrne, William J. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 C Caldés, Roser Jaquemot . . . . . . . . . . . . . . . . . . . 55 Campbell, Joseph P. . . . . . . . . . . . . . . . . . . . . . . . . 2 Campbell, Joseph P. . . . . . . . . . . . . . . . . . . . . . . . 94 Campbell, Nick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Campbell, Nick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Campbell, Nick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Campbell, Nick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Campbell, W.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Campillo Díaz, Francisco. . . . . . . . . . . . . . . . . .11 Cao, Zhigang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Capeletto, Matias L. . . . . . . . . . . . . . . . . . . . . . . 102 Cardeñoso, Valentín . . . . . . . . . . . . . . . . . . . . . . . 81 Cardinal, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Cardinal, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Carey, Michael J. . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Carey, Michael J. . . . . . . . . . . . . . . . . . . . . . . . . . 107 Carlosena, Alfonso . . . . . . . . . . . . . . . . . . . . . . . . . 3 Carlosena, Alfonso . . . . . . . . . . . . . . . . . . . . . . . . 62 Carmichael, James . . . . . . . . . . . . . . . . . . . . . . . . 41 Carmichael, James . . . . . . . . . . . . . . . . . . . . . . . . 78 Carriço, Luís . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Carson-Berndsen, Julie . . . . . . . . . . . . . . . . . . . . 90 Caseiro, Diamantino . . . . . . . . . . . . . . . . . . . . . . 56 Cassaca, Renato . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Castell, Núria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Castelli, Eric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Castro, María José . . . . . . . . . . . . . . . . . . . . . . . . . 23 Cathiard, Marie-Agnès . . . . . . . . . . . . . . . . . . . . . . 6 Cattoni, Roldano . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Cawley, Gavin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23 Cerisara, Christophe . . . . . . . . . . . . . . . . . . . . . 108 Cerisara, Christophe . . . . . . . . . . . . . . . . . . . . . 108 Černocký, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Černocký, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Eurospeech 2003 Černocký, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Cesari, Federico . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Çetin, Özgür . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Cettolo, Mauro . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Chambel, Teresa . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Chan, C.F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chan, Kin-Wah . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chan, Kwokleung . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chan, Shuk Fong . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Chang, Eric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Chang, Eric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Chang, Eric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Chang, Eric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Chang, Joon-Hyuk . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chang, Pi-Chuan . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chang, Sen-Chia . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chang, Shuangyu . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Chang, Wen-Whei . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Chapdelaine, Claude . . . . . . . . . . . . . . . . . . . . . . 43 Chapman, James . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Charbit, Maurice . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Charlet, Delphine . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Charnvivit, Patavee . . . . . . . . . . . . . . . . . . . . . . . . . 6 Charoenpornsawat, Paisarn . . . . . . . . . . . . . . . 12 Chateau, N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chatzichrisafis, N. . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chaudhari, Upendra . . . . . . . . . . . . . . . . . . . . . . . 70 Chaudhari, Upendra . . . . . . . . . . . . . . . . . . . . . . . 91 Chelba, Ciprian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Chen, Aoju . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chen, Barry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chen, Barry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Chen, Boxing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Chen, Fang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chen, Gao Peng . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Chen, Hsin-Hsi. . . . . . . . . . . . . . . . . . . . . . . . . . . . .98 Chen, Jau-Hung . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chen, Jia-fu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Chen, K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chen, Shun-Chuan . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Chen, Shun-Chuan. . . . . . . . . . . . . . . . . . . . . . . . .82 Chen, Shun-Chuan . . . . . . . . . . . . . . . . . . . . . . . 100 Chen, Shun-Chuan . . . . . . . . . . . . . . . . . . . . . . . 100 Chen, Sin-Horng . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chen, Sin-Horng . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Chen, Stanley F. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chen, Stanley F. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Chen, Stanley F. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Chen, Tsuhan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Chen, Y. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chen, Yining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Chen, Yiya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Chen, Zhenbiao . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Cheng, Shi-sian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Cheng, Yan Ming . . . . . . . . . . . . . . . . . . . . . . . . . 110 Cheung, Ming-Cheung . . . . . . . . . . . . . . . . . . . 105 Chiang, Yuan-Chuan . . . . . . . . . . . . . . . . . . . . . . 48 Chiang, Yuang-Chin . . . . . . . . . . . . . . . . . . . . . . . 42 Chiang, Yuang-Chin . . . . . . . . . . . . . . . . . . . . . . . 66 Chien, Jen-Tzung . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Ching, P.C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Choi, Chi-Ho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Choi, Frederick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Choi, Jin-Kyu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Chollet, Gérard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Chonavel, Thierry . . . . . . . . . . . . . . . . . . . . . . . . . 61 Chou, Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Chou, Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chou, Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Choukri, Khalid. . . . . . . . . . . . . . . . . . . . . . . . . . . .54 Choy, G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Choy, Thomas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chu, Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Chu, Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Chu, Wai C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chung, Grace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 Chung, Grace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chung, Hyun-Yeol . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chung, Hyun-Yeol . . . . . . . . . . . . . . . . . . . . . . . . . 88 Chung, Jaeho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chung, Minhwa . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chung, Minhwa . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Church, Kenneth Ward . . . . . . . . . . . . . . . . . . . . . 1 Cieri, Christopher . . . . . . . . . . . . . . . . . . . . . . . . . 56 Cieri, Christopher . . . . . . . . . . . . . . . . . . . . . . . . . 56 Çilingir, Onur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Ciloglu, Tolga . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 September 1-4, 2003 – Geneva, Switzerland Class, Fritz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Clements, Mark A. . . . . . . . . . . . . . . . . . . . . . . . 103 Clements, Mark A. . . . . . . . . . . . . . . . . . . . . . . . 109 Cohen, Arnon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Cohen, Gilead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Cohen, Rachel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Cole, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Cole, Ronald A. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Comeau, Michel . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Comeau, Michel . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Conejero, David . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Cook, Norman D. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Corazza, Anna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Córdoba, R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Córdoba, R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Cornu, E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Corr, Pat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Cortes, Corinna . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Cosi, Piero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Couvreur, Christophe . . . . . . . . . . . . . . . . . . . . . 63 Cox, Stephen J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Cox, Stephen J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Cox, Stephen J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Cox, Stephen J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Cranen, Bert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Creutz, Mathias. . . . . . . . . . . . . . . . . . . . . . . . . . . .41 Creutz, Mathias. . . . . . . . . . . . . . . . . . . . . . . . . . . .81 Cruz-Zeno, E.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Cui, Xiaodong. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77 Cummins, Fred . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Cunningham, Stuart . . . . . . . . . . . . . . . . . . . . . . . 78 Cutugno, Francesco . . . . . . . . . . . . . . . . . . . . . . 103 Czigler, Peter E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 D Dahan, Jean-Gui . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 d’Alessandro, Christophe . . . . . . . . . . . . . . . . . . 5 d’Alessandro, Christophe . . . . . . . . . . . . . . . . . 58 d’Alessandro, Christophe . . . . . . . . . . . . . . . . . 84 Dang, Jianwu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Dang, Jianwu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Daoudi, Khalid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Daubias, Philippe . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Dayanidhi, Krishna . . . . . . . . . . . . . . . . . . . . . . . . 79 de Cheveigné, Alain . . . . . . . . . . . . . . . . . . . . . . . 29 de Gelder, Beatrice . . . . . . . . . . . . . . . . . . . . . . . . . . 2 de Jong, Franciska . . . . . . . . . . . . . . . . . . . . . . . . . . 9 de la Torre, Ángel . . . . . . . . . . . . . . . . . . . . . . . . . 13 de la Torre, Ángel . . . . . . . . . . . . . . . . . . . . . . . . 107 Deléglise, Paul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Deller Jr., J.R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Delmonte, Rodolfo . . . . . . . . . . . . . . . . . . . . . . . . 70 Demirekler, Mübeccel . . . . . . . . . . . . . . . . . . . . . 41 Demirekler, Mübeccel . . . . . . . . . . . . . . . . . . . . . 55 Demirekler, Mübeccel . . . . . . . . . . . . . . . . . . . . . 85 Demiroglu, Cenk . . . . . . . . . . . . . . . . . . . . . . . . . . 76 De Mori, Renato . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 De Mori, Renato . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Demuynck, Kris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Demuynck, Kris . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Demuynck, Kris . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Denda, Yuki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Denecke, Matthias . . . . . . . . . . . . . . . . . . . . . . . . . 79 Deng, Huiqun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Deng, Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Deng, Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Deng, Yonggang . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Devillers, Laurence . . . . . . . . . . . . . . . . . . . . . . . . . 7 Devillers, Laurence . . . . . . . . . . . . . . . . . . . . . . . . . 8 de Villiers, Jacques . . . . . . . . . . . . . . . . . . . . . . . . 58 Deviren, Murat. . . . . . . . . . . . . . . . . . . . . . . . . . . . .24 De Wachter, Mathias . . . . . . . . . . . . . . . . . . . . . . 40 de Wet, Febe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Dewhirst, Oliver . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Dharanipragada, Satya . . . . . . . . . . . . . . . . . . . . 64 Dharanipragada, Satya . . . . . . . . . . . . . . . . . . . . 89 D’Haro, L.F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Di, Fengying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Diakoloukas, V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Diao, Qian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Digalakis, Vassilios . . . . . . . . . . . . . . . . . . . . . . . . 55 Digalakis, Vassilios . . . . . . . . . . . . . . . . . . . . . . . . 81 Dimitriadis, Dimitrios . . . . . . . . . . . . . . . . . . . 101 Ding, Pei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Ding, Peng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Ding, Peng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Dobrišek, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 116 Docio-Fernandez, Laura . . . . . . . . . . . . . . . . . . . 75 Dognin, Pierre L. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Dohen, Marion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Dohsaka, Kohji . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Dong, Minghui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Doss, Mathew Magimai . . . . . . . . . . . . . . . . . . . . 21 Doumpiotis, Vlasios . . . . . . . . . . . . . . . . . . . . . . . 70 Doval, Boris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Draxler, Chr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Droppo, Jasha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Droppo, Jasha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Drygajlo, Andrzej . . . . . . . . . . . . . . . . . . . . . . . . . 25 Drygajlo, Andrzej . . . . . . . . . . . . . . . . . . . . . . . . . 37 Drygajlo, Andrzej . . . . . . . . . . . . . . . . . . . . . . . . . 94 Du, Limin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Du, Limin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Du, Limin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Du, Limin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Duchateau, Jacques . . . . . . . . . . . . . . . . . . . . . . . 13 Duchateau, Jacques . . . . . . . . . . . . . . . . . . . . . . . 95 Dufour, Sophie . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 du Jeu, Charles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Dumouchel, Pierre. . . . . . . . . . . . . . . . . . . . . . . . .43 Dumouchel, Pierre. . . . . . . . . . . . . . . . . . . . . . . . .94 Dumouchel, Pierre . . . . . . . . . . . . . . . . . . . . . . . 105 Dunn, Robert B. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Dupont, Stéphane . . . . . . . . . . . . . . . . . . . . . . . . . 63 Duraiswami, Ramani. . . . . . . . . . . . . . . . . . . . . . . .3 Durston, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Dusan, Sorin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Dutoit, Thierry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Duxans, Helenca . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 E Edmondson, William . . . . . . . . . . . . . . . . . . . . . . 90 Eggleton, Barry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Eggleton, Barry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Ehrette, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Eickeler, Stefan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Eide, E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Ekanadham, Chaitanya J.K. . . . . . . . . . . . . . . . 22 El-Jaroudi, Amro . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Ellis, Daniel P.W. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Ellouze, Noureddine . . . . . . . . . . . . . . . . . . . . . 101 Emami, Ahmad . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Emele, Martin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Emele, Martin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Emonts, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Enderby, Pam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Endo, Toshiki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Eneman, Koen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Engwall, Olov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 En-Najjary, Taoufik . . . . . . . . . . . . . . . . . . . . . . . . 61 Eriksson, Erik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Escudero, David . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Eskenazi, Maxine . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Espy-Wilson, Carol . . . . . . . . . . . . . . . . . . . . . . . . 85 Estève, Yannick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Estienne, Claudio F. . . . . . . . . . . . . . . . . . . . . . . . 80 Evans, Nicholas W.D. . . . . . . . . . . . . . . . . . . . . . 102 F Fabian, Tibor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Fackrell, Justin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Fackrell, Justin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Fackrell, Justin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Fagel, Sascha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Fakotakis, Nikos . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Fakotakis, Nikos . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Fakotakis, Nikos . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Fakotakis, Nikos . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Fakotakis, Nikos . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Fakotakis, Nikos . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Falavigna, Daniele . . . . . . . . . . . . . . . . . . . . . . . . . 61 Fang, Xiaoshan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Farrell, Mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Faulkner, Andrew . . . . . . . . . . . . . . . . . . . . . . . . . 45 Federico, Marcello . . . . . . . . . . . . . . . . . . . . . . . . . 14 Fedorenko, Evelina . . . . . . . . . . . . . . . . . . . . . . . . 56 Fegyó, Tibor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Fegyó, Tibor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Ferreira, L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Ferreiros, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Ferreiros, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Ferrer, Luciana . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Eurospeech 2003 Filisko, Edward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Fingscheidt, Tim. . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Fink, Gernot A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Fischer, V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Fishler, Eran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Fissore, L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Flanagan, James . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Flecha-Garcia, M.L. . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Fohr, Dominique . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Fohr, Dominique . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Fonollosa, José A.R. . . . . . . . . . . . . . . . . . . . . . . . 88 Fortuna, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Fousek, Petr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Franco, Horacio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 François, Hélène . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Frangi, Alejandro F. . . . . . . . . . . . . . . . . . . . . . . . 80 Frank, Carmen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Fränti, Pasi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Franz, Martin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Frederking, Robert . . . . . . . . . . . . . . . . . . . . . . . . 14 Freeman, G.H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Freitas, Diamantino . . . . . . . . . . . . . . . . . . . . . . . . . 7 Freitas, Diamantino . . . . . . . . . . . . . . . . . . . . . . . 15 Freitas, Diamantino . . . . . . . . . . . . . . . . . . . . . . . 82 Fu, Guokang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Fu, Qiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Fujii, Atsushi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Fujii, Atsushi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Fujii, Atsushi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Fujimoto, Ichiro . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Fujimoto, Masakiyo . . . . . . . . . . . . . . . . . . . . . . . 51 Fujimoto, Masakiyo . . . . . . . . . . . . . . . . . . . . . . . 62 Fujimoto, Masakiyo . . . . . . . . . . . . . . . . . . . . . . . 63 Fujinaga, Katsuhisa . . . . . . . . . . . . . . . . . . . . . . . 96 Fujisaki, Hiroya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Fujisaki, Hiroya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Fujisaki, Hiroya . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Fujisaki, Hiroya . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Fujisaki, Hiroya . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Fujisawa, Takeshi . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Fukuda, Takashi . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Fukuda, Takashi . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Fukuda, Takashi . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Fukudome, Kimitoshi . . . . . . . . . . . . . . . . . . . . . 74 Fung, Pascale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Fung, Pascale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Fung, Tien-Ying . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Furuyama, Yusuke . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Fusaro, Andrea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 G Gadbois, Gregory J. . . . . . . . . . . . . . . . . . . . . . . . . 79 Gadde, Venkata R.R. . . . . . . . . . . . . . . . . . . . . . . . 71 Gales, M.J.F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Gales, M.J.F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Galescu, Lucian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Gaminde, I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Ganchev, Todor. . . . . . . . . . . . . . . . . . . . . . . . . . . .59 Gao, Hualin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Gao, Jianfeng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Gao, Sheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Gao, W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Gao, Yuqing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 Gao, Yuqing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98 Gao, Yuqing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99 Garcia-Gomar, M. . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Garcia-Romero, D. . . . . . . . . . . . . . . . . . . . . . . . . . 25 Garg, Ashutosh . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Gates, Donna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Gauvain, Jean-Luc . . . . . . . . . . . . . . . . . . . . . . . . . 66 Gedge, Oren . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Gelbart, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Gemello, Roberto . . . . . . . . . . . . . . . . . . . . . . . . 107 Gendrin, Frédéric . . . . . . . . . . . . . . . . . . . . . . . . 111 Georgila, K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Gfroerer, Stefan . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Gharavian, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Ghasedi, Mohammad E. . . . . . . . . . . . . . . . . . . . 54 Ghasemi, Seyyed Z. . . . . . . . . . . . . . . . . . . . . . . . . 54 Ghulam, Muhammad . . . . . . . . . . . . . . . . . . . . . . 77 September 1-4, 2003 – Geneva, Switzerland Gibbon, Dafydd . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Gibbon, Dafydd . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Gibbon, Dafydd . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Gibson, Edward . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Gieselmann, Petra . . . . . . . . . . . . . . . . . . . . . . . . . 79 Gillett, Ben . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Gillett, Ben . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Gilloire, André . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Giménez, Jesús . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Girão, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Girin, Laurent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Gish, Herbert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Glass, James . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Gleason, T.P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Gnaba, H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Goel, Vaibhava . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Goel, Vaibhava . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Goel, Vaibhava . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Gomes, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Gómez, Angel M. . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Gómez, Angel M. . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Gómez, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Gonzalez-Rodriguez, J. . . . . . . . . . . . . . . . . . . . . 25 Goodman, Bryan R. . . . . . . . . . . . . . . . . . . . . . . . . 79 Gopinath, Ramesh . . . . . . . . . . . . . . . . . . . . . . . . . 57 Gopinath, Ramesh . . . . . . . . . . . . . . . . . . . . . . . . . 92 Gopinath, Ramesh . . . . . . . . . . . . . . . . . . . . . . . . . 92 Gori, Marco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Gorin, Allen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Goronzy, Silke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Goronzy, Silke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Gorrell, Genevieve . . . . . . . . . . . . . . . . . . . . . . . . . 97 Goto, Masataka . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Goto, Masataka . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Goulian, Jérôme . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Gouvêa, Evandro . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Grandjean, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Granström, Björn . . . . . . . . . . . . . . . . . . . . . . . . 103 Grant, Ken W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Grashey, Stephan . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Green, James . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Green, Phil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Green, Phil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Green, Phil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Greenberg, Steven . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Greenberg, Steven . . . . . . . . . . . . . . . . . . . . . . . . . 90 Greenberg, Steven . . . . . . . . . . . . . . . . . . . . . . . . . 90 Grenez, F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Grézl, František . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Grieco, John J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Gu, Liang. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98 Gu, Liang. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99 Gu, Wentao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Gu, Zhenglai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Guan, Cuntai. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 Guan, Qi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Guimarães, Nuno . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Guitarte Pérez, Jesús F. . . . . . . . . . . . . . . . . . . . 80 Gül, Yilmaz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Gunawardana, Asela . . . . . . . . . . . . . . . . . . . . . . 57 Guo, Changchen . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Guo, Rui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Gurijala, A.R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Gustafson, Joakim . . . . . . . . . . . . . . . . . . . . . . . . . 22 Gustman, Samuel . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Gut, Ulrike . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 H Hacioglu, Kadri . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Hacker, Christian . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Hacker, Christian . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Haffner, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Hajdinjak, Melita . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Hajič, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Hajič, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Hakkani-Tür, Dilek Z. . . . . . . . . . . . . . . . . . . . . . 23 Hakkani-Tür, Dilek Z. . . . . . . . . . . . . . . . . . . . . . 64 Hakkani-Tür, Dilek Z. . . . . . . . . . . . . . . . . . . . . . 99 Hakkani-Tür, Dilek Z. . . . . . . . . . . . . . . . . . . . . . 99 Häkkinen, Juha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Hakulinen, Jaakko . . . . . . . . . . . . . . . . . . . . . . . . . 27 Hakulinen, Jaakko . . . . . . . . . . . . . . . . . . . . . . . . . 67 Hamada, Nozomu . . . . . . . . . . . . . . . . . . . . . . . . . 60 Hammervold, Kathrine . . . . . . . . . . . . . . . . . . . . 87 Hamza, W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Han, Jiang. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41 Han, Zhaobing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 117 Hanna, Philip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Hanna, Philip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Hansakunbuntheung, Chatchawarn . . . . . . . 4 Hansen, Jesse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 26 Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 45 Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 45 Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 47 Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 50 Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 64 Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 77 Hao, Jiucang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Harding, Sue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Hardy, Hilda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Harris, David M. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Hartikainen, Elviira . . . . . . . . . . . . . . . . . . . . . . . . 54 Hasegawa-Johnson, Mark . . . . . . . . . . . . . . . . . 15 Hasegawa-Johnson, Mark . . . . . . . . . . . . . . . . . 18 Hasegawa-Johnson, Mark . . . . . . . . . . . . . . . . . 88 Hashimoto, Yoshikazu . . . . . . . . . . . . . . . . . . . . . 7 Hatano, Toshie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Haton, Jean-Paul . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Haton, Jean-Paul . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Hatzis, Athanassios . . . . . . . . . . . . . . . . . . . . . . . 41 Hatzis, Athanassios . . . . . . . . . . . . . . . . . . . . . . . 78 Hautamäki, Ville . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Haverinen, Hemmo . . . . . . . . . . . . . . . . . . . . . . 108 Hawley, Mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Hayakawa, S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Hazen, Timothy J. . . . . . . . . . . . . . . . . . . . . . . . . . 15 Hazen, Timothy J. . . . . . . . . . . . . . . . . . . . . . . . . . 23 Hazen, Timothy J. . . . . . . . . . . . . . . . . . . . . . . . . . 69 He, Wei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 He, Xiaodong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 He, Xiaodong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Hébert, Matthieu . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Heck, Larry P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Hedelin, Per . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Heeman, Peter A. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Hegde, Rajesh M. . . . . . . . . . . . . . . . . . . . . . . . . . 103 Heikkinen, Ari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Heikkinen, Ari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Heikkinen, Ari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Heisterkamp, Paul . . . . . . . . . . . . . . . . . . . . . . . 103 Helbig, Jörg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Hell, Benjamin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Hell, Benjamin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Heracleous, Panikos . . . . . . . . . . . . . . . . . . . . . . . 19 Heracleous, Panikos . . . . . . . . . . . . . . . . . . . . . . . 32 Hermann, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Hermansky, Hynek . . . . . . . . . . . . . . . . . . . . . . . . 16 Hermansky, Hynek . . . . . . . . . . . . . . . . . . . . . . . . 30 Hermansky, Hynek . . . . . . . . . . . . . . . . . . . . . . . . 30 Hermansky, Hynek . . . . . . . . . . . . . . . . . . . . . . . . 36 Hermansky, Hynek . . . . . . . . . . . . . . . . . . . . . . . . 36 Hermansky, Hynek . . . . . . . . . . . . . . . . . . . . . . . . 94 Hernaez, I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Hernando, Javier . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Hernando, Javier . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Hetherington, Lee . . . . . . . . . . . . . . . . . . . . . . . . . 69 Higashinaka, Ryuichiro . . . . . . . . . . . . . . . . . . . 68 Hilario, Joan Marí . . . . . . . . . . . . . . . . . . . . . . . . 109 Hilger, Florian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Himanen, Sakari . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Hioka, Yusuke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Hiraiwa, Akira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Hirose, Keikichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .12 Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .14 Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .31 Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .59 Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .73 Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .87 Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .92 Hirose, Keikichi . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Hirose, Keikichi . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Hirose, Keikichi . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Hirsbrunner, Béat . . . . . . . . . . . . . . . . . . . . . . . . . 16 Hirschberg, Julia . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Hirschberg, Julia . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Hirschfeld, Diane . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Hirsimäki, Teemu . . . . . . . . . . . . . . . . . . . . . . . . . 81 Ho, Ching-Hsiang . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Ho, Ching-Hsiang . . . . . . . . . . . . . . . . . . . . . . . . 104 Ho, Man-Cheuk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Ho, Purdy P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Ho, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Ho, Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Eurospeech 2003 Hodgson, Murray . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Hodoshima, Nao . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Hoege, Harald . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Hoequist, Charles . . . . . . . . . . . . . . . . . . . . . . . . . 47 Hofmann, Thomas . . . . . . . . . . . . . . . . . . . . . . . . 35 Hogden, John . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Höge, Harald . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Hohmann, Volker. . . . . . . . . . . . . . . . . . . . . . . . . .50 Holada, Miroslav . . . . . . . . . . . . . . . . . . . . . . . . . 112 Honal, Matthias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Honda, Kiyoshi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Honda, Kiyoshi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Hori, Chiori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Hori, Chiori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Hori, Takaaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Hori, Takaaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Horiuchi, Yasuo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Horlock, James . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Horlock, James . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Horvat, Bogomir . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Horvat, Bogomir . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Hosokawa, Yuta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Hou, Zhaorong . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 House, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Hozjan, Vladimir . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Hsu, Chun-Nan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Hu, Fang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Hu, Sheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Hu, Wei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Hu, Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Huang, Chao-Shih . . . . . . . . . . . . . . . . . . . . . . . . . 17 Huang, Jing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Huang, Qiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Huang, Shan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Huerta, Juan M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Huo, Qiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Huo, Qiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Hwang, Tai-Hwei . . . . . . . . . . . . . . . . . . . . . . . . . . 77 I Ichikawa, Akira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Ichimura, Naoyuki . . . . . . . . . . . . . . . . . . . . . . . . . 80 Iizuka, Yosuke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Illina, Irina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Illina, Irina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Illina, Irina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Imamura, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Inagaki, Yasuyoshi . . . . . . . . . . . . . . . . . . . . . . . . 55 Inkelas, Sharon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Inoue, Akira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Inoue, Tsuyoshi . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Ipšić, Ivo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Ircing, Pavel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Ircing, Pavel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Irie, Yuki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Irino, Toshio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Irino, Toshio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Irino, Toshio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Iriondo, Ignasi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Isei-Jaakkola, Toshiko . . . . . . . . . . . . . . . . . . . . . . 4 Iser, Bernd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Ishi, Carlos Toshinori . . . . . . . . . . . . . . . . . . . . . 15 Ishihara, Kazushi . . . . . . . . . . . . . . . . . . . . . . . . 112 Ishikawa, Tetsuya . . . . . . . . . . . . . . . . . . . . . . . . . 40 Isobe, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Itakura, Fumitada . . . . . . . . . . . . . . . . . . . . . . . . . 67 Itakura, Fumitada . . . . . . . . . . . . . . . . . . . . . . . . . 86 Ito, Akinori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Ito, Ryosuke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Ito, Toshihiko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Itoh, Nobuyasu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Itou, Katunobu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Itou, Katunobu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Itou, Katunobu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Itou, Katunobu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Itou, Katunobu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Iwaki, Mamoru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Iwami, Yohei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Iyengar, Giridharan. . . . . . . . . . . . . . . . . . . . . . . .91 J Jackson, Philip J.B. . . . . . . . . . . . . . . . . . . . . . . . . 82 Jackson, Philip J.B. . . . . . . . . . . . . . . . . . . . . . . . . 97 Jafer, Essa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Jafer, Essa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Jaidane-Saidane, M. . . . . . . . . . . . . . . . . . . . . . . . . 49 September 1-4, 2003 – Geneva, Switzerland Jain, Pratibha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Jain, Pratibha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 James, A.B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Jamoussi, Salma . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Jan, E.E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Jančovič, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Jang, Dalwon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Jang, Dalwon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Jang, Gyucheol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Janke, E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Jansen, E.J.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Jesus, Luis M.T. . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Jia, Chuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Jia, Ying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Jiang, Jing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Jin, Jianhong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Jin, Minho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Jitapunkul, Somchai . . . . . . . . . . . . . . . . . . . . . . . . 6 Jitapunkul, Somchai . . . . . . . . . . . . . . . . . . . . . . . 65 Jitsuhiro, Takatoshi . . . . . . . . . . . . . . . . . . . . . . . 96 Johnstone, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Jokisch, Oliver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Jones, Douglas A.. . . . . . . . . . . . . . . . . . . . . . . . . .56 Jones, Douglas A.. . . . . . . . . . . . . . . . . . . . . . . . . .69 Jovičić, Slobodan . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Ju, Gwo-hwa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Ju, Gwo-hwa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Juan, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Jung, Ho-Youl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88 Junqua, Jean-Claude . . . . . . . . . . . . . . . . . . . . . . 13 Junqua, Jean-Claude . . . . . . . . . . . . . . . . . . . . . . 65 Junqua, Jean-Claude . . . . . . . . . . . . . . . . . . . . . . 71 Jutten, Christian . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 K Kabal, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Kaburagi, Tokihiko . . . . . . . . . . . . . . . . . . . . . . . . 17 Kačič, Zdravko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Kačič, Zdravko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Kačič, Zdravko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Kain, Alexander B. . . . . . . . . . . . . . . . . . . . . . . . . . 12 Kain, Alexander B. . . . . . . . . . . . . . . . . . . . . . . . . . 58 Kajarekar, Sachin S. . . . . . . . . . . . . . . . . . . . . . . . 71 Kajarekar, Sachin S. . . . . . . . . . . . . . . . . . . . . . . . 94 Kakutani, Naoko . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Kallulli, Dalina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Kam, Patgi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Kaneko, Tsuyoshi . . . . . . . . . . . . . . . . . . . . . . . . . 51 Kang, Hong-Goo . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Kanokphara, Supphanat . . . . . . . . . . . . . . . . . . 28 Kanthak, S.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40 Karjalainen, Matti . . . . . . . . . . . . . . . . . . . . . . . . . 87 Karlsson, Inger . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Kashioka, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Kasuya, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Kasuya, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Katagiri, Shigeru . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Katagiri, Yasuhiro . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Kato, Hiroaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Katz, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Kawaguchi, Nobuo . . . . . . . . . . . . . . . . . . . . . . . . 55 Kawahara, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Kawahara, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Kawahara, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Kawahara, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Kawahara, Tatsuya . . . . . . . . . . . . . . . . . . . . . . . . 16 Kawahara, Tatsuya . . . . . . . . . . . . . . . . . . . . . . . . 26 Kawahara, Tatsuya . . . . . . . . . . . . . . . . . . . . . . . . 60 Kawahara, Tatsuya . . . . . . . . . . . . . . . . . . . . . . . . 65 Kawahara, Tatsuya . . . . . . . . . . . . . . . . . . . . . . . 105 Kawaharay, Hideki . . . . . . . . . . . . . . . . . . . . . . . 111 Kawai, Hisashi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Kawai, Hisashi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Kawai, Koji. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 Kawanami, Hiromichi . . . . . . . . . . . . . . . . . . . . . 79 Kawanami, Hiromichi . . . . . . . . . . . . . . . . . . . . . 85 Kellner, Andreas. . . . . . . . . . . . . . . . . . . . . . . . . . .67 Kenicer, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Kenny, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Képesi, Marián . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Kerstholt, J.H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Kessens, Judith M. . . . . . . . . . . . . . . . . . . . . . . . . . 65 Keung, Chi-Kin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Khayrallah, Ali . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Khioe, Beatrice Fung-Wah . . . . . . . . . . . . . . . . . 84 Khitrov, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Khudanpur, Sanjeev . . . . . . . . . . . . . . . . . . . . . 110 118 Kienappel, Anne K. . . . . . . . . . . . . . . . . . . . . . . . . 42 Kienappel, Anne K. . . . . . . . . . . . . . . . . . . . . . . . . 52 Kikui, Genichiro . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Kikui, Genichiro . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Kikui, Genichiro . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Kikui, Genichiro . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Kikuiri, Kei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Killer, Mirjam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Kim, Chong Kyu . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Kim, D.Y. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Kim, Hyoung-Gook . . . . . . . . . . . . . . . . . . . . . . . . 18 Kim, Hyoung-Gook . . . . . . . . . . . . . . . . . . . . . . . . 20 Kim, Hyung Soon . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Kim, Hyun Woo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Kim, Jiun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Kim, Jong Uk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Kim, Jong Uk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Kim, Kwang-Dong . . . . . . . . . . . . . . . . . . . . . . . . . 51 Kim, Nam Soo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Kim, Nam Soo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Kim, Nam Soo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Kim, SangGyun . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Kim, SangGyun . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Kim, Taeyoon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Kim, Wooil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Kim, Woosung . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Kim, Young Joon . . . . . . . . . . . . . . . . . . . . . . . . . . 13 King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Kingsbury, Brian . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Kingsbury, Brian . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Kingsbury, Brian . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Kinnunen, Tomi . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Kinoshita, Keisuke . . . . . . . . . . . . . . . . . . . . . . . . 48 Kiran, G.V.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Kiriyama, Shinya . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Kishida, Itsuki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Kishon-Rabin, Liat . . . . . . . . . . . . . . . . . . . . . . . . . 73 Kishore, S.P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Kiss, Imre. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .108 Kita, Kenji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Kitamura, Tadashi . . . . . . . . . . . . . . . . . . . . . . . . . 31 Kitamura, Tadashi . . . . . . . . . . . . . . . . . . . . . . . . . 87 Kitamura, Tadashi . . . . . . . . . . . . . . . . . . . . . . . . . 93 Kitamura, Tadashi . . . . . . . . . . . . . . . . . . . . . . . 112 Kitaoka, Norihide . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Kitaoka, Norihide . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Kitaoka, Norihide . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Kitaoka, Norihide . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Kitawaki, Nobuhiko . . . . . . . . . . . . . . . . . . . . . . . 80 Kitayama, Koji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Kitazawa, Shigeyoshi . . . . . . . . . . . . . . . . . . . . . . . 7 Klabbers, Esther . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Klabbers, Esther . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Klabbers, Esther . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Klakow, Dietrich . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Klankert, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Klasmeyer, G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Kleijn, W. Bastiaan . . . . . . . . . . . . . . . . . . . . . . . . 38 Kleijn, W. Bastiaan . . . . . . . . . . . . . . . . . . . . . . . . 49 Klein, Alexandra . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Kleinschmidt, Michael . . . . . . . . . . . . . . . . . . . . . 50 Kleinschmidt, Michael . . . . . . . . . . . . . . . . . . . . . 91 Kneissler, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Ko, Hanseok . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Ko, Hanseok . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Kobayashi, Akio . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Kobayashi, Takao. . . . . . . . . . . . . . . . . . . . . . . . . .87 Kobayashi, Takao . . . . . . . . . . . . . . . . . . . . . . . . 102 Kobayashi, Tetsunori. . . . . . . . . . . . . . . . . . . . . .42 Kobayashi, Tetsunori. . . . . . . . . . . . . . . . . . . . . .43 Kobayashi, Tetsunori. . . . . . . . . . . . . . . . . . . . . .45 Kocsor, András . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Kodama, Yasuhiro . . . . . . . . . . . . . . . . . . . . . . . . . 42 Kojima, Hiroaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Kokkinakis, George . . . . . . . . . . . . . . . . . . . . . . . . . 5 Kokkinakis, George . . . . . . . . . . . . . . . . . . . . . . . . 21 Kokkinakis, George . . . . . . . . . . . . . . . . . . . . . . . . 60 Kokkinakis, George . . . . . . . . . . . . . . . . . . . . . . . . 78 Kokkinos, Iasonas . . . . . . . . . . . . . . . . . . . . . . . . . 29 Kokubo, Hiroaki . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Köküer, Münevver . . . . . . . . . . . . . . . . . . . . . . . . . 76 Eurospeech 2003 Kolář, Jáchym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Kollmeier, Birger . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Koloska, Uwe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Komatani, Kazunori . . . . . . . . . . . . . . . . . . . . . . . 26 Komatani, Kazunori . . . . . . . . . . . . . . . . . . . . . . . 60 Komatsu, Miki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Kominek, John. . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 Kondo, Aki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Kondoz, Ahmet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Kordik, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Korkmazsky, Filipp . . . . . . . . . . . . . . . . . . . . . . . 52 Korkmazsky, Filipp . . . . . . . . . . . . . . . . . . . . . . . 53 Kotnik, Bojan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Koumpis, Konstantinos . . . . . . . . . . . . . . . . . . . 99 Koval, S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Krasny, Leonid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Krbec, Pavel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81 Krishnan, Venkatesh . . . . . . . . . . . . . . . . . . . . . . 38 Krueger, Antonio . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Krüger, S.E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Kruschke, Hans . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Kryze, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Kubala, Francis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Kubala, Francis . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Kühne, Marco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Kukolich, Linda C. . . . . . . . . . . . . . . . . . . . . . . . . . 69 Kumar, Arun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Kumaresan, Ramdas . . . . . . . . . . . . . . . . . . . . . . . . 1 Kummert, Franz . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Kung, Sun-Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Kung, Sun-Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Kung, Sun-Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Kunzmann, S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Kuo, Chih-Chung . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Kuo, Chi-Shiang . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Kuo, Wei-Chih . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Kurimo, Mikko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Kurimo, Mikko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Kuroiwa, Shingo . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Kuroiwa, Shingo . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Kuroiwa, Shingo . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Kusumoto, Akiko . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Kuwabara, Hisao . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Kwok, Philip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Kwon, Oh-Wook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Kwon, Oh-Wook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Kwon, Soonil. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94 L Laaksonen, Lasse . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Lackey, B.C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Lacroix, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Ladd, D. Robert . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Laface, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Lähdekorpi, Marja . . . . . . . . . . . . . . . . . . . . . . . . . 38 Lahti, Tommi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Lai, Wen-Hsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Lai, Yiu-Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Lambert, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Lamel, Lori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Lamere, Paul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Lamere, Paul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Lane, Ian R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Langlois, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Langner, Brian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Lapidot, Itshak . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Larsen, Lars Bo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Larson, Martha . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Lashkari, Khosrow . . . . . . . . . . . . . . . . . . . . . . . . 60 Lasn, Jürgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Lathoud, Guillaume . . . . . . . . . . . . . . . . . . . . . . 102 Laureys, Tom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Lauri, Fabrice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Lavie, Alon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Lawson, Aaron D.. . . . . . . . . . . . . . . . . . . . . . . . . .53 Le, Viet Bac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Lee, Akinobu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52 Lee, Chen-Long . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Lee, Chin-Hui. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Lee, Chin-Hui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Lee, Chin-Hui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Lee, Chul Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Lee, Daniel D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Lee, J.H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Lee, J.J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Lee, K.Y.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48 Lee, Kyong-Nim . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Lee, Lin-shan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 September 1-4, 2003 – Geneva, Switzerland Lee, Lin-shan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 Lee, Lin-shan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 Lee, Lin-shan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48 Lee, Lin-shan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82 Lee, Lin-shan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Lee, Lin-shan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Lee, Sunil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Lee, Sunil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Lee, Tan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Lee, Tan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Lee, Te-Won . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Lee, Te-Won . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Lee, Te-Won. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30 Lee, Yun-Tien. . . . . . . . . . . . . . . . . . . . . . . . . . . . .100 Lees, Nicole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Lefevre, Fabrice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Lenz, Michael. . . . . . . . . . . . . . . . . . . . . . . . . . . . .102 Lenzo, Kevin A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Lenzo, Kevin A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Leonov, A.S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Levin, David N. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Levin, Lori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Levit, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Levow, Gina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Li, Aijun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Li, Haizhou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Li, Honglian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Li, Jianfeng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Li, Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Li, Stan Z. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Li, Ta-Hsin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Li, Xiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Li, Xiaolong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Li, Yujia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Li, Yuk-Chi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Li, Yuk-Chi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Liang, Min-Siong . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Liao, Shuo-Peng . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Liao, Yuan-Fu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Lickley, R.J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Lieb, Robert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Light, Joanna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Lim, Sung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Lim, Woohyung . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Lima, Amaro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Lin, Jeng-Shien . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Lin, Li-Feng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Lin, Xiaofan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Lin, Yi-Chung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Linares, Georges . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Linhard, Klaus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Liscombe, Jackson. . . . . . . . . . . . . . . . . . . . . . . . .26 Liu, Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Liu, Feng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Liu, Fu-Hua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Liu, Jia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Liu, Jian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Liu, Jingwei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Liu, Runsheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Liu, Wei M.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102 Liu, Xingkun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Liu, Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Liu, Yi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Livescu, Karen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Lleida, Eduardo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Llorà, Xavier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Lloyd-Thomas, Harvey . . . . . . . . . . . . . . . . . . . . 78 Lo, Tin-Hang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Lo, Wai-Kit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Lo, Wai-Kit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Lobacheva, Yuliya . . . . . . . . . . . . . . . . . . . . . . . . . 87 Locher, Ivo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Lœvenbruck, Hélène . . . . . . . . . . . . . . . . . . . . . . . . 6 Lonsdale, Deryle . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Looks, Karin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Lu, Ching-Ta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Lu, Meirong. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112 Lu, Yiqing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Lucey, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Luengo, I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Lukas, Klaus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Luksaneeyanawin, Sudaporn . . . . . . . . . . . . . . . 6 Luksaneeyanawin, Sudaporn. . . . . . . . . . . . . . 65 Luo, Yu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62 Luong, Mai Chi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Lyu, Dau-Cheng. . . . . . . . . . . . . . . . . . . . . . . . . . . .66 Lyu, Ren-Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Lyu, Ren-Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 119 M Ma, Changxue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Ma, Chengyuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Ma, Chengyuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Ma, L.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79 Maase, Jens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Macherey, Klaus . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Macherey, Wolfgang . . . . . . . . . . . . . . . . . . . . . . . 18 Macías-Guarasa, J. . . . . . . . . . . . . . . . . . . . . . . . . . 64 Macías-Guarasa, J. . . . . . . . . . . . . . . . . . . . . . . . . . 95 MacLaren, Victoria . . . . . . . . . . . . . . . . . . . . . . . . 56 Macrostie, Ehry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Maeda, Sakashi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Maegaard, Bente . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Maeki, Daiju . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Maffiolo, V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Magimai-Doss, Mathew . . . . . . . . . . . . . . . . . . . . 89 Magrin-Chagnolleau, Ivan . . . . . . . . . . . . . . . . . . 2 Mahajan, Milind . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Mahajan, Milind . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Mahdi, Abdulhussain E. . . . . . . . . . . . . . . . . . . . 20 Mahdi, Abdulhussain E. . . . . . . . . . . . . . . . . . . . 61 Mahdi, Abdulhussain E. . . . . . . . . . . . . . . . . . . . 73 Mahdi, Abdulhussain E. . . . . . . . . . . . . . . . . . . . 84 Mahé, Gaël . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Maia, R. da S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Maison, Benoît . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Maison, Benoît . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Maison, Benoît . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Mak, Brian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Mak, Brian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Mak, Man-Wai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Mak, Man-Wai . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Mak, Man-Wai . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Makhoul, John . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Makino, Shozo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Maloor, Preetam . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Maltese, Giulio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Mamede, Nuno J. . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Mami, Yassine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Mana, Franco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Mana, Nadia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Manabe, Hiroyuki . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Maneenoi, Ekkarit . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Maneenoi, Ekkarit . . . . . . . . . . . . . . . . . . . . . . . . . 65 Manfredi, Claudia . . . . . . . . . . . . . . . . . . . . . . . . . 84 Mangu, Lidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Mangu, Lidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Mangu, Lidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Mapelli, Valerie . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Maragos, Petros . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Maragos, Petros . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Maragoudakis, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Markov, Konstantin . . . . . . . . . . . . . . . . . . . . . . . 34 Martens, Jean-Pierre . . . . . . . . . . . . . . . . . . . . . . . 33 Martin, Alvin F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Martin, Arnaud . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Martin, Terrence . . . . . . . . . . . . . . . . . . . . . . . . . 110 Martin, Terrence . . . . . . . . . . . . . . . . . . . . . . . . . 110 Martinčić-Ipšić, Sanda . . . . . . . . . . . . . . . . . . . . . 68 Martínez, R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Martinez, Roberto . . . . . . . . . . . . . . . . . . . . . . . . 104 Masaki, Shinobu . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Masgrau, Enrique . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Maskey, Sameer Raj . . . . . . . . . . . . . . . . . . . . . . . 41 Mason, John S.D. . . . . . . . . . . . . . . . . . . . . . . . . . 102 Massaro, Dominic W. . . . . . . . . . . . . . . . . . . . . . . 79 Masuko, Takashi . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Matassoni, Marco . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Matassoni, Marco . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Matějka, Pavel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Matoušek, Jindřich . . . . . . . . . . . . . . . . . . . . . . . . 11 Matrouf, Driss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Matsubara, Shigeki . . . . . . . . . . . . . . . . . . . . . . . . 55 Matsui, Hisami. . . . . . . . . . . . . . . . . . . . . . . . . . . . .75 Matsui, Tomoko . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Matsui, Tomoko . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Matsunaga, S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Matsuoka, Bungo . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Matsushita, Masahiko . . . . . . . . . . . . . . . . . . . . . 42 Matsuura, Daisuke . . . . . . . . . . . . . . . . . . . . . . . . 96 Mattys, Sven L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Mau, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Mauuary, Laurent . . . . . . . . . . . . . . . . . . . . . . . . 108 Mayfield Tomokiyo, Laura . . . . . . . . . . . . . . . . 14 Mayfield Tomokiyo, Laura . . . . . . . . . . . . . . . . 72 McCowan, Iain A. . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Eurospeech 2003 McCowan, Iain A. . . . . . . . . . . . . . . . . . . . . . . . . . 102 McDermott, Erik . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 McDonough, John . . . . . . . . . . . . . . . . . . . . . . . . . 36 McDonough, John . . . . . . . . . . . . . . . . . . . . . . . . . 56 McQueen, James M. . . . . . . . . . . . . . . . . . . . . . . . 74 McTait, Kevin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 McTear, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Meinedo, Hugo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Meister, Einar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Meister, Lya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Melenchón, Javier . . . . . . . . . . . . . . . . . . . . . . . . 104 Melnar, Lynette. . . . . . . . . . . . . . . . . . . . . . . . . . .110 Meng, Helen M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Meng, Helen M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Meng, Helen M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Meng, Helen M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Mertins, Alfred . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Mertz, Frank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Metze, Florian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Metze, Florian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Meuwly, Didier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Meyer, Georg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Miao, Cailian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Mihajlik, Péter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Mihajlik, Péter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Mihelič, France . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Mihelič, France . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Mihoubi, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Mihoubi, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Mikami, Takayoshi . . . . . . . . . . . . . . . . . . . . . . . . 42 Miki, Kazuhiro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Miki, Nobuhiro . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Miki, Toshio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Miki, Toshio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Miller, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Miller, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Milner, Ben P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Milner, Ben P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Milner, Ben P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Milner, Ben P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Milner, Ben P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Minami, Yasuhiro . . . . . . . . . . . . . . . . . . . . . . . . 100 Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . . . 6 Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . 12 Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . 14 Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . 31 Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . 59 Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . 73 Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . 92 Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . 106 Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . 111 Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . 111 Ming, Ji. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Minnis, Steve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Mírovsky, Jirí . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Mishra, Taniya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Mishra, Taniya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Misra, Hemant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Misra, Hemant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Mitsuta, Yoshifumi . . . . . . . . . . . . . . . . . . . . . . . . . 7 Mittal, U. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Mixdorff, Hansjörg. . . . . . . . . . . . . . . . . . . . . . . . . .7 Mixdorff, Hansjörg . . . . . . . . . . . . . . . . . . . . . . . . 31 Miyajima, Chiyomi . . . . . . . . . . . . . . . . . . . . . . . . 93 Miyanaga, Yoshikazu . . . . . . . . . . . . . . . . . . . . . . 83 Miyazaki, Noboru. . . . . . . . . . . . . . . . . . . . . . . . . .68 Mizumachi, Mitsunori . . . . . . . . . . . . . . . . . . . . . 21 Mizumachi, Mitsunori . . . . . . . . . . . . . . . . . . . . . 62 Mizutani, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Möbius, Bernd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Möbius, Bernd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Mohri, Mehryar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Mok, Oi Yan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59 Mokhtari, Parham . . . . . . . . . . . . . . . . . . . . . . . . . 15 Möller, Sebastian . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Montero, J.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Montero, J.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Moonen, Marc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Moore, Darren C. . . . . . . . . . . . . . . . . . . . . . . . . . 102 Moore, Roger K. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Moore, Roger K. . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Moreau, Nicolas . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Moreau, Nicolas . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Morel, Michel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Moreno, Asunción . . . . . . . . . . . . . . . . . . . . . . . . . 54 Moreno, Asunción . . . . . . . . . . . . . . . . . . . . . . . . . 56 Moreno, David M. . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Moreno, Pedro J. . . . . . . . . . . . . . . . . . . . . . . . . . 105 Morgan, Nelson . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 September 1-4, 2003 – Geneva, Switzerland Mori, Hiroki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Mori, Hiroki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Mori, Kazumasa . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Mori, Shinsuke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Morimoto, Tsuyoshi . . . . . . . . . . . . . . . . . . . . . . . 23 Morin, Philippe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Moro-Sancho, Q. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Morris, Andrew . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Morris, Robert W. . . . . . . . . . . . . . . . . . . . . . . . . 109 Mostow, Jack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Mostow, Jack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Motlíček, Petr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Motlíček, Petr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Motomura, Yoichi . . . . . . . . . . . . . . . . . . . . . . . . . 80 Moudenc, Thierry . . . . . . . . . . . . . . . . . . . . . . . . . 32 Mouri, Taro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Mukherjee, Niloy . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Müller, Christian . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Muller, J.S.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Mullin, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Murao, H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Murtagh, Fionn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Murthy, Hema A. . . . . . . . . . . . . . . . . . . . . . . . . . 103 Muto, Makiko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Myrvoll, Tor André . . . . . . . . . . . . . . . . . . . . . . . . 53 N Nadeu, Climent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Nadeu, Climent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Nagarajan, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Nagata, Masaaki . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Nagata, Masaaki . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Naito, Takuro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Naka, Nobuhiko . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Nakadai, Kazuhiro. . . . . . . . . . . . . . . . . . . . . . . . .96 Nakagawa, Seiichi . . . . . . . . . . . . . . . . . . . . . . . . . 22 Nakagawa, Seiichi . . . . . . . . . . . . . . . . . . . . . . . . . 22 Nakagawa, Seiichi . . . . . . . . . . . . . . . . . . . . . . . . . 42 Nakagawa, Seiichi . . . . . . . . . . . . . . . . . . . . . . . . . 96 Nakagawa, Seiichi . . . . . . . . . . . . . . . . . . . . . . . . 106 Nakagawa, Seiichi . . . . . . . . . . . . . . . . . . . . . . . . 112 Nakajima, Hideharu . . . . . . . . . . . . . . . . . . . . . . . 95 Nakajima, Yoshitaka . . . . . . . . . . . . . . . . . . . . . . 92 Nakamura, Naoki . . . . . . . . . . . . . . . . . . . . . . . . 112 Nakamura, Norio . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 16 Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 19 Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 21 Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 24 Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 34 Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 44 Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 62 Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 76 Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 80 Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 96 Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . 108 Nakano, Mikio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Nakano, Mikio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Nakasone, Hirotaka . . . . . . . . . . . . . . . . . . . . . . . 25 Nakatani, Tomohiro . . . . . . . . . . . . . . . . . . . . . . . 81 Nakatani, Tomohiro . . . . . . . . . . . . . . . . . . . . . . . 86 Nankaku, Yoshihiko . . . . . . . . . . . . . . . . . . . . . . . 93 Narayanan, Shrikanth . . . . . . . . . . . . . . . . . . . . . . 6 Narayanan, Shrikanth . . . . . . . . . . . . . . . . . . . . . 39 Narayanan, Shrikanth . . . . . . . . . . . . . . . . . . . . . 43 Narayanan, Shrikanth . . . . . . . . . . . . . . . . . . . . . 94 Narayanan, Shrikanth . . . . . . . . . . . . . . . . . . . . 111 Narusawa, Shuichi . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Narusawa, Shuichi . . . . . . . . . . . . . . . . . . . . . . . . . 82 Natarajan, Ajay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Natarajan, Premkumar . . . . . . . . . . . . . . . . . . . . 79 Navarro-Mesa, Juan L. . . . . . . . . . . . . . . . . . . . . . 86 Navas, E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Navrátil, Jiří. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71 Navrátil, Jiří. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94 Nedel, Jon P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Nefti, Samir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Neti, Chalapathy . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Neto, João P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Neto, João P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Neubarth, Friedrich . . . . . . . . . . . . . . . . . . . . . . . 46 Neukirchen, Christoph . . . . . . . . . . . . . . . . . . . . 92 Newell, Alan F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 120 Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Nguyen, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Nguyen, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Nguyen, Phu Chien . . . . . . . . . . . . . . . . . . . . . . . . 16 Ni, Jinfu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Nicholson, H.B.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Niemann, Heinrich . . . . . . . . . . . . . . . . . . . . . . . . 34 Nieto, V.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48 Nigra, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Niimi, Yasuhisa . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Nikléczy, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Niklfeld, Georg . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Nishida, Masafumi . . . . . . . . . . . . . . . . . . . . . . . . 65 Nishikawa, Tsuyoki . . . . . . . . . . . . . . . . . . . . . . . 20 Nishimura, Masafumi . . . . . . . . . . . . . . . . . . . . . 16 Nishiura, Takanobu . . . . . . . . . . . . . . . . . . . . . . . 62 Nishiura, Takanobu . . . . . . . . . . . . . . . . . . . . . . . 76 Nishiura, Takanobu . . . . . . . . . . . . . . . . . . . . . . . 76 Nishizaki, Hiromitsu . . . . . . . . . . . . . . . . . . . . . . 42 Nishizawa, Nobuyuki . . . . . . . . . . . . . . . . . . . . . . 31 Nitta, Tsuneo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Nitta, Tsuneo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Nitta, Tsuneo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Niu, Xiaochuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Nix, Johannes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Nocera, Pascal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Nock, Harriet J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Nordén, Fredrik . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Norris, Dennis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Nöth, Elmar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Nöth, Elmar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Nöth, Elmar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Nöth, Elmar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Nouza, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Novak, Miroslav . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Nukinay, Masumi . . . . . . . . . . . . . . . . . . . . . . . . 111 Nurminen, Jani . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Nurminen, Jani . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Nurminen, Jani . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Nurminen, Jani . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 O Obuchi, Yasunari . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Och, Franz J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Odijk, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Oflazer, Kemal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Ogata, Jun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Ogata, Jun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Ogata, Jun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Ogawa, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Ogawa, Tetsuji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Ogawa, Yoshihiko . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Ogawa, Yoshio . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Oh, Se-Jin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Ohkawa, Yuichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Ohno, Sumio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Ohya, Tomoyuki . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Oikonomidis, Dimitrios . . . . . . . . . . . . . . . . . . . 55 Oikonomidis, Dimitrios . . . . . . . . . . . . . . . . . . . 81 Okada, Jiro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62 Okawa, Shigeki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Okimoto, Mamiko . . . . . . . . . . . . . . . . . . . . . . . . 112 Okuno, Hiroshi G. . . . . . . . . . . . . . . . . . . . . . . . . . 26 Okuno, Hiroshi G. . . . . . . . . . . . . . . . . . . . . . . . . . 96 Okuno, Hiroshi G. . . . . . . . . . . . . . . . . . . . . . . . . 112 Olaszy, G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Oliveira, Luís C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Oliveira, Luís C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Olsen, Peder A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Olsen, Peder A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Omar, Mohamed Kamal . . . . . . . . . . . . . . . . . . . 18 Omar, Mohamed Kamal . . . . . . . . . . . . . . . . . . . 88 Omologo, Maurizio . . . . . . . . . . . . . . . . . . . . . . . . 18 Omoto, Yukihiro . . . . . . . . . . . . . . . . . . . . . . . . . . 42 O’Neill, Ian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 O’Neill, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Onishi, Koji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Ono, Takayuki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Ordelman, Roeland . . . . . . . . . . . . . . . . . . . . . . . . . 9 Ordóñez, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Orlandi, Marco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Eurospeech 2003 Ortega, Alfonso. . . . . . . . . . . . . . . . . . . . . . . . . . . .50 Ortega, Antonio . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Ortega-Garcia, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Ortega-Garcia, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Osaki, Koichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 O’Shaughnessy, Douglas . . . . . . . . . . . . . . . . . . 36 O’Shaughnessy, Douglas . . . . . . . . . . . . . . . . . 109 Ostendorf, Mari. . . . . . . . . . . . . . . . . . . . . . . . . . . .89 Osterrath, Frédéric . . . . . . . . . . . . . . . . . . . . . . . . 43 Otake, Takashi. . . . . . . . . . . . . . . . . . . . . . . . . . . . .29 Otake, Takashi. . . . . . . . . . . . . . . . . . . . . . . . . . . . .74 Otsuji, Kiyotaka . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Ouellet, Pierre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Ouellet, Pierre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Ozeki, Kazuhiko . . . . . . . . . . . . . . . . . . . . . . . . . 112 Ozeki, Kazuhiko . . . . . . . . . . . . . . . . . . . . . . . . . 112 Ozturk, Ozlem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Ozturk, Ozlem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 P Padrell, Jaume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Padrell, Jaume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Padrta, Aleš . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Pakucs, Botond . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Paliwal, Kuldip K. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Paliwal, Kuldip K. . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Paliwal, Kuldip K. . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Paliwal, Kuldip K. . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Paliwal, Kuldip K. . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Palmer, Rebecca . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Pan, Jielin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Pardo, J.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Paredes, R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Parihar, N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Park, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Park, Jong Se . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Park, Seung Seop . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Park, Young-Hee . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Parker, Mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Parker, Mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Parveen, Shahla . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Pascual, Neus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Patterson, Roy D. . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Paulo, Sérgio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Pavešić, Nikola. . . . . . . . . . . . . . . . . . . . . . . . . . . . .45 Peereman, Ronald . . . . . . . . . . . . . . . . . . . . . . . . . 73 Peinado, Antonio M. . . . . . . . . . . . . . . . . . . . . . . . 38 Peinado, Antonio M. . . . . . . . . . . . . . . . . . . . . . . . 97 Pelecanos, Jason . . . . . . . . . . . . . . . . . . . . . . . . . 106 Pelle, Patricia A. . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Pellom, Bryan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Pellom, Bryan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Pellom, Bryan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Peng, Hu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Peretti, Giorgio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Pérez-Córdoba, José L. . . . . . . . . . . . . . . . . . . . . 38 Petek, Bojan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Peters, S. Douglas . . . . . . . . . . . . . . . . . . . . . . . . . 66 Petrillo, Massimo . . . . . . . . . . . . . . . . . . . . . . . . . 103 Petrinovic, Davor . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Petrinovic, Davor . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Petrinovic, Davorka . . . . . . . . . . . . . . . . . . . . . . . 39 Petrushin, Valery A. . . . . . . . . . . . . . . . . . . . . . . 111 Pfister, Beat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Pfister, Beat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Pfister, Beat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Pfitzinger, Hartmut R. . . . . . . . . . . . . . . . . . . . . . 29 Phillips, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Piano, Lawrence . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Piantanida, Juan P. . . . . . . . . . . . . . . . . . . . . . . . . 80 Picheny, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Picheny, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Picheny, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Picheny, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Picone, Joseph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Picone, Joseph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Picovici, Dorel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Pieraccini, Roberto . . . . . . . . . . . . . . . . . . . . . . . . 79 Pitrelli, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Pitsikalis, Vassilis . . . . . . . . . . . . . . . . . . . . . . . . . 29 Pitz, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Pobloth, Harald . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Podveský, Petr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Poeppel, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Pohjalainen, Jouni . . . . . . . . . . . . . . . . . . . . . . . 103 Poirier, Franck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Polifroni, Joseph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 September 1-4, 2003 – Geneva, Switzerland Pollák, Petr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Pols, Louis C.W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Popovici, C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Portele, Thomas . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Potamianos, Gerasimos . . . . . . . . . . . . . . . . . . . 45 Potamitis, Ilyas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Potamitis, Ilyas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Potamitis, Ilyas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Potamitis, Ilyas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Potamitis, Ilyas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Povey, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Prasad, K. Venkatesh . . . . . . . . . . . . . . . . . . . . . . 79 Prasad, Rashmi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Prasanna, S.R. Mahadeva . . . . . . . . . . . . . . . . . . . 3 Prasanna, S.R. Mahadeva . . . . . . . . . . . . . . . . . . 21 Prasanna Kumar, K.R. . . . . . . . . . . . . . . . . . . . . 108 Pratsolis, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Precoda, Kristin . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Prieto, Ramon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Prime, G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Prodanov, Plamen . . . . . . . . . . . . . . . . . . . . . . . . . 37 Przybocki, Mark A. . . . . . . . . . . . . . . . . . . . . . . . . 47 Psutka, Josef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Psutka, Josef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Psutka, Josef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Psutka, Josef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Psutka, J.V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Pucher, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Puder, Henning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Rodriguez, Francisco Romero . . . . . . . . . . . 102 Roh, Duk-Gyoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Romportl, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Romsdorfer, Harald . . . . . . . . . . . . . . . . . . . . . . . 72 Roohani, Mahmood R. . . . . . . . . . . . . . . . . . . . . . 54 Rosec, Olivier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Rosenhouse, Judith . . . . . . . . . . . . . . . . . . . . . . . 73 Rosset, Sophie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Rosset, Sophie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Rossi-Katz, Jessica A. . . . . . . . . . . . . . . . . . . . . . 50 Rothkrantz, Leon J.M. . . . . . . . . . . . . . . . . . . . . . 32 Roweis, Sam T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Roy, Deb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Rubio, Antonio J. . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Rubio, Antonio J. . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Rubio, Antonio J. . . . . . . . . . . . . . . . . . . . . . . . . . 107 Rudnicky, Alexander I. . . . . . . . . . . . . . . . . . . . . 21 Rudnicky, Alexander I. . . . . . . . . . . . . . . . . . . . . 66 Ruiz, Diego . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Ruske, Günther . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Ruske, Günther . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Russell, Martin J. . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Russell, Martin J. . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Russell, Martin J. . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Rutten, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Rutten, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Q Saarinen, Jukka . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Saarinen, Jukka . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Sagisaka, Yoshinori . . . . . . . . . . . . . . . . . . . . . . . . . 4 Sagisaka, Yoshinori . . . . . . . . . . . . . . . . . . . . . . . . . 7 Sagisaka, Yoshinori . . . . . . . . . . . . . . . . . . . . . . . . . 9 Sagisaka, Yoshinori . . . . . . . . . . . . . . . . . . . . . . . 15 Sai Jayram, A.K.V. . . . . . . . . . . . . . . . . . . . . . . . . . 47 Saito, Mutsumi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Sakamoto, Yoko . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Sakata, Keigo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Salonen, Esa-Pekka . . . . . . . . . . . . . . . . . . . . . . . . 27 Salor, Özgül . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Salor, Özgül . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Saltzman, Elliot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Salvi, Giampiero . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Salvi, Giampiero . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Samudravijaya, K. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Samuelsson, Jonas . . . . . . . . . . . . . . . . . . . . . . . . 49 Sanchez, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Sánchez, Victoria . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Sánchez, Victoria . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Sanchis, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Sanchis, Emilio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Sanchis, Emilio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Sanchis, Javier. . . . . . . . . . . . . . . . . . . . . . . . . . . .104 Sanders, Eric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Sankar, Ananth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Sankar, Ananth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Santarelli, Alfiero . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Saon, George . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Saon, George . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Šarić, Zoran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Sarich, Ace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Saruwatari, Hiroshi . . . . . . . . . . . . . . . . . . . . . . . . 20 Saruwatari, Hiroshi . . . . . . . . . . . . . . . . . . . . . . . . 52 Saruwatari, Hiroshi . . . . . . . . . . . . . . . . . . . . . . . . 79 Saruwatari, Hiroshi . . . . . . . . . . . . . . . . . . . . . . . . 85 Sasaki, Felix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80 Sasaki, Koji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Sasou, Akira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Sato, Tsutomu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Säuberlich, Bettina . . . . . . . . . . . . . . . . . . . . . . . . 46 Saul, Lawrence K. . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Savova, Guergana . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Scalart, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Scanlon, Patricia . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Schafföner, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Schalkwyk, Johan . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Scharenborg, Odette . . . . . . . . . . . . . . . . . . . . . . 73 Scharenborg, Odette . . . . . . . . . . . . . . . . . . . . . . 74 Scherer, Klaus R. . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Scherer, Klaus R. . . . . . . . . . . . . . . . . . . . . . . . . . 106 Schiel, Florian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Schimanowski, Juergen . . . . . . . . . . . . . . . . . . . 68 Schlüter, Ralf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Schmidt, Gerhard . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Schneider, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Schnell, K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Qian, Yao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Qian, Yasheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Quintana-Morales, Pedro . . . . . . . . . . . . . . . . . . 86 R Raad, Mohammed . . . . . . . . . . . . . . . . . . . . . . . . . 39 Radová, Vlasta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Radová, Vlasta . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Rahim, Mazin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Rahurkar, Mandar A. . . . . . . . . . . . . . . . . . . . . . . 26 Raj, Bhiksha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Raj, Bhiksha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Ramabhadran, Bhuvana . . . . . . . . . . . . . . . . . . . 33 Ramabhadran, Bhuvana . . . . . . . . . . . . . . . . . . . 91 Ramakrishnan, K.R. . . . . . . . . . . . . . . . . . . . . . . 108 Ramasubramanian, V. . . . . . . . . . . . . . . . . . . . . . 47 Ramaswamy, Ganesh N. . . . . . . . . . . . . . . . . . . . 69 Ramaswamy, Ganesh N. . . . . . . . . . . . . . . . . . . . 71 Ramaswamy, Ganesh N. . . . . . . . . . . . . . . . . . . . 94 Ramírez, Javier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Ramírez, Javier . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Ramírez, Miguel Arjona. . . . . . . . . . . . . . . . . .104 Ramos-Castro, D. . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Rank, Erhard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Rätsch, Gunnar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Raux, Antoine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Ravera, F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Raykar, Vikas C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Raymond, Christian . . . . . . . . . . . . . . . . . . . . . . . 22 Raza, D.G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Reichert, Jürgen . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Reilly, Richard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Renals, Steve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 Renals, Steve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99 Rentzos, Dimitrios . . . . . . . . . . . . . . . . . . . . . . . . 85 Rentzos, Dimitrios . . . . . . . . . . . . . . . . . . . . . . . 104 Resende Jr., F.G.V. . . . . . . . . . . . . . . . . . . . . . . . . . 87 Reynolds, Douglas A. . . . . . . . . . . . . . . . . . . . . . . . 2 Reynolds, Douglas A.. . . . . . . . . . . . . . . . . . . . . .47 Reynolds, Douglas A.. . . . . . . . . . . . . . . . . . . . . .56 Reynolds, Douglas A.. . . . . . . . . . . . . . . . . . . . . .69 Reynolds, Douglas A.. . . . . . . . . . . . . . . . . . . . . .71 Reynolds, Douglas A.. . . . . . . . . . . . . . . . . . . . . .94 Riccardi, Giuseppe . . . . . . . . . . . . . . . . . . . . . . . . 23 Riccardi, Giuseppe . . . . . . . . . . . . . . . . . . . . . . . . 64 Richey, Colleen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Rifkin, Ryan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Rigazio, Luca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Rigazio, Luca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Rigoll, Gerhard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Rilliard, Albert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Ris, Christophe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Rizzi, Romeo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Rodgers, Dwight . . . . . . . . . . . . . . . . . . . . . . . . . 106 Rodrigues, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 121 S Eurospeech 2003 Schoentgen, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Schone, P.J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Schreiner, Olaf . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Schultz, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Schultz, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Schultz, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Schultz, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Schultz, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Schultz, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Schwab, Markus . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Schwartz, Jean-Luc . . . . . . . . . . . . . . . . . . . . . . . . . 6 Schwartz, Jean-Luc . . . . . . . . . . . . . . . . . . . . . . . . 49 Schwartz, Richard . . . . . . . . . . . . . . . . . . . . . . . . . 79 Schwartz, Richard . . . . . . . . . . . . . . . . . . . . . . . . 100 Schwarz, Petr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Schweitzer, Antje . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Sciamarella, Denisse . . . . . . . . . . . . . . . . . . . . . . 84 Scordilis, Michael S. . . . . . . . . . . . . . . . . . . . . . . . 40 Seabra Lopes, L. . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Segarra, Encarna . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Segura, José C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Segura, José C. . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Seide, Frank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71 Sekiya, Toshiyuki . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Selouani, Sid-Ahmed . . . . . . . . . . . . . . . . . . . . . 109 Seltzer, Michael L. . . . . . . . . . . . . . . . . . . . . . . . . . 44 Sendlmeier, Walter F. . . . . . . . . . . . . . . . . . . . . . . 87 Seneff, Stephanie . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Seneff, Stephanie . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Seneff, Stephanie . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Seneff, Stephanie . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Seneff, Stephanie . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Sénica, N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Seo, Seongho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Seo, Seongho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Seppänen, Tapio . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Serralheiro, António . . . . . . . . . . . . . . . . . . . . . . . 56 Seward, Alexander . . . . . . . . . . . . . . . . . . . . . . . . 40 Sha, Fei. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35 Shabestary, Turaj Zakizadeh . . . . . . . . . . . . . 39 Shammass, Shaunie . . . . . . . . . . . . . . . . . . . . . . . 54 Shammass, Shaunie . . . . . . . . . . . . . . . . . . . . . . . 63 Shao, Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Shaw, Andrew T. . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Sheikhzadeh, H. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Sheikhzadeh, H. . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Sheng, Huanye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Sheykhzadegan, Javad . . . . . . . . . . . . . . . . . . . . 54 Shi, Bertram E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Shi, Rui P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Shiga, Yoshinori . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Shiga, Yoshinori . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Shigemori, Takeru . . . . . . . . . . . . . . . . . . . . . . . . . 51 Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 19 Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 20 Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 52 Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 76 Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 79 Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 85 Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 92 Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 93 Shimada, Yasuhiro . . . . . . . . . . . . . . . . . . . . . . . . 85 Shimizu, Tohru . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Shimodaira, Hiroshi . . . . . . . . . . . . . . . . . . . . . . . 96 Shin, Jong-Won . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Shingu, Masahisa . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Shinozaki, Takahiro . . . . . . . . . . . . . . . . . . . . . . . 34 Shirai, Katsuhiko . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Shirai, Katsuhiko . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Shiraishi, Kimio . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Shiraishi, Tatsuya . . . . . . . . . . . . . . . . . . . . . . . . . 79 Shriberg, Elizabeth . . . . . . . . . . . . . . . . . . . . . . . . 34 Shriberg, Elizabeth . . . . . . . . . . . . . . . . . . . . . . . . 71 Shriberg, Elizabeth . . . . . . . . . . . . . . . . . . . . . . . . 99 Shum, Heung-Yeung. . . . . . . . . . . . . . . . . . . . . . .59 Sigmund, Milan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Siivola, Vesa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Sikora, Thomas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Sikora, Thomas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Silva, Jorge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Simpson, Brian D. . . . . . . . . . . . . . . . . . . . . . . . . . 37 Simske, Steve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Sinervo, Ulpu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Singer, E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Singh, Rita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Singh, Rita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Siohan, Olivier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 September 1-4, 2003 – Geneva, Switzerland Siricharoenchai, Rungkarn . . . . . . . . . . . . . . . . . 4 Sista, Sreenivasa . . . . . . . . . . . . . . . . . . . . . . . . . 100 Sit, Chin-Hung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Siu, K.C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Siu, Man-Hung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Siu, Man-Hung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Sivadas, Sunil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Sivadas, Sunil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Sivakumaran, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Skowronek, Janto. . . . . . . . . . . . . . . . . . . . . . . . . .69 Skut, Wojciech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Smaïli, Kamel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Smaïli, Kamel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Smallwood, L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Smeele, Paula M.T. . . . . . . . . . . . . . . . . . . . . . . . . . 69 Smith, D.J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Smith, Jack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Soares, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Sodoyer, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Somervuo, Panu . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Song, Hwa Jeon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Sönmez, Kemal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Soon, Chng Chin . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Soong, Frank K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Soong, Frank K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Sornlertlamvanich, Virach . . . . . . . . . . . . . . . . 12 Sorokin, V.N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Spiess, Thurid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Sproat, Richard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Sreenivas, T.V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Sreenivas, T.V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Sridharan, Sridha . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Sridharan, Sridha . . . . . . . . . . . . . . . . . . . . . . . . 106 Sridharan, Sridha . . . . . . . . . . . . . . . . . . . . . . . . 110 Sridharan, Sridha . . . . . . . . . . . . . . . . . . . . . . . . 110 Srinivasamurthy, Naveen . . . . . . . . . . . . . . . . . 39 Srinivasamurthy, Naveen . . . . . . . . . . . . . . . . 111 Srinivasan, Soundararajan . . . . . . . . . . . . . . . . 72 Srinivasan, Sriram . . . . . . . . . . . . . . . . . . . . . . . . . 49 Srivastava, Amit . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Srivastava, Amit . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Stadermann, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Stahl, Christoph . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Stallard, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Stan, Sorel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Steidl, Stefan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Stemmer, Georg . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Stemmer, Georg . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Stent, Amanda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Stephenson, Todd A. . . . . . . . . . . . . . . . . . . . . . . 89 Stern, Richard M. . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Stern, Richard M. . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Stern, Richard M. . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Stevens, Catherine. . . . . . . . . . . . . . . . . . . . . . . . .72 Stewart, Darryl . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Stolbov, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Stolcke, Andreas . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Stolcke, Andreas . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Story, Ezra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Stouten, Veronique . . . . . . . . . . . . . . . . . . . . . . . . . 1 Stouten, Veronique . . . . . . . . . . . . . . . . . . . . . . . . 13 Strassel, Stephanie . . . . . . . . . . . . . . . . . . . . . . . . 56 Strayer, Susan E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Strik, Helmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Strik, Helmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Strzalkowski, Tomek . . . . . . . . . . . . . . . . . . . . . . . 8 Stüker, Sebastian . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Stüker, Sebastian . . . . . . . . . . . . . . . . . . . . . . . . . 111 Sturm, Janienke . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Sturm, Janienke . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Sturt, Christian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Sugimura, Toshiaki . . . . . . . . . . . . . . . . . . . . . . . . 96 Sugiyama, Masahide . . . . . . . . . . . . . . . . . . . . . . . 16 Suhadi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Suhm, Bernhard . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Sujatha, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Suk, Soo-Young . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Sullivan, Kirk P.H. . . . . . . . . . . . . . . . . . . . . . . . . . 93 Sumita, Eiichiro . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Sun, Hui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Sundaram, Shiva. . . . . . . . . . . . . . . . . . . . . . . . . . .43 Sung, Woo-Chang . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Suontausta, Janne . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Suzuki, Motoyuki . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Suzuki, Noriko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Svaizer, Piergiorgio . . . . . . . . . . . . . . . . . . . . . . . . 18 Svendsen, Torbjørn . . . . . . . . . . . . . . . . . . . . . . 110 Svendsen, Torbjørn . . . . . . . . . . . . . . . . . . . . . . 110 Szarvas, Máté . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 122 Szarvas, Máté . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 T Taddei, Hervé . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Tadj, C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Tago, Junji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Takagi, Kazuyuki . . . . . . . . . . . . . . . . . . . . . . . . 112 Takagi, Kazuyuki . . . . . . . . . . . . . . . . . . . . . . . . 112 Takahashi, Shin-ya . . . . . . . . . . . . . . . . . . . . . . . . 23 Takami, Kazuaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Takano, Sayoko . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Takatani, Tomoya . . . . . . . . . . . . . . . . . . . . . . . . . 20 Takeda, Kazuya . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Takeda, Kazuya . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Takeuchi, Masashi . . . . . . . . . . . . . . . . . . . . . . . . . 22 Takeuchi, Yugo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Takezawa, Toshiyuki . . . . . . . . . . . . . . . . . . . . . . 14 Takezawa, Toshiyuki . . . . . . . . . . . . . . . . . . . . . . 98 Tam, Yik-Cheung . . . . . . . . . . . . . . . . . . . . . . . . . 111 Tamburini, Fabio . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Tan, Wah Jin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Tanaka, Kazuyo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Tanaka, Kazuyo . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Tang, Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Tao, Jianhua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Tasoulis, Dimitris K. . . . . . . . . . . . . . . . . . . . . . . 59 Tatai, Gábor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Tatai, Péter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Tatai, Péter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Tattersall, Graham . . . . . . . . . . . . . . . . . . . . . . . . 78 Teixeira, António . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Teixeira, António . . . . . . . . . . . . . . . . . . . . . . . . 104 Teixeira, C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Teixeira, João Paulo . . . . . . . . . . . . . . . . . . . . . . . . 7 Teixeira, João Paulo . . . . . . . . . . . . . . . . . . . . . . . 15 ten Bosch, Louis . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 ten Bosch, Louis . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 ten Bosch, Louis . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Terken, Jacques . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Tesprasit, Virongrong . . . . . . . . . . . . . . . . . . . . . . 4 Tesprasit, Virongrong . . . . . . . . . . . . . . . . . . . . . 12 te Vrugt, Jürgen . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Thambiratnam, K. . . . . . . . . . . . . . . . . . . . . . . . . . 32 Thies, Alexandra . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Thomae, Matthias . . . . . . . . . . . . . . . . . . . . . . . . . 32 Thomas, Ryan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Thubthong, Nuttakorn . . . . . . . . . . . . . . . . . . . . . 6 Tian, Jilei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Tiede, Mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Tihelka, Daniel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Tisato, Graziano . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Toda, Tomoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Toda, Tomoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Toda, Tomoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Toivanen, Juhani . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Tokuda, Keiichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Tokuda, Keiichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Tokuda, Keiichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Tokuda, Keiichi . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Tokuma, Shinichi . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Tokuma, Won . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Tolba, Hesham . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Torge, Sunna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Torres, Francisco . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Torres-Carrasquillo, P.A. . . . . . . . . . . . . . . . . . . 47 Tóth, László . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Trancoso, Isabel . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Trancoso, Isabel . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Tremoulis, George . . . . . . . . . . . . . . . . . . . . . . . . . 19 Tremoulis, George . . . . . . . . . . . . . . . . . . . . . . . . . 60 Trentin, Edmondo . . . . . . . . . . . . . . . . . . . . . . . . . 64 Trippel, Thorsten. . . . . . . . . . . . . . . . . . . . . . . . . .29 Trippel, Thorsten. . . . . . . . . . . . . . . . . . . . . . . . . .80 Trost, Harald . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Tsai, Wei-Ho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Tsai, Wei-Ho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Tsakalidis, Stavros . . . . . . . . . . . . . . . . . . . . . . . . 70 Tseng, Chiu-yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Tseng, Chiu-yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Tseng, Shu-Chuan . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Tsourakis, N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Tsubota, Yasushi . . . . . . . . . . . . . . . . . . . . . . . . . 112 Tsuge, Satoru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Tsujino, Hiroshi . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Tsuruta, Naoyuki . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Tsuzaki, Minoru . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Eurospeech 2003 Tur, Gokhan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Tur, Gokhan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Turajlic, Emir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Turajlic, Emir. . . . . . . . . . . . . . . . . . . . . . . . . . . . .104 Turk, Oytun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Turk, Oytun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Türk, Ulrich . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Turunen, Markku . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Turunen, Markku . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Tyagi, Vivek. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34 U Ueno, Shinichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Unoki, Masashi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Utsuro, Takehito . . . . . . . . . . . . . . . . . . . . . . . . . . 42 V Vafin, Renat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38 Vair, C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Valdez, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Valente, Fabio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Valsan, Zica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 van Amelsvoort, A.G. . . . . . . . . . . . . . . . . . . . . . . 25 Van Bael, Christophe . . . . . . . . . . . . . . . . . . . . . . 54 Van Compernolle, Dirk . . . . . . . . . . . . . . . . . . . . 40 Van Compernolle, Dirk . . . . . . . . . . . . . . . . . . . . 69 Van Compernolle, Dirk . . . . . . . . . . . . . . . . . . . . 95 Vandecatseye, An . . . . . . . . . . . . . . . . . . . . . . . . . 33 van den Heuvel, Henk . . . . . . . . . . . . . . . . . . . . . 54 van den Heuvel, Henk . . . . . . . . . . . . . . . . . . . . . 54 van Doorn, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Van hamme, Hugo . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Van hamme, Hugo . . . . . . . . . . . . . . . . . . . . . . . . . 13 Van hamme, Hugo . . . . . . . . . . . . . . . . . . . . . . . . . 69 Van hamme, Hugo . . . . . . . . . . . . . . . . . . . . . . . . . 95 Van hamme, Hugo . . . . . . . . . . . . . . . . . . . . . . . 108 Van hamme, Hugo . . . . . . . . . . . . . . . . . . . . . . . 109 van Hessen, Arjan . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Vanhoucke, Vincent . . . . . . . . . . . . . . . . . . . . . . . 92 van Kommer, Robert . . . . . . . . . . . . . . . . . . . . . . 16 van Leeuwen, David A. . . . . . . . . . . . . . . . . . . . . 58 van Santen, Jan P.H. . . . . . . . . . . . . . . . . . . . . . . . 12 van Santen, Jan P.H. . . . . . . . . . . . . . . . . . . . . . . . 12 van Santen, Jan P.H. . . . . . . . . . . . . . . . . . . . . . . . 58 van Santen, Jan P.H. . . . . . . . . . . . . . . . . . . . . . . . 88 van Son, R.J.J.H. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Vantieghem, Johan . . . . . . . . . . . . . . . . . . . . . . . . 63 Varga, Imre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Vary, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Vaseghi, Saeed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Vaseghi, Saeed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Vaseghi, Saeed . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Vasilache, Marcel . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Vasilescu, Ioana . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Väyrynen, Eero . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Venditti, Jennifer . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Venkataraman, Anand . . . . . . . . . . . . . . . . . . . . . . 9 Venkataraman, Anand . . . . . . . . . . . . . . . . . . . . 14 Venkataraman, Anand . . . . . . . . . . . . . . . . . . . . 71 Vepa, Jithendra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Vergyri, Dimitra . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Verma, Ashish . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Vescovi, Michele . . . . . . . . . . . . . . . . . . . . . . . . . 106 Vesnicer, Boštjan . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Vidal, E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Viikki, Olli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Vilar, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Vilar, Juan Miguel . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Visser, Erik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Visweswariah, Karthik . . . . . . . . . . . . . . . . . . . . 57 Visweswariah, Karthik . . . . . . . . . . . . . . . . . . . . 66 Visweswariah, Karthik . . . . . . . . . . . . . . . . . . . . 92 Visweswariah, Karthik . . . . . . . . . . . . . . . . . . . . 92 Vivaracho-Pascual, C. . . . . . . . . . . . . . . . . . . . . . 93 Vogt, Robbie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Vonwiller, Julie . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Vosnidis, C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Vozila, Paul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Vrahatis, Michael N. . . . . . . . . . . . . . . . . . . . . . . . 59 W Waals, Juliette A.J.S. . . . . . . . . . . . . . . . . . . . . . . . 69 Wada, Yamato . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Waibel, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Waibel, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 September 1-4, 2003 – Geneva, Switzerland Waibel, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Walker, B.D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Walker, Kevin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Walker, Kevin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Walker, Marilyn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Walker, Marilyn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Walker, William . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Wallace, Dorcas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Wambacq, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Wambacq, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . 13 Wambacq, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . 40 Wan, Eric A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Wan, Vincent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Wang, Chao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Wang, Chao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Wang, Chong-kai . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Wang, DeLiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Wang, Hsiao-Chuan . . . . . . . . . . . . . . . . . . . . . . . 17 Wang, Hsiao-Chuan . . . . . . . . . . . . . . . . . . . . . . . 19 Wang, Hsiao-Chuan . . . . . . . . . . . . . . . . . . . . . . 108 Wang, Hsin-Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Wang, Hsin-Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Wang, Hsin-Min . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Wang, Huei-Ming . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Wang, Kuansan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Wang, Kuansan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Wang, Renhua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Wang, Wen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Wang, W.Q. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Wang, Xia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Wang, Xuechuan . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Wang, Yadong. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Wang, Yangsheng . . . . . . . . . . . . . . . . . . . . . . . . 107 Wang, Ye-Yi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Wang, Yih-Ru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Wang, Yih-Ru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Wang, Zhirong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Ward, Rabab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Ward, Todd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Warmuth, Manfred K. . . . . . . . . . . . . . . . . . . . . . 35 Warmuth, Manfred K. . . . . . . . . . . . . . . . . . . . . . 62 Wasinger, Rainer . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Wei, Jianqiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Wei, Jianqiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Wei, Yuan-Jun . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Weintraub, Mitch . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Welby, Pauline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Wellekens, Christian . . . . . . . . . . . . . . . . . . . . . . 16 Wendemuth, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Weruaga, Luis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Wester, Mirjam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Wester, Mirjam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Westfeld, Timo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Whittaker, Stephen . . . . . . . . . . . . . . . . . . . . . . . . 59 Wiggers, Pascal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Williams, Elliott . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Williams, Jason D. . . . . . . . . . . . . . . . . . . . . . . . . . 21 Williams, Jason D. . . . . . . . . . . . . . . . . . . . . . . . . . 78 Williams, Shaun . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Wirén, Mats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Witt, Silke M.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 Wittig, Frank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Wolf, Florian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Wolf, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Wölfel, Matthias . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Woltjer, Rogier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Wong, C.C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Wong, Eddie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Woodland, P.C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Woszczyna, Monika . . . . . . . . . . . . . . . . . . . . . . . 14 Wrede, Britta. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .90 Wrede, Britta. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99 Wrigley, Stuart N. . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Wu, Chung-Hsien . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Wu, Jian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Wu, Jian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Wutiwiwatchai, Chai. . . . . . . . . . . . . . . . . . . . . . .98 X Xu, Xu, Xu, Xu, Xu, Xu, Xu, Xu, Bo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Bo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Bo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Bo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Bo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Jun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Mingxing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Yunbiao. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67 123 Xue, Jianxia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Y Yabuta, Yohei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Yacoub, Sherif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Yamada, Takeshi . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Yamade, Shingo . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Yamagishi, Junichi . . . . . . . . . . . . . . . . . . . . . . . . 87 Yamaguchi, Yoshikazu . . . . . . . . . . . . . . . . . . 110 Yamaguchi, Yukiko . . . . . . . . . . . . . . . . . . . . . . . . 55 Yamajo, Hiroaki . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Yamamoto, Hirofumi . . . . . . . . . . . . . . . . . . . . . . . 9 Yamamoto, Hirofumi . . . . . . . . . . . . . . . . . . . . . . 96 Yamamoto, Kazumasa . . . . . . . . . . . . . . . . . . . . 62 Yamamoto, Kiyoshi . . . . . . . . . . . . . . . . . . . . . . . 80 Yamamoto, Natsuo . . . . . . . . . . . . . . . . . . . . . . . . 34 Yamamoto, Seiichi. . . . . . . . . . . . . . . . . . . . . . . . .14 Yamashita, Takumi . . . . . . . . . . . . . . . . . . . . . . . . . 7 Yamashita, Yoichi . . . . . . . . . . . . . . . . . . . . . . . . . 42 Yamauchi, Keita . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Yan, Binfeng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Yan, Gwo-Lang. . . . . . . . . . . . . . . . . . . . . . . . . . . . .23 Yan, Qin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Yan, Qin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Yan, Yonghong . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Yan, Zhaoli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Yan, Zhaoli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Yang, Fan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 Yang, Ya-Ru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Yao, Kaisheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Yao, Kaisheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Yao, Kaisheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Yapanel, Umit H. . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Yapanel, Umit H. . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Yasuda, Norihito . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Ye, Hui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Yegnanarayana, B. . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Yegnanarayana, B. . . . . . . . . . . . . . . . . . . . . . . . . . 21 Ying, D.W.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27 Yip, Wing Lin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Yiu, Kwok-Kwong . . . . . . . . . . . . . . . . . . . . . . . . 105 Yoma, Nestor Becerra . . . . . . . . . . . . . . . . . . . . . 78 Yoo, Chang D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Yoo, Chang D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Yoo, Chang D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Yoo, Chang D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Yoo, Chang D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Yoon, Sung-Wan . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Yoshida, Akihiro . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Yoshimura, Takashi . . . . . . . . . . . . . . . . . . . . . . . 80 Yoshizawa, Shinichi . . . . . . . . . . . . . . . . . . . . . . . 93 Youn, Dae-Hee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Young, Steve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Yu, An-Tze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Yu, Dong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Yu, Eric W.M.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59 Yu, Hua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Yu, Peng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Yuan, Baozong . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Z Zarrintare, Rahman . . . . . . . . . . . . . . . . . . . . . . . 54 Zawaydeh, Bushra . . . . . . . . . . . . . . . . . . . . . . . . . 79 Zechner, Klaus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Zeißler, Viktor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Zeißler, Viktor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Zellner Keller, Brigitte . . . . . . . . . . . . . . . . . . . . . 91 Zen, Heiga . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Zen, Heiga . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Zen, Heiga . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Zen, Heiga . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Zeng, Hui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Zeng, Hui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Zervas, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Zetterholm, Elisabeth . . . . . . . . . . . . . . . . . . . . . 93 Žgank, Andrej . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Zhang, Dong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Zhang, Guoliang . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Zhang, Huayun . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Zhang, Huayun . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Zhang, Jason Y. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Zhang, Jianping. . . . . . . . . . . . . . . . . . . . . . . . . . . .41 Zhang, Jing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Zhang, Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Zhang, Rong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Eurospeech 2003 September 1-4, 2003 – Geneva, Switzerland Zhang, Shuwu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Zhang, Shuwu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Zhang, Wei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Zhang, Xianxian . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Zhang, Yimin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Zhang, Zhaoyan . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Zhang, Zhipeng . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Zhao, Yong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Zhao, Yunxin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Zheng, Chengyi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Zheng, Fang. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .96 Zheng, Feng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Zheng, Hong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Zheng, Jing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Zhong, Xin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Zhou, Yi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Zhu, Donglai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Zhu, Qifeng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Zhu, Xiaoyan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32 Žibert, Janez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Žibert, Janez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Ziegenhain, Ute . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Zigel, Yaniv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Zilca, Ran D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Zilca, Ran D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Zissman, Marc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Zitouni, Imed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Ziv, Shirley . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Zolfaghari, Parham . . . . . . . . . . . . . . . . . . . . . . . . 81 Zolfaghari, Parham . . . . . . . . . . . . . . . . . . . . . . . . 86 Zolnay, András . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Zong, Chengqing . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Zu, Yiqing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Zubizarreta, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Zue, Victor W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Zvonik, Elena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Zweig, Geoffrey . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Zweig, Geoffrey . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Zweig, Geoffrey . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 124