Eurospeech 2003 Abstracts Book

Comments

Transcription

Eurospeech 2003 Abstracts Book
8th European Conference on Speech
Communication and Technology
September 1-4, 2003 – Geneva, Switzerland
BOOK OF ABSTRACTS
Typeset by:
Causal Productions Pty Ltd
www.causal.on.net
[email protected]
Table of Contents
Page
Plenary Talks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
SMoCa . . . Aurora Noise Robustness on SMALL Vocabulary Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
SMoCb . . . ISCA Special Interest Group Session: "Hot Topics" in Speech Science & Technology . . . . . . 2
OMoCc . . . Speech Signal Processing I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
OMoCd . . Phonology & Phonetics I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
PMoCe . . . Topics in Prosody & Emotional Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
PMoCf . . . Language Modeling, Discourse & Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
PMoCg . . . Speech Synthesis: Unit Selection I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
SMoDa . . . Aurora Noise Robustness on LARGE Vocabulary Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
SMoDb . . . Multilingual Speech-to-Speech Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
OMoDc . . Prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
OMoDd . . Language Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
PMoDe . . . Speech Modeling & Features I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
PMoDf . . . Speech Enhancement I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
PMoDg . . . Spoken Dialog Systems I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
OTuBa . . . Robust Speech Recognition - Noise Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
STuBb . . . Forensic Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
OTuBc . . . Emotion in Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
OTuBd . . . Dialog System User & Domain Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
PTuBf . . . . Phonology & Phonetics II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
PTuBg . . . Speech Modeling & Features II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
PTuBh . . . Topics in Speech Recognition & Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
OTuCa . . . Robust Speech Recognition - Acoustic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
STuCb . . . Advanced Machine Learning Algorithms for Speech & Language Processing. . . . . . . . . . . . .35
OTuCc . . . Speech Modeling & Features III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
OTuCd . . . Multi-Modal Spoken Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
PTuCe . . . Speech Coding & Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
PTuCf . . . . Speech Recognition - Search & Lexicon Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
PTuCg . . . Speech Technology Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
OTuDa . . . Robust Speech Recognition - Front-end Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
STuDb . . . Spoken Language Processing for e-Inclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
OTuDc . . . Speech Synthesis: Unit Selection II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
OTuDd . . Language & Accent Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
PTuDe . . . Speech Enhancement II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
PTuDf . . . Speech Recognition - Adaptation I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
PTuDg . . . Speech Resources & Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
OWeBa . . . Speech Recognition - Adaptation II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
SWeBb . . . Towards Synthesizing Expressive Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
OWeBc . . . Speaker Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
OWeBd . . Dialog System Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
PWeBe . . . Speech Signal Processing II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
PWeBf . . . Robust Speech Recognition I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
PWeBg . . . Speech Recognition - Large Vocabulary I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
PWeBh . . . Spoken Dialog Systems II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
OWeCa. . .Speech Recognition - Large Vocabulary II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
SWeCb . . . Robust Methods in Processing of Natural Language Dialogues . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
OWeCc . . . Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
OWeCd . . Speech Synthesis: Miscellaneous I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
PWeCe . . . Speech Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
PWeCf . . . Robust Speech Recognition II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
PWeCg . . . Multi-Modal Processing & Speech Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
OWeDb . . Speech Recognition - Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
OWeDc . . Speech Modeling & Features IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
SWeDd . . . Feature Analysis & Cross-Language Processing of Chinese Spoken Language. . . . . . . . . . . .82
PWeDe . . . Speech Production & Physiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
PWeDf . . . Speech Synthesis: Voice Conversion & Miscellaneous Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
PWeDg . . . Acoustic Modelling I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
SThBb . . . Time is of the Essence - Dynamic Approaches to Spoken Language . . . . . . . . . . . . . . . . . . . . . . 90
OThBc . . . Topics in Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
OThBd . . . Acoustic Modelling II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92
PThBe. . . .Speaker & Language Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
PThBf . . . . Robust Speech Recognition III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
PThBg . . . Spoken Language Understanding & Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
PThBh . . . Speech Signal Processing III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
SThCb . . . Towards a Roadmap for Speech Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
OThCc . . . Speech Signal Processing IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
OThCd . . . Speech Synthesis: Miscellaneous II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
PThCe . . . Speaker Recognition & Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
PThCf . . . . Robust Speech Recognition IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
PThCg . . . Multi-Lingual Spoken Language Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .109
PThCh . . . Interdisciplinary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Eurospeech 2003
Plenary & Monday
proach to compensate the noise effect. In addition, we proposed the
use of multi-class normalization in which different normalization
factors can be applied to different phonetic units. The combination
of the robust features and ML normalization is particularly useful
for highly mismatched condition in the Aurora 3 corpus resulting
in a 15.8% relative improvement in the highly mismatched case and
a 10.4% relative improvement on average over the three conditions.
PLENARY TALKS
Speech and Language Processing: Where Have We
Been and Where Are We Going?
Kenneth Ward Church; AT&T Labs-Research, USA
Time: Tuesday 08:30 to 09:30,
Venue: Room 1
Robust Speech Recognition Using Model-Based
Feature Enhancement
Can we use the past to predict the future? Moore’s Law is a great example: performance doubles and prices halve approximately every
18 months. This trend has held up well to the test of time and is expected to continue for some time. Similar arguments can be found
in speech demonstrating consistent progress over decades. Unfortunately, there are also cases where history repeats itself, as well as
major dislocations, fundamental changes that invalidate fundamental assumptions. What will happen, for example, when petabytes
become a commodity? Can demand keep up with supply? How
much text and speech would it take to match this supply? Priorities
will change. Search will become more important than coding and
dictation.
Veronique Stouten, Hugo Van hamme, Kris
Demuynck, Patrick Wambacq; Katholieke Universiteit
Leuven, Belgium
Maintaining a high level of robustness for Automatic Speech Recognition (ASR) systems is especially challenging when the background
noise has a time-varying nature. We have implemented a ModelBased Feature Enhancement (MBFE) technique that not only can easily be embedded in the feature extraction module of a recogniser,
but also is intrinsically suited for the removal of non-stationary additive noise. To this end we combine statistical models of the cepstral feature vectors of both clean speech and noise, using a Vector
Taylor Series approximation in the power spectral domain. Based
on this combined HMM, a global MMSE-estimate of the clean speech
is then calculated. Because of the scalability of the applied models,
MBFE is flexible and computationally feasible. Recognition experiments with this feature enhancement technique on the Aurora2
connected digit recognition task showed significant improvements
on the noise robustness of the HTK recogniser.
Auditory Principles in Speech Processing - Do
Computers Need Silicon Ears ?
Birger Kollmeier; Universität Oldenburg, Germany
Time: Wednesday 08:30 to 09:30,
September 1-4, 2003 – Geneva, Switzerland
Venue: Room 1
A brief review is given about speech processing techniques that are
based on auditory models with an emphasis on applications of the
“Oldenburg perception model”, i.e., objective assessment of subjective sound quality for speech and audio codecs, automatic speech
recognition, SNR estimation, and hearing aids.
Several HKU Approaches for Robust Speech
Recognition and Their Evaluation on Aurora
Connected Digit Recognition Tasks
Session: SMoCa– Oral
Aurora Noise Robustness on SMALL
Vocabulary Databases
Jian Wu, Qiang Huo; University of Hong Kong, China
Recently, we, at The University of Hong Kong (HKU) have proposed
several approaches based on stochastic vector mapping and switching linear Gaussian HMMs to compensate for environmental distortions in robust speech recognition. In this paper, we present a
comparative study of these algorithms and report results of performance evaluation on Aurora connected digits databases. By following the protocol specified by the organizer of the Eurospeech2003 special session on Aurora tasks, the best performance we
achieved on Aurora2 database is a digit recognition error rate, averaged on all three test sets, of 5.53% and 6.28% for multi- and cleancondition training respectively. In a preliminary evaluation on Aurora3 Finnish and Spanish databases, significant performance improvement is also achieved by our approach.
Time: Monday 13.30, Venue: Room 1
Chair: David Pierce, Motorola Lab., UK
A Speech Processing Front-End with Eigenspace
Normalization for Robust Speech Recognition in
Noisy Automobile Environments
Kaisheng Yao, Erik Visser, Oh-Wook Kwon, Te-Won
Lee; University of California at San Diego, USA
A new front-end processing scheme for robust speech recognition
is proposed and evaluated on the multi-lingual Aurora 3 database.
The front-end processing scheme consists of Mel-scaled spectral
subtraction, speech segmentation, cepstral coefficient extraction,
utterance-level frame dropping and eigenspace feature normalization. We also investigated performance on all language databases by
post-processing features extracted by the ETSI advanced front-end
with an additional eigenspace normalization module. This step consists in linear PCA matrix feature transformation followed by mean
and variance normalization of the transformed cepstral coefficients.
In speech recognition experiments, our proposed front-end yielded
better than 16 percent relative error rate reduction over the ETSI
front-end on the Finnish language database. Also, more than 6%
in average relative error reduction was observed over all languages
with the ETSI front-end augmented by eigenspace normalization.
Average Instantaneous Frequency (AIF) and
Average Log-Envelopes (ALE) for ASR with the
Aurora 2 Database
Yadong Wang, Jesse Hansen, Gopi Krishna Allu,
Ramdas Kumaresan; University of Rhode Island, USA
We have developed a novel approach to speech feature extraction
based on a modulation model of a band-pass signal. Speech is
processed by a bank of band-pass filters. At the output of the
band-pass filters the signal is subjected to a log-derivative operation which naturally decomposes the band-pass signal into anaˆ ) compoˆ ) and anti-analytic (called β̇(t) + j β̇
lytic (called α̇(t) + j α̇
nents. The average instantaneous frequency (AIF) and average logenvelope (ALE) are then extracted as coarse features at the output
of each filter. Further, refined features may also be extracted from
the analytic and anti-analytic components (but not done in this paper). We then evaluated the Aurora 2 task where noise corruption is
synthetic. For clean training, (compared to the mel-cepstrum front
end, with 3 mixture HMM back-end,) our AIF/ALE front end achieves
an average improvement of 13.97% with set A and 17.92% improvement with set B and -31.72% (negative) ‘improvement’ with set C. The
overall improvement in accuracy rates for clean training is 7.97%.
Although the improvements are modest, the novelty of the frontend and its potential for future enhancements are our strengths.
Maximum Likelihood Normalization for Robust
Speech Recognition
Yiu-Pong Lai, Man-Hung Siu; Hong Kong University of
Science & Technology, China
It is well-known that additive and channel noise cause shift and
scaling in MFCC features. Empirical normalization techniques to
estimate and compensate for the effects, such as cepstral mean subtraction and variance normalization, have been shown to be useful.
However, these empirical estimate may not be optimal. In this paper, we approach the problem from two directions, 1) use a more
robust MFCC-based features that is less sensitive to additive and
channel noise and 2) propose a maximum likelihood (ML) based ap-
1
Eurospeech 2003
Monday
September 1-4, 2003 – Geneva, Switzerland
Adaptation of Acoustic Model Using the
Gain-Adapted HMM Decomposition Method
ISCA Special Session: Hot Topics in Speech
Synthesis
Akira Sasou 1 , Futoshi Asano 1 , Kazuyo Tanaka 2 ,
Satoshi Nakamura 3 ; 1 AIST, Japan; 2 University of
Tsukuba, Japan; 3 ATR-SLT, Japan
Gerard Bailly 1 , Nick Campbell 2 , Bernd Möbius 3 ;
1
ICP-CNRS, France; 2 ATR-HIS, Japan; 3 University of
Stuttgart, Germany
In a real environment, it is essential to adapt acoustic models to
variations in background noises in order to realize robust speech
recognition. In this paper, we construct an extended acoustic model
by combining a mismatch model with a clean acoustic model trained
using only clean speech data. We assume the mismatch model
conforms to a Gaussian distribution with time-varying population
parameters. The proposed method adapts on-line the extended
acoustic model to the unknown noises by estimating the timevarying population parameters using a Gaussian Mixture Model
(GMM) and Gain-Adapted Hidden Markov Model (GA-HMM) decomposition method. We performed recognition experiments under
noisy conditions using the AURORA2 database in order to confirm
the effectiveness of the proposed method.
What are the Hot Topics for speech synthesis? How will they differ
in 5-years time? ISCA’s SynSIG presents a few suggestions. This
paper attempts to identify the top five hot topics, based not on
an analysis of what is being presented at current workshops and
conferences, but rather on an analysis of what is NOT. It will be
accompanied by results from a questionnaire polling SynSIG members’ views and opinions.
Perceiving Emotions by Ear and by Eye
Beatrice de Gelder; Tilburg University, The
Netherlands
Affective information is conveyed through visual as well as auditory perception. The present paper considers the integration of
these channels of information, that is, the multisensory processing of emotion. Findings from behavioral, neuropsychological and
imaging studies are reviewed.
Session: SMoCb– Oral
ISCA Special Interest Group Session: "Hot
Topics" in Speech Science & Technology
Strategies for Automatic Multi-Tier Annotation of
Spoken Language Corpora
Time: Monday 13.30, Venue: Room 2
Chair: Valérie Hazan, University College London, UK
Steven Greenberg; The Speech Institute, USA
Spoken corpora of the future will be annotated at multiple levels
of linguistic organization largely through automatic methods using
a combination of sophisticated signal processing, statistical classifiers and expert knowledge. It is important that annotation tools be
adaptable to a wide range of languages and speaking styles, as well
as readily accessible to the speech research and technology communities around the world. This latter objective is of particular
importance for minority languages, which are less likely to foster
development of sophisticated speech technology without such universal access.
Person Authentication by Voice: A Need for
Caution
Jean-François Bonastre 1 , Frédéric Bimbot 2 ,
Louis-Jean Boë 3 , Joseph P. Campbell 4 , Douglas A.
Reynolds 4 , Ivan Magrin-Chagnolleau 5 ; 1 LIA-CNRS,
France; 2 IRISA, France; 3 ICP-CNRS, France;
4
Massachusetts Institute of Technology, USA;
5
DDL-CNRS, France
Why is the Special Structure of the Language
Important for Chinese Spoken Language
Processing? – Examples on Spoken Document
Retrieval, Segmentation and Summarization
Because of recent events and as members of the scientific community working in the field of speech processing, we feel compelled to
publicize our views concerning the possibility of identifying or authenticating a person from his or her voice. The need for a clear and
common message was indeed shown by the diversity of information
that has been circulating on this matter in the media and general
public over the past year. In a press release initiated by the AFCP
and further elaborated in collaboration with the SpLC ISCA-SIG, the
two groups herein discuss and present a summary of the current
state of scientific knowledge and technological development in the
field of speaker recognition, in accessible wording for nonspecialists. Our main conclusion is that, despite the existence of technological solutions to some constrained applications, at the present
time, there is no scientific process that enables one to uniquely characterize a person’s voice or to identify with absolute certainty an
individual from his or her voice.
Lin-shan Lee, Yuan Ho, Jia-fu Chen, Shun-Chuan
Chen; National Taiwan University, Taiwan
The Chinese language is not only spoken by the largest population
in the world, but quite different from many western languages with
a very special structure. It is not alphabetic: large number of Chinese characters are ideographic symbols and pronounced as monosyllables. The open vocabulary nature, the flexible wording structure and the tone behavior are also good examples within the special
structure. It is believed that better results and performance will be
obtainable in developing Chinese spoken language processing technologies, if this special structure can be taken into account. In this
paper, a set of “feature units” for Chinese spoken language processing is identified, and the retrieval, segmentation and summarization of Chinese spoken documents are taken as examples in analyzing the use of such “feature units”. Experimental results indicate
that by careful considerations of the special structure and proper
choice of the “feature units”, significantly better performance can
be achieved.
En raison d’événements récents et en tant que membres de la
communauté scientifique en traitement de la parole, l’AFCP et le
SpLC, tous deux Groupes d’Intérêt Spécialisés de l’ISCA, ont collaboré à cet article pour présenter un résumé de l’état des connaissances scientifiques et du développement technologique en reconnaissance du locuteur, en des termes clairement accessibles à des
non-spécialistes. La nécessité d’un message clair et commun concernant la possibilité d’identifier ou d’authentifier une personne par
sa voix apparaît particulièrement nécessaire compte-tenu de la diversité des informations qui ont circulé dans les médias et dans
l’opinion publique ces derniers mois. En dépit de l’existence de solutions technologiques pour quelques applications dans des contextes d’utilisation très contraints, nous tenons à affirmer qu’ à
l’heure actuelle, il n’existe pas de processus scientifique qui permet de
caractériser de façon unique la voix d’une personne, ou d’identifier
avec certitude un individu à partir de sa voix.
2
Eurospeech 2003
Monday
September 1-4, 2003 – Geneva, Switzerland
frequency resolution. Two popularly used hearing aid algorithms,
a two channel wide band system and a nine channel compression
system, are simulated and are used to compensate the impaired auditory model. The responses of the compensated system, in terms
of the acoustic-phonetics cues that characterise speech intelligibility, are analysed and compared with one another and with that of a
normal auditory system. It is shown that although the nine channel
compression algorithm performs better than the two channel system both the hearing aid algorithms distort severely the acousticphonetic cues.
Session: OMoCc– Oral
Speech Signal Processing I
Time: Monday 13.30, Venue: Room 3
Chair: Hynek Hermansky, Oregon Graduate Institute of Science
and Technology, USA
Speech Analysis with the Short-Time Chirp
Transform
Frequency-Related Representation of Speech
Luis Weruaga 1 , Marián Képesi 2 ; 1 Cartagena
University of Technology, Spain; 2 Forschungszentrum
Telekommunikation Wien, Austria
Kuldip K. Paliwal 1 , Bishnu S. Atal 2 ; 1 Griffith
University, Australia; 2 AT&T Labs-Research, USA
Cepstral features derived from power spectrum are widely used for
automatic speech recognition. Very little work, if any, has been
done in speech research to explore phase-based representations.
In this paper, an attempt is made to investigate the use of phase
function in the analytic signal of critical-band filtered speech for
deriving a representation of frequencies present in the speech signal. Results are presented which show the validity of this approach.
The most popular time-frequency analysis tool, the Short-Time
Fourier Transform, suffers from blurry harmonic representation
when voiced speech undergoes changes in pitch. These relatively
fast variations lead to inconsistent bins in frequency domain and
cannot be accurately described by the Fourier analysis with high
resolution both in time and frequency. In this paper a new analysis tool, called Short-Time Chirp Transform is presented, offering
more precise time-frequency representation of speech signals. The
base of this adaptive transform is composed of quadratic chirps
that follow the pitch tendency segment-by-segment. Comparative
results between the proposed STCT and popular time-frequency
techniques reveal an improvement in time-frequency localization
and finer spectral representation. Since the signal can be resynthesized from its STCT, the proposed method is also suitable for
filtering purposes.
Tracking a Moving Speaker Using Excitation Source
Information
Vikas C. Raykar 1 , Ramani Duraiswami 1 , B.
Yegnanarayana 2 , S.R. Mahadeva Prasanna 2 ;
1
University of Maryland, USA; 2 Indian Institute of
Technology, India
Glottal Spectrum Based Inverse Filtering
Microphone arrays are widely used to detect, locate, and track a
stationary or moving speaker. The first step is to estimate the time
delay, between the speech signals received by a pair of microphones.
Conventional methods like generalized cross-correlation are based
on the spectral content of the vocal tract system in the speech signal. The spectral content of the speech signal is affected due to
degradations in the speech signal caused by noise and reverberation. However, features corresponding to the excitation source of
speech are less affected by such degradations. This paper proposes
a novel method to estimate the time delays using the excitation
source information in speech. The estimated delays are used to
get the position of the moving speaker. The proposed method is
compared with the spectrum-based approach using real data from
a microphone array setup.
Ixone Arroabarren, Alfonso Carlosena; Universidad
Publica de Navarra, Spain
In this paper a new inverse filtering technique for the time-domain
estimation of the glottal excitation is presented. This approach uses
the DAP modeling for the vocal tract characterization, and a spectral model for the derivative of the glottal flow. This spectral model
is based on the spectrum of the KLGLOTT88 model for the glottal
source. The proposed procedure removes the glottal source from
the spectrum of the speech signal in an accurate manner, particularly for high-pitched signals and singing voice, and the estimated
glottal waveforms present less amount of formant ripple.
En este trabajo se presenta una nueva técnica de filtrado inverso
para la estimación de la fuente glotal. Dicha técnica combina la
herramienta de cálculo de la respuesta de un sistema todo polos,
basada en las muestras espectrales de la señal (DAP modeling), con
un modelo espectral más preciso de la derivada de la fuente glotal.
Este modelo espectral está basado en el espectro del modelo temporal para la fuente glotal KLGLOTT88. El algoritmo propuesto, elimina el efecto de la fuente en el espectro de la señal de habla de una
manera más precisa, lo cual es de especial interés en señales con
alta frecuencia fundamental y señales de canto. Como consecuencia de esto la estimación de la fuente glotal resultante del filtrado
inverso presenta un menor rizado característico del efecto de los
formantes.
Tracking Vocal Tract Resonances Using an
Analytical Nonlinear Predictor and a Target-Guided
Temporal Constraint
Li Deng, Issam Bazzi, Alex Acero; Microsoft Research,
USA
A technique for high-accuracy tracking of formants or vocal tract
resonances is presented in this paper using a novel nonlinear predictor and using a target-directed temporal constraint. The nonlinear predictor is constructed from a parameter-free, discrete mapping function from the formant (frequencies and bandwidths) space
to the LPC-cepstral space, with trainable residuals. We examine in
this study the key role of vocal tract resonance targets in the tracking accuracy. Experimental results show that due to the use of the
targets, the tracked formants in the consonantal regions (including
closures and short pauses) of the speech utterance exhibit the same
dynamic properties as for the vocalic regions, and reflect the underlying vocal tract resonances. The results also demonstrate the effectiveness of training the prediction-residual parameters and of incorporating the target-based constraint in obtaining high-accuracy
formant estimates, especially for non-sonorant portions of speech.
A Novel Method of Analysing and Comparing
Responses of Hearing Aid Algorithms Using
Auditory Time-Frequency Representation
G.V. Kiran, T.V. Sreenivas; Indian Institute of Science,
India
A new and potentially important method for predicting, analysing
and comparing responses of hearing aid algorithms is studied and
presented here. This method is based on a time-frequency representation (TFR) generated by a computational auditory model. Hearing
impairment is simulated by a change of parameters of the auditory
model. To simulate the basilar membrane (BM) filtering part of the
auditory model we propose a single parameter control version of the
gammachirp filterbank and for simulating the neural processing in
the auditory pathway we propose a signal processing model motivated by the physiological properties of the auditory nerve. This
model then interprets the information processing in the auditory
pathway through the use of a TFR called the auditory TFR (A-TFR)
which matches the standard spectrogram in terms of both time and
3
Eurospeech 2003
Monday
September 1-4, 2003 – Geneva, Switzerland
Analysis and Modeling of Syllable Duration for
Thai Speech Synthesis
Session: OMoCd– Oral
Phonology & Phonetics I
Chatchawarn Hansakunbuntheung 1 , Virongrong
Tesprasit 1 , Rungkarn Siricharoenchai 1 , Yoshinori
Sagisaka 2 ; 1 NECTEC, Thailand; 2 Waseda University,
Japan
Time: Monday 13.30, Venue: Room 4
Chair: Dafydd Gibbon, Linguistics, Bielefeld, Germany
Features of Contracted Syllables of Spontaneous
Mandarin
This paper describes the analysis results on the control factors of
Thai syllable duration, and a statistical control model using linear
regression technique. The analyses have been carried out both at a
syllable level and at a phrase level. In a syllable level duration control, the effects of five Thai tones and syllable structures are investigated. To analyze syllable structure effects statistically, we applied
the quantification theory with two linguistic factors: (1) phone categories by themselves, and (2) the categories grouped by articulatory
similarities. In a phrase level, the effects of position in a phrase and
syllable counts in a phrase were analyzed. The experimental results
showed that tones, syllable structures, and position in a phrase play
significant roles on syllable duration control. Syllable counts in a
phrase slightly affects the syllable duration. These analysis results
have been integrated into a statistical control model. The duration assignment precision of the proposed model is evaluated using
2480-word speech data. Total correlation 0.73 between predicted
values and observed values for test set samples shows the fair precision of the proposed control model.
Shu-Chuan Tseng; Academia Sinica, Taiwan
Mandarin is a syllable-timed language whose syllable structure is
quite simple [1]. In spontaneous Mandarin, because of rapid speech
rate the structure of syllable may be changed, phonemes may be
reduced and syllable boundaries as well as lexical tones may be
merged. This fact has long been noticed, but no quantified empirical data were actually presented in the literature until now. This
paper focuses on a special type of syllable reduction in spontaneous
Mandarin caused by heavy coarticulation of phonemes across syllable boundaries, namely the phenomenon of syllable contraction.
Contracted syllables result from segmental deletions and omission
of syllable boundary. This paper reports a series of corpus-based
results of analyses on contracted syllables in Mandarin conversation by taking account of phonological as well as non-phonological
factors.
Durational Characteristics of Hindi Stop
Consonants
Reaction Time as an Indicator of Discrete
Intonational Contrasts in English
K. Samudravijaya; Tata Institute of Fundamental
Research, India
Aoju Chen; University of Nijmegen, The Netherlands
This paper reports a perceptual study using a semantically motivated identification task in which we investigated the nature of two
pairs of intonational contrasts in English: (1) normal High accent
vs. emphatic High accent; (2) early peak alignment vs. late peak
alignment. Unlike previous inquiries, the present study employs an
on-line method using the Reaction Time measurement, in addition
to the measurement of response frequencies. Regarding the peak
height continuum, the mean RTs are shortest for within-category
identification but longest for across-category identification. As for
the peak alignment contrast, no identification boundary emerges
and the mean RTs only reflect a difference between peaks aligned
with the vowel onset and peaks aligned elsewhere. We conclude
that the peak height contrast is discrete but the previously claimed
discreteness of the peak alignment contrast is not borne out.
A study of the durational characteristics of Hindi stop consonants in
spoken sentences was carried out. An annotated and time-aligned
Hindi speech database was used in the experiment. The influences
of aspiration, voicing and gemination on the durations of closure
and post-release segments of plosives as well as the duration of the
preceding vowel were studied. It was observed that the post-release
duration of a plosive changes systematically with manner of articulation. However, due to its large variation in continuous speech, the
post-release duration alone is not sufficient to identify the manner
of articulation of Hindi stops as hypothesised in earlier studies. A
low value of the ratio of the duration of a vowel to the closure duration of the following plosive is a reliable indicator of gemination
in Hindi stop consonants in continuous speech.
Quantity Comparison of Japanese and Finnish in
Various Word Structures
Session: PMoCe– Poster
Topics in Prosody & Emotional Speech
Toshiko Isei-Jaakkola; University of Helsinki, Finland
The durational patterns of short and long vowels and consonants
were investigated at the segmental and lexical level using variable
syllable structures in Japanese and Finnish. The results showed that
the Japanese segmental ratios between short and long in both vowels and consonants were longer than those of Finnish only when all
segments were pooled. However, this was not necessarily true when
observing their positions in different structures. Compared the lexical increase ratios based on the CVCV words, Japanese and Finnish
showed regular patterns according to the word structures respectively. The Japanese patterns were very isochronical in any word
structures, whereas the Finnish durational ratios stably decreased
within the same moraic word structures with the same number of
segments but different combinations of the same vowel and consonant. These results suggest that Japanese has a tendency to be
more mora-counting than Finnish in temporal isochronity.
Time: Monday 13.30, Venue: Main Hall, Level -1
Chair: Keikichi Hirose, Tokyo Univ., Japan
Transforming F0 Contours
Ben Gillett, Simon King; University of Edinburgh, U.K.
Voice transformation is the process of transforming the characteristics of speech uttered by a source speaker, such that a listener
would believe the speech was uttered by a target speaker. Training
F0 contour generation models for speech synthesis requires a large
corpus of speech. If it were possible to adapt the F0 contour of one
speaker to sound like that of another speaker, using a small, easily obtainable parameter set, this would be extremely valuable. We
present a new method for the transformation of F0 contours from
one speaker to another based on a small linguistically motivated
parameter set. The system performs a piecewise linear mapping
using these parameters. A perceptual experiment clearly demonstrates that the presented system is at least as good as an existing
technique for all speaker pairs, and that in many cases it is much
better and almost as good as using the target F0 contour.
Broad Focus Across Sentence Types in Greek
Mary Baltazani; University of California at Los
Angeles, USA
In Greek main sentence stress is located on the rightmost constituent in ‘all new’ declaratives, but for all-new negatives, polar
questions, and wh-questions it is located on the negative particle,
main verb, and wh-word respectively. I discuss the implications of
this pattern for the focus projection rules and for the accentedness
of discourse new constituents.
Evaluation of the Affect of Speech Intonation Using
a Model of the Perception of Interval Dissonance
and Harmonic Tension
Norman D. Cook, Takeshi Fujisawa, Kazuaki Takami;
Kansai University, Japan
4
Eurospeech 2003
Monday
We report the application of a psychophysical model of pitch perception to the analysis of speech intonation. The model was designed to reproduce the empirical findings on the perception of
musical phenomena (the dissonance/consonance of intervals and
the tension/sonority of chords), but does not depend on specific
musical scales or tuning systems. Application to intonation allows
us to calculate the total dissonance and tension among the pitches
in the speech utterance. In an experiment using the 144 utterances
of 18 male and female subjects, we found greater dissonance and
harmonic tension in sentences with negative affect, in comparison
with sentences with positive affect.
September 1-4, 2003 – Geneva, Switzerland
segmentation cue, the results of several cross-modal fragment
priming experiments reveal strong limitations to stress-based segmentation. When stress was pitted against phonotactic and coarticulatory cues, substantial effects of the latter two cues were found,
but there was no evidence for stress-based segmentation. However,
when the stimuli were presented in a background of noise, the pattern of results reversed: Strong syllables generated more priming
than weak ones, regardless of coarticulation and phonotactics. Furthermore, a similar dependency was found between stress and lexicality. Priming was stronger when the prime was preceded by a
real than a nonsense word, regardless of the stress pattern of the
prime. Yet, again, a reversal in cue dominance was observed when
the stimuli were played in noise. These results underscore the secondary role of stress-based segmentation in clear speech, and its efficiency in impoverished listening conditions. More generally, they
call for an integrated, hierarchical, and signal-contingent approach
to speech segmentation.
A New Pitch Modeling Approach for Mandarin
Speech
Wen-Hsing Lai, Yih-Ru Wang, Sin-Horng Chen;
National Chiao Tung University, Taiwan
In this paper, a new approach to model syllable pitch contour for
Mandarin speech is proposed. It takes the mean and shape of syllable pitch contour as two basic modeling units and considers several affecting factors that contribute to their variations. Parameters of the two models are automatically estimated by the EM algorithm. Experimental results showed that RMSEs of 0.551 ms and
0.614 ms in the reconstructed pitch were obtained for the closed
and open tests, respectively. All inferred values of those affecting
factors agreed well with our prior linguistic knowledge. Besides,
the prosodic states automatically labeled by the pitch mean model
provided useful cues to determine the prosodic phrase boundaries
occurred at inter-syllable locations without punctuation marks. So
it is a promising pitch modeling approach.
Emotion Recognition by Speech Signals
Oh-Wook Kwon, Kwokleung Chan, Jiucang Hao,
Te-Won Lee; University of California at San Diego,
USA
For emotion recognition, we selected pitch, log energy, formant,
mel-band energies, and mel frequency cepstral coefficients (MFCCs)
as the base features, and added velocity/ acceleration of pitch
and MFCCs to form feature streams. We extracted statistics used
for discriminative classifiers, assuming that each stream is a onedimensional signal. Extracted features were analyzed by using
quadratic discriminant analysis (QDA) and support vector machine
(SVM). Experimental results showed that pitch and energy were the
most important factors. Using two different kinds of databases,
we compared emotion recognition performance of various classifiers: SVM, linear discriminant analysis (LDA), QDA and hidden
Markov model (HMM). With the text-independent SUSAS database,
we achieved the best accuracy of 96.3% for stressed/neutral style
classification and 70.1% for 4-class speaking style classification using Gaussian SVM, which is superior to the previous results. With
the speaker-independent AIBO database, we achieved 42.3% accuracy for 5-class emotion recognition.
Bayesian Induction of Intonational Phrase Breaks
P. Zervas, M. Maragoudakis, Nikos Fakotakis, George
Kokkinakis; University of Patras, Greece
For the present paper, a Bayesian probabilistic framework for the
task of automatic acquisition of intonational phrase breaks was established. By considering two different conditional independence
assumptions, the naïve Bayes and Bayesian networks approaches
were regarded and evaluated against the CART algorithm, which
has been previously used with success. A finite length window of
minimal morphological and syntactic resources was incorporated,
i.e. the POS label and the kind of phrase boundary, a novel syntactic feature that has not been applied to intonational phrase break
detection before. This feature can be used in languages where syntactic parsers are not available and proves to be important, not only
for the proposed Bayesian methodologies but for other algorithms,
like CART. Trained on a 5500 word database, Bayesian networks
proved to be the most effective in terms of precision (82,3%) and
recall (77,2%) for predicting phrase breaks.
Automatic Prosodic Prominence Detection in
Speech Using Acoustic Features: An Unsupervised
System
Fabio Tamburini; University of Bologna, Italy
This paper presents work in progress on the automatic detection of
prosodic prominence in continuous speech. Prosodic prominence
involves two different phonetic features: pitch accents, connected
with fundamental frequency (F0) movements and syllable overall
energy, and stress, which exhibits a strong correlation with syllable
nuclei duration and mid-to-high-frequency emphasis. By measuring these acoustic parameters it is possible to build an automatic
system capable of correctly identifying prominent syllables with
an agreement, with human-tagged data, comparable with the interhuman agreement reported in the literature. This system does not
require any training phase, additional information or annotation, it
is not tailored to a specific set of data and can be easily adapted to
different languages.
Predicting the Perceptive Judgment of Voices in a
Telecom Context: Selection of Acoustic Parameters
T. Ehrette 1 , N. Chateau 1 , Christophe d’Alessandro 2 ,
V. Maffiolo 1 ; 1 France Télécom R&D, France;
2
LIMSI-CNRS, France
Perception of vocal styles is of paramount importance in vocal
server application as the global style of a telecom service is highly
dependant on the voice used. In this work we develop tools for
automatic inference of perceived vocal styles for a set of 100 vocal sequences. In a first stage, twenty subjective evaluation criteria
have been identified by running perceptive experiments with naïve
listeners. In a second stage, the vocal sequences have been parameterised using more than a hundred acoustic features representing
prosody, spectral energy distribution, articulation and waveform.
Then, regression analysis and neural networks are used for predicting the subjective score of each voice for each subjective criterion. The results show that the prediction error is generally low:
it seems possible to predict automatically the perceived quality of
the sequences. Moreover, the prediction error decreases when nonsignificant parameters are removed.
Improved Emotion Recognition with Large Set of
Statistical Features
Vladimir Hozjan, Zdravko Kačič; University of
Maribor, Slovenia
This paper presents and discusses the speaker dependent emotion
recognition with large set of statistical features. The speaker dependent emotion recognition gains in present the best accuracy performance. Recognition was performed on English, Slovenian, Spanish,
and French InterFace emotional speech databases. All databases
include 9 speakers. The InterFace databases include neutral speaking style and six emotions: disgust, surprise, joy, fear, anger and
sadness. Speech features for emotion recognition were determined
in two steps. In the first step, acoustical features were defined and
in the second statistical features were calculated from acoustical
features. Acoustical features are composed from pitch, derivative
of pitch, energy, derivative of energy, duration of speech segments,
Stress-Based Speech Segmentation Revisited
Sven L. Mattys; University of Bristol, U.K.
Although word stress is usually seen as a powerful speech-
5
Eurospeech 2003
Monday
jitter, and shimmer. Statistical features are statistical presentations
of acoustical features. In previous study feature vector was composed from 26 elements. In this study the feature vector was composed from 144 elements. The new feature set was called large set
of statistical features. Emotion recognition was performed using
artificial neural networks. Significant improvement was achieved
for all speakers except for Slovenian male and second English male
speaker were the improvement was about 2%. Large set of statistical features improve the accuracy of recognised emotion in average
for about 18%.
September 1-4, 2003 – Geneva, Switzerland
In this study, we introduced a new model of how a human understands speech in real time and performed a cognitive experiment to investigate the unit for processing and understanding
speech. In the model, first humans segment the acoustical signal
into some acoustical units, and then the mental lexicon is accessed
and searched for the segmented units. For this segmentation, we
believe that prosody information must be used. In order to investigate how humans segment acoustical speech using only prosody,
we performed an experiment in which participants listened to a pair
of segmented speech materials, where each material was divided
from the same speech material where the two segmentation positions differed from each other, and judged which material sounded
more natural. On the basis of the results of this experiment, it is
suggested that humans tend to segment speech based on the accent
rules of Japanese, and that the introduced model is supported.
Recognition of Intonation Patterns in Thai
Utterance
Patavee Charnvivit, Nuttakorn Thubthong, Ekkarit
Maneenoi, Sudaporn Luksaneeyanawin, Somchai
Jitapunkul; Chulalongkorn University, Thailand
Language-Reconfigurable Universal Phone
Recognition
Thai intonation can be categorized as paralinguistic information of
F0 contour of the utterance. There are three classes of intonation
pattern in Thai, the Fall Class, the Rise Class, and the Convolution
Class. This paper presents a method of intonation pattern recognition of Thai utterance. Two intonation feature contours, extracted
from F0 contour, were proposed. The feature contours were converted to feature vector to use as input of neural network recognizer. The recognition results show that an average recognition
rate is 63.4% for male speakers and 75.4% for female speakers. The
recognizer can recognize the Fall Class from the others better than
distinguish between the Rise Class and the Convolution Class.
B.D. Walker, B.C. Lackey, J.S. Muller, P.J. Schone; U.S.
Department of Defense, USA
We illustrate the development of a universal phone recognizer for
conversational telephone-quality speech. The acoustic models for
this system were trained in a novel fashion and with a wide variety of language data, thus permitting it to recognize most of the
world’s major phonemic categories. Moreover, with push-button
ease, this recognizer can automatically reconfigure itself to apply
the strongest language model in its inventory to whatever language
it is used on. In this paper, we not only describe this system, but
we also provide performance measurements for it using extensive
testing material both from languages in its training set as well as
from a language it has never seen. Surprisingly, the recognizer produces near-equivalent performance between the two types of data
thus showing its true universality. This recognizer presents a viable
solution for processing conversational, telephone-quality speech in
any language – even in low-density languages.
Use of Linguistic Information for Automatic
Extraction of F0 Contour Generation Process Model
Parameters
Keikichi Hirose, Yusuke Furuyama, Shuichi
Narusawa, Nobuaki Minematsu, Hiroya Fujisaki;
University of Tokyo, Japan
Emotion Recognition Using a Data-Driven Fuzzy
Inference System
A method was developed to utilize linguistic information (lexical accent types and syntactic boundaries) to improve the performance
of the automatic extraction of the F0 contour generation process
model commands. The extraction scheme is first to smooth the
observed F0 contour by a piecewise 3rd order polynomial function
and to locate accent command positions by taking the derivative
of the function. If the results of automatic extraction differ from
those estimated from the linguistic information, they are modified
according to the several rules. The results showed that some errors
could be corrected by the use of linguistic information, especially
when the initial word of an accent phrase is type 0 (flat) accent. As
a whole, the correct extraction rate (recall rate) was increased from
79.8% to 82.3% for phrase commands and from 81.6% to 85.9% for
accent commands.
Chul Min Lee, Shrikanth Narayanan; University of
Southern California, USA
The need and importance of automatically recognizing emotions
from human speech has grown with the increasing role of humancomputer interaction applications. This paper explores the detection of domain-specific emotions using a fuzzy inference system to
detect two emotion categories, negative and nonnegative emotions.
The input features are a combination of segmental and suprasegmental acoustic information; feature sets are selected from a 21dimensional feature set and applied to the fuzzy classifier. Our
fuzzy inference system is designed through a data-driven approach.
The design of the fuzzy inference system has two phases: one for
initialization for which fuzzy c-means method is used, and the other
is fine-tuning of parameters of the fuzzy model. For fine-tuning, a
well known neuro-fuzzy method are used. Results from on spoken
dialog data from a call center application show that the optimized
FIS with two rules (FIS-2) improves emotion classification by 63.0%
for male data and 73.7% for female over previous results obtained
using linear discriminant classifier.
Potential Audiovisual Correlates of Contrastive
Focus in French
Marion Dohen, Hélène Lœvenbruck, Marie-Agnès
Cathiard, Jean-Luc Schwartz; ICP-CNRS, France
The long-term purpose of this study is to determine whether there
are “visual” cues to prosody. An audiovisual corpus was recorded
from a male native French speaker. The sentences had a subjectverb-object (SVO) syntactic structure. Four conditions were studied:
focus on each phrase (S,V,O) and no focus. Normal and reiterant
modes were recorded. We first measured F0, duration and intensity to validate the corpus. The pitch maximum over the utterance
was generally on a focused syllable and duration and intensity were
higher for the focused syllables. Then lip aperture and jaw opening
were extracted from the video. The jaw opening maximum generally fell on one of the focused syllables, but peak velocity was more
consistently correlated with focus. Moreover, lip closure duration
was longer for the first segment of the focused phrase. We can
therefore assume that there are visual aspects in prosody that may
be used in communication.
Effects of Voice Prosody by Computers on Human
Behaviors
Noriko Suzuki 1 , Yohei Yabuta 2 , Yugo Takeuchi 2 ,
Yasuhiro Katagiri 1 ; 1 ATR-MIS, Japan; 2 Shizuoka
University, Japan
This paper examines whether a human is aware of slight prosodic
differences in a computer voice and changes his/her behaviors accordingly through interaction, when the prosodic difference carries
informational significance. We conduct a route selection experiment, in which subjects were asked to find a route in a computer
generated 3-D maze. The maze system occasionally provides a
confirmation in response to the subject’s choice of a route. The
prosodic characteristics of confirmation utterances are made to
marginally change according to whether the route selected is the
right route for reaching the goal or a wrong route that ends up
How does Human Segment the Speech by Prosody ?
Toshie Hatano, Yasuo Horiuchi, Akira Ichikawa;
Chiba University, Japan
6
Eurospeech 2003
Monday
in a cul de sac. In this experiment, subjects are able to pick up
the difference and successfully navigate through the maze. This
result demonstrates that subjects are sensitive to even a slight
change in the voice’s prosodic characteristics and that computer
voice prosody can affect the route selection behaviors of subjects.
September 1-4, 2003 – Geneva, Switzerland
one male and one female. The corpus was labeled on the syllabic
level and analyzed using the Fujisaki model. Results show that the
six tone types basically fall into two categories: Level, rising, curve
and falling tone can be accurately modeled by using tone commands
of positive or negative polarity. The so-called drop and broken
tones, however, obviously require a special control causing creaky
voice and in cases a very fast drop in F0 leading to temporary F0
halving or even quartering. In contrast to the drop tone, the broken
tone exhibits an F0 rise and hence a positive tone command right
after the creak occurs. Further observations suggest that drop and
broken tone do not only differ from the other four tones with respect to their F0 characteristics, but also as to their much tenser
articulation. A perception experiment performed with natural and
resynthesized stimuli shows, inter alia, that tone 4 is most prone to
confusion and that tone 6 obviously requires tense articulation as
well as vocal fry to be identified reliably.
An Investigation of Intensity Patterns for German
Oliver Jokisch, Marco Kühne; Dresden University of
Technology, Germany
The perceived quality of synthetic speech strongly depends on its
prosodic naturalness. Concerning the control of duration and fundamental frequency in a speech synthesis system, sophisticated
models have been developed during the last decade. Speech intensity modeling is often considered as algorithmically and perceptually less important. Departing from a syllable-based, trainable
prosody model the authors tested new factors of influence to improve the predicted intensity contour on phonemic level. Therefore,
a German newsreader corpus has been analyzed with respect to typical intensity patterns. The f0-intensity interaction has the most
significant influence and was perceptually evaluated by 32 listeners
ranking 20 different stimuli. Using an elementary, linear intensity
model, modified natural speech only slightly degrades about 0.3 at
the ITU-T conform MOS scale.
Japanese Prosodic Labeling Support System
Utilizing Linguistic Information
Shinya Kiriyama, Yoshifumi Mitsuta, Yuta Hosokawa,
Yoshikazu Hashimoto, Toshihiko Ito, Shigeyoshi
Kitazawa; Shizuoka University, Japan
A prosodic labeling support system has been developed. Largescale prosodic databases are strongly desired for years, however,
the construction of databases depend on hand labeling, because of
the variety of prosody. We aim at not automating the whole labeling
process, but making the hand labeling work more efficient by providing the labelers with the appropriate support information. The
methods of auto-generating initial phoneme and prosodic labels utilizing linguistic information are proposed and evaluated. The experimental results showed that more than 70% of prosodic labels
were correctly generated, and proved the efficiency of the proposed
methods. The results also yielded the useful knowledge to support
the labelers.
Segmental Durations Predicted with a Neural
Network
João Paulo Teixeira 1 , Diamantino Freitas 2 ;
1
Polytechnic Institute of Bragança, Portugal;
2
University of Porto, Portugal
This paper presents a segmental durations’ model applied to the European Portuguese language for TTS purposes. The model is based
on a feed-forward neural network, trained with a back-propagation
algorithm, and has as input a set of phonological and contextual
features, automatically extracted from the text. The relative importance of each feature, concerning the correlation with segmental
durations and improvements in the performance of the model, is
presented. Finally the model is evaluated objectively and subjectively by a perceptual test.
Why and How to Control the Authentic Emotional
Speech Corpora
Véronique Aubergé, Nicolas Audibert, Albert Rilliard;
ICP-CNRS, France
Generation and Perception of F0 Markedness in
Conversational Speech with Adverbs Expressing
Degrees
The affects are expressed in different levels of speech: metalinguistic (expressiveness), linguistic (attitudes), both anchored in
the “linguistic time”, and para-linguistic (emotions expressions) that
is anchored in the emotional causes timing. In an experimental approach, the corpus are the base of analysis. Main of emotional corpus have been produced by acting/elicitating speakers on one side
(with a possible strong control), and on the other side they have
been collected in “reallife”. This paper proposes both to generate a
Wizard of Oz method and some tools (E-Wiz and Top Logic, Sound
Teacher applications) in order to control the production of authentic data, separately for the three levels of affects.
Takumi Yamashita, Yoshinori Sagisaka; Waseda
University, Japan
Aiming at natural F0 control for conversational speech synthesis, F0
characteristics are analyzed from both generation and perception
viewpoints. By systematically designing conversational situations
and utterances with adverb phrases expressing different degree of
markedness, their F0 characteristics are compared. The comparison shows the consistent F0 control dependencies not only on adverbs themselves but also on the attribute of neighboring adjective phrases. Strong positive/negative correlation is observed between the markedness of adverbs and F0 height when an adjective
phrase with a positive/negative image is followed to the current adverb phrase. These consistencies have been perceptually confirmed
by naturalness evaluation tests using the same two-phrase samples
with different F0 heights. These results indicate the possibility of F0
control for natural conversational speech using lexical markedness
information and adjacent word attributes.
Prosodic Cues for Emotion Characterization in
Real-Life Spoken Dialogs
Laurence Devillers 1 , Ioana Vasilescu 2 ; 1 LIMSI-CNRS,
France; 2 ENST-CNRS, France
This paper reports on an analysis of prosodic cues for emotion
characterization in 100 natural spoken dialogs recorded at a telephone customer service center. The corpus annotated with taskdependent emotion tags which were validated by a perceptual test.
Two F0 range parameters, one at the sentence level and the other
at the subsegment level, emerge as the most salient cues for emotion classification. These parameters can differentiate between negative emotion (irritation/anger, anxiety/fear) and neutral attitude
and confirm trends illustrated by the perceptual experiment.
Quantitative Analysis and Synthesis of Syllabic
Tones in Vietnamese
Hansjörg Mixdorff 1 , Nguyen Hung Bach 2 , Hiroya
Fujisaki 3 , Mai Chi Luong 2 ; 1 Berlin University of
Applied Sciences, Germany; 2 National Centre for
Science and Technology, Vietnam; 3 University of
Tokyo, Japan
The current paper presents a preliminary study on the production
and perception of syllabic tones of Vietnamese. A speech corpus
consisting of fifty-two six-syllable sequences with various combinations of tones was uttered by two speakers of Standard Vietnamese,
7
Eurospeech 2003
Monday
September 1-4, 2003 – Geneva, Switzerland
several hundred dialogues in French and English. Inter-annotator
agreement was moderate. We are using these data to design our
dialogue system, and we hope that they will help us to derive appropriate dialogue strategies for novel situations.
Session: PMoCf– Poster
Language Modeling, Discourse & Dialog
Time: Monday 13.30, Venue: Main Hall, Level -1
Chair: Peter Heeman, Oregon Graduate Int., USA
Disfluency Under Feedback and Time-Pressure
H.B.M. Nicholson 1 , E.G. Bard 1 , A.H. Anderson 2 , M.L.
Flecha-Garcia 1 , D. Kenicer 2 , L. Smallwood 2 , J.
Mullin 2 , R.J. Lickley 3 , Y. Chen 1 ; 1 University of
Edinburgh, U.K.; 2 University of Glasgow, U.K.; 3 Queen
Margaret University College, U.K.
Towards the Automatic Generation of
Mixed-Initiative Dialogue Systems from Web
Content
Joseph Polifroni 1 , Grace Chung 2 , Stephanie Seneff 1 ;
1
Massachusetts Institute of Technology, USA;
2
Corporation for National Research Initiatives, USA
Speakers engaging in dialogue with another conversationalist must
create and execute plans with respect to the content of the utterance. An analysis of disfluencies from Map Task monologues
shows that a speaker is influenced by the pressure to communicate
with a distant listener. Speakers were also subject to time-pressure,
thereby increasing the cognitive burden of the overall task at hand.
The duress of the speaker, as determined by disfluency rate, was
examined across four conditions of variable feedback and timing.
A surprising result was found that does not adhere to the predictions of the traditional views concerning collaboration in dialogue.
Through efforts over the past fifteen years, we have acquired a great
deal of experience in designing spoken dialogue systems that provide access to large corpora of data in a variety of different knowledge domains, such as flights, hotels, restaurants, weather, etc. In
our recent research, we have begun to shift our focus towards developing tools that enable the rapid development of new applications.
This paper addresses a novel approach that drives system design
from the on-line knowledge resource. We were motivated by a desire to minimize the need for a pre-determined dialogue flow. In our
approach, decisions on dialogue flow are made dynamically based
on analyses of data, either prior to user interaction or during the
dialogue itself. Automated methods, used to organize numeric and
symbolic data, can be applied at every turn, as user constraints are
being specified. This helps the user mine through large data sets
to a few choices by allowing the system to synthesize intelligent
summaries of the data, created on-the-fly at every turn. Moreover
automatic methods are ultimately more robust against the frequent
changes to on-line content. Simulations generating hundreds of dialogues have produced log files that allow us to assess and improve
system behavior, including system responses and interactions with
the dialogue flow. Together, these techniques are aimed towards
the goal of instantiating new domains with little or no input from a
human developer.
Control in Task-Oriented Dialogues
Peter A. Heeman, Fan Yang, Susan E. Strayer; Oregon
Health & Science University, USA
In this paper, we explore the mechanisms by which conversants control the direction of a dialogue. We find further evidence that control in task-oriented dialogues is subordinate to discourse structure.
The initiator of a discourse segment has control; the non-initiator
can contribute to the purpose of the segment, but this does not result in that person taking over control. The proposal has important
implications for dialogue management, as it will pave the way for
building dialogue systems that can engage in mixed initiative dialogues.
The 300k LIMSI German Broadcast News
Transcription System
A Context Resolution Server for the Galaxy
Conversational Systems
Kevin McTait, Martine Adda-Decker; LIMSI-CNRS,
France
Edward Filisko, Stephanie Seneff; Massachusetts
Institute of Technology, USA
This paper describes improvements to the existing LIMSI German
broadcast news transcription system, especially its extension from
a 65k vocabulary to 300k words. Automatic speech recognition for
German is more problematic than for a language such as English
in that the inflectional morphology of German and its highly generative process of compounding lead to many more out of vocabulary words for a given vocabulary size. Experiments undertaken
to tackle this problem and reduce the transcription error rate include bringing the language models up to date, improved pronunciation models, semi-automatically constructed pronunciation lexicons and increasing the size of the system’s vocabulary.
The context resolution (CR) component of a conversational dialogue
system is responsible for interpreting a user’s utterance in the context of previously spoken user utterances, spatial and temporal context, inference, and shared world knowledge. This paper describes a
new and independent CR server for the GALAXY conversational system framework. Among the functionality provided by the CR server
is the inheritance and masking of historical information, pragmatic
verification, as well as reference and ellipsis resolution. The new
server additionally features a process that attempts to reconstruct
the intention of the user given a robust parse of an utterance. Design issues are described, followed by a description of each function
in the context resolution process along with examples. The effectiveness of the CR server in various domains attests to its success
as a module for context resolution.
Weighted Entropy Training for the Decision Tree
Based Text-to-Phoneme Mapping
Jilei Tian 1 , Janne Suontausta 1 , Juha Häkkinen 2 ;
1
Nokia Research Center, Finland; 2 Nokia Mobile
Phones, Finland
Semantic and Dialogic Annotation for Automated
Multilingual Customer Service
Hilda Hardy 1 , Kirk Baker 2 , Hélène
Bonneau-Maynard 3 , Laurence Devillers 3 , Sophie
Rosset 3 , Tomek Strzalkowski 1 ; 1 University at Albany,
USA; 2 Duke University, USA; 3 LIMSI-CNRS, France
The pronunciation model providing the mapping from the written
form of words to their pronunciations is called the text-to-phoneme
(TTP) mapping. Such a mapping is commonly used in automatic
speech recognition (ASR) as well as in text-to-speech (TTS) applications. Rule based TTP mappings can be derived for structured
languages, such as Finnish and Japanese. Data-driven TTP mappings are usually applied for non-structured languages such as English and Danish. Artificial neural network (ANN) and decision tree
(DT) approaches are commonly applied in this task. Compared to
the ANN methods, the DT methods usually provide more accurate
pronunciation models. The DT methods can, however, lead to a
set of models with a high memory footprint if the mappings between letters and phonemes are complex. In this paper, we present
a weighted entropy training method for the DT based TTP mapping. Statistical information about the vocabulary is utilized in the
One central goal of the AMITIÉS multilingual human-computer dialogue project is to create a dialogue management system capable of
engaging the user in human-like conversation in a specific domain.
To that end, we have developed new methods for the manual annotation of spoken dialogue transcriptions from European financial
call centers. We have modified the DAMSL dialogic schema to create
a dialogue act taxonomy appropriate for customer services. To capture the semantics, we use a domain-independent framework populated with domain-specific lists. We have designed a new flexible,
platform-independent annotation tool, XDML Tool, and annotated
8
Eurospeech 2003
Monday
training process in order to optimize the TTP performance for predefined memory requirements. The results obtained in the simulation experiments indicate that the memory requirements of the TTP
models can be significantly reduced without degrading the mapping
accuracy. The applicability of the approach is also verified in the
speech recognition experiments.
September 1-4, 2003 – Geneva, Switzerland
the classification of syllables in TIMIT. The main motivation for this
study is to circumvent the “beads-on-a-string” problem, i.e. the assumption that words can be described as a simple concatenation of
phones. Posterior probabilities for articulatory-acoustic features
are obtained from artificial neural nets and are used to classify
speech within the scope of syllables instead of phones. This gives
the opportunity to account for asynchronous feature changes, exploiting the strengths of the articulatory-acoustic features, instead
of losing the potential by reverting to phones.
Word Class Modeling for Speech Recognition with
Out-of-Task Words Using a Hierarchical Language
Model
Hierarchical Class N-Gram Language Models:
Towards Better Estimation of Unseen Events in
Speech Recognition
Yoshihiko Ogawa 1 , Hirofumi Yamamoto 2 , Yoshinori
Sagisaka 1 , Genichiro Kikui 2 ; 1 Waseda University,
Japan; 2 ATR-SLT, Japan
Imed Zitouni 1 , Olivier Siohan 2 , Chin-Hui Lee 3 ;
1
Lucent Technologies, USA; 2 IBM T.J. Watson Research
Center, USA; 3 Georgia Institute of Technology, USA
Out-of-vocabulary (OOV) problems are frequently seen when adapting a language model to another task where there are some observed
word classes but few individual words, such as names, places and
other proper nouns. Simple task adaptation cannot handle this
problem properly. In this paper, for task dependent OOV words
in the noun category, we adopt a hierarchical language model. In
this modeling, the lower class model expressing word phonotactics
does not require any additional task dependent corpora for training. It can be trained independent of the upper class model of conventional word class N-grams, as the proposed hierarchical model
clearly separates Inter-word characteristics and Intra-word characteristics. This independent-layered training capability makes it possible to apply this model to general vocabularies and tasks in combination with conventional language model adaptation techniques.
Speech recognition experiments showed a 19-point increase in word
accuracy (from 54% to 73%) in the with-OOV sentences, and comparable accuracy (85%) in the without-OOV sentences, compared with
a conventional adapted model. This improvement corresponds to
the performance when all OOVs are ideally registered in a dictionary.
In this paper, we show how a multi-level class hierarchy can be used
to better estimate the likelihood of an unseen event. In classical
backoff n-gram models, the (n-1)-gram model is used to estimate
the probability of an unseen n-gram. In the approach we propose,
we use a class hierarchy to define an appropriate context which is
more general than the unseen n-gram but more specific than the
(n-1)-gram. Each node in the hierarchy is a class containing all the
words of the descendant nodes (classes). Hence, the closer a node
is to the root, the more general the corresponding class is. We also
investigate in this paper the impact of the hierarchy depth and the
Turing’s discount coefficient on the performance of the model. We
evaluate the backoff hierarchical n-gram models on WSJ database
with two large vocabularies, 5, 000 and 20, 000 words. Experiments
show up to 26% improvement on the unseen events perplexity and
up to 12% improvement in the WER when a backoff hierarchical class
trigram language model is used on an ASR test set with a relatively
large number of unseen events.
Compound Decomposition in Dutch Large
Vocabulary Speech Recognition
Incremental and Iterative Monolingual Clustering
Algorithms
Roeland Ordelman, Arjan van Hessen, Franciska de
Jong; University of Twente, The Netherlands
Sergio Barrachina, Juan Miguel Vilar; Universidad
Jaume I, Spain
This paper addresses compound splitting for Dutch in the context
of broadcast news transcription. Language models were created using original text versions and text versions that were decomposed
using a data-driven compound splitting algorithm. Language model
performances were compared in terms of out-of- vocabulary rates
and word error rates in a real-world broadcast news transcription
task. It was concluded that compound splitting does improve ASR
performance. Best results were obtained when frequent compounds
were not decomposed.
To reduce speech recognition error rate we can use better statistical language models. These models can be improved by grouping
words into word equivalence classes. Clustering algorithms can be
used to automatically do this word grouping.
We present an incremental clustering algorithm and two iterative
clustering algorithms. Also, we compare them with previous algorithms.
The experimental results show that the two iterative algorithms perform as well as previous ones. It should be pointed out that one of
them, that uses the leaving one out technique, has the ability to
automatically determine the optimum number of classes. These iterative algorithms are used by the incremental one.
Designing for Errors: Similarities and Differences
of Disfluency Rates and Prosodic Characteristics
Across Domains
On the other hand, the proposed incremental algorithm achieves
the best results of the compared algorithms, its behavior is the most
regular with the variation of the number of classes and can automatically determine the optimum number of classes.
Guergana Savova 1 , Joan Bachenko 2 ; 1 Mayo Clinic,
USA; 2 Linguistech Consortium, USA
This paper focuses on some characteristics of disfluencies in
human-human (HHI) and human-computer (HCI) interaction corpora to outline similarities and differences. The main variables
studied are disfluency rates and prosodic features. Structured,
table-like input increases the disfluency rate in HCI and decreases
it in HHI. Direct exposure (visibility) to the interface also increases
the rate and gives speech a unique prosodic pattern of hyperarticulation. In most of the studied corpora, silences at the disfluency
site are not predicted by syntactic rules. Similarities between HCI
and HHI exist mainly in the prosodic realizations of the reparandum
and the repair. The findings contribute to better understanding and
modeling of disfluencies. Speech-based interfaces need to focus on
communication types that are well-understood and prone to good
modeling.
Techniques for Effective Vocabulary Selection
Anand Venkataraman, Wen Wang; SRI International,
USA
The vocabulary of a continuous speech recognition (CSR) system is
a significant factor in determining its performance. In this paper,
we present three principled approaches to select the target vocabulary for a particular domain by trading off between the target outof-vocabulary (OOV) rate and vocabulary size. We evaluate these
approaches against an ad-hoc baseline strategy. Results are presented in the form of OOV rate graphs plotted against increasing
vocabulary size for each technique.
Recognition of Out-of-Vocabulary Words with
Sub-Lexical Language Models
Syllable Classification Using Articulatory-Acoustic
Features
Lucian Galescu; Institute for Human and Machine
Cognition, USA
Mirjam Wester; University of Edinburgh, U.K.
This paper investigates the use of articulatory-acoustic features for
9
Eurospeech 2003
Monday
A major source of recognition errors, out-of-vocabulary (OOV)
words are also semantically important; recognizing them is, therefore, crucial for understanding. Success, so far, has been modest,
even on very constrained tasks. In this paper we present a new approach to unlimited vocabulary speech recognition based on using
grapheme-to-phoneme correspondences for sub-lexical modeling of
OOV words, and also some very encouraging results we obtained
with our approach on a large vocabulary speech recognition task.
September 1-4, 2003 – Geneva, Switzerland
Session: PMoCg– Poster
Speech Synthesis: Unit Selection I
Time: Monday 13.30, Venue: Main Hall, Level -1
Chair: Beat Pfister, TIK, ETHZ, Zurich, Switzerland
Unit Selection Based on Voice Recognition
Yi Zhou 1 , Yiqing Zu 2 ; 1 Shanghai Jiaotong University,
China; 2 Motorola China Research Center, China
A Semantic Representation for Spoken Dialogs
Hélène Bonneau-Maynard, Sophie Rosset; LIMSI-CNRS,
France
This paper describes a semantic annotation scheme for spoken dialog corpora. Manual semantic annotation of large corpora is tedious, expensive, and subject to inconsistencies. Consistency is
a necessity to increase the usefulness of corpus for developing
and evaluating spoken understanding models and for linguistics
studies. A semantic representation, which is based on a concept
dictionary definition, has been formalized and is described. Each
utterance is divided into semantic segments and each segment is
assigned with a 5-tuplets containing a mode, the underlying concept, the normalized form of the concept, the list of related segments, and an optional comment about the annotation. Based on
this scheme, a tool was developed which ensures that the provided annotations respect the semantic representation. The tool
includes interfaces for both the formal definition of the hierarchical concept dictionary and the annotation process. An experiment
was conducted to assess inter-annotator agreement using both a
human-human dialog corpus and a human-machine dialog corpus.
For human-human dialogs, the agreement rate, computed on the
triplets (mode, concept, value) is 61%, and the agreement rate on
the concepts alone is 74%. For the human-machine dialogs, the percentage of agreement on the triplet is 83% and the correct concept
identification rate is 93%.
A Corpus-Based Decompounding Algorithm for
German Lexical Modeling in LVCSR
Martine Adda-Decker; LIMSI-CNRS, France
In this paper a corpus-based decompounding algorithm is described
and applied for German LVCSR. The decompounding algorithm contributes to address two major problems for LVCSR: lexical coverage
and letter-to-sound conversion. The idea of the algorithm is simple: given a word start of length k only few different characters
can continue an admissible word in the language. But concerning
compounds, if word start k reaches a constituent word boundary,
the set of successor characters can theoretically include any character. The algorithm has been applied to a 300M word corpus with
2.6M distinct words. 800k decomposition rules have been extracted
automatically. OOV (out of vocabulary) word reductions of 25%
to 50% relative have been achieved using word lists from 65k to
600k words. Pronunciation dictionaries have been developed for
the LIMSI 300k German recognition system. As no language specific knowledge is required beyond the text corpus, the algorithm
can apply more generally to any compounding language.
Modeling Cross-Morpheme Pronunciation
Variations for Korean Large Vocabulary
Continuous Speech Recognition
Kyong-Nim Lee, Minhwa Chung; Sogang University,
Korea
In this paper, we describe a cross-morpheme pronunciation variation model which is especially useful for constructing morphemebased pronunciation lexicon for Korean LVCSR. There are a lot of
pronunciation variations occurring at morpheme boundaries in continuous speech. Since phonemic context together with morphological category and morpheme boundary information affect Korean
pronunciation variations, we have distinguished pronunciation variation rules according to the locations such as within a morpheme,
across a morpheme boundary in a compound noun, across a morpheme boundary in an eojeol, and across an eojeol boundary. In
33K-morpheme Korean CSR experiment, an absolute improvement
of 1.16% in WER from the baseline performance of 23.17% WER
is achieved by modeling cross-morpheme pronunciation variations
with a context-dependent multiple pronunciation lexicon.
In this paper, we describe a perceptual voice recognition method to
improve the naturalness of synthesized speech for Mandarin Chinese text-to-speech (TTS) baseline system. As a large TTS speech
corpus, speech data always has different acoustic properties for
different data recording conditions. Speech data recorded under
different conditions can finally influence the naturalness of synthesized speech. Concerning this fact, we separate the speech data in a
TTS corpus into several different voice classes based on an iterative
voice recognition method, which is something like speaker recognition. Among each class, speech units will be considered to have the
same voice characteristics. Based on the voice recognition result,
a novel unit selection algorithm is performed to select better units
to synthesize a more natural-sounding speech. Primary experiment
shows the possibility and validity of the method.
On Unit Analysis for Cantonese Corpus-Based TTS
Jun Xu, Thomas Choy, Minghui Dong, Cuntai Guan,
Haizhou Li; InfoTalk Technology, Singapore
This paper reports a study of unit analysis for concatenative TTS,
which usually has an inventory of hundreds of thousand of voice
units. It is known that the quality of synthesis units is especially
critical to the quality of resulting corpus-based TTS system. This
research focuses on the analysis of a Chinese Cantonese unit inventory, which has been built earlier for open vocabulary Chinese
Cantonese TTS tasks. The analysis results show that the exercise
helps identify the sources of pronunciation deficiency and suggests
ways of improvement to address quality issues. After taking remedy measures, subjective tests on improved system are carried out
to validate the exercise. The test results are encouraging.
Unit Selection in Concatenative TTS Synthesis
Systems Based on Mel Filter Bank Amplitudes and
Phonetic Context
T. Lambert 1 , Andrew P. Breen 2 , Barry Eggleton 2 ,
Stephen J. Cox 1 , Ben P. Milner 1 ; 1 University of East
Anglia, U.K.; 2 Nuance Communications, U.K.
In concatenative text-to-speech (TTS) synthesis systems unit selection aims to reduce the number of concatenation points in the synthesized speech and make concatenation joins as smooth as possible.
This research considers synthesis of completely new utterances
from non-uniform units, whereby the most appropriate units, according to acoustic and phonetic criteria, are selected from a myriad of similar speech database candidates. A Viterbi-style algorithm dynamically selects the most suitable database units from
a large speech database by considering concatenation and target
costs. Concatenation costs are derived from mel filter bank amplitudes, whereas target costs are considered in terms of the phonemic
and phonetic properties of required units.
Within subjects and between subjects ANOVA [9] evaluation of listeners’ scores showed that the TTS system with this method of unit
selection was preferred in 52% of test sentences.
Text Design for TTS Speech Corpus Building Using
a Modified Greedy Selection
Baris Bozkurt 1 , Ozlem Ozturk 2 , Thierry Dutoit 3 ;
1
Multitel, Belgium; 2 Middle East Technical University,
Turkey; 3 Faculté Polytechnique de Mons, Belgium
Speech corpora design is one of the key issues in building high quality text to speech synthesis systems. Often read speech is used since
it seems to be the easiest way to obtain a recorded speech corpus
10
Eurospeech 2003
Monday
with highest control of the content. The main topic of this study is
designing text for recording read speech corpora for concatenative
text to speech systems. We will discuss application of the greedy algorithm for text selection by proposing a new way of implementing
it and comparing with the standard implementation. Additionally,
a text corpus design for Turkish TTS is presented.
Discriminative Weight Training for Unit-Selection
Based Speech Synthesis
Seung Seop Park, Chong Kyu Kim, Nam Soo Kim;
Seoul National University, Korea
Concatenative speech synthesis by selecting units from large
database has become popular due to its high quality in synthesized
speech. The units are selected by minimizing the combination of
target and join costs for a given sentence. In this paper, we propose
a new approach to train the weight parameters associated with the
cost functions used for unit selection in concatenative speech synthesis. We first view the unit selection as a classification problem,
and apply the discriminative training technique which is found an
efficient way to parameter estimation in speech recognition. Instead
of defining an objective function which accounts for the subjective
speech quality, we take the classification error as the objective function to be optimized. The classification error is approximated by a
smooth function and the relevant parameters are updated by means
of the gradient descent technique.
The Application of Interactive Speech Unit
Selection in TTS Systems
Peter Rutten, Justin Fackrell; Rhetorical Systems Ltd.,
U.K.
Speech unit selection algorithms have the task to find a single sequence of speech units that optimally fit the target transcription
of an utterance that must be synthesized. In doing so, these algorithms ignore a very large number of possible alternative unit
sequences that lead to alternative renderings of that utterance. In
this paper we set out to explore these alternative unit sequences by introducing interactive unit selection.
Interactive unit selection is based on feedback of a listener. To collect this feedback we implement two levels of control: an elaborate
GUI, and a simple XML tag mechanism. The GUI offers access to
unit selection with a granularity of a single speech unit, and allows
a user to set prosodic constraints for the selection of alternative
speech units. The XML tag mechanism operates on words, and allows the user to request an nth-best alternative selection.
Results show that interactive unit selection succeeds in correcting
most of the synthesis problems that occur in our default synthesis system, providing very detailed information that can be used
to improve our run-time algorithms. This work not only provides
a powerful research tool, it also leads to a number of commercial
applications. The GUI can be used efficiently to improve speech
synthesis off-line – to the extent that it eliminates the need to make
special recordings for domain specific applications. The XML tag,
on the other hand, can be used to quickly optimize the output of
the system.
On the Design of Cost Functions for Unit-Selection
Speech Synthesis
Francisco Campillo Díaz, Eduardo R. Banga;
Universidad de Vigo, Spain
The quality of the synthetic speech provided by concatenative
speech systems depends heavily on the capability of accurately
modeling the different characteristics of speech segments. Moreover, the relative significance or weighting of each feature in the
unit selection process is a key point in the relationship between
synthetic speech and human perception. In this paper we propose a
new method for optimizing these weights, making a separate training according to the nature of the different parts of the cost function, i.e., the features referred to the phonetic context of the units
and the features related to their prosodic characteristics. This work
is mainly focused on the target cost function.
September 1-4, 2003 – Geneva, Switzerland
Kalman-Filter Based Join Cost for Unit-Selection
Speech Synthesis
Jithendra Vepa, Simon King; University of Edinburgh,
U.K.
We introduce a new method for computing join cost in unitselection speech synthesis which uses a linear dynamical model
(also known as a Kalman filter) to model line spectral frequency
trajectories. The model uses an underlying subspace in which it
makes smooth, continuous trajectories. This subspace can be seen
as an analogy for underlying articulator movement. Once trained,
the model can be used to measure how well concatenated speech
segments join together. The objective join cost is based on the
error between model predictions and actual observations. We report correlations between this measure and mean listener scores
obtained from a perceptual listening experiment. Our experiments
use a state-of-the art unit-selection text-to-speech system: rVoice
from Rhetorical Systems Ltd.
Optimizing Integrated Cost Function for Segment
Selection in Concatenative Speech Synthesis Based
on Perceptual Evaluations
Tomoki Toda, Hisashi Kawai, Minoru Tsuzaki;
ATR-SLT, Japan
This paper describes optimizing a cost function for segment selection in concatenative Text-to-Speech based on perceptual characteristics. We use the norm of a local cost for each segment as an
integrated cost function for a segment sequence to consider both
the degradation of naturalness over the entire synthetic speech and
the local degradation. The cost function is optimized by adjusting not only the power coefficient of the norm but also weights for
sub-costs so that the integrated cost corresponds better to perceptual scores determined by perceptual experiments. As a result, it is
clarified that the correspondence of the cost can be improved to a
greater degree by optimizing both the weights and the power coefficient than by optimizing either the weights or the power coefficient.
However, it is also clarified that the correspondence is insufficient
after optimizing the integrated cost function.
Automatic Segmentation for Czech Concatenative
Speech Synthesis Using Statistical Approach with
Boundary-Specific Correction
Jindřich Matoušek, Daniel Tihelka, Josef Psutka;
University of West Bohemia in Pilsen, Czech Republic
This paper deals with the problems of automatic segmentation for
the purposes of Czech concatenative speech synthesis. Statistical approach to speech segmentation using hidden Markov models
(HMMs) is applied in the baseline system. Several improvements of
this system are then proposed to get more accurate segmentation
results. These enhancements mainly concern the various strategies of HMM initialization (flat-start initialization, hand-labeled or
speaker independent HMM bootstrapping). Since HTK, the hidden
Markov model toolkit, was utilized in our work, a correction of the
output boundary placements is proposed to reflect speech parameterization mechanism. An objective comparison of various automatic methods and manual segmentation is performed to find
out the best method. The best results were obtained for boundaryspecific statistical correction of the segmentation that resulted from
bootstrapping with hand-labeled HMMs (96% segmentation accuracy in tolerance region 20 ms).
Automatic Speech Segmentation and Verification
for Concatenative Synthesis
Chih-Chung Kuo, Chi-Shiang Kuo, Jau-Hung Chen,
Sen-Chia Chang; Industrial Technology Research
Institute, Taiwan
This paper presents an automatic speech segmentation method
based on HMM alignment and a categorized multiple-expert fine
adjustment. The accuracy of syllable boundaries is significantly
improved (72.8% and 51.9% for starting and ending boundaries of
syllables, respectively) after the fine adjustment. Moreover, a novel
phonetic verification method for checking inconsistency between
text script and recorded speech are also proposed. Design and
11
Eurospeech 2003
Monday
performance of confidence measures for both segmentation and
verification are described, which manifests the automatic detection
of problematic speech segments can be achieved. These methods
together largely reduce human labor in construction of our new
corpus-based TTS system.
DTW-Based Phonetic Alignment Using Multiple
Acoustic Features
Sérgio Paulo, Luís C. Oliveira; INESC-ID/IST, Portugal
This paper presents the results of our effort in improving the accuracy of a DTW-based automatic phonetic aligner. The adopted
model assumes that the phonetic segment sequence is already
known and so the goal is only to align the spoken utterance with
a reference synthetic signal produced by waveform concatenation
without prosodic modifications. Instead of using a single acoustic measure to compute the alignment cost function, our strategy
uses a combination of acoustic features depending on the pair of
phonetic segment classes being aligned. The results show that this
strategy considerably reduces the segment boundary location errors, even when aligning synthetic and natural speech signals of
different gender speakers.
Evaluating and Correcting Phoneme Segmentation
for Unit Selection Synthesis
September 1-4, 2003 – Geneva, Switzerland
a female speaker with frequent mid phrase rises. Speaker 3 was a
male speaker with a similar f0 range to speaker 1 and with a measured prosodic style suitable for news and financial text.
We apply the models created for speaker 2 (an inappropriate model)
and speaker 3 (an appropriate model) to speaker 1 and compare
the results. Three passages (of three to four sentences in length)
from challenging prosodic genres (news report, poetry and personal
email) were synthesised using the target speaker and each of the
three models. The synthesised utterances were played to 15 native
english subjects and rated using a 5 point MOS scale. In addition, 7
experienced speech engineers rated each word for errors on a three
point scale: 1. Acceptable, 2. Poor, 3. Unacceptable.
The results suggest that a large model from an appropriate speaker
does not sound more natural or produce fewer errors than a smaller
model generated from the individual speaker’s own data. In addition it shows that an inappropriate model does produce both less
natural and more errors in the speech. High variance in both subject and materials analysis suggest both tests are far from ideal and
that evaluation techniques for both error rate and naturalness need
to improve.
Learning Phrase Break Detection in Thai
Text-to-Speech
Virongrong Tesprasit 1 , Paisarn Charoenpornsawat 1 ,
Virach Sornlertlamvanich 2 ; 1 NECTEC, Thailand;
2
CRL Asia Research Center, Thailand
John Kominek, Christina L. Bennett, Alan W. Black;
Carnegie Mellon University, USA
As part of improved support for building unit selection voices, the
Festival speech synthesis system now includes two algorithms for
automatic labeling of wavefile data. The two methods are based on
dynamic time warping and HMM-based acoustic modeling. Our experiments show that DTW is more accurate 70% of the time, but is
also more prone to gross labeling errors. HMM modeling exhibits a
systematic bias of 15 ms. Combining both methods directs human
labelers towards data most likely to be problematic.
Control and Prediction of the Impact of Pitch
Modification on Synthetic Speech Quality
Esther Klabbers, Jan P.H. van Santen; Oregon Health
& Science University, USA
In order to use speech synthesis to generate highly expressive
speech convincingly, the problem of poor prosody (both prediction
and generation) needs to be overcome. In this paper we will show
that with a simple annotation scheme using the notion of foot structure, we can more accurately predict the shape of local pitch contours. The assumption is that with a better selection mechanism
we can reduce the amount of pitch modification required, thereby
reducing speech degradation. In addition, we present a perceptual experiment that investigates the degradation introduced by
pitch modification using the OGIresLPC algorithm. We correlated
the weighted perceptual score with different pitch and delta pitch
distances. The best combination of distance measures is able to explain 63% of the variance in the perceptual scores. Decreasing the
pitch is shown to have a higher impact on perception than increasing the pitch.
My Voice, Your Prosody: Sharing a Speaker
Specific Prosody Model Across Speakers in Unit
Selection TTS
Matthew Aylett, Justin Fackrell, Peter Rutten;
Rhetorical Systems Ltd., U.K.
Data sparsity is a major problem for data driven prosodic models.
Being able to share prosodic data across speakers is a potential solution to this problem. This paper explores this potential solution
by addressing two questions: 1) Does a larger less sparse model
from a different speaker produce more natural speech than a small
sparse model built from the original speaker? 2)Does a different
speaker’s larger model generate more unit selection errors than a
small sparse model built from the original speaker?
A unit selection approach is used to produce a lazy learning model
of three English RP speaker’s f0 and durational parameters. Speaker
1 (the target speaker) had a much smaller database (approximately
one quarter to one fifth the size) of the other two. Speaker 2 was
One of the crucial problems in developing high quality Thai text-tospeech synthesis is to detect phrase break from Thai texts. Unlike
English, Thai has no word boundary delimiter and no punctuation
mark at the end of a sentence. It makes the problem more serious. Because when we detect phrase break incorrectly, it is not only
producing unnatural speech but also creating the wrong meaning.
In this paper, we apply machine learning algorithms namely C4.5
and RIPPER in detecting phrase break. These algorithms can learn
useful features for locating a phrase break position. The features
which are investigated in our experiments are collocations in different window sizes and the number of syllables before and after a
word in question to a phrase break position. We compare the results
from C4.5 and RIPPER with a based-line method (Part-of-Speech sequence model). The experiment shows that C4.5 and RIPPER appear
to outperform the based-line method and RIPPER performs better
accuracy results than C4.5.
A Speech Model of Acoustic Inventories Based on
Asynchronous Interpolation
Alexander B. Kain, Jan P.H. van Santen; Oregon
Health & Science University, USA
We propose a speech model that describes acoustic inventories of
concatenative synthesizers. The model has the following characteristics: (i) very compact representations and thus high compression ratios are possible, (ii) re-synthesized speech is free of concatenation errors, (iii) the degree of articulation can be controlled
explicitly, and (iv) voice transformation is feasible with relatively
few additional recordings of a target speaker. The model represents a speech unit as a synthesis of several types of features, each
of which has been computed using non-linear, asynchronous interpolation of neighboring basis vectors associated with known phonemic identities. During analysis, basis vectors and transition weights
are estimated under a strict diphone assumption using a dynamic
time warping approach. During synthesis, the estimated transition
weight values are modified to produce changes in duration and articulation effort.
Corpus-Based Synthesis of Fundamental Frequency
Contours of Japanese Using
Automatically-Generated Prosodic Corpus and
Generation Process Model
Keikichi Hirose, Takayuki Ono, Nobuaki Minematsu;
University of Tokyo, Japan
We have been developing corpus-based synthesis of fundamental
frequency (F0 ) contours for Japanese. Since, in our method, the
synthesis is done under the constraint of F0 contour generation
12
Eurospeech 2003
Monday
process model, a rather good quality is still kept even if the prediction process is done poorly. Although it was already shown that
the synthesized F0 contours sounded as highly natural as those using heuristic rules carefully arranged by experts, the F0 model parameters for the training corpus were extracted with some manual
processes. In the current paper, the automatically extracted parameters are used, and a good result is obtained. Also several features
are added as the inputs to the statistical method to obtain better
results. Some results on the accent phrase boundary prediction in
the similar corpus-based framework are also shown.
Session: SMoDa– Oral
Aurora Noise Robustness on LARGE
Vocabulary Databases
Time: Monday 16.00, Venue: Room 1
Chair: David Pierce, Motorola Lab., UK
Analysis of the Aurora Large Vocabulary
Evaluations
N. Parihar, Joseph Picone; Mississippi State University,
USA
In this paper, we analyze the results of the recent Aurora large
vocabulary evaluations. Two consortia submitted proposals on
speech recognition front ends for this evaluation: (1) Qualcomm,
ICSI, and OGI (QIO), and (2) Motorola, France Telecom, and Alcatel (MFA). These front ends used a variety of noise reduction techniques including discriminative transforms, feature normalization,
voice activity detection, and blind equalization. Participants used
a common speech recognition engine to postprocess their features.
In this paper, we show that the results of this evaluation were not
significantly impacted by suboptimal recognition system parameter
settings. Without any front end specific tuning, the MFA front end
outperforms the QIO front end by 9.6% relative. With tuning, the
relative performance gap increases to 15.8%. Both the mismatched
microphone and additive noise evaluation conditions resulted in a
significant degradation in performance for both front ends.
Evaluation of Quantile Based Histogram
Equalization with Filter Combination on the Aurora
3 and 4 Databases
Florian Hilger, Hermann Ney; RWTH Aachen,
Germany
The recognition performance of automatic speech recognition systems can be improved by reducing the mismatch between training
and test data during feature extraction. The approach described in
this paper is based on estimating the signal’s cumulative density
functions on the filter bank using a small number of quantiles. A
two-step transformation is then applied to reduce the difference between these quantiles and the ones estimated on the training data.
The first step is a power function transformation applied to each
individual filter channel, followed by a linear combination of neighboring filters. On the Aurora 4 16kHz database the average word
error rates could be reduced from 60.8% to 37.6% (clean training)
and from 38.0% to 31.5% (multi condition training).
Large Vocabulary Noise Robustness on Aurora4
Evaluation of Model-Based Feature Enhancement
on the AURORA-4 Task
Veronique Stouten, Hugo Van hamme, Jacques
Duchateau, Patrick Wambacq; Katholieke Universiteit
Leuven, Belgium
In this paper we focus on the challenging task of noise robustness
for large vocabulary Continuous Speech Recognition (LVCSR) systems in non-stationary noise environments. We have extended our
Model-Based Feature Enhancement (MBFE) algorithm – that we earlier successfully applied to small vocabulary CSR in the AURORA-2
framework – to cope with the new demands that are imposed by the
large vocabulary size in the AURORA-4 task. To incorporate a priori
knowledge of the background noise, we combine scalable Hidden
Markov Models (HMMs) of the cepstral feature vectors of both clean
speech and noise, using a Vector Taylor Series approximation in the
power spectral domain. Then, a global MMSE-estimate of the clean
speech is calculated based on this combined HMM. This technique is
easily embeddable in the feature extraction module of a recogniser
and is intrinsically suited for the removal of non-stationary additive
noise. Our approach is validated on the AURORA-4 task, revealing
a significant gain in noise robustness over the baseline.
Improved Feature Extraction Based on Spectral
Noise Reduction and Nonlinear Feature
Normalization
José C. Segura, Javier Ramírez, Carmen Benítez,
Ángel de la Torre, Antonio J. Rubio; Universidad de
Granada, Spain
This paper is mainly focused on showing experimental results of
a feature extraction algorithm that combines spectral noise reduction and nonlinear feature normalization. The successfulness of
this approach has been shown in a previous work, and in this one,
we present several improvements that result in a performance comparable to that of the recently approved AFE for DSR. Noise reduction is now based on a Wiener filter instead of spectral subtraction.
The voice activity detection based on the full-band energy has been
replaced with a new one using spectral information. Relative improvements of 24.81% and 17.50% over our previous system are
obtained for AURORA 2 and 3 respectively. Results for AURORA 2
are not as good as those for the AFE, but for AURORA 3 a relative
improvement of 5.27% is obtained.
Feature Compensation Technique for Robust
Speech Recognition in Noisy Environments
Young Joon Kim 1 , Hyun Woo Kim 2 , Woohyung Lim 1 ,
Nam Soo Kim 1 ; 1 Seoul National University, Korea;
2
Electronics and Telecommunications Research
Institute, Korea
In this paper, we analyze the problems of the existing interacting
multiple model (IMM) and spectral subtraction (SS) approaches and
propose a new approach to overcome the problems of these algorithms. Our approach combines the IMM and SS techniques based
on a soft decision for speech presence. Results reported on AURORA2 database show that proposed approach shows 14.26% of
average relative improvement compared to the IMM algorithm in
the speech recognition experiments.
Luca Rigazio, Patrick Nguyen, David Kryze,
Jean-Claude Junqua; Panasonic Speech Technology
Laboratory, USA
This paper presents experiments of noise robust ASR on the Aurora4 database. The database is designed to test large vocabulary
systems in presence of noise and channel distortions. A number
of different model-based and signal-based noise robustness techniques have been tested. Results show that it is difficult to design
a technique that is superior in every condition. Because of this
we combined different techniques to improve results. Best results
have been obtained when short time compensation / normalization
methods are combined with long term environmental adaptation
and robust acoustic models. The best average error rate obtained
over the 52 conditions is 30.8%. This represents a 40% relative improvement compared to the baseline results [1].
September 1-4, 2003 – Geneva, Switzerland
Session: SMoDb– Oral
Multilingual Speech-to-Speech Translation
Time: Monday 16.00, Venue: Room 2
Chair: Gianni Lazzari, Istituto Trentino di Cultura, Trento, Italy
The Statistical Approach to Machine Translation
and a Roadmap for Speech Translation
Hermann Ney; RWTH Aachen, Germany
During the last few years, the statistical approach has found
widespread use in machine translation, in particular for spoken language. In many comparative evaluations of automatic speech translation, the statistical approach was found to be significantly supe-
13
Eurospeech 2003
Monday
rior to the existing conventional approaches. The paper will present
the main components of a statistical machine translation system
(such as alignment and lexicon models, training procedure, generation of the target sentence) and summarize the progress made so
far. We will conclude with a roadmap for future research on spoken
language translation.
Coupling vs. Unifying: Modeling Techniques for
Speech-to-Speech Translation
Yuqing Gao; IBM T.J. Watson Research Center, USA
As a part of our effort to develop a unified computational framework for speech-to-speech translation, so that sub-optimizations or
local optimizations can be avoided, we are developing direct models
for speech recognition. In direct model, the focus is on the creation
of one single integrated model p(text|acoustics), rather than a complex series of artifices, therefore various factors such as linguistics
and language features, speaker or speaking rate differences, different acoustic conditions, can be applied to the joint optimization. In
this paper we discuss how linguistic and semantic constraints are
used in phoneme recognition.
Speechalator: Two-Way Speech-to-Speech
Translation on a Consumer PDA
Alex Waibel 1 , Ahmed Badran 1 , Alan W. Black 1 ,
Robert Frederking 1 , Donna Gates 1 , Alon Lavie 1 , Lori
Levin 1 , Kevin A. Lenzo 2 , Laura Mayfield Tomokiyo 2 ,
Jürgen Reichert 3 , Tanja Schultz 1 , Dorcas Wallace 1 ,
Monika Woszczyna 4 , Jing Zhang 3 ; 1 Carnegie Mellon
University, USA; 2 Cepstral LLC, USA; 3 Mobile
Technologies Inc., USA; 4 Multimodal Technologies
Inc., USA
September 1-4, 2003 – Geneva, Switzerland
evaluation campaign, which will take place in 2003, will focus on
written language translation by exploiting a large phrase-book parallel corpus covering several European and Asiatic languages.
Creating Corpora for Speech-to-Speech Translation
Genichiro Kikui, Eiichiro Sumita, Toshiyuki Takezawa,
Seiichi Yamamoto; ATR-SLT, Japan
This paper presents three approaches to creating corpora that we
are working on for speech-to-speech translation in the travel conversation task. The first approach is to collect sentences that bilingual
travel experts consider useful for people going-to/coming-from another country. The resulting English-Japanese aligned corpora are
collectively called the basic travel expression corpus (BTEC), which
is now being translated into several other languages. The second
approach tries to expand this corpus by generating many “synonymous” expressions for each sentence. Although we can create large
corpora by the above two approaches relatively cheaply, they may
be different from utterances in actual conversation. Thus, as the
third approach, we are collecting dialogue corpora by letting two
people talk, each in his/her native language, through a speech-tospeech translation system. To concentrate on translation modules,
we have replaced speech recognition modules with human typists.
We will report some of the characteristics of these corpora as well.
Session: OMoDc– Oral
Prosody
Time: Monday 16.00, Venue: Room 3
Chair: Eva Hajicova, Charles University in Prague, Czech Republic
This paper describes a working two-way speech-to-speech translation system that runs in near real-time on a consumer handheld
computer. It can translate from English to Arabic and Arabic to English in the domain of medical interviews.
We describe the general architecture and frameworks within which
we developed each of the components: HMM-based recognition, interlingua translation (both rule and statistically based), and unit
selection synthesis.
Development of Phrase Translation Systems for
Handheld Computers: From Concept to Field
Horacio Franco 1 , Jing Zheng 1 , Kristin Precoda 1 ,
Federico Cesari 1 , Victor Abrash 1 , Dimitra Vergyri 1 ,
Anand Venkataraman 1 , Harry Bratt 1 , Colleen
Richey 1 , Ace Sarich 2 ; 1 SRI International, USA;
2
Marine Acoustics, USA
We describe the development and conceptual evolution of handheld spoken phrase translation systems, beginning with an initial
unidirectional system for translation of English phrases, and later
extending to a limited bidirectional phrase translation system between English and Pashto, a major language of Afghanistan. We review the challenges posed by such projects, such as the constraints
imposed by the computational platform, to the limitations of the
phrase translation approach when dealing with naïve respondents.
We discuss our proposed solutions, in terms of architecture, algorithms, and software features, as well as some field experience by
users of initial prototypes.
Evaluation Frameworks for Speech Translation
Technologies
Marcello Federico; ITCirst, Italy
This paper reports on activities carried out under the European
project PF-STAR and within the CSTAR consortium, which aim at
evaluating speech translation technologies. In PF-STAR, speech
translation baselines developed by the partners and off-the-shelf
commercial systems will be compared systematically on several language pairs and application scenarios. In CSTAR, evaluation campaigns will be organized, on a regular basis, to compare research
baselines developed by the members of the consortium. The first
Prosodic Analysis and Modeling of the NAGAUTA
Singing to Synthesize its Prosodic Patterns from
the Standard Notation
Nobuaki Minematsu, Bungo Matsuoka, Keikichi
Hirose; University of Tokyo, Japan
NAGAUTA is a classical style of the Japanese singing. It has very
original and unique prosodic patterns in its singing, where an
abrupt and sharp change of F0 is always observed at a transition
from a note to another. This F0 change is often found even where
the transition is not accompanied by a change of tone. In this paper, we propose a model to synthesize this unique F0 pattern from
the standard notation. Further, this paper shows an interesting phenomenon about power movements at the F0 changes. Acoustic analysis of NAGAUTA singing samples reveals that sharp increases of
F0 and sharp decreases of power are observed synchronously. Although no discussion on physical mechanisms of this phenomenon
is done in this paper, another model to generate this unique power
pattern is also proposed. Evaluation experiments are done through
listening and their results indicate high validity of the two proposed
models.
Statistical Evaluation of the Influence of Stress on
Pitch Frequency and Phoneme Durations in Farsi
Language
D. Gharavian, S.M. Ahadi; Amirkabir University of
Technology, Iran
Stress is known to be an important prosodic feature of speech. The
recognition of stressed speech has always been an important issue
for speech researchers. On the other hand, providing a large corpus with the coverage of all different stressed conditions in a certain
language is a difficult task. Farsi (Persian) has been no exception to
this. In this research, our aim has been to evaluate the effect of
stress on prosodic features of Farsi language, such as phoneme duration, pitch frequency and the pitch contour slope. These might
be valuable in further research in speech recognition. As the main
influence of stress is on vowels, the effect of stress on such parameters as duration and pitch frequency and its slope on the phoneme
level and for vowels has been evaluated.
14
Eurospeech 2003
Monday
Prosody Dependent Speech Recognition with
Explicit Duration Modelling at Intonational Phrase
Boundaries
K. Chen, S. Borys, Mark Hasegawa-Johnson, J. Cole;
University of Illinois at Urbana-Champaign, USA
Does prosody help word recognition? In this paper, we propose a
novel probabilistic framework in which word and phoneme are dependent on prosody in a way that improves word recognition. The
prosody attribute that we investigate in this study is the lengthening of speech segments in the vicinity of intonational phrase
boundaries. Explicit Duration Hidden Markov Model (EDHMM) is implemented to provide an accurate phoneme duration model. This
study is conducted on Boston University Radio News Corpus with
prosodic boundaries marked using ToBI labelling system. We found
that lengthening of the phrase final rhymes can be reliably modelled
by EDHMM, which significantly improves the prosody dependent
acoustic modelling. Conversely, no systematic duration variation is
found at phrase initial position. With prosody dependence implemented in the acoustic model, pronunciation model and language
model, both word recognition accuracy and boundary recognition
accuracy are improved by 1% over systems without prosody dependence.
Prediction of Fujisaki Model’s Phrase Commands
João Paulo Teixeira 1 , Diamantino Freitas 2 , Hiroya
Fujisaki 3 ; 1 Polytechnic Institute of Bragança,
Portugal; 2 University of Porto, Portugal; 3 University of
Tokyo, Japan
This paper presents a model to predict the phrase commands of the
Fujisaki Model for F0 contour for the Portuguese Language. Phrase
commands location in text is governed by a set of weighted rules.
The amplitude (Ap) and timing (T0) of the phrase commands are
predicted in separate neural networks. The features for both neural networks are discussed. Finally a comparison between target
and predicted values is presented.
September 1-4, 2003 – Geneva, Switzerland
the best combination among the proposed acoustic parameters. Experiments are also conducted to verify the perceived degree of pitch
change within a phrase final, and the perceived degree of pitch reset.
While a good relationship is found between the perceptual scores
and some of the acoustic parameters, our results also advocate a
continuous rather than a categorical relationship between some of
the phrase final tone-types considered.
Session: OMoDd– Oral
Language Modeling
Time: Monday 16.00, Venue: Room 4
Chair: Holger Schwenk, Limsi, CNRS, France
Efficient Linear Combination for Distant n-Gram
Models
David Langlois, Kamel Smaïli, Jean-Paul Haton;
LORIA, France
The objective of this paper is to present a large study concerning
the use of distant language models. In order to combine efficiently
distant and classical models, an adaptation of the back-off principle is made. Also, we show the importance of each part of a history
for the prediction. In fact, each sub-history is analyzed in order to
estimate its importance in terms of prediction and then a weight is
associated to each class of sub-histories. Therefore, the combined
models take into account the features of each history’s part and
not the whole history as made in other works. The contribution
of distant n-gram models in terms of perplexity is significant and
improves the results by 12.8%. Making the linear combination depending on sub-histories achieves an improvement of 5.3% in comparison to classical linear combination.
Improving a Connectionist Based Syntactical
Language Model
Ahmad Emami; Johns Hopkins University, USA
Using a connectionist model as one of the components of the Structured Language Model has lead to significant improvements in perplexity and word error rate, mainly because of the connectionist
model’s power in using longer contexts and its ability in fighting
the data sparseness problem. For its training, the SLM needs the
syntactical parses of the word strings in the training data, provided
by either humans or an external parser.
Corpus-Based Modeling of Naturalness Estimation
in Timing Control for Non-Native Speech
Makiko Muto, Yoshinori Sagisaka, Takuro Naito,
Daiju Maeki, Aki Kondo, Katsuhiko Shirai; Waseda
University, Japan
In this paper, aiming at automatic estimation of naturalness in timing control of non-native’s speech, we have analyzed the timing
characteristics of non-native’s speech to correlate with the corresponding subjective naturalness evaluation scores given by native
speakers. Through statistical analyses using English speech data
spoken by Japanese with temporal naturalness scores ranging one
to five given by natives, we found high correlation between their
scores and the differences from native’s speech. These analyses
provided a linear regression model where naturalness in timing
control is estimated by differences from native’s speech in durations of overall sentences, individual content and function words
and pauses. The proposed naturalness evaluation model was tested
its estimation accuracy using open data. The root mean square errors 0.64 between scores predicted by the model and those given
by the natives turned out to be comparable to the differences 0.85
of scores among native listeners. Good correlation between model
prediction and native’s judgments confirmed the appropriateness
of the proposed model.
Perceptually-Related Acoustic-Prosodic Features of
Phrase Finals in Spontaneous Speech
Carlos Toshinori Ishi, Parham Mokhtari, Nick
Campbell; ATR-HIS, Japan
With the aim of automatically categorizing phrase final tones, investigations are conducted on the relationship between acousticprosodic parameters and perceptual tone categories. Three types
of acoustic parameters are proposed: one related to pitch movement within the phrase final, one related to pitch reset prior to the
phrase final, and one related to the length of the phrase final. A
classification tree is used to evaluate automatic categorization of
phrase final tone types, resulting in 76% correct classification for
In this paper we study the effect of training the connectionist based
language model on the hidden parses hypothesized by the SLM itself. Since multiple partial parses are constructed for each word
position, the model and the log-likelihood function will be in a form
that necessitates a specific manner of training of the connectionist
model. Experiments on the UPENN section of the Wall Street Journal corpus show significant improvements in perplexity.
Using Untranscribed User Utterances for Improving
Language Models Based on Confidence Scoring
Mikio Nakano 1 , Timothy J. Hazen 2 ; 1 NTT
Corporation, Japan; 2 Massachusetts Institute of
Technology, USA
This paper presents a method for reducing the effort of transcribing user utterances to develop language models for conversational
speech recognition when a small number of transcribed and a large
number of untranscribed utterances are available. The recognition
hypotheses for untranscribed utterances are classified according
to their confidence scores such that hypotheses with high confidence are used to enhance language model training. The utterances that receive low confidence can be scheduled to be manually
transcribed first to improve the language model. The results of experiments using automatic transcription of the untranscribed user
utterances show the proposed methods are effective in achieving
improvements in recognition accuracy while reducing the effort required from manual transcription.
15
Eurospeech 2003
Monday
Improved Chinese Broadcast News Transcription
by Language Modeling with Temporally Consistent
Training Corpora and Iterative Phrase Extraction
September 1-4, 2003 – Geneva, Switzerland
yields the LPLE predictor. It is proved that the all-pole filters computed by LPLE are always stable. The results show that the method
is well-suited when low-order all-pole models with improved modeling of the lowest formants are needed.
Pi-Chuan Chang, Shuo-Peng Liao, Lin-shan Lee;
National Taiwan University, Taiwan
Beyond a Single Critical-Band in TRAP Based ASR
In this paper an iterative Chinese new phrase extraction method
based on the intra-phrase association and context variation statistics is proposed. A Chinese language model enhancement framework including lexicon expansion is then developed. Extensive experiments for Chinese broadcast news transcription were then performed to explore the achievable improvements with respect to the
degree of temporal consistency for the adaptation corpora. Very
encouraging results were obtained and detailed analysis discussed.
Language Model Adaptation Using Word Clustering
Shinsuke Mori, Masafumi Nishimura, Nobuyasu Itoh;
IBM Japan Ltd., Japan
Building a stochastic language model (LM) for speech recognition
requires a large corpus of target tasks. For some tasks no enough
large corpus is available and this is an obstacle to achieving high
recognition accuracy. In this paper, we propose a method for building an LM with a higher prediction power using large corpora from
different tasks rather than an LM estimated from a small corpus for
a specific target task. In our experiment, we used transcriptions of
air university lectures and articles from Nikkei newspaper and compared an existing interpolation-based method and our new method.
The results show that our new method reduces perplexity by 9.71%.
Hierarchical Topic Classification for Dialog Speech
Recognition Based on Language Model Switching
Pratibha Jain, Hynek Hermansky; Oregon Health &
Science University, USA
TRAP based ASR attempts to extract information from rather long
(as long as 1 s) and narrow(one critical-band) patches (temporal
patterns) from time-frequency plane. We investigate the effect of
combining temporal patterns of logarithmic critical-band energies
from several adjacent bands. The frequency context is gradually
increased from one critical-band to several critical-bands by using
temporal patterns jointly from adjacent bands as input to the classposterior estimators. We show that up to three critical-bands of
frequency context is required for achieving higher recognition performance. This work also indicates that local bands interaction is
important for improved speech recognition performance.
Variational Bayesian GMM for Speech Recognition
Fabio Valente, Christian Wellekens; Institut Eurecom,
France
In this paper, we explore the potentialities of Variational Bayesian
(VB) learning for speech recognition problems. VB methods deal in
a more rigorous way with model selection and are a generalization
of MAP learning. VB training for Gaussian Mixture Models is less
affected than EM-ML training by over- fitting and singular solutions.
We compare two types of Variational Bayesian Gaussian Mixture
Models (VBGMM) with classical EM-ML GMM in a phoneme recognition task on the TIMIT database. VB learning performs better than
EM-ML learning and is less affected by the initial model guess.
Ian R. Lane 1 , Tatsuya Kawahara 1 , Tomoko Matsui 2 ,
Satoshi Nakamura 3 ; 1 Kyoto University, Japan;
2
Institute of Statistical Mathematics, Japan; 3 ATR-SLT,
Japan
Time Alignment for Scenario and Sounds with
Voice, Music and BGM
A speech recognition architecture combining topic detection and
topic-dependent language modeling is proposed. In this architecture, a hierarchical back-off mechanism is introduced to improve
system robustness. Detailed topic models are applied when topic
detection is confident, and wider models that cover multiple topics
are applied in cases of uncertainty. In this paper, two topic detection methods are evaluated for the architecture: unigram likelihood
and SVM (Support Vector Machine). On the ATR Basic Travel Expression corpus, both topic detection methods provide a comparable
reduction in WER of 10.0% and 11.1% respectively over a single language model system. Finally the proposed re-decoding approach
is compared with an equivalent system based on re-scoring. It is
shown that re-decoding is vital to provide optimal recognition performance.
This paper proposes a new time alignment method between scenario and sounds with voice, music and BGM (Back Ground Music)
in order to generate video caption automatically. The proposed
time alignment method, Voice-Music-Pause+BGM method, is based
on the composition of voice and music models. The result of the
experiments to evaluate the proposed method shows the proposed
method works about 10∼60 times better than the conventional time
alignment methods.
Yamato Wada, Masahide Sugiyama; University of
AIZU, Japan
Session: PMoDe– Poster
Speech Modeling & Features I
Time: Monday 16.00, Venue: Main Hall, Level -1
Chair: Hynek Hermansky, Oregon Graduate Institute of Science
and Technology, USA
Linear Predictive Method with Low-Frequency
Emphasis
Paavo Alku, Tom Bäckström; Helsinki University of
Technology, Finland
An all-pole modeling technique, Linear Prediction with Lowfrequency Emphasis (LPLE), which emphasizes the lower frequency
range of speech, is presented. The method is based on first interpreting conventional linear predictive (LP) analyses of successive
prediction orders with parallel structures using the concept of symmetric linear prediction. In these implementations, symmetric linear prediction is preceded by simple pre-filters, which are of either
low or high frequency characteristics. Combining those symmetric
linear predictors that are not preceded by high-frequency pre-filters
Efficient Quantization of Speech Excitation
Parameters Using Temporal Decomposition
Phu Chien Nguyen, Masato Akagi; JAIST, Japan
In this paper, we investigate the application of temporal decomposition (TD) technique to describe the temporal patterns of speech
excitation parameter contours, i.e. gain, pitch, and voicing. We use
a common set of event functions to describe the temporal structure of both spectral and excitation parameters, and then quantize
them. Experimental results show that each speech excitation parameter contour can be well described by a set of excitation targets using the event functions obtained from TD analysis of line
spectral frequency (LSF) parameters, with considerably low reconstruction error. Moreover, we can efficiently quantize the excitation
targets by a combination of two uniform quantizers, one working
directly on logarithmic excitation targets and the other working on
the difference between current and previous logarithmic excitation
targets.
Distributed Genetic Algorithm to Discover a
Wavelet Packet Best Basis for Speech Recognition
Robert van Kommer 1 , Béat Hirsbrunner 2 ; 1 Swisscom
Innovations, Switzerland; 2 University of Fribourg,
Switzerland
In the learning process of speech modeling, many choices or settings are defined “a priori” or are resulting from years of experimental work. In this paper, instead, a global learning scheme is pro-
16
Eurospeech 2003
Monday
posed based on a Distributed Genetic Algorithm combined with a
standard speech-modeling algorithm. The speech recognition models are now created out of a predefined space of solutions. Furthermore, this global scheme enables to learn the speech models
as well as the best feature extraction module. Experimental validation is performed on the task of discovering the Wavelet Packet
best basis decomposition, knowing that the “a priori” reference is
the mel-scaled subband decomposition. Two experiments are presented, a reference system using a simulated fitness and a second
one that uses the speech recognition performance as fitness value.
In the latter, each element of the space is a connectionist system
defined by a Wavelet topology and its associated Neural Network.
New Model-Based HMM Distances with
Applications to Run-Time ASR Error Estimation
and Model Tuning
Chao-Shih Huang 1 , Chin-Hui Lee 2 , Hsiao-Chuan
Wang 3 ; 1 Acer Inc., Taiwan; 2 Georgia Institute of
Technology, USA; 3 National Tsing Hua University,
Taiwan
We propose a novel model-based HMM distance computation framework to estimate run-time recognition errors and adapt recognition parameters without the need of using any testing or adaptation data. The key idea is to use HMM distances between competing models to measure the confusability between phones in speech
recognition. Starting with a set of simulated models in a given noise
condition, the corresponding error rate could be estimated with a
smooth approximation of the error count computed form the set
of phone distances without using any testing data. By minimizing
the estimated error between the desired and simulated models, the
target model parameters could also be adjusted without using any
adaptation data. Experimental results show that the word errors,
estimated with the proposed framework, closely resemble the errors obtained by running actual recognition experiments on a large
testing set in a number of adverse conditions. The adapted models
also gave better recognition performances than those obtained with
environment-matched models, especially in low signal-to-noise conditions.
September 1-4, 2003 – Geneva, Switzerland
tone contexts. Experimental results indicate the effectiveness of
the method in both tone discrimination and detection of the inconsistency between a lexical tone and its F0 pattern. The method is
suitable for the prosodic labeling of a large scale speech corpus.
Feature Selection for the Classification of Crosstalk
in Multi-Channel Audio
Stuart N. Wrigley, Guy J. Brown, Vincent Wan, Steve
Renals; University of Sheffield, U.K.
An extension to the conventional speech / nonspeech classification framework is presented for a scenario in which a number of
microphones record the activity of speakers present at a meeting
(one microphone per speaker). Since each microphone can receive
speech from both the participant wearing the microphone (local
speech) and other participants (crosstalk), the recorded audio can
be broadly classified in four ways: local speech, crosstalk plus local speech, crosstalk alone and silence. We describe a classifier in
which a Gaussian mixture model (GMM) is used to model each class.
A large set of potential acoustic features are considered, some of
which have been employed in previous speech / nonspeech classifiers. A combination of two feature selection algorithms is used
to identify the optimal feature set for each class. Results from the
GMM classifier using the selected features are superior to those of
a previously published approach.
A DTW-Based DAG Technique for Speech and
Speaker Feature Analysis
Jingwei Liu; Tsinghua University, China
Analysis of Voice Source Characteristics Using a
Constrained Polynomial Model
A DTW-based directed acyclic graph (DAG) optimization method
is proposed to exploit the interaction information of speech and
speaker in feature component. We introduce the DAG representation of intra-class samples based on dynamic time warping (DTW)
measure and propose two criteria based on in-degree of DAG. Combined with (l - r ) optimization algorithm, the DTW-based DAG
model is applied to discuss the feature subset information of representing speech and speaker in text-dependent speaker identification and speaker-dependent speech recognition. The experimental results demonstrate the powerful ability of our model to reveal
the low dimensional performance and the influence of speech and
speaker information in different tasks, and the corresponding DTW
recognition rates are also calculated for comparison.
Tokihiko Kaburagi, Koji Kawai; Kyushu Institute of
Design, Japan
Feature Transformations and Combinations for
Improving ASR Performance
This paper presents an analysis method of voice source characteristics from speech by simultaneously employing models of the vocal
tract and voice source signal. The vocal tract is represented as a
linear filter based on the conventional all-pole assumption. On the
other hand, the voice source signal is represented by linearly overlapping multiple number of base signals obtained from a generalization of the Rosenberg model. The resulting voice source model
is a polynomial function of time and has lesser degrees-of-freedom
than the polynomial order. By virtue of the linearity of both models, the optimal values of their parameters can be jointly determined
when the instants of the glottal opening and closing are given for
each pitch period. We also present a temporal search method of
these glottal events using the dynamic programming technique. Finally, experimental results are presented to reveal the applicability
of the proposed method for several phonation conditions.
Panu Somervuo, Barry Chen, Qifeng Zhu;
International Computer Science Institute, USA
Tone Pattern Discrimination Combining Parametric
Modeling and Maximum Likelihood Estimation
Jinfu Ni, Hisashi Kawai; ATR-SLT, Japan
This paper presents a novel method for tone pattern discrimination derived by combining a functional fundamental frequency (F0 )
model for feature extraction with vector quantization and maximum likelihood estimation techniques. Tone patterns are represented in a parametric form based on the F0 model and clustered
using the LBG algorithm. The mapping between lexical tones and
acoustic patterns is statistically modeled and decoded by the maximum likelihood estimation. Evaluation experiments are conducted
on 469 Mandarin utterances (1.4 hours of read speech from a female native) with varied analysis conditions of codebook sizes and
In this work, linear and nonlinear feature transformations have
been experimented in ASR front end. Unsupervised transformations were based on principal component analysis and independent
component analysis. Discriminative transformations were based on
linear discriminant analysis and multilayer perceptron networks.
The acoustic models were trained using a subset of HUB5 training
data and they were tested using OGI Numbers corpus. Baseline feature vector consisted of PLP cepstrum and energy with first and
second order deltas. None of the feature transformations could
outperform the baseline when used alone, but improvement in the
word error rate was gained when the baseline feature was combined
with the feature transformation stream. Two combination methods were experimented: feature vector concatenation and n-best
list combination using ROVER. Best results were obtained using the
combination of the baseline PLP cepstrum and the feature transform based on multilayer perceptron network. The word error rate
in the number recognition task was reduced from 4.1 to 3.1.
On the Role of Intonation in the Organization of
Mandarin Chinese Speech Prosody
Chiu-yu Tseng; Academia Sinica, Taiwan
This paper reports 3 perception experiments on intonation groups
and the role of phrasal intonation in the organization of speech
prosody. The goal is to help unlimited TTS achieve better naturalness. Experiments were also designed to compliment previous extensive analyses of speech data. Using the PRAAT software and removing segmental information humming experiments of extracted
17
Eurospeech 2003
Monday
intonation groups ending in interrogative and declarative intonations in both complete and edited forms were used. Results showed
that (1.)phrasal or sentential intonation contour is less significant
for Mandarin, (2.) yes-no questions with utterance question particles are characterized by a rising pitch on the final syllable only,
(3.) the general higher register exhibited in yes-no questions without utterance final question particles is not the most salient cue for
intonation, (4.) utterance final lengthening appears to be a salient
perceptual cue for intonation identification, (5.) speech units larger
than single sentences deserve more attention.
September 1-4, 2003 – Geneva, Switzerland
An Optimized Multi-Duration HMM for
Spontaneous Speech Recognition
harmonic product spectrum based feature is extracted in frequency
domain while the autocorrelation and the average magnitude difference based methods work in time domain. The algorithms produce a measure of voicing for each time frame. The voicing measure
was combined with the standard Mel Frequency Cepstral Coefficients
(MFCC) using linear discriminant analysis to choose the most relevant features. Experiments have been performed on small and large
vocabulary tasks. The three different voicing measures combined
with MFCCs resulted in similar improvements in word error rate:
improvements of up to 14% on the small-vocabulary task and improvements of up to 6% on the large-vocabulary task relative to using MFCC alone with the same overall number of parameters in the
system.
Yuichi Ohkawa, Akihiro Yoshida, Motoyuki Suzuki,
Akinori Ito, Shozo Makino; Tohoku University, Japan
Use of a CSP-Based Voice Activity Detector for
Distant-Talking ASR
In spontaneous speech, various speech style and speed changes can
be observed, which are known to degrade speech recognition accuracy.
Luca Armani, Marco Matassoni, Maurizio Omologo,
Piergiorgio Svaizer; ITCirst, Italy
In this paper, we describe an optimized multi-duration HMM (OMD).
An OMD is a kind of multi-path HMM with at most two parallel
paths. Each path is trained using speech samples with short or long
phoneme duration. The thresholds to divide samples of phonemes
are determined through phoneme recognition experiment. Not only
the thresholds but also topologies of HMM are determined using the
recognition result.
This paper addresses the problem of voice activity detection for
distant-talking speech recognition in noisy and reverberant environment. The proposed algorithm is based on the same Cross-power
Spectrum Phase analysis that is used for talker location and tracking purposes. A normalized feature is derived, which is shown to
be more effective than an energy-based one. The algorithm exploits
that feature by dynamically updating the threshold as a non-linear
average value computed during the preceding pause. Given a real
multichannel database, recorded with the speaker at 2.5 meter distance from the microphones, experiments show that the proposed
algorithm provides a relevant relative error rate reduction.
Next, we parallelize OMD model with ordinary HMM trained by
spontaneous speech and HMM trained by read speech in parallel.
Using this ‘all-parallel’ model, 19.3% reduction of word error rate
was obtained compared with the ordinary HMM trained with spontaneous speech.
Speaker Recognition Using MPEG-7 Descriptors
Maximum Conditional Mutual Information
Projection for Speech Recognition
Mohamed Kamal Omar, Mark Hasegawa-Johnson;
University of Illinois at Urbana-Champaign, USA
Hyoung-Gook Kim, Edgar Berdahl, Nicolas Moreau,
Thomas Sikora; Technische Universität Berlin,
Germany
Our purpose is to evaluate the efficiency of MPEG-7 audio descriptors for speaker recognition. The upcoming MPEG-7 standard provides audio feature descriptors, which are useful for many applications. One example application is a speaker recognition system,
in which reduced-dimension log-spectral features based on MPEG-7
descriptors are used to train hidden Markov models for individual
speakers. The feature extraction based on MPEG-7 descriptors consists of three main stages: Normalized Audio Spectrum Envelope
(NASE), Principal Component Analysis (PCA) and Independent Component Analysis (ICA). An experimental study is presented where
the speaker recognition rates are compared for different feature extraction methods. Using ICA, we achieved better results than NASE
and PCA in a speaker recognition system.
A Comparative Study on Maximum Entropy and
Discriminative Training for Acoustic Modeling in
Automatic Speech Recognition
Wolfgang Macherey, Hermann Ney; RWTH Aachen,
Germany
While Maximum Entropy (ME) based learning procedures have been
successfully applied to text based natural language processing,
there are only little investigations on using ME for acoustic modeling in automatic speech recognition. In this paper we show that
the well known Generalized Iterative Scaling (GIS) algorithm can be
used as an alternative method to discriminatively train the parameters of a speech recognizer that is based on Gaussian densities.
The approach is compared with both a conventional maximum likelihood training and a discriminative training based on the Extended
Baum algorithm. Experimental results are reported on a connected
digit string recognition task.
Linear discriminant analysis (LDA) in its original model-free formulation is best suited to classification problems with equal-covariance
classes. Heteroscedastic discriminant analysis (HDA) removes this
equal covariance constraint, and therefore is more suitable for automatic speech recognition (ASR) systems. However, maximizing
HDA objective function does not correspond directly to minimizing the recognition error. In its original formulation, HDA solves
a maximum likelihood estimation problem in the original feature
space to calculate the HDA transformation matrix. Since the dimension of the original feature space in ASR problems is usually
high, the estimation of the HDA transformation matrix becomes
computationally expensive and requires a large amount of training
data. This paper presents a generalization of LDA that solves these
two problems. We start with showing that the calculation of the
LDA projection matrix is a maximum mutual information estimation problem in the lower-dimensional space with some constraints
on the model of the joint conditional and unconditional probability
density functions (PDF) of the features, and then, by relaxing these
constraints, we develop a dimensionality reduction approach that
maximizes the conditional mutual information between the class
identity and the feature vector in the lower-dimensional space given
the recognizer model.
Using this approach, we achieved 1% improvement in phoneme
recognition accuracy compared to the baseline system. Improvement in recognition accuracy compared to both LDA and HDA approaches is also achieved.
Extraction Methods of Voicing Feature for Robust
Speech Recognition
András Zolnay, Ralf Schlüter, Hermann Ney; RWTH
Aachen, Germany
In this paper, three different voicing features are studied as additional acoustic features for continuous speech recognition. The
18
Eurospeech 2003
Monday
September 1-4, 2003 – Geneva, Switzerland
(WCs) in each subband. The idea is based on that the change of WC
variance in speech-dominated frames is larger than the change of
WC variance in noise-dominated frames. We can define a weighting function for WCs in each subband so that WCs are preserved in
speech-dominated frames and reduced in noise-dominated frames.
Then a weighting function in terms of WC’s variance is derived. The
experimental results show that the proposed method is more robust
than that of SNR adjusted speech enhancement system.
Session: PMoDf– Poster
Speech Enhancement I
Time: Monday 16.00, Venue: Main Hall, Level -1
Chair: Joaquin Gonzalez-Rodriguez, ATVS-DIAC-Univ. Politecnica
de Madrid, Spain
A Semi-Blind Source Separation Method for
Hands-Free Speech Recognition of Multiple Talkers
Microphone Array Voice Activity Detection and
Noise Suppression Using Wideband Generalized
Likelihood Ratio
Panikos Heracleous 1 , Satoshi Nakamura 2 , Kiyohiro
Shikano 1 ; 1 Nara Institute of Science and Technology,
Japan; 2 ATR-SLT, Japan
Ilyas Potamitis 1 , Eran Fishler 2 ; 1 University of Patras,
Greece; 2 Princeton University, USA
In this paper, we present a beamforming based semi-blind source
separation technique, which can be applied efficiently for handsfree speech recognition of multiple talkers (including moving talkers, too). The main difference from the conventional blind source
separation techniques lies in the fact that the proposed method
does not attempt to separate explicitly the unknown signals in a
pre-processing pass before speech recognition. In fact, localization
of multiple talkers, separation of the signals, and speech recognition are integrated in a single pass. Each time frame, beams formed
by a delay-and-sum beamformer are steered to every direction, and
speech information is extracted. A modified Viterbi formula provides n-best hypotheses for each direction and word hypotheses.
At the final frame, all hypotheses are clustered based on their direction information. The clusters, which correspond to the talkers
include information about the recognized speech of the multiple
talkers and about their direction. Experiments for recognition of
two and three talkers showed very promising results. In the case of
two talkers, and using simulated clean data we achieved for ‘top 5’
hypotheses a recognition rate of 95.02% on average, which is very
promising result.
Influence of the Waveguide Propagation on the
Antenna Performance in a Car Cabin
Leonid Krasny, Ali Khayrallah; Ericsson Research,
USA
This paper presents a novel array processing algorithm for noise
reduction in a hands free car environment. The algorithm incorporates the spatial properties of the sound field in a car cabin and
a constraint on allowable speech signal distortion. Our results indicate that the proposed algorithm gives substantial performance
improvement of 15-20 dB in comparison with the conventional array processing which is based on a coherent model of the signal
field.
Multi-Speaker DOA Tracking Using Interactive
Multiple Models and Probabilistic Data Association
Ilyas Potamitis, George Tremoulis, Nikos Fakotakis;
University of Patras, Greece
The general problem addressed in this work is that of tracking the
Direction of Arrival (DOA) of active moving speakers in the presence of background noise and moderate reverberation level in the
acoustic field. In order to efficiently beamform each moving speaker
on an extended basis we adapt the theory developed in the context
of Multi-target Tracking for military and civilian applications to the
context of microphone array. Our approach employs Wideband MUSIC and Interacting Multiple Model (IMM) estimators to estimate the
DOAs of the speakers under different kinds of motion and sudden
change in their course. Probabilistic Data Association (PDA) is used
to disambiguate and resolve DOA measurements. The efficiency of
the approach is illustrated on simulated and real room experiments
dealing with the crossing trajectories of two speakers.
Speech Enhancement Using Weighting Function
Based on the Variance of Wavelet Coefficients
Ching-Ta Lu, Hsiao-Chuan Wang; National Tsing Hua
University, Taiwan
There are few works on the problem of heavy noise corruption in
wavelet-based speech enhancement. In this paper, a new method is
introduced to adapt the weighting function for wavelet coefficients
The subject of this work is the use of microphone arrays for speech
activity detection and noise suppression in the case of a moving
speaker. The approach is based on the generalized likelihood ratio
test applied to the framework of far-field, wideband moving sources
(W-GLRT). It is shown that under certain distributional assumptions
the W-GLRT provides a unifying framework for evaluation of Direction of Arrival (DOA) measurements against spurious DOAs, probabilistic speech activity detection as well as noise suppression. As
regards speech enhancement, we demonstrate the direct connection
of W-GLRT with enhancement based on subspace methods. In addition, through the concept of directive a-priori SNR we demonstrate
its indirect connection with Minimum Mean Square Error spectral
(MMSE_SA) and log-spectral gain modification (MMSE_LSA). The efficiency of the approach is illustrated on a moving speaker where
additive white Gaussian Noise (AWGN) is present in the acoustical
field at very low SNRs.
Adaptive Beamforming in Room with
Reverberation
Zoran Šarić 1 , Slobodan Jovičić 2 ; 1 Institute of Security,
Serbia and Montenegro; 2 University of Belgrade,
Serbia and Montenegro
Microphone arrays are powerful tools for noise suppression in a
reverberant room. Generalized Sidelobe Canceller (GSC) that exploits Minimum Variance (MV) criterion is efficient in interference
suppression when there is no correlation between the desired signal and the interferences. Correlation between the desired signal
and any of interference produces a desired signal cancellation and
degradation of signal-to-noise ratio. This paper analyses the unwanted cancellation of the desired source. It shows that cancellation level of the desired signal is proportional to the correlation
between the direct wave and the reflected waves. For prevention of
a desired signal cancellation we suggest the GSC parameter estimation during the pauses of the desired signal. For this case it is analytically shown that there is no cancellation of the desired signal. The
proposed algorithm was experimentally tested and compared with
the Conventional Beamformer (CBF) and GSC. Experimental tests
have shown the advantage of the proposed method.
Perceptually-Constrained Generalized Singular
Value Decomposition-Based Approach for
Enhancing Speech Corrupted by Colored Noise
Gwo-hwa Ju, Lin-shan Lee; National Taiwan
University, Taiwan
In a previous work, we have successfully integrated the
transformation-based signal subspace technique with the generalized singular value decomposition (GSVD) algorithm to develop
an improved speech enhancement framework [1]. In this paper,
we further incorporate the perceptual masking effect of the psychoacoustics model as extra constraints of the previously proposed
GSVD-based algorithm to obtain improved sound feature, and furthermore make sure the undesired residual noise to be nearly unperceivable. Both subjective listening tests and spectrogram-plot
comparison showed that the closed-form solution developed here
can offer significantly better speech quality than either the conventional spectral subtraction algorithm or the previously proposed
GSVD-based technique, regardless of whether the additive noise is
white or not.
19
Eurospeech 2003
Monday
Blind Separation and Deconvolution for
Convolutive Mixture of Speech Using
SIMO-Model-Based ICA and Multichannel Inverse
Filtering
September 1-4, 2003 – Geneva, Switzerland
Speech Segregation Based on Fundamental Event
Information Using an Auditory Vocoder
Toshio Irino 1 , Roy D. Patterson 2 , Hideki Kawahara 1 ;
1
Wakayama University, Japan; 2 Cambridge
University, U.K.
Hiroaki Yamajo, Hiroshi Saruwatari, Tomoya
Takatani, Tsuyoki Nishikawa, Kiyohiro Shikano; Nara
Institute of Science and Technology, Japan
We propose a new two-stage blind separation and deconvolution
(BSD) algorithm for a convolutive mixture of speech, in which a
new Single-Input Multiple-Output (SIMO)-model-based ICA (SIMOICA) and blind multichannel inverse filtering are combined. SIMOICA can separate the mixed signals, not into monaural source signals but into SIMO-model-based signals from independent sources
as they are at the microphones. After SIMO-ICA, a simple blind deconvolution technique for the SIMO model can be applied even when
each source signal is temporally correlated. The simulation results
reveal that the proposed method can successfully achieve the separation and deconvolution for a convolutive mixture of speech.
Quality Enhancement of CELP Coded Speech by
Using an MFCC Based Gaussian Mixture Model
We present a new auditory method to segregate concurrent speech
sounds. The system is based on an auditory vocoder developed to
resynthesize speech from an auditory Mellin representation using
the vocoder STRAIGHT. The auditory representation preserves fine
temporal information, unlike conventional window-based processing, and this makes it possible to segregate speech sources with an
event synchronous procedure. We developed a method to convert
fundamental frequency information to estimate glottal pulse times
so as to facilitate robust extraction of the target speech. The results
show that the segregation is good even when the SNR is 0 dB; the extracted target speech was a little distorted but entirely intelligible,
whereas the distracter speech was reduced to a non-speech sound
that was not perceptually disturbing. So, this auditory vocoder has
potential for speech enhancement in applications such as hearing
aids.
Time Delay Estimation Based on Hearing
Characteristic
D.G. Raza, C.F. Chan; City University of Hong Kong,
China
At low bit rates CELP coders present certain artifacts generally
known as hoarse and muffing characteristics. An enhancement system is developed to lessen the effects of these artifacts in CELP
coded speech. In enhancement system, the high frequency components (4kHz-8kHz) are reinserted to reduce the muffing characteristics. This is achieved by using an MFCC based Gaussian Mixture
Model. The hoarse characteristics are reduced by re-synthesizing
the CELP reproduced speech with harmonic plus noise model. The
pair-wise listening experiment results show that the re-synthesized
wideband speech is preferred over the CELP coded speech. The enhanced speech is affirmed to be pleasant to listen and exhibits the
naturalness of the original wideband speech.
Enhancement of Noisy Speech for Noise Robust
Front-End and Speech Reconstruction at Back-End
of DSR System
Zhaoli Yan, Limin Du, Jianqiang Wei, Hui Zeng;
Chinese Academy of Sciences, China
This paper proposes a new time delay estimation model, Summary
Cross-correlation Function (SCCF). It is based on a hearing model of
the human ear, which comes from a pitch perception model. The inherent relation between some time delay estimation (TDE) and pitch
perception method is mentioned, and propose an idea – some pitch
perception models’ pre-processing can be used for references in
TDE model and vice versa. The new TDE model is proposed based
on this viewpoint. Then SCCF is analyzed further, and compares
its performance with Phase Transform (PHTA) and Modified Crosspower Spectrum (M-CPSP). The simulated experiments show that
the new model is more robust to noise than PHAT and M-CPSP.
Parametric Multi-Band Automatic Gain Control for
Noisy Speech Enhancement
M. Stolbov, S. Koval, M. Khitrov; Speech Technology
Center, Russia
Hyoung-Gook Kim, Markus Schwab, Nicolas Moreau,
Thomas Sikora; Technische Universität Berlin,
Germany
This paper presents a speech enhancement method for noise robust
front-end and speech reconstruction at the back-end of Distributed
Speech Recognition (DSR). The speech noise removal algorithm is
based on a two stage noise filtering LSAHT by log spectral amplitude speech estimator (LSA) and harmonic tunneling (HT) prior
to feature extraction. The noise reduced features are transmitted
with some parameters, viz., pitch period, the number of harmonic
peaks from the mobile terminal to the server along noise-robust
mel-frequency cepstral coefficients. Speech reconstruction at the
back end is achieved by sinusoidal speech representation. Finally,
the performance of the system is measured by the segmental signalnoise ratio, MOS tests, and the recognition accuracy of an Automatic
Speech Recognition (ASR) in comparison to other noise reduction
methods.
This report is devoted to a new approach to wide band nonstationary noise reduction and corrupted speech signal enhancement. The objective is to provide processed speech intelligibility
and quality while maintaining computation simplicity. We present
a new (non-subtractive) noise suppression method called multiband Automatic Gain Control (AGC). The proposed method is based
on the introduction of a non-subtractive noise suppression model
and multi band filter gain control. This model provides less residual noise and better speech quality over the Spectral Subtraction
Method (SSM). Modification of multi-band AGC gain function allows
easy introduce new useful feature called Spectral Contrasting. The
report contains the discussion of AGC control parameters values.
Experiments show that the proposed algorithms are effective in
non-stationary noisy background for Signal-to-Noise Ratio (SNR) up
to -6dB.
Improved Kalman Filter-Based Speech
Enhancement
Neural Networks versus Codebooks in an
Application for Bandwidth Extension of Speech
Signals
Jianqiang Wei, Limin Du, Zhaoli Yan, Hui Zeng;
Chinese Academy of Sciences, China
Bernd Iser, Gerhard Schmidt; Temic Speech Dialog
Systems, Germany
In this paper, a Kalman filter-based speech enhancement algorithm
with some improvements of previous work is presented. A new
technique based on spectral subtraction is used for separation
speech and noise characteristics from noisy speech and for the computation of speech and noise autoregressive (AR) parameters. In
order to obtain a Kalman filter output with high audible quality, a
perceptual post-filter is placed at the output of the Kalman filter
to smooth the enhanced speech spectra. Experiments indicate that
this newly proposed method works well.
This paper presents two versions of an algorithm for bandwidth
extension of speech signals. We focus on the generation of the
spectral envelope and compare the performance of two different
approaches – neural networks versus codebooks – in terms of objective and subjective distortion measures.
Wavelet-Based Perceptual Speech Enhancement
Using Adaptive Threshold Estimation
Essa Jafer, Abdulhussain E. Mahdi; University of
Limerick, Ireland
20
Eurospeech 2003
Monday
A new speech enhancement system, which is based on a timefrequency adaptive wavelet soft thresholding, is presented in this
paper. The system utilises a Bark-scaled wavelet packet decomposition integrated into a modified Weiner filtering technique using a
novel threshold estimation method based on a magnitude decisiondirected approach. First, a Bark-Scaled wavelet packet transform is
used to decompose the speech signal into critical bands. Threshold estimation is then performed for each wavelet band according
to an adaptive noise level-tracking algorithm. Finally, the speech is
estimated by incorporating the computed threshold into a Wiener
filtering process, using the magnitude decision-directed approach.
The proposed speech enhancement technique has been tested with
various stationary and non-stationary noise cases. Reported results
show that the system is capable of a high-level of noise suppression
while preserving the intelligibility and naturalness of the speech.
A Trainable Speech Enhancement Technique Based
on Mixture Models for Speech and Noise
Noise Reduction Using Paired-Microphones on
Non-Equally-Spaced Microphone Arrangement
Mitsunori Mizumachi, Satoshi Nakamura; ATR-SLT,
Japan
A wide variety of microphone arrays have been developed, and the
authors have also proposed a type of equally-spaced small-scale
microphone array. In this approach, a paired-microphone is selected at each frequency to design a subtractive beamformer that
can estimate a noise spectrum. This paper introduces a non-equallyspaced microphone arrangement, which might give more spatial information than equally-spaced microphones, with two criteria for
selecting the most suitable paired-microphone. These criteria are
based on noise reduction rate and spectral smoothness, assuming
that objective signals are speech. The feasibility of both the nonequally-spaced array and the criterion on spectral smoothness are
confirmed by computer simulation.
Ilyas Potamitis, Nikos Fakotakis, George Kokkinakis;
University of Patras, Greece
Our work introduces a trainable speech enhancement technique
that can directly incorporate information about the long-term, timefrequency characteristics of speech signals prior to the enhancement process. We approximate noise spectral magnitude from
available recordings from the operational environment as well as
clean speech from a clean database with mixtures of Gaussian pdfs
using the Expectation-Maximization algorithm (EM). Subsequently,
we apply the Bayesian inference framework to the degraded spectral coefficients and by employing Minimum Mean Square Error Estimation (MMSE) we derive a closed form solution for the spectral
magnitude estimation task. We evaluate our technique with a focus
on real, highly non-stationary noise types (e.g. passing-by aircraft
noise) and demonstrate its efficiency at low SNRs.
Perceptual Wavelet Adaptive Denoising of Speech
Qiang Fu, Eric A. Wan; Oregon Health & Science
University, USA
This paper introduces a novel speech enhancement system based on
a wavelet denoising framework. In this system, the noisy speech is
first preprocessed using a generalized spectral subtraction method
to initially lower the noise level with negligible speech distortion.
A perceptual wavelet transform is then used to decompose the resulting speech signal into critical bands. Threshold estimation is
implemented that is both time and frequency dependent, providing
robustness to non-stationary and correlated noisy environments.
Finally, to eliminate the “musical noise” artifact, we apply a modified Ephraim/Malah suppression rule to the thresholding operation
– adaptive denoising. Both objective and subjective experiments
prove that the new speech enhancement system is capable of significant noise reduction with little speech distortion.
Enhancement of Speech in Multispeaker
Environment
B. Yegnanarayana 1 , S.R. Mahadeva Prasanna 1 ,
Mathew Magimai Doss 2 ; 1 Indian Institute of
Technology, India; 2 IDIAP, Switzerland
In this paper a method based on the excitation source information
is proposed for enhancement of speech, degraded by speech from
other speakers. Speech from multiple speakers is simultaneously
collected over two spatially distributed microphones. Time-delay of
each speaker with respect to the two microphones is estimated using the excitation source information. A weight function is derived
for each speaker using the knowledge of the time-delay and the excitation source information. Linear prediction (LP) residuals of the
microphone signals are processed separately using the weight functions. Speech signals are synthesized from the modified residuals.
One speech signal per speaker is derived from each microphone
signal. The synthesized speech signals of each speaker are combined to produce enhanced speech. Significant enhancement of the
speech of one speaker relative to other was observed from the combined signal.
September 1-4, 2003 – Geneva, Switzerland
Session: PMoDg– Poster
Spoken Dialog Systems I
Time: Monday 16.00, Venue: Main Hall, Level -1
Chair: Antje Schweitzer, Universit"at Stuttgart, Germany
Two Studies of Open vs. Directed Dialog Strategies
in Spoken Dialog Systems
Silke M. Witt, Jason D. Williams; Edify Corporation,
USA
This paper analyzes the behavior of callers responding to a speech
recognition system when prompted either with an open or a directed dialog strategy. The results of two usability studies with
different caller populations are presented. Differences between the
results from the two studies are analyzed and are shown to arise
from the differences in the domains. It is shown that it depends on
the caller population whether an open or a directed dialog strategy
is preferred. In addition, we examine the effect of additional informational prompts on the routability of caller utterances.
The Queen’s Communicator: An Object-Oriented
Dialogue Manager
Ian O’Neill 1 , Philip Hanna 1 , Xingkun Liu 1 , Michael
McTear 2 ; 1 Queen’s University Belfast, U.K.;
2
University of Ulster, U.K.
This paper presents some of the main features of a prototype spoken dialogue manager (DM) that has been incorporated into the
DARPA Communicator architecture. Developed in Java, the object
components that constitute the DM separate generic from domainspecific dialogue behaviour in the interests of maintainability and
extensibility. Confirmation strategies encapsulated in a high-level
DiscourseManager determine the system’s behaviour across transactional domains, while rules of thumb encapsulated in a suite of
domain experts enable the system to guide the user towards completion of particular transactions. We describe the nature of the
generic confirmation strategy and the domain experts’ specialised
dialogue behaviour. We describe how rules of thumb fire given certain combinations of user-supplied values – or in the light of the
system’s own interaction with its database.
RavenClaw: Dialog Management Using Hierarchical
Task Decomposition and an Expectation Agenda
Dan Bohus, Alexander I. Rudnicky; Carnegie Mellon
University, USA
We describe RavenClaw, a new dialog management framework developed as a successor to the Agenda [1] architecture used in the
CMU Communicator. RavenClaw introduces a clear separation between task and discourse behavior specification, and allows rapid
development of dialog management components for spoken dialog
systems operating in complex, goal-oriented domains. The system
development effort is focused entirely on the specification of the
dialog task, while a rich set of domain-independent conversational
behaviors are transparently generated by the dialog engine. To date,
RavenClaw has been applied to five different domains allowing us
21
Eurospeech 2003
Monday
to draw some preliminary conclusions as to the generality of the
approach. We briefly describe our experience in developing these
systems.
Features for Tree Based Dialogue Course
Management
Klaus Macherey, Hermann Ney; RWTH Aachen,
Germany
In this paper, we introduce different features for dialogue course
management and investigate their effect on the system’s behaviour
for choosing the subsequent dialogue action during a dialogue session. Especially, we investigate whether the system is able to detect
and resolve ambiguities, and if it always chooses that state which
leads as quickly as possible to a final state that presumably meets
the user’s request. The criteria and used data structures are independently from the underlying domain and can therefore be used
for different applications of spoken dialogue systems.
September 1-4, 2003 – Geneva, Switzerland
Conceptual Decoding for Spoken Dialog Systems
Yannick Estève, Christian Raymond, Frédéric Béchet,
Renato De Mori; LIA-CNRS, France
A search methodology is proposed for performing conceptual decoding process. Such a process provides the best sequence of word
hypotheses according to a set of conceptual interpretations. The resulting models are combined in a network of Stochastic Finite State
Transducers. This approach is a framework that tries to bridge
the gap between speech recognition and speech understanding processes. Indeed, conceptual interpretations are generated according
to both a semantic representation of the task and a system t belief
which evolves according to the dialogue states. Preliminary experiments on the detection of semantic entities (mainly named entities)
in a dialog application have shown that interesting results can be
obtained even if the Word Error Rate is pretty high.
Sentence Verification in Spoken Dialogue System
Huei-Ming Wang, Yi-Chung Lin; Industrial Technology
Research Institute, Taiwan
Development of a Stochastic Dialog Manager
Driven by Semantics
Francisco Torres, Emilio Sanchis, Encarna Segarra;
Universitat Politècnica de València, Spain
We present an approach for the development of a dialog manager
based on stochastic models for the representation of the dialogue
structure and strategy. This dialog manager processes semantic
representations and, when it is integrated with our understanding
and answer generation modules, it performs natural language dialogs. It has been applied to a Spanish dialogue system which answers telephone queries about train timetables.
Generation of Natural Response Timing Using
Decision Tree Based on Prosodic and Linguistic
Information
Masashi Takeuchi, Norihide Kitaoka, Seiichi
Nakagawa; Toyohashi University of Technology,
Japan
In spoken dialogue systems, sentence verification technique is very
useful to avoid misunderstanding user’s intention by rejecting outof-domain or bad quality utterances. However, compared with word
verification and concept verification, sentence verification has been
seldom touched in the past. In this paper, we propose a sentence
verification approach which uses discriminative features extracted
from the edit operation sequence. Since the edit operation sequence
indicates what kinds of errors (i.e., insertion, deletion and substitution errors) may occur in the hypothetical concept sequence, it conveys sentence-level information for evaluating the quality of system’s interpretation for the user’s utterance. In addition, a sentence verification criterion concerning precision and recall rates of
hypothetical concepts is also proposed to pursue efficient and correct spoken dialogue interactions. Compared with the verification
method using acoustic confidence measure, the proposed approach
reduces 17.3% of errors.
Detection and Recognition of Correction Utterance
in Spontaneously Spoken Dialog
If a dialog system can respond to the user as reasonable as a human,
the interaction will be more smooth. Timing of response such as
backchannels and turn-taking plays important role in such a smooth
dialog as in human-human interaction. We are now developing a dialog system which can generate response timing in real time. In
this paper, we introduce a response timing generator for such a
dialog system. First, we analyzed conversations between two persons and extracted prosodic and linguistic information which had
effects on the timing. Then we constructed a decision tree based on
the features coming from the information and developed a timing
generator using rules derived from the decision tree. The timing
generator decides the action of the system at every 100ms in user’s
pause. We evaluated the timing generator by subjective and objective evaluation.
Child and Adult Speaker Adaptation During Error
Resolution in a Publicly Available Spoken Dialogue
System
Linda Bell, Joakim Gustafson; Telia Research, Sweden
This paper describes how speakers adapt their language during error resolution when interacting with the animated agent Pixie. A
corpus of spontaneous human-computer interaction was collected
at the Telecommunication museum in Stockholm, Sweden. Adult
and children speakers were compared with respect to user behavior and strategies during error resolution. In this study, 16 adults
and 16 children speakers were randomly selected from a corpus
from almost 3.000 speakers. This sub-corpus was then analyzed in
greater detail. Results indicate that adults and children use partly
different strategies when their interactions with Pixie become problematic. Children tend to repeat the same utterance verbatim, altering certain phonetic features. Adults, on the other hand, often
modify other aspects of their utterances such as lexicon and syntax. Results from the present study will be useful for constructing
future spoken dialogue systems with improved error handling for
adults as well as children.
Norihide Kitaoka, Naoko Kakutani, Seiichi Nakagawa;
Toyohashi University of Technology, Japan
Recently, the performance of speech recognition was drastically improved, and the products with the interface based on speech recognition have been realized. However, when we communicate with
computers through a speech interface, misrecognition is inevitable,
and it is difficult to recover from it because of the immaturity of the
interface. Users try to recover from misrecognition by a repetition
of the same content. So, the detection of user’s repetition is helpful
for a system to detect its misunderstanding, and to recover from
the misrecognition.
In this paper, we assume the utterance which includes repetitions
a correction and propose a method to detect correction utterances
in spontaneously spoken dialog using a word spotting based on
DTW (dynamic time warping) and N-best hypotheses overlapping
measure. As a result, we achieved recall rate of 92.7% and precision of 89.1%. Moreover, we tried to improve recognition accuracy
using the detection. Using the choice of vocabulary and grammar
setup based on the detection, we achieved improvement in recognition performance from 42.7% to 50.0% for correction utterance and
from 70.5% to 77.9% for non-correction utterance.
Topic-Specific Parser Design in an Air Travel
Natural Language Understanding Application
Chaitanya J.K. Ekanadham, Juan M. Huerta; IBM T.J.
Watson Research Center, USA
In this paper we contrast a traditional approach to semantic parsing for Natural Language Understanding applications in which a
single parser captures a whole application domain, with an alternative approach consisting of a collection of smaller parsers, each
able to handle only a portion of the domain. We implement this
topic-specific parsing strategy by fragmenting the training corpus
into subject specific subsets and developing from each subset a corresponding subject parser. We demonstrate this procedure on the
DARPA Communicator task, and we observe that given an appropri-
22
Eurospeech 2003
Monday
September 1-4, 2003 – Geneva, Switzerland
ate smoothing mechanism to overcome data sparseness, the set of
subject-specific parsers performs as effectively (in accuracy terms)
as the original parser. We present experiments both under supervised and unsupervised subject selection modes.
eling. The task consists of answering telephone queries about train
timetables, prices and services for long distance trains in Spanish.
A comparison between a global understanding model and the specific models is presented.
The Use of Confidence Measures in Vector Based
Call-Routing
Robust Parsing of Utterances in Negotiative
Dialogue
Stephen J. Cox, Gavin Cawley; University of East
Anglia, U.K.
Johan Boye, Mats Wirén; Telia Research, Sweden
In previous work, we experimented with different techniques of
vector-based call routing, using the transcriptions of the queries to
compare algorithms. In this paper, we base the routing decisions on
the recogniser output rather than transcriptions and examine the
use of confidence measures (CMs) to combat the problems caused
by the “noise” in the recogniser output. CMs are derived for both
the words output from the recogniser and for the routings themselves and are used to investigate improving both routing accuracy
and routing confidence. Results are given for a 35 route retail store
enquiry-point task. They suggest that although routing error is controlled by the recogniser error-rate, confidence in routing decisions
can be improved using these techniques.
Multi-Channel Sentence Classification for Spoken
Dialogue Language Modeling
Frédéric Béchet 1 , Giuseppe Riccardi 2 , Dilek Z.
Hakkani-Tür 2 ; 1 LIA-CNRS, France; 2 AT&T
Labs-Research, USA
In traditional language modeling word prediction is based on the
local context (e.g. n-gram). In spoken dialog, language statistics are
affected by the multidimensional structure of the human-machine
interaction. In this paper we investigate the statistical dependencies
of users’ responses with respect to the system’s and user’s channel. The system channel components are the prompts’ text, dialogue history, dialogue state. The user channel components are the
Automatic Speech Recognition (ASR) transcriptions, the semantic
classifier output and the sentence length. We describe an algorithm
for language model rescoring using users’ response classification.
The user’s response is first mapped into a multidimensional state
and the state specific language model is applied for ASR rescoring.
We present perplexity and ASR results on the How May I Help You
?sm 100K spoken dialogs.
Automatic Induction of N-Gram Language Models
from a Natural Language Grammar
This paper presents an algorithm for domain-dependent parsing of
utterances in negotiative dialogue. To represent such utterances,
the algorithm outputs semantic expressions that are more expressive than propositional slot-filler structures. It is very fast and robust, yet precise and capable of correctly combining information
from different utterance fragments.
Flexible Speech Act Identification of Spontaneous
Speech with Disfluency
Chung-Hsien Wu, Gwo-Lang Yan; National Cheng
Kung University, Taiwan
This paper describes an approach for flexible speech act identification of spontaneous speech with disfluency. In this approach,
semantic information, syntactic structure, and fragment features
of an input utterance are statistically encapsulated into a proposed speech act hidden Markov model (SAHMM) to characterize the
speech act. To deal with the disfluency problem in a sparse training
corpus, an interpolation mechanism is exploited to re-estimate the
state transition probability in SAHMM. Finally, the dialog system accepts the speech act with best score and returns the corresponding
response. Experiments were conducted to evaluate the proposed
approach using a spoken dialogue system for the air travel information service. A testing database from 25 speakers containing
480 dialogues including 3038 sentences was collected and used for
evaluation. Using the proposed approach, the experimental results
show that the performance can achieve 90.3% in speech act correct rate (SACR) and 85.5% in fragment correct rate (FCR) for fluent
speech and gains a significant improvement of 5.7% in SACR and
6.9% in FCR compared to the baseline system without considering
filled pauses for disfluent speech.
Efficient Spoken Dialogue Control Depending on
the Speech Recognition Rate and System’s
Database
Kohji Dohsaka, Norihito Yasuda, Kiyoaki Aikawa;
NTT Corporation, Japan
Stephanie Seneff, Chao Wang, Timothy J. Hazen;
Massachusetts Institute of Technology, USA
This paper details our work in developing a technique which can
automatically generate class n-gram language models from natural language (NL) grammars in dialogue systems. The procedure
eliminates the need for double maintenance of the recognizer language model and NL grammar. The resulting language model
adopts the standard class n-gram framework for computational efficiency. Moreover, both the n-gram classes and training sentences
are enhanced with semantic/syntactic tags defined in the NL grammar, such that the trained language model preserves the distinctive
statistics associated with different word senses. We have applied
this approach in several different domains and languages, and have
evaluated it on our most mature dialogue systems to assess its competitiveness with pre-existing n-gram language models. The speech
recognition performances with the new language model are in fact
the best we have achieved in both the JUPITER weather domain and
the MERCURY flight reservation domain.
We present dialogue control methods (the dual-cost method and
the trial dual-cost method) that enable a spoken dialogue system
to convey information to the user in as short a dialogue as possible depending on the speech recognition rate and the content of its
database. Both methods control a dialogue so as to minimize the
sum of two costs: the confirmation cost (C-cost) and the information transfer cost (I-cost). The C-cost is the length of a subdialogue
for confirming a user query, and the I-cost is the length of a system
response generated after the confirmations. The dual-cost method
can avoid the unnecessary confirmations that are inevitable in conventional methods. The trial dual-cost method is an improved version of the dual-cost method. Whereas the dual-cost method has
the limitation that it generates a system response based on only
the content of a query that the user has acknowledged in the confirmation subdialogue, the trial dual-cost method does not. Dialogue
experiments prove that the trial dual-cost method outperforms the
dual-cost method and that both methods outperform conventional
ones.
Robust Speech Understanding Based on Expected
Discourse Plan
Connectionist Classification and Specific Stochastic
Models in the Understanding Process of a Dialogue
System
Shin-ya Takahashi, Tsuyoshi Morimoto, Sakashi
Maeda, Naoyuki Tsuruta; Fukuoka University, Japan
David Vilar, María José Castro, Emilio Sanchis;
Universitat Politècnica de València, Spain
In this paper we present an approach to the application of specific
models to the understanding process of a dialogue system. The previous classification process is done by means of Multilayer Perceptrons, and Hidden Markov Models are used for the semantic mod-
This paper reports spoken dialogue experiments for elderly people in the home health care system we have developed. In spoken
dialogue systems, it is important to decrease recognition errors.
The recognition errors, however, cannot be completely avoided with
current speech recognition techniques. In this paper, we propose a
robust recognition understanding technique based on expected dis-
23
Eurospeech 2003
Tuesday
course plans in order to improve a recognition accuracy. First, we
collect dialogue examples of elderly users through a Wizard-of-Oz
(WOZ) experiment. Next, we conduct a recognition experiment for
collected elderly speech using the proposed technique. The experimental result demonstrates that this technique improved a sentence
recognition rate from 69.1% to 74.3%, a word recognition rate from
80.3% to 81.7% , and a plan matching rate from 88.3% to 92.0%.
Session: OTuBa– Oral
Robust Speech Recognition - Noise
Compensation
September 1-4, 2003 – Geneva, Switzerland
mance for clean speech. We proposed a novel sub-band approach,
where frequency sub-bands are multiplied with weighting factors
and merged, which considers sub-band dependence and proves to
be more robust than both full-band and conventional sub-band approaches. And further the weighting factors can be obtained by using the maximum-likelihood estimation approaches in order to minimize the mismatch between the trained models and the observed
features. Finally we evaluated our methods on both the Aurora2
task and the Resource Management task and showed improvement
of performance on the two tasks consistently.
Feature Compensation Scheme Based on Parallel
Combined Mixture Model
Time: Tuesday 10.00, Venue: Room 1
Chair: Iain McCowan, IDIAP, Martigny, Switzerland
Wooil Kim, Sungjoo Ahn, Hanseok Ko; Korea
University, Korea
Normalization of Time-Derivative Parameters
Using Histogram Equalization
Yasunari Obuchi 1 , Richard M. Stern 2 ; 1 Hitachi Ltd.,
Japan; 2 Carnegie Mellon University, USA
In this paper we describe a new framework of feature compensation for robust speech recognition. We introduce Delta-Cepstrum
Normalization (DCN) that normalizes not only cepstral coefficients,
but also their time-derivatives. In previous work, the mean and
the variance of cepstral coefficients are normalized to reduce the
irrelevant information, but such a normalization was not applied
to time-derivative parameters because the reduction of the irrelevant information was not enough. However, Histogram Equalization provides better compensation and can be applied even to delta
and delta-delta cepstra. We investigate various implementation of
DCN, and show that we can achieve the best performance when the
normalization of the cepstra and delta cepstra can be mutually interdependent. We evaluate the performance of DCN using speech
data recorded by a PDA. DCN provides significant improvements
compared to HEQ. We also examine the possibility of combining
Vector Taylor Series (VTS) and DCN. Even though some combinations do not improve the performance of VTS, it is shown that the
best combination gives better performance than VTS alone. Finally,
the advantages of DCN in terms of the computation speed are also
discussed.
Tree-Structured Noise-Adapted HMM Modeling for
Piecewise Linear-Transformation-Based Adaptation
Zhipeng Zhang 1 , Kiyotaka Otsuji 1 , Sadaoki Furui 2 ;
1
NTT DoCoMo Inc., Japan; 2 Tokyo Institute of
Technology, Japan
This paper proposes an effective feature compensation scheme
based on speech model for achieving robust speech recognition.
Conventional model-based method requires off-line training with
noisy speech database and is not suitable for online adaptation. In
the proposed scheme, we can relax the off-line training with noisy
speech database by employing the parallel model combination technique for estimation of correction factors. Applying the model combination process over to the mixture model alone as opposed to entire HMM makes the online model combination possible. Exploiting
the availability of noise model from off-line sources, we accomplish
the online adaptation via MAP(Maximum A Posteriori) estimation. In
addition, the online channel estimation procedure is induced within
the proposed framework. The representative experimental results
indicate that the suggested algorithm is effective in realizing robust
speech recognition under the combined adverse conditions of additive background noise and channel distortion.
A Comparison of Three Non-Linear Observation
Models for Noisy Speech Features
Jasha Droppo, Li Deng, Alex Acero; Microsoft
Research, USA
This paper reports our recent efforts to develop a unified, nonlinear, stochastic model for estimating and removing the effects
of additive noise on speech cepstra. The complete system consists
of prior models for speech and noise, an observation model, and
an inference algorithm. The observation model quantifies the relationship between clean speech, noise, and the noisy observation.
Since it is expressed in terms of the log Mel-frequency filter-bank
features, it is non-linear. The inference algorithm is the procedure
by which the clean speech and noise are estimated from the noisy
observation.
This paper proposes the application of tree-structured clustering
to various noise samples or noisy speech in the framework of
piecewise-linear transformation (PLT)-based noise adaptation. According to the clustering results, a noisy speech HMM is made for
each node of the tree structure. Based on the likelihood maximization criterion, the HMM that best matches the input speech is selected by tracing the tree from top to bottom, and the selected HMM
is further adapted by linear transformation. The proposed method
is evaluated by applying it to a Japanese dialogue recognition system. The results confirm that the proposed method is effective in
recognizing noise-added speech under various noise conditions.
The most critical component of the system is the observation model.
This paper derives a new approximation strategy and compares it
with two existing approximations. It is shown that the new approximation uses half the calculation, and produces equivalent or
improved word accuracy scores, when compared to previous techniques. We present noise-robust recognition results on the standard
Aurora 2 task.
Maximum Likelihood Sub-Band Weighting for
Robust Speech Recognition
We present a new predictive compensation scheme which makes
no assumption on how the noise sources alter the speech data and
which do not rely on clean speech models. Rather, this new scheme
makes the (realistic) assumption that speech databases recorded
under different background noise conditions are available. The philosophy of this scheme is to process these databases in order to
build a “tool” which will allow it to handle new noise conditions in
a robust way. We evaluate the performances of this new compensation scheme on a connected digits recognition task and show that
it can perform significantly better than multi-conditions training,
which is the most widely used techniques in these kind of scenarios.
Donglai Zhu 1 , Satoshi Nakamura 2 , Kuldip K.
Paliwal 3 , Renhua Wang 1 ; 1 University of Science and
Technology of China, China; 2 ATR-SLT, Japan;
3
Griffith University, Australia
Sub-band speech recognition approaches have been proposed for
robust speech recognition, where full-band power spectra are divided into several sub-bands and then likelihoods or cepstral vectors of the sub-bands are merged depending on their reliability.
In conventional sub-band approaches, correlations across the subbands are not modeled and the merging weights can only be set
experientially or estimated during training procedures, which may
not match observed data. The methods further degrade perfor-
A New Supervised-Predictive Compensation
Scheme for Noisy Speech Recognition
Khalid Daoudi, Murat Deviren; LORIA, France
24
Eurospeech 2003
Tuesday
September 1-4, 2003 – Geneva, Switzerland
paper we will address two issues related to the factors that affect
the system performance, namely the speech signal duration and the
signal-to-noise ratio.
Session: STuBb– Oral
Forensic Speaker Recognition
Estimating the Weight of Evidence in Forensic
Speaker Verification
Time: Tuesday 10.00, Venue: Room 2
Chair: Andrzej Drygajlo, EPFL, Switzerland
Beat Pfister, René Beutler; ETH Zürich, Switzerland
Statistical Methods and Bayesian Interpretation of
Evidence in Forensic Automatic Speaker
Recognition
Andrzej Drygajlo 1 , Didier Meuwly 2 , Anil Alexander 1 ;
1
EPFL, Switzerland; 2 Forensic Science Service, U.K.
The goal of this paper is to establish a robust methodology for
forensic automatic speaker recognition (FASR) based on sound statistical and probabilistic methods, and validated using databases
recorded in real-life conditions. The interpretation of recorded
speech as evidence in the forensic context presents particular challenges. The means proposed for dealing with them is through
Bayesian inference and corpus based methodology. A probabilistic model – the odds form of Bayes’ theorem and likelihood ratio –
seems to be an adequate tool for assisting forensic experts in the
speaker recognition domain to interpret this evidence. In forensic speaker recognition, statistical modelling techniques are based
on the distribution of various features pertaining to the suspect’s
speech and its comparison to the distribution of the same features
in a reference population with respect to the questioned recording. In this paper, the state-of-the-art automatic, text-independent
speaker recognition system, using Gaussian mixture model (GMM),
is adapted to the Bayesian interpretation (BI) framework to estimate the within-source variability of the suspected speaker and the
between-sources variability, given the questioned recording. This
double-statistical approach (BI-GMM) gives an adequate solution for
the interpretation of the recorded speech as evidence in the judicial
process.
Robust Likelihood Ratio Estimation in Bayesian
Forensic Speaker Recognition
J. Gonzalez-Rodriguez, D. Garcia-Romero, M.
Garcia-Gomar, D. Ramos-Castro, J. Ortega-Garcia;
Universidad Politécnica de Madrid, Spain
In this paper we summarize the bayesian methodology for forensic
analysis of the evidence in the speaker recognition area. We also
describe the procedure to convert any speaker recognition system
into a valuable forensic tool according to the bayesian methodology. Furthermore, we study the difference between assessment of
speaker recognition technology using DET curves and assessment
of forensic systems by means of Tippet plots. Finally, we will show
several complete examples of our speaker recognition system in a
forensic environment. Some experiments will be presented where,
using Ahumada-Gaudí speech data, we optimize the Likelihood Ratio computation procedure in order to be robust to inconsistencies
in the estimation of within- and between-sources statistical distributions. Results in the different tested situations, summarized in
Tippet plots, show the adequacy of this approach to daily forensic
work.
Automated Speaker Recognition in Real World
Conditions: Controlling the Uncontrollable
Hirotaka Nakasone; Federal Bureau of Investigation,
USA
The current development of automatic speaker recognition technology may provide a new method to augment or replace the traditional method offered by qualified experts using aural and spectrographic analysis. The most promising of these automated technologies are based on statistical hypothesis testing methods involving
likelihood ratios. The null hypothesis is generated using a universal background model composed of a large population of speakers.
However, techniques with excellent performance in standardized
evaluations (NIST trials) may not work perfectly in the real world.
By defining and controlling the input speech samples carefully, we
show quantitative differences in performance for different factors
affecting a speaker population, and discuss on-going efforts to improve the accuracy rate for use in real world conditions. In this
In forensic casework, the application of automatic speaker verification (SV) aims to determine the likelihood ratio of a suspect being
vs. being not the speaker of an incriminating speech recording. For
that purpose, the likelihood of the anti-speaker has to be estimated
from the speech of an adequate number of other speakers. In many
cases, speech signals of such an anti-speaker population are not
available and it is generally too expensive to make an appropriate
collection.
This paper presents a practical procedure of forensic SV which is
based on a text-dependent SV system and instead of an anti-speaker
population, a special speech database is used to calibrate the valuation scale for an individual case.
Auditory-Instrumental Forensic Speaker
Recognition
Stefan Gfroerer; Bundeskriminalamt, Germany
The most prominent part in forensic speech and audio processing is
speaker recognition. In the world a number of approaches to forensic speaker recognition (FSR) have been developed, that are different in terms of technical procedures, methodology, instrumentation and also in terms of the probability scale on which the final
conclusion is based. The BKA’s approach to speaker recognition is
a combination of classical phonetic analysis techniques including
analytical listening by an expert and the use of signal processing
techniques within an acoustic-phonetic framework. This combined
auditory-instrumental method includes acoustic measurements of
parameters which may be interpreted using statistical information
on their distributions, e.g. probability distributions of average fundamental frequency for adult males and females, average syllable
rates as indicators of speech rate, etc. In a voice comparison report the final conclusion is determined by a synopsis of the results
from auditory and acoustic parameters, amounting to about eight
to twelve on average, depending on the nature of the speech material. Results are given in the form of probability statements. The
paper gives an overview of current procedures and specific problems of FSR.
Earwitness Line-Ups: Effects of Speech Duration,
Retention Interval and Acoustic Environment on
Identification Accuracy
J.H. Kerstholt 1 , E.J.M. Jansen 1 , A.G. van Amelsvoort 2 ,
A.P.A. Broeders 3 ; 1 TNO Human Factors, The
Netherlands; 2 LSOP Police Knowledge and Expertise
Centre, The Netherlands; 3 Netherlands Forensic
Institute, The Netherlands
An experiment was conducted to investigate the effects of retention
interval, exposure duration and acoustic environment on speaker
identification accuracy in voice line-ups. In addition, the relation
between confidence assessments by participants and test assistant
and identification accuracy was explored. A total of 361 participants heard a single target voice in one of four exposure conditions
(short or long speech sample, recorded only indoors or indoors and
outdoors). Half the participants were tested immediately after exposure to the target voice and half one week later. The results
show that the target was correctly identified in 42% of cases. In
the target-absent condition there were 51% false alarms. Acoustic
environment did not affect identification accuracy. There was an
interaction between speech duration and retention interval in the
target-absent condition: after a one-week interval, listeners made
fewer false identifications if the speech sample was long. No effects were found when participants were tested immediately. Only
the confidence scores of the test assistant had predictive value. Taking the confidence score of the test assistant into account therefore
increases the diagnostic value of the line-up.
25
Eurospeech 2003
Tuesday
September 1-4, 2003 – Geneva, Switzerland
future directions utilizing spectral tilt and pitch contour to distinguish emotions in the valence dimension.
Session: OTuBc– Oral
Emotion in Speech
Recognition of Emotions in Interactive Voice
Response Systems
Time: Tuesday 10.00, Venue: Room 3
Chair: Elizabeth Shriberg, SRI, Menlo Park, USA
Sherif Yacoub, Steve Simske, Xiaofan Lin, John Burns;
Hewlett-Packard Laboratories, USA
Characteristics of Authentic Anger in Hebrew
Speech
Noam Amir, Shirley Ziv, Rachel Cohen; Tel Aviv
University, Israel
In this study we examine a number of characteristics of angry Hebrew speech. Whereas such studies are frequently carried out on
acted speech, in this study we used recordings of participants in
broadcasted, politically oriented talk shows. The recordings were
audited and rated for anger content by 11 listeners. 12 utterances
judged to contain angry speech were then analyzed along with 12
utterances from the same speakers that were judged to contain neutral speech. Various statistics of the F0 curve and spectral tilt were
calculated and correlated with the degree of anger, giving a number
of interesting results: for example, though pitch range was significantly correlated to anger in general, pitch range was significantly
negative-correlated to the degree of anger. A separate test was conducted, judging only the textual content of the utterances, to examine the degree to which it influenced the listening tests. After
neutralizing for the textual content, some of the acoustic measures
became weaker predictors of anger, whereas mean F0 remained the
strongest indicator of anger. Spectral tilt also showed a significant
decrease in angry speech.
Prosody-Based Classification of Emotions in
Spoken Finnish
Tapio Seppänen, Eero Väyrynen, Juhani Toivanen;
University of Oulu, Finland
An emotional speech corpus of Finnish was collected that includes
utterances of four emotional states of speakers. More than 40
prosodic features were derived and automatically computed for
the speech samples. Statistical classification experiments with kNN
classifier and human listening tests indicate that emotion recognition performance comparable to human listeners can be achieved.
This paper reports emotion recognition results from speech signals, with particular focus on extracting emotion features from the
short utterances typical of Interactive Voice Response (IVR) applications. We focus on distinguishing anger versus neutral speech,
which is salient to call center applications. We report on classification of other types of emotions such as sadness, boredom, happy,
and cold anger. We compare results from using neural networks,
Support Vector Machines (SVM), K-Nearest Neighbors, and decision
trees. We use a database from the Linguistic Data Consortium at
University of Pennsylvania, which is recorded by 8 actors expressing
15 emotions. Results indicate that hot anger and neutral utterances
can be distinguished with over 90% accuracy. We show results from
recognizing other emotions. We also illustrate which emotions can
be clustered together using the selected prosodic features.
We are not Amused – But How do You Know? User
States in a Multi-Modal Dialogue System
Anton Batliner, Viktor Zeißler, Carmen Frank, Johann
Adelhardt, Rui P. Shi, Elmar Nöth; Universität
Erlangen-Nürnberg, Germany
For the multi-modal dialogue system SmartKom, emotional user
states in a Wizard-of-Oz experiment as, e.g., joyful, angry, helpless, are annotated holistically and based purely on facial expressions; other phenomena (prosodic peculiarities, offtalk, i.e., speaking aside, etc.) are labelled as well. We present the correlations
between these different annotations and report classification results using a large prosodic feature vector. The performance of
the user state classification is not yet satisfactory; possible reasons
and remedies are discussed.
Session: OTuBd– Oral
Dialog System User & Domain Modeling
Time: Tuesday 10.00, Venue: Room 4
Chair: Paul Dalsgaard, Center for PersonKommunikation (CPK)
Frequency Distribution Based Weighted Sub-Band
Approach for Classification of Emotional/Stressful
Content in Speech
On-Line User Modelling in a Mobile Spoken
Dialogue System
Mandar A. Rahurkar, John H.L. Hansen; University of
Colorado at Boulder, USA
Niels Ole Bernsen; University of Southern Denmark,
Denmark
In this paper we explore the use of nonlinear Teager Energy Operator based features derived from multi-resolution sub-band analysis
for classification of emotional/stressful speech. We propose a novel
scheme for automatic sub-band weighting in an effort towards developing a generic algorithm for understanding emotion or stress
in speech. We evaluate the proposed algorithm using a corpus of
audio material from a military stressful Soldier of the Quarter Board
evaluation panel. We establish classification performance of emotional/stressful speech using an open speaker set with open test
tokens. With the new frequency distribution based scheme, we obtain a relative detection error reduction of 81.3% in stress speech,
and a 75.4% relative detection rate reduction in neutral speech detection error rate. The results suggest a important step forward in
establishing an effective processing scheme for developing generic
models of neutral and emotional speech.
Classifying Subject Ratings of Emotional Speech
Using Acoustic Features
Jackson Liscombe, Jennifer Venditti, Julia Hirschberg;
Columbia University, USA
This paper presents results from a study examining emotional
speech using acoustic features and their use in automatic machine
learning classification. In addition, we propose a classification
scheme for the labeling of emotions on continuous scales. Our findings support those of previous research as well as indicate possible
The paper presents research on user modelling for an in-car spoken dialogue system, including the implementation of a generic user
modelling module applied to the modelling of drivers’ task objectives.
Towards Dynamic Multi-Domain Dialogue
Processing
Botond Pakucs; KTH, Sweden
This paper introduces SesaME, a generic dialogue management
framework, especially designed for supporting dynamic multidomain dialogue processing. SesaME supports a multitude of highly
distributed applications and facilitates simultaneous adaptation to
individual users and their environment. The dynamic multi-domain
dialogue processing is supported through the use of standardised
and highly distributed domain descriptions. For fast, runtime handling of these domain descriptions a specially developed, dynamic
plug and play solution is employed. In this paper, a description
of how SesaME’s functionality is evaluated within the framework of
the PER demonstrator is also presented.
User Modeling in Spoken Dialogue Systems for
Flexible Guidance Generation
Kazunori Komatani, Shinichi Ueno, Tatsuya
Kawahara, Hiroshi G. Okuno; Kyoto University, Japan
26
Eurospeech 2003
Tuesday
September 1-4, 2003 – Geneva, Switzerland
We address appropriate user modeling in order to generate cooperative responses to each user in spoken dialogue systems. Unlike
previous studies that focus on users’ knowledge or typical kinds
of users, the proposed user model is more comprehensive. Specifically, we set up three dimensions of user models: skill level to the
system, knowledge level on the target domain and degree of hastiness. Moreover, the models are automatically derived by decision
tree learning using real dialogue data. We obtained reasonable classification accuracy for all dimensions. Dialogue strategies based on
the user modeling are implemented in Kyoto city bus information
system that has been developed at our laboratory. Experimental
evaluation shows that the cooperative responses adaptive to individual users serve as good guidance for novice users without increasing the dialogue duration for skilled users.
tutoring agents are easily expandable and configurable, and general
agents can be shared between applications. We have also received
positive feedback about integrated tutoring in initial user tests conducted with the implementation.
Empowering End Users to Personalize Dialogue
Systems Through Spoken Interaction
Empirical study of the syntax-prosody relation is hampered by the
fact that current prosodic models are essentially linear, while syntactic structure is hierarchical. The present contribution describes a
syntax-prosody comparison heuristic based on two new algorithms:
Time Tree Induction, TTI, for building a prosodic treebank from
time-annotated speech data, and Tree Similarity Indexing, TSI) for
comparing syntactic trees with the prosodic trees. Two parametrisations of the TTI algorithm, for different tree branching conditions,
are applied to sentences taken from a read-aloud narrative, and
compared with parses of the same sentences, using the TSI. In addition, null-hypotheses in the form of flat bracketing of the sentences
are compared. A preference for iambic (heavy rightmost branch)
grouping is found. The resulting quantitative evidence for syntaxprosody relations has applications in speech genre characterisation
and in duration models for speech synthesis.
Stephanie Seneff 1 , Grace Chung 2 , Chao Wang 1 ;
1
Massachusetts Institute of Technology, USA;
2
Corporation for National Research Initiatives, USA
This paper describes recent advances we have made towards the
goal of empowering end users to automatically expand the knowledge base of a dialogue system through spoken interaction, in order
to personalize it to their individual needs. We describe techniques
used to incrementally reconfigure a preloaded trained natural language grammar, as well as the lexicon and language models for the
speech recognition system. We also report on advances in the technology to integrate a spoken pronunciation with a spoken spelling,
in order to improve spelling accuracy. While the original algorithm
was designed for a “speak and spell” input mode, we have shown
here that the same methods can be applied to separately uttered
spoken and spelled forms of the word. By concatenating the two
waveforms, we can take advantage of the mutual constraints realized in an integrated composite FST. Using an OGI corpus of separately spoken and spelled names, we have demonstrated letter error
rates of under 6% for in-vocabulary words and under 11% for words
not contained in the training lexicon, a 44% reduction in error rate
over that achieved without use of the spoken form. We anticipate
applying this technique to unknown words embedded in a larger
context, followed by solicited spellings.
LET’S GO: Improving Spoken Dialog Systems for
the Elderly and Non-Natives
Antoine Raux, Brian Langner, Alan W. Black, Maxine
Eskenazi; Carnegie Mellon University, USA
With the recent improvements in speech technology, it is now possible to build spoken dialog systems that basically work. However,
such systems are designed and tailored for the general population.
When users come from less general sections of the population, such
as the elderly and non-native speakers of English, the accuracy of
dialog systems degrades.
This paper describes Let’s Go, a dialog system specifically designed
to allow dialog experiments to be carried out on the elderly and
non-native speakers in order to better tune such systems for these
important populations. Let’s Go is designed to provide Pittsburgh
area bus information. The basic system is described and our initial
experiments are outlined.
Agents for Integrated Tutoring in Spoken Dialogue
Systems
Jaakko Hakulinen, Markku Turunen, Esa-Pekka
Salonen; University of Tampere, Finland
In this paper, we introduce the concept of integrated tutoring in
speech applications. An integrated tutoring system teaches the use
of a system to a user while he/she is using the system in a typical manner. Furthermore, we introduce the general principles of
how to implement applications with integrated tutoring agents and
present an example implementation for an existing e-mail system.
The main innovation of the approach is that the tutoring agents are
part of the application, but implemented in a way which makes it
possible to plug them into the system without modifying it. This
is possible due to a set of small, stateless agents and a shared Information Storage provided by our system architecture. Integrated
Session: PTuBf– Poster
Phonology & Phonetics II
Time: Tuesday 10.00, Venue: Main Hall, Level -1
Chair: Yoshinori Sagisaka, Waseda Univ., Japan
Corpus-Based Syntax-Prosody Tree Matching
Dafydd Gibbon; Universität Bielefeld, Germany
A New Approach to Segment and Detect Syllables
from High-Speed Speech
D.W. Ying, W. Gao, W.Q. Wang; Chinese Academy of
Sciences, China
In this paper, we present a novel method to detect sound onsets
and offsets, and apply it to detect and segment syllables from highspeed speech according to the Mandarin characteristic. Our system detects onsets and offsets in 8 frequency bands by a two-layer
integrate-and-fire neural network. The continuous speech is segmented based on the timing of onsets and offsets. And the energy
is used as another cue to locate the segmentation point. In order to
improve the accuracy of segmenting, we introduce three time constraints by defining three refractory periods of neurons, which make
syllable length no less than the minimum. Although the boundaries
between syllables in high-speed speech are not salient, our system
can still segment individual syllables from speech robustly and accurately.
Information Structure and Efficiency in Speech
Production
R.J.J.H. van Son, Louis C.W. Pols; University of
Amsterdam, The Netherlands
Speech is considered an efficient communication channel. This implies that the organization of utterances is such that more speaking effort is directed towards important parts than towards redundant parts. Based on a model of incremental word recognition, the
importance of a segment is defined as its contribution to worddisambiguation. This importance is measured as the segmental information content, in bits. On a labeled Dutch speech corpus it is
then shown that crucial aspects of the information structure of utterances partition the segmental information content and explain
90% of the variance. Two measures of acoustical reduction, duration and spectral center of gravity, are correlated with the segmental
information content in such a way that more important phonemes
are less reduced. It is concluded that the organization of conventional information structure does indeed increase efficiency.
Learning Rule Ranking by Dynamic Construction
of Context-Free Grammars Using AND/OR Graphs
Anna Corazza 1 , Louis ten Bosch 2 ; 1 University of
Milan, Italy; 2 University of Nijmegen, The Netherlands
This paper1 discusses a novel approach for the construction of
a context-free grammar based on a sequential processing of sen-
27
Eurospeech 2003
Tuesday
tences. The construction of the grammar is based on a search algorithm for the minimum weight subgraph in an AND/OR graph.
Aspects of optimality and robustness are discussed. The algorithm
plays an essential role in a model for adaptive learning of probabilistic ordering. The novelty in the proposed model is the combination
of well-established methods from two different disciplines: graph
theory and statistics.
September 1-4, 2003 – Geneva, Switzerland
able” theme [p. 656]. We shall show that intonational marking of
themes in German seems rather gradual. Themes in contrastive contexts have a significantly longer stressed vowel, a higher and longer
rise which results in a higher and more delayed peak than noncontrastive themes. Moreover, speakers can use different strategies
to signal the contrast.
The set-up of this paper is mainly theoretical, and we follow a quite
formal approach. There is a close link with Optimality Theory, one
of the mainstream approaches in phonology, and with graph theory. The resulting techniques, however, can be applied in a more
general domain.
Data were elicited by reading short paragraphs with a contrastive
and non-contrastive pre-context. The use of many filler texts distracted subjects’ attention from the contrast so that the data may be
regarded as highly natural. Implementing these prosodic features
in speech synthesis systems might help to avoid unnatural exaggerated prosodic realisations.
The Effect of Surrounding Phrase Lengths on Pause
Duration
Accentual Lengthening in Standard Chinese:
Evidence from Four-Syllable Constituents
Elena Zvonik, Fred Cummins; University College
Dublin, Ireland
Yiya Chen; University of Edinburgh, U.K.
Little is known about the determining influences on the length of
silent intervals at IP boundaries and no current models accurately
predict their duration. The contribution of independent factors
with different characteristic properties to pause duration needs to
be explored. The present study seeks to investigate if pause duration is correlated with the length of sentences or phrases preceding and following a pause. We find that two independent factors
– the length of an IP (intonational phrase) preceding a pause and
the length of an IP following a pause combine superadditively. The
probability of a pause being short (<300 ms) rises greatly if both
the preceding and the following phrases are short(<=10 syllables).
Statistical Estimation of Phoneme’s Most Stable
Point Based on Universal Constraint
This study examines the pattern of accentual lengthening (AL) over
four-syllable mono-morphemic words in Standard Chinese (SC). I
show that 1) the domain of AL in SC is best characterized as the
constituent that is under focus; 2) the distribution of AL over a
focused domain is non-uniform and there is a strong tendency of
edge effect with the last syllable lengthened the most; and 3) different prosodic boundaries do not block but attenuate the spread
of AL with different magnitudes. These results are also compared
to the results of studies on AL in languages such as English and
Dutch. While there are similarities of AL in these two typologically
different languages, which open the possibility that some effects of
AL are universal, there are clearly important differences in the way
that AL is distributed over the focused constituent in different languages, due to the specific phonology of the language.
Syllable Structure Based Phonetic Units for
Context-Dependent Continuous Thai Speech
Recognition
Shigeki Okawa 1 , Katsuhiko Shirai 2 ; 1 Chiba Institute
of Technology, Japan; 2 Waseda University, Japan
In this paper, we present a statistical approach for phoneme extraction based on universal constraint. Inspired by former phonological
studies, we assume a fictitious point in each phoneme that exhibits
the most stable information to explain the phoneme’s existence.
With the universal constraint of phoneme definitions, the point is
statistically estimated by an iterative procedure to maximize the
local likelihood using a large amount of speech data. We also mention a context dependent modeling of the proposed approach and
its integration strategy to obtain more stability. The experimental
results show favorable convergencies of both the fictitious points
and their likelihoods, which give usefulness for the stable phoneme
modeling.
Independent Automatic Segmentation by
Self-Learning Categorial Pronunciation Rules
N. Beringer; Ludwig-Maximilians-Universität München
, Germany
The goal of this paper is to present a new method to automatically
generate pronunciation rules for automatic segmentation of speech
– the German MAUSER system.
MAUSER is an algorithm which generates pronunciation rules independently of any domain dependent training data either by clustering and statistically weighting self-learned rules according to a
small set of phonological rules clustered by categories or by reweighting “seen”’ phonological rules. By this method we are able to
automatically segment cost-effectively large corpora of mainly unprompted speech.
Prosodic Correlates of Contrastive and
Non-Contrastive Themes in German
Bettina Braun 1 , D. Robert Ladd 2 ; 1 Saarland
University, Germany; 2 University of Edinburgh, U.K.
Semantic theories on focus and information structure assume that
there are different accent types for thematic (backward-looking,
known) and rhematic (forward-looking, new) information in languages as English and German. According to Steedman [1], thematic material may only be intonationally marked (= bear a pitch
accent), if it “contrasts with a different established or accommodat-
Supphanat Kanokphara; NECTEC, Thailand
Choice of the phonetic units for speech recognizer is a factor greatly
affecting the system performance. Phonetic units are normally defined according to the acoustic properties in parts of speech. Nevertheless, with the limit of training data, too delicate acoustic properties are ignored. Syllable structure is one of the properties usually
ignored in English phonetic units due to the structure complexity.
Some language like Chinese successfully gets the benefit from incorporating this property in the phonetic units, as the language itself
is naturally syllabic and has only small amount of subsegments (onsets, nuclei, and codas). Thai, as some point between English and
Chinese, has more subsegments than Chinese but not as much as
English. There are two main steps in this paper. First, prove that
Thai phonetic units can be defined as a set of subsegments without
any data sparseness problem. Second, demonstrate that subsegmental phonetic units give better accuracy rate from integrating
the syllable structure information and reduce a lot of number of
triphone units because of left and right context constraints in the
syllable structure.
An Acoustic Phonetic Analysis of Diphthongs in
Ningbo Chinese
Fang Hu; Chinese Academy of Social Sciences, China
This paper describes the acoustic phonetic properties of diphthongs in Ningbo Chinese. Data from 20 speakers indicate that (1)
falling diphthongs have both onset and offset steady states while
rising diphthongs only have steady states on the offset element; (2)
both falling and rising diphthongs begin from an onset frequency
area close to their target vowels, but only the normal-length rising diphthongs reach the offset target, and falling and short rising diphthongs stop at somewhere before reaching the target; (3)
diphthongs can be well characterized by the F2 rate of change as
far the falling diphthongs are concerned, whereas data lack consistency when rising diphthongs are also taken into account. Results
show that the temporal organization within diphthongs, formant
patterns, and formant rate of change all contribute to the characterization of Ningbo diphthongs.
28
Eurospeech 2003
Tuesday
Latent Ability to Manipulate Phonemes by
Japanese Preliterates in Roman Alphabet
September 1-4, 2003 – Geneva, Switzerland
the classification capability of related nonlinear features over broad
phoneme classes. The results of these preliminary experiments indicate that the information carried by these novel nonlinear feature
sets is important and useful.
Takashi Otake, Yoko Sakamoto; Dokkyo University,
Japan
Recent studies in spoken word recognition show that Japanese
listeners with or without alphabetic knowledge are accessible to
phonemes during word activation. This suggests that even morabased language users can recognize a submoraic unit. The present
study investigates a possibility of latent ability to manipulate
phonemes to search and to construct new words by Japanese preliterates in Roman alphabet. Three experiments were conducted.
In Experiment 1 it was tested whether they could search embedded
words by deleting word initial consonants. In Experiments 2 and 3
it was tested whether they could construct new words by manipulating consonants and vowels at word initial and medial positions.
The results show that they could successfully manage these tasks
with high accuracy. These results suggests that they are likely to
have latent ability to manipulate phonemes to search and to construct new words.
The /i/-/a/-/u/-ness of Spoken Vowels
Hartmut R. Pfitzinger; University of Munich, Germany
This paper investigates acoustic, phonetic, and phonological representations of spoken vowels. For this purpose four experiments
have been conducted. First, by drawing the analogy between the
spectral energy distribution of vowels and the vowel space concept
of Dependency Phonology, we achieve a new phonologically motivated vowel quality representation of spoken vowels which we name
the /i/-/a/-/u/-ness. As a second step, it is shown that the extension of this approach is connected with the work of Pols, van der
Kamp & Plomp 1969 [1] who, among other things, predicted formant frequencies from the spectral energy distribution of vowels.
Third, the vowel quality relating to the IPA vowel diagram is derived
directly from the spectral energy distribution. Finally, we compare
this method with a formant and fundamental frequency based approach introduced by Pfitzinger 2003 [2]. While both the /i/-/a//u/-ness of vowels as well as the perceived vowel quality prediction
are quite robust and therefore useful for both signal pre-processing
and vowel quality research, the formant prediction achieved the
lowest accuracy for the mapping to the IPA vowel diagram.
Session: PTuBg– Poster
Speech Modeling & Features II
Time-Domain Based Temporal Processing with
Application of Orthogonal Transformations
Petr Motlíček, Jan Černocký; Brno University of
Technology, Czech Republic
In the paper, novel approach that efficiently extracts the temporal
information of speech has been proposed. This algorithm is fully
employed in time-domain, and the preprocessing blocks are well
justified by psychoacoustic studies. The achieved results show the
different properties of proposed algorithm compared to the traditional approach. The algorithm is advantageous in terms of possible modifications and computational inexpensiveness. Then, in
our experiments, we have focused on different representation of
time trajectories. Classical methods that are efficient in conventional feature extraction approaches showed not to be suitable to
approximate temporal trajectories of speech. However, the application of some orthogonal transformations, such as discrete Fourier
transform or discrete cosine transform, on top of previously derived
temporal trajectories outperforms classification in original domain.
In addition, these transformed features are very efficient to reduce
the dimensionality of data.
Recognition of Phoneme Strings Using TRAP
Technique
Petr Schwarz, Pavel Matějka, Jan Černocký; Brno
University of Technology, Czech Republic
We investigate and compare several techniques for automatic recognition of unconstrained context-independent phoneme strings from
TIMIT and NTIMIT databases. Among the compared techniques, the
technique based on TempoRAl Patterns (TRAP) achieves the best
results in the clean speech, it achieves about 10% relative improvements against baseline system. Its advantage is also observed in
the presence of mismatch between training and testing conditions.
Issues such as the optimal length of temporal patterns in the TRAP
technique and the effectiveness of mean and variance normalization
of the patterns and the multi-band input the TRAP estimations, are
also explored.
Comparative Study on Hungarian Acoustic Model
Sets and Training Methods
Time: Tuesday 10.00, Venue: Main Hall, Level -1
Chair: Bojan Petek, University of Ljubljana, Slovenia
Tibor Fegyó, Péter Mihajlik, Péter Tatai; Budapest
University of Technology and Economics, Hungary
A Computational Model of Arm Gestures in
Conversation
Dafydd Gibbon, Ulrike Gut, Benjamin Hell, Karin
Looks, Alexandra Thies, Thorsten Trippel; Universität
Bielefeld, Germany
Currently no standardised gesture annotation systems are available.
As a contribution towards solving this problem, CoGesT, a machine
processable and human usable computational model for the annotation of a subset of conversational gestures is presented, its empirical and formal properties are detailed, and application areas are
discussed.
Nonlinear Analysis of Speech Signals: Generalized
Dimensions and Lyapunov Exponents
In recent speech recognition systems the base unit of recognition
is generally the speech sound. To each speech sound an acoustic
model is associated, whose parameters are estimated by statistical
methods. The proper training data fundamentally determine the
efficiency of the recognizer. Present day technology and computational capacity allow speech recognition systems to operate with
large dictionaries and complex language models, but the quality of
the basic pattern matching units has large influence on the reliability of the system. In our experiments presented here we investigated the effects of different training methods to the recognition
accuracy; namely, the effect of increasing the number of speakers
and the number of mixtures were examined in the case of pronunciation modeling and context independent models.
F0 Estimation of One or Several Voices
Alain de Cheveigné, Alexis Baskind; IRCAM-CNRS,
France
Vassilis Pitsikalis, Iasonas Kokkinos, Petros Maragos;
National Technical University of Athens, Greece
In this paper, we explore modern methods and algorithms from
fractal/chaotic systems theory for modeling speech signals in a multidimensional phase space and extracting characteristic invariant
measures like generalized fractal dimensions and Lyapunov exponents. Such measures can capture valuable information for the characterisation of the multidimensional phase space – which is closer
to the true dynamics – since they are sensitive to the frequency with
which the attractor visits different regions and the rate of exponential divergence of nearby orbits, respectively. Further we examine
A methodology is presented for fundamental frequency estimation
of one or more voices. The signal is modeled as the sum of one
or more periodic signals, and the parameters estimated by search
with interpolation. Accurate, reliable estimates are obtained for
each frame without tracking or continuity constraints, and without
the use of specific instrument models (although their use might further boost performance). In formal evaluation over a large database
of speech, the single-voice algorithm outperformed the best competing methods by a factor of three.
29
Eurospeech 2003
Tuesday
In Search Of Target Class Definition In Tandem
Feature Extraction
September 1-4, 2003 – Geneva, Switzerland
GFA-HMM can achieve better performances over traditional HMM
with the same amount of training data but much smaller number
of model parameters.
Sunil Sivadas, Hynek Hermansky; Oregon Health &
Science University, USA
In the tandem feature extraction scheme a Multi-Layer Perceptron
(MLP) with softmax output layer is discriminatively trained to estimate context independent phoneme posterior probabilities on a labeled database. The outputs of the MLP after nonlinear transformation and Principal Component Analysis (PCA) are used as features
in a Gaussian Mixture Model (GMM) based recognizer. The baseline
tandem system is trained on 56 Context Independent (CI) phoneme
targets. In this paper we examine alternatives to CI phoneme targets
by grouping phonemes using apriori and and data-derived knowledge. On connected digit recognition task we achieve comparable performance to the baseline system using fewer data-derived
classes.
Segmentation of Speech for Speaker and Language
Recognition
André G. Adami, Hynek Hermansky; Oregon Health &
Science University, USA
Current Automatic Speech Recognition systems convert the speech
signal into a sequence of discrete units, such as phonemes, and
then apply statistical methods on the units to produce the linguistic
message. Similar methodology has also been applied to recognize
speaker and language, except that the output of the system can be
the speaker or language information. Therefore, we propose the
use of temporal trajectories of fundamental frequency and shortterm energy to segment and label the speech signal into a small set
of discrete units that can be used to characterize speaker and/or
language. The proposed approach is evaluated using the NIST Extended Data Speaker Detection task and the NIST Language Identification task.
Feature Generation Based on Maximum
Classification Probability for Improved Speech
Recognition
Xiang Li, Richard M. Stern; Carnegie Mellon
University, USA
Feature representation is a very important factor that has great effect on the performance of speech recognition systems. In this paper we focus on a feature generation process that is based on linear
transformation of the original log-spectral representation. We first
discuss several three popular linear transformation methods, MelFrequency Cepstral Coefficients (MFCC), Principal Component Analysis (PCA), and Linear Discriminant Analysis (LDA). We then propose
a new method of linear transformation that maximizes the normalized acoustic likelihood of the most likely state sequences of training data, a measure that directly related to our ultimate objective
of reducing Bayesian classification error rate in speech recognition.
Experimental results show that the proposed method decreases the
relative word error rate by more than 8.8% compared to the best implementation of LDA, and by more than 25.9% compared to MFCC
features.
Speech Recognition with a Generative Factor
Analyzed Hidden Markov Model
Learning Discriminative Temporal Patterns in
Speech: Development of Novel TRAPS-Like
Classifiers
Barry Chen 1 , Shuangyu Chang 2 , Sunil Sivadas 3 ;
1
International Computer Science Institute, USA;
2
University of California at Berkeley, USA; 3 Oregon
Health & Science University, USA
Motivated by the temporal processing properties of human hearing,
researchers have explored various methods to incorporate temporal and contextual information in ASR systems. One such approach,
TempoRAl PatternS (TRAPS), takes temporal processing to the extreme and analyzes the energy pattern over long periods of time
(500 ms to 1000 ms) within separate critical bands of speech. In
this paper we extend the work on TRAPS by experimenting with two
novel variants of TRAPS developed to address some shortcomings
of the TRAPS classifiers. Both the Hidden Activation TRAPS (HATS)
and Tonotopic Multi- Layer Perceptrons (TMLP) require 84% less parameters than TRAPS but can achieve significant phone recognition
error reduction when tested on the TIMIT corpus under clean, reverberant, and several noise conditions. In addition, the TMLP performs training in a single stage and does not require critical band
level training targets. Using these variants, we find that approximately 20 discriminative temporal patterns per critical band is sufficient for good recognition performance. In combination with a
conventional PLP system, these TRAPS variants achieve significant
additional performance improvements.
Using Mutual Information to Design Class-Specific
Phone Recognizers
Patricia Scanlon 1 , Daniel P.W. Ellis 1 , Richard Reilly 2 ;
1
Columbia University, USA; 2 University College
Dublin, Ireland
Information concerning the identity of subword units such as
phones cannot easily be pinpointed because it is broadly distributed in time and frequency. Continuing earlier work, we use
Mutual Information as measure of the usefulness of individual timefrequency cells for various speech classification tasks, using the
hand-annotations of the TIMIT database as our ground truth. Since
different broad phonetic classes such as vowels and stops have such
different temporal characteristics, we examine mutual information
separately for each class, revealing structure that was not uncovered in earlier work; further structure is revealed by aligning the
time-frequency displays of each phone at the center of their handmarked segments, rather than averaging across all possible alignments within each segment. Based on these results, we evaluate a
range of vowel classifiers over the TIMIT test set and show that selecting input features according to the mutual information criteria
can provides a significant increase in classification accuracy.
Estimation of GMM in Voice Conversion Including
Unaligned Data
Helenca Duxans, Antonio Bonafonte; Universitat
Politècnica de Catalunya, Spain
Kaisheng Yao 1 , Kuldip K. Paliwal 2 , Te-Won Lee 1 ;
1
University of California at San Diego, USA; 2 Griffith
University, Australia
We present a generative factor analyzed hidden Markov model (GFAHMM) for automatic speech recognition. In a traditional HMM, the
observation vectors are represented by mixture of Gaussians (MoG)
that are dependent on discrete-valued hidden state sequence. The
GFA-HMM introduces a hierarchy of continuous-valued latent representation of observation vectors, where latent vectors in one level
are acoustic-unit dependent and the latent vectors in a higher level
are acoustic-unit independent. An expectation maximization (EM)
algorithm is derived for maximum likelihood parameter estimation
of the model. The GFA-HMM can achieve a much more compact
representation of the intra-frame statistics of observation vectors
than traditional HMM. We conducted an experiment to show that the
Voice conversion consists in transforming a source speaker voice
into a target speaker voice. There are many applications of voice
conversion systems where the amount of training data from the
source speaker and the target speaker is different. Usually, the
amount of source data available is large, but it is desired to estimate the transformation with a small amount of target data.
Systems based on joint Gaussian Mixture Models (GMM) are well
suited to voice conversion [1], but they can’t deal with source data
without its corresponding aligned target data.
In this paper, two alternatives are studied to incorporate unaligned
source data in the estimation of a GMM for a voice conversion task.
It is shown that when a limited amount of aligned parameters are
available in the training step, to only include data from the source
speaker increases the performance of the voice transformation.
30
Eurospeech 2003
Tuesday
Trajectory Modeling Based on HMMs with the
Explicit Relationship Between Static and Dynamic
Features
September 1-4, 2003 – Geneva, Switzerland
individual experts are mixtures of Gaussians. However, in contrast
to the standard PoE model, the individual experts are not required
to be valid distributions, thus allowing additional flexibility in the
component priors and variances. The performance of PoE models
when used as a distributed representation on a large vocabulary
speech recognition task, SwitchBoard, is evaluated.
Keiichi Tokuda, Heiga Zen, Tadashi Kitamura;
Nagoya Institute of Technology, Japan
This paper shows that the HMM whose state output vector includes
static and dynamic feature parameters can be reformulated as a
trajectory model by imposing the explicit relationship between the
static and dynamic features. The derived model, named trajectory
HMM, can alleviate the limitations of HMMs: i) constant statistics
within an HMM state and ii) independence assumption of state output probabilities. We also derive a Viterbi-type training algorithm
for the trajectory HMM. A preliminary speech recognition experiment based on N-best rescoring demonstrates that the training algorithm can improve the recognition performance significantly even
though the trajectory HMM has the same parameterization as the
standard HMM.
On the Advantage of Frequency-Filtering Features
for Speech Recognition with Variable Sampling
Frequencies. Experiments with SpeechDatCar
Databases
Hermann Bauerecker, Climent Nadeu, Jaume Padrell;
Universitat Politècnica de Catalunya, Spain
When a speech recognition system has to work with signals corresponding to different sampling frequencies, multiple acoustic models may have to be maintained. To avoid this drawback, the system
can be trained at the highest expected sampling frequency and the
acoustic models are posteriorly converted to a new sampling frequency. However, the usual mel-frequency cepstral coefficients are
not well suited to this approach since they are not located in the frequency domain. For this reason, we propose in this paper to face
that problem with the features resulting from frequency-filtering
the logarithmic band energies. Experimental results are reported
with SpeechDatCar databases, at 16 kHz, 11 kHz, and 8 kHz sampling rates, which show no degradation in terms of recognition performance for 11/8 kHz testing signals when the system, trained at
16 kHz, is converted, in an inexpensive way, to 11/8 kHz, instead
of directly training the system at 11/8 kHz.
Towards the Automatic Extraction of Fujisaki
Model Parameters for Mandarin
Harmonic Weighting for All-Pole Modeling of the
Voiced Speech
Davor Petrinovic; University of Zagreb, Croatia
A new distance measure for all-pole modeling of voiced speech is
introduced in this paper. It can easily be integrated within the concept of discrete Weighted Mean Square Error (WMSE) all-pole modeling, by a suitable choice of the modeling weights. The proposed
weighting will address the problems such as: harmonic estimation
reliability, perceptual significance of the harmonic and the model
mismatch errors. The robust estimator is proposed, to reduce the
effect of outliers caused by spectral nulls or additive non-speech
contributions (e.g. background noise or music). It is demonstrated
that the proposed all-pole estimation can significantly improve the
performance of speech coders based on sinusoidal model, since
the harmonic magnitudes are modeled better by the WMSE all-pole
model.
Estimation of Resonant Characteristics Based on
AR-HMM Modeling and Spectral Envelope
Conversion of Vowel Sounds
Nobuyuki Nishizawa, Keikichi Hirose, Nobuaki
Minematsu; University of Tokyo, Japan
A new method was developed for accurately separating source and
articulation filter characteristics of speech. This method is based on
the AR-HMM modeling, where the residual waveform is expressed as
the output sequence from an HMM. To realize an accurate analysis,
a scheme of dividing HMM state was newly introduced. Using the
AR-filter parameter values obtained through the analysis, we can
construct a vocoder-type formant synthesizer, where the residual
waveform is used as the excitation source. Through the listening
test on the vowel sounds synthesized using AR-filter from a vowel
and excitation waveform from another vowel, it was shown that a
“flexible” synthesis with a high controllability on the acoustic parameters were possible by our formant synthesis configuration.
Session: PTuBh– Poster
Topics in Speech Recognition &
Segmentation
Hansjörg Mixdorff 1 , Hiroya Fujisaki 2 , Gao Peng
Chen 3 , Yu Hu 3 ; 1 Berlin University of Applied Sciences,
Germany; 2 University of Tokyo, Japan; 3 University of
Science and Technology of China, China
The generation of naturally-sounding F 0 contours in TTS enhances
the intelligibility and perceived naturalness of synthetic speech.
In earlier works the first author developed a linguistically motivated model of German intonation based on the quantitative Fujisaki model of the production process of F 0, and an automatic procedure for extracting the parameters from the F 0 contour which,
however, was specific to German. As has been shown by Fujisaki
and his co-workers, parametrization of F 0 contours of Mandarin
requires negative tone commands, as well as a more precise control of F 0 associated with the syllabic tones. This paper presents
an approach to the automatic parameter estimation for Mandarin,
as well as first results concerning the accuracy of estimation. The
paper also introduces a recently developed tool for editing Fujisaki
parameters featuring resynthesis which will soon be publicly available.
Product of Gaussians as a Distributed
Representation for Speech Recognition
Time: Tuesday 10.00, Venue: Main Hall, Level -1
Chair: John Makhoul, BBN Technologies, USA
Utterance Verification Under Distributed Detection
and Fusion Framework
Taeyoon Kim, Hanseok Ko; Korea University, Korea
In this paper, we consider an application of distributed detection
and fusion framework to utterance verification (UV) and confidence
measure (CM) objectives. We formulate the UV as a distributed detection and Bayesian fusion problem by combining various individual UV methods. We essentially design an optimal fusion rule that
achieves minimum error rate. In the relevant isolated word OOV
rejection experiments, the proposed method consistently outperforms over the individual UV methods.
Joint Estimation of Thresholds in a Bi-Threshold
Verification Problem
Simon Ho, Brian Mak; Hong Kong University of
Science & Technology, China
S.S. Airey, M.J.F. Gales; Cambridge University, U.K.
Distributed representations allow the effective number of Gaussian
components in a mixture model, or state of an HMM, to be increased
without dramatically increasing the number of model parameters.
Various forms of distributed representation have previously been
investigated. In this work it shown that the product of experts (PoE)
framework may be viewed as a distributed representation when the
Verification problems are usually posted as a 2-class problem and
the objective is to verify if an observation belongs to a class, say, A
or its complement A’. However, we find that in a computer-assisted
language learning application, because of the relatively low reliability of phoneme verification – with an equal-error-rate of more
than 30% – a system built on conventional phoneme verification algorithm needs to be improved. In this paper, we propose to cast
31
Eurospeech 2003
Tuesday
the problem as a 3-class verification problem with the addition of
an “in-between” class besides A and A’. As a result, there are two
thresholds to be designed in such a system. Although one may determine the two thresholds independently, better performance can
be obtained by a joint estimation of these thresholds by allowing
small deviation from the specified false acceptance and false rejection rates. This paper describes a cost-based approach to do that.
Furthermore, issues such as per-phoneme thresholds vs. phonemeclass thresholds, and the use of bagging technique to improve the
stability of thresholds are investigated. Experimental results on a
kids’ corpus show that cost-based thresholds and bagging improve
verification performance.
Confidence Measures for Phonetic Segmentation of
Continuous Speech
Samir Nefti 1 , Olivier Boëffard 1 , Thierry Moudenc 2 ;
1
IRISA, France; 2 France Télécom R&D, France
In the context of text-to-speech synthesis, this contribution deals
with the segmentation of speech into phone units. Using an HMM
based segmentation system, we proceed to compare several phonelevel confidence measures to detect potential local mismatches between the phone labels and the acoustics. As well as serving this
purpose, these confidence measures will help the system suggest
a new local graph of hypotheses for the markovian segmentation
system. We propose a new formulation of a frame-based posterior
probability confidence measure which gives the best results for all
of our experiments over a bench of six confidence measures. Adopting an hypothesis testing formulation, this posterior frame-based
measure gives an EER of 12% for a randomly blurred test database.
Using Confidence Measures and Domain
Knowledge to Improve Speech Recognition
Pascal Wiggers, Leon J.M. Rothkrantz; Delft University
of Technology, The Netherlands
In speech recognition domain knowledge is usually implemented by
training specialized acoustic and language models. This requires
large amounts of training data for the domain. When such data is
not available there often still exists external knowledge, obtainable
through other means, that might be used to constrain the search
for likely utterances. This paper presents a number of methods to
exploit such knowledge; an adaptive language model and a lattice
rescoring approach based on Bayesian updating. To decide whether
external knowledge is applicable a word level confidence measure
is implemented.
As a special case of the general problem station-to-station travel frequencies are considered to improve recognition accuracy in a train
table dialog system. Experiments are described that test and compare the different techniques.
Isolated Word Verification Using Cohort
Word-Level Verification
K. Thambiratnam, Sridha Sridharan; Queensland
University of Technology, Australia
Isolated Word Verification (IWV) is the task of verifying the occurrence of a keyword at a specified location within a speech stream.
Typical applications of IWV are to reduce the number of incorrect
results output by a speech recognizer or keyword spotter. Such
algorithms are also vital in reducing the false alarm rate in many
commercial applications of speech recognition, such as automated
telephone transaction systems and audio database search engines.
In this paper, we propose a new method of isolated word verification that we call Cohort Word-level Verification (CWV). The CWV
method attempts to increase IWV performance by incorporating
higher level linguistic and word level information into the selection
of non-keyword models for verification. When used in conjunction
with speech background model based IWV, we are able to achieve
significant performance improvements for IWV of short words.
September 1-4, 2003 – Geneva, Switzerland
A New Approach to Minimize Utterance
Verification Error Rate for a Specific Operating
Point
Wing-Hei Au, Man-Hung Siu; Hong Kong University of
Science & Technology, China
In many telephony applications that use speech recognition, it is
important to identify and reject out-of-vocabulary words or utterances without keywords by means of utterance verification (UV).
Typically, UV is performed based on the likelihood ratio of the target model versus an alternative model. The “goodness” of the models and the particular criteria used for estimating these models can
have significant impact on its performance. Because the UV problem
can be considered as a two-class classification problem, minimum
classification error (MCE) training is a natural choice. Earlier work
has focused on MCE training to reduce total classification errors.
In this paper, we extend the MCE approach to minimize the error
rates. In particular, we focus on the error rates at certain operating
points and show how this can result in a significant EER reduction
for phone verification on the TIMIT and a non-native kids corpus.
While the particular technique is developed on utterance verification, it can also be generalized for other verification tasks such as
speaker verification.
Continuous Speech Recognition and Verification
Based on a Combination Score
Binfeng Yan, Rui Guo, Xiaoyan Zhu; Tsinghua
University, China
In this paper we present a speech recognition and verification
method based on the integration of likelihood and likelihood ratio.
Speech recognition and verification is unified in one-phase framework. A modified agglomerative hierarchical clustering algorithm
is adopted to train the alternative model used in speech verification. In the process of decoding likelihood ratio is combined with
likelihood to get the combination score for searching the final results. Our experimental results showed that false-alarm rate get
decreased a lot with only slight loss in accuracy rate.
Impact of Word Graph Density on the Quality of
Posterior Probability Based Confidence Measures
Tibor Fabian, Robert Lieb, Günther Ruske, Matthias
Thomae; Technical University of Munich, Germany
Our new experimental results, presented in this paper, clearly prove
the dependence between word graph density and the quality of two
different confidence measures. Both confidence measures are based
on the computation of the posterior probabilities of the hypothesized words and apply the time alignment information of the word
graph for confidence score accumulation. We show that the quality
of the confidence scores of both confidence measures significantly
increases for higher word graph densities. The analyses were carried out on two different German spontaneous speech corpora: on
the Verbmobil evaluation corpus [1] and on the NaDia corpus. We
achieved a relative reduction of the confidence error rate by up to
41.4%, compared to the baseline confidence error rate. The results
lead us to propose to perform the confidence score calculation –
based on posterior probability accumulation – on higher word graph
densities in order to get the best results.
An Efficient Keyword Spotting Technique Using a
Complementary Language for Filler Models
Training
Panikos Heracleous 1 , Tohru Shimizu 2 ; 1 Nara
Institute of Science and Technology, Japan; 2 KDDI
R&D Laboratories Inc., Japan
The task of keyword spotting is to detect a set of keywords in the
input continuous speech. In a keyword spotter, not only the keywords, but also the non-keyword intervals must be modeled. For
this purpose, filler (or garbage) models are used. To date, most of
the keyword spotters have been based on hidden Markov models
(HMM). More specifically, a set of HMM is used as garbage models. In this paper, a two-pass keyword spotting technique based
on bilingual hidden Markov models is presented. In the first pass,
32
Eurospeech 2003
Tuesday
our technique uses phonemic garbage models to represent the nonkeyword intervals, and in the second stage the putative hits are verified using normalized scores. The main difference from similar approaches lies in the way the non-keyword intervals are modeled. In
this work, the target language is Japanese, and English was chosen
as the ‘garbage’ language for training the phonemic garbage models.
Experimental results on both clean and noisy telephone speech data
showed higher performance compared with using a common set of
acoustic models. Moreover, parameter tuning (e.g. word insertion
penalty tuning) does not have a serious effect on the performance.
For a vocabulary of 100 keywords and using clean telephone speech
test data we achieved a 92.04% recognition rate with only a 7.96%
false alarm rate, and without word insertion penalty tuning. Using
noisy telephone speech test data we achieved a 87.29% recognition
rate with only a 12.71% false alarm rate.
Context-Sensitive Evaluation and Correction of
Phone Recognition Output
September 1-4, 2003 – Geneva, Switzerland
Integrating Statistical and Rule-Based Knowledge
for Continuous German Speech Recognition
René Beutler, Beat Pfister; ETH Zürich, Switzerland
A new approach to continuous speech recognition (CSR) for German is presented, which integrates both statistical knowledge (at
the acoustic-phonetic level) and rule-based knowledge (at the word
and sentence levels).We introduce a flexible framework allowing
bidirectional processing and virtually any search strategy given an
acoustic model and a context-free grammar. An implementation of
this class of recognizers by means of a word spotter and an island
chart parser is presented. A word recognition accuracy of 93.5% is
reported on a speaker dependent recognition task with a 4k words
dictionary.
A Fast, Accurate and Stream-Based Speaker
Segmentation and Clustering Algorithm
An Vandecatseye, Jean-Pierre Martens; Ghent
University, Belgium
Michael Levit 1 , Hiyan Alshawi 1 , Allen Gorin 1 , Elmar
Nöth 2 ; 1 AT&T Labs-Research, USA; 2 Universität
Erlangen-Nürnberg, Germany
In speech and language processing, information about the errors
made by a learning system is commonly used to assess and improve
its performance. Because of high computational complexity, the
context of the errors is usually either ignored, or exploited in a simplistic form. The complexity becomes tractable, however, for phone
recognition because of the small lexicon. For phone-based systems,
an exhaustive modeling of local context is possible. Furthermore,
recent research studies have shown phone recognition to be useful for several spoken language processing tasks. In this paper, we
present a mechanism which learns patterns of context-sensitive errors from ASR-output aligned with the “true” phone transcriptions.
We also show how this information, encoded as a context-sensitive
weighted transducer, can provide a modest improvement to phone
recognition accuracy even when no transcriptions are available for
the domain of interest.
Estimating Speech Recognition Error Rate Without
Acoustic Test Data
Yonggang Deng 1 , Milind Mahajan 2 , Alex Acero 2 ;
1
Johns Hopkins University, USA; 2 Microsoft Research,
USA
We address the problem of estimating the word error rate (WER) of
an automatic speech recognition (ASR) system without using acoustic test data. This is an important problem which is faced by the designers of new applications which use ASR. Quick estimate of WER
early in the design cycle can be used to guide the decisions involving dialog strategy and grammar design. Our approach involves
estimating the probability distribution of the word hypotheses produced by the underlying ASR system given the text test corpus. A
critical component of this system is a phonemic confusion model
which seeks to capture the errors made by ASR on the acoustic data
at a phonemic level. We use a confusion model composed of probabilistic phoneme sequence conversion rules which are learned from
phonemic transcription pairs obtained by leave-one-out decoding
of the training set. We show reasonably close estimation of WER
when applying the system to test sets from different domains.
Multigram-Based Grapheme-to-Phoneme
Conversion for LVCSR
M. Bisani, Hermann Ney; RWTH Aachen, Germany
Many important speech recognition tasks feature an open, constantly changing vocabulary. (E.g. broadcast news transcription,
spoken document retrieval, . . . ) Recognition of (new) words requires
acoustic baseforms for them to be known. Commonly words are
transcribed manually, which poses a major burden on vocabulary
adaptation and inter-domain portability. In this work we investigate
the possibility of applying a data-driven grapheme-to-phoneme converter to obtain the necessary phonetic transcriptions. Experiments
were carried out on English and German speech recognition tasks.
We study the relation between transcription quality and word error
rate and show that manual transcription effort can be reduced significantly by this method with acceptable loss in performance.
In this paper a new pre-processor for a free speech transcription
system is described. It performs a speech/non-speech partition, a
segmentation of the speech parts into speaker turns, and a clustering of the speaker turns. It works in a stream-based mode, and
it is aiming for a high accuracy with a low delay and processing
time. Experiments on the Hub4 Broadcast News corpus show that
the newly proposed pre-processor is competitive with and in some
respects better than the best systems published so far. The paper
also describes attempts to raise the system performance by supplementing the standard MFCC features with prosodic features such
as pitch and voicing evidence.
A Sequential Metric-Based Audio Segmentation
Method via the Bayesian Information Criterion
Shi-sian Cheng, Hsin-Min Wang; Academia Sinica,
Taiwan
In this paper, we propose a sequential metric-based audio segmentation method that has the advantage of low computation cost of
metric-based methods and the advantage of high accuracy of modelselection-based methods. There are two major differences between
our method and the conventional metric-based methods:(1) Each
changing point has multiple chances to be detected by different
pairs of windows, rather than only once by its neighboring acoustic information.(2) By introducing the Bayesian Information Criterion(BIC) into the distance computation of two windows, we can
deal with the thresholding issue more easily. We used five onehour broadcast news shows for experiments, and the experimental results show that our method performs as well as the modelselection-based methods, but with a lower computation cost.
Sentence Boundary Detection in Arabic Speech
Amit Srivastava, Francis Kubala; BBN Technologies,
USA
This paper presents an automatic system to detect sentence boundaries in speech recognition transcripts. Two systems were developed that use independent sources of information. One is a linguistic system that uses linguistic features in a statistical language
model while the other is an acoustic system that uses prosodic
features in a feed-forward neural network model. A third system
was developed that combines the scores from the acoustic and the
linguistic systems in a Maximum-Likelihood framework. All systems outlined in this paper are essentially language-independent
but all our experiments were conducted on the Arabic Broadcast
News speech recognition transcripts. Our experiments show that
while the acoustic system outperforms the linguistic system, the
combined system achieves the best performance at detecting sentence boundaries.
Automated Transcription and Topic Segmentation
of Large Spoken Archives
Martin Franz, Bhuvana Ramabhadran, Todd Ward,
Michael Picheny; IBM T.J. Watson Research Center,
USA
Digital archives have emerged as the pre-eminent method for cap-
33
Eurospeech 2003
Tuesday
turing the human experience. Before such archives can be used
efficiently, their contents must be described. The scale of such
archives along with the associated content mark up cost make it
impractical to provide access via purely manual means, but automatic technologies for search in spoken materials still have relatively limited capabilities. The NSF-funded MALACH project will use
the world’s largest digital archive of video oral histories, collected
by the Survivors of the Shoah Visual History Foundation (VHF) to
make a quantum leap in the ability to access such archives by advancing the state-of-the-art in Automated Speech Recognition (ASR),
Natural Language Processing (NLP) and related technologies [1, 2].
This corpus consists of over 115,000 hours of unconstrained, natural speech from 52,000 speakers in 32 different languages, filled
with disfluencies, heavy accents, age-related coarticulations, and
un-cued speaker and language switching. This paper discusses
some of the ASR and NLP tools and technologies that we have been
building for the English speech in the MALACH corpus. We also discuss this new test bed while emphasizing the unique characteristics
of this corpus.
Automatic Disfluency Identification in
Conversational Speech Using Multiple Knowledge
Sources
Yang Liu 1 , Elizabeth Shriberg 2 , Andreas Stolcke 2 ;
1
International Computer Science Institute, USA; 2 SRI
International, USA
Disfluencies occur frequently in spontaneous speech. Detection and
correction of disfluencies can make automatic speech recognition
transcripts more readable for human readers, and can aid downstream processing by machine. This work investigates a number
of knowledge sources for disfluency detection, including acousticprosodic features, a language model (LM) to account for repetition
patterns, a part-of-speech (POS) based LM, and rule-based knowledge. Different components are designed for different purposes in
the system. Results show that detection of disfluency interruption
points is best achieved by a combination of prosodic cues, wordbased cues, and POS-based cues. The onset of a disfluency to be
removed, in contrast, is best found using knowledge-based rules.
Finally, specific disfluency types can be aided by the modeling of
word patterns.
Topic Segmentation and Retrieval System for
Lecture Videos Based on Spontaneous Speech
Recognition
Natsuo Yamamoto 1 , Jun Ogata 2 , Yasuo Ariki 1 ;
1
Ryukoku University, Japan; 2 AIST, Japan
In this paper, we propose a segmentation method of continuous
lecture speech into topics. A lecture includes several topics but it
is difficult to judge their boundaries. To solve this problem, transcriptions obtained by spontaneous speech recognition of a lecture
speech is associated with the textbook used in the lecture. This
method showed high performance of the topic segmentation with
an average of 93.7%. Incorporating this method, we constructed a
system where we can view an interesting part of lecture videos, by
specifying the chapters or sections as well as keywords.
Session: OTuCa– Oral
Robust Speech Recognition - Acoustic
Modeling
Time: Tuesday 13.30, Venue: Room 1
Chair: Richard Stern, CMU, USA
Hybrid HMM/BN ASR System Integrating Spectrum
and Articulatory Features
Konstantin Markov 1 , Jianwu Dang 2 , Yosuke Iizuka 2 ,
Satoshi Nakamura 1 ; 1 ATR-SLT, Japan; 2 JAIST, Japan
In this paper, we describe automatic speech recognition system
where features extracted from human speech production system
in form of articulatory movements data are effectively integrated
in the acoustic model for improved recognition performance. The
September 1-4, 2003 – Geneva, Switzerland
system is based on the hybrid HMM/BN model, which allows for
easy integration of different speech features by modeling probabilistic dependencies between them. In addition, features like articulatory movements, which are difficult or impossible to obtain
during recognition, can be left hidden, in fact eliminating the need
of their extraction. The system was evaluated in phoneme recognition task on small database consisting of three speakers’ data in
speaker dependent and multi-speaker modes. In both cases, we
obtained higher recognition rates compared to conventional, spectrum based HMM system with the same number of parameters.
Context-Dependent Output Densities for Hidden
Markov Models in Speech Recognition
Georg Stemmer, Viktor Zeißler, Christian Hacker,
Elmar Nöth, Heinrich Niemann; Universität
Erlangen-Nürnberg, Germany
In this paper we propose an efficient method to utilize context in the
output densities of HMMs. State scores of a phone recognizer are
integrated into the HMMs of a word recognizer which makes their
output densities context-dependent. A significant reduction of the
word error rate has been achieved when the approach is evaluated
on a set of spontaneous speech utterances. As we can expect that
context is more important for some phone models than for others,
we further extend the approach by state-dependent weighting factors which are used to control the influence of the different information sources. A small additional improvement has been achieved.
Time Adjustable Mixture Weights for Speaking
Rate Fluctuation
Takahiro Shinozaki, Sadaoki Furui; Tokyo Institute of
Technology, Japan
One of the most serious problems in spontaneous speech recognition is the degradation of recognition accuracy due to the speaking
rate fluctuation in an utterance. This paper proposes a method for
adjusting mixture weights of an HMM frame by frame depending on
the local speaking rate. The proposed method is implemented using the Bayesian network framework. A hidden variable representing the variation of the “mode” of the speaking rate is introduced
and its value controls the mixture weights of Gaussian mixtures.
Model training and maximum probability assignment of the variables are conducted using the EM/GEM and inference algorithms
for Bayesian networks. The Bayesian network is used to rescore the
acoustic likelihood of the hypotheses in N-best lists. Experimental
results show that the proposed method improves word accuracy by
1.6% for the absolute value on meeting speech given the speaking
rate information, whereas improvement by a regression HMM is less
significant.
A Switching Linear Gaussian Hidden Markov Model
and Its Application to Nonstationary Noise
Compensation for Robust Speech Recognition
Jian Wu, Qiang Huo; University of Hong Kong, China
The Switching Linear Gaussian (SLG) Models was proposed recently
for time series data with nonlinear dynamics. In this paper, we
present a new modelling approach, called SLGHMM, that uses a hybrid Dynamic Bayesian Network of SLG models and Continuous Density HMMs (CDHMMs) to compensate for the nonstationary distortion that may exist in speech utterance to be recognized. With this
representation, the CDHMMs (each modelling mainly the linguistic
information of a speech unit) and a set of linear Gaussian models
(each modelling a kind of stationary distortion) can be jointly learnt
from multi-condition training data. Such a SLGHMM is able to model
approximately the distribution of speech corrupted by switchingcondition distortions. The effectiveness of the proposed approach
is confirmed in noisy speech recognition experiments on Aurora2
task.
On Factorizing Spectral Dynamics for Robust
Speech Recognition
Vivek Tyagi, Iain A. McCowan, Hervé Bourlard,
Hemant Misra; IDIAP, Switzerland
34
Eurospeech 2003
Tuesday
In this paper, we introduce new dynamic speech features based on
the modulation spectrum. These features, termed Mel-cepstrum
Modulation Spectrum (MCMS), map the time trajectories of the spectral dynamics into a series of slow and fast moving orthogonal components, providing a more general and discriminative range of dynamic features than traditional delta and acceleration features. The
features can be seen as the outputs of an array of band-pass filters
spread over the cepstral modulation frequency range of interest.
In experiments, it is shown that, as well as providing a slight improvement in clean conditions, these new dynamic features yield a
significant increase in speech recognition performance in various
noise conditions when compared directly to the standard temporal
derivative features and RASTA-PLP features.
Joint Model and Feature Based Compensation for
Robust Speech Recognition Under Non-Stationary
Noise Environments
Chuan Jia, Peng Ding, Bo Xu; Chinese Academy of
Sciences, China
This paper presents a novel compensation approach, which is implemented in both model and feature spaces, for non-stationary
noise Due to the nature of non-stationary noise which can be decomposed into constant part and residual noise part, our proposed
scheme is performed in two steps: before recognition, an extended
Jacobian adaptation (JA) is applied to adapt the speech models for
the constant part of noise; during recognition, the power spectra
of noisy speech are compensated to eliminate the effect of residual noise part of noise. As verified by the experiments performed
under different stationary and non-stationary noise environments,
the proposed JA is superior to the basic JA and the joint approach
is better than the compensation in single space.
Session: STuCb– Oral
Advanced Machine Learning Algorithms for
Speech & Language Processing
Time: Tuesday 13.30, Venue: Room 2
Chair: Rahim Mazin, AT&T Res., USA
September 1-4, 2003 – Geneva, Switzerland
Robust Multi-Class Boosting
Gunnar Rätsch; Fraunhofer FIRST, Germany
Boosting approaches are based on the idea that high-quality learning algorithms can be formed by repeated use of a “weak-learner”,
which is required to perform only slightly better than random guessing. It is known that Boosting can lead to drastic improvements
compared to the individual weak-learner. For two-class problems
it has been shown that the original Boosting algorithm, called AdaBoost, is quite unaffected by overfitting. However, for the case of
noisy data, it is also understood that AdaBoost can be improved
considerably by introducing some regularization technique.
In speech-related problems one often considers multi-class problems and Boosting formulations have been used successfully to
solve them. I review existing multi-class boosting algorithms, which
have been much less analyzed and explored than the two-class pendants. In this work I extend these methods to derive new boosting
algorithms which are more robust against outliers and noise in the
data and are able to exploit prior knowledge about relationships
between the classes.
Statistical Signal Processing with Nonnegativity
Constraints
Lawrence K. Saul, Fei Sha, Daniel D. Lee; University of
Pennsylvania, USA
Nonnegativity constraints arise frequently in statistical learning and
pattern recognition. Multiplicative updates provide natural solutions to optimizations involving these constraints. One well known
set of multiplicative updates is given by the Expectation- Maximization algorithm for hidden Markov models, as used in automatic
speech recognition. Recently, we have derived similar algorithms
for nonnegative deconvolution and nonnegative quadratic programming. These algorithms have applications to low-level problems in
voice processing, such as fundamental frequency estimation, as well
as high-level problems, such as the training of large margin classifiers. In this paper, we describe these algorithms and the ideas that
connect them.
Inline Updates for HMMs
Ashutosh Garg 1 , Manfred K. Warmuth 2 ; 1 IBM
Corporation, USA; 2 University of California at Santa
Cruz, USA
Weighted Automata Kernels – General Framework
and Algorithms
Corinna Cortes, Patrick Haffner, Mehryar Mohri;
AT&T Labs-Research, USA
Kernel methods have found in recent years wide use in statistical
learning techniques due to their good performance and their computational efficiency in high-dimensional feature space. However,
text or speech data cannot always be represented by the fixed-length
vectors that the traditional kernels handle. We recently introduced
a general kernel framework based on weighted transducers, rational
kernels, to extend kernel methods to the analysis of variable-length
sequences and weighted automata [5] and described their application to spoken-dialog applications. We presented a constructive algorithm for ensuring that rational kernels are positive definite symmetric, a property which guarantees the convergence of discriminant classification algorithms such as Support Vector Machines,
and showed that many string kernels previously introduced in the
computational biology literature are special instances of such positive definite symmetric rational kernels [4]. This paper reviews the
essential results given in [5, 3, 4] and presents them in the form of
a short tutorial.
Most training algorithms for HMMs assume that the whole batch of
observation sequences is given ahead of time. This is particularly
the case for the standard EM algorithm. However, in many applications such as speech, the data is generated by a temporal process.
Singer and Warmuth developed online updates for HMMs that process a single observation sequence in each update. In this paper we
take this approach one step further and develop an inline update for
training HMMs. Now the parameters are updated after processing
a single symbol of the current observation sequence. The methodology for deriving the online and the new inline update is quite
different from the standard EM motivation.
We show experimentally on speech data that even when all observation sequences are available (batch mode), then the online update converges faster than the batch update, and the inline update
converges even faster. The standard batch EM update exhibits the
slowest convergence.
Factorial Models and Refiltering for Speech
Separation and Denoising
Large Margin Methods for Label Sequence Learning
Sam T. Roweis; University of Toronto, Canada
Yasemin Altun, Thomas Hofmann; Brown University,
USA
This paper proposes the combination of several ideas, some old and
some new, from machine learning and speech processing. We review the max approximation to log spectrograms of mixtures, show
why this motivates a “refiltering” approach to separation and denoising, and then describe how the process of inference in factorial
probabilistic models performs a computation useful for deriving
the masking signals needed in refiltering. A particularly simple
model, factorial-max vector quantization (MAXVQ), is introduced
along with a branch-and-bound technique for efficient exact inference and applied to both denoising and monaural separation. Our
approach represents a return to the ideas of Ephraim, Varga and
Moore but applied to auditory scene analysis rather than to speech
recognition.
Label sequence learning is the problem of inferring a state sequence
from an observation sequence, where the state sequence may encode a labeling, annotation or segmentation of the sequence. In
this paper we give an overview of discriminative methods developed for this problem. Special emphasis is put on large margin
methods by generalizing multiclass Support Vector Machines and
AdaBoost to the case of label sequences. An experimental evaluation demonstrates the advantages over classical approaches like
Hidden Markov Models and the competitiveness with methods like
Conditional Random Fields.
35
Eurospeech 2003
Tuesday
September 1-4, 2003 – Geneva, Switzerland
ality reduction tasks in continuous speech recognition systems. A
new type of feature transformation, LP transformation, is proposed
and its performance is compared to those of LDA and PCA transformations.
Session: OTuCc– Oral
Speech Modeling & Features III
Time: Tuesday 13.30, Venue: Room 3
Chair: Daniel Ellis, Columbia Univ., USA
Distributed Speech Recognition on the WSJ Task
Jan Stadermann, Gerhard Rigoll; Technische
Universitaet Muenchen, Germany
Band-Independent Speech-Event Categories for
TRAP Based ASR
Hynek Hermansky, Pratibha Jain; Oregon Health &
Science University, USA
Band-independent categories are investigated for feature estimation in ASR. These categories represent distinct speech-events manifested in frequency-localized temporal patterns of the speech signal. A universal, single estimator is proposed for estimating speechevent posterior probabilities using temporal patterns of criticalband energies for all the bands. The estimated posteriors are used
as the input features (referred to as speech-event features) to a backend recognizer. These features are evaluated on continuous OGIDigits task. The features are also evaluated on Aurora-2 and Aurora3 tasks in a Distributed Speech Recognition (DSR) framework. These
features are compared with earlier proposed broad-phonetic TRAPs
features estimated from temporal patterns using independent estimators in each critical-band.
Local Averaging and Differentiating of Spectral
Plane for TRAP-Based ASR
A comparison of traditional continuous speech recognizers with
hybrid tied-posterior systems in distributed environments is presented for the first time on a challenging medium vocabulary task.
We show how monophone and triphone systems are affected if
speech features are sent over a wireless channel with limited bandwidth. The algorithms are evaluated on the Wall Street Journal
database (WSJ0) and the results show that our monophone tiedposterior recognizer outperforms the traditional methods on this
task by a dramatic reduction of the performance loss by a factor of
4 compared to non-distributed recognizers.
Integrating Multilingual Articulatory Features into
Speech Recognition
Sebastian Stüker 1 , Florian Metze 1 , Tanja Schultz 2 ,
Alex Waibel 2 ; 1 Universität Karlsruhe, Germany;
2
Carnegie Mellon University, USA
Minimum Variance Distortionless Response on a
Warped Frequency Scale
The use of articulatory features, such as place and manner of articulation, has been shown to reduce the word error rate of speech
recognition systems under different conditions and in different settings. For example recognition systems based on features are more
robust to noise and reverberation. In earlier work we showed that
articulatory features can compensate for inter language variability
and can be recognized across languages. In this paper we show that
using cross- and multilingual detectors to support an HMM based
speech recognition system significantly reduces the word error rate.
By selecting and weighting the features in a discriminative way, we
achieve an error rate reduction that lies in the same range as that
seen when using language specific feature detectors. By combining
feature detectors from many languages and training the weights
discriminatively, we even outperform the case where only monolingual detectors are being used.
Matthias Wölfel 1 , John McDonough 1 , Alex Waibel 2 ;
1
Universität Karlsruhe, Germany; 2 Carnegie Mellon
University, USA
Session: OTuCd– Oral
Multi-Modal Spoken Language Processing
František Grézl, Hynek Hermansky; Oregon Health &
Science University, USA
Local frequency and time averaging and differentiating operators, using three neighboring points of critical-band time-frequency
plane, are used to process the plane prior to its use in TRAP-based
ASR. In that way, five alternative TRAP-based ASR systems (the original one and the time/frequency integrated/ differentiated ones)are
created. We show that the frequency differentiating operator improves performance of the TRAP-based ASR.
In this work we propose a time domain technique to estimate an
all-pole model based on the minimum variance distortionless response (MVDR) using a warped short time frequency axis such as
the Mel scale. The use of the MVDR eliminates the overemphasis of
harmonic peaks typically seen in medium and high pitched voiced
speech when spectral estimation is based on linear prediction (LP).
Moreover, warping the frequency axis prior to MVDR spectral estimation ensures more parameters in the spectral model are allocated
to the low, as opposed to high, frequency regions of the spectrum,
thereby mimicking the human auditory system. In a series of speech
recognition experiments on the Switchboard Corpus (spontaneous
English telephone speech), the proposed approach achieved a word
error rate (WER) of 32.1% for female speakers, which is clearly superior to the 33.2% WER obtained by the usual combination of Mel
warping and linear prediction.
Improving the Efficiency of Automatic Speech
Recognition by Feature Transformation and
Dimensionality Reduction
Xuechuan Wang, Douglas O’Shaughnessy; Université
du Québec, Canada
In speech recognition systems, feature extraction can be achieved in
two steps: parameter extraction and feature transformation. Feature transformation is an important step. It can concentrate the
energy distributions of a speech signal onto fewer dimensions than
those of parameter extraction and thus reduce the dimensionality of
the system. Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are the two popular feature transformation
methods. This paper investigates their performances in dimension-
Time: Tuesday 13.30, Venue: Room 4
Chair: Roger Moore, 20/20 Speech, United Kingdom
Using Corpus-Based Methods for Spoken Access to
News Texts on the Web
Alexandra Klein 1 , Harald Trost 2 ; 1 Austrian Research
Institute for Artificial Intelligence, Austria; 2 University
of Vienna, Austria
The system described in this paper relies both on a multimodal
corpus and a written newspaper corpus for processing spoken and
written user requests to Austrian news texts. Requests may be
spontaneous spoken and written utterances as well as mouse clicks;
user actions may concern actual search, but also control of the
browser. Because of spontaneous utterances, a large vocabulary
and multimodal interaction, interpreting the user request and generating an appropriate system response is often difficult. Apart
from a controller module, the system uses data from two corpora
for compensating the difficulties associated with the scenario. Multimodal user actions, which were collected in Wizard-of-Oz experiments, serve as a base for the identification of patterns in users’
spontaneous utterances. Furthermore, news documents are used
for obtaining background knowledge which can contribute to query
expansion whenever the interpretation of users’ utterances encounters ambiguity or underspecification concerning the search terms.
36
Eurospeech 2003
Tuesday
Cross-Modal Informational Masking Due to
Mismatched Audio Cues in a Speechreading Task
Douglas S. Brungart 1 , Brian D. Simpson 1 , Alex
Kordik 2 ; 1 Air Force Research Laboratory, USA;
2
Sytronics Inc., USA
Although most known examples of cross-modal interactions in
audio-visual speech perception involve a dominant visual signal that
modifies the apparent audio signal heard by the observer, there
may also be cases where an audio signal can alter the visual image seen by the observer. In this experiment, we examined the effects that different distracting audio signals had on an observer’s
ability to speechread a color and number combination from a visual speech stimulus. When the distracting signal was noise, timereversed speech, or irrelevant continuous speech, speechreading
performance was unaffected. However, when the distracting audio signal was speech that followed the same general syntax as the
target speech but contained a different color and number combination, speechreading performance was dramatically reduced. This
suggests that the amount of interference an audio signal causes in
a speechreading task strongly depends on the semantic similarity
of the target and masking phrases. The amount of interference did
not, however, depend on the apparent similarity between the audio
speech signal and the visible talker: masking phrases spoken by
a talker who was different in sex than the visible talker interfered
nearly as much with the speechreading task as masking phrases
spoken by the same talker used in the visual stimulus. A second
experiment that examined the effects of desynchronizing the audio
and visual signals found that the amount of interference caused by
the audio phrase decreased when it was time advanced or time delayed relative to the visual target, but that time shifts as large as 1
s were required before performance approached the level achieved
with no audio signal. The results of these experiments are consistent with the existence of a kind of cross-modal “informational
masking” that occurs when listeners who see one word and hear
another are unable to correctly determine which word was present
in the visual stimulus.
September 1-4, 2003 – Geneva, Switzerland
control and presentation of a multi-modal in-car e-mail system. A
simple interface for reading e-mail was constructed, which could
be controlled manually by pressing keyboard buttons, by speech
through a Wizard of Oz setup, or both. The e-mail program was
presented visually on a VDU, read to the driver through speech synthesis, or both. Results indicate that in this context subjective task
load was highest when manual/visual interaction was used. A solution may be interaction through user-determined modality selection, as results indicate that subjects judge their load lowest and
performance and preference highest among the tested conditions
when they are able to select the modality. Some evaluation issues
for multi-modal interfaces are discussed.
Bayesian Networks for Spoken Dialogue
Management in Multimodal Systems of Tour-Guide
Robots
Plamen Prodanov, Andrzej Drygajlo; EPFL,
Switzerland
In this paper, we propose a method based on Bayesian networks for
interpretation of multimodal signals used in the spoken dialogue
between a tour-guide robot and visitors in mass exhibition conditions. We report on experiments interpreting speech and laser scanner signals in the dialogue management system of the autonomous
tour-guide robot RoboX, successfully deployed at the Swiss National
Exhibition (Expo.02). A correct interpretation of a user’s (visitor’s)
goal or intention at each dialogue state is a key issue for successful
voice-enabled communication between tour-guide robots and visitors. To infer the visitors’ goal under the uncertainty intrinsic to
these two modalities, we introduce Bayesian networks for combining noisy speech recognition with data from a laser scanner, which
is independent of acoustic noise. Experiments with real data, collected during the operation of RoboX at Expo.02 demonstrate the
effectiveness of the approach.
Session: PTuCe– Poster
Speech Coding & Transmission
Audiovisual Speech Enhancement Based on the
Association Between Speech Envelope and Video
Features
Time: Tuesday 13.30, Venue: Main Hall, Level -1
Chair: Isabel Trancoso, INESC ID / IST, Lisboa, Portugal
Frédéric Berthommier; ICP-CNRS, France
The low level acoustico-visual association reported by Yehia et al.
(Speech Comm., 26(1):23-43, 1998) is exploited for audio-visual
speech enhancement with natural video sequences. The aim of
this study is to demonstrate that the redundant components of AV
speech are extractible with a suitable representation which does not
involve any categorization process. A comparative study is achieved
between different types of audio features, including the initial Line
Spectral Pairs (LSP) and 4-subbands envelope energy. A gain measure of the enhancement is applied for the comparison. The results
clearly show that the coarse envelope features allows a better gain
than the LSP.
Robust Speech Interaction in a Mobile Environment
Through the Use of Multiple and Different Media
Input Types
Optimization of Window and LSF Interpolation
Factor for the ITU-T G.729 Speech Coding Standard
Wai C. Chu, Toshio Miki; DoCoMo USA Labs, USA
A gradient-descent based optimization procedure is applied to the
window sequence used for linear prediction (LP) analysis of the ITUT G.729 CS-ACELP coder. By replacing the original window of the
standard by the optimized versions, similar subjective quality is
obtainable at reduced computational cost and / or lowered coding
delay. In addition, an optimization strategy is described to find the
line spectral frequency (LSF) interpolation factor.
Likelihood Ratio Test with Complex Laplacian
Model for Voice Activity Detection
Joon-Hyuk Chang, Jong-Won Shin, Nam Soo Kim;
Seoul National University, Korea
Rainer Wasinger, Christoph Stahl, Antonio Krueger;
DFKI GmbH, Germany
Mobile and outdoor environments have long been out of reach for
speech engines due to the performance limitations that were associated with portable devices, and the difficulties of processing
speech in high-noise areas. This paper outlines an architecture for
attaining robust speech recognition rates in a mobile pedestrian indoor/outdoor navigation environment, through the use of a media
fusion knowledge component.
Speech-Based, Manual-Visual, and Multi-Modal
Interaction with an In-Car Computer – Evaluation
of a Pilot Study
This paper proposes a voice activity detector (VAD) based on the
complex Laplacian model. With the use of a goodness-of-fit (GOF)
test, it is discovered that the Laplacian model is more suitable to
describe noisy speech distribution than the conventional Gaussian
model. The likelihood ratio (LR) based on the Laplacian model is
computed and then applied to the VAD operation. According to
the experimental results, we can find that the Laplacian statistical model is more suitable for the VAD algorithm compared to the
Gaussian model.
Multi-Mode Quantization of Adjacent Speech
Parameters Using a Low-Complexity Prediction
Scheme
Jani Nurminen; Nokia Research Center, Finland
Rogier Woltjer, Wah Jin Tan, Fang Chen; Linköpings
Universitet, Sweden
This paper presents a pilot study comparing various modalities for
This work addresses joint quantization of adjacent speech parameter values or vectors. The basic joint quantization scheme is improved by using a low-complexity predictor and by allowing the
37
Eurospeech 2003
Tuesday
quantizer to operate in several modes. In addition, this paper introduces an efficient algorithm for training quantizers having the
proposed structure. The algorithm is used for training a practical
quantizer that is evaluated in the context of the quantization of the
linear prediction coefficients. The simulation results indicate that
the proposed quantizer clearly outperforms conventional quantizers both in an error-free environment and in erroneous conditions
at all bit error rates included in the evaluation.
Multi-Mode Matrix Quantizer for Low Bit Rate LSF
Quantization
Ulpu Sinervo 1 , Jani Nurminen 2 , Ari Heikkinen 2 ,
Jukka Saarinen 2 ; 1 Tampere University of Technology,
Finland; 2 Nokia Research Center, Finland
September 1-4, 2003 – Geneva, Switzerland
pendent Speech Processing (A.L.I.S.P) approach that automatically
segments the speech signal ([1]), we studied the possibility of optimising this rate as well as the quality of re-synthesised signal,
by using the text information corresponding to the speech signal,
and by implementing a new segmentation method. This led to the
speech alignment with its phonetic transcription and the use of
polyphones, to finally increase output speech quality while keeping a bitrate between 400bits/s and 600bits/s. Typically, this can
be used to store recorded alpha-numeric books for blind people, or
compressing recorded courses for e-learning. Cell phone applications could also be considered.
Entropy-Optimized Channel Error Mitigation with
Application to Speech Recognition Over Wireless
In this paper, we introduce a novel method for quantization of line
spectral frequencies (LSF) converted from mth order linear prediction coefficients. In the proposed method, the interframe correlation of LSFs is exploited using matrix quantization where N consecutive frames are quantized as one m-by-N matrix. The voicingbased multi-mode operation reduces the bit rate by taking advantage of the properties of the speech signal. That is, certain parts of
a signal, such as unvoiced segments, can be quantized with smaller
codebooks. With this method, very low variable bit rate LSF quantization is obtained. The proposed method is suitable especially for
very low bit rate speech coders in which short time delay is tolerable, and high but not necessarily transparent quality is sufficient.
Voicing Controlled Frame Loss Concealment for
Adaptive Multi-Rate (AMR) Speech Frames in
Voice-Over-IP
Victoria Sánchez, Antonio M. Peinado, Angel M.
Gómez, José L. Pérez-Córdoba; Universidad de
Granada, Spain
In this paper we propose an entropy-optimized channel error mitigation technique with a low computational complexity and moderate memory requirements, suitable for transmissions over wireless
channels. We apply it to Distributed Speech Recognition (DSR), getting an improvement of around 3% in word accuracy over the recognition performance obtained by the mitigation technique proposed
in the ETSI standard for DSR (ETSIES- 201-108 v1.1.2) for bad channel conditions (GSMEP3 error pattern).
Robust Jointly Optimized Multistage Vector
Quantization for Speech Coding
Venkatesh Krishnan, David V. Anderson; Georgia
Institute of Technology, USA
Frank Mertz 1 , Hervé Taddei 2 , Imre Varga 2 , Peter
Vary 1 ; 1 RWTH Aachen, Germany; 2 Siemens AG,
Germany
In this paper we present a voicing controlled, speech parameter
based frame loss concealment for frames that have been encoded
with the Adaptive Multi-Rate (AMR) speech codec. The missing
parameters are estimated by interpolation and extrapolation techniques that are chosen in dependence of the voicing state of the
speech frames preceding and following the lost frames. The voicing controlled concealment outperforms the conventional extrapolation/muting based approach and it shows a consistent improvement over interpolation techniques that do not distinguish between
voiced and unvoiced speech. The quality can be further improved
if additional information about the predictor states of predictively
encoded parameters is available from a redundant transmission in
future packets.
Perceptual Irrelevancy Removal in Narrowband
Speech Coding
Marja Lähdekorpi 1 , Jani Nurminen 2 , Ari Heikkinen 2 ,
Jukka Saarinen 2 ; 1 Tampere University of Technology,
Finland; 2 Nokia Research Center, Finland
A masking model originally designed for audio signals is applied
to narrowband speech. The model is used to detect and remove
the perceptually irrelevant simultaneously masked frequency components of a speech signal. Objective measurements have shown
that the modified speech signal can be coded more efficiently than
the original signal. Furthermore, it has been confirmed through
perceptual evaluation that the removal of these frequency components does not cause significant degradation of the speech quality
but rather, it has consistently improved the output quality of two
standardized speech codecs. Thus, the proposed irrelevancy removal technique can be used at the front end of a speech coder to
achieve enhanced coding efficiency.
Very-Low-Rate Speech Compression by Indexation
of Polyphones
Charles du Jeu, Maurice Charbit, Gérard Chollet;
ENST-CNRS, France
Speech coding by indexation has proven to lower the rate of speech
compression drastically. Based on the Automatic Language Inde-
In this paper, a novel channel-optimized multistage vector quantization (COMSVQ) codec is presented in which the stage codebooks
are jointly designed. The proposed codec uses a signal source and
channel-dependent distortion measure to encode line spectral frequencies derived from segments of a speech signal. Simulation results are provided to demonstrate the consistent reduction in the
spectral distortion obtained using the proposed codec as compared
to the conventional sequentially-designed channel-matched multistage vector quantizer.
Polar Quantization of Sinusoids from Speech Signal
Blocks
Harald Pobloth, Renat Vafin, W. Bastiaan Kleijn; KTH,
Sweden
We introduce a block polar quantization (BPQ) procedure that minimizes a weighted distortion for a set of sinusoids representing
one block of a signal. The minimization is done under a resolution constraint for the entire signal block. BPQ outperforms rectangular quantization, strictly polar quantization, and unrestricted
polar quantization (UPQ) both when assuming the Cartesian coordinates of the sinusoidal components to be Gaussian and for sinusoids found from speech data. In the case of speech data we found
a significant performance gain (about 4 dB) over the best performing polar quantization (UPQ).
Transcoding Algorithm for G.723.1 and AMR
Speech Coders: For Interoperability Between VoIP
and Mobile Networks
Sung-Wan Yoon, Jin-Kyu Choi, Hong-Goo Kang,
Dae-Hee Youn; Yonsei University, Korea
In this paper, an efficient transcoding algorithm between G.723.1
and AMR speech coders is proposed for providing interoperability
between IP and mobile networks. Transcoding is completed through
three processing steps: line spectral pair (LSP) conversion, pitch
interval conversion, and fast adaptive-codebook search. For maintaining minimum distortion, sensitive parameters to quality such
as adaptive and fixed-codebooks are re-estimated from synthesized
target signals. To reduce overall complexity, other parameters are
directly converted in parametric levels without running through the
complete decoding process. Objective and subjective preference
tests verify that the proposed transcoding algorithm has equivalent
quality to conventional tandem approach. In addition, the proposed
38
Eurospeech 2003
Tuesday
algorithm achieves 20∼40% reduction of the overall complexity over
tandem approach with a shorter processing delay.
September 1-4, 2003 – Geneva, Switzerland
Multi-Rate Extension of the Scalable to Lossless
PSPIHT Audio Coder
Mohammed Raad 1 , Ian Burnett 1 , Alfred Mertins 2 ;
1
University Of Wollongong, Australia; 2 University of
Oldenburg, Germany
Quality-Complexity Trade-Off in Predictive LSF
Quantization
Davorka Petrinovic, Davor Petrinovic; University of
Zagreb, Croatia
In this paper several techniques are investigated for reduction of
complexity and/or improving quality of a line spectrum frequencies (LSF) quantization based on switched prediction (SP) and vector quantization (VQ). For switched prediction, a higher number of
prediction matrices is proposed. Quality of the quantized speech is
improved by the prediction multi-candidate and delayed decision
algorithm. It is shown that quantizers with delayed decision can
save up to one bit still having similar or even lower complexity than
the baseline quantizers with 2 switched matrices. By efficient implementation of prediction, lower complexity can be achieved through
use of prediction matrices with reduced number of non-zero elements. By combining such sparse matrices and multiple prediction
candidates, the best quality-complexity compromise quantizers can
be obtained as demonstrated by experimental results.
Variable Bit Rate Control with Trellis Diagram
Approximation
Kei Kikuiri, Nobuhiko Naka, Tomoyuki Ohya; NTT
DoCoMo Inc., Japan
In this paper, we present a variable bit rate control method for
speech/audio coding, under the constraint that the total bit rate of
a super-frame to be a constant. The proposed method uses a trellis
diagram for optimizing the overall quality of the super-frame. In
order to reduce the computational complexity, the trellis diagram
uses approximation by ignoring the encoder memory state between
different paths. Simulations on the AMR Wideband show that the
proposed variable bit rate control achieves up to 4.3 dB improvements to the constant rate coding in the perceptual weighted SNR.
This paper extends a scalable to lossless compression scheme to
allow scalability in terms of sampling rate as well as quantization
resolution. The scheme presented is an extension of a perceptually scalable scheme that scales to lossless compression, producing
smooth objective scalability, in terms of SNR, until lossless compression is achieved. The scheme is built around the Perceptual
SPIHT algorithm, which is a modification of the SPIHT algorithm.
An analysis of the expected limitations of scaling across sampling
rates is given as well as lossless compression results showing the
competitive performance of the presented technique.
Entropy Constrained Quantization of LSP
Parameters
Turaj Zakizadeh Shabestary, Per Hedelin, Fredrik
Nordén; Chalmers University of Technology, Sweden
Conventional procedures for spectrum coding for speech address
fixed rate coding. For the variable rate case, we develop spectrum
coding based on constrained entropy quantization. Our approach
integrates high rate theory for Gaussian mixture modeling with lattices based on line spectrum pairs. The overall procedure utilizes
a union of several lattices in order to enhance performance and to
comply with source statistics. We provide experimental results in
terms of SD for different conditions and compare these with high
rate lower bounds. One major advantage of our coding system concerns adaptivity, one design can operate at a variety of rates without
re-training.
Session: PTuCf– Poster
Speech Recognition - Search & Lexicon
Modeling
Time: Tuesday 13.30, Venue: Main Hall, Level -1
Chair: Hermann Ney, Aachen University of Technology, Germany
Named Entity Extraction from Japanese Broadcast
News
Towards Optimal Encoding for Classification with
Applications to Distributed Speech Recognition
Naveen Srinivasamurthy, Antonio Ortega, Shrikanth
Narayanan; University of Southern California, USA
In distributed classification applications, due to computational constraints, data acquired by low complexity clients is compressed
and transmitted to a remote server for classification. In this paper
the design of optimal quantization for distributed classification applications is considered and evaluated in the context of a speech
recognition task. The proposed encoder minimizes the detrimental effect compression has on classification performance. Specifically, the proposed methods concentrate on designing low dimension encoders. Here individual encoders independently quantize
sub-dimensions of a high dimension vector used for classification.
The main novelty of the work is the introduction of mutual information as a metric for designing compression algorithms in classification applications. Given a rate constraint, the proposed algorithm minimizes the mutual information loss due to compression.
Alternatively it ensures that the compressed data used for classification retains maximal information about the class labels. An iterative empirical algorithm (similar to the Lloyd algorithm) is provided
to design quantizers for this new distortion measure. Additionally, mutual information is also used to propose a rate-allocation
scheme where rates are allocated to the sub-dimensions of a vector (which are independently encoded) to satisfy a given rate constraint. The results obtained indicate that mutual information is a
better metric (when compared to mean square error) for optimizing encoders used in distributed classification applications. In a
distributed spoken names recognition task, the proposed mutual
information based rate-allocation reduces by a factor of six the increase in WER due to compression when compared to a heuristic
rate-allocation.
Akio Kobayashi 1 , Franz J. Och 2 , Hermann Ney 3 ;
1
NHK Science & Technical Research Laboratories,
Japan; 2 University of Southern California, USA;
3
RWTH Aachen, Germany
This paper describes a method for named entity extraction from
Japanese broadcast news. Our proposed named entity tagger gives
entity categories for every character in order to deal with unknown
words and entities correctly. This character-based tagger has models designed by maximum entropy modeling. We discuss the efficiency of the proposed tagger by comparison with a conventional
word-based tagger. The results indicate that the capability of the
taggers depends on the entity categories. Therefore, the features
derived from both character and word contexts are required to obtain high performance of named entity extraction.
Morpheme-Based Lexical Modeling for Korean
Broadcast News Transcription
Young-Hee Park, Dong-Hoon Ahn, Minhwa Chung;
Sogang University, Korea
In this paper, we describe our LVCSR system for Korean broadcast
news transcription. The main focus here is to find the most proper
morpheme-based lexical model for Korean broadcast news recognition to deal with the inflectional flexibilities in Korean. Since there
are trade-offs between lexicon size and lexical coverage, and between the length of lexical unit and WER, in our system we analyzed
the training corpus to obtain a compact 24k-morpheme-based lexicon with 98.8% coverage. Then, the lexicon is optimized by combining morphemes using statistics of training corpus under monosyllable constraint or maximum length constraint. In experiments, our
39
Eurospeech 2003
Tuesday
system reduced the number of monosyllable morphemes which are
the most error-prone, from 52% to 29% of the lexicon and obtained
13.24% WER for anchor and 24.97% for reporter.
Data Driven Example Based Continuous Speech
Recognition
September 1-4, 2003 – Geneva, Switzerland
shown to give good coverage on all four languages and represent
a large set of shared sub-word models. For all experiments, the
acoustic models are trained from scratch in order not to use any
prior phonetic knowledge.
Finally, we show that for the Dutch and German tasks, the presented
approach works well and may also help do decrease the word error
rate below that obtained by monolingual acoustic models. For all
four languages, adding language questions to the multilingual decision tree helps to improve the word error rate.
Mathias De Wachter, Kris Demuynck, Dirk Van
Compernolle, Patrick Wambacq; Katholieke
Universiteit Leuven, Belgium
The dominant acoustic modeling methodology based on Hidden
Markov Models is known to have certain weaknesses. Partial solutions to these flaws have been presented, but the fundamental
problem remains: compression of the data to a compact HMM discards useful information such as time dependencies and speaker
information. In this paper, we look at pure example based recognition as a solution to this problem. By replacing the HMM with
the underlying examples, all information in the training data is retained. We show how information about speaker and environment
can be used, introducing a new interpretation of adaptation. The basis for the recognizer is the well-known DTW algorithm, which has
often been used for small tasks. However, large vocabulary speech
recognition introduces new demands, resulting in an explosion of
the search space. We show how this problem can be tackled using
a data driven approach which selects appropriate speech examples
as candidates for DTW-alignment.
A Cross-Media Retrieval System for Lecture Videos
Atsushi Fujii 1 , Katunobu Itou 2 , Tomoyosi Akiba 2 ,
Tetsuya Ishikawa 1 ; 1 University of Tsukuba, Japan;
2
AIST, Japan
We propose a cross-media lecture-on-demand system, in which
users can selectively view specific segments of lecture videos by
submitting text queries. Users can easily formulate queries by using the textbook associated with a target lecture, even if they cannot
come up with effective keywords. Our system extracts the audio
track from a target lecture video, generates a transcription by large
vocabulary continuous speech recognition, and produces a text index. Experimental results showed that by adapting speech recognition to the topic of the lecture, the recognition accuracy increased
and the retrieval accuracy was comparable with that obtained by
human transcription.
Large Vocabulary Speaker Independent Isolated
Word Recognition for Embedded Systems
Building a Test Collection for Speech-Driven Web
Retrieval
Sergey Astrov, Bernt Andrassy; Siemens AG, Germany
Atsushi Fujii 1 , Katunobu Itou 2 ; 1 University of
Tsukuba, Japan; 2 AIST, Japan
In this paper the implementation of a word-stem based tree search
for large vocabulary speaker independent isolated word recognition
for embedded systems is presented. Two fast search algorithms
combine the effectiveness of the tree structure for large vocabularies and the fast Viterbi search within the regular structures of
word-stems. The algorithms are proved to be very effective for
workstation and embedded platform realizations. In order to decrease the processing power the word-stem based tree search with
frame dropping approach is used. The recognition speed was increased by a factor of 5 without frame dropping and by a factor of
10 with frame dropping in comparison to linear Viterbi search for
isolated word recognition task with a vocabulary of 20102 words.
Thus, the large vocabulary isolated word recognition becomes possible for embedded systems.
Low-Latency Incremental Speech Transcription in
the Synface Project
Alexander Seward; KTH, Sweden
In this paper, a real-time decoder for low-latency online speech transcription is presented. The system was developed within the Synface project, which aims to improve the possibilities for hard of
hearing people to use conventional telephony by providing speechsynchronized multimodal feedback. This paper addresses the specific issues related to HMM-based incremental phone classification
with real-time constraints. The decoding algorithm described in
this work enables a trade-off to be made between improved recognition accuracy and reduced latency. By accepting a longer latency
per output increment, more time can be ascribed to hypothesis
look-ahead and by that improve classification accuracy. Experiments performed on the Swedish SpeechDat database show that
it is possible to generate the same classification as is produced by
non-incremental decoding using HTK, by adopting a latency of approx. 150 ms or more.
Multilingual Acoustic Modeling Using Graphemes
This paper describes a test collection (benchmark data) for retrieval
systems driven by spoken queries. This collection was produced in
the subtask of the NTCIR-3 Web retrieval task, which was performed
in a TREC-style evaluation workshop. The search topics and document collection for the Web retrieval task were used to produce
spoken queries and language models for speech recognition, respectively. We used this collection to evaluate the performance of our
retrieval system. Experimental results showed that (a) the use of
target documents for language modeling and (b) enhancement of
the vocabulary size in speech recognition were effective in improving the system performance.
Confidence Measure Driven Scalable Two-Pass
Recognition Strategy for Large List Grammars
Miroslav Novak 1 , Diego Ruiz 2 ; 1 IBM T.J. Watson
Reseach Center, USA; 2 Université Catholique de
Louvain, Belgium
In this article we will discuss recognition performance on large list
grammars, a class of tasks often encountered in telephony applications. In these tasks, the user makes a selection from a large list
of choices (e.g. stock quotes, yellow pages, etc). Though the redundancy of the complete utterance is often high enough to achieve
high recognition accuracy, large search space presents a challenge
for the recognizer, in particular, when real-time, low latency performance is required. We propose a confidence measure driven
two-pass search strategy, exploiting the high mutual information
between grammar states to improve pruning efficiency while minimizing the need for memory.
An Efficient, Fast Matching Approach Using
Posterior Probability Estimates in Speech
Recognition
Sherif Abdou, Michael S. Scordilis; University of
Miami, USA
S. Kanthak, Hermann Ney; RWTH Aachen, Germany
In this paper we combine grapheme-based sub-word units with multilingual acoustic modeling. We show that a global decision tree together with automatically generated grapheme questions eliminate
manual effort completely. We also investigate the effects of additional language questions.
We present experimental results on four corpora with different languages, namely the Dutch and French ARISE corpus, the Italian EUTRANS corpus and the German VERBMOBIL corpus. Graphemes are
Acoustic fast matching is an effective technique to accelerate the
search process in large vocabulary continuous speech recognition.
This paper introduces a novel fast matching method. This method
is based on the evaluation of future posterior probabilities for
a look-ahead number of timeframes in order to exclude unlikely
phone models as early as possible during the search. In contrast to
the likelihood scores used by more traditional fast matching methods these posterior probabilities are more discriminative by nature
40
Eurospeech 2003
Tuesday
as they sum up to unity over all the possible models. By applying the
proposed method we managed to reduce by 66% the decoding time
consumed in our time-synchronous Viterbi decoder for a recognition task based on the Wall Street Journal database with virtually
no additional decoding errors.
On Lexicon Creation for Turkish LVCSR
Kadri Hacioglu 1 , Bryan Pellom 1 , Tolga Ciloglu 2 ,
Ozlem Ozturk 2 , Mikko Kurimo 3 , Mathias Creutz 3 ;
1
University of Colorado at Boulder, USA; 2 Middle East
Technical University, Turkey; 3 Helsinki University of
Technology, Finland
Although multiple cues, such as different signal processing techniques and feature representations, have been used in speech recognition in adverse acoustic environment, how to maximally utilize
the benefit of these cues is largely unsolved. In this paper, a novel
search strategy is proposed. During parallel decoding of different
feature streams, the intermediate outputs are cross-referenced to
reduce pruning errors. Experiment results show this method significantly improved recognition performance on a noisy large vocabulary continuous speech task.
Design of the CMU Sphinx-4 Decoder
In this paper, we address the lexicon design problem in Turkish
large vocabulary speech recognition. Although we focus only on
Turkish, the methods described here are general enough that they
can be considered for other agglutinative languages like Finnish,
Korean etc. In an agglutinative language, several words can be created from a single root word using a rich collection of morphological
rules. So, a virtually infinite size lexicon is required to cover the language if words are used as the basic units. The standard approach
to this problem is to discover a number of primitive units so that a
large set of words can be created by compounding those units. Two
broad classes of methods are available for splitting words into their
sub-units; morphology-based and data-driven methods. Although
the word splitting significantly reduces the out of vocabulary rate,
it shrinks the context and increases acoustic confusibility. We have
used two methods to address the latter. In one method, we use word
counts to avoid splitting of high frequency lexical units, and in the
other method, we recompound splits according to a probabilistic
measure. We present experimental results that show the methods
are very effective to lower the word error rate at the expense of lexicon size.
Compiling Large-Context Phonetic Decision Trees
into Finite-State Transducers
Paul Lamere 1 , Philip Kwok 1 , William Walker 1 ,
Evandro Gouvêa 2 , Rita Singh 2 , Bhiksha Raj 3 , Peter
Wolf 3 ; 1 Sun Microsystems Laboratories, USA;
2
Carnegie Mellon University, USA; 3 Mitsubishi Electric
Research Laboratories, USA
Sphinx-4 is an open source HMM-based speech recognition system
written in the JavaT M programming language. The design of the
Sphinx-4 decoder incorporates several new features in response to
current demands on HMM-based large vocabulary systems. Some
new design aspects include graph construction for multilevel parallel decoding with multiple feature streams without the use of compound HMMs, the incorporation of a generalized search algorithm
that subsumes Viterbi decoding as a special case, token stack decoding for efficient maintenance of multiple paths during search,
design of a generalized language HMM graph from grammars and
language models of multiple standard formats, that can potentially
toggle between flat search structure, tree search structure, etc. This
paper describes a few of these design aspects, and reports some
preliminary performance measures for speed and accuracy.
A New Decoder Design for Large Vocabulary
Turkish Speech Recognition
Onur Çilingir 1 , Mübeccel Demirekler 2 ; 1 TÜBİTAK
BİLTEN, Turkey; 2 Middle East Technical University,
Turkey
Stanley F. Chen; IBM T.J. Watson Research Center,
USA
Recent work has shown that the use of finite-state transducers
(FST’s) has many advantages in large vocabulary speech recognition. Most past work has focused on the use of triphone phonetic
decision trees. However, numerous applications use decision trees
that condition on wider contexts; for example, many systems at IBM
use 11-phone phonetic decision trees. Alas, large-context phonetic
decision trees cannot be compiled straightforwardly into FST’s due
to memory constraints. In this work, we discuss memory-efficient
techniques for manipulating large-context phonetic decision trees
in the FST framework. First, we describe a lazy expansion technique
that is applicable when expanding small word graphs. For general
applications, we discuss how to construct large-context transducers
via a sequence of simple, efficient finite-state operations; we also introduce a memory-efficient implementation of determinization.
An important problem in large vocabulary speech recognition for
agglutinative languages like Turkish is the high out of vocabulary
(OOV) rate caused by extensive number of distinct words. Recognition systems using words as the basic lexical elements have difficulty in dealing with such virtually unlimited vocabulary. We propose a new time-synchronous lexical tree decoder design using morphemes as the lexical elements. A key feature of the proposed decoder is the dynamic generation of the lexical tree according to the
morphological rules. The architecture emulates word generation in
the language and therefore allows very large vocabularies through
the defined set of morphemes and morphotactical rules.
Session: PTuCg– Poster
Speech Technology Applications
Automatic Summarization of Broadcast News
Using Structural Features
Time: Tuesday 13.30, Venue: Main Hall, Level -1
Chair: Jerome Bellegarda, Spoken Language Group, Apple
Computer, Inc., USA
Sameer Raj Maskey, Julia Hirschberg; Columbia
University, USA
We present a method for summarizing broadcast news that is not affected by word errors in an automatic speech recognition transcription, using information about the structure of the news program.
We construct a directed graphical model to represent the probability distribution and dependencies among the structural features
which we train by finding the values of parameters of the conditional probability tables. We then rank segments of the test set and
extract the highest ranked ones as a summary. We present the procedure and preliminary test results.
A Dynamic Cross-Reference Pruning Strategy for
Multiple Feature Fusion at Decoder Run Time
Yonghong Yan 1 , Chengyi Zheng 1 , Jianping Zhang 2 ,
Jielin Pan 2 , Jiang Han 2 , Jian Liu 2 ; 1 Oregon Health &
Science University, USA; 2 Chinese Academy of
Sciences, China
September 1-4, 2003 – Geneva, Switzerland
Automatic Speech Recognition with Sparse
Training Data for Dysarthric Speakers
Phil Green, James Carmichael, Athanassios Hatzis,
Pam Enderby, Mark Hawley, Mark Parker; University
of Sheffield, U.K.
We describe an unusual ASR application: recognition of command
words from severely dysarthric speakers, who have poor control of
their articulators. The goal is to allow these clients to control assistive technology by voice. While this is a small vocabulary, speakerdependent, isolated-word application, the speech material is more
variable than normal, and only a small amount of data is available for training. After training a CDHMM recogniser, it is necessary to predict its likely performance without using an independent
test set, so that confusable words can be replaced by alternatives.
We present a battery of measures of consistency and confusability,
based on forced-alignment, which can be used to predict recogniser
41
Eurospeech 2003
Tuesday
performance. We show how these measures perform, and how they
are presented to the clinicians who are the users of the system.
September 1-4, 2003 – Geneva, Switzerland
Evaluating Multiple LVCSR Model Combination in
NTCIR-3 Speech-Driven Web Retrieval Task
Masahiko Matsushita 1 , Hiromitsu Nishizaki 1 ,
Takehito Utsuro 2 , Yasuhiro Kodama 1 , Seiichi
Nakagawa 1 ; 1 Toyohashi University of Technology,
Japan; 2 Kyoto University, Japan
Prediction of Sentence Importance for Speech
Summarization Using Prosodic Parameters
Akira Inoue, Takayoshi Mikami, Yoichi Yamashita;
Ritsumeikan University, Japan
Recent improvements in computer systems are increasing the
amount of accessible speech data. Since speech media is not appropriate for quick scanning, the development of automatic summarization of lecture or meeting speech is expected. Spoken messages contain non-linguistic information, which is mainly expressed
by prosody, while written text conveys only linguistic information.
There are possibilities that the prosodic information can improve
the quality of speech summarization. This paper describes a technique of using prosodic parameters as well as linguistic information
to identify important sentences for speech summarization. Several
prosodic parameters about F0, power and duration are extracted for
each sentence in lecture speech. Importance of the sentence is predicted by the prosodic parameters and the linguistic information.
We also tried to combine the prosodic parameters and the linguistic
information by multiple regression analysis. Proposed methods are
evaluated both on the correlation between the predicted scores of
sentence importance and the preference scores by subjects and on
the accuracy of extraction of important sentences. By combination
of the prosodic parameters improves the quality of speech summarization.
An Automatic Singing Transcription System with
Multilingual Singing Lyric Recognizer and Robust
Melody Tracker
Chong-kai Wang 1 , Ren-Yuan Lyu 1 , Yuang-Chin
Chiang 2 ; 1 Chang Gung University, Taiwan; 2 National
Tsing Hua University, Taiwan
A singing transcription system which transcribes human singing
voice to musical notes is described in this paper. The fact that human singing rarely follows standard musical scale makes it a challenge to implement such a system. This system utilizes some new
methods to deal with the issue of imprecise musical scale of input
voice of a human singer, such as spectral standard deviation used
for note segmentation, Adaptive Round Semitone used for melody
tracking and Tune Map acting as a musical grammar constraint in
melody tracking. Furthermore, a large vocabulary speech recognizer performing the lyric recognition tasks is also added, which is
a new trial in a singing transcription system.
Speech Shift: Direct Speech-Input-Mode Switching
Through Intentional Control of Voice Pitch
Masataka Goto 1 , Yukihiro Omoto 2 , Katunobu Itou 1 ,
Tetsunori Kobayashi 2 ; 1 AIST, Japan; 2 Waseda
University, Japan
This paper describes a speech-input interface function, called
speech shift, that enables a user to specify a speech-input mode by
simply changing (shifting) voice pitch. While current speech-input
interfaces have used only verbal information, we aimed at building
a more user-friendly speech interface by making use of nonverbal
information, the voice pitch. By intentionally controlling the pitch,
a user can enter the same word with it having different meanings
(functions) without explicitly changing the speech-input mode. Our
speech-shift function implemented on a voice-enabled word processor, for example, can distinguish an utterance with a high pitch
from one with a normal (low) pitch, and regard the former as voicecommand-mode input (such as file-menu and edit-menu commands)
and the latter as regular dictation-mode text input. Our experimental
results from twenty subjects showed that the speech-shift function
is effective, easy to use, and a labor-saving input method.
This paper studies speech-driven Web retrieval models which accepts spoken search topics (queries) in the NTCIR-3 Web retrieval
task. The major focus of this paper is on improving speech recognition accuracy of spoken queries and then improving retrieval accuracy in speech-driven Web retrieval. We experimentally evaluate
the techniques of combining outputs of multiple LVCSR models in
recognition of spoken queries. As model combination techniques,
we compare the SVM learning technique and conventional voting
schemes such as ROVER. We show that the techniques of multiple LVCSR model combination can achieve improvement both in
speech recognition and retrieval accuracies in speech-driven text
retrieval. We also show that model combination by SVM learning
outperforms conventional voting schemes both in speech recognition and retrieval accuracies.
Semantic Object Synchronous Understanding in
SALT for Highly Interactive User Interface
Kuansan Wang; Microsoft Research, USA
SALT is an industrial standard that enables speech input/ output
for Web applications. Although the core design is to make simple
tasks easy, SALT gives the designers ample fine-grained controls
to create advanced user interface. The paper exploits a speech input mode in which SALT would dynamically report partial semantic
parses while audio capturing is still ongoing. The semantic parses
can be evaluated and the outcome reported immediately back to
the user. The potential impact for the dialog systems is that tasks
conventionally performed in a system turn can now be carried out
in the midst of a user turn, thereby presenting a significant departure from the conventional turn-taking. To assess the efficacy of
such highly interactive interface, more user studies are undoubtedly needed. This paper demonstrates how SALT can be employed
to facilitate such studies.
Information Retrieval Based Call Classification
Jan Kneissler, Anne K. Kienappel, Dietrich Klakow;
Philips Research Laboratories, Germany
In this paper we describe a fully automatic call classification system for customer service selection. Call classification is based on
one customer utterance following a “How may I help you” prompt.
In particular, we introduce two new elements to our information
retrieval based call classifier, which significantly improve the classification accuracy: the use of a-priory term relevance based on
class information, and classification confidence estimation. We describe the spontaneous speech recognizer as well as the classifier
and investigate correlations between speech recognition and call
classification accuracy.
Using Syllable-Based Indexing Features and
Language Models to Improve German Spoken
Document Retrieval
Martha Larson, Stefan Eickeler; Fraunhofer Institute
for Media Communication, Germany
Spoken document collections with high word-type/word-token ratios and heterogeneous audio continue to constitute a challenge
for information retrieval. The experimental results reported in this
paper demonstrate that syllable-based indexing features can outperform word-based indexing features on such a domain, and that
syllable-based speech recognition language models can successfully
be used to generate syllable-based indexing features. Recognition is
carried out with a 5k syllable language model and a 10k mixed-unit
language model whose vocabulary consists of a mixture of words
and syllables. Both language models make retrieval performance
possible that is comparable to that attained when a large vocabulary
word-based language model is used. Experiments are performed on
a spoken document collection consisting of short German-language
radio documentaries. First, the vector space model is applied to
42
Eurospeech 2003
Tuesday
a known item retrieval task and a similar-document search. Then,
the known item retrieval task is further explored with a Levenshteindistance-based fuzzy word match.
An Empirical Text Transformation Method for
Spontaneous Speech Synthesizers
Shiva Sundaram, Shrikanth Narayanan; University of
Southern California, USA
Spontaneously spoken utterances are characterized by a number
of lexical and non-lexical features. These features can also reflect
speaker specific characteristics. A major factor that discriminates
spontaneous speech from written text is the presence of these paralinguistic features such as filled pauses (fillers), false starts, laughter, disfluencies and discourse markers that are beyond the framework of formal grammars. The speech recognition community has
dealt with these variabilities by making provisions for them in language models, to improve recognition accuracy for spoken language. In another scenario, the analysis of these features could
also be used for language processing/generation for the overall improvement of synthesized speech or machine response. Such synthesized spontaneous speech could be used for computer avatars
and Speech User Interfaces (SUIs) where lengthy interactions with
machines occur, and it is generally desired to mimic a particular
speaker or the speaking style. This problem of language generation
involves capturing general characteristics of spontaneous speech
and also speaker specific traits. The usefulness of conventional
language processing tools is limited by the availability of training
corpus. Hence and empirical text processing technique with ideas
motivated from psycholinguistics is proposed. Such an empirical
technique could be included in the text analysis stage of a TTS system. The proposed technique is adaptable: it can be extended to
mimic different speakers based on an individual’s speaking style
and filler preferences.
A New Approach to Reducing Alarm Noise in
Speech
September 1-4, 2003 – Geneva, Switzerland
recognition over Bluetooth is described. We simulate a Bluetooth
environment and then incorporate its performance, in the form of
packet loss ratio, into the speech recognition system. We show how
intelligent framing of speech feature vectors, extracted by a fixedpoint arithmetic front-end, together with an interpolation technique
for lost vectors, can lead to a 50.48% relative improvement in recognition accuracy. This is achieved at a distance of 10 meters, around
the maximum operating distance between a Bluetooth transmitter
and a Bluetooth receiver.
Speech Starter: Noise-Robust Endpoint Detection
by Using Filled Pauses
Koji Kitayama 1 , Masataka Goto 2 , Katunobu Itou 2 ,
Tetsunori Kobayashi 1 ; 1 Waseda University, Japan;
2
AIST, Japan
In this paper we propose a speech interface function, called speech
starter, that enables noise-robust endpoint (utterance) detection for
speech recognition. When current speech recognizers are used in
a noisy environment, a typical recognition error is caused by incorrect endpoints because their automatic detection is likely to be
disturbed by non-stationary noises. The speech starter function enables a user to specify the beginning of each utterance by uttering
a filler with a filled pause, which is used as a trigger to start speechrecognition processes. Since filled pauses can be detected robustly
in a noisy environment, practical endpoint detection is achieved.
Speech starter also offers the advantage of providing a hands-free
speech interface and it is user-friendly because a speaker tends
to utter filled pauses (e.g., “er. . .”) at the beginning of utterances
when hesitating in human-human communication. Experimental
results from a 10-dB-SNR noisy environment show that the recognition error rate with speech starter was lower than with conventional
endpoint-detection methods.
Automatic Segmentation of Film Dialogues into
Phonemes and Graphemes
Gilles Boulianne, Jean-François Beaumont, Patrick
Cardinal, Michel Comeau, Pierre Ouellet, Pierre
Dumouchel; CRIM, Canada
Yilmaz Gül 1 , Aladdin M. Ariyaeeinia 2 , Oliver
Dewhirst 1 ; 1 Fulcrum Voice Technologies, U.K.;
2
University of Hertfordshire, U.K.
This paper presents a new single channel noise reduction method
for suppressing periodic alarm noise in telephony speech. The presence of background alarm noise can significantly detract from the
intelligibility of telephony speech received by emergency services,
and in particular, by the fire brigade control rooms. The attraction
of the proposed approach is that it targets the alarm noise without
affecting the speech signal. This is achieved through discriminating the alarm noise by appropriately modelling the contaminated
speech. The effectiveness of this method is confirmed experimentally using a set of real speech data collected by the Kent Fire Brigade
HQ (UK).
Improved Name Recognition with User Modeling
Dong Yu, Kuansan Wang, Milind Mahajan, Peter Mau,
Alex Acero; Microsoft Research, USA
Speech recognition of names in Personal Information Management
(PIM) systems is an important yet difficult task. The difficulty arises
from various sources: the large number of possible names that
users may speak, different ways a person may be referred to, ambiguity when only first names are used, and mismatched pronunciations. In this paper we present our recent work on name recognition
with User Modeling (UM), i.e., automatic modeling of user’s behavior patterns. We show that UM and our learning algorithm lead to
significant improvement in the perplexity, Out Of Vocabulary rate,
recognition speed, and accuracy of the top recognized candidate.
The use of an exponential window reduces the perplexity by more
than 30%.
Speech Recognition Over Bluetooth Wireless
Channels
Ziad Al Bawab, Ivo Locher, Jianxia Xue, Abeer Alwan;
University of California at Los Angeles, USA
This paper studies the effect of Bluetooth wireless channels on distributed speech recognition. An approach for implementing speech
In film post-production, efficient methods for re-recording a dialogue or dubbing in a new language require a precisely timealigned text, with individual letters time-coded to video frame resolution. Currently, this time alignment is performed by experts in
a painstaking and slow process.
To automate this process, we used CRIM’s large vocabulary HMM
speech recognizer as a phoneme segmenter and measured its accuracy on typical film extracts in French and English. Our results
reveal several characteristics of film dialogues, in addition to noise,
that affect segmentation accuracy, such as speaking style or reverberant recordings. Despite these difficulties, an HMM-based segmenter trained on clean speech can still provide more than 89%
acceptable phoneme boundaries on typical film extracts.
We also propose a method which provides the correspondence between aligned phonemes and graphemes of the text. The method
does not use explicit rules, but rather computes an optimal string
alignment according to an edit-distance metric.
Together, HMM phoneme segmentation and phoneme-grapheme
correspondence meet the needs of film postproduction for a timealigned text, and make it possible to automate a large part of the
current post-synch process.
Automated Closed-Captioning of Live TV Broadcast
News in French
Julie Brousseau, Jean-François Beaumont, Gilles
Boulianne, Patrick Cardinal, Claude Chapdelaine,
Michel Comeau, Frédéric Osterrath, Pierre Ouellet;
CRIM, Canada
This paper describes the system currently under development at
CRIM whose aim is to provide real-time closed captioning of live TV
broadcast news in Canadian French. This project is done in collaboration with TVA Network, a national TV broadcaster and the RQST (a
Québec association which promotes the use of subtitling). The automated closed-captioning system will use CRIM’s transducer-based
43
Eurospeech 2003
Tuesday
large vocabulary French recognizer. The system will be totally integrated to the existing broadcaster’s equipment and working methods. First “on-air” use will take place in February 2004.
Automatic Construction of Unique Signatures and
Confusable Sets for Natural Language Directory
Assistance Applications
E.E. Jan, Benoît Maison, Lidia Mangu, Geoffrey Zweig;
IBM T.J. Watson Research Center, USA
September 1-4, 2003 – Geneva, Switzerland
done in a sequential manner, resulting in the choice of overall excitation parameters being sub-optimal. In this paper, we propose a joint
excitation parameter optimization framework in which the associated complexity is slightly greater than the traditional sequential
optimization, but with significant quality improvement. Moreover,
the framework allows joint optimization to be easily incorporated
into existing pulse codebook systems with little or no impact to the
codebook search algorithms.
Named Entity Extraction from Word Lattices
This paper addresses the problem of building natural language
based grammars and language models for directory assistance applications that use automatic speech recognition. As input, one is
given an electronic version of a standard phone book, and the output is a grammar or language model that will accept all the ways
in which one might ask for a particular listing. We focus primarily on the problem of processing listings for businesses and government offices, but our techniques can be used to speech-enable
other kinds of large listings (like book titles, catalog entries, etc.).
We have applied these techniques to the business listings of a state
in the Midwestern United States, and we present highly encouraging
recognition results.
Recent Enhancements in CU VOCAL for Chinese
TTS-Enabled Applications
James Horlock, Simon King; University of Edinburgh,
U.K.
We present a method for named entity extraction from word lattices produced by a speech recogniser. Previous work by others on
named entity extraction from speech has used either a manual transcript or 1-best recogniser output. We describe how a single Viterbi
search can recover both the named entity sequence and the corresponding word sequence from a word lattice, and further that it is
possible to trade off an increase in word error rate for improved
named entity extraction.
A Topic Classification System Based on Parametric
Trajectory Mixture Models
William Belfield, Herbert Gish; BBN Technologies, USA
Helen M. Meng, Yuk-Chi Li, Tien-Ying Fung,
Man-Cheuk Ho, Chi-Kin Keung, Tin-Hang Lo, Wai-Kit
Lo, P.C. Ching; Chinese University of Hong Kong,
China
CU VOCAL is a Cantonese text-to-speech (TTS) engine. We use a
syllable-based concatenative synthesis approach to generate intelligible and natural synthesized speech [1]. This paper describes several recent enhancements in CU VOCAL. First, we have augmented
the syllable unit selection strategy with a positional feature. This
feature specifies the relative location of a syllable in a sentence and
serves to improve the quality of Cantonese tone realization. Second,
we have developed the CU VOCAL SAPI engine, a version of the synthesizer that eases integration with applications using SAPI (Speech
Application Programming Interface). We demonstrate the use of CU
VOCAL SAPI in an electronic book (e-book) reader. Third, we have
made an initial attempt to use the CU VOCAL SAPI engine in Web
content authored with Speech Application Language Tags (SALT).
The use of SALT tags can ease the task of invoking Cantonese TTS
service on webpages.
In this paper we address the problem of topic classification of
speech data. Our concern in this paper is the situation in which
there is no speech or phoneme recognizer available for the domain
of the speech data. In this situation the only inputs for training
the system are audio speech files labeled according to the topics of
interest. The process that we follow in developing the topic classifier is that of data segmentation followed by the representation of
the segments by polynomial trajectory models. The clustering of
acoustically similar segments enables us to train a trajectory Gaussian mixture model that is used to label segments of both on topic
and off topic data and the labeled data enables us to create topic
classifiers. The advantage of the approach that we are pursuing
is that it is language and domain independent. We evaluated the
performance of our approach with several classifiers demonstrated
positive results.
Session: OTuDa– Oral
Robust Speech Recognition - Front-end
Processing
Evaluation of an Alert System for Selective
Dissemination of Broadcast News
Time: Tuesday 16.00, Venue: Room 1
Chair: Sadaoki Furui, Tokyo Inst. of Technology, Japan
Isabel Trancoso 1 , João P. Neto 1 , Hugo Meinedo 1 , Rui
Amaral 2 ; 1 INESC-ID/IST, Portugal; 2 INESC-ID/IPS,
Portugal
Model Based Noisy Speech Recognition with
Environment Parameters Estimated by Noise
Adaptive Speech Recognition with Prior
This paper describes the evaluation of the system for selective dissemination of Broadcast News that we developed in the context of
the European project ALERT. Each component of the main processing block of our system was evaluated separately, using the ALERT
corpus. Likewise, the user interface was also evaluated separately.
Besides this modular evaluation which will be briefly mentioned
here, as a reference, the system can also be evaluated as a whole,
in a field trial from the point of view of a potential user. This is the
main topic of this paper. The analysis of the main sources of problems hinted at a large number of issues that must be dealt with in
order to improve the performance. In spite of these pending problems, we believe that having a fully operational system is a must
for being able to address user needs in the future in this type of
service.
Low Complexity Joint Optimization of Excitation
Parameters in Analysis-by-Synthesis Speech
Coding
Kaisheng Yao 1 , Kuldip K. Paliwal 2 , Satoshi
Nakamura 3 ; 1 University of California at San Diego,
USA; 2 Griffith University, Australia; 3 ATR-SLT, Japan
We have proposed earlier a noise adaptive speech recognition approach for recognizing speech corrupted by nonstationary noise
and channel distortion. In this paper, we extend this approach.
Instead of maximum likelihood estimation of environment parameters (as done in our previous work), the present method estimates
environment parameters within the Bayesian framework that is capable of incorporating prior knowledge of the environment. Experiments are conducted on a database that contains digit utterances contaminated by channel distortion and nonstationary noise.
Results show that this method performs better than the previous
methods.
A Harmonic-Model-Based Front End for Robust
Speech Recognition
U. Mittal, J.P. Ashley, E.M. Cruz-Zeno; Motorola Labs,
USA
Codebook searches in analysis-by-synthesis speech coders typically
involve minimization of a perceptually weighted squared error signal. Minimization of the error over multiple codebooks is often
Michael L. Seltzer 1 , Jasha Droppo 2 , Alex Acero 2 ;
1
Carnegie Mellon University, USA; 2 Microsoft
Research, USA
Speech recognition accuracy degrades significantly when the speech
44
Eurospeech 2003
Tuesday
September 1-4, 2003 – Geneva, Switzerland
has been corrupted by noise, especially when the system has been
trained on clean speech. Many compensation algorithms have been
developed which require reliable online noise estimates or a priori knowledge of the noise. In situations where such estimates or
knowledge is difficult to obtain, these methods fail. We present
a new robustness algorithm which avoids these problems by making no assumptions about the corrupting noise. Instead, we exploit
properties inherent to the speech signal itself to denoise the recognition features. In this method, speech is decomposed into harmonic and noise-like components, which are then processed independently and recombined. By processing noise-corrupted speech
in this manner we achieve significant improvements in recognition
accuracy on the Aurora 2 task.
working scheme of CFABF consists of two steps: source location
calibration and target signal enhancement. The first step is to prerecord the transfer functions between speaker and microphone array from different potential source positions using adaptive beamforming under quiet environments; and the second step is to use
this pre-recorded information to enhance the desired speech when
the car is running on the road. An evaluation using extensive actual
car speech data from the CU-Move Corpus shows that the method
can decrease WER for speech recognition by up to 30% over a single
channel scenario.
A New Perspective on Feature Extraction for
Robust In-Vehicle Speech Recognition
Gerasimos Potamianos, Chalapathy Neti; IBM T.J.
Watson Research Center, USA
Umit H. Yapanel, John H.L. Hansen; University of
Colorado at Boulder, USA
Visual speech information is known to improve accuracy and noise
robustness of automatic speech recognizers. However, to-date, all
audio-visual ASR work has concentrated on “visually clean” data
with limited variation in the speaker’s frontal pose, lighting, and
background. In this paper, we investigate audiovisual ASR in two
practical environments that present significant challenges to robust
visual processing: (a) Typical offices, where data are recorded by
means of a portable PC equipped with an inexpensive web camera, and (b) automobiles, with data collected at three approximate
speeds. The performance of all components of a state-of-the-art
audio-visual ASR system is reported on these two sets and benchmarked against “visually clean” data recorded in a studio-like environment. Not surprisingly, both audio- and visual-only ASR degrade, more than doubling their respective word error rates. Nevertheless, visual speech remains beneficial to ASR.
The problem of reliable speech recognition for in-vehicle applications has recently emerged as a challenging research domain. This
study focuses on the feature extraction stage of this problem. The
approach is based on MinimumVariance Distortionless Response
(MVDR) spectrum estimation. MVDR is used for robustly estimating the envelope of the speech signal and shown to be very accurate
and relatively less sensitive to additive noise. The proposed feature
estimation process removes the traditional Mel-scaled filterbank as
a perceptually motivated frequency partitioning. Instead, we directly warp the FFT power spectrum of speech. The word error rate
(WER) is shown to decrease by 27.3% with respect to the MFCCs and
18.8% with respect to recently proposed PMCCs on an extended digit
recognition task in real car environments. The proposed feature estimation approach is called PMVDR and conclusively shown to be
a better speech representation in real environments with emphasis
on time-varying car noise.
Audio-Visual Speech Recognition in Challenging
Environments
Session: STuDb– Oral
Spoken Language Processing for e-Inclusion
Speech Recognition of Double Talk Using
SAFIA-Based Audio Segregation
Time: Tuesday 16.00, Venue: Room 2
Chair: Paul Dalsgaard, Center for PersonKommunikation (CPK)
Toshiyuki Sekiya, Tetsuji Ogawa, Tetsunori
Kobayashi; Waseda University, Japan
SYNFACE – A Talking Face Telephone
Double-talk recognition under a distant microphone condition, a
serious problem in speech applications in a real environment, is
realized through use of modified SAFIA and acoustic model adaptation or training.
The original SAFIA is a high-performance audio segregation method
based on band selection using two directivity microphones. We
have modified SAFIA by adopting array signal processing and have
realized optimal directivity for SAFIA. We also used generalized
harmonic analysis (GHA) instead of FFT for the spectral analysis
in SAFIA to remove the effect of windowing which causes soundquality degradation in SAFIA.
These modifications of SAFIA enable good segregation in a human
auditory sense, but the quality is still insufficient for recognition.
Because SAFIA causes some particular distortion, we used MLLRbased acoustic model adaptation and immunity training to be robust to the distortion of SAFIA. These efforts enabled 76.2% word
accuracy under the condition that the SN ratio is 0 dB, this represents a 45% reduction in the error obtained in the case where only
array signal processing was used, and a 30% error reduction compared with when only SAFIA-based audio segregation was used.
Inger Karlsson 1 , Andrew Faulkner 2 , Giampiero
Salvi 1 ; 1 KTH, Sweden; 2 University College London,
U.K.
The SYNFACE project has as its primary goal to facilitate for
hearing-impaired people to use an ordinary telephone. This will
be achieved by using a talking face connected to the telephone. The
incoming speech signal will govern the speech movements of the
talking face, hence the talking face will provide lip-reading support
for the user.
The project will define the visual speech information that supports
lip-reading, and develop techniques to derive this information from
the acoustic speech signal in near real time for three different languages: Dutch, English and Swedish. This requires the development
of automatic speech recognition methods that detect information
in the acoustic signal that correlates with the speech movements.
This information will govern the speech movements in a synthetic
face and synchronise them with the acoustic speech signal.
A prototype system is being constructed. The prototype contains
results achieved so far in SYNFACE. This system will be tested and
evaluated for the three languages by hearing-impaired users.
SYNFACE is an IST project (IST-2001-33327) with partners from
the Netherlands, UK and Sweden. SYNFACE builds on experiences
gained in the Swedish Teleface project.
CFA-BF: A Novel Combined Fixed/Adaptive
Beamforming for Robust Speech Recognition in
Real Car Environments
A Voice-Driven Web Browser for Blind People
Xianxian Zhang, John H.L. Hansen; University of
Colorado at Boulder, USA
Among a number of studies which have investigated various speech
enhancement and processing schemes for in-vehicle speech systems, the delay-and-sum beamforming (DASB) and adaptive beamforming are two typical methods that both have their advantages
and disadvantages. In this paper, we propose a novel combined
fixed/adaptive beamforming solution (CFABF) based on previous
work for speech enhancement and recognition in real moving car
environments, which seeks to take advantage of both methods. The
Boštjan Vesnicer, Janez Žibert, Simon Dobrišek,
Nikola Pavešić, France Mihelič; University of
Ljubljana, Slovenia
A small self-voicing Web browser designed for blind users is presented. The Web browser was built from the GTK Web browser
Dillo, which is a free software project in terms of the GNU general
public license. Additional functionality has been introduced to this
original browser in form of different modules. The browser operates in two different modes, browsing mode and dialogue mode.
45
Eurospeech 2003
Tuesday
In browsing mode user navigates through structure of Web pages
using mouse and/or keyboard. When in dialogue mode, the dialogue module offers different actions and the user chooses between
them using either keyboard or spoken-commands which are recognized by the speech-recognition module. The content of the page
is presented to the user by screen-reader module which uses textto-speech module for its output.
The browser is capable of displaying all common Web pages that
do not contain frames, java or flash animations. However, the best
performance is achieved when pages comply with the recommendations set by the WAI.
The browser has been developed in Linux operating system and later
ported to Windows 9x/ME/NT/2000/XP platform. Currently it is being tested by members of the Slovenian blind people society. Any
suggestions or wishes from them will be considered for inclusion
in future versions of the browser.
Exploiting Speech for Recognizing Elderly Users to
Respond to Their Special Needs
Christian Müller, Frank Wittig, Jörg Baus; Saarland
University, Germany
In this paper we show how to exploit raw speech data to gain higher
level information about the user in a mobile context. In particular
we introduce an approach for the estimation of age and gender using well known machine learning techniques. On the basis of this
information, systems like for example a mobile pedestrian navigation system, can be made adaptive to the special needs of a specific
user group (here the elderly). First we provide a motivation why
we consider such an adaptation as necessary, then we outline some
adaptation strategies that are adequate for mobile assistants. The
major part of the paper is about (a) identifying and extracting features of speech that are relevant for age and gender estimation and
(b) classifying a particular speaker, treating uncertainty, and updating the user model over time. Finally we provide a short outlook on
current work.
Spoken Language and E-Inclusion
Alan F. Newell; University of Dundee, U.K.
Speech technology can help people with disabilities. Blind and nonspeaking people were amongst the first to be provided with commercially available speech synthesis systems, and, to this day, represent a much higher percentage of users of this technology than
their numbers would predict.
Speech synthesis technology has, for example, transformed the
lives of many blind people, but the success of speech output to
allow blind people to word processes, browse the web, and use domestic appliances should not to lull us into a false sense of security.
In the main, these users were young, aware of their limitations, and
of the substantial potential impact of such technology on their life
styles, and were generally highly motivated to make a success of
their use of the technology.
The speech community needs to be aware of the major differences
between the young disabled people who have found speech technology so useful, and the other groups of people who are excluded from
“e-society”. An example is older people. These have a much greater
range of characteristics than younger people and these characteristics change more rapidly with time. Very importantly for speech
technologists, most older people possess multiple minor disabilities, which can seriously interact, particularly in the context of a
human machine communication. In addition, a relatively high proportion of older people also have a major disability.
Acoustic Normalization of Children’s Speech
Georg Stemmer, Christian Hacker, Stefan Steidl,
Elmar Nöth; Universität Erlangen-Nürnberg, Germany
Young speakers are not represented adequately in current speech
recognizers. In this paper we focus on the problem to adapt the
acoustic frontend of a speech recognizer which has been trained on
adults’ speech to achieve a better performance on speech from children. We introduce and evaluate a method to perform non-linear
VTLN by an unconstrained data-driven optimization of the filterbank. A second approach normalizes the speaking rate of the young
speakers with the PSOLA algorithm. Significant reductions in word
error rate have been achieved.
September 1-4, 2003 – Geneva, Switzerland
Session: OTuDc– Oral
Speech Synthesis: Unit Selection II
Time: Tuesday 16.00, Venue: Room 3
Chair: Alan Black, CMU, USA
Unit Size in Unit Selection Speech Synthesis
S.P. Kishore 1 , Alan W. Black 2 ; 1 International Institute
of Information Technology, India; 2 Carnegie Mellon
University, USA
In this paper, we address the issue of choice of unit size in unit
selection speech synthesis. We discuss the development of a Hindi
speech synthesizer and our experiments with different choices of
units: syllable, diphone, phone and half phone. Perceptual tests
conducted to evaluate the quality of the synthesizers with different
unit size indicate that the syllable synthesizer performs better than
the phone, diphone and half phone synthesizers, and the half phone
synthesizer performs better than diphone and phone synthesizers.
Restricted Unlimited Domain Synthesis
Antje Schweitzer, Norbert Braunschweiler, Tanja
Klankert, Bernd Möbius, Bettina Säuberlich; University
of Stuttgart, Germany
This paper describes the hybrid unit selection strategy for restricted
domain synthesis in the SmartKom dialog system. Restricted domains are characterized as being biased toward domain specific utterances while being unlimited in terms of vocabulary size. This entails that unit selection in restricted domains must deal with both
domain specific and open-domain material. The strategy presented
here combines the advantages of two existing unit selection approaches, motivated by the claim that the phonological structure
matching approach is advantageous for domain specific parts of
utterances, while the acoustic clustering algorithm is more appropriate for open-domain material. This dichotomy is also reflected
in the speech database, which consists of a domain specific and an
open-domain part. The text material for the open-domain part was
constructed to optimize coverage of diphones and phonemes in different contexts.
Evaluation of Units Selection Criteria in
Corpus-Based Speech Synthesis
Hélène François, Olivier Boëffard; IRISA, France
This work comes within the scope of concatenative speech synthesis. We propose a method to evaluate the criteria used in units
selection methods. Usually criteria are evaluated in a comparative
black-box way : the performance of a criterion are measured relatively to other criteria performances, thus evaluation is not always
discriminant or formative. We present a glas-box method to measure the performances of a criterion in an absolute way. The principle is to explore the possible sequences of units able to synthesize
a given target utterance, to assign to each sequence a value X of
objective quality and a value Y related to the tested criterion ; then
mutual information I(X;Y) is calculated to measure the explicative
power of the criterion Y in relation to the quality variable X. Results are encouraging concerning criteria associated to units types,
but combinatorial problems weigh heavy for criteria related to units
instances.
Combining Non-Uniform Unit Selection with
Diphone Based Synthesis
Michael Pucher 1 , Friedrich Neubarth 1 , Erhard
Rank 1 , Georg Niklfeld 1 , Qi Guan 2 ; 1 ftw., Austria;
2
Siemens Österreich AG, Austria
This paper describes the unit selection algorithm of a speech synthesis system, which selects the k-best paths over units from a relational unit database. The algorithm uses words and diphones as
basic unit types. It is part of a customisable text-to-speech system
designed for generating new prompts using a recorded speech corpus, with the option that the user can interactively optimise the
results from the unit selection algorithm. This algorithm combines
46
Eurospeech 2003
Tuesday
advantages of non-uniform unit selection algorithms and diphone
inventory based speech synthesis.
Evolutionary Weight Tuning Based on Diphone
Pairs for Unit Selection Speech Synthesis
Francesc Alías 1 , Xavier Llorà 2 ; 1 Ramon Llull
University, Spain; 2 University of Illinois at
Urbana-Champaign, USA
Unit selection text-to-speech (TTS) conversion is an ongoing research for the speech synthesis community. This paper is focused
on tuning the weights involved in the target and concatenation cost
metrics. We propose a method for automatically adjusting these
weights simultaneously by means of diphone and triphone pairs.
This method is based on techniques provided by the evolutionary
computation community, taking advantage of their robustness in
noisy domains. The experiments and their analyses demonstrate
its good performance in this problem, thus, overcoming some constraints assumed by previous works and leading to a new interesting
framework for further investigations.
La conversió text-parla (CTP) basada en selecció d’unitats és una de
les línies de recerca actuals de la comunitat científica de síntesi de
veu. Aquest treball se centra en l’ajust dels pesos involucrats en el
càlcul dels costos d’unitat i de concatenació. Es presenta un mètode
automàtic per l’ajust simultani d’aquests pesos a partir de parelles
de difonemes i trifonemes. Aquest mètode està basat en tècniques
obtingudes de la comunitat de computació evolutiva, aprofitant la
robustesa d’aquests algorismes en dominis sorollosos. Els experiments que s’han dut a terme i la seva posterior anàlisi demostren
el bon funcionament del mètode en aquest problema, ja que supera
algunes de les restriccions d’anteriors mètodes. A més, esdevé un
marc de treball molt interessant per a properes investigacions.
Keeping Rare Events Rare
Ove Andersen, Charles Hoequist; Aalborg University,
Denmark
September 1-4, 2003 – Geneva, Switzerland
systems that used parallel banks of tokenizer-dependent language
models produced the best language identification performance.
Since that time, other approaches to language identification have
been developed that match or surpass the performance of phonebased systems. This paper describes and evaluates three techniques that have been applied to the language identification problem: phone recognition, Gaussian mixture modeling, and support
vector machine classification. A recognizer that fuses the scores of
three systems that employ these techniques produces a 2.7% equal
error rate (EER) on the 1996 NIST evaluation set and a 2.8% EER on
the NIST 2003 primary condition evaluation set. An approach to
dealing with the problem of out-of-set data is also discussed.
Using Place Name Data to Train Language
Identification Models
Stanley F. Chen, Benoît Maison; IBM T.J. Watson
Research Center, USA
The language of origin of a name affects its pronunciation, so language identification is an important technology for speech synthesis and recognition. Previous work on this task has typically used
training sets that are proprietary or limited in coverage. In this
work, we investigate the use of a publically-available geographic
database for training language ID models. We automatically cluster place names by language, and show that models trained from
place name data are effective for language ID on person names. In
addition, we compare several source-channel and direct models for
language ID, and achieve a 24% reduction in error rate over a sourcechannel letter trigram model on a 26-way language ID task.
Use of Trajectory Models for Automatic Accent
Classification
Pongtep Angkititrakul, John H.L. Hansen; University
of Colorado at Boulder, USA
NIST 2003 Language Recognition Evaluation
This paper describes a proposed automatic language accent identification system based on phoneme class trajectory models. Our
focus is to preserve discriminant information of the spectral evolution that belong to each accent. Here, we describe two classification
schemes based on stochastic trajectory models; supervised and unsupervised classification. For supervised classification, we assume
text of spoken words are known and integrate this into the classification scheme. Unsupervised classification uses a Multi-Trajectory
Template, which represents the global temporal evolution of each
accent. No prior text knowledge of the input speech is required
for the unsupervised scheme. We also conduct human-perceptual
accent classification experiments for comparison automatic system
performance. The experiments are conducted on 3 foreign accents
(Chinese, Thai, and Turkish) with native American English. Our experimental evaluation shows that supervised classification outperforms unsupervised classification by 11.5%. In general, supervised
classification performance increases to 80% correct accent discrimination as we increase the phoneme sequence to 11 accent-sensitive
phonemes.
Alvin F. Martin, Mark A. Przybocki; National Institute
of Standards and Technology, USA
Language Identification Using Parallel Sub-Word
Recognition – An Ergodic HMM Equivalence
The 2003 NIST Language Recognition Evaluation was very similar
to the last such NIST evaluation in 1996. It was intended to establish a new baseline of current performance capability for language recognition of conversational telephone speech and to lay
the groundwork for further research efforts in the field. The primary evaluation data consisted of excerpts from conversations in
twelve languages from the CallFriend Corpus. These test segments
had durations of approximately three, ten, or thirty seconds. Six
sites from three continents participated in the evaluation. The best
performance results were significantly improved from those of the
previous evaluation.
V. Ramasubramanian, A.K.V. Sai Jayram, T.V.
Sreenivas; Indian Institute of Science, India
It has been claimed that corpus-based TTS is unworkable because it
is not practical to include representative units to cover all or most of
the combinations of segments and prosodic characteristics found
in general texts, a problem characterized as Large Numbers of Rare
Events (LNRE). We argue that part of this problem is in its formulation, and that a closer look, including investigations into corpusbased TTS for Danish, show that LNRE need not be a fatal problem
for inventory design in corpus-based TTS.
Session: OTuDd– Oral
Language & Accent Identification
Time: Tuesday 16.00, Venue: Room 4
Chair: Stephen Cox, Univ. of East Anglia
Acoustic, Phonetic, and Discriminative Approaches
to Automatic Language Identification
E. Singer, P.A. Torres-Carrasquillo, T.P. Gleason, W.M.
Campbell, Douglas A. Reynolds; Massachusetts
Institute of Technology, USA
Formal evaluations conducted by NIST in 1996 demonstrated that
Recently, we have proposed a parallel sub-word recognition (PSWR)
system for language identification (LID) in a framework similar to
the parallel phone recognition (PPR) approach in the literature, but
without requiring phonetic labeling of the speech data in any of the
languages in the LID task. In this paper, we show the theoretical
equivalence of PSWR and ergodic- HMM (E-HMM) based LID. Here,
the front-end sub-word recognizer (SWR) and back-end language
model (LM) of each language in PSWR correspond to the states and
state-transitions of the E-HMM in that language. This equivalence
unifies the parallel phone (sub-word) recognition and ergodic-HMM
approaches, which have been treated as two distinct frameworks
in the LID literature so far, thus providing further insights into
both these frameworks. On a 6-language LID task using the OGITS database, the E-HMM system achieves performances comparable
to the PSWR system, offering clear experimental validation of their
equivalence.
47
Eurospeech 2003
Tuesday
September 1-4, 2003 – Geneva, Switzerland
On the Combination of Speech and Speaker
Recognition
Speech Enhancement for a Car Environment Using
LP Residual Signal and Spectral Subtraction
Mohamed Faouzi BenZeghiba, Hervé Bourlard; IDIAP,
Switzerland
A. Álvarez, V. Nieto, P. Gómez, R. Martínez;
Universidad Politécnica de Madrid, Spain
This paper investigates an approach that maximizes the joint posterior probability of the pronounced word and the speaker identity
given the observed data. This probability can be expressed as a
product of the posterior probability of the pronounced word estimated through an artificial neural network (ANN), and the likelihood of the data estimated through a Gaussian mixture model
(GMM). We show that the posterior probabilities estimated through
a speaker-dependent ANN, as usually done in the hybrid HMM/ANN
systems, are reliable for speech recognition but they are less reliable for speaker recognition. To alleviate this problem, we thus
study how this posterior probability can be combined with the
likelihood derived from a speaker-dependent GMM model to improve the speaker recognition performance. We thus end up with a
joint model that can be used for text-dependent speaker identification and for speech recognition (and mutually benefiting from each
other).
Handsfree speaker input is mandatory to enable safe operation
in cars. In those scenarios robust speech recognition emerges as
one of the key technologies to produce voice control car devices.
Through this paper, we propose a method of processing speech degraded by reverberation and noise in an automobile environment.
This approach involves analyzing the linear prediction error signal
to produce a weight function suitable for being combined with spectral subtraction techniques. The paper includes also an evaluation
of the performance of the algorithm in speech recognition experiments. The results show a reduction of more than 30% in word
error rate when the new speech enhancement frontend is applied.
Speech Enhancement and Improved Recognition
Accuracy by Integrating Wavelet Transform and
Spectral Subtraction Algorithm
Gwo-hwa Ju, Lin-shan Lee; National Taiwan
University, Taiwan
Session: PTuDe– Poster
Speech Enhancement II
Time: Tuesday 16.00, Venue: Main Hall, Level -1
Chair: Maurizio Omologo, ITC-irst
Improving Speech Intelligibility by Steady-State
Suppression as Pre-Processing in Small to Medium
Sized Halls
Nao Hodoshima 1 , Takayuki Arai 1 , Tsuyoshi Inoue 1 ,
Keisuke Kinoshita 1 , Akiko Kusumoto 2 ; 1 Sophia
University, Japan; 2 Portland VA Medical Center, USA
One of the reasons that reverberation degrades speech intelligibility
is the effect of overlap-masking, in which segments of an acoustic
signal are affected by reverberation components of previous segments [Bolt et al., 1949]. To reduce the overlap-masking, Arai et al.
suppressed steady-state portions having more energy, but which
are less crucial for speech perception, and confirmed promising results for improving speech intelligibility [Arai et al., 2002]. Our
goal is to provide a pre-processing filter for each auditorium. To
explore the relationship between the effect of a pre-processing filter and reverberation conditions, we conducted a perceptual test
with steady-state suppression under various reverberation conditions. The results showed that processed stimuli performed better
than unprocessed ones and clear improvements were observed for
reverberation conditions of 0.8 - 1.0s. We certified that steady-state
suppression was an effective pre-processing method for improving
speech intelligibility under reverberant conditions and proved the
effect of overlap-masking.
Enhancement of Hearing-Impaired Mandarin
Speech
Chen-Long Lee 1 , Ya-Ru Yang 1 , Wen-Whei Chang 1 ,
Yuan-Chuan Chiang 2 ; 1 National Chiao Tung
University, Taiwan; 2 National Hsinchu Teachers
College, Taiwan
Spectral subtraction (SS) approach has been widely used for speech
enhancement and recognition accuracy improvement, but becomes
less effective when the additive noise is not white. In this paper, we
propose to integrate wavelet transform and the SS algorithm. The
spectrum of the additive noise in each frequency band obtained in
this way can then be better approximated as white if the number
of bands is large enough, and therefore the SS approach can be
more effective. Experimental results based on three objective performance measures and spectrogram-plot comparison show that
this new approach can provide better performance especially when
the noise is non-white. Listening test results also indicate that the
new algorithm can give more preferable sound quality and intelligibility than the conventional spectral subtraction algorithm. Moreover, the new approach also offers some reductions of the computational complexity when compared with the conventional SS algorithm.
Multi-Referenced Correction of the Voice Timbre
Distortions in Telephone Networks
Gaël Mahé 1 , André Gilloire 2 ; 1 Université René
Descartes – Paris V, France; 2 France Télécom R&D,
France
In a telephone link, the voice timbre is impaired by spectral distortions generated by the analog parts of the link. We first evaluate
from a perceptual point of view an equalization method consisting in matching the long term spectrum of the processed signal to
a reference spectrum. This evaluation shows a satisfying restoration of the timbre for most speakers. For some speakers however,
a noticeable spectral distortion remains. That is why we propose
a multi-referenced equalizer, based on a classification of speakers
and using a different reference spectrum for each class. This leads
to a decrease of the spectral distortion and, as a consequence, to a
significant improvement of the timbre correction.
Efficient Speech Enhancement Based on Left-Right
HMM with State Sequence Detection Using LRT
This paper presents a new voice conversion system that modifies
misarticulations and prosodic deviations of the hearing-impaired
Mandarin speech. The basic strategy is the detection and exploitation of characteristic features that distinguish the impaired speech
from the normal speech at segmental and prosodic levels. For spectral conversion, cepstral coefficients were characterized under the
form of a Gaussian mixture model with parameters converted using
a mapping function that minimizes the spectral distortion between
the impaired and normal speech. We also proposed a VQ-based
approach to prosodic conversion that involves modifying the features extracted from the pitch contour by orthogonal polynomial
transform. Experimental results indicate that the proposed system
appears useful in enhancing the hearing-impaired Mandarin speech.
J.J. Lee 1 , J.H. Lee 2 , K.Y. Lee 1 ; 1 SoongSil University,
Korea; 2 Dong-Ah Broadcasting College, Korea
Since the conventional HMM (Hidden Markov Model)-based speech
enhancement methods try to improve speech quality by considering all states for the state transition, hence introduce huge computational loads inappropriate to real-time implementation. In the
Left-Right HMM (LR-HMM), only the current and the next states are
considered for a possible state transition so to reduce the computation complexity. We propose a new speech enhancement algorithm
based on LR-HMM with state sequence detection using LRT (Likelihood Ratio Test). Experimental results show that the proposed
method improves the speed up with little degradation of speech
quality compared to the conventional method.
48
Eurospeech 2003
Tuesday
Introduction of the CELP Structure of the GSM
Coder in the Acoustic Echo Canceller for the GSM
Network
September 1-4, 2003 – Geneva, Switzerland
while that of the proposed beamformer will be increased by only
0.95dB. Therefore, the passband of the proposed GSC beamformer
can be extended without loss of performance.
Speech Enhancement Using A-Priori Information
H. Gnaba 1 , M. Turki-Hadj Alouane 1 , M.
Jaidane-Saidane 1 , P. Scalart 2 ; 1 Ecole Nationale
d’Ingénieurs de Tunis, Tunisia; 2 France Télécom R&D,
France
Sriram Srinivasan, Jonas Samuelsson, W. Bastiaan
Kleijn; KTH, Sweden
This paper presents a new structure of an Acoustic Echo Canceller
(AEC) designed to operate in the Mobile Switching Center (MSC) of
a GSM network. The purpose of such system is to cancel the echo
for all the subscribers. Contrarily to the conventional AEC, the proposed combined AEC/CELP Predictor is able to take into account
the non linearities introduced by the GSM speech coders/decoders.
A short term predictor is used to model the behavior of the codecs.
This new combined system presents higher performance compared
to the conventional AEC.
Extracting an AV Speech Source from a Mixture of
Signals
David Sodoyer 1 , Laurent Girin 1 , Christian Jutten 2 ,
Jean-Luc Schwartz 1 ; 1 ICP-CNRS, France; 2 LIS-CNRS,
France
We present a new approach to the source separation problem for
multiple speech signals. Using the extra visual information of the
face speaker, the method aims to extract an acoustic speech signal from other acoustic signals by exploiting its coherence with the
speaker’s lip movements. We define a statistical model of the joint
probability of visual and spectral audio input for quantifying the
audio-visual coherence. Then, separation can be achieved by maximising this joint probability. Experiments on additive mixtures of
2, 3 and 5 sources show that the algorithm performs well, and systematically better than the classical BSS algorithm JADE.
Speech Enhancement for Hands-Free Car Phones
by Adaptive Compensation of Harmonic Engine
Noise Components
Henning Puder; Darmstadt University of Technology,
Germany
This paper presents a method for enhancing speech disturbed by
car noise. The proposed method cancels the powerful harmonic
components of engine noise by adaptive filtering which utilizes
the known rpm signal available on the CAN bus in modern cars.
The procedure can be used as a preprocessing method for classical
broad-band noise reduction as it is able to cancel the engine noise –
and thus a large amount of low-frequent car noise – without provoking speech distortion. The main part of the paper is dedicated to
the step-size control of the utilized LMS algorithm necessary for a
complete cancellation of the harmonics without speech distortion.
Therefore, first a theoretically optimal step-size is determined and
then a procedure is described which allows its determination in real
applications. The paper concludes with a presentation of results obtained with this approach.
Enhance Low-Frequency Suppression of GSC
Beamforming
Zhaorong Hou, Ying Jia; Intel China Research Center,
China
Usually the generalized sidelobe canceller (GSC) beamformer requires additional highpass pre-filtering due to insufficient suppression of low frequency directional interference, and it deteriorates
the bandwidth quality of speech enhancement, especially for small
size microphone array. This paper proposes a new GSC beamformer
with multiple frequency dependent norm-constrained adaptive filters (FD-NCAF), which combine bin-wise constraint in low frequency
band and norm constraint in high frequency band, to improve the
performance of the adaptive interference canceller (AIC) for low frequency interference. Simulation on five testing signals shows that
directional response of the proposed beamformer is less sensitive
to the spectrum of interference than the full-band GSC. In the experiments based on real recordings, when the cut-off frequency of highpass pre-filtering extended to lower frequency, the residual directional interference of the full-band GSC will be increased by 3.37dB,
In this paper, we present a speech enhancement technique that uses
a-priori information about both speech and noise. The a-priori information consists of speech and noise spectral shapes stored in
trained codebooks. The excitation variances of speech and noise
are determined through the optimization of a criterion that finds
the best fit between the noisy observation and the model represented by the two codebooks. The optimal spectral shapes and
variances are used in a Wiener filter to obtain an estimate of clean
speech. The method uses both a-priori and estimated noise information to perform well in stationary as well as non-stationary noise
environments. The high computational complexity resulting from a
full search of joint speech and noise codebooks is avoided through
an iterative optimization procedure. Experiments indicate that the
method significantly outperforms conventional enhancement techniques, especially for non-stationary noise.
Blind Inversion of Multidimensional Functions for
Speech Enhancement
John Hogden 1 , Patrick Valdez 1 , Shigeru Katagiri 2 ,
Erik McDermott 2 ; 1 Los Alamos National Laboratory,
USA; 2 NTT Corporation, Japan
We discuss speech production in terms of a mapping from a lowdimensional articulator space to low-dimensional manifold embedded in a high-dimensional acoustic space. Our discussion highlights
the advantages of using an articulatory representation of speech.
We then summarize mathematical results showing that, because articulator motions are bandlimited, a large class of mappings from
articulation to acoustics can be blindly inverted. Simulation results
showing the power of the inversion technique are also presented.
One of the most interesting simulation results is that some manyto-one mappings can also be inverted. These results explain earlier
experimental results that the studied technique can recover articulator positions. We conclude that our technique has many advantages for speech processing, including invariance with respect to
various nonlinearities and the ability to exploit context more easily.
Convergence Improvement for Oversampled
Subband Adaptive Noise and Echo Cancellation
H.R. Abutalebi 1 , H. Sheikhzadeh 2 , R.L. Brennan 2 ,
G.H. Freeman 3 ; 1 Amirkabir University of Technology,
Iran; 2 Dspfactory Ltd., Canada; 3 University of
Waterloo, Canada
The convergence rate of the Least Mean Square (LMS) algorithm is
dependent on the eigenvalue distribution of the reference input correlation matrix. When adaptive filters are employed in low-delay
over-sampled subband structures, colored subband signals considerably decelerate the convergence speed. Here, we propose and implement two promising techniques for improving the convergence
rate based on: 1) Spectral emphasis and 2) Decimation of the subband signals. We analyze the effects of the proposed methods based
on theoretical relationships between eigenvalue distribution and
convergence characteristics. We also propose a combined decimation and spectral emphasis whitening technique that exploits the
advantages of both methods to dramatically improve the convergence rate. Moreover, through decimation the combined whitening
approach reduces the overall computation cost compared to subband LMS with no pre-processing. Presented theoretical and simulation results confirm the effectiveness of the proposed convergence
improvement methods.
A Speech Dereverberation Method Based on the
MTF Concept
Masashi Unoki, Keigo Sakata, Masato Akagi; JAIST,
Japan
This paper proposes a speech dereverberation method based on
49
Eurospeech 2003
Tuesday
the MTF concept. This method can be used without measuring the
impulse response of room acoustics. In the model, the power envelopes and carriers are decomposed from a reverberant speech
signal using an N-channel filterbank and then are dereverberated in
each respective channel. In the envelope dereverberation process,
a power envelope inverse filtering method is used to dereverberate
the envelopes. In the carrier regeneration process, a carrier generation method based on voiced/unvoiced speech from the estimated
fundamental frequency (F0) is used. In this paper, we assume that
F0 has been estimated accurately. We have carried out 15,000 simulations of dereverberation for reverberant speech signals to evaluate the proposed model. We found that the proposed model can
accurately dereverberate not only the power envelopes but also the
speech signal from the reverberant speech using regenerated carriers.
Accuracy Improved Double-Talk Detector Based on
State Transition Diagram
SangGyun Kim, Jong Uk Kim, Chang D. Yoo; KAIST,
Korea
A double-talk detector (DTD) is generally used with an acoustic echo
canceller (AEC) in pinpointing the region where far-end and nearend signal coexist. This region is called double-talk and during this
region AEC usually freezes the adaptation. Decision variable used in
DTD has a relatively longer transient time going from double-talk to
single-talk than time going in opposite direction. Therefore, using
a single threshold to pinpoint the location of double-talk region can
be difficult. In this paper, a DTD based on a novel state transition
diagram and a decision variable which requires minimal computational overhead is proposed to improve the accuracy of pinpointing
the location. The use of different thresholds according to the state
helps the DTD locate double-talk region more accurately. The proposed DTD algorithm is evaluated by obtaining a receiver operating
characteristic (ROC) and is compared to that of Cho’s DTD.
Perceptual Based Speech Enhancement for
Normal-Hearing & Hearing-Impaired Individuals
Ajay Natarajan, John H.L. Hansen, Kathryn Arehart,
Jessica A. Rossi-Katz; University of Colorado at
Boulder, USA
This paper describes a new noise suppression scheme with the goal
of improving speech-in-noise perception for hearing-impaired listeners. Following the work of Tsoukalas et al. (1997) [4], Arehart et
al (2003) [3] implemented and evaluated a noise suppression algorithm based on an approach that used the auditory masked threshold in conjunction with a version of spectral subtraction to adjust
the enhancement parameters based on the masked threshold of the
noise across the frequency spectrum. That original formulation was
based on masking properties of the normal auditory system, with
its theoretical underpinnings based on MPEG-4 audio coding [6]. We
describe here a revised formulation, which is more suitable for hearing aid applications and which addresses changes in masking that
occur with cochlear hearing loss. In contrast to previous formulations, the algorithm described here is implemented with generalized minimum mean square error estimators, which provide improvements over spectral subtraction estimators [1]. Second, the
frequency resolution of the cochlea is described with auditory filter
equivalent rectangular bandwidths (ERBs) [2] rather than the critical band scale. Third, estimation of the auditory masked thresholds
and masking spreading functions are adjusted to address elevated
thresholds and broader auditory filters characteristic of cochlear
hearing loss. Fourth, the current algorithm does not include the
tonality offset developed for use in MPEG-4 audio coding applications. The scheme also shows an overall improvement of 11% in the
Itakura-Saito distortion measure.
Residual Echo Power Estimation for Speech
Reinforcement Systems in Vehicles
Alfonso Ortega, Eduardo Lleida, Enrique Masgrau;
University of Zaragoza, Spain
In acoustic echo cancellation systems, some residual echo exists
after the acoustic echo canceller (AEC) due to the fact that the
adaptive filter does not model exactly the impulse response of the
Loudspeaker-Enclosure-Microphone (LEM) path. This is specially
September 1-4, 2003 – Geneva, Switzerland
important in feedback acoustic environments like speech reinforcement systems for cars where this residual echo can make the system
become unstable. In order to suppress this residual echo remaining after the AEC, postfiltering is the most used technique. The
optimal filter that ensures stability without attenuating the speech
signal depends on the power spectral density (psd) of the residual
echo that must be estimated. This paper presents a residual echo
psd estimation method needed to obtain the optimal echo suppression filter in speech reinforcement systems for cars.
Dual-Mode Wideband Speech Recovery from
Narrowband Speech
Yasheng Qian, Peter Kabal; McGill University, Canada
The present public telephone networks trim o. the lowband (50300 Hz) and the highband (3400-7000 Hz) components of sounds.
As a result, telephone speech is characterized by thin and muffled
sounds, and degraded speaker identification. The lowband components are deterministically recoverable, while the missing highband
can be recovered statistically. We develop an equalizer to restore
the lowband parts. The highband parts are filled in using a linear
prediction approach. The highband excitation is generated using
a bandpass envelope modulated Gaussian signal and the spectral
envelope is generated using a Gaussian Mixture Model. The mean
log-spectrum distortion decreases by 0.96 dB, comparing to a previous method using wideband reconstruction with a VQ codebook
mapping algorithm. Informal subjective tests show that the reconstructed wideband speech enhances lowband sounds and regenerates realistic highband components.
A Robust Noise and Echo Canceller
Khaldoon Al-Naimi, Christian Sturt, Ahmet Kondoz;
University of Surrey, U.K.
The performance of an echo canceller systems deployed in a practical communication environment (i.e. the presence of background
noise and the possible double talk scenario) depends on an accurate
Voice Activity Detector (VAD) and an effective filter coefficient adaptation strategies. Accuracy of the VAD, which affects the coefficient
adaptation strategy, is itself affected by the presence of background
noise. In this paper, a novel soft weighting approach is proposed to
replace the VAD and filter coefficient adaptation strategy. The robustness of the echo canceller system is further improved through
integrating it with a noise suppression algorithm. The integrated
echo canceller and noise suppressor systems has shown excellent
performances under double talk scenarios with SNR as low as 5 dB.
Computational Auditory Scene Analysis by Using
Statistics of High-Dimensional Speech Dynamics
and Sound Source Direction
Johannes Nix, Michael Kleinschmidt, Volker
Hohmann; Universität Oldenburg, Germany
A main task for computational auditory scene analysis (CASA) is to
separate several concurrent speech sources. From psychoacoustics
it is known that common onsets, common amplitude modulation
and sound source direction are among the important cues which
allow the separation for the human auditory system.
A new algorithm for binaural signals is presented here, that performs statistical estimation of two speech sources by a state-space
approach which integrates temporal and frequency-specific features of speech. It is based on a Sequential Monte Carlo (SMC)
scheme and tracks magnitude spectra and direction on a frameby-frame basis. First results for estimating sound source direction
and separating the envelopes of two voices are shown. The results
indicate that the algorithm is able to localize two superimposed
sound sources in a time scale of 50 ms. This is achieved by integrating measured high-dimensional statistics of speech. Also, the
algorithm is able to track the short-time envelope and the shorttime magnitude spectra of both voices on a time scale of 10 - 40
ms.
The algorithm presented in this paper is developed for but not restricted to use in binaural hearing aid applications, as it is based on
two head-mounted microphone signals as input. It is conceptionally able to separate more than two voices and integrate additional
cues.
50
Eurospeech 2003
Tuesday
September 1-4, 2003 – Geneva, Switzerland
have an average 34∼37%, 9% higher accuracy than the speakerindependent acoustic models, respectively. The experimental results of Korean phone and word recognition confirmed the significant performance increase in small adaptation utterances compared
with without any speaker adaptation.
Session: PTuDf– Poster
Speech Recognition - Adaptation I
Time: Tuesday 16.00, Venue: Main Hall, Level -1
Chair: Richard Stern, CMU, USA
Vocal Tract Normalization as Linear
Transformation of MFCC
Reduction of Dimension of HMM Parameters Using
ICA and PCA in MLLR Framework for Speaker
Adaptation
Michael Pitz, Hermann Ney; RWTH Aachen, Germany
Jiun Kim, Jaeho Chung; Inha University, Korea
We have shown previously that vocal tract normalization (VTN) results in a linear transformation in the cepstral domain. In this paper
we show that Mel-frequency warping can equally well be integrated
into the framework of VTN as linear transformation on the cepstrum. We show examples of transformation matrices to obtain
VTN warped Mel-frequency cepstral coefficients (VTN-MFCC) as linear transformation of the original MFCC and discuss the effect of
Mel-frequency warping on the Jacobian determinant of the transformation matrix. Finally we show that there is a strong interdependence of VTN and Maximum Likelihood Linear Regression (MLLR)
for the case of Gaussian emission probabilities.
We discuss how to reduce the number of inverse matrix and its
dimensions requested in MLLR framework for speaker adaptation.
To find a smaller set of variables with less redundancy, we employ
PCA (principal component analysis) and ICA (independent component analysis) that would give as good a representation as possible.
The amount of additional computation when PCA or ICA is applied
is as small as it can be disregarded. The dimension of HMM parameters is reduced to about 1/3∼2/7 dimensions of SI (speaker
independent) model parameter with which speech recognition system represents word recognition rate as much as ordinary MLLR
framework. If dimension of SI model parameter is n , the amount
of computation of inverse matrix in MLLR is proportioned to O(n4 ).
So, compared with ordinary MLLR, the amount of total computation
requested in speaker adaptation is reduced to about 1/80∼1/150.
Non-Native Spontaneous Speech Recognition
Through Polyphone Decision Tree Specialization
Zhirong Wang, Tanja Schultz; Carnegie Mellon
University, USA
With more and more non-native speakers speaking in English, the
fast and efficient adaptation to non-native English speech becomes
a practical concern. The performance of speech recognition systems is consistently poor on non-native speech. The challenge for
non-native speech recognition is to maximize the recognition performance with small amount of non-native data available. In this
paper we report on the effectiveness of using polyphone decision
tree specialization method for non-native speech adaptation and
recognition. Several recognition results are presented by using nonnative speech from German speakers. Results obtained from the
experiments demonstrate the feasibility of this method.
Live Speech Recognition in Sports Games by
Adaptation of Acoustic Model and Language Model
Yasuo Ariki 1 , Takeru Shigemori 1 , Tsuyoshi Kaneko 1 ,
Jun Ogata 2 , Masakiyo Fujimoto 1 ; 1 Ryukoku
University, Japan; 2 AIST, Japan
This paper proposes a method to automatically extract keywords
from baseball radio speech through LVCSR for highlight scene retrieval. For robust recognition, we employed acoustic and language
model adaptation. In acoustic model adaptation, supervised and
unsupervised adaptations were carried out using MLLR+MAP. By
this two level adaptation, word accuracy was improved by 28%. In
language model adaptation, language model fusion and pronunciation modification were carried out. This adaptation showed 13%
improvement at word accuracy. Finally, by integrating both adaptations, 38% improvement was achieved at word accuracy level and
28% improvement at keyword accuracy level.
Speaker Adaptation Using Regression Classes
Generated by Phonetic Decision Tree-Based
Successive State Splitting
Se-Jin Oh 1 , Kwang-Dong Kim 1 , Duk-Gyoo Roh 1 ,
Woo-Chang Sung 2 , Hyun-Yeol Chung 2 ; 1 Korea
Astronomy Observatory, Korea; 2 Yeungnam
University, Korea
In this paper, we propose a new generation of regression classes
for MLLR speaker adaptation method using the PDTSSS algorithm
so as to represent the characteristic of speaker effectively. This
method extends the state splitting through clustering the context
components of adaptation data into a tree structure. It enables to
autonomously control a number of adaptation parameters (mean,
variance) depending on the context information and the amount
of adaptation utterances from a new speaker. Through the experiments, the phone and word recognition rates with adaptation
Geometric Constrained Maximum Likelihood
Linear Regression On Mandarin Dialect Adaptation
Huayun Zhang, Bo Xu; Chinese Academy of Sciences,
China
This paper presents a geometric constrained transformation approach for fast acoustic adaptation, which improves the modeling
resolution of the conventional Maximum Likelihood Linear Regression (MLLR). For this approach, the underlying geometry difference
between the seed and the target spaces is exposed and quantified,
and used as a prior knowledge to reconstruct refiner transforms.
Ignoring dimensions that have minor affections to this difference,
the transform could be constrained to a lower rank subspace. And
only distortions within this subspace are to be refined in a cascaded
process. Compared to previous cascade method, we employ a different parameterization and obtain a higher resolution. At the same
time, since the geometric span for refiner transforms is highly controlled, it could be adapted quickly. So, it could achieve a better
tradeoff between resolution and robustness. In Mandarin dialect
adaptations, this approach provides 4∼9% word-error-rate relative
decrease over MLLR and 3∼5% over previous cascade method correspondingly with varying amounts of data.
Adapting Language Models for Frequent Fixed
Phrases by Emphasizing N-Gram Subsets
Tomoyosi Akiba 1 , Katunobu Itou 1 , Atsushi Fujii 2 ;
1
AIST, Japan; 2 University of Tsukuba, Japan
In support of speech-driven question answering, we propose a
method to construct N-gram language models for recognizing spoken questions with high accuracy. Question-answering systems receive queries that often consist of two parts: one conveys the query
topic and the other is a fixed phrase used in query sentences. A
language model constructed by using a target collection of QA, for
example, newspaper articles, can model the former part, but cannot model the latter part appropriately. We tackle this problem as
task adaptation from language models obtained from background
corpora (e.g., newspaper articles) to the fixed phrases, and propose
a method that does not use the task-specific corpus, which is often difficult to obtain, but instead uses only manually listed fixed
phrases. The method emphasizes a subset of N-grams obtained
from a background corpus that corresponds to fixed phrases specified by the list. Theoretically, this method can be regarded as maximizing a posteriori probability (MAP) estimation using the subset
of the N-grams as a posteriori distribution. Some experiments show
the effectiveness of our method.
51
Eurospeech 2003
Tuesday
Learning Intra-Speaker Model Parameter
Correlations from Many Short Speaker Segments
Anne K. Kienappel; Philips Research Laboratories,
Germany
Very rapid speaker adaptation algorithms, such as eigenvoices or
speaker clustering, typically rely on learning intra-speaker correlations of model parameters from the training data. On the base
of this a-priori knowledge, many model parameters can be successfully adapted on the basis of few observations. However, eigenvoice
training or speaker clustering is non-trivial with training databases
containing many short speaker segments, where for each speaker
the available data to detect intra-speaker correlations is sparse. We
have trained eigenvoices that yield a small but significant word error rate reduction in on-line adaptation (i.e. self adaptation) for a
telephony database with on average only 5 seconds of speech per
speaker in training and test data.
Modeling Cantonese Pronunciation Variation by
Acoustic Model Refinement
Patgi Kam 1 , Tan Lee 1 , Frank K. Soong 2 ; 1 Chinese
University of Hong Kong, China; 2 ATR-SLT, Japan
Pronunciation variations can be roughly classified into two types: a
phone change or a sound change [1][2]. A phone change happens
when a canonical phone is produced as a different phone. Such
a change can be modeled by converting the baseform (standard)
phone to a surfaceform (actual) phone. A sound change happens at
a lower, phonetic or subphonetic level within a phone and it cannot
be modeled well by either the baseform or the surfaceform phone
alone. We propose here to refine the acoustic models to cope with
sound changes by (1) sharing the Gaussian mixture components
of HMM states in the baseform and the surfaceform models; (2)
adapting the mixture components of the baseform models towards
those of the surfaceform models; (3) selectively reconstructing new
acoustic models through sharing or adapting. The proposed pronunciation modeling algorithms are generic and can, in principle,
be applied to different languages. Specifically, they were tested in
a Cantonese speech recognition database. Relative word error rate
reductions of 5.45%, 2.53%, and 3.04% have been achieved using the
three approaches, respectively.
Performance Improvement of Rapid Speaker
Adaptation Based on Eigenvoice and Bias
Compensation
Jong Se Park, Hwa Jeon Song, Hyung Soon Kim; Pusan
National University, Korea
In this paper, we propose the bias compensation methods and the
eigenvoice method using the mean of dimensional eigenvoice to
improve the performance of rapid speaker adaptation based on
eigenvoice. Experimental results for vocabulary-independent word
recognition task shows the proposed method yields improvements
for a small adaptation data. We obtained 22∼30% relative improvement by the bias compensation methods, and obtained 41% relative
improvement by the eigenvoice method using the mean of dimensional eigenvoice with only single adaptation word.
Training Data Optimization for Language Model
Adaptation
September 1-4, 2003 – Geneva, Switzerland
tion from two large variable quality out-domain data sets for our
task. Then a new algorithm is proposed to adjust the n-gram distribution of the two data sets to that of a task-specific but small
data set. We consider preventing over-fitting problem in adaptation. All resulting models are evaluated on the realistic application
of email dictation. Experiments show that each method achieves
better performance, and the combined method achieves a perplexity reduction of 24% to 80%.
Approaches to Foreign-Accented
Speaker-Independent Speech Recognition
Stefanie Aalburg, Harald Hoege; Siemens AG,
Germany
Current research in the area of foreign-accented speech recognition
focusses either on acoustic model adaptation or speaker-dependent
pronunciation variation modeling.
In this paper both approaches are applied in parallel and in a
speaker-independent fashion: the acoustic modeling part is based
on a derived Hidden Markov Model (HMM) clustering algorithm
and the lexicon adaptation is based on speaker-independent multiple pronunciation rules. The pronunciation rules are derived using phoneme-level pronunciation scores. Foreign-accented speech
was simulated with Columbian Spanish and Spanish of Spain and
the experiments showed an improved recognition performance for
the acoustic modeling part and identical recognition results when
adding pronunciation variants to the lexica. Both results are taken
as indicators for an improved recognition performance when applied on real foreign-accented speech. The present limited availability of foreign-accented speech databases, however, clearly merits further investigations.
Unsupervised Speaker Adaptation Based on HMM
Sufficient Statistics in Various Noisy Environments
Shingo Yamade, Akinobu Lee, Hiroshi Saruwatari,
Kiyohiro Shikano; Nara Institute of Science and
Technology, Japan
Noise and speaker adaptation techniques are essential to realize robust speech recognition in noisy environments.
In this paper, first, a noise robust speech recognition algorithm
is implemented by superimposing a small quantity of noise data
on spectral subtracted input speech. According to the recognition
experiments, 30dB SNR noise superimposition on input speech after spectral subtraction increases the robustness against different
noises significantly.
Next, we apply this noise robust speech recognition to the unsupervised speaker adaptation algorithm based on HMM sufficient statistics in different noise environments. The HMM sufficient statistics
for each speaker are calculated from 25dB SNR office noise added
speech database beforehand.
We evaluate successfully our proposed unsupervised speaker adaptation algorithm in noisy environments with 20k dictation task using 11 kinds of different noises, including office, car, exhibition, and
crowd noises.
Using Genetic Algorithms for Rapid Speaker
Adaptation
Fabrice Lauri, Irina Illina, Dominique Fohr, Filipp
Korkmazsky; LORIA, France
Xiaoshan Fang 1 , Jianfeng Gao 2 , Jianfeng Li 3 ,
Huanye Sheng 1 ; 1 Shanghai Jiao Tong University,
China; 2 Microsoft Research Asia, China; 3 University
of Science and Technology of China, China
Language model (LM) adaptation is a necessary step when the LM is
applied to speech recognition. The task of LM adaptation is to use
out-domain data to improve in-domain model’s performance since
the available in-domain (task-specific) data set is usually not large
enough for LM training. LM adaptation faces two problems. One is
the poor quality of the out-domain training data. The other is the
mismatch between the n-gram distribution in out-domain data set
and that in in-domain data set. This paper presents two methods,
filtering and distribution adaptation, to solve them respectively.
First, a bootstrapping method is presented to filter suitable por-
This paper proposes two new approaches to rapid speaker adaptation of acoustic models by using genetic algorithms. Whereas
conventional speaker adaptation techniques yield adapted models
which represent local optimum solutions, genetic algorithms are
capable to provide multiple optimal solutions, thereby delivering
potentially more robust adapted models. We have investigated two
different strategies of application of the genetic algorithm in the
framework of speaker adaptation of acoustic models. The first approach (GA) consists in using a genetic algorithm to adapt the set of
Gaussian means to a new speaker. The second approach (GA + EV)
uses the genetic algorithm to enrich the set of speaker-dependant
systems employed by the EigenVoices. Experiments with the Resource Management corpus show that, with one adaptation utterance, GA can improve the performances of a speaker-independent
52
Eurospeech 2003
Tuesday
system as efficiently as EigenVoices. The method GA + EV outperforms EigenVoices.
Structural State-Based Frame Synchronous
Compensation
Vincent Barreaud, Irina Illina, Dominique Fohr, Filipp
Korkmazsky; LORIA, France
In this paper we present improvements of a frame-synchronous
noise compensation algorithm that uses Stochastic Matching approach to cope with time-varying unknown noise. We propose to
estimate a hierarchical mapping function in parallel with Viterbi
alignment. The structure of the transformation tree is build from
the states of acoustical models. The objective of this hierarchical
transformation is to better compensate non-linear distortions of
the feature space. The technique is entirely general since no assumption is made on the nature, level and variation of noise. Our
algorithm is evaluated on the VODIS database recorded in a moving car. For various tasks, proposed technique significantly outperforms classical compensation/ adaptation methods.
Effect of Foreign Accent on Speech Recognition in
the NATO N-4 Corpus
Aaron D. Lawson 1 , David M. Harris 2 , John J. Grieco 3 ;
1
Research Associates for Defense Conversion, USA;
2
ACS Defense Inc., USA; 3 Air Force Research
Laboratory, USA
We present results from a series of 151 speech recognition experiments based on the N4 corpus of accented English speech, using a
small vocabulary recognition system. These experiments looked at
the impact of foreign accent on speech recognition, both within nonnative accented English and across different accents, with particular
interest in using context free grammar technology to improve callsign identification. Results show that phonetic models built from
foreign accented English are not less accurate than native ones at
decoding novel data with the same accent. Cross accent recognition experiments show that phonetic models from a given accent
group were 1.8 times less accurate in recognizing speech from a
different accent. In contrast to other attempts to perform accurate
recognition across accents, our approach of training very compact,
accent-specific models (less than 3 hours of speech) provided very
accurate results without the arduous task of adapting a phonetic
dictionary to every accent.
Duration Normalization and Hypothesis
Combination for Improved Spontaneous Speech
Recognition
Jon P. Nedel, Richard M. Stern; Carnegie Mellon
University, USA
When phone segmentations are known a priori, normalizing the duration of each phone has been shown to be effective in overcoming weaknesses in duration modeling of Hidden Markov Models
(HMMs). While we have observed potential relative reductions in
word error rate (WER) of up to 34.6% with oracle segmentation information, it has been difficult to achieve significant improvement
in WER with segmentation boundaries that are estimated blindly. In
this paper, we present simple variants of our duration normalization algorithm, which make use of blindly-estimated segmentation
boundaries to produce different recognition hypotheses for a given
utterance. These hypotheses can then be combined for significant
improvements in WER. With oracle segmentations, WER reductions
of up to 38.5% are possible. With automatically derived segmentations, this approach has achieved a reduction of WER of 3.9% for
the Broadcast News corpus, 6.2% for the spontaneous register of the
MULT_REG corpus, and 7.7% for a spontaneous corpus of connected
Spanish digits collected by Telefónica Investigación y Desarrollo.
September 1-4, 2003 – Geneva, Switzerland
ous density HMMs is described. In our approach, a class of informative prior distribution for MAPLR based variance adaptation is
identified, from which the close form solution of MAPLR based variance adaptation is obtained under its EM formulation. Effects of the
proposed prior distribution in MAPLR based variance adaptation
are characterized and compared with conventional maximum likelihood linear regression (MLLR) based variance adaptation. These
findings provide a consistent Bayesian theoretical framework to incorporate prior knowledge in linear regression based variance adaptation. Experiments on large vocabulary speech recognition tasks
were performed. The experimental results indicate that significant
performance gain over the MLLR based variance adaptation can be
obtained based on the proposed approach.
On Divergence Based Clustering of Normal
Distributions and Its Application to HMM
Adaptation
Tor André Myrvoll 1 , Frank K. Soong 2 ; 1 NTNU,
Norway; 2 ATR-SLT, Japan
We present an algorithm for clustering multivariate normal distributions based upon the symmetric, Kullback-Leibler divergence. Optimal mean vector and covariance matrix of the centroid normal distribution are derived and a set of Riccati matrix equations is used
to find the optimal covariance matrix. The solutions are found iteratively by alternating the intermediate mean and covariance solutions. Clustering performance of the new algorithm is shown to
be superior to that of non-optimal sample mean and covariance
solutions. It achieves a lower overall distortion and flatter distributions of pdf samples across clusters. The resultant optimal clusters
were further tested on the Wall Street Journal database for adapting
HMM parameters in a Structured Maximum A Posterior Linear Regression (SMAPLR) framework. The recognition performance was
significantly improved and the word error rate was reduced from
32.6% for a non-optimal centroid (sample mean and covariance) to
27.6% and 27.5% for the diagonal and full covariance matrix cases,
respectively.
Fast Incremental Adaptation Using Maximum
Likelihood Regression and Stochastic Gradient
Descent
Sreeram V. Balakrishnan; IBM T.J. Watson Research
Center, USA
Adaptation to a new speaker or environment is becoming very
important as speech recognition systems are deployed in unpredictable real world situations. Constrained or Feature space Maximum Likelihood Regression (fMLLR) [1] has proved to be especially
effective for this purpose, particularly when used for incremental
unsupervised adaptation [2]. Unfortunately the standard implementation described in [1] and used by most authors since, requires
statistics that require O(n3 ) operations to collect per frame. In addition the statistics require O(n3 ) space for storage and the estimation of the feature transform matrix requires O(n4 ) operations.
This is an unacceptable cost for most embedded speech recognition systems. In this paper we show the fMLLR objective function
can be optimized using stochastic gradient descent in a way that
achieves almost the same results as the standard implementation.
All this is accomplished with an algorithm that requires only O(n2 )
operations per frame and O(n2 ) storage requirements. This order
of magnitude savings allows continuous adaptation to be implemented in most resource constrained embedded speech recognition
applications.
Maximum A Posteriori Linear Regression (MAPLR)
Variance Adaptation for Continuous Density HMMS
Wu Chou 1 , Xiaodong He 2 ; 1 Avaya Labs Research,
USA; 2 University of Missouri, USA
In this paper, the theoretical framework of maximum a Posteriori
linear regression (MAPLR) based variance adaptation for continu-
53
Eurospeech 2003
Tuesday
September 1-4, 2003 – Geneva, Switzerland
This paper describes an effective method for automatic speech unit
segmentation. Based on hidden Markov models (HMM), an initial estimation of segmentation from the explicit phonetic transcription
are processed by our local HMM training algorithm. With reliable
silence boundaries obtained by a silence detector, this algorithm
tries different training methods to overcome the insufficient training data problem. The performance is tested in a Mandarin TTS
speech corpus. The results show that using this method, a 14.98%
improvement is achieved in the boundary detection error rate (deviating larger than 20 ms).
Session: PTuDg– Poster
Speech Resources & Standards
Time: Tuesday 16.00, Venue: Main Hall, Level -1
Chair: Bruce Millar, Australian National University, Australia
Tfarsdat – The Telephone Farsi Speech Database
Mahmood Bijankhan 1 , Javad Sheykhzadegan 2 ,
Mahmood R. Roohani 2 , Rahman Zarrintare 2 , Seyyed
Z. Ghasemi 1 , Mohammad E. Ghasedi 2 ; 1 University of
Tehran, Iran; 2 Research Center of Intelligent Signal
Processing, Iran
Quality Control of Language Resources at ELRA
This paper describes an ongoing research to create an acoustic phonetic based telephone Farsi speech database, called “Tfarsdat”. It is
compared with two LDC Farsi corpora, OGI and Call friend in terms
of corpus dialectology. Up to now, we have recorded about 8 hours
of monologue calls containing spontaneous and read speech for 64
speakers belonging to one of ten dialect regions. A hierarchical annotation system is used to transcribe phoneme, word and sentence
levels of speech data. User software is written to access speech and
label files efficiently using a menu driven query system. We conducted two experiments to validate Tfarsdat statistically. Results
showed the necessity of increasing speaker size and also quality
enhancement of annotation system.
Henk van den Heuvel 1 , Khalid Choukri 2 , Harald
Höge 3 , Bente Maegaard 4 , Jan Odijk 5 , Valerie
Mapelli 2 ; 1 SPEX, The Netherlands; 2 ELRA/ELDA,
France; 3 Siemens AG, Germany; 4 CST, Denmark;
5
ScanSoft Belgium, Belgium
To promote quality control of its language resources the European
Language Resources Association (ELRA) installed a Validation Committee. This paper presents an overview of current activities of the
Committee: validation of language resources, standardisation, bug
reporting, patches of updates of language resources, and dissemination of results.
Validation of Phonetic Transcriptions Based on
Recognition Performance
Christophe Van Bael 1 , Diana Binnenpoorte 1 , Helmer
Strik 1 , Henk van den Heuvel 2 ; 1 University of
Nijmegen, The Netherlands; 2 SPEX, The Netherlands
Large Lexica for Speech-to-Speech Translation:
From Specification to Creation
Elviira Hartikainen 1 , Giulio Maltese 2 , Asunción
Moreno 3 , Shaunie Shammass 4 , Ute Ziegenhain 5 ;
1
Nokia Research Center, Finland; 2 IBM Italy, Italy;
3
Universitat Politècnica de Catalunya, Spain; 4 Natural
Speech Communication, Israel; 5 Siemens AG,
Germany
This paper presents the corpora collection and lexica creation for
the purposes of Automatic Speech Recognition (ASR) and Text-tospeech (TTS) that are needed in speech-to-speech translation (SST).
These lexica will be specified, built and validated within the scope of
the EU-project LC-STAR (Lexica and Corpora for Speech-to-Speech
Translation Components) during the years 2002-2005. Large lexica consisting of phonetic, prosodic and morpho-syntactic content
will be provided with well-documented specifications for at least 12
languages [1]. This paper provides a short overview of the speechto-speech translation lexica in general as well as a summary of the
LC-STAR project itself. More detailed information about the specification for the corpora collection and word extraction as well as the
specification and format of the lexica are presented in later chapters.
A Pronunciation Lexicon for Turkish Based on
Two-Level Morphology
In fundamental linguistic as well as in speech technology research
there is an increasing need for procedures to automatically generate and validate phonetic transcriptions. Whereas much research
has already focussed on the automatic generation of phonetic transcriptions, far less attention has been paid to the validation of such
transcriptions. In the little research performed in this area, the estimation of the quality of (automatically generated) phonetic transcriptions is typically based on the comparison between these transcriptions and a human-made reference transcription. We believe,
however, that the quality of phonetic transcriptions should ideally
be estimated with the application in which the transcriptions will
be used in mind, provided that the application is known at validation time. The application focussed on in this paper is automatic
speech recognition, the validation criterion is the word error rate.
We achieved a higher accuracy with a recogniser trained on an automatically generated transcription than with a similar recogniser
trained on a human-made transcription resembling a human-made
reference transcription more. This indicates that the traditional validation approach may not always be the most optimal one.
The Basque Speech_Dat (II) Database: A
Description and First Test Recognition Results
I. Hernaez, I. Luengo, E. Navas, M. Zubizarreta, I.
Gaminde, J. Sanchez; University of the Basque
Country, Spain
Kemal Oflazer 1 , Sharon Inkelas 2 ; 1 Sabancı
University, Turkey; 2 University of California at
Berkeley, USA
This paper describes the implementation of a full-scale pronunciation lexicon for Turkish based on a two-level morphological analyzer. The system produces at its output, a parallel representation
of the pronunciation and the morphological analysis of the word
form so that morphological disambiguation can be used to disambiguate pronunciation when necessary. The pronunciation representation is based on the SAMPA standard and also encodes the
position of the primary stress. The computation of the position
of the primary stress depends on an interplay of any exceptional
stress in root words and stress properties of certain morphemes,
and requires that a full morphological analysis be done. The system has been implemented using XRCE Finite State Toolkit.
In this work we present a telephone speech database for Basque,
compliant with the guidelines of the Speechdat project. The
database contains 1060 calls from the fixed telephone network. We
first describe the main aspects of the database design. We also
present the recognition results using the database and a set of procedures following the language independent reference recogniser
commonly named Refrec.
Towards an Evaluation Standard for Speech
Control Concepts in Real-World Scenarios
Using Both Global and Local Hidden Markov
Models for Automatic Speech Unit Segmentation
Jens Maase 1 , Diane Hirschfeld 2 , Uwe Koloska 2 , Timo
Westfeld 3 , Jörg Helbig 3 ; 1 Bosch und Siemens
Hausgeräte GmbH, Germany; 2 voice INTER connect
GmbH, Germany; 3 MediaInterface Dresden GmbH,
Germany
Hong Zheng 1 , Yiqing Lu 2 ; 1 CASCO (ALSTOM) Signal
Ltd., China; 2 Motorola China Research Center, China
Speech control is still mainly evaluated through statistical performance measures (recognition rate, insertion rate, etc.) considering
54
Eurospeech 2003
Tuesday
September 1-4, 2003 – Geneva, Switzerland
the performance of a speech recognizer under laboratory or artificial noise conditions. All these measures give no idea about the
practical usability of a speech interface, since practical aspects concern more than operational aspects of the speech recognizer inside
a product.
no compression to keep the highest possible image quality. The
LIUM-AVS database comprises two parts:
Since it was felt, that no evaluation standard so far fulfills the practical requirements for speech controlled products, this paper aims
in the establishment of an open design- and evaluation standard for
speech control concepts in real-world scenarios.
These two parts contain sequences with both natural and blue lips.
The whole database is released mainly to test and compare lip segmentation approaches on natural images, but speech recognition
experiments may also be carried using this corpus. For information
on obtaining the LIUM-AVS database, please contact us through our
webpage (http://www-lium.univlemans. fr/lium/avs-database).
First, the behaviour of the users and the normal environmental
conditions (typical noises) were evaluated in usability experiments.
Data recordings were conducted, trying to capture these typical usage requirements in a special corpus (Apollo-corpus). Finally, a set
of standard desktop-, as well as embedded speech recognizers were
tested for their performance under these real world conditions.
OrienTel: Recording Telephone Speech of Turkish
Speakers in Germany
Chr. Draxler; Ludwig-Maximilians-Universität
München, Germany
OrienTel is a project to create telephone speech databases for both
the local and the business languages of the Mediterranean and the
Arab Emirates. In Germany, 300 Turkish speakers speaking German
were to be recorded. The database is an extension of the SpeechDat
databases. This paper outlines the recording setup, the recruitment
strategy and the annotation procedure. Recruiting the speakers was
a particular challenge because none of the recruitment strategies
used in previous SpeechDat projects in Germany did work and a
new approach had to be found.
Spanish Broadcast News Transcription
Gerhard Backfried, Roser Jaquemot Caldés; SAIL
LABS Technology AG, Austria
We describe the Sail Labs Media Mining System (MMS) aimed at the
transcription of Castilian Spanish broadcast-news. In contrast to
previous systems, the focus of this system is on Spanish as spoken
on the Iberian Peninsula as opposed to the Americas. We discuss the
development of a Castilian Spanish broadcast-news corpus suitable
for training the various system components of the MMS and report
on the development of the speech-recognition component using the
newly established corpora.
• PBS Phonetically Balanced Sentences in French
• LET Spelled letters (also in French)
Implementation and Evaluation of a Text-to-Speech
Synthesis System for Turkish
Özgül Salor 1 , Bryan Pellom 2 , Mübeccel Demirekler 1 ;
1
Middle East Technical University, Turkey; 2 University
of Colorado at Boulder, USA
In this paper, a diphone based Text-to-Speech (TTS) system for the
Turkish language is presented. Turkish is the official language of
Turkey, where it is the native language of 70 million people and it is
also widely spoken in Asia (Azerbaidjain, Uzbekhstan, Kazakhstan,
Kirgizhstan and Iran), Cyprus and the Balkans. The research has
been done through a visiting internship at CSLR (the Center for Spoken Language Research, University of Colorado at Boulder) as part
of an ongoing collaboration between CSLR and METU (Middle East
Technical University), Department of Electrical and Electronics Engineering. The system is based on Festival Speech Synthesis System.
A diphone database has been designed for Turkish. Tools developed for quick diphone collection and segmentation are illustrated.
The text analysis module, the methods used for determination of
segment durations and pitch contours are discussed in detail. A
Diagnostic Rhyme Test (DRT) has been designed for Turkish to test
the intelligibility of the output speech. The resulting TTS system is
found to be 86.5% intelligible on the average by 20 listeners. This
is the first diphone based Turkish TTS system, whose intelligibility
is reported. We also believe that, this paper would help researchers
working on building TTS voices, especially those who work on agglutinative languages, since every step needed along the way are
explained in detail.
The Czech Speech and Prosody Database Both for
ASR and TTS Purposes
Jáchym Kolář, Jan Romportl, Josef Psutka; University
of West Bohemia in Pilsen, Czech Republic
Large Vocabulary Continuous Speech Recognition
in Greek: Corpus and an Automatic Dictation
System
Vassilios Digalakis, Dimitrios Oikonomidis, D.
Pratsolis, N. Tsourakis, C. Vosnidis, N. Chatzichrisafis,
V. Diakoloukas; Technical University of Crete, Greece
In this work, we present the creation of the first Greek Speech Corpus and the implementation of a Dictation System for workflow
improvement in the field of journalism. The current work was
implemented under the project called Logotypografia (Logos = logos, speech and Typografia = typography) sponsored by the General Secretariat of Research and Development of Greece. This paper presents the process of data collection (texts and recordings),
waveform processing (transcriptions), creation of the acoustic and
language models and the final integration to a fully functional dictation system. The evaluation of this system is also presented. The
Logotypografia database, described here, is available by ELRA.
The LIUM-AVS Database : A Corpus to Test Lip
Segmentation and Speechreading Systems in
Natural Conditions
Philippe Daubias, Paul Deléglise; Université du Maine,
France
We present here a new freely available audio-visual speech database.
Contrary to other existing corpora, the LIUM-AVS corpus was
recorded in conditions we qualify as natural, which are, according to
us, much closer to real application conditions than other databases.
This database was recorded without artificial lighting using an analog camcorder in camera mode. Images were stored digitally with
This paper describes a preparation of the first large Czech prosodic
database which should be useful both in automatic speech recognition (ASR) and text-to-speech (TTS) synthesis. In the area of ASR we
intend to use it for an automatic punctuation annotation, in the area
of TTS for building a prosodic module for the Czech high-quality
synthesis. The database is based on the Czech Radio&TV Broadcast News Corpus (UWB B02) recorded at the University of West Bohemia. The configuration of the database includes recorded speech,
raw and stylized F0 values, frame level energy values, a word- and
phoneme-level time alignment, and a linguistically motivated description of the prosodic data. A technique of prosodic data acquisition and stylization is described. A new tagset for a linguistical
annotation of the Czech prosody is proposed and used.
Construction of an Advanced In-Car Spoken
Dialogue Corpus and its Characteristic Analysis
Itsuki Kishida, Yuki Irie, Yukiko Yamaguchi, Shigeki
Matsubara, Nobuo Kawaguchi, Yasuyoshi Inagaki;
Nagoya University, Japan
This paper describes an advanced spoken language corpus which
has been constructed by enhancing an in-car speech database. The
corpus has the following characteristic features: (1) Advanced tag:
Not only linguistic phenomena tags but also advanced discourse
tags such as sentential structures, and utterance intentions, have
been provided for the transcribed texts. (2) Large-scale: The sentential structures and the intentions are currently provided for 45,053
phrases and 35,421 utterance units, respectively. (3) Multi-layer:
The corpus consists of different levels of spoken language data such
as speech signals, transcribed texts, sentential structures, inten-
55
Eurospeech 2003
Tuesday
tional markers and dialogue structures, moreover, they are related
with each other. It allows a very wide variety of analysis of spontaneous spoken dialogue to utilize the multi-layered corpus. This
paper also reports the result of investigation of the corpus, especially, focusing on the relations between the syntactic style and the
intentional style of spoken utterances.
Measuring the Readability of Automatic
Speech-to-Text Transcripts
Douglas A. Jones, Florian Wolf, Edward Gibson, Elliott
Williams, Evelina Fedorenko, Douglas A. Reynolds,
Marc Zissman; Massachusetts Institute of Technology,
USA
This paper reports initial results from a novel psycholinguistic
study that measures the readability of several types of speech transcripts. We define a four-part figure of merit to measure readability:
accuracy of answers to comprehension questions, reaction-time for
passage reading, reaction-time for question answering and a subjective rating of passage difficulty. We present results from an experiment with 28 test subjects reading transcripts in four experimental
conditions.
The NESPOLE! VoIP Multilingual Corpora in
Tourism and Medical Domains
Nadia Mana 1 , Susanne Burger 2 , Roldano Cattoni 1 ,
Laurent Besacier 3 , Victoria MacLaren 2 , John
McDonough 4 , Florian Metze 4 ; 1 ITCirst, Italy;
2
Carnegie Mellon University, USA; 3 CLIPS-IMAG
Laboratory, France; 4 Universität Karlsruhe, Germany
September 1-4, 2003 – Geneva, Switzerland
From Switchboard to Fisher: Telephone Collection
Protocols, Their Uses and Yields
Christopher Cieri, David Miller, Kevin Walker;
University of Pennsylvania, USA
This paper describes several methodologies for collecting conversational telephone speech (CTS) comparing their design, goals and
yields. We trace the evolution of the Switchboard protocol including
recent adaptations that have allowed for very cost-efficient data collection. We compare Switchboard to the CallHome and CallFriend
protocols that have similarly produced CTS data for speech technologies research. Finally, we introduce the new “Fisher” protocol
comparing its design and yield to the other protocols. We conclude
with a summary of data resources that result from each of the protocols described herein and that are generally available.
Development of the Estonian SpeechDat-Like
Database
Einar Meister, Jürgen Lasn, Lya Meister; Tallinn
Technical University, Estonia
A new database project has been launched in Estonia last year.
It aims the collection of telephone speech from a large number
of speakers for speech and speaker recognition purposes. Up to
2000 speakers are expected to participate in recordings. SpeechDat databases, especially Finnish SpeechDat, have been chosen as
a prototype for the Estonian database. It means that principles of
corpus design, file formats, recording and labelling methods implemented by the SpeechDat consortium will be followed as closely as
possible. The paper is a progress report of the project.
Towards a Repository of Digital Talking Books
In this paper we present the multilingual VoIP (Voice over Internet
Protocol networks) corpora collected for the second showcase of
the Nespole! project in the tourism and medical domains. The corpora comprise over 20 hours of human-to-human monolingual dialogues in English, French, German and Italian: 66 dialogues in the
tourism domain and 49 in the medical domain. We describe in detail the data collection (technical set-up, scenarios for each domain,
recording procedure and data transcription), as well as statistically
illustrated corpora and a preliminary data analysis.
Lexica and Corpora for Speech-to-Speech
Translation: A Trilingual Approach
David Conejero, Jesús Giménez, Victoria Arranz,
Antonio Bonafonte, Neus Pascual, Núria Castell,
Asunción Moreno; Universitat Politècnica de
Catalunya, Spain
Creation of lexica and corpora for Catalan, Spanish and US-English
is described. A lexicon is being created for speech recognition and
synthesis including relevant information. The lexicon contains 50K
common words selected to achieve a wide coverage on the chosen
domains, and 50K additional entries including special application
words, and proper nouns.
Furthermore, a large trilingual spontaneous speech corpus has been
created. These corpora, together with other available US-English
data, have been translated into their counterpart languages. This is
being used to investigate the language resources requirements for
statistical machine translation.
Se describe la creación de léxicos y corpus para el catalán, castellano
e inglés hablado en Estados Unidos. Un léxico conteniendo información relevante para el reconocimiento y síntesis del habla está
siendo creado. El léxico contiene 50.000 palabras comunes seleccionadas con el fin de lograr una amplia cobertura de los dominios
escogidos, y 50.000 entradas adicionales que incluyen vocabulario
específico, y nombres propios.
Además, se han creado corpus orales para el catalán y el castellano. Estos corpus, junto con otros datos disponibles sobre inglés
hablado en Estados Unidos, han sido traducidos a las otras dos
lenguas con el propósito de generar un gran corpus trilingüe. Éste
está siendo utilizado para investigar los requisitos de los recursos
lingüísticos para la traduccíon automática estadística.
António Serralheiro 1 , Isabel Trancoso 2 , Diamantino
Caseiro 2 , Teresa Chambel 3 , Luís Carriço 3 , Nuno
Guimarães 3 ; 1 INESC-ID/Academia Militar, Portugal;
2
INESC-ID/IST, Portugal; 3 LASIGE/FC, Portugal
Considerable effort has been devoted at L2 F to increase and
broaden our speech and text data resources. Digital Talking Books
(DTB), comprising both speech and text data are, as such, an invaluable asset as multimedia resources. Furthermore, those DTB have
been under a speech-to-text alignment procedure, either word or
phone-based, to increase their potential in research activities. This
paper thus describes the motivation and the method that we used
to accomplish this goal for aligning DTBs. This alignment allows
specific access interfaces for persons with special needs, and also
tools for easily detecting and indexing units (words, sentences, topics) in the spoken books. The alignment tool was implemented in
a Weighted Finite State Transducer framework, which provides an
efficient way to combine different types of knowledge sources, such
as alternative pronunciation rules. With this tool, a 2-hour long spoken book was aligned in a single step in much less than real time.
Last but not least, new browsing interfaces, allowing improved access and data retrieval to and from the DTBs, are described in this
paper.
Shared Resources for Robust Speech-to-Text
Technology
Stephanie Strassel, David Miller, Kevin Walker,
Christopher Cieri; University of Pennsylvania, USA
This paper describes ongoing efforts at Linguistic Data Consortium
to create shared resources for improved speech-to-text technology.
Under the DARPA EARS program, technology providers are charged
with creating STT systems whose outputs are substantially richer
and much more accurate than is currently possible. These aggressive program goals motivate new approaches to corpus creation
and distribution. EARS participants require multilingual broadcast
and telephone speech data, transcripts and annotations at a much
higher volume than for any previous program. While standard approaches to resource collection and creation are prohibitively expensive for this volume of material, within EARS new methods have
been established to allow for the development of vast quantities of
audio, transcripts and annotations. New distribution methods also
provide for efficient deployment of needed resources to participating research sites as well as enabling eventual publication to a wider
community of language researchers.
56
Eurospeech 2003
Wednesday
September 1-4, 2003 – Geneva, Switzerland
Structural Linear Model-Space Transformations for
Speaker Adaptation
Session: OWeBa– Oral
Speech Recognition - Adaptation II
Driss Matrouf, Olivier Bellot, Pascal Nocera, Georges
Linares, Jean-François Bonastre; LIA-CNRS, France
Time: Wednesday 10.00, Venue: Room 1
Chair: John Hansen, Colorado Univ., USA
Large Vocabulary Conversational Speech
Recognition with a Subspace Constraint on Inverse
Covariance Matrices
Scott Axelrod, Vaibhava Goel, Brian Kingsbury,
Karthik Visweswariah, Ramesh Gopinath; IBM T.J.
Watson Research Center, USA
This paper applies the recently proposed SPAM models for acoustic modeling in a Speaker Adaptive Training (SAT) context on large
vocabulary conversational speech databases, including the Switchboard database. SPAM models are Gaussian mixture models in
which a subspace constraint is placed on the precision and mean
matrices (although this paper focuses on the case of unconstrained
means). They include diagonal covariance, full covariance, MLLT,
and EMLLT models as special cases. Adaptation is carried out with
maximum likelihood estimation of the means and feature-space under the SPAM model. This paper shows the first experimental evidence that the SPAM models can achieve significant word-errorrate improvements over state-of-the-art diagonal covariance models, even when those diagonal models are given the benefit of choosing the optimal number of Gaussians (according to the Bayesian
Information Criterion). This paper also is the first to apply SPAM
models in a SAT context. All experiments are performed on the IBM
“Superhuman” speech corpus which is a challenging and diverse
conversational speech test set that includes the Switchboard portion of the 1998 Hub5e evaluation data set.
Speaker Adaptation Based on Confidence-Weighted
Training
Gyucheol Jang, Minho Jin, Chang D. Yoo; KAIST,
Korea
This paper presents a novel method to enhance the performance of
traditional speaker adaptation algorithm using discriminative adaptation procedure based on a novel confidence measure and nonlinear weighting. Regardless of the distribution of the adaptation
data, traditional model adaptation methods incorporate the adaptation data undiscriminatingly. When the data size is small and the
parameter tying is extensive, adaptation based on outliers can be
detrimental. A way to discriminate the contribution of each data
in the adaptation is to incorporate a confidence measure based on
likelihood. We evaluate and compare the performances of the proposed weighted SMAP (WSMAP) which controls the contribution of
each data by sigmoid weighting using a novel confidence measure.
The effectiveness of the proposed algorithm is experimentally verified by adapting native speaker models to nonnative speaker environment using TIDIGIT.
Jacobian Adaptation Based on the
Frequency-Filtered Spectral Energies
Alberto Abad, Climent Nadeu, Javier Hernando,
Jaume Padrell; Universitat Politècnica de Catalunya,
Spain
Jacobian Adaptation (JA) of the acoustic models is an efficient adaptation technique for robust speech recognition. Several improvements for the JA have been proposed in the last years, either to
generalize the Jacobian linear transformation for the case of large
noise mismatch between training and testing or to extend the adaptation to other degrading factors, like channel distortion and vocal
tract length. However, the JA technique has only been used so far
with the conventional mel-frequency cepstral coefficients (MFCC).
In this paper, the JA technique is applied to an alternative type of
features, the Frequency-Filtered (FF) spectral energies, resulting in
a more computationally efficient approach. Furthermore, in experimental tests with the database Aurora1, this new approach has
shown an improved recognition performance with respect to the
Jacobian adaptation with MFCCs.
Within the framework of speaker-adaptation, a technique based on
tree structure and the maximum a posteriori criterion was proposed
(SMAP). In SMAP, the parameters estimation, at each node in the tree
is based on the assumption that the mismatch between the training
and adaptation data is a Gaussian PDF which parameters are estimated by using the Maximum Likelihood criterion. To avoid poor
transformation parameters estimation accuracy due to an insufficiency of adaptation data in a node, we propose a new technique
based on the maximum a posteriori approach and PDF Gaussians
Merging. The basic idea behind this new technique is to estimate
an affine transformations which bring the training acoustic models
as close as possible to the test acoustic models rather than transformation maximizing the likelihood of the adaptation data. In this
manner, even with very small amount of adaptation data, the parameters transformations are accurately estimated for means and
variances. This adaptation strategy has shown a significant performance improvement in a large vocabulary speech recognition task,
alone and combined with the MLLR adaptation.
Minimum Classification Error (MCE) Model
Adaptation of Continuous Density HMMS
Xiaodong He 1 , Wu Chou 2 ; 1 University of Missouri,
USA; 2 Avaya Labs Research, USA
In this paper, a framework of minimum classification error (MCE)
model adaptation for continuous density HMMs is proposed based
on the approach of “super” string model. We show that the error
rate minimization in the proposed approach can be formulated into
maximizing a special ratio of two positive functions, and from that a
general growth transform algorithm is derived for MCE based model
adaptation. This algorithm departs from the generalized probability descent (GPD) algorithm, and it is well suited for model adaptation with a small amount of training data. The proposed approach
is applied to linear regression based variance adaptation, and the
close form solution for variance adaptation using MCE linear regression (MCELR) is derived. The MCELR approach is evaluated on large
vocabulary speech recognition tasks. The relative performance gain
is more than doubled on the standard (WSJ Spoke 3) database, comparing to maximum likelihood linear regression (MLLR) based variance adaptation for the same amount of adaptation data.
Adapting Acoustic Models to New Domains and
Conditions Using Untranscribed Data
Asela Gunawardana, Alex Acero; Microsoft Research,
USA
This paper investigates the unsupervised adaptation of an acoustic
model to a domain with mismatched acoustic conditions. We use
techniques borrowed from the unsupervised training literature to
adapt an acoustic model trained on the Wall Street Journal corpus to
the Aurora-2 domain, which is composed of read digit strings over
a simulated noisy telephone channel. We show that it is possible
to use untranscribed in-domain data to get significant performance
improvements, even when it is severely mismatched to the acoustic
model training data.
Session: SWeBb– Oral
Towards Synthesizing Expressive Speech
Time: Wednesday 10.00, Venue: Room 2
Chair: Wael Hamza, IBM, USA
Towards Synthesising Expressive Speech;
Designing and Collecting Expressive Speech Data
Nick Campbell; ATR-HIS, Japan
Corpus-based speech synthesis needs representative corpora of human speech if it is to meet the needs of everyday spoken interaction. This paper describes methods for recording such corpora, and
details some difficulties (with their solutions) found in the use of
spontaneous speech data for synthesis.
57
Eurospeech 2003
Wednesday
September 1-4, 2003 – Geneva, Switzerland
Is There an Emotion Signature in Intonational
Patterns? And Can It be Used in Synthesis?
Applications of Computer Generated Expressive
Speech for Communication Disorders
Tanja Bänziger 1 , Michel Morel 2 , Klaus R. Scherer 1 ;
1
University of Geneva, Switzerland; 2 University of
Caen, France
Jan P.H. van Santen, Lois Black, Gilead Cohen,
Alexander B. Kain, Esther Klabbers, Taniya Mishra,
Jacques de Villiers, Xiaochuan Niu; Oregon Health &
Science University, USA
Intonation is often considered to play an important role in the vocal
communication of emotion. Early studies using pitch manipulation
have supported this view. However, the properties of pitch contours involved in marking emotional state remain largely unidentified. In this contribution, a corpus of actor-generated utterances
for 8 emotions was used to measure intonation (pitch contour) by
identifying key features of the F0 contour. The data show that the
profiles obtained vary reliably with respect to F0 level as a function
of the degree of activation of the emotion concerned. However,
there is little evidence for qualitatively different forms of profiles
for different emotions. Results of recent collaborative studies on
the use of the F0 patterns identified in this research with synthesized utterances are presented. The nature of the contribution of
F0/pitch contours to emotional speech is discussed; it is argued
that pitch contours have to be considered as configurations that acquire emotional meaning only through interaction with a linguistic
and paralinguistic context.
This paper focuses on generation of expressive speech, specifically
speech displaying vocal affect. Generating speech with vocal affect
is important for diagnosis, research, and remediation for children
with autism and developmental language disorders. However, because vocal affect involves many acoustic factors working together
in complex ways, it is unlikely that we will be able to generate
compelling vocal affect with traditional diphone synthesis. Instead,
methods are needed that preserve as much of the original signals
as possible. We describe an approach to concatenative synthesis
that attempts to combine the naturalness of unit selection based
synthesis with the ability of diphone based synthesis to handle unrestricted input domains.
Session: OWeBc– Oral
Speaker Verification
Multilayered Extensions to the Speech Synthesis
Markup Language for Describing Expressiveness
Time: Wednesday 10.00, Venue: Room 3
Chair: Douglas Reynolds, MIT Lincoln Laboratory, USA
E. Eide, R. Bakis, W. Hamza, J. Pitrelli; IBM T.J. Watson
Research Center, USA
Speaker Verification Systems and Security
Considerations
In this paper we discuss possible extensions to the Speech Synthesis Markup Language (SSML) to facilitate the generation of synthetic
expressive speech. The proposed extensions are hierarchical in nature, allowing specification in terms of physical parameters such
as instantaneous pitch, higher-level parameters such as ToBI labels, or abstract concepts such as emotions. Low-level tags tend to
change their values frequently, even within a word, while the more
abstract tags generally apply to whole words, sentences or paragraphs. We envision interfaces at different levels to serve different
types of users; speech experts may want to use low-level interfaces
while artists may prefer to interface with the TTS system at more
abstract levels.
Unit Selection and Emotional Speech
David A. van Leeuwen; TNO Human Factors, The
Netherlands
In speaker verification technology, the security considerations are
quite different from performance measures that are usually studied.
The security level of a system is generally expressed in the amount
of effort it takes to have a successful break-in attempt. This paper discusses potential weaknesses of speaker verification systems
and methods of exploiting these weaknesses, and suggests proper
experiments for determining the security level of a speaker verification system.
Phonetic Class-Based Speaker Verification
Matthieu Hébert, Larry P. Heck; Nuance
Communications, USA
Alan W. Black; Carnegie Mellon University, USA
Unit Selection Synthesis, where appropriate units are selected from
large databases of natural speech, has greatly improved the quality of speech synthesis. But the quality improvement has come at
a cost. The quality of the synthesis relies on the fact that little or
no signal processing is done on the selected units, thus the style
of the recording is maintained in the quality of the synthesis. The
synthesis style is implicitly the style of the database. If we want
more general flexibility we have to record more data of the desired
style. Which means that our already large unit databases must be
made even larger.
This paper gives examples of how to produce varied style and
emotion using existing unit selection synthesis techniques and
also highlights the limitations of generating truly flexible synthetic
voices.
Phonetic Class-based Speaker Verification (PCBV) is a natural refinement of the traditional single Gaussian Mixture Model (Single GMM)
scheme. The aim is to accurately model the voice characteristics of
a user on a per-phonetic class basis. The paper describes briefly the
implementation of a representation of the voice characteristics in
a hierarchy of phonetic classes. We present a framework to easily
study the effect of the modeling on the PCBV. A thorough study of
the effect of the modeling complexity, the amount of enrollment
data and noise conditions is presented. It is shown that Phonemebased Verification (PBV), a special case of PCBV, is the optimal modeling scheme and consistently outperforms the state-of-the-art Single GMM modeling even in noisy environments. PBV achieves 9% to
14% relative error rate reduction while cutting the speaker model
size by 50% and CPU by 2/3.
Voice Quality Modification for Emotional Speech
Synthesis
An Evaluation of VTS and IMM for Speaker
Verification in Noise
Christophe d’Alessandro, Boris Doval; LIMSI-CNRS,
France
Suhadi, Sorel Stan, Tim Fingscheidt, Christophe
Beaugeant; Siemens AG, Germany
Synthesis of expressive speech has demonstrated that convincing
natural sounding results are impossible to obtain without dealing
with voice quality parameters. Time-domain and spectral-domain
models of the voice source signal are presented. Then algorithms
for analysis and synthesis of voice quality are discussed, including
modification of the periodic and aperiodic components. These algorithms may be useful for applications such as pre-processing of
speech corpora, modification of voice quality parameters together
with intonation in synthesis, voice transformation.
The performance of speaker verification (SV) systems degrades
rapidly in noise rendering them unsuitable for security-critical applications in mobile phones, where false acceptance rates (FAR) of
∼ 10−4 are required. However, less demanding applications for
which equal error rates (EER) comparable to word error rates (WER)
of speech recognizers are acceptable could benefit from the SV technology.
In this paper we evaluate two feature-based noise compensation algorithms in the context of SV: vector Taylor series (VTS) combined
with statistical linear approximation (SLA), and Kalman filter-based
interacting multiple models (IMM). Tests with the YOHO database
58
Eurospeech 2003
Wednesday
and the NTT-AT ambient noises show that EERs as low as 5%-10%
in medium to high noise conditions can be achieved for a textindependent SV system.
Locally Recurrent Probabilistic Neural Network for
Text-Independent Speaker Verification
Todor Ganchev, Dimitris K. Tasoulis, Michael N.
Vrahatis, Nikos Fakotakis; University of Patras, Greece
This paper introduces Locally Recurrent Probabilistic Neural Networks (LRPNN) as an extension of the well-known Probabilistic Neural Networks (PNN). A LRPNN, in contrast to a PNN, is sensitive to the
context in which events occur, and therefore, identification of time
or spatial correlations is attainable. Besides the definition of the
LRPNN architecture a fast three-step training method is proposed.
The first two steps are identical to the training of traditional PNNs,
while the third step is based on the Differential Evolution optimization method. Finally, the superiority of LRPNNs over PNNs on the
task of text-independent speaker verification is demonstrated.
Learning to Boost GMM Based Speaker Verification
The Gaussian mixture models (GMM) has proved to be an effective
probabilistic model for speaker verification, and has been widely
used in most of state-of-the-art systems. In this paper, we introduce a new method for the task: that using AdaBoost learning based
on the GMM. The motivation is the following: While a GMM linearly
combines a number of Gaussian models according to a set of mixing
weights, we believe that there exists a better means of combining
individual Gaussian mixture models. The proposed AdaBoost-GMM
method is non-parametric in which a selected set of weak classifiers, each constructed based on a single Gaussian model, is optimally combined to form a strong classifier, the optimality being in
the sense of maximum margin. Experiments show that the boosted
GMM classifier yields 10.81% relative reduction in equal error rate
for the same handsets and 11.24% for different handsets, a significant improvement over the baseline adapted GMM system.
Speaker Verification Based on G.729 and G.723.1
Coder Parameters and Handset Mismatch
Compensation
1
cise and informative utterances. While interacting over a phone,
users must both understand the system’s utterances, and remember
important facts that the system is providing. Thus most dialogue
systems implement some combination of different techniques for
(1) option selection: pruning the set of options; (2) information selection: selecting a subset of information to present about each option; (3) aggregation: combining multiple items of information succinctly. We first describe how user models based on multi-attribute
decision theory support domain-independent algorithms for both
option selection and information selection. We then describe experiments to determine an optimal level of conciseness in information
selection, i.e. how much information to include for an option. Our
results show that (a) users are highly oriented to utterance conciseness; (b) the information selection algorithm is highly consistent
with user’s judgments of conciseness; and (c) the appropriate level
of conciseness is both user and dialogue strategy dependent.
Natural Language Response Generation in
Mixed-Initiative Dialogs Using Task Goals and
Dialog Acts
Helen M. Meng, Wing Lin Yip, Oi Yan Mok, Shuk Fong
Chan; Chinese University of Hong Kong, China
Stan Z. Li, Dong Zhang, Chengyuan Ma, Heung-Yeung
Shum, Eric Chang; Microsoft Research Asia, China
1
September 1-4, 2003 – Geneva, Switzerland
This paper presents our approach towards natural language response generation for mixed-initiative dialogs in the CUHK Restaurants domain. Our experimental corpus consists of about 4000 customer requests and waiter responses. Every request/response utterance is annotated with its task goal (TG) and dialog act (DA). The
variable pair {TG, DA} is used to represent the dialog state. Our approach involves a set of corpus-derived dialog state transition rules
in the form of {TG, DA}request → {TG, DA}response . These rules encode
the communication goal(s) and initiatives of the request/ response.
Another set of hand-designed rules associate each response dialog
state with one or more text generation templates. Upon testing, our
system parses the input customer request for concept categories
and from these infers the TG and DA using trained Belief Networks.
Application of the dialog state transition rules and text generation
templates automatically generates a (virtual) waiter response. Ten
subjects were invited to interact with the system. Performance evaluation based on Grice’s maxims gave a mean score of 4 on a fivepoint Likert scale and a task completion rate of at least 90%.
Speech Generation from Concept for Realizing
Conversation with an Agent in a Virtual Room
1
Eric W.M. Yu , Man-Wai Mak , Chin-Hung Sit ,
Sun-Yuan Kung 2 ; 1 Hong Kong Polytechnic University,
China; 2 Princeton University, USA
Keikichi Hirose, Junji Tago, Nobuaki Minematsu;
University of Tokyo, Japan
A novel technique for speaker verification over a communication
network is proposed. The technique employs cepstral coefficients
(LPCCs) derived from G.729 and G.723.1 coder parameters as feature vectors. Based on the LP coefficients derived from the coder
parameters, LP residuals are reconstructed, and the verification performance is improved by taking account of the additional speakerdependent information contained in the reconstructed residuals.
This is achieved by adding the LPCCs of the LP residuals to the
LPCCs derived from the coder parameters. To reduce the acoustic mismatch between different handsets, a technique combining
a handset selector with stochastic feature transformation is employed. Experimental results based on 150 speakers show that the
proposed technique outperforms the approaches that only utilize
the coder-derived LPCCs.
Session: OWeBd– Oral
Dialog System Generation
A concept to speech generation was realized in an agent dialogue
system, where an agent (a stuffed animal) walked around in a small
room constructed on a computer display to complete some jobs
with instructions from a user. The communication between the
user and the agent was done through speech. If the agent could
not complete the job because of some difficulties, it tried to solve
the problems through conversations with the user. Different from
other spoken dialogue systems, the speech output from the agent
was generated directly from the concept, and was synthesized using
higher linguistic information. This scheme could largely improve
the prosodic quality of speech output. In order to realize the concept to speech conversion, the linguistic information was handled
as a tree structure in the whole dialogue process.
A Trainable Generator for Recommendations in
Multimodal Dialog
Marilyn Walker 1 , Rashmi Prasad 2 , Amanda Stent 3 ;
1
University of Sheffield, U.K.; 2 University of
Pennsylvania, USA; 3 Stony Brook University, USA
Time: Wednesday 10.00, Venue: Room 4
Chair: Rolf Carlson, KTH, Stockholm, Sweden
Should I Tell All?: An Experiment on Conciseness
in Spoken Dialogue
Stephen Whittaker 1 , Marilyn Walker 1 , Preetam
Maloor 2 ; 1 University of Sheffield, U.K.; 2 University of
Toronto, Canada
Spoken dialogue systems have a strong requirement to produce con-
As the complexity of spoken dialogue systems has increased, there
has been increasing interest in spoken language generation (SLG).
SLG promises portability across application domains and dialogue
situations through the development of application-independent linguistic modules. However in practice, rule-based SLGs often have to
be tuned to the application. Recently, a number of research groups
have been developing hybrid methods for spoken language generation, combining general linguistic modules with methods for training parameters for particular applications. This paper describes the
59
Eurospeech 2003
Wednesday
use of boosting to train a sentence planner to generate recommendations for restaurants in MATCH, a multimodal dialogue system
providing entertainment information for New York.
Spoken Dialogue System for Queries on Appliance
Manuals Using Hierarchical Confirmation Strategy
Tatsuya Kawahara, Ryosuke Ito, Kazunori Komatani;
Kyoto University, Japan
We address a dialogue framework for queries on manuals of electric appliances with a speech interface. Users can make queries
by unconstrained speech, from which keywords are extracted and
matched to the items in the manual. As a result, so many items are
usually obtained. Thus, we introduce an effective dialogue strategy which narrows down the items using a tree structure extracted
from the manual. Three cost functions are presented and compared
to minimize the number of dialogue turns. We have evaluated the
system performance on VTR manual query task. The number of
average dialogue turns is reduced to 71% using our strategy compared with a conventional method that makes confirmation in turn
according to the matching likelihood. Thus, the proposed system
helps users find their intended items more efficiently.
SAG: A Procedural Tactical Generator for Dialog
Systems
September 1-4, 2003 – Geneva, Switzerland
component transforms the spectral envelope as represented by a
linear prediction model. The transformation is achieved using a
Gaussian mixture model, which is trained on aligned speech from
source and target speakers. The second part of the system predicts
the spectral detail from the transformed linear prediction coefficients. A novel approach is proposed, which is based on a classifier
and residual codebooks. On the basis of a number of performance
metrics it outperforms existing systems.
DOA Estimation of Speech Signal Using
Equilateral-Triangular Microphone Array
Yusuke Hioka, Nozomu Hamada; Keio University,
Japan
In this contribution, we propose a DOA (Direction Of Arrival) estimation method of speech signal whose angular resolution is almost uniform with respect to DOA. Our previous DOA estimation method[1] achieves high precision with only two microphones,
however its resolution degrades as the propagating direction apart
from the array broadside. In the proposed method, the equilateraltriangular microphone array is adopted, and the subspace analysis
is applied. The efficiency of the proposed method is shown both
from the simulation and experimental results.
Dalina Kallulli; SAIL LABS Technology, Austria
Widely used declarative approaches to generation in which generation speed is a function of grammar size are not optimal for realtime dialog systems. We argue that a procedural system like the
one we present is potentially more efficient for time-critical realworld generation applications as it provides fine-grained control of
each processing step on the way from input to output representations. In this way the procedural behaviour of the generator can
be tailored to the task at hand. During the generation process, the
realizer generates flat deep structures from semantic-pragmatic expressions, then syntactic deep structures from the deep semanticpragmatic structures and from these syntactic deep structures surface strings. Nine different generation levels can be distinguished
and are described in the paper.
Session: PWeBe– Poster
Speech Signal Processing II
Time: Wednesday 10.00, Venue: Main Hall, Level -1
Chair: Matti Karjalainen, HUT, Finland
Optimization of the CELP Model in the LSP Domain
Khosrow Lashkari, Toshio Miki; DoCoMo USA Labs,
USA
This paper presents a new Analysis-by-Synthesis (AbS) technique for
joint optimization of the excitation and model parameters based on
minimizing the closed loop synthesis error instead of the linear prediction error. By minimizing the synthesis error, the analysis and
synthesis stages become more compatible. Using a gradient descent
algorithm, LSPs for a given excitation are optimized to minimize
the error between the original and the synthesized speech. Since
the optimization starts from the LPC solution, the synthesis error
is guaranteed to be lower than that obtained using the LPC coefficients. For the ITU G.729 codec, there is about 1dB of improvement
in the segmental SNR for male and female speakers over 4 to 6
second long sentences. By adding an extra optimization step, the
technique can be incorporated into the LPC, multi-pulse LPC and
CELP-type speech coders.
Transforming Voice Quality
Ben Gillett, Simon King; University of Edinburgh, U.K.
Voice transformation is the process of transforming the characteristics of speech uttered by a source speaker, such that a listener
would believe the speech was uttered by a target speaker. In this
paper we address the problem of transforming voice quality. We do
not attempt to transform prosody.
Our system has two main parts corresponding to the two components of the source-filter model of speech production. The first
Multi-Array Fusion for Beamforming and
Localization of Moving Speakers
Ilyas Potamitis, George Tremoulis, Nikos Fakotakis,
George Kokkinakis; University of Patras, Greece
In this work we deal with the fusion of the estimates of independent
microphone arrays to produce an improved estimate of the Direction of Arrival (DOA) of one moving speaker, as well as localization
coordinates of multiple moving speakers based on Time Delay Of
Arrivals (TDOA). Our approach (a) fuses measurements from independent arrays, (b) incorporates kinematic information of speakers’ movement by using parallel Kalman filters, and (c) associates
observations to specific speakers by using a Probabilistic Data Association (PDA) technique. We demonstrate that a network of arrays
combined with statistical fusion techniques provides a consistent
and coherent way to reduce uncertainty and ambiguity of measurements. The efficiency of the approach is illustrated on a simulation
dealing with beamforming one moving speaker on an extended basis and localization of two closely spaced moving speakers with
crossing trajectories.
Integrated Pitch and MFCC Extraction for Speech
Reconstruction and Speech Recognition
Applications
Xu Shao, Ben P. Milner, Stephen J. Cox; University of
East Anglia, U.K.
This paper proposes an integrated speech front-end for both speech
recognition and speech reconstruction applications. Speech is first
decomposed into a set of frequency bands by an auditory model.
The output of this is then used to extract both robust pitch estimates and MFCC vectors. Initial tests used a 128 channel auditory
model, but results show that this can be reduced significantly to
between 23 and 32 channels.
A detailed analysis of the pitch classification accuracy and the RMS
pitch error shows the system to be more robust than both comb
function and LPC-based pitch extraction. Speech recognition results show that the auditory-based cepstral coefficients give very
similar performance to conventional MFCCs. Spectrograms and informal listening tests also reveal that speech reconstructed from
the auditory-based cepstral coefficients and pitch has similar quality to that reconstructed from conventional MFCCs and pitch.
60
Eurospeech 2003
Wednesday
September 1-4, 2003 – Geneva, Switzerland
Exploiting Time Warping in AMR-NB and AMR-WB
Speech Coders
A Clustering Approach to On-Line Audio Source
Separation
Lasse Laaksonen 1 , Sakari Himanen 2 , Ari Heikkinen 2 ,
Jani Nurminen 2 ; 1 Tampere University of Technology,
Finland; 2 Nokia Research Center, Finland
Julien Bourgeois; DaimlerChrysler AG, Germany
In this paper, a time warping algorithm is implemented and its performance is evaluated in the context of Adaptive Multi- Rate (AMR)
wideband (WB) and narrowband (NB) speech coders. The aim of
time warping is to achieve bit savings in transmission of pitch information with no significant quality degradations. In the case of
AMR-NB and AMR-WB speech coders, these bit savings are 0.65-1.15
kbit/s depending on the mode. The performance of the modified
AMR speech coders is verified by subjective and objective measures
in error-free conditions. MOS tests show that only slight, statistically insignificant degradation of speech quality is experienced
when time warping is implemented.
A New Approach to Voice Activity Detection Based
on Self-Organizing Maps
We have developed an on-line separation method for audio signals.
The adopted approach makes use of the time-frequency transform
of the signals as a sparse decomposition. Since the sources for the
most part do not overlap in the time-frequency domain, we get raw
estimates of their individual mixing parameters with an analysis of
the mixture ratios. We then obtain reliable mixing parameters by dynamically clustering these instantaneous estimates. The mixing parameters are used to separate the mixtures, even at time-frequency
points where the sources overlap. In addition, even when the mixing parameters change over time, our approach is able to separate
signals with only one pass through the data. We have evaluated this
approach first on computer generated anechoic mixtures and then
on real echoic mixtures recorded in a car.
Estimation of Voice Source and Vocal Tract
Characteristics Based on Multi-Frame Analysis
Yoshinori Shiga, Simon King; University of Edinburgh,
U.K.
Stephan Grashey; Siemens AG, Germany
Accurate discrimination between speech and non-speech is an essential part in many tasks of speech processing systems. In this
paper an approach to the classification part of a Voice Activity Detector (VAD) is presented. Some possible shortcomings of present
VAD-systems are described and a classification approach which
overcomes these weaknesses is derived. This approach is based
on a Self-Organizing Map (SOM), a neural network, which is able to
detect clusters within the feature space of its training data. Training of the classifier takes place in two steps: First the SOM has to
be trained. When finished, it is used in the second training step
to learn the mapping between its classes and the desired output
“speech” resp. “non-speech”. Experiments on a database containing audio-samples obtained under different noisy conditions show
the potential of the proposed algorithm.
Estimating the Spectral Envelope of Voiced Speech
Using Multi-Frame Analysis
This paper presents a new approach for estimating voice source
and vocal tract filter characteristics of voiced speech. When it is required to know the transfer function of a system in signal processing, the input and output of the system are experimentally observed
and used to calculate the function. However, in the case of sourcefilter separation we deal with in this paper, only the output (speech)
is observed and the characteristics of the system (vocal tract) and
the input (voice source) must simultaneously be estimated. Hence
the estimate becomes extremely difficult, and it is usually solved
approximately using oversimplified models. We demonstrate that
these characteristics are separable under the assumption that they
are independently controlled by different factors. The separation
is realised using an iterative approximation along with the Multiframe Analysis method, which we have proposed to find spectral
envelopes of voiced speech with minimum interference of the harmonic structure.
A New Method for Pitch Prediction from Spectral
Envelope and its Application in Voice Conversion
Yoshinori Shiga, Simon King; University of Edinburgh,
U.K.
This paper proposes a novel approach for estimating the spectral
envelope of voiced speech independently of its harmonic structure. Because of the quasi-periodicity of voiced speech, its spectrum
indicates harmonic structure and only has energy at frequencies
corresponding to integral multiples of F0 . It is hence impossible
to identify transfer characteristics between the adjacent harmonics. In order to resolve this problem, Multi-frame Analysis (MFA)
is introduced. The MFA estimates a spectral envelope using many
portions of speech which are vocalised using the same vocal-tract
shape. Since each of the portions usually has a different F0 and
ensuing different harmonic structure, a number of harmonics can
be obtained at various frequencies to form a spectral envelope.
The method thereby gives a closer approximation to the vocal-tract
transfer function.
Adaptive Noise Estimation Using Second
Generation and Perceptual Wavelet Transforms
Taoufik En-Najjary 1 , Olivier Rosec 1 , Thierry
Chonavel 2 ; 1 France Télécom R&D, France; 2 ENST
Bretagne, France
This paper deals with the estimation of pitch from only spectral
envelope information. The proposed method uses a Gaussian Mixture Model (GMM) to characterize the joint distribution of the spectral envelope parameters and pitch-normalized values. During the
learning stage, the model parameters are estimated by means of
the EM algorithm. Then, a regression is made which enables the determination of a pitch prediction function from spectral envelope
coefficients. Some results are presented which show the accuracy of
the proposed method in terms of pitch prediction. Finally, the application of this method in a voice conversion system is described.
Maximum Likelihood Endpoint Detection with
Time-Domain Features
Essa Jafer, Abdulhussain E. Mahdi; University of
Limerick, Ireland
This paper describes the implementation and performance evaluation of three noise estimation algorithms using two different signal decomposition methods: a second-generation wavelet transform and a perceptual wavelet packet transform. These algorithms,
which do not require the use of a speech activity detector or signal statistics learning histograms, are: a smoothing-based adaptive technique, a minimum variance tracking-based technique and
a quantile-based technique. The paper also proposes a new and robust noise estimation technique, which utilises a combination of the
quantile-based and smoothing-based algorithms. The performance
of the latter technique is then evaluated and compared to those of
the above three noise estimation methods under various noise conditions. Reported results demonstrate that all four algorithms are
capable of tracking both stationary and non-stationary noise adequately but with varying degree of accuracy.
Marco Orlandi, Alfiero Santarelli, Daniele Falavigna;
ITCirst, Italy
In this paper we propose an effective, robust and computationally low-cost HMM-based start-endpoint detector for speech
recognisers1 . Our first attempts follow the classical scheme feature extractor-Viterbi classifier (used for voice activity detection),
followed by a post-processing stage, but the ultimate goal we pursue is a pure HMM-based architecture capable of performing the
endpointing task. The features used for voice activity detection are
energy and zero crossing rate, together with AMDF (Average Magnitude Difference Function), which proves to be a valid alternative to
energy; further, we study the impact on performance of grammar
structures and training conditions. In the end, we set the basis for
the investigation of pure HMM-based architectures.
61
Eurospeech 2003
Wednesday
September 1-4, 2003 – Geneva, Switzerland
Integration of Noise Reduction Algorithms for
Aurora2 Task
Unified Analysis of Glottal Source Spectrum
Ixone Arroabarren, Alfonso Carlosena; Universidad
Publica de Navarra, Spain
The spectral study of the glottal excitation has traditionally been
based on a single time-domain mathematical model of the signal,
and the spectral dependence on its time domain parameters. Opposite to this approach, in this work the two most widely used time
domain models have been studied jointly, namely the KLGLOTT88
and the LF models. Their spectra are analyzed in terms of their
dependence on the general glottal source parameters: Open quotient, asymmetry coefficient and spectral tilt. As a result, it has
been proved that even though the mathematical expressions for
both models are quite different, they can be made to converge. The
main difference found is that in the KLGLOTT88 model the asymmetry coefficient is not independent of the open quotient and the
spectral tilt. Once this relationship has been identified and translated to LF model, both models are shown to be equivalent in both
time and frequency domains.
En este trabajo se ha analizado el espectro de la derivada de la
fuente glotal. Este tipo de estudios tradicionalmente se han enfocado hacia el estudio de un determinado modelo de la fuente glotal,
y cómo afectan los parámetros temporal de dicho modelo a su espectro. Por el contrario, en este caso se pretende dar una visión
más general, y para ello se han estudiado conjuntamente dos de
los modelos temporales de fuente glotal más relevantes: el modelo KLGLOTT88 y el modelo LF. El espectro de ambos modelos ha
sido estudiado en términos de las tres características de la fuente
glotal que tiene modelar cualquier modelo matemático: el cociente
de apertura, el coeficiente de asimetría y la tendencia o inclinación
espectral. Como consecuencia de este estudio se ha podido comprobar que a pesar de que las expresiones matemáticas de ambos
modelos son muy diferentes, la principal diferencia entre ambos
reside en que en el caso del modelo KLGLOTT88 el coeficiente de
asimetría viene determinado por el cociente de apertura y la tendencia espectral. Dada la relación matemática entre los parámetros
se puede demostrar que en estas condiciones ambos modelos de
fuente son equivalentes en el domino temporal y el dominio espectral.
Session: PWeBf– Poster
Robust Speech Recognition I
Time: Wednesday 10.00, Venue: Main Hall, Level -1
Chair: Christian Wellekens, Eurecom, France
A Hidden Markov Model-Based Missing Data
Imputation Approach
Yu Luo, Limin Du; Chinese Academy of Sciences,
China
The accuracy of automatic speech recognizer degrades rapidly
when speech was distorted by noise. Robustness against noise
arises to be one of the challenge problems. In this paper, a hidden Markov model (HMM) based data imputation approach is presented to improve speech recognition robustness against noise at
the front-end of recognizer. Considering the correlation between
different filter-banks, the approach realizes missing data imputation by a HMM of L states, each of which has a Gaussian output
distribution with full covariance matrix. “Missing” data in speech
filter-bank vector sequences are recovered by MAP procedure from
local optimal state path or marginal Viterbi decoded HMM state sequence.
The potential of the approach was tested using speaker independent continuous mandarin speech recognizer with syllable-loop of
perplexity 402 for both Gaussian and babble noises each at 6 different SNR levels ranging from 0dB to 25dB, showing a significant
improvement in robustness against additive noises.
Takeshi Yamada 1 , Jiro Okada 1 , Kazuya Takeda 2 ,
Norihide Kitaoka 3 , Masakiyo Fujimoto 4 , Shingo
Kuroiwa 5 , Kazumasa Yamamoto 6 , Takanobu
Nishiura 7 , Mitsunori Mizumachi 8 , Satoshi
Nakamura 8 ; 1 University of Tsukuba, Japan; 2 Nagoya
University, Japan; 3 Toyohashi University of
Technology, Japan; 4 Ryukoku University, Japan;
5
University of Tokushima, Japan; 6 Shinshu University,
Japan; 7 Wakayama University, Japan; 8 ATR-SLT,
Japan
To achieve high recognition performance for a wide variety of noise
and for a wide range of signal-to-noise ratios, this paper presents
the integration of four noise reduction algorithms: spectral subtraction with smoothing of time direction, temporal domain SVDbased speech enhancement, GMM-based speech estimation and
KLT-based comb-filtering. Recognition results on the Aurora2 task
show that the effectiveness of these algorithms and their combinations strongly depends on noise conditions, and excessive noise
reduction tends to degrade recognition performance in multicondition training.
Classification with Free Energy at Raised
Temperatures
Rita Singh 1 , Manfred K. Warmuth 2 , Bhiksha Raj 3 ,
Paul Lamere 4 ; 1 Carnegie Mellon University, USA;
2
University of California at Santa Cruz, USA;
3
Mitsubishi Electric Research Laboratories, USA; 4 Sun
Microsystems Laboratories, USA
In this paper we describe a generalized classification method for
HMM-based speech recognition systems, that uses free energy as a
discriminant function rather than conventional probabilities. The
discriminant function incorporates a single adjustable temperature
parameter T . The computation of free energy can be motivated
using an entropy regularization, where the entropy grows monotonically with the temperature. In the resulting generalized classification scheme, the values of T = 0 and T = 1 give the conventional
Viterbi and forward algorithms, respectively, as special cases. We
show experimentally that if the test data are mismatched with the
classifier, classification at temperatures higher than one can lead to
significant improvements in recognition performance. The temperature parameter is far more effective in improving performance on
mismatched data than a variance scaling factor, which is another
apparent single adjustable parameter that has a very similar analytical form.
Flooring the Observation Probability for Robust
ASR in Impulsive Noise
Pei Ding 1 , Bertram E. Shi 2 , Pascale Fung 2 , Zhigang
Cao 1 ; 1 Tsinghua University, China; 2 Hong Kong
University of Science & Technology, China
Impulsive noise usually introduces sudden mismatches between the
observation features and the acoustic models trained with clean
speech, which drastically degrades the performance of automatic
speech recognition (ASR) systems. This paper presents a novel
method to directly suppress the adverse effect of impulsive noise
on recognition. In this method, according to the noise sensitivity
of each feature dimension, the observation vector is divided into
several subvectors, each of which is assigned to a suitable flooring
threshold. In recognition stage, observation probability of each feature sub-vector is floored at the Gaussian mixture level. Thus, the
unreliable relative probability difference caused by impulsive noise
is eliminated, and the expected correct state sequence recovers the
priority of being chosen in decoding. Experimental evaluations on
Aurora2 database show that the proposed method achieves the average error rate reduction (ERR) of 61.62% and 84.32% in simulated
impulsive noise and machinegun noise environment, respectively,
while maintaining high performance for clean speech recognition.
62
Eurospeech 2003
Wednesday
Combination of Temporal Domain SVD Based
Speech Enhancement and GMM Based Speech
Estimation for ASR in Noise – Evaluation on the
AURORA2 Task –
September 1-4, 2003 – Geneva, Switzerland
error rate of about 62% with respect to the baseline ETSI system and
of about 18% with respect to the advanced ETSI system. This confirm previous positive experience with the multi-band architecture
on other databases.
Noise Robust Speech Parameterization Based on
Joint Wavelet Packet Decomposition and
Autoregressive Modeling
Masakiyo Fujimoto, Yasuo Ariki; Ryukoku University,
Japan
In this paper, we propose a noise robust speech recognition
method by combination of temporal domain singular value decomposition( SVD) based speech enhancement and Gaussian mixture
model(GMM) based speech estimation. The bottleneck of GMM
based approach is a noise estimation problem. For this noise estimation problem, we incorporated the adaptive noise estimation in
GMM based approach. Furthermore, in order to obtain higher recognition accuracy, we employed a temporal domain SVD based speech
enhancement method as a pre-processing module of the GMM based
approach. In addition, to reduce the influence of the noise included
in the noisy speech, we introduced an adaptive over-subtraction
factor into the SVD based speech enhancement. Usually, a noise reduction method has a problem that it degrades the recognition rate
because of spectral distortion caused by residual noise occurred
through noise reduction and over estimation. To solve the problem in the noise reduction method, acoustic model adaptation is
employed by using an unsupervised MLLR to the distorted speech
signal. In evaluation on the AURORA2 tasks, our method showed
the improvement in relative improvement of clean condition training task.
Additive Noise and Channel Distortion-Robust
Parametrization Tool – Performance Evaluation on
Aurora 2 & 3
Petr Fousek, Petr Pollák; Czech Technical University in
Prague, Czech Republic
In this paper a HTK-compatible robust speech parametrization tool
CtuCopy is presented. This tool allows for the usage of several additive noise suppression preprocessing techniques, nonlinear spectrum transformation, RASTA-like filtration, and direct final feature
computation. The tool is general, it is easily extendible, and it
may be also used for speech enhancement purposes. In the second part, parametrizations combining the extended spectral subtraction for additive noise suppression and LDA RASTA-like filtration for channel-distortion elimination with final computation of
PLP cepstral coefficients are examined and evaluated on Aurora 2
& 3 and Czech SpeechDat corpora. This comparison shows specific
algorithm features and the differences in their behavior on above
mentioned databases. PLP cepstral coefficients with both extended
spectral subtraction and LDA RASTA-like filtration seem to be good
choice for noise robust parametrization.
Robust Feature Extraction and Acoustic Modeling
at Multitel: Experiments on the Aurora Databases
Bojan Kotnik, Zdravko Kačič, Bogomir Horvat;
University of Maribor, Slovenia
In this paper a noise robust feature extraction algorithm using joint
wavelet packet decomposition (WPD) and an autoregressive (AR)
modeling of the speech signal is presented. In opposition to the
short time Fourier transform (STFT) based time-frequency signal
representation, a computationally efficient WPD can lead to better
representation of non-stationary parts of the speech signal (consonants). The vowels are well described with an AR model like in
LPC analysis. The separately extracted WPD and AR based features
are combined together with the usage of modified principal component analysis (PCA) and voiced/unvoiced decision to produce final
output feature vector. The noise robustness is improved with the
application of the proposed wavelet based denoising algorithm with
the modified soft thresholding procedure and the voice activity detection. Speech recognition results on Aurora 3 databases show
performance improvement of 47.6% relative to the standard MFCC
front-end.
Database Adaptation for ASR in
Cross-Environmental Conditions in the SPEECON
Project
Christophe Couvreur 1 , Oren Gedge 2 , Klaus Linhard 3 ,
Shaunie Shammass 2 , Johan Vantieghem 1 ; 1 ScanSoft
Belgium, Belgium; 2 Natural Speech Communication,
Israel; 3 DaimlerChrysler AG, Germany
As part of the SPEECON corpora collection project, a software toolbox for transforming speech recordings made in a quiet environment with a close-talk microphone into far-talk noisy recordings
has been developed. The toolbox allows speech recognizers to be
trained for new acoustic environments without requiring an extensive data collection effort. This communication complements a previous article in which the adaptation toolbox was described in details and preliminary experimental results were presented. Detailed
experimental results on a database specifically collected for testing purposes show the performance improvements that can be obtained with the database adaptation toolbox in various far-talk and
noisy conditions. The Hebrew corpus collected for SPEECON is also
used to assess how close a recognizer trained on simulated data can
get to a recognizer trained on real far-talk noisy data.
Autoregressive Modeling Based Feature Extraction
for Aurora3 DSR Task
Stéphane Dupont, Christophe Ris; Multitel, Belgium
This paper intends to summarize some of the robust feature extraction and acoustic modeling technologies used at Multitel, together with their assessment on some of the ETSI Aurora reference
tasks. Ongoing work and directions for further research are also
presented.
For feature extraction (FE), we are using PLP coefficients. Additive
and convolutional noise are addressed using a cascade of spectral
subtraction and temporal trajectory filtering. For acoustic modeling
(AM), artificial neural networks (ANNs) are used for estimating the
HMM state probabilities. At the junction of FE and AM, the multiband structure provides a way to address the needs of robustness
by targeting both processing levels. Robust features within subbands can be extracted using a form of discriminant analysis. In
this work, this is obtained using sub-band ANN acoustic models.
The robust sub-band features are then used for the estimation of
state probabilities.
These systems are evaluated on the Aurora tasks in comparison
to the existing ETSI features. Our baseline system has similar performance than the ETSI advanced features coupled with the HTK
back-end. On the Aurora 3 tasks, the multi-band system outperforms the best ETSI results with an average reduction of the word
Petr Motlíček, Jan Černocký; Brno University of
Technology, Czech Republic
Techniques for analysis of speech, that use autoregressive (all-pole)
modeling approaches, are presented here and compared to generally known Mel-frequency cepstrum based feature extraction. In
the paper, first, we focus on several possible applications of modeling speech power spectra that increase the performance of ASR
system mainly in case of large mismatch between training and testing data. Then, the attention is payed to the different types of
features that can be extracted from all-pole model to reduce the
overall word error rate. The results show that generally used cepstrum based features, which can be easily extracted from all-pole
model, are not the most suitable parameters for ASR, where the
input speech is corrupted by different types of real noises. Very
good recognition performances were achieved e.g., with discrete or
selective all-pole modeling based approaches, or with decorrelated
line spectral frequencies. The feature extraction techniques were
tested on SpeechDat-Car databases used for front-end evaluation
of advanced distributed speech recognition (DSR) systems.
63
Eurospeech 2003
Wednesday
Evaluation on the Aurora 2 Database of Acoustic
Models That Are Less Noise-Sensitive
Edmondo Trentin 1 , Marco Matassoni 2 , Marco Gori 1 ;
1
Università degli Studi di Siena, Italy; 2 ITCirst, Italy
The Aurora 2 database may be used as a benchmark for evaluation
of algorithms under noisy conditions. In particular, the clean training/noisy test mode is aimed at evaluating models that are trained
on clean data only without further adjustments on the noisy data,
i.e. under severe mismatch between the training and test conditions. While several researchers proposed techniques at the frontend level to improve recognition performance over the reference
hideen Markov model (HMM) baseline, investigations at the backend level are sought. In this respect, the goal is to develop acoustic models that are intrinsically less noise sensitive. This paper
presents the word accuracy yielded by a non-parametric HMM with
connectionist estimates of the emission probabilities, i.e. a neural
network is applied instead of the usual parametric (Gaussian mixture) probability densities. A regularization technique, relying on
a maximum-likelihood parameter grouping algorithm, is explicitly
introduced to increase the generalization capability of the model
and, in turn, its noise-robustness. Results show that a 15,43% relative word error rate reduction w.r.t. the Gaussianmixture HMM is
obtained by averaging over the different noises and SNRs of Aurora
2 test set A.
Revisiting Scenarios and Methods for Variable
Frame Rate Analysis in Automatic Speech
Recognition
J. Macías-Guarasa, J. Ordóñez, J.M. Montero, J.
Ferreiros, R. Córdoba, L.F. D’Haro; Universidad
Politécnica de Madrid, Spain
In this paper we present a revision and evaluation of some of the
main methods used in variable frame rate (VFR) analysis, applied to
speech recognition systems. The work found in the literature in this
area usually deals with restricted conditions and scenarios and we
have revisited the main algorithmic alternatives and evaluated them
under the same experimental framework, so that we have been able
to establish objective considerations for each of them, selecting the
most adequate strategy.
We also show till what extent VFR analysis is useful in its three main
application scenarios, namely “reduction of computational load”,
“improve acoustic modelling” and “handling additive noise conditions in the time domain”. From our evaluation on a difficult telephone large vocabulary task, we establish that VFR analysis does
not significantly improve the results obtained using the traditional
fixed frame rate analysis (FFR), except when additive noise is present
in the database and specially for low SNRs.
Multitask Learning in Connectionist Robust ASR
Using Recurrent Neural Networks
September 1-4, 2003 – Geneva, Switzerland
measure of classification confidence. However, at high noise levels,
entropy can give a misleading indication of classification certainty.
Very noisy data vectors may be classified systematically into classes
which happen to be most noise-like and the resulting confusion matrix shows a dense column for each noise-like class. In this article we
show how this pattern of misclassification in the confusion matrix
can be used to derive a linear correction to the MLP posteriors estimate. We test the ability of this correction to reduce the problem of
misleading confidence estimates and to enhance the performance
of entropy based full-combination multi-stream approach. Better
word-error-rates are achieved for Numbers95 database at different
levels of added noise. The correction performs significantly better
at high SNRs.
Session: PWeBg– Poster
Speech Recognition - Large Vocabulary I
Time: Wednesday 10.00, Venue: Main Hall, Level -1
Chair: Alex Acero, Microsoft Research, USA
Large Vocabulary ASR for Spontaneous Czech in
the MALACH Project
Josef Psutka 1 , Pavel Ircing 1 , J.V. Psutka 1 , Vlasta
Radová 1 , William J. Byrne 2 , Jan Hajič 3 , Jirí
Mírovsky 3 , Samuel Gustman 4 ; 1 University of West
Bohemia in Pilsen, Czech Republic; 2 Johns Hopkins
University, USA; 3 Charles University, Czech Republic;
4
Survivors of the Shoah Visual History Foundation,
USA
This paper describes LVCSR research into the automatic transcription of spontaneous Czech speech in the MALACH (Multilingual Access to Large Spoken Archives) project. This project attempts to
provide improved access to the large multilingual spoken archives
collected by the Survivors of the Shoah Visual History Foundation
(VHF) (www.vhf.org) by advancing the state of the art in automated
speech recognition. We describe a baseline ASR system and discuss
the problems in language modeling that arise from the nature of
Czech as a highly inflectional language that also exhibits diglossia
between its written and spontaneous forms. The difficulties of this
task are compounded by heavily accented, emotional and disfluent
speech along with frequent switching between languages. To overcome the limited amount of relevant language model data we use
statistical techniques for selecting an appropriate training corpus
from a large unstructured text collection resulting in significant reductions in word error rate.
Active and Unsupervised Learning for Automatic
Speech Recognition
Giuseppe Riccardi, Dilek Z. Hakkani-Tür; AT&T
Labs-Research, USA
Shahla Parveen, Phil Green; University of Sheffield,
U.K.
The use of prior knowledge in machine learning techniques has been
proved to give better generalisation performance for unseen data.
However, this idea has not been investigated so far for robust ASR.
Training several related tasks simultaneously is also called multitask learning (MTL): the extra tasks effectively incorporate prior
knowledge. In this work we present an application of MTL in robust
ASR. We have used an RNN architecture to integrate classification
and enhancement of noisy speech in an MTL framework. Enhancement is used as an extra task to get higher recognition performance
on unseen data. We report our results on an isolated word recognition task. The reduction in error rate relative to multicondition
training with HMMs for subway, babble, car and exhibition noises
was 53.37%, 21.99%, 37.01% and 44.13% respectively.
Confusion Matrix Based Entropy Correction in
Multi-Stream Combination
Hemant Misra, Andrew Morris; IDIAP, Switzerland
An MLP classifier outputs a posterior probability for each class.
With noisy data, classification becomes less certain, and the entropy of the posteriors distribution tends to increase providing a
State-of-the-art speech recognition systems are trained using human transcriptions of speech utterances. In this paper, we describe
a method to combine active and unsupervised learning for automatic speech recognition (ASR). The goal is to minimize the human supervision for training acoustic and language models and to
maximize the performance given the transcribed and untranscribed
data. Active learning aims at reducing the number of training examples to be labeled by automatically processing the unlabeled examples, and then selecting the most informative ones with respect
to a given cost function. For unsupervised learning, we utilize the
remaining untranscribed data by using their ASR output and word
confidence scores. Our experiments show that the amount of labeled data needed for a given word accuracy can be reduced by 75%
by combining active and unsupervised learning.
Perceptual MVDR-Based Cepstral Coefficients
(PMCCs) for High Accuracy Speech Recognition
Umit H. Yapanel 1 , Satya Dharanipragada 2 , John H.L.
Hansen 1 ; 1 University of Colorado at Boulder, USA;
2
IBM T.J. Watson Research Center, USA
This paper describes an accurate feature representation for contin-
64
Eurospeech 2003
Wednesday
uous clean speech recognition. The main components of the technique involve performing a moderate order Linear Predictive (LP)
analysis and computing the Minimum Variance Distortionless Response (MVDR) spectrum from these LP coefficients. This feature
representation, PMCCs, was earlier shown to yield superior performance over MFCCs for different noise conditions with emphasis on
car noise [1]. The performance improvement was then attributed
to better spectrum and envelope modeling properties of the MVDR
methodology. This study shows that the representation is also quite
efficient for clean speech recognition. In fact, PMCCs are shown
to be a more accurate envelope representation and reduce speaker
variability. This, in turn, yields a 12.8% relative word error rate
(WER) reduction on the combination of Wall Street Journal (WSJ)
Nov’92 dev/eval sets with respect to the MFCCs. Accurate envelope modeling and reduction in the speaker variability also lead to
faster decoding, based on efficient pruning in the search stage. The
total gain in the decoding speed is 22.4%, relative to the standard
MFCC features. It is also shown that PMCCs are not very demanding
in terms of computation when compared to MFCCs. Therefore, we
conclude that PMCC feature extraction scheme is a better representation of clean speech as well as noisy speech than MFCC scheme.
A Discriminative Decision Tree Learning Approach
to Acoustic Modeling
Sheng Gao 1 , Chin-Hui Lee 2 ; 1 Institute for Infocomm
Research, Singapore; 2 Georgia Institute of
Technology, USA
The decision tree is a popular method to accomplish tying of the
states of a set context dependent phone HMMs for efficient and effective training of the large acoustic models. A likelihood-based
impurity function is commonly adopted. It is well known that maximizing likelihood does not result in the maximal separation between the distributions in the leaves of the tree. To improve robustness, a discriminative decision tree learning approach is proposed. It embeds the MCE-GPD formulation in defining the impurity function so that the discriminative information could be taken
into account while optimizing the tree. We compare the proposed
approach with the conventional tree building using a Mandarin syllable recognition task. Our preliminary results show that the separation between the divided subspaces in the tree nodes is clearly
enhanced although there is a slight performance reduction.
Large Corpus Experiments for Broadcast News
Recognition
Patrick Nguyen, Luca Rigazio, Jean-Claude Junqua;
Panasonic Speech Technology Laboratory, USA
This paper investigates the use of a large corpus for the training of a
Broadcast News speech recognizer. A vast body of speech recognition algorithms and mathematical machinery is aimed at smoothing
estimates toward accurate modeling with scant amounts of data. In
most cases, this research is motivated by a real need for more data.
In Broadcast News, however, a large corpus is already available to
all LDC members. Until recently, it has not been considered for
acoustic training.
We would like to pioneer the use of the largest speech corpus
(1200h) available for the purpose of acoustic training of speech
recognition systems. To the best of our knowledge it is the largest
scale acoustic training ever considered in speech recognition.
We obtain a performance improvement of 1.5% absolute WER over
our best standard (200h) training.
Performance Evaluation of Phonotactic and
Contextual Onset-Rhyme Models for Speech
Recognition of Thai Language
Somchai Jitapunkul, Ekkarit Maneenoi, Visarut
Ahkuputra, Sudaporn Luksaneeyanawin;
Chulalongkorn University, Thailand
This paper proposed two acoustic modelings of the onsetrhyme for
speech recognition. The two models are Phonotactic Onset-Rhyme
Model (PORM) and Contextual Onset-Rhyme Model (CORM). The
models comprise a pair of onset and rhyme units, which makes up a
September 1-4, 2003 – Geneva, Switzerland
syllable. An onset comprises an initial consonant and its transition
towards the following vowel. Together with the onset, the rhyme
consists of a steady vowel portion and a final consonant. The experiments have been carried out to find the proper acoustic model,
which can accurately model Thai sound and gives higher accuracy.
Experimental results show that the onset-rhyme model excels the
efficiency of the triphone for both PORM and CORM. The PORM
achieves higher syllable accuracy than the CORM 2.74%. Moreover
the onset-rhyme models also give a more efficiency in term of system complexity compared to the triphone models.
Overlapped Di-Tone Modeling for Tone
Recognition in Continuous Cantonese Speech
Yao Qian, Tan Lee, Yujia Li; Chinese University of
Hong Kong, China
This paper presents a novel approach to tone recognition in continuous Cantonese speech based on overlapped di-tone Gaussian mixture models (ODGMM). The ODGMM is designed with special consideration on the fact that Cantonese tone identification relies more
on the relative pitch level than on the pitch contour. A di-tone unit
covers a group of two consecutive tone occurrences. The tone sequence carried by a Cantonese utterance can be considered as the
connection of such di-tone units. Adjacent di-tone units overlap
with each other by exactly one tone. For each di-tone unit, a GMM is
trained with a 10-dimensional feature vector that characterizes the
F0 movement within the unit. In particular, the di-tone models capture the relative deviation between the F0 levels of the two tones.
Viterbi decoding algorithm is adopted to search for the optimal
tone sequence, under the phonological constraints on syllable-tone
combination. Experimental results show the ODGMM approach significantly outperforms the previously proposed methods for tone
recognition in continuous Cantonese speech.
Speaker Model Selection Using Bayesian
Information Criterion for Speaker Indexing and
Speaker Adaptation
Masafumi Nishida 1 , Tatsuya Kawahara 2 ; 1 Japan
Science and Technology Corporation, Japan; 2 Kyoto
University, Japan
This paper addresses unsupervised speaker indexing for discussion
audio archives. We propose a flexible framework that selects an optimal speaker model (GMM or VQ) based on the Bayesian Information Criterion (BIC) according to input utterances. The framework
makes it possible to use a discrete model when the data is sparse,
and to seamlessly switch to a continuous model after a large cluster is obtained. The speaker indexing is also applied and evaluated at automatic speech recognition of discussions by adapting
a speaker-independent acoustic model to each participant. It is
demonstrated that indexing with our method is sufficiently accurate for the speaker adaptation.
Automatic Transcription of Football Commentaries
in the MUMIS Project
Janienke Sturm 1 , Judith M. Kessens 1 , Mirjam
Wester 2 , Febe de Wet 1 , Eric Sanders 1 , Helmer Strik 1 ;
1
University of Nijmegen, The Netherlands; 2 University
of Edinburgh, U.K.
This paper describes experiments carried out to automatically transcribe football commentaries in Dutch, English and German for multimedia indexing. Our results show that the high levels of stadium
noise in the material create a task that is extremely difficult for
conventional ASR. The baseline WERs vary from 83% to 94% for
the three languages investigated. Employing state-of-the-art noise
robustness techniques leads to relative reductions of 9-10% WER.
Application specific words such as players’ names are recognized
correctly in about 50% of cases. Although this result is substantially better than the overall result, it is inadequate. Much better
results can be obtained if the football commentaries are recorded
separately from the stadium noise. This would make the automatic
transcriptions more useful for multimedia indexing.
65
Eurospeech 2003
Wednesday
On the Limits of Cluster-Based Acoustic Modeling
S. Douglas Peters; Nuance Communications, Canada
This article reports a two-part study of structured acoustic modeling of speech. First, speaker-independent clustering of speech
material was used as the basis for a practical cluster-based acoustic
modeling. Each cluster’s training material is applied to the adaptation of baseline hidden Markov model(HMM)parameters for recognition purposes. Further, the training material of each cluster is also
used to train phone-level Gaussian mixture models (GMMs) for cluster identification. Test utterances are evaluated on all such models
to identify an appropriate cluster or cluster combination. Experiments demonstrate that such cluster-based adaptation can yield
accuracy gains over computationally similar baseline models. At
the same time, these gains and those of similar methods found in
the literature are modest. Hence, the second part of our study examined the limitations of the approach by considering utterance consistency: that is, the ability of acoustically-derived cluster models
to uniquely identify a single utterance. These second experiments
show that arbitrary pieces of a given utterance are likely to be identified by different clusters, in opposition to an implicit assumption
of cluster-based acoustic modeling.
Large Vocabulary Taiwanese (Min-Nan) Speech
Recognition Using Tone Features and Statistical
Pronunciation Modeling
September 1-4, 2003 – Geneva, Switzerland
a single pronunciation dictionary, a 1.8% absolute word error rate
improvement is achieved on Switchboard, a large vocabulary conversational speech recognition task.
Fitting Class-Based Language Models into Weighted
Finite-State Transducer Framework
Pavel Ircing, Josef Psutka; University of West Bohemia
in Pilsen, Czech Republic
In our paper we propose a general way of incorporating class-based
language models with many-to-many word-to-class mapping into
the finite-state transducer (FST) framework. Since class-based models alone usually do not improve the recognition accuracy, we also
present a method for an efficient language model combination.
An example of a word-to-class mapping based on morphological
tags is also given. Several word-based and tag-based language models are tested in the task of transcribing Czech broadcast news. Results show that class-based models help to achieve a moderate improvement in recognition accuracy.
Multi-Source Training and Adaptation for Generic
Speech Recognition
Fabrice Lefevre, Jean-Luc Gauvain, Lori Lamel;
LIMSI-CNRS, France
Dau-Cheng Lyu 1 , Min-Siong Liang 1 , Yuang-Chin
Chiang 2 , Chun-Nan Hsu 3 , Ren-Yuan Lyu 1 ; 1 Chang
Gung University, Taiwan; 2 National Tsing Hua
University, Taiwan; 3 Academia Sinica, Taiwan
A large vocabulary Taiwanese (Min-nan) speech recognition system
is described in this paper. Due to the severe multiple pronunciation phenomenon in Taiwanese partly caused by tone sandhi, a statistical pronunciation modeling technique based on tonal features
is used. This system is speaker independent. It was trained by a
bi-lingual Mandarin/Taiwanese speech corpus to alleviate the lack
of pure Taiwanese speech corpus. The searching network is constructed based on nodes of Chinese characters and results in the
direct output Chinese character string. Experiments show that by
using the approaches proposed in this paper, the character error
rate can decrease significantly from 21.50% to 11.97%.
A New Spectral Transformation for Speaker
Normalization
Pierre L. Dognin, Amro El-Jaroudi; University of
Pittsburgh, USA
This paper proposes a new spectral transformation for speaker normalization. We use the Bilinear Transformation (BLT) to introduce
a new frequency warping resulting from a mapping of a prototype
Band-Pass (BP) filter into a general BP filter. This new transformation called “Band-Pass Transform” (BPT) offers two degrees of freedom enabling complex warpings of the frequency axis and different
from previous works with BLT. A procedure based on the NelderMead algorithm is proposed to estimate the BPT parameters. Our
experimental results include a detailed study of the performance
of the BPT compared to other VTLN methods for a subset of speakers and results on large test sets. BPT performs better than other
VTLN methods and offers a gain of 1.13% absolute on Hub-5 English
Eval01 set.
Enhanced Tree Clustering with Single
Pronunciation Dictionary for Conversational
Speech Recognition
Hua Yu, Tanja Schultz; Carnegie Mellon University,
USA
Modeling pronunciation variation is key for recognizing conversational speech. Rather than being limited to dictionary modeling, we
argue that triphone clustering is an integral part of pronunciation
modeling. We propose a new approach called enhanced tree clustering. This approach, in contrast to traditional decision tree based
state tying, allows parameter sharing across phonemes. We show
that accurate pronunciation modeling can be achieved through efficient parameter sharing in the acoustic model. Combined with
In recent years there has been a considerable amount of work devoted to porting speech recognizers to new tasks. Recognition systems are usually tuned to a particular task and porting the system to a new task (or language) is both time-consuming and expensive. In this paper, issues in speech recognition portability
are addressed and in particular the development of generic models for speech recognition. Multi-source training techniques aimed
at enhancing the genericity of some wide domain models are investigated. We show that multi-source training and adaptation can
reduce the performance gap between task-independent and taskdependent acoustic models, and for some tasks even out-perform
task-dependent acoustic models.
Ces dernières années, des efforts considérables ont été faits pour
faciliter le transfert des systèmes de reconnaissance de la parole
vers de nouvelles tâches. Les systèmes sont généralement optimisés sur une tâche particulière et leur transfert vers une nouvelle
tâche est fastidieux et très coûteux en temps. Dans ce papier, nous
nous intéresserons au problème du transfert des systèmes de reconnaissance, en particuliers au travers du developpement de modèles
génériques pour la reconnaissance de la parole.
Des techniques d’apprentissage multi-source visant à augmenter
le niveau de généricite de modèles à large domaine sont étudiées.
Nous montrons que l’apprentissage et l’adaptation multi-sources
peuvent permettre de réduire l’écart de performance entre des modèles indépendants et dépendants de la tâche, et même pour certaines tâches de dépasser les performances des modèles dépendants de la tâche.
Toward Domain-Independent Conversational
Speech Recognition
Brian Kingsbury, Lidia Mangu, George Saon, Geoffrey
Zweig, Scott Axelrod, Vaibhava Goel, Karthik
Visweswariah, Michael Picheny; IBM T.J. Watson
Research Center, USA
We describe a multi-domain, conversational test set developed for
IBM’s Superhuman speech recognition project and our 2002 benchmark system for this task. Through the use of multi-pass decoding,
unsupervised adaptation and combination of hypotheses from systems using diverse feature sets and acoustic models, we achieve a
word error rate of 32.0% on data drawn from voicemail messages,
two-person conversations and multiple-person meetings.
Comparative Study of Boosting and Non-Boosting
Training for Constructing Ensembles of Acoustic
Models
Rong Zhang, Alexander I. Rudnicky; Carnegie Mellon
University, USA
This paper compares the performance of Boosting and non- Boost-
66
Eurospeech 2003
Wednesday
ing training algorithms in large vocabulary continuous speech
recognition (LVCSR) using ensembles of acoustic models. Both algorithms demonstrated significant word error rate reduction on the
CMU Communicator corpus. However, both algorithms produced
comparable improvements, even though one would expect that the
Boosting algorithm, which has a solid theoretic foundation, should
work much better than the non-Boosting algorithm. Several voting schemes for hypothesis combining were evaluated, including
weighted voting, un-weighted voting and ROVER.
Session: PWeBh– Poster
Spoken Dialog Systems II
Time: Wednesday 10.00, Venue: Main Hall, Level -1
Chair: Paul Heisterkamp, DaimlerChrysler, Germany
A Study on Domain Recognition of Spoken
Dialogue Systems
T. Isobe, S. Hayakawa, H. Murao, T. Mizutani, Kazuya
Takeda, Fumitada Itakura; Nagoya University, Japan
In this paper, we present a multi-domain spoken dialogue system
equipped with the capability of parallel computation of speechrecognition engines that are assigned to each domain. The experimental system is set up to handle three different domains (restaurant information, weather report, and news query) in an in-car usage. All of these tasks are of information retrieval nature. The
domain of a particular utterance is determined based on the likelihood of each speech recognizer. In addition to the human-machine
interaction, synthesized voice of the route sub-system interrupts
the dialogue frequently. Experimental evaluation has yielded 95
percent recognition accuracy in selecting the task domain based on
a specially designed scoring method.
Domain Adaptation Augmented by
State-Dependence in Spoken Dialog Systems
Wei He, Honglian Li, Baozong Yuan; Northern
Jiaotong University, China
In the development of spoken dialog systems, domain adaptation
and dialog state-dependent language model are usually researched
separately. This paper proposes a new approach for domain adaptation augmented by the dialog state-dependence, which means a
dialog turn based cache model decaying synchronously with the dialog state change. Through this approach it’s more simple and rapid
to adapt a Chinese spoken dialog system to a new task. Two different tasks, the train ticket reservation and the park guide are selected
respectively as the target task in the experiments. The consistent
reductions of perplexity and character error rate are observed during the adaptation.
SmartKom-Home – An Advanced Multi-Modal
Interface to Home Entertainment
Thomas Portele 1 , Silke Goronzy 2 , Martin Emele 2 ,
Andreas Kellner 1 , Sunna Torge 2 , Jürgen te Vrugt 1 ;
1
Philips Research Aachen, Germany; 2 Sony
International (Europe) GmbH, Germany
This paper describes the SmartKom-Home system realized within
the SmartKom project. It assists the user by means of a multi-modal
dialogue system in the home environment. This involves the control of various devices and the access to services. SmartKom-Home
is supposed to serve as a uniform interface to all these devices and
services so the user is freed from the necessity to understand which
of the devices to consult how and when to fulfill complex wishes.
We describe the setting of this scenario together with the hardware
used. We furthermore discuss the specific requirements that evolve
in a home environment, and how they are handled in the project.
September 1-4, 2003 – Geneva, Switzerland
Methods to Improve Its Portability of A Spoken
Dialog System Both on Task Domains and
Languages
Yunbiao Xu 1 , Fengying Di 1 , Masahiro Araki 2 ,
Yasuhisa Niimi 2 ; 1 Hangzhou University of Commerce,
China; 2 Kyoto Institute of Technology, Japan
This paper presents the methods to improve its portability of a spoken dialog system both on task domains and languages, which have
been implemented in Chinese and Japanese in the tasks of sightseeing, accommodation-seeking guidance. Such methods include case
frame conversion, template-based text generation and topic frame
driven dialog control scheme. The former two methods are for improving the portability across languages, and the last one is for improving the portability across domains. The case frame conversion
is used for translating a source language case frame into a pivot language one. The template-based text generation is used for generating text responses in a particular language from abstract responses.
The topic frame driven dialog control scheme makes it possible to
manage mixed-initiative dialog based on a set of task-dependent
topic frames. The experiments showed that the proposed methods
could be used to improve the portability of a dialog system across
domains and languages.
VoxenterT M – Intelligent Voice Enabled Call Center
for Hungarian
Tibor Fegyó 1 , Péter Mihajlik 1 , Máté Szarvas 2 , Péter
Tatai 1 , Gábor Tatai 3 ; 1 Budapest University of
Technology and Economics, Hungary; 2 Tokyo Institute
of Technology, Japan; 3 AITIA Inc., Hungary
In this article we present a voice enabled call center which integrates
our basic and applied research results on Hungarian speech recognition. Telephone interfaces, data storage and retrieval modules,
and an intelligent dialog descriptor and manager module are also
parts of the system. To evaluate the efficiency of the recognition
and the dialog, a voice enabled call center was implemented and
tested under real life conditions. This article describes the main
modules of the system and compares the result of the field tests
with that of the laboratory testing.
Automatic Call-Routing Without Transcriptions
Qiang Huang, Stephen J. Cox; University of East
Anglia, U.K.
Call-routing is now an established technology to automate customers’ telephone queries. However, transcribing calls for training
purposes for a particular application requires considerable human
effort, and it would be preferable for the system to learn routes
without transcriptions being provided. This paper introduces a
technique for fully automatic routing. It is based on firstly identifying salient acoustic morphemes in a phonetic decoding of the
input speech, followed by Linear Discriminant Analysis (LDA) to
improve classification. Experimental results on an 18 route retail
store enquiry point task using this technique are compared with results obtained using word-level transcriptions.
Jaspis2 – An Architecture for Supporting
Distributed Spoken Dialogues
Markku Turunen, Jaakko Hakulinen; University of
Tampere, Finland
In this paper, we introduce an architecture for a new generation of
speech applications. The presented architecture is based on our
previous work with multilingual speech applications and extends
it by introducing support for synchronized distributed dialogues,
which is needed in emerging application areas, such as mobile and
ubiquitous computing. The architecture supports coordinated distribution of dialogues, concurrent dialogues, system level adaptation and shared system context. The overall idea is to use interaction agents to distribute dialogues, use an evaluation mechanism
to make them dynamically adaptive and synchronize them by using a coordination mechanism with triggers and transactions. We
present experiences from several applications written on top of the
freely available architecture.
67
Eurospeech 2003
Wednesday
Development of a Bilingual Spoken Dialog System
for Weather Information Retrieval
Janez Žibert 1 , Sanda Martinčić-Ipšić 2 , Melita
Hajdinjak 1 , Ivo Ipšić 2 , France Mihelič 1 ; 1 University of
Ljubljana, Slovenia; 2 University of Rijeka, Croatia
In this paper we present a strategy, current activities and results of a
joint project in designing a spoken dialog system for Slovenian and
Croatian weather information retrieval. We give a brief description
of the system design, of the procedures we have performed in order
to obtain domain specific speech databases, monolingual and bilingual speech recognition experiments and WOZ simulation experiments. Recognition results for Croatian and Slovenian speech are
presented, as well as bilingual speech recognition results when using common acoustic models. We propose two different approaches
to the language identification problem and show recognition results
for the both acoustically similar languages. Results of dialog simulations, performed in order to gain user behaviors when accessing
a spoken dialog system, are also presented.
Improving “How May I Help You?” Systems Using
the Output of Recognition Lattices
James Allen, David Attwater, Peter Durston, Mark
Farrell; BTexact Technologies, U.K.
September 1-4, 2003 – Geneva, Switzerland
ing a dialog system to assist in workforce training in automotive
manufacturing. The overall system design is presented with focus
on development of the semantic information needed by the natural language and dialog management modules. We describe data
collection and analysis through which the information was derived.
Through this process we reduced the parsing error rate by over 20%
and system understanding errors to 3%.
The Development of a Multi-Purpose Spoken
Dialogue System
João P. Neto, Nuno J. Mamede, Renato Cassaca, Luís
C. Oliveira; INESC-ID/IST, Portugal
In this paper we describe a multi-purpose Spoken Dialogue System
platform associated with two distinct applications as an home intelligent environment and remote access to information databases.
These applications differ substantially on contents and possible
uses but gives us the chance to develop a platform where we were
able to represent diverse services to be accessible by a spoken interface. The implemented voice input/output possibilities and the
service independence level opens a wide range of possibilities for
the development of new applications using the current components
of our Spoken Dialogue System.
The Dynamic, Multi-lingual Lexicon in SmartKom
“How may I help you?” systems where a caller to a call centre is
routed to one of a set of destinations using machine recognition
of spontaneous natural language is a difficult task. Previous BT
“How May I Help You” work [1,2] has used top 1 recognition results
for classification with much better results when tested on human
transcriptions. Classifying using a recognition lattice was found to
reduce the gap between results on transcriptions and recognition
output. Using features generated from the lattice in addition to the
top 1 recognition results gave an improvement in classification of
4% absolute over a baseline system using only the top 1 recognition
result. This reduced the gap between classification performance on
recognition and transcription by over 25%.
Incremental Learning of New User Formulations in
Automatic Directory Assistance
M. Andorno 1 , L. Fissore 2 , P. Laface 1 , M. Nigra 2 , C.
Popovici 2 , F. Ravera 2 , C. Vair 2 ; 1 Politecnico di Torino,
Italy; 2 Loquendo, Italy
Directory Assistance for business listings is a challenging task: one
of its main problems is that customers formulate their requests for
the same listing with great variability. Since it is difficult to reliably
predict a priori the user formulations, we have proposed a procedure for detecting, from field data, user formulations that were not
foreseen by the designers. These formulations can be added, as
variants, to the denominations already included in the system to
reduce its failures.
In this work, we propose an incremental procedure that is able to
filter a huge amount of calls routed to the operators, collected every
month, and to detect a limited number of phonetic strings that can
be included as new formulation variants in the system vocabulary.
The results of our experiments, tested on 9 months of calls that the
system was unable to serve automatically, show that the incremental procedure, using only additional amount of data collected every
month, is able to stay close to the (upper bound) performance of
the not incremental one, and offers the possibility of periodically
updating the system formulation variants of every city.
Dialog Systems for Automotive Environments
Julie A. Baca, Feng Zheng, Hualin Gao, Joseph Picone;
Mississippi State University, USA
The Center for Advanced Vehicular Systems (CAVS), located at Mississippi State University (MSU), is collaborating with regional automotive manufacturers such as Nissan, to advance telematics research. This paper describes work resulting from a research initiative to investigate the use of dialog systems in automotive environments, which includes in-vehicle driver as well as automotive
manufacturing environments. We present recent results of an effort to develop an in-vehicle dialog prototype, preliminary to build-
Silke Goronzy, Zica Valsan, Martin Emele, Juergen
Schimanowski; Sony International (Europe) GmbH,
Germany
This paper describes the dynamic, multi-lingual lexicon that was
developed in the SmartKom project. SmartKom is a multimodal
dialogue system that is supposed to assist the user in many applications which are characterised by their highly dynamic contents.
Because of this dynamic nature various modules of the dialogue
ranging from speech recognition over analysis to synthesis need to
have one common knowledge source that takes care of the dynamic
vocabularies that need to be processed. This central knowledge
source is the lexicon. It is able to dynamically add and remove
new words and generate the pronunciations for these words. We
also describe the class-based language model (LM) that is used in
SmartKom and that is closely coupled with the lexicon. Also evaluation results for this LM are given. Furthermore we describe our
approach to dynamically generate pronunciations and give experimental results for the different classifiers we trained for this task.
Evaluating Discourse Understanding in Spoken
Dialogue Systems
Ryuichiro Higashinaka, Noboru Miyazaki, Mikio
Nakano, Kiyoaki Aikawa; NTT Corporation, Japan
This paper describes a method for creating an evaluation measure
for discourse understanding in spoken dialogue systems. Discourse
understanding means utterance understanding taking the context
into account. Since the measure needs to be determined based on its
correlation with the system’s performance, conventional measures,
such as the concept error rate, cannot be easily applied. Using the
multiple linear regression analysis, we have previously shown that
the weighted sum of various metrics concerning dialogue states can
be used for the evaluation of discourse understanding in a single
domain. This paper reports the progress of our work: verification
of our approach by additional experiments in another domain. The
support vector regression method performs better than the multiple linear regression method in creating the measure, indicating
non-linearity in mapping the metrics to the system’s performance.
The results give strong support for our approach and hint at its
suitability as a universal evaluation measure for discourse understanding.
Assessment of Spoken Dialogue System Usability –
What are We really Measuring?
Lars Bo Larsen; Aalborg University, Denmark
Speech based interfaces have not experienced the breakthrough
many have predicted during the last decade. This paper attempts
to clarify some of the reasons why by investigating the currently
applied methods of usability evaluation. Usability attributes espe-
68
Eurospeech 2003
Wednesday
cially important for speech based interfaces are identified and discussed. It is shown that subjective measures (even for widespread
evaluation schemes, such as PARADISE) are mostly done in an ad
hoc manner and are rarely validated. A comparison is made between some well-known scales, and through an example application
of the CCIR usability questionnaire it is shown how validation of the
subjective measures can be performed.
Evaluation of a Speech-Driven Telephone
Information Service Using the PARADISE
Framework: A Closer Look at Subjective Measures
September 1-4, 2003 – Geneva, Switzerland
the flexibility of the framework. As a result, system developers have
significant freedom to design user verification solutions, and a wide
variety of application-specific, transaction-specific and user-specific
constraints can be addressed using a generic system. The paper also
describes a prototype implementation of a Conversational Biometrics solution with the proposed programmable policy manager.
Integration of Speaker Recognition into
Conversational Spoken Dialogue Systems
Timothy J. Hazen, Douglas A. Jones, Alex Park, Linda
C. Kukolich, Douglas A. Reynolds; Massachusetts
Institute of Technology, USA
Paula M.T. Smeele, Juliette A.J.S. Waals; TNO Human
Factors, The Netherlands
For the evaluation of a speech-driven telephone flight information
service we applied the PARADISE model developed by Walker and
colleagues [1] in order to gain insight into the factors affecting the
user satisfaction of this service. We conducted an experiment in
which participants were asked to call the service and book a flight.
During the telephone conversations quantitative measures (e.g. total elapsed time, the number of system errors) were logged. After completion of the telephone calls, the participants judged some
quality related aspects such as dialogue presentation and accessability of the system. These subjective measures together represent
a value for user satisfaction. Using multivariate linear regression, it
was possible to derive a performance function with user satisfaction
as the dependent variable and a combination of objective measures
as independent variables. The results of the regression analysis
also indicated that an extended definition of user satisfaction including a subjective measure ‘Grade’ provides a better prediction
than the analysis based on the narrow definition used by Walker et
al. Further, we investigated the correlation between the subjective
measures by conducting a principal components analysis. The results showed that these measures fell into two groups. Implications
are discussed.
Quantifying the Impact of System Characteristics
on Perceived Quality Dimensions of a Spoken
Dialogue Service
In this paper we examine the integration of speaker identification/verification technology into two dialogue systems developed
at MIT: the Mercury air travel reservation system and the Orion
task delegation system. These systems both utilize information
collected from registered users that is useful in personalizing the
system to specific users and that must be securely protected from
imposters. Two speaker recognition systems, the MIT Lincoln Laboratory text-independent GMM based system and the MIT Laboratory
for Computer Science text-constrained speaker-adaptive ASR-based
system, are evaluated and compared within the context of these conversational systems.
Session: OWeCa– Oral
Speech Recognition - Large Vocabulary II
Time: Wednesday 13.30, Venue: Room 1
Chair: John Bridle, Novauris Laboratories UK Ltd
Discriminative Optimization of Large Vocabulary
Mandarin Conversational Speech Recognition
System
Peng Ding, Zhenbiao Chen, Sheng Hu, Shuwu Zhang,
Bo Xu; Chinese Academy of Sciences, China
Sebastian Möller, Janto Skowronek; Ruhr-University
Bochum, Germany
Developers of telephone services which are relying on spoken dialogue systems would like to identify system characteristics influencing the quality perceived by the user, and to quantify the respective
impact before the system is put into service. A laboratory experiment is described in which speech input, speech output, and confirmation characteristics of a restaurant information system were
manipulated in a controlled way. Users’ quality perceptions were
collected by means of a specifically designed questionnaire. It is
based on a recently developed taxonomy of quality aspects, and
aims at capturing a multitude of perceptually relevant quality dimensions. Experimental results indicate that ASR performance affects a number of interaction parameters, and is a relatively well
identifiable quality impact for the user. In contrast, speech output affects perceived quality on a number of different levels, up
to global user satisfaction judgments. Potential reasons for these
findings are discussed.
A Programmable Policy Manager for
Conversational Biometrics
Ganesh N. Ramaswamy, Ran D. Zilca, Oleg
Alecksandrovich; IBM T.J. Watson Research Center,
USA
Conversational Biometrics combines acoustic speaker verification
with conversational knowledge verification to make a more accurate identity decision. To manage the added level of complexity that
the multi-modal user recognition approach introduces, this paper
proposes the use of verification policies, in the form of Finite State
Machines, which can be used to program a policy manager. Once
a verification policy is written, the policy manager interprets the
policy on the fly, and at every turn in the session decides dynamically whether to accept the user, reject the user, or continue to
interact and collect more data. The policy manager allows for any
number of verification engines to be plugged-in, thereby adding to
This paper examines techniques of discriminative optimization for
acoustic model, including both HMM parameters and linear transforms, in the context of HUB5 Mandarin large vocabulary speech
recognition task, with the aim to partly solve the problems brought
by the sparseness and the highly ambiguous nature of the telephony
conversational speech data. Three techniques are studied: MMI
training of the HMM acoustic parameters, MMI training of Semi-Tied
Covariance Model and MMI Speak Adaptive Training. Descriptions
of our recognition system and the algorithms used in our experiments will be detailed, followed by the corresponding results.
Speech Recognition with Dynamic Grammars Using
Finite-State Transducers
Johan Schalkwyk 1 , Lee Hetherington 2 , Ezra Story 1 ;
1
SpeechWorks International, USA; 2 Massachusetts
Institute of Technology, USA
Spoken language systems, ranging from interactive voice response
(IVR) to mixed-initiative conversational systems, make use of a wide
range of recognition grammars and vocabularies. The recognition
grammars are either static (created at design time) or dynamic (dependent on database lookup at run time). This paper examines the
compilation of recognition grammars with an emphasis on the dynamic (changing) properties of the grammar and how these relate
to context-dependent speech recognizers. By casting the problem
in the algebra of finite-state transducers (FSTs) we can use the composition operator for fast-and-efficient compilation and splicing of
dynamic recognition grammars within the context of a larger precompiled static grammar.
FLaVoR: A Flexible Architecture for LVCSR
Kris Demuynck, Tom Laureys, Dirk Van Compernolle,
Hugo Van hamme; Katholieke Universiteit Leuven,
Belgium
This paper describes a new architecture for large vocabulary continuous speech recognition (LVCSR), which will be developed within the
project FLaVoR (Flexible Large Vocabulary Recognition). The pro-
69
Eurospeech 2003
Wednesday
posed architecture abandons the standard all-in-one search strategy
with integrated acoustic, lexical and language model information.
Instead, a modular framework is proposed which allows for the integration of more complex linguistic components. The search process
consists of two layers. First, a pure acoustic-phonemic search generates a dense phoneme network enriched with meta-data. Then,
the output of the first layer is used by sophisticated language technology components for word decoding in the second layer. Preliminary experiments prove the feasibility of the approach.
An Architecture for Rapid Decoding of Large
Vocabulary Conversational Speech
September 1-4, 2003 – Geneva, Switzerland
Session: SWeCb– Oral
Robust Methods in Processing of Natural
Language Dialogues
Time: Wednesday 13.30, Venue: Room 2
Chair: Vincenzo Pallotta, EPFL, Switzerland
Spoken Language Condensation in the 21st Century
Klaus Zechner; Educational Testing Service, USA
George Saon, Geoffrey Zweig, Brian Kingsbury, Lidia
Mangu, Upendra Chaudhari; IBM T.J. Watson
Research Center, USA
This paper addresses the question of how to design a large vocabulary recognition system so that it can simultaneously handle a sophisticated language model, perform state-of-the-art speaker adaptation, and run in one times real time1 (1×RT). The architecture
we propose is based on classical HMM Viterbi decoding, but uses
an extremely fast initial speaker-independent decoding to estimate
VTL warp factors, feature-space and model-space MLLR transformations that are used in a final speaker-adapted decoding. We present
results on past Switchboard evaluation data that indicate that this
strategy compares favorably to published unlimited-time systems
(running in several hundred times real-time). Coincidentally, this
is the system that IBM fielded in the 2003 EARS Rich Transcription
evaluation.
MMI-MAP and MPE-MAP for Acoustic Model
Adaptation
D. Povey, M.J.F. Gales, D.Y. Kim, P.C. Woodland;
Cambridge University, U.K.
This paper investigates the use of discriminative schemes based on
the maximum mutual information (MMI) and minimum phone error
(MPE) objective functions for both task and gender adaptation. A
method for incorporating prior information into the discriminative
training framework is described. If an appropriate form of prior distribution is used, then this may be implemented by simply altering
the values of the counts used for parameter estimation. The prior
distribution can be based around maximum likelihood parameter
estimates, giving a technique known as I-smoothing, or for adaptation it can be based around a MAP estimate of the ML parameters,
leading to MMI-MAP, or MPE-MAP. MMI-MAP is shown to be effective
for task adaptation, where data from one task (Voicemail) is used to
adapt a HMM set trained on another task (Switchboard). MPE-MAP
is shown to be effective for generating gender-dependent models
for Broadcast News transcription.
Lattice Segmentation and Minimum Bayes Risk
Discriminative Training
Vlasios Doumpiotis, Stavros Tsakalidis, William J.
Byrne; Johns Hopkins University, USA
Modeling approaches are presented that incorporate discriminative
training procedures in segmental Minimum Bayes-Risk decoding
(SMBR). SMBR is used to segment lattices produced by a general
automatic speech recognition (ASR) system into sequences of separate decision problems involving small sets of confusable words.
We discuss two approaches to incorporating these segmented lattices in discriminative training. We investigate the use of acoustic
models specialized to discriminate between the competing words
in these classes which are then applied in subsequent SMBR rescoring passes. Refinement of the search space that allows the use of
specialized discriminative models is shown to be an improvement
over rescoring with conventionally trained discriminative models.
While the field of Information Retrieval originally had the search for
the most relevant documents in mind, it has become increasingly
clear that in many instances, what the user wants is a piece of coherent information, derived from a set of relevant documents and possibly other sources. Reducing relevant documents, passages, and
sentences to their core is the task of text summarization or information condensation. Applying text-based technologies to speech
is not always workable and often not enough to capture speech specific phenomena. In this paper, we will contrast speech summarization with text summarization, give an overview of the history of
speech summarization, its current state, and, finally, sketch possible avenues as well as remaining challenges in future research.
Robust Methods in Automatic Speech Recognition
and Understanding
Sadaoki Furui; Tokyo Institute of Technology, Japan
This paper overviews robust architecture and modeling techniques
for automatic speech recognition and understanding. The topics
include robust acoustic and language modeling for spontaneous
speech recognition, unsupervised adaptation of acoustic and language models, robust architecture for spoken dialogue systems,
multi-modal speech recognition, and speech understanding. This
paper also discusses the most important research problems to be
solved in order to achieve ultimate robust speech recognition and
understanding systems.
Parsing Spontaneous Speech
Rodolfo Delmonte; Università Ca’ Foscari, Italy
In this paper we will present work carried out lately on the 50,000
words Italian Spontaneous Speech Corpus called AVIP, under national project API, made available for free download from the website of the coordinator, the University of Naples. We will concentrate
on the tuning of the parser for Italian which had been previously
used to parse 100,000 words corpus of written Italian within the
National Treebank initiative coordinated by ILC in Pisa.
The parser receives as input the adequately transformed orthographic transcription of the dialogues making up the corpus, in
which pauses, hesitations and other disfluencies have been turned
into most likely corresponding punctuation marks, interjections or
truncation of the word underlying the uttered segment.
The most interesting phenomenon we will discuss is without any
doubts “overlapping”, i.e. a speech event in which two people speak
at the same time by uttering actual words or in some cases nonwords, when one of the speakers, usually the one which is not the
current turntaker, interrupts the current speaker.
This phenomenon takes place at a certain point in time where it has
to be anchored to the speech signal but in order to be fully parsed
and subsequently semantically interpreted, it needs to be referred
semantically to a following turn.
70
Eurospeech 2003
Wednesday
September 1-4, 2003 – Geneva, Switzerland
Model Compression for GMM Based Speaker
Recognition Systems
trained on a large pool of speakers. Speaker models are then used
to score the test data; they are normalized by subtracting the scores
obtained with the background model. We find that this approach
yields significant performance improvement when combined with a
state-of-the-art speaker recognition system based on standard cepstral features. Furthermore, the improvement persists even after
combination with lexical features. Finally, the improvement continues to increase with longer test sample durations, beyond the test
duration at which standard system accuracy level off.
Douglas A. Reynolds; Massachusetts Institute of
Technology, USA
Improved Speaker Verification Through
Probabilistic Subspace Adaptation
Session: OWeCc– Oral
Speaker Identification
Time: Wednesday 13.30, Venue: Room 3
Chair: Samy Bengio, IDIAP, Switzerland
For large-scale deployments of speaker verification systems models size can be an important issue for not only minimizing storage
requirements but also reducing transfer time of models over networks. Model size is also critical for deployments to small, portable
devices. In this paper we present a new model compression technique for Gaussian Mixture Model (GMM) based speaker recognition
systems. For GMM systems using adaptation from a background
model, the compression technique exploits the fact that speaker
models are adapted from a single speaker-independent model and
not all parameters need to be stored. We present results on the
2002 NIST speaker recognition evaluation cellular telephone corpus
and show that the compression technique provides a good tradeoff
of compression ratio to performance loss. We are able to achieve a
56:1 compression (624KB → 11KB) with only a 3.2% relative increase
in EER (9.1% → 9.4%).
Simon Lucey, Tsuhan Chen; Carnegie Mellon
University, USA
In this paper we propose a new adaptation technique for improved text-independent speaker verification with limited amounts
of training data using Gaussian mixture models (GMMs). The technique, referred to as probabilistic subspace adaptation (PSA), employs a probabilistic subspace description of how a client’s parametric representation (i.e. GMM) is allowed to vary. Our technique
is compared to traditional maximum a posteriori (MAP) adaptation, or relevance adaptation (RA), and maximum likelihood eigendecomposition (MLED), or subspace adaptation (SA) techniques. Results are given on a subset of the XM2VTS databases for the task of
text-independent speaker verification.
The Awe and Mystery of T-Norm
An Improved Model-Based Speaker Segmentation
System
Jiří Navrátil, Ganesh N. Ramaswamy; IBM T.J. Watson
Research Center, USA
Peng Yu, Frank Seide, Chengyuan Ma, Eric Chang;
Microsoft Research Asia, China
A popular score normalization technique termed T-norm is the central focus of this paper. Based on widely confirmed experimental
observation regarding T-norm tilting the DET curves of speaker detection systems, we set out to identify the components taking role
in this phenomenon. We claim that under certain local assumptions
the T-norm performs a gaussianization of the individual true and
impostor score populations and further derive conditions for clockwise and counter-clockwise DET rotations caused by this transform.
In this paper, we report our recent work on speaker segmentation.
Without a priori information about speaker number and speaker
identities, the audio stream is segmented, and segments of the
same speaker are grouped together. Speakers are represented by
Gaussian Mixture Models (GMMs), then an HMM network is used
for segmentation. However, unlike other model-based segmentation methods, the speaker GMMs are initialized using a simpler distance based segmentation algorithm. To group segments of identical speakers, a two-level clustering mechanism is introduced, which
we found to achieve higher accuracy than direct distance based clustering methods. Our method significantly outperforms the best result reported at the 2002 Speaker Recognition Workshop. When
tested on a professionally produced TV program set, our system
reports only 3.5% frame errors.
Gaussian Dynamic Warping (GDW) Method Applied
to Text-Dependent Speaker Detection and
Verification
Jean-François Bonastre 1 , Philippe Morin 2 ,
Jean-Claude Junqua 2 ; 1 LIA-CNRS, France; 2 Panasonic
Speech Technology Laboratory, USA
This paper introduces a new acoustic modeling method called Gaussian Dynamic Warping (GDW). It is targeting real world applications
such as voice-based entrance door security systems, the example
presented in this paper. The proposed approach uses a hierarchical statistical framework with three levels of specialization for the
acoustic modeling. The highest level of specialization is in addition
responsible for the modeling of the temporal constraints via a specific Temporal Structure Information (TSI) component.
The preliminary results show the ability of the GDW method to elegantly take into account the acoustic variability of speech while
capturing important temporal constraints.
Modeling Duration Patterns for Speaker
Recognition
Luciana Ferrer, Harry Bratt, Venkata R.R. Gadde,
Sachin S. Kajarekar, Elizabeth Shriberg, Kemal
Sönmez, Andreas Stolcke, Anand Venkataraman; SRI
International, USA
We present a method for speaker recognition that uses the duration patterns of speech units to aid speaker classification. The approach represents each word and/or phone by a feature vector comprised of either the durations of the individual phones making up
the word, or the HMM states making up the phone. We model the
vectors using mixtures of Gaussians. The speaker specific models
are obtained through adaptation of a “background” model that is
Session: OWeCd– Oral
Speech Synthesis: Miscellaneous I
Time: Wednesday 13.30, Venue: Room 4
Chair: Wolfgang Hess, IKP Universit"at Bonn, Germany
A Latent Analogy Framework for
Grapheme-to-Phoneme Conversion
Jerome R. Bellegarda; Apple Computer Inc., USA
Data-driven grapheme-to-phoneme conversion involves either (topdown) inductive learning or (bottom-up) pronunciation by analogy.
As both approaches rely on local context information, they typically require some external linguistic knowledge, e.g., individual
grapheme/phoneme correspondences. To avoid such supervision,
this paper proposes an alternative solution, dubbed pronunciation
by latent analogy, which adopts a more global definition of analogous events. For each out-of-vocabulary word, a neighborhood of
globally relevant pronunciations is constructed through an appropriate data-driven mapping of its graphemic form. Phoneme transcription then proceeds via locally optimal sequence alignment and
maximum likelihood position scoring. This method was successfully applied to the synthesis of proper names with a large diversity
of origin.
Conditional and Joint Models for
Grapheme-to-Phoneme Conversion
Stanley F. Chen; IBM T.J. Watson Research Center,
USA
71
Eurospeech 2003
Wednesday
In this work, we introduce several models for grapheme-tophoneme conversion: a conditional maximum entropy model, a
joint maximum entropy n-gram model, and a joint maximum entropy n-gram model with syllabification. We examine the relative
merits of conditional and joint models for this task, and find that
joint models have many advantages. We show that the performance
of our best model, the joint n-gram model, compares favorably with
the best results for English grapheme-to-phoneme conversion reported in the literature, sometimes by a wide margin. In the latter
part of this paper, we consider the task of merging pronunciation
lexicons expressed in different phone sets. We show that models
for grapheme-to-phoneme conversion can be adapted effectively to
this task.
Mixed-Lingual Text Analysis for Polyglot TTS
Synthesis
Beat Pfister, Harald Romsdorfer; ETH Zürich,
Switzerland
Text-to-speech (TTS) synthesis is more and more confronted with
the language mixing phenomenon. An important step towards the
solution of this problem and thus towards a so-called polyglot TTS
system is an analysis component for mixed-lingual texts. In this
paper it is shown how such an analyzer can be realized for a set of
languages, starting from a corresponding set of monolingual analyzers which are based on DCGs and chart parsing.
Identifying Speakers in Children’s Stories for
Speech Synthesis
1
1
September 1-4, 2003 – Geneva, Switzerland
Arabic in My Hand: Small-Footprint Synthesis of
Egyptian Arabic
Laura Mayfield Tomokiyo 1 , Alan W. Black 2 , Kevin A.
Lenzo 1 ; 1 Cepstral LLC, USA; 2 Carnegie Mellon
University, USA
The research described in this paper addresses the dual concerns of
synthesis of Arabic, a language that has shot to prominence in the
past few years, and synthesis on a handheld device, realization of
which presents difficult software engineering problems. Our system
was developed in conjunction with the DARPA BABYLON project,
and has been integrated with English synthesis, English and Arabic
ASR, and machine translation on a single off-the-shelf PDA.
We present a concatenative, general-domain Arabic synthesizer that
runs 7 times faster than real time with a 9MB footprint. The voice itself was developed over only a few months, without access to costly
prepared databases. It has been evaluated using standard test protocols with results comparable to those achieved by English voices
of the same size with the same level of development.
Session: PWeCe– Poster
Speech Perception
Time: Wednesday 13.30, Venue: Main Hall, Level -1
Chair: Anders Eriksson, Umea University, Sweden
Schema-Based Modeling of Phonemic Restoration
2
Soundararajan Srinivasan, DeLiang Wang; Ohio State
University, USA
Jason Y. Zhang , Alan W. Black , Richard Sproat ;
1
Carnegie Mellon University, USA; 2 AT&T Labs
Research, USA
Choosing appropriate voices for synthesizing children’s stories requires text analysis techniques that can identify which portions of
the text should be read by which speakers. Our work presents techniques to take raw text stories and automatically identify the quoted
speech, identify the characters within the stories and assign characters to each quote. The resulting marked-up story may then be rendered with a standard speech synthesizer with appropriate voices
for the characters.
This paper presents each of the basic stages in identification, and
the algorithms, both rule-driven and data-driven, used to achieve
this. A variety of story texts are used to test our system. Results
are presented with a discussion of the limitations and recommendations on how to improve speaker assignment in further texts.
Experimental Tools to Evaluate Intelligibility of
Text-to-Speech (TTS) Synthesis: Effects of Voice
Gender and Signal Quality
Phonemic restoration refers to the synthesis of masked phonemes
in speech when sufficient lexical context is present. Current models for phonemic restoration however, make no use of lexical knowledge. Such models are inherently inadequate for restoring unvoiced
phonemes and may be limited in their ability to restore voiced
phonemes too. We present a predominantly top-down model for
phonemic restoration. The model uses a missing data speech recognition system to recognize speech utterances as words and activates
word templates corresponding to the words containing the masked
phonemes. An activated template is dynamically time warped to
the noisy word and is then used to restore the speech frames corresponding to the masked phoneme, thereby synthesizing it. The
model is able to restore both voiced and unvoiced phonemes. Systematic testing shows that this model performs significantly better
than a Kalman-filter based model.
Perception of Voice-Individuality for Distortions of
Resonance/Source Characteristics and Waveforms
Hisao Kuwabara; Teikyo University of Science &
Technology, Japan
Catherine Stevens 1 , Nicole Lees 1 , Julie Vonwiller 2 ;
1
University of Western Sydney, Australia; 2 APPEN
Speech Technology, Australia
Two experiments are reported that constitute new methods for evaluation of text-to-speech (TTS) synthesis from the user’s perspective. Experiment 1, using sentence stimuli, and Experiment 2, using discrete word stimuli, investigate the effect of voice gender and
signal quality on the intelligibility of three TTS synthesis systems
from the user’s point of view. Accuracy scores and reaction time
were recorded as on-line, implicit indices of intelligibility during
phoneme detection tasks. It was hypothesized that male voice TTS
would be more intelligible than female voice TTS, and that low quality signals would reduce intelligibility. Results indicate an interaction between voice gender and signal quality which is dependent on
the TTS system. We suggest that intelligibility from the user’s perspective is modulated by several factors and there is a need to tailor
systems to particular commercial applications. Methods to achieve
commercially relevant evaluation of TTS synthesis are discussed.
A perceptual study has been performed to investigate relationship between acoustic parameters and the voice-individuality making use of a pitch synchronous analysis-synthesis system. Voiceindividuality is involved in many acoustic parameters and the aim
of this experiment is to examine how individual parameters affect
the voice-individuality by separately giving them some distortions.
Formant-frequency shift and bandwidth manipulations are given
for spectral distortion, F0 -shift for source manipulation. As the
waveform distortion, zero-crossing and center-clipping techniques
are used. It has been found that formant-shift is very sensitive to
voice-individuality change and F0 -shift and bandwidth manipulations are rather tolerant to the voice-individuality. The results of
waveform manipulation reveal that the voice-individuality is kept
more than the phonetic information for zero-crossing distortion
and the results for center-clipping distortion are reverse.
The Perceptual Cues of a High Level Pitch-Accent
Pattern in Japanese: Pitch-accent Patterns and
Duration
Tsutomu Sato; Meiji Gakuin University, Japan
It has been pointed out that a head-high pitch accent pattern (hereafter HLL) of a loanword in Japanese tends to be flattened and pronounced as a high level pitch-accent pattern (phonetically repre-
72
Eurospeech 2003
Wednesday
sented here as HHH) by younger generation.
This paper attempts to clarify how Japanese who are in their twenties distinguish a high level pitch-accent pattern from a level pitchaccent pattern (hereafter LHH) as a form of perception experiment,
using speech synthesis techniques.
In the first part of this paper, it will be shown how often HHH patterns appear among 10 Japanese college students’ production experiment and the F0 configurations will be investigated. Then, the
result of perception experiment indicates that HHH patterns are
taken for LHH patterns in faster speech. Finally, it will be suggested
that durational differences caused by pitch-accent patterns can be
perceptual cues in telling the pitch-accent patterns apart.
Illusory Continuity of Intermittent Pure Tone in
Binaural Listening and Its Dependency on
Interaural Time Difference
Mamoru Iwaki 1 , Norio Nakamura 2 ; 1 Niigata
University, Japan; 2 AIST, Japan
Illusory continuity is known as a psychoacoustical phenomenon in
hearing; i.e. an intermittent pure tone may be perceived as if it
was continuous, when it is padded with enough large white noise.
There are many researches related to this issue in monaural listening. It is said that such illusory continuity is observed when there
is no evidence of discontinuity and level of the white noise is large
enough to mask the pure tone. In this paper, we investigated illusory continuity in binaural listening, and measured its threshold
levels according to some interaural time differences (ITDs). The
ITDs simulate a sense of direction about sound sources and give
new information for evidence of discontinuity, which should be expected to promote the illusory continuity. As a result, the threshold
level of illusory continuity in binaural listening depended on ITDs
between tone target and noise masker. The increase of threshold
level was minimum when the target and masker had the same ITD.
CART-Based Factor Analysis of Intelligibility
Reduction in Japanese English
Nobuaki Minematsu 1 , Changchen Guo 2 , Keikichi
Hirose 1 ; 1 University of Tokyo, Japan; 2 KTH, Sweden
This study aims at automatically estimating probability of individual words of Japanese English (JE) being perceived correctly by
American listeners and clarifying what kinds of (combinations of)
segmental, prosodic, and linguistic errors in the words are more
fatal to their correct perception. From a JE speech database, a balanced set of 360 utterances by 90 male speakers are firstly selected.
Then, a listening experiment is done where 6 Americans are asked
to transcribe all the utterances. Next, using speech and language
technology, values of many segmental, prosodic, and linguistic attributes of the words are extracted. Finally, relation between transcription rate of each word and its attribute values is analyzed with
Classification And Regression Tree (CART) method to predict probability of each of the JE words being transcribed correctly. The machine prediction is compared with the human prediction of seven
teachers and this method is shown to be comparable to the best
American teacher. This paper also describes differences in perceiving intelligibility of the pronunciation between American and
Japanese teachers.
Harmonic Alternatives to Sine-Wave Speech
László Tóth, András Kocsor; Hungarian Academy of
Sciences, Hungary
Sine-wave speech (SWS) is a three-tone replica of speech, conventionally created by matching each constituent sinusoid in amplitude and frequency with the corresponding vocal tract resonance
(formant). We propose an alternative technique where we take a
high-quality multicomponent sinusoidal representation and decimate this model so that there are only three components per frame.
In contrast to SWS, the resulting signal contains only components
that were present in the original signal. Consequently it preserves
the harmonic fine structure of voiced speech. Perceptual studies
indicate that this signal is judged more natural and intelligible than
SWS. Furthermore, its tonal artifacts can mostly be eliminated by
the introduction of only a few additional components, which leads
to an intriguing speculation about grouping issues.
September 1-4, 2003 – Geneva, Switzerland
Non-Intrusive Assessment of Perceptual Speech
Quality Using a Self-Organising Map
Dorel Picovici, Abdulhussain E. Mahdi; University of
Limerick, Ireland
A new output-based method for non-intrusive assessment of speech
quality for voice communication system is proposed and its performance evaluated. The method is based on comparing the output speech to an appropriate reference representing the closest
match from a pre-formulated codebook containing optimally clustered speech parameter vectors extracted from a large number of
various undistorted clean speech records. The objective auditory
distances between vectors of the distorted speech and their corresponding matching references are then measured and appropriately
converted into an equivalent subjective score. The optimal clustering of the reference codebook is achieved by a dynamic k-means
method. A self-organising map algorithm is used to match the distorted speech vectors to the references. Speech parameters derived
from Bark spectrum analysis, Perceptual Linear Prediction (PLP),
and Mel-Frequency Cepstral coefficients (MFCC) are used to provide
speaker independent parametric representation of the speech signals as required by an output-based quality measure.
Inhibitory Priming Effect in Auditory Word
Recognition: The Role of the Phonological
Mismatch Length Between Primes and Targets
Sophie Dufour, Ronald Peereman; LEAD-CNRS, France
Three experiments examined lexical competition effects using the
phonological priming paradigm in a shadowing task. Experiment
1 replicated Hamburger and Slowiaczek’ s [1] finding of an initial
overlap inhibition when primes and targets share three phonemes
(/böiz/-/böik/) but not when they share two phonemes (/böEz//böik/). This observation suggests that lexical competition depends
on the number of shared phonemes between primes and targets.
However, Experiment 2 showed that an overlap of two phonemes
was sufficient to cause inhibition when the primes mismatched the
targets only on the last phoneme (/bol/-/bot/). Conversely, using
a three phonemes overlap, no inhibition was observed in Experiment 3 when the primes mismatched the targets on the last twophonemes (/bagEt/-/bagaj/). The data indicate that what essentially
determines prime-target competition effects in word-form priming
is the number of mismatching phonemes.
Recognising ‘Real-Life’ Speech with SpeM: A
Speech-Based Computational Model of Human
Speech Recognition
Odette Scharenborg, Louis ten Bosch, Lou Boves;
University of Nijmegen, The Netherlands
In this paper, we present a novel computational model of human
speech recognition – called SpeM – based on the theory underlying
Shortlist. We will show that SpeM, in combination with an automatic
phone recogniser (APR), is able to simulate the human speech recognition process from the acoustic signal to the ultimate recognition
of words. This joint model takes an acoustic speech file as input
and calculates the activation flows of candidate words on the basis
of the degree of fit of the candidate words with the input.
Experiments showed that SpeM outperforms Shortlist on the recognition of ‘real-life’ input. Furthermore, SpeM performs only slightly
worse than an off-the-shelf full-blown automatic speech recogniser
in which all words are equally probable, while it provides a transparent computationally elegant paradigm for modelling word activations in human word recognition.
The Effect of Speech Rate and Noise on Bilinguals’
Speech Perception: The Case of Native Speakers of
Arabic in Israel
Judith Rosenhouse 1 , Liat Kishon-Rabin 2 ; 1 Technion Israel Institute of Technology, Israel; 2 Tel-Aviv
University, Israel
Listening conditions affect bilinguals’ speech perception, but relatively little is known about the effect of the combination of several
degrading listening conditions. We studied the combined effect of
73
Eurospeech 2003
Wednesday
speech rate and background noise on bilinguals’ speech perception
in their L1 and L2. Speech perception of twenty Israeli university
students, native speakers of Arabic (L1), with Hebrew as L2, was
tested. The tests consisted of CHABA sentences adapted to Hebrew and Arabic. In each language, speech perception was evaluated under four conditions: quiet + regular speaking rate, quiet +
fast speaking rate, noise + regular speaking rate, and noise + fast
speaking rate. Results show that under optimal conditions bilingual
speakers of Arabic and Hebrew have similar achievements in Arabic
(L1) and Hebrew (L2). Under difficult conditions, performance was
poorer in L2 than in L1. The lowest scores were in the combined
condition. This reflects bilinguals’ disadvantages when listening to
L2.
Subjective Evaluations for Perception of Speaker
Identity Through Acoustic Feature
Transplantations
Oytun Turk 1 , Levent M. Arslan 2 ; 1 Sestek Inc., Turkey;
2
Bogazici University, Turkey
Perception of speaker identity is an important characteristic of the
human auditory system. This paper1 describes a subjective test
for the investigation of the relevance of four acoustic features in
this process: vocal tract, pitch, duration, and energy. PSOLA based
methods provide the framework for the transplantations of these
acoustic features between two speakers. The test database consists
of different combinations of transplantation outputs obtained from
a database of 8 speakers. Subjective decisions on speaker similarity
indicate that the vocal tract is the most relevant feature for single
feature transplantations. Pitch and duration possess similar significance whereas the energy is the least important acoustic feature.
Vocal tract + pitch + duration transplantation results in the highest
similarity to the target speaker. Vocal tract + pitch, vocal tract +
duration + energy and vocal tract + duration transplantations also
yield convincing results in transformation of the perceived speaker
identity.
Konuşmacı kimliği algılanması insan işitme sisteminin önemli özelliklerinden biridir. Bu çalışma, dört akustik özniteliğin konuşmacı
kimliği algılanmasındaki önemlerini öznel bir deneyle incelemektedir: gırtlak yapısı, ses perdesi, süre ve enerji. Geliştirilen
PSOLA tabanlı yöntemler bu özniteliklerin konuşmacılar arasında
nakledilmesine olanak sağlamaktadır.
Deneyde sekiz kişilik
bir veri tabanındaki konuşmacı çiftlerinden elde edilen nakil
çıktıları kullanılmıştır. Öznel deney sonuçları, konuşmacı kimliği
algılanmasında tek başına en önemli özniteliğin gırtlak yapısı
olduğunu göstermektedir. Gırtlak yapısı + ses perdesi + süre
nakilleri, hedef konuşmacıya en benzer çıktının elde edilmesini
sağlamıştır. Gırtlak yapısı + ses perdesi, gırtlak yapısı + süre +
enerji nakilleri de konuşmacı kimliğinin dönüştürülmesi açısından
başarılı sonuçlar vermiştir.
Modelling Human Speech Recognition Using
Automatic Speech Recognition Paradigms in SpeM
Odette Scharenborg 1 , James M. McQueen 2 , Louis ten
Bosch 1 , Dennis Norris 3 ; 1 University of Nijmegen, The
Netherlands; 2 Max Planck Institute for
Psycholinguistics, The Netherlands; 3 Medical Research
Council Cognition and Brain Sciences Unit, U.K.
September 1-4, 2003 – Geneva, Switzerland
The Effect of Amplitude Compression on Wide
Band Telephone Speech for Hearing-Impaired
Elderly People
Mutsumi Saito 1 , Kimio Shiraishi 2 , Kimitoshi
Fukudome 2 ; 1 Fujitsu Kyushu Digital Technology Ltd.,
Japan; 2 Kyushu Institute of Design, Japan
Recently, high-speed multimedia communication systems have become widespread. Not only conventional narrow band speech signal
(up to 3.4 kHz) but also wide band speech signal (up to 7 kHz) can
be transmitted through high-speed communication lines. Generally,
the quality of wide band speech signal is high and its articulation
score is good for normal-hearing people, but for elderly people who
have hearing losses in higher frequencies, the effect of wide band
speech is doubtful. Therefore, we investigated the effect of wide
band phone speech on the elderly people’s speech perception in
terms of articulation. And we also considered the effect of amplitude compression method that is used for hearing aids.
Japanese 62 CV syllables were used as test speech samples. The
original speech samples were re-sampled to narrow band speech
(8 kHz sampling) and wide band speech (16 kHz sampling). All
speech samples were processed with AMRCODEC (Adaptive Multi
Rate COder-DECoder), which is a voice coding system available to
both narrow band and wide band speech signals. Then, coded
speech signals were processed with a multi-band amplitude compression method. Ratios of the compression in each frequency
bands were determined according to the average value of subjects’
hearing levels. All subjects were native Japanese speakers, aged 68
to 72 years, and have hearing losses (more than 40 dB HL).
From the results of the test, we found that combination of wide
band speech and amplitude compression showed significant improvement of the articulation.
Word Activation Model by Japanese School
Children without Knowledge of Roman Alphabet
Takashi Otake, Miki Komatsu; Dokkyo University,
Japan
Recent models in word recognition have assumed that a word activation device which is based upon phonemes is universal. The present
study has attempted to investigate this proposal with Japanese
school children without knowledge of Roman alphabet. The main
question addressed in this study is to test whether Japanese school
children without Roman alphabet could activate word candidates
on the basis of phonemes. An experiment was conducted with 21
Japanese elementary school children who were preliterate in Roman
letters, employing a word reconstruction task. The results show
that regardless of absence of alphabetic knowledge they could reconstruct Japanese words just like Japanese adults. This suggests
that the current word activation model may equally be applicable
to Japanese children as well as adults who are mora-based language
users.
Multi-Resolution Auditory Scene Analysis: Robust
Speech Recognition Using Pattern-Matching from a
Noisy Signal
Sue Harding 1 , Georg Meyer 2 ; 1 University of Sheffield,
U.K.; 2 University of Liverpool, U.K.
We have recently developed a new model of human speech recognition, based on automatic speech recognition techniques [1]. The
present paper has two goals. First, we show that the new model performs well in the recognition of lexically ambiguous input. These
demonstrations suggest that the model is able to operate in the
same optimal way as human listeners. Second, we discuss how to
relate the behaviour of a recogniser, designed to discover the optimum path through a word lattice, to data from human listening experiments. We argue that this requires a metric that combines both
path-based and word-based measures of recognition performance.
The combined metric varies continuously as the input speech signal
unfolds over time.
Unlike automatic speech recognition systems, humans can understand speech when other competing sounds are present Although
the theory of auditory scene analysis (ASA) may help to explain this
ability, some perceptual experiments show fusion of the speech signal under circumstances in which ASA principles might be expected
to cause segregation. We propose a model of multi-resolution ASA
that uses both high- and low- resolution representations of the auditory signal in parallel in order to resolve this conflict. The use
of parallel representations reduces variability for pattern-matching
while retaining the ability to identify and segregate low-level features of the signal. An important feature of the model is the assumption that features of the auditory signal are fused together
unless there is good reason to segregate them. Speech is recognised by matching the low-resolution representation to previously
learned speech templates without prior segregation of the signal
into separate perceptual streams; this contrasts with the approach
74
Eurospeech 2003
Wednesday
generally used by computational models of ASA. We describe an implementation of the multi-resolution model, using hidden Markov
models, that illustrates the feasibility of this approach and achieves
much higher identification performance than standard techniques
used for computer recognition of speech mixed with other sounds.
Investigation of Emotionally Morphed Speech
Perception and its Structure Using a High Quality
Speech Manipulation System
Hisami Matsui, Hideki Kawahara; Wakayama
University, Japan
A series of perceptual experiments using morphed emotional
speech sounds was conducted. A high-quality speech modification
procedure STRAIGHT [1] extended to enable auditory morphing[2]
was used for providing CD quality test stimuli. The test results indicated that naturalness of morphed speech samples were comparable to natural speech samples and resynthesized samples without
any modifications when interpolated. It also indicated that the proposed morphing procedure enables to provide stimulus continuum
between different emotional expressions. Partial morphing tests
were also conducted to evaluate relative contributions and interdependence between spectral, temporal and source parameters.
Usefulness of Phase Spectrum in Human Speech
Perception
Kuldip K. Paliwal, Leigh Alsteris; Griffith University,
Australia
Short-time Fourier transform of speech signal has two components:
magnitude spectrum and phase spectrum. In this paper, relative
importance of short-time magnitude and phase spectra on speech
perception is investigated. Human perception experiments are conducted to measure intelligibility of speech tokens synthesized either from magnitude spectrum or phase spectrum. It is traditionally
believed that magnitude spectrum plays a dominant role for shorter
windows (20-30 ms); while phase spectrum is more important for
longer windows (128-3500 ms). It is shown in this paper that even
for shorter windows, phase spectrum can contribute to speech intelligibility as much as the magnitude spectrum if the shape of the
window function is properly selected.
Perception of English Lexical Stress by English and
Japanese Speakers: Effect of Duration and
“Realistic” Intensity Change
September 1-4, 2003 – Geneva, Switzerland
Physical and Perceptual Configurations of Japanese
Fricatives from Multidimensional Scaling Analyses
Won Tokuma; Seijo University, Japan
This study investigates the correlations between physical and perceptual spaces of voiceless Japanese fricatives /f s S ç h/, using Multidimensional Scaling technique. The spatial configurations were
constructed from spectral distance measures and perceptual similarity judgements. The results show that 2-dimensional solutions
adequately account for the data and the correlations between the
two spaces are high. The dimensions also corresponded to ‘sibilance’ and ‘place’ properties. The spectral analyses on fricative sections, excluding the transitions, seem to contain sufficient information for correct perceptual judgements. These results are highly
comparable to those of English fricative study (Choo, 1999), and
support a universal prototype theory, according to which the correct identification of speech segments depends on the perceived distance between speech stimuli and a prototype in perceptual spaces.
An Acquisition Model of Speech Perception with
Considerations of Temporal Information
Ching-Pong Au; City University of Hong Kong, China
Speech Perception of humans begins to develop as young as 6month-old or even earlier. The development of perception was suggested to be a self-organizing process driven by the linguistic environment to the infants [1]. Self-organizing maps have been widely
used for modeling the perception development of infants [2]. However, in these models, temporal information within speech is ignored. Only single vowels or phones have little variations along time
can be represented in this kind of models. In the present model,
temporal information of speech can be captured by the self-feeding
input preprocessors so that the sequence of speech components
can be learnt by the self-organizing map. The acquisition of both
the single vowels and diphthongs will be demonstrated in this paper.
Session: PWeCf– Poster
Robust Speech Recognition II
Time: Wednesday 13.30, Venue: Main Hall, Level -1
Chair: Nelson Morgan, ICSI and UC Berkeley, USA
Dynamic Channel Compensation Based on
Maximum A Posteriori Estimation
Shinichi Tokuma; Chuo University, Japan
This study investigated the effect of duration and intensity on the
perception of English lexical stress by native and non-native speakers of English. The spectral balance of intensity was manipulated
in a “realistic” way suggested by Sluijter et al. [1], which is to increase intensity level in the higher frequency bands (above 500Hz)
as shown in the realisation of vocal effort. A non-sense English word
/[email protected]:[email protected]:/ embedded in a frame sentence was used as the stimuli of
the perceptual experiment, where English speakers and two levels of
Japanese learners of English (advanced and pre-intermediate) were
asked to determine lexical stress locations. The result showed: (1)
“realistically” manipulated intensity serves as a strong cue for lexical stress perception of English for all subject groups; (2) advanced
Japanese learners of English are, like English speakers, sensitive
to duration in lexical stress perception, whereas pre-intermediate
Japanese learners are, to a very limited extent, duration-sensitive;
and (3) intensity, if altered in a proper way, could be as significant
a cue as duration in perceiving English lexical stress.
French Intonational Rises and Their Role in Speech
Seg Mentation [sic]
Huayun Zhang, Zhaobing Han, Bo Xu; Chinese
Academy of Sciences, China
The degradation of speech recognition performance in real-life environments and through transmission channels is a main embarrassment for many speech-based applications around the world,
especially when non-stationary noise and changing channel exist. In this paper, we extend our previous works on MaximumLikelihood (ML) dynamic channel compensation by introducing a
phone-conditioned prior statistic model for the channel bias and
applying Maximum A Posteriori (MAP) estimation technique. Compared to the ML based method, the new MAP based algorithm follows with the variations within channels more effectively. The average structural delay of the algorithm is decreased from 400ms to
200 ms, which means it works better for short utterance compensation (as in many real applications). An additional 7∼8% charactererror-rate relative reduction is observed in telephone-based Mandarin large vocabulary continuous speech recognition (LVCSR). In
short utterance test, the word-error-rate relatively reduced 30%.
Far-Field ASR on Inexpensive Microphones
Pauline Welby; Ohio State University, USA
The results of two perception experiments provide evidence that
French listeners use the presence of an early intonational rise and
its alignment to the beginning of a content word as a cue to speech
segmentation.
Laura Docio-Fernandez 1 , David Gelbart 2 , Nelson
Morgan 2 ; 1 Universidad de Vigo, Spain; 2 International
Computer Science Institute, USA
For a connected digits speech recognition task, we have compared
the performance of two inexpensive electret microphones with that
of a single high quality PZM microphone. Recognition error rates
were measured both with and without compensation techniques,
where both single-channel and two-channel approaches were used.
75
Eurospeech 2003
Wednesday
In all cases the task was recognition at a significant distance (2-6
feet) from the talker’s mouth. The results suggest that the wide variability in characteristics among inexpensive electret microphones
can be compensated for without explicit quality control, and that
this is particularly effective when both single-channel and twochannel techniques are used. In particular, the resulting performance for the inexpensive microphones used together is essentially
equivalent to the expensive microphone, and better than for either
inexpensive microphone used alone.
Evaluation of ETSI Advanced DSR Front-End and
Bias Removal Method on the Japanese Newspaper
Article Sentences Speech Corpus
September 1-4, 2003 – Geneva, Switzerland
ever, it is difficult for the conventional approach to reduce nonstationary noise, although it is easy to robustly reduce stationary
noise. To cope with this problem, we propose a new combination
technique with microphone array steering and Fourier / wavelet
spectral subtraction. Wavelet spectral subtraction promises to effectively reduce non-stationary noise, because the wavelet transform admits a variable time-frequency resolution on each frequency
band. As a result of an evaluation experiment in a real room, we
confirmed that the proposed combination technique provides better
performance of the ASR (Automatic Speech Recognition) and NRR
(Noise Reduction Rate) than the conventional combination technique.
Environmental Sound Source Identification Based
on Hidden Markov Model for Robust Speech
Recognition
Satoru Tsuge, Shingo Kuroiwa, Kenji Kita; University
of Tokushima, Japan
In October 2002, European Telecommunications Standards Institute (ETSI) recommended a standard Distributed Speech Recognition (DSR) advanced front-end, ETSI ES202 050 version 1.1.1 (ES202).
Many studies use this front-end in noise environments on several
languages on connected digit recognition tasks. However, we have
not seen the reports of large vocabulary continuous speech recognition using this front-end on a Japanese speech corpus. Since the
DSR system is used on several languages and tasks, we conducted
large vocabulary continuous speech recognition experiments using
ES202 on a Japanese speech corpus in noise environments. Experimental results show that ES202 has better recognition performance
than previous DSR front-end, ETSI ES201 050 version 1.1.2 under
all conditions. In addition, we focus on the influence on recognition performance of DSR with acoustic mismatches caused by input devices. DSR employs a vector quantization (VQ) algorithm for
feature compression so that the VQ distortion is increased by these
mismatches. Large VQ distortion increases the speech recognition
error rate. To overcome increases in VQ distortion, we have proposed the Bias Removal method (BRM) in previous work. However,
this method can not be applied in real-time. Hence, we have proposed the Real-time Bias Removal Method (RBRM) in this paper. The
continuous speech recognition experiments on a Japanese speech
corpus show that RBRM achieves an 8.7% improvement in the error
rate compared to ES202 under noise conditions (SNR=20dB with
convolutional noise).
Takanobu Nishiura 1 , Satoshi Nakamura 2 , Kazuhiro
Miki 3 , Kiyohiro Shikano 3 ; 1 Wakayama University,
Japan; 2 ATR-SLT, Japan; 3 Nara Institute of Science
and Technology, Japan
In real acoustic environments, humans communicate with each
other through speech by focusing on the target speech among environmental sounds. We can easily identify the target sound from
other environmental sounds. For hands-free speech recognition,
the identification of the target speech from environmental sounds
is imperative. This mechanism may also be important for a selfmoving robot to sense the acoustic environments and communicate
with humans. Therefore, this paper first proposes Hidden Markov
Model (HMM)-based environmental sound source identification. Environmental sounds are modeled by three states of HMMs and evaluated using 92 kinds of environmental sounds. The identification
accuracy was 95.4%. This paper also proposes a new HMM composition method that composes speech HMMs and an HMM of categorized environmental sounds for robust environmental sound-added
speech recognition. As a result of the evaluation experiments, we
confirmed that the proposed HMM composition outperforms the
conventional HMM composition with speech HMMs and a noise (environmental sound) HMM trained using noise periods prior to the
target speech in a captured signal.
High-Likelihood Model based on Reliability
Statistics for Robust Combination of Features:
Application to Noisy Speech Recognition
Environment Adaptive Control of Noise Reduction
Parameters for Improved Robustness of ASR
Chng Chin Soon 1 , Bernt Andrassy 2 , Josef Bauer 2 ,
Günther Ruske 1 ; 1 Technical University of Munich,
Germany; 2 Siemens AG, Germany
Peter Jančovič, Münevver Köküer, Fionn Murtagh;
Queen’s University Belfast, U.K.
This paper describes an extension to an automatic speech recognition system that improves the robustness concerning varying environments. A dedicated control unit tries to derive an optimal set
of parameters for a Wiener Filter based noise reduction unit aiming
at maximum recognition performances in different environments.
The input measure for the control unit is derived from the speech
signal. Apart from the SNR level, several other measures are investigated. The controlled parameters are closely related to the
strength of the noise reduction. Several non-linear methods such
as the Tabulated References and Neural Networks serve as the core
of the control unit. Experiments on realistic handsfree as well as
non-handsfree speech data show that the word error rate can be
reduced by as much as 31% through the proposed methods. An already optimized static configuration of the applied noise reduction
hereby serves as the baseline level.
Speech Enhancement with Microphone Array and
Fourier / Wavelet Spectral Subtraction in Real
Noisy Environments
This paper introduces a novel statistical approach for combination
of multiple features, assuming no knowledge about the identity of
the noisy features. In a given set of features, some of the features
may be dominated by noise. The proposed model deals with the
uncertainty about the noisy features by deriving the joint probability of a subset of features with highest probabilities. The core of
the model lies in the determination the number of features to be
included in the feature-subset – this is estimated based on calculating the reliability of each feature, which is defined as its normalized probability, and evaluating the joint maximal reliability. For
the evaluation, we used the TIDIGITS database for connected digit
recognition. The utterances were corrupted by various types of additive noise, which resulted the number and identity of the noisy
features varied over time (or changed suddenly). The experimental
results show that the high-likelihood model achieves recognition
performance similar to the one obtained with a full a-priori knowledge about the identity of the noisy features.
Noise Robust Digit Recognition with Missing
Frames
Yuki Denda, Takanobu Nishiura, Hideki Kawahara;
Wakayama University, Japan
Cenk Demiroglu, David V. Anderson; Georgia Institute
of Technology, USA
It is very important to capture distant-talking speech with high quality for teleconferencing systems or voice-controlled systems. For
this purpose, microphone array steering and Fourier spectral subtraction, for example, are ideal candidates. A combination technique using both microphone array steering and Fourier spectral
subtraction has also been proposed to improve performance. How-
Noise robustness is one of the most challenging problems in speech
recognition research. In this work, we propose a noise robust and
computationally simple system for small vocabulary speech recognition. We approach the noise robust digit recognition problem with
the missing frames idea. The key point behind the missing frames
76
Eurospeech 2003
Wednesday
idea is that frames with energies below a certain threshold are considered unreliable frames. We set these frames to a silence floor
and treat them as silence frames. Performing this operation only
in decoding stage creates high mismatch between trained speech
and decoded speech. To solve the mismatch problem, we apply
the same thresholding algorithm on the training data before training. The algorithm adds a negligible computational complexity at
the front end, and decreases the overall computational complexity. Moreover, it outperforms other computationally comparable,
well known methods. This makes the proposed system particularly
suitable for real-time systems.
A Noise-Robust ASR Back-End Technique Based on
Weighted Viterbi Recognition
Xiaodong Cui 1 , Alexis Bernard 2 , Abeer Alwan 1 ;
1
University of California at Los Angeles, USA; 2 Texas
Instruments Inc., USA
The performance of speech recognition systems trained in quiet degrades significantly under noisy conditions. To address this problem, a Weighted Viterbi Recognition (WVR) algorithm that is a function of the SNR of each speech frame is proposed. Acoustic models
trained on clean data, and the acoustic front-end features are kept
unchanged in this approach. Instead, a confidence/robustness factor is assigned to the output observation probability of each speech
frame according to its SNR estimate during the Viterbi decoding stage. Comparative experiments are conducted with Weighted
Viterbi Recognition with different front-end features such as MFCC,
LPCC and PLP. Results show consistent improvements with all three
feature vectors. For a reasonable size of adaptation data, WVR outperforms environment adaptation using MLLR.
September 1-4, 2003 – Geneva, Switzerland
ent noise conditioned recognizers in terms of Word Error Rate (WER)
and CPU usage. Results show that the model matching scheme using the knowledge extracted from the audio stream by Environmental Sniffing does a better job than a ROVER solution both in accuracy
and computation. A relative 11.1% WER improvement is achieved
with a relative 75% reduction in CPU resources.
Energy Contour Extraction for In-Car Speech
Recognition
Tai-Hwei Hwang; Industrial Technology Research
Institute, Taiwan
The time derivatives of speech energy, such as the delta and the
delta-delta log energy, have been known as critical features for automatic speech recognition (ASR). However, their discriminative ability in lower signal-to-noise ratio (SNR) could be limited or even becomes harmful because of the corruption of energy contour. By
taking the advantage of the spectral characteristic of in-car noise,
the speech energy contour is extracted from the high-pass filtered
signal so as to reduce the distortion in the delta energy. Such filtering can be implemented by using a pre-emphasis-like filter or
a summation of higher frequency band energies. A Chinese name
recognition task is conducted to evaluate the proposed method by
using real in-car speech and artificially generated one as the test
data. As shown in the experimental results, the method is capable
of improving the recognition accuracy of in-car speech in lower SNR
as well as of the clean speech.
Noise-Robust ASR by Using Distinctive Phonetic
Features Approximated with Logarithmic Normal
Distribution of HMM
Voice Quality Normalization in an Utterance for
Robust ASR
Takashi Fukuda, Tsuneo Nitta; Toyohashi University
of Technology, Japan
Muhammad Ghulam, Takashi Fukuda, Tsuneo Nitta;
Toyohashi University of Technology, Japan
Various approaches focused on noise-robustness have been investigated with the aim of using an automatic speech recognition (ASR)
system in practical environments. We have previously proposed a
distinctive phonetic feature (DPF) parameter set for a noise-robust
ASR system, which reduced the effect of high-level additive noise[1].
This paper describes an attempt to replace normal distributions
(NDs) of DPFs with logarithmic normal distributions (LNDs) in HMMs
because DPFs show skew symmetry, or positive and negative skewness. The HMM with the LNDs was firstly evaluated in comparison
with a standard HMM with NDs in an experiment using an isolated
spoken-word recognition task with clean speech. Then noise robustness was tested with four types of additive noise. In the case
of DPFs as an input feature vector set, the proposed HMM with the
LNDs can outperform the standard HMM with the NDs in the isolated spoken-word recognition task both with clean speech and with
speech contaminated by additive noise. Furthermore, we achieved
significant improvements over a baseline system with MFCC and dynamic feature-set when combining the DPFs with static MFCCs and
∆P.
In this paper, we propose a novel method of normalizing the voice
quality in an utterance for both clean speech and speech contaminated by noise. The normalization method is applied to the Nbest hypotheses from an HMM-based classifier, then an SM (Subspace Method)-based verifier tests the hypotheses after normalizing the monophone scores together with the HMM-based likelihood
score. The HMM-SM-based speech recognition system was proposed previously [1, 2] and successfully implemented on a speakerindependent word recognition task and an OOV word rejection
task. We extend the proposed system to a connected digit string
recognition task by exploring the effect of the voice quality normalization in an utterance for robust ASR and compare it with
the HMM-based recognition systems with utterance-level normalization, word-level normalization, monophone-level normalization,
and state-level normalization. Experimental results performed on
connected 4- digit strings showed that the word accuracy was significantly improved from 95.7% obtained by the typical HMM-based
system with utterance-level normalization to 98.2% obtained by the
HMM-SM-based system for clean speech, from 88.1% to 91.5% for
noise-added speech with SNR=10dB, and from 72.4% to 76.4% for
noise-added speech with SNR=5dB, while the other HMM-based systems also showed lower performances.
Environmental Sniffing: Robust Digit Recognition
for an In-Vehicle Environment
Murat Akbacak, John H.L. Hansen; University of
Colorado at Boulder, USA
In this paper, we propose to integrate an Environmental Sniffing [1]
framework, into an in-vehicle hands-free digit recognition task. The
framework of Environmental Sniffing is focused on detection, classification and tracking changing acoustic environments. Here, we
extend the framework to detect and track acoustic environmental
conditions in a noisy-speech audio stream. Knowledge extracted
about the acoustic environmental conditions is used to determine
which environment dependent acoustic model to use. Critical Performance Rate (CPR), previously considered in [1], is formulated and
calculated for this task. The sniffing framework is compared to a
ROVER solution for automatic speech recognition (ASR) using differ-
Noise-Robust Automatic Speech Recognition Using
Orthogonalized Distinctive Phonetic Feature
Vectors
Takashi Fukuda, Tsuneo Nitta; Toyohashi University
of Technology, Japan
With the aim of using an automatic speech recognition (ASR) system
in practical environments, various approaches focused on noiserobustness such as noise adaptation and reduction techniques have
been investigated. We have previously proposed a distinctive phonetic feature (DPF) parameter set for a noise-robust ASR system,
which reduced the effect of high-level additive noise[1]. This paper describes an attempt to apply an orthogonalized DPF parameter set as an input of HMMs. In our proposed method, orthogonal
bases are calculated using conventional DPF vectors that represent
38 Japanese phonemes, then the Karhunen-Loeve transform (KLT) is
used to orthogonalize the DPFs, output from a multilayer neural network (MLN), by using the orthogonal bases. In experiments, orthogonalized DPF parameters were firstly compared with original DPF
parameters on an isolated spoken-word recognition task with clean
speech. Noise robustness was then tested with four types of additive noise. The proposed orthogonalized DPFs can reduce the error
77
Eurospeech 2003
Wednesday
rate in an isolated spoken-word recognition task both with clean
speech and with speech contaminated by additive noise. Furthermore, we achieved significant improvements over a baseline system
with MFCC and dynamic feature-set when combining the orthogonalized DPFs with conventional static MFCCs and ∆P.
Language Model Accuracy and Uncertainty in
Noise Cancelling in the Stochastic Weighted Viterbi
Algorithm
September 1-4, 2003 – Geneva, Switzerland
by a second corpus which contains a record of subjects’ viewing
habits over a two year period. Finally, the two corpora have been
combined to create two information retrieval test sets. Two probabilistic information retrieval systems are described, and the results
obtained on the PUMA IR test sets using these systems are presented.
Towards a Personal Robot with Language Interface
L. Seabra Lopes, António Teixeira, M. Rodrigues, D.
Gomes, C. Teixeira, L. Ferreira, P. Soares, J. Girão, N.
Sénica; Universidade de Aveiro, Portugal
Nestor Becerra Yoma, Iván Brito, Jorge Silva;
University of Chile, Chile
In this paper, the Stochastic Weighted Viterbi (SWV) decoding is
combined with language modeling, which in turn guides the Viterbi
decoding in those intervals where the information provided by noisy
frames is not reliable. In other words, the knowledge from higher
layers (e.g. language model) compensates the low accuracy of the
information provided by the acoustic-phonetic modeling where the
original clean speech signal is not reliably estimated. Bigram and
trigram language models are tested, and in combination with spectral subtraction, the SWV algorithm can lead to reductions as high
as 20% or 45% in word error rate (WER) using a rough estimation of
the additive noise made in a short non-speech interval. Also, the
results presented here suggest that the higher the language model
accuracy, the higher the improvement due to SWV. This paper proposes that the problem of noise robustness in speech recognition
should be classified in two different contexts: firstly, at the acousticphonetic level only, as in small vocabulary tasks with flat language
model; and, by integrating noise canceling with the information
from higher layers.
The development of robots capable of accepting instructions in
terms of familiar concepts to the user is still a challenge. For these
robots to emerge it’s essential the development of natural language
interfaces, since this is regarded as the only interface acceptable for
a machine which expected to have a high level of interactivity with
Man. Our group has been involved for several years in the development of a mobile intelligent robot, named Carl, designed having in
mind such tasks as serving food in a reception or acting as a host
in an organization. The approach that has been followed in the
design of Carl is based on an explicit concern with the integration
of the major dimensions of intelligence, namely Communication,
Action, Reasoning and Learning. This paper focuses on the multimodal human-robot language communication capabilities of Carl,
since these have been significantly improved during the last year.
Preference, Perception, and Task Completion of
Open, Menu-Based, and Directed Prompts for Call
Routing: A Case Study
Jason D. Williams, Andrew T. Shaw, Lawrence Piano,
Michael Abt; Edify Corporation, USA
Session: PWeCg– Poster
Multi-Modal Processing & Speech Interface
Design
Usability subjects’ success with and preference among Open, Menubased and Directed Strategy dialogs for a call routing application
in the consumer retail industry are assessed. Each subject experienced two strategies and was asked for a preference. Task completion was assessed, and subjective feedback was taken through
Likert-scale questions. Preference and task completion scores were
highest for one of the Directed strategies; another directed strategy
was least preferred and the open strategy had the lowest task completion score.
Time: Wednesday 13.30, Venue: Main Hall, Level -1
Chair: Florian Schiel, Bavarian Archive for Speech Signals (BAS),
M"unchen, Germany
An Integrated System for Smart-Home Control of
Appliances Based on Remote Speech Interaction
Ilyas Potamitis, K. Georgila, Nikos Fakotakis, George
Kokkinakis; University of Patras, Greece
We present an integrated system that uses speech as a natural input
modality to provide user-friendly access to information and entertainment devices installed in a real home environment. The system
is based on a combination of beamforming techniques and speech
recognition. The general problem addressed in this work is that of
hands-free speech recognition in a reverberant room where users
walk while engaged in conversation in the presence of different
types of house-specific noisy conditions (e.g. TV/radio broadcast,
interfering speakers, ventilator/air-condition noise, etc.). The paper focuses on implementation details and practical considerations
concerning the integration of diverse technologies into a working
system.
A Spoken Language Interface to an Electronic
Programme Guide
Jianhong Jin 1 , Martin J. Russell 1 , Michael J. Carey 2 ,
James Chapman 3 , Harvey Lloyd-Thomas 4 , Graham
Tattersall 5 ; 1 University of Birmingham, U.K.;
2
University of Bristol, U.K.; 3 BT Exact Technologies,
U.K.; 4 Ensigma Technologies, U.K.; 5 Snape Signals
Research, U.K.
An Integrated Toolkit Deploying Speech
Technology for Computer Based Speech Training
with Application to Dysarthric Speakers
Athanassios Hatzis 1 , Phil Green 1 , James
Carmichael 1 , Stuart Cunningham 2 , Rebecca
Palmer 1 , Mark Parker 1 , Peter O’Neill 2 ; 1 University of
Sheffield, U.K.; 2 Barnsley District General Hospital
NHS Trust, U.K.
Computer based speech training systems aim to provide the client
with customised tools for improving articulation based on audiovisual stimuli and feedback. They require the integration of various
components of speech technology, such as speech recognition and
transcription tools, and a database management system which supports multiple on-the-fly configurations of the speech training application. This paper describes the requirements and development
of STRAPTk (www.dcs.shef.ac.uk/spandh/projects/straptk-Speech
Training Application Toolkit) from the point of view of developers,
clinicians, and clients in the domain of speech training for severely
dysarthric speakers. Preliminary results from an extended field trial
are presented.
Towards Best Practices for Speech User Interface
Design
This paper describes research into the development of personalised
spoken language interfaces to an electronic programme guide A
substantial data collection exercise has been conducted, resulting
in a corpus of nearly 10,000 spoken queries to an electronic programme guide by a total of 64 subjects. A substantial part of the
corpus comprises recordings of many queries from a small number
of ‘core’ subjects to facilitate research into personalisation and the
construction of user profiles. This spoken query data is supported
Bernhard Suhm; BBN Technologies, USA
Designing speech interfaces is difficult. Research on spoken language systems and commercial application development has created a body of speech interface design knowledge. However, this
knowledge is not easily accessible to practitioners. Few experts understand both speech recognition and human factors well enough to
avoid the pitfalls of speech interface design. To facilitate the design
78
Eurospeech 2003
Wednesday
of better speech interfaces, this paper presents a methodology to
compile design guidelines for various classes of speech interfaces.
Such guidelines enable practitioners to employ discount usability
engineering methods to speech interfaces, including obtaining guidance during early stages of design, and heuristic evaluation. To illustrate our methodology, we apply it to generate a short list of ten
guidelines for telephone spoken dialog systems. We demonstrate
the usefulness of the guidelines with examples from our consulting practice, applying each guideline to improve poorly designed
prompts. We believe this methodology can facilitate compiling the
growing body of design knowledge to best practices for important
classes of speech interfaces.
Design and Evaluation of a Limited Two-Way
Speech Translator
David Stallard, John Makhoul, Frederick Choi, Ehry
Macrostie, Premkumar Natarajan, Richard Schwartz,
Bushra Zawaydeh; BBN Technologies, USA
We present a limited speech translation system for English and
colloquial Levantine Arabic, which we are currently developing as
part of the DARPA Babylon program. The system is intended for
question/answer communication between an English-speaking operator and an Arabic-speaking subject. It uses speech recognition
to convert a spoken English question into text, and plays out a prerecorded speech file corresponding to the Arabic translation of this
text. It then uses speech recognition to convert the Arabic reply into
Arabic text, and does information extraction on this text to find the
answer content, which is rendered into English. A novel aspect of
our work is its use of a statistical classifier to extract information
content from the Arabic text. We present evaluation results for both
individual components and the end-to-end system.
Multimodal Interaction on PDA’s Integrating
Speech and Pen Inputs
Sorin Dusan 1 , Gregory J. Gadbois 2 , James
Flanagan 1 ; 1 Rutgers University, USA; 2 HandHeld
Speech LLC, USA
Recent efforts in the field of mobile computing are directed toward speech-enabling portable computers. This paper presents
a method of multimodal interaction and an application which integrates speech and pen on mobile computers. The application
is designed for documenting traffic accident diagrams by police.
The novelty of this application is due to a) its method of fusing
the speech and pen inputs, and b) its fully embedded speech engine. Preliminary experiments showed flexibility, versatility and increased naturalness and user satisfaction during multimodal interaction.
Towards Multimodal Interaction with an Intelligent
Room
September 1-4, 2003 – Geneva, Switzerland
shown at the 2003 North American International Auto Show in Detroit. The system, including a touch screen and a speech recognizer,
is used for controlling several non-critical automobile operations,
such as climate, entertainment, navigation, and telephone. The prototype implements a natural language spoken dialog interface integrated with an intuitive graphical user interface, as opposed to the
traditional, speech only, command-and-control interfaces deployed
in some of the vehicles currently on the market.
Context Awareness Using Environmental Noise
Classification
L. Ma, D.J. Smith, Ben P. Milner; University of East
Anglia, U.K.
Context-awareness is essential to the development of adaptive information systems. Environmental noise can provide a rich source
of information about the current context. We describe our approach
for automatically sensing and recognising noise from typical environments of daily life, such as office, car and city street. In this
paper we present our hidden Markov model based noise classifier.
We describe the architecture of the system, compare classification
results from the system with human listening tests, and discuss
open issues in environmental noise classification for mobile computing.
Simple Designing Methods of Corpus-Based Visual
Speech Synthesis
Tatsuya Shiraishi 1 , Tomoki Toda 2 , Hiromichi
Kawanami 1 , Hiroshi Saruwatari 1 , Kiyohiro Shikano 1 ;
1
Nara Institute of Science and Technology, Japan;
2
ATR-SLT, Japan
This paper describes simple designing methods of corpus-based visual speech synthesis. Our approach needs only a synchronous real
image and speech database. Visual speech is synthesized by concatenating real image segments and speech segments selected from
the database. In order to automatically perform all processes, e.g.
feature extraction, segment selection and segment concatenation,
we simply design two types of visual speech synthesis. One is synthesizing visual speech using synchronous real image and speech
segments selected with only speech information. The other is using speech segment selection and image segment selection with features extracted from the database without processes by hand. We
performed objective and subjective experiments to evaluate these
designing methods. As a result, the latter method can synthesize
visual speech more naturally than the former method.
Comparing the Usability of a User Driven and a
Mixed Initiative Multimodal Dialogue System for
Train Timetable Information
Janienke Sturm 1 , Ilse Bakx 2 , Bert Cranen 1 , Jacques
Terken 2 ; 1 University of Nijmegen, The Netherlands;
2
University of Eindhoven, The Netherlands
Petra Gieselmann 1 , Matthias Denecke 2 ; 1 Universität
Karlsruhe, Germany; 2 Carnegie Mellon University,
USA
There is a great potential for combining speech and gestures to
improve human computer interaction because this kind of communication resembles more and more the natural communication humans use every day with each other. Therefore, this paper is about
the multimodal interaction consisting of speech and gestures in an
intelligent room. The advantages of using multimodal systems are
explained and we present the gesture recognizer and the dialogue
system we use. We explain how the information from the different
modalities is parsed and integrated in one semantic representation.
The aim of the study presented in this paper was to compare the
usability of a user driven and a mixed initiative user interface of a
multimodal system for train timetable information. The evaluation
shows that the effectiveness of the two interfaces does not differ significantly. However, as a result of the absence of spoken prompts
and the obligatory use of buttons to provide values, the efficiency
of the user driven interface is much higher than the efficiency of the
mixed initiative interface. Although the user satisfaction was not
significantly higher for the user driven interface, by far most people
preferred the user driven interface to the mixed initiative interface.
Read My Tongue Movements: Bimodal Learning to
Perceive and Produce Non-Native Speech /r/ and
/l/
A Multimodal Conversational Interface for a
Concept Vehicle
Roberto Pieraccini 1 , Krishna Dayanidhi 1 , Jonathan
Bloom 1 , Jean-Gui Dahan 1 , Michael Phillips 1 , Bryan R.
Goodman 2 , K. Venkatesh Prasad 2 ; 1 SpeechWorks
International, USA; 2 Ford Motor Co., USA
Dominic W. Massaro, Joanna Light; University of
California at Santa Cruz, USA
This paper describes a prototype of a conversational system that
was implemented on the Ford Model U Concept Vehicle and first
This study investigated the effectiveness of Baldi for teaching nonnative phonetic contrasts, by comparing instruction illustrating the
internal articulatory processes of the oral cavity versus instruction
79
Eurospeech 2003
Wednesday
providing just the normal view of the tutor’s face. Eleven Japanese
speakers of English as a second language were bimodally trained
under both instruction methods to identify and produce American
English /r/ and /l/ in a within-subject design. Speech identification
and production improved under both training methods although
training with a view of the internal articulators did not show an
additional benefit. A generalization test showed that this learning
transferred to the production of new words.
Low Resource Lip Finding and Tracking Algorithm
for Embedded Devices
Jesús F. Guitarte Pérez 1 , Klaus Lukas 1 , Alejandro F.
Frangi 2 ; 1 Siemens AG, Germany; 2 University of
Zaragoza, Spain
One of the best challenges in Lip Reading is to apply this technology to an embedded device. In the current solutions the high use
of resources, especially in reference to visual processing, makes the
implementation in a small device very difficult.
In this article a new, efficient and straightforward algorithm for detection and tracking of lips is presented. Lip Finding and Tracking
is the first step in visual processing for Lip Reading. In our approach the Lip Finding is performed between a small amount of
blobs, which should fulfill a geometric restriction. In terms of computational power and memory the proposed algorithm meets the
requirements of an embedded device; on average less than 4 MHz1
of CPU is required. This algorithm shows promising results in a realistic environment accomplishing successful lip finding and tracking
in 94.2% of more than 4900 image frames.
Detection and Separation of Speech Segment Using
Audio and Video Information Fusion
Futoshi Asano 1 , Yoichi Motomura 1 , Hideki Asoh 1 ,
Takashi Yoshimura 1 , Naoyuki Ichimura 1 , Kiyoshi
Yamamoto 2 , Nobuhiko Kitawaki 2 , Satoshi
Nakamura 3 ; 1 AIST, Japan; 2 Tsukuba University,
Japan; 3 ATR-SLT, Japan
In this paper, a method of detecting and separating speech events in
a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization
using a microphone array and human tracking by stereo vision is
combined by a Bayesian network. From the inference results of
the Bayesian network, the information on the time and location of
speech events can be known in a multiple-sound-source condition.
Based on the detected speech event information, a maximum likelihood adaptive beamformer is constructed and the speech signal is
separated from the background noise and interferences.
Resynthesis of 3D Tongue Movements from Facial
Data
Olov Engwall, Jonas Beskow; KTH, Sweden
Simultaneous measurements of tongue and facial motion, using a
combination of electromagnetic articulography (EMA) and optical
motion tracking, are analysed to investigate the possibility to resynthesize the subject’s tongue movements with a parametrically controlled 3D model using the facial data only. The recorded material consists of 63 VCV words spoken by one Swedish subject. The
tongue movements are resynthesized using a combination of a linear estimation to predict the tongue data from the face and an inversion procedure to determine the articulatory parameters of the
model.
Acquiring Lexical Information from Multilevel
Temporal Annotations
September 1-4, 2003 – Geneva, Switzerland
and a method is presented for automatically accomplishing this
task, and evaluated using German, Japanese and Anyi (W. Africa)
corpora.
LUCIA a New Italian Talking-Head Based on a
Modified Cohen-Massaro’s Labial Coarticulation
Model
Piero Cosi, Andrea Fusaro, Graziano Tisato;
ISTC-CNR, Italy
LUCIA, a new Italian talking head based on a modified version of the
Cohen-Massaro’s labial coarticulation model is described. A semiautomatic minimization technique, working on real cinematic data,
acquired by the ELITE optoelectronic system, was used to train the
dynamic characteristics of the model. LUCIA is an MPEG-4 standard
facial animation system working on standard FAP visual parameters
and speaking with the Italian version of FESTIVAL TTS.
A Visual Context-Aware Multimodal System for
Spoken Language Processing
Niloy Mukherjee, Deb Roy; Massachusetts Institute of
Technology, USA
Recent psycholinguistic experiments show that acoustic and syntactic aspects of online speech processing are influenced by visual
context through cross-modal influences. During interpretation of
speech, visual context seems to steer speech processing and vice
versa. We present a real-time multimodal system motivated by
these findings that performs early integration of visual contextual
information to recognize the most likely word sequences in spoken
language utterances. The system first acquires a grammar and a visually grounded lexicon from a “show-and-tell” procedure where the
training input consists of camera images consisting of sets of objects paired with verbal object descriptions. Given a new scene, the
system generates a dynamic visually-grounded language model and
drives a dynamic model of visual attention to steer speech recognition search paths towards more likely word sequences.
Session: OWeDb– Oral
Speech Recognition - Language Modeling
Time: Wednesday 16.00, Venue: Room 2
Chair: Jean-Luc Gauvain, LIMSI, France
Maximum Entropy Good-Turing Estimator for
Language Modeling
Juan P. Piantanida, Claudio F. Estienne; University of
Buenos Aires, Argentina
In this paper, we propose a new formulation of the classical GoodTuring estimator for n-gram language model. The new approach
is based on defining a dynamic model for language production. Instead of assuming a fixed probability distribution of occurrence of
an n-gram on the whole text, we propose a maximum entropy approximation of a time varying distribution. This approximation led
us to a new distribution, which in turn is used to calculate expectations of the Good-Turing estimator. This defines a new estimator
that we call Maximum Entropy Good-Turing estimator. Contrary to
the classical Good-Turing estimator it needs neither expectations
approximations nor windowing or other smoothing techniques. It
also contains the well know discounting estimators as special cases.
Performance is evaluated both in terms of perplexity and word error
rate in an N-best re-scoring task. Also comparison to other classical
estimators is performed. In all cases our approach performs significantly better than classical estimators.
Exploiting Order-Preserving Perfect Hashing to
Speedup N-Gram Language Model Lookahead
Thorsten Trippel, Felix Sasaki, Benjamin Hell, Dafydd
Gibbon; Universität Bielefeld, Germany
The extraction of lexical information for machine readable lexica
from multilevel annotations is addressed in this paper. Relations
between these levels of annotation are used for sub-classification of
lexical entries. A method for relating annotation units is presented,
based on a temporal calculus. Relating the annotation units manually is error-prone, time consuming and tends to be inconsistent,
Xiaolong Li, Yunxin Zhao; University of
Missouri-Columbia, USA
Minimum Perfect Hashing (MPH) has recently been shown successful in reducing Language Model (LM) lookahead time in LVCSR decoding. In this paper we propose to exploit the order-preserving
(OP) property of a string-key based MPH function to further reduce
80
Eurospeech 2003
Wednesday
hashing operation and speed up LM lookahead. A subtree structure
is proposed for LM lookahead and an order-preserving MPH is integrated into the structure design. Subtrees are generated on demand
and stored in caches. Experiments were performed on Switchboard
data. By using the proposed method of OP MPH and subtree cache
structure for both trigrams and backoff bigrams, the LM lookahead
time was reduced by a factor of 2.9 in comparison with the baseline
case of using MPH alone.
Stem-Based Maximum Entropy Language Models
for Inflectional Languages
Dimitrios Oikonomidis, Vassilios Digalakis; Technical
University of Crete, Greece
In this work we build language models using three different training
methods: n-gram, class-based and maximum entropy models. The
main issue is the use of stem information to cope with the very large
number of distinct words of an inflectional language, like Greek. We
compare the three models with both perplexity and word error rate.
We also examine thoroughly the perplexity differences of the three
models on specific subsets of words.
Combination of a Hidden Tag Model and a
Traditional N-Gram Model: A Case Study in Czech
Speech Recognition
September 1-4, 2003 – Geneva, Switzerland
morphosyntactic language model. The architecture of the recognition system is based on the weighted finite-state transducer (WFST)
paradigm. Thanks to the flexible transducer-based architecture, the
morphosyntactic component is integrated seamlessly with the basic modules with no need to modify the decoder itself. We compare
the phoneme, morpheme, and word error-rates as well as the sizes
of the recognition networks in two configurations. In one configuration we use only the N-gram model while in the other we use the
combined model. The proposed stochastic morphosyntactic language model decreases the morpheme error rate by between 1.7
and 7.2% relatively when compared to the baseline trigram system.
The morpheme error-rate of the best configuration is 18% and the
best word error-rate is 22.3%.
Session: OWeDc– Oral
Speech Modeling & Features IV
Time: Wednesday 16.00, Venue: Room 3
Chair: Katrin Kirchhoff, University of Washington, USA
Locus Equations Determination Using the
SpeechDat(II)
Bojan Petek; University of Ljubljana, Slovenia
Pavel Krbec, Petr Podveský, Jan Hajič; Charles
University, Czech Republic
A speech recognition system targeting high inflective languages is
described that combines the traditional trigram language model and
an HMM tagger, obtaining results superior to the trigram language
model itself. An experiment in speech recognition of Czech has
been performed with promising results.
Unlimited Vocabulary Speech Recognition Based
on Morphs Discovered in an Unsupervised Manner
This paper presents a corpus-based approach to determination of
locus equations for Slovenian language. The SpeechDat(II) spoken
language database is analyzed first for all available target VCV contexts in order to yield candidate subsets for the acoustic-phonetic
measurements. Only the VCVs embedded within judiciously chosen carrier utterances are then selected for the (F2vowel , F2onset )
measurements. The paper discusses challenges, methodology,
and results obtained on the 1000-speaker Slovenian SpeechDat(II)
database in the framework of /VbV/, /VdV/, and /VgV/-based determination of locus equations.
A Memory-Based Approach to Cantonese Tone
Recognition
Vesa Siivola, Teemu Hirsimäki, Mathias Creutz, Mikko
Kurimo; Helsinki University of Technology, Finland
We study continuous speech recognition based on sub-word units
found in an unsupervised fashion. For agglutinative languages like
Finnish, traditional word-based n-gram language modeling does not
work well due to the huge number of different word forms. We
use a method based on the Minimum Description Length principle to split words statistically into subword units allowing efficient
language modeling and unlimited vocabulary. The perplexity and
speech recognition experiments on Finnish speech data show that
the resulting model outperforms both word and syllable based trigram models. Compared to the word trigram model, the out-ofvocabulary rate is reduced from 20% to 0% and the word error rate
from 56% to 32%.
Tutkimme ohjaamattomasti löydettyihin sanaa lyhyempiin yksiköihin perustuvaa jatkuvan puheen tunnistusta. Perinteiset sanoihin
perustuvat n-grammikielimallit toimivat huonosti agglutinatiivisille
kielille kuten suomi, sillä näissä kielissä on erittäin paljon erilaisia
sanamuotoja. Tässä työssä käytämme lyhyimpään kuvauspituuteen (Minimum Description Length, MDL) perustuvaa menetelmää
sanojen tilastolliseen pilkkomiseen. Näin saamme tehokkaan kielimallin, jolla on rajoittamaton sanasto. Kokeet suomenkielisellä
aineistolla osoittavat, että tämä malli toimii selvästi sekä sana- että
tavupohjaisia malleja paremmin. Sanapohjaiseen trigrammimalliin
verrattuna sanastosta puuttuvien sanojen osuus tippuu 20 prosentista nollaan prosenttiin ja puheentunnistimen sanavirhe 56 prosentista 32 prosenttiin.
Evaluation of the Stochastic Morphosyntactic
Language Model on a One Million Word Hungarian
Dictation Task
Máté Szarvas, Sadaoki Furui; Tokyo Institute of
Technology, Japan
Michael Emonts 1 , Deryle Lonsdale 2 ; 1 Sony
Electronics, USA; 2 Brigham Young University, USA
This paper introduces memory-based learning as a viable approach
for Cantonese tone recognition. The memory-based learning algorithm employed here outperforms other documented current approaches for this problem, which is based on neural networks. Various numbers of tones and features are modeled to find the best
method for feature selection and extraction. To further optimize
this approach, experiments are performed to isolate the best feature weighting method, the best class voting weights method, and
the best number of k-values to implement. Results and possible
future work are discussed.
Experimental Evaluation of the Relevance of
Prosodic Features in Spanish Using Machine
Learning Techniques
David Escudero 1 , Valentín Cardeñoso 1 , Antonio
Bonafonte 2 ; 1 Universidad de Valladolid, Spain;
2
Universitat Politècnica de Catalunya, Spain
In this work, machine learning techniques have been applied for the
assessment of the relevance of several prosodic features in TTS for
Spanish. Using a two step correspondence between sets of prosodic
features and intonation parameters, the influence of the number of
different intonation patterns and the number and order of prosodic
features is evaluated. The output of the trained classifiers is proposed as a labelling mechanism of intonation units which can be
used to synthesize high quality pitch contours. The input output
correspondence of the classifier also provides a bundle of relevant
prosodic knowledge.
Dominance Spectrum Based V/UV Classification
and F0 Estimation
In this article we evaluate our stochastic morphosyntactic language
model (SMLM) on a Hungarian newspaper dictation task that requires modeling over 1 million different word forms. The proposed
method is based on the use of morphemes as the basic recognition
units and the combination of a morpheme N-gram model and a
Tomohiro Nakatani 1 , Toshio Irino 2 , Parham
Zolfaghari 1 ; 1 NTT Corporation, Japan; 2 Wakayama
University, Japan
81
Eurospeech 2003
Wednesday
This paper presents a new method for robust voiced/unvoiced segment (V/UV) classification and accurate fundamental frequency (F0 )
estimation in a noisy environment. For this purpose, we introduce
the degree of dominance and dominance spectrum that are defined
by instantaneous frequency. The degree of dominance allows us
to evaluate the magnitude of individual harmonic components of
speech signals relative to the background noise. The V/UV segments are robustly classified based on the capability of the dominance spectrum to extract the regularity in the harmonic structure. F0 is accurately determined based on fixed points corresponding to dominant harmonic components easily selected from the
dominance spectrum. Experimental results show that the present
method is better than the existing methods in terms of gross and
fine F0 errors, and V/UV correct rates in the presence of background
white and babble noise.
Analysis and Modeling of F0 Contours of
Portuguese Utterances Based on the
Command-Response Model
1
1
September 1-4, 2003 – Geneva, Switzerland
un incremento absoluto del 18.4% (10.3%) en exactitud con respecto
a la prueba base. Estos descubrimientos proporcionan evidencias
adicionales del potencial de los flujos descompuestos harmónicamente para dar mejoras en rendimiento y, sustancialmente, para
realzar la exactitud del reconocimiento en ruido.
Session: SWeDd– Oral
Feature Analysis & Cross-Language
Processing of Chinese Spoken Language
Time: Wednesday 16.00, Venue: Room 4
Chair: Tao Jianhua, Chinese Academy of Sciences, Beijing
Automatic Title Generation for Chinese Spoken
Documents Considering the Special Structure of
the Language
Lin-shan Lee, Shun-Chuan Chen; National Taiwan
University, Taiwan
2
Hiroya Fujisaki , Shuichi Narusawa , Sumio Ohno ,
Diamantino Freitas 3 ; 1 University of Tokyo, Japan;
2
Tokyo University of Technology, Japan; 3 University
of Porto, Portugal
This paper describes the results of a joint study on the applicability of the command-response model to F0 contours of European
Portuguese, with an aim to use it in a TTS system. Analysis-bySynthesis of observed F0 contours of a number of utterances by
five native speakers indicated that the model with provisions for
both positive and negative accent commands applies quite well to
all the utterances tested. The estimated commands are found to be
closely related to the linguistic contents of the utterances. One of
the features of European Portuguese found in utterances by the majority of speakers is the occurrence of a negative accent command
at certain phrase-initial positions, and its perceptual significance is
examined by an informal listening test, using stimuli synthesized
both with and without negative accent commands.
Covariation and Weighting of Harmonically
Decomposed Streams for ASR
Philip J.B. Jackson 1 , David M. Moreno 2 , Martin J.
Russell 3 , Javier Hernando 2 ; 1 University of Surrey,
U.K.; 2 Universitat Politècnica de Catalunya, Spain;
3
University of Birmingham, U.K.
The purpose of automatic title generation is to understand a document and to summarize it with only several but readable words or
phrases. It is important for browsing and retrieving spoken documents, which may be automatically transcribed, but it will be much
more helpful if given the titles indicating the content subjects of
the documents. On the other hand, the Chinese language is not
only spoken by the largest population of the world, but with very
special structure different from western languages. It is not alphabetic, with large number of distinct characters each pronounced as a
monosyllable, while the total number of syllables is limited. In this
paper, considering the special structure of the Chinese language,
a set of “feature units” for Chinese spoken language processing is
defined and the effects of the choice of these “feature units” on automatic title generation are analyzed with a new adaptive K nearestneighbor approach, proposed in a companion paper also submitted
to this conference as the baseline.
Statistical Speech-to-Speech Translation with
Multilingual Speech Recognition and
Bilingual-Chunk Parsing
Bo Xu, Shuwu Zhang, Chengqing Zong; Chinese
Academy of Sciences, China
Decomposition of speech signals into simultaneous streams of periodic and aperiodic information has been successfully applied to
speech analysis, enhancement, modification and recently recognition. This paper examines the effect of different weightings of
the two streams in a conventional HMM system in digit recognition tests on the Aurora 2.0 database. Comparison of the results
from using matched weights during training showed a small improvement of approximately 10% relative to unmatched ones, under
clean test conditions. Principal component analysis of the covariation amongst the periodic and aperiodic features indicated that
only 45 (51) of the 78 coefficients were required to account for 99%
of the variance, for clean (multi-condition) training, which yielded
an 18.4% (10.3%) absolute increase in accuracy with respect to the
baseline. These findings provide further evidence of the potential
for harmonically-decomposed streams to improve performance and
substantially to enhance recognition accuracy in noise.
La descomposición de señales del habla en flujos simultaneos de
información periódica y aperiódica ha sido aplicada exitosamente
al análisis, realce, modificación y, recientemente, reconocimiento
del habla. Este artículo examina el efecto de diferentes ponderaciones de estos dos flujos en un sistema ‘HMM’ convencional de
reconocimiento de dígitos con la base de datos Aurora 2.0. Bajo
condiciones de prueba no ruidosas, la comparación de los resultados utilizando ponderaciones coincidentes durante entrenamiento
y prueba mostró una pequeña mejora relativa de aproximadamente
un 10% con respecto al caso de utilizar ponderaciones sólo en las
puebas. El análisis de componentes principales de la covarianza
entre los rasgos periódicos y aperiódicos indicó que sólo fueron requeridos 45 (51) de los 78 coeficientes para cubrir el 99% de la varianza, para entrenamento limpio (multicondicional), el cual produjo
Initiated mainly from speech community, researches in speech to
speech (S2S) translation have made steady progress in the past
decade. Many approaches to S2S translation have been proposed
continually. Among of them, corpus-dependent statistical strategies have been widely studied during recent years. In corpus-based
translation methodology, rather than taking the corpus just as reference templates, more detailed or structural information should be
exploited and integrated in statistical modeling. Under the statistical translation framework that provides very flexible way of integrating different prior or structural knowledge, we have conducted
a series of R&D activities on S2S translation. In the most recent version, we have independently developed a prototype Chinese-English
bi-directional S2S translation system with the supports of multilingual speech recognition and bilingual-Chunk based statistical translation techniques to meet the demand of Manos – a multilingual information service project for 2008 Beijing Olympic Games. This paper introduces our works in the research of multilingual S2S translation.
Automatic Extraction of Bilingual Chunk Lexicon
for Spoken Language Translation
Limin Du, Boxing Chen; Chinese Academy of Sciences,
China
In language communication, an utterance may be segmented as a
concatenation of chunks that are reasonable in syntax, meaningful
in semantics, and composed of several words. Usually, the order
of words within chunks is fixed, and the order of chunks within an
utterance is rather flexible. The improvement of spoken language
translation could benefit from using bilingual chunks. This paper
presents a statistical algorithm to build the bilingual chunk-lexicon
automatically from spoken language corpora. Several association
measurements are set up as the criteria of the extraction. And local
82
Eurospeech 2003
Wednesday
best algorithm, length ratio filtration and stop-word filtration are
also incorporated to improve the performance. A bilingual chunklexicon was extracted from a corpus with precision of 86.0% and
recall of 86.7%. The usability of the chunk-lexicon was then tested
with an innovative framework for English-to-Chinese Spoken Language translation, resulted in translation accuracy of 81.83% and
78.69% for training and test sets respectively, measured with Levenshtein distance based similarity score.
Multi-Scale Document Expansion in
English-Mandarin Cross-Language Spoken
Document Retrieval
Wai-Kit Lo 1 , Yuk-Chi Li 1 , Gina Levow 2 , Hsin-Min
Wang 3 , Helen M. Meng 1 ; 1 Chinese University of Hong
Kong, China; 2 University of Chicago, USA; 3 Academia
Sinica, Taiwan
This paper presents the application of document expansion using a
side collection to a cross-language spoken document retrieval (CLSDR) task to improve retrieval performance. Document expansion
is applied to a series of English-Mandarin CL-SDR experiments using
selected retrieval models (probabilistic belief network, vector space
model, and HMM-based retrieval model). English textual queries are
used to retrieve relevant documents from an archive of Mandarin
radio broadcast news. We have devised a multi-scale approach for
document expansion - a process that enriches the Mandarin spoken
document collection in order to improve overall retrieval performance. A document is expanded by (i) first retrieving related documents on a character bigram scale, (ii) then extracting word units
from such related documents as expansion terms to augment the
original document and (iii) finally indexing all documents in the collection by means of character bigrams and those expanded terms
by within-word character bigrams to prepare for future retrieval.
Hence the document expansion approach is multi-scale as it involves both word and subword scales. Experimental results show
that this approach achieves performance improvements up to 14%
across several retrieval models.
Mandarin Speech Prosody: Issues, Pitfalls and
Directions
Chiu-yu Tseng; Academia Sinica, Taiwan
From the perspective of speech technology development for unlimited Mandarin Chinese TTS, two issues appear most impedimental:
(1.) how to predict prosody from text, and (2.) how to achieve better naturalness for speech output. These impediments somewhat
brought out the major pitfalls in related research, i.e., characteristics of Chinese connected speech and the overall rhythmic structure of speech flow. This paper discusses where the problems stem
from and how some solutions could be found. We propose that
for Mandarin, prosody research needs to include the following: (1.)
characteristics of Mandarin connected speech that constitute the
prosodic properties in speech flow, i.e., units and boundaries, (2.)
scope and type of speech data collected, i.e., text other than isolated sentences, (3.) prosody in relation to speech planning, i.e.,
information other than lexical, syntactic and semantic, and (4.) an
overall organization of prosody for speech flow, i.e., a framework
that accommodate the above mentioned features.
A Contrastive Investigation of Standard Mandarin
and Accented Mandarin
1
September 1-4, 2003 – Geneva, Switzerland
Standard Chinese. No significant difference exists on durations of
initials and finals for these 20 speakers. And no phonological difference is found on four lexical tones. It seems that the prosodic
difference is mainly on rhythmic or stress pattern.
Emotion Control of Chinese Speech Synthesis in
Natural Environment
Jianhua Tao; Chinese Academy of Sciences, China
Emotional speech analysis was normally conducted from the viewpoint of prosody and articulation features. But for emotional
speech synthesis system, two issues appear most important: (1)
how to realize the acoustic features among various emotion states?
(2) how to convey the emotion with the combination of text analysis and environment detection. To answer the two questions, both
acoustic features and emotion focus were analyzed in the paper.
Due to the different background and culture, even the same emotion has different meaning for different people in certain contexts.
The paper also tries to explain if there are special characters in Chinese emotion expression. Finally, the emotion controlling model is
described in the paper, some rules are listed in a table. Environment influence was also classified and integrated into the system.
At the end of paper, the emotion synthesis results were evaluated
and compared to other previous works.
Session: PWeDe– Poster
Speech Production & Physiology
Time: Wednesday 16.00, Venue: Main Hall, Level -1
Chair: Hideki Kawahara, Wakayama University, Japan
Optimality Criteria in Inverse Problems for
Tongue-Jaw Interaction
A.S. Leonov 1 , V.N. Sorokin 2 ; 1 Moscow Physical
Engineering Institute, Russia; 2 Russian Academy of
Sciences, Russia
We consider the system of articulators “jaw – tip of the tongue”
in order to investigate instant and integral optimality criteria in
the variational approach to the solution of speech inverse problem
“from total displacement of articulators to their controls”. The required experimental data i.e., coordinates of the tip of the tongue
and lower incisor have been measured by the use of the X-ray microbeam system together with EMGs of masseter, longitudinalis superior and longitudinalis inferior. These data have been registered
for sequences of syllable /ta/ with different articulation rates, as
well as for an elevation and lowering of the tongue tip in non-speech
mode. We analyze instant and integral criteria of the work, kinetic
energy, elastic and inertial forces for the system. In speech mode,
total displacements of the tongue tip and the jaw are simulated perfectly by the use of any instant and integral criterion, mentioned
above. At the same time, the own displacements of the tongue tip
and the jaw are reproduced well by means of integral criteria only.
On the contrary, the own displacements in non-speech mode are
reproduced satisfactory only by the use of any instant optimality
criterion.
FEM Analysis Based on 3-D Time-Varying Vocal
Tract Shape
Koji Sasaki 1 , Nobuhiro Miki 2 , Yoshikazu Miyanaga 1 ;
1
Hokkaido University, Japan; 2 Future
University-Hakodate, Japan
2 1
Aijun Li , Xia Wang ; Chinese Academy of Social
Sciences, China; 2 Nokia Research Center, China
Segmental and supra-segmental acoustic features between standard
and Shanghai-accented Mandarin were analyzed in the paper. The
Shanghai Accented Mandarin was first classified into three categories as light, middle and heavy, by statistical method and dialectologist with subjective criteria. Investigation to initials, finals and
tones were then carried out. The results show that Shanghainese
always mispronounce or modify some sorts of phonemes of initials and finials. The heavier the accent is, the more frequently the
mispronunciation occurs. Initials present more modifications than
finals. Nine vowels are also compared phonetically for 10 Standard
Chinese speakers and 10 Shanghai speakers with middle-class accent. Additionally, retroflexed finals occur more than 10 times in
We propose a computational method for time-varying spectra based
on 3-D vocal tract shape using Finite Element Method (FEM). In order
to obtain the time-varying spectra, we introduce auto-mesh algorithm and interpolation. We show the vocal tract transfer function
(VTTF) with variable shape continuously.
Consideration of Muscle Co-Contraction in a
Physiological Articulatory Model
Jianwu Dang 1 , Kiyoshi Honda 2 ; 1 JAIST, Japan;
2
ATR-HIS, Japan
Physiological models of the speech organs must consider cocontrac-
83
Eurospeech 2003
Wednesday
tion of the muscles, a common phenomenon taking place during
articulation. This study investigated cocontraction of the tongue
muscles using the physiological articulatory model that replicates
midsagittal regions of the speech organs to simulate articulatory
movements during speech [1,2]. The relation between the muscle force and tongue movement obtained by the model simulation
indicated that each muscle drives the tongue towards an equilibrium position (EP) corresponding to the magnitude of the activation forces. Contributions of the muscles to the tongue movement
were evaluated by the distance between the equilibrium positions.
Based on the EPs and the muscle contributions, an invariant mapping (the EP map) was established to function the connection of a
spatial location to a muscle force. Cocontractions between agonist
and antagonist muscles were simulated using the EP maps. The
simulations demonstrated that coarticulation with multiple targets
could be compatibly realized using the co-contraction mechanism.
The implementation of the co-contraction mechanism enables relatively independent control over the tongue tip and body.
Robust Techniques for Pre- and Post-Surgical Voice
Analysis
Claudia Manfredi 1 , Giorgio Peretti 2 ; 1 University of
Florence, Italy; 2 Civil Brescia Hospital, Italy
Objective measure and tracking of the most relevant voice parameters is obtained for voice signals coming from patients that underwent thyroplasty implant. Due to the strong noise component
and high non-stationarity of the pre-surgical signal, robust methods are proposed, capable to recover the fundamental frequency,
tracking formants, and quantify the degree of hoarseness as well
as the patient’s functional recovery in an objective way. Thanks to
its high-resolution properties, autoregressive parametric modelling
is considered, with modifications required for the present application. The method is applied to sustained /a/ vowels, recorded
from patients suffering from unilateral vocal cord paralysis. Preand post-surgical parameters are evaluated, that allow the physician quantifying the effectiveness of the Montgomery thyroplasty
implant.
Analysis of Lossy Vocal Tract Models for Speech
Production
K. Schnell, A. Lacroix; Goethe-University Frankfurt am
Main, Germany
Discrete time tube models describe the propagation of plane sound
waves through the vocal tract. Therefore they are important for
speech analysis and production. In most cases discrete time models without losses have been used. In this contribution loss effects
are introduced by extended uniform tube elements modeling frequency dependent losses. The parameters of these extended tube
elements can be fitted to experimental and theoretical data of the
loss effects of wall vibrations, viscosity and heat conduction. For
the analysis of speech sounds the parameters of a lossy vocal tract
model are estimated from speech signals by an optimization algorithm. The spectrum of the analyzed speech can be approximated
well by the estimated magnitude response of the lossy vocal tract
model. Furthermore the estimated vocal tract areas show reasonable shapes.
September 1-4, 2003 – Geneva, Switzerland
is positively correlated with the total nasal airflow volume for the
nasals.
Estimation of Vocal Noise in Running Speech by
Means of Bi-Directional Double Linear Prediction
F. Bettens 1 , F. Grenez 1 , J. Schoentgen 2 ; 1 Université
Libre de Bruxelles, Belgium; 2 National Fund for
Scientific Research, Belgium
The presentation concerns forward and backward double linear prediction of speech with a view to the characterization of vocal noise
due to voice disorders. Bi-directional double linear prediction consists in a conventional short-term prediction followed by a distal
inter-cycle prediction that enables removing inter-cycle correlations
owing to voicing. The long-term prediction is performed forward
and backward. The minimum of the forward and backward prediction error is a cue of vocal noise. The minimum backward and
forward prediction error has been calculated for corpora involving
connected speech and sustained vowels. Comparisons have been
performed between the estimated vocal noise and the perceived
hoarseness in steady vowel fragments, as well as between the estimated vocal noise in connected speech and sustained vowels produced by the same speakers.
Visualisation of the Vocal Tract Based on
Estimation of Vocal Area Functions and Formant
Frequencies
Abdulhussain E. Mahdi; University of Limerick, Ireland
A system for visualisation of the vocal-tract shapes during vowel
articulation has been designed and developed. The system generates the vocal tract configuration using a new approach based
on extracting both the area functions and the formant frequencies
form the acoustic speech signal. Using a linear prediction analysis,
the vocal tract area functions and the first three formants are first
estimated. The estimated area functions are then mapped to corresponding mid-sagittal distances and displayed as 2D vocal tract
lateral graphics. The mapping process is based on a simple numerical algorithm and an accurate reference grid derived from x-rays for
the pronunciation of a number English vowels uttered by different
speakers. To compensate for possible errors in the estimated area
functions due to variations in vocal tract length, the first two section distances are determined by the three formants. The formants
are also used to adjust the rounding of the lips and the height of
the jawbone. Results show high correlation with x-ray data and the
PARAFAC analysis. The system could be useful as a visual sensory
aid for speech training of the hearing-impaired.
Reproducing Laryngeal Mechanisms with a
Two-Mass Model
Denisse Sciamarella, Christophe d’Alessandro;
LIMSI-CNRS, France
Evidence is produced for the correspondence between the oscillation regimes of an up-to-date two-mass model and laryngeal mechanisms. Features presented by experimental electroglottographic
signals during transition between laryngeal mechanisms are shown
to be reproduced by the model.
Temporal Properties of the Nasals and Nasalization
in Cantonese
Methods for Estimation of Glottal Pulses
Waveforms Exciting Voiced Speech
Beatrice Fung-Wah Khioe; City University of Hong
Kong, China
Milan Boštík, Milan Sigmund; Brno University of
Technology, Czech Republic
This paper is an investigation of the temporal properties of the
nasals and vowel nasalization in Cantonese by analyzing synchronized nasal and oral airflows. The nasal airflow volume for the vowels in both oral and nasal contexts and for the syllable-final nasals
[-m, -n, -N] were also obtained. Results show that (i) the vowel duration in the (C)VN syllables is negatively correlated with the duration
for the following nasals [-m, -n, -N]; (ii) the vowel duration in the
(C)VN syllables is positively correlated with the duration of nasalization; (iii) the vowel duration in the (C)VN syllables is positively
correlated with the nasal airflow volume for the vowel and for the
nasalized portion; (iv) the degree of nasalization is inversely correlated with the tongue height of the vowel; and (v) the nasal duration
Nowadays, the most popular techniques of the speech processing
are the recognition of all kinds (the speech, the speaker and the state
of speaker recog.) and the text-to-speech synthesis. In both these
domains, there are possibilities to use the glottal pulses waveforms.
In the recognition techniques we can use them for the vocal cords
description and then use it for the classification of speaker’s state
(physiological or mental state) or for the classification of a speaker.
In the text-to-speech techniques we can use them for the speech
timbre changing. This paper describes some methods for obtaining
of glottal pulses waveforms from recorded speech. There are several results obtained by application of described methods.
84
Eurospeech 2003
Wednesday
Acoustic Modeling of American English Lateral
Approximants
Zhaoyan Zhang 1 , Carol Espy-Wilson 1 , Mark Tiede 2 ;
1
University of Maryland, USA; 2 Haskins Laboratories,
USA
A vocal tract model for an American English /l/ production with
lateral channels and a supralingual side branch has been developed. Acoustic modeling of an /l/ production using MRI-derived
vocal tract dimensions shows that both the lateral channels and
the supralingual side branch contribute to the production of zeros
in the F3 to F5 frequency range, thereby resulting in pole-zero clusters around 2-5 kHz in the spectrum of the /l/ sound.
Translation and Rotation of the Cricothyroid Joint
Revealed by Phonation-Synchronized
High-Resolution MRI
September 1-4, 2003 – Geneva, Switzerland
This paper explores the estimation and mapping of probability
models of formant parameter vectors for voice conversion. The formant parameter vectors consist of the frequency, bandwidth and intensity of resonance at formants. Formant parameters are derived
from the coefficients of a linear prediction (LP) model of speech.
The formant distributions are modelled with phoneme-dependent
two-dimensional hidden Markov models with state Gaussian mixture densities. The HMMs are subsequently used for re-estimation
of the formant trajectories of speech. Two alternative methods are
explored for voice morphing. The first is a non-uniform frequency
warping method and the second is based on spectral mapping via
rotation of the formant vectors of the source towards those of the
target. Both methods transform all formant parameters (Frequency,
Bandwidth and Intensity). In addition, the factors that affect the
selection of the warping ratios for the mapping function are presented. Experimental evaluation of voice morphing examples is presented.
Perceptually Weighted Linear Transformations for
Voice Conversion
Sayoko Takano 1 , Kiyoshi Honda 1 , Shinobu Masaki 2 ,
Yasuhiro Shimada 2 , Ichiro Fujimoto 2 ; 1 ATR-HIS,
Japan; 2 ATR-BAIC, Japan
Hui Ye, Steve Young; Cambridge University, U.K.
The action of the cricothyroid joint for regulating voice fundamental
frequency is thought to have two components; rotation and translation. Its empirical verification, however, has faced methodological problems. This study examines the joint action by means of a
phonation-synchronized high-resolution Magnetic Resonance Imaging (MRI) technique, which employs two technical improvements; a
custom laryngeal coil to enhance image resolution and an external
triggering method to synchronize the subject’s phonation and MRI
scan. The obtained images were clear enough to demonstrate two
actions of the joint; the cricoid cartilage rotates 5 degrees and the
thyroid cartilage translated 1.25 mm in the range of half an octave.
Voice conversion is a technique for modifying a source speaker’s
speech to sound as if it was spoken by a target speaker. A popular
approach to voice conversion is to apply a linear transformation to
the spectral envelope. However, conventional parameter estimation
based on least square error optimization does not necessarily lead
to the best perceptual result. In this paper, a perceptually weighted
linear transformation is presented which is based on the minimization of the perceptual spectral distance between the voices of the
source and target speakers. The paper describes the new conversion algorithm and presents a preliminary evaluation of the performance of the method based on objective and subjective tests.
Voice Conversion with Smoothed GMM and MAP
Adaptation
Session: PWeDf– Poster
Speech Synthesis: Voice Conversion &
Miscellaneous Topics
Yining Chen 1 , Min Chu 2 , Eric Chang 2 , Jia Liu 1 ,
Runsheng Liu 1 ; 1 Tsinghua University, China;
2
Microsoft Research Asia, China
Time: Wednesday 16.00, Venue: Main Hall, Level -1
Chair: Christophe D’Alessandro, LIMSI, France
GMM-Based Voice Conversion Applied to Emotional
Speech Synthesis
Hiromichi Kawanami 1 , Yohei Iwami 1 , Tomoki Toda 2 ,
Hiroshi Saruwatari 1 , Kiyohiro Shikano 1 ; 1 Nara
Institute of Science and Technology, Japan; 2 ATR-SLT,
Japan
Voice conversion method is applied to synthesizing emotional
speech from standard reading (neutral) speech. Pairs of neutral
speech and emotional speech are used for conversion rule training.
The conversion adopts GMM (Gaussian Mixture Model) with DFW
(Dynamic Frequency Warping). We also adopt STRAIGHT, the highquality speech analysis-synthesis algorithm. As conversion target
emotions, (Hot) anger, (cold) sadness and (hot) happiness are used.
The converted speech is evaluated objectively first using mel cepstrum distortion as a criterion. The result confirms the GMM-based
voice conversion can reduce distortion between target speech and
neutral speech.
A subjective test is also carried out to investigate perceptual effect.
From the viewpoint of influence of prosody, two kinds of prosody
are used to synthesis. One is natural prosody extracted from neutral speech and the other is from emotional speech. The result
shows that prosody mainly contribute to emotion and spectrum
conversion can reinforce it.
Probability Models of Formant Parameters for
Voice Conversion
Dimitrios Rentzos 1 , Saeed Vaseghi 1 , Qin Yan 1 ,
Ching-Hsiang Ho 2 , Emir Turajlic 1 ; 1 Brunel University,
U.K.; 2 Fortune Institute of Technology, Taiwan
In most state-of-the-art voice conversion systems, speech quality of converted utterances is still unsatisfactory. In this paper,
STRAIGHT analysis-synthesis framework is used to improve the
quality. A smoothed GMM and MAP adaptation is proposed for spectrum conversion to avoid the overly smooth phenomenon in the traditional GMM method. Since frames are processed independently,
the GMM based transformation function may generate discontinuous features. Therefore, a time domain low pass filter is applied
on the transformation function during the conversion phase. The
results of listening evaluations show that the quality of the speech
converted by the proposed method is significantly better than that
by the traditional GMM method. Meanwhile, speaker identifiability
of the converted voice reaches 75%, even when the difference between the source speaker and the target speaker is not very large.
A System for Voice Conversion Based on Adaptive
Filtering and Line Spectral Frequency Distance
Optimization for Text-to-Speech Synthesis
Özgül Salor 1 , Mübeccel Demirekler 1 , Bryan Pellom 2 ;
1
Middle East Technical University, Turkey; 2 University
of Colorado at Boulder, USA
This paper proposes a new voice conversion algorithm that modifies the source speaker’s speech to sound as if produced by a target speaker. To date, most approaches for speaker transformation are based on mapping functions or codebooks. We propose
a linear filtering based approach to the problem of mapping the
spectral parameters of one speaker to those of the other. In the
proposed method, the transformation is performed by filtering the
source speaker’s Line Spectral Pair (LSP) frequencies to obtain the
LSP frequency estimates of the target speaker. Speech signal is
time-aligned into a sequence of HMM states. The filters are designed for each HMM state using the aligned data. We consider
two methods for spectral conversion. A linear transformation for
the LSP’s was obtained using the adaptive steepest gradient descent
85
Eurospeech 2003
Wednesday
approach. Mean values of LSP’s are adjusted to match those of the
target speaker. In order to prevent the LSP vectors from resulting
in unstable vocal tract filters, weighted least square estimation is
used. This approach optimizes differences between source and target LSP’s. Weights are inverses of the source LSP variances. This approach is integrated into a Time Domain Pitch Synchronous Overlap
and Add (TD-PSOLA) analysis-synthesis framework. The algorithm
is objectively evaluated using a distance measure based on the loglikelihood ratio of observing the input speech, given Gaussian mixture speaker models for both the source and the target voice. Results using the Gaussian mixture model formulated criteria demonstrate consistent transformation using a 5 speaker database. The
algorithm offers promise for rapidly adapting text-to-speech systems to new voices.
Speaker Conversion in ARX-Based Source-Formant
Type Speech Synthesis
Hiroki Mori, Hideki Kasuya; Utsunomiya University,
Japan
A speaker conversion framework for formant synthesis is proposed.
With this framework, given a small set of a target speaker’s utterances, segmental features of an original speech can be converted
to those of the given speaker. Unlike other speaker conversion
frameworks, further voice quality modification can also be applied
to the converted speech with conventional formant modification
techniques. The parameter conversion is based on MLLR in the cepstral domain. The effect of parameter conversion can be seen from
the graphical representation of formant placement. The results of
an auditory experiment showed that most of the converted speech
was perceived as being similar to that of target speakers.
Implementing an SSML Compliant Concatenative
TTS System
Andrew P. Breen, Steve Minnis, Barry Eggleton;
Nuance Communications, U.K.
The W3C Speech Synthesis Markup Language (SSML) unifies a number of recent related markup languages that have emerged to fill
the perceived need for increased, and standardized, user control
over Text to Speech (TTS) engines. One of the main drivers for
markup has been the increasing use of TTS engines as embedded
components of specific applications – which means they are in a
position to take advantage of additional knowledge about the text.
Although SSML allows improved control over the text normalization
process, most of the attention has focused on the level of prosody
markup, especially since the prediction of the prosody is generally
acknowledged as one of the most significant problems in TTS synthesis. Prosody control is by no means simple due to the large crossdependency between other related aspects of prosody. Prosody
control is also of particular complexity for concatenative TTS systems. SSML is about much more than prosody control though –
allowing high level engine control such as language switching and
voice switching, and low level control such as phonetic input for
words. Our experiences in implementing these diverse requirements of the SSML standard are discussed.
Acoustic Variations of Focused Disyllabic Words in
Mandarin Chinese: Analysis, Synthesis and
Perception
Zhenglai Gu, Hiroki Mori, Hideki Kasuya; Utsunomiya
University, Japan
The focus effects on acoustic correlates include both prosodic and
segmental modifications. Analysis of 35 focused words in a carrier sentences uttered by 2 male and 3 female speakers has shown
that: (1) there is a significant asymmetry of vowel duration as well
as F0 range between the pre-stressed and post-stressed syllables,
implying that different strategies are employed in the task of focusing disyllabic words, i.e., emphasizing the first syllable as well
as weakening the second syllable for the former, but emphasizing
the second syllable only for the latter; (2) the tonal combinations
significantly affect the variations of both the vowel duration and F0
range; (3) the formant frequencies (F1, F2) are changed systematically in a way that that the formants of the vowels plotted in the
(F1, F2) plane were stretched outwards. Perceptual validation of
the relative importance of these acoustic cues for signaling a focal
September 1-4, 2003 – Geneva, Switzerland
word has been accomplished. Results of the perception experiment
indicate that F0 is the dominant cue closely related to the judgment
of focused word and the other two cues, duration and formant frequencies contribute less to the judgment.
An Approach to Common Acoustical Pole and Zero
Modeling of Consecutive Periods of Voiced Speech
Pedro Quintana-Morales, Juan L. Navarro-Mesa;
Universidad de Las Palmas de Gran Canaria, Spain
In this paper the open and closed phases within a speech period are
separately modeled as acoustical pole-zero filters. We approach the
estimation of the coefficients associated to the poles and zeros by
minimizing a cost function based on the reconstruction error. The
cost function leads to a matrix formulation of the error for two time
intervals where the error must be defined. This defines a framework
that facilitates to model the phases associated to consecutive periods. We give a matrix formulation of the estimation process that let
us to attain two main objectives. Firstly, estimate the common-pole
structure of several consecutive periods and their particular zero
structure. And secondly, estimate their common-pole-zero structure. The experiments are carried out over a speech database of
five men and five women. The experiments are done in terms of the
reconstruction error and its dependence on the period length and
the order of the analysis.
Estimating the Vocal-Tract Area Function and the
Derivative of the Glottal Wave from a Speech Signal
Huiqun Deng, Michael Beddoes, Rabab Ward, Murray
Hodgson; University of British Columbia, Canada
We present a new method for estimating the vocal-tract area functions from speech signals. First, we point out and correct a longstanding sign error in some literature related to the derivation of
the acoustic reflection coefficients of the vocal tract from a speech
signal. Next, to eliminate the influence of the glottal wave on the
estimation of the vocal-tract filter, we estimate the vocal-tract filter and the derivative of the glottal wave simultaneously from a
speech signal. From the vocal-tract filter obtained, we derive the
vocal-tract area function. Our improvements to existing methods
can be seen from the vocal-tract area functions obtained for vowel
sounds /A/ and /i/, each produced by a female and a male subject.
They are comparable with those obtained using the magnetic resonance imaging method. The derivatives of the glottal waves for
these sounds are also presented, and they show very detailed structures.
Glottal Closure Instant Synchronous Sinusoidal
Model for High Quality Speech Analysis/Synthesis
Parham Zolfaghari 1 , Tomohiro Nakatani 1 , Toshio
Irino 2 , Hideki Kawahara 2 , Fumitada Itakura 3 ; 1 NTT
Corporation, Japan; 2 Wakayama University, Japan;
3
Nagoya University, Japan
In this paper, a glottal event synchronous sinusoidal model is proposed. A glottal event corresponds to the glottal closure instant
(GCI), which is accurately estimated using group delay and fixed
point analysis in the time domain using energy centroids. The
GCI synchronous sinusoidal model allows adequate processing according to the inherent local properties of speech, resulting in
phase matching between adjacent and corresponding harmonics
that are essential for precise speech analysis. Frequency domain
fixed points from mapping filter center frequencies to the instantaneous frequencies of the filter outputs result in highly accurate
estimates of the constituent sinusoidal components. Adequate window selection and placement at the GCI is found to be important in
obtaining stable sinusoidal components. We demonstrate that the
GCI synchronous instantaneous frequency method allows a large
reduction in spurious peaks in the spectrum and enables high quality synthesised speech. In speech quality evaluations, glottal synchronous analysis-synthesis results in a 0.4 improvement in MOS
over conventional fixed frame rate analysis-synthesis.
86
Eurospeech 2003
Wednesday
Mixed Physical Modeling Techniques Applied to
Speech Production
Matti Karjalainen; Helsinki University of Technology,
Finland
The Kelly-Lochbaum transmission-line model of the vocal tract
started the discrete-time modeling of speech production. More recently similar techniques have been developed in computer music
towards a more generalized methodology. In this paper we will
study the application of mixed physical modeling to speech production and speech synthesis. These approaches are Digital Waveguides (DWG), Finite Difference Time-Domain schemes (FDTD), and
Wave Digital Filters (WDF). The equivalence and interconnectivity
of these schemes is shown and flexible real-time synthesizers for
articulatory type of speech production are demonstrated.
An Expandable Web-Based Audiovisual
Text-to-Speech Synthesis System
Sascha Fagel, Walter F. Sendlmeier; Technische
Universität Berlin, Germany
The authors propose a framework for audiovisual speech synthesis
systems [1] and present a first implementation of the framework
[2], which is called MASSY - Modular Audiovisual Speech SYnthesizer. This paper describes how the audiovisual speech synthesis
system, the ‘talking head’, works, how it can be integrated into webapplications, and why it is worthwhile using it.
The presented applications use the wrapped audio synthesis, the
phonetic and visual articulation modules, and a face module. One
of the two already implemented visual articulation models, based
on a dominance model for coarticulation, is used. The face is a 3D
model described in VRML 97. The facial animation is described in
a motion parameter model which is capable of realizing the most
important visible articulation gestures [3][4]. MASSY is developed
in the client-server paradigm. The server is easy to set up and does
not need special or high performance hardware. The required bandwidth is low, and the client is an ordinary web browser with a freely
available standard plug-in.
The system is used for the evaluation of measured and predicted
articulation models and is also suitable for the enhancement of
human-computer-interfaces in applications like e.g. virtual tutors in
e-learning environments, speech training, video conferencing, computer games, audiovisual information systems, virtual agents, and
many more.
A Reconstruction of Farkas Kempelen’s Speaking
Machine
P. Nikléczy, G. Olaszy; Hungarian Academy of
Sciences, Hungary
The first “speaking machine” of the world was created by the Hungarian polyhistor Farkas Kempelen. He can also be referred to as
the first phonetician in the world. He went on improving his speaking machine for twenty-two years, and described the final version in
a book published in 1791 in Vienna. The reconstruction was made
based on this book. What we wanted to make was not just an exhibition piece but a machine that actually worked. Thus we can
go back more than 200 years and study the working of one of the
most precious instruments of the Baroque period. We can try out
the ways of producing sounds that Kempelen wrote so many pages
about in his book. The acoustic patterns of the machine’s speech
can be studied by today’s sophisticated signal processing methods
and prove or disprove Kempelen’s claims by measurement data. Besides these we took it to be an important task in terms of the history
of science to contribute to our knowledge of the beginnings of phonetic research.
Acoustic Model Selection and Voice Quality
Assessment for HMM-Based Mandarin Speech
Synthesis
Wentao Gu, Keikichi Hirose; University of Tokyo,
Japan
This paper presents a preliminary study in implementing HMMbased Mandarin speech synthesis system, whose main advantage
exists in generating various voices. A variety of acoustic unit
September 1-4, 2003 – Geneva, Switzerland
representations for Mandarin are compared to select an optimal
acoustic model set. Syllabic vs. sub-syllabic, context-independent
vs. context-dependent, toneless vs. tonal, initial-final vs. premetoneme models, and models with various numbers of states, are investigated respectively. To take the most advantage of HMM-based
speech synthesis, some aspects affecting speaker adaptation quality, especially the selection of adaptation data size, are also studied.
Modeling of Various Speaking Styles and Emotions
for HMM-Based Speech Synthesis
Junichi Yamagishi, Koji Onishi, Takashi Masuko,
Takao Kobayashi; Tokyo Institute of Technology,
Japan
This paper presents an approach to realizing various emotional
expressions and speaking styles in synthetic speech using HMMbased speech synthesis. We show two methods for modeling speaking styles and emotions. In the first method, called “style dependent modeling,” each speaking style and emotion is individually
modeled. On the other hand, in the second method, called “style
mixed modeling,” speaking style or emotion is treated as a contextual factor as well as phonetic, prosodic, and linguistic factors, and
all speaking styles and emotions are modeled by a single acoustic model simultaneously. We chose four styles, that is, “reading,”
“rough,” “joyful,” and “sad,” and compared those two modeling
methods using these styles. From the results of subjective tests,
it is shown that both modeling methods have almost the same performance, and that it is possible to synthesize speech with similar
speaking styles and emotions to those of the recorded speech. In
addition, it is also shown that the style mixed modeling can reduce
the number of output distributions in comparison with the style
dependent modeling.
Towards the Development of a Brazilian
Portuguese Text-to-Speech System Based on HMM
R. da S. Maia 1 , Heiga Zen 1 , Keiichi Tokuda 1 , Tadashi
Kitamura 1 , F.G.V. Resende Jr. 2 ; 1 Nagoya Institute of
Technology, Japan; 2 Federal University of Rio de
Janeiro, Brazil
This paper describes the development of a Brazilian Portuguese
text-to-speech system which applies a technique wherein speech is
directly synthesized from hidden Markov models. In order to build
the synthesizer a speech database was recorded and phonetically
segmented. Furthermore, contextual informations about syllables,
words, phrases, and utterances were determined, as well as questions for decision tree-based context clustering algorithms. The
resulting system presents a fair reproduction of the prosody even
when a small database is used for training.
Grapheme to Phoneme Conversion and Dictionary
Verification Using Graphonemes
Paul Vozila, Jeff Adams, Yuliya Lobacheva, Ryan
Thomas; Scansoft Inc., USA
We present a novel data-driven language independent approach for
grapheme to phoneme conversion, which achieves a phoneme error
rate of 3.68% and a pronunciation error rate of 17.13% for English.
We apply our stochastic model to the task of dictionary verification
and conclude that it is able to detect spurious entries, which can
then be examined and corrected by a human expert.
Improving the Accuracy of Pronunciation
Prediction for Unit Selection TTS
Justin Fackrell, Wojciech Skut, Kathrine Hammervold;
Rhetorical Systems Ltd., U.K.
This paper describes a technique which improves the accuracy of
pronunciation prediction for unit selection TTS. It does this by performing an orthography-based context-dependent lookup on the
unit database. During synthesis, the pronunciations of words which
have matching contexts in the unit database are determined. Pronunciations not found using this method are determined using traditional lexicon lookup and/or letter-to-sound rules. In its simplest
form, the model involves a lookup based on left and right word con-
87
Eurospeech 2003
Wednesday
text. A modified form, which backs-off to a lookup based on right
context, is shown to have a much higher firing rate, and to produce
more pronunciation variation.
The technique is good at occasionally inhibiting vowel reduction; at
choosing appropriate pronunciations in case of free variation; and
at choosing the correct pronunciation for names. Its effectiveness
is assessed by experiments on unseen data; by resynthesis; and by
a listening test on sentences rich in reducible words.
Detection of List-Type Sentences
Taniya Mishra, Esther Klabbers, Jan P.H. van Santen;
Oregon Health & Science University, USA
In this paper, we explore a text type based scheme of text analysis,
through the specific problem of detecting the list text type. This is
important because TTS systems that can generate the very distinct
F0 contour of lists sound more natural. The presented list detection algorithm uses part-of-speech tags as input, and detects lists
by computing the alignment costs of clauses in a sentence. The algorithm detects lists with 80% accuracy.
Session: PWeDg– Poster
Acoustic Modelling I
September 1-4, 2003 – Geneva, Switzerland
with tied variances. Finally, scalar quantization is performed for
the mean components of the models. With the proposed method,
a memory saving of 77.6% was achieved compared with the original continuous density HMMs and 23.0% compared to the quantized
parameter HMMs, respectively. The recognition performance of the
resulted models was similar to what was obtained with the original
continuous density HMMs in all tested environments.
Nearest-Neighbor Search Algorithms Based on
Subcodebook Selection and its Application to
Speech Recognition
José A.R. Fonollosa; Universitat Politècnica de
Catalunya, Spain
Vector quantization (VQ) is a efficient technique for data compression with a minimum distortion. VQ is widely used in applications
as speech and image coding, speech recognition, and image retrieval. This paper presents a novel fast nearest-neighbor algorithm
and shows its application to speech recognition. The proposed algorithm is based on a fast preselection that reduces the search to
a limited number of code vectors. The presented results show that
the computational cost of the VQ stage can be significantly reduced
without affecting the performance of the speech recognizer.
Non-Linear Maximum Likelihood Feature
Transformation for Speech Recognition
Time: Wednesday 16.00, Venue: Main Hall, Level -1
Chair: Melvyn Hunt, Phonetic Systems UK Ltd, United Kingdom
Mohamed Kamal Omar, Mark Hasegawa-Johnson;
University of Illinois at Urbana-Champaign, USA
A New Pitch Synchronous Time Domain Phoneme
Recognizer Using Component Analysis and Pitch
Clustering
Ramon Prieto, Jing Jiang, Chi-Ho Choi; Stanford
University, USA
A new framework for time domain voiced phoneme recognition is
shown. Each speech frame taken for training and recognition is
bounded by consecutive glottal closures. A pre-processing stage
is designed and implemented to model pitch synchronous frames
with gaussian mixture models. Component analysis carried out on
the data shows optimal performance with a very small number of
components, requiring low computational power. We designed a
new clustering technique that, using the pitch period, gives better
results than other well known clustering algorithms like k-means.
Mixed-Lingual Spoken Word Recognition by Using
VQ Codebook Sequences of Variable Length
Segments
Hiroaki Kojima 1 , Kazuyo Tanaka 2 ; 1 AIST, Japan;
2
University of Tsukuba, Japan
We are investigating unsupervised phone modeling. This paper describes a derivation method of VQ codebook sequences of variable
length segments from spoken word samples, and also describes
evaluation results by applying the method to mixed-lingual speech
recognition tasks which include non-native speakers. The VQ codebook is generated based on a piecewise linear segmentation method
which includes segmentation, alignment, reduction and clustering
processes. Derived codebook sequences are evaluated by speaker
independent recognition of a word set which is a mixture of English
and Japanese word. Speech samples are uttered by both English and
Japanese native speakers. The recognition rates of mixed-lingual
618 words by using a codebook consist of 128 codes are 89.7% for
English native speakers and 79.4% for Japanese native speakers in
average .
Low Memory Acoustic Models for HMM Based
Speech Recognition
Tommi Lahti, Olli Viikki, Marcel Vasilache; Nokia
Research Center, Finland
In this paper, we propose a new approach to reduce the memory
footprint of HMM based ASR systems. The proposed method involves three steps. Starting from the continuous density HMMs,
mixture variances are tied using k-means based vector quantization. Next, the re-estimation of the resulted models is performed
Most automatic speech recognition (ASR) systems use Hidden
Markov model (HMM) with a diagonal-covariance Gaussian mixture
model for the state-conditional probability density function. The
diagonal-covariance Gaussian mixture can model discrete sources
of variability like speaker variations, gender variations, or local dialect, but can not model continuous types of variability that account for correlation between the elements of the feature vector.
In this paper, we present a transformation of the acoustic feature
vector that minimizes an empirical estimate of the relative entropy
between the likelihood based on the diagonal-covariance Gaussian
mixture HMM model and the true likelihood.
Based on this formulation, we provide a solution to the problem
using volume-preserving maps; existing linear feature transform
designs are shown to be special cases of the proposed solution.
Since most of the acoustic features used in ASR are not linear functions of the sources of correlation in the speech signal, we use a
non-linear transformation of the features to minimize this objective function. We describe an iterative algorithm to estimate the
parameters of both the volume-preserving feature transformation
and the HMM that jointly optimize the objective function for an
HMM-based speech recognizer. Using this algorithm, we achieved
2% improvement in phoneme recognition accuracy compared to the
baseline system. Our approach shows also improvement in recognition accuracy compared to previous linear approaches like linear
discriminant analysis (LDA), maximum likelihood linear transform
(MLLT), and independent component analysis (ICA).
Automatic Generation of Context-Independent
Variable Parameter Models Using Successive State
and Mixture Splitting
Soo-Young Suk, Ho-Youl Jung, Hyun-Yeol Chung;
Yeungnam University, Korea
A Speech and Character Combined Recognition System (SCCRS) is
developed for working on PDA (Personal Digital Assistants) or on
mobile devices. In SCCRS, feature extraction for speech and for
character is carried out separately, but recognition is performed in
an engine. The recognition engine employs essentially CHMM (Continuous Hidden Markov Model) structure and this CHMM consists
of variable parameter topology in order to minimize the number
of model parameters and to reduce recognition time. This model
also adopts our proposed SSMS (Successive State and Mixture Splitting) for generating context independent model. SSMS optimizes
the number of mixtures through splitting in mixture domain and the
number of states through splitting in time domain. The recognition
results show that the proposed SSMS method can reduce the total
number of Gaussian up to 40.0% compared with the fixed parameter
88
Eurospeech 2003
Wednesday
models at the same recognition performance in speech recognition
system.
Data Driven Generation of Broad Classes for
Decision Tree Construction in Acoustic Modeling
Andrej Žgank, Zdravko Kačič, Bogomir Horvat;
University of Maribor, Slovenia
A new data driven approach for phonetic broad class generation
is proposed. The phonetic broad classes are used by tree based
clustering procedure for node questions during the context dependent acoustic models generation for speech recognition. The data
driven approach is based on phoneme confusion matrix, which is
produced with the phoneme recogniser. Such approach enables
the data driven method independency from particular language or
phoneme set found in a database. Data driven broad classes generated with this method were compared to expert defined and randomly generated broad classes. The experiment was carried out
with the Slovenian SpeechDat(II) database. Six different test configurations were included in the evaluation. Analysis of speech recognition results for different acoustic models showed that the proposed data driven method gives comparable or better results than
standard method.
An Efficient Integrated Gender Detection Scheme
and Time Mediated Averaging of Gender
Dependent Acoustic Models
Peder A. Olsen, Satya Dharanipragada; IBM T.J.
Watson Research Center, USA
This paper discusses building gender dependent gaussian mixture
models (GMMs) and how to integrate these with an efficient gender detection scheme. Gender specific acoustic models of half the
size of a corresponding gender independent acoustic model substantially outperform the larger gender independent acoustic models. With perfect gender detection, gender dependent modeling
should therefore yield higher recognition accuracy without consuming more memory. Furthermore, as certain phonemes are inherently
gender independent (e.g. silence) much of the male and female specific acoustic models can be shared. This paper proposes how to
discover which phonemes are inherently similar for male and female speakers and how to efficiently share this information between
gender dependent GMMs. A highly accurate and computationally efficient gender detection scheme is suggested that takes advantage
of computations inherently done in the speech recognizer. By making the gender assignment probabilistic an increase in word error
rate (WER) seen for erroneously gender labeled speakers is avoided.
The method of gender detection and probabilistic use of gender is
novel and should be of interest beyond mere gender detection. The
only requirement for the method to work is that the training data
be appropriately labeled.
Syllable-Based Acoustic Modeling for Japanese
Spontaneous Speech Recognition
This paper extends prior work in multi-stream modeling by introducing cross-stream observation dependencies and a new discriminative criterion for selecting such dependencies. Experimental results combining short-term PLP features with long-term TRAP features show gains associated with a multi-stream model with partial
state asynchrony over a baseline HMM. Frame-based analyses show
significant discriminant information in the added cross-stream dependencies, but so far there are only small gains in recognition accuracy.
Pruning Transitions in a Hidden Markov Model
with Optimal Brain Surgeon
Brian Mak, Kin-Wah Chan; Hong Kong University of
Science & Technology, China
This paper concerns about reducing the topology of a hidden
Markov model (HMM) for a given task. The purpose is two-fold:
(1) to select a good model topology with improved generalization
capability; and/or (2) to reduce the model complexity so as to save
memory and computation costs. The first goal falls into the active research area of model selection. From the model-theoretic research community, various measures such as Bayesian information
criterion, minimum description length, minimum message length
have been proposed and used with some success. In this paper,
we are considering another approach in which a well-performed
HMM, though perhaps oversized, is optimally pruned so that the
loss in the model training cost function is minimal. The method is
known as Optimal Brain Surgeon (OBS) that has been used in the
neural network (NN) community. The application of OBS to NN is a
constrained optimization problem; its application to HMM is more
involved and it becomes a quadratic programming problem with
both equality and inequality constraints. The detailed formulation
is presented, and the algorithm is shown effective by an example in
which HMM state transitions are pruned. The reduced model also
results in better generalization performance on unseen test data.
Using Pitch Frequency Information in Speech
Recognition
Mathew Magimai-Doss, Todd A. Stephenson, Hervé
Bourlard; IDIAP, Switzerland
Automatic Speech Recognition (ASR) systems typically use
smoothed spectral features as acoustic observations. In recent
studies, it has been shown that complementing these standard features with pitch frequency could improve the system performance
of the system [1, 2]. While previously proposed systems have been
studied in the framework of HMM/GMMs, in this paper we study
and compare different ways to include pitch frequency in state-ofthe-art hybrid HMM/ANN system. We have evaluated the proposed
system on two different ASR tasks, namely, isolated word recognition and connected word recognition. Our results show that pitch
frequency can indeed be used in ASR systems to improve the recognition performance.
Hidden Feature Models for Speech Recognition
Using Dynamic Bayesian Networks
Jun Ogata 1 , Yasuo Ariki 2 ; 1 AIST, Japan; 2 Ryukoku
University, Japan
We study on a syllable-based acoustic modeling method for
Japanese spontaneous speech recognition. Traditionally, morabased acoustic models have been adopted for Japanese read speech
recognition systems. In this paper, syllable-based unit and morabased unit are clearly distinguished in their definition, and syllables
are shown to be more suitable as an acoustic model for Japanese
spontaneous speech recognition. In spontaneous speech, a vowel
lengthening occurs frequently, and recognition accuracy is greatly
affected by this phenomena. From this viewpoint, we propose an
acoustic modeling technique that explicitly incorporates the vowel
lengthening in syllable-based HMMs. Experimental results showed
that the proposed model could exceed the performance of conventionally used cross-word triphone model and mora-based model in
Japanese spontaneous speech recognition task.
Cross-stream Observation Dependencies for
Multi-Stream Speech Recognition
September 1-4, 2003 – Geneva, Switzerland
Karen Livescu 1 , James Glass 1 , Jeff Bilmes 2 ;
1
Massachusetts Institute of Technology, USA;
2
University of Washington, USA
In this paper, we investigate the use of dynamic Bayesian networks
(DBNs) to explicitly represent models of hidden features, such as
articulatory or other phonological features, for automatic speech
recognition. In previous work using the idea of hidden features,
the representation has typically been implicit, relying on a single
hidden state to represent a combination of features. We present
a class of DBN-based hidden feature models, and show that such
a representation can be not only more expressive but also more
parsimonious. We also describe a way of representing the acoustic observation model with fewer distributions using a product of
models, each corresponding to a subset of the features. Finally, we
describe our recent experiments using hidden feature models on
the Aurora 2.0 corpus.
Özgür Çetin, Mari Ostendorf; University of
Washington, USA
89
Eurospeech 2003
Thursday
An Efficient Viterbi Algorithm on DBNs
September 1-4, 2003 – Geneva, Switzerland
and a variant distance measure. Compared to a baseline system
using triphones as subword units and with minimal pronunciation
variants, this method achieved a relative improvement of the word
error rate by 10%.
Wei Hu, Yimin Zhang, Qian Diao, Shan Huang; Intel
China Research Center, China
DBNs (Dynamic Bayesian Networks) [1] are powerful tool in modeling time-series data, and have been used in speech recognition
recently [2,3,4]. The “decoding” task in speech recognition means
to find the viterbi path [5](in graphical model community, “viterbi
path” has the same meaning as MPE “Most Probable Explanation”)
for a given acoustic observations. In this paper we describe a new
algorithm utilizes a new data structure “backpointer”, which is produced in the “marginalization” procedure in probability inference.
With these backpointers, the viterbi path can be found in a simple backtracking. We first introduce the concept of backpointer
and backtracking; then give the algorithm to compute the viterbi
path for DBNs based on backpointer and backtracking. We prove
that the new algorithm is correct, faster and more memory saving
comparison with old algorithm. Several experiments are conducted
to demonstrate the effectiveness of the algorithm on several well
known DBNs. We also test the algorithm on a real world DBN model
that can recognize continuous digit numbers.
Speech Recognition Based on Syllable Recovery
Li Zhang, William Edmondson; University of
Birmingham, U.K.
This paper reports the results of syllable recovery from speech using an articulatory model of the syllable. The contribution of syllable recovery to the overall process of speech recognition is discussed and speech recognition results are presented.
Session: SThBb– Oral
Time is of the Essence - Dynamic
Approaches to Spoken Language
Time: Thursday 10.00, Venue: Room 2
Chair: Steve Greenberg, ICSI, USA
Time is of the Essence – Dynamic Approaches to
Spoken Language
Steven Greenberg; The Speech Institute, USA
Temporal dynamics provide a fruitful framework with which to examine the relation between information and spoken language. This
paper serves as an introduction to the special Eurospeech session
on “Time is of the Essence – Dynamic Approaches to Spoken Language,” providing historical and conceptual background germane
to timing, as well as a discussion of its scientific and technological
prospects. Dynamics is examined from the perspectives of perception, production, neurology, synthesis, recognition and coding, in
an effort to define a prospective course for speech technology and
research.
Spectro-Temporal Interactions in Auditory and
Auditory-Visual Speech Processing
Ken W. Grant 1 , Steven Greenberg 2 ; 1 Walter Reed
Army Medical Center, USA; 2 The Speech Institute, USA
HARTFEX: A Multi-Dimensional System of HMM
Based Recognisers for Articulatory Features
Extraction
Tarek Abu-Amer, Julie Carson-Berndsen; University
College Dublin, Ireland
HARTFEX is a novel system that employs several tiers of HMMs
recognisers that work in parallel to extract multi-dimensions of articulatory features. The features segments on the different tiers
overlap to account for the co-articulation phenomena. The overlap
and precedence relation among features are applied to a phonological parser for further processing. HARTFEX system is built on
a modified version of HTK toolkit that allows it to perform multithread multi-feature recognition. The system testing results are
highly promising. The recognition accuracy for vowel is 98% and
for rhotic is 93%. Current work investigates inherited interdependencies of extracting different feature sets.
Automatic Baseform Generation from Acoustic
Data
Speech recognition often involves the face-to-face communication
between two or more individuals. The combined influences of auditory and visual speech information leads to a remarkably robust signal that is greatly resistant to noise, reverberation, hearing loss, and other forms of signal distortion. Studies of auditoryvisual speech processing have revealed that speechreading interacts with audition in both the spectral and temporal domain. For
example, not all speech frequencies are equal in their ability to supplement speechreading, with low-frequency speech cues providing
more benefit than high-frequency speech cues. Additionally, in contrast to auditory speech processing which integrates information
across frequency over relatively short time windows (20- 40 ms),
auditory-visual speech processing appears to use relatively long
time windows of integration (roughly 250 ms). In this paper, some
of the basic spectral and temporal interactions between auditory
and visual speech channels are enumerated and discussed.
Brain Imaging Correlates of Temporal Quantization
in Spoken Language
Benoît Maison; IBM T.J. Watson Research Center, USA
We describe two algorithms for generating pronunciation networks
from acoustic data. One is based on raw phonetic recognition and
the other uses the spelling of the words and the identification of
their language of origin as guides. In both cases, a pruning and voting procedure distills the noisy phonetic sequences into pronunciation networks. Recognition experiments on two large, grammarbased, test sets show a reduction of sentence error rates between
2% and 14%, and of word error rate between 3% to 23% when the
learned baseforms are added to our baseline lexicons.
Data-Driven Pronunciation Modeling for ASR Using
Acoustic Subword Units
Thurid Spiess 1 , Britta Wrede 2 , Gernot A. Fink 1 , Franz
Kummert 1 ; 1 Universität Bielefeld, Germany;
2
International Computer Science Institute, USA
We describe a method to model pronunciation variation for ASR in
a data-driven way, namely by use of automatically derived acoustic
subword units. The inventory of units is designed so as to produce maximal separable pronunciation variants of words while at
the same time only the most important variants for the particular
application are trained. In doing so, the optimal number of variants per word is determined iteratively. All this is accomplished
(almost) fully automatically by use of a state splitting algorithm
David Poeppel; University of Maryland, USA
Psychophysical research has established that temporal-integration
windows of several different sizes are critical for the analysis of any
acoustic speech signal. Recent work from our laboratory has examined speech processing in the human auditory cortex using both
hemodynamic (fMRI, PET) and electromagnetic (MEG, EEG) recording techniques. These studies provide evidence for at least two distinct temporal scales relevant to the integration and processing of
speech at the cortical level – a relatively short window of 25-50 ms
and a longer window of 150- 300 ms. In addition to support for processing on these time scales, there is also evidence for hemispheric
asymmetry in temporal quantization. Left auditory cortex shows
enhanced sensitivity to rapid temporal changes (possibly associated
with segmental and subsegmental perceptual analysis), while right
auditory cortex is more sensitive to slower changes (possibly associated with syllabic rate processing and dynamics of pitch).
Temporal Aspects of Articulatory Control
Elliot Saltzman; Boston University, USA
This contribution is focused on temporal aspects of articulatory
control during the production of speech. We review a set of computational and experimental results whose focus is on intragestural,
transgestural, and intergestural timing properties. The computa-
90
Eurospeech 2003
Thursday
tional results are based on recent developments of the task-dynamic
model of gestural patterning. These developments are focused on
the shaping and relative timing of gestural activations, and on the
manner in which relative timing among gestures can be interpreted
and modeled in the context of systems of coupled nonlinear oscillators. Emphasis is placed on dynamical accounts of prosodic
boundary influences on gestural activation patterns, and the manner in which intergestural coupling structures shape the timing patterns and stability properties of onset and coda clusters.
The Temporal Organisation of Speech as Gauged
by Speech Synthesis
September 1-4, 2003 – Geneva, Switzerland
Session: OThBc– Oral
Topics in Speech Recognition
Time: Thursday 10.00, Venue: Room 3
Chair: Sadaoki Furui, Tokyo Inst. of Technology, Japan
A Comparison of the Data Requirements of
Automatic Speech Recognition Systems and
Human Listeners
Roger K. Moore; 20/20 Speech Ltd., U.K.
Brigitte Zellner Keller; Université de Lausanne,
Switzerland
The simulation of speech by means of speech synthesis involves,
among other things, the ability to mimic typical delivery for different speech styles. This requires a realistic imitation of the manner
in which speakers organize their information flow in time (i.e., word
grouping boundaries), as well their speech rate with its variations.
The originality of our model is grounded in two levels. First, it is
assumed that the temporal component plays a dominant role in the
simulation of speech rhythm, whereas in traditional language models, temporal issues are mostly put aside. Second, the outcome of
our temporal modeling, based on statistical analysis and qualitative
parameters, results from the harmonization of various layers (segmental, syllabic, phrasal). The benefit of a multidimensional model
is the possibility of imposing subtle quantitative and qualitative effects at various levels, which is a key for respecting a specific language system as well as speech coherence and fluency for different
speech styles.
Localized Spectro-Temporal Features for Automatic
Speech Recognition
Michael Kleinschmidt; Universität Oldenburg,
Germany
Recent results from physiological and psychoacoustic studies indicate that spectrally and temporally localized time-frequency envelope patterns form a relevant basis of auditory perception. This motivates new approaches to feature extraction for automatic speech
recognition (ASR) which utilize two-dimensional spectro-temporal
modulation filters. The paper provides a motivation and a brief
overview on the work related to Localized Spectro-Temporal Features (LSTF). It further focuses on the Gabor feature approach,
where a feature selection scheme is applied to automatically obtain
a suitable set of Gabor-type features for a given task. The optimized feature sets are examined in ASR experiments with respect
to robustness and their statistical properties are analyzed.
Modulation Spectral Filtering of Speech
Les Atlas; University of Washington, USA
Recent auditory physiological evidence points to a modulation frequency dimension in the auditory cortex. This dimension exists
jointly with the tonotopic acoustic frequency dimension. Thus,
audition can be considered as a relatively slowly-varying twodimensional representation, the “modulation spectrum,” where the
first dimension is the well-known acoustic frequency and the second dimension is modulation frequency. We have recently developed a fully invertible analysis/synthesis approach for this modulation spectral transform. A general application of this approach
is removal or modification of different modulation frequencies in
audio or speech signals, which, for example, causes major changes
in perceived dynamic character. A specific application of this modification is single-channel multiple-talker separation.
Since the introduction of hidden Markov modelling there has been
an increasing emphasis on data-driven approaches to automatic
speech recognition. This derives from the fact that systems trained
on substantial corpora readily outperform those that rely on more
phonetic or linguistic priors. Similarly, extra training data almost
always results in a reduction in word error rate - “there’s no data like
more data”. However, despite this progress, contemporary systems
are not able to fulfill the requirements demanded by many potential
applications, and performance is still significantly short of the capabilities exhibited by human listeners. For these reasons, the R&D
community continues to call for even greater quantities of data in
order to train their systems. This paper addresses the issue of just
how much data might be required in order to bring the performance
of an automatic speech recognition system up to that of a human
listener.
Modeling Linguistic Features in Speech Recognition
Min Tang, Stephanie Seneff, Victor W. Zue;
Massachusetts Institute of Technology, USA
This paper explores a new approach to speech recognition in which
sub-word units are modeled in terms of linguistic features. Specifically, we have adopted a scheme of modeling separately the manner and place of articulation for these units. A novelty of our work
is the use of a generalized definition of place of articulation that
enables us to map both vowels and consonants into a common linguistic space. Modeling manner and place separately also allows
us to explore a multi-stage recognition architecture, in which the
search space is successively reduced as more detailed models are
brought in. In the 8,000 word PhoneBook isolated word telephone
speech recognition task, we show that such an approach can achieve
a recognition WER that is 10% better than that achieved in the best
results reported in the literature. This performance gain comes with
improvements in search space and computation time as well.
Impact of Audio Segmentation and Segment
Clustering on Automated Transcription Accuracy
of Large Spoken Archives
Bhuvana Ramabhadran, Jing Huang, Upendra
Chaudhari, Giridharan Iyengar, Harriet J. Nock; IBM
T.J. Watson Research Center, USA
This paper addresses the influence of audio segmentation and segment clustering on automatic transcription accuracy for large spoken archives. The work forms part of the ongoing MALACH project,
which is developing advanced techniques for supporting access to
the world’s largest digital archive of video oral histories collected
in many languages from over 52000 survivors and witnesses of the
Holocaust. We present several audio-only and audio-visual segmentation schemes, including two novel schemes: the first is iterative
and audio-only, the second uses audio-visual synchrony. Unlike
most previous work, we evaluate these schemes in terms of their
impact upon recognition accuracy. Results on English interviews
show the automatic segmentation schemes give performance comparable to (exorbitantly expensive and impractically lengthy) manual segmentation when using a single pass decoding strategy based
on speaker-independent models. However, when using a multiple
pass decoding strategy with adaptation, results are sensitive to both
initial audio segmentation and the scheme for clustering segments
prior to adaptation: the combination of our best automatic segmentation and clustering scheme has an error rate 8% worse (relative)
to manual audio segmentation and clustering due to the occurrence
of “speaker-impure” segments.
91
Eurospeech 2003
Thursday
Learning Linguistically Valid Pronunciations from
Acoustic Data
September 1-4, 2003 – Geneva, Switzerland
Session: OThBd– Oral
Acoustic Modelling II
Françoise Beaufays, Ananth Sankar, Shaun Williams,
Mitch Weintraub; Nuance Communications, USA
We describe an algorithm to learn word pronunciations from acoustic data. The algorithm jointly optimizes the pronunciation of a
word using (a) the acoustic match of this pronunciation to the observed data, and (b) how “linguistically reasonable” the pronunciation is. Variations of word pronunciations in the recognition dictionary (which was created by linguists), are used to train a model of
whether new hypothesized pronunciations are reasonable or not.
The algorithm is well-suited for proper name pronunciation learning. Experiments on a corporate name dialing database show 40%
error rate reduction with respect to a letter-to-phone pronunciation
engine.
Improvement of Non-Native Speech Recognition by
Effectively Modeling Frequently Observed
Pronunciation Habits
Nobuaki Minematsu, Koichi Osaki, Keikichi Hirose;
University of Tokyo, Japan
In this paper, two techniques are proposed to enhance the nonnative (Japanese English) speech recognition performance. The first
technique effectively integrates orthographic representation of a
phoneme as an additional context in state clustering in training
tied-state triphones. Non-native speakers often learned the target language not through their ears but through their eyes and it
is easily assumed that their pronunciation of a phoneme may depend upon its grapheme. Here, correspondence between a vowel
and its grapheme is automatically extracted and used as an additional context in the state clustering. The second technique elaborately couples a Japanese English acoustic model and a Japanese
Japanese model to make a parallel model. When using triphones,
mapping between the two models should be carefully trained because phoneme sets of both the models are different. Here, several
phoneme recognition experiments are done to induce the mapping,
and based upon the mapping, a tentative method of the coupling
is examined. Results of LVCSR experiments show high validity of
both the proposed methods.
Non-Audible Murmur Recognition
Yoshitaka Nakajima 1 , Hideki Kashioka 1 , Kiyohiro
Shikano 1 , Nick Campbell 2 ; 1 Nara Institute of Science
and Technology, Japan; 2 ATR-HIS, Japan
Time: Thursday 10.00, Venue: Room 4
Chair: John Hansen, Colorado Univ., USA
Variable Length Mixtures of Inverse Covariances
Vincent Vanhoucke 1 , Ananth Sankar 2 ; 1 Stanford
University, USA; 2 Nuance Communications, USA
The mixture of inverse covariances model is a low-complexity, approximate decomposition of the inverse covariance matrices in a
Gaussian mixture model which achieves high modeling accuracy
with very good computational efficiency. In this model, the inverse
covariances are decomposed into a linear combination of K shared
prototype matrices. In this paper, we introduce an extension of this
model which uses a variable number of prototypes per Gaussian for
improved efficiency. The number of prototypes per Gaussian is optimized using a maximum likelihood criterion. This variable length
model is shown to achieve significantly better accuracy at a given
complexity level on several speech recognition tasks.
Semi-Tied Full Deviation Matrices for Laplacian
Density Models
Christoph Neukirchen; Philips Research Laboratories,
Germany
The Philips speech recognition system uses mixtures of Laplacian
densities with diagonal deviations to model acoustic feature vectors. Such an approach neglects the correlations between different feature components that typically exist in the acoustic vectors.
This paper extends the conventional Laplacian approach to model
the between-feature interdependencies explicitly. These extensions
either lead to a full deviation matrix model or to an integrated feature space transformation similar to the semi-tied covariances for
Gaussian densities. Both methods can be efficiently implemented
by exploiting a strong tying of the feature transformations and the
deviation matrices, respectively. The novel approach is evaluated
on two different digit string recognition tasks.
Acoustic Modeling with Mixtures of Subspace
Constrained Exponential Models
Karthik Visweswariah, Scott Axelrod, Ramesh
Gopinath; IBM T.J. Watson Research Center, USA
We propose a new style of practical input interface for the recognition of non-audible murmur (NAM), i.e., for the recognition of inaudible speech produced without vibration of the vocal folds. We
developed a microphone attachment, which adheres to the skin,
applying the principle of a medical stethoscope, found the ideal
position for sampling flesh-conducted NAM sound vibration and retrained an acoustic model with NAM samples. Then using the Julius
Japanese Dictation Toolkit, we tested the possibilities for practical
use of this method in place of an external microphone for analyzing air-conducted voice sound. Additionally we propose laryngeal
elevation index (LEI), a new index of prosody, which can show the
prosody of NAM without F0, using simple processing of images from
medical ultrasonography. We realized and defined NAM never used
for input or communication and propose that we should make use
of it for the interface of human-human and human-cybernetic machines.
Gaussian distributions are usually parameterized with their natural parameters: the mean µ and the covariance Σ. They can also
be re-parameterized as exponential models with canonical parameters P = Σ−1 and ψ = P µ. In this paper we consider modeling
acoustics with mixtures of Gaussians parameterized with canonical
parameters where the parameters are constrained to lie in a shared
affine subspace. This class of models includes Gaussian models
with various constraints on its parameters: diagonal covariances,
MLLT models, and the recently proposed EMLLT and SPAM models. We describe how to perform maximum likelihood estimation
of the subspace and parameters within a fixed subspace. In speech
recognition experiments, we show that this model improves upon
all of the above classes of models with roughly the same number
of parameters and with little computational overhead. In particular
we get 30-40% relative improvement over LDA+MLLT models when
using roughly the same number of parameters.
Discriminative Estimation of Subspace Precision
and Mean (SPAM) Models
Vaibhava Goel, Scott Axelrod, Ramesh Gopinath,
Peder A. Olsen, Karthik Visweswariah; IBM T.J.
Watson Research Center, USA
The SPAM model was recently proposed as a very general method
for modeling Gaussians with constrained means and covariances.
It has been shown to yield significant error rate improvements over
other methods of constraining covariances such as diagonal, semitied covariances, and extended maximum likelihood linear transformations. In this paper we address the problem of discriminative
estimation of SPAM model parameters, in an attempt to further im-
92
Eurospeech 2003
Thursday
prove its performance. We present discriminative estimation under
two criteria: maximum mutual information (MMI) and an “errorweighted” training. We show that both these methods individually
result in over 20% relative reduction in word error rate on a digit
task over maximum likelihood (ML) estimated SPAM model parameters. We also show that a gain of as much as 28% relative can be
achieved by combining these two discriminative estimation techniques. The techniques developed in this paper also apply directly
to an extension of SPAM called subspace constrained exponential
models.
Model-Integration Rapid Training Based on
Maximum Likelihood for Speech Recognition
Shinichi Yoshizawa 1 , Kiyohiro Shikano 2 ; 1 Matsushita
Electric Industrial Co. Ltd., Japan; 2 Nara Institute of
Science and Technology, Japan
Speech recognition technology has been widely used. Considering
a training cost of an acoustic model, it is beneficial to reuse preexisting acoustic models for making a suitable one for various apparatus and application. However, a complex acoustic model for
high CPU power does not work for low CPU power. And a simple
model for fast-processing-demanded application does not work well
for high-precision-demanded ones. Therefore, it is important to adjust a model complexity according to apparatus or application, such
as a number of mixture of Gaussians. This paper describes a new
model-integration-type of training for obtaining a required number
of mixture of Gaussians. This training can alter a number of mixture into a required one according to a specification of apparatus or
application. We propose a model integration rapid training based
on maximum likelihood, and evaluate the recognition performance
successfully.
On the Use of Kernel PCA for Feature Extraction in
Speech Recognition
Amaro Lima, Heiga Zen, Yoshihiko Nankaku, Chiyomi
Miyajima, Keiichi Tokuda, Tadashi Kitamura; Nagoya
Institute of Technology, Japan
This paper describes an approach for feature extraction in speech
recognition systems using kernel principal component analysis
(KPCA). This approach consists in representing speech features as
the projection of the extracted speech features mapped into a feature space via a nonlinear mapping onto the principal components.
The nonlinear mapping is implicitly performed using the kerneltrick, which is an useful way of not mapping the input space into a
feature space explicitly, making this mapping computationally feasible. Better results were obtained by using this approach when
compared to the standard technique.
September 1-4, 2003 – Geneva, Switzerland
Who Knows Carl Bildt? – And What if You don’t?
Elisabeth Zetterholm 1 , Kirk P.H. Sullivan 2 , James
Green 3 , Erik Eriksson 2 , Jan van Doorn 2 , Peter E.
Czigler 4 ; 1 Lund University, Sweden; 2 Umeå
University, Sweden; 3 University of Otago, New
Zealand; 4 Örebro University, Sweden
One problem with using speaker identification by witnesses in legal
settings is that high quality imitations can result in speaker misidentification. A recent series of experiments has looked at listener acceptance of an imitation of a well known Swedish politician. Results
showed that listener expectation of the topic of an imitated passage impacts on the acceptance or rejection of the imitation. The
strength of that impact varied according to various listener characteristics, including age of listener. It is likely that age reflected
the degree of familiarity with the voice that was being imitated. The
present study has reanalyzed the data from Swedish listeners in the
previous studies to look at performance according to self reports
of whether the listeners were familiar with the politician. Results
showed that the acceptance of the imitation by those listeners who
reported knowing the politician was more influenced by the topic
of the imitated passage than by those who reported not knowing
him. Implications of this finding in regard to listeners’ choice of
alternate voices in the line up are discussed.
Improving the Competitiveness of Discriminant
Neural Networks in Speaker Verification
C. Vivaracho-Pascual 1 , J. Ortega-Garcia 2 , L.
Alonso-Romero 3 , Q. Moro-Sancho 1 ; 1 Universidad de
Valladolid, Spain; 2 Universidad Politécnica de Madrid,
Spain; 3 Universidad de Salamanca, Spain
The Artificial Neural Network (ANN) Multilayer Perceptron (MLP)
has shown good performance levels as discriminant system in textindependent Speaker Verification (SV) tasks, as shown in our work
presented at Eurospeech 2001. In this paper, substantial improvements with regard to that reference architecture are described.
Firstly, a new heuristic method for selecting the impostors in the
ANN training process is presented, eliminating the random nature
of the system behaviour introduced by the traditional random selection. The use of the proposed selection method, together with
an improvement in the classification stage based on a selective use
of the network outputs to calculate the final sample score, and an
optimisation of the MLP learning coefficient, allow an improvement
of over 35% with regard to our reference system, reaching a final
EER of 13% over the NIST-AHUMADA database. These promising
results show that MLP as discriminant system can be competitive
with respect to GMM-based SV systems.
On the Fusion of Dissimilarity-Based Classifiers for
Speaker Identification
Session: PThBe– Poster
Speaker & Language Recognition
Tomi Kinnunen, Ville Hautamäki, Pasi Fränti;
University of Joensuu, Finland
Time: Thursday 10.00, Venue: Main Hall, Level -1
Chair: Larry Heck, Nuance Communication, USA
Speaker Modeling from Selected Neighbors Applied
to Speaker Recognition
Yassine Mami, Delphine Charlet; France Télécom
R&D, France
This paper addresses the estimation of a speaker GMM through the
selection and merging of a set of neighbors models for that speaker.
The selection of the neighbors models is based on the likelihood
score for the training data on a set of potential neighbor GMM.
Once the neighbors models are selected, they are merged to give
a model of the speaker, which can also be used as an a priori model
for an adaptation phase. Experiments show that merging neighborhood models captures significant information about the speaker but
doesn’t improve significantly compared to classical UBM-adapted
GMM.
In this work, we describe a speaker identification system that uses
multiple supplementary information sources for computing a combined match score for the unknown speaker. Each speaker profile
in the database consists of multiple feature vector sets that can
vary in their scale, dimensionality, and the number of vectors. The
evidence from a given feature set is weighted by its reliability that
is set in a priori fashion. The confidence of the identification result is also estimated. The system is evaluated with a corpus of 110
Finnish speakers. The evaluated feature sets include mel-cepstrum,
LPC-cepstrum, dynamic cepstrum, long-term averaged spectrum of
/A/ vowel, and F0.
Robust Speaker Identification Using Posterior
Union Models
Ji Ming 1 , Darryl Stewart 1 , Philip Hanna 1 , Pat Corr 1 ,
Jack Smith 1 , Saeed Vaseghi 2 ; 1 Queen’s University
Belfast, U.K.; 2 Brunel University, U.K.
This paper investigates the problem of speaker identification in
noisy conditions, assuming that there is no prior knowledge about
the noise. To confine the effect of the noise on recognition, we use a
93
Eurospeech 2003
Thursday
multi-stream approach to characterize the speech signal, assuming
that while all of the feature streams may be affected by the noise,
there may be some streams that are less severely affected and thus
still provide useful information about the speaker. Recognition decisions are based on the feature streams that are uncontaminated
or least contaminated, thereby reducing the effect of the noise on
recognition. We introduce a novel statistical method, the posterior
union model, for selecting reliable feature streams. An advantage
of the union model is that knowledge of the structure of the noise
is not needed, thereby providing robustness to time-varying unpredictable noise corruption. We have tested the new method on the
TIMIT database with additive corruption from real-world nonstationary noise; the results obtained are encouraging.
“Syncpitch”: A Pseudo Pitch Synchronous
Algorithm for Speaker Recognition
Ran D. Zilca, Jiří Navrátil, Ganesh N. Ramaswamy;
IBM T.J. Watson Research Center, USA
Pitch mismatch between enrollment and testing is a common problem in speaker recognition systems. It is well known that the fine
spectral structure related to fundamental frequency manifests itself in Mel cepstral features used for speaker recognition. Therefore pitch variations result in variation of the acoustic features, and
potentially an increase in error rate. A previous study introduced
a signal processing procedure termed depitch that attempts to remove pitch information from the speech signal by forcing every
speech frame to be pitch synchronous and include a single pitch
cycle. This paper presents a modification of the depitch algorithm,
termed syncpitch, that performs pseudo pitch synchronous processing while still preserving the pitch information. The new algorithm
has a relatively moderate effect on the speech signal. System combination of syncpitch with a baseline system is shown to improve
speaker verification accuracy in experiments conducted on the 2002
NIST Speaker Recognition Evaluation data.
A Method for On-Line Speaker Indexing Using
Generic Reference Models
Soonil Kwon, Shrikanth Narayanan; University of
Southern California, USA
On-line Speaker indexing is useful for multimedia applications such
as meeting or teleconference archiving and browsing. It sequentially detects the points where a speaker identity changes in a multispeaker audio stream, and classifies each speaker segment. The
main problem of on-line processing is that we can use only current
and previous information in the data stream for any decisioning. To
address this difficulty, we apply a predetermined reference speakerindependent model set. This set can be useful for more accurate
speaker modeling and clustering without actual training of target
data speaker models. Once a speaker-independent model is selected from the reference set, it is adapted into a speaker-dependent
model progressively. Experiments were performed with HUB-4
Broadcast News Evaluation English Test Material(1999) and Speaker
Recognition Benchmark NIST Speech(1999). Results showed that
our new technique gave 96.5% indexing accuracy on a telephone
conversation data source and 84.3% accuracy on a broadcast news
source.
Discriminative Training and Maximum Likelihood
Detector for Speaker Identification
M. Mihoubi, Gilles Boulianne, Pierre Dumouchel;
CRIM, Canada
This article describes a new approach for cues discrimination between speakers addressed to a speaker identification task. To this
end, we make use of elements of decision theory. We propose to
decompose the conventional feature space (MFCCs) into two subspaces which carry information about discriminative and confusable sections of the speech signal. The method is based on the
idea that, instead of adapting the speakers models to a new test
environment, we require the test utterance to fit the speakers models environment. Discriminative sections of training speech are
used to estimate the probability density function (pdf) of a discriminative world model (DM), and confusable sections to estimate the
probability density function of a confusion world model (CM). The
two models are then used as a maximum likelihood detector (filter)
September 1-4, 2003 – Geneva, Switzerland
at the input of the recogniser. The method was experimented on
highly mismatched telephone speech and achieves a considerable
improvement (averaging 16% gain in performance) over the baseline
GMM system.
Novel Approaches for One- and Two-Speaker
Detection
Sachin S. Kajarekar 1 , André G. Adami 2 , Hynek
Hermansky 2 ; 1 SRI International, USA; 2 Oregon
Health & Science University, USA
The paper reviews OGI submission for NIST 2002 speaker recognition evaluation. It describes the systems submitted for one- and
two-speaker detection tasks and the post-evaluation improvements.
In one-speaker detection system, we present a new design of a datadriven temporal filter. We show that using few broad phonetic categories improves the performance of speaker recognition system. In
post evaluation experiments, we show that combinations with complementary features and modeling techniques significantly improve
the performance of the GMM-based system. In two-speaker detection system, we present a structured approach to detect speaker in
the conversations.
Fusing High- and Low-Level Features for Speaker
Recognition
Joseph P. Campbell, Douglas A. Reynolds, Robert B.
Dunn; Massachusetts Institute of Technology, USA
The area of automatic speaker recognition has been dominated by
systems using only short-term, low-level acoustic information, such
as cepstral features. While these systems have produced low error rates, they ignore higher levels of information beyond low-level
acoustics that convey speaker information. Recently published
works have demonstrated that such high-level information can be
used successfully in automatic speaker recognition systems by improving accuracy and potentially increasing robustness. Wide ranging high-level-feature-based approaches using pronunciation models, prosodic dynamics, pitch gestures, phone streams, and conversational interactions were explored and developed under the SuperSID project at the 2002 JHU CLSP Summer Workshop (WS2002):
http://www.clsp.jhu.edu/ws2002/groups/supersid/. In this paper,
we show how these novel features and classifiers provide complementary information and can be fused together to drive down the
equal error rate on the 2001 NIST Extended Data Task to 0.2% – a
71% relative reduction in error over the previous state of the art.
Score Normalisation Applied to Open-Set,
Text-Independent Speaker Identification
P. Sivakumaran 1 , J. Fortuna 2 , Aladdin M.
Ariyaeeinia 2 ; 1 20/20 Speech Ltd., U.K.; 2 University of
Hertfordshire, U.K.
This paper presents an investigation into the relative effectiveness of various score normalisation methods for open-set, textindependent speaker identification. The paper describes the need
for score normalisation in this case, and provides a detailed theoretical and experimental analysis of the methods that can be used
for this purpose. The experimental investigations are based on the
use of speech material drawn from 9 hours of recordings of different Broadcast News. The results clearly demonstrate the significance of improvement offered by score normalisation. It is shown
that, amongst various normalisation methods considered, the unconstrained cohort normalisation method achieves the best performance in terms of reducing the errors associated with the open-set
nature of the process. Furthermore, it is demonstrated that both
the cohort and world model methods can offer very similar effectiveness, and also outperform the T-norm method in this particular
case of speaker recognition.
On the Number of Gaussian Components in a
Mixture: An Application to Speaker Verification
Tasks
Mijail Arcienega, Andrzej Drygajlo; EPFL, Switzerland
Despite all advances in the speaker recognition domain, Gaussian
Mixture Models (GMM) remain the state-of-the-art modeling tech-
94
Eurospeech 2003
Thursday
nique in speaker recognition systems. The key idea is to approximate the probability density function (pdf) of the feature vectors
associated to a speaker with a weighted sum of Gaussian densities.
Although the extremely efficient Expectation-Maximization (EM) algorithm can be used for estimating the parameters associated with
this Gaussian mixture, there is no explicit method for predicting the
best number of Gaussian components in the mixture (also called order of the model). This paper presents an attempt for determining
the “optimal” number of components for a given feature database.
September 1-4, 2003 – Geneva, Switzerland
Session: PThBf– Poster
Robust Speech Recognition III
Time: Thursday 10.00, Venue: Main Hall, Level -1
Chair: Nelson Morgan, ICSI and UC Berkeley, USA
Assessment of Dereverberation Algorithms for
Large Vocabulary Speech Recognition Systems
Koen Eneman, Jacques Duchateau, Marc Moonen,
Dirk Van Compernolle, Hugo Van hamme; Katholieke
Universiteit Leuven, Belgium
Using Accent Information in ASR Models for
Swedish
Giampiero Salvi; KTH, Sweden
In this study accent information is used in an attempt to improve
acoustic models for automatic speech recognition (ASR). First, accent dependent Gaussian models were trained independently. The
Bhattacharyya distance was then used in conjunction with agglomerative hierarchical clustering to define optimal strategies for merging those models. The resulting allophonic classes were analyzed
and compared with the phonetic literature. Finally, accent “aware”
models were built, in which the parametric complexity for each
phoneme corresponds to the degree of variability across accent areas and to the amount of training data available for it. The models
were compared to models with the same, but evenly spread, overall
complexity showing in some cases a slight improvement in recognition accuracy.
The performance of large vocabulary recognition systems, for instance in a dictation application, typically deteriorates severely
when used in a reverberant environment. This can be partially
avoided by adding a dereverberation algorithm as a speech signal
preprocessing step. The purpose of this paper is to compare the
effect of different speech dereverberation algorithms on the performance of a recognition system. Experiments were conducted on the
Wall Street Journal dictation benchmark. Reverberation was added
to the clean acoustic data in the benchmark both by simulation and
by re-recording the data in a reverberant room. Moreover additive
noise was added to investigate its effect on the dereverberation algorithms. We found that dereverberation based on a delay-and-sum
beamforming algorithm has the best performance of the investigated algorithms.
Estimating Japanese Word Accent from Syllable
Sequence Using Support Vector Machine
Analysis and Compensation of Packet Loss in
Distributed Speech Recognition Using Interleaving
Hideharu Nakajima, Masaaki Nagata, Hisako Asano,
Masanobu Abe; NTT Corporation, Japan
Ben P. Milner, A.B. James; University of East Anglia,
U.K.
This paper proposes two methods that estimate, from the word
reading (syllable sequence), the place in the word where the accent
should be placed (hereafter we call it “accent type”). Both methods
use a statistical classifier; one directly estimates accent type, and
the other first estimates tone high and low labels and then decides
the accent type from the tone label sequence obtained before. Experiments show that both offer high accuracy in the estimation of
accent type of Japanese proper names without the use of linguistic
knowledge.
The aim of this work is to improve the robustness of speech recognition systems operating in burst-like packet loss. First a set of
highly artificial packet loss profiles are used to analyse their effect
on both recognition performance and on the underlying feature vector stream. This indicates that the simple technique of vector repetition can make the recogniser robust to high percentages of packet
loss, providing burst lengths are reasonably short. This leads to the
proposal of interleaving the feature vector sequence, prior to packetisation, to disperse bursts of packet loss throughout the feature
vector stream.
PPRLM Optimization for Language Identification in
Air Traffic Control Tasks
Recognition results on the Aurora connected digits database show
considerable accuracy gains across a range of packet losses and
burst lengths. For example at a packet loss rate of 50% with an average burst length of 4 packets (corresponding to 8 static vectors)
performance is increased from 49.4% to 88.5% with an increase in
delay of 90ms.
R. Córdoba, G. Prime, J. Macías-Guarasa, J.M.
Montero, J. Ferreiros, J.M. Pardo; Universidad
Politécnica de Madrid, Spain
In this paper, we present the work done in language identification
for two air traffic control speech recognizers, one for continuous
speech and the other one for a command interface. The system
is able to distinguish between Spanish and English. We will confirm the advantage of using PPRLM over PRLM. All previous studies show that PPRLM is the technique with the best performance
despite of its drawbacks: more processing time and labeled data
is needed. No work has been published regarding the optimum
weights which should be given to the language models to optimize
the performance of the language recognizer. This paper addresses
this topic, providing three different approaches for weight selection in the language model score. We will also see that a trigram
language model improves performance. The final results are very
good even with very short segments of speech.
Non-Linear Compression of Feature Vectors Using
Transform Coding and Non-Uniform Bit Allocation
Ben P. Milner; University of East Anglia, U.K.
This paper uses transform coding for compressing feature vectors
in distributed speech recognition applications. Feature vectors are
first grouped together into non-overlapping blocks and a transformation applied. A non-uniform allocation of bits to the elements of
the resultant matrix is based on their relative information content.
Analysis of the amplitude distribution of these elements indicates
that non-linear quantisation is more appropriate than linear quantisation. Comparative results, based on speech recognition accuracy,
confirm this. RASTA filtering is also considered as is shown to reduce the temporal variation of the feature vector stream.
Recognition tests demonstrate that compression to bits rates of
2400bps, 1200bps and 800bps has very little effect on recognition
accuracy for both clean and noisy speech. For example at a bit rate
of 1200bps, recognition accuracy is 98.0% compared to 98.6% with
no compression.
Predictive Hidden Markov Model Selection for
Decision Tree State Tying
Jen-Tzung Chien 1 , Sadaoki Furui 2 ; 1 National Cheng
Kung University, Taiwan; 2 Tokyo Institute of
Technology, Japan
95
Eurospeech 2003
Thursday
September 1-4, 2003 – Geneva, Switzerland
This paper presents a novel predictive information criterion (PIC)
for hidden Markov model (HMM) selection. The PIC criterion is
exploited to select the best HMMs, which provide the largest prediction information for generalization of future data. When the
randomness of HMM parameters is expressed by a product of conjugate prior densities, the prediction information is derived without integral approximation. In particular, a multivariate t distribution is attained to characterize the prediction information corresponding to HMM mean vector and precision matrix. When performing HMM selection in tree structure HMMs, we develop a top-down
prior/posterior propagation algorithm for estimation of structural
hyperparameters. The prediction information is accordingly determined so as to choose the best HMM tree model. The parameters of
chosen HMMs can be rapidly computed via maximum a posteriori
(MAP) estimation. In the evaluation of continuous speech recognition using decision tree HMMs, the PIC model selection criterion
performs better than conventional maximum likelihood and minimum description length criteria in building a compact tree structure
with moderate tree size and higher recognition rate.
Developing a real-life spoken dialogue system must face with many
practical issues, where the out-of-vocabulary (OOV) words problem
is one of the key difficulties. This paper presents the OOV detection
mechanism based on the word confidence scoring developed for the
d-Ear Attendant system, a spontaneous spoken dialogue system. In
the d-Ear Attendant system, an explicit filler model is originally used
to detect the presence of OOV words [1]. Although this approach
has a satisfactory OOV detection rate, it badly degrades the accuracy of in-vocabulary (IV) detection by 4.4% absolutely (from 97% to
92.6%). Such the degradation will not be acceptable in a practical
system. By using a few commonly used acoustic confidence features
and some new context confidence features, our confidence measure
method not only is able to detect the word level speech recognition
errors, but also has a good ability for OOV words detection with an
acceptable false alarm rate. For example, with a false rejection rate
of 2.5%, the false acceptance rate of 26% is achieved.
Three Simultaneous Speech Recognition by
Integration of Active Audition and Face
Recognition for Humanoid
Hiroyuki Manabe, Akira Hiraiwa, Toshiaki Sugimura;
NTT DoCoMo Inc., Japan
Kazuhiro Nakadai 1 , Daisuke Matsuura 2 , Hiroshi G.
Okuno 3 , Hiroshi Tsujino 4 ; 1 Japan Science and
Technology Corporation, Japan; 2 Tokyo Institute of
Technology, Japan; 3 Kyoto University, Japan; 4 Honda
Research Institute Japan Co. Ltd., Japan
This paper addresses listening to three simultaneous talkers by a
humanoid with two microphones. In such situations, sound separation and automatic speech recognition (ASR) of the separated
speech are difficult, because the number of simultaneous talkers exceeds that of its microphones, the signal-to-noise ratio is quite low
(around -3 dB) and noise is not stable due to interfering voices. Humanoid audition system consists of sound separation, face recognition and ASR. Sound sources are separated by an active directionpass filter (ADPF), which extracts sounds from a specified direction
in real-time. Since features of sounds separated by ADPF vary according to the sound direction, ASR uses multiple direction- and
speaker-dependent acoustic models. The system integrates ASR results by using the sound direction and speaker information by face
recognition as well as confidence measure of ASR results to select
the best one. The resulting system improves word recognition rates
against three simultaneous utterances.
Mis-Recognized Utterance Detection Using Multiple
Language Models Generated by Clustered
Sentences
Katsuhisa Fujinaga 1 , Hiroaki Kokubo 2 , Hirofumi
Yamamoto 2 , Genichiro Kikui 2 , Hiroshi Shimodaira 1 ;
1
JAIST, Japan; 2 ATR-SLT, Japan
This paper proposes a new method of detecting mis-recognized
utterances based on a ROVER-like voting scheme. Although the
ROVER approach is effective in improving recognition accuracy, it
has two serious problems from a practical point of view: 1) it is
difficult to construct multiple automatic speech recognition (ASR)
systems, 2) the computational cost increase according to the number of ASR systems. To overcome these problems, a new method
is proposed where only a single acoustic engine is employed but
multiple language models (LMs) consisting of a baseline (main) LM
and sub LMs are used. The sub LMs are generated by clustered sentences and used to rescore the word lattice given by the main LM. As
a result, the computational cost is greatly reduced. Through experiments, the proposed method resulted in 18-point higher precision
with 10% loss of recall when compared with the baseline, and 22point higher precision with 20% loss of recall.
Using Word Confidence Measure for OOV Words
Detection in a Spontaneous Spoken Dialog System
Hui Sun 1 , Guoliang Zhang 1 , Fang Zheng 2 , Mingxing
Xu 1 ; 1 Tsinghua University, China; 2 Beijing d-Ear
Technologies Co. Ltd., China
Speech Recognition Using EMG; Mime Speech
Recognition
The cellular phone offers significant benefits but causes several social problems. One such problem is phone use in places where people should not speak, such as trains and libraries. A communication style that would not require voiced speech has the potential to
solve this problem. Speech recognition based on electromyography
(EMG), which we call “Mime Speech Recognition” is proposed. It not
only eases communication in socially sensitive environments, but
also improves speech recognition accuracy in noisy environments.
In this paper, we report that EMG yields stable and accurate recognition of 5 Japanese vowels uttered statically without generating
voice. Moreover, the ability of EMG to handle consonants is described, and the feasibility of basing comprehensive speech recognition systems on EMG is shown.
Automatic Generation of Non-Uniform
Context-Dependent HMM Topologies Based on the
MDL Criterion
Takatoshi Jitsuhiro 1 , Tomoko Matsui 2 , Satoshi
Nakamura 1 ; 1 ATR-SLT, Japan; 2 Institute of Statistical
Mathematics, Japan
We propose a new method of automatically creating non-uniform
context-dependent HMM topologies by using the Minimum Description Length (MDL) criterion. Phonetic decision tree clustering is
widely used, based on the Maximum Likelihood (ML) criterion, and
creates only contextual variations. However, it also needs to empirically predetermine control parameters for use as stop criteria,
for example, the total number of states. Furthermore, it cannot create topologies with various state lengths automatically. Therefore,
we introduce the MDL criterion as split and stop criteria, and use
the Successive State Splitting (SSS) algorithm as a method of generating contextual and temporal variations. This proposed method,
the MDL-SSS, can automatically create proper topologies without
such predetermined parameters. Experimental results show that
the MDLSSS can automatically stop splitting and obtain more appropriate HMM topologies than the original one. Furthermore, we investigated the MDL-SSS combined with phonetic decision tree clustering, and this method can automatically obtain the best performance with any heuristic.
Comparison of Effects of Acoustic and Language
Knowledge on Spontaneous Speech
Perception/Recognition Between Human and
Automatic Speech Recognizer
Norihide Kitaoka, Masahisa Shingu, Seiichi
Nakagawa; Toyohashi University of Technology,
Japan
An automatic speech recognizer uses acoustic knowledge and linguistic knowledge. In large vocabulary speech recognition, acoustic
knowledge is modeled by hidden Markov models (HMM), linguistic knowledge is modeled by N-gram (typically bi-gram or trigram),
and these models are stochastically integrated. It is thought that
96
Eurospeech 2003
Thursday
humans also integrate acoustic and linguistic knowledge of speech
when perceiving continuous speech. Automatic speech recognition
with HMM and N-gram is thought to roughly model the process of
human perception.
Although these models have drastically improved the performance
of automatic speech recognition of well-formed read speech so far,
they cannot deliver sufficient performance on spontaneous speech
recognition tasks because of various particular phenomena of spontaneous speech.
In this paper, we conducted simulation experiments of N-gram language models by combining human acoustic knowledge and instruction of local context and assured that using two words neighboring
the target word was enough to improve the performance of recognition when we could use only local information as linguistic knowledge. We also assured that coarticulation affected the perception
of short words.
We then compared some language models on speech recognizer.
We calculated acoustic scores with HMM and then linguistic scores
calculated from a language model were added. We obtained 37.5%
recognition rate only with acoustic model, whereas we obtained
51.0% with both acoustic and language models, thus the relative
performance improvement was 36%. On the other hand, we obtained a 16.5% recognition rate only with the language model, so the
acoustic model improved the performance relatively 209%. The performance of the language model on spontaneous speech is almost
equal to that on read speech and thus, the improvements of the
acoustic models is more effective than that of the language model.
Using Statistical Language Modelling to Identify
New Vocabulary in a Grammar-Based Speech
Recognition System
September 1-4, 2003 – Geneva, Switzerland
acoustic levels. A potential difficulty with such a model is that advantages gained by the introduction of an articulatory layer might
be compromised by limitations due to an insufficiently rich articulatory representation, or by compromises made for mathematical or computational expediency. This paper describes a simple
model in which speech dynamics are modelled as linear trajectories in a formant-based ‘articulatory’ layer, and the articulatory-toacoustic mappings are linear. Phone classification results for TIMIT
are presented for monophone and triphone systems with a phonelevel syntax. The results demonstrate that provided the intermediate representation is sufficiently rich, or a sufficiently large number
of phone-class-dependent articulatory-to-acoustic mapping are employed, classification performance is not compromised.
Presentamos un nuevo HMM multinivel en el que una representación
‘articulatoria’ intermedia se incluye entre el nivel de estados y el
acústico de superficie. Una dificultad potencial con tal modelo es
que las ventajas ganadas por la introducción de una capa articulatoria quizás sean cedidas por limitaciones debidas a una representación articulatoria insuficientemente rica, o por cesiones realizadas por conveniencia matemática o computacional. Este artículo
describe un modelo sencillo en el cuál la dinámica del habla se modela como trayectorias lineales en una capa articulatoria basada en
formantes, y las proyecciones acústico-articulatorias son lineales.
Los resultados de la clasificación de fonemas para TIMIT se presentan para sistemas de monofonemas y trifonemas con una sintaxis a
nivel de fonema. Los resultados demuestran que la representación
intermedia es suficientemente rica, o se emplea un número suficientemente grande de proyecciones acústico-articulatorias dependiente de la clase de fonema, donde no se comprometen las prestaciones de la clasificación.
Automatic Phone Set Extension with Confidence
Measure for Spontaneous Speech
Genevieve Gorrell; Linköping University, Sweden
Spoken language recognition meets with difficulties when an unknown word is encountered. In addition to the new word being
unrecognisable, its presence impacts on recognition performance
on the surrounding words. The possibility is explored here of using a back-off statistical recogniser to allow recognition of out-ofvocabulary words in a grammar-based speech recognition system.
This study shows that a statistical language model created from
a corpus obtained using a grammar-based system and augmented
with minimally-constrained domain-appropriate material allows extraction of words that are out of the vocabulary of the grammar in
an unseen corpus with fairly high precision.
A Source Model Mitigation Technique for
Distributed Speech Recognition Over Lossy Packet
Channels
Ángel M. Gómez, Antonio M. Peinado, Victoria
Sánchez, Antonio J. Rubio; Universidad de Granada,
Spain
In this paper, we develop a new mitigation technique for a distributed speech recognition system over IP. We have designed and
tested several methods to improve the interpolation used in the
Aurora DSR ETSI standard without any significant increase of computational cost at the decoder. These methods make use of the
information contained in the data-source, because, in IP networks,
unlike in cellular networks, no information is received during packet
losses.
When a packet loss occurs, the lost information can be reconstructed through estimations from the N nearest received packets. Due to the enormous amount of combinations from previous
and next received speech vector sequences, we have developed a
methodology that drastically reduces the amount of required estimations.
The Effect of an Intermediate Articulatory Layer on
the Performance of a Segmental HMM
Martin J. Russell 1 , Philip J.B. Jackson 2 ; 1 University of
Birmingham, U.K.; 2 University of Surrey, U.K.
We present a novel multi-level HMM in which an intermediate ‘articulatory’ representation is included between the state and surface-
Yi Liu, Pascale Fung; Hong Kong University of Science
& Technology, China
Extending the phone set is one common approach for dealing with
phonetic confusions in spontaneous speech. We propose using likelihood ratio test as a confidence measure for automatic phone set
extension to model phonetic confusions. We first extend the standard phone set using dynamic programming (DP) alignment to cover
all possible phonetic confusions in training data. Likelihood ratio
test is then used as a confidence measure to optimize the extended
phonetic units to represent the acoustic samples between two standard phonetic units with high confusability. The optimum set of extended phonetic units is combined with the standard phone set to
form a multiple pronunciation dictionary. The effectiveness of this
approach is evaluated on spontaneous Mandarin telephony speech.
It gives an encouraging 1.09% absolute syllable error rate reduction.
Using the extended phone set provides a good balance between the
demands of high resolution acoustic model and the available training data.
Utterance Verification Using an Optimized
k-Nearest Neighbour Classifier
R. Paredes, A. Sanchis, E. Vidal, A. Juan; Universitat
Politècnica de València, Spain
Utterance verification can be seen as a conventional pattern classification problem in which a feature vector is obtained for each hypothesized word in order to classify it as either correct or incorrect.
In this paper, we study the application to this problem of an optimized version of the k-Nearest Neighbour decision rule which also
incorporates an adequate feature selection technique. Experiments
are reported showing that it gives comparatively good results.
La detección de errores de reconocimiento puede considerarse
como un problema clásico de clasificación en dos clases, en el que
para cada palabra reconocida se obtiene un vector de características
que permite clasificarla como correcta o incorrecta. En este trabajo
se estudia la aplicación a este problema de una técnica basada en
una regla optimizada de clasificación por los k-vecinos más próximos. Esta técnica permite, además, seleccionar aquellas características que son más importantes en el proceso de clasificación. Los resultados obtenidos muestran que la aplicación de esta técnica consigue comparativamente buenos resultados.
97
Eurospeech 2003
Thursday
September 1-4, 2003 – Geneva, Switzerland
Session: PThBg– Poster
Spoken Language Understanding &
Translation
ber of concepts handled in our mixed-initiative dialogue system,
the proposed system achieves a considerable concept interpretation result on either a typed-in test set or a spoken test set. A high
subframe recall rate also verifies an applicability of the proposed
system.
Time: Thursday 10.00, Venue: Main Hall, Level -1
Chair: Hélène Bonneau-Maynard, LIMSI-CNRS, France
Discriminative Methods for Improving Named
Entity Extraction on Speech Data
Spoken Cross-Language Access to Image Collection
via Captions
James Horlock, Simon King; University of Edinburgh,
U.K.
Hsin-Hsi Chen; National Taiwan University, Taiwan
In this paper we present a method of discriminatively training language models for spoken language understanding; we show improvements in named entity F-scores on speech data using these improved language models. A comparison between theoretical probabilities associated with manual markup and the actual probabilities of output markup is used to identify probabilities requiring
adjustment. We present results which support our hypothesis that
improvements in F-scores are possible by using either previously
used training data or held out development data to improve discrimination amongst a set of N-gram language models.
This paper presents a framework of using Chinese speech to access images via English captions. The formulation and the structure
mapping rules of Chinese and English named entities are extracted
from an NICT foreign location name corpus. For a named location,
name part and keyword part are usually transliterated and translated, respectively. Keyword spotting identifies the keyword from
speech queries and narrows down the search space of image collections. A scoring function is proposed to compute the similarity
between speech query and annotated captions in terms of International Phonetic Alphabets. The experimental results show that the
average rank and the mean reciprocal rank are 2.04 and 0.8322, respectively, which is very close to the best performance, i.e., 1, for
both average rank and mean reciprocal rank.
Understanding Process for Speech Recognition
Salma Jamoussi, Kamel Smaïli, Jean-Paul Haton;
LORIA, France
The automatic speech understanding problem could be considered
as an association problem between two different languages. At the
entry, the request expressed in natural language and at the end,
just before the interpretation stage, the same request is expressed
in term of concepts. A concept represents a given meaning, it is
defined by a set of words sharing the same semantic properties. In
this paper, we propose a new Bayesian network based method to
automatically extract the underlined concepts. We also propose a
new approach for the vector representation of words. We finish this
paper by a description of the postprocessing step during which, we
label our sentences and we generate the corresponding SQL queries.
This step allows us to validate our speech understanding approach
by obtaining good results. In fact, a rate of 92.5% of well formed
SQL requests has been achieved on the test corpus.
Collecting Machine-Translation-Aided Bilingual
Dialogues for Corpus-Based Speech Translation
Toshiyuki Takezawa, Genichiro Kikui; ATR-SLT, Japan
A huge bilingual corpus of English and Japanese is being built at
ATR Spoken Language Translation Research Laboratories in order
to enhance speech translation technology, so that people can use a
portable translation system for traveling abroad, dining and shopping, as well as hotel situations. As a part of these corpus construction activities, we have been collecting dialogue data using an experimental translation system between English and Japanese. The
purpose of this data collection is to study the communication behaviors and linguistic expressions preferred in front of such systems. We use human typists to transcribe the users’ utterances and
input them into a machine translation system between English and
Japanese instead of using speech recognition systems. In this paper, we present an overview of our activities and discussions based
on the basic characteristics.
Combination of Finite State Automata and Neural
Network for Spoken Language Understanding
Chai Wutiwiwatchai, Sadaoki Furui; Tokyo Institute of
Technology, Japan
This paper proposes a novel approach for spoken language understanding based on a combination of weighted finite state automata
and an artificial neural network. The former machine acts as a robust parser, which extracts some semantic information called subframes from an input sentence, then the latter machine interprets a
concept of the sentence by considering the existence of subframes
and their scores obtained from the automata. With a large num-
Improving Statistical Natural Concept Generation
in Interlingua-Based Speech-to-Speech Translation
Liang Gu, Yuqing Gao, Michael Picheny; IBM T.J.
Watson Research Center, USA
Natural concept generation is critical to statistical interlingua-based
speech translation performance. To improve maximum-entropybased concept generation, a set of novel features and algorithms
are proposed including features enabling model training on parallel corpora, employment of confidence thresholds and multiple
sets of features. The concept generation error rate is reduced by
43%-50% in our speech translation corpus within limited domains.
Improvements are also achieved in our experiments on speech-tospeech translation.
How NLP Techniques can Improve Speech
Understanding: ROMUS – A Robust Chunk Based
Message Understanding System Using Link
Grammars
Jérôme Goulian, Jean-Yves Antoine, Franck Poirier;
University of South-Brittany, France
This paper discusses the issue of how a speech understanding
system can be made robust against spontaneous speech phenomena (hesitations and repairs) as well as achieving a detailed analysis of spoken French. The Romus system is presented. It implements speech understanding in a two-stage process. The first stage
achieves a finite-state shallow parsing that consists in segmenting
the recognized sentence into basic units (spoken-adapted chunks).
The second one, a Link Grammar parser, looks for inter-chunks dependencies in order to build a rich representation of the semantic
structure of the utterance. These dependencies are mainly investigated at a pragmatic level through the consideration of a task concept hierarchy. Discussion about the approach adopted, its benefits
and limitations, is based on the results of the system’s assessment
carried out under different linguistic phenomena during an evaluation campaign held by the French CNRS.
Discriminative Training of N-Gram Classifiers for
Speech and Text Routing
Ciprian Chelba, Alex Acero; Microsoft Research, USA
We present a method for conditional maximum likelihood estimation of N-gram models used for text or speech utterance classification. The method employs a well known technique relying on
a generalization of the Baum-Eagon inequality from polynomials to
rational functions. The best performance is achieved for the 1-gram
classifier where conditional maximum likelihood training reduces
the class error rate over a maximum likelihood classifier by 45% relative.
98
Eurospeech 2003
Thursday
Correction of Disfluencies in Spontaneous Speech
Using a Noisy-Channel Approach
Matthias Honal 1 , Tanja Schultz 2 ; 1 Universität
Karlsruhe, Germany; 2 Carnegie Mellon University,
USA
In this paper we present a system which automatically corrects disfluencies such as repairs and restarts typically occurring in spontaneously spoken speech. The system is based on a noisy-channel
model and its development requires no linguistic knowledge, but
only annotated texts. Therefore, it has large potential for rapid
deployment and the adaptation to new target languages. The experiments were conducted on spontaneously spoken dialogs from
the English VERBMOBIL corpus where a recall of 77.2% and a precision of 90.2% was obtained. To demonstrate the feasibility of rapid
adaptation additional experiments on the spontaneous Mandarin
Chinese CallHome corpus were performed achieving 49.4% recall
and 76.8% precision.
Multi-class Extractive Voicemail Summarization
Konstantinos Koumpis, Steve Renals; University of
Sheffield, U.K.
This paper is about a system that extracts principal content words
from speech-recognized transcripts of voicemail messages and classifies them into proper names, telephone numbers, dates/times and
‘other’. The short text summaries generated are suitable for mobile messaging applications. The system uses a set of classifiers
to identify the summary words, with each word being identified by
a vector of lexical and prosodic features. The features are selected
using Parcel, an ROC-based algorithm. We visually compare the role
of a large number of individual features and discuss effective ways
to combine them. We finally evaluate their performance on manual and automatic transcriptions derived from two different speech
recognition systems.
Active Labeling for Spoken Language
Understanding
September 1-4, 2003 – Geneva, Switzerland
data is available. The first method augments the training data by using the machine-labeled call-types for the unlabeled utterances. The
second method, instead, augments the classification model trained
using the human-labeled utterances with the machine-labeled ones
in a weighted manner. We have evaluated these methods using a
call classification system used for AT&T natural dialog customer
care system. For call classification, we have used a boosting algorithm. Our results indicate that it is possible to obtain the same
classification performance by using 30% less labeled data when the
unlabeled data is utilized. This corresponds to a 1-1.5% absolute
classification error rate reduction, using the same amount of labeled data.
Noise Robustness in Speech to Speech Translation
Fu-Hua Liu, Yuqing Gao, Liang Gu, Michael Picheny;
IBM T.J. Watson Research Center, USA
This paper describes various noise robustness issues in a speechto-speech translation system. We present quantitative measures
for noise robustness in the context of speech recognition accuracy and speech-to-speech translation performance. To enhance
noise immunity, we explore two approaches to improve the overall
speech-to-speech translation performance. First, a multi-style training technique is used to tackle the issue of environmental degradation at the acoustic model level. Second, a pre-processing technique, CDCN, is exploited to compensate for the acoustic distortion at the signal level. Further improvement can be obtained by
combining both schemes. In addition to recognition accuracy for
speech recognition, this paper studies and examines how closely
speech recognition accuracy is related the overall speech-to-speech
recognition. When we apply the proposed schemes to an English-toChinese translation task, the word error rate for our speech recognition subsystem is substantially reduced by 28% relative, to 13.2%
from 18.9% for test data of 15dB SNR. The corresponding BLEU score
improves to 0.478 from 0.43 for the overall speech-to-speech translation. Similar improvements are also observed for a lower SNR
condition.
Example-Based Bi-Directional Chinese-English
Machine Translation with Semi-Automatically
Induced Grammars
Gokhan Tur, Mazin Rahim, Dilek Z. Hakkani-Tür;
AT&T Labs-Research, USA
State-of-the-art spoken language understanding (SLU) systems are
trained using human-labeled utterances, preparation of which is
labor intensive and time consuming. Labeling is an error-prone
process due to various reasons, such as labeler errors or imperfect description of classes. Thus, usually a second (or maybe more)
pass(es) of labeling is required in order to check and fix the labeling errors and inconsistencies of the first (or earlier) pass(es). In
this paper, we check the effect of labeling errors for statistical call
classification and evaluate methods of finding and correcting these
errors by checking minimum amount of data. We describe two alternative methods to speed up the labeling effort, one is based on the
confidences obtained from a prior model and the other completely
unsupervised. We call the labeling process employing one of these
methods as active labelling. Active labeling aims to minimize the
number of utterances to be checked again by automatically selecting
the ones that are likely to be erroneous or inconsistent with the previously labeled examples. Although very same methods can be used
as a postprocessing step to correct labeling errors, we only consider
them as part of the labeling process. We have evaluated these active
labelling methods using a call classification system used for AT&T
natural dialog customer care system. Our results indicate that it is
possible to find about 90% of the labeling errors or inconsistencies
by checking just half the data.
Exploiting Unlabeled Utterances for Spoken
Language Understanding
Gokhan Tur, Dilek Z. Hakkani-Tür; AT&T
Labs-Research, USA
State of the art spoken language understanding systems are trained
using labeled utterances, which is labor intensive and time consuming to prepare. In this paper, we propose methods for exploiting
the unlabeled data in a statistical call classification system within a
natural language dialog system. The basic assumption is that some
amount of labeled data and relatively larger chunks of unlabeled
K.C. Siu, Helen M. Meng, C.C. Wong; Chinese
University of Hong Kong, China
We have previously developed a framework for bi-directional
English-to-Chinese/Chinese-to-English machine translation using
semi-automatically induced grammars from unannotated corpora.
The framework adopts an example-based machine translation
(EBMT) approach. This work reports on three extensions to the
framework. First, we investigate the comparative merits of three
distance metrics (Kullback-Leibler, Manhattan-Norm and Gini Index) for agglomerative clustering in grammar induction. Second, we
seek an automatic evaluation method that can also consider multiple translation outputs generated for a single input sentence based
on the BLEU metric. Third, our previous investigation shows that
Chinese-to-English translation has lower performance due to incorrect use of English inflectional forms – a consequence of random selection among translation alternatives. We present an improved selection strategy that leverages information from the example parse
trees in our EBMT paradigm.
Spotting “Hot Spots” in Meetings: Human
Judgments and Prosodic Cues
Britta Wrede 1 , Elizabeth Shriberg 2 ; 1 International
Computer Science Institute, USA; 2 SRI International,
USA
Recent interest in the automatic processing of meetings is motivated by a desire to summarize, browse, and retrieve important
information from lengthy archives of spoken data. One of the most
useful capabilities such a technology could provide is a way for
users to locate “hot spots” or regions in which participants are
highly involved in the discussion (e.g. heated arguments, points of
excitement, etc.). We ask two questions about hot spots in meetings
in the ICSI Meeting Recorder corpus. First, we ask whether involvement can be judged reliably by human listeners. Results show that
despite the subjective nature of the task, raters show significant
99
Eurospeech 2003
Thursday
agreement in distinguishing involved from non-involved utterances.
Second, we ask whether there is a relationship between human judgments of involvement and automatically extracted prosodic features of the associated regions. Results show that there are significant differences in both F0 and energy between involved and noninvolved utterances. These findings suggest that humans do agree
to some extent on the judgment of hot spots, and that acoustic-only
cues could be used for automatic detection of hot spots in natural
meetings.
Combination of CFG and N-Gram Modeling in
Semantic Grammar Learning
Ye-Yi Wang, Alex Acero; Microsoft Research, USA
SGStudio is a grammar authoring tool that eases semantic grammar development. It is capable of integrating different information
sources and learning from annotated examples to induct CFG rules.
In this paper, we investigate a modification to its underlying model
by replacing CFG rules with n-gram statistical models. The new
model is a composite of HMM and CFG. The advantages of the new
model include its built-in robust feature and its scalability to an ngram classifier when the understanding does not involve slot filling.
We devised a decoder for the model. Preliminary results show that
the new model achieved 32% error reduction in high resolution understanding.
Automatic Title Generation for Chinese Spoken
Documents Using an Adaptive K Nearest-Neighbor
Approach
Shun-Chuan Chen, Lin-shan Lee; National Taiwan
University, Taiwan
The purpose of automatic title generation is to understand a document and to summarize it with only several but readable words or
phrases. It is important for browsing and retrieving spoken documents, which may be automatically transcribed, but it will be much
more helpful if given the titles indicating the content subjects of the
documents. For title generation for Chinese language, additional
problems such as word segmentation and key phrase extraction
also have to be solved. In this paper, we developed a new approach
of title generation for Chinese spoken documents. It includes key
phrase extraction, topic classification, and a new title generation
model based on an adaptive K nearest-neighbor concept. The tests
were performed with a training corpus including 151,537 news stories in text form with human-generated titles and a testing corpus
of 210 broadcast news stories. The evaluation included both objective F1 measures and 5-level subjective human evaluation. Very
positive results were obtained.
Speech Summarization Using Weighted Finite-State
Transducers
September 1-4, 2003 – Geneva, Switzerland
Cross Domain Chinese Speech Understanding and
Answering Based on Named-Entity Extraction
Yun-Tien Lee, Shun-Chuan Chen, Lin-shan Lee;
National Taiwan University, Taiwan
Chinese language is not alphabetic, with flexible wording structure
and large number of domain-specific terms generated every day
for each domain. In this paper, a new approach for cross-domain
Chinese speech understanding and answering is proposed based
on named-entity extraction. This approach includes two parts: a
speech query recognition (SQR) part and a speech understanding
and answering (SUA) part. The huge quantities of news documents
retrieved from the Web are used to construct domain-specific lexicons and language models for SQR. The named-entity extraction is
used to construct a domain-specific named-entity database for SUA.
It is found that by combining domain classifiers and named-entity
extraction, we can not only understand cross-domain queries, but
also find answers in a specific domain.
Evaluation Method for Automatic Speech
Summarization
Chiori Hori 1 , Takaaki Hori 1 , Sadaoki Furui 2 ; 1 NTT
Corporation, Japan; 2 Tokyo Institute of Technology,
Japan
We have proposed an automatic speech summarization approach
that extracts words from transcription results obtained by automatic speech recognition (ASR) systems. To numerically evaluate
this approach, the automatic summarization results are compared
with manual summarization generated by humans through word extraction. We have proposed three metrics, weighted word precision,
word strings precision and summarization accuracy (SumACCY),
based on a word network created by merging manual summarization results. In this paper, we propose a new metric for automatic summarization results, weighted summarization accuracy
(WSumACCY). This accuracy is weighted by the posterior probability of the manual summaries in the network to give the reliability
of each answer extracted from the network. We clarify the goal of
each metric and use these metrics to provide automatic evaluation
results of the summarized speech. To compare the performance
of each evaluation metric, correlations between the evaluation results using these metrics and subjective evaluation by hand are measured. It is confirmed that WSumACCY is an effective and robust
measure for automatic summarization.
An Information Theoretic Approach for Using
Word Cluster Information in Natural Language Call
Routing
Li Li, Feng Liu, Wu Chou; Avaya Labs Research, USA
Takaaki Hori, Chiori Hori, Yasuhiro Minami; NTT
Corporation, Japan
This paper proposes an integrated framework to summarize spontaneous speech into written-style compact sentences. Most current speech recognition systems attempt to transcribe whole spoken words correctly. However, recognition results of spontaneous
speech are usually difficult to understand, even if the recognition is
perfect, because spontaneous speech includes redundant information, and its style is different to that of written sentences. In particular, the style of spoken Japanese is very different to that of the
written language. Therefore, techniques to summarize recognition
results into readable and compact sentences are indispensable for
generating captions or minutes from speech. Our speech summarization includes speech recognition, paraphrasing, and sentence
compaction, which are integrated in a single Weighted Finite-State
Transducer (WFST). This approach enables the decoder to employ
all the knowledge sources in a one-pass search strategy and reduces
the search errors, since all the constraints of the models are used
from the beginning of the search. We conducted experiments on
a 20kword Japanese lecture speech recognition and summarization
task. Our approach yielded improvements in both recognition accuracy and summarization accuracy compared with other approaches
that perform speech recognition and summarization separately.
In this paper, an information theoretic approach for using word
clusters in natural language call routing (NLCR) is proposed. This
approach utilizes an automatic word class clustering algorithm to
generate word classes from the word based training corpus. In our
approach, the information gain (IG) based term selection is used
to combine both word term and word class information in NLCR.
A joint latent semantic indexing natural language understanding
algorithm is derived and studied in NLCR tasks. Comparing with
word term based approach, an average performance gain of 10.7%
to 14.5% is observed averaged over various training and testing conditions.
Unsupervised Topic Discovery Applied to
Segmentation of News Transcriptions
Sreenivasa Sista, Amit Srivastava, Francis Kubala,
Richard Schwartz; BBN Technologies, USA
Audio transcriptions from Automatic Speech Recognition systems
are a continuous stream of words that are difficult to read. Segmenting these transcriptions into thematically distinct stories and categorizing the stories by topics increases readability and comprehensibility. However, manually defined topic categories are rarely available, and the cost of annotating a large corpus with thousands of
distinct topics is high. We describe a procedure for applying the Unsupervised Topic Discovery (UTD) algorithm to the Thematic Story
Segmentation procedure for segmenting broadcast news episodes
100
Eurospeech 2003
Thursday
into stories and to assign these stories with automatic topic labels.
We report our results of applying automatic topics for the task of
story segmentation on a collection of news episodes in English and
Arabic. Our results indicate that story segmentation performance
with automatic topic annotations from UTD is at par with the performance with manual topic annotations.
Session: PThBh– Poster
Speech Signal Processing III
Time: Thursday 10.00, Venue: Main Hall, Level -1
Chair: Javier Hernando, Universitat Politecnica de Catalunya, Spain
September 1-4, 2003 – Geneva, Switzerland
iki yeni yöntem sunulmaktadır. Seçimli ön vurgulama olarak adlandırılan birinci yöntem, gırtlak yapısı dönüşümü için bant-geçiren
süzgeçleme kullanmaktadır. İkinci yöntem ses perdesinin zamanla değişim eğrisini modellemek için parçalı bir model önermektedir. Her iki yöntem, yeni bir konuşmacı dönüştürme algoritmasında kullanılmıştır. Yöntemler, öznel deneyler yoluyla
halen kullanılan gırtlak yapısı ve ses perdesi dönüştürme yöntemleriyle karşılaştırılmıştır. Seçimli ön vurgulama yönteminin yüksek
örnekleme sıklıkları için daha düşük kestirim derecelerinde önceki
çalışmalarımızdaki yöntemlere benzer sonuç verdiği gösterilmiştir.
Sonuçlar, parçalı ses perdesi modelinin konuşmacı dönüşümünde
performansı arttırdığını göstermektedir.
Modulation Spectrum for Pitch and Speech Pause
Detection
Local Regularity Analysis at Glottal Opening and
Closure Instants in Electroglottogram Signal Using
Wavelet Transform Modulus Maxima
Olaf Schreiner; DaimlerChrysler AG, Germany
Aïcha Bouzid 1 , Noureddine Ellouze 2 ; 1 Superior
Institute of Technological Studies of Sfax, Tunisia;
2
National school of engineers of Tunis, Tunisia
This paper deals with singularities characterisation and detection
in Electroglottogram (EGG) signal using wavelet transform modulus
maxima. These singularities correspond to glottal opening and closure instants (GOIs and GCIs). Wavelets with one and two vanishing
moments are applied to EGG signal. We show that wavelet with one
vanishing moment is sufficient to detect singularities of EGG signal
and to measure their regularities.
The Lipschitz regularity at any point is the maxima slope of log2 of
wavelet transform modulus maxima as a function of log2 s along
the maxima lines converging to this point. Local regularity measures allow us to conclude that EGG signal is more regular at glottal
opening instant than at glottal closure instant.
Improved Robustness of Automatic Speech
Recognition Using a New Class Definition in Linear
Discriminant Analysis
M. Schafföner, M. Katz, S.E. Krüger, A. Wendemuth;
Otto-von-Guericke-University Magdeburg, Germany
This work discusses the improvements which can be expected
when applying linear feature-space transformations based on Linear
Discriminant Analysis (LDA) within automatic speech-recognition
(ASR). It is shown that different factors influence the effectiveness
of LDA-transformations. Most importantly, increasing the number of LDA-classes by using time-aligned states of Hidden-MarkovModels instead of phonemes is necessary to obtain improvements
predictably. An extension of LDA is presented, which utilises the elementary Gaussian components of the mixture probability-density
functions of the Hidden-Markov-Models’ states to define actual
Gaussian LDA-classes. Experimental results on the TIMIT and WSJCAM0 recognition task are given, where relative improvements of
the error-rate of 3.2% and 3.9%, respectively, were obtained.
Voice Conversion Methods for Vocal Tract and
Pitch Contour Modification
Oytun Turk 1 , Levent M. Arslan 2 ; 1 Sestek Inc., Turkey;
2
Bogazici University, Turkey
This study1 proposes two new methods for detailed modeling and
transformation of the vocal tract spectrum and the pitch contour.
The first method (selective pre-emphasis) relies on band-pass filtering to perform vocal tract transformation. The second method
(segmental pitch contour model) focuses on a more detailed modeling of pitch contours. Both methods are utilized in the design
of a voice conversion algorithm based on codebook mapping. We
compare them with existing vocal tract and pitch contour transformation methods and acoustic feature transplantations in subjective
tests. The performance of the selective pre-emphasis based method
is similar to the methods used in our previous work at higher sampling rates with a lower prediction order. The results also indicate
that the segmental pitch contour model improves voice conversion
performance.
Bu çalışmada, gırtlak yapısı ve ses perdesinin daha ayrıntılı biçimde
modellenmesi ve bir konuşmacıdan diğerine dönüştürülmesi için
This paper describes a new approach to the speech pause detection problem. The goal is to safely decide for a given signal frame
whether speech is present or not in order to switch an automatic
speech recognizer on or off. The modulation spectrum is introduced as a method to determine the amount of voicing in a signal
frame. This method is tested against two standard methods in pitch
detection.
Robust Energy Demodulation Based on Continuous
Models with Application to Speech Recognition
Dimitrios Dimitriadis, Petros Maragos; National
Technical University of Athens, Greece
In this paper, we develop improved schemes for simultaneous
speech interpolation and demodulation based on continuous-time
models. This leads to robust algorithms to estimate the instantaneous amplitudes and frequencies of the speech resonances and extract novel acoustic features for ASR. The continuous-time models
retain the excellent time resolution of the ESAs based on discrete energy operators and perform better in the presence of noise. We also
introduce a robust algorithm based on the ESAs for amplitude compensation of the filtered signals. Furthermore, we use robust nonlinear modulation features to enhance the classic cepstrum-based
features and use the augmented feature set for ASR applications.
ASR experiments show promising evidence that the robust modulation features improve recognition.
A Robust and Sensitive Word Boundary Decision
Algorithm
Jong Uk Kim, SangGyun Kim, Chang D. Yoo; KAIST,
Korea
A robust and sensitive word boundary decision algorithm for automatic speech recognition (ASR) system is proposed. The algorithm uses a time-frequency feature to improve both robustness
and sensitivity. The time-frequency features are passed through
a bank of moving average filters for temporary decision of word
boundary in each band. The decision results of each band are then
passed through a median filter for the final decision. The adoption
of time-frequency feature improves the sensitivity, while the median filtering improves the robustness. Proposed algorithm uses
an adaptive threshold based on the signal-to-noise ratio (SNR) in
each band which further improves the decision performance. Experimental result shows that the proposed algorithm outperforms
the Q.Li et al’s robust algorithm.
A Novel Transcoding Algorithm for SMV and
G.723.1 Speech Coders via Direct Parameter
Transformation
Seongho Seo, Dalwon Jang, Sunil Lee, Chang D. Yoo;
KAIST, Korea
In this paper, a novel transcoding algorithm for the Selectable Mode
Vocoder (SMV) and the G.723.1 speech coder is proposed. In contrast to the conventional tandem transcoding algorithm, the proposed algorithm converts the parameters of one coder to the other
without going through the decoding and encoding process. The
proposed algorithm is composed of four parts: the parameter decoding, Line Spectral Pair (LSP) conversion, pitch period conversion
and rate selection. The evaluation results show that the proposed
101
Eurospeech 2003
Thursday
algorithm achieves equivalent speech quality to that of tandem
transcoding with reduced computational complexity and delay.
A Novel Rate Selection Algorithm for Transcoding
CELP-Type Codec and SMV
September 1-4, 2003 – Geneva, Switzerland
Estimation of the Parameters of the Quantitative
Intonation Model with Continuous Wavelet
Analysis
Hans Kruschke, Michael Lenz; Dresden University of
Technology, Germany
Dalwon Jang, Seongho Seo, Sunil Lee, Chang D. Yoo;
KAIST, Korea
In this paper, we propose an efficient rate selection algorithm that
can be used to transcode speech encoded by any code excited linear prediction (CELP)-type codec into a format compatible with selectable mode vocoder (SMV) via direct parameter transformation.
The proposed algorithm performs rate selection using the CELP parameters. Simulation results show that while maintaining similar
overall bit-rate compared to the rate selection algorithm of SMV,
the proposed algorithm requires less computational load than that
of SMV and does not degrade the quality of the transcoded speech.
Subband-Based Acoustic Shock Limiting Algorithm
on a Low-Resource DSP System
Intonation generation in state-of-the-art speech synthesis requires
the analysis of a large amount of data. Therefore reliable algorithms
for the extraction of the parameters of an intonation model from a
given F0 contour are required. This contribution proposes improvements concerning the extraction of the parameters of the quantitative intonation model developed by Fujisaki. The improvements are
mainly based on the application of the continuous wavelet transform for the detection of accents and phrases in a F0 contour. A
detailed explanation of the underlying idea of this approach is given
and the implemented algorithm is described. Results prove that
with the proposed method a significant improvement in the accuracy of the extracted parameters is achieved. Thereby the structure
and the rules of the algorithm are kept relatively simple.
Morphological Filtering of Speech Spectrograms in
the Context of Additive Noise
G. Choy, D. Hermann, R.L. Brennan, T. Schneider, H.
Sheikhzadeh, E. Cornu; Dspfactory Ltd., Canada
Acoustic Shock describes a condition where sudden loud acoustic signals in communication equipment causes hearing damage
and discomfort to the users. To combat this problem, a subbandbased acoustic shock limiting (ASL) algorithm is proposed and implemented on an ultra low-power DSP system with an input-output
latency of 6.5 msec. This algorithm processes the input signal in
both the time and frequency domains. This approach allows the
algorithm to detect sudden increases in sound level (time-domain),
as well as frequency-selectively suppressing shock disturbances in
frequency domain. The unaffected portion of the sound spectrum
is thus preserved as much as possible. A simple ASL algorithm calibration procedure is proposed to satisfy different sound pressure
level (SPL) limit requirements for various communication equipment. Acoustic test results show that the ASL algorithm limits
acoustic shock signals to below specified SPL limits while preserving speech quality.
Pitch Estimation Using Phase Locked Loops
Patricia A. Pelle, Matias L. Capeletto; University of
Buenos Aires, Argentina
In this paper we present a new method for pitch estimation using
a system based on phase-locked-loop devices. Three main blocks
define our system. The aim of the first one is to make an harmonic
decomposition of the speech signal. This stage is implemented using a band-pass filter bank and phase-locked-loops cascaded to the
output of each filter. A second block enhances the harmonic corresponding to the fundamental frequency and attenuates all other
harmonics. Finally a third stage re-synthesizes a new signal with
high energy at the fundamental frequency and extracts pitch contour from that signal using another phase locked-loop. Performance
is evaluated over two databases of laryngograph-labeled speech and
compared to various well known pitch estimation algorithms.
Performance Evaluation of IFAS-Based
Fundamental Frequency Estimator in Noisy
Environment
Dhany Arifianto, Takao Kobayashi; Tokyo Institute of
Technology, Japan
In this paper, instantaneous frequency amplitude spectrum (IFAS)based fundamental frequency estimator is evaluated with speech
signal corrupted by additive white gaussian noise. A key idea of the
IFAS-based estimator is the use of degree of regularity of periodicity in spectrum of speech signal, de- fined by a quantity called harmonicity measure, for band selection in the fundamental frequency
estimation. Several frequency band and window length selection
methods based on harmonicity measure are assessed to find out
better performance. It is shown that the performance of the IFASbased estimator is maintained at constant error rate about 1% from
clean speech data up to 15 dB and about 11% at 0 dB SNR. For both
female and male speakers, the IFAS-based estimator outperforms
several well-known methods particularly at 0 dB SNR.
Francisco Romero Rodriguez 1 , Wei M. Liu 2 , Nicholas
W.D. Evans 2 , John S.D. Mason 2 ; 1 Escuela Superior de
Ingenieros, Spain; 2 University of Wales Swansea, U.K.
A recent approach to signal segmentation in additive noise [1, 2]
uses features of small spectrogram sub-units accrued over the full
spectrogram. The original work considered chirp signals in additive
white Gaussian noise. This paper extends this work first by considering similar signals at different signal-to-noise ratios and then in
the context of speech recognition. For the chirp case, a cost function based on spectrogram area is introduced and this indicates that
the segmentation process is robust down to and below 0 dB SNR.
For the speech experiments the objectives are again to assess the
segmentation capabilities of the process. White Gaussian noise is
added to clean speech and the segmentation process applied. The
cost function now is automatic speech recognition (ASR) accuracy.
After segmentation speech areas are set to one constant level and
non-speech areas are set to a lower constant level, thereby assessing the segmentation process and the importance of spectral shape
in ASR. For the ASR experiments the TIDigits database is used in
a standard AURORA 2 configuration, under mis-matched test and
training conditions. With 5 dB SNR for the test set only (clean training) a word accuracy of 56% is achieved. This compares with 16%
when the same noisy test data is applied directly to the ASR system without segmentation. Thus the segmentation approach shows
that spectral shapes alone (without normal spectral amplitude variations) leads to perhaps surprisingly good ASR results in noisy conditions. The next stage is to include amplitude information along
with appropriate noise compensation.
Segmenting Multiple Concurrent Speakers Using
Microphone Arrays
Guillaume Lathoud, Iain A. McCowan, Darren C.
Moore; IDIAP, Switzerland
Speaker turn detection is an important task for many speech processing applications. However, accurate segmentation can be hard
to achieve if there are multiple concurrent speakers (overlap), as
is typically the case in multi-party conversations. In such cases,
the location of the speaker, as measured using a microphone array,
may provide greater discrimination than traditional spectral features. This was verified in previous work which obtained a global
segmentation in terms of single speaker classes, as well as possible
overlap combinations. However, such a global strategy suffers from
an explosion of the number of overlap classes, as each possible combination of concurrent speakers must be modeled explicitly. In this
paper, we propose two alternative schemes that produce an individual segmentation decision for each speaker, implicitly handling
all overlapping speaker combinations. The proposed approaches
also allow straightforward online implementations. Experiments
are presented comparing the segmentation with that obtained using the previous system.
102
Eurospeech 2003
Thursday
September 1-4, 2003 – Geneva, Switzerland
Segmentation of Speech into Syllable-Like Units
Session: OThCc– Oral
Speech Signal Processing IV
T. Nagarajan, Hema A. Murthy, Rajesh M. Hegde;
Indian Institute of Technology, India
In the development of a syllable-centric ASR system, segmentation
of the acoustic signal into syllabic units is an important stage. This
paper presents a minimum phase group delay based approach to
segment spontaneous speech into syllable-like units. Here, three
different minimum phase signals are derived from the short term
energy functions of three sub-bands of speech signals, as if it were
a magnitude spectrum. The experiments are carried out on Switchboard and OGI-MLTS corpus and the error in segmentation is found
to be utmost 40msec for 85% of the syllable segments.
Session: SThCb– Oral
Towards a Roadmap for Speech Technology
Time: Thursday 13.30, Venue: Room 2
Chair: Steven Krauwer, Utrecht University / ELSNET
“Do not attempt to light with match!”: Some
Thoughts on Progress and Research Goals in
Spoken Dialog Systems
Time: Thursday 13.30, Venue: Room 3
Chair: Ben Milner, School of Information Systems
A Syllable Segmentation Algorithm for English and
Italian
Massimo Petrillo, Francesco Cutugno; Università degli
Studi di Napoli “Federico II”, Italy
In this paper we present a simple algorithm for speech syllabification. It is based on the detection of the most relevant energy maximums, using two different energy calculations: the former from the
original signal, the latter from a low-pass filtered version. The system requires setting appropriate values for a number of parameter.
The procedure to assign a proper value to each one is reduced to
the minimization of a n-variable function, for which we use either a
genetic algorithm and simulated annealing. Different estimation of
parameters for both Italian and English was carried out. We found
the English setting was also suitable for Italian but not the reverse.
Modeling Speaking Rate for Voice Fonts
Paul Heisterkamp; DaimlerChrysler AG, Germany
In view of the current market consolidation in the speech recognition industry, we ask some questions as to what constitutes the
ideas underlying the ‘roadmap’ metaphor. These questions challenge the traditional faith in ever more complex and ‘natural’ systems as the ultimate goals and keys to full commercial success of
Spoken Dialog Systems. As we strictly obey that faith, we consider those questions ‘jesuitic’ rather than ‘heretical’. Mainly, we
ask: Have we (i.e. the scientific and industrial communities) been
promising the right things to the right people? We leave the question open for discussion, and only cast glimpses at potential alternatives.
Multimodality and Speech Technology: Verbal and
Non-Verbal Communication in Talking Agents
Björn Granström, David House; KTH, Sweden
This paper presents methods for the acquisition and modelling of
verbal and non-verbal communicative signals for the use in animated talking agents. This work diverges from the traditional focus
on the acoustics of speech in speech technology and will be of importance for the realization of future multimodal interfaces, some
experimental examples of which are presented at the end of the paper.
Roadmaps, Journeys and Destinations Speculations
on the Future of Speech Technology Research
Ronald A. Cole; University of Colorado at Boulder,
USA
This article presents thoughts on the future of speech technology
research, and a vision of the near future in which computer interaction is characterized by natural face-to-face conversations with
lifelike characters that speak, emote and gesture. A first generation
of these perceptive animated interfaces are now under development
in a project called the Colorado Literacy Tutor, which uses perceptive animated agents in a computer-based literacy program.
Spoken Language Output: Realising the Vision
Roger K. Moore; 20/20 Speech Ltd., U.K.
Significant progress has taken place in ‘Spoken Language Output’
(SLO) R&D, yet there is still some way to go before it becomes a
ubiquitous and widely deployed technology. This paper reviews the
challenges facing SLO, using ‘Technology Roadmapping’ (TRM) to
identify market drivers and future product concepts. It concludes
with a summary of the behaviours that will be required in future
SLO systems.
Ashish Verma, Arun Kumar; Indian Institute of
Technology, India
Voice fonts are created and stored for a speaker, to be used to
synthesize speech in the speaker’s voice. The most important descriptors of voice fonts are spectral envelope for acoustic units
and prosodic features such as fundamental frequency and average
speaking rate. In this paper, we present a new approach to model
the speaking rate so that it can be easily incorporated in voice fonts
and used for personality transformation. We model speaking rate
in the form of average duration for various acoustic units and categories for the speaker. The speaking rate can be automatically
extracted from a speech corpus in the speaker’s voice using the
proposed approach. We show how the proposed approach can be
implemented, and present its performance evaluation through various subjective tests.
A New HMM-Based Approach to Broad Phonetic
Classification of Speech
Jouni Pohjalainen; Helsinki University of Technology,
Finland
A novel automatic method is introduced for classifying speech segments into broad phonetic categories using one or more hidden
Markov models (HMMs) on long speech utterances. The general
method is based on prior analysis of the acoustic features of speech
and the properties of HMMs. Three example algorithms are implemented and applied to voiced-unvoiced-silence classification. The
main advantages of the approach are that it does not require a separate training phase or training data, is adaptive, and that the classification results are automatically smoothed because of the Markov
assumption of successive phonetic events. The method is especially
applicable to speech recognition.
Acoustic Change Detection and Segment Clustering
of Two-Way Telephone Conversations
Xin Zhong 1 , Mark A. Clements 1 , Sung Lim 2 ; 1 Georgia
Institute of Technology, USA; 2 Fast-Talk
Communications, USA
We apply the Bayesian information criterion (BIC) to unsupervised
segmentation of two-way telephone conversations according to
speaker turns, and then proceed to produce homogenous clusters
consisting of the resulting segments. Such clustering allows more
accurate feature normalization and model adaption for ASR-related
tasks. In contrast to similar processing of broadcast data reported
in previous work, we can safely assume there are two distinguishable acoustic environments in a call, but new challenges include a
much faster changing rate, variation of speaking style by a talker,
and presence of crosstalk and non-meaningful sounds. The algorithm is tested on two-speaker telephone conversations with dif-
103
Eurospeech 2003
Thursday
ferent genders and via different telephony networks (land-line and
cellular). Using the purities of segments and final clusters as the
performance measure, the BIC-based algorithm approaches the optimal result without requiring an iterative procedure.
Blind Normalization of Speech from Different
Channels
David N. Levin; University of Chicago, USA
We show how to construct a channel-independent representation of
speech that has propagated through a noisy reverberant channel.
The method achieved greater channel-independence than cepstral
mean normalization (CMN), and it was comparable to the combination of CMN and spectral subtraction (SS), despite the fact that
no measurements of channel noise or reverberations were required
(unlike SS).
September 1-4, 2003 – Geneva, Switzerland
of vowels and diphthongs. Comparative analysis of the formant
values, the formant trajectories and the formant target points of
British and broad Australian accents are presented. A method for
ranking the contribution of formants to accent identity is proposed
whereby formants are ranked according to the normalised distances
between formants across accents. The first two formants are considered more sensitive to accents than other formants. Finally a set
of experiments on accent conversion is presented to transform the
broad Australian accent of a speaker to British Received Pronunciation (RP) accent by formant mapping and prosody modification.
Perceptual evaluations of accent conversion results illustrate that
besides prosodic correlates such as pitch and duration, formants
also play an important role in conveying accents.
Cycle Extraction for Perfect Reconstruction and
Rate Scalability
Speech Watermarking by Parametric Embedding
with an ∞ Fidelity Criterion
Miguel Arjona Ramírez; University of São Paulo,
Brazil
A.R. Gurijala, J.R. Deller Jr.; Michigan State
University, USA
A cycle extractor is presented to be used in a speech coder independently from the coding stage. It samples cycle waveforms (CyWs) of
the original prediction residual signal at their natural nonuniform
rate. It is shown that perfect reconstruction is possible due to the
interplay of these properties for two cycle length normalization and
denormalization techniques. The coding stage is coupled to the cycle extractor in the analysis stage by an evolving waveform interpolator that may handle several interpolation methods and sampling
rates for a variety of fixed and variable rate coders. The description
of extraction, evolution interpolation and synthesis stages is cast in
discrete time. The upper performance bound is perfect reconstruction while the lower bound is equivalent to conventional waveform
interpolation (WI) speech coding.
Parameter-embedded watermarking of speech signals is effected
through slight perturbations of parametric models of some deeplyintegrated dynamics of the signal. One of the objectives of the
present research is to develop, within the parameter-embedding
framework, quantifiable measures of fidelity of the stegosignal and
of robustness of the watermark to attack. This paper advances
previous developments on parameter-embedded watermarking by
introducing a specific technique for watermark selection subject
to a fidelity constraint. New results in set-theoretic filtering are
used to obtain sets of allowable parameter perturbations (i.e., watermarks) subject to an ∞ constraint on the error between the watermarked and original material. With respect to previous trial-anderror perturbation methods, the set-based parameter perturbation
is not only quantified and systematic, it is found to be more robust,
and to have a higher threshold of perceptibility with perturbation
energy. After a brief review of the general parameter-embedding
strategy, the new algorithm for set-theoretic watermark selection is
presented. Experiments with real speech data are used to assess
robustness and other performance properties. This work is being
undertaken in support of the development of the National Gallery
of the Spoken Word, a project of the Digital Libraries II Initiative.
Session: OThCd– Oral
Speech Synthesis: Miscellaneous II
Time: Thursday 13.30, Venue: Room 4
Chair: Jan van Santen, OGI, USA
On présente un extracteur de cycles pour des codeurs de la parole qui est indépendent de l’étage de codage. Il échantillonne
des cycles (CyW) du signal résiduel de prédiction à leur débit
d’échantillonnage naturel qui n’est pas uniforme. On montre qu’il
est possible d’obtenir la reconstruction parfaite à cause des liens entre ces deux propriétés par deux techniques de normalisation et de
denormalisation du longueur des cycles. L’étage de codage est couplé à l’extracteur de cycles dans l’étage d’analyse par un interpolateur de formes d’onde d’évolution que peut ménager plusiers méthodes d’interpolation et débits d’échantillonnage pour une grande
variété de codeurs à débits fixes ou variable. La description des
étages d’extraction, d’interpolation des formes d’onde d’évolution
et de synthèse est en temps discret. La limite supérieur de performance est la reconstruction parfaite tandis que l’inférieur est équivalente à celle du codage conventionnel par interpolation de formes
d’onde (WI).
Adding Fricatives to the Portuguese Articulatory
Synthesiser
Using Acoustic Models to Choose Pronunciation
Variations for Synthetic Voices
António Teixeira, Luis M.T. Jesus, Roberto Martinez;
Universidade de Aveiro, Portugal
Christina L. Bennett, Alan W. Black; Carnegie Mellon
University, USA
Within-speaker pronunciation variation is a well-known phenomenon; however, attempting to capture and predict a speaker’s
choice of pronunciations has been mostly overlooked in the field
of speech synthesis. We propose a method to utilize acoustic
modeling techniques from speech recognition in order to detect a
speaker’s choice between full and reduced pronunciations.
Comparative Analysis and Synthesis of Formant
Trajectories of British and Broad Australian
Accents
First attempts at incorporating models of frication into an articulatory synthesizer, with a modular and flexible design, are presented.
Although the synthesizer allows the user to choose different combinations of source types, noise volume velocity sources have been
used to generate turbulence. Preliminary results indicate that the
model is capturing essential characteristics of the transfer functions and spectral characteristics of fricatives. Results also show
the potential of performing synthesis based on broad articulatory
configurations of fricatives.
A Hybrid Method Oriented to Concatenative
Text-to-Speech Synthesis
Qin Yan 1 , Saeed Vaseghi 1 , Ching-Hsiang Ho 2 ,
Dimitrios Rentzos 1 , Emir Turajlic 1 ; 1 Brunel
University, U.K.; 2 Fortune Institute of Technology,
Taiwan
Ignasi Iriondo, Francesc Alías, Javier Sanchis, Javier
Melenchón; Ramon Llull University, Spain
The differences between the formant trajectories of British and
broad Australian English accents are analysed and used for accent
conversion. An improved formant model based on linear prediction (LP) feature analysis and a 2-D hidden Markov model (HMM)
of formants is employed for estimation of the formant trajectories
In this paper we present a speech synthesis method for diphonebased text-to-speech systems. Its main goal is to achieve prosodic
modifications that result in more natural-sounding synthetic
speech. This improvement is especially useful for emotional speech
synthesis, which requires high-quality prosodic modification. We
present a hybrid method based on TD-PSOLA and the harmonic plus
noise model, which incorporates a novel method to jointly mod-
104
Eurospeech 2003
Thursday
ify pitch and time-scale. Preliminary results show an improvement
in the synthetic speech quality when high pitch modification is required.
Custom-Tailoring TTS Voice Font – Keeping the
Naturalness When Reducing Database Size
Yong Zhao, Min Chu, Hu Peng, Eric Chang; Microsoft
Research Asia, China
This paper presents a framework for custom-tailoring voice font
in data-driven TTS systems. Three criteria for unit pruning, the
prosodic outlier criterion, the importance criterion and the combination of the two, are proposed. The performance of voice fonts
in different sizes which are pruned with the three criteria is evaluated by simulating speech synthesis over large amount of texts
and estimating the naturalness with an objective measure at the
same time. The result shows that the combined criterion performs
the best among the three. The pre-estimated curve for naturalness
vs. database size might be used as a reference for custom-tailoring
voice font. The naturalness remains almost unchanged when 50%
of instances are pruned off with the combined criterion.
Session: PThCe– Poster
Speaker Recognition & Verification
Time: Thursday 13.30, Venue: Main Hall, Level -1
Chair: Samy Bengio, IDIAP, Martigny, Switzerland
New MAP Estimators for Speaker Recognition
P. Kenny, M. Mihoubi, Pierre Dumouchel; CRIM,
Canada
We report the results of some experiments which demonstrate that
eigenvoice MAP and eigenphone MAP are at least as effective as classical MAP for discriminative speaker modeling on SWITCHBOARD
data. We show how eigenvoice MAP can be modified to yield a new
model-based channel compensation technique which we call eigenchannel MAP. When compared with multi-channel training, eigenchannel MAP was found to reduce speaker identification errors by
50%.
A New SVM Approach to Speaker Identification and
Verification Using Probabilistic Distance Kernels
Pedro J. Moreno, Purdy P. Ho; Hewlett-Packard, USA
One major SVM weakness has been the use of generic kernel functions to compute distances among data points. Polynomial, linear,
and Gaussian are typical examples. They do not take full advantage
of the inherent probability distributions of the data. Focusing on
audio speaker identification and verification, we propose to explore
the use of novel kernel functions that take full advantage of good
probabilistic and descriptive models of audio data. We explore the
use of generative speaker identification models such as Gaussian
Mixture Models and derive a kernel distance based on the KullbackLeibler (KL) divergence between generative models. In effect our
approach combines the best of both generative and discriminative
methods. Our results show that these new kernels perform as well
as baseline GMM classifiers and outperform generic kernel based
SVM’s in both speaker identification and verification on two different audio databases.
Adaptive Decision Fusion for Multi-Sample Speaker
Verification Over GSM Networks
Ming-Cheung Cheung 1 , Man-Wai Mak 1 , Sun-Yuan
Kung 2 ; 1 Hong Kong Polytechnic University, China;
2
Princeton University, USA
In speaker verification, a claimant may produce two or more utterances. In our previous study [1], we proposed to compute the optimal weights for fusing the scores of these utterances based on their
score distribution and our prior knowledge about the score statistics estimated from the mean scores of the corresponding client
speaker and some pseudo-impostors during enrollment. As the fusion weights depend on the prior scores, in this paper, we propose
to adapt the prior scores during verification based on the likelihood
of the claimant being an impostor. To this end, a pseudo-imposter
September 1-4, 2003 – Geneva, Switzerland
GMM score model is created for each speaker. During verification,
the claimant’s scores are fed to the score model to obtain a likelihood for adapting the prior score. Experimental results based on
the GSM-transcoded speech of 150 speakers from the HTIMIT corpus demonstrate that the proposed prior score adaptation approach
provides a relative error reduction of 15% when compared with our
previous approach where the prior scores are non-adaptive.
Environment Adaptation for Robust Speaker
Verification
Kwok-Kwong Yiu 1 , Man-Wai Mak 1 , Sun-Yuan Kung 2 ;
1
Hong Kong Polytechnic University, China; 2 Princeton
University, USA
In speaker verification over public telephone networks, utterances
can be obtained from different types of handsets. Different handsets may introduce different degrees of distortion to the speech
signals. This paper attempts to combine a handset selector with
(1) handset-specific transformations and (2) handset-dependent
speaker models to reduce the effect caused by the acoustic distortion. Specifically, a number of Gaussian mixture models are
independently trained to identify the most likely handset given
a test utterance; then during recognition, the speaker model and
background model are either transformed by MLLR-based handsetspecific transformation or respectively replaced by a handsetdependent speaker model and a handset-dependent background
model whose parameters were adapted by reinforced learning to fit
the new environment. Experimental results based on 150 speakers
of the HTIMIT corpus show that environment adaptation based on
both MLLR and reinforced learning outperforms the classical CMS,
Hnorm and Tnorm approaches, with MLLR adaptation achieves the
best performance.
On Cohort Selection for Speaker Verification
Yaniv Zigel, Arnon Cohen; Ben-Gurion University,
Israel
Speaker verification systems require some kind of background
model to reliably perform the verification task. Several algorithms
have been proposed for the selection of cohort models to form
a background model. This paper proposes a new cohort selection method called the Close Impostor Clustering (CIC). The new
method is shown to outperform several other methods in a textdependent verification task. Several normalization methods are
also compared. With three cohort models and the best scorenormalization method, the CIC yielded an average Equal Error Rate
(EER) of 0.8%, while the second best method (Maximally-Spread
Close, MSC) yielded average EER of 1.1%.
Speaker Characterization Using Principal
Component Analysis and Wavelet Transform for
Speaker Verification
C. Tadj, A. Benlahouar; École de Technologie
Supérieure, Canada
In this paper, we investigate the use of the Wavelet Transform
for text-dependent and text-independent Speaker Verification tasks.
We have introduced a Principal Component Analysis based wavelet
transform to perform frequencies segmentation with levels decomposition. A speaker dependent library tree has been built, corresponding to the best structure for a given speaker. The constructed
tree is abstract and specific to every single speaker. Therefore the
extracted parameters are more discriminative and appropriate for
speaker verification applications. It has been compared to MFCC’s
and other wavelet-based parameters. Experiments have been conducted using corpus, extracted from Yoho and Spidre Databases.
This technique has shown robustness and 100% efficiency in both
cases.
Unsupervised Speaker Indexing Using Anchor
Models and Automatic Transcription of
Discussions
Yuya Akita, Tatsuya Kawahara; Kyoto University,
Japan
We present unsupervised speaker indexing combined with auto-
105
Eurospeech 2003
Thursday
matic speech recognition (ASR) for speech archives such as discussions. Our proposed indexing method is based on anchor models,
by which we define a feature vector based on the similarity with
speakers of a large scale speech database. Several techniques are
introduced to improve discriminant ability. ASR is performed using the results of this indexing. No discussion corpus is available
to train acoustic and language models. So we applied the speaker
adaptation technique to the baseline acoustic model based on the
indexing. We also constructed a language model by merging two
models that cover different linguistic features. We achieved the
speaker indexing accuracy of 93% and the significant improvement
of ASR for real discussion data.
A Statistical Approach to Assessing Speech and
Voice Variability in Speaker Verification
Klaus R. Scherer, D. Grandjean, T. Johnstone, G.
Klasmeyer, Tanja Bänziger; University of Geneva,
Switzerland
Voice and speech parameters for a single speaker vary widely over
different contexts, in particular in situations in which speakers are
affected by stress or emotion or in which speech styles are used
strategically. This high degree of intra-speaker variability presents
a major challenge for speaker verification systems. Based on a largescale study in which different kinds of affective states were induced
in over 100 speakers from three language groups, we use a statistical approach to identify speech and voice parameters that are likely
to strongly vary as a function of the respective situation and affective state as well as those that tend to remain relatively stable. In
addition, we evaluate the latter with respect to their potential to
differentiate individual speakers.
Automatic Singer Identification of Popular Music
Recordings via Estimation and Modeling of Solo
Vocal Signal
Wei-Ho Tsai, Hsin-Min Wang, Dwight Rodgers;
Academia Sinica, Taiwan
This study presents an effective technique for automatically identifying the singer of a music recording. Since the vast majority of
popular music contains background accompaniment during most
or all vocal passages, directly acquiring isolated solo voice data for
extracting the singer’s vocal characteristics is usually infeasible. To
eliminate the interference of background music for singer identification, we leverage statistical estimation of a piece’s musical background to build a reliable model for the solo voice. Validity of the
proposed singer identification system is confirmed via the experimental evaluations conducted on a 23-singer pop music database.
A DP Algorithm for Speaker Change Detection
Michele Vescovi 1 , Mauro Cettolo 2 , Romeo Rizzi 1 ;
1
Università degli Studi di Trento, Italy; 2 ITCirst, Italy
The Bayesian Information Criterion (BIC) is a widely adopted
method for audio segmentation; typically, it is applied within a sliding variable-size analysis window where single changes in the nature
of the audio are locally searched.
In this work, a dynamic programming algorithm which uses the BIC
method for globally segmenting the input audio stream is described,
analyzed, and experimentally evaluated.
On the 2000 NIST Speaker Recognition Evaluation test set, the DP
algorithm outperforms the local one by 2.4% (relative) F-score in the
detection of changes, at the cost of being 38 times slower.
September 1-4, 2003 – Geneva, Switzerland
Automatic Estimation of Perceptual Age Using
Speaker Modeling Techniques
Nobuaki Minematsu, Keita Yamauchi, Keikichi Hirose;
University of Tokyo, Japan
This paper proposes a technique to estimate speakers’ perceptual
age automatically only with acoustic information of their utterances. Firstly, we experimentally collected data of how old individual speakers in databases sound to listeners. Speech samples
of approximately 500 male speakers with a very wide range of the
real age were presented to listeners, who were asked to estimate
the age only by hearing. Using the results, the perceptual age of
the individual speakers was defined in two ways as label (averaged
age over the listeners) and distribution. Then, each of the speakers
was acoustically modeled by GMMs. Finally, the perceptual age of
an input speaker was estimated as weighted sum of the perceptual
age of all the other speakers in the databases, where the weight for
speaker i was calculated as a function of likelihood score of the
input speaker as speaker i. Experiments showed that correlation
was about 0.9 between the perceptual age estimated by the listening test and that estimated by the proposed method. This paper
also introduces some techniques to realize robust estimation of the
perceptual age.
Speaker Recognition Using Local Models
Ryan Rifkin; Honda Research Institute, USA
Many of the problems arising in speech processing are characterized by extremely large training and testing sets, constraining the
kinds of models and algorithms that lead to tractable implementations. In particular, we would like the amount of processing associated with each test frame to be sublinear (i.e., logarithmic) in
the number of training points. In this paper, we consider smoothed
kernel regression models at each test frame, using only those training frames that are close to the desired test frame. The problem is
made tractable via the use of approximate nearest neighbors techniques. The resulting system is conceptually simple, easy to implement, and fast, with performance comparable to more sophisticated
methods. Preliminary results on a NIST speaker recognition task are
presented, demonstrating the feasibility of the method.
Dependence of GMM Adaptation on Feature
Post-Processing for Speaker Recognition
Robbie Vogt, Jason Pelecanos, Sridha Sridharan;
Queensland University of Technology, Australia
This paper presents a study on the relationship between feature
post-processing and speaker modelling techniques for robust textindependent speaker recognition. A fully coupled target and background Gaussian mixture speaker model structure is used for hypothesis testing in this speaker model based recognition system.
Two formulations of the Maximum a Posteriori (MAP) adaptation
algorithm for Gaussian mixture models are considered. We contrast the standard single iteration adaptation algorithm to adaptation using multiple iterations. Three post-processing techniques
for cepstral features are considered; feature warping, cepstral mean
subtraction (CMS) and RelAtive SpecTrA (RASTA) processing. It is
shown that the advantage gained through iterative MAP adaptation
is dependent on the parameterisation technique used. Reasons for
this dependency are discussed.
Text-Independent Speaker Recognition by
Speaker-Specific GMM and Speaker Adapted
Syllable-Based HMM
Seiichi Nakagawa, Wei Zhang; Toyohashi University
of Technology, Japan
SOM as Likelihood Estimator for Speaker Clustering
Itshak Lapidot; IDIAP, Switzerland
A new approach is presented for clustering the speakers from unlabeled and unsegmented conversation, when the number of speakers
is unknown. In this approach, Self-Organizing-Map (SOM) is used as
likelihood estimators for speaker model. For estimation of the number of clusters the Bayesian Information Criterion (BIC) is applied.
This approach was tested on the NIST 1996 HUB-4 evaluation test
in terms of speaker and cluster purities. Results indicate that the
combined SOM-BIC approach can lead to better clustering results
than the baseline system.
We present a new text-independent speaker recognition method
by combining speaker-specific Gaussian Mixture Model(GMM) with
syllable-based HMM adapted by MLLR or MAP. The robustness
of this speaker recognition method for speaking style’s change
was evaluated. The speaker identification experiment using NTT
database which consists of sentences data uttered at three speed
modes (normal, fast and slow) by 35 Japanese speakers(22 males
and 13 females) on five sessions over ten months was conducted.
Each speaker uttered only 5 training utterances. We obtained the
106
Eurospeech 2003
Thursday
accuracy of 100% for text-independent speaker identification. This
result was superior to some conventional methods for the same
database.
On the Amount of Speech Data Necessary for
Successful Speaker Identification
Aleš Padrta, Vlasta Radová; University of West
Bohemia in Pilsen, Czech Republic
The paper deals with the dependence between the speaker identification performance and the amount of test data. Three speaker
identification procedures based on hidden Markov models (HMMs)
of phonemes are presented here. One, which is quite commonly
used in the speaker recognition systems based on HMMs, uses the
likelihood of the whole utterance for speaker identification. The
other two that are proposed in this paper are based on the majority
voting rule. The experiments were performed for two different situations: either both training and test data were obtained from the
same channel, or they were obtained from different channels. All
experiments show that the proposed speaker identification procedure based on the majority voting rule for sequences of phonemes
allows us to reduce the amount of test data necessary for successful
speaker identification.
September 1-4, 2003 – Geneva, Switzerland
ficients (MFCC)s, JRASTA Perceptual Linear Prediction Coefficients
(JRASTAPLP) indicate that executing Principal Component Analysis
(PCA) on MRA features result in performance superior to the use of
MFCCs and competitive with the use of JRASTAPLP features.
Experiments in noisy conditions, using the Italian component of
the AURORA3 corpus, show a WER reduction of 15.7% when SNRdependent Spectral Subtraction (SS) is performed on MRA-PCA features compared to when it is performed on JRASTAPLP features.
Furthermore, SS appears to be better than Soft Thresholding (ST).
An Accurate Noise Compensation Algorithm in the
Log-Spectral Domain for Robust Speech
Recognition
Mohamed Afify; Cairo University, Egypt
Speaker Verification Based on the German VeriDat
Database
This paper presents an algorithm for noise compensation in the logspectral domain. The idea is based on the use of accurate approximations which allow theoretical derivations of the noisy speech
statistics, and using these statistics to define a compensation algorithm under a Gaussian mixture model assumption. The algorithm is tested on a digit data base recorded in the car, the word
recognition accuracies for the baseline (uncompensated), first order VTS, the proposed method, and the matched test, are 85.8%,
90.6%, 93.1%, and 93.9% respectively. This clearly indicates the performance gain due to the proposed technique.
Ulrich Türk, Florian Schiel;
Ludwig-Maximilians-Universität München, Germany
A New Adaptive Long-Term Spectral Estimation
Voice Activity Detector
This paper introduces the new German speaker verification (SV)
database VeriDat as well as the system design, the baseline performance and the results of several experiments of our experimental speaker verification (SV) frame work. The main focus is how
typical problems using real-world telephone speech can be avoided
automatically by rejecting inputs to the enrollment or test material.
Possible splittings of the data sets according to network type and
acoustical environment are tested in cheating experiments.
Session: PThCf– Poster
Robust Speech Recognition IV
Time: Thursday 13.30, Venue: Main Hall, Level -1
Chair: Jean-Claude Junqua, Panasonic, USA
A Segment-Based Algorithm of Speech
Enhancement for Robust Speech Recognition
Guokang Fu 1 , Ta-Hsin Li 2 ; 1 IBM China Research Lab,
China; 2 IBM T.J. Watson Research Center, USA
Accurate recognition of speech in noisy environment is still an obstacle for wider application of speech recognition technology. Noise
reduction, which is aimed at cleaning the corrupted testing signal
to match the ideal training conditions, remain to be an effective approach to improving the accuracy of speech recognition in noisy
environment. This paper introduces a new algorithm of noise reduction that combines a tree-based segmentation method with the
maximum likelihood estimation to accommodate the nonstationarity of speech while efficiently suppressing the possibly nonstationary noise. Numerical results are obtained from the experiments
on an speech recognition system, showing the effectiveness of the
proposed algorithm in improving the accuracy of Chinese speech
recognition.
Robust Multiple Resolution Analysis for Automatic
Speech Recognition
Roberto Gemello 1 , Franco Mana 1 , Dario Albesano 1 ,
Renato De Mori 2 ; 1 Loquendo, Italy; 2 LIA-CNRS,
France
This paper investigates the potential of exploiting the redundancy
implicit in Multi Resolution Analysis (MRA) for Automatic Speech
Recognition (ASR) systems. Experiments, carried with data collected from home telephones and in cars, confirm the proposed
approach for exploiting this redundancy.
Comparisons with the use of Mel Frequency-scaled Cepstral Coef-
Javier Ramírez, José C. Segura, Carmen Benítez,
Ángel de la Torre, Antonio J. Rubio; Universidad de
Granada, Spain
This paper shows an efficient voice activity detector (VAD) that is
based on the estimation of the long-term spectral divergence (LTSD)
between noise and speech periods. The proposed method decomposes the input signal into overlapped speech frames, uses a sliding
window to compute the long-term spectral envelope and measures
the speech/non-speech LTSD, thus yielding a high discriminating
decision rule and minimizing the average number of decision errors. In order to increase nonspeech detection accuracy, the decision threshold is adapted to the measured noise energy while a
controlled hang-over is activated only when the observed signal-tonoise ratio (SNR) is low. An exhaustive analysis of the proposed VAD
is carried out using the AURORA TIdigits and SpeechDat-Car (SDC)
databases. The proposed VAD is compared to the most commonly
used ones in the field in terms of speech/non-speech detection and
recognition performance. Experimental results demonstrate a sustained advantage over G.729, AMR and AFE VADs.
Robust Speech Recognition Using Non-Linear
Spectral Smoothing
Michael J. Carey; University of Bristol, U.K.
A new simple but robust method of front-end analysis, nonlinear
spectral smoothing (NLSS), is proposed. NLSS uses rank-order filtering to replace noisy low-level speech spectrum coefficients with
values computed from adjacent spectral peaks. The resulting transformation bears significant similarities with masking in the auditory system. It can be used as an intermediate processing stage between the FFT and the filter-bank analyzer. It also produces features
which can be cosine transformed and used by a pattern matcher.
NLSS gives significant improvements in the performance of speech
recognition systems in the presence of stationary noise, a reduction
in error rate of typically 50% or an increased tolerance to noise of
3dB for the same error rate in an isolated digit test on the Noisex
database. Results on female speech were superior to those on male
speech: female speech gave a recognition error rate of 1.1% at a 0dB
signal to noise ratio.
A Novel Use of Residual Noise Model for Modified
PMC
Cailian Miao, Yangsheng Wang; Chinese Academy of
Sciences, China
In this paper, a new approach based on model adaptation is proposed for acoustic mismatch problem. A specific bias model – resid-
107
Eurospeech 2003
Thursday
ual noise model – is presented, which is the joint compensation
model for additive and convolutive bias. The novel noise model is
estimated on the basis of maximum likelihood manner. In conjunction with the Parallel Model combination (PMC), it is effective for
noisy environments. The experiments have been done based on the
Cambridge’s HTK toolkit to implement the continuous Mandarin
digit recognition in noisy environments.
Robust Speech Recognition to Non-Stationary
Noise Based on Model-Driven Approaches
Christophe Cerisara, Irina Illina; LORIA, France
Automatic speech recognition works quite well in clean conditions,
and several algorithms have already been proposed to deal with
stationary noise. The next challenge consists to work with nonstationary noise. This paper studies this problem. We propose three
algorithms to non-stationary noise adaptation : Static and Dynamic
Optional Parallel Model Combination (OPMC) and one algorithm derived from the Missing Data framework. The combination of speech
and noise is expressed in the spectral domain and different ways to
estimate the non-stationary noise model are studied. The proposed
algorithms are tested on a telephone database with added background music at different SNRs. The best result is obtained using
dynamic OPMC.
Towards Missing Data Recognition with Cepstral
Features
Christophe Cerisara; LORIA, France
We study in this work the Missing Data Recognition (MDR) framework applied to a large vocabulary continuous speech recognition
(LVCSR) task with cepstral models when the speech signal is corrupted by musical noise. We do not propose a full system that
solves this difficult problem, but we rather present some of the issues involved and study some possible solutions to them. We focus
in this work on the issues concerning the application of masks to
cepstral models. We further identify possible errors and study how
some of them affect the performances of the system.
On-Line Parametric Histogram Equalization
Techniques for Noise Robust Embedded Speech
Recognition
September 1-4, 2003 – Geneva, Switzerland
The performance of the proposed methods is comparable to that of
CMN in using cepstral coefficients.
Voicing Parameter and Energy Based
Speech/Non-Speech Detection for Speech
Recognition in Adverse Conditions
Arnaud Martin 1 , Laurent Mauuary 2 ; 1 Université de
Bretagne Sud, France; 2 France Télécom R&D, France
In adverse conditions, the speech recognition performance decreases in part due to imperfect speech/non-speech detection. In
this paper, a new combination of voicing parameter and energy
for speech/non-speech detection is described. This combination
avoids especially the noise detections in real life very noisy environments and provides better performance for continuous speech
recognition. This new speech/non-speech detection approach outperforms both noise statistical based [1] and Linear Discriminate
Analysis (LDA) based [2] criteria in noisy environments and for continuous speech recognition applications.
Two Correction Models for Likelihoods in Robust
Speech Recognition Using Missing Feature Theory
Hugo Van hamme; Katholieke Universiteit Leuven,
Belgium
In Missing Feature Theory (MFT), it is assumed that some of the
features that are extracted from an observation are missing or unreliable. Applied to spectral features for noisy speech recognition,
the clean feature values are known to be less than the observed
noisy features. Based on this inequality constraint, an HMM-statedependent clean speech value of the missing features can be inferred through maximum likelihood estimation. This paper describes two observed biases of the likelihood evaluated at the estimate. Theoretical and experimental evidence are provided that
an upper bound on the accuracy is improved by applying computationally simple corrections for the number of free variables in the
likelihood maximization and for the global acoustic space density
function.
Spectral Maxima Representation for Robust
Automatic Speech Recognition
J. Sujatha, K.R. Prasanna Kumar, K.R. Ramakrishnan,
N. Balakrishnan; Indian Institute of Science, India
Hemmo Haverinen, Imre Kiss; Nokia Research Center,
Finland
In this paper, two low-complexity histogram equalization algorithms are presented that significantly reduce the mismatch between training and testing conditions in HMM-based automatic
speech recognizers. The proposed algorithms use Gaussian approximations for the initial and target distributions and perform a linear
mapping between them. We show that even this simplified mapping
can improve the noise robustness of ASR systems, while the associated computational load, memory requirements, and algorithmic
delay are minimal. The proposed algorithms were evaluated in a
multi-lingual speaker independent isolated word recognition task
without and in combination with on-line MAP acoustic model adaptation. The best results obtained showed an approximate 25/20%
relative error-rate reduction without/with acoustic model adaptation.
Compensation of Channel Distortion in Line
Spectrum Frequency Domain
In the context of automatic speech recognition, the popular Mel
Frequency Cepstral Coefficients(MFCC) as features, though perform
very well under clean and matched environments, are observed to
fail in mismatched conditions. The spectral maxima are often observed to preserve their locations and energies under noisy environments, but are not presented explicitly by the MFCC features.
This paper presents a framework for representing the maxima information for robust recognition in the presence of additive White
Gaussian Noise(WGN). For the task of phoneme based Isolated Word
Recognition (IWR) under different Signal to Noise Ratio (SNR) environments, the results show an improved recognition performance.
The cepstral features are computed from a reconstructed spectrogram by fitting gaussians around the spectral maxima. In view of
the inherent robustness and easy trackability of the maxima, this
opens up interesting avenues towards a robust feature representation as well as preprocessing techniques.
Missing Feature Theory Applied to Robust Speech
Recognition Over IP Network
An-Tze Yu, Hsiao-Chuan Wang; National Tsing Hua
University, Taiwan
This paper addresses the problem of channel effect in the line spectrum frequency (LSF) domain. The channel effect can be expressed
in terms of the channel phase. The speech signal is represented by
its inverse filter derived from LP analysis. Then the mean normalization on the inverse filters is introduced for removing the channel
distortion. Further study indicates that the mean normalization on
the inverse filters becomes the mean subtraction in phase domain.
Based on this finding, two methods are proposed to compensate the
channel effect. Experiments on simulated channel distorted speech
are conducted to evaluate the effectiveness of the proposed methods. The experimental results show that the proposed methods can
give significant improvements in speech recognition performance.
Toshiki Endo 1 , Shingo Kuroiwa 2 , Satoshi
Nakamura 1 ; 1 ATR-SLT, Japan; 2 University of
Tokushima, Japan
This paper addresses the problems involved in performing speech
recognition over mobile and IP networks. The main problem is
speech data loss caused by packet loss in the network. We present
two missing-feature-based approaches that recover lost regions of
speech data. These approaches are based on reconstruction of missing frames or on marginal distributions. For comparison, we also
use a tacking method, which recognizes only received data. We
evaluate these approaches with packet loss models, i.e., random
loss and Gilbert loss models. The results show that the marginal-
108
Eurospeech 2003
Thursday
distributions-based approach is most effective for a packet loss environment; the degradation of word accuracy is only 5% when the
packet loss rate is 30% and only 3% when mean burst loss length is
24 frames.
Comparative Experiments to Evaluate the Use of
Auditory-Based Acoustic Distinctive Features and
Formant Cues for Robust Automatic Speech
Recognition in Low-SNR Car Environments
Hesham Tolba, Sid-Ahmed Selouani, Douglas
O’Shaughnessy; Université du Québec, Canada
This paper presents an evaluation of the use of some auditory-based
distinctive features and formant cues for robust automatic speech
recognition (ASR) in the presence of highly interfering car noise.
Comparative experiments have indicated that combining the classical MFCCs with some auditory-based acoustic distinctive cues and
either the main formant magnitudes or the formant frequencies of
a speech signal using a multi-stream paradigm leads to an improvement in the recognition performance in noisy car environments. To
test the use of the new multi-stream feature vector, a series of experiments on speaker-independent continuous-speech recognition
have been carried out using a noisy version of the TIMIT database.
Using such multi-stream paradigm, we found that the use of the proposed paradigm, outperforms the conventional recognition process
based on the use of the MFCCs in interfering noisy car environments
for a wide range of SNRs.
Robust Speech Recognition Using Missing Feature
Theory in the Cepstral or LDA Domain
Hugo Van hamme; Katholieke Universiteit Leuven,
Belgium
When applying Missing Feature Theory to noise robust speech
recognition, spectral features are labeled as either reliable or unreliable in the time-frequency plane. The acoustic model evaluation of
the unreliable features is modified to express that their clean values
are unknown or confined within bounds. Classically, MFT requires
an assumption of statistical independence in the spectral domain,
which deteriorates the accuracy on clean speech. In this paper,
MFT is expressed in any domain that is a linear transform of (log)spectra, for example for cepstra and their time-derivatives. The
acoustic model evaluation is recast as a nonnegative least squares
problem. Approximate solutions are proposed and the success
of the method is shown through experiments on the AURORA-2
database.
Bandwidth Mismatch Compensation for Robust
Speech Recognition
September 1-4, 2003 – Geneva, Switzerland
feature estimation for automatic speech recognition. By using these
methods, it is possible to explore new possibilities in leveraging the
autoregressive assumption for noise robust feature extraction. Two
minimum mean square error estimators are compared that directly
estimate the mean of the feature vectors. The first estimator uses
the assumption that the speech is an autoregressive signal, while
the second makes no assumptions about the speech spectrum. By
creating samples from the posterior distribution, these methods
also provide an elegant solution to finding feature variances. These
variances can be used to create optimal temporal smoothers of the
features as well as input for uncertainty observation decoding. Testing on the Aurora2 database shows that autoregressive modeling
provides additional information to improve speech recognition performance. In addition, both smoothing and uncertain observation
decoding improve performance in this method.
A Comparative Study of Some Discriminative
Feature Reduction Algorithms on the AURORA
2000 and the DaimlerChrysler In-Car ASR Tasks
Joan Marí Hilario, Fritz Class; DaimlerChrysler AG,
Germany
A common practice in ASR to add contextual information is to append consecutive feature frames in a single large feature vector.
However, this increases the processing time in the acoustic modelling and may lead to poorly trained parameters. A possible solution is to use a Linear Discriminant Analysis (LDA) mapping to
reduce the dimensionality of the feature, but this is not optimal, at
least in the case where the LDA classes are HMM-states. It is shown
in this paper that the feature reduction problem is essentially a
problem of approximating class posterior probabilities. These can
be approximated using Neural Nets (NN). Some approaches using
different choices for the classes and NN topology are presented
and tested on the AURORA 2000 digit task and on our in-car task.
Results on AURORA show a significant performance increase compared to LDA, but none of the NN-based approaches outperforms
LDA on our in-car task.
Session: PThCg– Poster
Multi-Lingual Spoken Language Processing
Time: Thursday 13.30, Venue: Main Hall, Level -1
Chair: Torbjorn Svendsen, NTNU, Trondheim, Norway
Recent Progress in the Decoding of Non-Native
Speech with Multilingual Acoustic Models
V. Fischer, E. Janke, S. Kunzmann; IBM Pervasive
Computing, Germany
Yuan-Fu Liao 1 , Jeng-Shien Lin 1 , Wei-Ho Tsai 2 ;
1
National Taipei University of Technology, Taiwan;
2
Academia Sinica, Taiwan
In this paper, an iterative bandwidth mismatch compensation (BMC)
algorithm is proposed to alleviate the need of multiple pre-trained
models for recognizing different bandwidth speech. The BMC uses
the concept of the bandwidth extension as similar as in the speech
enhancement approaches. However, it aims at directly improving
the recognition accuracy instead of speech intelligence or quality
and utilizes only recognizer’s hidden Markov models (HMMs) for
both bandwidth mismatch compensation and recognition. The BMC
first detects the bandwidth of the input speech signal based on a divergence measurement. The HMM/Gaussian mixture model (GMM)based method is then used to iteratively segment the input speech
utterance and compensates the speech features. Experiments on
serious bandwidth mismatched conditions, i.e., training on 8 kHz
and testing on 4 kHz or 5.5 kHz bandwidth database have verified
the effectiveness of the proposed approach.
Markov Chain Monte Carlo Methods for Noise
Robust Feature Extraction Using the
Autoregressive Model
Robert W. Morris, Jon A. Arrowood, Mark A. Clements;
Georgia Institute of Technology, USA
In this paper, Markov Chain Monte Carlo techniques are applied to
In this paper we report on recent progress in the use of multilingual
Hidden Markov Models for the recognition of non-native speech.
While we have previously discussed the use of bilingual acoustic models and recognizer combination methods, we now seek to
avoid the increased computational load imposed by methods such
as ROVER by focusing on acoustic models that share training data
from 5 languages. Our investigations concentrate on the determination of a proper model complexity and show the multilingual models’ capability to handle cases where a non-native speaker is borrowing phones from his or her native language. Finally, using a limited
amount of non-native speech for MLLR adaptation, we demonstrate
the superiority of multilingual models even after adaptation.
An NN-Based Approach to Prosodic Information
Generation for Synthesizing English Words
Embedded in Chinese Text
Wei-Chih Kuo, Li-Feng Lin, Yih-Ru Wang, Sin-Horng
Chen; National Chiao Tung University, Taiwan
In this paper, a neural network-based approach to generating proper
prosodic information for spelling/reading English words embedded
in background Chinese texts is discussed. It expands an existing
RNN-based prosodic information generator for Mandarin TTS to an
RNN-MLP scheme for Mandarin-English mixed-lingual TTS. It first
treats each English word as a Chinese word and uses the RNN,
trained for Mandarin TTS, to generate a set of initial prosodic in-
109
Eurospeech 2003
Thursday
formation for each syllable of the English word. It then refines the
initial prosodic information by using additional MLPs. The resulting prosodic information is expected to be appropriate for Englishword synthesis as well as to match well with that of the background Mandarin speech. Experimental results showed that the
proposed RNN-MLP scheme performed very well. For English word
spelling/reading, RMSEs of 41.8/78.2 ms, 30.8/26 ms, 0.65/0.45
ms/frame, and 3.06/4.9 dB were achieved in the open tests for the
synthesized syllable duration, inter-syllable pause duration, pitch
contour, and energy level, respectively. So it is a promising approach.
Speaker Adaptation for Non-Native Speakers Using
Bilingual English Lexicon and Acoustic Models
S. Matsunaga, A. Ogawa, Yoshikazu Yamaguchi, A.
Imamura; NTT Corporation, Japan
This paper proposes a supervised speaker adaptation method that
is effective for both non-native (i.e. Japanese) and native English
speakers’ pronunciation of English speech. This method uses English and Japanese phoneme acoustic models and a pronunciation lexicon in which each word has both English and Japanese
phoneme transcriptions. The same utterances are used for adaptation of both acoustic models. A recognition system uses these
two adapted acoustic models and the lexicon, and the highestlikelihood word sequence obtained in combining with English- and
Japanese-pronounced words is the recognition result. Continuous
speech recognition experiments show that the proposed adaptation
method greatly improves both Japanese-English and native- English
recognition performance, and the system using bilingual adapted
models achieves the highest accuracy for Japanese speakers among
those using monolingual models, while maintaining the same performance level for native speakers as that of an English recognition
system using an English adapted model.
Using the Web for Fast Language Model
Construction in Minority Languages
Viet Bac Le 1 , Brigitte Bigi 1 , Laurent Besacier 1 , Eric
Castelli 2 ; 1 CLIPS-IMAG Laboratory, France; 2 MICA
Center, Vietnam
The design and construction of a language model for minority languages is a hard task. By minority language, we mean a language
with small available resources, especially for the statistical learning problem. In this paper, a new methodology for fast language
model construction in minority languages is proposed. It is based
on the use of Web resources to collect and make efficient textual
corpora. By using some filtering techniques, this methodology allows a quick and efficient construction of a language model with
a small cost in term of computational and human resources. Our
primary experiments have shown excellent performance of the Web
language models vs newspaper language models using the proposed
filtering methods on a majority language (French). Following the
same way for a minority language (Vietnamese), a valuable language
model was constructed in 3 month with only 15% new development
to modify some filtering tools.
An Approach to Multilingual Acoustic Modeling for
Portable Devices
Yan Ming Cheng, Chen Liu, Yuan-Jun Wei, Lynette
Melnar, Changxue Ma; Motorola Labs, USA
There is an increasing need to deploy speech recognition systems
supporting multiple languages/dialects on portable devices worldwide. A common approach uses a collection of individual monolingual speech recognition systems as a solution. However, such an approach is not practical for handheld devices such as cell phones due
to stringent restrictions on memory and computational resources.
In this paper, we present a simple and effective method to develop
multilingual acoustic models that achieve comparable performance
relative to monolingual acoustic models but with only a fraction of
the storage space of the combined monolingual acoustic model set.
September 1-4, 2003 – Geneva, Switzerland
Cross-Lingual Pronunciation Modelling for
Indonesian Speech Recognition
Terrence Martin 1 , Torbjørn Svendsen 2 , Sridha
Sridharan 1 ; 1 Queensland University of Technology,
Australia; 2 Norwegian University of Science and
Technology, Norway
The resources necessary to produce Automatic Speech Recognition
systems for a new language are considerable, and for many languages these resources are not available. This emphasizes the need
for the development of generic techniques which overcome this data
shortage. Indonesian is one language which suffers from this problem and whose population and importance suggest it could benefit
from speech enabled technology. Accordingly, we investigate using
English acoustic models to recognize Indonesian speech. The mapping process, where the symbolic representation of the Source language acoustic models is equated to the Target language phonetic
units, has typically been achieved using one to one mapping techniques. This mapping method does not allow for the incorporation
of predictable allophonic variation in the lexicon. Accordingly, in
this paper we present the use of cross-lingual pronunciation modelling to extract context dependant mapping rules, which are subsequently used to produce a more accurate cross lingual lexicon.
Language Model Adaptation Using Cross-Lingual
Information
Woosung Kim, Sanjeev Khudanpur; Johns Hopkins
University, USA
The success of statistical language modeling techniques is crucially
dependent on the availability of a large amount training text. For a
language in which such large text collections are not available, methods have recently been proposed to take advantage of a resourcerich language, together with cross-lingual information retrieval and
machine translation, to sharpen language models for the resourcedeficient language. In this paper, we describe investigations into
such language models for an automatic speech recognition system
for Mandarin Broadcast News. By exploiting a large side-corpus of
contemporaneous English news articles to adapt a static Chinese
language model to the news story being transcribed, we demonstrate significant improvements in recognition accuracy. The improvement from using English text is greater when less Chinese text
is available to estimate the static language model. We also compare
our cross-lingual adaptation to monolingual topic-dependent language model adaptation, and achieve further gains by combining
the two adaptation techniques.
Multilingual Phone Clustering for Recognition of
Spontaneous Indonesian Speech Utilising
Pronunciation Modelling Techniques
Eddie Wong 1 , Terrence Martin 1 , Torbjørn Svendsen 2 ,
Sridha Sridharan 1 ; 1 Queensland University of
Technology, Australia; 2 Norwegian University of
Science and Technology, Norway
In this paper, a multilingual acoustic model set derived from English, Hindi, and Spanish is utilised to recognise speech in Indonesian. In order to achieve this task we incorporate a two tiered
approach to perform the cross-lingual porting of the multilingual
models to a new language. In the first stage, we use an entropy
based decision tree to merge similar phones from different languages into clusters to form a new multilingual model set. In the
second stage, we propose the use of a cross-lingual pronunciation
modelling technique to perform the mapping from the multilingual
models to the Indonesian phone set. A set of mapping rules are
derived from this process and are employed to convert the original Indonesian lexicon into a pronunciation lexicon in terms of
the multilingual model set. Preliminary experimental results show
that, compared to the common knowledge based approach, both
of these techniques reduce the word error rate in a spontaneous
speech recognition task.
110
Eurospeech 2003
Thursday
Language-Adaptive Persian Speech Recognition
Naveen Srinivasamurthy, Shrikanth Narayanan;
University of Southern California, USA
Development of robust spoken language technology ideally relies
on the availability of large amounts of data preferably in the target
domain and language. However, more often than not, speech developers need to cope with very little or no data, typically obtained
from a different target domain. This paper focuses on developing techniques towards addressing this challenge. Specifically we
consider the case of developing a Persian language speech recognizer with sparse amounts of data. For language modeling, there
are several potential sources of text data, e.g., available on the Internet, to help bootstrap initial models; however, acoustic data can be
obtained only by tedious data collection efforts. The drawback of
limited Persian acoustic data can be partially overcome by making
use of acoustic data from languages that have vast resources such
as English (and other languages, if available). The phoneme sets
especially for diverse languages such as English and Persian differ
considerably. However by incorporating knowledge-based as well
as data-driven phoneme mappings, reliable Persian acoustic models
can be trained using well-trained English models and small amounts
of Persian re-training data. In our experiments Persian models retrained from seed models created by data-driven phoneme mappings of English models resulted in a phoneme error rate of 19.80%
as compared to a phoneme error rate of 20.35% when the Persian
models were re-trained from seed models created by sparse Persian
data.
Grapheme Based Speech Recognition
Mirjam Killer 1 , Sebastian Stüker 2 , Tanja Schultz 3 ;
1
ETH Zürich, Switzerland; 2 Universität Karlsruhe,
Germany; 3 Carnegie Mellon University, USA
Large vocabulary speech recognition systems traditionally represent words in terms of subword units, usually phonemes. This
paper investigates the potential of graphemes acting as subunits.
In order to develop context dependent grapheme based speech recognizers several decision tree based clustering procedures are performed and compared to each other. Grapheme based speech recognizers in three languages – English, German, and Spanish - are
trained and compared to their phoneme based counterparts. The results show that for languages with a close grapheme-to-phoneme relation, grapheme based modeling is as good as the phoneme based
one. Furthermore, multilingual grapheme based recognizers are designed to investigate whether grapheme based information can be
successfully shared among languages. Finally, some bootstrapping
experiments for Swedish were performed to test the potential for
rapid language deployment.
Session: PThCh– Poster
Interdisciplinary
Time: Thursday 13.30, Venue: Main Hall, Level -1
Chair: Mike McTear, University of Ulster at Jordanstown
Learning Chinese Tones
Valery A. Petrushin; Accenture, USA
This paper is devoted to developing techniques for improving learning of foreign spoken languages. It presents a general framework
for evaluating student’s spoken response, which is based on collecting experimental data about experts’ and novices’ performance and
applying machine learning and knowledge management techniques
for deriving evaluation rules. The related speech analysis, visualization, and student response evaluation techniques are described. An
experimental course for learning tones of Standard Chinese (Mandarin) is discussed.
A Pronunciation Training System for Japanese
Lexical Accents with Corrective Feedback in
Learner’s Voice
Keikichi Hirose, Frédéric Gendrin, Nobuaki
Minematsu; University of Tokyo, Japan
A system was developed for teaching non-Japanese learners pro-
September 1-4, 2003 – Geneva, Switzerland
nunciation of Japanese lexical accents. The system first identifies
word accent types in a learner’s utterance using F0 change between
two adjacent morae as the feature parameter. As for the representative F0 value for a mora, we defined one with a good match to the
perceived pitch. The system notices the user if his/her pronunciation is good or not, and, then, generates audio and visual corrective
feedbacks. Using TDPSOLA technique, the learner’s utterance is
modified in its prosodic features by referring to teacher’s features,
and offered to the learner as the audio corrective feedback. The visual feedback is also offered to enhance the modifications that occurred. Accent type pronunciation training experiments were conducted for 8 non-Japanese speakers, and the results showed that
the training process could be facilitated by the feedbacks especially
when they were asked to pronounce sentences.
Considerations on Vowel Durations for Japanese
CALL System
Taro Mouri, Keikichi Hirose, Nobuaki Minematsu;
University of Tokyo, Japan
Due to various difficulties in pronunciation, utterances by nonnative speakers may be lacking in fluency. The Japanese pronunciation is said to have mora-synchronism, and, therefore, we assume
that the disfluency may cause larger variations in vowel durations.
Analyses of vowel (and CV) durations were conducted for Japanese
sentence utterances by 2 non-Japanese speakers and one Japanese
speaker (all female speakers). Larger variations were clearly observed in non-Japanese utterances. Then, 10 Japanese speakers
were asked to rate the non-Japanese utterances. Strong negative
correlations were observed between durational variations and pronunciation ratings. Based on the result, a method was developed
for automatic evaluation of non- Japanese utterances. The ratings
by the method were shown to be close to those by native speakers. Also, in order to offer a corrective feedback in learner’s voice,
non-Japanese utterances were modified in their vowel durations by
referring to native Japanese utterances. The modification was done
using TD-PSOLA scheme. The result of listening test indicated some
improvements in nativeness.
Influence of Recording Equipment on the
Identification of Second Language Phoneme
Contrasts
Hiroaki Kato 1 , Masumi Nukinay 2 , Hideki
Kawaharay 2 , Reiko Akahane-Yamada 1 ; 1 ATR-HIS,
Japan; 2 Wakayama University, Japan
This paper investigates the perceptual quality of English words
recorded by different types of microphones to assess their availability in Computer Assisted Language Learning (CALL) systems.
English words minimally contrasting in /r/ and /l/, /b/ and /v/, or
/s/ and /th/ were recorded from native female and male speakers
of American English using six different microphones. The phonemic contrasts in these recordings were then evaluated by 14 native
listeners of American English. The results showed that the identification of the /r/-/l/ contrast was unaltered by the difference
in microphones, whereas that of the /s/-/th/ contrast significantly
dropped with several headset microphones, and that of the /b/-/v/
contrast dropped with a tie-pin microphone. These findings suggest
that some microphones are not appropriate for speech perception
training. Finally, a post hoc equalization procedure was applied
to compensate for the acoustic characteristics of the microphones
tested, and this procedure was confirmed to be effective in recovering phonemic contrasts under several conditions.
Training a Confidence Measure for a Reading Tutor
That Listens
Yik-Cheung Tam, Jack Mostow, Joseph E. Beck,
Satanjeev Banerjee; Carnegie Mellon University, USA
One issue in a Reading Tutor that listens is to determine which
words the student read correctly. We describe a confidence measure that uses a variety of features to estimate the probability that
a word was read correctly. We trained two decision tree classifiers.
The first classifier tries to fix insertion and substitution errors made
by the speech decoder, while the second classifier tries to fix deletion errors. By applying the two classifiers together, we achieved a
111
Eurospeech 2003
Thursday
relative reduction in false alarm rate by 25.89% while holding the
miscue detection rate constant.
Evaluating the Effect of Predicting Oral Reading
Miscues
Satanjeev Banerjee, Joseph E. Beck, Jack Mostow;
Carnegie Mellon University, USA
This paper extends and evaluates previously published methods
for predicting likely miscues in children’s oral reading in a Reading
Tutor that listens. The goal is to improve the speech recognizer’s
ability to detect miscues but limit the number of “false alarms” (correctly read words misclassified as incorrect). The “rote” method
listens for specific miscues from a training corpus. The “extrapolative” method generalizes to predict other miscues on other words.
We construct and evaluate a scheme that combines our rote and extrapolative models. This combined approach reduced false alarms
by 0.52% absolute (12% relative) while simultaneously improving
miscue detection by 1.04% absolute (4.2% relative) over our existing miscue prediction scheme.
VISPER II – Enhanced Version of the Educational
Software for Speech Processing Courses
Miroslav Holada, Jan Nouza; Technical University of
Liberec, Czech Republic
In the paper we describe a new version of the software tool developed for education and experimental works in speech processing
domain. Since 1997, when the original VISPER was released, we
have added several new modules and options that give a student a
deeper look at the basic principles, methods and algorithms used
namely in speech recognition. Newly included modules allow for
visualization of the Viterbi search algorithm implemented either
in sequential or parallel way, they introduce the idea of the beam
search with pruning and guide a student towards understanding
the principle of word string recognition. The VISPER concept of a
single graphic environment with mutually linked modules remains
untouched. The VISPER II is compatible with all recent versions of
the MS Windows OS and it is freely available.
The Use of Multiple Pause Information in
Dependency Structure Analysis of Spoken
Japanese Sentences
Meirong Lu, Kazuyuki Takagi, Kazuhiko Ozeki;
University of Electro-Communications, Japan
There is a close relationship between prosody and syntax. In the
field of speech synthesis, many investigations have been made to
control prosody so that it conforms to the syntactic structure of
the sentence. This paper is concerned with the inverse problem:
recovery of syntactic structure with the help of prosodic information. In our past investigations, it was observed that duration of
inter-phrase pause is most effective among various prosodic features in dependency structure analysis of Japanese sentences. In
those studies, only one kind of pause, i.e. the pause that immediately follows a phrase in question was used. In this paper, another
kind of pause is employed as a prosodic feature: the pause that immediately follows the succeeding phrase of a phrase in question. It
is shown that simultaneous use of the first and second pauses improves the parsing accuracy compared to the case where only the
first pause is used.
A Neural Network Approach to Dependency
Analysis of Japanese Sentences Using Prosodic
Information
Kazuyuki Takagi, Mamiko Okimoto, Yoshio Ogawa,
Kazuhiko Ozeki; University of
Electro-Communications, Japan
Prosody and syntax are significantly related with each other as has
often been observed. In the field of speech synthesis, many efforts
have been made to control prosody so that it reflects the syntactic
structure of the sentence. However, the inverse problem, recovery of syntactic structure using prosodic information, has not been
so much investigated. This paper focuses on syntactic information contained in prosodic features extracted from read Japanese
September 1-4, 2003 – Geneva, Switzerland
sentences, and describes a method of exploiting it in dependency
structure analysis. In this paper, a multilayer perceptron is employed to estimate conditional probability of dependency distance
of a phrase given its prosodic feature, i.e., pause duration and F0
contour. Parsing accuracy was improved by combining two different kinds of prosodic information by the perceptron.
Say-As Classification for Alphabetic Words in
Japanese Texts
Hisako Asano, Masaaki Nagata, Masanobu Abe; NTT
Corporation, Japan
Modern Japanese texts often include Western sourced words written in Roman alphabet. For example, a shopping directory in a web
portal, which lists more than 8,000 shops, includes a total of 6,400
alphabetic words. As most of them are very new and idiosyncratic
proper nouns, it is impractical to assume all those alphabetic words
can be registered in the word dictionary of a text-to-speech synthesis system; their pronunciations must be derived automatically. Our
solution consists of two steps. Step 1 classifies each unknown alphabetic word into a say-as class (English, Japanese, French, Italian
or English spell-out), which indicates how it is to be read, and Step
2 derives the pronunciation using the grapheme-to-phoneme conversion rules for the classified say-as class. This paper proposes a
method of say-as classification (i.e. Step 1) that uses the Support
Vector Machine. After some trial and error, we achieved 89.2% accuracy for web shop data, which we think sufficient for practical use.
Automatic Transformation of Environmental
Sounds into Sound-Imitation Words Based on
Japanese Syllable Structure
Kazushi Ishihara, Yasushi Tsubota, Hiroshi G. Okuno;
Kyoto University, Japan
Sound-imitation words, a sound-related subset of onomatopoeia, are
important for computer-human interaction and automatic tagging
of sound archives. The main problem of automatic recognition
of sound-imitation word is that the literal representation of such
words is dependent on listeners and influenced by a particular cultural history. Based on our preliminary experiments of such dependency and the sonority theory, we discovered that the process
of transforming environmental sounds into syllable-structure expressions is mostly listener-independent while that of transforming
syllable-structure expressions into sound-imitation words is mostly
listener-dependent and influenced by culture. This paper focuses
on the former lister-independent process and presents the threestage architecture of automatic transformation of environmental
sounds to sound-imitation words; segmenting sound signals to syllables, identifying syllable structure as mora, and recognizing mora
as phonemes.
Decision Tree-Based Simultaneous Clustering of
Phonetic Contexts, Dimensions, and State Positions
for Acoustic Modeling
Heiga Zen, Keiichi Tokuda, Tadashi Kitamura;
Nagoya Institute of Technology, Japan
In this paper, a new decision tree-based clustering technique called
Phonetic, Dimensional and State Positional Decision Tree (PDSDT) is proposed. In PDS-DT, phonetic contexts, dimensions and
state positions are grouped simultaneously during decision tree
construction. PDS-DT provides a complicate distribution sharing
structure without any external control parameters. In speakerindependent continuous speech recognition experiments, PDS-DT
achieved about 13%–15% error reduction over the phonetic decision
tree-based state-tying technique.
A Statistical Method of Evaluating Pronunciation
Proficiency for English Words Spoken by Japanese
Seiichi Nakagawa, Kazumasa Mori, Naoki Nakamura;
Toyohashi University of Technology, Japan
In this paper, we propose a statistical method of evaluating the pronunciation proficiency of English words spoken by Japanese. We
analyze statistically the utterances to find a combination that has a
112
Eurospeech 2003
Thursday
high correlation between an English teacher’s score and some acoustic features. We found that the likelihood ratio of English phoneme
acoustic models to phoneme acoustic models adapted by Japanese
was the best measure of pronunciation proficiency. The combination of the likelihood for American native models, likelihood for
English models adapted by Japanese, the best likelihood for arbitrary sequences of acoustic models, phoneme recognition rate and
the rate of speech are highly related to the English teacher’s score.
We obtained the correlation coefficient of 0.81 with open data for
vocabulary and 0.69 with open data for speaker at the five words set
level, respectively. The coefficient was higher than the correlation
between humans’ scores, 0.65.
113
September 1-4, 2003 – Geneva, Switzerland
Eurospeech 2003
September 1-4, 2003 – Geneva, Switzerland
114
Eurospeech 2003
Author Index
A
Aalburg, Stefanie . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Abad, Alberto. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Abdou, Sherif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Abe, Masanobu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Abe, Masanobu . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Abrash, Victor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Abt, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Abu-Amer, Tarek . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Abutalebi, H.R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Acero, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Adami, André G. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Adami, André G. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Adams, Jeff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Adda-Decker, Martine . . . . . . . . . . . . . . . . . . . . . . 8
Adda-Decker, Martine . . . . . . . . . . . . . . . . . . . . . 10
Adelhardt, Johann . . . . . . . . . . . . . . . . . . . . . . . . . 26
Afify, Mohamed . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Ahadi, S.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Ahkuputra, Visarut . . . . . . . . . . . . . . . . . . . . . . . . 65
Ahn, Dong-Hoon . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Ahn, Sungjoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Aikawa, Kiyoaki . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Aikawa, Kiyoaki . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Airey, S.S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Akagi, Masato . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Akagi, Masato . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Akahane-Yamada, Reiko . . . . . . . . . . . . . . . . . 111
Akbacak, Murat . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Akiba, Tomoyosi . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Akiba, Tomoyosi . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Akita, Yuya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Al Bawab, Ziad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Albesano, Dario . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Alecksandrovich, Oleg . . . . . . . . . . . . . . . . . . . . 69
Alexander, Anil . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Alías, Francesc. . . . . . . . . . . . . . . . . . . . . . . . . . . . .47
Alías, Francesc . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Alku, Paavo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Allen, James . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Allu, Gopi Krishna . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Al-Naimi, Khaldoon . . . . . . . . . . . . . . . . . . . . . . . 50
Alonso-Romero, L. . . . . . . . . . . . . . . . . . . . . . . . . . 93
Alouane, M. Turki-Hadj . . . . . . . . . . . . . . . . . . . 49
Alshawi, Hiyan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Alsteris, Leigh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Altun, Yasemin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Álvarez, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Alwan, Abeer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Alwan, Abeer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Amaral, Rui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Amir, Noam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Andersen, Ove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Anderson, A.H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Anderson, David V. . . . . . . . . . . . . . . . . . . . . . . . . 38
Anderson, David V. . . . . . . . . . . . . . . . . . . . . . . . . 76
Andorno, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Andrassy, Bernt . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Andrassy, Bernt . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Angkititrakul, Pongtep . . . . . . . . . . . . . . . . . . . . 47
Antoine, Jean-Yves . . . . . . . . . . . . . . . . . . . . . . . . 98
Arai, Takayuki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Araki, Masahiro. . . . . . . . . . . . . . . . . . . . . . . . . . . .67
Arcienega, Mijail . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Arehart, Kathryn . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Arifianto, Dhany . . . . . . . . . . . . . . . . . . . . . . . . . 102
Ariki, Yasuo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Ariki, Yasuo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Ariki, Yasuo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Ariki, Yasuo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Ariyaeeinia, Aladdin M. . . . . . . . . . . . . . . . . . . . 43
Ariyaeeinia, Aladdin M. . . . . . . . . . . . . . . . . . . . 94
Armani, Luca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Arranz, Victoria . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Arroabarren, Ixone . . . . . . . . . . . . . . . . . . . . . . . . . 3
Arroabarren, Ixone . . . . . . . . . . . . . . . . . . . . . . . . 62
Arrowood, Jon A. . . . . . . . . . . . . . . . . . . . . . . . . 109
Arslan, Levent M. . . . . . . . . . . . . . . . . . . . . . . . . . . 74
September 1-4, 2003 – Geneva, Switzerland
Arslan, Levent M. . . . . . . . . . . . . . . . . . . . . . . . . 101
Asano, Futoshi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Asano, Futoshi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Asano, Hisako . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Asano, Hisako . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Ashley, J.P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Asoh, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Astrov, Sergey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Atal, Bishnu S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Atlas, Les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Attwater, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Au, Ching-Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Au, Wing-Hei. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32
Aubergé, Véronique . . . . . . . . . . . . . . . . . . . . . . . . 7
Audibert, Nicolas . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Axelrod, Scott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Axelrod, Scott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Axelrod, Scott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Axelrod, Scott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Aylett, Matthew. . . . . . . . . . . . . . . . . . . . . . . . . . . .12
B
Baca, Julie A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Bach, Nguyen Hung . . . . . . . . . . . . . . . . . . . . . . . . . 7
Bachenko, Joan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Backfried, Gerhard . . . . . . . . . . . . . . . . . . . . . . . . 55
Bäckström, Tom . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Badran, Ahmed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Bailly, Gerard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Baker, Kirk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Bakis, R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Bakx, Ilse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79
Balakrishnan, N. . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Balakrishnan, Sreeram V. . . . . . . . . . . . . . . . . . 53
Baltazani, Mary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Banerjee, Satanjeev . . . . . . . . . . . . . . . . . . . . . . 111
Banerjee, Satanjeev . . . . . . . . . . . . . . . . . . . . . . 112
Banga, Eduardo R. . . . . . . . . . . . . . . . . . . . . . . . . . 11
Bänziger, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Bänziger, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Bard, E.G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Barrachina, Sergio . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Barreaud, Vincent . . . . . . . . . . . . . . . . . . . . . . . . . 53
Baskind, Alexis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Batliner, Anton . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Bauer, Josef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Bauerecker, Hermann . . . . . . . . . . . . . . . . . . . . . 31
Baus, Jörg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Bazzi, Issam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
Beaufays, Françoise . . . . . . . . . . . . . . . . . . . . . . . 92
Beaugeant, Christophe . . . . . . . . . . . . . . . . . . . . 58
Beaumont, Jean-François . . . . . . . . . . . . . . . . . . 43
Beaumont, Jean-François . . . . . . . . . . . . . . . . . . 43
Béchet, Frédéric . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Béchet, Frédéric . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Beck, Joseph E. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Beck, Joseph E. . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Beddoes, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Belfield, William . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Bell, Linda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Bellegarda, Jerome R. . . . . . . . . . . . . . . . . . . . . . 71
Bellot, Olivier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Benítez, Carmen . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Benítez, Carmen . . . . . . . . . . . . . . . . . . . . . . . . . 107
Benlahouar, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Bennett, Christina L. . . . . . . . . . . . . . . . . . . . . . . 12
Bennett, Christina L. . . . . . . . . . . . . . . . . . . . . . 104
BenZeghiba, Mohamed Faouzi . . . . . . . . . . . . 48
Berdahl, Edgar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Beringer, N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Bernard, Alexis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Bernsen, Niels Ole . . . . . . . . . . . . . . . . . . . . . . . . . 26
Berthommier, Frédéric . . . . . . . . . . . . . . . . . . . . 37
Besacier, Laurent . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Besacier, Laurent . . . . . . . . . . . . . . . . . . . . . . . . . 110
Beskow, Jonas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Bettens, F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Beutler, René . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Beutler, René . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Bigi, Brigitte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Bijankhan, Mahmood . . . . . . . . . . . . . . . . . . . . . . 54
Bilmes, Jeff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Bimbot, Frédéric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Binnenpoorte, Diana . . . . . . . . . . . . . . . . . . . . . . 54
Bisani, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
115
Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Black, Alan W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Black, Lois . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Bloom, Jonathan . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Boë, Louis-Jean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2
Boëffard, Olivier . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Boëffard, Olivier . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Bohus, Dan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Bonafonte, Antonio . . . . . . . . . . . . . . . . . . . . . . . 30
Bonafonte, Antonio . . . . . . . . . . . . . . . . . . . . . . . 56
Bonafonte, Antonio . . . . . . . . . . . . . . . . . . . . . . . 81
Bonastre, Jean-François . . . . . . . . . . . . . . . . . . . . 2
Bonastre, Jean-François . . . . . . . . . . . . . . . . . . . 57
Bonastre, Jean-François . . . . . . . . . . . . . . . . . . . 71
Bonneau-Maynard, Hélène . . . . . . . . . . . . . . . . . . 8
Bonneau-Maynard, Hélène . . . . . . . . . . . . . . . . 10
Borys, S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Boštík, Milan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Boulianne, Gilles. . . . . . . . . . . . . . . . . . . . . . . . . . .43
Boulianne, Gilles. . . . . . . . . . . . . . . . . . . . . . . . . . .43
Boulianne, Gilles. . . . . . . . . . . . . . . . . . . . . . . . . . .94
Bourgeois, Julien . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Bourlard, Hervé . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Bourlard, Hervé . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Bourlard, Hervé . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Bouzid, Aïcha . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Boves, Lou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Boye, Johan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Bozkurt, Baris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Bratt, Harry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Bratt, Harry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Braun, Bettina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Braunschweiler, Norbert . . . . . . . . . . . . . . . . . . 46
Breen, Andrew P. . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Breen, Andrew P. . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Brennan, R.L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Brennan, R.L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Brito, Iván . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Broeders, A.P.A. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Brousseau, Julie . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Brown, Guy J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Brungart, Douglas S. . . . . . . . . . . . . . . . . . . . . . . 37
Burger, Susanne . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Burnett, Ian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Burns, John . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Byrne, William J. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Byrne, William J. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
C
Caldés, Roser Jaquemot . . . . . . . . . . . . . . . . . . . 55
Campbell, Joseph P. . . . . . . . . . . . . . . . . . . . . . . . . 2
Campbell, Joseph P. . . . . . . . . . . . . . . . . . . . . . . . 94
Campbell, Nick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Campbell, Nick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Campbell, Nick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Campbell, Nick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Campbell, W.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Campillo Díaz, Francisco. . . . . . . . . . . . . . . . . .11
Cao, Zhigang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Capeletto, Matias L. . . . . . . . . . . . . . . . . . . . . . . 102
Cardeñoso, Valentín . . . . . . . . . . . . . . . . . . . . . . . 81
Cardinal, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Cardinal, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Carey, Michael J. . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Carey, Michael J. . . . . . . . . . . . . . . . . . . . . . . . . . 107
Carlosena, Alfonso . . . . . . . . . . . . . . . . . . . . . . . . . 3
Carlosena, Alfonso . . . . . . . . . . . . . . . . . . . . . . . . 62
Carmichael, James . . . . . . . . . . . . . . . . . . . . . . . . 41
Carmichael, James . . . . . . . . . . . . . . . . . . . . . . . . 78
Carriço, Luís . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Carson-Berndsen, Julie . . . . . . . . . . . . . . . . . . . . 90
Caseiro, Diamantino . . . . . . . . . . . . . . . . . . . . . . 56
Cassaca, Renato . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Castell, Núria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Castelli, Eric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Castro, María José . . . . . . . . . . . . . . . . . . . . . . . . . 23
Cathiard, Marie-Agnès . . . . . . . . . . . . . . . . . . . . . . 6
Cattoni, Roldano . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Cawley, Gavin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
Cerisara, Christophe . . . . . . . . . . . . . . . . . . . . . 108
Cerisara, Christophe . . . . . . . . . . . . . . . . . . . . . 108
Černocký, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Černocký, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Eurospeech 2003
Černocký, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Cesari, Federico . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Çetin, Özgür . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Cettolo, Mauro . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Chambel, Teresa . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chan, C.F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Chan, Kin-Wah . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Chan, Kwokleung . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chan, Shuk Fong . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Chang, Eric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Chang, Eric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chang, Eric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Chang, Eric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Chang, Joon-Hyuk . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chang, Pi-Chuan . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chang, Sen-Chia . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chang, Shuangyu . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Chang, Wen-Whei . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapdelaine, Claude . . . . . . . . . . . . . . . . . . . . . . 43
Chapman, James . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Charbit, Maurice . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Charlet, Delphine . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Charnvivit, Patavee . . . . . . . . . . . . . . . . . . . . . . . . . 6
Charoenpornsawat, Paisarn . . . . . . . . . . . . . . . 12
Chateau, N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chatzichrisafis, N. . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chaudhari, Upendra . . . . . . . . . . . . . . . . . . . . . . . 70
Chaudhari, Upendra . . . . . . . . . . . . . . . . . . . . . . . 91
Chelba, Ciprian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chen, Aoju . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chen, Barry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chen, Barry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Chen, Boxing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Chen, Fang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chen, Gao Peng . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chen, Hsin-Hsi. . . . . . . . . . . . . . . . . . . . . . . . . . . . .98
Chen, Jau-Hung . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chen, Jia-fu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chen, K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chen, Shun-Chuan . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chen, Shun-Chuan. . . . . . . . . . . . . . . . . . . . . . . . .82
Chen, Shun-Chuan . . . . . . . . . . . . . . . . . . . . . . . 100
Chen, Shun-Chuan . . . . . . . . . . . . . . . . . . . . . . . 100
Chen, Sin-Horng . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chen, Sin-Horng . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chen, Stanley F. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chen, Stanley F. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Chen, Stanley F. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chen, Tsuhan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chen, Y. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chen, Yining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Chen, Yiya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chen, Zhenbiao . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Cheng, Shi-sian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Cheng, Yan Ming . . . . . . . . . . . . . . . . . . . . . . . . . 110
Cheung, Ming-Cheung . . . . . . . . . . . . . . . . . . . 105
Chiang, Yuan-Chuan . . . . . . . . . . . . . . . . . . . . . . 48
Chiang, Yuang-Chin . . . . . . . . . . . . . . . . . . . . . . . 42
Chiang, Yuang-Chin . . . . . . . . . . . . . . . . . . . . . . . 66
Chien, Jen-Tzung . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Ching, P.C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Choi, Chi-Ho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Choi, Frederick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Choi, Jin-Kyu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Chollet, Gérard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Chonavel, Thierry . . . . . . . . . . . . . . . . . . . . . . . . . 61
Chou, Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Chou, Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chou, Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Choukri, Khalid. . . . . . . . . . . . . . . . . . . . . . . . . . . .54
Choy, G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Choy, Thomas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chu, Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Chu, Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Chu, Wai C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chung, Grace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
Chung, Grace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chung, Hyun-Yeol . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chung, Hyun-Yeol . . . . . . . . . . . . . . . . . . . . . . . . . 88
Chung, Jaeho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chung, Minhwa . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chung, Minhwa . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Church, Kenneth Ward . . . . . . . . . . . . . . . . . . . . . 1
Cieri, Christopher . . . . . . . . . . . . . . . . . . . . . . . . . 56
Cieri, Christopher . . . . . . . . . . . . . . . . . . . . . . . . . 56
Çilingir, Onur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Ciloglu, Tolga . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
September 1-4, 2003 – Geneva, Switzerland
Class, Fritz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Clements, Mark A. . . . . . . . . . . . . . . . . . . . . . . . 103
Clements, Mark A. . . . . . . . . . . . . . . . . . . . . . . . 109
Cohen, Arnon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Cohen, Gilead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Cohen, Rachel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Cole, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Cole, Ronald A. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Comeau, Michel . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Comeau, Michel . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Conejero, David . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Cook, Norman D. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Corazza, Anna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Córdoba, R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Córdoba, R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Cornu, E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Corr, Pat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Cortes, Corinna . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Cosi, Piero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Couvreur, Christophe . . . . . . . . . . . . . . . . . . . . . 63
Cox, Stephen J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Cox, Stephen J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Cox, Stephen J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Cox, Stephen J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Cranen, Bert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Creutz, Mathias. . . . . . . . . . . . . . . . . . . . . . . . . . . .41
Creutz, Mathias. . . . . . . . . . . . . . . . . . . . . . . . . . . .81
Cruz-Zeno, E.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Cui, Xiaodong. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77
Cummins, Fred . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Cunningham, Stuart . . . . . . . . . . . . . . . . . . . . . . . 78
Cutugno, Francesco . . . . . . . . . . . . . . . . . . . . . . 103
Czigler, Peter E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
D
Dahan, Jean-Gui . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
d’Alessandro, Christophe . . . . . . . . . . . . . . . . . . 5
d’Alessandro, Christophe . . . . . . . . . . . . . . . . . 58
d’Alessandro, Christophe . . . . . . . . . . . . . . . . . 84
Dang, Jianwu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Dang, Jianwu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Daoudi, Khalid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Daubias, Philippe . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Dayanidhi, Krishna . . . . . . . . . . . . . . . . . . . . . . . . 79
de Cheveigné, Alain . . . . . . . . . . . . . . . . . . . . . . . 29
de Gelder, Beatrice . . . . . . . . . . . . . . . . . . . . . . . . . . 2
de Jong, Franciska . . . . . . . . . . . . . . . . . . . . . . . . . . 9
de la Torre, Ángel . . . . . . . . . . . . . . . . . . . . . . . . . 13
de la Torre, Ángel . . . . . . . . . . . . . . . . . . . . . . . . 107
Deléglise, Paul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Deller Jr., J.R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Delmonte, Rodolfo . . . . . . . . . . . . . . . . . . . . . . . . 70
Demirekler, Mübeccel . . . . . . . . . . . . . . . . . . . . . 41
Demirekler, Mübeccel . . . . . . . . . . . . . . . . . . . . . 55
Demirekler, Mübeccel . . . . . . . . . . . . . . . . . . . . . 85
Demiroglu, Cenk . . . . . . . . . . . . . . . . . . . . . . . . . . 76
De Mori, Renato . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
De Mori, Renato . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Demuynck, Kris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Demuynck, Kris . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Demuynck, Kris . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Denda, Yuki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Denecke, Matthias . . . . . . . . . . . . . . . . . . . . . . . . . 79
Deng, Huiqun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Deng, Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Deng, Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Deng, Yonggang . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Devillers, Laurence . . . . . . . . . . . . . . . . . . . . . . . . . 7
Devillers, Laurence . . . . . . . . . . . . . . . . . . . . . . . . . 8
de Villiers, Jacques . . . . . . . . . . . . . . . . . . . . . . . . 58
Deviren, Murat. . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
De Wachter, Mathias . . . . . . . . . . . . . . . . . . . . . . 40
de Wet, Febe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Dewhirst, Oliver . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Dharanipragada, Satya . . . . . . . . . . . . . . . . . . . . 64
Dharanipragada, Satya . . . . . . . . . . . . . . . . . . . . 89
D’Haro, L.F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Di, Fengying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Diakoloukas, V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Diao, Qian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Digalakis, Vassilios . . . . . . . . . . . . . . . . . . . . . . . . 55
Digalakis, Vassilios . . . . . . . . . . . . . . . . . . . . . . . . 81
Dimitriadis, Dimitrios . . . . . . . . . . . . . . . . . . . 101
Ding, Pei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Ding, Peng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Ding, Peng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Dobrišek, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
116
Docio-Fernandez, Laura . . . . . . . . . . . . . . . . . . . 75
Dognin, Pierre L. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Dohen, Marion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Dohsaka, Kohji . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Dong, Minghui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Doss, Mathew Magimai . . . . . . . . . . . . . . . . . . . . 21
Doumpiotis, Vlasios . . . . . . . . . . . . . . . . . . . . . . . 70
Doval, Boris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Draxler, Chr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Droppo, Jasha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Droppo, Jasha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Drygajlo, Andrzej . . . . . . . . . . . . . . . . . . . . . . . . . 25
Drygajlo, Andrzej . . . . . . . . . . . . . . . . . . . . . . . . . 37
Drygajlo, Andrzej . . . . . . . . . . . . . . . . . . . . . . . . . 94
Du, Limin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Du, Limin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Du, Limin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Du, Limin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Duchateau, Jacques . . . . . . . . . . . . . . . . . . . . . . . 13
Duchateau, Jacques . . . . . . . . . . . . . . . . . . . . . . . 95
Dufour, Sophie . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
du Jeu, Charles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Dumouchel, Pierre. . . . . . . . . . . . . . . . . . . . . . . . .43
Dumouchel, Pierre. . . . . . . . . . . . . . . . . . . . . . . . .94
Dumouchel, Pierre . . . . . . . . . . . . . . . . . . . . . . . 105
Dunn, Robert B. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Dupont, Stéphane . . . . . . . . . . . . . . . . . . . . . . . . . 63
Duraiswami, Ramani. . . . . . . . . . . . . . . . . . . . . . . .3
Durston, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Dusan, Sorin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Dutoit, Thierry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Duxans, Helenca . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
E
Edmondson, William . . . . . . . . . . . . . . . . . . . . . . 90
Eggleton, Barry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Eggleton, Barry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Ehrette, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Eickeler, Stefan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Eide, E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Ekanadham, Chaitanya J.K. . . . . . . . . . . . . . . . 22
El-Jaroudi, Amro . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Ellis, Daniel P.W. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Ellouze, Noureddine . . . . . . . . . . . . . . . . . . . . . 101
Emami, Ahmad . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Emele, Martin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Emele, Martin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Emonts, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Enderby, Pam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Endo, Toshiki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Eneman, Koen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Engwall, Olov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
En-Najjary, Taoufik . . . . . . . . . . . . . . . . . . . . . . . . 61
Eriksson, Erik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Escudero, David . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Eskenazi, Maxine . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Espy-Wilson, Carol . . . . . . . . . . . . . . . . . . . . . . . . 85
Estève, Yannick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Estienne, Claudio F. . . . . . . . . . . . . . . . . . . . . . . . 80
Evans, Nicholas W.D. . . . . . . . . . . . . . . . . . . . . . 102
F
Fabian, Tibor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Fackrell, Justin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Fackrell, Justin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Fackrell, Justin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Fagel, Sascha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Fakotakis, Nikos . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Fakotakis, Nikos . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Fakotakis, Nikos . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Fakotakis, Nikos . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Fakotakis, Nikos . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Fakotakis, Nikos . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Falavigna, Daniele . . . . . . . . . . . . . . . . . . . . . . . . . 61
Fang, Xiaoshan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Farrell, Mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Faulkner, Andrew . . . . . . . . . . . . . . . . . . . . . . . . . 45
Federico, Marcello . . . . . . . . . . . . . . . . . . . . . . . . . 14
Fedorenko, Evelina . . . . . . . . . . . . . . . . . . . . . . . . 56
Fegyó, Tibor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Fegyó, Tibor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Ferreira, L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Ferreiros, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Ferreiros, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Ferrer, Luciana . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Eurospeech 2003
Filisko, Edward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Fingscheidt, Tim. . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Fink, Gernot A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Fischer, V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Fishler, Eran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Fissore, L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Flanagan, James . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Flecha-Garcia, M.L. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Fohr, Dominique . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Fohr, Dominique . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Fonollosa, José A.R. . . . . . . . . . . . . . . . . . . . . . . . 88
Fortuna, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Fousek, Petr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Franco, Horacio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
François, Hélène . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Frangi, Alejandro F. . . . . . . . . . . . . . . . . . . . . . . . 80
Frank, Carmen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Fränti, Pasi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Franz, Martin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Frederking, Robert . . . . . . . . . . . . . . . . . . . . . . . . 14
Freeman, G.H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Freitas, Diamantino . . . . . . . . . . . . . . . . . . . . . . . . . 7
Freitas, Diamantino . . . . . . . . . . . . . . . . . . . . . . . 15
Freitas, Diamantino . . . . . . . . . . . . . . . . . . . . . . . 82
Fu, Guokang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Fu, Qiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Fujii, Atsushi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Fujii, Atsushi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Fujii, Atsushi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Fujimoto, Ichiro . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Fujimoto, Masakiyo . . . . . . . . . . . . . . . . . . . . . . . 51
Fujimoto, Masakiyo . . . . . . . . . . . . . . . . . . . . . . . 62
Fujimoto, Masakiyo . . . . . . . . . . . . . . . . . . . . . . . 63
Fujinaga, Katsuhisa . . . . . . . . . . . . . . . . . . . . . . . 96
Fujisaki, Hiroya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Fujisaki, Hiroya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Fujisaki, Hiroya . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Fujisaki, Hiroya . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Fujisaki, Hiroya . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Fujisawa, Takeshi . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Fukuda, Takashi . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Fukuda, Takashi . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Fukuda, Takashi . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Fukudome, Kimitoshi . . . . . . . . . . . . . . . . . . . . . 74
Fung, Pascale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Fung, Pascale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Fung, Tien-Ying . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Furui, Sadaoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Furuyama, Yusuke . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Fusaro, Andrea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
G
Gadbois, Gregory J. . . . . . . . . . . . . . . . . . . . . . . . . 79
Gadde, Venkata R.R. . . . . . . . . . . . . . . . . . . . . . . . 71
Gales, M.J.F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Gales, M.J.F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Galescu, Lucian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Gaminde, I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Ganchev, Todor. . . . . . . . . . . . . . . . . . . . . . . . . . . .59
Gao, Hualin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Gao, Jianfeng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Gao, Sheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Gao, W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Gao, Yuqing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
Gao, Yuqing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98
Gao, Yuqing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99
Garcia-Gomar, M. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Garcia-Romero, D. . . . . . . . . . . . . . . . . . . . . . . . . . 25
Garg, Ashutosh . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Gates, Donna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Gauvain, Jean-Luc . . . . . . . . . . . . . . . . . . . . . . . . . 66
Gedge, Oren . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Gelbart, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Gemello, Roberto . . . . . . . . . . . . . . . . . . . . . . . . 107
Gendrin, Frédéric . . . . . . . . . . . . . . . . . . . . . . . . 111
Georgila, K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Gfroerer, Stefan . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Gharavian, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Ghasedi, Mohammad E. . . . . . . . . . . . . . . . . . . . 54
Ghasemi, Seyyed Z. . . . . . . . . . . . . . . . . . . . . . . . . 54
Ghulam, Muhammad . . . . . . . . . . . . . . . . . . . . . . 77
September 1-4, 2003 – Geneva, Switzerland
Gibbon, Dafydd . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Gibbon, Dafydd . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Gibbon, Dafydd . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Gibson, Edward . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Gieselmann, Petra . . . . . . . . . . . . . . . . . . . . . . . . . 79
Gillett, Ben . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Gillett, Ben . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Gilloire, André . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Giménez, Jesús . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Girão, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Girin, Laurent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Gish, Herbert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Glass, James . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Gleason, T.P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Gnaba, H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Goel, Vaibhava . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Goel, Vaibhava . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Goel, Vaibhava . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Gomes, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Gómez, Angel M. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Gómez, Angel M. . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Gómez, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Gonzalez-Rodriguez, J. . . . . . . . . . . . . . . . . . . . . 25
Goodman, Bryan R. . . . . . . . . . . . . . . . . . . . . . . . . 79
Gopinath, Ramesh . . . . . . . . . . . . . . . . . . . . . . . . . 57
Gopinath, Ramesh . . . . . . . . . . . . . . . . . . . . . . . . . 92
Gopinath, Ramesh . . . . . . . . . . . . . . . . . . . . . . . . . 92
Gori, Marco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Gorin, Allen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Goronzy, Silke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Goronzy, Silke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Gorrell, Genevieve . . . . . . . . . . . . . . . . . . . . . . . . . 97
Goto, Masataka . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Goto, Masataka . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Goulian, Jérôme . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Gouvêa, Evandro . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Grandjean, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Granström, Björn . . . . . . . . . . . . . . . . . . . . . . . . 103
Grant, Ken W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Grashey, Stephan . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Green, James . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Green, Phil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Green, Phil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Green, Phil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Greenberg, Steven . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Greenberg, Steven . . . . . . . . . . . . . . . . . . . . . . . . . 90
Greenberg, Steven . . . . . . . . . . . . . . . . . . . . . . . . . 90
Grenez, F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Grézl, František . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Grieco, John J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Gu, Liang. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98
Gu, Liang. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99
Gu, Wentao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Gu, Zhenglai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Guan, Cuntai. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Guan, Qi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Guimarães, Nuno . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Guitarte Pérez, Jesús F. . . . . . . . . . . . . . . . . . . . 80
Gül, Yilmaz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Gunawardana, Asela . . . . . . . . . . . . . . . . . . . . . . 57
Guo, Changchen . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Guo, Rui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Gurijala, A.R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Gustafson, Joakim . . . . . . . . . . . . . . . . . . . . . . . . . 22
Gustman, Samuel . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Gut, Ulrike . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
H
Hacioglu, Kadri . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Hacker, Christian . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Hacker, Christian . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Haffner, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Hajdinjak, Melita . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Hajič, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Hajič, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Hakkani-Tür, Dilek Z. . . . . . . . . . . . . . . . . . . . . . 23
Hakkani-Tür, Dilek Z. . . . . . . . . . . . . . . . . . . . . . 64
Hakkani-Tür, Dilek Z. . . . . . . . . . . . . . . . . . . . . . 99
Hakkani-Tür, Dilek Z. . . . . . . . . . . . . . . . . . . . . . 99
Häkkinen, Juha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Hakulinen, Jaakko . . . . . . . . . . . . . . . . . . . . . . . . . 27
Hakulinen, Jaakko . . . . . . . . . . . . . . . . . . . . . . . . . 67
Hamada, Nozomu . . . . . . . . . . . . . . . . . . . . . . . . . 60
Hammervold, Kathrine . . . . . . . . . . . . . . . . . . . . 87
Hamza, W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Han, Jiang. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41
Han, Zhaobing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
117
Hanna, Philip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Hanna, Philip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Hansakunbuntheung, Chatchawarn . . . . . . . 4
Hansen, Jesse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 26
Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 45
Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 45
Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 47
Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 50
Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 64
Hansen, John H.L. . . . . . . . . . . . . . . . . . . . . . . . . . 77
Hao, Jiucang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Harding, Sue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Hardy, Hilda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Harris, David M. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Hartikainen, Elviira . . . . . . . . . . . . . . . . . . . . . . . . 54
Hasegawa-Johnson, Mark . . . . . . . . . . . . . . . . . 15
Hasegawa-Johnson, Mark . . . . . . . . . . . . . . . . . 18
Hasegawa-Johnson, Mark . . . . . . . . . . . . . . . . . 88
Hashimoto, Yoshikazu . . . . . . . . . . . . . . . . . . . . . 7
Hatano, Toshie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Haton, Jean-Paul . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Haton, Jean-Paul . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Hatzis, Athanassios . . . . . . . . . . . . . . . . . . . . . . . 41
Hatzis, Athanassios . . . . . . . . . . . . . . . . . . . . . . . 78
Hautamäki, Ville . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Haverinen, Hemmo . . . . . . . . . . . . . . . . . . . . . . 108
Hawley, Mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Hayakawa, S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Hazen, Timothy J. . . . . . . . . . . . . . . . . . . . . . . . . . 15
Hazen, Timothy J. . . . . . . . . . . . . . . . . . . . . . . . . . 23
Hazen, Timothy J. . . . . . . . . . . . . . . . . . . . . . . . . . 69
He, Wei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
He, Xiaodong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
He, Xiaodong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Hébert, Matthieu . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Heck, Larry P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Hedelin, Per . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Heeman, Peter A. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Hegde, Rajesh M. . . . . . . . . . . . . . . . . . . . . . . . . . 103
Heikkinen, Ari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Heikkinen, Ari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Heikkinen, Ari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Heisterkamp, Paul . . . . . . . . . . . . . . . . . . . . . . . 103
Helbig, Jörg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Hell, Benjamin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Hell, Benjamin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Heracleous, Panikos . . . . . . . . . . . . . . . . . . . . . . . 19
Heracleous, Panikos . . . . . . . . . . . . . . . . . . . . . . . 32
Hermann, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Hermansky, Hynek . . . . . . . . . . . . . . . . . . . . . . . . 16
Hermansky, Hynek . . . . . . . . . . . . . . . . . . . . . . . . 30
Hermansky, Hynek . . . . . . . . . . . . . . . . . . . . . . . . 30
Hermansky, Hynek . . . . . . . . . . . . . . . . . . . . . . . . 36
Hermansky, Hynek . . . . . . . . . . . . . . . . . . . . . . . . 36
Hermansky, Hynek . . . . . . . . . . . . . . . . . . . . . . . . 94
Hernaez, I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Hernando, Javier . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Hernando, Javier . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Hetherington, Lee . . . . . . . . . . . . . . . . . . . . . . . . . 69
Higashinaka, Ryuichiro . . . . . . . . . . . . . . . . . . . 68
Hilario, Joan Marí . . . . . . . . . . . . . . . . . . . . . . . . 109
Hilger, Florian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Himanen, Sakari . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Hioka, Yusuke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Hiraiwa, Akira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Hirose, Keikichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .12
Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .14
Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .31
Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .59
Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .73
Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .87
Hirose, Keikichi. . . . . . . . . . . . . . . . . . . . . . . . . . . .92
Hirose, Keikichi . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Hirose, Keikichi . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Hirose, Keikichi . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Hirsbrunner, Béat . . . . . . . . . . . . . . . . . . . . . . . . . 16
Hirschberg, Julia . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Hirschberg, Julia . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Hirschfeld, Diane . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Hirsimäki, Teemu . . . . . . . . . . . . . . . . . . . . . . . . . 81
Ho, Ching-Hsiang . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Ho, Ching-Hsiang . . . . . . . . . . . . . . . . . . . . . . . . 104
Ho, Man-Cheuk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Ho, Purdy P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Ho, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Ho, Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Eurospeech 2003
Hodgson, Murray . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Hodoshima, Nao . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Hoege, Harald . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Hoequist, Charles . . . . . . . . . . . . . . . . . . . . . . . . . 47
Hofmann, Thomas . . . . . . . . . . . . . . . . . . . . . . . . 35
Hogden, John . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Höge, Harald . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Hohmann, Volker. . . . . . . . . . . . . . . . . . . . . . . . . .50
Holada, Miroslav . . . . . . . . . . . . . . . . . . . . . . . . . 112
Honal, Matthias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Honda, Kiyoshi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Honda, Kiyoshi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Hori, Chiori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Hori, Chiori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Hori, Takaaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Hori, Takaaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Horiuchi, Yasuo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Horlock, James . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Horlock, James . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Horvat, Bogomir . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Horvat, Bogomir . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Hosokawa, Yuta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Hou, Zhaorong . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
House, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Hozjan, Vladimir . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Hsu, Chun-Nan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Hu, Fang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Hu, Sheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Hu, Wei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Hu, Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Huang, Chao-Shih . . . . . . . . . . . . . . . . . . . . . . . . . 17
Huang, Jing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Huang, Qiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Huang, Shan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Huerta, Juan M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Huo, Qiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Huo, Qiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Hwang, Tai-Hwei . . . . . . . . . . . . . . . . . . . . . . . . . . 77
I
Ichikawa, Akira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Ichimura, Naoyuki . . . . . . . . . . . . . . . . . . . . . . . . . 80
Iizuka, Yosuke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Illina, Irina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Illina, Irina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Illina, Irina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Imamura, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Inagaki, Yasuyoshi . . . . . . . . . . . . . . . . . . . . . . . . 55
Inkelas, Sharon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Inoue, Akira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Inoue, Tsuyoshi . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Ipšić, Ivo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Ircing, Pavel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Ircing, Pavel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Irie, Yuki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Irino, Toshio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Irino, Toshio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Irino, Toshio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Iriondo, Ignasi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Isei-Jaakkola, Toshiko . . . . . . . . . . . . . . . . . . . . . . 4
Iser, Bernd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Ishi, Carlos Toshinori . . . . . . . . . . . . . . . . . . . . . 15
Ishihara, Kazushi . . . . . . . . . . . . . . . . . . . . . . . . 112
Ishikawa, Tetsuya . . . . . . . . . . . . . . . . . . . . . . . . . 40
Isobe, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Itakura, Fumitada . . . . . . . . . . . . . . . . . . . . . . . . . 67
Itakura, Fumitada . . . . . . . . . . . . . . . . . . . . . . . . . 86
Ito, Akinori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Ito, Ryosuke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Ito, Toshihiko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Itoh, Nobuyasu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Itou, Katunobu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Itou, Katunobu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Itou, Katunobu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Itou, Katunobu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Itou, Katunobu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Iwaki, Mamoru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Iwami, Yohei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Iyengar, Giridharan. . . . . . . . . . . . . . . . . . . . . . . .91
J
Jackson, Philip J.B. . . . . . . . . . . . . . . . . . . . . . . . . 82
Jackson, Philip J.B. . . . . . . . . . . . . . . . . . . . . . . . . 97
Jafer, Essa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Jafer, Essa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Jaidane-Saidane, M. . . . . . . . . . . . . . . . . . . . . . . . . 49
September 1-4, 2003 – Geneva, Switzerland
Jain, Pratibha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Jain, Pratibha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
James, A.B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Jamoussi, Salma . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Jan, E.E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Jančovič, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Jang, Dalwon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Jang, Dalwon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Jang, Gyucheol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Janke, E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Jansen, E.J.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Jesus, Luis M.T. . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Jia, Chuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Jia, Ying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Jiang, Jing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Jin, Jianhong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Jin, Minho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Jitapunkul, Somchai . . . . . . . . . . . . . . . . . . . . . . . . 6
Jitapunkul, Somchai . . . . . . . . . . . . . . . . . . . . . . . 65
Jitsuhiro, Takatoshi . . . . . . . . . . . . . . . . . . . . . . . 96
Johnstone, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Jokisch, Oliver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Jones, Douglas A.. . . . . . . . . . . . . . . . . . . . . . . . . .56
Jones, Douglas A.. . . . . . . . . . . . . . . . . . . . . . . . . .69
Jovičić, Slobodan . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Ju, Gwo-hwa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Ju, Gwo-hwa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Juan, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Jung, Ho-Youl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88
Junqua, Jean-Claude . . . . . . . . . . . . . . . . . . . . . . 13
Junqua, Jean-Claude . . . . . . . . . . . . . . . . . . . . . . 65
Junqua, Jean-Claude . . . . . . . . . . . . . . . . . . . . . . 71
Jutten, Christian . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
K
Kabal, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Kaburagi, Tokihiko . . . . . . . . . . . . . . . . . . . . . . . . 17
Kačič, Zdravko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Kačič, Zdravko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Kačič, Zdravko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Kain, Alexander B. . . . . . . . . . . . . . . . . . . . . . . . . . 12
Kain, Alexander B. . . . . . . . . . . . . . . . . . . . . . . . . . 58
Kajarekar, Sachin S. . . . . . . . . . . . . . . . . . . . . . . . 71
Kajarekar, Sachin S. . . . . . . . . . . . . . . . . . . . . . . . 94
Kakutani, Naoko . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Kallulli, Dalina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Kam, Patgi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Kaneko, Tsuyoshi . . . . . . . . . . . . . . . . . . . . . . . . . 51
Kang, Hong-Goo . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Kanokphara, Supphanat . . . . . . . . . . . . . . . . . . 28
Kanthak, S.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40
Karjalainen, Matti . . . . . . . . . . . . . . . . . . . . . . . . . 87
Karlsson, Inger . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Kashioka, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Kasuya, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Kasuya, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Katagiri, Shigeru . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Katagiri, Yasuhiro . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Kato, Hiroaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Katz, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Kawaguchi, Nobuo . . . . . . . . . . . . . . . . . . . . . . . . 55
Kawahara, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Kawahara, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Kawahara, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Kawahara, Hideki . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Kawahara, Tatsuya . . . . . . . . . . . . . . . . . . . . . . . . 16
Kawahara, Tatsuya . . . . . . . . . . . . . . . . . . . . . . . . 26
Kawahara, Tatsuya . . . . . . . . . . . . . . . . . . . . . . . . 60
Kawahara, Tatsuya . . . . . . . . . . . . . . . . . . . . . . . . 65
Kawahara, Tatsuya . . . . . . . . . . . . . . . . . . . . . . . 105
Kawaharay, Hideki . . . . . . . . . . . . . . . . . . . . . . . 111
Kawai, Hisashi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Kawai, Hisashi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Kawai, Koji. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
Kawanami, Hiromichi . . . . . . . . . . . . . . . . . . . . . 79
Kawanami, Hiromichi . . . . . . . . . . . . . . . . . . . . . 85
Kellner, Andreas. . . . . . . . . . . . . . . . . . . . . . . . . . .67
Kenicer, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Kenny, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Képesi, Marián . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Kerstholt, J.H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Kessens, Judith M. . . . . . . . . . . . . . . . . . . . . . . . . . 65
Keung, Chi-Kin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Khayrallah, Ali . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Khioe, Beatrice Fung-Wah . . . . . . . . . . . . . . . . . 84
Khitrov, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Khudanpur, Sanjeev . . . . . . . . . . . . . . . . . . . . . 110
118
Kienappel, Anne K. . . . . . . . . . . . . . . . . . . . . . . . . 42
Kienappel, Anne K. . . . . . . . . . . . . . . . . . . . . . . . . 52
Kikui, Genichiro . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Kikui, Genichiro . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Kikui, Genichiro . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Kikui, Genichiro . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Kikuiri, Kei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Killer, Mirjam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Kim, Chong Kyu . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Kim, D.Y. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Kim, Hyoung-Gook . . . . . . . . . . . . . . . . . . . . . . . . 18
Kim, Hyoung-Gook . . . . . . . . . . . . . . . . . . . . . . . . 20
Kim, Hyung Soon . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Kim, Hyun Woo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Kim, Jiun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Kim, Jong Uk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Kim, Jong Uk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Kim, Kwang-Dong . . . . . . . . . . . . . . . . . . . . . . . . . 51
Kim, Nam Soo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Kim, Nam Soo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Kim, Nam Soo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Kim, SangGyun . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Kim, SangGyun . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Kim, Taeyoon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Kim, Wooil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Kim, Woosung . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Kim, Young Joon . . . . . . . . . . . . . . . . . . . . . . . . . . 13
King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
King, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Kingsbury, Brian . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Kingsbury, Brian . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Kingsbury, Brian . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Kinnunen, Tomi . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Kinoshita, Keisuke . . . . . . . . . . . . . . . . . . . . . . . . 48
Kiran, G.V.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
Kiriyama, Shinya . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Kishida, Itsuki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Kishon-Rabin, Liat . . . . . . . . . . . . . . . . . . . . . . . . . 73
Kishore, S.P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Kiss, Imre. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .108
Kita, Kenji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Kitamura, Tadashi . . . . . . . . . . . . . . . . . . . . . . . . . 31
Kitamura, Tadashi . . . . . . . . . . . . . . . . . . . . . . . . . 87
Kitamura, Tadashi . . . . . . . . . . . . . . . . . . . . . . . . . 93
Kitamura, Tadashi . . . . . . . . . . . . . . . . . . . . . . . 112
Kitaoka, Norihide . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Kitaoka, Norihide . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Kitaoka, Norihide . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Kitaoka, Norihide . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Kitawaki, Nobuhiko . . . . . . . . . . . . . . . . . . . . . . . 80
Kitayama, Koji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Kitazawa, Shigeyoshi . . . . . . . . . . . . . . . . . . . . . . . 7
Klabbers, Esther . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Klabbers, Esther . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Klabbers, Esther . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Klakow, Dietrich . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Klankert, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Klasmeyer, G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Kleijn, W. Bastiaan . . . . . . . . . . . . . . . . . . . . . . . . 38
Kleijn, W. Bastiaan . . . . . . . . . . . . . . . . . . . . . . . . 49
Klein, Alexandra . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Kleinschmidt, Michael . . . . . . . . . . . . . . . . . . . . . 50
Kleinschmidt, Michael . . . . . . . . . . . . . . . . . . . . . 91
Kneissler, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Ko, Hanseok . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Ko, Hanseok . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Kobayashi, Akio . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Kobayashi, Takao. . . . . . . . . . . . . . . . . . . . . . . . . .87
Kobayashi, Takao . . . . . . . . . . . . . . . . . . . . . . . . 102
Kobayashi, Tetsunori. . . . . . . . . . . . . . . . . . . . . .42
Kobayashi, Tetsunori. . . . . . . . . . . . . . . . . . . . . .43
Kobayashi, Tetsunori. . . . . . . . . . . . . . . . . . . . . .45
Kocsor, András . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Kodama, Yasuhiro . . . . . . . . . . . . . . . . . . . . . . . . . 42
Kojima, Hiroaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Kokkinakis, George . . . . . . . . . . . . . . . . . . . . . . . . . 5
Kokkinakis, George . . . . . . . . . . . . . . . . . . . . . . . . 21
Kokkinakis, George . . . . . . . . . . . . . . . . . . . . . . . . 60
Kokkinakis, George . . . . . . . . . . . . . . . . . . . . . . . . 78
Kokkinos, Iasonas . . . . . . . . . . . . . . . . . . . . . . . . . 29
Kokubo, Hiroaki . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Köküer, Münevver . . . . . . . . . . . . . . . . . . . . . . . . . 76
Eurospeech 2003
Kolář, Jáchym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Kollmeier, Birger . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Koloska, Uwe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Komatani, Kazunori . . . . . . . . . . . . . . . . . . . . . . . 26
Komatani, Kazunori . . . . . . . . . . . . . . . . . . . . . . . 60
Komatsu, Miki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Kominek, John. . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
Kondo, Aki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Kondoz, Ahmet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Kordik, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Korkmazsky, Filipp . . . . . . . . . . . . . . . . . . . . . . . 52
Korkmazsky, Filipp . . . . . . . . . . . . . . . . . . . . . . . 53
Kotnik, Bojan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Koumpis, Konstantinos . . . . . . . . . . . . . . . . . . . 99
Koval, S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Krasny, Leonid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Krbec, Pavel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81
Krishnan, Venkatesh . . . . . . . . . . . . . . . . . . . . . . 38
Krueger, Antonio . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Krüger, S.E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Kruschke, Hans . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Kryze, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Kubala, Francis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Kubala, Francis . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Kühne, Marco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Kukolich, Linda C. . . . . . . . . . . . . . . . . . . . . . . . . . 69
Kumar, Arun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Kumaresan, Ramdas . . . . . . . . . . . . . . . . . . . . . . . . 1
Kummert, Franz . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Kung, Sun-Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Kung, Sun-Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Kung, Sun-Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Kunzmann, S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Kuo, Chih-Chung . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Kuo, Chi-Shiang . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Kuo, Wei-Chih . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Kurimo, Mikko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Kurimo, Mikko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Kuroiwa, Shingo . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Kuroiwa, Shingo . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Kuroiwa, Shingo . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Kusumoto, Akiko . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Kuwabara, Hisao . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Kwok, Philip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Kwon, Oh-Wook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Kwon, Oh-Wook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Kwon, Soonil. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94
L
Laaksonen, Lasse . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Lackey, B.C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Lacroix, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Ladd, D. Robert . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Laface, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Lähdekorpi, Marja . . . . . . . . . . . . . . . . . . . . . . . . . 38
Lahti, Tommi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Lai, Wen-Hsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Lai, Yiu-Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Lambert, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Lamel, Lori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Lamere, Paul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Lamere, Paul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Lane, Ian R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Langlois, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Langner, Brian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Lapidot, Itshak . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Larsen, Lars Bo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Larson, Martha . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Lashkari, Khosrow . . . . . . . . . . . . . . . . . . . . . . . . 60
Lasn, Jürgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Lathoud, Guillaume . . . . . . . . . . . . . . . . . . . . . . 102
Laureys, Tom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Lauri, Fabrice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Lavie, Alon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Lawson, Aaron D.. . . . . . . . . . . . . . . . . . . . . . . . . .53
Le, Viet Bac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Lee, Akinobu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52
Lee, Chen-Long . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Lee, Chin-Hui. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
Lee, Chin-Hui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Lee, Chin-Hui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Lee, Chul Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Lee, Daniel D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Lee, J.H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Lee, J.J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Lee, K.Y.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48
Lee, Kyong-Nim . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Lee, Lin-shan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
September 1-4, 2003 – Geneva, Switzerland
Lee, Lin-shan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
Lee, Lin-shan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19
Lee, Lin-shan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48
Lee, Lin-shan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82
Lee, Lin-shan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Lee, Lin-shan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Lee, Sunil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Lee, Sunil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Lee, Tan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Lee, Tan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Lee, Te-Won . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Lee, Te-Won . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Lee, Te-Won. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30
Lee, Yun-Tien. . . . . . . . . . . . . . . . . . . . . . . . . . . . .100
Lees, Nicole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Lefevre, Fabrice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Lenz, Michael. . . . . . . . . . . . . . . . . . . . . . . . . . . . .102
Lenzo, Kevin A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Lenzo, Kevin A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Leonov, A.S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Levin, David N. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Levin, Lori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Levit, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Levow, Gina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Li, Aijun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Li, Haizhou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Li, Honglian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Li, Jianfeng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Li, Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Li, Stan Z. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Li, Ta-Hsin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Li, Xiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Li, Xiaolong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Li, Yujia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Li, Yuk-Chi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Li, Yuk-Chi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Liang, Min-Siong . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Liao, Shuo-Peng . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Liao, Yuan-Fu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Lickley, R.J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Lieb, Robert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Light, Joanna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Lim, Sung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Lim, Woohyung . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Lima, Amaro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Lin, Jeng-Shien . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Lin, Li-Feng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Lin, Xiaofan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Lin, Yi-Chung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Linares, Georges . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Linhard, Klaus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Liscombe, Jackson. . . . . . . . . . . . . . . . . . . . . . . . .26
Liu, Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Liu, Feng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Liu, Fu-Hua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Liu, Jia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Liu, Jian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Liu, Jingwei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Liu, Runsheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Liu, Wei M.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102
Liu, Xingkun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Liu, Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Liu, Yi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Livescu, Karen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Lleida, Eduardo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Llorà, Xavier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Lloyd-Thomas, Harvey . . . . . . . . . . . . . . . . . . . . 78
Lo, Tin-Hang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Lo, Wai-Kit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Lo, Wai-Kit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Lobacheva, Yuliya . . . . . . . . . . . . . . . . . . . . . . . . . 87
Locher, Ivo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Lœvenbruck, Hélène . . . . . . . . . . . . . . . . . . . . . . . . 6
Lonsdale, Deryle . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Looks, Karin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Lu, Ching-Ta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Lu, Meirong. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112
Lu, Yiqing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Lucey, Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Luengo, I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Lukas, Klaus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Luksaneeyanawin, Sudaporn . . . . . . . . . . . . . . . 6
Luksaneeyanawin, Sudaporn. . . . . . . . . . . . . . 65
Luo, Yu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62
Luong, Mai Chi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Lyu, Dau-Cheng. . . . . . . . . . . . . . . . . . . . . . . . . . . .66
Lyu, Ren-Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Lyu, Ren-Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
119
M
Ma, Changxue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Ma, Chengyuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Ma, Chengyuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Ma, L.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79
Maase, Jens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Macherey, Klaus . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Macherey, Wolfgang . . . . . . . . . . . . . . . . . . . . . . . 18
Macías-Guarasa, J. . . . . . . . . . . . . . . . . . . . . . . . . . 64
Macías-Guarasa, J. . . . . . . . . . . . . . . . . . . . . . . . . . 95
MacLaren, Victoria . . . . . . . . . . . . . . . . . . . . . . . . 56
Macrostie, Ehry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Maeda, Sakashi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Maegaard, Bente . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Maeki, Daiju . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Maffiolo, V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Magimai-Doss, Mathew . . . . . . . . . . . . . . . . . . . . 89
Magrin-Chagnolleau, Ivan . . . . . . . . . . . . . . . . . . 2
Mahajan, Milind . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Mahajan, Milind . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Mahdi, Abdulhussain E. . . . . . . . . . . . . . . . . . . . 20
Mahdi, Abdulhussain E. . . . . . . . . . . . . . . . . . . . 61
Mahdi, Abdulhussain E. . . . . . . . . . . . . . . . . . . . 73
Mahdi, Abdulhussain E. . . . . . . . . . . . . . . . . . . . 84
Mahé, Gaël . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Maia, R. da S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Maison, Benoît . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Maison, Benoît . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Maison, Benoît . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Mak, Brian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Mak, Brian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Mak, Man-Wai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Mak, Man-Wai . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Mak, Man-Wai . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Makhoul, John . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Makino, Shozo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Maloor, Preetam . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Maltese, Giulio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Mamede, Nuno J. . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Mami, Yassine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Mana, Franco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Mana, Nadia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Manabe, Hiroyuki . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Maneenoi, Ekkarit . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Maneenoi, Ekkarit . . . . . . . . . . . . . . . . . . . . . . . . . 65
Manfredi, Claudia . . . . . . . . . . . . . . . . . . . . . . . . . 84
Mangu, Lidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Mangu, Lidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Mangu, Lidia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Mapelli, Valerie . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Maragos, Petros . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Maragos, Petros . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Maragoudakis, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Markov, Konstantin . . . . . . . . . . . . . . . . . . . . . . . 34
Martens, Jean-Pierre . . . . . . . . . . . . . . . . . . . . . . . 33
Martin, Alvin F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Martin, Arnaud . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Martin, Terrence . . . . . . . . . . . . . . . . . . . . . . . . . 110
Martin, Terrence . . . . . . . . . . . . . . . . . . . . . . . . . 110
Martinčić-Ipšić, Sanda . . . . . . . . . . . . . . . . . . . . . 68
Martínez, R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Martinez, Roberto . . . . . . . . . . . . . . . . . . . . . . . . 104
Masaki, Shinobu . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Masgrau, Enrique . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Maskey, Sameer Raj . . . . . . . . . . . . . . . . . . . . . . . 41
Mason, John S.D. . . . . . . . . . . . . . . . . . . . . . . . . . 102
Massaro, Dominic W. . . . . . . . . . . . . . . . . . . . . . . 79
Masuko, Takashi . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Matassoni, Marco . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Matassoni, Marco . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Matějka, Pavel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Matoušek, Jindřich . . . . . . . . . . . . . . . . . . . . . . . . 11
Matrouf, Driss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Matsubara, Shigeki . . . . . . . . . . . . . . . . . . . . . . . . 55
Matsui, Hisami. . . . . . . . . . . . . . . . . . . . . . . . . . . . .75
Matsui, Tomoko . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Matsui, Tomoko . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Matsunaga, S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Matsuoka, Bungo . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Matsushita, Masahiko . . . . . . . . . . . . . . . . . . . . . 42
Matsuura, Daisuke . . . . . . . . . . . . . . . . . . . . . . . . 96
Mattys, Sven L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Mau, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Mauuary, Laurent . . . . . . . . . . . . . . . . . . . . . . . . 108
Mayfield Tomokiyo, Laura . . . . . . . . . . . . . . . . 14
Mayfield Tomokiyo, Laura . . . . . . . . . . . . . . . . 72
McCowan, Iain A. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Eurospeech 2003
McCowan, Iain A. . . . . . . . . . . . . . . . . . . . . . . . . . 102
McDermott, Erik . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
McDonough, John . . . . . . . . . . . . . . . . . . . . . . . . . 36
McDonough, John . . . . . . . . . . . . . . . . . . . . . . . . . 56
McQueen, James M. . . . . . . . . . . . . . . . . . . . . . . . 74
McTait, Kevin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
McTear, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Meinedo, Hugo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Meister, Einar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Meister, Lya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Melenchón, Javier . . . . . . . . . . . . . . . . . . . . . . . . 104
Melnar, Lynette. . . . . . . . . . . . . . . . . . . . . . . . . . .110
Meng, Helen M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Meng, Helen M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Meng, Helen M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Meng, Helen M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Mertins, Alfred . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Mertz, Frank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Metze, Florian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Metze, Florian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Meuwly, Didier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Meyer, Georg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Miao, Cailian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Mihajlik, Péter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Mihajlik, Péter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Mihelič, France . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Mihelič, France . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Mihoubi, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Mihoubi, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Mikami, Takayoshi . . . . . . . . . . . . . . . . . . . . . . . . 42
Miki, Kazuhiro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Miki, Nobuhiro . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Miki, Toshio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Miki, Toshio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Miller, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Miller, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Milner, Ben P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Milner, Ben P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Milner, Ben P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Milner, Ben P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Milner, Ben P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Minami, Yasuhiro . . . . . . . . . . . . . . . . . . . . . . . . 100
Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . . . 6
Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . 12
Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . 14
Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . 31
Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . 59
Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . 73
Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . . 92
Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . 106
Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . 111
Minematsu, Nobuaki . . . . . . . . . . . . . . . . . . . . . 111
Ming, Ji. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93
Minnis, Steve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Mírovsky, Jirí . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Mishra, Taniya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Mishra, Taniya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Misra, Hemant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Misra, Hemant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Mitsuta, Yoshifumi . . . . . . . . . . . . . . . . . . . . . . . . . 7
Mittal, U. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Mixdorff, Hansjörg. . . . . . . . . . . . . . . . . . . . . . . . . .7
Mixdorff, Hansjörg . . . . . . . . . . . . . . . . . . . . . . . . 31
Miyajima, Chiyomi . . . . . . . . . . . . . . . . . . . . . . . . 93
Miyanaga, Yoshikazu . . . . . . . . . . . . . . . . . . . . . . 83
Miyazaki, Noboru. . . . . . . . . . . . . . . . . . . . . . . . . .68
Mizumachi, Mitsunori . . . . . . . . . . . . . . . . . . . . . 21
Mizumachi, Mitsunori . . . . . . . . . . . . . . . . . . . . . 62
Mizutani, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Möbius, Bernd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Möbius, Bernd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Mohri, Mehryar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Mok, Oi Yan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59
Mokhtari, Parham . . . . . . . . . . . . . . . . . . . . . . . . . 15
Möller, Sebastian . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Montero, J.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Montero, J.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Moonen, Marc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Moore, Darren C. . . . . . . . . . . . . . . . . . . . . . . . . . 102
Moore, Roger K. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Moore, Roger K. . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Moreau, Nicolas . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Moreau, Nicolas . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Morel, Michel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Moreno, Asunción . . . . . . . . . . . . . . . . . . . . . . . . . 54
Moreno, Asunción . . . . . . . . . . . . . . . . . . . . . . . . . 56
Moreno, David M. . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Moreno, Pedro J. . . . . . . . . . . . . . . . . . . . . . . . . . 105
Morgan, Nelson . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
September 1-4, 2003 – Geneva, Switzerland
Mori, Hiroki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Mori, Hiroki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Mori, Kazumasa . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Mori, Shinsuke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Morimoto, Tsuyoshi . . . . . . . . . . . . . . . . . . . . . . . 23
Morin, Philippe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Moro-Sancho, Q. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Morris, Andrew . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Morris, Robert W. . . . . . . . . . . . . . . . . . . . . . . . . 109
Mostow, Jack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Mostow, Jack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Motlíček, Petr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Motlíček, Petr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Motomura, Yoichi . . . . . . . . . . . . . . . . . . . . . . . . . 80
Moudenc, Thierry . . . . . . . . . . . . . . . . . . . . . . . . . 32
Mouri, Taro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Mukherjee, Niloy . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Müller, Christian . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Muller, J.S.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
Mullin, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Murao, H. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Murtagh, Fionn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Murthy, Hema A. . . . . . . . . . . . . . . . . . . . . . . . . . 103
Muto, Makiko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Myrvoll, Tor André . . . . . . . . . . . . . . . . . . . . . . . . 53
N
Nadeu, Climent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Nadeu, Climent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Nagarajan, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Nagata, Masaaki . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Nagata, Masaaki . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Naito, Takuro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Naka, Nobuhiko . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Nakadai, Kazuhiro. . . . . . . . . . . . . . . . . . . . . . . . .96
Nakagawa, Seiichi . . . . . . . . . . . . . . . . . . . . . . . . . 22
Nakagawa, Seiichi . . . . . . . . . . . . . . . . . . . . . . . . . 22
Nakagawa, Seiichi . . . . . . . . . . . . . . . . . . . . . . . . . 42
Nakagawa, Seiichi . . . . . . . . . . . . . . . . . . . . . . . . . 96
Nakagawa, Seiichi . . . . . . . . . . . . . . . . . . . . . . . . 106
Nakagawa, Seiichi . . . . . . . . . . . . . . . . . . . . . . . . 112
Nakajima, Hideharu . . . . . . . . . . . . . . . . . . . . . . . 95
Nakajima, Yoshitaka . . . . . . . . . . . . . . . . . . . . . . 92
Nakamura, Naoki . . . . . . . . . . . . . . . . . . . . . . . . 112
Nakamura, Norio . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 16
Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 19
Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 21
Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 24
Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 34
Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 44
Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 62
Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 76
Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 80
Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . . 96
Nakamura, Satoshi . . . . . . . . . . . . . . . . . . . . . . . 108
Nakano, Mikio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Nakano, Mikio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Nakasone, Hirotaka . . . . . . . . . . . . . . . . . . . . . . . 25
Nakatani, Tomohiro . . . . . . . . . . . . . . . . . . . . . . . 81
Nakatani, Tomohiro . . . . . . . . . . . . . . . . . . . . . . . 86
Nankaku, Yoshihiko . . . . . . . . . . . . . . . . . . . . . . . 93
Narayanan, Shrikanth . . . . . . . . . . . . . . . . . . . . . . 6
Narayanan, Shrikanth . . . . . . . . . . . . . . . . . . . . . 39
Narayanan, Shrikanth . . . . . . . . . . . . . . . . . . . . . 43
Narayanan, Shrikanth . . . . . . . . . . . . . . . . . . . . . 94
Narayanan, Shrikanth . . . . . . . . . . . . . . . . . . . . 111
Narusawa, Shuichi . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Narusawa, Shuichi . . . . . . . . . . . . . . . . . . . . . . . . . 82
Natarajan, Ajay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Natarajan, Premkumar . . . . . . . . . . . . . . . . . . . . 79
Navarro-Mesa, Juan L. . . . . . . . . . . . . . . . . . . . . . 86
Navas, E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Navrátil, Jiří. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
Navrátil, Jiří. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94
Nedel, Jon P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Nefti, Samir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Neti, Chalapathy . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Neto, João P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Neto, João P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Neubarth, Friedrich . . . . . . . . . . . . . . . . . . . . . . . 46
Neukirchen, Christoph . . . . . . . . . . . . . . . . . . . . 92
Newell, Alan F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
120
Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Ney, Hermann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Nguyen, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Nguyen, Patrick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Nguyen, Phu Chien . . . . . . . . . . . . . . . . . . . . . . . . 16
Ni, Jinfu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Nicholson, H.B.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Niemann, Heinrich . . . . . . . . . . . . . . . . . . . . . . . . 34
Nieto, V.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48
Nigra, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Niimi, Yasuhisa . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Nikléczy, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Niklfeld, Georg . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Nishida, Masafumi . . . . . . . . . . . . . . . . . . . . . . . . 65
Nishikawa, Tsuyoki . . . . . . . . . . . . . . . . . . . . . . . 20
Nishimura, Masafumi . . . . . . . . . . . . . . . . . . . . . 16
Nishiura, Takanobu . . . . . . . . . . . . . . . . . . . . . . . 62
Nishiura, Takanobu . . . . . . . . . . . . . . . . . . . . . . . 76
Nishiura, Takanobu . . . . . . . . . . . . . . . . . . . . . . . 76
Nishizaki, Hiromitsu . . . . . . . . . . . . . . . . . . . . . . 42
Nishizawa, Nobuyuki . . . . . . . . . . . . . . . . . . . . . . 31
Nitta, Tsuneo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Nitta, Tsuneo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Nitta, Tsuneo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Niu, Xiaochuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Nix, Johannes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Nocera, Pascal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Nock, Harriet J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Nordén, Fredrik . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Norris, Dennis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Nöth, Elmar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Nöth, Elmar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Nöth, Elmar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Nöth, Elmar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Nouza, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Novak, Miroslav . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Nukinay, Masumi . . . . . . . . . . . . . . . . . . . . . . . . 111
Nurminen, Jani . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Nurminen, Jani . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Nurminen, Jani . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Nurminen, Jani . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
O
Obuchi, Yasunari . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Och, Franz J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Odijk, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Oflazer, Kemal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Ogata, Jun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Ogata, Jun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Ogata, Jun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Ogawa, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Ogawa, Tetsuji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Ogawa, Yoshihiko . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Ogawa, Yoshio . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Oh, Se-Jin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Ohkawa, Yuichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Ohno, Sumio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Ohya, Tomoyuki . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Oikonomidis, Dimitrios . . . . . . . . . . . . . . . . . . . 55
Oikonomidis, Dimitrios . . . . . . . . . . . . . . . . . . . 81
Okada, Jiro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62
Okawa, Shigeki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Okimoto, Mamiko . . . . . . . . . . . . . . . . . . . . . . . . 112
Okuno, Hiroshi G. . . . . . . . . . . . . . . . . . . . . . . . . . 26
Okuno, Hiroshi G. . . . . . . . . . . . . . . . . . . . . . . . . . 96
Okuno, Hiroshi G. . . . . . . . . . . . . . . . . . . . . . . . . 112
Olaszy, G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Oliveira, Luís C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Oliveira, Luís C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Olsen, Peder A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Olsen, Peder A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Omar, Mohamed Kamal . . . . . . . . . . . . . . . . . . . 18
Omar, Mohamed Kamal . . . . . . . . . . . . . . . . . . . 88
Omologo, Maurizio . . . . . . . . . . . . . . . . . . . . . . . . 18
Omoto, Yukihiro . . . . . . . . . . . . . . . . . . . . . . . . . . 42
O’Neill, Ian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
O’Neill, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Onishi, Koji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Ono, Takayuki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Ordelman, Roeland . . . . . . . . . . . . . . . . . . . . . . . . . 9
Ordóñez, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Orlandi, Marco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Eurospeech 2003
Ortega, Alfonso. . . . . . . . . . . . . . . . . . . . . . . . . . . .50
Ortega, Antonio . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Ortega-Garcia, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Ortega-Garcia, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Osaki, Koichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
O’Shaughnessy, Douglas . . . . . . . . . . . . . . . . . . 36
O’Shaughnessy, Douglas . . . . . . . . . . . . . . . . . 109
Ostendorf, Mari. . . . . . . . . . . . . . . . . . . . . . . . . . . .89
Osterrath, Frédéric . . . . . . . . . . . . . . . . . . . . . . . . 43
Otake, Takashi. . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
Otake, Takashi. . . . . . . . . . . . . . . . . . . . . . . . . . . . .74
Otsuji, Kiyotaka . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Ouellet, Pierre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Ouellet, Pierre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Ozeki, Kazuhiko . . . . . . . . . . . . . . . . . . . . . . . . . 112
Ozeki, Kazuhiko . . . . . . . . . . . . . . . . . . . . . . . . . 112
Ozturk, Ozlem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Ozturk, Ozlem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
P
Padrell, Jaume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Padrell, Jaume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Padrta, Aleš . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Pakucs, Botond . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Paliwal, Kuldip K. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Paliwal, Kuldip K. . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Paliwal, Kuldip K. . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Paliwal, Kuldip K. . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Paliwal, Kuldip K. . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Palmer, Rebecca . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Pan, Jielin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Pardo, J.M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Paredes, R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Parihar, N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Park, Alex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Park, Jong Se . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Park, Seung Seop . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Park, Young-Hee . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Parker, Mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Parker, Mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Parveen, Shahla . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Pascual, Neus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Patterson, Roy D. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Paulo, Sérgio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Pavešić, Nikola. . . . . . . . . . . . . . . . . . . . . . . . . . . . .45
Peereman, Ronald . . . . . . . . . . . . . . . . . . . . . . . . . 73
Peinado, Antonio M. . . . . . . . . . . . . . . . . . . . . . . . 38
Peinado, Antonio M. . . . . . . . . . . . . . . . . . . . . . . . 97
Pelecanos, Jason . . . . . . . . . . . . . . . . . . . . . . . . . 106
Pelle, Patricia A. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Pellom, Bryan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Pellom, Bryan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Pellom, Bryan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Peng, Hu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Peretti, Giorgio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Pérez-Córdoba, José L. . . . . . . . . . . . . . . . . . . . . 38
Petek, Bojan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Peters, S. Douglas . . . . . . . . . . . . . . . . . . . . . . . . . 66
Petrillo, Massimo . . . . . . . . . . . . . . . . . . . . . . . . . 103
Petrinovic, Davor . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Petrinovic, Davor . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Petrinovic, Davorka . . . . . . . . . . . . . . . . . . . . . . . 39
Petrushin, Valery A. . . . . . . . . . . . . . . . . . . . . . . 111
Pfister, Beat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Pfister, Beat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Pfister, Beat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Pfitzinger, Hartmut R. . . . . . . . . . . . . . . . . . . . . . 29
Phillips, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Piano, Lawrence . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Piantanida, Juan P. . . . . . . . . . . . . . . . . . . . . . . . . 80
Picheny, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Picheny, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Picheny, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Picheny, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Picone, Joseph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Picone, Joseph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Picovici, Dorel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Pieraccini, Roberto . . . . . . . . . . . . . . . . . . . . . . . . 79
Pitrelli, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Pitsikalis, Vassilis . . . . . . . . . . . . . . . . . . . . . . . . . 29
Pitz, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Pobloth, Harald . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Podveský, Petr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Poeppel, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Pohjalainen, Jouni . . . . . . . . . . . . . . . . . . . . . . . 103
Poirier, Franck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Polifroni, Joseph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
September 1-4, 2003 – Geneva, Switzerland
Pollák, Petr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Pols, Louis C.W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Popovici, C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Portele, Thomas . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Potamianos, Gerasimos . . . . . . . . . . . . . . . . . . . 45
Potamitis, Ilyas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Potamitis, Ilyas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Potamitis, Ilyas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Potamitis, Ilyas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Potamitis, Ilyas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Povey, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Prasad, K. Venkatesh . . . . . . . . . . . . . . . . . . . . . . 79
Prasad, Rashmi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Prasanna, S.R. Mahadeva . . . . . . . . . . . . . . . . . . . 3
Prasanna, S.R. Mahadeva . . . . . . . . . . . . . . . . . . 21
Prasanna Kumar, K.R. . . . . . . . . . . . . . . . . . . . . 108
Pratsolis, D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Precoda, Kristin . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Prieto, Ramon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Prime, G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Prodanov, Plamen . . . . . . . . . . . . . . . . . . . . . . . . . 37
Przybocki, Mark A. . . . . . . . . . . . . . . . . . . . . . . . . 47
Psutka, Josef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Psutka, Josef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Psutka, Josef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Psutka, Josef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Psutka, J.V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Pucher, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Puder, Henning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Rodriguez, Francisco Romero . . . . . . . . . . . 102
Roh, Duk-Gyoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Romportl, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Romsdorfer, Harald . . . . . . . . . . . . . . . . . . . . . . . 72
Roohani, Mahmood R. . . . . . . . . . . . . . . . . . . . . . 54
Rosec, Olivier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Rosenhouse, Judith . . . . . . . . . . . . . . . . . . . . . . . 73
Rosset, Sophie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Rosset, Sophie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Rossi-Katz, Jessica A. . . . . . . . . . . . . . . . . . . . . . 50
Rothkrantz, Leon J.M. . . . . . . . . . . . . . . . . . . . . . 32
Roweis, Sam T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Roy, Deb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Rubio, Antonio J. . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Rubio, Antonio J. . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Rubio, Antonio J. . . . . . . . . . . . . . . . . . . . . . . . . . 107
Rudnicky, Alexander I. . . . . . . . . . . . . . . . . . . . . 21
Rudnicky, Alexander I. . . . . . . . . . . . . . . . . . . . . 66
Ruiz, Diego . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Ruske, Günther . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Ruske, Günther . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Russell, Martin J. . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Russell, Martin J. . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Russell, Martin J. . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Rutten, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Rutten, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Q
Saarinen, Jukka . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Saarinen, Jukka . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Sagisaka, Yoshinori . . . . . . . . . . . . . . . . . . . . . . . . . 4
Sagisaka, Yoshinori . . . . . . . . . . . . . . . . . . . . . . . . . 7
Sagisaka, Yoshinori . . . . . . . . . . . . . . . . . . . . . . . . . 9
Sagisaka, Yoshinori . . . . . . . . . . . . . . . . . . . . . . . 15
Sai Jayram, A.K.V. . . . . . . . . . . . . . . . . . . . . . . . . . 47
Saito, Mutsumi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Sakamoto, Yoko . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Sakata, Keigo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Salonen, Esa-Pekka . . . . . . . . . . . . . . . . . . . . . . . . 27
Salor, Özgül . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Salor, Özgül . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Saltzman, Elliot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Salvi, Giampiero . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Salvi, Giampiero . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Samudravijaya, K. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Samuelsson, Jonas . . . . . . . . . . . . . . . . . . . . . . . . 49
Sanchez, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Sánchez, Victoria . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Sánchez, Victoria . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Sanchis, A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Sanchis, Emilio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Sanchis, Emilio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Sanchis, Javier. . . . . . . . . . . . . . . . . . . . . . . . . . . .104
Sanders, Eric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Sankar, Ananth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Sankar, Ananth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Santarelli, Alfiero . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Saon, George . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Saon, George . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Šarić, Zoran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Sarich, Ace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Saruwatari, Hiroshi . . . . . . . . . . . . . . . . . . . . . . . . 20
Saruwatari, Hiroshi . . . . . . . . . . . . . . . . . . . . . . . . 52
Saruwatari, Hiroshi . . . . . . . . . . . . . . . . . . . . . . . . 79
Saruwatari, Hiroshi . . . . . . . . . . . . . . . . . . . . . . . . 85
Sasaki, Felix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80
Sasaki, Koji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Sasou, Akira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Sato, Tsutomu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Säuberlich, Bettina . . . . . . . . . . . . . . . . . . . . . . . . 46
Saul, Lawrence K. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Savova, Guergana . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Scalart, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Scanlon, Patricia . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Schafföner, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Schalkwyk, Johan . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Scharenborg, Odette . . . . . . . . . . . . . . . . . . . . . . 73
Scharenborg, Odette . . . . . . . . . . . . . . . . . . . . . . 74
Scherer, Klaus R. . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Scherer, Klaus R. . . . . . . . . . . . . . . . . . . . . . . . . . 106
Schiel, Florian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Schimanowski, Juergen . . . . . . . . . . . . . . . . . . . 68
Schlüter, Ralf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Schmidt, Gerhard . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Schneider, T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Schnell, K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Qian, Yao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Qian, Yasheng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Quintana-Morales, Pedro . . . . . . . . . . . . . . . . . . 86
R
Raad, Mohammed . . . . . . . . . . . . . . . . . . . . . . . . . 39
Radová, Vlasta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Radová, Vlasta . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Rahim, Mazin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Rahurkar, Mandar A. . . . . . . . . . . . . . . . . . . . . . . 26
Raj, Bhiksha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Raj, Bhiksha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Ramabhadran, Bhuvana . . . . . . . . . . . . . . . . . . . 33
Ramabhadran, Bhuvana . . . . . . . . . . . . . . . . . . . 91
Ramakrishnan, K.R. . . . . . . . . . . . . . . . . . . . . . . 108
Ramasubramanian, V. . . . . . . . . . . . . . . . . . . . . . 47
Ramaswamy, Ganesh N. . . . . . . . . . . . . . . . . . . . 69
Ramaswamy, Ganesh N. . . . . . . . . . . . . . . . . . . . 71
Ramaswamy, Ganesh N. . . . . . . . . . . . . . . . . . . . 94
Ramírez, Javier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Ramírez, Javier . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Ramírez, Miguel Arjona. . . . . . . . . . . . . . . . . .104
Ramos-Castro, D. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Rank, Erhard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Rätsch, Gunnar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Raux, Antoine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Ravera, F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Raykar, Vikas C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Raymond, Christian . . . . . . . . . . . . . . . . . . . . . . . 22
Raza, D.G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Reichert, Jürgen . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Reilly, Richard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Renals, Steve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
Renals, Steve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99
Rentzos, Dimitrios . . . . . . . . . . . . . . . . . . . . . . . . 85
Rentzos, Dimitrios . . . . . . . . . . . . . . . . . . . . . . . 104
Resende Jr., F.G.V. . . . . . . . . . . . . . . . . . . . . . . . . . 87
Reynolds, Douglas A. . . . . . . . . . . . . . . . . . . . . . . . 2
Reynolds, Douglas A.. . . . . . . . . . . . . . . . . . . . . .47
Reynolds, Douglas A.. . . . . . . . . . . . . . . . . . . . . .56
Reynolds, Douglas A.. . . . . . . . . . . . . . . . . . . . . .69
Reynolds, Douglas A.. . . . . . . . . . . . . . . . . . . . . .71
Reynolds, Douglas A.. . . . . . . . . . . . . . . . . . . . . .94
Riccardi, Giuseppe . . . . . . . . . . . . . . . . . . . . . . . . 23
Riccardi, Giuseppe . . . . . . . . . . . . . . . . . . . . . . . . 64
Richey, Colleen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Rifkin, Ryan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Rigazio, Luca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Rigazio, Luca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Rigoll, Gerhard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Rilliard, Albert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Ris, Christophe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Rizzi, Romeo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Rodgers, Dwight . . . . . . . . . . . . . . . . . . . . . . . . . 106
Rodrigues, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
121
S
Eurospeech 2003
Schoentgen, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Schone, P.J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Schreiner, Olaf . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Schultz, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Schultz, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Schultz, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Schultz, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Schultz, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Schultz, Tanja . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Schwab, Markus . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Schwartz, Jean-Luc . . . . . . . . . . . . . . . . . . . . . . . . . 6
Schwartz, Jean-Luc . . . . . . . . . . . . . . . . . . . . . . . . 49
Schwartz, Richard . . . . . . . . . . . . . . . . . . . . . . . . . 79
Schwartz, Richard . . . . . . . . . . . . . . . . . . . . . . . . 100
Schwarz, Petr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Schweitzer, Antje . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Sciamarella, Denisse . . . . . . . . . . . . . . . . . . . . . . 84
Scordilis, Michael S. . . . . . . . . . . . . . . . . . . . . . . . 40
Seabra Lopes, L. . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Segarra, Encarna . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Segura, José C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Segura, José C. . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Seide, Frank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
Sekiya, Toshiyuki . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Selouani, Sid-Ahmed . . . . . . . . . . . . . . . . . . . . . 109
Seltzer, Michael L. . . . . . . . . . . . . . . . . . . . . . . . . . 44
Sendlmeier, Walter F. . . . . . . . . . . . . . . . . . . . . . . 87
Seneff, Stephanie . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Seneff, Stephanie . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Seneff, Stephanie . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Seneff, Stephanie . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Seneff, Stephanie . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Sénica, N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Seo, Seongho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Seo, Seongho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Seppänen, Tapio . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Serralheiro, António . . . . . . . . . . . . . . . . . . . . . . . 56
Seward, Alexander . . . . . . . . . . . . . . . . . . . . . . . . 40
Sha, Fei. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35
Shabestary, Turaj Zakizadeh . . . . . . . . . . . . . 39
Shammass, Shaunie . . . . . . . . . . . . . . . . . . . . . . . 54
Shammass, Shaunie . . . . . . . . . . . . . . . . . . . . . . . 63
Shao, Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Shaw, Andrew T. . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Sheikhzadeh, H. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Sheikhzadeh, H. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Sheng, Huanye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Sheykhzadegan, Javad . . . . . . . . . . . . . . . . . . . . 54
Shi, Bertram E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Shi, Rui P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Shiga, Yoshinori . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Shiga, Yoshinori . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Shigemori, Takeru . . . . . . . . . . . . . . . . . . . . . . . . . 51
Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 19
Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 20
Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 52
Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 76
Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 79
Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 85
Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 92
Shikano, Kiyohiro . . . . . . . . . . . . . . . . . . . . . . . . . 93
Shimada, Yasuhiro . . . . . . . . . . . . . . . . . . . . . . . . 85
Shimizu, Tohru . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Shimodaira, Hiroshi . . . . . . . . . . . . . . . . . . . . . . . 96
Shin, Jong-Won . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Shingu, Masahisa . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Shinozaki, Takahiro . . . . . . . . . . . . . . . . . . . . . . . 34
Shirai, Katsuhiko . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Shirai, Katsuhiko . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Shiraishi, Kimio . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Shiraishi, Tatsuya . . . . . . . . . . . . . . . . . . . . . . . . . 79
Shriberg, Elizabeth . . . . . . . . . . . . . . . . . . . . . . . . 34
Shriberg, Elizabeth . . . . . . . . . . . . . . . . . . . . . . . . 71
Shriberg, Elizabeth . . . . . . . . . . . . . . . . . . . . . . . . 99
Shum, Heung-Yeung. . . . . . . . . . . . . . . . . . . . . . .59
Sigmund, Milan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Siivola, Vesa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Sikora, Thomas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Sikora, Thomas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Silva, Jorge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Simpson, Brian D. . . . . . . . . . . . . . . . . . . . . . . . . . 37
Simske, Steve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Sinervo, Ulpu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Singer, E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Singh, Rita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Singh, Rita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Siohan, Olivier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
September 1-4, 2003 – Geneva, Switzerland
Siricharoenchai, Rungkarn . . . . . . . . . . . . . . . . . 4
Sista, Sreenivasa . . . . . . . . . . . . . . . . . . . . . . . . . 100
Sit, Chin-Hung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Siu, K.C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Siu, Man-Hung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Siu, Man-Hung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Sivadas, Sunil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Sivadas, Sunil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Sivakumaran, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Skowronek, Janto. . . . . . . . . . . . . . . . . . . . . . . . . .69
Skut, Wojciech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Smaïli, Kamel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Smaïli, Kamel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Smallwood, L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Smeele, Paula M.T. . . . . . . . . . . . . . . . . . . . . . . . . . 69
Smith, D.J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Smith, Jack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Soares, P. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Sodoyer, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Somervuo, Panu . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Song, Hwa Jeon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Sönmez, Kemal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Soon, Chng Chin . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Soong, Frank K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Soong, Frank K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Sornlertlamvanich, Virach . . . . . . . . . . . . . . . . 12
Sorokin, V.N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Spiess, Thurid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Sproat, Richard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Sreenivas, T.V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Sreenivas, T.V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Sridharan, Sridha . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Sridharan, Sridha . . . . . . . . . . . . . . . . . . . . . . . . 106
Sridharan, Sridha . . . . . . . . . . . . . . . . . . . . . . . . 110
Sridharan, Sridha . . . . . . . . . . . . . . . . . . . . . . . . 110
Srinivasamurthy, Naveen . . . . . . . . . . . . . . . . . 39
Srinivasamurthy, Naveen . . . . . . . . . . . . . . . . 111
Srinivasan, Soundararajan . . . . . . . . . . . . . . . . 72
Srinivasan, Sriram . . . . . . . . . . . . . . . . . . . . . . . . . 49
Srivastava, Amit . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Srivastava, Amit . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Stadermann, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Stahl, Christoph . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Stallard, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Stan, Sorel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Steidl, Stefan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Stemmer, Georg . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Stemmer, Georg . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Stent, Amanda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Stephenson, Todd A. . . . . . . . . . . . . . . . . . . . . . . 89
Stern, Richard M. . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Stern, Richard M. . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Stern, Richard M. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Stevens, Catherine. . . . . . . . . . . . . . . . . . . . . . . . .72
Stewart, Darryl . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Stolbov, M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Stolcke, Andreas . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Stolcke, Andreas . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Story, Ezra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Stouten, Veronique . . . . . . . . . . . . . . . . . . . . . . . . . 1
Stouten, Veronique . . . . . . . . . . . . . . . . . . . . . . . . 13
Strassel, Stephanie . . . . . . . . . . . . . . . . . . . . . . . . 56
Strayer, Susan E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Strik, Helmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Strik, Helmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Strzalkowski, Tomek . . . . . . . . . . . . . . . . . . . . . . . 8
Stüker, Sebastian . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Stüker, Sebastian . . . . . . . . . . . . . . . . . . . . . . . . . 111
Sturm, Janienke . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Sturm, Janienke . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Sturt, Christian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Sugimura, Toshiaki . . . . . . . . . . . . . . . . . . . . . . . . 96
Sugiyama, Masahide . . . . . . . . . . . . . . . . . . . . . . . 16
Suhadi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Suhm, Bernhard . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Sujatha, J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Suk, Soo-Young . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Sullivan, Kirk P.H. . . . . . . . . . . . . . . . . . . . . . . . . . 93
Sumita, Eiichiro . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Sun, Hui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Sundaram, Shiva. . . . . . . . . . . . . . . . . . . . . . . . . . .43
Sung, Woo-Chang . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Suontausta, Janne . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Suzuki, Motoyuki . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Suzuki, Noriko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Svaizer, Piergiorgio . . . . . . . . . . . . . . . . . . . . . . . . 18
Svendsen, Torbjørn . . . . . . . . . . . . . . . . . . . . . . 110
Svendsen, Torbjørn . . . . . . . . . . . . . . . . . . . . . . 110
Szarvas, Máté . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
122
Szarvas, Máté . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
T
Taddei, Hervé . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Tadj, C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Tago, Junji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Takagi, Kazuyuki . . . . . . . . . . . . . . . . . . . . . . . . 112
Takagi, Kazuyuki . . . . . . . . . . . . . . . . . . . . . . . . 112
Takahashi, Shin-ya . . . . . . . . . . . . . . . . . . . . . . . . 23
Takami, Kazuaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Takano, Sayoko . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Takatani, Tomoya . . . . . . . . . . . . . . . . . . . . . . . . . 20
Takeda, Kazuya . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Takeda, Kazuya . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Takeuchi, Masashi . . . . . . . . . . . . . . . . . . . . . . . . . 22
Takeuchi, Yugo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Takezawa, Toshiyuki . . . . . . . . . . . . . . . . . . . . . . 14
Takezawa, Toshiyuki . . . . . . . . . . . . . . . . . . . . . . 98
Tam, Yik-Cheung . . . . . . . . . . . . . . . . . . . . . . . . . 111
Tamburini, Fabio . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Tan, Wah Jin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Tanaka, Kazuyo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Tanaka, Kazuyo . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Tang, Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Tao, Jianhua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Tasoulis, Dimitris K. . . . . . . . . . . . . . . . . . . . . . . 59
Tatai, Gábor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Tatai, Péter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Tatai, Péter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Tattersall, Graham . . . . . . . . . . . . . . . . . . . . . . . . 78
Teixeira, António . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Teixeira, António . . . . . . . . . . . . . . . . . . . . . . . . 104
Teixeira, C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Teixeira, João Paulo . . . . . . . . . . . . . . . . . . . . . . . . 7
Teixeira, João Paulo . . . . . . . . . . . . . . . . . . . . . . . 15
ten Bosch, Louis . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
ten Bosch, Louis . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
ten Bosch, Louis . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Terken, Jacques . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Tesprasit, Virongrong . . . . . . . . . . . . . . . . . . . . . . 4
Tesprasit, Virongrong . . . . . . . . . . . . . . . . . . . . . 12
te Vrugt, Jürgen . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Thambiratnam, K. . . . . . . . . . . . . . . . . . . . . . . . . . 32
Thies, Alexandra . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Thomae, Matthias . . . . . . . . . . . . . . . . . . . . . . . . . 32
Thomas, Ryan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Thubthong, Nuttakorn . . . . . . . . . . . . . . . . . . . . . 6
Tian, Jilei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Tiede, Mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Tihelka, Daniel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Tisato, Graziano . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Toda, Tomoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Toda, Tomoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Toda, Tomoki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Toivanen, Juhani . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Tokuda, Keiichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Tokuda, Keiichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Tokuda, Keiichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Tokuda, Keiichi . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Tokuma, Shinichi . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Tokuma, Won . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Tolba, Hesham . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Torge, Sunna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Torres, Francisco . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Torres-Carrasquillo, P.A. . . . . . . . . . . . . . . . . . . 47
Tóth, László . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Trancoso, Isabel . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Trancoso, Isabel . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Tremoulis, George . . . . . . . . . . . . . . . . . . . . . . . . . 19
Tremoulis, George . . . . . . . . . . . . . . . . . . . . . . . . . 60
Trentin, Edmondo . . . . . . . . . . . . . . . . . . . . . . . . . 64
Trippel, Thorsten. . . . . . . . . . . . . . . . . . . . . . . . . .29
Trippel, Thorsten. . . . . . . . . . . . . . . . . . . . . . . . . .80
Trost, Harald . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Tsai, Wei-Ho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Tsai, Wei-Ho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Tsakalidis, Stavros . . . . . . . . . . . . . . . . . . . . . . . . 70
Tseng, Chiu-yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Tseng, Chiu-yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Tseng, Shu-Chuan . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Tsourakis, N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Tsubota, Yasushi . . . . . . . . . . . . . . . . . . . . . . . . . 112
Tsuge, Satoru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Tsujino, Hiroshi . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Tsuruta, Naoyuki . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Tsuzaki, Minoru . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Eurospeech 2003
Tur, Gokhan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Tur, Gokhan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Turajlic, Emir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Turajlic, Emir. . . . . . . . . . . . . . . . . . . . . . . . . . . . .104
Turk, Oytun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Turk, Oytun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Türk, Ulrich . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Turunen, Markku . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Turunen, Markku . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Tyagi, Vivek. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
U
Ueno, Shinichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Unoki, Masashi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Utsuro, Takehito . . . . . . . . . . . . . . . . . . . . . . . . . . 42
V
Vafin, Renat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38
Vair, C. . . . . . . . . . . . . . . . . . . . . . . . . . . . .