Message from the ISCA President

Transcription

Message from the ISCA President
6-10 September 2009
CONFERENCE
PROGRAMME
& ABSTRACT
BOOK
Speech and Intelligence
Interspeech 2009
Brighton UK
10th Annual Conference of the
International Speech
Communication Association
www.interspeech2009.org
Map of Venue & Central Brighton
Conference venue highlighted with red arrow.
Map courtesy of
6-10 September 2009
The Brighton Centre
Brighton, United Kingdom
Copyright © 2009 International Speech Communication Association
http://www.isca-speech.org
[email protected]
All rights reserved.
ISSN 1990-9772 Proceedings CD-ROM
Editors: Maria Uther, Roger Moore, Stephen Cox
Acknowledgements:
Photos of Brighton and map of Brighton courtesy of Visit Brighton (copyright holder), Brighton & Hove City
Council. Thanks also to our colleagues on the organising committee, Meeting Makers and Brighton &
Hove City Council for contributing relevant information for this booklet. Thanks to Dr. James Uther for
assistance on various elements of the booklet.
Abstracts and Proceedings CD-ROM Production by
Causal Productions Pty Ltd
http://www.causalproductions.com
[email protected]
2
Welcome to Interspeech 2009 .................................................................................................... 4
Interspeech 2009 Information ..................................................................................................... 8
General Information and the City of Brighton............................................................................ 10
Interspeech 2009 Social Programme........................................................................................ 13
Interspeech 2009 Organisers ................................................................................................... 14
Sponsors and Exhibitors........................................................................................................... 20
Supporting Committee Institutions............................................................................................ 22
Contest for the Loebner Prize 2009 .......................................................................................... 24
Satellite Events ........................................................................................................................ 26
Interspeech 2009 Keynote Sessions ........................................................................................ 28
Interspeech 2009 Special Sessions.......................................................................................... 32
Interspeech 2009 Tutorials - Sunday 6 September 2009 .......................................................... 35
Public Engagement Events ...................................................................................................... 39
Session Index........................................................................................................................... 40
Interspeech 2009 Programme by Day ...................................................................................... 42
Abstracts .................................................................................................................................. 47
Author Index............................................................................................................................177
Venue Floorplan .....................................................................................................................187
3
Welcome to Interspeech 2009
Message from the ISCA President
On behalf of the International Speech Communication Association (ISCA), welcome to
INTERSPEECH 2009 in Brighton. Ten years ago, in Budapest, we took the first steps towards
creating what later became a unified Association and a unified conference, joining the best of the
former EUROSPEECH and ICSLP conferences. This conference is the 10th in the long cycle that
bears the INTERSPEECH label and that already includes conferences in such wonderful venues as
Beijing, Aalborg, Denver, Geneva, Jeju, Lisbon, Pittsburgh, Antwerp and Brisbane.
I would like to start this long message of thanks and commendation by honouring the ISCA Medalist
for 2009, Prof. Sadaoki Furui, for his outstanding contributions to speech processing, and leadership
as one of the first Presidents of the International Speech Communication Association.
At this meeting we shall also recognize several other ISCA members who, for their technical merit
and service to the Association, were recently elected as ISCA Fellows: Anne Cutler, Wolfgang Hess,
Joseph Mariani, Hermann Ney, Roberto Pieraccini, and Elizabeth Shriberg.
Next in my list of ISCA volunteers to thank are the members who are promoting research activities
on speech science and technology by giving lectures in different parts of the world, as ISCA
Distinguished Lecturers: Rich Stern, Abeer Alwan, and James Glass. ISCA's growing efforts to
promote speech science and technology research are reflected in the work of Special Interest
Groups, and International Sub-Committees as well, in the many workshops spanning
multidisciplinary areas that continuously enlarge our electronic archive, in the increasing number of
grants to students, in our very long newsletter ISCApad, and in many other activities. Of particular
relevance are the ones undertaken by our very active Student Advisory Committee, who recently
launched a resume posting service. Thank you all for your continuing help and support!
The coordination of all these activities is the responsibility of the ISCA Board, and I would like to
particularly thank the two members who have completed their terms in 2009 for all their efforts for
the community: Eva Hajičova and Lin-shan Lee. The past year has been a year of expansion, but
also of consolidation of many new different activities. This was the prime motivation for enlarging the
Board to 14 members. I take this opportunity to welcome the new members: Nick Campbell, Keikichi
Hirose, Haizhou Li, Douglas O'Shaughnessy, and Yannis Stylianou.
For INTERSPEECH 2009, over thirteen hundred papers were submitted, and approximately 57% of
the regular papers were selected for presentation after the peer review process, with at least 3
reviews per paper. Many members of the international community participated in this review process,
and I hope to have the chance to thank you all personally in Brighton.
The large number of submissions marks INTERSPEECH conferences as the major annual
conferences in speech science and technology, a role that we would like to further enhance by
getting these conferences included in major citation indexes. We hope that this number keeps
increasing in the next events in Makuhari (2010) and Florence (2011), and Portland (2012).
This conference is chaired by the first ISCA President, Prof. Roger Moore. We are particularly
appreciative of his organizing team for testing a new model for INTERSPEECH organization,
independent of any university sponsorship. For this bold step, and for all the work and devotion that
you have put into organizing this conference, thank you all so much!
Having heard you discuss your plans for almost 4 years, we are looking forward to a superbly
organized conference, with excellent keynotes and tutorials, interesting sessions, a very lively social
programme, and innovations such as the Loebner award.
I invite you all to join us in celebrating the 10th anniversary of ISCA and wish you a very successful
conference.
Isabel Trancoso, ISCA President
4
Message from the General Chair
Dear Colleague,
On behalf of the organising team, it is with great personal pleasure that
I welcome you to Brighton and to INTERSPEECH-2009: the 10th
Annual Conference of the International Speech Communication
Association (ISCA).
It has been almost four years since we put together the bid to host
INTERSPEECH in the UK, and it has been a truly momentous
experience for everyone involved. A large number of people have
worked very hard to bring this event to fruition, so I sincerely hope that
everything will run as smoothly as possible, and that you have an
enjoyable and productive time with us here in Brighton.
The theme of this year’s conference is Speech and Intelligence, and we
have arranged a number of special events in line with this. For
example, on Sunday we are hosting the 19th annual Loebner Prize for
Artificial Intelligence (a text-based instantiation of a Turing test) run by Hugh Loebner himself. Also
on the Sunday, and inspired by the Loebner Prize, for the first time we will be attempting to run a
real-time speech-based version of the Turing test. Although we only have a couple of contestants,
we hope that it will be an informative and entertaining aspect of the day’s activities. Another
conference event related to Speech and Intelligence is a special semi-plenary discussion that is
scheduled to take place on Tuesday at 16:00 (Tue-Ses3-S1). This will involve a number of
distinguished panellists who have agreed to engage in a lively Q&A interaction with the audience.
Please come along and join in with the debate.
As well as the two competitions, other events taking place on Sunday include eight excellent and
varied Tutorials presented by fourteen top quality lecturers. Tutorials have become a popular
feature of INTERSPEECH conferences, and several hundred attendees take advantage of the
opportunity to learn at first-hand some of the core scientific principles underpinning different aspects
of our developing field. Another first for INTERSPEECH-2009 is Sunday’s public exhibition of
speech-related activities. Public engagement with science is an important issue in modern society,
and we are very grateful for the help and support that we have received to mount this event from the
UK Engineering and Physical Sciences Research Council funded science outreach project Walking
with Robots (http://www.walkingwithrobots.org/). It will be interesting to see what the people of
Brighton make of our particular brand of science and technology – let’s hope for some positive
feedback!
Monday sees the beginning of the main conference programme and, after the opening ceremony
(during which we will pay tribute to our recently departed senior colleague and ESCA Medallist,
Gunnar Fant), we are honoured to welcome this year’s ISCA Medallist, Prof. Sadaoki Furui (Tokyo
Institute of Technology), who will be presenting the first Keynote talk of the conference. Sadaoki’s
subject is “Selected Topics from 40 Years of Research on Speech and Speaker Recognition”, and
I’m sure that he will provide us with a wealth of interesting insights into the progress that he has
seen at the forefront of these areas of research.
The main technical programme of oral and poster sessions starts after lunch on Monday and runs
through until Thursday afternoon. This year we received an almost record number of submissions:
1303 by the published April deadline. These were assessed by 645 reviewers, and the resulting
~4000 reviews were organised by the 24 Area Coordinators so that the final accept/reject decisions
could be made at the Technical Programme Committee meeting held in London at the beginning of
June. This careful selection process resulted in the acceptance of 762 papers (707 in the main
programme and 55 in special sessions), all of which means that we have a total of 38 oral sessions,
39 poster sessions and 10 special sessions at this year’s conference.
In addition to the main programme, each day starts with a prestigious Keynote talk from a
distinguished scientist of international standing. On Tuesday, Tom Griffiths (UC Berkley) will present
his talk entitled “Connecting Human and Machine Learning via Probabilistic Models of Cognition”; on
Wednesday, Deb Roy (MIT Media Lab) promises to lead us towards “New Horizons in the Study of
5
Language Development”; and on Thursday, Mari Ostendorf (University of Washington) will address
her topic of “Transcribing Speech for Spoken Language Processing”. Keynote presentations are
often the scientific highlight of any conference, so I hope that, like me, you are looking forward to
some stimulating early morning talks.
Alongside the regular sessions, we also have a number of special sessions, each of which is
devoted to a ‘hot’ topic in spoken language processing. Daniel Hirst has organised a session on
‘Measuring the rhythm of speech’; Oliver Lemon and Olivier Pietquin have put together a session on
‘Machine learning for adaptivity in spoken dialogue systems’; Carol Espy-Wilson, Jennifer Cole,
Abeer Alwan, Louis Goldstein, Mary Harper, Elliot Saltzman and Mark Hasegawa-Johnson have
arranged a session on ‘New approaches to modeling variability for automatic speech recognition’;
Bruce Denby and Tanja Schultz have put together a session on ‘Silent speech interfaces’; Bjoern
Schuller, Stefan Steidl and Anton Batliner have organised the ‘INTERSPEECH 2009 emotion
challenge’; Anna Barney and Mette Pedersen have gathering together peole interested in ‘Advanced
voice function assessment’; Nick Campbell, Anton Nijholt, Joakim Gustafson and Carl Vogel are
responsible for a session on ‘Active listening and synchrony’; and Mike Cohen, Johan Schalkwyk
and Mike Phillips have organised a session on ‘Lessons and challenges deploying voice search’.
As well as the scientific programme, we have also arranged a series of social events and activities.
Unfortunately, due to the sheer number of attendees, we had to abandon the idea of holding a Party
on the Pier. Instead, we are very pleased to have found an excellent alternative, Revelry at the
Racecourse, which is taking place high above the town with stunning seaward views for an evening
of food and fun. Other events include the Welcome Reception at the Brighton Museum and Art
Gallery, the Students’ Reception at the stylish Italian Al Duomo restaurant, and the Reviewers’
Reception at the amazing Royal Pavilion.
These organised social events are just a few of the opportunities that you will have to discover the
delights of the local area. Brighton is a vibrant British seaside town, so I hope that you will enjoy
exploring its many attractions and perhaps (weather permitting) the beach.
As I mentioned above, an event the size of INTERSPEECH simply cannot take place without the
help and support of large a number of individuals, many of whom give their services freely despite
the many other calls on their time. I would particularly like to thank Stephen Cox (University of East
Anglia) for taking on the extremely time-consuming role of Technical Programme Chair and Valerie
Hazan (University College London) for diligently looking after the most crucial aspect of the whole
operation – the financial budget. In fact, due to the lack of underwriting by any particular institution,
we were obliged to adopt a very different financial model for running INTERSPEECH this year. So I
would also like to thank Valerie and Stephen for agreeing to join me in taking on these additional
responsibilities, and ISCA for providing extra help with managing our cash flow.
I would also like to thank the rest of the organising committee: Anna Barney (University of
Southampton) for putting together a fun social programme; Andy Breen (Nuance UK) for doing a
tremendous job raising sponsorship in a very difficult financial climate; Shona D'Arcy (Trinity College,
Dublin) for liaising with the students and organising the student helpers; Thomas Hain (University of
Sheffield) for organising an impressive array of tutorials; Mark Huckvale (University College London)
for doing a superb job as web master and for significantly upgrading the submission system; Philip
Jackson (University of Surrey) for liaising with Hugh Loebner and organising the Loebner Prize
competition; Peter Jancovic (University of Birmingham) for coordinating the different meeting room
requirements; Denis Johnston (Haughgate Innovations) for helping with the sponsorship drive;
Simon King (University of Edinburgh) for providing a contact point for the satellite workshops; Mike
McTear (University of Ulster) for helping to smooth the registration process; Ben Milner (University of
East Anglia) for organising the exhibition; Ji Ming (Queen's University Belfast) for coordinating the
special sessions; Steve Renals (University of Edinburgh) for arranging the plenary sessions; Martin
Russell (University of Birmingham) for looking after publicity and for producing a terrific poster; Maria
Uther (Brunel University) for preparing the abstract book and conference proceedings; Simon
Worgan (University of Sheffield) for organising the public outreach event and the speech-based
competition; and Steve Young (University of Cambridge) for assisting with obtaining an opening
speaker.
I would also like to thank the team at Meeting Makers – our Professional Conference Organisers –
who have brought our vision into reality by providing valuable help and advice along the way.
6
I would particularly like to thank our generous sponsors, without whose support it would have been
very difficult to mount the event. When we started this process in 2005, who could have envisaged
the dire financial situation faced by the world’s economies and banking systems in 2009? It is a
great relief to us that our sponsors have found the means to provide us with support in these difficult
times. We are especially grateful to Brighton & Hove City Council for their subvention towards the
costs of the conference centre and to Visit Brighton for their encouragement in bringing a large
conference such as INTERSPEECH to the UK.
Finally, I would like to thank everyone who submitted a paper to this year’s conference (whether they
were successful or not), the Reviewers for diligently evaluating them, the Area Chairs for putting
together a varied and high-quality programme, and the Session Chairs and Student Helpers for
ensuring a smooth running of the event itself.
I do hope that you will have an enjoyable and productive time here in Brighton and that you will leave
with fond memories of INTERSPEECH-2009 – the place, the people, and the scientific exchanges
you engaged in while you were here.
With best wishes for a successful conference.
Roger Moore, Conference Chair, Interspeech 2009
7
Interspeech 2009 Information
Venue
Interspeech 2009 will take place in the Brighton Centre. The Brighton Centre is located on the
King's Road, about a 10 minute walk from Brighton station.
All keynote talks are in the Main Hall. See conference programme for venues of other sessions.
Registration
The Conference Registration Desks are located in the main foyer. For registration or any
administrative issues please enquire at the Conference Registration Desks. These desks will be
open at the following times:
Tutorial registration will be open on Sunday 6 September from 0830 hours to 1430 hours.
Conference registration will be open at the following times:
Sunday 6 September
1400 – 1800 hours
Monday 7 September
0900 – 1700 hours
Tuesday 8 September
0800 – 1700 hours
Wednesday 9 September 0800 – 1700 hours
Thursday 10 September
0800 – 1700 hours
The full registration package includes:
• Entry to all conference sessions (excluding satellite workshops and tutorial sessions)
• Conference bag containing:
o Abstract book and conference programme
o CD-ROM of Conference Proceedings
o Promotional material
• Welcome Reception at Brighton Museum (Monday 7th September)
• Revelry at the Racecourse (Wednesday 9th September)
• Coffee breaks as per programme
• Badge
Badge
Your name badge, issued to you when you register, must be worn to all conference sessions
and social events for identification and security purposes.
Non-Smoking event
Smoking is not permitted anywhere inside the Conference Centre.
Language
The official language of Interspeech 2009 is English.
Internet access
Wi-fi access is available throughout the Brighton Centre. Internet access is also provided on
allocated PCs in the Rainbow Room on the Ground Floor and will be open from Monday to
Thursday during conference open hours. The following website provides details of Wireless
access points in Brighton http://www.brighton.co.uk/Wireless_Hotspots/.
8
Speaker Preparation Room
The Speaker Preparation Room is located in the Sunrise Room. If you are presenting an oral
paper, you must load your presentation onto the central fileserver and check that it displays
correctly well before your talk.
We recommend you do this well in advance of your presentation and certainly no later
than two hours beforehand.
The room will be open during the following times:
Sunday 6th September
1400-1800 hours
th
Monday 7 September
0900-1700 hours
Tuesday 8th September
0830-1700 hours
Wednesday 9th September 0830-1700 hours
Thursday 10th September
0830-1700 hours
Coffee Breaks
These are scheduled according to the programme and will be served in the Hewison Hall and
foyers. All those with special dietary requirements should make themselves known to Centre
Staff who will provide alternative catering.
Lunch Breaks
Lunch is not included as part of your registration. There are several cafes and restaurants in
Brighton from which you can purchase lunch.
Insurance
Registration fees do not include personal, travel or medical insurance of any kind. Delegates
are advised when registering for the conference and booking travel that a travel insurance
policy should be taken out to cover risks including (but not limited to) loss, cancellation, medical
costs and injury. Interspeech 2009 and/or the conference organisers will not accept any
responsibility for any delegate failing to insure.
9
General Information and the City of Brighton
Getting There
Rail
Brighton is under an hour by rail from London Victoria station, with 2 services every hour. There
are also regular services from many other points, including a direct service from St Pancras
station, connecting with Eurostar, and a link to Gatwick and Luton airports. More information
can be found at www.nationalrail.co.uk or www.firstcapitalconnect.co.uk
Road
Brighton is about 45 minutes from the M25 London orbital down the M23 motorway, and 30
minutes from Gatwick airport.
Coach
Regular services to Brighton depart from London, Heathrow and Gatwick airports, and many
other locations in the U.K. See www.nationalexpress.com for further information.
Air
Brighton is just 30 minutes by road or rail from London Gatwick International Airport and 90
minutes by road from London Heathrow. There are fast coach links between Heathrow, Gatwick
and Brighton.
Sea
Brighton is 20 minutes by road or 25 minutes by rail from the port of Newhaven where a ferry
service operates to Dieppe. See www.aferry.com/visitbrighton/ for more details.
Car Rental
Major car rental companies have offices located at the major airports. To drive in the U.K. you
must have a current driver’s license. Note that cars travel on the left side of the road in the U.K.
Brighton
Brighton is one of the most colourful, vibrant and creative cities in Europe. It has a very
cosmopolitan flavour that is compact, energetic, unique, fun, lively, historic and free-spirited.
Nestling between the South Downs & the sea on the south coast, Brighton offers everything
from Regency heritage to beachfront cafes and a lively nightlife. It is a fantastic mix of iconic
attractions, award winning restaurants, funky arts, cultural festivals and events.
Time Zone
Brighton and the U.K. in general are in British Summer Time (BST) at the time of the conference,
BST = UTC (Greenwich) + 1.
Money and Credit Cards
Currency is the British Pound (GBP). Major international credit and charge cards such as Visa,
American Express and MasterCard are widely accepted at retail outlets. Travellers’ cheques are
also widely accepted and can be cashed at banks, airports and major hotels.
10
Transportation in Brighton
Brighton and Hove is so compact, that once you’re here, you might find it easiest to explore the
city on foot. A frequent bus service also runs, costing £1.80 for a standard ticket across the city
or various day tickets exist for £3.60 or £4.50 depending on the zone covered. Brighton is also
one of five nationally selected ‘cycling demonstration towns’, and bicycles may be hired in a
number of shops.
Accommodation
The Brighton area has an abundance of different kinds of accommodation from budget to luxury
hotels. We recommend the Visit Brighton website to find accommodation to suit your needs:
http://www.visitbrighton.com/site/accommodation
Electrical Voltage
The electrical supply in the U.K. is 230-240 volts, AC 50Hz. The U.K. three-pin power outlet is
different from that in many other countries, so you will need an adaptor socket. If your
appliances are 110-130 volts you will need a voltage converter. Universal outlet adaptors for
both 240V and 110 V appliances
are sometimes available in leading hotels.
Tipping
There are no hard and fast rules for tipping in the U.K. If you are happy with the service, a 1015% tip is customary, particularly in a restaurant or café with table service. Tipping in bars is not
expected. For taxi fares, it is usual to round up to the nearest pound (£).
Non-Smoking Policy
The UK smoking laws prevent smoking in enclosed public spaces. It is therefore not acceptable
to smoke in restaurants, bars and other public venues. There may be designated smoking areas.
Shop Opening Hours
Shopping hours tend to be from 1000 – 1800 hours with late opening until 2000 hours on a
Thursday.
Emergencies
Please dial 999 for Fire, Ambulance or Police emergency services. The European emergency
number 112 may also be used.
Medical Assistance
In the case of medical emergencies, there is a 24 hour Accident & Emergency department at
the Royal Sussex County Hospital, Eastern Road, BN2 5BF - 01273 696955 (For ambulances,
telephone 999).
The Doctor's Surgery for temporary residents and visitors is:
Chilvers McCrea Medical Centre, 1st Floor Boots the Chemist, 129 North Street, Brighton,
01273 328080. Open: Mon - Friday 08.00 - 18.00 Saturday 09.00 - 13.00 Closed on Sundays.
For emergency dental treament (out of hours) call the Brighton & Hove health authority on
01273 486444 (lines open weekdays 06.30 - 21.30, weekends and bank holidays 09.00 - 12.30)
Pharmacies
The following pharmacies open later than normal working hours:
Ashtons: 98 Dyke Road, Seven Dials, Brighton, BN1 3JD, 01273 325020
11
Open: Mon - Sunday 09.00 - 22.00 except for 25 December
Asda: Brighton Marina, BN2 5UT, 01273 688019
Open: Mon - Thursday and Saturday: 09.00 - 20.00 Friday: 09.00 - 21.00 Sunday: 10.00 - 16.00
(times vary on Bank Holidays)
Asda: Crowhurst Road, Brighton, BN1 8AS, 01273 542314
Open: Mon - Thursday & Saturday: 09.00 - 20.00 Friday: 09.00 - 21.00 Sunday: 11.00 - 17.00
(times vary on Bank Holidays)
Westons: 6-7 Coombe Terrace, Lewes Road, Brighton, BN2 4AD, 01273 605354
Open: Mon - Sunday 09.00 - 22.00
Places of Interest
•
•
•
•
•
The Royal Pavilion – the seaside palace of the Prince Regent (George IV) –
www.royalpavilion.org.uk.
Brighton Walking Tours – Hear about Brighton’s history and discover interesting facts by
downloading www.coolcitywalks.com/brighton/index.php onto your mp3 player.
Brighton Pier – enjoy a typical day at the seaside on Brighton Pier. Enjoy the funfair
rides, enjoy traditional seaside treats like candyfloss or a stick of rock or hire a deckchair
and relax.
Enjoy a traditional afternoon tea at the Grand hotel – www.grandbrighton.co.uk
Take a 45 minute pleasure cruise along the coast and past the 2 piers or join a 90
minute mackerel fishing trip – www.watertours.co.uk.
Dining and Entertainment
There are numerous cafes and restaurants in the Brighton area, particularly around the ‘The
Lanes’ and ‘North Laine’ areas.
12
Interspeech 2009 Social Programme
Monday 7th September, Welcome Reception
Everyone is invited to the Welcome Reception to be held at the Brighton Museum and Art
Gallery, part of the historic Brighton Pavilion Estate. Event starts at 19:30.
Tuesday 8th September, Reviewers Reception
A reception for our hardworking reviewers will take place at the Royal Pavilion, the spectacular
seaside palace of the Prince Regent (King George IV) transformed by John Nash between 1815
and 1822 into one of the most dazzling and exotic buildings in the British Isles. Event starts at
19:30. Admission by ticket enclosed with conference pack.
Tuesday 8th September, Student delegates drinks reception
Also on Tuesday evening there will be a drinks reception for student delegates at the stylish
Italian Al Duomo Restaurant in the heart of the town. Event starts at 19:30. Admission by ticket
enclosed with conference pack.
Wednesday 9th September, Revelry at the Racecourse
Look out for Revelry at the Racecourse! The Brighton Racecourse, set high on the Sussex
Downs with stunning views of Brighton and Hove, will host the Conference Dinner and Party. No
live horses, but expect a conference event with a difference and plenty of fun. Everyone is
welcome and the event will start at 19:30. Transport to and from the event will be provided.
13
Interspeech 2009 Organisers
Conference Committee
Role
General Chair
Name
Prof. Roger Moore
Institution
University of Sheffield
Technical Programme Prof. Stephen J. Cox University of East Anglia
Finance
Prof. Valerie Hazan
University College London
Publications
Dr. Maria Uther
Brunel University
Web Master
Dr. Mark Huckvale
University College London
Plenary Sessions
Prof. Steve Renals
University of Edinburgh
Tutorials
Dr. Thomas Hain
University of Sheffield
Special Sessions
Dr. Ji Ming
Queen's University Belfast
Satellite Workshops
Dr. Simon King
University of Edinburgh
Sponsorship
Dr. Andrew Breen
Nuance UK
Exhibits
Dr. Ben Milner
University of East Anglia
Publicity
Prof. Martin Russell
University of Birmingham
Social Programme
Dr. Anna Barney
University of Southampton
Industrial Liaison
Mr. Denis Johnston
Haughgate Innovations
Advisor
Prof. Steve Young
University of Cambridge
Student Liaison
Dr. Shona D'Arcy
Trinity College, Dublin
Public Outreach
Dr. Simon Worgan
University of Sheffield
Registration
Prof. Michael McTear University of Ulster
Loebner Contest
Dr. Philip Jackson
University of Surrey
Meeting Rooms
Dr. Peter Jancovic
University of Birmingham
Conference Organisers
Meeting
Makers
Ltd
76
Southbrae
Drive
Glasgow
G13
1PP
Tel: +44 (0) 141 434 1500
Fax: +44 (0) 141 434 1519
[email protected]
14
Scientific Committee
The conference organisers are indebted to the following people who made such a large
contribution to the creation of the technical programme.
Technical Chair
•
Stephen Cox, University of East Anglia
Area Coordinators
Special Session Organisers
1. Aladdin Ariyaeeinia, University of
Hertfordshire
2. Nick Campbell, Trinity College Dublin
3. Mark J. F. Gales, Cambridge University
4. Yoshi Gotoh, Sheffield University
5. Phil Green, Sheffield University
6. Valerie Hazan, University College
London
7. Wendy Holmes, Aurix Ltd
8. David House, KTH Stockholm
9. Kate Knill, Toshiba Research Europe
Ltd
10. Bernd Möbius, Stuttgart
11. Sebastian Möller, Deutsche Telekom
Laboratories
12. Satoshi Nakamura, NICT/ATR
13. Kuldip Paliwal, Griffith University
14. Gerasimos Potamianos, Athens
15. Ralf Schlüter, RWTH Aachen University
16. Tanja
Schultz,
Carnegie
Mellon
University
17. Yannis Stylianou, University of Crete
18. Marc Swerts, Tilburg University
19. Isabel Trancoso, INESC-ID Lisboa / IST
20. Saeed Vaseghi, Brunel University
21. Yi Xu, University College London
22. Bayya Yegnanarayana, International
Institute of Information Technology
Hyderabad
23. Kai Yu, Cambridge University
24. Heiga Zen, Toshiba Research Ltd.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
15
Abeer Alwan, UCLA
Anna Barney, ISVR, University of
Southampton
Anton
Batliner,
FAU
ErlangenNuremberg
Nick Campbell, Trinity College Dublin
Mike Cohen, Google
Jennifer Cole, Illinois
Bruce Denby, Université Pierre et Marie
Curie
Carol
Espy-Wilson,
University
of
Maryland
Louis Goldstein, University of Southern
California
Joakim Gustafson, KTH Stockholm
Mary Harper, University of Maryland
Mark Hasegawa-Johnson, Illinois
Daniel Hirst, Universite de Provence
Oliver Lemon, Organiser Edinburgh
University
Anton Nijholt, Twente
Mette Pedersen, Medical Centre, Voice
Unit, Denmark
Mike Phillips, Vlingo
Olivier Pietquin, IMS Research Group
Elliot Saltzman, Haskins Laboratories
Johan Schalkwyk, Google
Björn Schuller, Technische Universität
München
Tanja
Schultz,
Carnegie
Mellon
University
Stefan Steidl, FAU Erlangen-Nuremberg
Carl Vogel, Trinity College Dublin
Scientific Reviewers
Alberto Abad
Sherif Abdou
Alex Acero
Andre Gustavo Adami
Gilles Adda
Martine Adda-Decker
Masato Akagi
Masami Akamine
Murat Akbacak
Jan Alexandersson
Paavo Alku
Abeer Alwan
Eliathamby Ambikairajah
Noam Amir
Ove Andersen
Tim Anderson
Elisabeth Andre
Walter Andrews
Takayuki Arai
Masahiro Araki
Aladdin Ariyaeeinia
Victoria Arranz
Bishnu Atal
Roland Auckenthaler
Cinzia Avesani
Matthew Aylett
Harald Baayen
Michiel Bacchiani
Pierre Badin
Janet Baker
Srinivas Bangalore
Plinio Barbosa
Etienne Barnard
Anna Barney
Dante Barone
William J. Barry
Anton Batliner
Frederic Bechet
Steve Beet
Homayoon Beigi
Nuno Beires
Jerome Bellegarda
Jose Miguel Benedi Ruiz
M. Carmen Benitez Ortuzar
Nicole Beringer
Kay Berkling
Laurent Besacier
Frédéric Bettens
Frederic Bimbot
Maximilian Bisani
Judith Bishop
Alan Black
Mats Blomberg
Gerrit Bloothooft
Antonio Bonafonte
Jean-Francois Bonastre
Zinny Bond
Helene Bonneau-Maynard
Herve Bourlard
Lou Boves
Daniela Braga
Bettina Braun
Catherine Breslin
John Bridle
Mirjam Broersma
Niko Brummer
Udo Bub
Luis Buera
Tim Bunnell
Anthony Burkitt
Denis Burnham
Bill Byrne
Jose R. Calvo de Lara
Joseph P. Campbell
Nick Campbell
William Campbell
Antonio Cardenal Lopez
Valentin Cardenoso-Payo
Michael Carey
Rolf Carlson
Maria Jose Castro Bleda
Mauro Cettolo
Joyce Chai
Chandra Sekhar Chellu
Fang Chen
Yan-Ming Cheng
Rathi Chengalvarayan
Jo Cheolwoo
Jen-Tzung Chien
KK Chin
Gerard Chollet
Khalid Choukri
Heidi Christensen
Hyunsong Chung
Robert Clark
Mike Cohen
Luisa Coheur
Jennifer Cole
Alistair Conkie
Martin Cooke
Ricardo de Cordoba
Piero Cosi
Bob Cowan
Felicity Cox
Catia Cucchiarini
Fred Cummins
Francesco Cutugno
Joachim Dale
Paul Dalsgaard
Geraldine Damnati
Morena Danieli
Marelie Davel
Chris Davis
Amedeo De Dominicis
Angel de la Torre Vega
Renato De Mori
16
Carme de-la-Mota Gorriz
David Dean
Michael Deisher
Grazyna Demenko
Kris Demuynck
Bruce Denby
Matthias Denecke
Li Deng
Giuseppe Di Fabbrizio
Vassilis Digalakis
Christine Doran
Ellen Douglas-Cowie
Christoph Draxler
Jasha Droppo
Andrzej Drygajlo
Jacques Duchateau
Christophe D`Alessandro
Mariapaola D`Imperio
Kjell Elenius
Daniel Ellis
Ahmad Emami
Julien Epps
Anders Eriksson
Mirjam Ernestus
David Escudero Mancebo
Carol Espy-Wilson
Sascha Fagel
Daniele Falavigna
Mauro Falcone
Isabel Fale
Kevin Farrell
Marcos Faundez-Zanuy
Marcello Federico
Junlan Feng
Javier Ferreiros
Carlos A. Ferrer Riesgo
Tim Fingscheidt
Janet Fletcher
José A. R. Fonollosa
Kate Forbes-Riley
Eric Fosler-Lussier
Diamantino Freitas
Juergen Fritsch
Sonia Frota
Christian Fuegen
Olac Fuentes
Hiroya Fujisaki
Toshiaki Fukada
Sadaoki Furui
Mark J. F. Gales
Ascensiòn Gallardo Antolìn
Yuqing Gao
Fernando Garcia Granada
Carmen Garcia Mateo
Mária Gósy
Panayiotis Georgiou
Dafydd Gibbon
Mazin Gilbert
Juan Ignacio Godino Llorente
Simon Godsill Godsill
Roland Goecke
Pedro Gomez Vilda
Joaquin Gonzalez-Rodriguez
Allen Gorin
Martijn Goudbeek
Philippe Gournay
Bjorn Granstrom
Agustin Gravano
David Grayden
Phil Green
Rodrigo C Guido
Susan Guion
Ellen Haas
Kadri Hacioglu
Reinhold Haeb-Umbach
Jari Hagqvist
Thomas Hain
John Hajek
Eva Hajicova
Dilek Hakkani-Tur
Gerhard Hanrieder
John H.L. Hansen
Mary Harper
Jonathan Harrington
John Harris
Naomi Harte
Mark Hasegawa-Johnson
Jean-Paul Haton
Valerie Hazan
Timothy J. Hazen
Harald Höge
Matthieu Hebert
Martin Heckmann
Per Hedelin
Peter Heeman
Paul Heisterkamp
John Henderson
Inmaculada Hernaez Rioja
Luis
Alfonso
Hernandez
Gomez
Javier Hernando
Wolfgang Hess
Lee Hetherington
Ulrich Heute
Keikichi Hirose
Hans Guenter Hirsch
Julia Hirschberg
Daniel Hirst
Wendy Holmes
W. Harvey Holmes
Chiori Hori
John-Paul Hosom
David House
Fei Huang
Qiang Huang
Isabel Hub Faria
Juan Huerta
Marijn Huijbregts
Melvyn Hunt
Lluis F. Hurtado Oliver
John Ingram
Shunichi Ishihara
Masato Ishizaki
Yoshiaki Itoh
Philip Jackson
Tae-Yeoub Jang
Esther Janse
Arne Jönsson
Alexandra Jesse
Luis Jesus
Qin Jin
Michael Johnston
Kristiina Jokinen
Caroline Jones
Szu-Chen Stan Jou
Denis Jouvet
Ho-Young Jung
Peter Kabal
Takehiko Kagoshima
Alexander Kain
Hong-Goo Kang
Stephan Kanthak
Hiroaki Kato
Tatsuya Kawahara
Hisashi Kawai
Mano Kazunori
Thomas Kemp
Patrick Kenny
Jong-mi Kim
Sunhee Kim
Byeongchang Kim
Hoirin Kim
Hong Kook Kim
Hyung Soon Kim
Jeesun Kim
Nam Soo Kim
Sanghun Kim
Simon King
Brian Kingsbury
Yuko Kinoshita
Christine Kitamura
Esther Klabbers
Dietrich Klakow
Bastiaan Kleijn
Kate Knill
Hanseok Ko
Takao Kobayashi
M. A. Kohler
George Kokkinakis
John Kominek
Myoung-Wan Koo
Sunil Kumar Kopparapu
Christian Kroos
Chul Hong Kwon
Hyuk-Chul Kwon
Oh-Wook Kwon
Francisco Lacerda
Pietro Laface
Unto K. Laine
Claude Lamblin
Lori Lamel
Kornel Laskowski
17
Javier Latorre
Gary Geunbae Lee
Chin Hui Lee
Lin-shan Lee
Oliver Lemon
Qi (Peter) Li
Haizhou Li
Carlos Lima
Georges Linares
Mike Lincoln
Bjorn Lindblom
Anders Lindström
Zhen-Hua Ling
Yang Liu
Yi Liu
Karen Livescu
Eduardo Lleida Solano
Joaquim Llisterri
Deborah Loakes
Maria Teresa Lopez Soto
Ramon Lopez-Cozar Delgado
Patrick Lucey
Changxue Ma
Ning Ma
Dusan Macho Cierna
Javier Macias-Guarasa
Abdulhussain E. Mahdi
Brian Mak
Robert Malkin
Nuno Mamede
Claudia Manfredi
Lidia Mangu
José B. Mariño Acebal
Ewin Marsi
Carlos
David
Martínez
Hinarejos
Jean-Pierre Martens
Rainer Martin
Alvin Martin
Enrique Masgrau Gomez
Sameer Maskey
John Mason
Takashi Masuko
Ana Isabel Mata
Tomoko Matsui
Karen Mattock
Bernd Möbius
Sebastian Möller
Christian Müller
Alan McCree
Erik McDermott
Gordon McIntyre
John McKenna
Michael McTear
Hugo Meinedo
Alexsandro Meireles
Carlos Eduardo Meneses
Ribeiro
Helen Meng
Florian Metze
Bruce Millar
Ben Milner
Nobuaki Minematsu
Wolfgang Minker
Holger Mitterer
Hansjoerg Mixdorff
Parham Mokhtari
Garry Molholt
Juan
Manuel
Montero
Martinez
Asuncion Moreno
Pedro J. Moreno
Nelson Morgan
Ronald Mueller
Climent Nadeu
Seiichi Nakagawa
Satoshi Nakamura
Shrikanth Narayanan
Eva Navas
Jiri Navratil
Ani Nenkova
João Neto
Ron Netsell
Sergio Netto
Hermann Ney
Patrick Nguyen
Anton Nijholt
Elmar Noeth
Albino Nogueiras
Michael Norris
Mohaddeseh Nosratighods
Regine Obrecht
John Ohala
Sumio Ohno
Luis Oliveira
Peder Olsen
Mohamed Omar
Roeland Ordelman
Rosemary Orr
Alfonso Ortega Gimenez
Javier Ortega-Garcia
Beatrice Oshika
Mari Ostendorf
Douglas O`Shaughnessy
Fernando S. Pacheco
Tim Paek
Vincent Pagel
Sira
Elena
Palazuelos
Cagigas
Kuldip Paliwal
Yue Pan
P.C. Pandey
Jose Manuel Pardo
Jun Park
Jong Cheol Park
Patrick Paroubek
Steffen Pauws
Mette Pedersen
Antonio M. Peinado
Carmen Pelaez-Moreno
Jason Pelecanos
Bryan Pellom
Christina Pentland
Fernando Perdigao
Jose Luis Perez Cordoba
Pascal Perrier
Hartmut R. Pfitzinger
Mike Phillips
Michael Picheny
Joe Picone
Roberto Pieraccini
Olivier Pietquin
Ferran Pla Santamaría
Aristodemos Pnevmatikakis
Louis C.W. Pols
Alexandros Potamianos
Gerasimos Potamianos
David Powers
Rohit Prasad
Mahadeva Prassanna
Kristin Precoda
Patti Price
Tarun Pruthi
Mark Przybocki
Yao Qian
Zhu Qifeng
Thomas F. Quatieri
Raja Rajasekaran
Nitendra Rajput
Bhuvana Ramabhadran
V. Ramasubramanian
Preeti Rao
Andreia S. Rauber
Mosur Ravishankar
Mario Refice
Norbert Reithinger
Steve Renals
Fernando Gil Vianna Resende
Jr.
Douglas Reynolds
Luca Rigazio
Michael Riley
Christian Ritz
Tony Robinson
Eduardo Rodriguez Banga
Luis Javier Rodriguez-Fuentes
Richard Rose
Olivier Rosec
Antti-Veikko Rosti
Jean-Luc Rouas
Antonio Rubio
Martin Russell
Yoshinori Sagisaka
Josep M. Salavedra
K Samudravijaya
Rubén San-Segundo
Victoria Eugenia Sanchez
Calle
Emilio Sanchis
Eric Sanders
George Saon
Shimon Sapir
Murat Saraclar
Ruhi Sarikaya
Hiroshi Saruwatari
Antonio Satue Villar
18
Michelina Savino
Joan Andreu Sánchez Peiró
Thomas Schaaf
Johan Schalkwyk
Odette Scharenborg
Ralf Schlüter
Jean Schoentgen
Marc Schroeder
Bjoern Schuller
Michael Schuster
Reva Schwartz
Antje Schweitzer
Encarnacion Segarra Soriano
Jose Carlos Segura
Frank Seide
Mike Seltzer
D. Sen
Stephanie Seneff
Cheol Jae Seong
Antonio Serralheiro
Kiyohiro Shikano
Jiyoung Shin
Koichi Shinoda
Carlos Silva
Olivier Siohan
Malcolm Slaney
Raymond Slyh
Connie So
M. Mohan Sondhi
Victor Sorokin
Dave Stallard
Mark Steedman
Stefan Steidl
Andreas Stergiou
Richard Stern
Mary Stevens
Helmer Strik
Volker Strom
Sebastian Stueker
Matt Stuttle
Yannis Stylianou
Rafid Sukkar
Torbjorn Svendsen
Marc Swerts
Ann Syrdal
David Talkin
Zheng-Hua Tan
Yun Tang
Jianhua Tao
Carlos Teixeira
Antonio Teixeira
Joao Paulo Teixeira
Louis ten Bosch
Jacques Terken
Barry-John Theobald
William Thorpe
Jilei Tian
Michael Tjalve
Tomoki Toda
Roberto Togneri
Keiichi Tokuda
Doroteo T Toledano
Laura Tomokiyo
María Inés Torres Barañano
Arthur Toth
Dat Tran
Isabel Trancoso
David Traum
Kimiko Tsukada
Roger Tucker
Gokhan Tur
L. Alfonso Urena Lopez
Jacqueline Vaissiere
Dirk Van Compernolle
Henk van den Heuvel
Jan van Doorn
Hugo Van hamme
Arjan van Hessen
Roeland van Hout
David van Leeuwen
Jan van Santen
Rob Van Son
Peter Vary
Saeed Vaseghi
Mario Vayra
Werner Verhelst
Jo Verhoeven
Ceu Viana
Marina Vigario
Fábio Violaro
Carlos Enrique Vivaracho
Pascual
Robbie Vogt
Julie Vonwiller
Michael Wagner
Marilyn Walker
Patrick Wambacq
Hsiao-Chuan Wang
Kuansan Wang
Ye-Yi Wang
Chao Wang
Hsin-min Wang
Nigel Ward
Catherine Watson
Christian Wellekens
Stanley Wenndt
Stefan Werner
Yorick Wilks
Daniel Willett
Briony Williams
Monika Woszczyna
Johan Wouters
Stuart Wrigley
John Zhiyong Wu
19
Yi-Jian Wu
Bing Xiang
Yi Xu
Haitian Xu
Junichi Yamagashi
Bayya Yegnanarayana
Nestor Becerra Yoma
Chang D. Yoo
Dongsuk Yook
Steve Young
Roger (Peng) Yu
Dong Yu
Kai Yu
Young-Sun Yun
Milos Zelezny
Heiga Zen
Andrej Zgank
Tong Zhang
Yunxin Zhao
Jing Zheng
Bowen Zhou
Imed Zitouni
Udo Zoelzer
Geoffrey Zweig
Sponsors and Exhibitors
Sponsors
We gratefully acknowledge the support of:
Google, Inc.
for silver-level sponsorship of the
conference.
Carstens Medizinelektronik GmbH
for bronze-level sponsorship of the
conference.
Northern Digital, Inc.
for bronze-level sponsorship of the
conference.
Nuance, Inc.
for bronze-level sponsorship of the
conference.
Toshiba Research Europe
for bronze-level sponsorship of the
conference.
Appen Pty Ltd
for bronze-level sponsorship of the
conference.
20
Crown Industries, Inc.
Crown Industries, Inc
for supporting the Loebner Prize
competition.
IBM Research Inc.
for supporting the Loebner Prize
competition.
Brighton & Hove City Council
for sponsoring the conference venue.
Exhibitors
Interspeech
2009
welcomes
INTERSPEECH 2010
21
the
following
exhibitors:
Supporting Committee Institutions
The organising committee would like to acknowledge the support of their respective
institutions (institutions presented in alphabetical order).
Trinity College Dublin
22
In Memoriam
Gunnar Fant
8th October 1919 – 6th June 2009
Speech research pioneer and recipient of the1989
ESCA Medal for scientific achievement
23
Contest for the Loebner Prize 2009
Time: 10:45am Sunday 6 September
Venue: Rainbow room, Brighton Centre
The Loebner Prize for artificial intelligence is the first formal instantiation of a Turing Test. The
test is named after Alan Turing the brilliant British mathematician with many accomplishments in
computing science. In 1950, in the article Computing Machinery and Intelligence which
appeared in the philosophy journal Mind, Alan Turing asked the question "Can a Machine
Think?" He answered in the affirmative, but a central question was: "If a computer could think,
how could we tell?" Turing's suggestion was, that if the responses from a computer in an
imitation game were indistinguishable from that of a human, the computer could be said to be
thinking.
The Loebner prize competition seeks to find out how close we are to building a computer to
pass the Turing test. In 1950 Alan Turing wrote:
"I believe that in about fifty years' time it will be possible, to programme computers, with a
storage capacity of about 109, to make them play the imitation game so well that an average
interrogator will not have more than 70 per cent chance of making the right identification after
five minutes of questioning..."
The 2009 Loebner Prize will operate in the following manner.
•
•
•
•
•
Panels of judges communicate with two entities over a typewritten link. One entity is a
human, one is a computer program, allocated at random.
Each judge will begin the round by making an initial comment to the first entity and
continue interacting for 5 minutes. At the conclusion of the five minutes, the judge will
begin the interaction with the second entity and continue for 5 minutes.
Entities will be expected to respond to the judges' initial comments or questions. There
will be no restrictions on what names
etc the entries, humans, or judges can
use, nor any other restrictions on the
content of the conversations.
At the conclusion of the 10 minutes of
questioning, judges will be allowed 10
minutes to review the conversations.
They will then score one of the two
entities as the human. Following this,
there will be a 5 minute period for
judges and confederates to take their
places for the next round.
The system that is most often considered to be human by the judges will win a Bronze
Loebner medal and $3000.
More details at the Loebner Prize web site: http://www.loebner.net/Prizef/loebner-prize.html.
The Loebner Prize is made possible by funding from Crown Industries, Inc., of East Orange NJ
and contributions from IBM research.
Organiser: Philip Jackson, [email protected]
24
“Spoken Language Processing for All”
Spoken Language Processing for All Ages, Health Conditions,
Native Languages, and Environments
INTERSPEECH 2010, the 11th conference in the annual series of Interspeech events,
will be held at the International Convention Hall at the Makuhari Messe exhibition
complex in Chiba, Japan. The conference venue allows easy access for international
travelers: 30 minutes from Narita International Airport by bus, 30 minutes from Tokyo
station by train, and within walking distance of a number of hotels with a wide variety
of room rates.
INTERSPEECH 2010 returns to Japan for the first time in 16 years. Japan hosted the
first and third International Conferences on Spoken Language Processing (ICSLP) in
1990 and 1994. In 2010, we seek to emphasize the interdisciplinary nature of speech
research, and facilitate cross-fertilization among various branches of spoken language
science and technology.
Mark 26-30 September 2010 on your calendar now!
For further details,
visit http://www.interspeech2010.org/ or
write to [email protected]
General Chair:
Keikichi Hirose
(The University of Tokyo)
Makuhari
Messe
General Co-Chair:
Yoshinori Sagisaka (Waseda University)
Technical Program Committee Chair:
Satoshi Nakamura (NICT/ATR)
25
Satellite Events
This is the list of satellite workshops linked to Interspeech 2009.
ACORNS Workshop on Computational Models of Language Evolution,
Acquisition and Processing
11 September 2009, Brighton, UK.
The workshop brings together up to 50 scientists to discuss future research in language
acquisition, processing and evolution. Deb Roy, Friedemann Pulvermüller, Rochelle Newman
and Lou Boves will provide an overview of the state-of-art, a number of discussants from
different disciplines will widen the perspective, and all participants can contribute to a roadmap.
AVSP 2009 - Audio-Visual Speech Processing
10-13 Sept 2009, University of East Anglia, Norwich, U.K.
The International Conference on Auditory-Visual Speech Processing (AVSP) attracts an
interdisciplinary audience of psychologists, engineers, scientists and linguists, and considers a
range of topics related to speech perception, production, recognition and synthesis. Recently
the scope of AVSP has broadened to also include discussion on more general issues related to
audiovisual communication. For example, the interplay between speech and the expressions of
emotion, and the relationship between speech and manual gestures.
Blizzard Challenge Workshop
4 September, University of Edinburgh, U.K.
In order to better understand and compare research techniques in building corpus-based
speech synthesizers on the same data, the Blizzard Challenge was devised. The basic
challenge is to take the released speech database, build a synthetic voice from the data and
synthesize a prescribed set of test sentences which are evaluated through listening tests. The
results are presented at this workshop. Attendance at the 2009 workshop for the 4th Blizzard
Challenge is open to all, not just participants in the challenge. Registration closes on 14th
August 2009.
SIGDIAL - Special Interest Group on Dialogue
11-12 Sept 2009, Queen Mary University, London, U.K.
The SIGDIAL venue provides a regular forum for the presentation of cutting edge research in
discourse and dialogue to both academic and industry researchers. The conference is
sponsored by the SIGDIAL organization, which serves as the Special Interest Group in
discourse and dialogue for both the Association for Computational Linguistics and the
International Speech Communication Association.
SLaTE Workshop on Speech and Language Technology in Education (SLaTE)
3-5 September 2009, Wroxall, Warwickshire, U.K.
SLaTE 2009 follows SLaTE 2007, held in Farmington, Pennsylvania, USA, and the STiLL
meeting organized by KTH in Marholmen, Sweden, in 1998. The workshop will address all
topics which concern speech and language technology for education. Papers will discuss
theories, applications, evaluation, limitations, persistent difficulties, general research tools and
26
techniques. Papers that critically evaluate approaches or processing strategies will be
especially welcome, as will prototype demonstrations of real-world applications.
Young Researchers' Roundtable on Spoken Dialogue Systems
13-14 September 2009, Queen Mary, University of London, U.K.
The Young Researchers' Roundtable on Spoken Dialog Systems is an annual workshop
designed for students, post docs, and junior researchers working in research related to spoken
dialogue systems in both academia and industry. The roundtable provides an open forum where
participants can discuss their research interests, current work and future plans. The workshop is
meant to provide an interdisciplinary forum for creative thinking about current issues in spoken
dialogue systems research, and help create a stronger international network of young
researchers working in the field.
27
Interspeech 2009 Keynote Sessions
Keynote 1
ISCA Scientific Achievement Medallist for 2009
Sadaoki Furui,
Tokyo Institute of Technology
Selected topics from 40 years of research on speech and speaker recognition
Mon-Ses1-K: Monday 11:00, Main Hall
Chair: Isabel Trancoso
Abstract
This talk summarizes my 40 years research on speech and speaker recognition, focusing on
selected topics that I have investigated at NTT Laboratories, Bell Laboratories and Tokyo
Institute of Technology with my colleagues and students. These topics include: the importance
of spectral dynamics in speech perception; speaker recognition methods using statistical
features, cepstral features, and HMM/GMM; text-prompted speaker recognition; speech
recognition by dynamic features; Japanese LVCSR; spontaneous speech corpus construction
and analysis; spontaneous speech recognition; automatic speech summarization; WFST-based
decoder development and its applications; and unsupervised model adaptation methods.
Presenter
Sadaoki Furui is currently a Professor at Tokyo Institute of Technology, Department of
Computer Science. He is engaged in a wide range of research on speech analysis, speech
recognition, speaker recognition, speech synthesis, and multimodal human-computer interaction
and has authored or coauthored over 800 published articles. He is a Fellow of the IEEE, the
International Speech Communication Association (ISCA), the Institute of Electronics,
Information and Communication Engineers of Japan (IEICE), and the Acoustical Society of
America. He has served as President of the Acoustical Society of Japan (ASJ) and the ISCA.
He has served as a member of the Board of Governor of the IEEE Signal Processing (SP)
Society and Editor-in-Chief of both the Transaction of the IEICE and the Journal of Speech
Communication. He has received the Yonezawa Prize, the Paper Award and the Achievement
Award from the IEICE (1975, 88, 93, 2003, 2003, 2008), and the Sato Paper Award from the
ASJ (1985, 87). He has received the Senior Award and Society Award from the IEEE SP
Society (1989, 2006), the Achievement Award from the Minister of Science and Technology and
the Minister of Education, Japan (1989, 2006), and the Purple Ribbon Medal from Japanese
Emperor (2006). In 1993 he served as an IEEE SPS Distinguished Lecturer.
28
Keynote 2
Tom Griffiths,
UC Berkeley
Connecting human and machine learning via probabilistic models of cognition
Tue-Ses0-K: Tuesday 08:30, Main Hall
Chair: Steve Renals
Abstract
Human performance defines the standard that machine learning systems aspire to in many
areas, including learning language. This suggests that studying human cognition may be a good
way to develop better learning algorithms, as well as providing basic insights into how the
human mind works. However, in order for ideas to flow easily from cognitive science to
computer science and vice versa, we need a common framework for describing human and
machine learning. I will summarize recent work exploring the hypothesis that probabilistic
models of cognition, which view learning as a form of statistical inference, provide such a
framework, including results that illustrate how novel ideas from statistics can inform cognitive
science. Specifically, I will talk about how probabilistic models can be used to identify the
assumptions of learners, learn at different levels of abstraction, and link the inductive biases of
individuals to cultural universals.
Presenter
Tom Griffiths is an Assistant Professor of Psychology and Cognitive Science at UC Berkeley,
with courtesy appointments in Computer Science and Neuroscience. His research explores
connections between human and machine learning, using ideas from statistics and artificial
intelligence to try to understand how people solve the challenging computational problems they
encounter in everyday life. He received his PhD in Psychology from Stanford University in 2005,
and taught in the Department of Cognitive and Linguistic Sciences at Brown University before
moving to Berkeley. His work and that of his students has received awards from the Neural
Information Processing Systems conference and the Annual Conference of the Cognitive
Science Society, and in 2006 IEEE Intelligent Systems magazine named him one of "Ten to
watch in AI."
29
Keynote 3
Deb Roy,
MIT Media Lab
New Horizons in the Study of Language Development
Wed-Ses0-K: Wednesday 08:30, Main Hall
Chair: Roger Moore
Abstract
Emerging forms of ecologically-valid longitudinal recordings of human behavior and social
interaction promise fresh perspectives on age-old questions of child development. In a pilot
effort, 240,000 hours of audio and video recordings of one child’s life at home are being
analyzed with a focus on language development. To study a corpus of this scale and richness,
current methods of developmental sciences are insufficient. New data analysis algorithms and
methods for interpretation and computational modeling are under development. Preliminary
speech analysis reveals surprising levels of linguistic “finetuning” by caregivers that may provide
crucial support for word learning. Ongoing analysis of various other aspects of the corpus aim to
model detailed aspects of the child’s language development as a function of learning
mechanisms combined with everyday experience. Plans to collect similar corpora from more
children based on a streamlined recording system are underway.
Presenter
Deb Roy directs the Media Lab's Cognitive Machines group, is founding director of MIT’s Center
for Future Banking, and chairs the academic program in Media Arts and Sciences. A native of
Canada, he received his bachelor of computer engineering from the University of Waterloo in
1992 and his PhD in cognitive science from MIT in 1999. He joined the MIT faculty in 2000 and
was named AT&T Associate Professorship of Media Arts and Sciences in 2003.
Roy studies how children learn language, and designs machines that learn to communicate in
human-like ways. To enable this work, he has developed new data-driven methods for
analyzing and modeling human linguistic and social behavior. He has begun exploring
applications of these methods to a range of new domains from financial behavior to autism. Roy
has authored numerous scientific papers in the areas of artificial intelligence, cognitive modeling,
human-machine interaction, data mining and information visualization.
30
Keynote 4
Mari Ostendorf,
University of Washington
Transcribing Speech for Spoken Language Processing
Thu-Ses0-K: Thursday 08:30, Main Hall
Chair: Martin Russell
Abstract
As storage costs drop and bandwidth increases, there has been a rapid growth of spoken
information available via the web or in online archives -- including radio and TV broadcasts, oral
histories, legislative proceedings, call center recordings, etc. -- raising problems of document
retrieval, information extraction, summarization and translation for spoken language. While
there is a long tradition of research in these technologies for text, new challenges arise when
moving from written to spoken language. In this talk, we look at differences between speech
and text, and how we can leverage the information in the speech signal beyond the words to
provide structural information in a rich, automatically generated transcript that better serves
language processing applications. In particular, we look at three interrelated types of structure
(segmentation, prominence and syntax), methods for automatic detection, the benefit of
optimizing rich transcription for the target language processing task, and the impact of this
structural information in tasks such as parsing, topic detection, information extraction and
translation.
Presenter
Mari Ostendorf received the Ph.D. in electrical engineering from Stanford University. After
working at BBN Laboratories and Boston University, she joined the University of Washington
(UW) in 1999. She has also been a visiting researcher at the ATR Interpreting
Telecommunications Laboratory and at the University of Karlsruhe. At UW, she is currently an
Endowed Professor of System Design Methodologies in Electrical Engineering and an Adjunct
Professor in Computer Science and Engineering and in Linguistics. Currently, she is the
Associate Dean for Research and Graduate Studies in the UW College of Engineering. She
teaches undergraduate and graduate courses in signal processing and statistical learning,
including a design-oriented freshman course that introduces students to signal processing and
communications.
Prof. Ostendorf's research interests are in dynamic and linguistically-motivated statistical
models for speech and language processing. Her work has resulted in over 200 publications
and 2 paper awards. Prof. Ostendorf has served as co-Editor of Computer Speech and
Language, as the Editor-in-Chief of the IEEE Transactions on Audio, Speech and Language
Processing, and is currently on the IEEE Signal Processing Society Board of Governors and the
ISCA Advisory Council. She is a Fellow of IEEE and ISCA.
31
Interspeech 2009 Special Sessions
The Interspeech 2009 Organisation Committee is pleased to announce acceptance of the
following Special Sessions at Interspeech 2009.
INTERSPEECH 2009 Emotion Challenge
Mon-Ses2-S1: Monday 13:30
Place: Ainsworth (East Wing 4)
The INTERSPEECH 2009 Emotion Challenge aims to help bridge the gap between the
excellent research on human emotion recognition from speech and the low compatibility of
results. The FAU Aibo Emotion Corpus of spontaneous, emotionally coloured speech, and
benchmark results of the two most popular approaches will be provided by the organisers. This
consists of nine hours of speech from 51 children, recorded at two different schools. This corpus
allows for distinct definition of test and training partitions incorporating speaker independence
as needed in most real-life settings. The corpus further provides a uniquely detailed
transcription of spoken content with word boundaries, non-linguistic vocalisations, emotion
labels, units of analysis, etc. The results of the Challenge will be presented at the Special
Session and prizes will be awarded to the sub-challenge winners and a best paper.
Organisers: Bjoern Schuller ([email protected]), Technische Universitaet Muenchen, Germany,
Stefan Steidl ([email protected]), FAU Erlangen-Nuremberg, Germany, Anton
Batliner ([email protected]), FAU Erlangen-Nuremberg, Germany.
Silent Speech Interfaces
Mon-Ses3-S1: Monday 16:00
Place: Ainsworth (East Wing 4)
A Silent Speech Interface (SSI) is an electronic system enabling speech communication to take
place without the necessity of emitting an audible acoustic signal. By acquiring sensor data from
elements of the human speech production process – from the articulators, their neural pathways,
or the brain itself – an SSI produces a digital representation of speech which can synthesized
directly, interpreted as data, or routed into a communications network. Due to this novel
approach Silent Speech Interfaces have the potential to overcome the major limitations of
traditional speech interfaces today, i.e. (a) limited robustness in the presence of ambient noise;
(b) lack of secure transmission of private and confidential information; and (c) disturbance of
bystanders created by audibly spoken speech in quiet environments; while at the same time
retaining speech as the most natural human communication modality. The special session
intends to bring together researchers in the areas of human articulation, speech and language
technologies, data acquisition and signal processing, as well as in human interface design,
software engineering and systems integration. Its goal is to promote the exchange of ideas on
current SSI challenges and to discuss solutions, by highlighting, for each of the technological
approaches presented, its range of applications, key advantages, potential drawbacks, and
current state of development.
Organisers: Bruce Denby ([email protected]), Université Pierre et Marie Curie, France, Tanja
Schultz ([email protected]), Cognitive Systems Lab, University of Karlsruhe, Germany.
Advanced Voice Function Assessment
Tue-Ses1-S1: Tuesday 10:00
Place: Ainsworth (East Wing 4)
In order to advance the field of voice function assessment in a clinical setting, cooperation
between clinicians and technologists is essential. The aim of this special session is to showcase
32
work that crosses the borders between basic, applied and clinical research and highlights the
development of partnership between technologists and healthcare professionals in advancing
the protocols and technologies for the assessment of voice function.
Organisers: Anna Barney ([email protected]), Institute of Sound and Vibration Research, UK,
Mette Pedersen ([email protected]), Medical Centre, Voice Unit, Denmark.
Measuring the Rhythm of Speech
Tue-Ses3-S2: Tuesday 16:00
Place: Ainsworth (East Wing 4)
There has been considerable interest in the last decade in the modelling of rhythm both from a
typological perspective (e.g. establishing objective criteria for classifying languages or dialect as
stress timed, syllable timed or mora timed) and from the perspective of establishing evaluation
metrics of non standard or deviant varieties of speech such as that obtained from non-native
speakers, from speakers with pathological disabilities or from automatic speech synthesis. The
aim of this special session will be to bring together a number of researchers who have
contributed to this debate and to assess and discuss the current status of our understanding of
the relative value of different metrics for different tasks.
Organiser: Daniel Hirst ([email protected]), Laboratoire Parole et Langage, Université de
Provence, France.
Lessons and Challenges Deploying Voice Search
Wed-Ses1-S1: Wednesday 10:00
Place: Ainsworth (East Wing 4)
In the past year, a number of companies have deployed multimodal search applications for
mobile phones. These applications enable spoken input for search, as an alternative to typing.
There are many technical challenges associated with deploying such applications, including:
High perplexity: A language model for general search must accommodate a very large
vocabulary and tremendous range of possible inputs; Challenging acoustic environments:
Mobile phones are often used when "on the go" - which can often be in noisy environments;
Challenging usage scenarios: Mobile search may be used in challenging situations such as
information access while driving a car. This session will focus on early lessons learned from
usage data, challenges posed, and technical and design solutions to these challenges, as well
as a look towards the future.
Organisers:
Mike
Cohen
([email protected]),
Google,
Johan
([email protected]), Google, Mike Phillips ([email protected]), Vlingo.
Schalkwyk
Active Listening & Synchrony
Wed-Ses2-S1: Wednesday 13:30
Place: Ainsworth (East Wing 4)
Traditional approaches to Multimodal Interface design have tended to assume a "ping-pong" or
"push-to-talk" approach to speech interaction wherein either the system or the interlocuting
human is active at any one time. This is contrary to many recent findings in conversation and
discourse analysis, where the definition of a "turn", or even an "utterance" is found to be very
complex; people don’t "take turns" to talk in a typical conversational interaction, but they each
contribute actively and interactively to the joint emergence of a "common understanding". . The
aim of this special session, marking the 70th anniversary of synchrony research, is to bring
together researchers from the varous different fields, who have special interest in novel
techniques that are aimed at overcoming weaknesses of the "push-to-talk" approach in interface
33
technology, or who have knowledge of the history of this field from which the research
community could benefit.
Organisers: Nick Campbell ([email protected]), Trinity College Dublin, Ireland, Anton Nijholt
([email protected]), University of Twente, The Netherlands, Joakim Gustafson
([email protected]), KTH, Sweden, Carl Vogel ([email protected]), Trinity College Dublin,
Ireland.
Machine Learning for Adaptivity in Spoken Dialogue Systems
Wed-Ses3-S1: Wednesday 16:00
Place: Ainsworth (East Wing 4)
In the past decade, research in the field of Spoken Dialogue Systems (SDS) has experienced
increasing growth, and new applications include interactive mobile search, tutoring, and
troubleshooting systems. The design and optimization of robust SDS for such tasks requires the
development of dialogue strategies which can automatically adapt to different types of users
and noise conditions. New statistical learning techniques are emerging for training and
optimizing adaptive speech recognition, spoken language understanding, dialogue management,
natural language generation, and speech synthesis in spoken dialogue systems. Among
machine learning techniques for spoken dialogue strategy optimization, reinforcement learning
using Markov Decision Processes (MDPs) and Partially Observable MDP (POMDPs) has
become a particular focus. The purpose of this special session is to provide an opportunity for
the international research community to share ideas on these topics and to have constructive
discussions in a single, focussed, special conference session.
Organisers: Oliver Lemon ([email protected]), Edinburgh University, UK, Olivier Pietquin
([email protected]), Supélec - IMS Research Group, France.
New Approaches to Modeling Variability for Automatic Speech Recognition
Thu-Ses1-S1: Thursday 10:00
Place: Ainsworth (East Wing 4)
Despite great strides in the development of automatic speech recognition (ASR) technology, our
community is still far from achieving its holy grail: an ASR system with performance comparable
to humans in automatically transcribing unrestricted conversational speech, spoken by many
speakers and in adverse acoustic environments. Many of the difficulties faced by ASR models
are due to the high degree of variation in the acoustic waveforms associated with a given
phonetic unit measured across different segmental and prosodic contexts. Such variation has
both deterministic origins (intersegmental coarticulation; prosodic juncture and accent) and
stochastic origins (token-to-token variability for utterances with the same segmental and
prosodic structure). Current ASR systems successfully model acoustic variation that is due to
adjacent phone context, but variation due to other sources, including prosodic context, speech
rate, and speaker, is not adequately treated. The goal of this special session is to bring together
researchers who are exploring alternative approaches to state-of-the-art ASR methodologies.
Of special interest are new approaches that model variation in the speech signal at multiple
levels, from both linguistic and extra-linguistic sources. In particular, we encourage the
participation of those who are attempting to incorporate the insights that the field has gained
over the past several decades from acoustic phonetics, speech production, speech perception,
prosody, lexical access, natural language processing and pattern recognition to the problem of
developing models of speech recognition that are robust to the full variability of speech.
Organisers: Carol Espy-Wilson ([email protected]), University of Maryland , Jennifer Cole
([email protected]), Illinois, Abeer Alwan, UCLA, Louis Goldstein, University of Southern
California, Mary Harper, University of Maryland, Elliot Saltzman, Haskins Laboratories, Mark
Hasegawa-Johnson, Illinois.
34
Interspeech 2009 Tutorials - Sunday 6 September 2009
T-1: Analysis by synthesis of speech prosody, from data to models
Sunday 9:15
Place: Jones (East Wing 1)
The study of speech prosody today has become a research area which has attracted interest
from researchers in a great number of different related fields including academic linguistics and
phonetics, conversation analysis, semantics and pragmatics, sociolinguistics, acoustics, speech
synthesis and recognition, cognitive psychology, neuroscience, speech therapy, language
teaching... and no doubt many more. So much so, that it is particularly difficult for any one
person to keep up to date on research in all relevant areas. This is particularly true for new
researchers coming into the field. This tutorial will propose an overview of a variety of current
ideas on the methodology and tools for the automatic and semi-automatic analysis and
synthesis of speech prosody, consisting in particular of lexical prosody, rhythm, accentuation
and intonation. The tools presented will include but not be restricted to those developed by the
presenter himself. The emphasis will be on the importance of data analysis for the testing of
linguistic models and the relevance of these models to the analysis itself. The target audience
will be researchers who are aware of the importance of the analysis and synthesis of prosody
for their own research interests and who wish to update their knowledge of background and
current work in the field.
Presenter: Daniel Hirst ([email protected]), Laboratoire Parole et Langage, Université de
Provence, France.
T-2: Dealing with High Dimensional Data with Dimensionality Reduction
Sunday 9:15
Place: Fallside (East Wing 2)
Dimensionality reduction is a standard component of the toolkit in any area of data modelling.
Over the last decade algorithmic development in the area of dimensionality reduction has been
rapid. Approaches such as Isomap, LLE, and maximum variance unfolding have extended the
methodologies available to the practitioner. More recently, probabilistic dimensionality reduction
techniques have been used with great success in modelling of human motion. How are all these
approaches related? What are they useful for? In this tutorial our aim is to develop an
understanding of high dimensional data and what the problems are with dealing with it. We will
motivate the use of nonlinear dimensionality reduction as a solution for these problems. The
keystone to unify the various approaches to non-linear dimensionality reduction is principal
component analysis. We will show how it underpins spectral methods and attempt to cast
spectral approaches within the same unifying framework. We will further build on principal
component analysis to introduce probabilistic approaches to non-linear dimensionality reduction.
These approaches have become increasingly popular in graphics and vision through the
Gaussian Process Latent Variable Model. We will review the GP-LVM and also consider earlier
approaches such as the Generative Topographic Mapping and Latent Density Networks.
Presenter: Neil Lawrence ([email protected]), School of Computer Science,
Univ. of Manchester, UK ; Jon Barker ([email protected]), Department of Computer
Science, Univ. of Sheffield, UK
T-3: Language and Dialect Recognition
Sunday 9:15
Place: Holmes (East Wing 3)
Spoken language recognition (a.k.a Language ID or LID) is a task of recognizing the language
from a sample spoken by an unknown speaker. Language ID finds applications in multi-lingual
dialog systems, distillation, diarization and indexing systems, speaker detection and speech
recognition. Often, LID represents one of the first and necessary processing steps in many
35
speech processing systems. Furthermore, language, dialect, and accent are of interest in
diarization, indexing/search, and may play an important auxiliary role in identifying speakers.
LID has seen almost four decades of active research. Benefitting from the development of
public multi-lingual corpora in the 90’s, the progress in LID technology has accelerated in the
00’s tremendously. While the availability of large corpora served as an enabling medium,
establishing a series of NIST-administered Language Recognition Evaluations (LRE) provided
the research community with a common ground of comparison and proved to be a strong
catalyst. In another positive way, the LRE series gave rise to a "cross-pollination effect" by
effectively fusing the speaker and language recognition communities thus sharing and
spreading their respective methods and techniques. In the past five years or so, a considerable
success was achieved by focusing on and developing techniques to deal with channel and
session variability, to improve acoustic language modeling by means of discriminative methods,
and to further refine the basic phonotactic approaches.
The goal of this tutorial is to survey the LID area from a historical perspective as well as in its
most modern state. Several important milestones contributing to the growth of the LID area will
be identified. In a second, larger part, most successful state-of-the-art probabilistic approaches
and modeling techniques will be described more in detail. Among these belong various
phonotactic architectures, UBM-GMMs, discriminative techniques, and subspace modeling
tricks. The closely related problem of detecting dialects will be discussed in the final part.
Presenter: Jiri Navratil ([email protected]), Multilingual Analytics and User Technologies, IBM T.J.
Watson Research Center, USA.
T-4: Emerging Technologies for Silent Speech Interfaces
Sunday 9:15
Place: Ainsworth (East Wing 4)
In the past decade, the performance of automatic speech processing systems, including speech
recognition, text and speech translation, and speech synthesis, has improved dramatically. This
has resulted in an increasingly widespread use of speech and language technologies in a wide
variety of applications, such as commercial information retrieval systems, call centre services,
voice-operated cell phones or car navigation systems, personal dictation and translation
assistance, as well as applications in military and security domains. However, speech-driven
interfaces based on conventional acoustic speech signals still suffer from several limitations.
Firstly, the acoustic signals are transmitted through the air and are thus prone to ambient noise.
Despite tremendous efforts, robust speech processing systems, which perform reliably in
crowded restaurants, airports, or other public places, are still not in sight. Secondly,
conventional interfaces rely on audibly uttered speech, which has two major drawbacks: it
jeopardizes confidential communications in public and it disturbs any bystanders. Services,
which require the access, retrieval, and transmission of private or confidential information, such
as PINS, passwords, and security or safety information, are particularly vulnerable.
Recently, Silent Speech Interfaces have been proposed which allow its users to communicate
by speaking silently, i.e. without producing any sound. This is realized by capturing the speech
signal at the early stage of human articulation, namely before the signal becomes airborne, and
then transfer these articulation-related signals for further processing and interpretation. Due to
this novel approach Silent Speech Interfaces have the potential to overcome the major
limitations of traditional speech interfaces today, i.e. (a) limited robustness in the presence of
ambient noise; (b) lack of secure transmission of private and confidential information; and (c)
disturbance of bystanders created by audibly spoken speech in quiet environments; while at the
same time retaining speech as the most natural human communication modality. The SSI
furthermore could provide an alternative for persons with speech disabilities such as
laryngectomy, as well as the elderly or weak who may not be healthy or strong enough to speak
aloud effectively.
Presenter: Tanja Schultz ([email protected]), Computer Science Department, Karlsruhe
University, Germany; Bruce Denby ([email protected]), Université Pierre et Marie Curie
(Paris-VI).
36
T-5: In-Vehicle Speech Processing & Analysis
Sunday 14:15
Place: Jones (East Wing 1)
In this tutorial, we will focus on speech technology for in-vehicle use by discussing the cuttingedge developments in these two applications:
1. Speech as interface: Robust speech recognition system development under vehicle-noise
conditions (i.e. engine, open windows, A/C operation). This field of study includes application of
microphone-arrays for in-vehicle use to reduce the effect of the noise on speech recognition
employing beam-forming algorithms. The resultant system can be employed as a driver-vehicle
interface for entering prompts and commands for music search, control of in-vehicle systems
such as cell-phone, A/C, windows etc. instead of manual operation which engages the driver
visually as well.
2. Speech as monitoring system: Speech can be used to design a sub-module for drivermonitoring systems. For the last two decades speech under stress studies has contributed to
improve the performance of ASR systems. Detecting stress in speech can also help improving
the performance of driver monitoring systems which conventionally relies on computer vision
applications of driver head and eye tracking. On the other hand, the effects of introducing
speech technologies as an interface can be assessed via driver behaviour modeling studies.
Presenter:
John
H.L.
Hansen([email protected]),Pinar
Boyraz
([email protected]),Erik Jonsson School of Engineering and Computer Science,
University of Texas, Dallas, USA
T-6: Emotion Recognition in the Next Generation: an Overview and Recent
Development
Sunday 14:15
Place: Fallside (East Wing 2)
Emotional aspects have recently attracted considerable attention as being the "next big thing"
for dialog systems and robotic product’s market success, and practically any intelligent HumanMachine Interface. Having matured over the last decade of research, recognition technology is
now becoming ready for usage in such systems, and many further applications as Multimedia
Retrieval and Surveillance. At the same time systems have evolved considerably more
complex: in addition to a variety of definitions and theoretical approaches, today’s engines
demand subject independency, coping with spontaneous and non prototypical emotions,
robustness against noise, transmission, and optimal system integration.
In this respect this tutorial will present an introduction to the recognition of emotion with a
particular focus on recent developments in audio-based analysis. A general introduction to
researchers working in related fields will be followed by current issues and impulses for acoustic,
linguistic, and multi-stream and -modal analyses. A summary of the main recognition techniques
will be presented, as well as an overview on current challenges, datasets, studies and
performances in view of optimal future application design. Also, the first open source Emotion
Recognition Engine “openSMILE” developed in the European Union’s Seventh Framework
Programme Project SEMAINE will be introduced to the participants in order for them to be
directly able to experiment with emotion recognition from speech or test latest technology on
their datasets.
Presenter: Björn Schuller([email protected]), Munich University of Technology, Germany.
T-7: Fundamentals and recent advances in HMM-based speech synthesis
Sunday 14:15
Place: Holmes (East Wing 3)
Over the last ten years, the quality of speech synthesis has dramatically improved with the rise
of general corpus-based speech synthesis. Especially, state-of-the-art unit selection speech
synthesis can generate natural-sounding high quality speech. However, for constructing humanlike talking machines, speech synthesis systems are required to have an ability to generate
speech with arbitrary speaker’s voice characteristics, various speaking styles including native
37
and non-native speaking styles in different languages, varying emphasis and focus, and/or
emotional expressions; it is still difficult to have such flexibility with unit-selection synthesizers,
since they need a large-scale speech corpus for each voice.
In recent years, a kind of statistical parametric speech synthesis based on hidden Markov
models (HMMs) has been developed. The system has the following features:
1.Original speaker’s voice characteristics can easily be reproduced because all speech features
including spectral, excitation, and duration parameters are modeled in a unified framework of
HMM, and then generated from the trained HMMs themselves.
2. Using a very small amount of adaptation speech data, voice characteristics can easily be
modified by transforming HMM parameters by a speaker adaptation technique used in speech
recognition systems.
From these features, the HMM-based speech synthesis approach is expected to be useful for
constructing speech synthesizers which can give us the flexibility we have in human voices.
In this tutorial, the system architecture is outlined, and then basic techniques used in the system,
including algorithms for speech parameter generation from HMM, are described with simple
examples. Relation to the unit selection approach, trajectory modeling, recent improvements,
and evaluation methodologies are are summarized. Techniques developed for increasing the
flexibility and improving the speech quality are also reviewed.
Presenters: Keiichi Tokuda ([email protected]), Department of Computer Science and
Engineering, Nagoya Institute of Technology; Heiga Zen ([email protected]), Toshiba
Research Europe Ltd.
T-8: Statistical approaches to dialogue systems
Sunday 14:15
Place: Ainsworth (East Wing 4)
The objective of this tutorial is to provide a comprehensive, cohesive overview of statistical
techniques in dialog management for the newcomer. Specifically we will start by motivating the
research area by showing how traditional techniques fail and intuitively why statistical
techniques would be expected to do better. Then, in classroom style presentation, we will
explain the core algorithms and how they have been applied to spoken dialogue systems. Our
intention is to provide a cohesive treatment of the techniques using a unified, common notation
in order to give the audience a clear understanding of how the techniques interrelate. Finally we
will report results from the literature to provide an indication of the impact in practice. Through
the tutorial we will draw on both our own work and the literature (with citations throughout), and
wherever possible we will use audio/video recordings of interactions to illustrate operation. We
will provide lecture notes and a comprehensive bibliography. Our aim is that attendees to this
course should be able to readily read papers in this area, comment on them meaningfully, and
(we hope!) suggest avenues for future work in this area rich in open challenges and begin
research enquiries of their own.
Presenters: Jason Williams ([email protected]), AT&T Labs – Research, USA; Steve
Young ([email protected]) , Blaise Thomson([email protected]) , Information Engineering
Division, Cambridge University, UK.
38
Public Engagement Events
The first Interspeech conversational systems challenge
Sunday 14:15
Place: Rainbow Room
The first interspeech conversational systems challenge is based around the original Loebner
competition but due to the unique challenges of speech we have changed things slightly. We
have devised a scenario that presents an urgent and direct task full of 'full-blown' emotion. As a
result competitors systems will have to convey urgency and emotion through speech, while any
speech recognition system will have to function successfully in a conversational context with
little time for training.
Each judge will be given the following briefing:
"You're a captain of the one of the fleets finest starships, suddenly your sensors detect a badly
damaged ship heading straight for you, the intercom crackles into life: there's lots of interference
but they're requesting to dock. The ship is about to crash into you, do you push the button
blowing the artificial infiltrator out of the sky or do you open the landing bay and guide the
human refugee to safety, you have 3 minutes to decide."
The artificial system that fools our judges for the longest period of time will be declared the
winner.
Competitors: Marc Schroeder and Jens Edlund; Organiser: Simon Worgan ([email protected])
Interspeech 2009 public exhibition
Sunday
10:00
Place: Public Foyer
On Sunday 6th September a number of exhibitors will be demonstrating aspects of speech and
language technology to the general public. Hosted in the public foyer, exhibits will include
emotive talking heads, agents that attempt to elicit rapport from human speakers and
customized text-to-speech systems.
39
Session Index
Mon-Ses1-K
Keynote 1 — Sadaoki Furui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47
Tue-Ses0-K
Keynote 2 — Tom Griffiths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47
Wed-Ses0-K
Keynote 3 — Deb Roy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47
Thu-Ses0-K
Keynote 4 — Mari Ostendorf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47
Mon-Ses2-O1
ASR: Features for Noise Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47
Mon-Ses2-O2
Production: Articulatory Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48
Mon-Ses2-O3
Systems for LVCSR and Rich Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49
Mon-Ses2-O4
Speech Analysis and Processing I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50
Mon-Ses2-P1
Speech Perception I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51
Mon-Ses2-P2
Accent and Language Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53
Mon-Ses2-P3
ASR: Acoustic Model Training and Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55
Mon-Ses2-P4
Spoken Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57
Mon-Ses2-S1
Special Session: INTERSPEECH 2009 Emotion Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60
Mon-Ses3-O1
Automatic Speech Recognition: Language Models I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61
Mon-Ses3-O2
Phoneme-Level Perception. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62
Mon-Ses3-O3
Statistical Parametric Synthesis I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63
Mon-Ses3-O4
Systems for Spoken Language Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64
Mon-Ses3-P1
Human Speech Production I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65
Mon-Ses3-P2
Prosody, Text Analysis, and Multilingual Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67
Mon-Ses3-P3
Automatic Speech Recognition: Adaptation I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69
Mon-Ses3-P4
Applications in Learning and Other Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
Mon-Ses3-S1
Special Session: Silent Speech Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
Tue-Ses1-O1
ASR: Discriminative Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75
Tue-Ses1-O2
Language Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
Tue-Ses1-O3
ASR: Lexical and Prosodic Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77
Tue-Ses1-O4
Unit-Selection Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78
Tue-Ses1-P1
Human Speech Production II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79
Tue-Ses1-P2
Speech Perception II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80
Tue-Ses1-P3
Speech and Audio Segmentation and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81
Tue-Ses1-P4
Speaker Recognition and Diarisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84
Tue-Ses1-S1
Special Session: Advanced Voice Function Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .86
Tue-Ses2-O1
Automotive and Mobile Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .87
Tue-Ses2-O2
Prosody: Production I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88
Tue-Ses2-O3
ASR: Spoken Language Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89
Tue-Ses2-O4
Speaker Diarisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .90
Tue-Ses2-P1
Speech Analysis and Processing II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91
Tue-Ses2-P2
Speech Processing with Audio or Audiovisual Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93
Tue-Ses2-P3
ASR: Decoding and Confidence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .96
Tue-Ses2-P4
Robust Automatic Speech Recognition I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98
Tue-Ses3-S1
Panel: Speech & Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99
Tue-Ses3-O3
Speaker Verification & Identification I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99
Tue-Ses3-O4
Text Processing for Spoken Language Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Tue-Ses3-P1
Single- and Multichannel Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Tue-Ses3-P2
ASR: Acoustic Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Tue-Ses3-P3
Assistive Speech Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Tue-Ses3-P4
Topics in Spoken Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
40
Tue-Ses3-S2
Special Session: Measuring the Rhythm of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Wed-Ses1-O1
Speaker Verification & Identification II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Wed-Ses1-O2
Emotion and Expression I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Wed-Ses1-O3
Automatic Speech Recognition: Adaptation II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Wed-Ses1-O4
Voice Transformation I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Wed-Ses1-P1
Phonetics, Phonology, Cross-Language Comparisons, Pathology . . . . . . . . . . . . . . . . . . . . . . . . 116
Wed-Ses1-P2
Prosody Perception and Language Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Wed-Ses1-P3
Statistical Parametric Synthesis II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Wed-Ses1-P4
Resources, Annotation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Wed-Ses1-S1
Special Session: Lessons and Challenges Deploying Voice Search. . . . . . . . . . . . . . . . . . . . . . . 124
Wed-Ses2-O1
Word-Level Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Wed-Ses2-O2
Applications in Education and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Wed-Ses2-O3
ASR: New Paradigms I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Wed-Ses2-O4
Single-Channel Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Wed-Ses2-P1
Emotion and Expression II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Wed-Ses2-P2
Expression, Emotion and Personality Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Wed-Ses2-P3
Speech Synthesis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Wed-Ses2-P4
LVCSR Systems and Spoken Term Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Wed-Ses2-S1
Special Session: Active Listening & Synchrony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Wed-Ses3-O1
Language Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Wed-Ses3-O2
Phonetics & Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Wed-Ses3-O3
Speech Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Wed-Ses3-O4
Multimodal Speech (e.g. Audiovisual Speech, Gesture) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Wed-Ses3-P1
Phonetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Wed-Ses3-P2
Speaker Verification & Identification III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Wed-Ses3-P3
Robust Automatic Speech Recognition II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Wed-Ses3-P4
Prosody: Production II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Wed-Ses3-S1
Special Session: Machine Learning for Adaptivity in Spoken Dialogue Systems . . . . . . . . 149
Thu-Ses1-O1
Robust Automatic Speech Recognition III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Thu-Ses1-O2
Prosody: Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Thu-Ses1-O3
Segmentation and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Thu-Ses1-O4
Evaluation & Standardisation of SL Technology and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Thu-Ses1-P1
Speech Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Thu-Ses1-P2
Voice Transformation II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Thu-Ses1-P3
Automatic Speech Recognition: Language Models II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Thu-Ses1-P4
Systems for Spoken Language Understanding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Thu-Ses1-S1
Special Session: New Approaches to Modeling Variability for Automatic Speech
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Thu-Ses2-O1
User Interactions in Spoken Dialog Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Thu-Ses2-O2
Production: Articulation and Acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Thu-Ses2-O3
Features for Speech and Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Thu-Ses2-O4
Speech and Multimodal Resources & Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Thu-Ses2-O5
Speech Analysis and Processing III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Thu-Ses2-P1
Speaker and Speech Variability, Paralinguistic and Nonlinguistic Cues . . . . . . . . . . . . . . . . . 168
Thu-Ses2-P2
ASR: Acoustic Model Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Thu-Ses2-P3
ASR: Tonal Language, Cross-Lingual and Multilingual ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Thu-Ses2-P4
ASR: New Paradigms II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
41
DAY 0: Sunday September 6th
Jones (East Wing 1)
Fallside (East Wing 2)
Holmes (East Wing 3)
Ainsworth (East Wing 4)
Rainbow Room
TUTORIAL
08:30 Registration for tutorials opens (closes at 14:30)
09:00 ISCA Board Meeting 1 (finish at 17:00) - BCS Room 3
T-1: Analysis by Synthesis of T-2: Dealing with High
Speech Prosody, from Data
Dimensional Data with
09:15 to Models
Dimensionality Reduction
T-3: Language and Dialect
Recognition
T-4: Emerging Technologies
for Silent Speech Interfaces
T-3: Language and Dialect
Recognition
T-4: Emerging Technologies
for Silent Speech Interfaces
10:45 Coffee break
T-1: Analysis by Synthesis of T-2: Dealing with High
Speech Prosody, from Data
Dimensional Data with
11:15 to Models
Dimensionality Reduction
Loebner Competition
12:45 Lunch
42
14:00 General registration opens (closes at 18:00)
T-5: In-Vehicle Speech
Processing & Analysis
14:15
T-6: Emotion Recognition in
the Next Generation: an
Overview and Recent
Development
T-7: Fundamentals and
Recent Advances in HMMbased Speech Synthesis
T-8: Statistical Approaches to
Dialogue Systems
T-6: Emotion Recognition in
the Next Generation: an
Overview and Recent
Development
T-7: Fundamentals and
Recent Advances in HMMbased Speech Synthesis
T-8: Statistical Approaches to
Dialogue Systems
15:45 Tea break
T-5: In-Vehicle Speech
Processing & Analysis
16:15
17:45
18:00 Elsevier Thank You Reception for Former Computer Speech and Language Editors (finish at 19:30) - BCS Room 1
The first Interspeech
conversational systems
challenge
DAY 1: Monday September 7th
Main Hall
Jones
(East Wing 1)
Fallside
(East Wing 2)
Holmes
(East Wing 3)
Hewison Hall
Ainsworth
(East Wing 4)
POSTER
SPECIAL
ORAL
09:00
Arrival and Registration (closes at 17:00)
10:00
Opening Ceremony in Main Hall
11:00
Mon-Ses1-K, Plenary Session in Main Hall. Keynote Speaker: Sadaoki Furui, ISCA Medallist, Department of Computer Science, Tokyo Institute of Technology
Selected topics from 40 years of research on speech and speaker recognition
12:00
Lunch; IAC (Advisory Council) Meeting - BCS Room 3
13:30
Mon-Ses2-O1
ASR: Features
for Noise
Robustness
Tea Break
16:00
Mon-Ses3-O1
ASR: Language
Models I
43
15:30
Mon-Ses2-O2
Production:
Articulatory
modelling
Mon-Ses2-O3
Systems for
LVCSR and Rich
Transcription
Mon-Ses2-O4
Speech Analysis
and Processing I
Mon-Ses2-P1
Speech
Perception I
Mon-Ses2-P2
Accent and
Language
Recognition
Mon-Ses2-P3
Mon-Ses2-P4
ASR: Acoustic
Spoken Dialogue
Model Training
Systems
and Combination
Mon-Ses2-S1
Special Session:
INTERSPEECH
2009 Emotion
Challenge
Mon-Ses3-O2
Phoneme-level
Perception
Mon-Ses3-O3
Statistical
Parametric
Synthesis I
Mon-Ses3-O4
Systems for
Spoken
Language
Translation
Mon-Ses3-P1
Human Speech
Production I
Mon-Ses3-P2
Prosody, Text
Analysis, and
Multilingual
Models
Mon-Ses3-P3
ASR: Adaptation
I
Mon-Ses3-S1
Special Session:
Silent Speech
Interfaces
Mon-Ses3-P4
Applications in
Learning and
Other Areas
18:00
19:30
Welcome Reception - Brighton Dome
Interspeech 2009 - Main Conference Session Codes
DAYS:
TIMES:
TYPE:
Mon = Monday
Ses1 = 10:00 - 12:00
O = Oral
Tue = Tuesday
Ses2 = 13:30 - 15:30
P = Poster
Wed = Wednesday
Ses3 = 16:00 - 18:00
S = Special
Thu = Thursday
K = Keynote
Day 2: Tuesday September 8th
Main Hall
Jones
Fallside
Holmes
(East Wing 1) (East Wing 2) (East Wing 3)
ORAL
Ainsworth
(East Wing 4)
SPECIAL
Hewison Hall
POSTER
08:00 Registration (closes at 17:00)
08:30
Tues-Ses1-K, Plenary Session in Main Hall. Keynote Speaker: Tom Griffiths, University of California, Berkeley, USA
Connecting human and machine learning via probabilistic models of cognition
09:30 Coffee Break
Tue-Ses1-O1
Tue-Ses1-O2
ASR: Discrimi- Language
native Training Acquisition
10:00
Tue-Ses1-O3
ASR: Lexical
and Prosodic
Models
Tue-Ses1-O4
Unit-Selection
Synthesis
Tue-Ses1-P1
Tue-Ses1-P2
Human Speech Speech
Production II
perception II
Tue-Ses1-P3
Speech and
Audio Segmentation and
Classification
Tue-Ses1-P4
Speaker
Recognition
and Diarisation
Tue-Ses1-S1
Special
Session:
Advanced
Voice Function
Assessment
12:00 Lunch; Elsevier Editorial Board Meeting for Computer Speech and Language - BCS Room 1; Special Interest Group Meeting - BCS Room 3
13:30 Standardising assessment for voice and speech pathology (finish at 14:30) - BCS Room 3
44
Tue-Ses2-O1
Automotive
and Mobile
13:30
Applications
15:30 Tea Break
Tue-Ses3-S1
Panel: Speech
& Intelligence
16:00
Tue-Ses2-O2
Prosody:
Production I
Tue-Ses2-O3
ASR: Spoken
Language
Understanding
Tue-Ses2-O4
Speaker
Diarisation
Tue-Ses2-P1
Speech
Analysis and
Processing II
Tue-Ses2-P2
Speech
Processing with
Audio or
Audiovisual
Input
Tue-Ses2-P3
ASR: Decoding
and Confidence
Measures
Tue-Ses2-P4
Robust
Automatic
Speech
Recognition I
ISCA Student
Advisory
Committee
Tue-Ses3-O3
Speaker
Verification &
Identification I
Tue-Ses3-O4
Text
Processing for
Spoken
Language
Generation
Tue-Ses3-P1
Single- and
Multi-Channel
Speech
Enhancement
Tue-Ses3-P2
ASR: Acoustic
Modelling
Tue-Ses3-P3
Assistive
Speech
Technology
Tue-Ses3-P4
Topics in
Spoken
Language
Processing
Tue-Ses3-S2
Special
Session:
Measuring the
Rhythm of
Speech
18:00
18:15 ISCA General Assembly - Main Hall
19:30 Reviewers' Reception - Brighton Pavilion; Student Reception - Al Duomo Restaurant
DAY 3: Wednesday September 9th
Main Hall
Jones
Fallside
(East Wing 1)
(East Wing 2)
ORAL
Holmes
(East Wing 3)
Ainsworth
(East Wing 4)
SPECIAL
Hewison Hall
POSTER
08:00 Registration (closes at 17:00)
08:30
Wed-Ses0-K, Plenary Session in Main Hall. Keynote Speaker: Deb Roy, MIT Media Laboratory
New Horizons in the Study of Language Development
09:30 Coffee Break
Wed-Ses1-O1
Speaker
Verification &
10:00 Identification II
Wed-Ses1-O2
Emotion and
Expression I
Wed-Ses1-O3
ASR: Adaptation
II
Wed-Ses1-O4
Voice
Transformation I
Wed-Ses1-P1
Phonetics,
Phonology, CrossLanguage
Comparisons,
Pathology
Wed-Ses1-P2
Prosody
Perception and
Language
Acquisition
Wed-Ses1-P3
Statistical
Parametric
Synthesis II
Wed-Ses1-P4
Resources,
Annotation and
Evaluation
Wed-Ses1-S1
Special Session:
Lessons and
Challenges
Deploying Voice
Search
12:00 Lunch; Interspeech Steering Committee - BCS Room 1; Elsevier Editorial Board Meeting for Speech Communication - BCS Room 3
13:30
Wed-Ses2-O1
Word-level
Perception
45
15:30 Tea Break
Wed-Ses3-O1
Language
Recognition
16:00
Wed-Ses2-O2
Applications in
Education and
Learning
Wed-Ses2-O3
ASR: New
Paradigms I
Wed-Ses2-O4
Single-Channel
Speech
Enhancement
Wed-Ses2-P1
Emotion and
Expression II
Wed-Ses2-P2
Expression,
Emotion and
Personality
Recognition
Wed-Ses2-P3
Wed-Ses2-P4
Speech Synthesis LVCSR Systems
Methods
and Spoken Term
Detection
Wed-Ses2-S1
Special Session:
Active Listening &
Synchrony
Wed-Ses3-O2
Phonetics &
Phonology
Wed-Ses3-O3
Speech Activity
Detection
Wed-Ses3-O4
Multimodal
Speech (e.g.
Audiovisual
Speech, Gesture)
Wed-Ses3-P1
Phonetics
Wed-Ses3-P2
Speaker
Verification &
Identification III
Wed-Ses3-P3
Wed-Ses3-P4
Robust Automatic Prosody:
Speech
Production II
Recognition II
Wed-Ses3-S1
Special Session:
Machine Learning
for Adaptivity in
Dialogue Systems
18:00
19:30 Revelry at the Racecourse
DAY 4: Thursday September 10th
Main Hall
Jones
Fallside
(East Wing 1) (East Wing 2)
ORAL
Holmes
(East Wing 3)
Ainsworth
(East Wing 4)
SPECIAL
Hewison Hall
POSTER
08:00 Registration (closes at 17:00)
08:30
Thu-Ses0-K, Plenary Session in Main Hall. Keynote Speaker: Mari Ostendorf, University of Washington
Transcribing Speech for Spoken Language Processing
09:30 Coffee Break
Thu-Ses1-O1
Robust
Automatic
10:00 Speech
Recognition III
Thu-Ses1-O2
Prosody:
Perception
Thu-Ses1-O3
Segmentation
and
Classification
Thu-Ses1-O4
Evaluation &
Standardisation
of SL
Technology and
Systems
Thu-Ses1-P1
Speech Coding
Thu-Ses1-P2
Voice
Transformation
II
Thu-Ses1-P3
ASR: Language
Models II
Thu-Ses1-P4
Systems for
Spoken
Language
Understanding
Thu-Ses2-O4
Speech and
Multimodal
Resources &
Annotation
Thu-Ses2-P1
Speaker and
Speech
Variability,
Paralinguistic
and
Nonlinguistic
Cues
Thu-Ses2-P2
ASR: Acoustic
Model Features
Thu-Ses2-P3
Thu-Ses2-P4
ASR: Tonal
ASR: New
Language, Cross- Paradigms II
Lingual and
Multilingual ASR
Thu-Ses1-S1
Special Session:
New Approaches
to Modeling
Variability for
ASR
12:00 Lunch; Industrial Lunch - BCS Room 1
46
Thu-Ses2-O1
User
Interactions in
Spoken Dialog
13:30
Systems
Thu-Ses2-O2
Production:
Articulation and
Acoustics
15:30 Tea Break
16:00 Closing Ceremony - Main Hall
Thu-Ses2-O3
Features for
Speech and
Speaker
Recognition
Thu-Ses2-O5
Speech Analysis
and Processing
III
Abstracts
audio-video recordings spanning the first three years of one child’s
life at home. To study a corpus of this scale and richness, current
methods of developmental cognitive science are inadequate. We
are developing new methods for data analysis and interpretation
that combine pattern recognition algorithms with interactive user
interfaces and data visualization. Preliminary speech analysis
reveals surprising levels of linguistic fine-tuning by caregivers that
may provide crucial support for word learning. Ongoing analyses
of the corpus aim to model detailed aspects of the child’s language
development as a function of learning mechanisms combined with
lifetime experience. Plans to collect similar corpora from more
children based on a transportable recording system are underway.
Mon-Ses1-K : Keynote 1 — Sadaoki Furui
Main Hall, 11:00, Monday 7 Sept 2009
Chair: Isabel Trancoso, INESC-ID Lisboa/IST, Portugal
Selected Topics from 40 Years of Research on
Speech and Speaker Recognition
Sadaoki Furui; Tokyo Institute of Technology, Japan
Mon-Ses1-K-1, Time: 11:00
This paper summarizes my 40 years of research on speech and
speaker recognition, focusing on selected topics that I have investigated at NTT Laboratories, Bell Laboratories and Tokyo Institute
of Technology with my colleagues and students. These topics
include: the importance of spectral dynamics in speech perception;
speaker recognition methods using statistical features, cepstral
features, and HMM/GMM; text-prompted speaker recognition;
speech recognition using dynamic features; Japanese LVCSR; robust speech recognition; spontaneous speech corpus construction
and analysis; spontaneous speech recognition; automatic speech
summarization; and WFST-based decoder development and its
applications.
Tue-Ses0-K : Keynote 2 — Tom Griffiths
Main Hall, 08:30, Tuesday 8 Sept 2009
Chair: Steve Renals, University of Edinburgh, UK
Connecting Human and Machine Learning via
Probabilistic Models of Cognition
Thomas L. Griffiths; University of California at
Berkeley, USA
Tue-Ses0-K-1, Time: 08:30
Human performance defines the standard that machine learning
systems aspire to in many areas, including learning language.
This suggests that studying human cognition may be a good way
to develop better learning algorithms, as well as providing basic
insights into how the human mind works. However, in order for
ideas to flow easily from cognitive science to computer science and
vice versa, we need a common framework for describing human
and machine learning. I will summarize recent work exploring
the hypothesis that probabilistic models of cognition, which
view learning as a form of statistical inference, provide such a
framework, including results that illustrate how novel ideas from
statistics can inform cognitive science. Specifically, I will talk about
how probabilistic models can be used to identify the assumptions
of learners, learn at different levels of abstraction, and link the
inductive biases of individuals to cultural universals.
Thu-Ses0-K : Keynote 4 — Mari Ostendorf
Main Hall, 08:30, Thursday 10 Sept 2009
Chair: Martin Russell, University of Birmingham, UK
Transcribing Human-Directed Speech for Spoken
Language Processing
Mari Ostendorf; University of Washington, USA
Thu-Ses0-K-1, Time: 08:30
As storage costs drop and bandwidth increases, there has been a
rapid growth of spoken information available via the web or in online archives, raising problems of document retrieval, information
extraction, summarization and translation for spoken language.
While there is a long tradition of research in these technologies for
text, new challenges arise when moving from written to spoken
language. In this talk, we look at differences between speech and
text, and how we can leverage the information in the speech signal
beyond the words to provide structural information in a rich,
automatically generated transcript that better serves language
processing applications. In particular, we look at three interrelated
types of structure (orthographic, prosodic, and syntactic), methods
for automatic detection, the benefit of optimizing rich transcription for the target language processing task, and the impact of this
structural information in tasks such as information extraction,
translation, and summarization.
Mon-Ses2-O1 : ASR: Features for Noise
Robustness
Main Hall, 13:30, Monday 7 Sept 2009
Chair: Hynek Hermansky, Johns Hopkins University, USA
Feature Extraction for Robust Speech Recognition
Using a Power-Law Nonlinearity and Power-Bias
Subtraction
Chanwoo Kim, Richard M. Stern; Carnegie Mellon
University, USA
Wed-Ses0-K : Keynote 3 — Deb Roy
Mon-Ses2-O1-1, Time: 13:30
Main Hall, 08:30, Wednesday 9 Sept 2009
Chair: Roger K. Moore, University of Sheffield, UK
New Horizons in the Study of Child Language
Acquisition
Deb Roy; MIT, USA
Wed-Ses0-K-1, Time: 08:30
Naturalistic longitudinal recordings of child development promise
to reveal fresh perspectives on fundamental questions of language
acquisition. In a pilot effort, we have recorded 230,000 hours of
This paper presents a new feature extraction algorithm called
Power-Normalized Cepstral Coefficients (PNCC) that is based on
auditory processing. Major new features of PNCC processing
include the use of a power-law nonlinearity that replaces the
traditional log nonlinearity used for MFCC coefficients, and a novel
algorithm that suppresses background excitation by estimating
SNR based on the ratio of the arithmetic to geometric mean power,
and subtracts the inferred background power.
Experimental
results demonstrate that the PNCC processing provides substantial
improvements in recognition accuracy compared to MFCC and PLP
processing for various types of additive noise. The computational
Notes
47
cost of PNCC is only slightly greater than that of conventional
MFCC processing.
Towards Fusion of Feature Extraction and Acoustic
Model Training: A Top Down Process for Robust
Speech Recognition
Yu-Hsiang Bosco Chiu, Bhiksha Raj, Richard M. Stern;
Carnegie Mellon University, USA
Mon-Ses2-O1-2, Time: 13:50
sentences. Two evolutions of PEQ are presented as solutions to the
limitations encountered. The effects of the proposed evolutions
are evaluated on three speech corpora, namely WSJ0, AURORA-3
and HIWIRE cockpit databases, with different mismatch conditions
given by convolutional and/or additive noise and non-native speakers. The obtained results show that the encountered limitations
can be overcome by the newly introduced techniques.
Dynamic Features in the Linear Domain for Robust
Automatic Speech Recognition in a Reverberant
Environment
This paper presents a strategy to learn physiologically-motivated
components in a feature computation module discriminatively,
directly from data, in a manner that is inspired by the presence of
efferent processes in the human auditory system. In our model a
set of logistic functions which represent the rate-level nonlinearities found in most mammal hearing system are put in as part of
the feature extraction process. The parameters of these rate-level
functions are estimated to maximize the a posteriori probability
of the correct class in the training data. The estimated feature
computation is observed to be robust against environmental noise.
Experiments conducted with the CMU Sphinx-III on the DARPA Resource Management task show that the discriminatively estimated
rate-nonlinearity results in better performance in the presence
of background noise than traditional procedures which separate
the feature extraction and model training into two distinct parts
without feed back from the latter to the former.
Since the MFCC are calculated from logarithmic spectra, the delta
and delta-delta are considered as difference operations in a logarithmic domain. In a reverberant environment, speech signals
have trailing reverberations, whose power is plotted as a long-term
exponential decay. This means the logarithmic delta value tends
to remain large for a long time. This paper proposes a delta
feature calculated in the linear domain, due to the rapid decay in
reverberant environments. In an experiment using an evaluation
framework (CENSREC-4), significant improvements were found
in reverberant situations by simply replacing the MFCC dynamic
features with the proposed dynamic features.
Temporal Modulation Processing of Speech Signals
for Noise Robust ASR
Local Projections and Support Vector Based Feature
Selection in Speech Recognition
Hong You, Abeer Alwan; University of California at Los
Angeles, USA
Antonio Miguel, Alfonso Ortega, L. Buera, Eduardo
Lleida; Universidad de Zaragoza, Spain
Osamu Ichikawa, Takashi Fukuda, Ryuki Tachibana,
Masafumi Nishimura; IBM Tokyo Research Lab, Japan
Mon-Ses2-O1-5, Time: 14:50
Mon-Ses2-O1-3, Time: 14:10
Mon-Ses2-O1-6, Time: 15:10
In this paper, we analyze the temporal modulation characteristics
of speech and noise from a speech/non-speech discrimination
point of view. Although previous psychoacoustic studies [3][10]
have shown that low temporal modulation components are important for speech intelligibility, there is no reported analysis on
modulation components from the point of view of speech/noise
discrimination. Our data-driven analysis of modulation components of speech and noise reveals that speech and noise is more
accurately classified by low-passed modulation frequencies than
band-passed ones. Effects of additive noise on the modulation
characteristics of speech signals are also analyzed. Based on the
analysis, we propose a frequency adaptive modulation processing
algorithm for a noise robust ASR task. The algorithm is based on
speech channel classification and modulation pattern denoising.
Speech recognition experiments are performed to compare the
proposed algorithm with other noise robust frontends, including
RASTA and ETSI AFE. Recognition results show that the frequency
adaptive modulation processing is promising.
In this paper we study a method to provide noise robustness in
mismatch conditions for speech recognition using local frequency
projections and feature selection. Local time-frequency filtering patterns have been used previously to provide noise robust
features and a simpler feature set to apply reliability weighting
techniques. The proposed method combines two techniques to
select the feature set, first a reliability metric based on information
theory and, second, a support vector set to reduce the errors.
The support vector set provides the most representative examples
which have influence in the error rate in mismatch conditions, so
that only the features which incorporate implicit robustness to
mismatch are selected. Some experimental results are obtained
with this method compared to baseline systems using the Aurora
2 database.
Progressive Memory-Based Parametric Non-Linear
Feature Equalization
Jones (East Wing 1), 13:30, Monday 7 Sept 2009
Chair: Rob van Son, University of Amsterdam, The Netherlands
Luz Garcia 1 , Roberto Gemello 2 , Franco Mana 2 ,
Jose Carlos Segura 1 ; 1 Universidad de Granada, Spain;
2
Loquendo, Italy
Feedforward Control of a 3D Physiological
Articulatory Model for Vowel Production
Mon-Ses2-O1-4, Time: 14:30
This paper analyzes the benefits and drawbacks of PEQ (Parametric
Non-linear Equalization), a features normalization technique based
on the parametric equalization of the MFCC parameters to match
a reference probability distribution. Two limitations have been
outlined: the distortion intrinsic to the normalization process and
the lack of accuracy in estimating normalization statistics on short
Mon-Ses2-O2 : Production: Articulatory
Modelling
Qiang Fang 1 , Akikazu Nishikido 2 , Jianwu Dang 2 ,
Aijun Li 1 ; 1 Chinese Academy of Social Sciences, China;
2
JAIST, Japan
Mon-Ses2-O2-1, Time: 13:30
A 3D Physiological articulatory model has been developed to
account for the biomechanical properties of speech organs in
speech production. To control the model for investigating the
Notes
48
mechanism of speech production, a feedforward control strategy
is necessary to generate proper muscle activations according
to desired articulatory targets. In this paper, we elaborated a
feedforward control module for the 3D physiological articulatory
model. In the feedforward control process, an input articulatory
target, specified by articulatory parameters, is transformed to
intrinsic representation of articulation; then, a muscle activation
pattern by a proposed mapping function. The results show that
the proposed feedforward control strategy is able to control the
proposed 3D physiological articulatory model with high accuracy
both acoustically and articulatorily.
Articulatory Modeling Based on Semi-Polar
Coordinates and Guided PCA Technique
Jun Cai 1 , Yves Laprie 1 , Julie Busset 1 , Fabrice Hirsch 2 ;
1
LORIA, France; 2 Institut de Phonétique de Strasbourg,
France
Mon-Ses2-O2-2, Time: 13:50
Research on 2-dimensional static articulatory modeling has been
performed by using the semi-polar system and the guided PCA
analysis of lateral X-ray images of vocal tract. The density of the
grid lines in the semi-polar system has been increased to have a
better descriptive precision. New parameters have been introduced
to describe the movements of tongue apex. An extra feature, the
tongue root, has been extracted as one of the elementary factors in
order to improve the precision of tongue model. New methods still
remain to be developed for describing the movements of tongue
apex.
Sequencing of Articulatory Gestures Using Cost
Optimization
Juraj Simko, Fred Cummins; University College Dublin,
Ireland
Mon-Ses2-O2-3, Time: 14:10
Within the framework of articulatory phonology (AP), gestures
function as primitives, and their ordering in time is provided by
a gestural score. Determining how they should be sequenced
in time has been something of a challenge. We modify the task
dynamic implementation of AP, by defining tasks to be the desired
positions of physically embodied end effectors. This allows us
to investigate the optimal sequencing of gestures based on a
parametric cost function. Costs evaluated include precision of
articulation, articulatory effort, and gesture duration. We find that
a simple optimization using these costs results in stable gestural
sequences that reproduce several known coarticulatory effects.
From Experiments to Articulatory Motion — A Three
Dimensional Talking Head Model
Xiao Bo Lu 1 , William Thorpe 1 , Kylie Foster 2 , Peter
Hunter 1 ; 1 University of Auckland, New Zealand;
2
University of Massey, New Zealand
deformation technique. The motion of the epiglottis has also been
considered in the model.
Towards Robust Glottal Source Modeling
Javier Pérez, Antonio Bonafonte; Universitat Politècnica
de Catalunya, Spain
Mon-Ses2-O2-5, Time: 14:50
We present here a new method for the simultaneous estimation
of the derivative glottal waveform and the vocal tract filter. The
algorithm is pitch-synchronous and uses overlapping frames of
several glottal cycles to increase the robustness and quality of the
estimation. Two parametric models for the glottal waveform are
used: the KLGLOTT88 during the convex optimization iteration,
and the LF model for the final parametrization. We use a synthetic
corpus using real data published in several studies to evaluate the
performance. A second corpus has been specially recorded for this
work, consisting of isolated vowels uttered with different voice
qualities. The algorithm has been found to perform well with most
of the voice qualities present in the synthetic data-set in terms of
glottal waveform matching. The performance is also good with the
real vowel data-set in terms of resynthesis quality.
Sliding Vocal-Tract Model and its Application for
Vowel Production
Takayuki Arai; Sophia University, Japan
Mon-Ses2-O2-6, Time: 15:10
In a previous study, Arai implemented a sliding vocal-tract model
based on Fant’s three-tube model and demonstrated its usefulness for education in acoustics and speech science. The sliding
vocal-tract model consists of a long outer cylinder and a short
inner cylinder, which simulates tongue constriction in the vocal
tract. This model can produce different vowels by sliding the inner
cylinder and changing the degree of constriction. In this study, we
investigated the model’s coverage of vowels on the vowel space
and explored its application for vowel production in the speech
and hearing sciences.
Mon-Ses2-O3 : Systems for LVCSR and Rich
Transcription
Fallside (East Wing 2), 13:30, Monday 7 Sept 2009
Chair: Thomas Schaaf, Multimodal Technologies Inc., USA
Minimum Hypothesis Phone Error as a Decoding
Method for Speech Recognition
Haihua Xu 1 , Daniel Povey 2 , Jie Zhu 1 , Guanyong Wu 1 ;
1
Shanghai Jiao Tong University, China; 2 Microsoft
Research, USA
Mon-Ses2-O3-1, Time: 13:30
Mon-Ses2-O2-4, Time: 14:30
The goal of this study is to develop a customised computer model
that can accurately represent the motion of vocal articulators
during vowels and consonants. Models of the articulators were
constructed as Finite Element (FE) meshes based on digitised
high-resolution MRI (Magnetic Resonance Imaging) scans obtained
during quiet breathing. Articulatory kinematics during speaking
were obtained by EMA (Electromagnetic Articulography) and video
of the face. The movement information thus acquired was applied
to the FE model to provide jaw motion, modeled as a rigid body,
and tongue, cheek and lip movements modeled with a free-form
In this paper we show how methods for approximating phone error
as normally used for Minimum Phone Error (MPE) discriminative
training, can be used instead as a decoding criterion for lattice
rescoring. This is an alternative to Confusion Networks (CN) which
are commonly used in speech recognition. The standard (Maximum
A Posteriori) decoding approach is a Minimum Bayes Risk estimate
with respect to the Sentence Error Rate (SER); however, we are
typically more interested in the Word Error Rate (WER). Methods
such as CN and our proposed Minimum Hypothesis Phone Error
(MHPE) aim to get closer to minimizing the expected WER. Based
on preliminary experiments we find that our approach gives more
improvement than CN, and is conceptually simpler.
Notes
49
Posterior-Based Out of Vocabulary Word Detection
in Telephone Speech
Porting an European Portuguese Broadcast News
Recognition System to Brazilian Portuguese
Stefan Kombrink 1 , Lukáš Burget 1 , Pavel Matějka 1 ,
Martin Karafiát 1 , Hynek Hermansky 2 ; 1 Brno
University of Technology, Czech Republic; 2 Johns
Hopkins University, USA
Alberto Abad 1 , Isabel Trancoso 2 , Nelson Neto 3 ,
M. Céu Viana 4 ; 1 INESC-ID Lisboa, Portugal; 2 INESC-ID
Lisboa/IST, Portugal; 3 Federal University of Pará,
Brazil; 4 CLUL, Portugal
Mon-Ses2-O3-2, Time: 13:50
Mon-Ses2-O3-5, Time: 14:50
In this paper we present an out-of-vocabulary word detector
suitable for English conversational and read speech. We use an
approach based on phone posteriors created by a Large Vocabulary
Continuous Speech Recognition system and an additional phone
recognizer, that allows detection of OOV and misrecognized words.
In addition, the recognized word output can be transcribed more
detailed using several classes. Reported results are on CallHome
English and Wall Street Journal data.
This paper reports on recent work in the context of the activities of the PoSTPort project aimed at porting a Broadcast News
recognition system originally developed for European Portuguese
to other varieties. Concretely, in this paper we have focused
on porting to Brazilian Portuguese. The impact of some of the
main sources of variability has been assessed, besides proposing
solutions at the lexical, acoustic and syntactic levels. The ported
Brazilian Portuguese Broadcast News system allowed a drastic
performance improvement from 56.6% WER (obtained with the
European Portuguese system) to 25.5%.
Automatic Transcription System for Meetings of the
Japanese National Congress
Yuya Akita, Masato Mimura, Tatsuya Kawahara; Kyoto
University, Japan
Mon-Ses2-O3-3, Time: 14:10
This paper presents an automatic speech recognition (ASR) system
for assisting meeting record creation of the National Congress
of Japan. The system is designed to cope with spontaneous
characteristics of meeting speech, as well as a variety of topics and
speakers. For acoustic model, minimum phone error (MPE) training
is applied with several normalization techniques. For language
model, we have proposed statistical style transformation to generate spoken-style N-grams and their statistics. We also introduce
statistical modeling of pronunciation variation in spontaneous
speech. The ASR system was evaluated on real congressional
meetings, and achieved word accuracy of 84%. It is also suggested
that the ASR-based transcripts with this accuracy level is usable
for editing meeting records.
Cross-Language Bootstrapping for Unsupervised
Acoustic Model Training: Rapid Development of a
Polish Speech Recognition System
Modeling Northern and Southern Varieties of Dutch
for STT
Julien Despres 1 , Petr Fousek 2 , Jean-Luc Gauvain 2 ,
Sandrine Gay 1 , Yvan Josse 1 , Lori Lamel 2 , Abdel
Messaoudi 2 ; 1 Vecsys Research, France; 2 LIMSI, France
Mon-Ses2-O3-6, Time: 15:10
This paper describes how the Northern (NL) and Southern (VL)
varieties of Dutch are modeled in the joint Limsi-Vecsys Research
speech-to-text transcription systems for broadcast news (BN) and
conversational telephone speech (CTS). Using the Spoken Dutch
Corpus resources (CGN), systems were developed and evaluated in
the 2008 N-Best benchmark. Modeling techniques that are used
in our systems for other languages were found to be effective for
the Dutch language, however it was also found to be important to
have acoustic and language models, and statistical pronunciation
generation rules adapted to each variety. This was in particular
true for the MLP features which were only effective when trained
separately for Dutch and Flemish. The joint submissions obtained
the lowest WERs in the benchmark by a significant margin.
Jonas Lööf, Christian Gollan, Hermann Ney; RWTH
Aachen University, Germany
Mon-Ses2-O4 : Speech Analysis and
Processing I
Mon-Ses2-O3-4, Time: 14:30
Holmes (East Wing 3), 13:30, Monday 7 Sept 2009
Chair: Bernd Möbius, Universität Stuttgart, Germany
This paper describes the rapid development of a Polish language
speech recognition system. The system development was performed without access to any transcribed acoustic training data.
This was achieved through the combined use of cross-language
bootstrapping and confidence based unsupervised acoustic model
training. A Spanish acoustic model was ported to Polish, through
the use of a manually constructed phoneme mapping. This initial
model was refined through iterative recognition and retraining of
the untranscribed audio data.
The system was trained and evaluated on recordings from the
European Parliament, and included several state-of-the-art speech
recognition techniques in addition to the use of unsupervised
model training. Confidence based speaker adaptive training using
features space transform adaptation, as well as vocal tract length
normalization and maximum likelihood linear regression, was
used to refine the acoustic model. Through the combination of
the different techniques, good performance was achieved on the
domain of parliamentary speeches.
Nearly Perfect Detection of Continuous F0 Contour
and Frame Classification for TTS Synthesis
Thomas Ewender, Sarah Hoffmann, Beat Pfister; ETH
Zürich, Switzerland
Mon-Ses2-O4-1, Time: 13:30
We present a new method for the estimation of a continuous fundamental frequency (F0 ) contour. The algorithm implements a global
optimization and yields virtually error-free F0 contours for high
quality speech signals. Such F0 contours are subsequently used to
extract a continuous fundamental wave. Some local properties of
this wave, together with a number of other speech features allow
to classify the frames of a speech signal into five classes: voiced,
unvoiced, mixed, irregularly glottalized and silence.
The presented F0 detection and frame classification can be applied
to F0 modeling and prosodic modification of speech segments in
high-quality concatenative speech synthesis.
Notes
50
AM-FM Estimation for Speech Based on a
Time-Varying Sinusoidal Model
demonstrate the accuracy and robustness of our method through
comparisons to state-of-the art pitch estimation algorithms using
both simulated and real waveform data.
Yannis Pantazis 1 , Olivier Rosec 2 , Yannis Stylianou 1 ;
1
FORTH, Greece; 2 Orange Labs, France
Complex Cepstrum-Based Decomposition of Speech
for Glottal Source Estimation
Mon-Ses2-O4-2, Time: 13:50
In this paper we present a method based on a time-varying sinusoidal model for a robust and accurate estimation of amplitude
and frequency modulations (AM-FM) in speech. The suggested approach has two main steps. First, speech is modeled as a sinusoidal
model with time-varying amplitudes. Specifically, the model makes
use of a first order time polynomial with complex coefficients
for capturing instantaneous amplitude and frequency (phase)
components. Next, the model parameters are updated by using the
previously estimated instantaneous phase information. Thus, an
iterative scheme for AM-FM decomposition of speech is suggested
which was validated on synthetic AM-FM signals and tested on
reconstruction of voiced speech signals where the signal-to-error
reconstruction ratio (SERR) was used as measure. Compared to the
standard sinusoidal representation, the suggested approach found
to improve the corresponding SERR by 47%, resulting in over 30 dB
of SERR.
Voice Source Waveform Analysis and Synthesis
Using Principal Component Analysis and Gaussian
Mixture Modelling
Jon Gudnason 1 , Mark R.P. Thomas 1 , Patrick A.
Naylor 1 , Dan P.W. Ellis 2 ; 1 Imperial College London,
UK; 2 Columbia University, USA
Thomas Drugman 1 , Baris Bozkurt 2 , Thierry Dutoit 1 ;
1
Faculté Polytechnique de Mons, Belgium; 2 Izmir
Institute of Technology, Turkey
Mon-Ses2-O4-5, Time: 14:50
Homomorphic analysis is a well-known method for the separation
of non-linearly combined signals. More particularly, the use of
complex cepstrum for source-tract deconvolution has been discussed in various articles. However there exists no study which
proposes a glottal flow estimation methodology based on cepstrum
and reports effective results. In this paper, we show that complex
cepstrum can be effectively used for glottal flow estimation by
separating the causal and anticausal components of a windowed
speech signal as done by the Zeros of the Z-Transform (ZZT)
decomposition. Based on exactly the same principles presented
for ZZT decomposition, windowing should be applied such that
the windowed speech signals exhibit mixed-phase characteristics
which conform the speech production model that the anticausal
component is mainly due to the glottal flow open phase. The
advantage of the complex cepstrum-based approach compared to
the ZZT decomposition is its much higher speed.
Approximate Intrinsic Fourier Analysis of Speech
Mon-Ses2-O4-3, Time: 14:10
Frank Tompkins, Patrick J. Wolfe; Harvard University,
USA
The paper presents a voice source waveform modeling techniques
based on principal component analysis (PCA) and Gaussian mixture
modeling (GMM). The voice source is obtained by inverse-filtering
speech with the estimated vocal tract filter. This decomposition
is useful in speech analysis, synthesis, recognition and coding.
Existing models of the voice source signal are based on functionfitting or physically motivated assumptions and although they are
well defined, estimation of their parameters is not well understood
and few are capable of reproducing the large variety of voice
source waveforms. Here, a data-driven approach is presented for
signal decomposition and classification based on the principal
components of the voice source. The principal components are
analyzed and the ‘prototype’ voice source signals corresponding
to the Gaussian mixture means are examined. We show how an
unknown signal can be decomposed into its components and/or
prototypes and resynthesized. We show how the techniques are
suited for both low bitrate or high quality analysis/synthesis
schemes.
Popular parametric models of speech sounds such as the sourcefilter model provide a fixed means of describing the variability
inherent in speech waveform data. However, nonlinear dimensionality reduction techniques such as the intrinsic Fourier analysis
method of Jansen and Niyogi provide a more flexible means of
adaptively estimating such structure directly from data. Here we
employ this approach to learn a low-dimensional manifold whose
geometry is meant to reflect the structure implied by the human
speech production system. We derive a novel algorithm to efficiently learn this manifold for the case of many training examples
— the setting of both greatest practical interest and computational
difficulty. We then demonstrate the utility of our method by way
of a proof-of-concept phoneme identification system that operates
effectively in the intrinsic Fourier domain.
Model-Based Estimation of Instantaneous Pitch in
Noisy Speech
Hewison Hall, 13:30, Monday 7 Sept 2009
Chair: Paul Boersma, University of Amsterdam, The Netherlands
Jung Ook Hong, Patrick J. Wolfe; Harvard University,
USA
Mon-Ses2-O4-6, Time: 15:10
Mon-Ses2-P1 : Speech Perception I
Relative Importance of Formant and Whole-Spectral
Cues for Vowel Perception
Mon-Ses2-O4-4, Time: 14:30
In this paper we propose a model-based approach to instantaneous
pitch estimation in noisy speech, by way of incorporating pitch
smoothness assumptions into the well-known harmonic model. In
this approach, the latent pitch contour is modeled using a basis
of smooth polynomials, and is fit to waveform data by way of a
harmonic model whose partials have time-varying amplitudes. The
resultant nonlinear least squares estimation task is accomplished
through the Gauss-Newton method with a novel initialization
step that serves to greatly increase algorithm efficiency. We
Masashi Ito, Keiji Ohara, Akinori Ito, Masafumi Yano;
Tohoku University, Japan
Mon-Ses2-P1-1, Time: 13:30
Three psycho-acoustical experiments were carried out to investigate relative importance of formant frequency and whole spectral
shape as cues for vowel perception. Four types of vowel-like
signals were presented to eight listeners. The mean responses for
stimuli including both formant and amplitude-ratio feature were
quite similar to those for the stimuli including only formant peak
Notes
51
feature. Nonetheless reasonable vowel changes were observed
in responses for stimuli including only amplitude-ratio feature.
The perceived vowel changes were also observed even for stimuli
including neither of these features. The results suggested that
perceptual cues were involved in various parts of vowel spectrum.
principally due to lexical and phonetic competition. We also found
a binaural unmasking effect, which was more important when
speech was used as interferer, suggesting that this suppressive
effect was more efficient in the case of high-level informational
(lexical and phonetic) competition.
Influences of Vowel Duration on Speaker-Size
Estimation and Discrimination
Using Location Cues to Track Speaker Changes from
Mobile, Binaural Microphones
Chihiro Takeshima 1 , Minoru Tsuzaki 1 , Toshio Irino 2 ;
1
Kyoto City University of Arts, Japan; 2 Wakayama
University, Japan
Heidi Christensen, Jon Barker; University of Sheffield,
UK
Mon-Ses2-P1-2, Time: 13:30
This paper presents initial developments towards computational
hearing models that move beyond stationary microphone assumptions. We present a particle filtering based system for using
localisation cues to track speaker changes in meeting recordings.
Recording are made using in-ear binaural microphones worn by
a listener whose head is constantly moving. Tracking speaker
changes requires simultaneously inferring the perceiver’s head
orientation, as any change in relative spatial angle to a source
can be caused by either the source moving or the microphones
moving. In real applications, such as robotics, there may be access
to external estimates of the perceiver’s position. We investigate
the effect of simulating varying degrees of measurement noise in
an external perceiver position estimate. We show that only limited
self-position knowledge is needed to greatly improve the reliability
with which we can decode the acoustic localisation cues in the
meeting scenario.
Mon-Ses2-P1-5, Time: 13:30
Several experimental studies have shown that the human auditory
system has a mechanism for extracting speaker-size information,
using sufficiently long sounds. This paper investigated influence
of vowel duration on the processing for size extraction using short
vowels. In a size estimation experiment, listeners subjectively
estimated the size (height) of the speaker for isolated vowels.
The results showed that listeners’ perception of speaker size was
highly correlated with the factor of vocal-tract length in all the
tested durations (from 16 ms to 256 ms). In a size discrimination
experiment, listeners were presented with two vowels scaled the
vocal-tract length and were asked which vowel was perceived to
be spoken by a smaller speaker. The results showed that the
just-noticeable differences (JNDs) in speaker size were almost the
same for the durations longer than 32 ms. However, the JNDs rose
considerably for 16-ms duration. These observations of the experiments suggest that the auditory system can extract speaker-size
information even for 16-ms vowels although the precision of size
extraction would deteriorate when the duration becomes less than
32 ms.
A Perceptual Investigation of Speech Transcription
Errors Involving Frequent Near-Homophones in
French and American English
Ioana Vasilescu 1 , Martine Adda-Decker 1 , Lori Lamel 1 ,
Pierre Hallé 2 ; 1 LIMSI, France; 2 LPP, France
High Front Vowels in Czech: A Contrast in Quantity
or Quality?
Václav Jonáš Podlipský 1 , Radek Skarnitzl 2 , Jan Volín 2 ;
1
Palacký University Olomouc, Czech Republic; 2 Charles
University in Prague, Czech Republic
Mon-Ses2-P1-3, Time: 13:30
We investigate the perception and production of Czech /I/ and /i:/,
a contrast traditionally described as quantitative. First, we show
that the spectral difference between the vowels is for many Czechs
as strong a cue as (or even stronger than) duration. Second, we test
the hypothesis that this shift towards vowel quality as a perceptual
cue for this contrast resulted in weakening of the durational
differentiation in production. Our measurements confirm this:
members of the /I/-/i:/ pair differed in duration much less than
those of other short-long pairs. We interpret these findings in
terms of Lindblom’s H&H theory.
Mon-Ses2-P1-6, Time: 13:30
This article compares the errors made by automatic speech recognizers to those made by humans for near-homophones in American
English and French. This exploratory study focuses on the impact
of limited word context and the potential resulting ambiguities for
automatic speech recognition (ASR) systems and human listeners.
Perceptual experiments using 7-gram chunks centered on incorrect
or correct words output by an ASR system, show that humans
make significantly more transcription errors on the first type of
stimuli, thus highlighting the local ambiguity. The long-term aim
of this study is to improve the modeling of such ambiguous items
in order to reduce ASR errors.
The Role of Glottal Pulse Rate and Vocal Tract
Length in the Perception of Speaker Identity
Etienne Gaudrain, Su Li, Vin Shen Ban, Roy D.
Patterson; University of Cambridge, UK
Effect of Contralateral Noise on Energetic and
Informational Masking on Speech-in-Speech
Intelligibility
Mon-Ses2-P1-7, Time: 13:30
Marjorie Dole 1 , Michel Hoen 2 , Fanny Meunier 1 ; 1 DDL,
France; 2 SBRI, France
Mon-Ses2-P1-4, Time: 13:30
This experiment tested the advantage of binaural presentation of
an interfering noise in a task involving identification of monaurallypresented words. These words were embedded in three types of
noise: a stationary noise, a speech-modulated noise and a speechbabble noise, in order to assess energetic and informational
masking contributions to binaural unmasking. Our results showed
important informational masking in the monaural condition,
In natural speech, for a given speaker, vocal tract length (VTL)
is effectively fixed whereas glottal pulse rate (GPR) is varied to
indicate prosodic distinctions. This suggests that VTL will be
a more reliable cue for identifying a speaker than GPR. It also
suggests that listeners will accept larger changes in GPR before
perceiving speaker change. We measured the effect of GPR and
VTL on the perception of a speaker difference, and found that
listeners hear different speakers given a VTL difference of 25%, but
they require a GPR difference of 45%.
Notes
52
Development of Voicing Categorization in Deaf
Children with Cochlear Implant
Mon-Ses2-P2 : Accent and Language
Recognition
Victoria Medina, Willy Serniclaes; LPP, France
Mon-Ses2-P1-8, Time: 13:30
Cochlear implant (CI) improves hearing but communication abilities still depend on several factors. The present study assesses
the development of voicing categorization in deaf children with
cochlear implant, examining both categorical perception (CP) and
boundary precision (BP) performances. We compared 22 implanted
children to 55 normal-hearing children using different age factors.
The results showed that the development of voicing perception in
CI children is fairly similar to that in normal-hearing controls with
the same auditory experience and irrespective of differences in the
age of implantation (two vs. three years of age).
Processing Liaison-Initial Words in Native and
Non-Native French: Evidence from Eye Movements
Annie Tremblay; University of Illinois at
Urbana-Champaign, USA
Mon-Ses2-P1-9, Time: 13:30
French listeners have no difficulty recognizing liaison-initial words.
This is in part because acoustic/phonetic information distinguishes liaison consonants from (non-resyllabified) word onsets
in the speech signal. Using eye tracking, this study investigates
whether native speakers of English, a language that does not have
a phonological resyllabification process like liaison, can develop
target-like segmentation procedures for recognizing liaison-initial
words in French, and if so, how such procedures develop with
increasing proficiency.
Estimating the Potential of Signal and
Interlocutor-Track Information for Language
Modeling
Hewison Hall, 13:30, Monday 7 Sept 2009
Chair: William Campbell, MIT, USA
Factor Analysis and SVM for Language Recognition
Florian Verdet 1 , Driss Matrouf 1 , Jean-François
Bonastre 1 , Jean Hennebert 2 ; 1 LIA, France; 2 Université
de Fribourg, Switzerland
Mon-Ses2-P2-1, Time: 13:30
Statistic classifiers operate on features that generally include both,
useful and useless information. These two types of information are
difficult to separate in feature domain. Recently, a new paradigm
based on Factor Analysis (FA) proposed a model decomposition
into useful and useless components. This method has successfully
been applied to speaker recognition tasks. In this paper, we study
the use of FA for language recognition. We propose a classification
method based on SDC features and Gaussian Mixture Models
(GMM). We present well performing systems using Factor Analysis
and FA-based Support Vector Machine (SVM) classifiers. Experiments are conducted using NIST LRE 2005’s primary condition.
The relative equal error rate reduction obtained by the best factor
analysis configuration with respect to baseline GMM-UBM system
is over 60%, corresponding to an EER of 6.59%.
Exploring Universal Attribute Characterization of
Spoken Languages for Spoken Language Recognition
Sabato Marco Siniscalchi 1 , Jeremy Reed 2 , Torbjørn
Svendsen 1 , Chin-Hui Lee 2 ; 1 NTNU, Norway; 2 Georgia
Institute of Technology, USA
Mon-Ses2-P2-2, Time: 13:30
Nigel G. Ward, Benjamin H. Walker; University of Texas
at El Paso, USA
Mon-Ses2-P1-10, Time: 13:30
Although today most language models treat language purely as
word sequences, there is recurring interest in tapping new sources
of information, such as disfluencies, prosody, the interlocutor’s
dialog act, and the interlocutor’s recent words. In order to estimate
the potential value of such sources of information, we extend
Shannon’s guessing-game method for estimating entropy to work
for spoken dialog. Four teams of two subjects each predicted the
next word in a dialog using various amounts of context: one word,
two words, all the words spoken so far, or the full dialog audio so
far. The entropy benefit in the full-audio condition over the full
text condition was substantial, .64 bits per word, greater than the
.54 bit benefit of full text context over trigrams. This suggests that
language models may be improved by use of the prosody of the
speaker and context from the interlocutor.
We propose a novel universal acoustic characterization approach
to spoken language identification (LID), in which any spoken
language is described with a common set of fundamental units
defined “universally.” Specifically, manner and place of articulation
form this unit inventory and are used to build a set of universal
attribute models with data-driven techniques. Using the vector
space modeling approaches to LID a spoken utterance is first
decoded into a sequence of attributes. Then, a feature vector
consisting of co-occurrence statistics of attribute units is created,
and the final LID decision is implemented with a set of vector
space language classifiers. Although the present study is just in
its preliminary stage, promising results comparable to acoustically
rich phone-based LID systems have already been obtained on
the NIST 2003 LID task. The results provide clear insight for
further performance improvements and encourage a continuing
exploration of the proposed framework.
On the Use of Phonological Features for Automatic
Accent Analysis
Abhijeet Sangwan, John H.L. Hansen; University of
Texas at Dallas, USA
Mon-Ses2-P2-3, Time: 13:30
In this paper, we present an automatic accent analysis system
that is based on phonological features (PFs). The proposed system
exploits the knowledge of articulation embedded in phonology by
rapidly build Markov models (MMs) of PFs extracted from accented
speech. The Markov models capture information in the PF space
along two dimensions of articulation: PF state-transitions and
state-durations. Furthermore, by utilizing MMs of native and
Notes
53
non-native accents a new statistical measure of “accentedness”
is developed which rates the articulation of a word on a scale of
native-like (-1) to non-native like (+1). The proposed methodology
is then used to perform an automatic cross-sectional study of
accented English spoken by native speakers of Mandarin Chinese
(N-MC). The experimental results demonstrate the capability of
the proposed system to rapidly perform quantitative as well as
qualitative analysis of foreign accents. The work developed in this
paper is easily assimilated into language learning systems, and has
impact in the areas of speaker and speech recognition.
Language Recognition Using Language Factors
Fabio Castaldo 1 , Sandro Cumani 1 , Pietro Laface 1 ,
Daniele Colibro 2 ; 1 Politecnico di Torino, Italy;
2
Loquendo, Italy
Mon-Ses2-P2-4, Time: 13:30
Language recognition systems based on acoustic models reach
state of the art performance using discriminative training techniques.
In speaker recognition, eigenvoice modeling of the speaker, and
the use of speaker factors as input features to SVMs has recently
been demonstrated to give good results compared to the standard
GMM-SVM approach, which combines GMMs supervectors and
SVMs.
In this paper we propose, in analogy to the eigenvoice modeling
approach, to estimate an eigen-language space, and to use the language factors as input features to SVM classifiers. Since language
factors are low-dimension vectors, training and evaluating SVMs
with different kernels and with large training examples becomes
an easy task.
This approach is demonstrated on the 14 languages of the NIST
2007 language recognition task, and shows performance improvements with respect to the standard GMM-SVM technique.
Automatic Accent Detection: Effect of Base Units
and Boundary Information
Je Hun Jeon, Yang Liu; University of Texas at Dallas,
USA
Mon-Ses2-P2-5, Time: 13:30
Automatic prominence or pitch accent detection is important as it
can perform automatic prosodic annotation of speech corpora, as
well as provide additional features in other tasks such as keyword
detection. In this paper, we evaluate how accent detection performance changes according to different base units and what kind
of boundary information is available. We compare word, syllable,
and vowel-based units when their boundaries are provided. We
also automatically estimate syllable boundaries using energy
contours when phone-level alignment is available. In addition, we
utilize a sliding window with fixed length under the condition of
unknown boundaries. Our experiments show that when boundary
information is available, using longer base unit achieves better
performance. In the case of no boundary information, using a
moving window with a fixed size achieves similar performance to
using syllable information on word-level evaluation, suggesting
that accent detection can be performed without relying on a speech
recognizer to generate boundaries.
Age Verification Using a Hybrid Speech Processing
Approach
Ron M. Hecht 1 , Omer Hezroni 1 , Amit Manna 1 , Ruth
Aloni-Lavi 1 , Gil Dobry 1 , Amir Alfandary 1 , Yaniv
Zigel 2 ; 1 PuddingMedia, Israel; 2 Ben-Gurion University
of the Negev, Israel
Mon-Ses2-P2-6, Time: 13:30
The human speech production system is a multi-level system.
On the upper level, it starts with information that one wants to
transmit. It ends on the lower level with the materialization of
the information into a speech signal. Most of the recent work
conducted in age estimation is focused on the lower-acoustic level.
In this research the upper lexical level information is utilized for
age-group verification and it is shown that one’s vocabulary reflects
one’s age. Several age-group verification systems that are based on
automatic transcripts are proposed. In addition, a hybrid approach
is introduced, an approach that combines the word-based system
and an acoustic-based system. Experiments were conducted on a
four age-groups verification task using the Fisher corpora, where
an average equal error rate (EER) of 28.7% was achieved using the
lexical-based approach and 28.0% using an acoustic approach. By
merging these two approaches the verification error was reduced
to 24.1%.
Information Bottleneck Based Age Verification
Ron M. Hecht 1 , Omer Hezroni 1 , Amit Manna 1 , Gil
Dobry 2 , Yaniv Zigel 2 , Naftali Tishby 3 ; 1 PuddingMedia,
Israel; 2 Ben-Gurion University of the Negev, Israel;
3
Hebrew University, Israel
Mon-Ses2-P2-7, Time: 13:30
Word N-gram models can be used for word-based age-group verification. In this paper the agglomerative information bottleneck
(AIB) approach is used to tackle one of the most fundamental
drawbacks of word N-gram models: its abundant amount of
irrelevant information. It is demonstrated that irrelevant information can be omitted by joining words to form word-clusters;
this provides a mechanism to transform any sequence of words
to a sequence of word-cluster labels. Consequently, word N-gram
models are converted to word-cluster N-gram models which are
more compact. Age verification experiments were conducted on
the Fisher corpora. Their goal was to verify the age-group of the
speaker of an unknown speech segment. In these experiments an
N-gram model was compressed to a fifth of its original size without
reducing the verification performance. In addition, a verification
accuracy improvement is demonstrated by disposing irrelevant
information.
Discriminative N-Gram Selection for Dialect
Recognition
F.S. Richardson, W.M. Campbell, P.A.
Torres-Carrasquillo; MIT, USA
Mon-Ses2-P2-8, Time: 13:30
Dialect recognition is a challenging and multifaceted problem.
Distinguishing between dialects can rely upon many tiers of interpretation of speech data — e.g., prosodic, phonetic, spectral, and
word. High-accuracy automatic methods for dialect recognition
typically use either phonetic or spectral characteristics of the
input. A challenge with spectral system, such as those based
on shifted-delta cepstral coefficients, is that they achieve good
performance but do not provide insight into distinctive dialect
features. In this work, a novel method based upon discriminative
training and phone N-grams is proposed. This approach achieves
Notes
54
excellent classification performance, fuses well with other systems,
and has interpretable dialect characteristics in the phonetic tier.
The method is demonstrated on data from the LDC and prior NIST
language recognition evaluations. The method is also combined
with spectral methods to demonstrate state-of-the-art performance
in dialect recognition.
Data-Driven Phonetic Comparison and Conversion
Between South African, British and American
English Pronunciations
Linsen Loots, Thomas Niesler; Stellenbosch University,
South Africa
Using Prosody and Phonotactics in Arabic Dialect
Identification
Mon-Ses2-P2-9, Time: 13:30
We analyse pronunciations in American, British and South African
English pronunciation dictionaries. Three analyses are performed.
First the accuracy is determined with which decision tree based
grapheme-to-phoneme (G2P) conversion can be applied to each
accent. It is found that there is little difference between the
accents in this regard. Secondly, pronunciations are compared
by performing pairwise alignments between the accents. Here
we find that South African English pronunciation most closely
matches British English. Finally, we apply decision trees to the
conversion of pronunciations from one accent to another. We find
that pronunciations of unknown words can be more accurately
determined from a known pronunciation in a different accent than
by means of G2P methods. This has important implications for
the development of pronunciation dictionaries in less-resourced
varieties of English, and hence also for the development of ASR
systems.
Target-Aware Language Models for Spoken
Language Recognition
1
1
1
This paper investigates the use of language identification (LID)
in real-time speech-to-speech translation systems. We propose
a framework that incorporates LID capability into a speech-tospeech translation system while minimizing the impact on the
system’s real-time performance. We compared two phone-based
LID approaches, namely PRLM and PPRLM, to a proposed extended
approach based on Conditional Random Field classifiers. The
performances of these three approaches were evaluated to identify
the input language in the CMU English-Iraqi TransTAC system, and
the proposed approach obtained significantly higher classification
accuracies on two of the three test sets evaluated.
Fadi Biadsy, Julia Hirschberg; Columbia University,
USA
Mon-Ses2-P2-12, Time: 13:30
While Modern Standard Arabic is the formal spoken and written
language of the Arab world, dialects are the major communication
mode for everyday life; identifying a speaker’s dialect is thus critical
to speech processing tasks such as automatic speech recognition,
as well as speaker identification. We examine the role of prosodic
features (intonation and rhythm) across four Arabic dialects: Gulf,
Iraqi, Levantine, and Egyptian, for the purpose of automatic dialect
identification. We show that prosodic features can significantly
improve identification, over a purely phonotactic-based approach,
with an identification accuracy of 86.33% for 2m utterances.
Mon-Ses2-P3 : ASR: Acoustic Model Training
and Combination
Hewison Hall, 13:30, Monday 7 Sept 2009
Chair: Jeff Bilmes, University of Washington, USA
2
Rong Tong , Bin Ma , Haizhou Li , Eng Siong Chng ,
Kong-Aik Lee 1 ; 1 Institute for Infocomm Research,
Singapore; 2 Nanyang Technological University,
Singapore
Refactoring Acoustic Models Using Variational
Expectation-Maximization
Mon-Ses2-P2-10, Time: 13:30
Pierre L. Dognin, John R. Hershey, Vaibhava Goel,
Peder A. Olsen; IBM T.J. Watson Research Center, USA
This paper studies a new way of constructing multiple phone
tokenizers for language recognition. In this approach, each phone
tokenizer for a target language will share a common set of acoustic
models, while each tokenizer will have a unique phone-based
language model (LM) trained for a specific target language. The
target-aware language models (TALM) are constructed to capture
the discriminative ability of individual phones for the desired
target languages. The parallel phone tokenizers thus formed are
shown to achieve better performance than the original phone
recognizer. The proposed TALM is very different from the LM in
the traditional PPRLM technique. First of all, the TALM applies the
LM information in the front-end as opposed to PPRLM approach
which uses a LM in the system back-end; Furthermore, the TALM
exploits the discriminative phones occurrence statistics, which are
different from the traditional n-gram statistics in PPRLM approach.
A novel way of training TALM is also studied in this paper. Our
experimental results show that the proposed method consistently
improves the language recognition performance on NIST 1996,
2003 and 2007 LRE 30-second closed test sets.
In probabilistic modeling, it is often useful to change the structure, or refactor, a model, so that it has a different number of
components, different parameter sharing, or other constraints. For
example, we may wish to find a Gaussian mixture model (GMM)
with fewer components that best approximates a reference model.
Maximizing the likelihood of the refactored model under the
reference model is equivalent to minimizing their KL divergence.
For GMMs, this optimization is not analytically tractable. However,
a lower bound to the likelihood can be maximized using a variational expectation-maximization algorithm. Automatic speech
recognition provides a good framework to test the validity of such
methods, because we can train reference models of any given size
for comparison with refactored models. We show that we can
efficiently reduce model size by 50%, with the same recognition
performance as the corresponding model trained from data.
Language Identification for Speech-to-Speech
Translation
Daniel Chung Yong Lim, Ian Lane; Carnegie Mellon
University, USA
Mon-Ses2-P2-11, Time: 13:30
Mon-Ses2-P3-1, Time: 13:30
Investigations on Convex Optimization Using
Log-Linear HMMs for Digit String Recognition
Georg Heigold, David Rybach, Ralf Schlüter, Hermann
Ney; RWTH Aachen University, Germany
Mon-Ses2-P3-2, Time: 13:30
Discriminative methods are an important technique to refine the
acoustic model in speech recognition. Conventional discriminative
Notes
55
training is initialized with some baseline model and the parameters
are re-estimated in a separate step. This approach has proven to be
successful, but it includes many heuristics, approximations, and
parameters to be tuned. This tuning involves much engineering
and makes it difficult to reproduce and compare experiments. In
contrast to the conventional training, convex optimization techniques provide a sound approach to estimate all model parameters
from scratch. Such a straight approach hopefully dispense with
additional heuristics, e.g. scaling of posteriors. This paper addresses the question how well this concept using log-linear models
carries over to practice. Experimental results are reported for a
digit string recognition task, which allows for the investigation of
this issue without approximations.
Investigations on Discriminative Training in Large
Scale Acoustic Model Estimation
Janne Pylkkönen; Helsinki University of Technology,
Finland
Mon-Ses2-P3-3, Time: 13:30
In this paper two common discriminative training criteria, maximum mutual information (MMI) and minimum phone error (MPE),
are investigated. Two main issues are addressed: sensitivity
to different lattice segmentations and the contribution of the
parameter estimation method. It is noted that MMI and MPE
may benefit from different lattice segmentation strategies. The
use of discriminative criterion values as the measure of model
goodness is shown to be problematic as the recognition results do
not correlate well with these measures. Moreover, the parameter
estimation method clearly affects the recognition performance of
the model irrespective of the value of the discriminative criterion.
Also the dependence on the recognition task is demonstrated by
example with two Finnish large vocabulary dictation tasks used in
the experiments.
Margin-Space Integration of MPE Loss via
Differencing of MMI Functionals for Generalized
Error-Weighted Discriminative Training
Erik McDermott, Shinji Watanabe, Atsushi Nakamura;
NTT Corporation, Japan
Mon-Ses2-P3-4, Time: 13:30
Using the central observation that margin-based weighted classification error (modeled using Minimum Phone Error (MPE))
corresponds to the derivative with respect to the margin term
of margin-based hinge loss (modeled using Maximum Mutual
Information (MMI)), this article subsumes and extends marginbased MPE and MMI within a broader framework in which the
objective function is an integral of MPE loss over a range of
margin values. Applying the Fundamental Theorem of Calculus,
this integral is easily evaluated using finite differences of MMI
functionals; lattice-based training using the new criterion can then
be carried out using differences of MMI gradients. Experimental
results comparing the new framework with margin-based MMI,
MCE and MPE on the Corpus of Spontaneous Japanese and the MIT
OpenCourseWare/MIT-World corpus are presented.
Compacting Discriminative Feature Space
Transforms for Embedded Devices
Etienne Marcheret 1 , Jia-Yu Chen 2 , Petr Fousek 3 ,
Peder A. Olsen 1 , Vaibhava Goel 1 ; 1 IBM T.J. Watson
Research Center, USA; 2 University of Illinois at
Urbana-Champaign, USA; 3 IBM Research, Czech
Republic
Mon-Ses2-P3-5, Time: 13:30
Discriminative training of the feature space using the minimum
phone error objective function has been shown to yield remarkable
accuracy improvements. These gains, however, come at a high
cost of memory. In this paper we present techniques that maintain fMPE performance while reducing the required memory by
approximately 94%. This is achieved by designing a quantization
methodology which minimizes the error between the true fMPE
computation and that produced with the quantized parameters.
Also illustrated is a Viterbi search over the allocation of quantization levels, providing a framework for optimal non-uniform
allocation of quantization levels over the dimensions of the fMPE
feature vector. This provides an additional 8% relative reduction in
required memory with no loss in recognition accuracy.
A Back-Off Discriminative Acoustic Model for
Automatic Speech Recognition
Hung-An Chang, James R. Glass; MIT, USA
Mon-Ses2-P3-6, Time: 13:30
In this paper we propose a back-off discriminative acoustic model
for Automatic Speech Recognition (ASR). We use a set of broad
phonetic classes to divide the classification problem originating
from context-dependent modeling into a set of sub-problems. By
appropriately combining the scores from classifiers designed for
the sub-problems, we can guarantee that the back-off acoustic
score for different context-dependent units will be different. The
back-off model can be combined with discriminative training
algorithms to further improve the performance. Experimental
results on a large vocabulary lecture transcription task show that
the proposed back-off discriminative acoustic model has more
than a 2.0% absolute word error rate reduction compared to
clustering-based acoustic model.
Efficient Generation and Use of MLP Features for
Arabic Speech Recognition
J. Park, F. Diehl, M.J.F. Gales, M. Tomalin, P.C.
Woodland; University of Cambridge, UK
Mon-Ses2-P3-7, Time: 13:30
Front-end features computed using Multi-Layer Perceptrons (MLPs)
have recently attracted much interest, but are a challenge to scale
to large networks and very large training data sets. This paper
discusses methods to reduce the training time for the generation
of MLP features and their use in an ASR system using a variety
of techniques: parallel training of a set of MLPs on different data
sub-sets; methods for computing features from by a combination
of these networks; and rapid discriminative training of HMMs using
MLP-based features. The impact on MLP frame-based accuracy
using different training strategies is discussed along with the effect
on word rates from incorporating the MLP features in various configurations into an Arabic broadcast audio transcription system.
Notes
56
A Study of Bootstrapping with Multiple Acoustic
Features for Improved Automatic Speech
Recognition
Mon-Ses2-P4 : Spoken Dialogue Systems
Hewison Hall, 13:30, Monday 7 Sept 2009
Chair: Dilek Hakkani-Tür, ICSI, USA
Xiaodong Cui, Jian Xue, Bing Xiang, Bowen Zhou; IBM
T.J. Watson Research Center, USA
Mon-Ses2-P3-8, Time: 13:30
This paper investigates a scheme of bootstrapping with multiple
acoustic features (MFCC, PLP and LPCC) to improve the overall
performance of automatic speech recognition. In this scheme,
a Gaussian mixture distribution is estimated for each type of
feature resampled in each HMM state by single-pass retraining on
a shared decision tree. Thus obtained acoustic models based on
the multiple features are combined by likelihood averaging during
decoding. Experiments on large vocabulary spontaneous speech
recognition show its superior overall performance than the best of
acoustic models from individual features. It also achieves comparable performance to Recognizer Output Voting Error Reduction
(ROVER) with computational advantages.
Analysis of Low-Resource Acoustic Model
Self-Training
Scott Novotney, Richard Schwartz; BBN Technologies,
USA
Mon-Ses2-P3-9, Time: 13:30
Previous work on self-training of acoustic models using unlabeled
data reported significant reductions in WER assuming a large phonetic dictionary was available. We now assume only those words
from ten hours of speech are initially available. Subsequently, we
are then given a large vocabulary and then quantify the value of
repeating self-training with this larger dictionary. This experiment
is used to analyze the effects of self-training on categories of
words. We report the following findings: (i) Although the small
5k vocabulary raises WER by 2% absolute, self-training is equally
effective as using a large 75k vocabulary. (ii) Adding all 75k words
to the decoding vocabulary after self-training reduces the WER
degradation to only 0.8% absolute. (iii) Self-training most benefits
those words in the unlabeled audio but not transcribed by a wide
margin.
Log-Linear Model Combination with
Word-Dependent Scaling Factors
Björn Hoffmeister, Ruoying Liang, Ralf Schlüter,
Hermann Ney; RWTH Aachen University, Germany
Mon-Ses2-P3-10, Time: 13:30
Log-linear model combination is the standard approach in LVCSR
to combine several knowledge sources, usually an acoustic
and a language model. Instead of using a single scaling factor
per knowledge source, we make the scaling factor word- and
pronunciation-dependent. In this work, we combine three acoustic
models, a pronunciation model, and a language model for a
Mandarin BN/BC task. The achieved error rate reduction of 2%
relative is small but consistent for two test sets. An analysis of
the results shows that the major contribution comes from the
improved interdependency of language and acoustic model.
Enabling a User to Specify an Item at Any Time
During System Enumeration — Item Identification
for Barge-In-Able Conversational Dialogue Systems
Kyoko Matsuyama, Kazunori Komatani, Tetsuya
Ogata, Hiroshi G. Okuno; Kyoto University, Japan
Mon-Ses2-P4-1, Time: 13:30
In conversational dialogue systems, users prefer to speak at any
time and to use natural expressions. We have developed an
Independent Component Analysis (ICA) based semi-blind source
separation method, which allows users to barge-in over system
utterances at any time. We created a novel method from timing
information derived from barge-in utterances to identify one
item that a user indicates during system enumeration. First, we
determine the timing distribution of user utterances containing
referential expressions and then approximate it using a gamma
distribution. Second, we represent both the utterance timing and
automatic speech recognition (ASR) results as probabilities of
the desired selection from the system’s enumeration. We then
integrate these two probabilities to identify the item having the
maximum likelihood of selection. Experimental results using 400
utterances indicated that our method outperformed two methods
used as a baseline (one of ASR results only and one of utterance
timing only) in identification accuracy.
System Request Detection in Human Conversation
Based on Multi-Resolution Gabor Wavelet Features
Tomoyuki Yamagata, Tetsuya Takiguchi, Yasuo Ariki;
Kobe University, Japan
Mon-Ses2-P4-2, Time: 13:30
For a hands-free speech interface, it is important to detect commands in spontaneous utterances. Usual voice activity detection
systems can only distinguish speech frames from non-speech
frames, but they cannot discriminate whether the detected speech
section is a command for a system or not. In this paper, in order to
analyze the difference between system requests and spontaneous
utterances, we focus on fluctuations in a long period, such as
prosodic articulation, and fluctuations in a short period, such as
phoneme articulation. The use of multi-resolution analysis using
Gabor wavelet on a Log-scale Mel-frequency Filter-bank clarifies the
different characteristics of system commands and spontaneous
utterances. Experiments using our robot dialog corpus show that
the accuracy of the proposed method is 92.6% in F-measure, while
the conventional power and prosody-based method is just 66.7%.
Using Graphical Models for Mixed-Initiative Dialog
Management Systems with Realtime Policies
Stefan Schwärzler, Stefan Maier, Joachim Schenk,
Frank Wallhoff, Gerhard Rigoll; Technische Universität
München, Germany
Mon-Ses2-P4-3, Time: 13:30
In this paper, we present a novel approach for dialog modeling,
which extends the idea underlying the partially observable Markov
Decision Processes (POMDPs), i.e. it allows for calculating the dialog policy in real-time and thereby increases the system flexibility.
The use of statistical dialog models is particularly advantageous
to react adequately to common errors of speech recognition sys-
Notes
57
tems. Comparing our results to the reference system (POMDP), we
achieve a relative reduction of 31:6% of the average dialog length.
Furthermore, the proposed system shows a relative enhancement
of 64:4% of the sensitivity rate in the error recognition capabilities
using the same specifity rate in both systems. The achieved results
are based on the Air Travelling Information System with 21650
user utterances in 1585 natural spoken dialogs.
Conversation Robot Participating in and Activating
a Group Communication
has been applied to develop a dialog manager within the framework
of the European LUNA project, whose main goal is the creation
of a robust natural spoken language understanding system. We
present an evaluation of this approach for both human machine
and human-human conversations acquired in this project. We
demonstrate that a statistical dialog manager developed with the
proposed technique and learned from a corpus of human-machine
dialogs can successfully infer the task-related topics present in
spontaneous human-human dialogs.
A Policy-Switching Learning Approach for Adaptive
Spoken Dialogue Agents
Shinya Fujie, Yoichi Matsuyama, Hikaru Taniyama,
Tetsunori Kobayashi; Waseda University, Japan
Mon-Ses2-P4-4, Time: 13:30
As a new type of application of the conversation system, a robot
activating other parties’ communications has been developed. The
robot participates in a quiz game with other participants and tries
to activate the game. The functions installed in the robot are as
follows: (1) The robot can participate in a group communication
using its basic group conversation function. (2) The robot can
perform the game according to the rules of the game. (3) The robot
can activate communication using its proper actions depending on
the game situations and the participants’ situations. We conducted
a real field experiment: the prototype system performed a quiz
game with elderly people in an adult day-care center. The robot
successfully entertained the people with its one hour demonstration.
Recent Advances in WFST-Based Dialog System
Chiori Hori, Kiyonori Ohtake, Teruhisa Misu, Hideki
Kashioka, Satoshi Nakamura; NICT, Japan
Mon-Ses2-P4-5, Time: 13:30
To construct an expandable and adaptable dialog system which
handles multiple tasks, we proposed a dialog system using a
weighted finite-state transducer (WFST) in which users concept and
system action tags are input and output of the transducer, respectively. To test the potential of the WFST-based dialog management
(DM) platform using statistical DM models, we constructed a dialog
system using a human-to-human spoken dialog corpus for hotel
reservation, which is annotated with Interchange Format (IF). A
scenario WFST and a spoken language understanding (SLU) WFST
were obtained from the corpus and then composed together and
optimized. We evaluated the detection accuracy of the system next
actions. In this paper, we focus on how WFST optimization operations contribute to the performance of the system. In addition, we
have constructed a full WFST-based dialog system by composing
SLU, scenario and sentence generation (SG) WFSTs. We show an
example of a hotel reservation dialog with the fully composed
system and discuss future work.
A Statistical Dialog Manager for the LUNA Project
David Griol 1 , Giuseppe Riccardi 2 , Emilio Sanchis 3 ;
1
Universidad Carlos III de Madrid, Spain; 2 Università di
Trento, Italy; 3 Universidad Politécnica de Valencia,
Spain
Mon-Ses2-P4-6, Time: 13:30
In this paper, we present an approach for the development of a
statistical dialog manager, in which the system response is selected by means of a classification process which considers all the
previous history of the dialog to select the next system response.
In particular, we use decision trees for its implementation. The
statistical model is automatically learned from training data which
are labeled in terms of different SLU features. This methodology
Heriberto Cuayáhuitl, Juventino Montiel-Hernández;
Autonomous University of Tlaxcala, Mexico
Mon-Ses2-P4-7, Time: 13:30
The reinforcement learning paradigm has been adopted for inferring optimized and adaptive spoken dialogue agents. Such
agents are typically learnt and tested without combining competing agents that may yield better performance at some points in
the conversation. This paper presents an approach that learns
dialogue behaviour from competing agents — switching from one
policy to another competing one — on a previously proposed
hierarchical learning framework. This policy-switching approach
was investigated using a simulated flight booking dialogue system
based on different types of information request. Experimental
results reported that the induced agent using the proposed policyswitching approach yielded 8.2% fewer system actions than three
baselines with a fixed type of information request. This result
suggests that the proposed approach is useful for learning adaptive
and scalable spoken dialogue agents.
Strategies for Accelerating the Design of Dialogue
Applications using Heuristic Information from the
Backend Database
L.F. D’Haro 1 , R. Cordoba 1 , R. San-Segundo 1 , J.
Macias-Guarasa 2 , J.M. Pardo 1 ; 1 Universidad
Politécnica de Madrid, Spain; 2 Universidad de Alcalá,
Spain
Mon-Ses2-P4-8, Time: 13:30
Nowadays, current commercial and academic platforms for developing spoken dialogue applications lack of acceleration strategies
based on using heuristic information from the contents or structure of the backend database in order to speed up the definition
of the dialogue flow. In this paper we describe our attempts to
take advantage of these information sources using the following
strategies: the quick creation of classes and attributes to define
the data model structure, the semi-automatic generation and
debugging of database access functions, the automatic proposal of
the slots that should be preferably requested using mixed-initiative
forms or the slots that are better to request one by one using
directed forms, and the generation of automatic state proposals
to specify the transition network that defines the dialogue flow.
Subjective and objective evaluations confirm the advantages of
using the proposed strategies to simplify the design, and the high
acceptance of the platform and its acceleration strategies.
Feature-Based Summary Space for Stochastic
Dialogue Modeling with Hierarchical Semantic
Frames
Florian Pinault, Fabrice Lefèvre, Renato De Mori; LIA,
France
Mon-Ses2-P4-9, Time: 13:30
Notes
58
In a spoken dialogue system, the dialogue manager needs to make
decisions in a highly noisy environment, mainly due to speech
recognition and understanding errors. This work addresses this
issue by proposing a framework to interface efficient probabilistic
modeling for both the spoken language understanding module
and the dialogue management module. First hierarchical semantic
frames are inferred and composed so as to build a thorough
representation of the user’s utterance semantics. Then this representation is mapped into a feature-based summary space in which
is defined the set of dialogue states used by the stochastic dialogue manager, based on the partially observable Markov decision
process (POMDP) paradigm. This allows a planning of the dialogue
course taking into account the uncertainty on the current dialogue
state and tractability is ensured by the use of an intermediate
summary space.
A preliminary implementation of such a system is presented on
the Media domain. The task is touristic information and hotel
booking, and the availability of WoZ data allows to consider a
model-based approach to the POMDP dialogue manager.
The MonAMI Reminder: A Spoken Dialogue System
for Face-to-Face Interaction
Jonas Beskow, Jens Edlund, Björn Granström, Joakim
Gustafson, Gabriel Skantze, Helena Tobiasson; KTH,
Sweden
Mon-Ses2-P4-12, Time: 13:30
We describe the MonAMI Reminder, a multimodal spoken dialogue
system which can assist elderly and disabled people in organising
and initiating their daily activities. Based on deep interviews with
potential users, we have designed a calendar and reminder application which uses an innovative mix of an embodied conversational
agent, digital pen and paper, and the web to meet the needs of
those users as well as the current constraints of speech technology.
We also explore the use of head pose tracking for interaction and
attention control in human-computer face-to-face interaction.
Influence of Training on Direct and Indirect
Measures for the Evaluation of Multimodal Systems
Julia Seebode 1 , Stefan Schaffer 1 , Ina Wechsung 2 ,
Florian Metze 3 ; 1 Technische Universität Berlin,
Germany; 2 Deutsche Telekom Laboratories, Germany;
3
Carnegie Mellon University, USA
Language Modeling and Dialog Management for
Address Recognition
Rajesh Balchandran, Leonid Rachevsky, Larry
Sansone; IBM T.J. Watson Research Center, USA
Mon-Ses2-P4-10, Time: 13:30
Mon-Ses2-P4-13, Time: 13:30
This paper describes a language modeling and dialog management
system for efficient and robust recognition of several arbitrarily
ordered and inter-related components from very large datasets
— such as with a complete addresses specified in a single sentence with address components in their natural sequence. A new
two-pass speech recognition technique based on using multiple
language models with embedded grammars is presented. Tests
with this technique on complete address recognition task yielded
good results and memory and CPU requirements are sufficiently
low to make this technique viable for embedded environments. Additionally, a goal oriented algorithm for dialog based error recovery
and disambiguation, that does not require manual identification
of all possible dialog situations, is also presented. The combined
system yields very high task completion accuracy, for only a few
additional turns of interaction.
Finding suitable evaluation methods is an indispensable task
during the development of new user interfaces, as no standardized
approach has so far been established, especially for multimodal
interfaces. In the current study, we used several data sources
(direct and indirect measurements) to evaluate a multimodal
version of an information system, tested on trained and untrained
users. We investigated the extent to which the different types of
data showed concordance concerning the perceived quality of the
system, in order to derive clues as to the suitability of the respective evaluation methods. The aim was to examine, if widely used
methods not originally developed for multimodal interfaces are
appropriate under these conditions, and to derive new evaluation
paradigms.
A Framework for Rapid Development of
Conversational Natural Language Call Routing
Systems for Call Centers
Talking Heads for Interacting with Spoken Dialog
Smart-Home Systems
Christine Kühnel, Benjamin Weiss, Sebastian Möller;
Deutsche Telekom Laboratories, Germany
Ea-Ee Jan, Hong-Kwang Kuo, Osamuyimen Stewart,
David Lubensky; IBM T.J. Watson Research Center, USA
Mon-Ses2-P4-11, Time: 13:30
A framework for rapid development of conversational natural
language call routing systems is proposed. The framework cuts
costs by using only scantily prepared business requirements to
automatically create an initial prototype. Besides clear targets (terminal routing classes), vague targets which are variations of users’
incomplete (semantically overlapping) sentences are enumerated.
The vague targets can be derived from the confusion set of the
semantic tokens of the clear targets. Also automatically generated
for a vague target is a disambiguation dialogue module, which
consists of a prompt and grammar to guide the user from a vague
target to one of its associated clear targets. In the final analysis,
our framework is able to reduce the human labor associated with
developing an initial natural language call routing system from
a few weeks to just a few hours. The experimental results from
a deployed pilot system support the feasibility of our proposed
approach.
Mon-Ses2-P4-14, Time: 13:30
In this paper the relation between the quality of a talking head as
an output component of a spoken dialog system and the quality of
the system itself is investigated. Results show that the quality of
the talking head has indeed an important impact on system quality.
The quality of the talking head itself is found to be influenced by
visual and speech quality and the synchronization of voice and lip
movement.
Speech Generation from Hand Gestures Based on
Space Mapping
Aki Kunikoshi, Yu Qiao, Nobuaki Minematsu, Keikichi
Hirose; University of Tokyo, Japan
Mon-Ses2-P4-15, Time: 13:30
Individuals with speaking disabilities, particularly people suffering
from dysarthria, often use a TTS synthesizer for speech communication. Since users always have to type sound symbols and the
synthesizer reads them out in a monotonous style, the use of the
current synthesizers usually renders real-time operation and lively
Notes
59
communication difficult. This is why dysarthric users often fail to
control the flow of conversation. In this paper, we propose a novel
speech generation framework which makes use of hand gestures
as input. People usually use tongue gesture transitions for speech
generation but we develop a special glove, by wearing which,
speech sounds are generated from hand gesture transitions. For
development, GMM-based voice conversion techniques (mapping
techniques) are applied to estimate a mapping function between
a space of hand gestures and another space of speech sounds. In
this paper, as an initial trial, a mapping between hand gestures and
Japanese vowel sounds is estimated so that topological features
of the selected gestures in a feature space and those of the five
Japanese vowels in a cepstrum space are equalized. Experiments
show that the special glove can generate good Japanese vowel
transitions with voluntary control of duration and articulation.
Mon-Ses2-S1 : Special Session:
INTERSPEECH 2009 Emotion Challenge
Ainsworth (East Wing 4), 13:30, Monday 7 Sept 2009
Chair: Björn Schuller, Technische Universität München, Germany
Emotion Recognition Using a Hierarchical Binary
Decision Tree Approach
Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok
Lee, Shrikanth S. Narayanan; University of Southern
California, USA
Mon-Ses2-S1-3, Time: 13:50
Emotion state tracking is an important aspect of human-computer
and human-robot interaction. It is important to design task
specific emotion recognition systems for real-world applications.
In this work, we propose a hierarchical structure loosely motivated
by Appraisal Theory for emotion recognition. The levels in the
hierarchical structure are carefully designed to place the easier
classification task at the top level and delay the decision between
highly ambiguous classes to the end. The proposed structure
maps an input utterance into one of the five-emotion classes
through subsequent layers of binary classifications. We obtain a
balanced recall on each of the individual emotion classes using this
hierarchical structure. The performance measure of the average
unweighted recall percentage on the evaluation data set improves
by 3.3% absolute (8.8% relative) over the baseline model.
Improving Automatic Emotion Recognition from
Speech Signals
The INTERSPEECH 2009 Emotion Challenge
Björn Schuller 1 , Stefan Steidl 2 , Anton Batliner 2 ;
1
Technische Universität München, Germany; 2 FAU
Erlangen-Nürnberg, Germany
Mon-Ses2-S1-1, Time: 13:30
Elif Bozkurt 1 , Engin Erzin 1 , Çiǧdem Eroǧlu Erdem 2 ,
A. Tanju Erdem 3 ; 1 Koç University, Turkey; 2 Bahçeşehir
University, Turkey; 3 Özyeğin University, Turkey
The last decade has seen a substantial body of literature on the
recognition of emotion from speech. However, in comparison to
related speech processing tasks such as Automatic Speech and
Speaker Recognition, practically no standardised corpora and
test-conditions exist to compare performances under exactly the
same conditions. Instead a multiplicity of evaluation strategies
employed — such as cross-validation or percentage splits without
proper instance definition — prevents exact reproducibility. Further, in order to face more realistic scenarios, the community is in
desperate need of more spontaneous and less prototypical data.
This INTERSPEECH 2009 Emotion Challenge aims at bridging such
gaps between excellent research on human emotion recognition
from speech and low compatibility of results. The FAU Aibo
Emotion Corpus [1] serves as basis with clearly defined test and
training partitions incorporating speaker independence and different room acoustics as needed in most real-life settings. This paper
introduces the challenge, the corpus, the features, and benchmark
results of two popular approaches towards emotion recognition
from speech.
We present a speech signal driven emotion recognition system. Our
system is trained and tested with the INTERSPEECH 2009 Emotion
Challenge corpus, which includes spontaneous and emotionally
rich recordings. The challenge includes classifier and feature
sub-challenges with five-class and two-class classification problems. We investigate prosody related, spectral and HMM-based
features for the evaluation of emotion recognition with Gaussian
mixture model (GMM) based classifiers. Spectral features consist of
mel-scale cepstral coefficients (MFCC), line spectral frequency (LSF)
features and their derivatives, whereas prosody-related features
consist of mean normalized values of pitch, first derivative of
pitch and intensity. Unsupervised training of HMM structures
are employed to define prosody related temporal features for the
emotion recognition problem. We also investigate data fusion
of different features and decision fusion of different classifiers,
which are not well studied for emotion recognition framework.
Experimental results of automatic emotion recognition with the
INTERSPEECH 2009 Emotion Challenge corpus are presented.
GTM-URL Contribution to the INTERSPEECH 2009
Emotion Challenge
Exploring the Benefits of Discretization of Acoustic
Features for Speech Emotion Recognition
Santiago Planet, Ignasi Iriondo, Joan Claudi Socoró,
Carlos Monzo, Jordi Adell; Universitat Ramon Llull,
Spain
Thurid Vogt, Elisabeth André; Universität Augsburg,
Germany
Mon-Ses2-S1-2, Time: 13:40
We present a contribution to the Open Performance subchallenge
of the INTERSPEECH 2009 Emotion Challenge. We evaluate the
feature extraction and classifier of EmoVoice, our framework
for real-time emotion recognition from voice on the challenge
database and achieve competitive results. Furthermore, we explore
the benefits of discretizing numeric acoustic features and find it
beneficial in a multi-class task.
Mon-Ses2-S1-4, Time: 14:00
Mon-Ses2-S1-5, Time: 14:10
This paper describes our participation in the INTERSPEECH 2009
Emotion Challenge [1]. Starting from our previous experience in the
use of automatic classification for the validation of an expressive
corpus, we have tackled the difficult task of emotion recognition
from speech with real-life data. Our main contribution to this
work is related to the Classifier Sub-Challenge, for which we tested
several classification strategies. On the whole, the results were
slightly worse than or similar to the baseline, but we found some
configurations that could be considered in future implementations.
Notes
60
Combining Spectral and Prosodic Information for
Emotion Recognition in the Interspeech 2009
Emotion Challenge
Cepstral and Long-Term Features for Emotion
Recognition
Iker Luengo, Eva Navas, Inmaculada Hernáez;
University of the Basque Country, Spain
Pierre Dumouchel 1 , Najim Dehak 1 , Yazid Attabi 1 ,
Réda Dehak 2 , Narjès Boufaden 1 ; 1 CRIM, Canada;
2
LRDE, France
Mon-Ses2-S1-6, Time: 14:20
Mon-Ses2-S1-9, Time: 14:50
This paper describes the system presented at the Interspeech 2009
Emotion Challenge. It relies on both spectral and prosodic features
in order to automatically detect the emotional state of the speaker.
As both kinds of features have very different characteristics, they
are treated separately, creating two sub-classifiers, one using the
spectral features and the other one using the prosodic ones. The
results of these two classifiers are then combined with a fusion
system based on Support Vector Machines.
In this paper, we describe systems that were developed for the Open
Performance Sub-Challenge of the INTERSPEECH 2009 Emotion
Challenge. We participate in both two-class and five-class emotion
detection. For the two-class problem, the best performance is
obtained by logistic regression fusion of three systems. These
systems use short- and long-term speech features. Fusion allowed
to an absolute improvement of 2.6% on the unweighted recall value
compared with [1]. For the five-class problem, we submitted two
individual systems: cepstral GMM vs. long-term GMM-UBM. The
best result comes from a cepstral GMM and produces an absolute
improvement of 3.5% compared to [6].
Acoustic Emotion Recognition Using Dynamic
Bayesian Networks and Multi-Space Distributions
R. Barra-Chicote 1 , Fernando Fernández 1 , S. Lutfi 1 ,
Juan Manuel Lucas-Cuesta 1 , J. Macias-Guarasa 2 , J.M.
Montero 1 , R. San-Segundo 1 , J.M. Pardo 1 ; 1 Universidad
Politécnica de Madrid, Spain; 2 Universidad de Alcalá,
Spain
Brno University of Technology System for
Interspeech 2009 Emotion Challenge
Marcel Kockmann, Lukáš Burget, Jan Černocký; Brno
University of Technology, Czech Republic
Mon-Ses2-S1-10, Time: 15:00
Mon-Ses2-S1-7, Time: 14:30
In this paper we describe the acoustic emotion recognition system built at the Speech Technology Group of the Universidad
Politecnica de Madrid (Spain) to participate in the INTERSPEECH
2009 Emotion Challenge. Our proposal is based on the use of
a Dynamic Bayesian Network (DBN) to deal with the temporal
modelling of the emotional speech information. The selected
features (MFCC, F0, Energy and their variants) are modelled as
different streams, and the F0 related ones are integrated under
a Multi Space Distribution (MSD) framework, to properly model
its dual nature (voiced/unvoiced). Experimental evaluation on
the challenge test set, show a 67.06% and 38.24% of unweighted
recall for the 2 and 5-classes tasks respectively. In the 2-class
case, we achieve similar results compared with the baseline, with
a considerable less number of features. In the 5-class case, we
achieve a statistically significant 6.5% relative improvement.
This paper describes Brno University of Technology (BUT) system
for the Interspeech 2009 Emotion Challenge. Our submitted
system for the Open Performance Sub-Challenge uses acoustic
frame based features as a front-end and Gaussian Mixture Models
as a back-end. Different feature types and modeling approaches
successfully applied in speaker- and language recognition are investigated and we can achieve an 16% and 9% relative improvement
over the best dynamic and static baseline system on the 5-class
task, respectively.
Summary of the INTERSPEECH 2009 Emotion
Challenge
Time: 15:10
Awards Ceremony
Time: 15:20
Emotion Classification in Children’s Speech Using
Fusion of Acoustic and Linguistic Features
Tim Polzehl 1 , Shiva Sundaram 1 , Hamed Ketabdar 1 ,
Michael Wagner 2 , Florian Metze 3 ; 1 Technische
Universität Berlin, Germany; 2 University of Canberra,
Australia; 3 Carnegie Mellon University, USA
Mon-Ses3-O1 : Automatic Speech
Recognition: Language Models I
Main Hall, 16:00, Monday 7 Sept 2009
Chair: Keiichi Tokuda, Nagoya Insitute of Technology, Japan
Mon-Ses2-S1-8, Time: 14:40
This paper describes a system to detect angry vs. non-angry utterances of children who are engaged in dialog with an Aibo robot
dog. The system was submitted to the Interspeech2009 Emotion
Challenge evaluation. The speech data consist of short utterances
of the children’s speech, and the proposed system is designed to
detect anger in each given chunk. Frame-based cepstral features,
prosodic and acoustic features as well as glottal excitation features
are extracted automatically, reduced in dimensionality and classified by means of an artificial neural network and a support vector
machine. An automatic speech recognizer transcribes the words
in an utterance and yields a separate classification based on the
degree of emotional salience of the words. Late fusion is applied
to make a final decision on anger vs. non-anger of the utterance.
Preliminary results show 75.9% unweighted average recall on the
training data and 67.6% on the test set.
Back-Off Language Model Compression
Boulos Harb, Ciprian Chelba, Jeffrey Dean, Sanjay
Ghemawat; Google Inc., USA
Mon-Ses3-O1-1, Time: 16:00
With the availability of large amounts of training data relevant to
speech recognition scenarios, scalability becomes a very productive way to improve language model performance. We present
a technique that represents a back-off n-gram language model
using arrays of integer values and thus renders it amenable to
effective block compression. We propose a few such compression algorithms and evaluate the resulting language model along
two dimensions: memory footprint, and speed reduction relative to the uncompressed one. We experimented with a model
Notes
61
that uses a 32-bit word vocabulary (at most 4B words) and logprobabilities/back-off-weights quantized to 1 byte, respectively.
The best compression algorithm achieves 2.6 bytes/n-gram at
≈18X slower than uncompressed. For faster LM operation we
found it feasible to represent the LM at ≈4.0 bytes/n-gram, and
≈3X slower than the uncompressed LM. The memory footprint of
a LM containing one billion n-grams can thus be reduced to 3–4
Gbytes without impacting its speed too much.
best recognition were slightly worse than word based character
recognition. However combining the two systems using log-linear
combination gives better results than either system separately.
An equally weighted combination gave consistent CER gains of
0.1–0.2% absolute over the word based standard system.
Improving Broadcast News Transcription with a
Precision Grammar and Discriminative Reranking
Gwénolé Lecorvé, Guillaume Gravier, Pascale Sébillot;
IRISA, France
Constraint Selection for Topic-Based MDI
Adaptation of Language Models
Mon-Ses3-O1-5, Time: 17:20
Tobias Kaufmann, Thomas Ewender, Beat Pfister; ETH
Zürich, Switzerland
We propose a new approach of integrating a precision grammar
into speech recognition. The approach is based on a novel robust
parsing technique and discriminative reranking. By reranking
100-best output of the LIMSI German broadcast news transcription
system we achieved a significant reduction of the word error rate
by 9.6% relative. To our knowledge, this is the first significant
improvement for a real-world broad-domain speech recognition
task due to a precision grammar.
This paper presents an unsupervised topic-based language model
adaptation method which specializes the standard minimum
information discrimination approach by identifying and combining
topic-specific features. By acquiring a topic terminology from
a thematically coherent corpus, language model adaptation is
restrained to the sole probability re-estimation of n-grams ending with some topic-specific words, keeping other probabilities
untouched. Experiments are carried out on a large set of spoken
documents about various topics. Results show significant perplexity and recognition improvements which outperform results of
classical adaptation techniques.
Use of Contexts in Language Model Interpolation
and Adaptation
Nonstationary Latent Dirichlet Allocation for Speech
Recognition
X. Liu, M.J.F. Gales, P.C. Woodland; University of
Cambridge, UK
Chuang-Hua Chueh, Jen-Tzung Chien; National Cheng
Kung University, Taiwan
Mon-Ses3-O1-2, Time: 16:20
Mon-Ses3-O1-3, Time: 16:40
Mon-Ses3-O1-6, Time: 17:40
Language models (LMs) are often constructed by building component models on multiple text sources to be combined using global,
context free interpolation weights. By re-adjusting these weights,
LMs may be adapted to a target domain representing a particular
genre, epoch or other higher level attributes. A major limitation
with this approach is other factors that determine the “usefulness”
of sources on a context dependent basis, such as modeling resolution, generalization, topics and styles, are poorly modeled. To
overcome this problem, this paper investigates a context dependent form of LM interpolation and test-time adaptation. Depending
on the context, a discrete history weighting function is used to
dynamically adjust the contribution from component models. In
previous research, it was used primarily for LM adaptation. In this
paper, a range of schemes to combine context dependent weights
obtained from training and test data to improve LM adaptation are
proposed. Consistent perplexity and error rate gains of 6% relative
were obtained on a state-of-the-art broadcast recognition task.
Latent Dirichlet allocation (LDA) has been successful for document
modeling. LDA extracts the latent topics across documents. Words
in a document are generated by the same topic distribution.
However, in real-world documents, the usage of words in different
paragraphs is varied and accompanied with different writing styles.
This study extends the LDA and copes with the variations of topic
information within a document. We build the nonstationary LDA
(NLDA) by incorporating a Markov chain which is used to detect
the stylistic segments in a document. Each segment corresponds
to a particular style in composition of a document. This NLDA
can exploit the topic information between documents as well as
the word variations within a document. We accordingly establish
a Viterbi-based variational Bayesian procedure. A language model
adaptation scheme using NLDA is developed for speech recognition. Experimental results show improvement of NLDA over LDA
in terms of perplexity and word error rate.
Exploiting Chinese Character Models to Improve
Speech Recognition Performance
Mon-Ses3-O2 : Phoneme-Level Perception
1
2
Jones (East Wing 1), 16:00, Monday 7 Sept 2009
Chair: Rolf Carlson, KTH, Sweden
2
J.L. Hieronymus , X. Liu , M.J.F. Gales , P.C.
Woodland 2 ; 1 NASA Ames Research Center, USA;
2
University of Cambridge, UK
Categorical Perception of Speech Without Stimulus
Repetition
Mon-Ses3-O1-4, Time: 17:00
The Chinese language is based on characters which are syllabic
in nature. Since languages have syllabotactic rules which govern
the construction of syllables and their allowed sequences, Chinese
character sequence models can be used as a first level approximation of allowed syllable sequences. N-gram character sequence
models were trained on 4.3 billion characters. Characters are
used as a first level recognition unit with multiple pronunciations
per character. For comparison the CU-HTK Mandarin word based
system was used to recognize words which were then converted to
character sequences. The character only system error rates for one
Jack C. Rogers, Matthew H. Davis; University of
Cambridge, UK
Mon-Ses3-O2-1, Time: 16:00
We explored the perception of phonetic continua generated with
an automated auditory morphing technique in three perceptual experiments. The use of large sets of stimuli allowed an assessment
of the impact of single vs. paired presentation without the massed
stimulus repetition typical of categorical perception experiments.
A third experiment shows that such massed repetition alters the
degree of categorical and sub-categorical discrimination possible in
Notes
62
speech perception. Implications for accounts of speech perception
are discussed.
Perceptual Grouping of Alternating Word Pairs:
Effect of Pitch Difference and Presentation Rate
Nandini Iyer, Douglas S. Brungart, Brian D. Simpson;
Air Force Research Laboratory, USA
Non-Automaticity of Use of Orthographic
Knowledge in Phoneme Evaluation
Anne Cutler 1 , Chris Davis 2 , Jeesun Kim 2 ; 1 Max Planck
Institute for Psycholinguistics, The Netherlands;
2
University of Western Sydney, Australia
Mon-Ses3-O2-2, Time: 16:20
Two phoneme goodness rating experiments addressed the role
of orthographic knowledge in the evaluation of speech sounds.
Ratings for the best tokens of /s/ were higher in words spelled
with S (e.g., bless) than in words where /s/ was spelled with C (e.g.,
voice). This difference did not appear for analogous nonwords
for which every lexical neighbour had either S or C spelling (pless,
floice). Models of phonemic processing incorporating obligatory
influence of lexical information in phonemic processing cannot
explain this dissociation; the data are consistent with models in
which phonemic decisions are not subject to necessary top-down
lexical influence.
Learning and Generalization of Novel Contrastive
Cues
Meghan Sumner; Stanford University, USA
Mon-Ses3-O2-3, Time: 16:40
This paper examines the learning of a novel phonetic contrast.
Specifically, we examine how a contrast is learned — do speakers
learn a specific property about a particular word, or do they internalize a pattern that can be applied to words of a particular type
in subsequent processing? In two experiments, participants were
trained to treat stop release as contrastive. Following training,
participants took either a minimal pair decision or a cross-modal
form priming task, both of which include trained words, untrained
words with a trained rime, and novel, untrained words. The results
of both experiments suggest that both strategies are used in
learning — listeners generalize to words with similar rimes, but
are unable to extend this knowledge to novel words.
Mon-Ses3-O2-5, Time: 17:20
When listeners hear sequences of tones that slowly alternate between a low frequency and a slightly higher frequency, they tend to
report hearing a single stream of alternating tones. However, when
the alternation rate and/or the frequency difference increases,
they often report hearing two distinct streams: a slowly pulsing
high and low frequency stream. This experiment used repeating
sequences of spondees to investigate whether a similar streaming
phenomenon might occur for speech stimuli. The F0 difference
between every other word was varied from 0–18 semitones. Each
word was either 100 or 125 ms in duration. The inter-onset
intervals (IOIs) of the individual words were varied from 100–300
ms. The spondees were selected in such a way that listeners who
perceived a single stream of sequential words would report hearing
a different set of spondees than ones who perceived two distinct
streams grouped by frequency. As expected, F0 differences was a
strong cue for sequential segregation. Moreover, the number of
‘two’ stream judgments were greater at smaller IOIs, suggesting
that factors that influence the obligatory streaming of tonal signals
are also important in the segregation of speech signals.
Comparing Methods to Find a Best Exemplar in a
Multidimensional Space
Titia Benders, Paul Boersma; University of Amsterdam,
The Netherlands
Mon-Ses3-O2-6, Time: 17:40
We present a simple algorithm for running a listening experiment
aimed at finding the best exemplar in a multidimensional space.
For simulated humanlike listeners, who have perception thresholds and some decision noise on their responses, the algorithm
on average ends up twelve times closer than Iverson and Evans’
algorithm [1].
Vowel Category Perception Affected by
Microdurational Variations
Mon-Ses3-O3 : Statistical Parametric
Synthesis I
Einar Meister 1 , Stefan Werner 2 ; 1 Tallinn University of
Technology, Estonia; 2 University of Joensuu, Finland
Fallside (East Wing 2), 16:00, Monday 7 Sept 2009
Chair: Jean-François Bonastre, LIA, France
Mon-Ses3-O2-4, Time: 17:00
Vowel quality perception in quantity languages is considered to
be unrelated to vowel duration since duration is used to realize
quantity oppositions. To test the role of microdurational variations
in vowel category perception in Estonian listening experiments
with synthetic stimuli were carried out, involving five vowel pairs
along the close-open axis.
The results show that in the case of high-mid vowel pairs vowel
openness correlates positively with stimulus duration; in mid-low
vowel pairs no such correlation was found. The discrepancy in
the results is explained by the hypothesis that in case of shorter
perceptual distances (high-mid area of vowel space) intrinsic duration plays the role of a secondary feature to enhance perceptual
contrast between vowels, whereas in case of mid-low oppositions
perceptual distance is large enough to guarantee the necessary
perceptual contrast by spectral features alone and vowel intrinsic
duration as an additional cue is not needed.
Autoregressive HMMs for Speech Synthesis
Matt Shannon, William Byrne; University of Cambridge,
UK
Mon-Ses3-O3-1, Time: 16:00
We propose the autoregressive HMM for speech synthesis. We
show that the autoregressive HMM supports efficient EM parameter
estimation and that we can use established effective synthesis
techniques such as synthesis considering global variance with
minimal modification. The autoregressive HMM uses the same
model for parameter estimation and synthesis in a consistent
way, in contrast to the standard HMM synthesis framework, and
supports easy and efficient parameter estimation, in contrast to
the trajectory HMM. We find that the autoregressive HMM gives
performance comparable to the standard HMM synthesis framework on a Blizzard Challenge-style naturalness evaluation.
Notes
63
Asynchronous F0 and Spectrum Modeling for
HMM-Based Speech Synthesis
Local Minimum Generation Error Criterion for
Hybrid HMM Speech Synthesis
Cheng-Cheng Wang, Zhen-Hua Ling, Li-Rong Dai;
University of Science & Technology of China, China
Xavi Gonzalvo 1 , Alexander Gutkin 2 , Joan Claudi
Socoró 3 , Ignasi Iriondo 3 , Paul Taylor 1 ; 1 Phonetic Arts
Ltd., UK; 2 Yahoo! Europe, UK; 3 Universitat Ramon
Llull, Spain
Mon-Ses3-O3-2, Time: 16:20
This paper proposes an asynchronous model structure for fundamental frequency(F0) and spectrum modeling in HMM-based
parametric speech synthesis to improve the performance of
F0 prediction. F0 and spectrum features are considered to be
synchronous in the conventional system. Considering that the
production of these two features is decided by the movement
of different speech organs, an explicitly asynchronous model
structure is introduced. At training stage, F0 models are training
asynchronously with spectrum models. At synthesis stage, the two
features are generated respectively. The objective and subjective
evaluation results show the proposed method can effectively
improve the accuracy of F0 prediction.
A Minimum V/U Error Approach to F0 Generation in
HMM-Based TTS
Mon-Ses3-O3-5, Time: 17:20
This paper presents an HMM-driven hybrid speech synthesis
approach in which unit selection concatenative synthesis is used
to improve the quality of the statistical system using a Local
Minimum Generation Error (LMGE) during the synthesis stage.
The idea behind this approach is to combine the robustness due
to HMMs with the naturalness of concatenated units. Unlike the
conventional hybrid approaches to speech synthesis that use
concatenative synthesis as a backbone, the proposed system
employs stable regions of natural units to improve the statistically
generated parameters. We show that this approach improves the
generation of vocal tract parameters, smoothes the bad joints and
increases the overall quality.
Yao Qian, Frank K. Soong, Miaomiao Wang, Zhizheng
Wu; Microsoft Research Asia, China
Thousands of Voices for HMM-Based Speech
Synthesis
Mon-Ses3-O3-3, Time: 16:40
Junichi Yamagishi 1 , Bela Usabaev 2 , Simon King 1 ,
Oliver Watts 1 , John Dines 3 , Jilei Tian 4 , Rile Hu 4 , Yong
Guan 4 , Keiichiro Oura 5 , Keiichi Tokuda 5 , Reima
Karhila 6 , Mikko Kurimo 6 ; 1 University of Edinburgh,
UK; 2 Universität Tübingen, Germany; 3 IDIAP Research
Institute, Switzerland; 4 Nokia Research Center, China;
5
Nagoya Institute of Technology, Japan; 6 Helsinki
University of Technology, Finland
The HMM-based TTS can produce a highly intelligible and decent
quality voice. However, HMM model degrades when feature vectors
used in training are noisy. Among all noisy features, pitch tracking
errors and corresponding flawed voiced/unvoiced (v/u) decisions
are identified as two key factors in voice quality problems. In this
paper, we propose a minimum v/u error approach to F0 generation.
A prior knowledge of v/u is imposed in each Mandarin phone and
accumulated v/u posterior probabilities are used to search for
the optimal v/u switching point in each VU or UV segment in
generation. Objectively the new approach is shown to improve
v/u prediction performance, specifically on voiced to unvoiced
swapping errors. They are reduced from 3.7% (baseline) down
to 2.0% (new approach). The improvement is also subjectively
confirmed by an AB preference test score, 72% (new approach)
versus 22% (baseline).
Voiced/Unvoiced Decision Algorithm for
HMM-Based Speech Synthesis
Shiyin Kang 1 , Zhiwei Shuang 2 , Quansheng Duan 1 ,
Yong Qin 2 , Lianhong Cai 1 ; 1 Tsinghua University,
China; 2 IBM China Research Lab, China
Mon-Ses3-O3-4, Time: 17:00
This paper introduces a novel method to improve the U/V decision
method in HMM-based speech synthesis. In the conventional
method, the U/V decision of each state is independently made, and
a state in the middle of a vowel may be decided as unvoiced. In this
paper, we propose to utilize the constraints of natural speech to
improve the U/V decision inside a unit, such as syllable or phone.
We use a GMM-based U/V change time model to select the best U/V
change time in one unit, and refine the U/V decision of all states
in that unit based on the selected change time. The result of a
perceptual evaluation demonstrates that the proposed method can
significantly improve the naturalness of the synthetic speech.
Mon-Ses3-O3-6, Time: 17:40
Our recent experiments with HMM-based speech synthesis systems
have demonstrated that speaker-adaptive HMM-based speech
synthesis (which uses an ‘average voice model’ plus model adaptation) is robust to non-ideal speech data that are recorded under
various conditions and with varying microphones, that are not
perfectly clean, and/or that lack of phonetic balance. This enables
us consider building high-quality voices on ‘non-TTS’ corpora
such as ASR corpora. Since ASR corpora generally include a large
number of speakers, this leads to the possibility of producing an
enormous number of voices automatically. In this paper we show
thousands of voices for HMM-based speech synthesis that we have
made from several popular ASR corpora such as the Wall Street
Journal databases (WSJ0/WSJ1/WSJCAM0), Resource Management,
Globalphone and Speecon. We report some perceptual evaluation
results and outline the outstanding issues.
Mon-Ses3-O4 : Systems for Spoken
Language Translation
Holmes (East Wing 3), 16:00, Monday 7 Sept 2009
Chair: Hermann Ney, RWTH Aachen University, Germany
Efficient Combination of Confidence Measures for
Machine Translation
Sylvain Raybaud, David Langlois, Kamel Smaïli;
LORIA, France
Mon-Ses3-O4-1, Time: 16:00
We present in this paper a twofold contribution to Confidence
Measures for Machine Translation. First, in order to train and
Notes
64
test confidence measures, we present a method to automatically
build corpora containing realistic errors. Errors introduced into
reference translation simulate classical machine translation errors
(word deletion and word substitution), and are supervised by
Wordnet. Second, we use SVM to combine original and classical
confidence measures both at word- and sentence-level. We show
that the obtained combination outperforms by 14% (absolute) our
best single word-level confidence measure, and that combination of
sentence-level confidence measures produces meaningful scores.
Incremental Dialog Clustering for Speech-to-Speech
Translation
Using Syntax in Large-Scale Audio Document
Translation
David Stallard, Stavros Tsakalidis, Shirin Saleem; BBN
Technologies, USA
Mon-Ses3-O4-2, Time: 16:20
Application domains for speech-to-speech translation and dialog
systems often contain sub-domains and/or task-types for which
different outputs are appropriate for a given input. It would be
useful to be able to automatically find such sub-domain structure
in training corpora, and to classify new interactions with the
system into one of these sub-domains. To this end, We present a
document-clustering approach to such sub-domain classification,
which uses a recently-developed algorithm based on von Mises
Fisher distributions. We give preliminary perplexity reduction and
MT performance results for a speech-to-speech translation system
using this model.
Iterative Sentence-Pair Extraction from
Quasi-Parallel Corpora for Machine Translation
R. Sarikaya, Sameer Maskey, R. Zhang, Ea-Ee Jan, D.
Wang, Bhuvana Ramabhadran, S. Roukos; IBM T.J.
Watson Research Center, USA
Mon-Ses3-O4-3, Time: 16:40
This paper addresses parallel data extraction from the quasiparallel corpora generated in a crowd-sourcing project where ordinary people watch tv shows and movies and transcribe/translate
what they hear, creating document pools in different languages.
Since they do not have guidelines for naming and performing translations, it is often not clear which documents are the translations
of the same show/movie and which sentences are the translations
of the each other in a given document pair. We introduce a method
for automatically pairing documents in two languages and extracting parallel sentences from the paired documents. The method
consists of three steps: i) document pairing, ii) sentence pair
alignment of the paired documents, and iii) context extrapolation
to boost the sentence pair coverage. Human evaluation of the
extracted data shows that 95% of the extracted sentences carry
useful information for translation. Experimental results also show
that using the extracted data provides significant gains over the
baseline statistical machine translation system built with manually
annotated data.
RTTS: Towards Enterprise-Level Real-Time Speech
Transcription and Translation Services
Juan M. Huerta 1 , Cheng Wu 1 , Andrej Sakrajda 1 ,
Sasha Caskey 1 , Ea-Ee Jan 1 , Alexander Faisman 1 , Shai
Ben-David 2 , Wen Liu 3 , Antonio Lee 1 , Osamuyimen
Stewart 1 , Michael Frissora 1 , David Lubensky 1 ; 1 IBM
T.J. Watson Research Center, USA; 2 IBM Haifa
Research Lab, Israel; 3 IBM China Research Lab, China
Mon-Ses3-O4-4, Time: 17:00
In this paper we describe the RTTS system for enterprise-level
real time speech recognition and translation. RTTS follows a Web
Service-based approach which allows the encapsulation of ASR
and MT Technology components thus hiding the configuration and
tuning complexities and details from the client applications while
exposing a uniform interface. In this way, RTTS is capable of easily
supporting a wide variety of client applications. The clients we
have implemented include a VoIP-based real time speech-to-speech
translation system, a chat and Instant Messaging translation
System, a Transcription Server, among others.
Jing Zheng 1 , Necip Fazil Ayan 1 , Wen Wang 1 , David
Burkett 2 ; 1 SRI International, USA; 2 University of
California at Berkeley, USA
Mon-Ses3-O4-5, Time: 17:20
Recently, the use of syntax has very effectively improved machine
translation (MT) quality in many text translation tasks. However,
using syntax in speech translation poses additional challenges
because of disfluencies and other spoken language phenomena,
and of errors introduced by automatic speech recognition (ASR). In
this paper, we investigate the effect of using syntax in a large-scale
audio document translation task targeting broadcast news and
broadcast conversations. We do so by comparing the performance
of three synchronous context-free grammar based translation
approaches: 1) hierarchical phrase-based translation, 2) syntaxaugmented MT, and 3) string-to-dependency MT. The results show a
positive effect of explicitly using syntax when translating broadcast
news, but no benefit when translating broadcast conversations.
The results indicate that improving the robustness of syntactic
systems against conversational language style is important to their
success and requires future effort.
Context-Driven Automatic Bilingual Movie Subtitle
Alignment
Andreas Tsiartas, Prasanta Kumar Ghosh,
Panayiotis G. Georgiou, Shrikanth S. Narayanan;
University of Southern California, USA
Mon-Ses3-O4-6, Time: 17:40
Movie subtitle alignment is a potentially useful approach for deriving automatically parallel bilingual/multilingual spoken language
data for automatic speech translation. In this paper, we consider
the movie subtitle alignment task. We propose a distance metric
between utterances of different languages based on lexical features
derived from bilingual dictionaries. We use the dynamic time
warping algorithm to obtain the best alignment. The best F-score
of ∼0.713 is obtained using the proposed approach.
Mon-Ses3-P1 : Human Speech Production I
Hewison Hall, 16:00, Monday 7 Sept 2009
Chair: Shrikanth Narayanan, University of Southern California,
USA
Probabilistic Effects on French [t] Duration
Francisco Torreira, Mirjam Ernestus; Radboud
Universiteit Nijmegen, The Netherlands
Mon-Ses3-P1-1, Time: 16:00
The present study shows that [t] consonants are affected by
probabilistic factors in a syllable-timed language as French, and
in spontaneous as well as in journalistic speech. Study 1 showed
Notes
65
a word bigram frequency effect in spontaneous French, but its
exact nature depended on the corpus on which the probabilistic
measures were based. Study 2 investigated journalistic speech and
showed an effect of the joint frequency of the test word and its
following word. We discuss the possibility that these probabilistic
effects are due to the speaker’s planning of upcoming words, and
to the speaker’s adaptation to the listener’s needs.
On the Production of Sandhi Phenomena in French:
Psycholinguistic and Acoustic Data
Odile Bagou, Violaine Michel, Marina Laganaro;
University of Neuchâtel, Switzerland
Mon-Ses3-P1-2, Time: 16:00
This preliminary study addresses two complementary questions
about the production of sandhi phenomena in French. First, we
investigated whether the encoding of enchaînement and liaison
enchaînée involves a processing cost compared to non-resyllabified
sequences. This question was analyzed with a psycholinguistic
production time paradigm. The elicited sequences were then
used to address our second question, namely how critical V1 CV2
sequences are phonetically realized across different boundary
conditions. We compared the durational properties of critical
sequences containing a word-final coda consonant (enchaînement:
V1 .C#V2 ), an additional consonant (liaison enchaînée: V1 +C#V2 )
and a similar onset consonant (V1 #CV2 ). Results on production
latencies suggested that the encoding of liaison enchaînée involves
an additional processing cost compared to the two other boundary
conditions. In addition, the acoustic analyses indicated durational
differences across the three boundary conditions on V1 , C and V2 .
Implications for both, psycholinguistic and phonological models
are discussed.
Extreme Reductions: Contraction of Disyllables into
Monosyllables in Taiwan Mandarin
Chierh Cheng, Yi Xu; University College London, UK
Mon-Ses3-P1-3, Time: 16:00
This study investigates a severe form of segmental reduction
known as contraction. In Taiwan Mandarin, a disyllabic word or
phrase is often contracted into a monosyllabic unit in conversational speech, just as “do not” is often contracted into “don’t” in
English. A systematic experiment was conducted to explore the
underlying mechanism of such contraction. Preliminary results
show evidence that contraction is not a categorical shift but a
gradient undershoot of the articulatory target as a result of time
pressure. Moreover, contraction seems to occur only beyond
a certain duration threshold. These findings may further our
understanding of the relation between duration and segmental
reduction.
Annotation and Features of Non-Native Mandarin
Tone Quality
Mitchell Peabody, Stephanie Seneff; MIT, USA
Mon-Ses3-P1-4, Time: 16:00
Native speakers of non-tonal languages, such as American English, frequently have difficulty accurately producing the tones
of Mandarin Chinese. This paper describes a corpus of Mandarin
Chinese spoken by non-native speakers and annotated for tone
quality using a simple good/bad system. We examine inter-rater
correlation of the annotations and highlight the differences in
feature distribution between native, good non-native, and bad
non-native tone productions. We find that the features of tones
judged by a simple majority to be bad are significantly different
from features from tones judged to be good, and tones produced
by native speakers.
On-Line Formant Shifting as a Function of F0
Kateřina Chládková 1 , Paul Boersma 1 , Václav Jonáš
Podlipský 2 ; 1 University of Amsterdam, The
Netherlands; 2 Palacký University Olomouc, Czech
Republic
Mon-Ses3-P1-5, Time: 16:00
We investigate whether there is a within-speaker effect of a higher
F0 on the values of the first and the second formant. When asked
to speak at a high F0, speakers turn out to raise their formants as
well. In the F1 dimension this effect is greater for women than for
men. We conclude that while a general formant raising effect might
be due to the physiology of a high F0 (i.e. raised larynx and shorter
vocal tract), a plausible explanation for the gender-dependent size
of the effect can only be found in the undersampling hypothesis.
Production Boundary Between Fricative and
Affricate in Japanese and Korean Speakers
Kimiko Yamakawa 1 , Shigeaki Amano 2 , Shuichi
Itahashi 1 ; 1 National Institute of Informatics, Japan;
2
NTT Corporation, Japan
Mon-Ses3-P1-6, Time: 16:00
A fricative [s] and an affricate [ts] pronounced by both native
Japanese and Korean speakers were analyzed to clarify the effect
of the mother language on speech production. It was revealed that
Japanese speakers have a clear individual production boundary
between [s] and [ts], and that this boundary corresponds to the
production boundary of all Japanese speakers. In contrast, although Korean speakers tend to have a clear individual production
boundary, the boundary dose not corresponds to that of Japanese
speakers. These facts suggest that Korean speakers tend to have a
stable [s]-[ts] production boundary but that differ from Japanese
speakers.
Aerodynamics of Fricative Production in European
Portuguese
Cátia M.R. Pinho 1 , Luis M.T. Jesus 1 , Anna Barney 2 ;
1
Universidade de Aveiro, Portugal; 2 University of
Southampton, UK
Mon-Ses3-P1-7, Time: 16:00
The characteristics of steady state fricative production, and those
of the phone preceding and following the fricative, were investigated. Aerodynamic and electroglotographic (EGG) recordings of
four normal adult speakers (two females and two males), producing
a speech corpus of 9 isolated words with the European Portuguese
(EP) voiced fricatives /v, z, Z/ in initial, medial and final word
position, and the same 9 words embedded in 42 different real EP
carrier sentences, were analysed. Multimodal data allowed the
characterisation of fricatives in terms of their voicing mechanisms,
based on the amplitude of oral flow, F1 excitation and fundamental
frequency (F0).
Contextual Effects on Protrusion and Lip Opening
for /i,y/
Anne Bonneau, Julie Buquet, Brigitte Wrobel-Dautcourt;
LORIA, France
Mon-Ses3-P1-8, Time: 16:00
This study investigates the effect of “adverse” contexts, especially
that of the consonant /S/, on labial parameters for French /i,y/.
Five parameters were analysed: the height, width and area of lip
opening, the distance between the corners of the mouth, as well as
Notes
66
lip protrusion. Ten speakers uttered a corpus made up of isolated
vowels, syllables and logatoms. A special procedure has been
designed to evaluate lip opening contours. Results showed that the
carry-over effect of the consonant /S/ can impede the opposition
between /i/ and /y/ in the protrusion dimension, depending upon
speakers.
82% of German and French mixed-lingual test sentences cannot be
distinguished from natural polyglot prosody.
Weighted Neural Network Ensemble Models for
Speech Prosody Control
Harald Romsdorfer; ETH Zürich, Switzerland
Speech Rate Effects on European Portuguese Nasal
Vowels
Mon-Ses3-P2-2, Time: 16:00
Catarina Oliveira, Paula Martins, António Teixeira;
Universidade de Aveiro, Portugal
Mon-Ses3-P1-9, Time: 16:00
This paper presents new temporal information regarding the
production of European Portuguese (EP) nasal vowels, based on
new EMMA data. The influence of speech rate on duration of
velum gestures and their coordination with consonantic and glottal
gestures were analyzed. As information on relative speed of
articulators is scarce, the parameter stiffness for the nasal gestures
was also calculated and analyzed. Results show clear effects of
speech rate on temporal characteristics of EP nasal vowels. Speech
rate reduces the duration of velum gestures, increases the stiffness
and inter-gestural overlap.
Relation of Formants and Subglottal Resonances in
Hungarian Vowels
In text-to-speech synthesis systems, the quality of the predicted
prosody contours influences quality and naturalness of synthetic
speech. This paper presents a new statistical model for prosody
control that combines an ensemble learning technique using neural
networks as base learners with feature relevance determination.
This weighted neural network ensemble model was applied for
both, phone duration modeling and fundamental frequency modeling. A comparison with state-of-the-art prosody models based on
classification and regression trees (CART), multivariate adaptive
regression splines (MARS), or artificial neural networks (ANN),
shows a 12% improvement compared to the best duration model
and a 24% improvement compared to the best F0 model. The neural network ensemble model also outperforms another, recently
presented ensemble model based on gradient tree boosting.
Cross-Language F0 Modeling for Under-Resourced
Tonal Languages: A Case Study on Thai-Mandarin
Vataya Boonpiam, Anocha Rugchatjaroen, Chai
Wutiwiwatchai; NECTEC, Thailand
Tamás Gábor Csapó 1 , Zsuzsanna Bárkányi 2 ,
Tekla Etelka Gráczi 2 , Tamás Bőhm 1 , Steven M.
Lulich 3 ; 1 BME, Hungary; 2 Hungarian Academy of
Sciences, Hungary; 3 MIT, USA
Mon-Ses3-P2-3, Time: 16:00
Mon-Ses3-P1-10, Time: 16:00
The relation between vowel formants and subglottal resonances
(SGRs) has previously been explored in English, German, and
Korean. Results from these studies indicate that vowel classes
are categorically separated by SGRs. We extended this work to
Hungarian vowels, which have not been related to SGRs before. The
Hungarian vowel system contains paired long and short vowels
as well as a series of front rounded vowels, similar to German
but more complex than English and Korean. Results indicate that
SGRs separate vowel classes in Hungarian as in English, German,
and Korean, and uncover additional patterns of vowel formants
relative to the third subglottal resonance (Sg3). These results have
implications for understanding phonological distinctive features,
and applications in automatic speech technologies.
Mon-Ses3-P2 : Prosody, Text Analysis, and
Multilingual Models
Hewison Hall, 16:00, Monday 7 Sept 2009
Chair: Andrew Breen, Nuance Communications, Belgium
This paper proposed a novel method for F0 modeling in underresourced tonal languages. Conventional statistical models require
large training data which are deficient in many languages. In
tonal languages, different syllabic tones are represented by different F0 shapes, some of them are similar across languages.
With cross-language F0 contour mapping, we can augment the
F0 model of one under-resourced language with corpora from
another rich-resourced language. A case study on Thai HMM-based
F0 modeling with a Mandarin corpus is explored. Comparing
to baseline systems without cross-language resources, over 7%
relative reduction of RMSE and significant improvement of MOS
are obtained.
Prosodic Issues in Synthesising Thadou, a
Tibeto-Burman Tone Language
Dafydd Gibbon 1 , Pramod Pandey 2 , D. Mary Kim
Haokip 3 , Jolanta Bachan 4 ; 1 Universität Bielefeld,
Germany; 2 Jawaharlal Nehru University, India;
3
Assam University, India; 4 Adam Mickiewicz
University, Poland
Mon-Ses3-P2-4, Time: 16:00
Polyglot Speech Prosody Control
Harald Romsdorfer; ETH Zürich, Switzerland
Mon-Ses3-P2-1, Time: 16:00
Within a polyglot text-to-speech synthesis system, the generation
of an adequate prosody for mixed-lingual texts, sentences, or even
words, requires a polyglot prosody model that is able to seamlessly
switch between languages and that applies the same voice for
all languages. This paper presents the first polyglot prosody
model that fulfills these requirements and that is constructed
from independent monolingual prosody models. A perceptual
evaluation showed that the synthetic polyglot prosody of about
The objective of the present analysis is to present linguistic
constraints on the phonetic realisation of lexical tone which are
relevant for the choice of a speech synthesis development strategy
for a specific type of tone language. The selected case is Thadou
(Tibeto-Burman), which has lexical and morphosyntactic tone as
well as phonetic tone displacement. The last two constraint types
differ from those in more well-known tone languages such as
Mandarin, and present problems for mainstream corpus-based
speech synthesis techniques. Linguistic and phonetic models and
a ‘microvoice’ for rule-based tone generation are developed.
Notes
67
Advanced Unsupervised Joint Prosody Labeling and
Modeling for Mandarin Speech and its Application to
Prosody Generation for TTS
Sentiment Classification in English from
Sentence-Level Annotations of Emotions Regarding
Models of Affect
Chen-Yu Chiang, Sin-Horng Chen, Yih-Ru Wang;
National Chiao Tung University, Taiwan
Alexandre Trilla, Francesc Alías; Universitat Ramon
Llull, Spain
Mon-Ses3-P2-5, Time: 16:00
Mon-Ses3-P2-8, Time: 16:00
Motivated by the success of the unsupervised joint prosody
labeling and modeling (UJPLM) method for Mandarin speech on
modeling of syllable pitch contour in our previous study, in this
paper, the advanced UJPLM (A-UJPLM) method is proposed based
on UJPLM to jointly label prosodic tags and model syllable pitch
contour, duration and energy level. Experimental results on the
Sinica Treebank corpus showed that most prosodic tags labeled
were linguistically meaningful and the model parameters estimated
were interpretable and generally agreed with other previous study.
In virtue of the functions given by the model parameters, an application of A-UJPLM to the prosody generation for Mandarin TTS is
proposed. Experimental results showed that the proposed method
performed well. Most predicted prosodic features matched well to
their original counterparts. This also reconfirmed the effectiveness
of the A-UJPLM method.
This paper presents a text classifier for automatically tagging the
sentiment of input text according to the emotion that is being
conveyed. This system has a pipelined framework composed of
Natural Language Processing modules for feature extraction and
a hard binary classifier for decision making between positive and
negative categories. To do so, the Semeval 2007 dataset composed
of sentences emotionally annotated is used for training purposes
after being mapped into a model of affect. The resulting scheme
stands a first step towards a complete emotion classifier for a
future automatic expressive text-to-speech synthesizer.
Optimization of T-Tilt F0 Modeling
Ausdang Thangthai, Anocha Rugchatjaroen, Nattanun
Thatphithakkul, Ananlada Chotimongkol, Chai
Wutiwiwatchai; NECTEC, Thailand
Mon-Ses3-P2-6, Time: 16:00
This paper investigates on the improvement of T-Tilt modeling, a
modified Tilt model specifically designed for F0 modeling in tonal
languages. The model has proved to work well for F0 analysis but
suffers from texttoF0 prediction. To optimize, the T-Tilt event
is restricted to span over the whole syllable unit which helps
reduce the number of parameters significantly. F0 interpolation
and smoothing processes often performed in preprocessing are
avoided to prevent modeling errors. F0 shape preclassification
and parameter clustering are introduced for better modeling.
Evaluation results using the optimized model show the significant
improvement for both F0 analysis and prediction.
Identification of Contrast and its Emphatic
Realization in HMM Based Speech Synthesis
Leonardo Badino, J. Sebastian Andersson, Junichi
Yamagishi, Robert A.J. Clark; University of Edinburgh,
UK
Mon-Ses3-P2-9, Time: 16:00
The work presented in this paper proposes to identify contrast in
the form of contrastive word pairs and prosodically signal it with
emphatic accents in a Text-to-Speech (TTS) application using a
Hidden-Markov-Model (HMM) based speech synthesis system.
We first describe a novel method to automatically detect contrastive word pairs using textual features only and report its
performance on a corpus of spontaneous conversations in English.
Subsequently we describe the set of features selected to train a
HMM-based speech synthesis system and attempting to properly
control prosodic prominence (including emphasis).
Results from a large scale perceptual test show that in the majority of cases listeners judge emphatic contrastive word pairs as
acceptable as their non-emphatic counterpart, while emphasis on
non-contrastive pairs is almost never acceptable.
A Multi-Level Context-Dependent Prosodic Model
Applied to Durational Modeling
How to Improve TTS Systems for Emotional
Expressivity
Nicolas Obin 1 , Xavier Rodet 1 , Anne Lacheret-Dujour 2 ;
1
IRCAM, France; 2 MoDyCo, France
Antonio Rui Ferreira Rebordao, Mostafa Al Masum
Shaikh, Keikichi Hirose, Nobuaki Minematsu;
University of Tokyo, Japan
Mon-Ses3-P2-7, Time: 16:00
We present in this article a multi-level prosodic model based on the
estimation of prosodic parameters on a set of well defined linguistic units. Different linguistic units are used to represent different
scales of prosodic variations (local and global forms) and thus to
estimate the linguistic factors that can explain the variations of
prosodic parameters independently on each level. This model is
applied to the modeling of syllable-based durational parameters
on two read speech corpora — laboratory and acted speech. Compared to a syllable-based baseline model, the proposed approach
improves performance in terms of the temporal organization of
the predicted durations (correlation score) and reduces model’s
complexity, when showing comparable performance in terms of
relative prediction error.
Mon-Ses3-P2-10, Time: 16:00
Several experiments have been carried out that revealed weaknesses
of the current Text-To-Speech (TTS) systems in their emotional
expressivity. Although some TTS systems allow XML-based representations of prosodic and/or phonetic variables, few publications
considered, as a pre-processing stage, the use of intelligent text
processing to detect affective information that can be used to tailor
the parameters needed for emotional expressivity. This paper
describes a technique for an automatic prosodic parameterization
based on affective clues. This technique recognizes the affective
information conveyed in a text and, accordingly to its emotional
connotation, assigns appropriate pitch accents and other prosodic
parameters by XML-tagging. This pre-processing assists the TTS
system to generate synthesized speech that contains emotional
clues. The experimental results are encouraging and suggest the
possibility of suitable emotional expressivity in speech synthesis.
Notes
68
State Mapping Based Method for Cross-Lingual
Speaker Adaptation in HMM-Based Speech Synthesis
Yi-Jian Wu, Yoshihiko Nankaku, Keiichi Tokuda;
Nagoya Institute of Technology, Japan
Mon-Ses3-P3 : Automatic Speech
Recognition: Adaptation I
Hewison Hall, 16:00, Monday 7 Sept 2009
Chair: Stephen J. Cox, University of East Anglia, UK
Mon-Ses3-P2-11, Time: 16:00
A phone mapping-based method had been introduced for crosslingual speaker adaptation in HMM-based speech synthesis. In
this paper, we continue to propose a state mapping based method
for cross-lingual speaker adaptation. In this method, we firstly
establish the state mapping between two voice models in source
and target languages using Kullback-Leibler divergence (KLD).
Based on the established mapping information, we introduce two
approaches to conduct cross-lingual speaker adaptation, including
data mapping and transform mapping approaches. From the experimental results, the state mapping based method outperformed
the phone mapping based method. In addition, the data mapping
approach achieved better speaker similarity, and the transform
mapping approach achieved better speech quality after adaptation.
Real Voice and TTS Accent Effects on Intelligibility
and Comprehension for Indian Speakers of English
as a Second Language
Frederick Weber 1 , Kalika Bali 2 ; 1 Columbia University,
USA; 2 Microsoft Research India, India
Mon-Ses3-P2-12, Time: 16:00
We investigate the effect of accent on comprehension of English
for speakers of English as a second language in southern India.
Subjects were exposed to real and TTS voices with US and several
Indian accents, and were tested for intelligibility and comprehension. Performance trends indicate a measurable advantage for
familiar accents, and are broken down by various demographic
factors.
Improving Consistence of Phonetic Transcription
for Text-to-Speech
On the Development of Matched and Mismatched
Italian Children’s Speech Recognition Systems
Piero Cosi; CNR-ISTC, Italy
Mon-Ses3-P3-1, Time: 16:00
While at least read speech corpora are available for Italian children’s
speech research, there exist many languages which completely lack
children’s speech corpora. We propose that learning statistical
mappings between the adult and child acoustic space using existing
adult/children corpora may provide a future direction for generating children’s models for such data deficient languages. In this
work the recent advances in the development of the SONIC Italian
children’s speech recognition system will be described. This work,
completing a previous one developed in the past, was conducted
with the specific goals of integrating the newly trained children’s
speech recognition models into the Italian version of the Colorado
Literacy Tutor platform. Specifically, children’s speech recognition
research for Italian was conducted using the complete training and
test set of the FBK (ex ITC-irst) Italian Children’s Speech Corpus
(ChildIt). Using the University of Colorado SONIC LVSR system,
we demonstrate a phonetic recognition error rate of 12,0% for
a system which incorporates Vocal Tract Length Normalization
(VTLN), Speaker-Adaptive Trained phonetic models, as well as
unsupervised Structural MAP Linear Regression (SMAPLR).
Combination of Acoustic and Lexical Speaker
Adaptation for Disordered Speech Recognition
Oscar Saz, Eduardo Lleida, Antonio Miguel;
Universidad de Zaragoza, Spain
Mon-Ses3-P3-2, Time: 16:00
Pablo Daniel Agüero 1 , Antonio Bonafonte 2 ,
Juan Carlos Tulli 1 ; 1 Universidad Nacional de Mar del
Plata, Argentina; 2 Universitat Politècnica de
Catalunya, Spain
Mon-Ses3-P2-13, Time: 16:00
Grapheme-to-phoneme conversion is an important step in speech
segmentation and synthesis. Many approaches are proposed in the
literature to perform appropriate transcriptions: CART, FST, HMM,
etc. In this paper we propose the use of an automatic algorithm
that uses the transformation-based error-driven learning to match
the phonetic transcription with the speaker’s dialect and style.
Different transcriptions based on word, part-of-speech tags, weak
forms and phonotactic rules are validated. The experimental results show an improvement in the transcription using an objective
measure. The articulation MOS score is also improved, as most of
the changes in phonetic transcription affect coarticulation effects.
This paper presents an approach to provide of lexical adaptation in
Automatic Speech Recognition (ASR) of the disordered speech from
a group of young impaired speakers. The outcome of an Acoustic
Phonetic Decoder (APD) is used to learn new lexical variants of the
57-word vocabulary and add them to a lexicon personalized to each
user. The possibilities of combination of this lexical adaptation
with acoustic adaptation achieved through traditional Maximum A
Posteriori (MAP) approaches are further explored, and the results
show the importance of matching the lexicon in the ASR decoding
phase to the lexicon used for the acoustic adaptation.
Bilinear Transformation Space-Based Maximum
Likelihood Linear Regression Frameworks
Hwa Jeon Song, Yongwon Jeong, Hyung Soon Kim;
Pusan National University, Korea
Mon-Ses3-P3-3, Time: 16:00
This paper proposes two types of bilinear transformation spacebased speaker adaptation frameworks.
In training session,
transformation matrices for speakers are decomposed into the
style factor for speakers’ characteristics and orthonormal basis of
eigenvectors to control dimensionality of the canonical model by
the singular value decomposition-based algorithm. In adaptation
session, the style factor of a new speaker is estimated, depending
on what kind of proposed framework is used. At the same time,
the dimensionality of the canonical model can be reduced by
the orthonormal basis from training. Moreover, both maximum
likelihood linear regression (MLLR) and eigenspace-based MLLR are
Notes
69
Speaking Style Adaptation for Spontaneous Speech
Recognition Using Multiple-Regression HMM
circumstances. This may lead to a very inefficient usage of the
database. We show that after VTLN significantly more speakers —
also from opposite gender — contribute templates to the matching
sequence compared to the non-normalized case. In experiments
on the Wall Street Journal database this leads to a relative word
error rate reduction of 10%.
Yusuke Ijima, Takeshi Matsubara, Takashi Nose,
Takao Kobayashi; Tokyo Institute of Technology, Japan
Improving the Robustness with Multiple Sets of
HMMs
identified as special cases of our proposed methods. Experimental
results show that the proposed methods are much more effective
and versatile than other methods.
Mon-Ses3-P3-4, Time: 16:00
This paper describes a rapid model adaptation technique for
spontaneous speech recognition. The proposed technique utilizes
a multiple-regression hidden Markov model (MRHMM) and is based
on a style estimation technique of speech. In the MRHMM, the mean
vector of probability density function (pdf) is given by a function
of a low-dimensional vector, called style vector, which corresponds
to the intensity of expressivity of speaking style variation. The
value of the style vector is estimated for every utterance of the
input speech and the model adaptation is conducted by calculating
new mean vectors of the pdf using the estimated style vector.
The performance evaluation results using “Corpus of spontaneous
Japanese (CSJ)” are shown under a condition in which the amount
of model training and adaptation data is very small.
Acoustic Class Specific VTLN-Warping Using
Regression Class Trees
Hans-Günter Hirsch, Andreas Kitzig; HS Niederrhein,
Germany
Mon-Ses3-P3-7, Time: 16:00
The highest recognition performance is still achieved when training
a recognition system with speech data that have been recorded
in the acoustic scenario where the system will be applied. We
investigated the approach of using several sets of HMMs. These
sets have been trained on data that were recorded in different
typical noise situations. One HMM set is individually selected
at each speech input by comparing the pause segment at the
beginning of the utterance with the pause models of all sets. We
observed a considerable reduction of the error rates when applying
this approach in comparison to two well known techniques for
improving the robustness. Furthermore, we developed a technique
to additionally adapt certain parameters of the selected HMMs to
the specific noise condition. This leads to a further improvement
of the recognition rates.
S.P. Rath, S. Umesh; IIT Kanpur, India
Mon-Ses3-P3-5, Time: 16:00
In this paper, we study the use of different frequency warpfactors for different acoustic classes in a computationally efficient
frame-work of Vocal Tract Length Normalization (VTLN). This
is motivated by the fact that all acoustic classes do not exhibit
similar spectral variations as a result of physiological differences
in vocal tract, and therefore, the use of a single frequency-warp
for the entire utterance may not be appropriate. We have recently
proposed a VTLN method that implements VTLN-warping through
a linear-transformation (LT) of the conventional MFCC features
and efficiently estimates the warp-factor using the same sufficient
statistics as that are used in CMLLR adaptation. In this paper we
have shown that, in this framework of VTLN, and using the idea
of regression class tree, we can obtain separate VTLN-warping for
different acoustic classes. The use of regression class tree ensures
that warp-factor is estimated for each class even when there is
very little data available for that class. The acoustic classes, in
general, can be any collection of the Gaussian components in the
acoustic model. We have built acoustic classes by using data-driven
approach and by using phonetic knowledge. Using WSJ database
we have shown the recognition performance of the proposed
acoustic class specific warp-factor both for the data driven and the
phonetic knowledge based regression class tree definitions and
compare it with the case of the single warp-factor.
On the Use of Pitch Normalization for Improving
Children’s Speech Recognition
Rohit Sinha, Shweta Ghai; IIT Guwahati, India
Mon-Ses3-P3-8, Time: 16:00
In this work, we have studied the effect of pitch variations across
the speech signals in context of automatic speech recognition.
Our initial study done on vowel data indicates that on account
of insufficient smoothing of pitch harmonics by the filterbank,
particularly for high pitch signals, the variances of mel frequency
cepstral coefficients (MFCC) feature significantly increase with
increase in the pitch of the speech signals. Further to reduce the
variance of MFCC feature due to varying pitch among speakers, a
maximum likelihood based explicit pitch normalization method
has been explored. On connected digit recognition task, with
pitch normalization a relative improvement of 15% is obtained
over baseline for children’s speech (higher pitch) on adults’ speech
(lower pitch) trained models.
Using VTLN Matrices for Rapid and
Computationally-Efficient Speaker Adaptation with
Robustness to First-Pass Transcription Errors
S.P. Rath, S. Umesh, A.K. Sarkar; IIT Kanpur, India
Mon-Ses3-P3-9, Time: 16:00
Speaker Normalization for Template Based Speech
Recognition
Sébastien Demange, Dirk Van Compernolle; Katholieke
Universiteit Leuven, Belgium
Mon-Ses3-P3-6, Time: 16:00
Vocal Tract Length Normalization (VTLN) has been shown to be
an efficient speaker normalization tool for HMM based systems.
In this paper we show that it is equally efficient for a template
based recognition system. Template based systems, while promising, have as potential drawback that templates maintain all non
phonetic details apart from the essential phonemic properties;
i.e. they retain information on speaker and acoustic recording
In this paper, we propose to combine the rapid adaptation capability of conventional Vocal Tract Length Normalization (VTLN)
with the computational efficiency of transform-based adaptation
such as MLLR or CMLLR. VTLN requires the estimation of only
one parameter and is, therefore, most suited for the cases where
there is little adaptation data (i.e. rapid adaptation). In contrast,
transform-based adaptation methods require the estimation
of matrices. However, the drawback of conventional VTLN is
that it is computationally expensive since it requires multiple
spectral-warping to generate VTLN-warped features. We have
recently shown that VTLN-warping can be implemented by a lineartransformation (LT) of the conventional MFCC features. These LTs
are analytically pre-computed and stored. In this frame-work of LT
Notes
70
VTLN, computational complexity of VTLN is similar to transformbased adaptation since warp-factor estimation can be done using
the same sufficient statistics as that are used in CMLLR. We show
that VTLN provides significant improvement in performance when
there is small adaptation data as compared to transform-based
adaptation methods. We also show that the use of an additional
decorrelating transform, MLLT, along with the VTLN-matrices, gives
performance that is better than MLLR and comparable to SAT with
MLLT even for large adaptation data. Further we show that in the
mismatched train and test case (i.e. poor first-pass transcription),
VTLN provides significant improvement over the transform-based
adaptation methods. We compare the performances of different
methods on the WSJ, the RM and the TIDIGITS databases.
Speaker Adaptation Based on Two-Step Active
Learning
Koichi Shinoda, Hiroko Murakami, Sadaoki Furui;
Tokyo Institute of Technology, Japan
Mon-Ses3-P3-10, Time: 16:00
We propose a two-step active learning method for supervised
speaker adaptation. In the first step, the initial adaptation data
is collected to obtain a phone error distribution. In the second
step, those sentences whose phone distributions are close to the
error distribution are selected, and their utterances are collected
as the additional adaptation data. We evaluated the method using
a Japanese speech database and maximum likelihood linear regression (MLLR) as the speaker adaptation algorithm. We confirmed
that our method had a significant improvement over a method
using randomly chosen sentences for adaptation.
Tree-Based Estimation of Speaker Characteristics for
Speech Recognition
Mats Blomberg, Daniel Elenius; KTH, Sweden
Mon-Ses3-P3-11, Time: 16:00
Speaker adaptation by means of adjustment of speaker characteristic properties, such as vocal tract length, has the important
advantage compared to conventional adaptation techniques that
the adapted models are guaranteed to be realistic if the description
of the properties are. One problem with this approach is that the
search procedure to estimate them is computationally heavy. We
address the problem by using a multi-dimensional, hierarchical
tree of acoustic model sets. The leaf sets are created by transforming a conventionally trained model set using leaf-specific speaker
profile vectors. The model sets of non-leaf nodes are formed by
merging the models of their child nodes, using a computationally
efficient algorithm. During recognition, a maximum likelihood
criterion is followed to traverse the tree. Studies of one- (VTLN)
and four-dimensional speaker profile vectors (VTLN, two spectral
slope parameters and model variance scaling) exhibit a reduction
of the computational load to a fraction compared to that of an
exhaustive grid search. In recognition experiments on children’s
connected digits using adult and male models, the one-dimensional
tree search performed as well as the exhaustive search. Further
reduction was achieved with four dimensions. The best recognition results are 0.93% and 10.2% WER in TIDIGITS and PF-Star-Sw,
respectively, using adult models.
A Study on the Influence of Covariance Adaptation
on Jacobian Compensation in Vocal Tract Length
Normalization
D.R. Sanand, S.P. Rath, S. Umesh; IIT Kanpur, India
Mon-Ses3-P3-12, Time: 16:00
when there is a mismatch between the train and test speaker
conditions. VTLN is implemented using our recently proposed
approach of linear transformation of conventional MFCC, i.e.
a feature transformation. In this case, Jacobian is simply the
determinant of the linear transformation. Feature transformation
is equivalent to the means and covariances of the model being
transformed by the inverse transformation while leaving the data
unchanged. Using a set of adaptation experiments, we analyze
the reasons for the degradation during Jacobian compensation
and conclude that applying the same VTLN transformation on
both means and variances does not fully match the data when
there is a mismatch in the speaker conditions. This may have
similar implications for constrained-MLLR in mismatched speaker
conditions. We then propose to use covariance adaptation on top
of VTLN to account for the covariance mismatch between the train
and the test speakers and show that accounting for Jacobian after
covariance adaptation improves the performance.
Mon-Ses3-P4 : Applications in Learning and
Other Areas
Hewison Hall, 16:00, Monday 7 Sept 2009
Chair: Nestor Becerra Yoma, Universidad de Chile, Chile
Designing Spoken Tutorial Dialogue with Children
to Elicit Predictable but Educationally Valuable
Responses
Gregory Aist, Jack Mostow; Carnegie Mellon University,
USA
Mon-Ses3-P4-1, Time: 16:00
How to construct spoken dialogue interactions with children that
are educationally effective and technically feasible? To address
this challenge, we propose a design principle that constructs short
dialogues in which (a) the user’s utterance are the external evidence
of task performance or learning in the domain, and (b) the target
utterances can be expressed as a well-defined set, in some cases
even as a finite language (up to a small set of variables which may
change from exercise to exercise.) The key approach is to teach
the human learner a parameterized process that maps input to
response. We describe how the discovery of this design principle
came out of analyzing the processes of automated tutoring for
reading and pronunciation and designing dialogues to address
vocabulary and comprehension, show how it also accurately describes the design of several other language tutoring interactions,
and discuss how it could extend to non-language tutoring tasks.
Optimizing Non-Native Speech Recognition for CALL
Applications
Joost van Doremalen, Helmer Strik, Catia Cucchiarini;
Radboud Universiteit Nijmegen, The Netherlands
Mon-Ses3-P4-2, Time: 16:00
We are developing a Computer Assisted Language Learning (CALL)
system for practicing oral proficiency that makes use of Automatic
Speech Recognition (ASR) to provide feedback on grammar and
pronunciation. Since good quality unconstrained non-native ASR
is not yet feasible, we use an approach in which we try to elicit
constrained responses. The task in the current experiments is
to select utterances from a list of responses. The results of our
experiments show that significant improvements can be obtained
by optimizing the language model and the acoustic models, thus
reducing the utterance error rate from 29–26% to 10–8%.
In this paper, we first show that accounting for Jacobian in VocalTract Length Normalization (VTLN) will degrade the performance
Notes
71
Evaluation of English Intonation Based on
Combination of Multiple Evaluation Scores
read, we develop a novel score, the “phonetic challenge score”,
consisting of a measure for native language-specific difficulties
described in the second-language acquisition literature and also of
a statistical measure based on the cross-entropy between phoneme
sequences of the native language and English.
Akinori Ito, Tomoaki Konno, Masashi Ito, Shozo
Makino; Tohoku University, Japan
Mon-Ses3-P4-3, Time: 16:00
In this paper, we proposed a novel method for evaluating intonation of an English utterance spoken by a learner for intonation
learning by a CALL system. The proposed method is based on an
intonation evaluation method proposed by Suzuki et al., which uses
“word importance factors,” which are calculated based on word
clusters given by a decision tree. We extended Suzuki’s method so
that multiple decision trees are used and the resulting intonation
scores are combined using multiple regression. As a result of an
experiment, we obtained correlation coefficient comparable to the
correlation between human raters.
A Language-Independent Feature Set for the
Automatic Evaluation of Prosody
Andreas Maier, F. Hönig, V. Zeissler, Anton Batliner, E.
Körner, N. Yamanaka, P. Ackermann, Elmar Nöth; FAU
Erlangen-Nürnberg, Germany
Mon-Ses3-P4-4, Time: 16:00
In second language learning, the correct use of prosody plays a vital
role. Therefore, an automatic method to evaluate the naturalness
of the prosody of a speaker is desirable. We present a novel
method to model prosody independently of the text and thus independently of the language as well. For this purpose, the voiced and
unvoiced speech segments are extracted and a 187-dimensional
feature vector is computed for each voiced segment. This approach
is compared to word based prosodic features on a German text
passage. Both are confronted with the perceptive evaluation of two
native speakers of German. The word-based feature set yielded
correlations of up to 0.92 while the text-independent feature set
yielded 0.88. This is in the same range as the inter-rater correlation with 0.88. Furthermore, the text-independent features were
computed for a Japanese translation of the passage which was also
rated by two native speakers of Japanese. Again, the correlation
between the automatic system and the human perception of the
naturalness was high with 0.83 and not significantly lower than the
inter-rater correlation of 0.92.
Adapting the Acoustic Model of a Speech
Recognizer for Varied Proficiency Non-Native
Spontaneous Speech Using Read Speech with
Language-Specific Pronunciation Difficulty
We collected about 23,000 read sentences from 200 speakers in
four language groups: Chinese, Japanese, Korean, and Spanish.
We used this data for acoustic model adaptation of a spontaneous
speech recognizer and compared recognition performance between the unadapted baseline and the system after adaptation on
a held-out set from the English test responses data set.
The results show that using this targeted read speech material
for acoustic model adaptation does reduce the word error rate
significantly for two of four language groups of the spontaneous
speech test set, while changes of the two other language groups
are not significant.
Analysis and Utilization of MLLR Speaker
Adaptation Technique for Learners’ Pronunciation
Evaluation
Dean Luo 1 , Yu Qiao 1 , Nobuaki Minematsu 1 , Yutaka
Yamauchi 2 , Keikichi Hirose 1 ; 1 University of Tokyo,
Japan; 2 Tokyo International University, Japan
Mon-Ses3-P4-6, Time: 16:00
In this paper, we investigate the effects and problems of MLLR
speaker adaptation when applied to pronunciation evaluation.
Automatic scoring and error detection experiments are conducted
on two publicly available databases of Japanese learners’ English
pronunciation. As we expected, over-adaptation causes misjudge
of pronunciation accuracy. Following these experiments, two
novel methods, Forced-aligned GOP scoring and Regularized-MLLR
adaptation, are proposed to solve the adverse effects of MLLR
adaption. Experimental results show that the proposed methods
can better utilize MLLR adaptation and avoid over-adaptation.
Control of Human Generating Force by Use of
Acoustic Information — Study on Onomatopoeic
Utterances for Controlling Small Lifting-Force
Miki Iimura 1 , Taichi Sato 1 , Kihachiro Tanaka 2 ;
1
Tokyo Denki University, Japan; 2 Saitama University,
Japan
Mon-Ses3-P4-7, Time: 16:00
Klaus Zechner, Derrick Higgins, René Lawless, Yoko
Futagi, Sarah Ohls, George Ivanov; Educational Testing
Service, USA
Mon-Ses3-P4-5, Time: 16:00
This paper presents a novel approach to acoustic model adaptation of a recognizer for non-native spontaneous speech in the
context of recognizing candidates’ responses in a test of spoken
English. Instead of collecting and then transcribing spontaneous
speech data, a read speech corpus is created where non-native
speakers of English read English sentences of different degrees of
pronunciation difficulty with respect to their native language. The
motivation for this approach is (1) to save time and cost associated
with transcribing spontaneous speech, and (2) to allow for a
targeted training of the recognizer, focusing particularly on those
phoneme environments which are difficult to pronounce correctly
by non-native speakers and hence have a higher likelihood of being
misrecognized. As a criterion for selecting the sentences to be
We have conducted basic experiments for applying acoustic information to engineering problems. We asked the subjects to
execute lifting actions while listening to sounds and measured the
resultant lifting-force.
We used human onomatopoeic utterances as the sounds that are
presented to the subjects aiming to make their lifting-force small.
Especially, we focused on the “emotion” or “nuance” contained in
humans’ utterances, which is a unique characteristic evoked by
the utterance’ acoustical features. We found that the emotion or
nuance can control the lifting-force effectively. We also clarified
the acoustical features that are responsible for effective control of
lifting-force exerted by human.
Mi-DJ: A Multi-Source Intelligent DJ Service
Ching-Hsien Lee, Hsu-Chih Wu; ITRI, Taiwan
Mon-Ses3-P4-8, Time: 16:00
In this paper, A Multi-source intelligent DJ (Mi-DJ) service is introduced. It is an audio program platform that integrates different
Notes
72
media types, including audio and text format content. It acts like
a DJ who plays personalized audio program to user whenever and
wherever users need. The audio program is automatically generated, comprising several audio clips; all of them are from either
existing audio files or text information, such as e-mail, calendar,
news or user-preferred article. Our unique program generation
technology makes user feel like listening to a well-organized
program, instead of several separated audio files. The program can
be organized dynamically, which realizes context-aware service
based on location, user’s schedule, or other user preference.
With appropriate data management, text processing and speech
synthesis technologies, Mi-DJ can be applied to many application
scenarios. For example, it can be applied in language learning and
tour guide.
Mon-Ses3-S1 : Special Session: Silent Speech
Interfaces
Ainsworth (East Wing 4), 16:00, Monday 7 Sept 2009
Chair: Bruce Denby, Université Pierre et Marie Curie, France and
Tanja Schultz, Carnegie Mellon University, USA
Characterizing Silent and Pseudo-Silent Speech
Using Radar-Like Sensors
John F. Holzrichter; Fannie and John Hertz Foundation,
USA
Mon-Ses3-S1-1, Time: 16:00
Radar-like sensors enable the measuring of speech articulator
conditions, especially their shape changes and contact events both
during silent and normal speech. Such information can be used
to associate articulator conditions with digital “codes” for use in
communications, machine control, speech masking or canceling,
and other applications.
Human Voice or Prompt Generation? Can They
Co-Exist in an Application?
Géza Németh, Csaba Zainkó, Mátyás Bartalis, Gábor
Olaszy, Géza Kiss; BME, Hungary
Mon-Ses3-P4-9, Time: 16:00
This paper describes an R&D project regarding procedures for
the automatic maintenance of the interactive voice response (IVR)
system of a mobile telecom operator. The original plan was to
create a generic voice prompt generation system for the customer
service department. The challenge was to create a solution that is
hard to distinguish from the human speaker (i.e. passing a sort of
Turing-test) so its output can be freely mixed with original human
recordings. The domain of the solution at the first step had to be
narrowed down to the price lists of available mobile phones and
services. This is updated weekly, so the final operational system
generates about 3 hours of speech at each weekend. It operates
under human supervision but without intervention in the speech
generation process. It was tested both by academic procedures
and company customers and was accepted as fulfilling the original
requirements.
Automatic vs. Human Question Answering Over
Multimedia Meeting Recordings
Quoc Anh Le 1 , Andrei Popescu-Belis 2 ; 1 University of
Namur, Belgium; 2 IDIAP Research Institute,
Switzerland
Mon-Ses3-P4-10, Time: 16:00
Information access in meeting recordings can be assisted by
meeting browsers, or can be fully automated following a questionanswering (QA) approach. An information access task is defined,
aiming at discriminating true vs. false parallel statements about
facts in meetings. An automatic QA algorithm is applied to
this task, using passage retrieval over a meeting transcript. The
algorithm scores 59% accuracy for passage retrieval, while random
guessing is below 1%, but only scores 60% on combined retrieval
and question discrimination, for which humans reach 70%–80% and
the baseline is 50%. The algorithm clearly outperforms humans
for speed, at less than 1 second per question, vs. 1.5–2 minutes
per question for humans. The degradation on ASR compared
to manual transcripts still yields lower but acceptable scores,
especially for passage identification. Automatic QA thus appears
to be a promising enhancement to meeting browsers used by
humans, as an assistant for relevant passage identification.
Technologies for Processing Body-Conducted
Speech Detected with Non-Audible Murmur
Microphone
Tomoki Toda, Keigo Nakamura, Takayuki Nagai,
Tomomi Kaino, Yoshitaka Nakajima, Kiyohiro Shikano;
NAIST, Japan
Mon-Ses3-S1-2, Time: 16:20
In this paper, we review our recent research on technologies for
processing body-conducted speech detected with Non-Audible
Murmur (NAM) microphone. NAM microphone enables us to
detect various types of body-conducted speech such as extremely
soft whisper, normal speech, and so on. Moreover, it is robust
against external noise due to its noise-proof structure. To make
speech communication more universal by effectively using these
properties of NAM microphone, we have so far developed two main
technologies: one is body-conducted speech conversion for humanto-human speech communication; and the other is body-conducted
speech recognition for man-machine speech communication. This
paper gives an overview of these technologies and presents our
new attempts to investigate the effectiveness of body-conducted
speech recognition.
Artificial Speech Synthesizer Control by
Brain-Computer Interface
Jonathan S. Brumberg 1 , Philip R. Kennedy 2 , Frank H.
Guenther 1 ; 1 Boston University, USA; 2 Neural Signals
Inc., USA
Mon-Ses3-S1-3, Time: 16:40
We developed and tested a brain-computer interface for control of
an artificial speech synthesizer by an individual with near complete
paralysis. This neural prosthesis for speech restoration is currently
capable of predicting vowel formant frequencies based on neural
activity recorded from an intracortical microelectrode implanted
in the left hemisphere speech motor cortex. Using instantaneous
auditory feedback (< 50 ms) of predicted formant frequencies,
the study participant has been able to correctly perform a vowel
production task at a maximum rate of 80–90% correct.
Notes
73
Visuo-Phonetic Decoding Using Multi-Stream and
Context-Dependent Models for an Ultrasound-Based
Silent Speech Interface
shown that the EMG signals created by audible and silent speech
are quite distinct. In this paper we first compare various methods
of initializing a silent speech EMG recognizer, showing that the
performance of the recognizer substantially varies across different
speakers. Based on this, we analyze EMG signals from audible and
silent speech, present first results on how discrepancies between
these speaking modes affect EMG recognizers, and suggest areas
for future work.
Thomas Hueber 1 , Elie-Laurent Benaroya 1 , Gérard
Chollet 2 , Bruce Denby 3 , Gérard Dreyfus 1 , Maureen
Stone 4 ; 1 LE-ESPCI, France; 2 LTCI, France; 3 Université
Pierre et Marie Curie, France; 4 University of Maryland
at Baltimore, USA
Synthesizing Speech from Electromyography Using
Voice Transformation Techniques
Mon-Ses3-S1-4, Time: 17:00
Recent improvements are presented for phonetic decoding of
continuous-speech from ultrasound and optical observations of
the tongue and lips in a silent speech interface application. In a
new approach to this critical step, the visual streams are modeled by context-dependent multi-stream Hidden Markov Models
(CD-MSHMM). Results are compared to a baseline system using
context-independent modeling and a visual feature fusion strategy,
with both systems evaluated on a one-hour, phonetically balanced
English speech database. Tongue and lip images are coded using
PCA-based feature extraction techniques. The uttered speech
signal, also recorded, is used to initialize the training of the
visual HMMs. Visual phonetic decoding performance is evaluated
successively with and without the help of linguistic constraints
introduced via a 2.5k-word decoding dictionary.
Arthur R. Toth, Michael Wand, Tanja Schultz;
Universität Karlsruhe (TH), Germany
Mon-Ses3-S1-7, Time: 17:20
Surface electromyography (EMG) can be used to record the activation potentials of articulatory muscles while a person speaks. This
technique could enable silent speech interfaces, as EMG signals are
generated even when people pantomime speech without producing
sound. Having effective silent speech interfaces would enable a
number of compelling applications, allowing people to communicate in areas where they would not want to be overheard or where
the background noise is so prevalent that they could not be heard.
In order to use EMG signals in speech interfaces, however, there
must be a relatively accurate method to map the signals to speech.
Up to this point, it appears that most attempts to use EMG signals
for speech interfaces have focused on Automatic Speech Recognition (ASR) based on features derived from EMG signals. Following
the lead of other researchers who worked with Electro-Magnetic
Articulograph (EMA) data and Non-Audible Murmur (NAM) speech,
we explore the alternative idea of using Voice Transformation (VT)
techniques to synthesize speech from EMG signals. With speech
output, both ASR systems and human listeners can directly use
EMG-based systems. We report the results of our preliminary
studies, noting the difficulties we encountered and suggesting
areas for future work.
Disordered Speech Recognition Using Acoustic and
sEMG Signals
Yunbin Deng 1 , Rupal Patel 2 , James T. Heaton 3 , Glen
Colby 1 , L. Donald Gilmore 4 , Joao Cabrera 1 , Serge H.
Roy 4 , Carlo J. De Luca 4 , Geoffrey S. Meltzner 1 ; 1 BAE
Systems Inc., USA; 2 Northeastern University, USA;
3
Massachusetts General Hospital, USA; 4 Delsys Inc.,
USA
Mon-Ses3-S1-5, Time: 17:20
Parallel isolated word corpora were collected from healthy speakers
and individuals with speech impairment due to stroke or cerebral
palsy. Surface electromyographic (sEMG) signals were collected for
both vocalized and mouthed speech production modes. Pioneering
work on disordered speech recognition using the acoustic signal,
the sEMG signals, and their fusion are reported. Results indicate
that speaker-dependent isolated-word recognition from the sEMG
signals of articulator muscle groups during vocalized disorderedspeech production was highly effective. However, word recognition
accuracy for mouthed speech was much lower, likely related to
the fact that some disordered speakers had considerable difficulty
producing consistent mouthed speech. Further development of
the sEMG-based speech recognition systems is needed to increase
usability and robustness.
Impact of Different Speaking Modes on EMG-Based
Speech Recognition
Michael Wand 1 , Szu-Chen Stan Jou 2 , Arthur R. Toth 1 ,
Tanja Schultz 1 ; 1 Universität Karlsruhe (TH), Germany;
2
Industrial Technology Research Institute, Taiwan
Multimodal HMM-Based NAM-to-Speech Conversion
Viet-Anh Tran 1 , Gérard Bailly 1 , Hélène Lœvenbruck 1 ,
Tomoki Toda 2 ; 1 GIPSA, France; 2 NAIST, Japan
Mon-Ses3-S1-8, Time: 17:20
Although the segmental intelligibility of converted speech from
silent speech using direct signal-to-signal mapping proposed by
Toda et al. [1] is quite acceptable, listeners have sometimes difficulty in chunking the speech continuum into meaningful words due
to incomplete phonetic cues provided by output signals. This paper studies another approach consisting in combining HMM-based
statistical speech recognition and synthesis techniques, as well as
training on aligned corpora, to convert silent speech to audible
voice. By introducing phonological constraints, such systems are
expected to improve the phonetic consistency of output signals.
Facial movements are used in order to improve the performance
of both recognition and synthesis procedures. The results show
that including these movements improves the recognition rate by
6.2% and a final improvement of the spectral distortion by 2.7%
is observed. The comparison between direct signal-to-signal and
phonetic-based mappings is finally commented in this paper.
Mon-Ses3-S1-6, Time: 17:20
We present our recent results on speech recognition by surface
electromyography (EMG), which captures the electric potentials
that are generated by the human articulatory muscles. This technique can be used to enable Silent Speech Interfaces, since EMG
signals are generated even when people only articulate speech
without producing any sound. Preliminary experiments have
Notes
74
Tue-Ses1-O1 : ASR: Discriminative Training
Main Hall, 10:00, Tuesday 8 Sept 2009
Chair: Erik McDermott, NTT Corporation, Japan
On the Semi-Supervised Learning of Multi-Layered
Perceptrons
Jonathan Malkin, Amarnag Subramanya, Jeff Bilmes;
University of Washington, USA
Tue-Ses1-O1-1, Time: 10:00
We present a novel approach for training a multi-layered perceptron (MLP) in a semi-supervised fashion. Our objective function,
when optimized, balances training set accuracy with fidelity to a
graph-based manifold over all points. Additionally, the objective
favors smoothness via an entropy regularizer over classifier outputs as well as straightforward 2 regularization. Our approach
also scales well enough to enable large-scale training. The results
demonstrate significant improvement on several phone classification tasks over baseline MLPs.
In this paper, we have successfully extended our previous work
of convex optimization methods to MMIE-based discriminative
training for large vocabulary continuous speech recognition.
Specifically, we have re-formulated the MMIE training into a second
order cone programming (SOCP) program using some convex
relaxation techniques that we have previously proposed. Moreover,
the entire SOCP formulation has been developed for word graphs
instead of N-best lists to handle large vocabulary tasks. The
proposed method has been evaluated in the standard WSJ-5k task
and experimental results show that the proposed SOCP method
significantly outperforms the conventional EBW method in terms
of recognition accuracy as well as convergence behavior. Our
experiments also show that the proposed SOCP method is efficient
enough to handle some relatively large HMM sets normally used in
large vocabulary tasks.
Hidden Conditional Random Field with Distribution
Constraints for Phone Classification
Dong Yu, Li Deng, Alex Acero; Microsoft Research, USA
Tue-Ses1-O1-5, Time: 11:20
We propose a new algorithm called Generalized Discriminative
Feature Transformation (GDFT) for acoustic models in speech
recognition. GDFT is based on Lagrange relaxation on a transformed optimization problem. We show that the existing discriminative feature transformation methods like feature space MMI/MPE
(fMMI/MPE), region dependent linear transformation (RDLT), and a
non-discriminative feature transformation, constrained maximum
likelihood linear regression (CMLLR) are special cases of GDFT.
We evaluate the performance of GDFT for Iraqi large vocabulary
continuous speech recognition.
We advance the recently proposed hidden conditional random field
(HCRF) model by replacing the moment constraints (MCs) with the
distribution constraints (DCs). We point out that the distribution
constraints are the same as the traditional moment constraints for
the binary features but are able to better regularize the probability
distribution of the continuous-valued features than the moment
constraints. We show that under the distribution constraints the
HCRF model is no longer log-linear but embeds the model parameters in non-linear functions. We provide an effective solution to
the resulting more difficult optimization problem by converting
it to the traditional log-linear form at a higher-dimensional space
of features exploiting cubic spline. We demonstrate that a 20.8%
classification error rate (CER) can be achieved on the TIMIT phone
classification task using the HCRF-DC model. This result is superior
to any published single-system result on this heavily evaluated
task including the HCRF-MC model, the discriminatively trained
HMMs, and the large-margin HMMs using the same features.
A Fast Online Algorithm for Large Margin Training
of Continuous Density Hidden Markov Models
Deterministic Annealing Based Training Algorithm
for Bayesian Speech Recognition
Chih-Chieh Cheng 1 , Fei Sha 2 , Lawrence K. Saul 1 ;
1
University of California at San Diego, USA;
2
University of Southern California, USA
Sayaka Shiota, Kei Hashimoto, Yoshihiko Nankaku,
Keiichi Tokuda; Nagoya Institute of Technology, Japan
Tue-Ses1-O1-3, Time: 10:40
This paper proposes a deterministic annealing based training algorithm for Bayesian speech recognition. The Bayesian method is a
statistical technique for estimating reliable predictive distributions
by marginalizing model parameters. However, the local maxima
problem in the Bayesian method is more serious than in the
ML-based approach, because the Bayesian method treats not only
state sequences but also model parameters as latent variables. The
deterministic annealing EM (DAEM) algorithm has been proposed
to improve the local maxima problem in the EM algorithm, and its
effectiveness has been reported in HMM-based speech recognition
using ML criterion. In this paper, the DAEM algorithm is applied
to Bayesian speech recognition to relax the local maxima problem.
Speech recognition experiments show that the proposed method
achieved a higher performance than the conventional methods.
Generalized Discriminative Feature Transformation
for Speech Recognition
Roger Hsiao, Tanja Schultz; Carnegie Mellon
University, USA
Tue-Ses1-O1-2, Time: 10:20
Tue-Ses1-O1-6, Time: 11:40
We propose an online learning algorithm for large margin training
of continuous density hidden Markov models. The online algorithm
updates the model parameters incrementally after the decoding of
each training utterance. For large margin training, the algorithm
attempts to separate the log-likelihoods of correct and incorrect
transcriptions by an amount proportional to their Hamming
distance. We evaluate this approach to hidden Markov modeling
on the TIMIT speech database. We find that the algorithm yields
significantly lower phone error rates than other approaches — both
online and batch — that do not attempt to enforce a large margin.
We also find that the algorithm converges much more quickly than
analogous batch optimizations for large margin training.
Maximum Mutual Information Estimation via Second
Order Cone Programming for Large Vocabulary
Continuous Speech Recognition
Dalei Wu, Baojie Li, Hui Jiang; York University, Canada
Tue-Ses1-O1-4, Time: 11:00
Notes
75
of this hypothesis involved acoustic measurement of L2 speakers’
intonation contours, and comparison of these contours with those
of native speakers.
Tue-Ses1-O2 : Language Acquisition
Jones (East Wing 1), 10:00, Tuesday 8 Sept 2009
Chair: Maria Uther, Brunel University, UK
KLAIR: A Virtual Infant for Spoken Language
Acquisition Research
Connecting Rhythm and Prominence in Automatic
ESL Pronunciation Scoring
Mark Huckvale 1 , Ian S. Howard 2 , Sascha Fagel 3 ;
1
University College London, UK; 2 University of
Cambridge, UK; 3 Technische Universität Berlin,
Germany
Emily Nava, Joseph Tepperman, Louis Goldstein,
Maria Luisa Zubizarreta, Shrikanth S. Narayanan;
University of Southern California, USA
Tue-Ses1-O2-1, Time: 10:00
Tue-Ses1-O2-4, Time: 11:00
Past studies have shown that a native Spanish speaker’s use of
phrasal prominence is a good indicator of her level of English
prosody acquisition. Because of the cross-linguistic differences in
the organization of phrasal prominence and durational contrasts,
we hypothesize that those speakers with English-like prominence
in their L2 speech are also expected to have acquired English-like
rhythm. Statistics from a corpus of native and nonnative English
confirm that speakers with an English-like phrasal prominence
are also the ones who use English-like rhythm. Additionally, two
methods of automatic score generation based on vowel duration
times demonstrate a correlation of at least 0.6 between these
automatic scores and subjective scores for phrasal prominence.
These findings suggest that simple vowel duration measures
obtained from standard automatic speech recognition methods
can be salient cues for estimating subjective scores of prosodic
acquisition, and of pronunciation in general.
Recent research into the acquisition of spoken language has
stressed the importance of learning through embodied linguistic
interaction with caregivers rather than through passive observation. However the necessity of interaction makes experimental
work into the simulation of infant speech acquisition difficult
because of the technical complexity of building real-time embodied
systems. In this paper we present KLAIR: a software toolkit for
building simulations of spoken language acquisition through
interactions with a virtual infant. The main part of KLAIR is
a sensori-motor server that supplies a client machine learning
application with a virtual infant on screen that can see, hear and
speak. By encapsulating the real-time complexities of audio and
video processing within a server that will run on a modern PC, we
hope that KLAIR will encourage and facilitate more experimental
research into spoken language acquisition through interaction.
Evaluating Parameters for Mapping Adult Vowels to
Imitative Babbling
An Articulatory Analysis of Phonological Transfer
Using Real-Time MRI
Joseph Tepperman, Erik Bresch, Yoon-Chul Kim,
Sungbok Lee, Louis Goldstein, Shrikanth S. Narayanan;
University of Southern California, USA
Ilana Heintz 1 , Mary Beckman 1 , Eric Fosler-Lussier 1 ,
Lucie Ménard 2 ; 1 Ohio State University, USA;
2
Université du Québec à Montréal, Canada
Tue-Ses1-O2-5, Time: 11:20
Tue-Ses1-O2-2, Time: 10:20
We design a neural network model of first language acquisition to
explore the relationship between child and adult speech sounds.
The model learns simple vowel categories using a produce-andperceive babbling algorithm in addition to listening to ambient
speech. The model is similar to that of Westermann & Miranda
(2004), but adds a dynamic aspect in that it adapts in both the
articulatory and acoustic domains to changes in the child’s speech
patterns. The training data is designed to replicate infant speech
sounds and articulatory configurations. By exploring a range of
articulatory and acoustic dimensions, we see how the child might
learn to draw correspondences between his or her own speech and
that of a caretaker, whose productions are quite different from the
child’s. We also design an imitation evaluation paradigm that gives
insight into the strengths and weaknesses of the model.
Intonation of Japanese Sentences Spoken by English
Speakers
Chiharu Tsurutani; Griffith University, Australia
Tue-Ses1-O2-3, Time: 10:40
This study investigated intonation of Japanese sentences spoken
by Australian English speakers and the influence of their first
language (L1) prosody on their intonation of Japanese sentences.
The second language (L2) intonation is a complicated product of
the L1 transfer at two levels of prosodic hierarchy: at word level
and at phrase levels. L2 speech is hypothesized to retain the
characteristics of L1, and to gain marked features of the target
language only during the late stage of acquisition. Investigation
Phonological transfer is the influence of a first language on phonological variations made when speaking a second language. With
automatic pronunciation assessment applications in mind, this
study intends to uncover evidence of phonological transfer in
terms of articulation. Real-time MRI videos from three German
speakers of English and three native English speakers are compared to uncover the influence of German consonants on close
English consonants not found in German. Results show that
nonnative speakers demonstrate the effects of L1 transfer through
the absence of articulatory contrasts seen in native speakers, while
still maintaining minimal articulatory contrasts that are necessary
for automatic detection of pronunciation errors, encouraging the
further use of articulatory models for speech error characterization
and detection.
Do Multiple Caregivers Speed up Language
Acquisition?
L. ten Bosch 1 , Okko Johannes Räsänen 2 , Joris
Driesen 3 , Guillaume Aimetti 4 , Toomas Altosaar 2 , Lou
Boves 1 , A. Corns 1 ; 1 Radboud Universiteit Nijmegen,
The Netherlands; 2 Helsinki University of Technology,
Finland; 3 Katholieke Universiteit Leuven, Belgium;
4
University of Sheffield, UK
Tue-Ses1-O2-6, Time: 11:40
In this paper we compare three different implementations of
language learning to investigate the issue of speaker-dependent
initial representations and subsequent generalization.
These
implementations are used in a comprehensive model of language
Notes
76
acquisition under development in the FP6 FET project ACORNS.
All algorithms are embedded in a cognitively and ecologically
plausible framework, and perform the task of detecting word-like
units without any lexical, phonetic, or phonological information.
The results show that the computational approaches differ with
respect to the extent they deal with unseen speakers, and how
generalization depends on the variation observed during training.
We also demonstrate the effectiveness of semi-supervised and
unsupervised tone recognition techniques for this less-resourced
language, with weakly supervised approaches rivaling supervised
techniques.
A Sequential Minimization Algorithm for Finite-State
Pronunciation Lexicon Models
Simon Dobrišek, Boštjan Vesnicer, France Mihelič;
University of Ljubljana, Slovenia
Tue-Ses1-O3 : ASR: Lexical and Prosodic
Models
Tue-Ses1-O3-4, Time: 11:00
Fallside (East Wing 2), 10:00, Tuesday 8 Sept 2009
Chair: Eric Fosler-Lussier, Ohio State University, USA
Grapheme to Phoneme Conversion Using an SMT
System
Antoine Laurent, Paul Deléglise, Sylvain Meignier;
LIUM, France
Tue-Ses1-O3-1, Time: 10:00
This paper presents an automatic grapheme to phoneme conversion system that uses statistical machine translation techniques
provided by the Moses Toolkit. The generated word pronunciations
are employed in the dictionary of an automatic speech recognition
system and evaluated using the ESTER 2 French broadcast news
corpus. Grapheme to phoneme conversion based on Moses is compared to two other methods: G2P, and a dictionary look-up method
supplemented by a rule-based tool for phonetic transcriptions of
words unavailable in the dictionary. Moses gives better results
than G2P, and have performance comparable to the dictionary
look-up strategy.
Lexical and Phonetic Modeling for Arabic Automatic
Speech Recognition
The paper first presents a large-vocabulary automatic speechrecognition system that is being developed for the Slovenian
language. The concept of a single-pass token-passing algorithm
for the fast speech decoding that can be used with the designed
multi-level system structure is discussed. From the algorithmic
point of view, the main component of the system is a finite-state
pronunciation lexicon model. This component has crucial impact
on the overall performance of the system and we developed a
sequential minimization algorithm that very efficiently reduces the
size and algorithmic complexity of the lexicon model. Our finitestate lexicon model is represented as a state-emitting finite-state
transducer. The presented experiments show that the sequential minimization algorithm easily outperforms (up to 60%) the
conventional algorithms that were developed for the static global
optimization of the transition-emitting finite-state transducers.
These algorithms are delivered as part of the AT&T FSM library and
the OpenFST library.
A General-Purpose 32 ms Prosodic Vector for
Hidden Markov Modeling
Kornel Laskowski 1 , Mattias Heldner 2 , Jens Edlund 2 ;
1
Carnegie Mellon University, USA; 2 KTH, Sweden
Tue-Ses1-O3-5, Time: 11:20
Long Nguyen 1 , Tim Ng 1 , Kham Nguyen 2 , Rabih
Zbib 3 , John Makhoul 1 ; 1 BBN Technologies, USA;
2
Northeastern University, USA; 3 MIT, USA
Tue-Ses1-O3-2, Time: 10:20
In this paper, we describe the use of either words or morphemes
as lexical modeling units and the use of either graphemes or
phonemes as phonetic modeling units for Arabic automatic speech
recognition (ASR). We designed four Arabic ASR systems: two wordbased systems and two morpheme-based systems. Experimental
results using these four systems show that they have comparable
state-of-the-art performance individually, but the more sophisticated morpheme-based system tends to be the best. However,
they seem to complement each other quite well within the ROVER
system combination framework to produce substantially-improved
combined results.
Assessing Context and Learning for isiZulu Tone
Recognition
Prosody plays a central role in conversation, making it important
for speech technologies to model. Unfortunately, the application
of standard modeling techniques to the acoustics of prosody has
been hindered by difficulties in modeling intonation. In this work,
we explore the suitability of the recently introduced fundamental
frequency variation (FFV) spectrum as a candidate general representation of tone. Experiments on 4 tasks demonstrate that
FFV features are complimentary to other acoustic measures of
prosody and that hidden Markov models offer a suitable modeling
paradigm. Proposed improvements yield a 35% relative decrease in
error on unseen data and simultaneously reduce time complexity
by a factor of five. The resulting representation is sufficiently
mature for general deployment in a broad range of automatic
speech processing applications.
Vocabulary Expansion Through Automatic
Abbreviation Generation for Chinese Voice Search
Dong Yang, Yi-cheng Pan, Sadaoki Furui; Tokyo
Institute of Technology, Japan
Gina-Anne Levow; University of Chicago, USA
Tue-Ses1-O3-6, Time: 11:40
Tue-Ses1-O3-3, Time: 10:40
Prosody plays an integral role in spoken language understanding.
In isiZulu, a Nguni family language with lexical tone, prosodic
information determines word meaning. We assess the impact
of models of tone and coarticulation for tone recognition. We
demonstrate the importance of modeling prosodic context to
improve tone recognition. We employ this less commonly studied
language to assess models of tone developed for English and
Mandarin, finding common threads in coarticulatory modeling.
Long named entities are often abbreviated in oral Chinese language, and this usually leads to out-of-vocabulary(OOV) problems
in speech recognition applications. The generation of Chinese
abbreviations is much more complex than English abbreviations,
most of which are acronyms and truncations. In this paper, we
propose a new method for automatically generating abbreviations
for Chinese named entities and we perform vocabulary expansion
using output of the abbreviation model for voice search. In our
abbreviation modeling, we convert the abbreviation generation
Notes
77
problem into a tagging problem and use the conditional random
field (CRF) as the tagging tool. In the vocabulary expansion, considering the multiple abbreviation problem and limited coverage
of top-1 abbreviation candidate, we add top-10 candidates into the
vocabulary. In our experiments, for the abbreviation modeling, we
achieved the top-10 coverage of 88.3% by the proposed method;
for the voice search, we improved the voice search accuracy from
16.9% to 79.2% by incorporating the top-10 abbreviation candidates
to vocabulary.
Tue-Ses1-O4 : Unit-Selection Synthesis
Holmes (East Wing 3), 10:00, Tuesday 8 Sept 2009
Chair: Alan Black, Carnegie Mellon University, USA
Perceptual Cost Function for Cross-Fading Based
Concatenation
Qi Miao, Alexander Kain, Jan P.H. van Santen; Oregon
Health & Science University, USA
A Novel Approach to Cost Weighting in Unit
Selection TTS
Jerome R. Bellegarda; Apple Inc., USA
Tue-Ses1-O4-4, Time: 11:00
Unit selection text-to-speech synthesis relies on multiple cost
criteria, each encapsulating a different aspect of acoustic and
prosodic context at any given concatenation point. For a particular
set of criteria, the relative weighting of the resulting costs crucially
affects final candidate ranking. Their influence is typically determined in an empirical manner (e.g., based on a limited amount of
synthesized data), yielding global weights that are thus applied
to all concatenations indiscriminately. This paper proposes an
alternative approach, based on a data-driven framework separately
optimized for each concatenation. The cost distribution in every
information stream is dynamically leveraged to locally shift weight
towards those characteristics that prove most discriminative at
this point. An illustrative case study underscores the potential
benefits of this solution.
Maximum Likelihood Unit Selection for
Corpus-Based Speech Synthesis
Tue-Ses1-O4-1, Time: 10:00
In earlier research, we applied a linear weighted cross-fading
function to ensure smooth concatenation. However, this can
cause unnaturally shaped spectral trajectories.
We propose
context-sensitive cross-fading. To train this system, a perceptually
validated cost function is needed, which is the focus of this
paper. A corpus was designed to generate a variety of formant
trajectory shapes. A perceptual experiment was performed and a
multiple linear regression model was applied to predict perceptual
quality ratings from various distances between cross-faded and
natural trajectories. Results show that perceptual quality could be
predicted well from the proposed distance measures.
Exploring Automatic Similarity Measures for Unit
Selection Tuning
Daniel Tihelka 1 , Jan Romportl 2 ; 1 University of West
Bohemia in Pilsen, Czech Republic; 2 SpeechTech s.r.o.,
Czech Republic
Tue-Ses1-O4-2, Time: 10:20
The present paper focuses on the current handling of target
features in the unit selection approach basically requiring huge
corpora. In the paper there are outlined possible solutions based
on measuring (dis)similarity among prosodic patterns. As the start
of research, the feasibility of (dis)similarity estimation is examined
on several intuitively chosen measures of acoustic signal which
are correlated to perceived similarity obtained from a large-scale
listening test.
Towards Intonation Control in Unit Selection Speech
Synthesis
Cédric Boidin 1 , Olivier Boeffard 2 , Thierry Moudenc 1 ,
Géraldine Damnati 1 ; 1 Orange Labs, France; 2 IRISA,
France
Abubeker Gamboa Rosales 1 , Hamurabi
Gamboa Rosales 2 , Ruediger Hoffmann 2 ; 1 University of
Guanajuato, Mexico; 2 Technische Universität Dresden,
Germany
Tue-Ses1-O4-5, Time: 11:20
Corpus-based speech synthesis systems deliver a considerable
synthesis quality since the unit selection approaches have been
optimized in the last decade. Unit selection attempts to find the
best combination of speech unit sequences in an inventory so
that the perceptual differences between expected (natural) and
synthesized signals are as low as possible. However, mismatches
and distortions are still possible in concatenative speech synthesis
and they are normally perceptible in the synthesized waveform.
Therefore, unit selection strategies and parameter tuning are still
important issues in the improvement of speech synthesis. We
present a novel concept to increase the efficiency of the exhaustive
speech unit search within the inventory via a unit selection model.
This model bases its operation on a mapping analysis of the
concatenation sub-costs, a Bayes optimal classification (BOC), and
a Maximum likelihood selection (MLS). The principle advantage
of the proposed unit selection method is that it does not require
an exhaustive training to set up weighted coefficients for target
and concatenation sub-costs. It provides an alternative for unit
selection but requires further optimization, e. g. by integrating
target cost mapping.
A Close Look into the Probabilistic Concatenation
Model for Corpus-Based Speech Synthesis
Shinsuke Sakai, Ranniery Maia, Hisashi Kawai, Satoshi
Nakamura; NICT, Japan
Tue-Ses1-O4-6, Time: 11:40
Tue-Ses1-O4-3, Time: 10:40
We propose to control intonation in unit selection speech synthesis
with a mixed CART-HMM intonation model. The Finite State
Machine (FSM) formulation is suited to incorporate the intonation
model in the unit selection framework because it allows for combination of models with different unit types and handling competing
intonative variants. Subjective experiments have been carried
out to compare segmental and joint-prosodic-and-segmental unit
selection.
We have proposed a novel probabilistic approach to concatenation
modeling for corpus-based speech synthesis, where the goodness
of concatenation for a unit is modeled using a conditional Gaussian
probability density whose mean is defined as a linear transform
of the feature vector from the previous unit. This approach has
shown its effectiveness through a subjective listening test. In this
paper, we further investigate the characteristics of the proposed
method by a objective evaluation and by observing the sequence
of concatenation scores across an utterance. We also present the
mathematical relationships of the proposed method with other
approaches and show that it has a flexible modeling power, having
Notes
78
other approaches to concatenation scoring methods as special
cases.
Characteristics of Two-Dimensional Finite
Difference Techniques for Vocal Tract Analysis and
Voice Synthesis
Matt Speed, Damian Murphy, David M. Howard;
University of York, UK
Tue-Ses1-P1 : Human Speech Production II
Hewison Hall, 10:00, Tuesday 8 Sept 2009
Chair: Martin Cooke, Ikerbasque, Spain
Tue-Ses1-P1-4, Time: 10:00
Simple Physical Models of the Vocal Tract for
Education in Speech Science
Takayuki Arai; Sophia University, Japan
Tue-Ses1-P1-1, Time: 10:00
In the speech-related field, physical models of the vocal tract are
effective tools for education in acoustics. Arai’s cylinder-type models are based on Chiba and Kajiyama’s measurement of vocal-tract
shapes. The models quickly and effectively demonstrate vowel
production. In this study, we developed physical models with
simplified shapes as educational tools to illustrate how vocal-tract
shape accounts for differences among vowels. As a result, the
five Japanese vowels were produced by tube-connected models,
where several uniform tubes with different cross-sectional areas
and lengths are connected as Fant’s and Arai’s three-tube models.
Both digital waveguide and finite difference techniques are numerical methods that have been demonstrated as appropriate for
acoustic modelling applications. Whilst the application of the digital waveguide mesh to vocal tract modelling has been the subject
of previous work, the application of comparable finite difference
techniques is as yet untested. This study explores the characteristics of such a finite-difference approach to two-dimensional
vocal tract modelling. Initial results suggest that finite difference
techniques alone are not ideal, due to the limitation of non-dynamic
behaviour and poor representation of admittance discontinuities
in the approximation of three-dimensional geometries. They do
however introduce robust boundary formulations, and have a
valid and useful application in modelling non-vital static volumes,
particularly the nasal tract.
Adaptation of a Predictive Model of Tongue Shapes
Chao Qin, Miguel Á. Carreira-Perpiñán; University of
California at Merced, USA
Auto-Meshing Algorithm for Acoustic Analysis of
Vocal Tract
Tue-Ses1-P1-5, Time: 10:00
Kyohei Hayashi, Nobuhiro Miki; Future University
Hakodate, Japan
Tue-Ses1-P1-2, Time: 10:00
We propose a new method for an auto-meshing algorithm for an
acoustic analysis of the vocal tract using the Finite Element Method
(FEM). In our algorithm, the domain of the 3 dimensional figure of
the vocal tract is decomposed into two domains; one is a surface
domain and the other is an inner domain in order to employ
the overlapping domain decomposition method. The meshing of
surface blocks can be realized with smooth surfaces using a NURBS
interpolation. We show the example of the meshes for the vocal
tract figure of Japanese vowel /a/, and the trial result of the FEM
simulation.
It is possible to recover the full midsagittal contour of the tongue
with submillimetric accuracy from the location of just 3–4 landmarks on it. This involves fitting a predictive mapping from the
landmarks to the contour using a training set consisting of contours extracted from ultrasound recordings. However, extracting
sufficient contours is a slow and costly process. Here, we consider
adapting a predictive mapping obtained for one condition (such as
a given recording session, recording modality, speaker or speaking
style) to a new condition, given only a few new contours and no
correspondences. We propose an extremely fast method based
on estimating a 2D-wise linear alignment mapping, and show
it recovers very accurate predictive models from about 10 new
contours.
Using Sensor Orientation Information for
Computational Head Stabilisation in 3D
Electromagnetic Articulography (EMA)
Voice Production Model Employing an Interactive
Boundary-Layer Analysis of Glottal Flow
Tokihiko Kaburagi, Katsunori Daimo, Shogo
Nakamura; Kyushu University, Japan
Christian Kroos; University of Western Sydney,
Australia
Tue-Ses1-P1-3, Time: 10:00
Tue-Ses1-P1-6, Time: 10:00
A voice production model has been studied by considering essential aerodynamic and acoustic phenomena in human phonation.
Acoustic voice sources are produced by the temporal change of
volume flow passing through the glottis. A precise flow analysis is
therefore performed based on the boundary-layer approximation
and the viscous-inviscid interaction between the boundary layer
and core flow. This flow analysis can supply information on
the separation point of the glottal flow and the thickness of the
boundary layer, which strongly depend on the glottal configuration,
and yield an effective prediction of the flow behavior. When the
flow analysis is combined with a mechanical model of the vocal
fold, the resulting acoustic wave travels through the vocal tract
and a pressure change develops in the vicinity of the glottis. This
change can affect the glottal flow and the motion of the vocal folds,
causing source-filter interaction. Preliminary simulations were
conducted by changing the relationship between the fundamental
and formant frequencies and their results were reported.
We propose a new simple algorithm to make use of the sensor
orientation information in 3D Electromagnetic Articulography
(EMA) for computational head stabilisation. The algorithm also
provides a well-defined procedure in the case where only two
sensors are available for head motion tracking and allows for the
combining of position coordinates and orientation angles for head
stabilisation with an equal weighting of each kind of information.
An evaluation showed that the method using the orientation angles
produced the most reliable results.
Collision Threshold Pressure Before and After Vocal
Loading
Laura Enflo 1 , Johan Sundberg 1 , Friedemann Pabst 2 ;
1
KTH, Sweden; 2 Hospital Dresden Friedrichstadt,
Germany
Tue-Ses1-P1-7, Time: 10:00
Notes
79
The phonation threshold pressure (PTP) has been found to increase
during vocal fatigue. In the present study we compare PTP and
collision threshold pressure (CTP) before and after vocal loading
in singer and non-singer voices. Seven subjects repeated the vowel
sequence /a,e,i,o,u/ at an SPL of at least 80 dB @ 0.3 m for 20 min.
Before and after this loading the subjects’ voices were recorded
while they produced a diminuendo repeating the syllable /pa/.
Oral pressure during the /p/ occlusion was used as a measure of
subglottal pressure. Both CTP and PTP increased significantly after
the vocal loading.
are associated with different motor control strategies involving
durational manipulations. The relative contribution of closing
movement durations increases with decreasing speech rate, and is
a more dominant strategy for elderly speakers.
Variability and Stability in Collaborative Dialogues:
Turn-Taking and Filled Pauses
Štefan Beňuš; Constantine the Philosopher University in
Nitra, Slovak Republic
Tue-Ses1-P1-11, Time: 10:00
Gender Differences in the Realization of
Vowel-Initial Glottalization
Elke Philburn; University of Manchester, UK
Tue-Ses1-P1-8, Time: 10:00
The aim of the study was to investigate gender-dependent differences in the realization of German glottalized vowel onsets.
Laryngographic data of semi-spontaneous speech were collected
from four male and four female speakers of Standard German.
Measurements of relative vocal fold contact duration were carried
out including glottalized vowel onsets as well as non-glottalized
controls. The results show that female subjects realized the
glottalized vowel onsets with greater maximum vocal fold contact
duration than male subjects and that the glottalized vowel onsets
produced by females were more clearly distinguished from the
non-glottalized controls.
Stability and Composition of Functional Synergies
for Speech Movements in Children and Adults
Filled pauses have important and varied functions in turn-taking
behavior, and better understanding of this relationship opens
new ways for improving the quality and naturalness of dialogue
systems. We use a corpus of collaborative task oriented dialogues
to provide new insights into the relationship between filled pauses
and turn-taking based on temporal and acoustic features. We
then explore which of these patterns are stable and robust across
speakers, which are prone to entrainment based on conversational
partners, and which are variable and noisy. Our findings suggest
that intensity is the least stable feature followed by pitch-related
features, and temporal features relating filled pauses to chunking
and turn-taking are the most stable.
Speaking in the Presence of a Competing Talker
Youyi Lu 1 , Martin Cooke 2 ; 1 University of Sheffield, UK;
2
Ikerbasque, Spain
Tue-Ses1-P1-12, Time: 10:00
Hayo Terband 1 , Frits van Brenk 2 , Pascal
van Lieshout 3 , Lian Nijland 1 , Ben Maassen 1 ;
1
Radboud University Nijmegen Medical Centre, The
Netherlands; 2 University of Strathclyde, UK;
3
University of Toronto, Canada
Tue-Ses1-P1-9, Time: 10:00
The consistency and composition of functional synergies for
speech movements were investigated in 7 year-old children and
adults in a reiterated speech task using electromagnetic articulography (EMA). Results showed higher variability in children for
tongue tip and jaw, but not for lower lip movement trajectories.
Furthermore, the relative contribution to the oral closure of lower
lip was smaller in children compared to adults, whereas in this
respect no difference was found for tongue tip. These results
support and extend findings of non-linearity in speech motor
development and illustrate the importance of a multi-measures
approach in studying speech motor development.
An Analysis of Speech Rate Strategies in Aging
Frits van Brenk 1 , Hayo Terband 2 , Pascal
van Lieshout 3 , Anja Lowit 1 , Ben Maassen 2 ; 1 University
of Strathclyde, UK; 2 Radboud University Nijmegen
Medical Centre, The Netherlands; 3 University of
Toronto, Canada
Tue-Ses1-P1-10, Time: 10:00
Effects of age and speech rate on movement cycle duration were
assessed using electromagnetic articulography. In a repetitive task
syllables were articulated at eight rates, obtained by metronome
and self-pacing. Results indicate that increased speech rate is
associated with increasing movement cycle duration stability,
while decreased rate leads to a decrease in uniformity of cycle
duration, supporting the view that alterations in speech rate
How do speakers cope with a competing talker? This study
investigated the possibility that speakers are able to retime their
contributions to take advantages of temporal fluctuations in the
background, reducing any adverse effects for an interlocutor.
Speech was produced in quiet, competing talker, modulated noise
and stationary backgrounds, with and without a communicative
task. An analysis of the timing of contributions relative to the
background indicated a significantly reduced chance of overlapping for the modulated noise backgrounds relative to quiet, with
competing speech resulting in the least overlap. Strong evidence
for an active overlap avoidance strategy is presented.
Tue-Ses1-P2 : Speech Perception II
Hewison Hall, 10:00, Tuesday 8 Sept 2009
Chair: Odette Scharenborg, Radboud Universiteit Nijmegen, The
Netherlands
Effect of R-Resonance Information on Intelligibility
Antje Heinrich, Sarah Hawkins; University of
Cambridge, UK
Tue-Ses1-P2-1, Time: 10:00
We investigated the importance of phonetic information in preceding syllables for the intelligibility of minimal-pair words containing
/r/ or /l/. Target words were cross-spliced into a different token
of the same sentence (match) or into a sentence that was identical
but originally contained the paired word (mismatch). Young and
old adults heard the sentences, casually or carefully spoken, in
cafeteria or 12-talker babble. Matched phonetic information in
the syllable immediately before the target segment, and in earlier
syllables, facilitated intelligibility of r- but not l-words. Despite
hearing loss, older adults also used this phonetic information.
Notes
80
Perception of Temporal Cues at Discourse
Boundaries
Are Real Tongue Movements Easier to Speech Read
than Synthesized?
Hsin-Yi Lin, Janice Fon; National Taiwan University,
Taiwan
Olov Engwall, Preben Wik; KTH, Sweden
Tue-Ses1-P2-2, Time: 10:00
Speech perception studies with augmented reality displays in
talking heads have shown that tongue reading abilities are weak
initially, but that subjects become able to extract some information
from intra-oral visualizations after a short training session. In this
study, we investigate how the nature of the tongue movements
influences the results, by comparing synthetic rule-based and
actual, measured movements. The subjects were significantly
better at perceiving sentences accompanied by real movements,
indicating that the current coarticulation model developed for
facial movements is not optimal for the tongue.
Tue-Ses1-P2-6, Time: 10:00
This study investigates the role of temporal cues in the perception
at discourse boundaries. Target cues were penult lengthening,
final lengthening, and pause duration. Results showed that different cues are weighted differently for different purposes. Final
lengthening is more important for subjects to detect boundaries,
while pause duration is more responsible in cuing the boundary
sizes.
Human Audio-Visual Consonant Recognition
Analyzed with Three Bimodal Integration Models
Eliciting a Hierarchical Structure of Human
Consonant Perception Task Errors Using Formal
Concept Analysis
Zhanyu Ma, Arne Leijon; KTH, Sweden
Tue-Ses1-P2-3, Time: 10:00
With A-V recordings, ten normal hearing people took recognition
tests at different signal-to-noise ratios (SNR). The A-V recognition
results are predicted by the fuzzy logical model of perception
(FLMP) and the post-labelling integration model (POSTL). We also
applied hidden Markov models (HMMs) and multi-stream HMMs
(MSHMMs) for the recognition. As expected, all the models agree
qualitatively with the results that the benefit gained from the visual
signal is larger at lower acoustic SNRs. However, the FLMP severely
overestimates the A-V integration result, while the POSTL model
underestimates it. Our automatic speech recognizers integrated
the audio and visual stream efficiently. The visual automatic
speech recognizer could be adjusted to correspond to human
visual performance. The MSHMMs combine the audio and visual
streams efficiently, but the audio automatic speech recognizer
must be further improved to allow precise quantitative comparisons with human audio-visual performance.
Carmen Peláez-Moreno, Ana I. García-Moral,
Francisco J. Valverde-Albacete; Universidad Carlos III
de Madrid, Spain
Tue-Ses1-P2-7, Time: 10:00
In this paper we have used Formal Concept Analysis to elicit a
hierarchical structure of human consonant perception task errors.
We have used the Native Listeners experiments provided for the
Consonant Challenge session of Interspeech 2008 to analyze
perception errors committed in relation to the place of articulation
of the consonants being evaluated for one quiet and six noisy
acoustic conditions.
Acoustic and Perceptual Effects of Vocal training in
Amateur Male Singing
Takeshi Saitou, Masataka Goto; AIST, Japan
Effects of Tempo in Radio Commercials on Young
and Elderly Listeners
Tue-Ses1-P2-8, Time: 10:00
Hanny den Ouden, Hugo Quené; Utrecht University,
The Netherlands
Tue-Ses1-P2-4, Time: 10:00
The aim of the present study is to investigate the effects of tempo
manipulations in radio commercials, on listeners’ evaluation,
cognition and persuasion. Questionnaire scores from 131 young
and 130 elderly listeners show effects of tempo manipulation on
listeners’ subjective evaluation, but not on their cognitive scores.
Tempo effects on persuasion scores are modulated by the listeners’
general disposition towards radio and radio commercials. In sum,
it seems that not age but listeners’ general disposition is of importance in evaluating tempo manipulation of radio commercials.
This paper reports our investigation of the acoustic effects of
vocal training for amateur singers and of the contribution of those
effects to perceived vocal quality. Recording singing voices before
and after vocal training and then analyzing changes in acoustic
parameters with a focus on features unique to singing voices, we
found that two different F0 fluctuations (vibrato and overshoot)
and singing formant were improved by the training. The results of
psychoacoustic experiments showed that perceived voice quality
was influenced more by the changes of F0 characteristics than by
the changes of spectral characteristics and that acoustic features
unique to singing voices contribute to perceived voice quality
in the following order: vibrato, singing formant, overshoot, and
preparation.
Tue-Ses1-P3 : Speech and Audio
Segmentation and Classification
Self-Voice Recognition in 4 to 5-Year-Old Children
Sofia Strömbergsson; KTH, Sweden
Tue-Ses1-P2-5, Time: 10:00
Children’s ability to recognize their own recorded voice as their
own was explored in a group of 4 to 5-year-old children. The task
for the children was to identify which one of four voice samples
represented their own voice. The results reveal that children
perform well above chance level, and that a time span of 1–2 weeks
between the recording and the identification does not affect the
children’s performance. F0 similarity between the participant’s
recordings and the reference recordings correlated with a higher
error-rate. Implications for the use of recordings in speech and
language therapy are discussed.
Hewison Hall, 10:00, Tuesday 8 Sept 2009
Chair: S. Umesh, IIT Kanpur, India
Wavelet-Based Speaker Change Detection in Single
Channel Speech Data
Michael Wiesenegger, Franz Pernkopf; Graz University
of Technology, Austria
Tue-Ses1-P3-1, Time: 10:00
Speaker segmentation is the task of finding speaker turns in an
audio stream. We propose a metric-based algorithm based on
Notes
81
Discrete Wavelet Transform (DWT) features. Principal component
analysis (PCA) or linear discriminant analysis (LDA) [1] are further
used to reduce the dimensionality of the feature space and remove
redundant information. In the experiments our methods referred
to as DWT-PCA and DWT-LDA are compared to the DISTBIC algorithm [2] using clean and noisy data of the TIMIT database.
Especially, under conditions with strong noise, i.e. -10dB SNR,
our DWT-PCA approach is very robust, the false alarm rate (FAR)
increases by ∼2% and the missed detection rate (MDR) stays about
the same compared to clean speech, whereas the DISTBIC method
fails — the FAR and MDR is almost ∼0% and ∼100%, respectively.
For clean speech DWT-PCA shows an improvement of ∼30% (relative) for both the FAR and MDR in comparison to the DISTBIC
algorithm. DWT-LDA is performing slightly worse than DWT-PCA.
An Adaptive Threshold Computation for
Unsupervised Speaker Segmentation
Laura Docio-Fernandez, Paula Lopez-Otero, Carmen
Garcia-Mateo; Universidade de Vigo, Spain
Tue-Ses1-P3-2, Time: 10:00
Reliable speaker segmentation is critical in many applications in
the speech processing domain. In this paper, we compare the
performance of two speaker segmentation systems: the first one is
inspired on a typical state-of-art speaker segmentation system, and
the other is an improved version of the former system. We show
that the proposed system has a better performance as it does not
“over-segment” the data. This system includes an algorithm that
randomly discards some of the point changes with a probability
depending on its performance at any moment. Thus, the system
merges adjacent segments when they are spoken by the same
speaker with a high probability; anytime a change is discarded
the discard probability will rise, as the system made a mistake;
the opposite will occur when the two adjacent segments belong to
different speakers, as there will not be a mistake in this case. We
show the improvements of the new system through comparative
experiments on data from the Spanish Parliament Sessions defined
for the 2006 TC-STAR Automatic Speech Recognition evaluation
campaign.
A Semi-Supervised Version of Heteroscedastic
Linear Discriminant Analysis
Haolang Zhou, Damianos Karakos, Andreas G.
Andreou; Johns Hopkins University, USA
Tue-Ses1-P3-4, Time: 10:00
Heteroscedastic Linear Discriminant Analysis (HLDA) was introduced in [1] as an extension of Linear Discriminant Analysis to
the case where the class-conditional distributions have unequal
covariances. The HLDA transform is computed such that the
likelihood of the training (labeled) data is maximized, under the
constraint that the projected distributions are orthogonal to a
nuisance space that does not offer any discrimination. In this
paper we consider the case of semi-supervised learning, where a
large amount of unlabeled data is also available. We derive update
equations for the parameters of the projected distributions, which
are estimated jointly with the HLDA transform, and we empirically
compare it with the case where no unlabeled data are available.
Experimental results with synthetic data and real data from a vowel
recognition task show that, in most cases, semi-supervised HLDA
results in improved performance over HLDA.
Self-Learning Vector Quantization for Pattern
Discovery from Speech
Okko Johannes Räsänen, Unto Kalervo Laine, Toomas
Altosaar; Helsinki University of Technology, Finland
Tue-Ses1-P3-5, Time: 10:00
A novel and computationally straightforward clustering algorithm
was developed for vector quantization (VQ) of speech signals for
a task of unsupervised pattern discovery (PD) from speech. The
algorithm works in purely incremental mode, is computationally
extremely feasible, and achieves comparable classification quality
with the well-known k-means algorithm in the PD task. In addition
to presenting the algorithm, general findings regarding the relationship between the amounts of training material, convergence of
the clustering algorithm, and the ultimate quality of VQ codebooks
are discussed.
Monaural Segregation of Voiced Speech Using
Discriminative Random Fields
A Data-Driven Approach for Estimating the
Time-Frequency Binary Mask
Rohit Prabhavalkar, Zhaozhang Jin, Eric
Fosler-Lussier; Ohio State University, USA
Gibak Kim, Philipos C. Loizou; University of Texas at
Dallas, USA
Tue-Ses1-P3-6, Time: 10:00
Tue-Ses1-P3-3, Time: 10:00
The ideal binary mask, often used in robust speech recognition
applications, requires an estimate of the local SNR in each timefrequency (T-F) unit. A data-driven approach is proposed for
estimating the instantaneous SNR of each T-F unit. By assuming
that the a priori SNR and a posteriori SNR are uniformly distributed
within a small region, the instantaneous SNR is estimated by
minimizing the localized Bayes risk. The binary mask estimator
derived by the proposed approach is evaluated in terms of hit
and false alarm rates. Compared to the binary mask estimator
that uses the decision-directed approach to compute the SNR, the
proposed data-driven approach yielded substantial improvements
(up to 40%) in classification performance, when assessed in terms
of a sensitivity metric which is based on the difference between the
hit and false alarm rates.
Techniques for separating speech from background noise and other
sources of interference have important applications for robust
speech recognition and speech enhancement. Many traditional
computational auditory scene analysis (CASA) based approaches
decompose the input mixture into a time-frequency (T-F) representation, and attempt to identify the T-F units where the target
energy dominates that of the interference. This is accomplished
using a two stage process of segmentation and grouping. In this
pilot study, we explore the use of Discriminative Random Fields
(DRFs) for the task of monaural speech segregation. We find that
the use of DRFs allows us to effectively combine multiple auditory
features into the system, while simultaneously integrating the the
two CASA stages into one. Our preliminary results suggest that
CASA based approaches may benefit from the DRF framework.
Advancements in Whisper-Island Detection within
Normally Phonated Audio Streams
Chi Zhang, John H.L. Hansen; University of Texas at
Dallas, USA
Tue-Ses1-P3-7, Time: 10:00
Notes
82
In this study, several improvements are proposed for improved
whisper-island detection within normally phonated audio streams.
Based on our previous study, an improved feature, which is
more sensitive to vocal effort change points between whisper and
neutral speech, is developed and utilized in vocal effort change
point (VECP) detection and vocal effort classification. Evaluation
is based on the proposed multi-error score, where the improved
feature showed better performance in VECPs detection with the
lowest MES of 19.08. Furthermore, a more accurate whisper-island
detection was obtained using the improved algorithm. Finally,
the experimental detection rate results of 95.33% reflects better
whisper-island detection performance for the improved algorithm
versus that of the original baseline algorithm.
Joint Segmentation and Classification of Dialog Acts
Using Conditional Random Fields
Matthias Zimmermann; xbrain.ch, Switzerland
Tue-Ses1-P3-8, Time: 10:00
This paper investigates the use of conditional random fields for
joint segmentation and classification of dialog acts exploiting
both word and prosodic features that are directly available from
a speech recognizer. To validate the approach experiments are
conducted with two different sets of dialog act types under both
reference and speech to text conditions. Although the proposed
framework is conceptually simpler than previous attempts at
segmentation and classification of DAs it outperforms all previous
systems for a task based on the ICSI (MRDA) meeting corpus.
Exploring Complex Vowels as Phrase Break
Correlates in a Corpus of English Speech with
ProPOSEL, a Prosody and POS English Lexicon
Claire Brierley 1 , Eric Atwell 2 ; 1 University of Bolton,
UK; 2 University of Leeds, UK
Tue-Ses1-P3-9, Time: 10:00
Real-world knowledge of syntax is seen as integral to the machine
learning task of phrase break prediction but there is a deficiency of
a priori knowledge of prosody in both rule-based and data-driven
classifiers. Speech recognition has established that pauses affect
vowel duration in preceding words. Based on the observation
that complex vowels occur at rhythmic junctures in poetry, we
run significance tests on a sample of transcribed, contemporary
British English speech and find a statistically significant correlation
between complex vowels and phrase breaks. The experiment
depends on automatic text annotation via ProPOSEL, a prosody and
part-of-speech English lexicon.
Automatic Topic Detection of Recorded Voice
Messages
Caroline Clemens 1 , Stefan Feldes 2 , Karlheinz
Schuhmacher 1 , Joachim Stegmann 1 ; 1 Deutsche
Telekom Laboratories, Germany; 2 T-Systems, Germany
Tue-Ses1-P3-10, Time: 10:00
Identification and Automatic Detection of Parasitic
Speech Sounds
Jindřich Matoušek 1 , Radek Skarnitzl 2 , Pavel Machač 2 ,
Jan Trmal 1 ; 1 University of West Bohemia in Pilsen,
Czech Republic; 2 Charles University in Prague, Czech
Republic
Tue-Ses1-P3-11, Time: 10:00
This paper presents initial experiments with the identification and
automatic detection of parasitic sounds in speech signals. The
main goal of this study is to identify such sounds in the source
recordings for unit-selection-based speech synthesis systems
and thus to avoid their unintended usage in synthesised speech.
The first part of the paper describes the phonetic analysis and
identification of parasitic phenomena in recordings of two Czech
speakers. In the second part, experiments with the automatic
detection of parasitic sounds using HMM-based and BVM classifiers
are presented. The results are encouraging, especially those for
glottalization phenomena.
Phonetic Alignment for Speech Synthesis in
Under-Resourced Languages
D.R. van Niekerk, Etienne Barnard; CSIR, South Africa
Tue-Ses1-P3-12, Time: 10:00
The rapid development of concatenative speech synthesis systems
in resource scarce languages requires an efficient and accurate
solution with regard to automated phonetic alignment. However,
in this context corpora are often minimally designed due to a lack
of resources and expertise necessary for large scale development.
Under these circumstances many techniques toward accurate
segmentation are not feasible and it is unclear which approaches
should be followed. In this paper we investigate this problem by
evaluating alignment approaches and demonstrating how these
approaches can be applied to limit manual interaction while achieving acceptable alignment accuracy with minimal ideal resources.
Improving Initial Boundary Estimation for
HMM-Based Automatic Phonetic Segmentation
Kalu U. Ogbureke, Julie Carson-Berndsen; University
College Dublin, Ireland
Tue-Ses1-P3-13, Time: 10:00
This paper presents an approach to boundary estimation for automatic segmentation of speech given a phone (sound) sequence.
The technique presented represents an extension to existing approaches to Hidden Markov Model based automatic segmentation
which modifies the topology of the model to control for duration.
An HMM system trained with this modified topology places 77.10%,
86.72% and 91.15% of the boundaries, on the TIMIT speech test
corpus annotations, within 10, 15 and 20 ms respectively as compared with manual annotations. This represents an improvement
over the baseline result of 70.99%, 83.50% and 89.18% for initial
boundary estimation.
We present an approach to automatic classification of spontaneously spoken voice messages.
During overload periods at
call-centers customers are offered a call-back at a later time. A
speech dialog asks them to describe their concern on a voice
box. The identified topics correspond to the supported service
categories, which in turn determine the agent group the customer message is routed to. Our multistage classification process
includes speech-to-text, stemming, keyword spotting, and categorization. Classifier training and evaluation have been performed
with real-life data. Results show promising performance. The pilot
will be launched in a field test.
Notes
83
Tue-Ses1-P4 : Speaker Recognition and
Diarisation
Hewison Hall, 10:00, Tuesday 8 Sept 2009
Chair: Sadaoki Furui, Tokyo Institute of Technology, Japan
Importance of Nasality Measures for Speaker
Recognition Data Selection and Performance
Prediction
Howard Lei, Eduardo Lopez-Gonzalo; ICSI, USA
Tue-Ses1-P4-1, Time: 10:00
We improve upon measures relating feature vector distributions
to speaker recognition (SR) performances for SR performance
prediction and arbitrary data selection. In particular, we examine
the means and variances of 11 features pertaining to nasality
(resulting in 22 measures), computing them on feature vectors of
phones to determine which measures give good SR performance
prediction of phones. We’ve found that the combination of nasality
measures give a 0.917 correlation with the Equal Error Rates (EERs)
of phones on SRE08, exceeding the correlation of our previous
best measure (mutual information) by 12.7%. When implemented
in our data-selection scheme (which does not require a SR system
to be run), the nasality measures allow us to select data with
combined EER better than data selected via running a SR system in
certain cases, at a fortieth of the computational costs. The nasality
measures require a tenth of the computational costs compared to
our previous best measure.
Exploration of Vocal Excitation Modulation Features
for Speaker Recognition
Ning Wang, P.C. Ching, Tan Lee; Chinese University of
Hong Kong, China
Tue-Ses1-P4-2, Time: 10:00
To derive spectro-temporal vocal source features complementary
to the conventional spectral-based vocal tract features in improving
the performance and reliability of a speaker recognition system,
the excitation related modulation properties are studied. Through
multi-band demodulation method, source-related amplitude and
phase quantities are parameterized into feature vectors. Evaluation of the proposed features is carried out first through a set of
designed experiments on artificially generated inputs, and then by
simulations on speech database. It is observed via the designed
experiments that the proposed features are capable of capturing
the vocal differences in terms of F0 variation, pitch epoch shape,
and relevant excitation details between epochs. In the real task
simulations, by combination with the standard spectral features,
both the amplitude and the phase-related features are shown
to evidently reduce the identification error rate and equal error
rate in the context of the Gaussian mixture model-based speaker
recognition system.
Speaker Identification for Whispered Speech Using
Modified Temporal Patterns and MFCCs
Xing Fan, John H.L. Hansen; University of Texas at
Dallas, USA
Tue-Ses1-P4-3, Time: 10:00
Speech production variability due to whisper represents a major
challenges for effective speech systems. Whisper is used by talkers
intentionally in certain circumstances to protect personal privacy.
Due to the absence of periodic excitation in the production of
whisper, there are considerable differences between neutral and
whispered speech in the spectral structure. Therefore, performance of speaker ID systems trained with high energy voiced
phonemes, degrades significantly when tested with whisper. This
study considers a combination of modified temporal patterns
(m-TRAPs) and MFCCs to improve the performance of a neutral
trained system for whispered speech. The m-TRAPs are introduced
based on an explanation for the whisper/neutral mismatch degradation of an MFCC based system. A phoneme-by-phoneme score
weighting method is used to fuse the score from each subband.
Text independent closed set speaker ID was conducted and experimental results show that m-TRAPs are especially efficient for
whisper with low SNR. When combining scores from both MFCC and
TRAPs based GMMs, an absolute 26.3% improvement in accuracy
is obtained compared with a traditional MFCC baseline system.
This result confirms a viable approach to improving speaker ID
performance between neutral/whisper mismatched conditions.
Speaker Diarization for Meeting Room Audio
Hanwu Sun, Tin Lay Nwe, Bin Ma, Haizhou Li; Institute
for Infocomm Research, Singapore
Tue-Ses1-P4-4, Time: 10:00
This paper describes a speaker diarization system in 2007 NIST
Rich Transcription (RT07) Meeting Recognition Evaluation for
the task of Multiple Distant Microphone (MDM) in meeting room
scenarios. The system includes three major modules: data preparation, initial speaker clustering and cluster purification/merging.
The data preparation consists of the raw data Wiener filtering
and beamforming, Time Difference of Arrival estimate and speech
activity detection. Based on the initial processed data, two-stage
histogram quantization has been used to perform the initial
speaker clustering. A modified purification strategy via high-order
GMM clustering method is proposed. BIC criterion is applied for
cluster merging. The system achieves a competitive overall DER of
8.31% for RT07 MDM speaker diarization task.
Improving Speaker Segmentation via Speaker
Identification and Text Segmentation
Runxin Li, Tanja Schultz, Qin Jin; Carnegie Mellon
University, USA
Tue-Ses1-P4-5, Time: 10:00
Speaker segmentation is an essential part of a speaker diarization
system. Common segmentation systems usually miss speaker
change points when speakers switch fast. These errors seriously
confuse the following speaker clustering step and result in high
overall speaker diarization error rates. In this paper two methods
are proposed to deal with this problem: The first approach uses
speaker identification techniques to boost speaker segmentation.
And the second approach applies text segmentation methods to
improve the performance of speaker segmentation. Experiments
on Quaero speaker diarization evaluation data shows that our
methods achieve up to 45% relative reduction in the speaker
diarization error and 64% relative increase in the speaker change
detection recall rate over the baseline system. Moreover, both
these two approaches can be considered as post-processing steps
over the baseline segmentation, therefore, they can be applied in
any speaker diarization systems.
Overall Performance Metrics for Multi-Condition
Speaker Recognition Evaluations
David A. van Leeuwen; TNO Human Factors, The
Netherlands
Notes
84
Tue-Ses1-P4-6, Time: 10:00
In this paper we propose a framework for measuring the overall
performance of an automatic speaker recognition system using a
set of trials of a heterogeneous evaluation such as NIST SRE-2008,
which combines several acoustic conditions in one evaluation. We
do this by weighting trials of different conditions according to
their relative proportion, and we derive expressions for the basic
speaker recognition performance measures Cdet , Cllr , as well as the
min
DET curve, from which EER and Cdet
can be computed. Examples
of pooling of conditions are shown on SRE-2008 data, including
speaker sex and microphone type and speaking style.
Speaker Identification Using Warped MVDR Cepstral
Features
Matthias Wölfel 1 , Qian Yang 2 , Qin Jin 3 , Tanja
Schultz 2 ; 1 ZKM, Germany; 2 Universität Karlsruhe (TH),
Germany; 3 Carnegie Mellon University, USA
Tue-Ses1-P4-7, Time: 10:00
It is common practice to use similar or even the same feature
extraction methods for automatic speech recognition and speaker
identification. While the front-end for the former requires to
preserve phoneme discrimination and to compensate for speaker
differences to some extend, the front-end for the latter has to
preserve the unique characteristics of individual speakers. It
seems, therefore, contradictory to use the same feature extraction
methods for both tasks. Starting out from the common practice we
propose to use warped minimum variance distortionless response
(MVDR) cepstral coefficients, which have already been demonstrated to perform superior for automatic speech recognition in
particular under adverse conditions. Replacing the widely used
mel-frequency cepstral coefficients by WMVDR cepstral coefficients
improves the speaker identification accuracy by up to 24% relative.
We found that the optimal choice of the model order within
the WMVDR framework differs between speech recognition and
speaker recognition, confirming our intuition that the two different
tasks indeed require different feature extraction strategies.
Entropy Based Overlapped Speech Detection as a
Pre-Processing Stage for Speaker Diarization
Oshry Ben-Harush 1 , Itshak Lapidot 2 , Hugo
Guterman 1 ; 1 Ben-Gurion University of the Negev,
Israel; 2 Sami Shamoon College of Engineering, Israel
Speech Style and Speaker Recognition: A Case Study
Marco Grimaldi, Fred Cummins; University College
Dublin, Ireland
Tue-Ses1-P4-9, Time: 10:00
This work presents an experimental evaluation of the effect of
different speech styles on the task of speaker recognition. We make
use of willfully altered voice extracted from the chains corpus and
methodically assess the effect of its use in both testing and training
a reference speaker identification system and a reference speaker
verification system. In this work we contrast normal readings of
text with two varieties of imitative styles and with the familiar,
non-imitative, variant of fast speech. Furthermore, we test the
applicability of a novel speech parameterization that has been
suggested as a promising technique in the task of speaker identification: the pyknogram frequency estimate coefficients — pykfec.
The experimental evaluation indicates that both the reference
verification and identification systems are affected by variations in
style of the speech material used, especially in the case that speech
is also mismatched in channel. Our case studies also indicates that
the adoption of pykfec as speech encoding methodology has an
overall favorable effect on the systems accuracy scores.
The Majority Wins: A Method for Combining
Speaker Diarization Systems
Marijn Huijbregts 1 , David A. van Leeuwen 2 ,
Franciska M.G. de Jong 1 ; 1 University of Twente, The
Netherlands; 2 TNO Human Factors, The Netherlands
Tue-Ses1-P4-10, Time: 10:00
In this paper we present a method for combining multiple diarization systems into one single system by applying a majority voting
scheme. The voting scheme selects the best segmentation purely
on basis of the output of each system. On our development set
of NIST Rich Transcription evaluation meetings the voting method
improves our system on all evaluation conditions. For the single
distant microphone condition, DER performance improved by 7.8%
(relative) compared to the best input system. For the multiple
distant microphone condition the improvement is 3.6%.
Two-Wire Nuisance Attribute Projection
Yosef A. Solewicz 1 , Hagai Aronowitz 2 ; 1 Bar-Ilan
University, Israel; 2 IBM Haifa Research Lab, Israel
Tue-Ses1-P4-8, Time: 10:00
Tue-Ses1-P4-11, Time: 10:00
One inherent deficiency of most diarization systems is their
inability to handle co-channel or overlapped speech. Most of
the suggested algorithms perform under singular conditions,
require high computational complexity in both time and frequency
domains.
In this study, frame based entropy analysis of the audio data in the
time domain serves as a single feature for an overlapped speech
detection algorithm. Identification of overlapped speech segments
is performed using Gaussian Mixture Modeling (GMM) along with
well known classification algorithms applied on two speaker
conversations. By employing this methodology, the proposed
method eliminates the need for setting a hard threshold for each
conversation or database.
This paper addresses the task of nuisance reduction in two-wire
speaker recognition applications.
Besides channel mismatch,
two-wire conversations are contaminated by extraneous speakers
which represent an additional source of noise in the supervector
domain. It is shown that two-wire nuisance manifests itself as
undesirable directions in the inter-speaker subspace. For this purpose, we derive two alternative Nuisance Attribute Projection (NAP)
formulations tailored for two-wire sessions. The first formulation
generalizes the NAP framework based on a model of two-wire
conversations. The second formulation explicitly models the fourvs. two-wire supervector variability. Preliminary experiments show
that two-wire NAP significantly outperforms regular NAP in varied
two-wire tasks.
LDC CALLHOME American English corpus is used for evaluation
of the suggested algorithm. The proposed method successfully
detects 63.2% of the frames labeled as overlapped speech by the
manual segmentation, while keeping a 5.4% false-alarm rate.
Notes
85
the so-called salience of the speech signal samples. The method
does not request that the signal is locally periodic and the average
period length is known a priori. Several implementations are
considered and discussed. Salience analysis is compared with the
auto-correlation method for cycle detection implemented in Praat.
Tue-Ses1-S1 : Special Session: Advanced
Voice Function Assessment
Ainsworth (East Wing 4), 10:00, Tuesday 8 Sept 2009
Chair: Anna Barney, University of Southampton, UK and Mette
Pedersen, Medical Centre of Copenhagen, Denmark
The Use of Telephone Speech Recordings for
Assessment and Monitoring of Cognitive Function
in Elderly People
Acoustic and High-Speed Digital Imaging Based
Analysis of Pathological Voice Contributes to Better
Understanding and Differential Diagnosis of
Neurological Dysphonias and of Mimicking
Phonatory Disorders
Viliam Rapcan, Shona D’Arcy, Nils Penard, Ian H.
Robertson, Richard B. Reilly; Trinity College Dublin,
Ireland
Tue-Ses1-S1-4, Time: 11:00
Krzysztof Izdebski 1 , Yuling Yan 2 , Melda Kunduk 3 ;
1
Pacific Voice and Speech Foundation, USA; 2 Stanford
University, USA; 3 Louisiana State University, USA
Tue-Ses1-S1-1, Time: 10:00
Using Nyquist-plots definitions and HSDI-based analyses of the
acoustic and visual data base of similarly sounding disordered
neurologically driven pathological phonations, we categorized
these signals and provided an in-depth explanation of how these
sounds differ, and how these sounds are generated at the glottic level. Combined evaluations based on modern technology
strengthened our knowledge and improved objective guidelines
on how to approach clinical diagnosis by ear, significantly aiding
the process of differential diagnosis of complex pathological voice
qualities in nonlaboratory settings.
Cognitive assessment in clinic represents time consuming and
expensive task. Speech may be employed as a means of monitoring cognitive function in elderly people. Extraction of speech
characteristics from speech recorded remotely over a telephone
was investigated and compared to speech characteristics extracted
from recordings made in controlled environment. Results demonstrate that speech characteristics can be, with little changes in
feature extraction algorithm, reliably (with overall accuracy of
93.2%) extracted from telephone quality speech. With further
development of a fully automated IVR system, an early screening
system for cognitive decline may be easily realized.
Optimized Feature Set to Assess Acoustic
Perturbations in Dysarthric Speech
Sunil Nagaraja, Eduardo Castillo-Guerra; University of
New Brunswick, Canada
Normalized Modulation Spectral Features for
Cross-Database Voice Pathology Detection
Tue-Ses1-S2-5, Time: 11:20
Maria Markaki 1 , Yannis Stylianou 2 ; 1 University of
Crete, Greece; 2 FORTH, Greece
Tue-Ses1-S1-2, Time: 10:20
In this paper, we employ normalized modulation spectral analysis
for voice pathology detection. Such normalization is important
when there is a mismatch between training and testing conditions, or in other words, employing the detection system in real
(testing) conditions. Modulation spectra usually produce a highdimensionality space. For classification purposes, the size of the
original space is reduced using Higher Order Singular Value Decomposition (SVD). Further, we select most relevant features based
on the mutual information between subjective voice quality and
computed features, which leads to an adaptive to the classification
task modulation spectra representation. For voice pathology detection, the adaptive modulation spectra is combined with an SVM
classifier. To simulate the real testing conditions; one for training
and the other for testing. We address the difference of signal
characteristics between training and testing data through subband
normalization of modulation spectral features. Simulations show
that feature normalization enables the cross-database detection of
pathological voices even when training and test data are different.
Speech Sample Salience Analysis for Speech Cycle
Detection
C. Mertens, Francis Grenez, Jean Schoentgen;
Université Libre de Bruxelles, Belgium
Tue-Ses1-S1-3, Time: 10:40
The presentation proposes a method for the measurement of cycle
lengths in voiced speech. The background is the study of acoustic
cues of slow (vocal tremor) and fast (vocal jitter) perturbations
of the vocal frequency. Here, these acoustic cues are obtained
by means of a temporal method that detects speech cycles via
This paper is focused on the optimization of features derived to
characterize the acoustic perturbations encountered in a group
of neurological disorders known as Dysarthria. The work derives
a set of orthogonal features that enable acoustic analyses of
dysarthric speech from eight different Dysarthria types. The feature set is composed by combinations of objective measurements
obtained with digital signal processing algorithms and perceptual
judgments of the most reliably perceived acoustic perturbations.
The effectiveness of the features to provide relevant information
of the disorders is evaluated with different classifiers enabling a
classification rate up to 93.7%.
A Microphone-Independent Visualization Technique
for Speech Disorders
Andreas Maier 1 , Stefan Wenhardt 1 , Tino Haderlein 1 ,
Maria Schuster 2 , Elmar Nöth 1 ; 1 FAU
Erlangen-Nürnberg, Germany; 2 Universitätsklinikum
Erlangen, Germany
Tue-Ses1-S2-6, Time: 10:20
In this paper we introduce a novel method for the visualization of
speech disorders. We demonstrate the method with disordered
speech and a control group. However, both groups were recorded
using two different microphones. The projection of the patient
data using a single microphone yields significant correlations
between the coordinates on the map and certain criteria of the
disorder which were perceptually rated. However, projection of
data from multiple microphones reduces this correlation. Usually,
the acoustical mismatch between the microphones is greater than
the mismatch between the speakers, i.e., not the disorders but
the microphones form clusters in the visualization. Based on an
extension of the Sammon mapping, we are able to create a map
which projects the same speakers onto the same position even if
Notes
86
multiple microphones are used. Furthermore, our method also
restores the correlation between the map coordinates and the
perceptual assessment.
Evaluation of the Effect of the GSM Full Rate Codec
on the Automatic Detection of Laryngeal
Pathologies Based on Cepstral Analysis
Rubén Fraile, Carmelo Sánchez, Juan I.
Godino-Llorente, Nicolás Sáenz-Lechón, Víctor
Osma-Ruiz, Juana M. Gutiérrez; Universidad
Politécnica de Madrid, Spain
Type I & II (13 patients), type III (5 patients), type V (5 patients).
Evaluation was done pre and postoperatively for 12 months. The
other group was represented by patients with unilateral vocal fold
paralysis treated by thyroplasty (17 patients). Evaluation was done
before and 3 months postoperatively. Total VHI, emotional and
physical subscales improved significantly for type I&II cordectomy
and for thyroplasty. VHI can provide an insight into patient’s
handicap.
Intelligibility Assessment in Children with Cleft Lip
and Palate in Italian and German
Tue-Ses1-S2-7, Time: 11:20
Advances in speech signal analysis during the last decade have allowed the development of automatic algorithms for a non-invasive
detection of laryngeal pathologies. Bearing in mind the extension
of these automatic methods to remote diagnosis scenarios, this
paper analyzes the performance of a pathology detector based on
Mel Frequency Cepstral Coefficients when the speech signal has
undergone the distortion of a speech codec such as the GSM FR
codec, which is used in one of the nowadays most widespread communications networks. It is shown that the overall performance of
the automatic detection of pathologies is degraded less than 5%,
and that such degradation is not due to the codec itself, but to the
bandwidth limitation needed at its input. These results indicate
that the GSM system can be more adequate to implement remote
voice assessment than the analogue telephone channel.
Marcello Scipioni 1 , Matteo Gerosa 2 , Diego Giuliani 2 ,
Elmar Nöth 3 , Andreas Maier 3 ; 1 Politecnico di Milano,
Italy; 2 FBK, Italy; 3 FAU Erlangen-Nürnberg, Germany
Tue-Ses1-S2-10, Time: 11:20
Current research has shown that the speech intelligibility in children with cleft lip and palate (CLP) can be estimated automatically
using speech recognition methods. On German CLP data high and
significant correlations between human ratings and the recognition
accuracy of a speech recognition system were already reported. In
this paper we investigate whether the approach is also suitable for
other languages. Therefore, we compare the correlations obtained
on German data with the correlations on Italian data. A high and
significant correlation (r=0.76; p < 0.01) was identified on the
Italian data. These results do not differ significantly from the
results on German data (p > 0.05).
Universidade de Aveiro’s Voice Evaluation Protocol
Cepstral Analysis of Vocal Dysperiodicities in
Disordered Connected Speech
Luis M.T. Jesus 1 , Anna Barney 2 , Ricardo Santos 3 ,
Janine Caetano 4 , Juliana Jorge 5 , Pedro Sá Couto 1 ;
1
Universidade de Aveiro, Portugal; 2 University of
Southampton, UK; 3 Hospital Privado da Trofa,
Portugal; 4 Agrupamento de Escolas Serra da
Gardunha, Portugal; 5 RAIZ, Portugal
A. Alpan 1 , Jean Schoentgen 1 , Y. Maryn 2 , Francis
Grenez 1 , P. Murphy 3 ; 1 Université Libre de Bruxelles,
Belgium; 2 Sint-Jan General Hospital, Belgium;
3
University of Limerick, Ireland
Tue-Ses1-S2-8, Time: 11:20
Tue-Ses1-S2-11, Time: 11:20
Several studies have shown that the amplitude of the first rahmonic
peak (R1) in the cepstrum is an indicator of hoarse voice quality.
The cepstrum is obtained by taking the inverse Fourier Transform
of the log-magnitude spectrum. In the present study, a number
of spectral analysis processing steps are implemented, including
period-synchronous and period-asynchronous analysis, as well
as harmonic-synchronous and harmonic-asynchronous spectral
band-limitation prior to computing the cepstrum. The analysis
is applied to connected speech signals. The correlation between
amplitude R1 and perceptual ratings is examined for a corpus
comprising 28 normophonic and 223 dysphonic speakers. One
observes that the correlation between R1 and perceptual ratings
increases when the spectrum is band-limited prior to computing
the cepstrum. In addition, comparisons are made with a popular
cepstral cue which is the cepstral peak prominence (CPP).
This paper presents Universidade de Aveiro’s Voice Evaluation
Protocol for European Portuguese (EP), and a preliminary inter-rater
reliability study. Ten patients with vocal pathology were assessed,
by two Speech and Language Therapists (SLTs). Protocol parameters such as overall severity, roughness, breathiness, change of
loudness (CAPE-V), grade, breathiness and strain (GRBAS), glottal
attack, respiratory support, respiratory-phonotary-articulatory
coordination, digital laryngeal manipulation, voice quality after
manipulation, muscular tension and diagnosis, presented high
reliability and were highly correlated (good inter-rater agreement
and high value of correlation). Values for the overall severity and
grade were similar to those reported in the literature.
Tue-Ses2-O1 : Automotive and Mobile
Applications
Standard Information from Patients: The Usefulness
Main Hall, 13:30, Tuesday 8 Sept 2009
Chair: Kate Knill, Toshiba Research Europe Ltd., UK
of Self-Evaluation (Measured with the French
Version of the VHI)
Lise Crevier-Buchman 1 , Stephanie Borel 1 , Stéphane
Hans 1 , Madeleine Menard 1 , Jacqueline Vaissiere 2 ;
1
Université Paris Descartes, France; 2 LPP, France
Fast Speech Recognition for Voice Destination Entry
in a Car Navigation System
Hoon Chung, JeonGue Park, HyeonBae Jeon, YunKeun
Lee; ETRI, Korea
Tue-Ses1-S2-9, Time: 11:20
Voice Handicap Index is a scale designed to measure the voice
disability in daily life. Two groups of patients were evaluated. One
group was represented by glottic carcinoma treated by cordectomy
Tue-Ses2-O1-1, Time: 13:30
In this paper, we introduce a multi-stage decoding algorithm
optimized to recognize very large number of entry names on
Notes
87
a resource-limited embedded device. The multi-stage decoding
algorithm is composed of a two-stage HMM-based coarse search
and a detailed search. The two-stage HMM-based coarse search
generates a small set of candidates that are assumed to contain a
correct hypothesis with high probability, and the detailed search
re-ranks the candidates by rescoring them with sophisticate acoustic models. In this paper, we take experiments with 1-millions of
point-of-interest (POI) names on an in-car navigation device with
a fixed-point processor running at 620MHz. The experimental
result shows that the multi-stage decoding algorithm runs about
2.23 times real-time on the device without serious degradation of
recognition performance.
are accurately retrieved using a vector space model. In evaluating
SMS replies within the acoustically challenging environment of
automobiles, the voice search approach consistently outperformed
using just the recognition results of a statistical language model
or a probabilistic context-free grammar. For SMS replies covered
by our templates, the approach achieved as high as 89.7% task
completion when evaluating the top five reply candidates.
Improving Perceived Accuracy for In-Car Media
Search
Tue-Ses2-O1-5, Time: 14:50
Yun-Cheng Ju, Michael Seltzer, Ivan Tashev; Microsoft
Research, USA
Tue-Ses2-O1-2, Time: 13:50
Speech recognition technology is prone to mistakes, but this is not
the only source of errors that cause speech recognition systems
to fail; sometimes the user simply does not utter the command
correctly. Usually, user mistakes are not considered when a system
is designed and evaluated. This creates a gap between the claimed
accuracy of the system and the actual accuracy perceived by the
users. We address this issue quantitatively in our in-car infotainment media search task and propose expanding the capability of
voice command to accommodate user mistakes while retaining
a high percentage of the performance for queries with correct
syntax. As a result, failures caused by user mistakes were reduced
by an absolute 70% at the cost of a drop in accuracy of only 0.28%.
Laying the Foundation for In-Car Alcohol Detection
by Speech
Language Modeling for What-with-Where on
GOOG-411
Charl van Heerden 1 , Johan Schalkwyk 2 , Brian
Strope 2 ; 1 CSIR, South Africa; 2 Google Inc., USA
This paper describes the language modeling architectures and
recognition experiments that enabled support of ‘what-with-where’
queries on GOOG-411. First we compare accuracy trade-offs between a single national business LM for business queries and using
many small models adapted for particular cities. Experimental
evaluations show that both approaches lead to comparable overall
accuracy. Differences in the distributions of errors also lead to
improvements from a simple combination. We then optimize
variants of the national business LM in the context of combined
business and location queries from the web, and finally evaluate
these models on a recognition test from the recently fielded
‘what-with-where’ system.
Very Large Vocabulary Voice Dictation for Mobile
Devices
Jan Nouza, Petr Cerva, Jindrich Zdansky; Technical
University of Liberec, Czech Republic
Tue-Ses2-O1-6, Time: 15:10
Florian Schiel, Christian Heinrich; LMU München,
Germany
Tue-Ses2-O1-3, Time: 14:10
The fact that an increasing number of functions in the automobile
are and will be controlled by speech of the driver rises the question whether this speech input may be used to detect a possible
alcoholic intoxication of the driver. For that matter a large part
of the new Alcohol Language Corpus (ALC) edited by the Bavarian
Archive of Speech Signals (BAS) will be used for a broad statistical
investigation of possible feature candidates for classification. In
this contribution we present the motivation and the design of the
ALC corpus as well as first results from fundamental frequency
and rhythm analysis. Our analysis by comparing sober and alcoholized speech of the same individuals suggests that there are in
fact promising features that can automatically be derived from
the speech signal during the speech recognition process and will
indicate intoxication for most speakers.
This paper deals with optimization techniques that can make very
large vocabulary voice dictation applications deployable on recent
mobile devices. We focus namely on optimization of signal parameterization (frame rate, FFT calculation, fixed-point representation)
and on efficient pruning techniques employed on the state and
Gaussian mixture level. We demonstrate the applicability of the
proposed techniques on the practical design of an embedded
255K-word discrete dictation program developed for Czech. Its
real performance is comparable to a client-server version of the
fluent dictation program implemented on the same mobile device.
Tue-Ses2-O2 : Prosody: Production I
Jones (East Wing 1), 13:30, Tuesday 8 Sept 2009
Chair: Fred Cummins, University College Dublin, Ireland
Did You Say a BLUE Banana? The Prosody of
Contrast and Abnormality in Bulgarian and Dutch
Diana V. Dimitrova, Gisela Redeker, John C.J. Hoeks;
Rijksuniversiteit Groningen, The Netherlands
A Voice Search Approach to Replying to SMS
Messages in Automobiles
Tue-Ses2-O2-1, Time: 13:30
Yun-Cheng Ju, Tim Paek; Microsoft Research, USA
Tue-Ses2-O1-4, Time: 14:30
Automotive infotainment systems now provide drivers the ability
to hear incoming Short Message Service (SMS) text messages using
text-to-speech. However, the question of how best to allow users
to respond to these messages using speech recognition remains
unsettled. In this paper, we propose a robust voice search approach
to replying to SMS messages based on template matching. The templates are empirically derived from a large SMS corpus and matches
In a production experiment on Bulgarian that was based on a previous study on Dutch [1], we investigated the role of prosody when
linguistic and extra-linguistic information coincide or contradict.
Speakers described abnormally colored fruits in conditions where
contrastive focus and discourse relations were varied. We found
that the coincidence of contrast and abnormality enhances accentuation in Bulgarian as it did in Dutch. Surprisingly, when both
factors are in conflict, the prosodic prominence of abnormality
often overruled focus accentuation in both Bulgarian and Dutch,
though the languages also show marked differences.
Notes
88
A Quantitative Study of F0 Peak Alignment and
Sentence Modality
Pitch Adaptation in Different Age Groups: Boundary
Tones versus Global Pitch
Hansjörg Mixdorff 1 , Hartmut R. Pfitzinger 2 ; 1 BHT
Berlin, Germany; 2 Christian-Albrechts-Universität zu
Kiel, Germany
Marie Nilsenová, Marc Swerts, Véronique Houtepen,
Heleen Dittrich; Tilburg University, The Netherlands
Tue-Ses2-O2-2, Time: 13:50
Linguistic adaptation is a process by which interlocutors adjust
their production to their environment. In the context of humancomputer interaction, past research showed that adult speakers
adapt to computer speech in various manners but less is known
about younger age groups. We report the results of three priming
experiments in which children in different age groups interacted
with a prerecorded computer voice. The goal of the experiments
was to determine to what extent children copy the pitch properties
of the interlocutor. Based on the dialogue model of Pickering &
Garrod, we predicted that children would be more likely to adapt
to pitch primes that were meaningful in the context (high or low
boundary tone) compared to primes with no apparent functionality
(global pitch manipulation). This prediction was confirmed by our
data. Moreover, we observed a decreasing trend in adaptation in
the older age groups compared to the younger ones.
Tue-Ses2-O2-5, Time: 14:50
The current study examines the relationship between prosodic
accent labels assigned in the Kiel Corpus of Spontaneous Speech IV,
Isačenko’s intoneme classes of the underlying accents and the associated parameters of the Fujisaki model. Among other findings,
there is a close connection between early peaks and information
intonemes, as well as late peaks and non-terminal intonemes. The
majority of tokens within both intoneme classes, however, are
associated with medial peaks. Precise analysis of alignment shows
that accent command offset times for information intonemes
are significantly earlier than for non-terminal intonemes. This
suggests that the anchoring of the relevant tonal transition could
be more important for separating different intonational categories
than that of the F0 peak.
Closely Related Languages, Different Ways of
Realizing Focus
Backchannel-Inviting Cues in Task-Oriented
Dialogue
Szu-wei Chen 1 , Bei Wang 2 , Yi Xu 3 ; 1 National Chung
Cheng University, Taiwan; 2 Minzu University of China,
China; 3 University College London, UK
Agustín Gravano, Julia Hirschberg; Columbia
University, USA
Tue-Ses2-O2-6, Time: 15:10
Tue-Ses2-O2-3, Time: 14:10
We investigated how focus was prosodically realized in Taiwanese,
Taiwan Mandarin and Beijing Mandarin by monolingual and bilingual speakers. Acoustic analyses showed that all speakers raised
pitch and intensity of focused words, but only Beijing Mandarin
speakers lowered pitch and intensity of post-focus words. Crossgroup differences in duration were mixed. When listening to
stimuli from their own language groups, subjects from Beijing
had over 80% focus recognition rate, while those from Taiwan
had less than 70% recognition rate. This difference is mainly due
to presence/absence of post-focus compression. These findings
have implications for prosodic typology, language contact and
bilingualism.
Cross-Variety Rhythm Typology in Portuguese
Plínio A. Barbosa 1 , M. Céu Viana 2 , Isabel Trancoso 3 ;
1
State University of Campinas, Brazil; 2 CLUL, Portugal;
3
INESC-ID Lisboa/IST, Portugal
Tue-Ses2-O2-4, Time: 14:30
This paper aims at proposing a measure of speech rhythm based
on the inference of the coupling strength between the syllable
oscillator and the stress group oscillator of an underlying coupled oscillators model. This coupling is inferred from the linear
regression between the stress group duration and the number
of syllables within the group, as well as from the multiple linear
regression between the same parameters and an estimate of phrase
stress prominence. This technique is applied to compare the
rhythmic differences between European and Brazilian Portuguese
in two speaking styles and three speakers per variety. Compared
with a syllable-sized normalised PVI, the findings suggest that
the coupling strength captures better the perceptual effects of
the speakers’ renditions. Furthermore, it shows that stress group
duration is much better predicted by adding phrase stress prominence to the regression.
We examine backchannel-inviting cues — distinct prosodic,
acoustic and lexical events in the speaker’s speech that tend to
precede a short response produced by the interlocutor to convey
continued attention — in the Columbia Games Corpus, a large
corpus of task-oriented dialogues. We show that the likelihood
of occurrence of a backchannel increases quadratically with the
number of cues conjointly displayed by the speaker. Our results
are important for improving the coordination of conversational
turns in interactive voice-response systems, so that systems can
produce backchannels in appropriate places, and so that they can
elicit backchannels from users in expected places.
Tue-Ses2-O3 : ASR: Spoken Language
Understanding
Fallside (East Wing 2), 13:30, Tuesday 8 Sept 2009
Chair: Lin-shan Lee, National Taiwan University, Taiwan
What’s in an Ontology for Spoken Language
Understanding
Silvia Quarteroni, Giuseppe Riccardi, Marco Dinarelli;
Università di Trento, Italy
Tue-Ses2-O3-1, Time: 13:30
Current Spoken Language Understanding systems rely either on
hand-written semantic grammars or on flat attribute-value sequence labeling. In both approaches, concepts and their relations
(when modeled at all) are domain-specific, thus making it difficult
to expand, port or share the domain model.
To address this issue, we introduce: 1) a domain model based on
an ontology where concepts are classified into either as predicate
or argument; 2) the modeling of relations between such concept
classes in terms of classical relations as defined in lexical semantics. We study and analyze our approach on the spoken dialog
corpus collected within a problem-solving task in the LUNA project.
We evaluate the coverage and relevance of the ontology for the
interpretation of spoken utterances.
Notes
89
A Fundamental Study of Shouted Speech for
Acoustic-Based Security System
Semantic Role Labeling with Discriminative Feature
Selection for Spoken Language Understanding
Hiroaki Nanjo 1 , Hiroki Mikami 1 , Hiroshi Kawano 2 ,
Takanobu Nishiura 2 ; 1 Ryukoku University, Japan;
2
Ritsumeikan University, Japan
Chao-Hong Liu, Chung-Hsien Wu; National Cheng
Kung University, Taiwan
Tue-Ses2-O3-2, Time: 13:30
In the task of Spoken Language Understanding (SLU), Intent Classification techniques have been applied to different domains of
Spoken Dialog Systems (SDS). Recently it was shown that intent
classification performance can be improved with Semantic Role
(SR) information. However, using SR information for SDS encounters two difficulties: 1) the state-of-the-art Automatic Speech
Recognition (ASR) systems provide less than 80% recognition rate,
2) speech always exhibits ungrammatical expressions. This study
presents an approach to Semantic Role Labeling (SRL) with discriminative feature selection to improve the performance of SDS.
Bernoulli event features on word and part-of-speech sequences
are introduced for better representation of the ASR recognized
text. SRL and SLU experiments conducted using CoNLL-2005 SRL
corpus and ATIS spoken corpus show that the proposed feature
selection method with Bernoulli event features can improve intent
classification by 3.4% and the performance of SRL.
Tue-Ses2-O3-6, Time: 13:30
A speech processing system for ensuring safety and security,
namely, acoustic-based security system is addressed. Focusing on
indoor security such as school security, we study for an advanced
acoustic-based system which can discriminate emergency shout
from the other speech events based on the understanding of
speech events. In this paper, we describe fundamental results of
shouted speech.
Evaluating the Potential Utility of ASR N-Best Lists
for Incremental Spoken Dialogue Systems
Timo Baumann, Okko Buß, Michaela Atterer, David
Schlangen; Universität Potsdam, Germany
Tue-Ses2-O3-3, Time: 13:30
The potential of using ASR n-best lists for dialogue systems has
often been recognised (if less often realised): it is often the case
that even when the top-ranked hypothesis is erroneous, a better
one can be found at a lower rank. In this paper, we describe
metrics for evaluating whether the same potential carries over to
incremental dialogue systems, where ASR output is consumed and
reacted upon while speech is still ongoing. We show that even
small N can provide an advantage for semantic processing, at a
cost of a computational overhead.
Improving the Recognition of Names by
Document-Level Clustering
Tue-Ses2-O4 : Speaker Diarisation
Holmes (East Wing 3), 13:30, Tuesday 8 Sept 2009
Chair: Yannis Stylianou, FORTH, Greece
A Study of New Approaches to Speaker Diarization
Douglas Reynolds 1 , Patrick Kenny 2 , Fabio Castaldo 3 ;
1
MIT, USA; 2 CRIM, Canada; 3 Politecnico di Torino, Italy
Tue-Ses2-O4-1, Time: 13:30
Named entities are of great importance in spoken document processing, but speech recognizers often get them wrong because they
are infrequent. A name correction method based on documentlevel name clustering is proposed in this paper, consisting of
three components: named entity detection, name clustering, and
name hypothesis selection. We compare the performance of this
method to oracle conditions and show that the oracle gain is a 23%
reduction in name character error for Mandarin and the automatic
approach achieves about 20% of that.
This paper reports on work carried out at the 2008 JHU Summer
Workshop examining new approaches to speaker diarization. Four
different systems were developed and experiments were conducted
using summed-channel telephone data from the 2008 NIST SRE.
The systems are a baseline agglomerative clustering system, a
new Variational Bayes system using eigenvoice speaker models, a
streaming system using a mix of low dimensional speaker factors
and classic segmentation and clustering, and a new hybrid system
combining the baseline system with a new cosine-distance speaker
factor clustering. Results are presented using the Diarization Error
Rate as well as by the EER when using diarization outputs for a
speaker detection task. The best configurations of the diarization
system produced DERs of 3.5—4.6% and we demonstrate a weak
correlation of EER and DER.
Robust Dependency Parsing for Spoken Language
Understanding of Spontaneous Speech
Redefining the Bayesian Information Criterion for
Speaker Diarisation
Frederic Bechet 1 , Alexis Nasr 2 ; 1 LIA, France; 2 LIF,
France
Themos Stafylakis, Vassilis Katsouros, George
Carayannis; Athena Research Center, Greece
Bin Zhang, Wei Wu, Jeremy G. Kahn, Mari Ostendorf;
University of Washington, USA
Tue-Ses2-O3-4, Time: 13:30
Tue-Ses2-O3-5, Time: 13:30
Tue-Ses2-O4-2, Time: 13:50
We describe in this paper a syntactic parser for spontaneous
speech geared towards the identification of verbal subcategorization frames. The parser proceeds in two stages. The first stage is
based on generic syntactic resources for French. The second stage
is a reranker which is specially trained for a given application. The
parser is evaluated on the French media spoken dialogue corpus.
A novel approach to the Bayesian Information Criterion (BIC) is
introduced. The new criterion redefines the penalty terms of
the BIC, such that each parameter is penalized with the effective
sample size is trained with. Contrary to Local-BIC, the proposed
criterion scores overall clustering hypotheses and therefore is
not restricted to hierarchical clustering algorithms. Contrary to
Global-BIC, it provides a local dissimilarity measure that depends
only the statistics of the examined clusters and not on the overall
sample size. We tested our criterion with two benchmark tests
and found significant improvement in performance in the speaker
diarisation task.
Notes
90
Improved Speaker Diarization of Meeting Speech
with Recurrent Selection of Representative Speech
Segments and Participant Interaction Pattern
Modeling
Speaker Diarization Using Divide-and-Conquer
1
2
Shih-Sian Cheng , Chun-Han Tseng , Chia-Ping
Chen 2 , Hsin-Min Wang 1 ; 1 Academia Sinica, Taiwan;
2
National Sun Yat-Sen University, Taiwan
Tue-Ses2-O4-3, Time: 14:10
Speaker diarization systems usually consist of two core components: speaker segmentation and speaker clustering. The current
state-of-the-art speaker diarization systems usually apply hierarchical agglomerative clustering (HAC) for speaker clustering after
segmentation. However, HAC’s quadratic computational complexity with respect to the number of data samples inevitably limits
its application in large-scale data sets. In this paper, we propose a
divide-and-conquer (DAC) framework for speaker diarization. It recursively partitions the input speech stream into two sub-streams,
performs diarization on them separately, and then combines the
diarization results obtained from them using HAC. The results of
experiments conducted on RT-02 and RT-03 broadcast news data
show that the proposed framework is faster than the conventional
segmentation and clustering-based approach while achieving comparable diarization accuracy. Moreover, the proposed framework
obtains a higher speedup over the conventional approach on a
larger test data set.
KL Realignment for Speaker Diarization with
Multiple Feature Streams
Kyu J. Han, Shrikanth S. Narayanan; University of
Southern California, USA
Tue-Ses2-O4-6, Time: 15:10
In this work we describe two distinct novel improvements to our
speaker diarization system, previously proposed for analysis of
meeting speech. The first approach focuses on recurrent selection
of representative speech segments for speaker clustering while the
other is based on participant interaction pattern modeling. The
former selects speech segments with high relevance to speaker
clustering, especially from a robust cluster modeling perspective,
and keeps updating them throughout clustering procedures. The
latter statistically models conversation patterns between meeting
participants and applies it as a priori information when refining
diarization results. Experimental results reveal that the two proposed approaches provide performance enhancement by 29.82%
(relative) in terms of diarization error rate in tests on 13 meeting
excerpts from various meeting speech corpora.
Tue-Ses2-P1 : Speech Analysis and
Processing II
Deepu Vijayasenan, Fabio Valente, Hervé Bourlard;
IDIAP Research Institute, Switzerland
Hewison Hall, 13:30, Tuesday 8 Sept 2009
Chair: A. Ariyaeeinia, University of Hertfordshire, UK
Tue-Ses2-O4-4, Time: 14:30
This paper aims at investigating the use of Kullback-Leibler (KL)
divergence based realignment with application to speaker diarization. The use of KL divergence based realignment operates directly
on the speaker posterior distribution estimates and is compared
with traditional realignment performed using HMM/GMM system.
We hypothesize that using posterior estimates to re-align speaker
boundaries is more robust than gaussian mixture models in case
of multiple feature streams with different statistical properties.
Experiments are run on the NIST RT06 data. These experiments
reveal that in case of conventional MFCC features the two approaches yields the same performance while the KL based system
outperforms the HMM/GMM re-alignment in case of combination
of multiple feature streams (MFCC and TDOA).
Speech Overlap Detection in a Two-Pass Speaker
Diarization System
Marijn Huijbregts 1 , David A. van Leeuwen 2 ,
Franciska M.G. de Jong 1 ; 1 University of Twente, The
Netherlands; 2 TNO Human Factors, The Netherlands
Tue-Ses2-O4-5, Time: 14:50
In this paper we present the two-pass speaker diarization system
that we developed for the NIST RT09s evaluation. In the first pass
of our system a model for speech overlap detection is generated
automatically. This model is used in two ways to reduce the
diarization errors due to overlapping speech. First, it is used in a
second diarization pass to remove overlapping speech from the
data while training the speaker models. Second, it is used to find
speech overlap for the final segmentation so that overlapping
speech segments can be generated. The experiments show that our
overlap detection method improves the performance of all three
of our system configurations.
Spectral and Temporal Modulation Features for
Phonetic Recognition
Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen,
Jiang Wu; Binghamton University, USA
Tue-Ses2-P1-1, Time: 13:30
Recently, the modulation spectrum has been proposed and found
to be a useful source of speech information. The modulation
spectrum represents longer term variations in the spectrum and
thus implicitly requires features extracted from much longer
speech segments compared to MFCCs and their delta terms. In
this paper, a Discrete Cosine Transform (DCT) analysis of the log
magnitude spectrum combined with a Discrete Cosine Series (DCS)
expansion of DCT coefficients over time is proposed as a method
for capturing both the spectral and modulation information. These
DCT/DCS features can be computed so as to emphasize frequency
resolution or time resolution or a combination of the two factors.
Several variations of the DCT/DCS features were evaluated with
phonetic recognition experiments using TIMIT and its telephone
version (NTIMIT). Best results obtained with a combined feature
set are 73.85% for TIMIT and 62.5% for NTIMIT. The modulation
features are shown to be far more important than the spectral
features for automatic speech recognition and far more noise
robust.
Use of Harmonic Phase Information for Polarity
Detection in Speech Signals
Ibon Saratxaga, Daniel Erro, Inmaculada Hernáez,
Iñaki Sainz, Eva Navas; University of the Basque
Country, Spain
Tue-Ses2-P1-2, Time: 13:30
Phase information resultant from the harmonic analysis of the
speech can be very successfully used to determine the polarity of
a voiced speech segment. In this paper we present two algorithms
Notes
91
which calculate the signal polarity from this information. One
is based on the effect of the glottal signal on the phase of the
first harmonics and the other on the relative phase shifts between
the harmonics. The detection rates of these two algorithms are
compared against others established algorithms.
Finite Mixture Spectrogram Modeling for Multipitch
Tracking Using A Factorial Hidden Markov Model
vector quantization. The sequence of codebook indices, the pitch
contour and the energy contour derived from the TM signal are
used to store/transmit the TM speech information efficiently. At
the receiver, the all-pole system corresponding to the estimated
CSM spectral vectors is excited by a synthetic residual to generate
the speech signal.
Analysis of Lombard Speech Using Excitation Source
Information
Michael Wohlmayr, Franz Pernkopf; Graz University of
Technology, Austria
G. Bapineedu, B. Avinash, Suryakanth V. Gangashetty,
B. Yegnanarayana; IIIT Hyderabad, India
Tue-Ses2-P1-3, Time: 13:30
In this paper, we present a simple and efficient feature modeling
approach for tracking the pitch of two speakers speaking simultaneously. We model the spectrogram features using Gaussian
Mixture Models (GMMs) in combination with the Minimum Description Length (MDL) model selection criterion. This enables
to automatically determine the number of Gaussian components
depending on the available data for a specific pitch pair. A factorial
hidden Markov model (FHMM) is applied for tracking. We compare
our approach to two methods based on correlogram features [1].
Those methods either use a HMM [1] or a FHMM [7] for tracking.
Experimental results on the Mocha-TIMIT database [2] show that
our proposed approach significantly outperforms the correlogrambased methods for speech utterances mixed at 0dB. The superior
performance even holds when adding white Gaussian noise to the
mixed speech utterances during pitch tracking.
Group-Delay-Deviation Based Spectral Analysis of
Speech
Tue-Ses2-P1-6, Time: 13:30
This paper examines the Lombard effect on the excitation features
in speech production. These features correspond mostly to the
acoustic features at subsegmental (< pitch period) level. The
instantaneous fundamental frequency F0 (i.e., pitch), the strength
of excitation at the instants of significant excitation and a loudness
measure reflecting the sharpness of the impulse-like excitation
around epochs are used to represent the excitation features at the
subsegmental level. The Lombard effect influences the pitch and
the loudness. The extent of Lombard effect on speech depends
on the nature and level (or intensity) of the external feedback that
causes the Lombard effect.
A Comparison of Linear and Nonlinear
Dimensionality Reduction Methods Applied to
Synthetic Speech
Andrew Errity, John McKenna; Dublin City University,
Ireland
Anthony Stark, Kuldip Paliwal; Griffith University,
Australia
Tue-Ses2-P1-7, Time: 13:30
Tue-Ses2-P1-4, Time: 13:30
In this paper, we investigate a new method for extracting useful
information from the group delay spectrum of speech. The group
delay spectrum is often poorly behaved and noisy. In the literature,
various methods have been proposed to address this problem.
However, to make the group delay a more tractable function,
these methods have typically relied upon some modification of the
underlying speech signal. The method proposed in this paper does
not require such modifications. To accomplish this, we investigate
a new function derived from the group delay spectrum, namely the
group delay deviation. We use it for both narrowband analysis and
wideband analysis of speech and show that this function exhibits
meaningful formant and pitch information.
Speaker Dependent Mapping for Low Bit Rate
Coding of Throat Microphone Speech
In this study a number of linear and nonlinear dimensionality reduction methods are applied to high dimensional representations
of synthetic speech to produce corresponding low dimensional
embeddings. Several important characteristics of the synthetic
speech, such as formant frequencies and f0, are known and controllable prior to dimensionality reduction. The degree to which
these characteristics are retained after dimensionality reduction is
examined in visualisation and classification experiments. Results
of these experiments indicate that each method is capable of discovering meaningful low dimensional representations of synthetic
speech and that the nonlinear methods may outperform linear
methods in some cases.
ZZT-Domain Immiscibility of the Opening and
Closing Phases of the LF GFM Under Frame Length
Variations
C.F. Pedersen, O. Andersen, P. Dalsgaard; Aalborg
University, Denmark
Anand Joseph M. 1 , B. Yegnanarayana 1 , Sanjeev
Gupta 2 , M.R. Kesheorey 2 ; 1 IIIT Hyderabad, India;
2
Center for Artificial Intelligence & Robotics, India
Tue-Ses2-P1-8, Time: 13:30
Tue-Ses2-P1-5, Time: 13:30
Throat microphones (TM) which are robust to background noise
can be used in environments with high levels of background noise.
Speech collected using TM is perceptually less natural. The objective of this paper is to map the spectral features (represented in the
form of cepstral features) of TM and close speaking microphone
(CSM) speech to improve the former’s perceptual quality, and to
represent it in an efficient manner for coding. The spectral mapping of TM and CSM speech is done using a multilayer feed-forward
neural network, which is trained from features derived from TM
and CSM speech. The sequence of estimated CSM spectral features
is quantized and coded as a sequence of codebook indices using
Current research has proposed a non-parametric speech waveform
representation (rep) based on zeros of the z-transform (ZZT) [1]
[2]. Empirically, the ZZT rep has successfully been applied in
discriminating the glottal and vocal tract components in pitchsynchronously windowed speech by using the unit circle (UC) as
discriminant [1] [2]. Further, similarity between ZZT reps of windowed speech, glottal flow waveforms, and waveforms of glottal
flow opening and closing phases has been demonstrated [1] [3].
Therefore, the underlying cause of the separation on either side of
the UC can be analyzed via the individual ZZT reps of the opening
and closing phase waveforms; the waveforms are generated by the
LF glottal flow model (GFM) [1]. The present paper demonstrates
this cause and effect analytically and thereby supplement the
Notes
92
previous empirical works. Moreover, this paper demonstrates that
immiscibility is variant under changes in frame lengths; lengths
that maximize or minimize immiscibility are presented.
Finally, the distribution of speaker-specific information is analyzed
for wideband speech.
Dimension Reducing of LSF parameters Based on
Radial Basis Function Neural Network
Artificial Nasalization of Speech Sounds Based on
Pole-Zero Models of Spectral Relations Between
Mouth and Nose Signals
Hongjun Sun, Jianhua Tao, Huibin Jia; Chinese
Academy of Sciences, China
Karl Schnell, Arild Lacroix; Goethe-Universität
Frankfurt, Germany
Tue-Ses2-P1-9, Time: 13:30
Tue-Ses2-P1-12, Time: 13:30
In this paper, we investigate a novel method for transforming
line spectral frequency (LSF) parameters to lower dimensional
coefficients. Radial basis function neutral network (RBF NN) based
transforming model is used to fit LSF vectors. In the training process, two criterions, including mean squared error and weighted
mean squared error, are involved to measure distance between
original vector and approximate vector. Besides, features of
LSF parameters are taken into account to supervise the training
process. As a result, LSF vectors are represented by the coefficient
vectors of transforming model. The experimental results reveal
that 24-order LSF vector can be transformed to 15-dimension coefficient vector with an average spectral distortion of approximately
1dB. Subjective evaluation manifests that the transforming method
in this paper will not lead to significant voice quality decreasing.
In this contribution, a method for nasalization of speech sounds is
proposed based on model-based spectral relations between mouth
and nose signals. For that purpose, the mouth and nose signals
of speech utterances are recorded simultaneously. The spectral
relations of the mouth and nose signals are modeled by pole-zero
models. A filtering of non-nasalized speech signals by these
pole-zero models yields approximately nasal signals, which can be
utilized to nasalize the speech signals. The artificial nasalization
can be exploited to modify speech units of a non-nasalized or
weakly nasalized representation which should be nasalized due to
coarticulation or for the production of foreign words.
Characterizing Speaker Variability Using Spectral
Envelopes of Vowel Sounds
Andrew Hines, Naomi Harte; Trinity College Dublin,
Ireland
Error Metrics for Impaired Auditory Nerve
Responses of Different Phoneme Groups
A.N. Harish, D.R. Sanand, S. Umesh; IIT Kanpur, India
Tue-Ses2-P1-13, Time: 13:30
Tue-Ses2-P1-10, Time: 13:30
An auditory nerve model allows faster investigation of new signal
processing algorithms for hearing aids. This paper presents a
study of the degradation of auditory nerve (AN) responses at a
phonetic level for a range of sensorineural hearing losses and
flat audiograms. The AN model of Zilany & Bruce was used to
compute responses to a diverse set of phoneme rich sentences
from the TIMIT database. The characteristics of both the average
discharge rate and spike timing of the responses are discussed.
The experiments demonstrate that a mean absolute error metric
provides a useful measure of average discharge rates but a more
complex measure is required to capture spike timing response
errors.
In this paper, we present a study to understand the relation among
spectra of speakers enunciating the same sound and investigate
the issue of uniform versus non-uniform scaling. There is a lot
of interest in understanding this relation as speaker variability
is a major source of concern in many applications including
Automatic Speech Recognition (ASR). Using dynamic programming,
we find mapping relations between smoothed spectral envelopes
of speakers enunciating the same sound and show that these relations are not linear but have a consistent non-uniform behavior.
This non-uniform behavior is also shown to vary across vowels.
Through a series of experiments, we show that using the observed
non-uniform relation provides better vowel normalization than
just a simple linear scaling relation. All results in this paper are
based on vowel data from TIMIT, Hillenbrand et al. and North
Texas databases.
Tue-Ses2-P2 : Speech Processing with Audio
or Audiovisual Input
Hewison Hall, 13:30, Tuesday 8 Sept 2009
Chair: Robert I. Damper, University of Southampton, UK
Analysis of Band Structures for Speaker-Specific
Information in FM Feature Extraction
Tharmarajah Thiruvaran, Eliathamby Ambikairajah,
Julien Epps; University of New South Wales, Australia
Application of Differential Microphone Array for
IS-127 EVRC Rate Determination Algorithm
Tue-Ses2-P1-11, Time: 13:30
Henry Widjaja, Suryoadhi Wibowo; Institut Teknologi
Telkom, Indonesia
Frequency modulation (FM) features are typically extracted using a
filterbank, usually based on an auditory frequency scale, however
there is psychophysical evidence to suggest that this scale may
not be optimal for extracting speaker-specific information. In this
paper, speaker-specific information in FM features is analyzed as
a function of the filterbank structure at the feature, model and
classification stages. Scatter matrix based separation measures
at the feature level and Kullback-Leibler distance based measures at the model level are used to analyze the discriminative
contributions of the different bands. Then a series of speaker
recognition experiments are performed to study how each band
of the FM feature contributes to speaker recognition. A new
filter bank structure is proposed that attempts to maximize the
speaker-specific information in the FM feature for telephone data.
Tue-Ses2-P2-1, Time: 13:30
Differential microphone array is known to have low sensitivity to
distant sound sources. Such characteristics may be advantageous
in voice activity detection where it can be assumed that the target
speaker is close and background noise sources are distant. In
this paper we develop a simple modification to the EVRC rate
determination algorithm (EVRC RDA) to exploit the noise-canceling
property of differential microphone array to improve its performance in highly dynamic noise environment. Comprehensive
computer simulations show that the modified algorithm outperforms the original EVRC RDA in all tested noise conditions.
Notes
93
Estimating the Position and Orientation of an
Acoustic Source with a Microphone Array Network
A Non-Intrusive Signal-Based Model for Speech
Quality Evaluation Using Automatic Classification of
Background Noises
Alberto Yoshihiro Nakano, Seiichi Nakagawa,
Kazumasa Yamamoto; Toyohashi University of
Technology, Japan
Adrien Leman 1 , Julien Faure 1 , Etienne Parizet 2 ;
1
Orange Labs, France; 2 LVA, France
Tue-Ses2-P2-2, Time: 13:30
Tue-Ses2-P2-5, Time: 13:30
We propose a method that finds the position and orientation of
an acoustic source in an enclosed environment. For each of eight
T-shaped arrays forming a microphone array network, the time
delay of arrival (TDOA) of signals from microphone pairs, a source
position candidate, and energy related features are estimated.
These form the input for artificial neural networks (ANNs), the
purpose of which is to provide indirectly a more precise position
of the source and, additionally, to estimate the source’s orientation
using various combinations of the estimated parameters. The best
combination of parameters (TDOAs and microphone positions)
yields a 21.8% reduction in the mean average position error compared to baselines, and a correct orientation ratio higher than
99.0%. The position estimation baselines include two estimation
methods: a TDOA-based method that finds the source position
geometrically, and the SRP-PHAT that finds the most likely source
position by spatial exploration.
This paper describes an original method for speech quality evaluation in the presence of different types of background noises
for a range of communications (mobile, VoIP, RTC). The model
is obtained from subjective experiments described in [1]. These
experiments show that background noise can be more or less
tolerated by listeners, depending on the sources of noise that can
be identified. Using a classification method, the background noises
can be classified into four groups. For each one of the four groups,
a relation between loudness of the noise and speech quality is
proposed.
Singing Voice Detection in Polyphonic Music using
Predominant Pitch
Acoustic Event Detection for Spotting “Hot Spots” in
Podcasts
Kouhei Sumi 1 , Tatsuya Kawahara 1 , Jun Ogata 2 ,
Masataka Goto 2 ; 1 Kyoto University, Japan; 2 AIST,
Japan
Tue-Ses2-P2-6, Time: 13:30
Vishweshwara Rao, S. Ramakrishnan, Preeti Rao; IIT
Bombay, India
Tue-Ses2-P2-3, Time: 13:30
This paper demonstrates the superiority of energy-based features
derived from the knowledge of predominant-pitch, for singing
voice detection in polyphonic music over commonly used spectral
features. However, such energy-based features tend to misclassify
loud, pitched instruments. To provide robustness to such accompaniment we exploit the relative instability of the pitch contour
of the singing voice by attenuating harmonic spectral content
belonging to stable-pitch instruments, using sinusoidal modeling.
The obtained feature shows high classification accuracy when
applied to north Indian classical music data and is also found
suitable for automatic detection of vocal-instrumental boundaries
required for smoothing the frame-level classifier decisions.
Word Stress Assessment for Computer Aided
Language Learning
This paper presents a method to detect acoustic events that can
be used to find “hot spots” in podcast programs. We focus on
meaningful non-verbal audible reactions which suggest hot spots
such as laughter and reactive tokens. In order to detect this
kind of short events and segment the counterpart utterances, we
need accurate audio segmentation and classification, dealing with
various recording environments and background music. Thus,
we propose a method for automatically estimating and switching
penalty weights for the BIC-based segmentation depending on
background environments. Experimental results show significant
improvement in detection accuracy by proposed method compared
to when using a constant penalty weight.
Improving Detection of Acoustic Events Using
Audiovisual Data and Feature Level Fusion
T. Butko, C. Canton-Ferrer, C. Segura, X. Giró, C.
Nadeu, J. Hernando, J.R. Casas; Universitat Politècnica
de Catalunya, Spain
Tue-Ses2-P2-7, Time: 13:30
Juan Pablo Arias, Nestor Becerra Yoma, Hiram
Vivanco; Universidad de Chile, Chile
Tue-Ses2-P2-4, Time: 13:30
In this paper an automatic word stress assessment system is proposed based on a top-to-bottom scheme. The method presented
is text and language independent. The utterance pronounced by
the student is directly compared with a reference one. The trend
similarity of F 0 and energy contours are compared frame-by-frame
by using DTW alignment. The stress assessment evaluation system
gives an EER equal to 21.5%, which in turn is similar to the error
observed in phonetic quality evaluation schemes. These results
suggest that the proposed system can be employed in real applications and applicable to any language.
The detection of the acoustic events (AEs) that are naturally
produced in a meeting room may help to describe the human and
social activity that takes place in it. When applied to spontaneous
recordings, the detection of AEs from only audio information
shows a large amount of errors, which are mostly due to temporal
overlapping of sounds. In this paper, a system to detect and recognize AEs using both audio and video information is presented.
A feature-level fusion strategy is used, and the structure of the
HMM-GMM based system considers each class separately and uses
a one-against-all strategy for training. Experimental AED results
with a new and rather spontaneous dataset are presented which
show the advantage of the proposed approach.
Detecting Audio Events for Semantic Video Search
M. Bugalho 1 , J. Portêlo 2 , Isabel Trancoso 1 , T.
Pellegrini 2 , Alberto Abad 2 ; 1 INESC-ID Lisboa/IST,
Portugal; 2 INESC-ID Lisboa, Portugal
Tue-Ses2-P2-8, Time: 13:30
Notes
94
This paper describes our work on audio event detection, one of our
tasks in the European project VIDIVIDEO. Preliminary experiments
with a small corpus of sound effects have shown the potential of
this type of corpus for training purposes. This paper describes
our experiments with SVM classifiers, and different features, using
a 290-hour corpus of sound effects, which allowed us to build
detectors for almost 50 semantic concepts. Although the performance of these detectors on the development set is quite good
(achieving an average F-measure of 0.87), preliminary experiments
on documentaries and films showed that the task is much harder
in real-life videos, which so often include overlapping audio events.
conferencing systems or user-adaptive interfaces. A key feature
of the proposed system is to first glean information about the
speaker’s location and identity from the audio and visual data
streams separately and then to fuse these data in a probabilistic
framework employing the Viterbi algorithm. Here, visual evidence of a person is utilized through a priori state probabilities,
while location and speaker change information are employed via
time-variant transition probabilities. Experiments show that video
information yields a substantial improvement compared to pure
audio-based diarization.
Factor Analysis for Audio-Based Video Genre
Classification
Multimodal Speaker Verification Using Ancillary
Known Speaker Characteristics Such as Gender or
Age
Mickael Rouvier, Driss Matrouf, Georges Linarès; LIA,
France
Girija Chetty, Michael Wagner; University of Canberra,
Australia
Tue-Ses2-P2-9, Time: 13:30
Tue-Ses2-P2-12, Time: 13:30
Statistical classifiers operate on features that generally include
both useful and useless information. These two types of information are difficult to separate in the feature domain. Recently, a
new paradigm based on a Latent Factor Analysis (LFA) proposed
a model decomposition into useful and useless components.
This method was successfully applied to speaker and language
recognition tasks. In this paper, we study the use of LFA for video
genre classification by using only the audio channel. We propose
a classification method based on short-term cepstral features
and Gaussian Mixture Models (GMM) or Support Vector Machine
(SVM) classifiers, that are combined with Factor Analysis (FA).
Experiments are conducted on a corpus composed of 5 types of
video (musics, commercials, cartoons, movies and news). The
relative classification error reduction obtained by using the best
factor analysis configuration with respect to the baseline system,
Gaussian Mixture Model Universal Background Model (GMM-UBM),
is about 56%, corresponding to a correct identification rate of about
90%.
Multimodal speaker verification based on easy-to-obtain biometric
traits such as face and voice is rapidly gaining acceptance as the
preferred technology for many applications. In many such practical
applications, other characteristics of the speaker such as gender
or age are known and may be exploited for enhanced verification
accuracy. In this paper we present a parallel approach determining
gender as an ancillary speaker characteristic, which is incorporated in the decision of a face-voice speaker verification system.
Preliminary experiments with the DaFEx multimodal audio-video
database show that fusing the results of gender recognition and
identity verification improves the performance of multimodal
speaker verification.
Robust Audio-Based Classification of Video Genre
Mickael Rouvier, Georges Linarès, Driss Matrouf; LIA,
France
Tue-Ses2-P2-10, Time: 13:30
Discovering Keywords from Cross-Modal Input:
Ecological vs. Engineering Methods for Enhancing
Acoustic Repetitions
Guillaume Aimetti 1 , Roger K. Moore 1 , L. ten Bosch 2 ,
Okko Johannes Räsänen 3 , Unto Kalervo Laine 3 ;
1
University of Sheffield, UK; 2 Radboud Universiteit
Nijmegen, The Netherlands; 3 Helsinki University of
Technology, Finland
Tue-Ses2-P2-13, Time: 13:30
Video genre classification is a challenging task in a global context
of fast growing video collections available on the Internet. This
paper presents a new method for video genre identification by
audio analysis. Our approach relies on the combination of low
and high level audio features. We investigate the discriminative
capacity of features related to acoustic instability, speaker interactivity, speech quality and acoustic space characterization. The
genre identification is performed on these features by using a SVM
classifier. Experiments are conducted on a corpus composed from
cartoons, movies, news, commercials and musics on which we
obtain an identification rate of 91%.
Fusing Audio and Video Information for Online
Speaker Diarization
This paper introduces a computational model that automatically
segments acoustic speech data and builds internal representations
of keyword classes from cross-modal (acoustic and pseudo-visual)
input. Acoustic segmentation is achieved using a novel dynamic
time warping technique and the focus of this paper is on recent
investigations conducted to enhance the identification of repeating
portions of speech. This ongoing research is inspired by current
cognitive views of early language acquisition and therefore strives
for ecological plausibility in an attempt to build more robust
speech recognition systems. Results show that an ad-hoc computationally engineered solution can aid the discovery of repeating
acoustic patterns. However, we show that this improvement can
be simulated in a more ecologically valid way.
Joerg Schmalenstroeer, Martin Kelling, Volker
Leutnant, Reinhold Haeb-Umbach; Universität
Paderborn, Germany
Tue-Ses2-P2-11, Time: 13:30
In this paper we present a system for identifying and localizing speakers using distant microphone arrays and a steerable
pan-tilt-zoom camera. Audio and video streams are processed
in real-time to obtain the diarization information “who speaks
when and where” with low latency to be used in advanced video
Notes
95
than 11 × speedup compared to a highly optimized sequential
implementation on Intel Core i7 without sacrificing accuracy.
Tue-Ses2-P3 : ASR: Decoding and
Confidence Measures
Hewison Hall, 13:30, Tuesday 8 Sept 2009
Chair: Kai Yu, University of Cambridge, UK
Combined Low Level and High Level Features for
Out-of-Vocabulary Word Detection
Incremental Composition of Static Decoding Graphs
Benjamin Lecouteux 1 , Georges Linarès 1 , Benoit
Favre 2 ; 1 LIA, France; 2 ICSI, USA
Miroslav Novák; IBM T.J. Watson Research Center, USA
Tue-Ses2-P3-4, Time: 13:30
Tue-Ses2-P3-1, Time: 13:30
This paper addresses the issue of Out-Of-Vocabulary (OOV) word
detection in Large Vocabulary Continuous Speech Recognition
(LVCSR) systems. We propose a method inspired by confidence
measures, that consists in analyzing the recognition system outputs in order to automatically detect errors due to OOV words.
This method combines various features based on acoustic, linguistic, decoding graph and semantics. We evaluate separately
each feature and we estimate their complementarity. Experiments
are conducted on a large French broadcast news corpus from the
ESTER evaluation campaign. Results show good performance in
real conditions: the method obtains an OOV word detection rate
of 43%–90% with 2.5%–17.5% of false detection.
A fast, scalable and memory-efficient method for static decoding
graph construction is presented. As an alternative to the traditional transducer-based approach, it is based on incremental
composition. Memory efficiency is achieved by combining composition, determinization and minimization into a single step,
thus eliminating large intermediate graphs. We have previously
reported the use of incremental composition limited to grammars
and left cross-word context [1]. Here, this approach is extended to
n-gram models with explicit ε arcs and right cross-word context.
Evaluation of Phone Lattice Based Speech Decoding
Jacques Duchateau, Kris Demuynck, Hugo
Van hamme; Katholieke Universiteit Leuven, Belgium
Bayes Risk Approximations Using Time Overlap
with an Application to System Combination
Tue-Ses2-P3-2, Time: 13:30
Björn Hoffmeister, Ralf Schlüter, Hermann Ney; RWTH
Aachen University, Germany
Previously, we proposed a flexible two-layered speech recogniser
architecture, called FLaVoR. In the first layer an unconstrained,
task independent phone recogniser generates a phone lattice. Only
in the second layer the task specific lexicon and language model
are applied to decode the phone lattice and produce a word level
recognition result. In this paper, we present a further evaluation
of the FLaVoR architecture. The performance of a classical singlelayered architecture and the FLaVoR architecture are compared
on two recognition tasks, using the same acoustic, lexical and
language models. On the large vocabulary Wall Street Journal 5k
and 20k benchmark tasks, the two-layered architecture resulted in
slightly but not significantly better word error rates. On a reading
error detection task for a reading tutor for children, the FLaVoR
architecture clearly outperformed the single-layered architecture.
A Fully Data Parallel WFST-Based Large Vocabulary
Continuous Speech Recognition on a Graphics
Processing Unit
Tue-Ses2-P3-5, Time: 13:30
The computation of the Minimum Bayes Risk (MBR) decoding rule
for word lattices needs approximations. We investigate a class of
approximations where the Levenshtein alignment is approximated
under the condition that competing lattice arcs overlap in time.
The approximations have their origins in MBR decoding and in discriminative training. We develop modified versions and propose a
new, conceptually extremely simple confusion network algorithm.
The MBR decoding rule is extended to scope with several lattices,
which enables us to apply all the investigated approximations to
system combination. All approximations are tested on a Mandarin
and on an English LVCSR task for a single system and for system
combination. The new methods are competitive in error rate and
show some advantages over the standard approaches to MBR
decoding.
Unsupervised Estimation of the Language Model
Scaling Factor
Jike Chong, Ekaterina Gonina, Youngmin Yi, Kurt
Keutzer; University of California at Berkeley, USA
Tue-Ses2-P3-3, Time: 13:30
Tremendous compute throughput is becoming available in personal desktop and laptop systems through the use of graphics
processing units (GPUs).
However, exploiting this resource
requires re-architecting an application to fit a data parallel programming model. The complex graph traversal routines in the
inference process for large vocabulary continuous speech recognition (LVCSR) have been considered by many as unsuitable for
extensive parallelization. We explore and demonstrate a fully
data parallel implementation of a speech inference engine on
NVIDIA’s GTX280 GPU. Our implementation consists of two phases
- compute-intensive observation probability computation phase
and communication-intensive graph traversal phase. We take
advantage of dynamic elimination of redundant computation in
the compute-intensive phase while maintaining close-to-peak execution efficiency. We also demonstrate the importance of exploring
application-level trade-offs in the communication-intensive graph
traversal phase to adapt the algorithm to data parallel execution
on GPUs. On 3.1 hours of speech data set, we achieve more
Christopher M. White, Ariya Rastrow, Sanjeev
Khudanpur, Frederick Jelinek; Johns Hopkins
University, USA
Tue-Ses2-P3-6, Time: 13:30
This paper addresses the adjustment of the language model (LM)
scaling factor of an automatic speech recognition (ASR) system
for a new domain using only un-transcribed speech. The main
idea is to replace the (unavailable) reference transcript with an
automatic transcript generated by an independent ASR system,
and adjust parameters using this sloppy reference. It is shown that
despite its fairly high error rate (ca. 35%), choosing the scaling
factor to minimize disagreement with the erroneous transcripts is
still an effective recipe for model selection. This effectiveness is
demonstrated by adjusting an ASR system trained on Broadcast
News to transcribe the MIT Lectures corpus. An ASR system for
telephone speech produces the sloppy reference, and optimizing
towards it yields a nearly optimal LM scaling factor for the MIT
Lectures corpus.
Notes
96
Simultaneous Estimation of Confidence and Error
Cause in Speech Recognition Using Discriminative
Model
A Comparison of Audio-Free Speech Recognition
Error Prediction Methods
Atsunori Ogawa, Atsushi Nakamura; NTT Corporation,
Japan
Tue-Ses2-P3-7, Time: 13:30
Since recognition errors are unavoidable in speech recognition,
confidence scoring, which accurately estimates the reliability of
recognition results, is a critical function for speech recognition
engines. In addition to achieving accurate confidence estimation,
if we are to develop speech recognition systems that will be
widely used by the public, speech recognition engines must be
able to report the causes of errors properly, namely they must
offer a reason for any failure to recognize input utterances. This
paper proposes a method that simultaneously estimates both
confidences and causes of errors in speech recognition results by
using discriminative models. We evaluated the proposed method
in an initial speech recognition experiment, and confirmed its
promising performance with respect to confidence and error cause
estimation.
A Generalized Composition Algorithm for Weighted
Finite-State Transducers
Preethi Jyothi, Eric Fosler-Lussier; Ohio State
University, USA
Tue-Ses2-P3-10, Time: 13:30
Predicting possible speech recognition errors can be invaluable for
a number of Automatic Speech Recognition (ASR) applications. In
this study, we extend a Weighted Finite State Transducer (WFST)
framework for error prediction to facilitate a comparison between two approaches of predicting confusable words: examining
recognition errors on the training set to learn phone confusions
and utilizing distances between the phonetic acoustic models for
the prediction task. We also expand the framework to deal with
continuous word recognition and we can accurately predict 60% of
the misrecognized sentences (with an average words-per-sentence
count of 15) and a little over 70% of the total number of errors
from the unseen test data where no acoustic information related
to the test data is utilized.
Automatic Out-of-Language Detection Based on
Confidence Measures Derived from LVCSR Word
and Phone Lattices
Petr Motlicek; IDIAP Research Institute, Switzerland
Cyril Allauzen, Michael Riley, Johan Schalkwyk; Google
Inc., USA
Tue-Ses2-P3-8, Time: 13:30
This paper describes a weighted finite-state transducer composition algorithm that generalizes the concept of the composition
filter and presents filters that remove useless epsilon paths and
push forward labels and weights along epsilon paths. This filtering
permits the composition of large speech recognition contextdependent lexicons and language models much more efficiently in
time and space than previously possible. We present experiments
on Broadcast News and a spoken query task that demonstrate an
∼5% to 10% overhead for dynamic, runtime composition compared
to a static, offline composition of the recognition transducer. To
our knowledge, this is the first such system with so little overhead.
Word Confidence Using Duration Models
Tue-Ses2-P3-11, Time: 13:30
Confidence Measures (CMs) estimated from Large Vocabulary Continuous Speech Recognition (LVCSR) outputs are commonly used
metrics to detect incorrectly recognized words. In this paper, we
propose to exploit CMs derived from frame-based word and phone
posteriors to detect speech segments containing pronunciations
from non-target (alien) languages. The LVCSR system used is built
for English, which is the target language, with medium-size recognition vocabulary (5k words). The efficiency of detection is tested
on a set comprising speech from three different languages (English,
German, Czech). Results achieved indicate that employment of
specific temporal context (integrated in the word or phone level)
significantly increases the detection accuracies. Furthermore,
we show that combination of several CMs can also improve the
efficiency of detection.
Automatic Estimation of Decoding Parameters Using
Large-Margin Iterative Linear Programming
Stefano Scanzio 1 , Pietro Laface 1 , Daniele Colibro 2 ,
Roberto Gemello 2 ; 1 Politecnico di Torino, Italy;
2
Loquendo, Italy
Brian Mak, Tom Ko; Hong Kong University of Science &
Technology, China
Tue-Ses2-P3-9, Time: 13:30
In this paper, we propose a word confidence measure based on
phone durations depending on large contexts. The measure is
based on the expected duration of each recognized phone in a
word. In the approach here proposed the duration of each phone
is in principle context-dependent, and the measure is a function of
the distance between the observed and expected phone duration
distributions within a word. Our experiments show that, since
the “duration confidence” does not make use of any acoustic
information, its Equal Error Rate (EER) in terms of False Accept
and False Rejection rates is not as good as the one obtained by
using the more informed acoustic confidence measure. However,
combining the two measures by a simple linear interpolation, the
system EER improves by 6% to 10% relative on an isolated word
recognition task in several languages.
Tue-Ses2-P3-12, Time: 13:30
The decoding parameters in automatic speech recognition — grammar factor and word insertion penalty — are usually determined
by performing a grid search on a development set. Recently, we
cast their estimation as a convex optimization problem, and proposed a solution using an iterative linear programming algorithm.
However, the solution depends on how well the development data
set matches with the test set. In this paper, we further investigates
an improvement on the generalization property of the solution by
using large margin training within the iterative linear programming
framework. Empirical evaluation on the WSJ0 5K speech recognition tasks shows that the recognition performance of the decoding
parameters found by the improved algorithm using only a subset
of the acoustic model training data is even better than that of the
decoding parameters found by grid search on the development
data, and is close to the performance of those found by grid search
on the test set.
Notes
97
Tue-Ses2-P4 : Robust Automatic Speech
Recognition I
A Study of Mutual Front-End Processing Method
Based on Statistical Model for Noise Robust Speech
Recognition
Hewison Hall, 13:30, Tuesday 8 Sept 2009
Chair: Alex Acero, Microsoft Research, USA
Masakiyo Fujimoto, Kentaro Ishizuka, Tomohiro
Nakatani; NTT Corporation, Japan
Tue-Ses2-P4-4, Time: 13:30
Optimization of Dereverberation Parameters Based
on Likelihood of Speech Recognizer
Randy Gomez, Tatsuya Kawahara; Kyoto University,
Japan
Tue-Ses2-P4-1, Time: 13:30
Speech recognition under reverberant condition is a difficult task.
Most dereverberation techniques used to address this problem
enhance the reverberant waveform independent from that of the
speech recognizer. In this paper, we improve the conventional
Spectral Subtraction-based (SS) dereverberation technique. In our
proposed approach, the dereverberation parameters are optimized
to improve the likelihood of the acoustic model. The system is
capable of adaptively fine-tuning these parameters jointly with
acoustic model training. Additional optimization is also implemented during decoding of the test utterances. We have evaluated
using real reverberant data and experimental results show that the
proposed method significantly improves the recognition performance over the conventional approach.
This paper addresses robust front-end processing for automatic
speech recognition (ASR) in noise. Accurate recognition of corrupted speech requires noise robust front-end processing, e.g.,
voice activity detection (VAD) and noise suppression (NS). Typically, VAD and NS are combined as one-way processing, and are
developed independently. However, VAD and NS should not be
assumed to be independent techniques, because sharing each
others’ information is important for the improvement of front-end
processing. Thus, we investigate the mutual front-end processing
by integrating VAD and NS, which can beneficially share each
others’ information. In an evaluation of a concatenated speech
corpus, CENSREC-1-C database, the proposed method improves the
performance of both VAD and ASR compared with the conventional
method.
Integrating Codebook and Utterance Information in
Cepstral Statistics Normalization Techniques for
Robust Speech Recognition
Guan-min He, Jeih-weih Hung; National Chi Nan
University, Taiwan
Application of Noise Robust MDT Speech
Recognition on the SPEECON and SpeechDat-Car
Databases
Tue-Ses2-P4-5, Time: 13:30
J.F. Gemmeke 1 , Y. Wang 2 , Maarten Van Segbroeck 2 , B.
Cranen 1 , Hugo Van hamme 2 ; 1 Radboud Universiteit
Nijmegen, The Netherlands; 2 Katholieke Universiteit
Leuven, Belgium
Tue-Ses2-P4-2, Time: 13:30
We show that the recognition accuracy of an MDT recognizer
which performs well on artificially noisified data, deteriorates
rapidly under realistic noisy conditions (using multiple microphone recordings from the SPEECON/SpeechDat-Car databases)
and is outperformed by a commercially available recognizer which
was trained using a multi-condition paradigm. Analysis of the
recognition results indicates that the recording channels with the
lowest SNRs where the MDT recognizer fails most, are also the
channels which suffer most from room reverberation. Despite the
channel compensation measures we took, it appears difficult to
maintain the restorative power of MDT in such non-additive noise
conditions.
Model Based Feature Enhancement for Automatic
Speech Recognition in Reverberant Environments
Cepstral statistics normalization techniques have been shown to
be very successful at improving the noise robustness of speech
features. This paper proposes a hybrid-based scheme to achieve
a more accurate estimate of the statistical information of features in these techniques. By properly integrating codebook
and utterance knowledge, the resulting hybrid-based approach
significantly outperforms conventional utterance-based, segmentbased and codebook-based approaches in noisy environments.
For the Aurora-2 clean-condition training task, the proposed
hybrid codebook/segment-based histogram equalization (CS-HEQ)
achieves an average recognition accuracy of 90.66%, which is
better than utterance-based HEQ (87.62%), segment-based HEQ
(85.92%) and codebook-based HEQ (85.29%). Furthermore, the
high-performance CS-HEQ can be implemented with a short delay
and can thus be applied in real-time online systems.
Reduced Complexity Equalization of Lombard Effect
for Speech Recognition in Noisy Adverse
Environments
Hynek Bořil, John H.L. Hansen; University of Texas at
Dallas, USA
Alexander Krueger, Reinhold Haeb-Umbach;
Universität Paderborn, Germany
Tue-Ses2-P4-6, Time: 13:30
Tue-Ses2-P4-3, Time: 13:30
In this paper we present a new feature space dereverberation
technique for automatic speech recognition. We derive an expression for the dependence of the reverberant speech features in the
log-mel spectral domain on the non-reverberant speech features
and the room impulse response. The obtained observation model
is used for a model based speech enhancement based on Kalman
filtering. The performance of the proposed enhancement technique is studied on the AURORA5 database. In our currently best
configuration, which includes uncertainty decoding, the number
of recognition errors is approximately halved compared to the
recognition of unprocessed speech.
In real-world adverse environments, speech signal corruption by
background noise, microphone channel variations, and speech
production adjustments introduced by speakers in an effort to
communicate efficiently over noise (Lombard effect) severely
impact automatic speech recognition (ASR) performance. Recently,
a set of unsupervised techniques reducing ASR sensitivity to these
sources of distortion have been presented, with the main focus on
equalization of Lombard effect (LE). The algorithms performing
maximum-likelihood spectral transformation, cepstral dynamics
normalization, and decoding with a codebook of noisy speech
models have been shown to outperform conventional methods,
however, at a cost of considerable increase in computational complexity due to required numerous decoding passes through the ASR
Notes
98
models. In this study, a scheme utilizing a set of speech-in-noise
Gaussian mixture models and a neutral/LE classifier is shown to
substantially decrease the computational load (from 14 to 2–4 ASR
decoding passes) while preserving overall system performance. In
addition, an extended codebook capturing multiple environmental
noises is introduced and shown to improve ASR in changing environments (8.2–49.2% absolute WER improvement). The evaluation
is performed on the Czech Lombard Speech Database (CLSD’05).
The task is to recognize neutral/LE connected digit strings presented in different levels of background car noise and Aurora 2
noises.
Unsupervised Training Scheme with Non-Stereo
Data for Empirical Feature Vector Compensation
domain such as MFCC. The proposed method works as a feature
extraction front-end that is independent from decoding engine,
and has ability to compensate for non-stationary additive and
convolutional noises with a short time delay. It includes spectral
subtraction as a special case when no parameter optimization is
performed. Experiments were performed using the AURORA-2J
database. It has been shown that significantly higher recognition
performance is obtained by the proposed method than spectral
subtraction.
Noise-Robust Feature Extraction Based on Forward
Masking
Sheng-Chiuan Chiou, Chia-Ping Chen; National Sun
Yat-Sen University, Taiwan
L. Buera 1 , Antonio Miguel 1 , Alfonso Ortega 1 , Eduardo
Lleida 1 , Richard M. Stern 2 ; 1 Universidad de Zaragoza,
Spain; 2 Carnegie Mellon University, USA
Tue-Ses2-P4-10, Time: 13:30
Tue-Ses2-P4-7, Time: 13:30
In this paper, a novel training scheme based on unsupervised and
non-stereo data is presented for Multi-Environment Model-based
LInear Normalization (MEMLIN) and MEMLIN with cross-probability
model based on GMMs (MEMLIN-CPM). Both are data-driven feature
vector normalization techniques which have been proved very effective in dynamic noisy acoustic environments. However, this kind
of techniques usually requires stereo data in a previous training
phase, which could be an important limitation in real situations.
To compensate this drawback, we present an approach based on
ML criterion and Vector Taylor Series (VTS). Experiments have
been carried out with Spanish SpeechDat Car, reaching consistent
improvements: 48.7% and 61.9% when the novel training process
is applied over MEMLIN and MEMLIN-CPM, respectively.
Incremental Adaptation with VTS and Joint
Adaptively Trained Systems
Forward masking is a phenomenon of human auditory perception,
that a weaker sound is masked by a preceding stronger masker.
The actual cause of forward masking is not clear, but synaptic
adaptation and temporal integration are heuristic explanations.
In this paper, we postulate the mechanism of forward masking to
be synaptic adaptation and temporal integration, and incorporate
them in the feature extraction process of an automatic speech
recognition system to improve noise-robustness. The synaptic
adaptation is implemented by a highpass filter, and the temporal
integration is implemented by a bandpass filter. We apply both
filters in the domain of log mel-spectrum. On the Aurora 3 tasks,
we evaluate three modified mel-frequency cepstral coefficients:
synaptic adaptation only, temporal integration only, and both
synaptic adaptation and temporal integration. Experiments show
that the overall improvement is 16.1%, 21.8%, and 26.2% respectively in the three cases over the baseline.
Tue-Ses3-S1 : Panel: Speech & Intelligence
Main Hall, 16:00, Tuesday 8 Sept 2009
Chair: Roger K. Moore, University of Sheffield, UK
F. Flego, M.J.F. Gales; University of Cambridge, UK
Tue-Ses2-P4-8, Time: 13:30
Recently adaptive training schemes using model based compensation approaches such as VTS and JUD have been proposed.
Adaptive training allows the use of multi-environment training
data whilst training a neutral, “clean”, acoustic model to be
trained. This paper describes and assesses the advantages of
using incremental, rather than batch, mode adaptation with these
adaptively trained systems. Incremental adaptation reduces the
latency during recognition, and has the possibility of reducing the
error rate for slowly varying noise. The work is evaluated on a
large scale multi-environment training configuration targeted at
in-car speech recognition. Results on in-car collected test data
indicate that incremental adaptation is an attractive option when
using these adaptively trained systems.
Target Speech GMM-Based Spectral Compensation
for Noise Robust Speech Recognition
Panel: Speech & Intelligence
Tue-Ses3-S1-1, Time: 16:00
In line with the theme of this year’s INTERSPEECH conference,
this special semi-plenary Panel Session will be run as a guided
discussion, drawing on issues raised by the panel members and
solicited in advance from the attendees. An international panel
of distinguished experts will engage with the topic of ‘speech and
intelligence’ and address open questions such the importance
of a link between spoken language and other aspects of human
cognition. It is expected that this special event will be both
informative and entertaining, and will involve opportunities for
audience participation.
Tue-Ses3-O3 : Speaker Verification &
Identification I
Takahiro Shinozaki, Sadaoki Furui; Tokyo Institute of
Technology, Japan
Fallside (East Wing 2), 16:00, Tuesday 8 Sept 2009
Chair: Patrick Kenny, CRIM, Canada
Tue-Ses2-P4-9, Time: 13:30
To improve speech recognition performance in adverse conditions, a noise compensation method is proposed that applies
a transformation in the spectral domain whose parameters are
optimized based on likelihood of speech GMM modeled on the
feature domain. The idea is that additive and convolutional noises
have mathematically simple expression in the spectral domain
while speech characteristics are better modeled in the feature
Investigation into Variants of Joint Factor Analysis
for Speaker Recognition
Lukáš Burget, Pavel Matějka, Valiantsina Hubeika, Jan
Černocký; Brno University of Technology, Czech
Republic
Notes
99
Tue-Ses3-O3-1, Time: 16:00
In this paper, we have investigated into JFA used for speaker
recognition. First, we performed systematic comparison of full JFA
with its simplified variants and confirmed superior performance
of the full JFA with both eigenchannels and eigenvoices. We
investigated into sensitivity of JFA on the number of eigenvoices
both for the full one and simplified variants. We studied the
importance of normalization and found that gender-dependent
zt-norm was crucial. The results are reported on NIST 2006 and
2008 SRE evaluation data.
Improved GMM-Based Speaker Verification Using
SVM-Driven Impostor Dataset Selection
this assumption to derive a new general kernel. The kernel function
is general in that it is a linear combination of any kernels belonging
to the reproducing kernel Hilbert space. The combination weights
are obtained by optimizing the ability of a discriminant function to
separate a target speaker from impostors using either regression
analysis or SVM training. The idea was applied to both low- and
high-level speaker verification. In both cases, results show that
the proposed kernels outperform the state-of-the-art sequence
kernels. Further performance enhancement was also observed
when the high-level scores were combined with acoustic scores.
UBM-Based Sequence Kernel for Speaker Recognition
Zhenchun Lei; Jiangxi Normal University, China
Mitchell McLaren, Robbie Vogt, Brendan Baker, Sridha
Sridharan; Queensland University of Technology,
Australia
Tue-Ses3-O3-5, Time: 17:20
Tue-Ses3-O3-2, Time: 16:20
The problem of impostor dataset selection for GMM-based speaker
verification is addressed through the recently proposed data-driven
background dataset refinement technique. The SVM-based refinement technique selects from a candidate impostor dataset those
examples that are most frequently selected as support vectors
when training a set of SVMs on a development corpus. This study
demonstrates the versatility of dataset refinement in the task of
selecting suitable impostor datasets for use in GMM-based speaker
verification. The use of refined Z- and T-norm datasets provided
performance gains of 15% in EER in the NIST 2006 SRE over the use
of heuristically selected datasets. The refined datasets were shown
to generalise well to the unseen data of the NIST 2008 SRE.
Adaptive Individual Background Model for Speaker
Verification
This paper proposes a probabilistic sequence kernel based on
the universal background model, which is widely used in speaker
recognition. The Gaussian components are used to construct the
speaker reference space, and the utterances with different length
are mapped into the fixed size vectors after normalization with
correlation matrix. Finally the linear support vector machine is
used for speaker recognition. A transition probabilistic sequence
kernel is also proposed by adaption the transition information
between neighbor frames. The experiments on NIST 2001 show
that the performance is compared with the traditional UBM-MAP
model. If we fusion the models, the performance will be improved
16.8% and 19.1% respectively compared with the UBM-MAP model.
GMM Kernel by Taylor Series for Speaker
Verification
Minqiang Xu 1 , Xi Zhou 2 , Beiqian Dai 1 , Thomas S.
Huang 2 ; 1 University of Science & Technology of China,
China; 2 University of Illinois at Urbana-Champaign,
USA
Yossi Bar-Yosef, Yuval Bistritz; Tel-Aviv University,
Israel
Tue-Ses3-O3-3, Time: 16:40
Tue-Ses3-O3-6, Time: 17:40
Most techniques for speaker verification today use Gaussian
Mixture Models (GMMs) and make the decision by comparing the
likelihood of the speaker model to the likelihood of a universal
background model (UBM). The paper proposes to replace the UBM
by an individual background model (IBM) that is generated for each
speaker. The IBM is created using the K-nearest cohort models
and the UBM by a simple new adaptation algorithm. The new
GMM-IBM speaker verification system can also be combined with
various score normalization techniques that have been proposed
to increase the robustness of the GMM-UBM system. Comparative
experiments were held on the NIST-2004-SRE database with a plain
system setting (without score normalization) and also with the
combination of adaptive test normalization (ATnorm). Results
indicated that the proposed GMM-IBM system outperforms a
comparable GMM-UBM system.
Currently, approach of Gaussian Mixture Model combined with
Support Vector Machine to text-independent speaker verification
task has produced the stat-of-the-art performance. Many kernels
have been reported for combining GMM and SVM.
Optimization of Discriminative Kernels in SVM
Speaker Verification
In this paper, we propose a novel kernel to represent the GMM
distribution by Taylor expansion theorem and it’s regarded as
the input of SVM. The utterance-specific GMM is represented as
a combination of orders of Taylor series expansing at the means
of the Gaussian components. Here we extract the distribution
information around the means of the Gaussian components in
the GMM as we can naturally assume that each mean position
indicates a feature cluster in the feature space. And then the kernel
computes the emsemble distance between orders of Taylor series.
Results of our new kernel on NIST speaker recognition evaluation
(SRE) 2006 core task have been shown relative improvements of
up to 7.1% and 11.7% in EER for male and female compared to K-L
divergence based SVM system.
Shi-Xiong Zhang, Man-Wai Mak; Hong Kong
Polytechnic University, China
Tue-Ses3-O3-4, Time: 17:00
An important aspect of SVM-based speaker verification systems is
the design of sequence kernels. These kernels should be able to
map variable-length observation sequences to fixed-size supervectors that capture the dynamic characteristics of speech utterances
and allow speakers to be easily distinguished. Most existing kernels
in SVM speaker verification are obtained by assuming a specific
form for the similarity function of supervectors. This paper relaxes
Notes
100
as the need to list allowed alignments is removed. Finally, loose
comparison with other studies indicates Combilex is a superior
quality lexicon in terms of consistency and size.
Tue-Ses3-O4 : Text Processing for Spoken
Language Generation
Holmes (East Wing 3), 16:00, Tuesday 8 Sept 2009
Chair: Douglas Reynolds, MIT, USA
Letter-to-Phoneme Conversion by Inference of
Rewriting Rules
Automatic Syllabification for Danish Text-to-Speech
Systems
Vincent Claveau; IRISA, France
1
1
2
Jeppe Beck , Daniela Braga , João Nogueira , Miguel
Sales Dias 1 , Luis Coelho 3 ; 1 Microsoft Language
Development Center, Portugal; 2 University of Lisbon,
Portugal; 3 Polytechnic Institute of Oporto, Portugal
Tue-Ses3-O4-1, Time: 16:00
In this paper, a rule-based automatic syllabifier for Danish is
described using the Maximal Onset Principle. Prior success rates
of rule-based methods applied to Portuguese and Catalan syllabification modules were on the basis of this work. The system was
implemented and tested using a very small set of rules. The results
gave rise to 96.9% and 98.7% of word accuracy rate, contrary to
our initial expectations, being Danish a language with a complex
syllabic structure and thus difficult to be rule-driven. Comparison
with data-driven syllabification system using artificial neural
networks showed a higher accuracy rate of the former system.
Hybrid Approach to Grapheme to Phoneme
Conversion for Korean
Tue-Ses3-O4-4, Time: 17:00
Phonetization is a crucial step for oral document processing.
In this paper, a new letter-to-phoneme conversion approach is
proposed; it is automatic, simple, portable and efficient. It relies
on a machine learning technique initially developed for transliteration and translation; the system infers rewriting rules from
examples of words with their phonetic representations. This
approach is evaluated in the framework of the Pronalsyl Pascal
challenge, which includes several datasets on different languages.
The obtained results equal or outperform those of the best known
systems. Moreover, thanks to the simplicity of our technique, the
inference time of our approach is much lower than those of the
best performing state-of-the-art systems.
Online Discriminative Training for
Grapheme-to-Phoneme Conversion
Sittichai Jiampojamarn, Grzegorz Kondrak; University
of Alberta, Canada
Tue-Ses3-O4-5, Time: 17:20
Jinsik Lee 1 , Byeongchang Kim 2 , Gary Geunbae Lee 1 ;
1
POSTECH, Korea; 2 Catholic University of Daegu,
Korea
Tue-Ses3-O4-2, Time: 16:20
In the grapheme to phoneme conversion problem for Korean,
two main approaches have been discussed: knowledge-based and
data-driven methods. However, both camps have limitations: the
knowledge-based hand-written rules cannot handle some of the
pronunciation changes due to the lack of capability of linguistic
analyzers and many exceptions; data-driven methods always suffer
from data sparseness. To overcome the shortages of both camps,
this paper presents a novel combining method which effectively
integrates two components: (1) a rule-based converting system
based on linguistically motivated hand-written rules and (2) a
statistical converting system using a Maximum Entropy model.
The experimental results clearly show the effectiveness of our
proposed method.
We present an online discriminative training approach to graphemeto-phoneme (g2p) conversion. We employ a many-to-many alignment between graphemes and phonemes, which overcomes the
limitations of widely used one-to-one alignments. The discriminative structure-prediction model incorporates input segmentation,
phoneme prediction, and sequence modeling in a unified dynamic
programming framework. The learning model is able to capture
both local context features in inputs, as well as non-local dependency features in sequence outputs. Experimental results show
that our system surpasses the state-of-the-art on several data sets.
Using Same-Language Machine Translation to Create
Alternative Target Sequences for Text-to-Speech
Synthesis
Peter Cahill 1 , Jinhua Du 2 , Andy Way 2 , Julie
Carson-Berndsen 1 ; 1 University College Dublin, Ireland;
2
Dublin City University, Ireland
Tue-Ses3-O4-6, Time: 17:40
Robust LTS Rules with the Combilex Speech
Technology Lexicon
Korin Richmond, Robert A.J. Clark, Sue Fitt; University
of Edinburgh, UK
Tue-Ses3-O4-3, Time: 16:40
Combilex is a high quality pronunciation lexicon, aimed at speech
technology applications, that has recently been released by CSTR.
Combilex benefits from several advanced features. This paper
evaluates one of these: the explicit alignment of phones to
graphemes in a word. This alignment can help to rapidly develop
robust and accurate letter-to-sound (LTS) rules, without needing
to rely on automatic alignment methods. To evaluate this, we
used Festival’s LTS module, comparing its standard automatic
alignment with Combilex’s explicit alignment. Our results show
using Combilex’s alignment improves LTS accuracy: 86.50% words
correct as opposed to 84.49%, with our most general form of
lexicon. In addition, building LTS models is greatly accelerated,
Modern speech synthesis systems attempt to produce speech
utterances from an open domain of words. In some situations, the
synthesiser will not have the appropriate units to pronounce some
words or phrases accurately but it still must attempt to pronounce
them. This paper presents a hybrid machine translation and unit
selection speech synthesis system. The machine translation system
was trained with English as the source and target language. Rather
than the synthesiser only saying the input text as would happen
in conventional synthesis systems, the synthesiser may say an
alternative utterance with the same meaning. This method allows
the synthesiser to overcome the problem of insufficient units in
runtime.
Notes
101
show that the proposed estimator leads to significant improvement
for the presented estimator over state-of-the-art methods.
Tue-Ses3-P1 : Single- and Multichannel
Speech Enhancement
Hewison Hall, 16:00, Tuesday 8 Sept 2009
Chair: Richard C. Hendriks, Technische Universiteit Delft, The
Netherlands
Speech Enhancement in a 2-Dimensional Area Based
on Power Spectrum Estimation of Multiple Areas
with Investigation of Existence of Active Sources
Watermark Recovery from Speech Using Inverse
Filtering and Sign Correlation
Yusuke Hioka 1 , Ken’ichi Furuya 1 , Yoichi Haneda 1 ,
Akitoshi Kataoka 2 ; 1 NTT Corporation, Japan;
2
Ryukoku University, Japan
Robert Morris 1 , Ralph Johnson 1 , Vladimir
Goncharoff 2 , Joseph DiVita 1 ; 1 SPAWAR Systems
Center Pacific, USA; 2 University of Illinois at Chicago,
USA
Tue-Ses3-P1-4, Time: 16:00
Tue-Ses3-P1-1, Time: 16:00
This paper presents an improved method for asynchronous embedding and recovery of sub-audible watermarks in speech signals.
The watermark, a sequence of DTMF tones, was added to speech
without knowledge of its time-varying characteristics. Watermark
recovery began by implementing a synchronized zero-phase
inverse filtering operation to decorrelate the speech during its
voiced segments. The final step was to apply the sign correlation
technique, which resulted in performance advantages over linear
correlation detection. Our simulations include the effects of finite
word length in the correlator.
A microphone array that emphasizes sound sources located in a
particular 2-dimensional area is described. We previously developed a method that estimates the power spectra of target and
noise sounds using multiple fixed beamformings. However, that
method requires the areas where the noise sources are located to
be restricted. We describe the principle of this limitation then propose a procedure that investigates the possibility of the existence
of a sound source in a target area and other areas beforehand to
reduce the number of unknown power spectra to be estimated.
Modulation Domain Spectral Subtraction for Speech
Enhancement
Kuldip Paliwal, Belinda Schwerin, Kamil Wójcicki;
Griffith University, Australia
Tue-Ses3-P1-5, Time: 16:00
Weighted Linear Prediction for Speech Analysis in
Noisy Conditions
Jouni Pohjalainen, Heikki Kallasjoki, Kalle J. Palomäki,
Mikko Kurimo, Paavo Alku; Helsinki University of
Technology, Finland
Tue-Ses3-P1-2, Time: 16:00
Following earlier work, we modify linear predictive (LP) speech
analysis by including temporal weighting of the squared prediction
error in the model optimization. In order to focus this so called
weighted LP model on the least noisy signal regions in the presence
of stationary additive noise, we use short-time signal energy as
the weighting function. We compare the noisy spectrum analysis
performance of weighted LP and its recently proposed variant,
the latter guaranteed to produce stable synthesis models. As a
practical test case, we use automatic speech recognition to verify
that the weighted LP methods improve upon the conventional
FFT and LP methods by making spectrum estimates less prone to
corruption by additive noise.
Log-Spectral Magnitude MMSE Estimators Under
Super-Gaussian Densities
1
In this paper we investigate the modulation domain as an alternative to the acoustic domain for speech enhancement. More
specifically, we wish to determine how competitive the modulation
domain is for spectral subtraction as compared to the acoustic
domain. For this purpose, we extend the traditional analysismodification-synthesis framework to include modulation domain
processing. We then compensate the noisy modulation spectrum
for additive noise distortion by applying the spectral subtraction
algorithm in the modulation domain. Using subjective listening
tests and objective speech quality evaluation we show that the
proposed method results in improved speech quality. Furthermore,
applying spectral subtraction in the modulation domain does not
introduce the musical noise artifacts that are typically present
after acoustic domain spectral subtraction. The proposed method
also achieves better background noise reduction than the MMSE
method.
Variational Loopy Belief Propagation for
Multi-Talker Speech Recognition
Steven J. Rennie, John R. Hershey, Peder A. Olsen; IBM
T.J. Watson Research Center, USA
Tue-Ses3-P1-6, Time: 16:00
1
Richard C. Hendriks , Richard Heusdens , Jesper
Jensen 2 ; 1 Technische Universiteit Delft, The
Netherlands; 2 Oticon A/S, Denmark
Tue-Ses3-P1-3, Time: 16:00
Despite the fact that histograms of speech DFT coefficients are
super-Gaussian, not much attention has been paid to develop estimators under these super-Gaussian distributions in combination
with perceptual meaningful distortion measures. In this paper
we present log-spectral magnitude MMSE estimators under superGaussian densities, resulting in an estimator that is perceptually
more meaningful and in line with measured histograms of speech
DFT coefficients. Compared to state-of-the-art reference methods,
the presented estimator leads to an improvement of the segmental
SNR in the order of 0.5 dB up to 1 dB. Moreover, listening tests
We address single-channel speech separation and recognition
by combining loopy belief propagation and variational inference
methods. Inference is done in a graphical model consisting of an
HMM for each speaker combined with the max interaction model
of source combination. We present a new variational inference
algorithm that exploits the structure of the max model to compute
an arbitrarily tight bound on the probability of the mixed data.
The variational parameters are chosen so that the algorithm scales
linearly in the size of the language and acoustic models, and
quadratically in the number of sources. The algorithm scores
30.7% on the SSC task [1], which is the best published result by a
method that scales linearly with speaker model complexity to date.
The algorithm achieves average recognition error rates of 27%,
35%, and 51% on small datasets of SSC-derived speech mixtures
containing two, three, and four sources, respectively, using a single
audio channel.
Notes
102
Enhancement of Binaural Speech Using Codebook
Constrained Iterative Binaural Wiener Filter
Enhanced Minimum Statistics Technique
Incorporating Soft Decision for Noise Suppression
Nadir Cazi, T.V. Sreenivas; Indian Institute of Science,
India
Yun-Sik Park, Ji-Hyun Song, Jae-Hun Choi, Joon-Hyuk
Chang; Inha University, Korea
Tue-Ses3-P1-7, Time: 16:00
Tue-Ses3-P1-10, Time: 16:00
A clean speech VQ codebook has been shown to be effective in
providing intraframe constraints and hence better convergence
of the iterative Wiener filtering scheme for single channel speech
enhancement. Here we present an extension of the single channel
CCIWF scheme to binaural speech input by incorporating a speech
distortion weighted multi-channel Wiener filter. The new algorithm
shows considerable improvement over single channel CCIWF in
each channel, in a diffuse noise field environment, in terms of a
posteriori SNR and speech intelligibility measure. Next, considering
a moving speech source, a good tracking performance is seen, up
to a certain resolution.
In this paper, we propose a novel approach to noise power estimation for robust noise suppression in noisy environments. From
investigation of the state-of-the-art techniques for noise power
estimation, it is discovered that the previously known methods are
accurate mostly either during speech absence or speech presence
but none of it works well in both situations. Our approach combines minimum statistics (MS) and soft decision (SD) techniques
based on probability of speech absence. The performance of the
proposed approach is evaluated by a quantitative comparison
method and subjective test under various noise environments and
found to yield better results compared with conventional MS and
SD-based schemes.
A Semi-Blind Source Separation Method with a Less
Amount of Computation Suitable for Tiny DSP
Modules
Effect of Noise Reduction on Reaction Time to
Speech in Noise
Kazunobu Kondo, Makoto Yamada, Hideki Kenmochi;
Yamaha Corporation, Japan
Mark Huckvale, Jayne Leak; University College London,
UK
Tue-Ses3-P1-8, Time: 16:00
Tue-Ses3-P1-11, Time: 16:00
In this paper, we propose a method of implementing FDICA on
tiny DSP modules. Firstly, we show a semi-blind separation matrix
initialization step that consists of an estimation method using
covariance fitting for a known source and an unknown source. It
contributes to the faster convergence and less amount of computation. Secondly, a learning band selection step is shown that
consists of the determinant of the covariance matrix as a criteria
for selection; This achieves a significant reduction of an amount of
computation with practical separation performance. Finally, the
effectiveness of the proposed method is evaluated via the source
separation simulations in anechoic and reverberant rooms, and
also a procedure and a resource presumption for the integrated
method which we call tinyICA are shown.
In moderate levels of noise, listeners report that noise reduction
(NR) processing can improve the perceived quality of a speech
signal as measured on a typical MOS rating scale. Most quantitative
experiments of intelligibility, however, show that NR reduces the
intelligibility of noisy speech signals, and so should be expected
to increase the cognitive effort required to process utterances. To
study cognitive effort we look at how NR affects reaction times
to speech in noise, using material that is still highly intelligible.
We show that adding noise increases reaction times and that NR
does not restore reaction times back to the quiet condition. The
implication is that NR does not make speech “easier” to process, at
least as far as this task is concerned.
Model-Based Speech Separation: Identifying
Transcription Using Orthogonality
Joint Noise Reduction and Dereverberation of
Speech Using Hybrid TF-GSC and Adaptive MMSE
Estimator
S.W. Lee 1 , Frank K. Soong 2 , Tan Lee 1 ; 1 Chinese
University of Hong Kong, China; 2 Microsoft Research
Asia, China
Behdad Dashtbozorg, Hamid Reza Abutalebi; Yazd
University, Iran
Tue-Ses3-P1-9, Time: 16:00
This paper proposes a new multichannel hybrid method for
dereverberation of speech signals in noisy environments. This
method extends the use of a hybrid noise reduction method for
dereverberation which is based on the combination of Generalized
Sidelobe Canceller (GSC) and a single-channel noise reduction
stage. In this research, we employ Transfer Function GSC (TF-GSC)
that is more suitable for dereverberation. The single-channel stage
is an Adaptive Minimum Mean-Square Error (AMMSE) spectral
amplitude estimator. We also modify the AMMSE estimator for
dereverberation application. Experimental results demonstrate
superiority of the proposed method in dereverberation of speech
signal in noisy environments.
Tue-Ses3-P1-12, Time: 16:00
Spectral envelopes and harmonics are the building elements of a
speech signal. By estimating these elements, individual speech
sources in a mixture observation can be reconstructed and hence
separated. Transcription gives the spoken content. More important, it describes the expected sequence of spectral envelopes, if
modeling of different speech sounds is acquired. Our recently
proposed single-microphone speech separation algorithm exploits
this to derive the spectral envelope trajectories of individual
sources and remove interference accordingly. The correctness of
such transcription becomes critical to the separation performance.
This paper investigates the relationship between the correctness
of transcription hypotheses and the orthogonality of associated
source estimates. An orthogonality measure is introduced to
quantify the correlation between spectrograms. Experiments verify
that underlying true transcriptions lead to a salient orthogonality
distribution, which is distinguishable from the counterfeit transcription one. Accordingly a transcription identification technique
is developed, which succeeds in identifying true transcriptions in
99.74% of the experimental trials.
A Study on Multiple Sound Source Localization with
a Distributed Microphone System
Kook Cho, Takanobu Nishiura, Yoichi Yamashita;
Ritsumeikan University, Japan
Tue-Ses3-P1-13, Time: 16:00
This paper describes a novel method for multiple sound source
Notes
103
localization and its performance evaluation in actual room environments. The proposed method localizes a sound source by
finding the position that maximizes the accumulated correlation
coefficient between multiple channel pairs. After the estimation of
the first sound source, a typical pattern of the accumulated correlation for a single sound source is subtracted from the observed
distribution of the accumulated correlation. Subsequently, the second sound source is searched again. To evaluate the effectiveness
of the proposed method, experiments of multiple sound source
localization were carried out in an actual office room. The result
shows that multiple sound source localization accuracy is about
99.7%. The proposed method could realize the multiple sound
source localization robustly and stably.
Robust Minimal Variance Distortionless Speech
Power Spectra Enhancement Using Order Statistic
Filter for Microphone Array
Tao Yu, John H.L. Hansen; University of Texas at
Dallas, USA
Tue-Ses3-P1-14, Time: 16:00
In this study, we propose a novel minimal variance distortionless
speech power spectral enhancement algorithm, which is robust
to some of the real-world implementation issues. Our proposed
method is implemented in the power spectral domain where
stochastic noise can be modeled as the exponential distribution,
whose non-Gaussianity is explored by order statistics filter. Both
theoretical and experimental results shows the effectiveness of our
proposed method over traditional ones.
Speech Enhancement Minimizing Generalized
Euclidean Distortion Using Supergaussian Priors
spectrum, using the knowledge of the window function we are
using for the STFT. These harmonics are then scaled and laid on
multiples of F0 .
Experimental results prove the effectiveness of this enhancement
method in various noisy conditions and various SNR ratios.
Joint Speech Enhancement and Speaker
Identification Using Monte Carlo Methods
Ciira wa Maina, John MacLaren Walsh; Drexel
University, USA
Tue-Ses3-P1-17, Time: 16:00
We present an approach to speaker identification using noisy
speech observations where the speech enhancement and speaker
identification tasks are performed jointly. This is motivated by
the belief that human beings perform these tasks jointly and
that optimality may be sacrificed if sequential processing is used.
We employ a Bayesian approach where the speech features are
modeled using a mixture of Gaussians prior. A Gibbs sampler is
used to estimate the speech source and the identity of the speaker.
Preliminary experimental results are presented comparing our
approach to a maximum likelihood approach and demonstrating
the ability of our method to both enhance speech and identify
speakers.
Tue-Ses3-P2 : ASR: Acoustic Modelling
Hewison Hall, 16:00, Tuesday 8 Sept 2009
Chair: Simon King, University of Edinburgh, UK
Combined Discriminative Training for Multi-Stream
HMM-Based Audio-Visual Speech Recognition
Amit Das, John H.L. Hansen; University of Texas at
Dallas, USA
Tue-Ses3-P1-15, Time: 16:00
Jing Huang 1 , Karthik Visweswariah 2 ; 1 IBM T.J. Watson
Research Center, USA; 2 IBM India Research Lab, India
We introduce short time spectral estimators which minimize the
weighted Euclidean distortion (WED) between the clean and estimated speech spectral components when clean speech is degraded
by additive noise. The traditional minimum mean square error
(MMSE) estimator does not take into account sufficient perceptual
measure during enhancement of noisy speech. However, the
new estimators discussed in this paper provide greater flexibility
to improve speech quality. We explore the cases when clean
speech spectral magnitude and discrete Fourier transform (DFT)
coefficients are modeled by super-Gaussian priors like Chi and
bilateral Gamma distributions respectively. We also present the
joint maximum a posteriori (MAP) estimators of the Chi distributed
spectral magnitude and uniform phase. Performance evaluations
over two noise types and three SNR levels demonstrate improved
results of the proposed estimators.
In this paper we investigate discriminative training of models
and feature space for a multi-stream hidden Markov model (HMM)
based audio-visual speech recognizer (AVSR). Since the two streams
are used together in decoding, we propose to train the parameters
of the two streams jointly. This is in contrast to prior work which
has considered discriminative training of parameters in each
stream independent of the other. In experiments on a 20-speaker
one-hour speaker independent test set, we obtain 22% relative
gain on AVSR performance over A/V models whose parameters
are trained separately, and 50% relative gain on AVSR over the
baseline maximum-likelihood models. On a noisy (mismatched
to training) test set, we obtain 21% relative gain over A/V models
whose parameters are trained separately. This represents 30%
relative improvement over the maximum-likelihood baseline.
STFT-Based Speech Enhancement by Reconstructing
the Harmonics
Cued Speech Recognition for Augmentative
Communication in Normal-Hearing and
Hearing-Impaired Subjects
Iman Haji Abolhassani 1 , Sid-Ahmed Selouani 2 ,
Douglas O’Shaughnessy 1 ; 1 INRS-EMT, Canada;
2
Université de Moncton, Canada
Tue-Ses3-P2-1, Time: 16:00
Panikos Heracleous, Denis Beautemps, Noureddine
Abboutabit; GIPSA, France
Tue-Ses3-P1-16, Time: 16:00
Tue-Ses3-P2-2, Time: 16:00
A novel Short Time Fourier Transform (STFT) based speech enhancement method is introduced. This method enhances the
magnitude spectrum of a noisy speech segment. The new idea that
is used in this method is to basically reconstruct the harmonics at
the multiples of the fundamental frequency (F0 ) rather than trying
to improve them. The harmonics are produced, in the magnitude
Speech is the most natural communication mean for humans.
However, in situations where audio speech is not available or
cannot be perceived because of disabilities or adverse environmental conditions, people may resort to alternative methods
such as augmented speech. Augmented speech is audio speech
supplemented or replaced by other modalities, such as audiovisual
Notes
104
speech, or Cued Speech. Cued Speech is a visual communication
mode, which uses lipreading and handshapes placed in different
position to make spoken language wholly understandable to deaf
individuals. The current study reports the authors’ activities
and progress in Cued Speech recognition for French. Previously,
the authors have reported experimental results for vowel- and
consonant recognition in Cued Speech for French in the case of
a normal-hearing subject. The study has been extended by also
employing a deaf cuer, and both cuer-dependent and multi-cuer
experiments based on hidden Markov models (HMM) have been
conducted.
On Acquiring Speech Production Knowledge from
Articulatory Measurements for Phoneme
Recognition
D. Neiberg, G. Ananthakrishnan, Mats Blomberg; KTH,
Sweden
Tue-Ses3-P2-3, Time: 16:00
The paper proposes a general version of a coupled Hidden
Markov/Bayesian Network model for performing phoneme recognition on acoustic-articulatory data. The model uses knowledge
learned from the articulatory measurements, available for training,
for phoneme recognition on the acoustic input. After training on
the articulatory data, the model is able to predict 71.5% of the articulatory state sequences using the acoustic input. Using optimized
parameters, the proposed method shows a slight improvement for
two speakers over the baseline phoneme recognition system which
does not use articulatory knowledge. However, the improvement
is only statistically significant for one of the speakers. While
there is an improvement in recognition accuracy for the vowels,
diphthongs and to some extent the semi-vowels, there is a decrease
in accuracy for the remaining phonemes.
There has been increasing interest in the use of unsupervised
adaptation for the personalisation of text-to-speech (TTS) voices,
particularly in the context of speech-to-speech translation. This
requires that we are able to generate adaptation transforms from
the output of an automatic speech recognition (ASR) system. An
approach that utilises unified ASR and TTS models would seem
to offer an ideal mechanism for the application of unsupervised
adaptation to TTS since transforms could be shared between
ASR and TTS. Such unified models should use a common set of
parameters. A major barrier to such parameter sharing is the use
of differing contexts in ASR and TTS. In this paper we propose a
simple approach that generates ASR models from a trained set
of TTS models by marginalising over the TTS contexts that are
not used by ASR. We present preliminary results of our proposed
method on a large vocabulary speech recognition task and provide
insights into future directions of this work.
Detailed Description of Triphone Model Using
SSS-Free Algorithm
Motoyuki Suzuki 1 , Daisuke Honma 2 , Akinori Ito 2 ,
Shozo Makino 2 ; 1 University of Tokushima, Japan;
2
Tohoku University, Japan
Tue-Ses3-P2-6, Time: 16:00
The triphone model is frequently used as an acoustic model. It is
effective for modeling phonetic variations caused by coarticulation.
However, it is known that acoustic features of phonemes are also
affected by other factors such as speaking style and speaking
speed. In this paper, a new acoustic model is proposed. All training data which have the same phoneme context are automatically
clustered into several clusters based on acoustic similarity, and a
“sub-triphones” is trained using training data corresponding to a
cluster.
In experiments, the sub-triphone model achieved about 5% higher
phoneme accuracy than the triphone model.
Measuring the Gap Between HMM-Based ASR and
TTS
John Dines 1 , Junichi Yamagishi 2 , Simon King 2 ; 1 IDIAP
Research Institute, Switzerland; 2 University of
Edinburgh, UK
Decision Tree Acoustic Models for ASR
Jitendra Ajmera, Masami Akamine; Toshiba Corporate
R&D Center, Japan
Tue-Ses3-P2-4, Time: 16:00
Tue-Ses3-P2-7, Time: 16:00
The EMIME European project is conducting research in the development of technologies for mobile, personalised speech-to-speech
translation systems. The hidden Markov model is being used as the
underlying technology in both automatic speech recognition (ASR)
and text-to-speech synthesis (TTS) components, thus, the investigation of unified statistical modelling approaches has become an
implicit goal of our research. As one of the first steps towards this
goal, we have been investigating commonalities and differences
between HMM-based ASR and TTS. In this paper we present results
and analysis of a series of experiments that have been conducted
on English ASR and TTS systems measuring their performance with
respect to phone set and lexicon, acoustic feature type and dimensionality and HMM topology. Our results show that, although the
fundamental statistical model may be essentially the same, optimal
ASR and TTS performance often demands diametrically opposed
system designs. This represents a major challenge to be addressed
in the investigation of such unified modelling approaches.
This paper presents a summary of our research progress using
decision-tree acoustic models (DTAM) for large vocabulary speech
recognition. Various configurations of training DTAMs are proposed and evaluated on wall-street journal (WSJ) task. A number of
different acoustic and categorical features have been used for this
purpose. Various ways of realizing a forest instead of a single tree
have been presented and shown to improve recognition accuracy.
Although the performance is not shown to be better than Gaussian
mixture models (GMMs), several advantages of DTAMs have been
highlighted and exploited. These include compactness, computational simplicity and ability to handle unordered information.
Speech Recognition with Speech Synthesis Models
by Marginalising over Decision Tree Leaves
John Dines, Lakshmi Saheer, Hui Liang; IDIAP
Research Institute, Switzerland
Tue-Ses3-P2-5, Time: 16:00
Compression Techniques Applied to Multiple
Speech Recognition Systems
Catherine Breslin, Matt Stuttle, Kate Knill; Toshiba
Research Europe Ltd., UK
Tue-Ses3-P2-8, Time: 16:00
Speech recognition systems typically contain many Gaussian distributions, and hence a large number of parameters. This makes
them both slow to decode speech, and large to store. Techniques
have been proposed to decrease the number of parameters. One
approach is to share parameters between multiple Gaussians, thus
reducing the total number of parameters and allowing for shared
Notes
105
likelihood calculation. Gaussian tying and subspace clustering
are two related techniques which take this approach to system
compression. These techniques can decrease the number of
parameters with no noticeable drop in performance for single
systems. However, multiple acoustic models are often used in real
speech recognition systems. This paper considers the application
of Gaussian tying and subspace compression to multiple systems.
Results show that two speech recognition systems can be modelled
using the same number of Gaussians as just one system, with little
effect on individual system performance.
performance in gender-independent, spontaneous-speaking applications. However, the multi-path acoustic model size may increase
and require more training samples depending on the increased
number of paths. To solve this problem, we used a tied-state multipath topology by which we can create a three-domain successive
state splitting method to which environmental splitting is added.
This method can obtain a suitable model topology with small mixture components. Experiments demonstrated that the proposed
multi-path HMnet model performs better than single-path models
for the same number of states.
Graphical Models for Discrete Hidden Markov
Models in Speech Recognition
Acoustic Modeling Using Exponential Families
Vaibhava Goel, Peder A. Olsen; IBM T.J. Watson
Research Center, USA
Antonio Miguel, Alfonso Ortega, L. Buera, Eduardo
Lleida; Universidad de Zaragoza, Spain
Tue-Ses3-P2-12, Time: 16:00
Tue-Ses3-P2-9, Time: 16:00
Emission probability distributions in speech recognition have been
traditionally associated to continuous random variables. The most
successful models have been the mixtures of Gaussians in the
states of the hidden Markov models to generate/ capture observations. In this work we show how graphical models can be used to
extract the joint information of more than two features. This is
possible if we previously quantize the speech features to a small
number of levels and model them as discrete random variables.
In this paper it is shown a method to estimate a graphical model
with a bounded number of dependencies, which is a subset of the
directed acyclic graph based model framework, Bayesian networks.
Some experimental results have been obtained with mixtures of
graphical models compared to baseline systems using mixtures of
Gaussians with full and diagonal covariance matrices.
Factor Analyzed HMM Topology for Speech
Recognition
We present a framework to utilize general exponential families
for acoustic modeling.
Maximum Likelihood (ML) parameter
estimation is carried out using sampling based estimates of the
partition function and expected feature vector. Markov Chain
Monte Carlo procedures are used to draw samples from general
exponential densities. We apply our ML estimation framework
to two new exponential families to demonstrate the modeling
flexibility afforded by this framework.
Tue-Ses3-P3 : Assistive Speech Technology
Hewison Hall, 16:00, Tuesday 8 Sept 2009
Chair: Elmar Nöth, FAU Erlangen-Nürnberg, Germany
Personalizing Synthetic Voices for People with
Progressive Speech Disorders: Judging Voice
Similarity
Tue-Ses3-P2-10, Time: 16:00
S.M. Creer 1 , S.P. Cunningham 1 , P.D. Green 1 , K.
Fatema 2 ; 1 University of Sheffield, UK; 2 University of
Kent, UK
This paper presents a new factor analyzed (FA) similarity measure
between two Gaussian mixture models (GMMs). An adaptive hidden
Markov model (HMM) topology is built to compensate the pronunciation variations in speech recognition. Our idea aims to evaluate
whether the variation of a HMM state from new speech data is
significant or not and judge if a new state should be generated
in the models. Due to the effectiveness of FA data analysis, we
measure the GMM similarity by estimating the common factors
and specific factors embedded in the HMM means and variances.
Similar Gaussian densities are represented by the common factors.
Specific factors express the residual of similarity measure. We
perform a composite hypothesis test due to common factors as
well as specific factors. An adaptive HMM topology is accordingly
established from continuous collection of training utterances.
Experiments show that the proposed FA measure outperforms
other measures with comparable size of parameters.
In building personalized synthetic voices for people with speech
disorders, the output should capture the individual’s vocal identity. This paper reports a listener judgment experiment on the
similarity of Hidden Markov Model based synthetic voices using
varying amounts of adaptation data to two non-impaired speakers.
We conclude that around 100 sentences of data is needed to build
a voice that retains the characteristics of the target speaker but
using more data improves the voice. Experiments using Multi-Layer
Perceptrons (MLPs) are conducted to find which acoustic features
contribute to the similarity judgments. Results show that melcepstral distortion and fraction of voicing agreement contribute
most to replicating the similarity judgment but the combination
of all features is required for accurate prediction. Ongoing work
applies the findings to voice building for people with impaired
speech.
Tied-State Multi-Path HMnet Model Using
Three-Domain Successive State Splitting
Electrolaryngeal Speech Enhancement Based on
Statistical Voice Conversion
Soo-Young Suk, Hiroaki Kojima; AIST, Japan
Keigo Nakamura, Tomoki Toda, Hiroshi Saruwatari,
Kiyohiro Shikano; NAIST, Japan
Chuan-Wei Ting, Jen-Tzung Chien; National Cheng
Kung University, Taiwan
Tue-Ses3-P2-11, Time: 16:00
In this paper, we address the improvement of an acoustic model
using the multi-path Hidden Markov network (HMnet) model for
automatically creating non-uniform tied-state, context-dependent
hidden markov model topologies. Recent research has achieved
multi-path model topologies in order to improve the recognition
Tue-Ses3-P3-1, Time: 16:00
Tue-Ses3-P3-2, Time: 16:00
This paper proposes a speaking-aid system for laryngectomees
using GMM-based voice conversion that converts electrolaryngeal
speech (EL speech) to normal speech. Because valid F0 information
cannot be obtained from the EL speech, we have so far converted
the EL speech to whispering. This paper conducts the EL speech
Notes
106
conversion to normal speech using F0 counters estimated from
the spectral information of the EL speech. In this paper, we
experimentally evaluate these two types of output speech of our
speaking-aid system from several points of view. The experimental
results demonstrate that the converted normal speech is preferred
to the converted whisper.
subjects benefit from SynFace especially with speech with stereo
babble noise.
Age Recognition for Spoken Dialogue Systems: Do
We Need It?
Live closed-captions for deaf and hard of hearing audiences are
currently produced by stenographers, or by voice writers using
speech recognition. Both techniques can produce captions with errors. We are currently developing a correction module that allows a
user to intercept the real-time caption stream and correct it before
it is broadcast. We report results of preliminary experiments on
correction rate and actual user performance using a prototype
correction module connected to the output of a speech recognition
captioning system.
Maria Wolters, Ravichander Vipperla, Steve Renals;
University of Edinburgh, UK
Tue-Ses3-P3-3, Time: 16:00
When deciding whether to adapt relevant aspects of the system to
the particular needs of older users, spoken dialogue systems often
rely on automatic detection of chronological age. In this paper,
we show that vocal ageing as measured by acoustic features is
an unreliable indicator of the need for adaptation. Simple lexical
features greatly improve the prediction of both relevant aspects
of cognition and interactions style. Lexical features also boost age
group prediction. We suggest that adaptation should be based
on observed behaviour, not on chronological age, unless it is not
feasible to build classifiers for relevant adaptation decisions.
Speech-Based and Multimodal Media Center for
Different User Groups
Real-Time Correction of Closed-Captions
Patrick Cardinal, Gilles Boulianne; CRIM, Canada
Tue-Ses3-P3-6, Time: 16:00
Universal Access: Speech Recognition for Talkers
with Spastic Dysarthria
Harsh Vardhan Sharma 1 , Mark Hasegawa-Johnson 2 ;
1
Beckman Institute for Advanced Science &
Technology, USA; 2 University of Illinois at
Urbana-Champaign, USA
Tue-Ses3-P3-7, Time: 16:00
Markku Turunen 1 , Jaakko Hakulinen 1 , Aleksi Melto 1 ,
Juho Hella 1 , Juha-Pekka Rajaniemi 1 , Erno Mäkinen 1 ,
Jussi Rantala 1 , Tomi Heimonen 1 , Tuuli Laivo 1 , Hannu
Soronen 2 , Mervi Hansen 2 , Pellervo Valkama 1 , Toni
Miettinen 1 , Roope Raisamo 1 ; 1 University of Tampere,
Finland; 2 Tampere University of Technology, Finland
Tue-Ses3-P3-4, Time: 16:00
We present a multimodal media center interface based on speech
input, gestures, and haptic feedback. For special user groups,
including visually and physically impaired users, the application
features a zoomable context + focus GUI in tight combination with
speech output and full speech-based control. These features have
been developed in cooperation with representatives of the user
groups. Evaluations of the system with regular users have been
conducted and results from a study where subjective evaluations
were collected show that the performance and user experience of
speech input were very good, similar to results from a ten month
public pilot use.
Virtual Speech Reading Support for Hard of Hearing
in a Domestic Multi-Media Setting
Samer Al Moubayed 1 , Jonas Beskow 1 , Ann-Marie
Öster 1 , Giampiero Salvi 1 , Björn Granström 1 , Nic
van Son 2 , Ellen Ormel 2 ; 1 KTH, Sweden; 2 Viataal, The
Netherlands
Tue-Ses3-P3-5, Time: 16:00
In this paper we present recent results on the development of the
SynFace lip synchronized talking head towards multilinguality,
varying signal conditions and noise robustness in the Hearing at
Home project. We then describe the large scale hearing impaired
user studies carried out for three languages. The user tests focus
on measuring the gain in Speech Reception Threshold in Noise
when using SynFace, and on measuring the effort scaling when
using SynFace by hearing impaired people. Preliminary analysis
of the results does not show significant gain in SRT or in effort
scaling. But looking at inter-subject variability, it is clear that many
This paper describes the results of our experiments in small
and medium vocabulary dysarthric speech recognition, using the
database being recorded by our group under the Universal Access
initiative. We develop and test speaker-dependent, word- and
phone-level speech recognizers utilizing the hidden Markov Model
architecture; the models are trained exclusively on dysarthric
speech produced by individuals diagnosed with cerebral palsy.
The experiments indicate that (a) different system configurations
(being word vs. phone based, number of states per HMM, number
of Gaussian components per state specific observation probability
density etc.) give useful performance (in terms of recognition
accuracy) for different speakers and different task-vocabularies,
and (b) for very low intelligibility subjects, speech recognition
outperforms human listeners on recognizing dysarthric speech.
Exploring Speech Therapy Games with Children on
the Autism Spectrum
Mohammed E. Hoque, Joseph K. Lane, Rana
el Kaliouby, Matthew Goodwin, Rosalind W. Picard;
MIT, USA
Tue-Ses3-P3-8, Time: 16:00
Individuals on the autism spectrum often have difficulties producing intelligible speech with either high or low speech rate, and
atypical pitch and/or amplitude affect. In this study, we present a
novel intervention towards customizing speech enabled games to
help them produce intelligible speech. In this approach, we clinically and computationally identify the areas of speech production
difficulties of our participants. We provide an interactive and customized interface for the participants to meaningfully manipulate
the prosodic aspects of their speech. Over the course of 12 months,
we have conducted several pilots to set up the experimental design,
developed a suite of games and audio processing algorithms for
prosodic analysis of speech. Preliminary results demonstrate our
intervention being engaging and effective for our participants.
Notes
107
Analyzing GMMs to Characterize Resonance
Anomalies in Speakers Suffering from Apnoea
1
1
1
José Luis Blanco , Rubén Fernández , David Pardo ,
Álvaro Sigüenza 1 , Luis A. Hernández 1 , José Alcázar 2 ;
1
Universidad Politécnica de Madrid, Spain; 2 Hospital
Torrecardenas, Spain
Tue-Ses3-P3-9, Time: 16:00
Past research on the speech of apnoea patients has revealed that
resonance anomalies are among the most distinguishing traits for
these speakers. This paper presents an approach to characterize
these peculiarities using GMMs and distance measures between
distributions. We report the findings obtained with two analytical
procedures, working with a purpose-designed speech database
of both healthy and apnoea-suffering patients. First, we validate
the database to guarantee that the models trained are able to
describe the acoustic space in a way that may reveal differences
between groups. Then we study abnormal nasalization in apnoea
patients by considering vowels in nasal and non-nasal phonetic
contexts. Our results confirm that there are differences between
the groups, and that statistical modelling techniques can be used
to describe this factor. Results further suggest that it would be
possible to design an automatic classifier using such discriminative
information.
On the Mutual Information Between Source and
Filter Contributions for Voice Pathology Detection
Thomas Drugman, Thomas Dubuisson, Thierry Dutoit;
Faculté Polytechnique de Mons, Belgium
Tue-Ses3-P3-10, Time: 16:00
This paper addresses the problem of automatic detection of voice
pathologies directly from the speech signal. For this, we investigate
the use of the glottal source estimation as a means to detect voice
disorders. Three sets of features are proposed, depending on
whether they are related to the speech or the glottal signal, or
to prosody. The relevancy of these features is assessed through
mutual information-based measures. This allows an intuitive
interpretation in terms of discrimination power and redundancy
between the features, independently of any subsequent classifier.
It is discussed which characteristics are interestingly informative
or complementary for detecting voice pathologies.
A System for Detecting Miscues in Dyslexic Read
Speech
Morten Højfeldt Rasmussen, Zheng-Hua Tan, Børge
Lindberg, Søren Holdt Jensen; Aalborg University,
Denmark
Tue-Ses3-P3-11, Time: 16:00
While miscue detection in general is a well explored research field
little attention has so far been paid to miscue detection in dyslexic
read speech. This domain differs substantially from the domains
that are commonly researched, as for example dyslexic read speech
includes frequent regressions and long pauses between words. A
system detecting miscues in dyslexic read speech is presented.
It includes an ASR component employing a forced-alignment like
grammar adjusted for dyslexic input and uses the GOP score and
phone duration to accept or reject the read words. Experimental
results show that the system detects miscues at a false alarm
rate of 5.3% and a miscue detection rate of 40.1%. These results
are worse than current state of the art reading tutors perhaps
indicating that dyslexic read speech is a challenge to handle.
Tue-Ses3-P4 : Topics in Spoken Language
Processing
Hewison Hall, 16:00, Tuesday 8 Sept 2009
Chair: Chiori Hori, NICT, Japan
Techniques for Rapid and Robust Topic
Identification of Conversational Telephone Speech
Jonathan Wintrode 1 , Scott Kulp 2 ; 1 United States
Department of Defense, USA; 2 Rutgers University, USA
Tue-Ses3-P4-1, Time: 16:00
In this paper, we investigate the impact of automatic speech
recognition (ASR) errors on the accuracy of topic identification in
conversational telephone speech. We present a modified TF-IDF
feature weighting calculation that provides significant robustness
under various recognition error conditions. For our experiments
we take conversations from the Fisher corpus to produce 1-best
and lattice outputs using a single recognizer tuned to run at various
speeds. We use an SVM classifier to perform topic identification
on the output. We observe classifiers incorporating confidence
information to be significantly more robust to errors than those
treating output as unweighted text.
Localization of Speech Recognition in Spoken Dialog
Systems: How Machine Translation Can Make Our
Lives Easier
David Suendermann, Jackson Liscombe, Krishna
Dayanidhi, Roberto Pieraccini; SpeechCycle Labs, USA
Tue-Ses3-P4-2, Time: 16:00
The localization of speech recognition for large-scale spoken
dialog systems can be a tremendous exercise. Usually, all involved
grammars have to be translated by a language expert, and new
data has to be collected, transcribed, and annotated for statistical
utterance classifiers resulting in a time-consuming and expensive
undertaking. Often though, a vast number of transcribed and
annotated utterances exists for the source language. In this
paper, we propose to use such data and translate it into the target
language using machine translation. The translated utterances
and their associated (original) annotations are then used to train
statistical grammars for all contexts of the target system. As
an example, we localize an English spoken dialog system for
Internet troubleshooting to Spanish by translating more than 4
million source utterances without any human intervention. In an
application of the localized system to more than 10,000 utterances
collected on a similar Spanish Internet troubleshooting system, we
show that the overall accuracy was only 5.7% worse than that of
the English source system.
Algorithms for Speech Indexing in Microsoft Recite
Kunal Mukerjee, Shankar Regunathan, Jeffrey Cole;
Microsoft Corporation, USA
Tue-Ses3-P4-3, Time: 16:00
Microsoft Recite is a mobile application to store and retrieve
spoken notes. Recite stores and matches n-grams of pattern class
identifiers that are designed to be language neutral and handle a
large number of out of vocabulary phrases. The query algorithm
expects noise and fragmented matches and compensates for them
with a heuristic ranking scheme. This contribution describes a
class of indexing algorithms for Recite that allows for high retrieval
accuracy while meeting the constraints of low computational
complexity and memory footprint of embedded platforms. The
Notes
108
results demonstrate that a particular indexing scheme within this
class can be selected to optimize the trade-off between retrieval
accuracy and insertion/query complexity.
Parallelized Viterbi Processor for 5,000-Word
Large-Vocabulary Real-Time Continuous Speech
Recognition FPGA System
A WFST-Based Log-Linear Framework for
Speaking-Style Transformation
Graham Neubig, Shinsuke Mori, Tatsuya Kawahara;
Kyoto University, Japan
Tue-Ses3-P4-7, Time: 16:00
Tsuyoshi Fujinaga, Kazuo Miura, Hiroki Noguchi,
Hiroshi Kawaguchi, Masahiko Yoshimoto; Kobe
University, Japan
Tue-Ses3-P4-4, Time: 16:00
We propose a novel Viterbi processor for the large vocabulary
real-time continuous speech recognition. This processor is built
with multi Viterbi cores. Since each core can independently
compute, these cores reduce the cycle times very efficiently. To
verify the effect of utilizing multi cores, we implement a dual-core
Viterbi processor in an FPGA and achieve 49% cycle-time reduction,
compared to a single-core processor. Our proposed dual-core
Viterbi processor achieves the 5,000-word real-time continuous
speech recognition at 65.175 MHz. In addition, it is easy to
implement scalable increases in the number of cores, which leads
to achievement of the larger vocabulary.
SpLaSH (Spoken Language Search Hawk): Integrating
Time-Aligned with Text-Aligned Annotations
Sara Romano, Elvio Cecere, Francesco Cutugno;
Università di Napoli Federico II, Italy
When attempting to make transcripts from automatic speech recognition results, disfluency deletion, transformation of colloquial
expressions, and insertion of dropped words must be performed
to ensure that the final product is clean transcript-style text. This
paper introduces a system for the automatic transformation of
the spoken word to transcript-style language that enables not
only deletion of disfluencies, but also substitutions of colloquial
expressions and insertion of dropped words. A number of potentially useful features are combined in a log-linear probabilistic
framework, and the utility of each is examined. The system is
implemented using weighted finite state transducers (WFSTs) to
allow for easy combination of features and integration with other
WFST-based systems. On evaluation, the best system achieved
a 5.37% word error rate, a 5.49% absolute gain over a rule-based
baseline and a 1.54% absolute gain over a simple noisy-channel
model.
ClusterRank: A Graph Based Method for Meeting
Summarization
Nikhil Garg 1 , Benoit Favre 2 , Korbinian Reidhammer 2 ,
Dilek Hakkani-Tür 2 ; 1 EPFL, Switzerland; 2 ICSI, USA
Tue-Ses3-P4-8, Time: 16:00
In this work we present SpLaSH (Spoken Language Search Hawk),
a toolkit used to perform complex queries on spoken language
corpora. In SpLaSH, tools for the integration of time aligned annotations (TMA), by means of annotation graphs, with text aligned
ones (TXA), by means of generic XML files, are provided. SpLaSH
imposes a very limited number of constraints to the data model design, allowing the integration of annotations developed separately
within the same dataset and without any relative dependency. It
also provides a GUI allowing three types of queries: simple query
on TXA or TMA structures, sequence query on TMA structure and
cross query on both TXA and TMA integrated structures.
This paper presents an unsupervised, graph based approach for
extractive summarization of meetings. Graph based methods such
as TextRank have been used for sentence extraction from news
articles. These methods model text as a graph with sentences
as nodes and edges based on word overlap. A sentence node is
then ranked according to its similarity with other nodes. The
spontaneous speech in meetings leads to incomplete, ill-formed
sentences with high redundancy and calls for additional measures
to extract relevant sentences. We propose an extension of the
TextRank algorithm that clusters the meeting utterances and uses
these clusters to construct the graph. We evaluate this method on
the AMI meeting corpus and show a significant improvement over
TextRank and other baseline methods.
PodCastle: Collaborative Training of Acoustic
Models on the Basis of Wisdom of Crowds for
Podcast Transcription
Leveraging Sentence Weights in a Concept-Based
Optimization Framework for Extractive Meeting
Summarization
Jun Ogata, Masataka Goto; AIST, Japan
Shasha Xie 1 , Benoit Favre 2 , Dilek Hakkani-Tür 2 , Yang
Liu 1 ; 1 University of Texas at Dallas, USA; 2 ICSI, USA
Tue-Ses3-P4-5, Time: 16:00
Tue-Ses3-P4-6, Time: 16:00
This paper presents acoustic-model-training techniques for improving automatic transcription of podcasts. A typical approach
for acoustic modeling is to create a task-specific corpus including
hundreds (or even thousands) of hours of speech data and their
accurate transcriptions. This approach, however, is impractical
in podcast-transcription task because manual generation of the
transcriptions of the large amounts of speech covering all the
various types of podcast contents will be too costly and time
consuming. To solve this problem, we introduce collaborative
training of acoustic models on the basis of wisdom of crowds,
i.e., the transcriptions of podcast-speech data are generated by
anonymous users on our web service PodCastle. We then describe
a podcast-dependent acoustic modeling system by using RSS
metadata to deal with the differences of acoustic conditions in
podcast speech data. From our experimental results on actual
podcast speech data, the effectiveness of the proposed acoustic
model training was confirmed.
Tue-Ses3-P4-9, Time: 16:00
We adopt an unsupervised concept-based global optimization
framework for extractive meeting summarization, where a subset
of sentences is selected to cover as many important concepts as
possible. We propose to leverage sentence importance weights in
this model. Three ways are introduced to combine the sentence
weights within the concept-based optimization framework: selecting sentences for concept extraction, pruning unlikely candidate
summary sentences, and using joint optimization of sentence and
concept weights. Our experimental results on the ICSI meeting
corpus show that our proposed methods can significantly improve
the performance for both human transcripts and ASR output
compared to the concept-based baseline approach, and this unsupervised approach achieves results comparable with those from
supervised learning approaches presented in previous work.
Notes
109
Hybrids of Supervised and Unsupervised Models for
Extractive Speech Summarization
Shih-Hsiang Lin, Yueng-Tien Lo, Yao-Ming Yeh, Berlin
Chen; National Taiwan Normal University, Taiwan
Tue-Ses3-P4-10, Time: 16:00
Speech summarization, distilling important information and
removing redundant and incorrect information from spoken
documents, has become an active area of intensive research in
the recent past. In this paper, we consider hybrids of supervised
and unsupervised models for extractive speech summarization.
Moreover, we investigate the use of the unsupervised summarizer
to improve the performance of the supervised summarizer when
manual labels are not available for training the latter. A novel training data selection and relabeling approach designed to leverage the
inter-document or/and the inter-sentence similarity information is
explored as well. Encouraging results were initially demonstrated.
Automatic Detection of Audio Advertisements
I. Dan Melamed, Yeon-Jun Kim; AT&T Labs Research,
USA
which found little or no evidence for the different types of
isochrony which had been assumed to be the basis for the classification. In recent years, there has been a renewal of interest with
the development of empirical metrics for measuring rhythm. In
this paper it is shown that some of these metrics are more sensitive
to the rhythm of the text than to the rhythm of the utterance
itself. While a number of recent proposals have been made for
improving these metrics it is proposed that what is needed is
more detailed studies of large corpora in order to develop more
sophisticated models of the way in which prosodic structure is
realised in different languages. New data on British English is
presented using the Aix-Marsec corpus.
Oral Presentation of Poster Papers
Time: 16:20
No Time to Lose? Time Shrinking Effects Enhance
the Impression of Rhythmic “Isochrony” and Fast
Speech Rate
Petra Wagner, Andreas Windmann; Universität
Bielefeld, Germany
Tue-Ses3-P4-11, Time: 16:00
Quality control analysts in customer service call centers often
search for keywords in call transcripts. Their searches can return
an overwhelming number of false positives when the search terms
also appear in advertisements that customers hear while they
are on hold. This paper presents new methods for detecting
advertisements in audio data, so that they can be filtered out. In
order to be usable in real-world applications, our methods are
designed to minimize human intervention after deployment. Even
so, they are much more accurate than a baseline HMM method.
Named Entity Network Based on Wikipedia
Sameer Maskey 1 , Wisam Dakka 2 ; 1 IBM T.J. Watson
Research Center, USA; 2 Google Inc., USA
Tue-Ses3-P4-12, Time: 16:00
Named Entities (NEs) play an important role in many natural
language and speech processing tasks. A resource that identifies
relations between NEs could potentially be very useful. We present
such automatically generated knowledge resource from Wikipedia,
Named Entity Network (NE-NET), that provides a list of related
Named Entities (NEs) and the degree of relation for any given NE.
Unlike some manually built knowledge resource, NE-NET has a wide
coverage consisting of 1.5 million NEs represented as nodes of a
graph with 6.5 million arcs relating them. NE-NET also provides
the ranks of the related NEs using a simple ranking function that
we propose. In this paper, we present NE-NET and our experiments
showing how NE-NET can be used to improve the retrieval of
spoken (Broadcast News) and text documents.
Tue-Ses3-S2 : Special Session: Measuring
the Rhythm of Speech
Ainsworth (East Wing 4), 16:00, Tuesday 8 Sept 2009
Chair: Daniel Hirst, LPL, France and Greg Kochanski, University of
Oxford, UK
Tue-Ses3-S2-2, Time: 16:40
Time Shrinking denotes the psycho-acoustic shrinking effect of a
short interval on one or several subsequent longer intervals. Its
effectiveness in the domain of speech perception has so far not
been examined. Two perception experiments clearly suggest the
influence of relative duration patterns triggering time shrinking on
the perception of tempo and rhythmical isochrony or rather “evenness”. A comparison between the experimental data and duration
patterns across various languages suggests a strong influence
of time shrinking on the impression of isochrony in speech and
perceptual speech rate. Our results thus emphasize the necessity
of taking into account relative timing within rhythmical domains
such as feet, phrases or narrow rhythm units as a complementary
perspective to popular global rhythm variability metrics.
Measuring Speech Rhythm Variation in a
Model-Based Framework
Plínio A. Barbosa; State University of Campinas, Brazil
Tue-Ses3-S2-3, Time: 17:00
A coupled-oscillators-model-based method for measuring speech
rhythm is presented. This model explains cross-linguistic differences in rhythm as deriving from varying degrees of coupling
strength between a syllable oscillator and a phrase stress oscillator. The method was applied to three texts read aloud in French,
in Brazilian and European Portuguese by seven speakers. The
results reproduce the early findings on rhythm typology for these
languages/varieties with the following advantages: it successfully
accounts for speech rate variation, related to the syllabic oscillator
frequency in the model; it takes only syllable-sized units into
account, not splitting syllables into vowels and consonants; the
consequences of phrase stress magnitude on stress group duration
are directly considered; both universal and language-specific
aspects of speech rhythm are captured by the model.
Rhythm Measures with Language-Independent
Segmentation
The Rhythm of Text and the Rhythm of Utterances:
From Metrics to Models
Anastassia Loukina 1 , Greg Kochanski 1 , Chilin Shih 2 ,
Elinor Keane 1 , Ian Watson 1 ; 1 University of Oxford, UK;
2
University of Illinois at Urbana-Champaign, USA
Daniel Hirst; LPL, France
Tue-Ses3-S2-1, Time: 16:00
The typological classification of languages as stress-timed, syllabletimed and mora-timed did not stand up to empirical investigation
Tue-Ses3-S2-4, Time: 17:20
Notes
110
We compare 15 measures of speech rhythm based on an automatic
segmentation of speech into vowel-like and consonant-like regions.
This allows us to apply identical segmentation criteria to all
languages and to compute rhythm measures over a large corpus.
It may also approximate more closely the segmentation available
to pre-lexical infants, who apparently can discriminate between
languages. We find that within-language variation is large and
comparable to the between-languages differences we observed. We
evaluate the success of different measures in separating languages
and show that the efficiency of measures depends on the languages
included in the corpus. Rhythm appears to be described by two
dimensions and different published rhythm measures capture
different aspects of it.
Panel Discussion
Time: 17:40
No abstract was available at the time of publication.
Investigating Changes in the Rhythm of Maori Over
Time
Margaret Maclagan 1 , Catherine I. Watson 2 , Jeanette
King 1 , Ray Harlow 3 , Laura Thompson 2 , Peter
Keegan 2 ; 1 University of Canterbury, New Zealand;
2
University of Auckland, New Zealand; 3 University of
Waikato, New Zealand
nantal and vocalic intervals in spoken texts. One of the problems
of this approach lies in complex syllabic structures. Unless we
make an a-priori phonological decision, sonorous consonants may
contribute to either vocalic or consonantal part of the speech signal
in post-initial and pre-final positions of syllabic onsets and codas.
A procedure is offered to avoid phonological dilemmas together
with tedious manual work. The method is tested on continuous
Czech and English texts read out by several professionals.
Vowel Duration in Pre-Geminate Contexts in Polish
Zofia Malisz; Adam Mickiewicz University, Poland
Tue-Ses3-S2-8, Time: 18:00
The study presents Polish experimental data on the variability of
vowel duration in the context of following singleton and geminate
consonants. The aim of the study is to explain the low vocalic
variability values obtained from “rhythm metrics” based analyses
of speech rhythm. It also aims at contributing to the discussion
about current dynamical models of speech rhythm that contain
assumptions of the relative temporal stability of the vowel-to-vowel
sequence. The results suggest that vowels in Polish co-vary with
following consonant length in a roughly proportionate manner.
An interpretation of the effect is offered where a fortition process
overrides the possibility of temporal compensation.
Wed-Ses1-O1 : Speaker Verification &
Identification II
Tue-Ses3-S2-5, Time: 18:00
Present-day Maori elders comment that the mita (which includes
rhythm) of the Maori language, has changed over time. This paper
presents the first results in a study of the change of Maori rhythm.
PVI analyses did not capture this change. Perceptual experiments,
using extracts of speech low-pass filtered to 400 Hz, demonstrated
that Maori and English speech could be distinguished. Listeners
who spoke Maori were more accurate than those who spoke only
English. The English and Maori speech of groups of different
speakers born at different times was perceived differently, indicating that the rhythm of Maori has indeed changed over time.
Main Hall, 10:00, Wednesday 9 Sept 2009
Chair: Steve Renals, University of Edinburgh, UK
Effects of Mora-Timing in English Rhythm Control
by Japanese Learners
Intersession variability (ISV) compensation in speaker recognition
is well studied with respect to extrinsic variation, but little is
known about its ability to model intrinsic variation. We find
that ISV compensation is remarkably successful on a corpus of
intrinsic variation that is highly controlled for channel (a dominant
component of ISV). The results are particularly surprising because
the ISV training data come from a different corpus than do speaker
train and test data. We further find that relative improvements are
(1) inversely related to uncompensated performance, (2) reduced
more by vocal effort train/test mismatch than by speaking style
mismatch, and (3) reduced additionally for mismatches in both
style and level. Results demonstrate that intersession variability
compensation does model intrinsic variation, and suggest that
mismatched data may be more useful than previously expected for
modeling certain types of within-speaker variability in speech.
Shizuka Nakamura 1 , Hiroaki Kato 2 , Yoshinori
Sagisaka 1 ; 1 Waseda University, Japan; 2 NICT, Japan
Tue-Ses3-S2-6, Time: 18:00
In this paper, we analyzed the durational differences between
learners and native speakers in various speech units from the
perspective of that the contrast between the stressed and the
unstressed is one of the most important features to characterize stress-timing of English by comparison with mora-timing of
Japanese. The results showed that the lengthening and shortening of learner speech were not enough to convey the difference
between the stressed and the unstressed. Finally, it was confirmed
that these durational differences strongly affected the subjective
evaluation scores given by English language teachers.
The Dynamic Dimension of the Global
Speech-Rhythm Attributes
1
Does Session Variability Compensation in Speaker
Recognition Model Intrinsic Variation Under
Mismatched Conditions?
Elizabeth Shriberg, Sachin Kajarekar, Nicolas Scheffer;
SRI International, USA
Wed-Ses1-O1-1, Time: 10:00
Variability Compensated Support Vector Machines
Applied to Speaker Verification
Zahi N. Karam, W.M. Campbell; MIT, USA
2 1
Jan Volín , Petr Pollák ; Charles University in Prague,
Czech Republic; 2 Czech Technical University in Prague,
Czech Republic
Tue-Ses3-S2-7, Time: 18:00
Recent years have revealed that certain global attributes of speech
rhythm can be quite successfully captured with respect to conso-
Wed-Ses1-O1-2, Time: 10:20
Speaker verification using SVMs has proven successful, specifically
using the GSV Kernel [1] with nuisance attribute projection (NAP)
[2]. Also, the recent popularity and success of joint factor analysis
[3] has led to promising attempts to use speaker factors directly
as SVM features [4]. NAP projection and the use of speaker factors
with SVMs are methods of handling variability in SVM speaker
Notes
111
verification: NAP by removing undesirable nuisance variability,
and using the speaker factors by forcing the discrimination to be
performed based on inter-speaker variability. These successes
have led us to propose a new method we call variability compensated SVM (VCSVM) to handle both inter and intra-speaker
variability directly in the SVM optimization. This is done by adding
a regularized penalty to the optimization that biases the normal
to the hyperplane to be orthogonal to the nuisance subspace or
alternatively to the complement of the subspace containing the
inter-speaker variability. This bias will attempt to ensure that interspeaker variability is used in the recognition while intra-speaker
variability is ignored. In this paper, we present the VCSVM theory
and promising results on nuisance compensation.
Support Vector Machines versus Fast Scoring in the
Low-Dimensional Total Variability Space for Speaker
Verification
Najim Dehak 1 , Réda Dehak 2 , Patrick Kenny 1 , Niko
Brümmer 3 , Pierre Ouellet 1 , Pierre Dumouchel 1 ;
1
CRIM, Canada; 2 LRDE, France; 3 AGNITIO, South
Africa
Wed-Ses1-O1-3, Time: 10:40
This paper presents a new speaker verification system architecture
based on Joint Factor Analysis (JFA) as feature extractor. In this
modeling, the JFA is used to define a new low-dimensional space
named the total variability factor space, instead of both channel
and speaker variability spaces for the classical JFA. The main
contribution in this approach, is the use of the cosine kernel in
the new total factor space to design two different systems: the
first system is Support Vector Machines based, and the second
one uses directly this kernel as a decision score. This last scoring
method makes the process faster and less computation complex
compared to others classical methods. We tested several intersession compensation methods in total factors, and we found that
the combination of Linear Discriminate Analysis and Within Class
Covariance Normalization achieved the best performance. We
achieved a remarkable results using fast scoring method based
only on cosine kernel especially for male trials, we yield an EER of
1.12% and MinDCF of 0.0094 on the English trials of the NIST 2008
SRE dataset.
Within-Session Variability Modelling for Factor
Analysis Speaker Verification
Robbie Vogt 1 , Jason Pelecanos 2 , Nicolas Scheffer 3 ,
Sachin Kajarekar 3 , Sridha Sridharan 1 ; 1 Queensland
University of Technology, Australia; 2 IBM T.J. Watson
Research Center, USA; 3 SRI International, USA
Speaker Recognition by Gaussian Information
Bottleneck
Ron M. Hecht 1 , Elad Noor 2 , Naftali Tishby 3 ; 1 Tel-Aviv
University, Israel; 2 Weizmann Institute of Science,
Israel; 3 Hebrew University, Israel
Wed-Ses1-O1-5, Time: 11:20
This paper explores a novel approach for the extraction of relevant
information in speaker recognition tasks. This approach uses a
principled information theoretic framework — the Information
Bottleneck method (IB). In our application, the method compresses
the acoustic data while preserving mostly the relevant information
for speaker identification. This paper focuses on a continuous
version of the IB method known as the Gaussian Information
Bottleneck (GIB). This version assumes that both the source and
target variables are high dimensional multivariate Gaussian variables. The GIB was applied in our work to the Super Vector (SV)
dimension reduction conundrum. Experiments were conducted on
the male part of the NIST SRE 2005 corpora. The GIB representation
was compared to other dimension reduction techniques and to a
baseline system. In our experiments, the GIB outperformed the
baseline system; achieving a 6.1% Equal Error Rate (EER) compared
to the 15.1% EER of a baseline system.
Variational Dynamic Kernels for Speaker
Verification
C. Longworth, R.C. van Dalen, M.J.F. Gales; University
of Cambridge, UK
Wed-Ses1-O1-6, Time: 11:40
An important aspect of SVM-based speaker verification is the choice
of dynamic kernel. Recently there has been interest in the use of
kernels based on the Kullback-Leibler divergence between GMMs.
Since this has no closed-form solution, typically a matched-pair
upper bound is used instead. This places significant restrictions
on the forms of model structure that may be used. All GMMs must
contain the same number of components and must be adapted
from a single background model. For many tasks this will not
be optimal. In this paper, dynamic kernels are proposed based
on alternative, variational approximations to the KL divergence.
Unlike the matched-pair bound, these do not restrict the forms
of GMM that may be used. Additionally, using a more accurate
approximation of the divergence may lead to performance gains.
Preliminary results using these kernels are presented on the NIST
2002 SRE dataset.
Wed-Ses1-O2 : Emotion and Expression I
Jones (East Wing 1), 10:00, Wednesday 9 Sept 2009
Chair: Ailbhe Ní Chasaide, Trinity College Dublin, Ireland
Wed-Ses1-O1-4, Time: 11:00
This work presents an extended Joint Factor Analysis model including explicit modelling of unwanted within-session variability.
The goals of the proposed extended JFA model are to improve
verification performance with short utterances by compensating
for the effects of limited or imbalanced phonetic coverage, and to
produce a flexible JFA model that is effective over a wide range
of utterance lengths without adjusting model parameters such as
retraining session subspaces. Experimental results on the 2006
NIST SRE corpus demonstrate the flexibility of the proposed model
by providing competitive results over a wide range of utterance
lengths without retraining and also yielding modest improvements
in a number of conditions over current state-of-the-art.
Emotion Dimensions and Formant Position
Martijn Goudbeek 1 , Jean Philippe Goldman 2 , Klaus R.
Scherer 3 ; 1 Tilburg University, The Netherlands;
2
Université de Genève, Switzerland; 3 Swiss Center for
Affective Sciences, Switzerland
Wed-Ses1-O2-1, Time: 10:00
The influence of emotion on articulatory precision was investigated
in a newly established corpus of acted emotional speech. The frequencies of the first and second formant of the vowels /i/, /u/,
and /a/ was measured and shown to be significantly affected by
emotion dimension. High arousal resulted in a higher mean F1 in
all vowels, whereas positive valence resulted in higher mean values
Notes
112
for F2. The dimension potency/control showed a pattern of effects
that was consistent with a larger vocalic triangle for emotions high
in potency/control. The results are interpreted in the context of
Scherer’s component process model.
Identifying Uncertain Words Within an Utterance via
Prosodic Features
Heather Pon-Barry, Stuart Shieber; Harvard University,
USA
Wed-Ses1-O2-2, Time: 10:20
activation, evaluation, and power. We found immediate response
patterns, in which the staff member colored her utterances in
response to the emotion shown by the student in the immediately
previous utterance, and built a predictive model suitable for use
in a dialog system to persuasively discuss graduate school with
students.
Analysis of Laugh Signals for Detecting in
Continuous Speech
Sudheer Kumar K. 1 , Sri Harish Reddy M. 1 ,
Sri Rama Murty K. 2 , B. Yegnanarayana 1 ; 1 IIIT
Hyderabad, India; 2 IIT Madras, India
We describe an experiment that investigates whether sub-utterance
prosodic features can be used to detect uncertainty at the wordlevel. That is, given an utterance that is classified as uncertain,
we want to determine which word or phrase the speaker is uncertain about. We have a corpus of utterances spoken under
varying degrees of certainty. Using combinations of sub-utterance
prosodic features we train models to predict the level of certainty
of an utterance. On a set of utterances that were perceived to
be uncertain, we compare the predictions of our models for two
candidate ‘target word’ segmentations: (a) one with the actual
word causing uncertainty as the proposed target word, and (b) one
with a control word as the proposed target word. Our best model
correctly identifies the word causing the uncertainty rather than
the control word 91% of the time.
Laughter is a nonverbal vocalization that occurs often in speech
communication. Since laughter is produced by the speech production mechanism, spectral analysis methods are used mostly
for the study of laughter acoustics. In this paper the significance
of excitation features for discriminating laughter and speech is
discussed. New features describing the excitation characteristics
are used to analyze the laugh signals. The features are based
on instantaneous pitch and strength of excitation at epochs. An
algorithm is developed based on these features to detect laughter
regions in continuous speech. The results are illustrated by
detecting laughter regions in a TV broadcast program.
Evaluating Evaluators: A Case Study in
Understanding the Benefits and Pitfalls of
Multi-Evaluator Modeling
Data-Driven Clustering in Emotional Space for
Affect Recognition Using Discriminatively Trained
LSTM Networks
Emily Mower, Maja J. Matarić, Shrikanth S.
Narayanan; University of Southern California, USA
Martin Wöllmer 1 , Florian Eyben 1 , Björn Schuller 1 ,
Ellen Douglas-Cowie 2 , Roddy Cowie 2 ; 1 Technische
Universität München, Germany; 2 Queen’s University
Belfast, UK
Wed-Ses1-O2-3, Time: 10:40
Emotion perception is a complex process, often measured using
stimuli presentation experiments that query evaluators for their
perceptual ratings of emotional cues. These evaluations contain
large amounts of variability both related and unrelated to the
evaluated utterances. One approach to handling this variability
is to model emotion perception at the individual level. However,
the perceptions of specific users may not adequately capture the
emotional acoustic properties of an utterance. This problem can be
mitigated by the common technique of averaging evaluations from
multiple users. We demonstrate that this averaging procedure improves classification performance when compared to classification
results from models created using individual-specific evaluations.
We also demonstrate that the performance increases are related
to the consistency with which evaluators label data. These results
suggest that the acoustic properties of emotional speech are better
captured using models formed from averaged evaluations rather
than from individual-specific evaluations.
Responding to User Emotional State by Adding
Emotional Coloring to Utterances
Wed-Ses1-O2-5, Time: 11:20
Wed-Ses1-O2-6, Time: 11:40
In today’s affective databases speech turns are often labelled on
a continuous scale for emotional dimensions such as valence or
arousal to better express the diversity of human affect. However,
applications like virtual agents usually map the detected emotional
user state to rough classes in order to reduce the multiplicity of
emotion dependent system responses. Since these classes often
do not optimally reflect emotions that typically occur in a given
application, this paper investigates data-driven clustering of emotional space to find class divisions that better match the training
data and the area of application. Thereby we consider the Belfast
Sensitive Artificial Listener database and TV talkshow data from
the VAM corpus. We show that a discriminatively trained Long
Short-Term Memory (LSTM) recurrent neural net that explicitly
learns clusters in emotional space and additionally models context
information outperforms both, Support Vector Machines and a
Regression-LSTM net.
Jaime C. Acosta, Nigel G. Ward; University of Texas at
El Paso, USA
Wed-Ses1-O3 : Automatic Speech
Recognition: Adaptation II
Wed-Ses1-O2-4, Time: 11:00
Fallside (East Wing 2), 10:00, Wednesday 9 Sept 2009
Chair: Satoshi Nakamura, NICT, Japan
When people speak to each other, they share a rich set of nonverbal behaviors such as varying prosody in voice. These behaviors,
sometimes interpreted as demonstrations of emotions, call for
appropriate responses, but today’s spoken dialog systems lack the
ability to do so. We collected a corpus of persuasive dialogs, specifically conversations about graduate school between a staff member
and students, and had judges label all utterances with triples
indicating the perceived emotions, using the three dimensions:
On the Estimation and the Use of
Confusion-Matrices for Improving ASR Accuracy
Omar Caballero Morales, Stephen J. Cox; University of
East Anglia, UK
Wed-Ses1-O3-1, Time: 10:00
Notes
113
In previous work, we described how learning the pattern of recognition errors made by an individual using a certain ASR system leads
to increased recognition accuracy compared with a standard MLLR
adaptation approach. This was the case for low-intelligibility speakers with dysarthric speech, but no improvement was observed for
normal speakers. In this paper, we describe an alternative method
for obtaining the training data for confusion-matrix estimation
for normal speakers which is more effective than our previous
technique. We also address the issue of data sparsity in estimation
of confusion-matrices by using non-negative matrix factorization
(NMF) to discover structure within them. The confusion-matrix
estimates made using these techniques are integrated into the
ASR process using a technique termed as “metamodels”, and the
results presented here show statistically significant gains in word
recognition accuracy when applied to normal speech.
A Study on Soft Margin Estimation of Linear
Regression Parameters for Speaker Adaptation
Shigeki Matsuda 1 , Yu Tsao 1 , Jinyu Li 2 , Satoshi
Nakamura 1 , Chin-Hui Lee 3 ; 1 NICT, Japan; 2 Microsoft
Corporation, USA; 3 Georgia Institute of Technology,
USA
Unsupervised Lattice-Based Acoustic Model
Adaptation for Speaker-Dependent Conversational
Telephone Speech Transcription
K. Thambiratnam, F. Seide; Microsoft Research Asia,
China
Wed-Ses1-O3-4, Time: 10:00
This paper examines the application of lattice adaptation techniques to speaker-dependent models for the purpose of conversational telephone speech transcription. Given sufficient training
data per speaker, it is feasible to build adapted speaker-dependent
models using lattice MLLR and lattice MAP. Experiments on iterative and cascaded adaptation are presented. Additionally various
strategies for thresholding frame posteriors are investigated,
and it is shown that accumulating statistics from the local bestconfidence path is sufficient to achieve optimal adaptation. Overall,
an iterative cascaded lattice system was able to reduce WER by
7.0% abs., which was a 0.8% abs. gain over transcript-based adaptation. Lattice adaptation reduced the unsupervised/supervised
adaptation gap from 2.5% to 1.7%.
Rapid Unsupervised Adaptation Using Frame
Independent Output Probabilities of Gender and
Context Independent Phoneme Models
Wed-Ses1-O3-2, Time: 10:00
We formulate a framework for soft margin estimation-based linear
regression (SMELR) and apply it to supervised speaker adaptation.
Enhanced separation capability and increased discriminative ability
are two key properties in margin-based discriminative training.
For the adaptation process to be able to flexibly utilize any amount
of data, we also propose a novel interpolation scheme to linearly
combine the speaker independent (SI) and speaker adaptive SMELR
(SMELR/SA) models. The two proposed SMELR algorithms were
evaluated on a Japanese large vocabulary continuous speech
recognition task. Both the SMELR and interpolated SI+SMELR/SA
techniques showed improved speech adaptation performance
in comparison with the well-known maximum likelihood linear
regression (MLLR) method. We also found that the interpolation
framework works even more effectively than SMELR when the
amount of adaptation data is relatively small.
Exploring the Role of Spectral Smoothing in Context
of Children’s Speech Recognition
Shweta Ghai, Rohit Sinha; IIT Guwahati, India
Wed-Ses1-O3-3, Time: 10:00
This work is motivated by our earlier study which shows that
on explicit pitch normalization the children’s speech recognition
performance on the adults’ speech trained models improves as
a result of reduction in the pitch-dependent distortions in the
spectral envelope. In this paper, we study the role of spectral
smoothing in context of children’s speech recognition.
The
spectral smoothing has been effected in the feature domain by
two approaches viz., modification of bandwidth of the filters
in the filterbank and cepstral truncation. In conjunction, both
approaches give significant improvement in the children’s speech
recognition performance with 57% relative improvement over the
baseline. Also, when combined with the widely used vocal tract
length normalization (VTLN), these spectral smoothing approaches
result in an additional 25% relative improvement over the VTLN
performance for children’s speech recognition on the adults’
speech trained models.
Satoshi Kobashikawa, Atsunori Ogawa, Yoshikazu
Yamaguchi, Satoshi Takahashi; NTT Corporation,
Japan
Wed-Ses1-O3-5, Time: 10:20
Business is demanding higher recognition accuracy with no increase in computation time compared to previously adopted
baseline speech recognition systems. Accuracy can be improved
by adding a gender dependent acoustic model and unsupervised
adaptation based on CMLLR (Constrained Maximum Likelihood
Linear Regression). CMLLR-based batch-type unsupervised adaptation estimates a single global transformation matrix by utilizing
prior unsupervised labeling, which unfortunately increases the
computation time. Our proposed technique reduces prior gender
selection and labeling time by using frame independent output
probabilities of only gender dependent speech GMM (Gaussian
Mixture Model) and context independent phoneme (monophone)
HMM (Hidden Markov Model) in dual-gender acoustic models. The
proposed technique further raises accuracy by employing a power
term after adaptation. Simulations using spontaneous speech
show that the proposed technique reduces computation time by
17.9% and the relative error in correct rate by 13.7% compared
to the baseline without prior gender selection and unsupervised
adaptation.
Bark-Shift Based Nonlinear Speaker Normalization
Using the Second Subglottal Resonance
Shizhen Wang, Yi-Hui Lee, Abeer Alwan; University of
California at Los Angeles, USA
Wed-Ses1-O3-6, Time: 11:00
In this paper, we propose a Bark-scale shift based piecewise
nonlinear warping function for speaker normalization, and a joint
frequency discontinuity and energy attenuation detection algorithm to estimate the second subglottal resonance (Sg2). We then
apply Sg2 for rapid speaker normalization. Experimental results
on children’s speech recognition show that the proposed nonlinear
warping function is more effective for speaker normalization than
linear frequency warping. Compared to maximum likelihood based
grid search methods, Sg2 normalization is more efficient and
achieves comparable or better performance, especially for limited
normalization data.
Notes
114
Cross-Language Voice Conversion Based on
Eigenvoices
Wed-Ses1-O4 : Voice Transformation I
Holmes (East Wing 3), 10:00, Wednesday 9 Sept 2009
Chair: Yannis Stylianou, FORTH, Greece
Malorie Charlier 1 , Yamato Ohtani 1 , Tomoki Toda 1 ,
Alexis Moinet 2 , Thierry Dutoit 2 ; 1 NAIST, Japan;
2
Faculté Polytechnique de Mons, Belgium
Many-to-Many Eigenvoice Conversion with
Reference Voice
Wed-Ses1-O4-4, Time: 11:00
Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari,
Kiyohiro Shikano; NAIST, Japan
Wed-Ses1-O4-1, Time: 10:00
In this paper, we propose many-to-many voice conversion (VC)
techniques to convert an arbitrary source speaker’s voice into an
arbitrary target speaker’s voice. We have proposed one-to-many
eigenvoice conversion (EVC) and many-to-one EVC. In the EVC,
an eigenvoice Gaussian mixture model (EV-GMM) is trained in
advance using multiple parallel data sets of a reference speaker
and many pre-stored speakers. The EV-GMM is flexibly adapted
to an arbitrary speaker using a small amount of adaptation data
without any linguistic constraints. In this paper, we achieve
many-to-many VC by sequentially performing many-to-one EVC
and one-to-many EVC through the reference speaker using the
same EV-GMM. Experimental results demonstrate the effectiveness
of the proposed many-to-many VC.
This paper presents a novel cross-language voice conversion (VC)
method based on eigenvoice conversion (EVC). Cross-language VC
is a technique for converting voice quality between two speakers
uttering different languages each other. In general, parallel data
consisting of utterance pairs of those two speakers are not available. To deal with this problem, we apply EVC to cross-language
VC. First, we train an eigenvoice GMM (EV-GMM) using many
parallel data sets by a source speaker and many pre-stored other
speakers who can utter the same language as the source speaker.
And then, the conversion model between the source speaker and
a target speaker who cannot utter the source speaker’s language
is developed by adapting the EV-GMM using a few arbitrary sentences uttered by the target speaker in a different language. The
experimental results demonstrate that the proposed method yields
significant performance improvements in both speech quality and
conversion accuracy for speaker individuality compared with a
conventional cross-language VC method based on frame selection.
Alleviating the One-to-Many Mapping Problem in
Voice Conversion with Context-Dependent Modeling
Voice Conversion Using K-Histograms and Frame
Selection
Elizabeth Godoy 1 , Olivier Rosec 1 , Thierry Chonavel 2 ;
1
Orange Labs, France; 2 Telecom Bretagne, France
Alejandro José Uriz 1 , Pablo Daniel Agüero 1 , Antonio
Bonafonte 2 , Juan Carlos Tulli 1 ; 1 Universidad Nacional
de Mar del Plata, Argentina; 2 Universitat Politècnica de
Catalunya, Spain
Wed-Ses1-O4-2, Time: 10:20
This paper addresses the “one-to-many” mapping problem in Voice
Conversion (VC) by exploring source-to-target mappings in GMMbased spectral transformation. Specifically, we examine differences
using source-only versus joint source/target information in the
classification stage of transformation, effectively illustrating a
“one-to-many effect” in the traditional acoustically-based GMM. We
propose combating this effect by using phonetic information in the
GMM learning and classification. We then show the success of our
proposed context-dependent modeling with transformation results
using an objective error criterion. Finally, we discuss implications
of our work in adapting current approaches to VC.
Wed-Ses1-O4-5, Time: 11:20
The goal of voice conversion systems is to modify the voice of a
source speaker to be perceived as if it had been uttered by another
specific speaker. Many approaches found in the literature work
based on statistical models and introduce an oversmoothing in
the target features. Our proposal is a new model that combines
several techniques used in unit selection for text-to-speech and
a non-gaussian transformation mathematical model. Subjective
results support the proposed approach.
Efficient Modeling of Temporal Structure of Speech
for Applications in Voice Transformation
Online Model Adaptation for Voice Conversion
Using Model-Based Speech Synthesis Techniques
Binh Phu Nguyen, Masato Akagi; JAIST, Japan
Dalei Wu 1 , Baojie Li 1 , Hui Jiang 1 , Qian-Jie Fu 2 ; 1 York
University, Canada; 2 House Ear Institute, USA
Wed-Ses1-O4-3, Time: 10:40
Aims of voice transformation are to change styles of given utterances. Most voice transformation methods process speech signals
in a time-frequency domain. In the time domain, when processing
spectral information, conventional methods do not consider relations between neighboring frames. If unexpected modifications
happen, there are discontinuities between frames, which lead to
the degradation of the transformed speech quality. This paper
proposes a new modeling of temporal structure of speech to ensure
the smoothness of the transformed speech for improving the quality of transformed speech in the voice transformation. In our work,
we propose an improvement of the temporal decomposition (TD)
technique, which decomposes a speech signal into event targets
and event functions, to model the temporal structure of speech.
The TD is used to control the spectral dynamics and to ensure the
smoothness of transformed speech. We investigate the TD in two
applications, concatenative speech synthesis and spectral voice
conversion. Experimental results confirm the effectiveness of TD
in terms of improving the quality of the transformed speech.
Wed-Ses1-O4-6, Time: 11:40
In this paper, we present a novel voice conversion method using
model-based speech synthesis that can be used for some applications where prior knowledge or training data is not available from
the source speaker. In the proposed method, training data from
a target speaker is used to build a GMM-based speech model and
voice conversion is then performed for each utterance from the
source speaker according to the pre-trained target speaker model.
To reduce the mismatch between source and target speakers,
online model adaptation is proposed to improve model selection
accuracy, based on maximum likelihood linear regression (MLLR).
Objective and subjective evaluations suggest that the proposed
methods are quite effective in generating acceptable voice quality
for voice conversion even without training data from source
speakers.
Notes
115
Investigating Phonetic Information Reduction and
Lexical Confusability
Wed-Ses1-P1 : Phonetics, Phonology,
Cross-Language Comparisons, Pathology
William Hartmann, Eric Fosler-Lussier; Ohio State
University, USA
Hewison Hall, 10:00, Wednesday 9 Sept 2009
Chair: Valerie Hazan, University College London, UK
Wed-Ses1-P1-4, Time: 10:00
Fast Transcription of Unstructured Audio
Recordings
Brandon C. Roy, Deb Roy; MIT, USA
Wed-Ses1-P1-1, Time: 10:00
We introduce a new method for human-machine collaborative
speech transcription that is significantly faster than existing transcription methods. In this approach, automatic audio processing
algorithms are used to robustly detect speech in audio recordings
and split speech into short, easy to transcribe segments. Sequences
of speech segments are loaded into a transcription interface that
enables a human transcriber to simply listen and type, obviating
the need for manually finding and segmenting speech or explicitly
controlling audio playback. As a result, playback stays synchronized to the transcriber’s speed of transcription. In evaluations
using naturalistic audio recordings made in everyday home situations, the new method is up to 6 times faster than other popular
transcription tools while preserving transcription quality.
Finding Allophones: An Evaluation on Consonants
in the TIMIT Corpus
Timothy Kempton, Roger K. Moore; University of
Sheffield, UK
Wed-Ses1-P1-2, Time: 10:00
Phonemic analysis, the process of identifying the contrastive
sounds in a language, involves finding allophones; phonetic
variants of those contrastive sounds. An algorithm for finding
allophones (developed by Peperkamp et al.) is evaluated on
consonants in the TIMIT acoustic phonetic transcripts. A novel
phonetic filter based on the active articulator is introduced and
has a higher recall than previous filters. The combined retrieval
performance, measured by area under the ROC curve, is 83%. The
system implemented can process any language transcribed in IPA
and is currently being used to assist the phonemic analysis of
unwritten languages.
In the presence of pronunciation variation and the masking effects
of additive noise, we investigate the role of phonetic information
reduction and lexical confusability on ASR performance. Contrary
to previous work [1], we show that place of articulation as a
representation for unstressed segments performs at least as well
as manner of articulation in the presence of additive noise. Methods of phonetic reduction introduce lexical confusibility which
negatively impact performance. By limiting this confusability, recognizers that employ high levels of phonetic reduction (40.1%) can
perform as well a baseline system in the presence of nonstationary
noise.
Improving Phone Recognition Performance via
Phonetically-Motivated Units
Hyejin Hong, Minhwa Chung; Seoul National
University, Korea
Wed-Ses1-P1-5, Time: 10:00
This paper examines how phonetically-motivated units affect the
performance of phone recognition systems. Focusing on the realization of /h/, which is one of the most frequently error-making
phones in Korean phone recognition, three different phone sets are
designed by considering optional phonetic constraints which show
complementary distributions. Experimental results show that one
of the proposed sets, the h-deletion set improves phone recognition
performance compared to the baseline phone recognizer. It is
noteworthy that this set needs no additional phonetic unit, which
means that no more HMM is necessary to be modeled, accordingly
it has the advantage in terms of model size. Besides, it obtains
competent performance compared to the baseline system in terms
of word recognition as well. Thus, this phonetically-motivated
approach dealing with improvement of phone recognition performance is expected to be used in embedded solutions which require
fast and light recognition process.
An Evaluation of Formant Tracking methods on an
Arabic Database
Imen Jemaa 1 , Oussama Rekhis 1 , Kaïs Ouni 1 , Yves
Laprie 2 ; 1 Ecole Nationale d’Ingénieurs de Tunis,
Tunisia; 2 LORIA, France
Automatic Formant Extraction for Sociolinguistic
Analysis of Large Corpora
Keelan Evanini, Stephen Isard, Mark Liberman;
University of Pennsylvania, USA
Wed-Ses1-P1-6, Time: 10:00
Wed-Ses1-P1-3, Time: 10:00
In this paper, we propose a method of formant prediction from
pole and bandwidth data, and apply this method to automatically
extract F1 and F2 values from a corpus of regional dialect variation in North America that contains 134,000 manual formant
measurements. These predicted formants are shown to increase
performance over the default formant values from a popular
speech analysis package. Finally, we demonstrate that sociolinguistic analysis based on vowel formant data can be conducted
reliably using the automatically predicted values, and we argue
that sociolinguists should begin to use this methodology in order
to be able to analyze larger amounts of data efficiently.
In this paper we present a formant database of Arabic used to
evaluate our new automatic formant tracking algorithm based
on Fourier ridges detection. In this method we have introduced
a continuity constraint based on the computation of centres of
gravity for a set of formant candidates. This leads to connect
a frame of speech to its neighbours and thus improves the robustness of tracking. The formant trajectories obtained by the
algorithm proposed are compared to those of the hand edited
formant database and those given by Praat with LPC data.
Comparison of Manual and Automated Estimates of
Subglottal Resonances
Wolfgang Wokurek, Andreas Madsack; Universität
Stuttgart, Germany
Wed-Ses1-P1-7, Time: 10:00
This study compares manual measurements of the first two sub-
Notes
116
glottal resonances to the results of an automated measurement
procedure for the same quantities. We also briefly sketch the sensor prototype that is used for the measurements. The subglottal
resonances are presented in the space spanned by the vowels’
first two formants. A three axis acceleration sensor is gently
pressed at the neck of the speaker. In front of the ligamentum
conicum, located near the lower end of the larynx, pressure signals
may be recorded that follow the subglottal pressure changes at
least up to 2 kHz bandwidth. The recordings of the subglottal
pressure signals are made simultaneously with recordings of the
electroglottogram and the acoustic speech sound with 12 male and
12 female speakers.
Using Durational Cues in a Computational Model of
Spoken-Word Recognition
Odette Scharenborg; Radboud Universiteit Nijmegen,
The Netherlands
Wed-Ses1-P1-8, Time: 10:00
Evidence that listeners use durational cues to help resolve temporarily ambiguous speech input has accumulated over the past
few years. In this paper, we investigate whether durational cues are
also beneficial for word recognition in a computational model of
spoken-word recognition. Two sets of simulations were carried out
using the acoustic signal as input. The simulations showed that the
computational model, like humans, takes benefit from durational
cues during word recognition, and uses these to disambiguate the
speech signal. These results thus provide support for the theory
that durational cues play a role in spoken-word recognition.
Second Language Discrimination Vowel Contrasts
by Adults Speakers with a Five Vowel System
Bianca Sisinni, Mirko Grimaldi; Università del Salento,
Italy
Wed-Ses1-P1-9, Time: 10:00
This study tests the ability of a group of Salento Italian undergraduate students that have been exposed to L2 in a scholastic context
to perceive British English second language (L2) vowel phonemes.
The aim is to verify if the Perceptual Assimilation Model could
be applied to them. In order to test their ability to perceive L2
phonemes, subjects have executed an identification and an oddity
discrimination test. The results indicated that the L2 discrimination
processes are in line with those predicted by the PAM, supporting
the idea that students with a formal L2 background are still naïve
listeners to the L2.
Three-Way Laryngeal Categorization of Japanese,
French, English and Chinese Plosives by Korean
Speakers
The Effect of F0 Peak-Delay on the L1 / L2
Perception of English Lexical Stress
Shinichi Tokuma 1 , Yi Xu 2 ; 1 Chuo University, Japan;
2
University College London, UK
Wed-Ses1-P1-11, Time: 10:00
This study investigated the perceptual effect of F0 peak-delay on
L1 / L2 perception of English lexical stress. A bisyllabic English
non-word ‘nini’ /nInI/ whose F0 was set to reach its peak in the
second syllable was embedded in a frame sentence and used as
the stimulus of the perceptual experiment. Native English and
Japanese speakers were asked to determine lexical stress locations
in the experiment. The results showed that in the perception of
English lexical stress, delayed F0 peaks which were aligned with
the second syllable of the stimulus words perceptually affected
Japanese and English groups in the same manner: both groups
perceived the delayed F0 peaks as a cue to lexical stress in the first
syllable when the peaks were aligned with, or before, the end of
/n/ in the second syllable. A supplementary experiment conducted
on Japanese speakers confirmed the location of the categorical
boundary. These findings are supported by the data provided by
previous studies on L1 acoustic analysis and on L1 / L2 perception
of intonation.
Lexical Tone Production by Cantonese Speakers
with Parkinson’s Disease
Joan Ka-Yin Ma; Technische Universität Dresden,
Germany
Wed-Ses1-P1-12, Time: 10:00
The aim of this study was to investigate lexical tone production
in Cantonese speakers associated with Parkinson’s disease (PD
speakers). The effect of intonation on the production of lexical tone
was also examined. Speech data was collected from five Cantonese
PD speakers. Speech materials consisted of targets contrasting
in tones, embedded in different sentence contexts (initial, medial
and final) and intonations (statements and questions). Analysis of
the normalized F0 patterns showed that PD speakers contrasted
the six lexical tones in similar manner as compared with control
speakers across positions and intonations, except at the final
position of questions. Significantly lower F0 values were found at
the 75% and 100% time points of the final syllable of questions
for the PD speakers than for the control speakers, indicating that
intonation has a smaller influence on the F0 patterns of lexical
tones for PD speakers than control speakers. The results of this
study supported the previous claim of differential control for
intonation and tone.
Acoustic Cues of Palatalisation in Plosive + Lateral
Onset Clusters
Daniela Müller 1 , Sidney Martin Mota 2 ; 1 CLLE-ERSS,
France; 2 Escola Oficial d’Idiomes de Tarragona, Spain
Tomohiko Ooigawa, Shigeko Shinohara; Sophia
University, Japan
Wed-Ses1-P1-13, Time: 10:00
Wed-Ses1-P1-10, Time: 10:00
Korean has a three-way laryngeal contrast in oral stops. This
paper reports perception patterns of plosives of Japanese, French,
English and Chinese by Korean speakers. In Korean loanwords,
laryngeal contrasts of Japanese, French, and English plosives show
distinct patterns. To test whether perception explains the loanword
patterns, we selected languages with different acoustic properties
and carried out perception tests. Our results reveal discrepancies
between the phonological adaptation and the acoustic perception
patterns.
Palatalisation of /l/ in obstruent + lateral onset clusters in the
absence of a following palatal sound has received a considerable
amount of attention from historical linguistics. The phonetics of
its development, however, remains less well-investigated. This
paper aims at studying the acoustic cues that could have led
plosive + lateral onset clusters to develop palatalisation. It is found
that onset clusters with velar plosives favour palatalisation more
than labial + lateral clusters, and that a high degree of darkness
diminishes the likelihood of palatalisation to take place.
Notes
117
In this study a perception experiment was carried out to examine
the perceived similarity of intonation contours. Amongst other results we found, that the subjects are capable to produce consistent
similarity judgements.
Wed-Ses1-P2 : Prosody Perception and
Language Acquisition
Hewison Hall, 10:00, Wednesday 9 Sept 2009
Chair: David House, KTH, Sweden
Perception of English Compound vs. Phrasal Stress:
Natural vs. Synthetic Speech
Irene Vogel, Arild Hestvik, H. Timothy Bunnell, Laura
Spinu; University of Delaware, USA
Wed-Ses1-P2-1, Time: 10:00
The ability of listeners to distinguish between compound and
phrasal stress in English was examined on the basis of a picture
selection task. The responses to naturally and synthetically produced stimuli were compared. While greater overall accuracy was
observed with the natural stimuli, the same pattern of greater
accuracy with compound stress than with phrasal stress was
observed with both types of stimuli.
New Method for Delexicalization and its Application
to Prosodic Tagging for Text-to-Speech Synthesis
Martti Vainio 1 , Antti Suni 1 , Tuomo Raitio 2 , Jani
Nurminen 3 , Juhani Järvikivi 4 , Paavo Alku 2 ;
1
University of Helsinki, Finland; 2 Helsinki University of
Technology, Finland; 3 Nokia Devices R&D, Finland;
4
Max Planck Institute for Psycholinguistics, The
Netherlands
Wed-Ses1-P2-2, Time: 10:00
This paper describes a new flexible delexicalization method based
on glottal excited parametric speech synthesis scheme. The system
utilizes inverse filtered glottal flow and all-pole modelling of
the vocal tract. The method provides a possibility to retain and
manipulate all relevant prosodic features of any kind of speech.
Most importantly, the features include voice quality, which has
not been properly modeled in earlier delexicalization methods.
The functionality of the new method was tested in a prosodic
tagging experiment aimed at providing word prominence data
for a text-to-speech synthesis system. The experiment confirmed
the usefulness of the method and further corroborated earlier
evidence that linguistic factors influence the perception of prosodic
prominence.
Minnaleena Toivola, Mietta Lennes, Eija Aho;
University of Helsinki, Finland
Wed-Ses1-P2-3, Time: 10:00
In this study, the temporal aspects of speech are compared in
read-aloud Finnish produced by six native and 16 non-native
speakers. It is shown that the speech and articulation rates as well
as pause durations are different for native and non-native speakers. Moreover, differences exist between the groups of speakers
representing four different non-native languages. Surprisingly,
the native Finnish speakers tend to make longer pauses than the
non-natives. The results are relevant when developing methods for
assessing fluency or the strength of foreign accent.
Uwe D. Reichel, Felicitas Kleber, Raphael Winkelmann;
Technische Universität München, Germany
Wed-Ses1-P2-4, Time: 10:00
Finally, we developed applicable linear regression and neural feed
forward network models predicting similarity perception of intonation on the basis of physical contour distances. The performance
of the neural networks, measured in terms of mean absolute error,
did not differ significantly from the human performance derived
from judgement consistency.
Studying L2 Suprasegmental Features in Asian
Englishes: A Position Paper
Helen Meng 1 , Chiu-yu Tseng 2 , Mariko Kondo 3 , Alissa
Harrison 1 , Tanya Viscelgia 4 ; 1 Chinese University of
Hong Kong, China; 2 Academia Sinica, Taiwan;
3
Waseda University, Japan; 4 Ming Chuan University,
Taiwan
Wed-Ses1-P2-5, Time: 10:00
This position paper highlights the importance of suprasegmental
training in secondary language (L2) acquisition. Suprasegmental
features are manifested in terms of acoustic cues and convey
important information about linguistic and information structures.
Hence, L2 learners must harness appropriate suprasegmental
productions for effective communication. However, this learning
process is influenced by well-established perceptions of sounds
and articulatory motions in the primary language (L1). We propose
to design and collect a corpus to support systematic analysis of
L2 suprasegmental features. We lay out a set of carefully selected
textual environments that illustrate how suprasegmental features
convey information including part-of-speech, syntax, focus, speech
acts and semantics. We intend to use these textual environments
for collecting speech data in a variety of Asian Englishes from
non-native English speakers. Analyses of such corpora should
lead to research findings that have important implications for
language education, as well as speech technology development for
computer-aided language learning (CALL) applications.
Classification of Disfluent Phenomena as Fluent
Communicative Devices in Specific Prosodic
Contexts
Speech Rate and Pauses in Non-Native Finnish
Modelling Similarity Perception of Intonation
On the basis of this data we studied the influence of several
physical distance measures on the human similarity judgements
by grouping these measures to principal components and by
comparing the weights of these components in a linear regression
model predicting human perception. Non-correlation based distance measures for f0 contours received the highest relative weight.
Helena Moniz 1 , Isabel Trancoso 2 , Ana Isabel Mata 1 ;
1
FLUL/CLUL, Portugal; 2 INESC-ID Lisboa/IST, Portugal
Wed-Ses1-P2-6, Time: 10:00
This work explores prosodic cues of disfluent phenomena. In our
previous work, we conducted a perceptual experiment regarding
(dis)fluency ratings. Results suggested that some disfluencies
may be considered felicitous by listeners, namely filled pauses
and prolongations. In an attempt to discriminate which linguistic
features are more salient in the classification of disfluencies as
either fluent or disfluent phenomena, we used CART techniques on
a corpus of 3.5 hours of spontaneous and prepared non-scripted
speech. CART results pointed out 2 splits: break indices and
contour shape. The first split indicates that events uttered at
breaks 3 and 4 are considered felicitous. The second shows that
these events must have flat or ascending contours to be considered
Notes
118
as such; otherwise they are strongly penalized. Our preliminary
results suggest that there are regular trends in the production of
these events, namely, prosodic phrasing and contour shape.
Cross-Cultural Perception of Discourse Phenomena
Akiko Amano-Kusumoto, John-Paul Hosom, Izhak
Shafran; Oregon Health & Science University, USA
Wed-Ses1-P2-10, Time: 10:00
Rolf Carlson 1 , Julia Hirschberg 2 ; 1 KTH, Sweden;
2
Columbia University, USA
Wed-Ses1-P2-7, Time: 10:00
We discuss perception studies of two low level indicators of
discourse phenomena by Swedish, Japanese, and Chinese native
speakers. Subjects were asked to identify upcoming prosodic
boundaries and disfluencies in Swedish spontaneous speech. We
hypothesize that speakers of prosodically unrelated languages
should be less able to predict upcoming phrase boundaries but
potentially better able to identify disfluencies, since indicators
of disfluency are more likely to depend upon lexical, as well as
acoustic information. However, surprisingly, we found that both
phenomena were fairly well recognized by native and non-native
speakers, with, however, some possible interference from word
tones for the Chinese subjects.
Modelling Vocabulary Growth from Birth to Young
Adulthood
1
Classifying Clear and Conversational Speech Based
on Acoustic Features
This paper reports an investigation of features relevant for classifying two speaking styles, namely, conversational speaking style
and clear (e.g. hyper-articulated) speaking style. Spectral and
prosodic features were automatically extracted from speech and
classified using decision tree classifiers and multilayer perceptrons
to achieve accuracies of about 71% and 77% respectively. More
interestingly, we found that out of the 56 features only about 9
features are needed to capture the most predictive power. While
perceptual studies have shown that spectral cues are more useful
than prosodic features for intelligibility [1], here we find prosodic
features are more important for classification.
The Acoustic Characteristics of Russian Vowels in
Children of 6 and 7 Years of Age
Elena E. Lyakso, Olga V. Frolova, Aleks S. Grigoriev; St.
Petersburg State University, Russia
Wed-Ses1-P2-11, Time: 10:00
2 1
Roger K. Moore , L. ten Bosch ; University of
Sheffield, UK; 2 Radboud Universiteit Nijmegen, The
Netherlands
Wed-Ses1-P2-8, Time: 10:00
There has been considerable debate over the existence of the
‘vocabulary spurt’ phenomenon — an apparent acceleration in
word learning that is commonly said to occur in children around
the age of 18 months. This paper presents an investigation into
modelling the phenomenon using data from almost 1800 children.
The results indicate that the acquisition of a receptive/productive
lexicon can be quite adequately modelled as a single growth function with an ecologically well founded and cognitively plausible
interpretation. Hence it is concluded that there is little evidence
for the vocabulary spurt phenomenon as a separable aspect of
language acquisition.
Adaptive Non-Negative Matrix Factorization in a
Computational Model of Language Acquisition
The purpose of this investigation is to examine the process of
acoustic features of vowels from child speech approaching corresponding values in the normal Russian adult speech. The vowels
formants structure, pitch and vowels duration were examined.
Word stress and palatal context influence on the formants structure of the vowels were taken into account. It was shown that
the word stress is formed by 6–7 years of age on the basis of the
feature typical for Russian language. Formant structure of Russian
vowels /u/ and /i/ is not formed by the age of 7 years. Native
speakers recognize the meaning of 57–93% words in speech of 6
and 7-years-old children.
Japanese Children’s Acquisition of Prosodic
Politeness Expressions
Takaaki Shochi 1 , Donna Erickson 2 , Kaoru Sekiyama 1 ,
Albert Rilliard 3 , Véronique Aubergé 4 ; 1 Kumamoto
University, Japan; 2 Showa Music University, Japan;
3
LIMSI, France; 4 GIPSA, France
Wed-Ses1-P2-12, Time: 10:00
Joris Driesen 1 , L. ten Bosch 2 , Hugo Van hamme 1 ;
1
Katholieke Universiteit Leuven, Belgium; 2 Radboud
Universiteit Nijmegen, The Netherlands
Wed-Ses1-P2-9, Time: 10:00
During the early stages of language acquisition, young infants
face the task of learning a basic vocabulary without the aid of
prior linguistic knowledge. It is believed the long term episodic
memory plays an important role in this process. Experiments have
shown that infants retain large amounts of very detailed episodic
information about the speech they perceive (e.g. [1]). This weakly
justifies the fact that some algorithms attempting to model the
process of vocabulary acquisition computationally process large
amounts of speech data in batch. Non-negative Matrix Factorization (NMF), a technique that is particularly successful in data
mining but can also be applied to vocabulary acquisition (e.g. [2]),
is such an algorithm. In this paper, we will integrate an adaptive
variant of NMF into a computational framework for vocabulary
acquisition, foregoing the need for long term storage of speech
inputs, and experimentally show its accuracy matches that of the
original batch algorithm.
This paper presents a perception experiment to measure the ability
of Japanese children in fourth and fifth grade elementary school
to recognize culturally encoded expressions of politeness and
impoliteness in their native language. Audio-visual stimuli were
presented to listeners, who rate the politeness degree and a possible situation where such an expression could be used. Analysis
of results focuses on the differences and the similarities between
adult listeners and children, for each attitude and modality. Facial
information seems to be retrieved earlier than audio ones, and
expressions of different degrees of Japanese politeness, including
expressions of kyoshuku, are still not understood around 10 years
of age.
Perceptual Training of Singleton and Geminate
Stops in Japanese Language by Korean Learners
Mee Sonu 1 , Keiichi Tajima 2 , Hiroaki Kato 3 , Yoshinori
Sagisaka 1 ; 1 Waseda University, Japan; 2 Hosei
University, Japan; 3 NICT, Japan
Wed-Ses1-P2-13, Time: 10:00
Notes
119
We aim to build up an effective perceptual training paradigm
toward a computer-assisted language learning (CALL) system for
second language. This study investigated the effectiveness of the
perceptual training on Korean-speaking learners of Japanese in
the distinction between geminate and singleton stops of Japanese.
The training consisted of identification of geminate and singleton
stops with feedback. We investigated whether training improves
the learners’ identification of the geminate and singleton stops
in Japanese. Moreover, we examined how perceptual training is
affected by factors that influence speaking rate. Results were as
follows. Participants who underwent perceptual training improved
overall performance to a greater extent than untrained control
participants. However, there was no significant difference between
the group that was trained with three speaking rates and the group
that was trained with normal rate only.
Wed-Ses1-P3 : Statistical Parametric
Synthesis II
Hewison Hall, 10:00, Wednesday 9 Sept 2009
Chair: Simon King, University of Edinburgh, UK
A Bayesian Approach to Hidden Semi-Markov Model
Based Speech Synthesis
Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda;
Nagoya Institute of Technology, Japan
Wed-Ses1-P3-1, Time: 10:00
This paper proposes a Bayesian approach to hidden semi-Markov
model (HSMM) based speech synthesis. Recently, hidden Markov
model (HMM) based speech synthesis based on the Bayesian
approach was proposed. The Bayesian approach is a statistical technique for estimating reliable predictive distributions by
treating model parameters as random variables. In the Bayesian
approach, all processes for constructing the system are derived
from one single predictive distribution which exactly represents
the problem of speech synthesis. However, there is an inconsistency between training and synthesis: although the speech is
synthesized from HMMs with explicit state duration probability
distributions, HMMs are trained without them. In this paper, we
introduce an HSMM, which is an HMM with explicit state duration
probability distributions, into the HMM-based Bayesian speech
synthesis system. Experimental results show that the use of HSMM
improves the naturalness of the synthesized speech.
Rich Context Modeling for High Quality HMM-Based
TTS
Zhi-Jie Yan, Yao Qian, Frank K. Soong; Microsoft
Research Asia, China
Wed-Ses1-P3-2, Time: 10:00
This paper presents a rich context modeling approach to high
quality HMM-based speech synthesis. We first analyze the oversmoothing problem in conventional decision tree tying-based HMM,
and then propose to model the training speech tokens with rich
context models. Special training procedure is adopted for reliable
estimation of the rich context model parameters. In synthesis,
a search algorithm following a context-based pre-selection is
performed to determine the optimal rich context model sequence
which generates natural and crisp output speech. Experimental
results show that spectral envelopes synthesized by the rich
context models are with crisper formant structures and evolve
with richer details than those obtained by the conventional models.
The speech quality improvement is also perceived by listeners
in a subjective preference test, in which 76% of the sentences
synthesized using rich context modeling are preferred.
Tying Covariance Matrices to Reduce the Footprint
of HMM-Based Speech Synthesis Systems
Keiichiro Oura, Heiga Zen, Yoshihiko Nankaku,
Akinobu Lee, Keiichi Tokuda; Nagoya Institute of
Technology, Japan
Wed-Ses1-P3-3, Time: 10:00
This paper proposes a technique of reducing footprint of HMMbased speech synthesis systems by tying all covariance matrices.
HMM-based speech synthesis systems usually consume smaller
footprint than unit-selection synthesis systems because statistics
rather than speech waveforms are stored. However, further reduction is essential to put them on embedded devices which have very
small memory. According to the empirical knowledge that covariance matrices have smaller impact for the quality of synthesized
speech than mean vectors, here we propose a clustering technique
of mean vectors while tying all covariance matrices. Subjective
listening test results show that the proposed technique can shrink
the footprint of an HMM-based speech synthesis system while
retaining the quality of synthesized speech.
The HMM Synthesis Algorithm of an Embedded
Unified Speech Recognizer and Synthesizer
Guntram Strecha 1 , Matthias Wolff 1 , Frank Duckhorn 1 ,
Sören Wittenberg 1 , Constanze Tschöpe 2 ; 1 Technische
Universität Dresden, Germany; 2 Fraunhofer IZFP,
Germany
Wed-Ses1-P3-4, Time: 10:00
In this paper we present an embedded unified speech recognizer
and synthesizer using identical, speaker independent HiddenMarkov-Models. The system was prototypically realized on a signal
processor extended by a field programmable gate array. In a first
section we will give a brief overview of the system. The main
part of the paper deals with a specially designed unit based HMM
synthesis algorithm. In a last section we state the results of an
informal listening evaluation of the speech synthesizer.
Syllable HMM Based Mandarin TTS and Comparison
with Concatenative TTS
Zhiwei Shuang 1 , Shiyin Kang 2 , Qin Shi 1 , Yong Qin 1 ,
Lianhong Cai 2 ; 1 IBM China Research Lab, China;
2
Tsinghua University, China
Wed-Ses1-P3-5, Time: 10:00
This paper introduces a Syllable HMM based Mandarin TTS system.
10-state left-to-right HMMs are used to model each syllable. We
leverage the corpus and the front end of a concatenative TTS
system to build the Syllable HMM based TTS system. Furthermore,
we utilize the unique consonant/vowel structure of Mandarin
syllable to improve the voiced/unvoiced decision of HMM states.
Evaluation results show that the Syllable HMM based Mandarin TTS
system with a 5.3MB’s model size can achieve an overall quality
close to a concatenative TTS system with 1GB’ data size.
Pulse Density Representation of Spectrum for
Statistical Speech Processing
Yoshinori Shiga; NICT, Japan
Wed-Ses1-P3-6, Time: 10:00
This study investigates a new spectral representation that is suitable for statistical parametric speech synthesis. Statistical speech
processing involves spectral averaging in the training process;
however, averaging spectra in the domain of conventional speech
Notes
120
parameters over-smooths the resulting means, which degrades the
quality of the speech synthesised. In the proposed representation,
high-energy parts of the spectrum, such as sections of dominant
formants, are represented by a group of high-density pulses in the
frequency domain. These pulses’ locations (i.e., frequencies) are
then parameterised. The representation is theoretically capable of
averaging spectra with less over-smoothing effect. The experimental results provide the optimal values of factors necessary for the
encoding and decoding of the proposed representation towards
the future applications of speech synthesis.
Parameterization of Vocal Fry in HMM-Based Speech
Synthesis
Hanna Silén 1 , Elina Helander 1 , Jani Nurminen 2 ,
Moncef Gabbouj 1 ; 1 Tampere University of Technology,
Finland; 2 Nokia Devices R&D, Finland
Wed-Ses1-P3-7, Time: 10:00
HMM-based speech synthesis offers a way to generate speech with
different voice qualities. However, sometimes databases contain
certain inherent voice qualities that need to be parametrized
properly. One example of this is vocal fry typically occurring at the
end of utterances. A popular mixed excitation vocoder for HMMbased speech synthesis is STRAIGHT. The standard STRAIGHT is
optimized for modal voices and may not produce high quality with
other voice types. Fortunately, due to the flexibility of STRAIGHT,
different F0 and aperiodicity measures can be used in the synthesis
without any inherent degradations in speech quality. We have
replaced the STRAIGHT excitation with a representation based
on a robust F0 measure and a carefully determined two-band
voicing. According to our analysis-synthesis experiments, the new
parameterization can improve the speech quality. In HMM-based
speech synthesis, the quality is significantly improved especially
due to the better modeling of vocal fry.
A Deterministic Plus Stochastic Model of the
Residual Signal for Improved Parametric Speech
Synthesis
Thomas Drugman 1 , Geoffrey Wilfart 2 , Thierry
Dutoit 1 ; 1 Faculté Polytechnique de Mons, Belgium;
2
Acapela Group, Belgium
Wed-Ses1-P3-8, Time: 10:00
Speech generated by parametric synthesizers generally suffers
from a typical buzziness, similar to what was encountered in old
LPC-like vocoders. In order to alleviate this problem, a more suited
modeling of the excitation should be adopted. For this, we hereby
propose an adaptation of the Deterministic plus Stochastic Model
(DSM) for the residual. In this model, the excitation is divided
into two distinct spectral bands delimited by the maximum voiced
frequency. The deterministic part concerns the low-frequency
contents and consists of a decomposition of pitch-synchronous
residual frames on an orthonormal basis obtained by Principal
Component Analysis. The stochastic component is a high-pass
filtered noise whose time structure is modulated by an energyenvelope, similarly to what is done in the Harmonic plus Noise
Model (HNM). The proposed residual model is integrated within a
HMM-based speech synthesizer and is compared to the traditional
excitation through a subjective test. Results show a significant
improvement for both male and female voices. In addition the
proposed model requires few computational load and memory,
which is essential for its integration in commercial applications.
A Decision Tree-Based Clustering Approach to State
Definition in an Excitation Modeling Framework for
HMM-Based Speech Synthesis
Ranniery Maia 1 , Tomoki Toda 2 , Keiichi Tokuda 3 ,
Shinsuke Sakai 1 , Satoshi Nakamura 1 ; 1 NICT, Japan;
2
NAIST, Japan; 3 Nagoya Institute of Technology, Japan
Wed-Ses1-P3-9, Time: 10:00
This paper presents a decision tree-based algorithm to cluster
residual segments assuming an excitation model based on statedependent filtering of pulse train and white noise. The decision
tree construction principle is the same as the one applied to
speech recognition. Here parent nodes are split using the residual
maximum likelihood criterion. Once these excitation decision trees
are constructed for residual signals segmented by full context
models, using questions related to the full context of the training
sentences, they can be utilized for excitation modeling in speech
synthesis based on hidden Markov models (HMM). Experimental
results have shown that the algorithm in question is very effective
in terms of clustering residual signals given segmentation, pitch
marks and full context questions, resulting in filters with good
residual modeling properties.
An Improved Minimum Generation Error Based
Model Adaptation for HMM-Based Speech Synthesis
Yi-Jian Wu 1 , Long Qin 2 , Keiichi Tokuda 1 ; 1 Nagoya
Institute of Technology, Japan; 2 Carnegie Mellon
University, USA
Wed-Ses1-P3-10, Time: 10:00
A minimum generation error (MGE) criterion had been proposed for
model training in HMM-based speech synthesis. In this paper, we
apply the MGE criterion to model adaptation for HMM-based speech
synthesis, and introduce an MGE linear regression (MGELR) based
model adaptation algorithm, where the regression matrices used
to transform source models are optimized so as to minimize the
generation errors of adaptation data. In addition, we incorporate
the recent improvements of MGE criterion into MGELR-based model
adaptation, including state alignment under MGE criterion and
using a log spectral distortion (LSD) instead of Euclidean distance
for spectral distortion measure. From the experimental results,
the adaptation performance was improved after incorporating
these two techniques, and the formal listening tests showed that
the quality and speaker similarity of synthesized speech after
MGELR-based adaptation were significantly improved over the
original MLLR-based adaptation.
Two-Pass Decision Tree Construction for
Unsupervised Adaptation of HMM-Based Synthesis
Models
Matthew Gibson; University of Cambridge, UK
Wed-Ses1-P3-11, Time: 10:00
Hidden Markov model (HMM) -based speech synthesis systems
possess several advantages over concatenative synthesis systems.
One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset.
Speaker adaptation methods used in the field of HMM-based
automatic speech recognition (ASR) are adopted for this task. In
the case of unsupervised speaker adaptation, previous work has
used a supplementary set of acoustic models to firstly estimate
the transcription of the adaptation data. By defining a mapping
between HMM-based synthesis models and ASR-style models,
this paper introduces an approach to the unsupervised speaker
adaptation task for HMM-based speech synthesis models which
Notes
121
avoids the need for supplementary acoustic models. Further, this
enables unsupervised adaptation of HMM-based speech synthesis
models without the need to perform linguistic analysis of the
estimated transcription of the adaptation data.
Speaker Adaptation Using a Parallel Phone Set
Pronunciation Dictionary for Thai-English Bilingual
TTS
Wed-Ses1-P4 : Resources, Annotation and
Evaluation
Hewison Hall, 10:00, Wednesday 9 Sept 2009
Chair: Michael Wagner, University of Canberra, Australia
Resources for Speech Research: Present and Future
Infrastructure Needs
Anocha Rugchatjaroen, Nattanun Thatphithakkul,
Ananlada Chotimongkol, Ausdang Thangthai, Chai
Wutiwiwatchai; NECTEC, Thailand
Wed-Ses1-P3-12, Time: 10:00
This paper develops a bilingual Thai-English TTS system from
two monolingual HMM-based TTS systems. An English Nagoya
HMM-based TTS system (HTS) provides correct pronunciations
of English words but the voice is different from the voice in a
Thai HTS system. We apply a CSMAPLR adaptation technique to
make the English voice sounds more similar to the Thai voice.
To overcome a phone mapping problem normally occurs with a
pair of languages that have dissimilar phone sets, we utilize a
cross-language pronunciation mapping through a parallel phone
set pronunciation dictionary. The results from the subjective
listening test show that English words synthesized by our proposed system are more intelligible (with 0.61 higher MOS) than the
existing bilingual Thai-English TTS. Moreover, with the proposed
adaptation method, the synthesized English words sound more
similar to synthesized Thai words.
HMM-Based Automatic Eye-Blink Synthesis from
Speech
Michal Dziemianko, Gregor Hofer, Hiroshi Shimodaira;
University of Edinburgh, UK
Wed-Ses1-P3-13, Time: 10:00
In this paper we present a novel technique to automatically
synthesise eye blinking from a speech signal. Animating the
eyes of a talking head is important as they are a major focus of
attention during interaction. The developed system predicts eye
blinks from the speech signal and generates animation trajectories
automatically employing a “Trajectory Hidden Markov Model”.
The evaluation of the recognition performance showed that the
timing of blinking can be predicted from speech with an F-score
value upwards of 52%, which is well above chance. Additionally, a
preliminary perceptual evaluation was conducted, that confirmed
that adding eye blinking significantly improves the perception
the character. Finally it showed that the speech synchronised
synthesised blinks outperform random blinking in naturalness
ratings.
Lou Boves 1 , Rolf Carlson 2 , Erhard Hinrichs 3 , David
House 2 , Steven Krauwer 4 , Lothar Lemnitzer 3 , Martti
Vainio 5 , Peter Wittenburg 6 ; 1 Radboud Universiteit
Nijmegen, The Netherlands; 2 KTH, Sweden;
3
Universität Tübingen, Germany; 4 Utrecht University,
The Netherlands; 5 University of Helsinki, Finland; 6 Max
Planck Institute for Psycholinguistics, The Netherlands
Wed-Ses1-P4-1, Time: 10:00
This paper introduces the EU-FP7 project CLARIN, a joint effort
of over 150 institutions in Europe, aimed at the creation of a
sustainable language resources and technology infrastructure for
the humanities and social sciences research community. The paper
briefly introduces the vision behind the project and how it relates
to speech research with a focus on the contributions that CLARIN
can and will make to research in spoken language processing.
Speech Recordings via the Internet: An Overview of
the VOYS Project in Scotland
Catherine Dickie 1 , Felix Schaeffler 1 , Christoph
Draxler 2 , Klaus Jänsch 2 ; 1 Queen Margaret University,
UK; 2 LMU München, Germany
Wed-Ses1-P4-2, Time: 10:00
The VOYS (Voices of Young Scots) project aims to establish a
speech database of adolescent Scottish speakers. This database
will serve for speech recognition technology and sociophonetic
research. 300 pupils will ultimately be recorded at secondary
schools in 10 locations in Scotland. Recordings are performed via
the Internet using two microphones (close-talk and desktop) in
22,05 kHz 16 bit linear stereo signal quality.
VOYS is the first large-scale and cross-boundary speech data collection based on the WikiSpeech content management system for
speech resources. In VOYS, schools receive a kit containing the
microphones and A/D interface and they organise the recordings
themselves. The recorded data is immediately uploaded to the
server in Munich, alleviating the schools from all data-handling
tasks. This paper outlines the corpus specification, describes the
technical issues, summarises the signal quality and gives a status
report.
The Multi-Session Audio Research Project (MARP)
Corpus: Goals, Design and Initial Findings
A.D. Lawson 1 , A.R. Stauffer 1 , E.J. Cupples 1 , S.J.
Wenndt 2 , W.P. Bray 3 , J.J. Grieco 2 ; 1 RADC Inc., USA;
2
Air Force Research Laboratory, USA; 3 Oasis Systems,
USA
Wed-Ses1-P4-3, Time: 10:00
This project describes the composition and goals of the Multisession Audio Research Project (MARP) corpus and some initial experimental findings. The MARP corpus is a three year longitudinal
collect of 21 sessions and more than 60 participants. This study was
undertaken to test the impact of various factors on speaker recog-
Notes
122
nition, such as inter-session variability, intonation, aging, whispering and text dependency. Initial results demonstrate the impact of
sentence intonation, whispering, text dependency and cross session
tests. These results highlight the sensitivity of speaker recognition
to vocal, environmental and phonetic conditions that are commonly
encountered but rarely explored or tested.
Structure and Annotation of Polish LVCSR Speech
Database
Predicting the Quality of Multimodal Systems Based
on Judgments of Single Modalities
Ina Wechsung 1 , Klaus-Peter Engelbrecht 1 , Anja B.
Naumann 1 , Stefan Schaffer 2 , Julia Seebode 2 , Florian
Metze 3 , Sebastian Möller 1 ; 1 Deutsche Telekom
Laboratories, Germany; 2 Technische Universität Berlin,
Germany; 3 Carnegie Mellon University, USA
Wed-Ses1-P4-7, Time: 10:00
Katarzyna Klessa, Grażyna Demenko; Adam
Mickiewicz University, Poland
Wed-Ses1-P4-4, Time: 10:00
This paper reports on the problems occurring in the process of
building LVCSR (Large Vocabulary Continuous Speech Recognition)
corpora based on the internal evaluation of the Polish database JURISDIC. The initial assumptions are discussed together with technical matters concerning the database realization and annotation
results. Providing rich database statistics was considered crucial
especially regarding linguistic description both for database evaluation and for the implementation of linguistic factors in acoustic models for speech recognition. The assumed principles for
database construction are: low redundancy, acoustic-phonetic variability adequate to dictation task, representativeness, balanced,
heterogeneous structure enabling separate or combined modeling
of phonetic-acoustic structures.
Balanced Corpus of Informal Spoken Czech:
Compilation, Design and Findings
This paper investigates the relationship between user ratings of
multimodal systems and user ratings of its single modalities. Based
on previous research showing precise predictions of ratings of multimodal systems based on ratings of single modality, it was hypothesized that the accuracy might have been caused by the participants’ efforts to rate consistently. We address this issue with
two new studies. In the first study, the multimodal system was
presented before the single modality versions were known by the
users. In the second study, the type of system was changed, and
age effects were investigated. We apply linear regression and show
that models get worse when the order is changed. In addition, models for younger users perform better than those for older users. We
conclude that ratings can be impacted by the effort of users to judge
consistently, as well as their ability to do so.
Auto-Checking Speech Transcriptions by Multiple
Template Constrained Posterior
Lijuan Wang 1 , Shenghao Qin 2 , Frank K. Soong 1 ;
1
Microsoft Research Asia, China; 2 Microsoft Business
Division, China
Martina Waclawičová, Michal Křen, Lucie Válková;
Charles University in Prague, Czech Republic
Wed-Ses1-P4-8, Time: 10:00
Wed-Ses1-P4-5, Time: 10:00
The paper presents ORAL2008, a new 1-million corpus of spoken
Czech compiled within the framework of the Czech National Corpus project. ORAL2008 is designed as a representation of authentic
spoken language used in informal situations and it is balanced in
the main sociolinguistic categories of speakers. The paper concentrates also on the data collection, its broad coverage and the transcription system that registers variability of spoken Czech. Possible
findings based on the provided data are finally outlined.
JTrans: An Open-Source Software for
Semi-Automatic Text-to-Speech Alignment
C. Cerisara, O. Mella, D. Fohr; LORIA, France
Wed-Ses1-P4-6, Time: 10:00
Aligning speech corpora with text transcriptions is an important
requirement of many speech processing, data mining applications
and linguistic researches. Despite recent progress in the field of
speech recognition, many linguists still manually align spontaneous
and noisy speech recordings to guarantee a good alignment quality.
This work proposes an open-source java software with an easy-touse GUI that integrates dedicated semi-automatic speech alignment
algorithms that can be dynamically controlled and guided by the
user. The objective of this software is to facilitate and speed up the
process of creating and aligning speech corpora.
Checking transcription errors in speech database is an important
but tedious task that traditionally requires intensive manual labor.
In [9], Template Constrained Posterior (TCP) was proposed to automate the checking process by screening potential erroneous sentences with a single context template. However, single templatebased method is not robust and requires parameter optimization
that still involves some manual work. In this work, we propose
to use multiple templates which is more robust and requires no
development data for parameter optimization. By using its multiple hypothesis sifting capabilities — from well-defined, full context
to loosely defined context like wild card, the confidence for a focus
unit can be measured at different expected accuracy. The joint verification by multiple TCP improves measured confidence of each unit
in the transcription and is robust across different speech databases.
Experimental results show that the checking process automatically
separates erroneous sentences from correct ones: the sentence error hit rate decrease rapidly in the sorted TCP values, from 59% to
7% for the Mexican Spanish database and from 63% to 11% for the
American English database, among the top 10% sentences in the
rank lists.
Subjective Experiments on Influence of Response
Timing in Spoken Dialogues
Toshihiko Itoh 1 , Norihide Kitaoka 2 , Ryota Nishimura 3 ;
1
Hokkaido University, Japan; 2 Nagoya University,
Japan; 3 Toyohashi University of Technology, Japan
Wed-Ses1-P4-9, Time: 10:00
To verify the validity of analysis results relating to dialogue rhythm
from earlier studies, we produced spoken dialogues based on analysis results relating to response timing and the other spoken dialogues, and performed subjective experiments to investigate parameters such as the naturalness of the dialogue, the incongruity
Notes
123
of the synthesized speech, and the ease of comprehension of the
utterances. We used very short task-oriented four-turn dialogues
using synthesized speech in Experiment 1, and approx. one-minute
free-conversation dialogues in Experiment 2 using natural human
speech and synthesized speech. As a result, we were able to show
that a natural response timing exists for utterances, and that response timings that conform to the utterance contents are felt to
be more natural, thus demonstrating the validity of the analysis results relating to dialogue rhythm.
Usability Study of VUI consistent with GUI Focusing
on Age-Groups
Jun Okamoto, Tomoyuki Kato, Makoto Shozakai; Asahi
Kasei Corporation, Japan
Wed-Ses1-P4-10, Time: 10:00
We studied the usability of a Voice User Interface (VUI) that is consistent with a Graphical User Interface (GUI), and focused on its
dependency with user age-groups. Usability tests were iteratively
conducted on 245 Japanese subjects with age-groups from 20s to
60s using a prototype of an in-vehicle information application. Next
we calculated and analyzed statistics of the usability tests. We discuss the differences in usability with respect to age-groups and how
to handle them. We propose that it is necessary to make voice
guidance straightforward and to devise a VUI consistent with a GUI
(VGUI) in order to let users understand the system structure. Also
we found that the default design of a VGUI should be as simple as
possible so that elderly users, who may be slow to learn the new
system structure, are able to easily learn it.
Annotating Communicative Function and Semantic
Content in Dialogue Act for Construction of
Consulting Dialogue Systems
Teruhisa Misu, Kiyonori Ohtake, Chiori Hori, Hideki
Kashioka, Satoshi Nakamura; NICT, Japan
Wed-Ses1-P4-11, Time: 10:00
Our goal in this study is to train a dialogue manager that can handle consulting dialogues through spontaneous interactions from a
tagged dialogue corpus. We have collected 130 hours of consulting
dialogues in sightseeing guidance domain. This paper provides our
taxonomy of dialogue act (DA) annotation that can describe two aspects of utterances. One is a communicative function (speech act),
and the other is a semantic content of an utterance. We provide an
overview of the Kyoto tour guide dialogue corpus and a preliminary
analysis using the dialogue act tags.
Improved Speech Summarization with
Multiple-Hypothesis Representations and
Kullback-Leibler Divergence Measures
An Improved Speech Segmentation Quality Measure:
The R-Value
Okko Johannes Räsänen, Unto Kalervo Laine, Toomas
Altosaar; Helsinki University of Technology, Finland
Wed-Ses1-P4-13, Time: 10:00
Phone segmentation in ASR is usually performed indirectly by
Viterbi decoding of HMM output. Direct approaches also exist, e.g.,
blind speech segmentation algorithms. In either case, performance
of automatic speech segmentation algorithms is often measured
using automated evaluation algorithms and used to optimize a segmentation system’s performance. However, evaluation approaches
reported in literature were found to be lacking. Also, we have determined that increases in phone boundary location detection rates
are often due to increased over-segmentation levels and not to algorithmic improvements, i.e., by simply adding random boundaries
a better hit-rate can be achieved when using current quality measures. Since established measures were found to be insensitive to
this type of random boundary insertion, a new R-value quality measure is introduced that indicates how close a segmentation algorithm’s performance is to an ideal point of operation.
No Sooner Said Than Done? Testing Incrementality
of Semantic Interpretations of Spontaneous Speech
Michaela Atterer, Timo Baumann, David Schlangen;
Universität Potsdam, Germany
Wed-Ses1-P4-14, Time: 10:00
Ideally, a spoken dialogue system should react without much delay
to a user’s utterance. Such a system would already select an object,
for instance, before the user has finished her utterance about moving this particular object to a particular place. A prerequisite for
such a prompt reaction is that semantic representations are built up
on the fly and passed on to other modules. Few approaches to incremental semantics construction exist, and, to our knowledge, none
of those has been systematically tested on a spontaneous speech
corpus. In this paper, we develop measures to test empirically on
transcribed spontaneous speech to what extent we can create semantic interpretation on the fly with an incremental semantic chunker that builds a frame semantics.
Wed-Ses1-S1 : Special Session: Lessons and
Challenges Deploying Voice Search
Ainsworth (East Wing 4), 10:00, Wednesday 9 Sept 2009
Chair: Michael Cohen, Google, USA and Mike Phillips, Vlingo, USA
Role of Natural Language Understanding in Voice
Local Search
Shih-Hsiang Lin, Berlin Chen; National Taiwan Normal
University, Taiwan
Wed-Ses1-P4-12, Time: 10:00
Imperfect speech recognition often leads to degraded performance
when leveraging existing text-based methods for speech summarization. To alleviate this problem, this paper investigates various
ways to robustly represent the recognition hypotheses of spoken
documents beyond the top scoring ones. Moreover, a new summarization method stemming from the Kullback-Leibler (KL) divergence measure and exploring both the sentence and document relevance information is proposed to work with such robust representations. Experiments on broadcast news speech summarization
seem to demonstrate the utility of the presented approaches.
Junlan Feng, Srinivas Banglore, Mazin Gilbert; AT&T
Labs Research, USA
Wed-Ses1-S1-1, Time: 10:00
Speak4it is a voice-enabled local search system currently available
for iPhone devices. The natural language understanding (NLU) component is one of the key technology modules in this system. The
role of NLU in voice-enabled local search is twofold: (a) parse the automatic speech recognition (ASR) output (1-best and word lattices)
into meaningful segments that contribute to high-precision local
search, and (b) understand user’s intent. This paper is concerned
with the first task of NLU. In previous work, we had presented a
scalable approach to parsing, which is built upon text indexing and
search framework, and can also parse ASR lattices. In this paper,
we propose an algorithm to improve the baseline by extracting the
“subjects” of the query. Experimental results indicate that lattice-
Notes
124
based query parsing outperforms ASR 1-best based parsing by 2.1%
absolute and extracting subjects in the query improves the robustness of search.
Recognition and Correction of Voice Web Search
Queries
Wed-Ses2-O1 : Word-Level Perception
Main Hall, 13:30, Wednesday 9 Sept 2009
Chair: Jeesun Kim, University of Western Sydney, Australia
Semantic Context Effects in the Recognition of
Acoustically Unreduced and Reduced Words
Keith Vertanen, Per Ola Kristensson; University of
Cambridge, UK
Wed-Ses1-S1-2, Time: 10:15
In this work we investigate how to recognize and correct voice web
search queries. We describe our corpus of web search queries and
show how it was used to improve recognition accuracy. We show
that using a search-specific vocabulary with automatically generated pronunciations is superior to using a vocabulary limited to
a fixed pronunciation dictionary. We conducted a formative user
study to investigate recognition and correction aspects of voice
search in a mobile context. In the user study, we found that despite a word error rate of 48%, users were able to speak and correct
search queries in about 18 seconds. Users did this while walking
around using a mobile touch-screen device.
Voice Search and Everything Else — What Users Are
Saying to the Vlingo Top Level Voice UI
Chao Wang; Vlingo, USA
Wed-Ses1-S1-3, Time: 10:30
Marco van de Ven 1 , Benjamin V. Tucker 2 , Mirjam
Ernestus 3 ; 1 Max Planck Institute for Psycholinguistics,
The Netherlands; 2 University of Alberta, Canada;
3
Radboud Universiteit Nijmegen, The Netherlands
Wed-Ses2-O1-1, Time: 13:30
Listeners require context to understand the casual pronunciation
variants of words that are typical of spontaneous speech [1]. The
present study reports two auditory lexical decision experiments,
investigating listeners’ use of semantic contextual information in
the comprehension of unreduced and reduced words. We found
a strong semantic priming effect for low frequency unreduced
words, whereas there was no such effect for reduced words. Word
frequency was facilitatory for all words. These results show that
semantic context is relevant especially for the comprehension of
unreduced words, which is unexpected given the listener driven
explanation of reduction in spontaneous speech.
Context Effects and the Processing of Ambiguous
Words: Further Evidence from Semantic
Incongruence
Searching Google by Voice
Johan Schalkwyk; Google Inc., USA
Michael C.W. Yip; Hong Kong Institute of Education,
China
Wed-Ses1-S1-4, Time: 10:45
Wed-Ses2-O1-2, Time: 13:50
Multiple-hypotheses searches from deeply parsed
requests to multiple-evidences scoring: the DeepQA
challenge
Roberto Sicconi; IBM T.J. Watson Research Center, USA
Wed-Ses1-S1-5, Time: 11:00
Research Areas in Voice Search: Lessons from
Microsoft Deployments
Geoffrey Zweig; Microsoft Research, USA
Wed-Ses1-S1-6, Time: 11:15
A cross-modal naming experiment was conducted to further
verify the effects of context and other lexical information in
the processing of Chinese homophones during spoken language
comprehension. In this experiment, listeners named aloud a visual
probe as fast as they could, at a pre-designated point upon hearing
the sentence, which ended with a spoken Chinese homophone.
Results further support that context has exerted an effect on
the disambiguation of various homophonic meanings at an early
stage, within the acoustic boundary of the word. This contextual
effect was even stronger than the tonal information. Finally, the
present results are in line with the context-dependency hypothesis
that selection of the appropriate meaning of an ambiguous word
depends on the simultaneous interaction among sentential, tonal
and other lexical information during lexical access.
Panel Discussion
Wed-Ses1-S1-7, Time: 11:30
Panel Members:
• Johan Schalkwyk, Senior Staff Software Engineer, Google Inc.,
USA
• Chao Wang, Principal Speech Scientist, Vlingo, USA
• Roberto Sicconi, Program Director, DeepQA New Opportunities, IBM T.J. Watson Research Center, USA
• Mazin Gilbert, AT&T Labs Research, USA
• Geoffrey Zweig, Senior Researcher, Microsoft Research, USA
• Kieth Vertanen, University of Cambridge, UK
The Roles of Reconstruction and Lexical Storage in
the Comprehension of Regular Pronunciation
Variants
Mirjam Ernestus; Radboud Universiteit Nijmegen, The
Netherlands
Wed-Ses2-O1-3, Time: 14:10
This paper investigates how listeners process regular pronunciation variants, resulting from simple general reduction processes.
Study 1 shows that when listeners are presented with new words,
they store the pronunciation variants presented to them, whether
these are unreduced or reduced. Listeners thus store information
on word-specific pronunciation variation. Study 2 suggests that if
participants are presented with regularly reduced pronunciations,
they also reconstruct and store the corresponding unreduced
pronunciations. These unreduced pronunciations apparently have
Notes
125
special status. Together the results support hybrid models of
speech processing, assuming roles for both exemplars and abstract
representations.
Wed-Ses2-O2 : Applications in Education
and Learning
Jones (East Wing 1), 13:30, Wednesday 9 Sept 2009
Chair: Maxine Eskenazi, Carnegie Mellon University, USA
Lexical Embedding in Spoken Dutch
Odette Scharenborg 1 , Stefanie Okolowski 2 ; 1 Radboud
Universiteit Nijmegen, The Netherlands; 2 Universität
Trier, Germany
A Large Greek-English Dictionary with Incorporated
Speech and Language Processing Tools
Wed-Ses2-O1-4, Time: 14:30
A stretch of speech is often consistent with multiple words, e.g.,
the sequence /hæm/ is consistent with ‘ham’ but also with the first
syllable of ‘hamster’, resulting in temporary ambiguity. However,
to what degree does this lexical embedding occur? Analyses on two
corpora of spoken Dutch showed that 11.9%–19.5% of polysyllabic
word tokens have word-initial embedding, while 4.1%–7.5% of
monosyllabic word tokens can appear word-initially embedded.
This is much lower than suggested by an analysis of a large
dictionary of Dutch. Speech processing thus appears to be simpler
than one might expect on the basis of statistics on a dictionary.
Real-Time Lexical Competitions During
Speech-in-Speech Comprehension
Véronique Boulenger 1 , Michel Hoen 2 , François
Pellegrino 1 , Fanny Meunier 1 ; 1 DDL, France; 2 SBRI,
France
Wed-Ses2-O1-5, Time: 14:50
This study investigates speech comprehension in competing multitalker babble. We examined the effects of number of simultaneous
talkers and of frequency of words in the babble on lexical decision
to target words. Results revealed better performance at a low talker
number (n = 2). Importantly, frequency of words in the babble
significantly affected performance: high frequency word babble
interfered more strongly with word recognition than low frequency
babble. This informational masking was particularly salient for
the 2-talker babble. These findings suggest that investigating
speech-in-speech comprehension may provide crucial information
on lexical competition processes that occur in real-time during
word recognition.
Discovering Consistent Word Confusions in Noise
Martin Cooke; Ikerbasque, Spain
Wed-Ses2-O1-6, Time: 15:10
Listeners make mistakes when communicating under adverse
conditions, with overall error rates reasonably well-predicted by
existing speech intelligibility metrics. However, a detailed examination of confusions made by a majority of listeners is more likely
to provide insights into processes of normal word recognition. The
current study measured the rate at which robust misperceptions
occurred for highly-confusable words embedded in noise. In a
second experiment, confusions discovered in the first listening
test were subjected to a range of manipulations designed to help
identify their cause. These experiments reveal that while majority
confusions are quite rare, they occur sufficiently often to make
large-scale discovery worthwhile. Surprisingly few misperceptions
were due solely to energetic masking by the noise, suggesting
that speech and noise “react” in complex ways which are not
well-described by traditional masking concepts.
Dimitrios P. Lyras, George Kokkinakis, Alexandros
Lazaridis, Kyriakos Sgarbas, Nikos Fakotakis;
University of Patras, Greece
Wed-Ses2-O2-1, Time: 13:30
A large Greek-English Dictionary with 81,515 entries, 192,592
translations into English and 50,106 usage examples with their
translation has been developed in combined printed and electronic
(DVD) form. The electronic dictionary features unique facilities
for searching the entire or any part of the Greek and English
section, and has incorporated a series of speech and language
processing tools which may efficiently assist learners of Greek and
English. This paper presents the human-machine interface of the
dictionary and the most important tools, i.e. the TTS-synthesizers
for Greek and English, the lemmatizers for Greek and English, the
Grapheme-to-Phoneme converter for Greek and the syllabification
system for Greek.
Predicting Children’s Reading Ability Using
Evaluator-Informed Features
Matthew Black, Joseph Tepperman, Sungbok Lee,
Shrikanth S. Narayanan; University of Southern
California, USA
Wed-Ses2-O2-2, Time: 13:50
Automatic reading assessment software has the difficult task
of trying to model human-based observations, which have both
objective and subjective components. In this paper, we mimic the
grading patterns of a “ground-truth” (average) evaluator in order
to produce models that agree with many people’s judgments. We
examine one particular reading task, where children read a list of
words aloud, and evaluators rate the children’s overall reading
ability on a scale from one to seven. We first extract various
features correlated with the specific cues that evaluators said they
used. We then compare various supervised learning methods that
mapped the most relevant features to the ground-truth evaluator
scores. Our final system predicted these scores with 0.91 correlation, higher than the average inter-evaluator agreement.
Automatic Intonation Classification for Speech
Training Systems
György Szaszák, Dávid Sztahó, Klára Vicsi; BME,
Hungary
Wed-Ses2-O2-3, Time: 14:10
A prosodic Hidden Markov model (HMM) based modality recognizer has been developed, which, after supra-segmental acoustic
pre-processing, can perform clause and sentence boundary detection and modality (sentence type) recognition. This modality
recognizer is adapted to carry out automatic evaluation of the
intonation of the produced utterances in a speech training system
for hearing-impaired persons or foreign language learners. The
system is evaluated on utterances from normally-speaking persons and tested with speech-impaired (due to hearing problems)
persons. To allow a deeper analysis, the automatic classification of
the intonation is compared to subjective listening tests.
Notes
126
Automated Pronunciation Scoring Using Confidence
Scoring and Landmark-Based SVM
1
Wed-Ses2-O3 : ASR: New Paradigms I
Fallside (East Wing 2), 13:30, Wednesday 9 Sept 2009
Chair: Geoffrey Zweig, Microsoft Research, USA
1
Su-Youn Yoon , Mark Hasegawa-Johnson , Richard
Sproat 2 ; 1 University of Illinois at Urbana-Champaign,
USA; 2 Oregon Health & Science University, USA
Wed-Ses2-O2-4, Time: 14:30
In this study, we present a pronunciation scoring method for
second language learners of English (hereafter, L2 learners). This
study presents a method using both confidence scoring and
classifiers. Classifiers have an advantage over confidence scoring
for specialization in the specific phonemes where L2 learners
make frequent errors. Classifiers (Landmark-based Support Vector
Machines) were trained in order to distinguish L2 phonemes from
their frequent substitution patterns.
In this study, the method was evaluated on the specific English
phonemes where L2 English learners make frequent errors. The
results suggest that the automated pronunciation scoring method
can be improved consistently by combining the two methods.
ASR Based Pronunciation Evaluation with
Automatically Generated Competing Vocabulary
Carlos Molina, Nestor Becerra Yoma, Jorge Wuth,
Hiram Vivanco; Universidad de Chile, Chile
The Semi-Supervised Switchboard Transcription
Project
Amarnag Subramanya, Jeff Bilmes; University of
Washington, USA
Wed-Ses2-O3-1, Time: 13:30
In previous work, we proposed a new graph-based semi-supervised
learning (SSL) algorithm and showed that it outperforms other
state-of-the-art SSL approaches for classifying documents and
web-pages. Here we use a multi-threaded implementation in order
to scale the algorithm to very large data sets. We treat the phonetically annotated portion of the Switchboard transcription project
(STP) as labeled data and automatically annotate (at the phonetic
level) the Switchboard I (SWB) training set and show that our
proposed approach outperforms state-of-the-art SSL algorithms as
well as a state-of-the-art strictly supervised classifier. As a result,
we have STP-style annotations of the entire SWB-I training set
which we refer to as semi-supervised STP (S3TP).
Maximum Mutual Information Multi-Phone Units in
Direct Modeling
Wed-Ses2-O2-5, Time: 14:50
In this paper the application of automatic speech recognition (ASR)
technology in CAPT (Computer Aided Pronunciation Training) is
addressed. A method to automatically generate the competitive
lexicon, required by an ASR engine to compare the pronunciation
of a target word with its correct and wrong phonetic realization,
is presented. In order to enable the efficient deployment of CAPT
applications, the generation of this competitive lexicon does not
require any human assistance or a priori information of mother
language dependent errors. The method presented here leads to
averaged subjective-objective score correlation equal to 0.82 and
0.75 depending on the task.
High Performance Automatic Mispronunciation
Detection Method Based on Neural Network and
TRAP Features
Hongyan Li, Shijin Wang, Jiaen Liang, Shen Huang, Bo
Xu; Chinese Academy of Sciences, China
Wed-Ses2-O2-6, Time: 15:10
In this paper, we propose a new approach to utilize temporal
information and neural network (NN) to improve the performance
of automatic mispronunciation detection (AMD). Firstly, the alignment results between speech signals and corresponding phoneme
sequences are obtained within the classic GMM-HMM framework.
Then, the long-time TempoRAl Patterns (TRAPs) [5] features are
introduced to describe the pronunciation quality instead of the
conventional spectral features (e.g. MFCC). Based on the phoneme
boundaries and TRAPs features, we use Multi-layer Perceptron
(MLP) to calculate the final posterior probability of each testing
phoneme, and determine whether it is a mispronunciation or not
by comparing with a phone dependent threshold. Moreover, we
combine the TRAPs-MLP method with our existing methods to
further improve the performance. Experiments show that the
TRAPs-MLP method can give a significant relative improvement
of 39.04% in EER (Equal Error Rate) reduction, and the fusion
of TRAPs-MLP, GMM-UBM and GLDS-SVM [4] methods can yield
48.32% in EER reduction relatively, both compared with the baseline
GMM-UBM method.
Geoffrey Zweig, Patrick Nguyen; Microsoft Research,
USA
Wed-Ses2-O3-2, Time: 13:30
This paper introduces a class of discriminative features for use in
maximum entropy speech recognition models. The features we
propose are acoustic detectors for discriminatively determined
multi-phone units. The multi-phone units are found by computing
the mutual information between the phonetic sub-sequences that
occur in the training lexicon, and the word labels. This quantity is
a function of an error model governing our ability to detect phone
sequences accurately (an otherwise informative sequence which
cannot be reliably detected is not so useful). We show how to
compute this mutual information quantity under a class of error
models efficiently, in one pass over the data, for all phonetic subsequences in the training data. After this computation, detectors
are created for a subset of highly informative units. We then define
two novel classes of features based on these units: associative and
transductive. Incorporating these features in a maximum entropy
based direct model for Voice-Search outperforms the baseline by
24% in sentence error rate.
Profiling Large-Vocabulary Continuous Speech
Recognition on Embedded Devices: A Hardware
Resource Sensitivity Analysis
Kai Yu, Rob A. Rutenbar; Carnegie Mellon University,
USA
Wed-Ses2-O3-3, Time: 13:30
When deployed in embedded systems, speech recognizers are
necessarily reduced from large-vocabulary continuous speech
recognizers (LVCSR) found on desktops or servers to fit the limited
hardware. However, embedded hardware continues to evolve in
capability; today’s smartphones are vastly more powerful than
their recent ancestors. This begets a new question: which hardware
features not currently found on today’s embedded platforms, but
potentially add-ons to tomorrow’s devices, are most likely to
improve recognition performance? Said differently — what is the
sensitivity of the recognizer to fine-grain details of the embedded
Notes
127
hardware resources? To answer this question rigorously and
quantitatively, we offer results from a detailed study of LVCSR
performance as a function of micro-architecture options on an
embedded ARM11 and an enterprise-class Intel Core2Duo. We
estimate speed and energy consumption, and show, feature by
feature, how hardware resources impact recognizer performance.
Continuous Speech Recognition Using Attention
Shift Decoding with Soft Decision
Ozlem Kalinli, Shrikanth S. Narayanan; University of
Southern California, USA
speech recognizers without transcribed data by formulating the
HMM training as an optimization over both the parameter and
transcription sequence space. We describe how this can be easily
implemented using existing STT tools. We tested the effectiveness of our unsupervised training approach on the task of topic
classification on the Switchboard corpus. The unsupervised HMM
recognizer, initialized with a segmental tokenizer, outperformed
both the a HMM phoneme recognizer trained with 1 hour of
transcribed data, and the Brno University of Technology (BUT)
Hungarian phoneme recognizer. This approach can also be applied
to other speech applications, including spoken term detection,
language and speaker verification.
Wed-Ses2-O3-4, Time: 13:30
We present an attention shift decoding (ASD) method inspired by
human speech recognition. In contrast to the traditional automatic
speech recognition (ASR) systems, ASD decodes speech inconsecutively using reliability criteria; the gaps (unreliable speech
regions) are decoded with the evidence of islands (reliable speech
regions). On the BU Radio News Corpus, ASD provides significant
improvement (2.9% absolute) over the baseline ASR results when
it is used with oracle island-gap information. At the core of the
ASD method is the automatic island-gap detection. Here, we
propose a new feature set for automatic island-gap detection which
achieves 83.7% accuracy. To cope with the imperfect nature of the
island-gap classification, we also propose a new ASD algorithm
using soft decision. The ASD with soft decision provides 0.4%
absolute (2.2% relative) improvement over the baseline ASR results
when it is used with automatically detected islands and gaps.
Towards Using Hybrid Word and Fragment Units for
Vocabulary Independent LVCSR Systems
Ariya Rastrow 1 , Abhinav Sethy 2 , Bhuvana
Ramabhadran 2 , Frederick Jelinek 1 ; 1 Johns Hopkins
University, USA; 2 IBM T.J. Watson Research Center,
USA
Wed-Ses2-O4 : Single-Channel Speech
Enhancement
Holmes (East Wing 3), 13:30, Wednesday 9 Sept 2009
Chair: B. Yegnanarayana, IIIT Hyderabad, India
Constrained Probabilistic Subspace Maps Applied to
Speech Enhancement
Kaustubh Kalgaonkar, Mark A. Clements; Georgia
Institute of Technology, USA
Wed-Ses2-O4-1, Time: 13:30
This paper presents a probabilistic algorithm that extracts a
mapping between two subspaces by representing each subspace as
a collection of states. In many cases, the data is a time series with
temporal constraints. This paper suggests a method to impose
these temporal constraints on the transitions between the states
of the subspace.
This probabilistic model has been successfully applied to the problem of speech enhancement and improves the performance of a
Wiener filter by providing robust estimates of a priori SNR.
Wed-Ses2-O3-5, Time: 13:30
This paper presents the advantages of augmenting a word-based
system with sub-word units as a step towards building open
vocabulary speech recognition systems. We show that a hybrid
system which combines words and data-driven, variable length sub
word units has a better phone accuracy than word only systems. In
addition the hybrid system is better in detecting Out-Of-Vocabulary
(OOV) terms and representing them phonetically. Results are presented on the RT-04 broadcast news and MIT Lecture data sets.
An FSM-based approach to recover OOV words from the hybrid
lattices is also presented. At an OOV rate of 2.5% on RT-04 we
observed a 8% relative improvement in phone error rate (PER), 7.3%
relative improvement in oracle PER and 7% relative improvement
in WER after recovering the OOV terms. A significant reduction of
33% relative in PER is seen in the OOV regions.
Unsupervised Training of an HMM-Based Speech
Recognizer for Topic Classification
Herbert Gish, Man-hung Siu, Arthur Chan, Bill Belfield;
BBN Technologies, USA
Wed-Ses2-O3-6, Time: 13:30
Reconstructing Clean Speech from Noisy MFCC
Vectors
Ben Milner, Jonathan Darch, Ibrahim Almajai;
University of East Anglia, UK
Wed-Ses2-O4-2, Time: 13:50
The aim of this work is to reconstruct clean speech solely from a
stream of noise-contaminated MFCC vectors, as may be encountered in distributed speech recognition systems. Speech reconstruction is performed using the ETSI Aurora back-end speech reconstruction standard which requires MFCC vectors, fundamental
frequency and voicing information. In this work, fundamental frequency and voicing are obtained using maximum a posteriori prediction from input MFCC vectors, thereby allowing speech reconstruction solely from a stream of MFCC vectors. Two different methods to improve prediction accuracy in noisy conditions are then
developed. Experimental results first establish that improved fundamental frequency and voicing prediction is obtained when noise
compensation is applied. A series of human listening tests are then
used to analyse the reconstructed speech quality, which determine
the effectiveness of noise compensation in terms of mean opinion
scores.
HMM-based Speech-To-Text (STT) systems are widely deployed
not only for dictation tasks but also as the first processing stage
of many automatic speech applications such as spoken topic
classification. However, the necessity of transcribed data for
training the HMMs precludes its use in domains where transcribed
speech is difficult to come by because of the specific domain,
channel or language. In this work, we propose building HMM-based
Notes
128
An Evaluation of Objective Quality Measures for
Speech Intelligibility Prediction
Cees H. Taal 1 , Richard C. Hendriks 1 , Richard
Heusdens 1 , Jesper Jensen 2 , Ulrik Kjems 2 ; 1 Technische
Universiteit Delft, The Netherlands; 2 Oticon A/S,
Denmark
Wed-Ses2-O4-3, Time: 14:10
In this research various objective quality measures are evaluated in
order to predict the intelligibility for a wide range of non-linearly
processed speech signals and speech degraded by additive noise.
The obtained results are compared with the prediction results of a
more advanced perceptual-based model proposed by Dau et al. and
an objective intelligibility measure, namely the coherence speech intelligibility index (cSII). These tests are performed in order to gain
more knowledge between the link of speech-quality and speechintelligibility and may help us to exploit the extensive research done
into the field of speech-quality for speech-intelligibility. It is shown
that cSII does not necessarily show better performance compared
to conventional objective (speech)-quality measures. In general, the
DAU-model is the only method with reasonable results for all processing conditions.
Performance Comparison of HMM and VQ Based
Single Channel Speech Separation
M.H. Radfar 1 , W.-Y. Chan 2 , R.M. Dansereau 3 , W.
Wong 1 ; 1 University of Toronto, Canada; 2 Queen’s
University, Canada; 3 Carleton University, Canada
through experiments: 1) soft mask is better than binary mask in
terms of recognition performance and 2) cepstral mean normalization (CMN) reduces the distortion, especially for that caused by soft
mask. At the end, we evaluate the recognition performance of our
method in noisy and reverberant real environment.
Enhancing Audio Speech Using Visual Speech
Features
Ibrahim Almajai, Ben Milner; University of East Anglia,
UK
Wed-Ses2-O4-6, Time: 15:10
This work presents a novel approach to speech enhancement by
exploiting the bimodality of speech and the correlation that exists
between audio and visual speech features. For speech enhancement, a visually-derived Wiener filter is developed. This obtains
clean speech statistics from visual features by modelling their joint
density and making a maximum a posteriori estimate of clean audio from visual speech features. Noise statistics for the Wiener
filter utilise an audio-visual voice activity detector which classifies
input audio as speech or nonspeech, enabling a noisemodel to be
updated. Analysis shows estimation of speech and noise statistics
to be effective with human listening tests measuring the effectiveness of the resulting Wiener filter.
Wed-Ses2-P1 : Emotion and Expression II
Hewison Hall, 13:30, Wednesday 9 Sept 2009
Chair: L. ten Bosch, Radboud Universiteit Nijmegen, The
Netherlands
Wed-Ses2-O4-4, Time: 14:30
In this paper, single channel speech separation (SCSS) techniques
based on hidden Markov models (HMM) and vector quantization
(VQ) are described and compared in terms of (a) signal-to-noise ratio (SNR) between separated and original speech signals, (b) preference of listeners, and (c) computational complexity. The SNR results show that the HMM-based technique marginally outperforms
the VQ-based technique by 0.85 dB in experiments conducted on
mixtures of female-female, male-male, and male-female speakers.
Subjective tests show that listeners prefer HMM over VQ for 86.70%
of test speech files. This improvement, however, is at the expense
of a drastic increase in computational complexity when compared
with the VQ-based technique.
Stereo-Input Speech Recognition Using
Sparseness-Based Time-Frequency Masking in a
Reverberant Environment
Perceiving Surprise on Cue Words: Prosody and
Semantics Interact on Right and Really
Catherine Lai; University of Pennsylvania, USA
Wed-Ses2-P1-1, Time: 13:30
Cue words in dialogue have different interpretations depending
context and prosody. This paper presents a corpus study and perception experiment investigating when prosody causes right and
really to be perceived as questioning or expressing surprise. Pitch
range is found to be the best cue for surprise. This extends to the
question rating for really but not for right. In fact, prosody appears
to interact with semantics so ratings differ for these two types of
cue word even when prosodic features are similar. So, different
semantics appears to result in different surprise/question rating
thresholds.
Yosuke Izumi 1 , Kenta Nishiki 1 , Shinji Watanabe 2 ,
Takuya Nishimoto 1 , Nobutaka Ono 1 , Shigeki
Sagayama 1 ; 1 University of Tokyo, Japan; 2 NTT
Corporation, Japan
Emotion Recognition Using Linear Transformations
in Combination with Video
Wed-Ses2-O4-5, Time: 14:50
Wed-Ses2-P1-2, Time: 13:30
We present noise robust automatic speech recognition (ASR) using
sparseness-based underdetermined blind source separation (BSS)
technique. As a representative underdetermined BSS method, we
utilized time-frequency masking in this paper. Although timefrequency masking is able to separate target speech from interferences effectively, one should consider two problems. One is that
masking does not work well in noisy or reverberant environment.
Another is that masking itself might cause some distortion of the
target speech. For the former, we apply our time-frequency masking method [7] which can separate the target signal robustly even in
noisy and reverberant environment. Next, investigating the distortion caused by time-frequency masking, we reveal following facts
The paper discuses the usage of linear transformations of Hidden
Markov Models, normally employed for speaker and environment
adaptation, as a way of extracting the emotional components from
the speech. A constrained version of Maximum Likelihood Linear
Regression (CMLLR) transformation is used as a feature for classification of normal or aroused emotional state. We present a procedure of incrementally building a set of speaker independent acoustic models, that are used to estimate the CMLLR transformations for
emotion classification. An audio-video database of spontaneous
emotions (AvID) is briefly presented since it forms the basis for
the evaluation of the proposed method. Emotion classification using the video part of the database is also described and the added
Rok Gajšek, Vitomir Štruc, Simon Dobrišek, France
Mihelič; University of Ljubljana, Slovenia
Notes
129
value of combining the visual information with the audio features
is shown.
Speaker Dependent Emotion Recognition Using
Prosodic Supervectors
Ignacio Lopez-Moreno, Carlos Ortego-Resa, Joaquin
Gonzalez-Rodriguez, Daniel Ramos; Universidad
Autónoma de Madrid, Spain
manipulated were presented to listeners to test the effect of this
manipulation on the affective colouring of the stimuli. The results
showed that even when devoid of intrinsic loudness variation, nonmodal voice quality stimuli were capable of communicating affect.
However, changing the loudness of a non-modal voice quality stimulus towards its intrinsic loudness resulted in the increase of affective ratings.
Modeling Mutual Influence of Interlocutor Emotion
States in Dyadic Spoken Interactions
Wed-Ses2-P1-3, Time: 13:30
This work presents a novel approach for detection of emotions
embedded in the speech signal. The proposed approach works
at the prosodic level, and models the statistical distribution of
the prosodic features with Gaussian Mixture Models (GMM) meanadapted from a Universal Background Model (UBM). This allows the
use of GMM-mean supervectors, which are classified by a Support
Vector Machine (SVM). Our proposal is compared to a popular baseline, which classifies with an SVM a set of selected prosodic features
from the whole speech signal. In order to measure the speaker intervariability, which is a factor of degradation in this task, speaker
dependent and speaker independent frameworks have been considered. Experiments have been carried out under the SUSAS subcorpus, including real and simulated emotions. Results shows that
in a speaker dependent framework our proposed approach achieves
a relative improvement greater than 14% in Equal Error Rate (EER)
with respect to the baseline approach. The relative improvement is
greater than 17% when both approaches are combined together by
fusion with respect to the baseline.
Physiologically-Inspired Feature Extraction for
Emotion Recognition
Chi-Chun Lee, Carlos Busso, Sungbok Lee, Shrikanth S.
Narayanan; University of Southern California, USA
Wed-Ses2-P1-6, Time: 13:30
In dyadic human interactions, mutual influence — a person’s influence on the interacting partner’s behaviors — is shown to be
important and could be incorporated into the modeling framework
in characterizing, and automatically recognizing the participants’
states. We propose a Dynamic Bayesian Network (DBN) to explicitly
model the conditional dependency between two interacting partners’ emotion states in a dialog using data from the IEMOCAP corpus of expressive dyadic spoken interactions. Also, we focus on
automatically computing the Valence-Activation emotion attributes
to obtain a continuous characterization of the participants’ emotion flow. Our proposed DBN models the temporal dynamics of the
emotion states as well as the mutual influence between speakers
in a dialog. With speech based features, the proposed network improves classification accuracy by 3.67% absolute and 7.12% relative
over the Gaussian Mixture Model (GMM) baseline on isolated turnby-turn emotion classification.
A Detailed Study of Word-Position Effects on
Emotion Expression in Speech
Yu Zhou 1 , Yanqing Sun 1 , Junfeng Li 2 , Jianping
Zhang 1 , Yonghong Yan 1 ; 1 Chinese Academy of
Sciences, China; 2 JAIST, Japan
Jangwon Kim, Sungbok Lee, Shrikanth S. Narayanan;
University of Southern California, USA
Wed-Ses2-P1-4, Time: 13:30
In this paper, we proposed a new feature extraction method for
emotion recognition based on the knowledge of the emotion production mechanism in physiology. It was reported by physiacoustist that emotional speech is differently encoded from the normal speech in terms of articulation organs and that emotion information in speech is concentrated in different frequencies caused
by the different movements of organs [4]. To apply these findings,
in this paper, we first quantified the distribution of speech emotion information along with each frequency band by exploiting the
Fisher’s F-Ratio and mutual information techniques, and then proposed a non-uniform sub-band processing method which is able
to extract and emphasize the emotion features in speech. These
extracted features are finally applied to emotional recognition. Experimental results in speech emotion recognition showed that the
extracted features using our proposed non-uniform sub-band processing outperform the traditional (MFCC) features, and the average
error reduction rate amounts to 16.8% for speech emotion recognition.
Perceived Loudness and Voice Quality in Affect
Cueing
Wed-Ses2-P1-7, Time: 13:30
We investigate emotional effects on articulatory-acoustic speech
characteristics with respect to word location within a sentence. We
examined the hypothesis that emotional effect will vary based on
word position by first examining articulatory features manually extracted from Electromagnetic articulography data. Initial articulatory data analyses indicated that the emotional effects on sentence
medial words are significantly stronger than on initial words. To
verify that observation further, we expanded our hypothesis testing to include both acoustic and articulatory data, and a consideration of an expanded set of words from different locations. Results
suggest that emotional effects are generally more significant on sentence medial words than sentence initial and final words. This finding suggests that word location needs to be considered as a factor
in emotional speech processing.
CMAC for Speech Emotion Profiling
Norhaslinda Kamaruddin, Abdul Wahab; Nanyang
Technological University, Singapore
Wed-Ses2-P1-8, Time: 13:30
Irena Yanushevskaya, Christer Gobl, Ailbhe
Ní Chasaide; Trinity College Dublin, Ireland
Wed-Ses2-P1-5, Time: 13:30
The paper describes an auditory experiment aimed at testing
whether the intrinsic loudness of a stimulus with a given voice
quality influences the way in which it signals affect. Synthesised
voice quality stimuli in which intrinsic loudness was systematically
Cultural differences have been one of the many factors that can
cause failures in speech emotion analysis. If this cultural parameter could be regarded as noise artifacts in detecting emotion in
speech, we could then extract pure emotion speech signal from the
raw emotional speech. In this paper we use the amplitude spectral
subtraction (ASS) method to profile the emotion from raw emotional
speech based on the affection space model. In addition, the robustness of the cerebellar model arithmetic computer (CMAC) is used
Notes
130
to ensure that all other noise artifacts can be suppressed. Result
from the speech emotion profiling shows potential of such technique to visualize hidden features for detecting intra-cultural and
inter-cultural variation that is missing from current approach of
speech emotion recognition.
On the Relevance of High-Level Features for
Speaker Independent Emotion Recognition of
Spontaneous Speech
Marko Lugger, Bin Yang; Universität Stuttgart,
Germany
Spoken dialogue researchers often use supervised machine learning to classify turn-level user affect from a set of turn-level features. The utility of sub-turn features has been less explored, due
to the complications introduced by associating a variable number
of sub-turn units with a single turn-level classification. We present
and evaluate several voting methods for using word-level pitch and
energy features to classify turn-level user uncertainty in spoken dialogue data. Our results show that when linguistic knowledge regarding prosody and word position is introduced into a word-level
voting model, classification accuracy is significantly improved compared to the use of both turn-level and uninformed word-level models.
Wed-Ses2-P1-9, Time: 13:30
Detecting Subjectivity in Multiparty Speech
In this paper we study the relevance of so called high-level speech
features for the application of speaker independent emotion recognition. After we give a brief definition of high-level features, we
discuss for which standard feature groups high-level features are
conceivable. Two groups of high-level features are proposed within
this paper: a feature set for the parametrization of phonation called
voice quality parameters and a second feature set deduced from
music theory called harmony features. Harmony features give information about the frequency interval and chord content of the
pitch data of a spoken utterance. Finally, we study the gain in classification rate by combining the proposed high-level features with the
standard low-level features. We show that both high-level feature
sets improve the speaker independent classification performance
for spontaneous emotional speech.
Gabriel Murray, Giuseppe Carenini; University of
British Columbia, Canada
Pitch Contour Parameterisation Based on Linear
Stylisation for Emotion Recognition
Recognising Interest in Conversational Speech —
Comparing Bag of Frames and Supra-Segmental
Features
Vidhyasaharan Sethu, Eliathamby Ambikairajah,
Julien Epps; University of New South Wales, Australia
Björn Schuller, Gerhard Rigoll; Technische Universität
München, Germany
Wed-Ses2-P2-3, Time: 13:30
Wed-Ses2-P1-10, Time: 13:30
It is common knowledge that affective and emotion-related states
are acoustically well modelled on a supra-segmental level. Nonetheless successes are reported for frame-level processing either by
means of dynamic classification or multi-instance learning techniques. In this work a quantitative feature-type-wise comparison
between frame-level and supra-segmental analysis is carried out
for the recognition of interest in human conversational speech. To
shed light on the respective differences the same classifier, namely
Support-Vector-Machines, is used in both cases: once by clustering
a ‘bag of frames’ of unknown sequence length employing MultiInstance Learning techniques, and once by statistical functional application for the projection of the time series onto a static feature
vector. As database serves the Audiovisual Interest Corpus of naturalistic interest.
Wed-Ses2-P2 : Expression, Emotion and
Personality Recognition
Hewison Hall, 13:30, Wednesday 9 Sept 2009
Chair: John H.L. Hansen, University of Texas at Dallas, USA
Classifying Turn-Level Uncertainty Using
Word-Level Prosody
Diane Litman 1 , Mihai Rotaru 2 , Greg Nicholas 3 ;
1
University of Pittsburgh, USA; 2 Textkernel B.V., The
Netherlands; 3 Brown University, USA
Wed-Ses2-P2-1, Time: 13:30
Wed-Ses2-P2-2, Time: 13:30
In this research we aim to detect subjective sentences in spontaneous speech and label them for polarity. We introduce a novel
technique wherein subjective patterns are learned from both labeled and unlabeled data, using n-grams with varying levels of lexical instantiation. Applying this technique to meeting speech, we
gain significant improvement over state-of-the-art approaches and
demonstrate the method’s robustness to ASR errors. We also show
that coupling the pattern-based approach with structural and lexical features of meetings yields additional improvement.
The pitch contour contains information that characterises the emotion being expressed by speech, and consequently features extracted from pitch form an integral part of many automatic emotion
recognition systems. While pitch contours may have many small
variations and hence are difficult to represent compactly, it may be
possible to parameterise them by approximating the contour for
each voiced segment by a straight line. This paper looks at such
a parameterisation method in the context of emotion recognition.
Listening tests were performed to subjectively determine if the linearly stylised contours were able to sufficiently capture information
pertaining to emotions expressed in speech. Furthermore these parameters were used as features for an automatic 5-class emotion
classification system. The use of the proposed parameters rather
than pitch statistics resulted in a relative increase in accuracy of
about 20%.
Feature-Based and Channel-Based Analyses of
Intrinsic Variability in Speaker Verification
Martin Graciarena 1 , Tobias Bocklet 2 , Elizabeth
Shriberg 1 , Andreas Stolcke 1 , Sachin Kajarekar 1 ; 1 SRI
International, USA; 2 FAU Erlangen-Nürnberg,
Germany
Wed-Ses2-P2-4, Time: 13:30
We explore how intrinsic variations (those associated with the
speaker rather than the recording environment) affect textindependent speaker verification performance. In a previous paper we introduced the SRI-FRTIV corpus and provided speaker verification results using a Gaussian mixture model (GMM) system on
telephone-channel speech. In this paper we explore the use of other
speaker verification systems on the telephone channel data and
Notes
131
compare against the GMM baseline. We found the GMM system
to be one of the more robust across all conditions. Systems relying on recognition hypotheses had a significant degradation in low
vocal effort conditions. We also explore the use of the GMM system on several other channels. We found improved performance
on table-top microphones compared to the telephone channel in
furtive conditions and gradual degradations as a function of the
distance from the microphone to the speaker. Therefore distant
microphones further degrade the speaker verification performance
due to intrinsic variability.
Robust Angry Speech Detection Employing a
TEO-Based Discriminative Classifier Combination
Wooil Kim, John H.L. Hansen; University of Texas at
Dallas, USA
Wed-Ses2-P2-5, Time: 13:30
This study proposes an effective angry speech detection approach
employing the TEO-based feature extraction. Decorrelation processing is applied to the TEO-based feature to increase model training ability by decreasing the correlation between feature elements
and vector size. Minimum classification error training is employed
to increase the discrimination between the angry speech model and
other stressed speech models. Combination with the conventional
Mel frequency cepstral coefficients (MFCC) is also employed to leverage the effectiveness of MFCC to characterize the spectral envelope of speech signals. Experimental results over the SUSAS corpus demonstrate the proposed angry speech detection scheme is
effective at increasing detection accuracy on an open-speaker and
open-vocabulary task. An improvement of up to 7.78% in classification accuracy is obtained by combination of the proposed methods
including decorrelation of TEO-based feature vector, discriminative
training, and classifier combination.
Improving Emotion Recognition Using Class-Level
Spectral Features
Dmitri Bitouk, Ani Nenkova, Ragini Verma; University
of Pennsylvania, USA
Wed-Ses2-P2-6, Time: 13:30
Traditional approaches to automatic emotion recognition from
speech typically make use of utterance level prosodic features. Still,
a great deal of useful information about expressivity and emotion can be gained from segmental spectral features, which provide a more detailed description of the speech signal, or from
measurements from specific regions of the utterance, such as the
stressed vowels. Here we introduce a novel set of spectral features
for emotion recognition: statistics of Mel-Frequency Spectral Coefficients computed over three phoneme type classes of interest:
stressed vowels, unstressed vowels and consonants in the utterance. We investigate performance of our features in the task of
speaker-independent emotion recognition using two publicly available datasets. Our experimental results clearly indicate that indeed
both the richer set of spectral features and the differentiation between phoneme type classes are beneficial for the task. Classification accuracies are consistently higher for our features compared
to prosodic features or utterance-level spectral features. Combination of our phoneme class features with prosodic features leads to
even further improvement.
Arousal and Valence Prediction in Spontaneous
Emotional Speech: Felt versus Perceived Emotion
Khiet P. Truong 1 , David A. van Leeuwen 2 , Mark A.
Neerincx 2 , Franciska M.G. de Jong 1 ; 1 University of
Twente, The Netherlands; 2 TNO Defence, The
Netherlands
Wed-Ses2-P2-7, Time: 11:20
In this paper, we describe emotion recognition experiments carried
out for spontaneous affective speech with the aim to compare the
added value of annotation of felt emotion versus annotation of perceived emotion. Using speech material available in the tno-gaming
corpus (a corpus containing audiovisual recordings of people playing videogames), speech-based affect recognizers were developed
that can predict Arousal and Valence scalar values. Two types of
recognizers were developed in parallel: one trained with felt emotion annotations (generated by the gamers themselves) and one
trained with perceived/observed emotion annotations (generated
by a group of observers). The experiments showed that, in speech,
with the methods and features currently used, observed emotions
are easier to predict than felt emotions. The results suggest that
recognition performance strongly depends on how and by whom
the emotion annotations are carried out.
Dimension Reduction Approaches for SVM Based
Speaker Age Estimation
Gil Dobry 1 , Ron M. Hecht 2 , Mireille Avigal 1 , Yaniv
Zigel 3 ; 1 Open University of Israel, Israel;
2
PuddingMedia, Israel; 3 Ben-Gurion University of the
Negev, Israel
Wed-Ses2-P2-8, Time: 13:30
This paper presents two novel dimension reduction approaches applied on the gaussian mixture model (GMM) supervectors to improve age estimation speed and accuracy. The GMM supervector
embodies many speech characteristics irrelevant to age estimation
and like noise, they are harmful to the system’s generalization ability. In addition, the support vectors machine (SVM) testing computation grows with the vector’s dimension, especially when using
complex kernels. The first approach presented is the weightedpairwise principal components analysis (WPPCA) that reduces the
vector dimension by minimizing the redundant variability. The second approach is based on anchor-models, using a novel anchors
selection method. Experiments showed that dimension reduction
makes the testing process 5 times faster and using the WPPCA approach, it is also 5% more accurate.
ANN Based Decision Fusion for Speech Emotion
Recognition
Lu Xu 1 , Mingxing Xu 1 , Dali Yang 2 ; 1 Tsinghua
University, China; 2 Beijing Information Science &
Technology University, China
Wed-Ses2-P2-9, Time: 13:30
As a hot research field, speech emotion recognition has attracted
increasing attentions from both academic and business. In this paper, we proposed a method to recognize speech emotions adopting
ANNs and to fuse two kinds of recognitions using different features at the decision level. Each emotional utterance is recognized
by some individual recognizers firstly. Then the outputs of these
recognizers were fused adopting the voting strategy. Furthermore,
the dimensionality of supervectors constructed from spectral features is reduced through PCA. Experimental results demonstrated
that the proposed decision fusion is effective and the dimensionality reduction is feasible.
Notes
132
Processing Affected Speech Within Human Machine
Interaction
Bogdan Vlasenko, Andreas Wendemuth;
Otto-von-Guericke-Universität Magdeburg, Germany
Wed-Ses2-P2-10, Time: 13:30
Spoken dialog systems (SDS) integrated into human-machine interaction interfaces is becoming a standard technology. Current stateof-the-art SDS, usually, is not able to provide for the user a natural
way of communication. Existing automated dialog systems do not
dedicate enough attention to problems in the interaction related to
affected user behavior. As a result, Automatic Speech Recognition
(ASR) engines are not able to recognize affected speech and dialog
strategy does not make use of the user’s emotional state. This paper addresses some aspects of processing affected speech within
natural human-machine interaction. First of all, we propose an affected speech adapted ASR engine. Second, we describe our methods of emotion recognition within speech and present our results of
emotion classification within Interspeech 2009 Emotion Challenge.
Third, we test affected speech adapted speech recognition models
and introduce an approach to achieve emotion adaptive dialog management in human-machine interaction.
Emotion Recognition from Speech Using Extended
Feature Selection and a Simple Classifier
Ali Hassan, Robert I. Damper; University of
Southampton, UK
Wed-Ses2-P2-11, Time: 13:30
We describe extensive experiments on the recognition of emotion
from speech using acoustic features only. Two databases of acted
emotional speech (Berlin and DES) have been used in this work. The
principal focus is on methods for selection of good features from
a relatively large set of hand-crafted features, perhaps formed by
fusing different feature sets used by different researchers. We show
that the monotonic assumption underlying popular sequential selection algorithms does not hold, and use this finding to improve
recognition accuracy. We show further that a very simple classifier (k-nearest neighbour) produces better results than any so far
reported by other researchers on these databases, suggesting that
previous work has failed to match the complexity of the classifier
used to the complexity of the data. Finally, several potentially fruitful avenues for future work are outlined.
Wed-Ses2-P3 : Speech Synthesis Methods
velopmental psychology claims, infants acquire the holistic sound
patterns of words from the utterances of their parents, called word
Gestalt, and they reproduce them with their vocal tubes. This behavior is called vocal imitation. In our previous studies, the word
Gestalt was defined physically and a method of extracting it from a
word utterance was proposed. We already applied the word Gestalt
to ASR, CALL, and also speech generation, which we call structure to
speech conversion. Unlike reading machines, our framework simulates infants’ vocal imitation. In this paper, a method for improving
our speech generation framework based on a structural cost function is proposed and evaluated.
Deriving Vocal Tract Shapes from Electromagnetic
Articulograph Data via Geometric Adaptation and
Matching
Ziad Al Bawab 1 , Lorenzo Turicchia 2 , Richard M.
Stern 1 , Bhiksha Raj 1 ; 1 Carnegie Mellon University,
USA; 2 MIT, USA
Wed-Ses2-P3-2, Time: 13:30
In this paper, we present our efforts towards deriving vocal tract
shapes from ElectroMagnetic Articulograph data (EMA) via geometric adaptation and matching. We describe a novel approach for
adapting Maeda’s geometric model of the vocal tract to one speaker
in the MOCHA database. We show how we can rely solely on the EMA
data for adaptation. We present our search technique for the vocal
tract shapes that best fit the given EMA data. We then describe our
approach of synthesizing speech from these shapes. Results on
Mel-cepstral distortion reflect improvement in synthesis over the
approach we used before without adaptation.
Towards Unsupervised Articulatory Resynthesis of
German Utterances Using EMA Data
Ingmar Steiner 1 , Korin Richmond 2 ; 1 Universität des
Saarlandes, Germany; 2 University of Edinburgh, UK
Wed-Ses2-P3-3, Time: 13:30
As part of ongoing research towards integrating an articulatory synthesizer into a text-to-speech (TTS) framework, a corpus of German
utterances recorded with electromagnetic articulography (EMA) is
resynthesized to provide training data for statistical models. The
resynthesis is based on a measure of similarity between the original
and resynthesized EMA trajectories, weighted by articulatory relevance. Preliminary results are discussed and future work outlined.
The KlattGrid Speech Synthesizer
Hewison Hall, 13:30, Wednesday 9 Sept 2009
Chair: Nobuaki Minematsu, University of Tokyo, Japan
David Weenink; University of Amsterdam, The
Netherlands
Optimal Event Search Using a Structural Cost
Function — Improvement of Structure to Speech
Conversion
Wed-Ses2-P3-4, Time: 13:30
We present a new speech synthesizer class, named KlattGrid, for
the Praat program [3]. This synthesizer is based on the original description of Klatt [1, 2]. New aspects of a KlattGrid in comparison
with other Klatt-type synthesizers are that a KlattGrid
Daisuke Saito, Yu Qiao, Nobuaki Minematsu, Keikichi
Hirose; University of Tokyo, Japan
• is not frame-based but time-based. You specify parameters
as a function of time with any precision you like.
Wed-Ses2-P3-1, Time: 13:30
This paper describes a new and improved method for the framework of structure to speech conversion we previously proposed.
Most of the speech synthesizers take a phoneme sequence as input
and generate speech by converting each of the phonemes into its
corresponding sound. In other words, they simulate a human process of reading text out. However, infants usually acquire speech
communication ability without text or phoneme sequences. Since
their phonemic awareness is very immature, they can hardly decompose an utterance into a sequence of phones or phonemes. As de-
Notes
133
• has no limitations on the number of oral formants, nasal
formants, nasal antiformants, tracheal formants or tracheal
antiformants that can be defined.
• has separate formants for the frication part.
• allows varying the form of the glottal flow function as a function of time.
• allows for any number of formants and bandwidths to be
modified during the open phase of the glottis.
• uses no beforehand quantization of amplitude parameters.
• is fully integrated into the freely available speech analysis
program Praat [3].
Development of a Kenyan English Text to Speech
System: A Method of Developing a TTS for a
Previously Undefined English Dialect
Unit Selection Based Speech Synthesis for Poor
Channel Condition
Ling Cen, Minghui Dong, Paul Chan, Haizhou Li;
Institute for Infocomm Research, Singapore
Mucemi Gakuru; Teknobyte Ltd., Kenya
Wed-Ses2-P3-5, Time: 13:30
This work provides a method that can be used to build an English
TTS for a population who speak a dialect which is not defined and
for which no resources exist, by showing how a Text to Speech
System (TTS) was developed for the English dialect spoken in
Kenya. To begin with, the existence of a unique English dialect
which had not previously been defined was confirmed from the
need by the English speaking Kenyan population to have a TTS in
an accent different from the British accent. This dialect is referred
to here and has also been branded as Kenyan English. Given that
building a TTS requires language features to be adequately defined,
it was necessary to develop the essential features of the dialect
such as the phoneset and the lexicon and then verifying their
correctness. The paper shows how it was possible to come up with
a systematic approach for defining these features through tracing
the evolution of the dialect. It also discusses how the TTS was built
and tested.
Feedback Loop for Prosody Prediction in
Concatenative Speech Synthesis
1
2
Wed-Ses2-P3-8, Time: 13:30
Synthesized speech can be largely degraded in noise, resulting in
compromised speech quality. In this paper, we propose a unit
selection based speech synthesis system for better speech quality
under poor channel conditions. First, the measurement of speech
intelligibility is incorporated in the cost function as a searching
criterion for unit selection. Next, the prosody of the selected units
is modified according to the Lombard effect. Prosody modification
includes increasing the amplitude of unvoiced phoneme and
enlarging the speech duration. Finally, the FIR equalization via
convex optimization is applied to reduce signal distortion due
to the channel effect. Listening test in our experiments shows
that the quality level of synthetic speech can be improved under
poor channel conditions with the help of our proposed synthesis
system.
Vocalic Sandwich, a Unit Designed for Unit Selection
TTS
Didier Cadic 1 , Cédric Boidin 1 , Christophe
d’Alessandro 2 ; 1 Orange Labs, France; 2 LIMSI, France
1
Javier Latorre , Sergio Gracia , Masami Akamine ;
1
Toshiba Corporate R&D Center, Japan; 2 Universitat
Politècnica de Catalunya, Spain
Wed-Ses2-P3-9, Time: 13:30
Wed-Ses2-P3-6, Time: 13:30
We propose a method for concatenative speech synthesis that permits to obtain a better matching between the logF0 and duration
predicted by the prosody module and the waveform generation
back-end. The proposed method is based upon our previous
multilevel parametric F0 model and Toshiba’s plural unit selection
and fusion synthesizer. The method adds a feedback loop from
the back-end into the prosody module so that the prosodical information of the selected units is used to re-estimate new prosody
values. The feedback loop defines a frame-level prosody model
which consists of the average value and variance of the duration
and logF0 of the selected units. The log-likelihood defined by
this model is added to the log-likelihood of the prosody model.
From the maximization of this total log-likelihood, we obtain the
prosody values that produce the optimum compromise between
the distortion introduced by F0 discontinuities and the one created
by the prosody adjusting signal processing.
Assessing a Speaker for Fast Speech in Unit
Selection Speech Synthesis
1
of a fast speech unit selection inventory are drawn. The following
section deals with a perception study where a selected speaker’s
ability to speak fast is investigated. To conclude, a preliminary
perceptual analysis of the recordings for the speech synthesis
corpus is presented.
Unit selection text-to-speech systems currently produce very natural synthetic sentences by concatenating speech segments from
a large database. Recently, increasing demand for designing high
quality voices with less data creates need for further optimization
of the textual corpus recorded by the speaker. The optimization
process of this corpus is traditionally guided by the coverage rate
of well-known units: triphones, words…. Such units are however
not dedicated to concatenative speech synthesis; they are of
general use in speech technologies and linguistics. In this paper,
we describe a new unit which takes account of concatenative TTS
own features: the “vocalic sandwich.” Both an objective and a
perceptual evaluation tend to show that vocalic sandwiches are
appropriate units for corpus design.
Speech Synthesis Based on the Plural Unit Selection
and Fusion Method Using FWF Model
Ryo Morinaka, Masatsune Tamura, Masahiro Morita,
Takehiko Kagoshima; Toshiba Corporate R&D Center,
Japan
Wed-Ses2-P3-10, Time: 13:30
2 1
Donata Moers , Petra Wagner ; Rheinische
Friedrich-Wilhelms-Universität Bonn, Germany;
2
Universität Bielefeld, Germany
Wed-Ses2-P3-7, Time: 13:30
This paper describes work in progress concerning the adequate
modeling of fast speech in unit selection speech synthesis systems,
mostly having in mind blind and visually impaired users. Initially,
a survey of the main characteristics of fast speech will be given.
Subsequently, strategies for fast speech production will be discussed. Certain requirements concerning the ability of a speaker
For speech synthesizers, enhanced diversity and improved quality
of synthesized speech are required. Speaker interpolation and
voice conversion are the techniques that enhance diversity. The
PUSF (plural unit selection and fusion) method, which we have
proposed, generates synthesized waveforms using pitch-cycle
waveforms. However, it is difficult to modify its spectral features
while keeping naturalness of synthesized speech. In the present
work, we investigated how best to represent speech waveforms.
Firstly, we introduce a method that decomposes a pitch waveform
in a voiced portion into a periodic component, which is excited
by vocal sound source, and an aperiodic component, which is
Notes
134
excited by noise source. Moreover, we introduce the FWF (formant
waveform) model to represent the periodic component. Because
the FWF model represents the pitch waveform in accordance
with formant parameters, it can control the formant parameters
independently. We realized a method that can easily be applied
to the diversity-enhancing techniques in the PUSF-based method
because this model is based on vocal tract features.
sistance. In this paper we present a real-time system for automatic
subtitling of live broadcast news in Spanish based on the News
Redaction Computer texts and an Automatic Speech Recognition
engine to provide precise temporal alignment of speech to text
scripts with negligible latency. The presented system is working
satisfactory on the Aragonese Public Television from June 2008
without human assistance.
Speech Synthesis Without a Phone Inventory
Development of the 2008 SRI Mandarin
Speech-to-Text System for Broadcast News and
Conversation
Matthew P. Aylett, Simon King, Junichi Yamagishi;
University of Edinburgh, UK
Wed-Ses2-P3-11, Time: 13:30
In speech synthesis the unit inventory is decided using phonological and phonetic expertise. This process is resource intensive and
potentially sub-optimal. In this paper we investigate how acoustic
clustering, together with lexicon constraints, can be used to build
a self-organised inventory. Six English speech synthesis systems
were built using two frameworks, unit selection and parametric
HTS for three inventory conditions: 1) a traditional phone set, 2) a
system using orthographic units, and 3) a self-organised inventory.
A listening test showed a strong preference for the classic system,
and for the orthographic system over the self-organised system.
Results also varied by letter to sound complexity and database
coverage. This suggests the self-organised approach failed to
generalise pronunciation as well as introducing noise above and
beyond that caused by orthographic sound mismatch.
Context-Dependent Additive log F0 Model for
HMM-Based Speech Synthesis
Xin Lei 1 , Wei Wu 2 , Wen Wang 1 , Arindam Mandal 1 ,
Andreas Stolcke 1 ; 1 SRI International, USA; 2 University
of Washington, USA
Wed-Ses2-P4-2, Time: 13:30
We describe the recent progress in SRI’s Mandarin speech-to-text
system developed for 2008 evaluation in the DARPA GALE program.
A data-driven lexicon expansion technique and language model
adaptation methods contribute to the improvement in recognition
performance. Our system yields 8.3% character error rate on
the GALE dev08 test set, and 7.5% after combining with RWTH
systems. Compared to our 2007 evaluation system, a significant
improvement of 13% relative has been achieved.
Multifactor Adaptation for Mandarin Broadcast
News and Conversation Speech Recognition
Wen Wang, Arindam Mandal, Xin Lei, Andreas Stolcke,
Jing Zheng; SRI International, USA
Heiga Zen, Norbert Braunschweiler; Toshiba Research
Europe Ltd., UK
Wed-Ses2-P4-3, Time: 13:30
Wed-Ses2-P3-12, Time: 13:30
This paper proposes a context-dependent additive acoustic modelling technique and its application to logarithmic fundamental
frequency (log F0 ) modelling for HMM-based speech synthesis. In
the proposed technique, mean vectors of state-output distributions are composed as the weighted sum of decision tree-clustered
context-dependent bias terms. Its model parameters and decision
trees are estimated and built based on the maximum likelihood
(ML) criterion. The proposed technique has the potential to capture
the additive structure of log F0 contours. A preliminary experiment
using a small database showed that the proposed technique yielded
encouraging results.
Wed-Ses2-P4 : LVCSR Systems and Spoken
Term Detection
We explore the integration of multiple factors such as genre and
speaker gender for acoustic model adaptation tasks to improve
Mandarin ASR system performance on broadcast news and broadcast conversation audio. We investigate the use of multifactor
clustering of acoustic model training data and the application of
MPE-MAP and fMPE-MAP acoustic model adaptations. We found
that by effectively combining these adaptation approaches, we
achieve 6% relative reduction in recognition error rate compared
to a Mandarin recognition system that does not use genre-specific
acoustic models, and 5% relative improvement if the genre-adaptive
system is combined with another, genre-independent state-of-theart system.
Development of the GALE 2008 Mandarin LVCSR
System
C. Plahl, Björn Hoffmeister, Georg Heigold, Jonas Lööf,
Ralf Schlüter, Hermann Ney; RWTH Aachen University,
Germany
Hewison Hall, 13:30, Wednesday 9 Sept 2009
Chair: Simon King, University of Edinburgh, UK
Wed-Ses2-P4-4, Time: 13:30
Real-Time Live Broadcast News Subtitling System
for Spanish
Alfonso Ortega, Jose Enrique Garcia, Antonio Miguel,
Eduardo Lleida; Universidad de Zaragoza, Spain
Wed-Ses2-P4-1, Time: 13:30
Subtitling of live broadcast news is a very important application
to meet the needs of deaf and hard of hearing people. However,
live subtitling is a high cost operation in terms of qualification
human resources and therefore, money if high precision is desired.
Automatic Speech Recognition researchers can help to perform
this task saving both time and money developing systems that
deliver subtitles fully synchronized with speech without human as-
This paper describes the current improvements of the RWTH
Mandarin LVCSR system. We introduce vocal tract length normalization for the Gammatone features and present comparable
results for Gammatone based feature extraction and classical
feature extraction. In order to benefit from the huge amount of
data of 1600h available in the GALE project we have trained the
acoustic models up to 8M Gaussians. We present detailed character
error rates for the different number of Gaussians.
Different kinds of systems are developed and a two stage decoding framework is applied, which uses cross-adaptation and
a subsequent lattice-based system combination. In addition to
various acoustic front-ends, these systems use different kinds of
neural network toneme posterior features. We present detailed
Notes
135
recognition results of the development cycle and the different
acoustic front-ends of the systems. Finally, we compare the
ultimate evaluation system to our last years system and can report
a 10% relative improvement.
The RWTH Aachen University Open Source Speech
Recognition System
Improvements to the LIUM French ASR System
Based on CMU Sphinx: What Helps to Significantly
Reduce the Word Error Rate?
Paul Deléglise, Yannick Estève, Sylvain Meignier, Teva
Merlin; LIUM, France
Wed-Ses2-P4-8, Time: 13:30
David Rybach, Christian Gollan, Georg Heigold, Björn
Hoffmeister, Jonas Lööf, Ralf Schlüter, Hermann Ney;
RWTH Aachen University, Germany
Wed-Ses2-P4-5, Time: 13:30
We announce the public availability of the RWTH Aachen University
speech recognition toolkit. The toolkit includes state of the art
speech recognition technology for acoustic model training and
decoding. Speaker adaptation, speaker adaptive training, unsupervised training, a finite state automata library, and an efficient
tree search decoder are notable components. Comprehensive
documentation, example setups for training and recognition, and
a tutorial are provided to support newcomers.
This paper describes the new ASR system developed by the LIUM
and analyzes the various origins of the significant drop of the
word error rate observed in comparison to the previous LIUM
ASR system. This study was made on the test data of the latest
evaluation campaign of ASR systems on French broadcast news,
called ESTER 2 and organized in December 2008.
For the same computation time, the new system yields a word error
rate about 38% lower than what the previous system (which reached
the second position during the ESTER 1 evaluation campaign) did.
This paper evaluates the gain provided by various changes to the
system: implementation of new search and training algorithms,
new training data, vocabulary size, etc. The LIUM ASR system was
the best open-source ASR system of the ESTER 2 campaign.
Merging Search Spaces for Subword Spoken Term
Detection
Online Detecting End Times of Spoken Utterances
for Synchronization of Live Speech and its
Transcripts
Timo Mertens 1 , Daniel Schneider 2 , Joachim Köhler 2 ;
1
NTNU, Norway; 2 Fraunhofer IAIS, Germany
Jie Gao, Qingwei Zhao, Yonghong Yan; Chinese
Academy of Sciences, China
Wed-Ses2-P4-9, Time: 13:30
Wed-Ses2-P4-6, Time: 13:30
In this paper, we present our initial efforts in the task of Automatically Synchronizing live spoken Utterances with their Transcripts
(textual contents) (ASUT). We address the problem of online detecting of the end time of a spoken utterance given its textual content,
which is one of the key problems of the ASUT task. A framesynchronous likelihood ratio test (FS-LRT) procedure is proposed
and explored under the hidden Markov model (HMM) framework.
The property of FS-LRT is studies empirically. Experiments indicate
that our proposed approach shows satisfying performance. In
addition, the proposed procedure has been successfully applied in
a subtitling system for live broadcast news.
Real-Time ASR from Meetings
Philip N. Garner 1 , John Dines 1 , Thomas Hain 2 , Asmaa
El Hannani 2 , Martin Karafiát 3 , Danil Korchagin 1 ,
Mike Lincoln 4 , Vincent Wan 2 , Le Zhang 4 ; 1 IDIAP
Research Institute, Switzerland; 2 University of
Sheffield, UK; 3 Brno University of Technology, Czech
Republic; 4 University of Edinburgh, UK
Wed-Ses2-P4-7, Time: 13:30
The AMI(DA) system is a meeting room speech recognition system
that has been developed and evaluated in the context of the
NIST Rich Text (RT) evaluations. Recently, the “Distant Access”
requirements of the AMIDA project have necessitated that the
system operate in real-time. Another more difficult requirement is
that the system fit into a live meeting transcription scenario. We
describe an infrastructure that has allowed the AMI(DA) system to
evolve into one that fulfils these extra requirements. We emphasise
the components that address the live and real-time aspects.
We describe how complementary search spaces, addressed by two
different methods used in Spoken Term Detection (STD), can be
merged for German subword STD. We propose fuzzy-search techniques on lattices to narrow the gap between subword and word
retrieval. The first technique is based on an edit-distance, where no
a priori knowledge about confusions is employed. Additionally, we
propose a weighting method which explicitly models pronunciation
variation on a subword level and thus improves robustness against
false positives. Recall is improved by 6% absolute when retrieving
on the merged search space rather than using an exact lattice
search. By modeling subword pronunciation variation, we increase
recall in a high-precision setting by 2% absolute compared to the
edit-distance method.
A Posterior Probability-Based System Hybridisation
and Combination for Spoken Term Detection
Javier Tejedor 1 , Dong Wang 2 , Simon King 2 , Joe
Frankel 2 , José Colás 1 ; 1 Universidad Autónoma de
Madrid, Spain; 2 University of Edinburgh, UK
Wed-Ses2-P4-10, Time: 13:30
Spoken term detection (STD) is a fundamental task for multimedia
information retrieval. To improve the detection performance,
we have presented a direct posterior-based confidence measure
generated from a neural network. In this paper, we propose a
detection-independent confidence estimation based on the direct
posterior confidence measure, in which the decision making is
totally separated from the term detection. Based on this idea, we
first present a hybrid system which conducts the term detection
and confidence estimation based on different sub-word units and
then propose a combination method which merges detections from
heterogeneous term detectors based on the direct posterior-based
confidence. Experimental results demonstrated that the proposed
methods improved system performance considerably for both
English and Spanish.
Notes
136
Stochastic Pronunciation Modelling for Spoken
Term Detection
Dong Wang, Simon King, Joe Frankel; University of
Edinburgh, UK
Wed-Ses2-P4-11, Time: 13:30
A major challenge faced by a spoken term detection (STD) system
is the detection of out-of-vocabulary (OOV) terms. Although a
subword-based STD system is able to detect OOV terms, performance reduction is always observed compared to in-vocabulary
terms. Current approaches to STD do not acknowledge the particular properties of OOV terms, such as pronunciation uncertainty.
In this paper, we use a stochastic pronunciation model to deal
with the uncertain pronunciations of OOV terms. By considering
all possible term pronunciations, predicted by a joint-multigram
model, we observe a significant performance improvement.
quickly from a very large speech database without using a large
memory space. To accelerate searches and save memory, we used
a suffix array as the data structure and applied phoneme-based
DP-matching. To avoid an exponential increase in the process time
with the length of the keyword, a long keyword is divided into short
sub-keywords. Moreover, an iterative lengthening search algorithm
is used to rapidly output accurate search results. The experimental
results show that it takes less than 100ms to detect the first set of
search results from a 10,000-h virtual speech database.
Wed-Ses2-S1 : Special Session: Active
Listening & Synchrony
Ainsworth (East Wing 4), 13:30, Wednesday 9 Sept 2009
Chair: Nick Campbell, Trinity College Dublin, Ireland
Term-Dependent Confidence for Out-of-Vocabulary
Term Detection
Understanding Speaker-Listener Interactions
Dong Wang, Simon King, Joe Frankel, Peter Bell;
University of Edinburgh, UK
Wed-Ses2-S1-1, Time: 13:30
Dirk Heylen; University of Twente, The Netherlands
Wed-Ses2-P4-12, Time: 13:30
Within a spoken term detection (STD) system, the decision maker
plays an important role in retrieving reliable detections. Most
of the state-of-the-art STD systems make decisions based on
a confidence measure that is term-independent, which poses
a serious problem for out-of-vocabulary (OOV) term detection.
In this paper, we study a term-dependent confidence measure
based on confidence normalisation and discriminative modelling,
particularly focusing on its remarkable effectiveness for detecting
OOV terms. Experimental results indicate that the term-dependent
confidence provides much more significant improvement for OOV
terms than terms in-vocabulary.
A Comparison of Query-by-Example Methods for
Spoken Term Detection
Wade Shen, Christopher M. White, Timothy J. Hazen;
MIT, USA
Wed-Ses2-P4-13, Time: 13:30
In this paper we examine an alternative interface for phonetic
search, namely query-by-example, that avoids OOV issues associated with both standard word-based and phonetic search methods.
We develop three methods that compare query lattices derived
from example audio against a standard ngram-based phonetic
index and we analyze factors affecting the performance of these
systems. We show that the best systems under this paradigm are
able to achieve 77% precision when retrieving utterances from
conversational telephone speech and returning 10 results from a
single query (performance that is better than a similar dictionarybased approach) suggesting significant utility for applications
requiring high precision. We also show that these systems can
be further improved using relevance feedback: By incorporating
four additional queries the precision of the best system can be
improved by 13.7% relative. Our systems perform well despite
high phone recognition error rates (> 40%) and make use of no
pronunciation or letter-to-sound resources.
We provide an eclectic generic framework to understand the back
and forth interactions between participants in a conversation highlighting the complexity of the actions that listeners are engaged
in. Communicative actions of one participant implicate the “other”
in many ways. In this paper, we try to enumerate some essential
relevant dimensions of this reciprocal dependence.
Detecting Changes in Speech Expressiveness in
Participants of a Radio Program
Plínio A. Barbosa; State University of Campinas, Brazil
Wed-Ses2-S1-2, Time: 13:50
A method for speech expressiveness change detection is presented
which combines a dimensional analysis of speech expression,
a Principal Component Analysis technique, as well as multiple
regression analysis. From the three inferred rates of activation,
valence, and involvement, two PCA-factors explain 97% of the
variance of the judges’ evaluations of a corpus of radio show interaction. The multiple regression analysis predicted the values of
the two listener-oriented, PCA-derived dimensions of promptness
and empathy from the acoustic parameters automatically obtained
from a set of 206 utterances produced by radio show’s participants.
Analysed chronologically, the utterances reveal expression change
from automatic acoustic analysis.
An Audio-Visual Approach to Measuring Discourse
Synchrony in Multimodal Conversation Data
Nick Campbell; Trinity College Dublin, Ireland
Wed-Ses2-S1-3, Time: 14:10
This paper describes recent work on the automatic extraction of
visual and audio parameters relating to the detection of synchrony
in discourse, and to the modelling of active listening for advanced
speech technology. It reports findings based on image processing
that reliably identify the strong entrainment between members of
a group conversation, and describes techniques for the extraction
and analysis of such information.
Towards Flexible Representations for Analysis of
Accommodation of Temporal Features in
Spontaneous Dialogue Speech
Fast Keyword Detection Using Suffix Array
Kouichi Katsurada, Shigeki Teshima, Tsuneo Nitta;
Toyohashi University of Technology, Japan
Wed-Ses2-P4-14, Time: 13:30
In this paper, we propose a technique for detecting keywords
Spyros Kousidis, David Dorran, Ciaran McDonnell,
Eugene Coyle; Dublin Institute of Technology, Ireland
Wed-Ses2-S1-4, Time: 14:30
Notes
137
Current advances in spoken interface design point towards a
shift towards more “human-like” interaction, as opposed to the
traditional “push-to-talk” approach. However, human dialogue is
characterized by synchrony and multi-modality, and these properties are not captured by traditional representation approaches,
such as turn succession. This paper proposes an alternative representation schema for recorded (human) dialogues, which employs
per frame averages of speaker turn distribution, in order to inform
further analyses of temporal features (pauses and overlaps) in
terms of inter-speaker accommodation. Preliminary results of such
analyses are provided.
Are We ‘in Sync’: Turn-Taking in Collaborative
Dialogues
Štefan Beňuš; Constantine the Philosopher University in
Nitra, Slovak Republic
Wed-Ses2-S1-5, Time: 14:50
We used a corpus of collaborative task oriented dialogues in
American English to compare two units of rhythmic structure —
pitch accents and syllables — within the coupled oscillator model
of rhythmical entrainment in turn-taking proposed in [1]. We found
that pitch accents are a slightly better fit than syllables as the unit
of rhythmical structure for the model, but we also observed weak
support for the model in general. Some turn-taking types were
rhythmically more salient than others.
An Audio-Visual Attention System for Online
Association Learning
Martin Heckmann, Holger Brandl, Xavier Domont,
Bram Bolder, Frank Joublin, Christian Goerick; Honda
Research Institute GmbH, Germany
Wed-Ses2-S1-6, Time: 15:10
We present an audio-visual attention system for speech based
interaction with a humanoid robot where a tutor can teach visual
properties/locations (e.g “left”) and corresponding, arbitrary
speech labels. The acoustic signal is segmented via the attention
system and speech labels are learned from a few repetitions of
the label by the tutor. The attention system integrates bottom-up
stimulus driven saliency calculation (delay-and-sum beamforming,
adaptive noise level estimation) and top-down modulation (spectral
properties, segment length, movement and interaction status of
the robot). We evaluate the performance of different aspects of the
system based on a small dataset.
Large Margin Estimation of Gaussian Mixture Model
Parameters with Extended Baum-Welch for Spoken
Language Recognition
Donglai Zhu, Bin Ma, Haizhou Li; Institute for
Infocomm Research, Singapore
Wed-Ses3-O1-2, Time: 16:20
Discriminative training (DT) methods of acoustic models, such as
SVM and MMI-training GMM, have been proved effective in spoken
language recognition. In this paper we propose a DT method for
GMM using the large margin (LM) estimation. Unlike traditional
MMI or MCE methods, the LM estimation attempts to enhance the
generalization ability of GMM to deal with new data that exhibits
mismatch with training data. We define the multi-class separation
margin as a function of GMM likelihoods, and derive update formulae of GMM parameters with the extended Baum-Welch algorithm.
Results on the NIST language recognition evaluation (LRE) 2007
task show that the LM estimation achieves better performance and
faster convergent speed than the MMI estimation.
Linguistically-Motivated Automatic Classification of
Regional French Varieties
Cécile Woehrling, Philippe Boula de Mareüil, Martine
Adda-Decker; LIMSI, France
Wed-Ses3-O1-3, Time: 16:40
The goal of this study is to automatically differentiate French
varieties (standard French and French varieties spoken in the
South of France, Alsace, Belgium and Switzerland) by applying a
linguistically-motivated approach. We took advantage of automatic phoneme alignment to measure vowel formants, consonant
(de)voicing, pronunciation variants as well as prosodic cues. These
features were then used to identify French varieties by applying
classification techniques. On large corpora of hundreds of speakers, over 80% correct identification scores were obtained. The
confusions between varieties and the features used (by decision
trees) are linguistically grounded.
Discriminative Acoustic Language Recognition via
Channel-Compensated GMM Statistics
Niko Brümmer 1 , Albert Strasheim 1 , Valiantsina
Hubeika 2 , Pavel Matějka 2 , Lukáš Burget 2 , Ondřej
Glembek 2 ; 1 AGNITIO, South Africa; 2 Brno University
of Technology, Czech Republic
Wed-Ses3-O1-4, Time: 17:00
Wed-Ses3-O1 : Language Recognition
Main Hall, 16:00, Wednesday 9 Sept 2009
Chair: Jan Černocký, Brno University of Technology, Czech
Republic
A Human Benchmark for Language Recognition
Rosemary Orr, David A. van Leeuwen; ICSI, USA
Wed-Ses3-O1-1, Time: 16:00
In this study, we explore a human benchmark in language recognition, for the purpose of comparing human performance to machine
performance in the context of the NIST LRE 2007. Humans are
categorised in terms of language proficiency, and performance
is presented per proficiency. The main challenge in this work
is the design of a test and application of a performance metric
which allows a meaningful comparison of humans and machines.
The main result of this work is that where subjects have lexical
knowledge of a language, even at a low level, they perform as well
as the state of the art in language recognition systems in 2007.
We propose a novel design for acoustic feature-based automatic
spoken language recognizers. Our design is inspired by recent
advances in text-independent speaker recognition, where intraclass variability is modeled by factor analysis in Gaussian mixture
model (GMM) space. We use approximations to GMM-likelihoods
which allow variable-length data sequences to be represented as
statistics of fixed size. Our experiments on NIST LRE’07 show that
variability-compensation of these statistics can reduce error-rates
by a factor of three. Finally, we show that further improvements
are possible with discriminative logistic regression training.
Language Score Calibration Using Adapted Gaussian
Back-End
Mohamed Faouzi BenZeghiba, Jean-Luc Gauvain, Lori
Lamel; LIMSI, France
Wed-Ses3-O1-5, Time: 17:20
Generative Gaussian back-end and discriminative logistic regres-
Notes
138
sion are the most used approaches for language score fusion
and calibration. Combination of these two approaches can significantly improve the performance. This paper proposes the
use of an adapted Gaussian back-end, where the mean of the
language-dependent Gaussian is adapted from the mean of a
language-specific background Gaussian via maximum a posteriori estimation algorithm. Experiments are conducted using the
LRE-07 evaluation data. Compared to the conventional Gaussian
back-end approach for a closed set task, relative improvements in
the Cavg of 50%, 17% and 4.2% are obtained on the 30s, 10s and
3s conditions, respectively. Besides this, the estimated scores are
better calibrated. A combination with logistic regression results in
a system with the best calibrated scores.
A Framework for Discriminative SVM/GMM Systems
for Language Recognition
W.M. Campbell, Zahi N. Karam; MIT, USA
Wed-Ses3-O1-6, Time: 17:40
Language recognition with support vector machines and shifteddelta cepstral features has been an excellent performer in
NIST-sponsored language evaluation for many years. A novel
improvement of this method has been the introduction of hybrid
SVM/GMM systems. These systems use GMM supervectors as
an SVM expansion for classification. In prior work, methods for
scoring SVM/GMM systems have been introduced based upon
either standard SVM scoring or GMM scoring with a pushed model.
Although prior work showed experimentally that GMM scoring
yielded better results, no framework was available to explain
the connection between SVM scoring and GMM scoring. In this
paper, we show that there are interesting connections between
SVM scoring and GMM scoring. We provide a framework both
theoretically and experimentally that connects the two scoring
techniques. This connection should provide the basis for further
research in SVM discriminative training for GMM models.
Large-Scale Analysis of Formant Frequency
Estimation Variability in Conversational Telephone
Speech
Nancy F. Chen 1 , Wade Shen 1 , Joseph Campbell 1 , Reva
Schwartz 2 ; 1 MIT, USA; 2 United States Secret Service,
USA
Wed-Ses3-O2-2, Time: 16:20
We quantify how the telephone channel and regional dialect
influence formant estimates extracted from Wavesurfer [1, 2]
in spontaneous conversational speech from over 3,600 native
American English speakers. To the best of our knowledge, this is
the largest scale study on this topic. We found that F1 estimates
are higher in cellular channels than those in landline, while F2 in
general shows an opposite trend. We also characterized vowel
shift trends in northern states in U.S.A. and compared them with
the Northern city chain shift (NCCS) [3]. Our analysis is useful in
forensic applications where it is important to distinguish between
speaker, dialect, and channel characteristics.
Developing an Automatic Functional Annotation
System for British English Intonation
Saandia Ali, Daniel Hirst; LPL, France
Wed-Ses3-O2-3, Time: 16:40
One of the fundamental aims of prosodic analysis is to provide a
reliable means of extracting functional information (what prosody
contributes to meaning) directly from prosodic form (i.e. what
prosody is — in this case intonation). This paper addresses the
development of an automatic functional annotation system for
British English. It is based on the study of a large corpus of British
English and a procedure of analysis by synthesis, enabling to test
and enrich different models of English intonation on the one hand
and work towards an automatic version of the annotation process
on the other.
Intrinsic Vowel Duration and the Post-Vocalic
Voicing Effect: Some Evidence from Dialects of
North American English
Wed-Ses3-O2 : Phonetics & Phonology
Jones (East Wing 1), 16:00, Wednesday 9 Sept 2009
Chair: Unto Kalervo Laine, Helsinki University of Technology,
Finland
Joshua Tauberer, Keelan Evanini; University of
Pennsylvania, USA
Functional Data Analysis as a Tool for Analyzing
Speech Dynamics — A Case Study on the French
Word c’était
Wed-Ses3-O2-4, Time: 17:00
Michele Gubian, Francisco Torreira, Helmer Strik, Lou
Boves; Radboud Universiteit Nijmegen, The
Netherlands
Wed-Ses3-O2-1, Time: 16:00
In this paper we introduce Functional Data Analysis (FDA) as a tool
for analyzing dynamic transitions in speech signals. FDA makes
it possible to perform statistical analyses of sets of mathematical
functions in the same way as classical multivariate analysis treats
scalar measurement data. We illustrate the use of FDA with a
reduction phenomenon affecting the French word c’était /setE/ ‘it
was’, which can be reduced to [stE] in conversational speech. FDA
reveals that the dynamics of the transition from [s] to [t] in fully
reduced cases may still be different from the dynamics of [s]-[t]
transitions in underlying /st/ clusters such as in the word stage.
We report the results of a comprehensive dialectal survey of three
vowel duration phenomena in North American English: gross
duration differences between dialects, the effect of post-vocalic
consonant voicing, and intrinsic vowel duration. Duration data,
from HMM-based forced alignment of phones in the Atlas of North
American English corpus [1], showed that 1) the post-vocalic voicing effect appears in every dialect region and all but one dialect,
and 2) dialectal variation in first formant frequency appears to be
independent of intrinsic vowel duration. This second result adds
evidence that intrinsic vowel durations are targets stored in the
grammar and do not result from physiological constraints.
Investigating /l/ Variation in English Through
Forced Alignment
Jiahong Yuan, Mark Liberman; University of
Pennsylvania, USA
Wed-Ses3-O2-5, Time: 10:00
We present a new method for measuring the “darkness” of /l/, and
use it to investigate the variation of English /l/ in a large speech
corpus that is automatically aligned with phones predicted from an
Notes
139
orthographic transcript. We found a correlation between the rime
duration and /l/-darkness for syllable-final /l/, but no correlation
between /l/ duration and darkness for syllable-initial /l/. The data
showed a clear difference between clear and dark /l/ in English,
and also showed that syllable-final /l/ was less dark preceding an
unstressed vowel than preceding a consonant or a word boundary.
Structural Analysis of Dialects, Sub-Dialects and
Sub-Sub-Dialects of Chinese
Xuebin Ma 1 , Akira Nemoto 2 , Nobuaki Minematsu 1 , Yu
Qiao 1 , Keikichi Hirose 1 ; 1 University of Tokyo, Japan;
2
Nankai University, China
Wed-Ses3-O2-6, Time: 17:40
In China, there are hundred kinds of dialects. By traditional
dialectology, they are classified into seven big dialect regions and
most of them also have many sub-dialects and sub-sub-dialects.
As they are different in various linguistic aspects, people from
different dialect regions often cannot communicate orally. But for
the sub-dialects of one dialect region, although they are sometimes
still mutually unintelligible, more common features are shared.
In this paper, a dialect pronunciation structure, which has been
used successfully in dialect-based speaker classification in our
previous work [1], is examined for the task of speaker classification
and distance measurement among cities based on sub-dialects of
Mandarin. Using the finals of the dialectal utterances of a specific
list of written characters, a dialect pronunciation structure is built
for every speaker in a data set and these speakers are classified
based on the distances among their structures. Then, the results
of classifying 16 Mandarin speakers based on their sub-dialects
show that they are linguistically classified with little influence of
their age and gender. Finally, distances among sub-sub-dialects
are similarly calculated and evaluated. All the results show high
validity and accordance to linguistic studies.
Wed-Ses3-O3 : Speech Activity Detection
High-Accuracy, Low-Complexity Voice Activity
Detection Based on a posteriori SNR Weighted
Energy
Zheng-Hua Tan, Børge Lindberg; Aalborg University,
Denmark
Wed-Ses3-O3-3, Time: 16:40
This paper presents a voice activity detection (VAD) method
using the measurement of a posteriori signal-to-noise ratio (SNR)
weighted energy. The motivations are manifold: 1) the difference
in frame-to-frame energy provides a great discrimination for
speech signals, 2) speech segments, besides their characteristics,
are accounted also on their reliability e.g. measured by SNR, 3) the
a posteriori SNR for noise-only segments will theoretically equal to
0 dB, being ideal for VAD, and 4) both energy and a posteriori SNR
are easy to estimate, resulting in a low complexity. The method
is experimentally shown to be superior to a number of referenced
methods and standards.
Fusing Fast Algorithms to Achieve Efficient Speech
Detection in FM Broadcasts
Stéphane Pigeon, Patrick Verlinde; Royal Military
Academy, Belgium
Wed-Ses3-O3-4, Time: 17:00
Fallside (East Wing 2), 16:00, Wednesday 9 Sept 2009
Chair: Isabel Trancoso, INESC-ID Lisboa/IST, Portugal
This paper describes a system aimed at detecting speech segments
in FM broadcasts. To achieve high processing speeds, simple but
fast algorithms are used. To output robust decisions, a combination of many different algorithms has been considered. The system
is fully operational in the context of Open Source Intelligence, since
2007.
Voice Activity Detection Using Singular Value
Decomposition-Based Filter
Hwa Jeon Song, Sung Min Ban, Hyung Soon Kim; Pusan
National University, Korea
Wed-Ses3-O3-1, Time: 16:00
This paper proposes a novel voice activity detector (VAD) based
on singular value decomposition (SVD). The spectro-temporal
characteristics of background noise region can be easily analyzed
by SVD. The proposed method naturally drops hangover algorithm
from VAD. Moreover, it adaptively changes the decision threshold
by employing the most dominant singular value of the observation
matrix in the noise region. According to simulation results, the
proposed VAD shows significantly better performance than the
conventional statistical model-based method and is less sensitive
to the environmental changes. In addition, the proposed algorithm
requires very low computational cost compared with other algorithms.
Voice Activity Detection Using Partially Observable
Markov Decision Process
Chiyoun Park, Namhoon Kim, Jeongmi Cho; Samsung
Electronics Co. Ltd., Korea
Partially observable Markov decision process (POMDP) has been
generally used to model agent decision processes such as dialogue
management. In this paper, possibility of applying POMDP to a
voice activity detector (VAD) has been explored. The proposed
system first formulates hypotheses about the current noise environment and speech activity. Then, it decides and observes the
features that are expected to be the most salient in the estimated
situation. VAD decision is made based on the accumulated information. A comparative evaluation is presented to show that
the proposed method outperforms other model-based algorithms
regardless of noise types or signal-to-noise ratio.
Robust Speech Recognition Using
VAD-Measure-Embedded Decoder
Tasuku Oonishi 1 , Paul R. Dixon 1 , Koji Iwano 2 , Sadaoki
Furui 1 ; 1 Tokyo Institute of Technology, Japan; 2 Tokyo
City University, Japan
Wed-Ses3-O3-5, Time: 17:20
In a speech recognition system a Voice Activity Detector (VAD) is
a crucial component for not only maintaining accuracy but also
for reducing computational consumption. Front-end approaches
which drop non-speech frames typically attempt to detect speech
frames by utilizing speech/non-speech classification information such as the zero crossing rate or statistical models. These
approaches discard the speech/non-speech classification information after voice detection. This paper proposes an approach
that uses the speech/non-speech information to adjust the score
of the recognition hypotheses. Experimental results show that
our approach can improve the accuracy significantly and reduce
computational consumption by combining the front-end method.
Wed-Ses3-O3-2, Time: 16:20
Notes
140
Investigating Privacy-Sensitive Features for Speech
Detection in Multiparty Conversations
Sree Hari Krishnan Parthasarathi, Mathew
Magimai-Doss, Hervé Bourlard, Daniel Gatica-Perez;
IDIAP Research Institute, Switzerland
Wed-Ses3-O3-6, Time: 17:40
We investigate four different privacy-sensitive features, namely
energy, zero crossing rate, spectral flatness, and kurtosis, for
speech detection in multiparty conversations. We liken this scenario to a meeting room and define our datasets and annotations
accordingly. The temporal context of these features is modeled.
With no temporal context, energy is the best performing single
feature. But by modeling temporal context, kurtosis emerges as
the most effective feature. Also, we combine the features. Besides
yielding a gain in performance, certain combinations of features
also reveal that a shorter temporal context is sufficient. We then
benchmark other privacy-sensitive features utilized in previous
studies. Our experiments show that the performance of all the
privacy-sensitive features modeled with context is close to that
of state-of-the-art spectral-based features, without extracting and
using any features that can be used to reconstruct the speech
signal.
Wed-Ses3-O4 : Multimodal Speech (e.g.
Audiovisual Speech, Gesture)
Holmes (East Wing 3), 16:00, Wednesday 9 Sept 2009
Chair: Ji Ming, Queen’s University Belfast, UK
speech source separation, and biometric spoofing detection. In
particular, we build on earlier work, extending our previously
proposed time-evolution model of audio-visual features to include
non-causal (future) feature information. This significantly improves robustness of the method to small time-alignment errors
between the audio and visual streams, as demonstrated by our
experiments. In addition, we compare the proposed model to two
known literature approaches for audio-visual synchrony detection,
namely mutual information and hypothesis testing, and we show
that our method is superior to both.
Acoustic-to-Articulatory Inversion Using Speech
Recognition and Trajectory Formation Based on
Phoneme Hidden Markov Models
Atef Ben Youssef, Pierre Badin, Gérard Bailly, Panikos
Heracleous; GIPSA, France
Wed-Ses3-O4-3, Time: 16:40
In order to recover the movements of usually hidden articulators
such as tongue or velum, we have developed a data-based speech
inversion method. HMMs are trained, in a multistream framework,
from two synchronous streams: articulatory movements measured
by EMA, and MFCC + energy from the speech signal. A speech
recognition procedure based on the acoustic part of the HMMs
delivers the chain of phonemes and together with their durations,
information that is subsequently used by a trajectory formation
procedure based on the articulatory part of the HMMs to synthesise
the articulatory movements. The RMS reconstruction error ranged
between 1.1 and 2. mm.
Speaker Discriminability for Visual Speech Modes
Evaluation of External and Internal Articulator
Dynamics for Pronunciation Learning
Jeesun Kim 1 , Chris Davis 1 , Christian Kroos 1 , Harold
Hill 2 ; 1 University of Western Sydney, Australia;
2
University of Wollongong, Australia
Lan Wang, Hui Chen, JianJun Ouyang; Chinese
Academy of Sciences, China
Wed-Ses3-O4-4, Time: 17:00
Wed-Ses3-O4-1, Time: 16:00
In this paper we present a data-driven 3D talking head system using
facial video and a X-ray film database for speech research. In order
to construct a database recording the three dimensional positions
of articulators at phoneme-level, the feature points of articulators
were defined and labeled in facial and X-ray images for each English
phoneme. Dynamic displacement based deformations were used in
three modes to simulate the motions of both external and internal
articulators. For continuous speech, the articulatory movements of
each phoneme within an utterance were concatenated. A blending
function was also employed to smooth the concatenation. In
audio-visual test, a set of minimal pairs were used as the stimuli to
access the realistic degree of articulatory motions of the 3D talking
head. In the experiments where the subjects are native speakers
and professional English teachers, a word identification accuracy
of 91.1% among 156 tests was obtained.
Robust Audio-Visual Speech Synchrony Detection
by Generalized Bimodal Linear Prediction
Kshitiz Kumar 1 , Jiri Navratil 2 , Etienne Marcheret 2 , Vit
Libal 2 , Gerasimos Potamianos 3 ; 1 Carnegie Mellon
University, USA; 2 IBM T.J. Watson Research Center,
USA; 3 NCSR “Demokritos”, Greece
Does speech mode affect recognizing people from their visual
speech? We examined 3D motion data from 4 talkers saying 10
sentences (twice). Speech was in noise, in quiet or whispered.
Principal Component Analyses (PCAs) were conducted and speaker
classification was determined by Linear Discriminant Analysis
(LDA). The first five PCs for the rigid motion and the first 10 PCs
each for the non-rigid motion and the combined motion were input
to a series of LDAs for all possible combinations of PCs that could
be constructed using the retained PCs. The discriminant functions
and classification coefficients were determined on the training data
to predict the talker of the test data. Classification performance
for both the in-noise and whispered speech modes were superior
to the in-quiet one. Superiority of classification was found even
if only the first PC (jaw motion) was used, i.e., measures of jaw
motion when speaking in noise or whispering hold promise for
bimodal person recognition or verification.
Audio-Visual Prosody of Social Attitudes in
Vietnamese: Building and Evaluating a Tones
Balanced Corpus
Dang-Khoa Mac 1 , Véronique Aubergé 1 , Albert
Rilliard 2 , Eric Castelli 3 ; 1 LIG, France; 2 LIMSI, France;
3
MICA, Vietnam
Wed-Ses3-O4-2, Time: 16:20
Wed-Ses3-O4-5, Time: 17:20
We study the problem of detecting audio-visual synchrony in video
segments containing a speaker in frontal head pose. The problem
holds a number of important applications, for example speech
source localization, speech activity detection, speaker diarization,
This paper presents the building and a first evaluation of a tones
balanced Audio-Visual corpus of social affect in Vietnamese
language. This under-resourced tonal language has specific glottalization and co-articulation phenomena, for which interactions with
Notes
141
attitudes prosody are a very interesting issue. A well-controlled
recording methodology was designed to build a large representative audio-visual corpus for 16 attitudes, and one speaker. A
perception experiment was carried out to evaluate a speaker’s
perceived performances and to study the role and integration of
the audio, visual, and audio-visual information in the listener’s
perception of the speaker’s attitudes. The results reveal characteristics of Vietnamese prosodic attitudes and allow us to investigate
such social affect in Vietnamese language.
Direct, Modular and Hybrid Audio to Visual Speech
Conversion Methods — A Comparative Study
Gyorgy Takacs; Peter Pazmany University, Hungary
Wed-Ses3-O4-6, Time: 17:40
A systematic comparative study of audio to visual speech conversion methods is described in this paper. A direct conversion
system is compared to conceptually different ASR based solutions.
Hybrid versions of the different solutions will also be presented.
The methods are tested using the same speech material, audio
preprocessing and facial motion visualization units. Only the conversion blocks are changed. Subjective opinion score evaluation
tests prove the naturalness of the direct conversion is the best.
Wed-Ses3-P1 : Phonetics
Hewison Hall, 16:00, Wednesday 9 Sept 2009
Chair: Helmer Strik, Radboud Universiteit Nijmegen, The
Netherlands
How Similar Are Clusters Resulting from schwa
Deletion in French to Identical Underlying Clusters?
Audrey Bürki 1 , Cécile Fougeron 2 , Christophe Veaux 3 ,
Ulrich H. Frauenfelder 1 ; 1 Université de Genève,
Switzerland; 2 LPP, France; 3 IRCAM, France
Rarefaction Gestures and Coarticulation in Mangetti
Dune !Xung Clicks
Amanda Miller 1 , Abigail Scott 1 , Bonny Sands 2 , Sheena
Shah 3 ; 1 University of British Columbia, Canada;
2
Northern Arizona University, USA; 3 Georgetown
University, USA
Wed-Ses3-P1-3, Time: 16:00
We provide high-speed ultrasound data on the four Mangetti Dune
!Xung clicks. The posterior constriction is uvular for all four clicks
— front uvular for [g |] and [}] and back uvular for [g !] and [g {].
[g !] and [g {] both involve tongue center lowering and tongue root
retraction as part of the rarefaction gestures. The rarefaction
gestures in [g |] and [}] involve tongue center lowering. Lingual
cavity volume is largest for [g !], followed by [g {], [}] and [g |]. A
tongue tip recoil effect is found following [g !], but the effect is
smaller than that seen in IsiXhosa in earlier studies.
The Acoustics of Mangetti Dune !Xung Clicks
Amanda Miller 1 , Sheena Shah 2 ; 1 University of British
Columbia, Canada; 2 Georgetown University, USA
Wed-Ses3-P1-4, Time: 16:00
We document the acoustics of the four Mangetti Dune !Xung
coronal clicks. We report the temporal measures of burst duration,
relative burst amplitude and rise time, as well as the spectral value
of center of gravity in the click bursts. COG correlates with lingual
cavity volume. We show that there is inter-speaker variation in the
acoustics of the palatal click, which we expect to correlate with a
difference in the anterior constriction release dynamics. We show
that burst duration, amplitude and rise time are correlated, similar
to the correlation found between rise time and frication duration
in affricates.
Acoustic Characteristics of Ejectives in Amharic
Hussien Seid, S. Rajendran, B. Yegnanarayana; IIIT
Hyderabad, India
Wed-Ses3-P1-1, Time: 16:00
Clusters resulting from the deletion of schwa in French are
compared with identical underlying clusters in words and pseudowords. Both manual and automatic acoustical comparisons
suggest that clusters resulting from schwa deletion in French are
highly similar to identical underlying clusters. Furthermore, cluster
duration is not longer for clusters resulting from schwa deletion
than for identical underlying clusters. Clusters in pseudowords
show a different acoustical and durational pattern from the two
other clusters in words.
Word-Final [t]-Deletion: An Analysis on the
Segmental and Sub-Segmental Level
Wed-Ses3-P1-5, Time: 16:00
In this paper, a preliminary investigation of the acoustic characteristics of Amharic ejectives in comparison with their unvoiced
conjugates is presented. The normalized error from linear prediction residual and a zero frequency resonator output are used to
locate the instant of release of the oral closure and the instant of
the start of voicing, respectively. Amharic ejectives are found to
have longer closure duration and smaller VOT than their unvoiced
conjugates. Cross-linguistic comparisons reveal that no ejectives
of two languages behave acoustically in a similar manner despite
similarity in their articulation.
Sentence-Final Particles in Hong Kong Cantonese:
Are They Tonal or Intonational?
Barbara Schuppler 1 , Wim van Dommelen 2 , Jacques
Koreman 2 , Mirjam Ernestus 1 ; 1 Radboud Universiteit
Nijmegen, The Netherlands; 2 NTNU, Norway
Wing Li Wu; University College London, UK
Wed-Ses3-P1-2, Time: 16:00
Wed-Ses3-P1-6, Time: 16:00
This paper presents a study on the reduction of word-final [t]s
in conversational standard Dutch. Based on a large amount of
tokens annotated on the segmental level, we show that the bigram
frequency and the segmental context are the main predictors for
the absence of [t]s. In a second study, we present an analysis of the
detailed acoustic properties of word-final [t]s and we show that bigram frequency and context also play a role on the sub-segmental
level. This paper extends research on the realization of /t/ in
spontaneous speech and shows the importance of incorporating
sub-segmental properties in models of speech.
Cantonese is rich in sentence-final particles (SFPs), morphemes
serving to show various linguistic or attitudinal meanings. The
acoustic manifestations of these SFPs are not yet clear. This paper
presents detailed analyses of the fundamental frequency tracings,
final F0 , final velocity and duration of ten SFPs in Hong Kong
Cantonese. The results show that most of these SFPs are very
similar to the lexical tones in terms of the F0 measurements, but
the durations are significantly different in half the cases. The
notable differences may give some insight into the nature of this
special class of words.
Notes
142
Same Tone, Different Category: Linguistic-Tonetic
Variation in the Areal Tone Acoustics of Chuqu Wu
nents, but evidence for the phonetics/phonology dichotomy on
which it hinges has proved elusive. Advocating a multidisciplinary
approach, this paper outlines a new research project which combines traditional behavioural experiments with neuro-linguistic
data to advance our understanding of the linguistic representation
and neural correlates of intonation.
William Steed, Phil Rose; Australian National
University, Australia
Wed-Ses3-P1-7, Time: 16:00
Acoustic and auditory data are presented for the citation tones
of single speakers from nine sites (eight hitherto undescribed in
English) from the little-studied Chuqu subgroup of Wu in East
Central China: Lìshuı̆, Lóngquán, Qìngyuán, Lóngyóu, Jìnyún,
Qı̄ngtián, Yúnhé, Jı̆ngníng, and Táishùn. The data demonstrate
a high degree of complexity, having no less than 22 linguistictonetically different tones. The nature of the complexity of these
forms is discussed, especially with respect to whether the variation
is continuous or categorical, and inferences are drawn on their
historical development.
Why Would Aspiration Lower the Pitch of the
Following Vowel? Observations from
Leng-Shui-Jiang Chinese
Exploring Vocalization of /l/ in English: An EPG and
EMA Study
Mitsuhiro Nakamura; Nihon University, Japan
Wed-Ses3-P1-11, Time: 16:00
This study explores the spatiotemporal characteristics of lingual
gestures for the clear, dark, and vocalized allophones of /l/ in
English by examining the EPG and EMA data from the multichannel
articulatory (MOCHA) database. The results show the evidence that
the spatiotemporal controls of the tip lowering and the dorsum
backing gestures are organized systematically for the three variants. An exploratory description of the articulatory correlates for
the /l/ gestures is made.
Caicai Zhang; Hong Kong University of Science &
Technology, China
The Monophthongs and Diphthongs of
North-Eastern Welsh: An Acoustic Study
Wed-Ses3-P1-8, Time: 16:00
Robert Mayr, Hannah Davies; University of Wales
Institute Cardiff, UK
This paper is a preliminary report of the aspiration-conditioned
tonal split in Leng-shui-jiang (LSJ hereafter) Chinese. So far no consensus has been reached concerning the intrinsic perturbation of
aspiration on the F0 of the following vowel. Conflicting data come
from both the same language and different languages. In order to
shed light on this issue, F0 and Closing quotient (Qx hereafter) are
calculated in syllables after aspirated and unaspirated obstruents
from six speakers (three male, three female) in LSJ dialect. The
results turn out that F0 is significantly lower after the aspirated
obstruents in two out of the three tone groups. The relatively
lower Qx found in the syllables with aspirated initials is a possible
explanation for the lower pitch.
Dialectal Characteristics of Osaka and Tokyo
Japanese: Analyses of Phonologically Identical
Words
Wed-Ses3-P1-12, Time: 16:00
Descriptive accounts of Welsh vowels indicate systematic differences between Northern and Southern varieties. Few studies have,
however, attempted to verify these claims instrumentally, and little
is known about regional variation in Welsh vowel systems. The
present study aims to provide a first preliminary analysis of the
acoustic properties of Welsh monophthongs and diphthongs, as
produced by a male speaker from North-eastern Wales. The results
indicate distinctive production of all the monophthong categories
of Northern Welsh. Interesting patterns of spectral change were
found for the diphthongs. Implications for theories of contrastivity
in vowel systems are discussed.
Voicing Profile of Polish Sonorants: [r] in Obstruent
Clusters
Kanae Amino, Takayuki Arai; Sophia University, Japan
Wed-Ses3-P1-9, Time: 16:00
This study investigates the characteristics of the two major dialects
of Japanese: Osaka and Tokyo dialects. We recorded the utterances
of the speakers of both dialects, and analysed the differences that
appear in the accentuation of the words at the phonetic-acoustic
level. The Japanese words that are phonologically identical in both
dialects were used as the analysis target. The results showed that
the pitch patterns contained the dialect-dependent features of
Osaka Japanese. Furthermore, these patterns could not be fully
mimicked by speakers of Tokyo Japanese. These results show that
there is a phonetics-phonology gap in the dialectal differences, and
that we may exploit this gap for forensic purposes.
J. Sieczkowska, Bernd Möbius, Antje Schweitzer,
Michael Walsh, Grzegorz Dogil; Universität Stuttgart,
Germany
Wed-Ses3-P1-13, Time: 16:00
This study aims at defining and analyzing voicing profile of Polish
sonorant [r] showing the variability of its realizations depending
on segmental and prosodic position. Voicing profile is defined as
the frame-by-frame voicing status of a speech sound in continuous
speech. Word-final devoicing of sonorants is shortly reviewed and
analyzed in terms of the conducted corpus-based investigation. We
used automatic tools to extract consonants’ features, F0 values and
obtain voicing profile. The results show that liquid [r] devoice word
and syllable finally, particularly with left voiceless stop context.
Categories and Gradience in Intonation: Evidence
from Linguistics and Neurobiology
Brechtje Post, Francis Nolan, Emmanuel Stamatakis,
Toby Hudson; University of Cambridge, UK
Wed-Ses3-P1-10, Time: 16:00
Multiple cues interact to signal multiple functions in intonation
simultaneously, which makes intonation notoriously complex to
analyze. The Autosegmental-Metrical model for intonation analysis
has proved to be an excellent vehicle for separating the compo-
Notes
143
the UBM model to have significantly high overlap in the acoustic
space. We hypothesize that the use of VTLN will help in compacting
the UBM model and thus the speaker adapted models obtained
from this compact model will have better speaker-separability in
the acoustic space. We perform experiments on MIT, TIMIT and
NIST 2004 SRE databases and show that using VTLN we can achieve
lesser Identification Error Rates as compared to the conventional
GMM-UBM based method.
Wed-Ses3-P2 : Speaker Verification &
Identification III
Hewison Hall, 16:00, Wednesday 9 Sept 2009
Chair: A. Ariyaeeinia, University of Hertfordshire, UK
Mel, Linear, and Antimel Frequency Cepstral
Coefficients in Broad Phonetic Regions for
Telephone Speaker Recognition
BUT System for NIST 2008 Speaker Recognition
Evaluation
Howard Lei, Eduardo Lopez; ICSI, USA
Wed-Ses3-P2-1, Time: 16:00
We’ve examined the speaker discriminative power of mel-, antimeland linear-frequency cepstral coefficients (MFCCs, a-MFCCs and
LFCCs) in the nasal, vowel, and non-nasal consonant speech regions. Our inspiration came from the work of Lu and Dang in 2007,
who showed that filterbank energies at some frequencies mainly
outside the telephone bandwidth possess more speaker discriminative power due to physiological characteristics of speakers, and
derived a set of cepstral coefficients that outperformed MFCCs in
non-telephone speech. Using telephone speech, we’ve discovered
that LFCCs gave 21.5% and 15.0% relative EER improvements over
MFCCs in nasal and non-nasal consonant regions, agreeing with our
filterbank energy f-ratio analysis. We’ve also found that using only
the vowel region with MFCCs gives a 9.1% relative improvement
over using all speech. Last, we’ve shown that a-MFCCs are valuable
in combination, contributing to a system with 17.3% relative
improvement over our baseline.
Fast GMM Computation for Speaker Verification
Using Scalar Quantization and Discrete Densities
Lukáš Burget, Michal Fapšo, Valiantsina Hubeika,
Ondřej Glembek, Martin Karafiát, Marcel Kockmann,
Pavel Matějka, Petr Schwarz, Jan Černocký; Brno
University of Technology, Czech Republic
Wed-Ses3-P2-4, Time: 16:00
This paper presents BUT system submitted to NIST 2008 SRE.
It includes two subsystems based on Joint Factor Analysis (JFA)
GMM/UBM and one based on SVM-GMM. The systems were developed on NIST SRE 2006 data, and the results are presented on NIST
SRE 2008 evaluation data. We concentrate on the influence of side
information in the calibration.
Selection of the Best Set of Shifted Delta Cepstral
Features in Speaker Verification Using Mutual
Information
José R. Calvo, Rafael Fernández, Gabriel Hernández;
CENATAV, Cuba
Wed-Ses3-P2-5, Time: 16:00
Guoli Ye 1 , Brian Mak 1 , Man-Wai Mak 2 ; 1 Hong Kong
University of Science & Technology, China; 2 Hong
Kong Polytechnic University, China
Wed-Ses3-P2-2, Time: 16:00
Most of current state-of-the-art speaker verification (SV) systems
use Gaussian mixture model (GMM) to represent the universal
background model (UBM) and the speaker models (SM). For an
SV system that employs log-likelihood ratio between SM and
UBM to make the decision, its computational efficiency is largely
determined by the GMM computation. This paper attempts to
speedup GMM computation by converting a continuous-density
GMM to a single or a mixture of discrete densities using scalar
quantization. We investigated a spectrum of such discrete models:
from high-density discrete models to discrete mixture models, and
their combination called high-density discrete-mixture models. For
the NIST 2002 SV task, we obtained an overall speedup by a factor
of 2–100 with little loss in EER performance.
Text-Independent Speaker Identification Using Vocal
Tract Length Normalization for Building Universal
Background Model
A.K. Sarkar, S. Umesh, S.P. Rath; IIT Kanpur, India
Wed-Ses3-P2-3, Time: 16:00
In this paper, we propose to use Vocal Tract Length Normalization
(VTLN) to build the Universal Background Model (UBM) for a
closed set speaker identification system. Vocal Tract Length (VTL)
differences among speakers is a major source of variability in the
speech signal. Since the UBM model is trained using data from
many speakers, it statistically captures this inherent variation in
the speech signal, which results in a “coarse” model in the acoustic
space. This may cause the adapted speaker models obtained from
Shifted delta cepstral (SDC) features, obtained by concatenating
delta cepstral features across multiples speech frames, were recently reported to produce superior performance to delta cepstral
features in language and speaker recognition systems. In this
paper, the use of SDC features in a speaker verification experiment
is reported. Mutual information between SDC features and identity
of a speaker is used to select the best set of SDC parameters. The
experiment evaluates robustness of the best SDC features due to
channel and handset mismatch in speaker verification. The result
reflects an EER relative reduction until 19% in a speaker verification
experiment.
Forensic Speaker Recognition Using Traditional
Features Comparing Automatic and
Human-in-the-Loop Formant Tracking
Alberto de Castro, Daniel Ramos, Joaquin
Gonzalez-Rodriguez; Universidad Autónoma de
Madrid, Spain
Wed-Ses3-P2-6, Time: 16:00
In this paper we compare forensic speaker recognition with traditional features using two different formant tracking strategies:
one performed automatically and one semi-automatic performed
by human experts. The main contribution of the work is the use
of an automatic method for formant tracking, which allows a
much faster recognition process and the use of a much higher
amount of data for modelling background population, calibration,
etc. This is especially important in likelihood-ratio-based forensic
speaker recognition, where the variation of features among a
population of speakers must be modelled in a statistically robust
way. Experiments show that, although recognition using the
human-in-the-loop approach is better than using the automatic
scheme, the performance of the latter is also acceptable. Moreover,
Notes
144
we present a novel feature selection method which allows the analysis of which feature of each formant has a greater contribution to
the discriminating power of the whole recognition process, which
can be used by the expert in order to decide which features in the
available speech material are important.
The MIT Lincoln Laboratory 2008 Speaker
Recognition System
D.E. Sturim, W.M. Campbell, Zahi N. Karam, Douglas
Reynolds, F.S. Richardson; MIT, USA
Wed-Ses3-P2-10, Time: 16:00
Open-Set Speaker Identification Under Mismatch
Conditions
S.G. Pillay 1 , A. Ariyaeeinia 1 , P. Sivakumaran 1 , M.
Pawlewski 2 ; 1 University of Hertfordshire, UK; 2 BT
Labs, UK
Wed-Ses3-P2-7, Time: 16:00
This paper presents investigations into the performance of
open-set, text-independent speaker identification (OSTI-SI) under
mismatched data conditions. The scope of the study includes
attempts to reduce the adverse effects of such conditions through
the introduction of a modified parallel model combination (PMC)
method together with condition-adjusted T-Norm (CT-Norm) into
the OSTI-SI framework. The experiments are conducted using
examples of real world noise. Based on the outcomes, it is
demonstrated that the above approach can lead to considerable
improvements in the accuracy of open-set speaker identification
operating under severely mismatched data conditions. The paper
details the realisation of the modified PMC method and CT-Norm
in the context of OSTI-SI, presents the experimental investigations
and provides an analysis of the results.
MiniVectors: An Improved GMM-SVM Approach for
Speaker Verification
Xavier Anguera; Telefonica Research, Spain
Wed-Ses3-P2-8, Time: 16:00
The accuracy levels achieved by state-of-the-art Speaker Verification systems are high enough for the technology to be used in
real-life applications. Unfortunately, the transfer from the lab to
the field is not as straight-forward as could be: the best performing
systems can be computationally expensive to run and need large
speaker model footprints. In this paper, we compare two speaker
verification algorithms (GMM-SVM Supervectors and Kharroubi’s
GMM-SVM vectors) and propose an improvement of Kharroubi’s
system that: (a) achieves up to 17% relative performance improvement when compared to the Supervectors algorithm; (b) is 24%
faster in run time and (c) makes use of speaker models that are
94% smaller than those needed by the Supervectors algorithm.
Robustness of Phase Based Features for Speaker
Recognition
R. Padmanabhan 1 , Sree Hari Krishnan
Parthasarathi 2 , Hema A. Murthy 1 ; 1 IIT Madras, India;
2
IDIAP Research Institute, Switzerland
Wed-Ses3-P2-9, Time: 16:00
This paper demonstrates the robustness of group-delay based
features for speech processing. An analysis of group delay functions is presented which show that these features retain formant
structure even in noise. Furthermore, a speaker verification task
performed on the NIST 2003 database show lesser error rates, when
compared with the traditional MFCC features. We also mention
about using feature diversity to dynamically choose the feature for
every claimed speaker.
In recent years methods for modeling and mitigating variational
nuisances have been introduced and refined. A primary emphasis
in last years NIST 2008 Speaker Recognition Evaluation (SRE) was to
greatly expand the use of auxiliary microphones. This offered the
additional channel variations which has been a historical challenge
to speaker verification systems. In this paper we present the MIT
Lincoln Laboratory Speaker Recognition system applied to the
task in the NIST 2008 SRE. Our approach during the evaluation
was two-fold: 1) Utilize recent advances in variational nuisance
modeling (latent factor analysis and nuisance attribute projection)
to allow our spectral speaker verification systems to better compensate for the channel variation introduced, and 2) fuse systems
targeting the different linguistic tiers of information, high and low.
The performance of the system is presented when applied on a
NIST 2008 SRE task. Post evaluation analysis is conducted on the
sub-task when interview microphones are present.
Speaker Recognition on Lossy Compressed Speech
Using the Speex Codec
A.R. Stauffer, A.D. Lawson; RADC Inc., USA
Wed-Ses3-P2-11, Time: 16:00
This paper examines the impact of lossy speech coding with Speex
on GMM-UBM speaker recognition (SR). Audio from 120 speakers
was compressed with Speex into twelve data sets, each with a
different level of compression quality from 0 (most compressed) to
10 (least), plus uncompressed. Experiments looked at performance
under matched and mismatched compression conditions, using
models conditioned for the coded environment, and Speex coding
applied to improving SR performance on other coders. Results
show that Speex is effective for compression of data used in SR and
that Speex coding can improve performance on data compressed
by the GSM codec.
Text-Independent Speaker Verification Using Rank
Threshold in Large Number of Speaker Models
Haruka Okamoto 1 , Satoru Tsuge 2 , Amira
Abdelwahab 1 , Masafumi Nishida 3 , Yasuo Horiuchi 1 ,
Shingo Kuroiwa 1 ; 1 Chiba University, Japan;
2
University of Tokushima, Japan; 3 Doshisha University,
Japan
Wed-Ses3-P2-12, Time: 16:00
In this paper, we propose a novel speaker verification method
which determines whether a claimer is accepted or rejected by
the rank of the claimer in a large number of speaker models
instead of score normalization, such as T-norm and Z-norm. The
method has advantages over the standard T-norm in speaker
verification accuracy. However, it needs much computation time
as well as T-norm that needs calculating likelihoods for many
cohort models. Hence, we also discuss the speed-up using the
method that selects cohort subset for each target speaker in the
training stage. This data driven approach can significantly reduce
computation resulting in faster speaker verification decision.
We conducted text-independent speaker verification experiments
using large-scale Japanese speaker recognition evaluation corpus
constructed by National Research Institute of Police Science. As a
result, the proposed method achieved an equal error rate of 2.2%,
while T-norm obtained 2.7%.
Notes
145
Adaptive Training with Noisy Constrained
Maximum Likelihood Linear Regression for Noise
Robust Speech Recognition
The Role of Age in Factor Analysis for Speaker
Identification
Yun Lei, John H.L. Hansen; University of Texas at
Dallas, USA
D.K. Kim, M.J.F. Gales; University of Cambridge, UK
Wed-Ses3-P2-13, Time: 16:00
Wed-Ses3-P3-2, Time: 16:00
The speaker acoustic space described by a factor analysis model
is assumed to reflect a majority of the speaker variations using a
reduced number of latent factors. In this study, the age factor, as
an observable important factor of a speaker’s voice, is analyzed and
employed in the description of the speaker acoustic space, using
a factor analysis approach. An age dependent acoustic space is
developed for speakers, and the effect of the age dependent space
in eigenvoice is evaluated using the NIST SRE08 corpus. In addition,
the data pool with different age distributions are evaluated based
on joint factor analysis model to assess age influence from the
data pool.
Adaptive training is a widely used technique for building speech
recognition systems on non-homogeneous training data. Recently
there has been interest in applying these approaches for situations
where there is significant levels of background noise. This work
extends the most popular form of linear transform for adaptive
training, constrained MLLR, to reflect additional uncertainty from
noise corrupted observations. This new form of transform, Noisy
CMLLR, uses a modified version of generative model between clean
speech and noisy observation, similar to factor analysis. Adaptive
training using NCMLLR with both maximum likelihood and discriminative criteria are described. Experiments are conducted on
noise-corrupted Resource Management and in-car recorded data.
In preliminary experiments this new form achieves improvements
in recognition performance over the standard approach in low
signal-to-noise ratio conditions.
Do Humans and Speaker Verification System Use the
Same Information to Differentiate Voices?
Juliette Kahn 1 , Solange Rossato 2 ; 1 LIA, France; 2 LIG,
France
Wed-Ses3-P2-14, Time: 16:00
The aim of this paper is to analyze the pairwise comparisons of
voices by a speaker verification system (ALIZE/Spk) and by human.
A database of familial groups of 24 speakers was created. A single
sentence was chosen for the perception test. The same sentence
was used the test signal for the ALIZE/Spk trained on another part
of the corpus. Results shows that the voice proximities within a
familial group were well recovered in the speaker representation
by ALIZE and much less returned in the representation from
perception test.
Wed-Ses3-P3 : Robust Automatic Speech
Recognition II
Hewison Hall, 16:00, Wednesday 9 Sept 2009
Chair: Peter Jancovic, University of Birmingham, UK
Noisy Speech Recognition by Using Output
Combination of Discrete-Mixture HMMs and
Continuous-Mixture HMMs
Performance Comparisons of the Integrated Parallel
Model Combination Approaches with Front-End
Noise Reduction
Guanghu Shen 1 , Soo-Young Suk 2 , Hyun-Yeol Chung 1 ;
1
Yeungnam University, Korea; 2 AIST, Japan
Wed-Ses3-P3-3, Time: 16:00
In this paper, to find the best noise robustness approach, we
study on approaches implemented at both-end (i.e. front-end and
back-end) of speech recognition system. To reduce the noise with
lower speech distortion at front-end, we investigate the Two-stage
Mel-warped Wiener Filtering (TMWF) in the integrated Parallel
Model Combination (PMC) approach. Furthermore, the first-stage
of TMWF (i.e. One-stage Mel-warped Wiener Filtering (OMWF)), as
well as the well-known Wiener Filtering (WF), is effective to reduce
the noise, so we integrate PMC with those front-end noise reduction
approaches. From the recognition performance, TMWF-PMC shows
improved performance comparing with the well-known WF-PMC,
and OMWF-PMC also shows a comparable performance in all noises.
Tuning Support Vector Machines for Robust
Phoneme Classification with Acoustic Waveforms
Tetsuo Kosaka, You Saito, Masaharu Kato; Yamagata
University, Japan
Jibran Yousafzai, Zoran Cvetković, Peter Sollich; King’s
College London, UK
Wed-Ses3-P3-1, Time: 16:00
Wed-Ses3-P3-4, Time: 16:00
This paper presents an output combination approach for noiserobust speech recognition. The aim of this work is to improve
recognition performance for adverse conditions which contain
both stationary and non-stationary noise. In the proposed method,
both discrete-mixture HMMs (DMHMMs) and continuous-mixture
HMMs (CMHMMs) are used as acoustic models. In the DMHMM,
subvector quantization is used instead of vector quantization and
each state has multiple mixture components. Our previous work
showed that DMHMM system indicated better performance in low
SNR and/or non-stationary noise conditions. In contrast, CMHMM
system was better in the opposite conditions. Thus, we take a
system combination approach of the two models to improve the
performance in various kinds of noise conditions. The proposed
method was evaluated on a LVCSR task with 5K word vocabulary.
The results showed that the proposed method was effective in
various kinds of noise conditions.
This work focuses on the robustness of phoneme classification
to additive noise in the acoustic waveform domain using support
vector machines (SVMs). We address the issue of designing kernels
for acoustic waveforms which imitate the state-of-the-art representations such as PLP and MFCC and are tuned to the physical
properties of speech. For comparison, classification results in
the PLP representation domain with cepstral mean-and-variance
normalization (CMVN) using standard kernels are also reported. It
is shown that our custom-designed kernels achieve better classification performance at high noise. Finally, we combine the PLP and
acoustic waveform representations to attain better classification
than either of the individual representations over the entire range
of noise levels tested, from quiet condition up to -18dB SNR.
Notes
146
An Analytic Derivation of a Phase-Sensitive
Observation Model for Noise Robust Speech
Recognition
terms of both recognition accuracy and computational cost on a
database recorded in a real car environment. Experimental results
indicate the unscented transformation is one of the best options
for estimating JUD transforms as it maintains a good balance
between accuracy and efficiency.
Volker Leutnant, Reinhold Haeb-Umbach; Universität
Paderborn, Germany
Wed-Ses3-P3-5, Time: 16:00
In this paper we present an analytic derivation of the moments
of the phase factor between clean speech and noise cepstral or
log-mel-spectral feature vectors. The development shows, among
others, that the probability density of the phase factor is of
sub-Gaussian nature and that it is independent of the noise type
and the signal-to-noise ratio, however dependent on the mel filter
bank index. Further we show how to compute the contribution
of the phase factor to both the mean and the variance of the
noisy speech observation likelihood, which relates the speech
and noise feature vectors to those of noisy speech. The resulting
phase-sensitive observation model is then used in model-based
speech feature enhancement, leading to significant improvements
in word accuracy on the AURORA2 database.
Variational Model Composition for Robust Speech
Recognition with Time-Varying Background Noise
Wooil Kim, John H.L. Hansen; University of Texas at
Dallas, USA
Wed-Ses3-P3-6, Time: 16:00
This paper proposes a novel model composition method to improve speech recognition performance in time-varying background
noise conditions. It is suggested that each order of the cepstral
coefficients represents the frequency degree of changing components in the envelope of the log-spectrum. With this motivation,
in the proposed method, variational noise models are generated
by selectively applying perturbation factors to a basis model,
resulting in a collection of various types of spectral patterns in
the log-spectral domain. The basis noise model is obtained from
the silent duration segments of the input speech. The proposed
Variational Model Composition (VMC) method is employed to generate multiple environmental models for our previously proposed
feature compensation method. Experimental results prove that
the proposed method is considerably more effective at increasing
speech recognition performance in time-varying background noise
conditions with 30.34% and 9.02% average relative improvements
in word error rate for speech babble and background music conditions respectively, compared to an existing single model-based
method.
Comparison of Estimation Techniques in Joint
Uncertainty Decoding for Noise Robust Speech
Recognition
Haitian Xu, K.K. Chin; Toshiba Research Europe Ltd.,
UK
Wed-Ses3-P3-7, Time: 16:00
Model-based joint uncertainty decoding (JUD) has recently achieved
promising results by integrating the front-end uncertainty into
the back-end decoding by estimating JUD transforms in a mathematically consistent framework. There are different ways of
estimating the JUD transforms resulting in different JUD methods.
This paper gives an overview of the estimation techniques existing
in the literature including data-driven parallel model combination,
Taylor series based approximation and the recently proposed
second order approximation. Application of a new technique
based on the unscented transformation is also proposed for the
JUD framework. The different techniques have been compared in
Replacing Uncertainty Decoding with Subband
Re-Estimation for Large Vocabulary Speech
Recognition in Noise
Jianhua Lu, Ji Ming, Roger Woods; Queen’s University
Belfast, UK
Wed-Ses3-P3-8, Time: 16:00
In this paper, we propose a novel approach for parameterized
model compensation for large-vocabulary speech recognition in
noisy environments. The new compensation algorithm, termed
CMLLR-SUBREST, combines the model-based uncertainty decoding
(UD) with subspace distribution clustering hidden Markov modeling (SDCHMM), so that the UD-type compensation can be realized
by re-estimating the models based on small amount of adaptation
data. This avoids the estimation of the covariance biases, which
is required in model-based UD and usually needs a numerical
approach. The Aurora 4 corpus is used in the experiments. We
have achieved 16.9% relative WER (word error rate) reduction
over our previous missing-feature (MF) based decoding and 16.1%
over the combination of Constrained MLLR compensation and MF
decoding. The number of model parameters is reduced by two
orders of magnitude.
Wed-Ses3-P4 : Prosody: Production II
Hewison Hall, 16:00, Wednesday 9 Sept 2009
Chair: Shinichi Tokuma, Chuo University, Japan
Perception and Production of Boundary Tones in
Whispered Dutch
W. Heeren, V.J. Van Heuven; Universiteit Leiden, The
Netherlands
Wed-Ses4-P4-1, Time: 13:30
The main cue to interrogativity in Dutch declarative questions is
found in the final boundary tone. When whispering, a speaker does
not produce the most important acoustic information conveying
this: the fundamental frequency. In this paper listeners are shown
to perceive the difference between whispered declarative questions
and statements, though less clearly than in phonated speech. Moreover, possible acoustic correlates conveying whispered question
intonation were investigated. The results show that the second
formant may convey pitch in whispered speech, and also that
first formant and intensity differences exist between high and low
boundary tones in both phonated and whispered speech.
Pitch Accents and Information Status in a German
Radio News Corpus
Katrin Schweitzer, Arndt Riester, Michael Walsh,
Grzegorz Dogil; Universität Stuttgart, Germany
Wed-Ses4-P4-2, Time: 13:30
This paper presents a corpus analysis of prosodic realisations of
information status categories in terms of pitch accent types. The
annotations base on a recent annotation scheme for information
status [1] that is based on semantic criteria applied to written
text. For each information status category, typical pitch accent
realisations are identified. Moreover, the relevance of the strict
Notes
147
semantic information status labelling scheme on the prosodic
realisation is examined. It can be shown that the semantic criteria
are reflected in prosody, i.e. the prosodic findings corroborate the
theoretical assumptions made in the framework.
Analysis of Voice Fundamental Frequency Contours
of Continuing and Terminating Prosodic Phrases in
Four Swiss German Dialects
Adrian Leemann, Keikichi Hirose, Hiroya Fujisaki;
University of Tokyo, Japan
Wed-Ses4-P4-3, Time: 13:30
In the present study, the F0 contours of continuing and terminating
prosodic phrases of 4 Swiss German dialects are analyzed by means
of the command-response model. In every model parameter, the
two prosodic phrase types show significant differences: continuing
prosodic phrases indicate higher phrase command magnitude and
shorter durations. Locally, they demonstrate more distinct accent
command amplitudes as well as durations. In addition, continuing
prosodic phrases have later rises relative to segment onset than
terminating prosodic phrases. In the same context, fine phonetic
differences between the dialects are highlighted.
Using Responsive Prosodic Variation to
Acknowledge the User’s Current State
Nigel G. Ward, Rafael Escalante-Ruiz; University of
Texas at El Paso, USA
Wed-Ses4-P4-6, Time: 13:30
Spoken dialog systems today do not vary the prosody of their utterances, although prosody is known to have many useful expressive
functions. In a corpus of memory quizzes, we identify eleven
dimensions of prosodic variation, each with its own expressive
function. We identified the situations in which each was used,
and developed rules for detecting these situations from the dialog
context and the prosody of the interlocutor’s previous utterance.
We implemented the resulting rules and had 21 users interact with
two versions of the system. Overall they preferred the version in
which the prosodic forms of the acknowledgments were chosen
to be suitable for each specific context. This suggests that simple
adjustments to system prosody based on local context can have
value to users.
Intonation Segments and Segmental Intonation
Oliver Niebuhr; LPL, France
Wed-Ses4-P4-7, Time: 13:30
Intonational Features for Identifying Regional
Accents of Italian
Michelina Savino; Università di Bari, Italy
Wed-Ses4-P4-4, Time: 13:30
Aim of this paper is providing a preliminary account of some
intonational features useful for identifying a large number of
Italian accents, estimated as representative of Italian regional
variation, by analysing a corpus of comparable speech materials
consisting of Map Task dialogues. Analysis concentrates on the
intonational characteristics of yes-no questions, which can be
realised very differently across varieties, whereas statements are
generally characterised by a (low) falling final movement. Results
of this preliminary investigation indicate that intonational features
useful for identifying Italian regional accents are the tune type
(rising-falling vs falling-rising vs rising), and the nuclear peak
alignment in rising-falling contours (mid vs late).
Analysis and Recognition of Accentual Patterns
Agnieszka Wagner; Adam Mickiewicz University,
Poland
Wed-Ses4-P4-5, Time: 13:30
This study proposes a framework of automatic analysis and
recognition of accentual patterns. In the first place we present
the results of analyses which aimed at identification of acoustic
cues signaling prominent syllables and different pitch accent types
distinguished at the surface-phonological level. The resulting representation provides a framework of analysis of accentual patterns
at the acoustic-phonetic level. The representation is compact — it
consists of 13 acoustic features, has low redundancy — the features
can not be derived from one another and wide coverage — it encodes distinctions between perceptually different utterances. Next,
we train statistical models to automatically determine accentual
patterns of utterances using the acoustic-phonetic representation
which involves two steps: detection of accentual prominence and
assigning pitch accent types to prominent syllables. The efficiency
of the best models consists in achieving high accuracy (above 80%
on average) using small acoustic feature vectors.
An acoustic analysis of a German dialogue corpus showed that
the sound qualities and durations of fricatives, vocoids, and
diphthongs at the ends of question and statement utterances
varied systematically with the utterance-final intonation segments,
which were high-rising in the questions and terminal- falling in the
statements. The ways in which the variations relate to phenomena
like sibilant/spectral pitch and intrinsic F0 suggest that they are
meant to support the pitch course. Thus, they may be called
segmental intonations.
The Phrase-Final Accent in Kammu: Effects of Tone,
Focus and Engagement
David House 1 , Anastasia Karlsson 2 , Jan-Olof
Svantesson 2 , Damrong Tayanin 2 ; 1 KTH, Sweden;
2
Lund University, Sweden
Wed-Ses4-P4-8, Time: 13:30
The phrase-final accent can typically contain a multitude of simultaneous prosodic signals. In this study, aimed at separating the
effects of lexical tone from phrase-final intonation, phrase-final
accents of two dialects of Kammu were analyzed. Kammu, a MonKhmer language spoken primarily in northern Laos, has dialects
with lexical tones and dialects with no lexical tones. Both dialects
seem to engage the phrase-final accent to simultaneously convey
focus, phrase finality, utterance finality, and speaker engagement.
Both dialects also show clear evidence of truncation phenomena.
These results have implications for our understanding of the
interaction between tone, intonation and phrase-finality.
Tonal Alignment in Three Varieties of
Hiberno-English
Raya Kalaldeh, Amelie Dorn, Ailbhe Ní Chasaide;
Trinity College Dublin, Ireland
Wed-Ses4-P4-9, Time: 13:30
This pilot study investigates the tonal alignment of pre-nuclear
(PN) and nuclear (N) accents in three Hiberno-English (HE) regional varieties: Dublin, Drogheda, and Donegal English. The
peak alignment is investigated as a function of the number of
unstressed syllables before PN and after N. Dublin and Drogheda
English appear to a have fixed peak alignment in both nuclear and
Notes
148
pre-nuclear conditions. Donegal English, however, shows a drift
in peak alignment in nuclear and pre-nuclear conditions. Findings
also show that the peak is located earlier in nuclear and later in
pre-nuclear conditions across the three dialects.
Is Tonal Alignment Interpretation Independent of
Methodology?
Caterina Petrone 1 , Mariapaola D’Imperio 2 ; 1 ZAS,
Germany; 2 LPL, France
Wed-Ses4-P4-13, Time: 13:30
Determining Intonational Boundaries from the
Acoustic Signal
Lourdes Aguilar 1 , Antonio Bonafonte 2 , Francisco
Campillo 3 , David Escudero 4 ; 1 Universitat Autònoma
de Barcelona, Spain; 2 Universitat Politècnica de
Catalunya, Spain; 3 Universidade de Vigo, Spain;
4
Universidad de Valladolid, Spain
Wed-Ses4-P4-10, Time: 13:30
This article has two-fold aims: it reports firstly the improvement
of a speech database in Catalan for speech synthesis (Festcat) with
the information about prosodic boundaries using the break index
labels proposed in the ToBI system; and secondly, it presents the
experiments undergone to determine the acoustic markers that
can differentiate among the break-indexes. Several experiments
using different classification techniques were performed in order
to compare the relative merit of different attributes to characterize breaks. Results show that the prosodic phrase breaks are
correlated with: presence of a pause, lengthening of the pre-break
syllable and the F0 contour of the span between the stressed
syllable and the following post-stressed, if there are, immediately
preceding the break.
Compression and Truncation Revisited
Claudia K. Ohl, Hartmut R. Pfitzinger;
Christian-Albrechts-Universität zu Kiel, Germany
Tonal target detection is a very difficult task, especially in presence
of consonantal perturbations. Though different detection methods
have been adopted in tonal alignment research, we still do not
know which is the most reliable. In our paper, we found that
such methodological choices have serious theoretical implications.
Interpretation of the data strongly depends on whether tonal
targets have been detected by a manual, a semi-automatic or an
automatic procedure. Moreover, different segmental classes can
affect target placement especially in automatic detection. This
suggests the importance of keeping segmental classes separate for
the purpose of statistical analysis.
Modeling the Intonation of Topic Structure: Two
Approaches
Margaret Zellers 1 , Brechtje Post 1 , Mariapaola
D’Imperio 2 ; 1 University of Cambridge, UK; 2 LPL,
France
Wed-Ses4-P4-14, Time: 13:30
Intonational variation is widely regarded as a source of information
about the topic structure of spoken discourse. However, many
factors other than topic can influence this variation. We compared
two models of intonation in terms of their ability to account for
these other sources of variation. In dealing with this variation, the
models paint different pictures of the intonational correlates of
topic.
Wed-Ses4-P4-11, Time: 13:30
This paper investigates the influence of varying segmental structures on the realizations of utterance-final rising and falling
intonation contours.
Following Grabe’s study on adjustment
strategies in German, i.e. truncation and compression, a similar
experiment was carried out, using materials with decreasing
stretches of voicing in questions, lists, and statements. However,
the results presented in the present paper could not confirm the
idea of such common adjustment strategies. Instead, considerable
variation was found as to how the phrase-final intonation contours
were adjusted to the respective amounts of voicing: the strategies
varied strongly across different word groups.
Comparison of Fujisaki-Model Extractors and F0
Stylizers
Wed-Ses3-S1 : Special Session: Machine
Learning for Adaptivity in Spoken Dialogue
Systems
Ainsworth (East Wing 4), 16:00, Wednesday 9 Sept 2009
Chair: Oliver Lemon, University of Edinburgh, UK and Olivier
Pietquin, Supélec, France
A User Modeling-Based Performance Analysis of a
Wizarded Uncertainty-Adaptive Dialogue System
Corpus
Kate Forbes-Riley, Diane Litman; University of
Pittsburgh, USA
Hartmut R. Pfitzinger 1 , Hansjörg Mixdorff 2 , Jan
Schwarz 1 ; 1 Christian-Albrechts-Universität zu Kiel,
Germany; 2 BHT Berlin, Germany
Wed-Ses3-S1-1, Time: 16:00
Wed-Ses4-P4-12, Time: 13:30
This study compares four automatic methods for estimating
Fujisaki-model parameters. Since interpolation and smoothing are
necessary prerequisites for all approaches their fitting accuracies
are also compared with that of a novel stylisation method. A
hand-corrected set of results from one of the methods which
was created on linguistic grounds served as a second benchmark.
Although the four methods yield comparable results with respect
to their total errors, they show different error distributions. The
manually corrected version provided a poorer approximation of
the F0 contours than the automatic one.
Motivated by prior spoken dialogue system research in user
modeling, we analyze interactions between performance and user
class in a dataset previously collected with two wizarded spoken
dialogue tutoring systems that adapt to user uncertainty. We focus
on user classes defined by expertise level and gender, and on both
objective (learning) and subjective (user satisfaction) performance
metrics. We find that lower expertise users learn best from one
adaptive system but prefer the other, while higher expertise users
learned more from one adaptive system but didn’t prefer either.
Female users both learn best from and prefer the same adaptive
system, while males preferred one adaptive system but didn’t
learn more from either. Our results yield an empirical basis for
future investigations into whether adaptive system performance
can improve by adapting to user uncertainty differently based on
user class.
Notes
149
Using Dialogue-Based Dynamic Language Models for
Improving Speech Recognition
Juan Manuel Lucas-Cuesta, Fernando Fernández,
Javier Ferreiros; Universidad Politécnica de Madrid,
Spain
Bayesian Learning of Confidence Measure Function
for Generation of Utterances and Motions in Object
Manipulation Dialogue Task
Komei Sugiura, Naoto Iwahashi, Hideki Kashioka,
Satoshi Nakamura; NICT, Japan
Wed-Ses3-S1-2, Time: 16:20
Wed-Ses3-S1-5, Time: 17:20
We present a new approach to dynamically create and manage
different language models to be used on a spoken dialogue system.
We apply an interpolation based approach, using several measures
obtained by the Dialogue Manager to decide what LM the system
will interpolate and also to estimate the interpolation weights.
We propose to use not only semantic information (the concepts
extracted from each recognized utterance), but also information
obtained by the dialogue manager module (DM), that is, the objectives or goals the user wants to fulfill, and the proper classification
of those concepts according to the inferred goals. The experiments
we have carried out show improvements over word error rate when
using the parsed concepts and the inferred goals from a speech
utterance for rescoring the same utterance.
This paper proposes a method that generates motions and utterances in an object manipulation dialogue task. The proposed
method integrates belief modules for speech, vision, and motions
into a probabilistic framework so that a user’s utterances can be
understood based on multimodal information. Responses to the
utterances are optimized based on an integrated confidence measure function for the integrated belief modules. Bayesian logistic
regression is used for the learning of the confidence measure
function. The experimental results revealed that the proposed
method reduced the failure rate from 12% down to 2.6% while the
rejection rate was less than 24%.
Reinforcement Learning for Dialog Management
Using Least-Squares Policy Iteration and Fast
Feature Selection
Lihong Li 1 , Jason D. Williams 2 , Suhrid Balakrishnan 2 ;
1
Rutgers University, USA; 2 AT&T Labs Research, USA
Wed-Ses3-S1-3, Time: 16:40
Predicting How it Sounds: Re-Ranking Dialogue
Prompts Based on TTS Quality for Adaptive Spoken
Dialogue Systems
Cédric Boidin 1 , Verena Rieser 2 , Lonneke
van der Plas 3 , Oliver Lemon 2 , Jonathan Chevelu 1 ;
1
Orange Labs, France; 2 University of Edinburgh, UK;
3
Université de Genève, Switzerland
Wed-Ses3-S1-6, Time: 17:40
Reinforcement learning (RL) is a promising technique for creating
a dialog manager. RL accepts features of the current dialog state
and seeks to find the best action given those features. Although
it is often easy to posit a large set of potentially useful features,
in practice, it is difficult to find the subset which is large enough
to contain useful information yet compact enough to reliably
learn a good policy. In this paper, we propose a method for RL
optimization which automatically performs feature selection. The
algorithm is based on least-squares policy iteration, a state-of-theart RL algorithm which is highly sample-efficient and can learn
from a static corpus or on-line. Experiments in dialog simulation
show it is more stable than a baseline RL algorithm taken from a
working dialog system.
Hybridisation of Expertise and Reinforcement
Learning in Dialogue Systems
This paper presents a method for adaptively re-ranking paraphrases in a Spoken Dialogue System (SDS) according to their
predicted Text To Speech (TTS) quality. We collect data under
4 different conditions and extract a rich feature set of 55 TTS
runtime features. We build predictive models of user ratings using
linear regression with latent variables. We then show that these
models transfer to a more specific target domain on a separate test
set. All our models significantly outperform a random baseline.
Our best performing model reaches the same performance as
reported by previous work, but it requires 75% less annotated
training data. The TTS re-ranking model is part of an end-to-end
statistical architecture for Spoken Dialogue Systems developed by
the ECFP7 CLASSiC project.
Thu-Ses1-O1 : Robust Automatic Speech
Recognition III
Romain Laroche 1 , Ghislain Putois 1 , Philippe Bretier 1 ,
Bernadette Bouchon-Meunier 2 ; 1 Orange Labs, France;
2
LIP6, France
Main Hall, 10:00, Thursday 10 Sept 2009
Chair: P.D. Green, University of Sheffield, UK
Wed-Ses3-S1-4, Time: 17:00
This paper addresses the problem of introducing learning capabilities in industrial handcrafted automata-based Spoken Dialogue
Systems, in order to help the developer to cope with his dialogue
strategies design tasks. While classical reinforcement learning
algorithms position their learning at the dialogue move level, the
fundamental idea behind our approach is to learn at a finer internal
decision level (which question, which words, which prosody,
. . .). These internal decisions are made on the basis of different
(distinct or overlapping) knowledge. This paper proposes a novel
reinforcement learning algorithm that can be used to make a datadriven optimisation of such handcrafted systems. An experiment
shows that the convergence can be up to 20 times faster than with
Q-Learning.
Accounting for the Uncertainty of Speech Estimates
in the Complex Domain for Minimum Mean Square
Error Speech Enhancement
Ramón Fernandez Astudillo, Dorothea Kolossa,
Reinhold Orglmeister; Technische Universität Berlin,
Germany
Thu-Ses1-O1-1, Time: 10:00
Uncertainty decoding and uncertainty propagation, or error propagation, techniques have emerged as a powerful tool to increase the
accuracy of automatic speech recognition systems by employing
an uncertain, or probabilistic, description of the speech features
rather than the usual point estimate. In this paper we analyze
the uncertainty generated in the complex Fourier domain when
performing speech enhancement with the Wiener or Ephraim-Malah
Notes
150
filters. We derive closed form solutions for the computation of the
error of estimation and show that it provides a better insight into
the origin of estimation uncertainty. We also show how the combination of such an error estimate with uncertainty propagation
and uncertainty decoding or modified imputation yields superior
recognition robustness when compared to conventional MMSE
estimators with little increase in the computational cost.
Signal Separation for Robust Speech Recognition
Based on Phase Difference Information Obtained in
the Frequency Domain
Chanwoo Kim, Kshitiz Kumar, Bhiksha Raj, Richard M.
Stern; Carnegie Mellon University, USA
Thu-Ses1-O1-2, Time: 10:00
In this paper, we present a new two-microphone approach that
improves speech recognition accuracy when speech is masked by
other speech. The algorithm improves on previous systems that
have been successful in separating signals based on differences
in arrival time of signal components from two microphones. The
present algorithm differs from these efforts in that the signal
selection takes place in the frequency domain. We observe that additional smoothing of the phase estimates over time and frequency
is needed to support adequate speech recognition performance. We
demonstrate that the algorithm described in this paper provides
better recognition accuracy than time-domain-based signal separation algorithms, and at less than 10 percent of the computation
cost.
Transforming Features to Compensate Speech
Recogniser Models for Noise
R.C. van Dalen, F. Flego, M.J.F. Gales; University of
Cambridge, UK
Thu-Ses1-O1-3, Time: 10:00
To make speech recognisers robust to noise, either the features
or the models can be compensated. Feature enhancement is
often fast; model compensation is often more accurate, because
it predicts the corrupted speech distribution. It is therefore able,
for example, to take uncertainty about the clean speech into
account. This paper re-analyses the recently-proposed predictive
linear transformations for noise compensation as minimising the
kl divergence between the predicted corrupted speech and the
adapted models. New schemes are then introduced which apply
observation-dependent transformations in the front-end to adapt
the back-end distributions. One applies transforms in the exact
same manner as the popular minimum mean square error (mmse)
feature enhancement scheme, and is as fast. The new method
performs better on aurora 2.
Subband Temporal Modulation Spectrum
Normalization for Automatic Speech Recognition in
Reverberant Environments
Xugang Lu 1 , Masashi Unoki 2 , Satoshi Nakamura 1 ;
1
NICT, Japan; 2 JAIST, Japan
Thu-Ses1-O1-4, Time: 10:00
Speech recognition in reverberant environments is still a challenge
problem. In this paper, we first investigated the reverberation
effect on subband temporal envelopes by using the modulation
transfer function (MTF). Based on the investigation, we proposed
an algorithm which normalizes the subband temporal modulation
spectrum (TMS) to reduce the diffusion effect of the reverberation.
During the normalization, both the subband TMS of the clean and
reverberated speech are normalized to a reference TMS calculated
from a clean speech data set for each frequency subband. Based
on the normalized subband TMS, the inverse Fourier transform
was done to restore the subband temporal envelopes by keeping
their original phase information. We tested our algorithm on
reverberated speech recognition tasks (in a reverberant room).
For comparison, the traditional Mel-frequency cepstral coefficient
(MFCC) and relative spectral filtering (RASTA) were used. Experimental results showed that the recognition rate using the feature
extracted based on the proposed normalization method has totally
a 80.64% relative improvement.
Robust In-Car Spelling Recognition — A Tandem
BLSTM-HMM Approach
Martin Wöllmer 1 , Florian Eyben 1 , Björn Schuller 1 ,
Yang Sun 1 , Tobias Moosmayr 2 , Nhu Nguyen-Thien 3 ;
1
Technische Universität München, Germany; 2 BMW
Group, Germany; 3 Continental Automotive GmbH,
Germany
Thu-Ses1-O1-5, Time: 10:00
As an intuitive hands-free input modality automatic spelling recognition is especially useful for in-car human-machine interfaces.
However, for today’s speech recognition engines it is extremely
challenging to cope with similar sounding spelling speech sequences in the presence of noises such as the driving noise inside
a car. Thus, we propose a novel Tandem spelling recogniser,
combining a Hidden Markov Model (HMM) with a discriminatively
trained bidirectional Long Short-Term Memory (BLSTM) recurrent
neural net. The BLSTM network captures long-range temporal
dependencies to learn the properties of in-car noise, which makes
the Tandem BLSTM-HMM robust with respect to speech signal
disturbances at extremely low signal-to-noise ratios and mismatches between training and test noise conditions. Experiments
considering various driving conditions reveal that our Tandem
recogniser outperforms a conventional HMM by up to 33%.
Applying Non-Negative Matrix Factorization on
Time-Frequency Reassignment Spectra for Missing
Data Mask Estimation
Maarten Van Segbroeck, Hugo Van hamme; Katholieke
Universiteit Leuven, Belgium
Thu-Ses1-O1-6, Time: 10:00
The application of Missing Data Theory (MDT) has shown to
improve the robustness of automatic speech recognition (ASR)
systems. A crucial part in a MDT-based recognizer is the computation of the reliability masks from noisy data. To estimate
accurate masks in environments with unknown, non-stationary
noise statistics, we need to rely on a strong model for the speech.
In this paper, an unsupervised technique using non-negative
matrix factorization (NMF) discovers phone-sized time-frequency
patches into which speech can be decomposed. The input matrix
for the NMF is constructed using a high resolution and reassigned
time-frequency representation. This representation facilitates an
accurate detection of the patches that are active in unseen noisy
speech. After further denoising of the patch activations, speech
and noise can be reconstructed from which missing feature masks
are estimated. Recognition experiments on the Aurora2 database
demonstrate the effectiveness of this technique.
Notes
151
Prosodic Analysis of Foreign-Accented English
Thu-Ses1-O2 : Prosody: Perception
Hansjörg Mixdorff 1 , John Ingram 2 ; 1 BHT Berlin,
Germany; 2 University of Queensland, Australia
Jones (East Wing 1), 10:00, Thursday 10 Sept 2009
Chair: Yi Xu, University College London, UK
Thu-Ses1-O2-4, Time: 11:00
Experiments on Automatic Prosodic Labeling
Antje Schweitzer, Bernd Möbius; Universität Stuttgart,
Germany
Thu-Ses1-O2-1, Time: 10:00
This paper presents results from experiments on automatic
prosodic labeling. Using the WEKA machine learning software [1],
classifiers were trained to determine for each syllable in a speech
database of a male speaker its pitch accent and its boundary
tone. Pitch accents and boundaries are according to the GToBI(S)
dialect, with slight modifications. Classification was based on 35
attributes involving PaIntE F0 parametrization [2] and normalized
phone durations, but also some phonological information as well
as higher-linguistic information. Several classification algorithms
yield results of approx. 78% accuracy on the word level for pitch
accents, and approx. 88% accuracy on the word level for phrase
boundaries, which compare very well to results of other studies.
The classifiers generalize to similar data of a female speaker in
that they perform equally well as classifiers trained directly on the
female data.
German Boundary Tones Show Categorical
Perception and a Perceptual Magnet Effect When
Presented in Different Contexts
Katrin Schneider, Grzegorz Dogil, Bernd Möbius;
Universität Stuttgart, Germany
Thu-Ses1-O2-2, Time: 10:20
The experiment presented in this paper examines categorical
perception as well as the perceptual magnet effect in German
boundary tones, taking also context information into account.
The test phrase is preceded by different context sentences that
are assumed to affect the location of the category boundary in
the stimulus continuum between the low and the high boundary
tone. Results provide evidence for the existence of a low and a
high boundary tone in German, corresponding to statement versus
question interpretation, respectively. Furthermore, in contrast to
previous findings, a prototype was found not only in the category
of the low but also in the category of the high boundary tone,
supporting the hypothesis that context might have been taken into
account to solve a possible ambiguity between H% and a previously
hypothesized non-low and non-terminal boundary tone.
Eye Tracking for the Online Evaluation of Prosody
in Speech Synthesis: Not So Fast!
Michael White, Rajakrishnan Rajkumar, Kiwako Ito,
Shari R. Speer; Ohio State University, USA
Thu-Ses1-O2-3, Time: 10:40
This paper presents an eye-tracking experiment comparing the processing of different accent patterns in unit selection synthesis and
human speech. The synthetic speech results failed to replicate the
facilitative effect of contextually appropriate accent patterns found
with human speech, while producing a more robust intonational
garden-path effect with contextually inappropriate patterns, both
of which could be due to processing delays seen with the synthetic
speech. As the synthetic speech was of high quality, the results
indicate that eye tracking holds promise as a highly sensitive and
objective method for the online evaluation of prosody in speech
synthesis.
This study compares utterances by Vietnamese learners of Australian English with those of native subjects. In a previous study
the utterances had been rated for foreign accent and intelligibility.
We aim to find measurable prosodic differences accounting for
the perceptual results. Our outcomes indicate, inter alia, that
unaccented syllables are relatively longer compared with accented
ones in the Vietnamese corpus than those in the Australian English
corpus. Furthermore, the correlations of syllabic durations in
utterances of one and the same sentence are much higher for Australian English subjects than for Vietnamese learners of English.
Vietnamese speakers use a larger range of f0 and produce more
pitch-accents than Australian speakers.
Perception of the Evolution of Prosody in the French
Broadcast News Style
Philippe Boula de Mareüil, Albert Rilliard, Alexandre
Allauzen; LIMSI, France
Thu-Ses1-O2-5, Time: 11:20
This study makes use of advances in automatic speech processing
to analyse French audiovisual archives and the perception of the
journalistic style evolution regarding prosody. Three perceptual
experiments were run, using prosody transplantation, delexicalisation and imitation. Results show that the fundamental frequency
and duration correlates of prosody enable old-fashioned recordings
to be distinguished from more recent ones. The higher the pitch is
and the more there are pitch movements on syllables which may
be interpreted as word-initially stressed, the more speech samples
are perceived as dating back to the 40s or the 50s.
Prosodic Effects on Vowel Production: Evidence
from Formant Structure
Yoonsook Mo, Jennifer Cole, Mark Hasegawa-Johnson;
University of Illinois at Urbana-Champaign, USA
Thu-Ses1-O2-6, Time: 11:40
Speakers communicate pragmatic and discourse meaning through
the prosodic form assigned to an utterance, and listeners must
attend to the acoustic cues to prosodic form to fully recover
the speaker’s intended meaning. While much of the research on
prosody examines supra-segmental cues such as F0 and temporal
patterns, prosody is also known to affect the phonetic properties
of segments as well. This paper reports on the effect of prosodic
prominence on the formant patterns of vowels using speech data
from the Buckeye corpus of spontaneous American English. A
prosody annotation was obtained for a subset of this corpus based
on the auditory perception of 97 ordinary, untrained listeners.
To understand the relationship between prominence perception
and formant structure, as a measure of the ‘strength’ of the vowel
articulation, we measure the steady-state first and second formants of stressed vowels at vowel mid-points for monophthongs
and at both 10% (nucleus) and 90% (glide) positions for diphthongs.
Two hypotheses about the articulatory mechanism that implements
prominence (Hyperarticulation vs. Sonority Expansion Hypothesis)
were evaluated using Pearson’s bivariate correlation analyses with
formant values and prominence ‘scores’ — a novel perceptual
measure of prominence. The findings demonstrate that higher F1
values correlate with higher prominence scores regardless of vowel
height, confirming that vowels perceived as prominent tend to
have enhanced sonority. In the frontness dimension, on the other
hand, the results show that vowels perceived as prominent tend
Notes
152
to be hyperarticulated. These results support the model of the
supra-laryngeal implementation of prominence proposed in [5, 6]
based on controlled “laboratory” speech, and demonstrate that the
model can be extended to cover prosody in spontaneous speech
using a continuous-valued measure of prosodic prominence. The
evidence reported here from spontaneous speech shows that
prominent vowels have expanded sonority regardless of vowel
height, and are hyperarticulated only when hyperarticulation does
not interfere with sonority expansion.
Thu-Ses1-O3 : Segmentation and
Classification
Fallside (East Wing 2), 10:00, Thursday 10 Sept 2009
Chair: Stephen J. Cox, University of East Anglia, UK
An Adaptive BIC Approach for Robust Audio Stream
Segmentation
Janez Žibert 1 , Andrej Brodnik 1 , France Mihelič 2 ;
1
University of Primorska, Slovenia; 2 University of
Ljubljana, Slovenia
Speaker Segmentation and Clustering for
Simultaneously Presented Speech
Lingyun Gu, Richard M. Stern; Carnegie Mellon
University, USA
Thu-Ses1-O3-4, Time: 11:00
Thu-Ses1-O3-1, Time: 10:00
In this paper we focus on an audio segmentation. We present
a novel method for robust estimation of decision-thresholds for
accurate detection of acoustic change points in continuous audio
streams.
In standard segmentation procedures the decisionthresholds are usually set in advance and need to be tuned from
development data. In the presented approach we tried to remove
a need for using pre-determined decision-thresholds and propose
a method for estimation of thresholds directly from the currently
processed audio data. It employs change-detection methods from
two well-established audio segmentation approaches based on the
Bayesian Information Criterion. Following from that, we develop
two audio segmentation procedures, which enable us to adaptively
tune boundary-detection thresholds and to combine different
audio representations in the segmentation process. The proposed
segmentation procedures are tested on broadcast news audio data.
Improving the Robustness of Phonetic Segmentation
to Accent and Style Variation with a Two-Staged
Approach
Vaishali Patil, Shrikant Joshi, Preeti Rao; IIT Bombay,
India
Thu-Ses1-O3-2, Time: 10:20
Correct and temporally accurate phonetic segmentation of speech
utterances is important in applications ranging from transcription
alignment to pronunciation error detection. Automatic speech
recognizers used in these tasks provide insufficient temporal
alignment accuracy apart from a recognition performance that is
sensitive to accent and style variations from the training data. A
two-staged approach combining HMM broad-class recognition with
acoustic-phonetic knowledge based refinement is evaluated for
phonetic segmentation accuracy in the context of accent and style
mismatches with training data.
Signature Cluster Model Selection for Incremental
Gaussian Mixture Cluster Modeling in
Agglomerative Hierarchical Speaker Clustering
Agglomerative hierarchical speaker clustering (AHSC) has been
widely used for classifying speech data by speaker characteristics. Its bottom-up, one-way structure of merging the closest
cluster pair at every recursion step, however, makes it difficult to
recover from incorrect merging. Hence, making AHSC robust to
incorrect merging is an important issue. In this paper we address
this problem in the framework of AHSC based on incremental
Gaussian mixture models, which we previously introduced for
better representing variable cluster size. Specifically, to minimize
contamination in cluster models by heterogeneous data, we select
and keep updating a representative (or signature) model for each
cluster during AHSC. Experiments on meeting speech excerpts (4
hours total) verify that the proposed approach improves average
speaker clustering performance by approximately 20% (relative).
This paper proposes a new scheme used to segment and cluster
speech segments on an unsupervised basis in cases where multiple
speakers are presented simultaneously at different SNRs. The
new elements in our work are in the development of new feature
for segmenting and clustering simultaneously-presented speech,
the procedure for identifying a candidate set of possible speakerchange points, and the use of pair-wise cross-segment distance
distributions to cluster segments by speaker.
The proposed
system is evaluated in terms of the F measure that is obtained.
The system is compared to a baseline system that uses MFCC
for acoustic features, the Bayesian Information Criterion (BIC) for
detecting speaker-change points, and the Kullback-Leibler distance
for clustering the segments. Experimental indicate that the new
system consistently provides better performance than the baseline
system with very small computational cost.
Trimmed KL Divergence Between Gaussian Mixtures
for Robust Unsupervised Acoustic Anomaly
Detection
Nash Borges, Gerard G.L. Meyer; Johns Hopkins
University, USA
Thu-Ses1-O3-5, Time: 11:20
In previous work [1], we presented several implementations of
acoustic anomaly detection by training a model on purely normal
data and estimating the divergence between it and other input.
Here, we reformulate the problem in an unsupervised framework
and allow for anomalous contamination of the training data. We
focus exclusively on methods employing Gaussian mixture models
(GMMs) since they are often used in speech processing systems.
After analyzing what caused the Kullback-Leibler (KL) divergence
between GMMs to break down in the face of training contamination, we came up with a promising solution. By trimming one
quarter of the most divergent Gaussians from the mixture model,
we significantly outperformed the untrimmed approximation for
contamination levels of 10% and above, reducing the equal error
rate from 33.8% to 6.4% at 33% contamination. The performance of
the trimmed KL divergence showed no significant dependence on
the investigated contamination levels.
Kyu J. Han, Shrikanth S. Narayanan; University of
Southern California, USA
Thu-Ses1-O3-3, Time: 10:40
Notes
153
How to Loose Confidence: Probabilistic Linear
Machines for Multiclass Classification
Results of the N-Best 2008 Dutch Speech
Recognition Evaluation
Hui Lin 1 , Jeff Bilmes 1 , Koby Crammer 2 ; 1 University of
Washington, USA; 2 University of Pennsylvania, USA
David A. van Leeuwen 1 , Judith Kessens 1 , Eric
Sanders 2 , Henk van den Heuvel 2 ; 1 TNO Human
Factors, The Netherlands; 2 SPEX, The Netherlands
Thu-Ses1-O3-6, Time: 11:40
In this paper we propose a novel multiclass classifier called the
probabilistic linear machine (PLM) which overcomes the lowentropy problem of exponential-based classifiers. Although PLMs
are linear classifiers, we use a careful design of the parameters
matched with weak requirements over the features to output a
true probability distribution over labels given an input instance. We
cast the discriminative learning problem as linear programming,
which can scale up to large problems on the order of millions of
training samples. Our experiments on phonetic classification show
that PLM achieves high entropy while maintaining a comparable
accuracy to other state-of-the-art classifiers.
Thu-Ses1-O4-3, Time: 10:40
In this paper we report the results of a Dutch speech recognition
system evaluation held in 2008. The evaluation contained material in two domains: Broadcast News (BN) and Conversational
Telephone Speech (CTS) and in two main accent regions (Flemish
and Dutch). In total 7 sites submitted recognition results to
the evaluation, totalling 58 different submissions in the various
conditions. Best performances ranged from 15.9% word error rate
for BN, Flemish to 46.1% for CTS, Flemish. This evaluation is the
first of its kind for the Dutch language.
SHoUT, the University of Twente Submission to the
N-Best 2008 Speech Recognition Evaluation for
Dutch
Thu-Ses1-O4 : Evaluation & Standardisation
of SL Technology and Systems
Holmes (East Wing 3), 10:00, Thursday 10 Sept 2009
Chair: Sebastian Möller, Deutsche Telekom Laboratories, Germany
Quantifying Wideband Speech Codec Degradations
via Impairment Factors: The New ITU-T P.834.1
Methodology and its Application to the G.711.1
Codec
Marijn Huijbregts, Roeland Ordelman, Laurens
van der Werff, Franciska M.G. de Jong; University of
Twente, The Netherlands
Thu-Ses1-O4-4, Time: 11:00
Sebastian Möller 1 , Nicolas Côté 1 , Atsuko Kurashima 2 ,
Noritsugu Egi 2 , Akira Takahashi 2 ; 1 Deutsche Telekom
Laboratories, Germany; 2 NTT Corporation, Japan
Thu-Ses1-O4-1, Time: 10:00
Wideband speech codecs usually provide better perceptual speech
quality than their narrowband counterparts, but they still degrade
quality compared to an uncoded transmission path. In order to
quantify these degradations, a new methodology is presented
which derives a one-dimensional quality index on the basis of
instrumental measurements. This index can be used to rank different wideband speech codecs according to their degradations and
to calculate overall quality in conjunction with other degradations,
like packet loss. We apply this methodology to derive respective
indices for the new G.711.1 codec.
SUXES — User Experience Evaluation Method for
Spoken and Multimodal Interaction
Markku Turunen, Jaakko Hakulinen, Aleksi Melto,
Tomi Heimonen, Tuuli Laivo, Juho Hella; University of
Tampere, Finland
Thu-Ses1-O4-2, Time: 10:20
Much work remains to be done with subjective evaluations of
speech-based and multimodal systems. In particular, user experience is still hard to evaluate. SUXES is an evaluation method for
collecting subjective metrics with user experiments. It captures
both user expectations and user experiences, making it possible to
analyze the state of the application and its interaction methods,
and compare results. We present the SUXES method with examples
of user experiments with different applications and modalities.
In this paper we present our primary submission to the first
Dutch and Flemish large vocabulary continuous speech recognition
benchmark, N-Best. We describe our system workflow, the models
we created for the four evaluation tasks and how we approached
the problem of compounding that is typical for a language such as
Dutch. We present the evaluation results and our post-evaluation
analysis.
NIST 2008 Speaker Recognition Evaluation:
Performance Across Telephone and Room
Microphone Channels
Alvin F. Martin, Craig S. Greenberg; NIST, USA
Thu-Ses1-O4-5, Time: 11:20
We describe the 2008 NIST Speaker Recognition Evaluation, including the speech data used, the test conditions included, the
participants, and some of the performance results obtained. This
evaluation was distinguished by including as part of the required
test condition interview type speech as well as conversational
telephone speech, and speech recorded over microphone channels
as well as speech recorded over telephone lines. Notable was the
relative consistency of best system performance obtained over the
different speech types, including those involving different types
in training and test. Some comparison with performance in prior
evaluations is also discussed.
The Ester 2 Evaluation Campaign for the Rich
Transcription of French Radio Broadcasts
Sylvain Galliano 1 , Guillaume Gravier 2 , Laura
Chaubard 1 ; 1 DGA, France; 2 AFCP, France
Thu-Ses1-O4-6, Time: 11:40
This paper reports on the final results of the Ester 2 evaluation
campaign held from 2007 to April 2009. The aim of this campaign
was to evaluate automatic radio broadcasts rich transcription systems for the French language. The evaluation tasks were divided
into three main categories: audio event detection and tracking (e.g.,
speech vs. music, speaker tracking), orthographic transcription,
and information extraction. The paper describes the data provided
Notes
154
for the campaign, the task definitions and evaluation protocols as
well as the results.
Soft Decision-Based Acoustic Echo Suppression in a
Frequency Domain
Thu-Ses1-P1 : Speech Coding
Yun-Sik Park, Ji-Hyun Song, Jae-Hun Choi, Joon-Hyuk
Chang; Inha University, Korea
Thu-Ses1-P1-4, Time: 10:00
Hewison Hall, 10:00, Thursday 10 Sept 2009
Chair: Børge Lindberg, Aalborg University, Denmark
Differential Vector Quantization of Feature Vectors
for Distributed Speech Recognition
Jose Enrique Garcia, Alfonso Ortega, Antonio Miguel,
Eduardo Lleida; Universidad de Zaragoza, Spain
Thu-Ses1-P1-1, Time: 10:00
Distributed speech recognition arises for solving computational
limitations of mobile devices like PDAs or mobile phones. Due
to bandwidth restrictions, it is necessary to develop efficient
transmission techniques of acoustic features in Automatic Speech
Recognition applications. This paper presents a technique for
compressing acoustic feature vectors based on Differential Vector
Quantization. It is a combination of Vector Quantization and
Differential encoding schemes. Recognition experiments have been
carried out, showing that the proposed method outperforms the
ETSI standard VQ system, and classical VQ schemes for different
codebook lengths and situations. With the proposed scheme,
bit rates as low as 2.1 kbps can be used without decreasing the
performance of the ASR system in terms of WER compared with a
system without quantization.
Arithmetic Coding of Sub-Band Residuals in FDLP
Speech/Audio Codec
Petr Motlicek 1 , Sriram Ganapathy 2 , Hynek
Hermansky 2 ; 1 IDIAP Research Institute, Switzerland;
2
Johns Hopkins University, USA
Thu-Ses1-P1-2, Time: 10:00
A speech/audio codec based on Frequency Domain Linear Prediction (FDLP) exploits auto-regressive modeling to approximate
instantaneous energy in critical frequency sub-bands of relatively
long input segments. The current version of the FDLP codec
operating at 66 kbps has been shown to provide comparable
subjective listening quality results to state-of-the-art codecs on
similar bit-rates even without employing standard blocks such as
entropy coding or simultaneous masking. This paper describes an
experimental work to increase compression efficiency of the FDLP
codec by employing entropy coding. Unlike conventional Huffman
coding employed in current speech/audio coding systems, we
describe an efficient way to exploit arithmetic coding to entropy
compress quantized spectral magnitudes of the sub-band FDLP
residuals. Such an approach provides 11% (∼ 3 kbps) bit-rate
reduction compared to the Huffman coding algorithm (∼ 1 kbps).
Pitch Variation Estimation
Tom Bäckström, Stefan Bayer, Sascha Disch;
Fraunhofer IIS, Germany
In this paper, we propose a novel acoustic echo suppression
(AES) technique based on soft decision in a frequency domain. The
proposed approach provides an efficient and unified framework for
such procedures as AES gain computation, AES gain modification
using soft decision, and estimation of relevant parameters based
on the same statistical model assumption of the near-end and
far-end signal instead of the conventional strategies requiring the
additional residual echo suppression (RES) step. Performances
of the proposed AES algorithm are evaluated by objective tests
under various environments and better results compared with the
conventional AES method are obtained.
Fine-Granular Scalable MELP Coder Based on
Embedded Vector Quantization
Mouloud Djamah, Douglas O’Shaughnessy; INRS-EMT,
Canada
Thu-Ses1-P1-5, Time: 10:00
This paper presents an efficient codebook design for treestructured vector quantization (TSVQ), which is embedded in
nature. The federal standard MELP (mixed excitation linear prediction) speech coder is modified by replacing the original single
stage vector quantizer for Fourier magnitudes with a TSVQ and
the original multistage vector quantizer (MSVQ) for line spectral
frequencies (LSF’s) with a multistage TSVQ (MTVQ). The modified
coder is fine-granular bit-rate scalable with gradual change in
quality for the synthetic speech when the number of bits available
for LSF and Fourier magnitudes decoding is decremented bit-by-bit.
Joint Quantization Strategies for Low Bit-Rate
Sinusoidal Coding
Emre Unver, Stephane Villette, Ahmet Kondoz;
University of Surrey, UK
Thu-Ses1-P1-6, Time: 10:00
Transparent speech quality has not been achieved at low bit rates,
especially at 2.4 kbps and below, which is an area of interest for
military and security applications. In this paper, strategies for
low bit rate sinusoidal coding are discussed. Previous work in
the literature on using metaframes and performing variable bit
allocation according to the metaframe type is extended. An optimum metaframe size compromise between delay and quantization
gains is found. A new method for voicing determination from the
LPC shape is also presented. The proposed techniques have been
applied to the SB-LPC vocoder to produce speech at 1.2/0.8 kbps,
and compared to the original SB-LPC vocoder at 2.4/1.2 kbps as
well as an established standard (MELP) at 2.4/1.2/0.6 kbps in a
listening test. It has been found that the proposed techniques have
been effective in reducing the bit-rate while not compromising the
speech quality.
Thu-Ses1-P1-3, Time: 10:00
A method for estimating the normalised pitch variation is described. While pitch tracking is a classical problem, in applications
where the pitch magnitude is not required but only the change
in pitch, all the main problems of pitch tracking can be avoided,
such as octave jumps and intricate peak-finding heuristics. The
presented approach is efficient, accurate and unbiased. It was
developed for use in speech and audio coding for pitch variation
compensation, but can also be used as additional information for
pitch tracking.
Steganographic Band Width Extension for the AMR
Codec of Low-Bit-Rate Modes
Akira Nishimura; Tokyo University of Information
Sciences, Japan
Thu-Ses1-P1-7, Time: 10:00
This paper proposes a bandwidth extension (BWE) method for
the AMR narrow-band speech codec using steganography, which
Notes
155
is called steganographic BWE herein. The high-band information
is embedded into the pitch delay data of the AMR codec using
an extended quantization-based method that achieves increased
embedding capacity and higher perceived sound quality than the
previous steganographic method. The target bit-rate mode is
below 7 kbps, the level below which the previous steganographic
BWE method did not maintain adequate sound quality. The sound
quality of the steganographic BWE speech signals decoded from
the embedded bitstream is comparable to that of the wide-band
speech signals of the AMR-WB codec at a bit rate of less than
6.7 kbps, with only a slight degradation in the quality relative to
speech signals decoded from the same bitstream by the legacy
AMR decoder.
subvector information for the mel-frequency cepstral coefficients
(MFCCs) is then added as an error protection code. At the same
time, Huffman coding methods are applied to compressed MFCCs
to prevent the bit-rate increase by using such protection codes,.
Different Huffman trees for MFCCs are designed according to the
voicing class, subvector-wise, and their combinations. It is shown
from the recognition experiments on the Aurora 4 large vocabulary
database under several noisy channel conditions that the proposed
FEC method is able to achieve the relative average word error rate
(WER) reduction by 9.03∼17.81% compared with the standard DSR
system using no FEC methods.
Thu-Ses1-P2 : Voice Transformation II
Ultra Low Bit-Rate Speech Coding Based on
Unit-Selection with Joint Spectral-Residual
Quantization: No Transmission of Any Residual
Information
Hewison Hall, 10:00, Thursday 10 Sept 2009
Chair: Tomoki Toda, NAIST, Japan
HMM Adaptation and Voice Conversion for the
Synthesis of Child Speech: A Comparison
V. Ramasubramanian, D. Harish; Siemens Corporate
Technology India, India
Thu-Ses1-P1-8, Time: 10:00
A recent trend in ultra low bit-rate speech coding is based on
segment quantization by unit-selection principle using large continuous codebooks as a unit database. We show that use of such
large unit databases allows speech to be reconstructed at the decoder by using the best unit’s residual itself (in the unit database),
thereby obviating the need to transmit any side information about
the residual of the input speech. For this, it becomes necessary
to jointly quantize the spectral and residual information at the
encoder during unit selection, and we propose various composite
measures for such a joint spectral-residual quantization within
a unit-selection algorithm proposed earlier. We realize ultra low
bit-rate speaker-dependent speech coding at an overall rate of 250
bits/sec using unit database sizes of 19 bits/unit (524288 phonelike units or about 6 hours of speech) with spectral distortions less
than 2.5 dB that retains intelligibility, naturalness, prosody and
speaker-identity.
Oliver Watts 1 , Junichi Yamagishi 1 , Simon King 1 , Kay
Berkling 2 ; 1 University of Edinburgh, UK; 2 Inline
Internet Online Dienste GmbH, Germany
Thu-Ses1-P2-1, Time: 10:00
This study compares two different methodologies for producing
data-driven synthesis of child speech from existing systems that
have been trained on the speech of adults. On one hand, an
existing statistical parametric synthesiser is transformed using
model adaptation techniques, informed by linguistic and prosodic
knowledge, to the speaker characteristics of a child speaker. This
is compared with the application of voice conversion techniques
to convert the output of an existing waveform concatenation
synthesiser with no explicit linguistic or prosodic knowledge. In
a subjective evaluation of the similarity of synthetic speech to
natural speech from the target speaker, the HMM-based systems
evaluated are generally preferred, although this is at least in part
due to the higher dimensional acoustic features supported by
these techniques.
On the Cost of Backward Compatibility for
Communication Codecs
HMM-Based Speaker Characteristics Emphasis Using
Average Voice Model
Konstantin Schmidt, Markus Schnell, Nikolaus
Rettelbach, Manfred Lutzky, Jochen Issing; Fraunhofer
IIS, Germany
Takashi Nose, Junichi Adada, Takao Kobayashi; Tokyo
Institute of Technology, Japan
Thu-Ses1-P1-9, Time: 10:00
Super wideband (SWB) communication calls more and more attention as can be seen by the standardization activities of SWB
extensions for well-established wideband codecs, e.g. G.722 or
G.711.1. This paper presents a technical solution for extending
the G.722 codec and compares the new technology to other
standardized SWB codecs. Hereby, a closer look is given on the
concept of extending technologies to more capabilities in contrast
to non-backwards compatible solutions.
A Media-Specific FEC Based on Huffman Coding for
Distributed Speech Recognition
Young Han Lee, Hong Kook Kim; GIST, Korea
Thu-Ses1-P1-10, Time: 10:00
Thu-Ses1-P2-2, Time: 10:00
This paper presents a technique for controlling and emphasizing
speaker characteristics of synthetic speech. The key idea comes
from the way of imitating voice by professional impersonators. In
the voice imitation, impersonators effectively utilize exaggeration
of a target speaker’s voice characteristics. To model and control
the degree of speaker characteristics, we use a speech synthesis
framework based on multiple-regression hidden semi-Markov
model (MRHSMM). In MRHSMM, mean parameters are given by multiple regression of a low-dimensional control vector. The control
vector represents how much the target speaker’s model parameters
are different from those of the average voice model. By changing
the control vector in speech synthesis, we can control the degree
of voice characteristics of the target speaker. Results of subjective
experiments show that the speaker reproducibility of synthetic
speech is improved by emphasizing speaker characteristics.
In this paper, we propose a media-specific forward error correction
(FEC) method based on Huffman coding for distributed speech
recognition (DSR). In order to mitigate the performance degradation of DSR in noisy channel environments, the importance of each
subvector for the DSR system is first explored. As a result, the first
Notes
156
Observation of Empirical Cumulative Distribution of
Vowel Spectral Distances and Its Application to
Vowel Based Voice Conversion
An Evaluation Methodology for Prosody
Transformation Systems Based on Chirp Signals
Damien Lolive, Nelly Barbot, Olivier Boeffard; IRISA,
France
Thu-Ses1-P2-3, Time: 10:00
Evaluation of prosody transformation systems is an important
issue. First, the existing evaluation methodologies focus on parallel
evaluation of systems and are not applicable to compare parallel
and non-parallel systems. Secondly, these methodologies do not
guarantee the independence from other features such as the segmental component. In particular, its influence cannot be neglected
during evaluation and introduces a bias in the listening test. To
answer these problems, we propose an evaluation methodology
that depends only on the melody of the voice and that is applicable
in a non-parallel context. Given a melodic contour, we propose to
build an audio whistle from a chirp signal model. Experimental
results show the efficiency of the proposed method concerning the
discrimination of voices using only their melody information. An
example of transformation function is also given and the results
confirm the applicability of this methodology.
Voice Morphing Based on Interpolation of Vocal
Tract Area Functions Using AR-HMM Analysis of
Speech
Yoshiki Nambu, Masahiko Mikawa, Kazuyo Tanaka;
University of Tsukuba, Japan
Thu-Ses1-P2-4, Time: 10:00
This paper presents a new voice morphing method which focuses
on the continuity of phonological identity overall inter- and extrapolated regions. Main features of the method are 1) to separate the
characteristic of vocal tract area resonances from that of vocal cord
waves by using AR-HMM analysis of speech, 2) interpolation in a
log vocal tract area function domain and 3) independent morphing
for the vocal tract resonances and vocal cord wave characteristics.
By the morphing system constructed on a statistical conversion
method, the continuity of formants and perceptual difference
between a conventional method and the proposed method are
confirmed.
A Novel Model-Based Pitch Conversion Method for
Mandarin Speech
Hsin-Te Hwang, Chen-Yu Chiang, Po-Yi Sung,
Sin-Horng Chen; National Chiao Tung University,
Taiwan
Thu-Ses1-P2-5, Time: 10:00
In this paper, a novel model-based pitch conversion method for
Mandarin is presented and compared with other two conventional
conversion methods, i.e. the mean/variance transformation approach and the GMM-based mapping approach. Syllable pitch
contour is first quantized by 3rd order orthogonal expansion
coefficients; then, the source and target speakers’ prosodic models
are constructed, respectively. Two mapping methods based on the
prosodic model are presented. Objective tests confirmed that one
of the proposed methods are superior the conventional methods.
Some findings in informal listening tests and objective tests are
worthwhile to further investigate.
Hideki Kawahara 1 , Masanori Morise 2 , Toru
Takahashi 3 , Hideki Banno 4 , Ryuichi Nisimura 1 ,
Toshio Irino 1 ; 1 Wakayama University, Japan;
2
Ritsumeikan University, Japan; 3 Kyoto University,
Japan; 4 Meijo University, Japan
Thu-Ses1-P2-6, Time: 10:00
A simple and fast voice conversion method based only on vowel
information is proposed. The proposed method relies on empirical distribution of perceptual spectral distances between
representative examples of each vowel segment extracted using
TANDEM-STRAIGHT spectral envelope estimation procedure [1].
Mapping functions of vowel spectra are designed to preserve vowel
space structure defined by the observed empirical distribution
while transforming position and orientation of the structure in
an abstract vowel spectral space. By introducing physiological
constraints in vocal tract shapes and vocal tract length normalization, difficulties in careful frequency alignment between vowel
template spectra of the source and the target speakers can be
alleviated without significant degradations in converted speech.
The proposed method is a frame-based instantaneous method and
is relevant for real-time processing. Applications of the proposed
method in-cross language voice conversion are also discussed.
Japanese Pitch Conversion for Voice Morphing
Based on Differential Modeling
Ryuki Tachibana 1 , Zhiwei Shuang 2 , Masafumi
Nishimura 1 ; 1 IBM Tokyo Research Lab, Japan; 2 IBM
China Research Lab, China
Thu-Ses1-P2-7, Time: 10:00
In this paper, we convert the pitch contours predicted by a TTS
system that models a source speaker to resemble the pitch contours of a target speaker. When the speaking styles of the speakers
are very different, complex conversions such as adding or deleting
pitch peaks may be required. Our method does the conversions by
modeling the direct pitch features and differential pitch features
at the same time based on linguistic features. The differential
pitch features are calculated from matched pairs of source and
target pitch values. We show experimental results in which the
target speaker’s characteristics are successfully modeled based
on a very limited training corpus. The proposed pitch conversion
method stretches the possibilities of TTS customization for various
speaking styles.
A Novel Technique for Voice Conversion Based on
Style and Content Decomposition with Bilinear
Models
Victor Popa 1 , Jani Nurminen 2 , Moncef Gabbouj 1 ;
1
Tampere University of Technology, Finland; 2 Nokia
Devices R&D, Finland
Thu-Ses1-P2-8, Time: 10:00
This paper presents a novel technique for voice conversion by
solving a two-factor task using bilinear models. The spectral
content of the speech represented as line spectral frequencies
is separated into so-called style and content parameterizations
using a framework proposed in [1]. This formulation of the voice
conversion problem in terms of style and content offers a flexible
representation of factor interactions and facilitates the use of efficient training algorithms based on singular value decomposition
Notes
157
and expectation maximization. Promising results in a comparison
with the traditional Gaussian mixture model based method indicate
increased robustness with small training sets.
Rule-Based Voice Quality Variation with Formant
Synthesis
as other features, such as morphology and prosody. We evaluate
the accuracy of our model at predicting syntactic information on
the POS tagging task against state-of-the-art POS taggers and on
perplexity against the ngram model.
Improved Language Modelling Using Bag of Word
Pairs
Felix Burkhardt; Deutsche Telekom Laboratories,
Germany
Langzhou Chen, K.K. Chin, Kate Knill; Toshiba
Research Europe Ltd., UK
Thu-Ses1-P2-9, Time: 10:00
We describe an approach to simulate different phonation types,
following John Laver’s terminology, by means of a hybrid (rulebased and unit concatenating) formant synthesizer. Different
voice qualities were generated by following hints from the literature and applying the revised KLGLOTT88 model. Within
a listener perception experiment, we show that the phonation
types get distinguished by the listeners and lead to emotional
impression as predicted by literature. The synthesis system and
its source code, as well as audio samples can be downloaded at
http://emoSyn.syntheticspeech.de/.
Thu-Ses1-P3 : Automatic Speech
Recognition: Language Models II
Thu-Ses1-P3-3, Time: 10:00
The bag-of-words (BoW) method has been used widely in language
modelling and information retrieval. A document is expressed as
a group of words disregarding the grammar and the order of word
information. A typical BoW method is latent semantic analysis
(LSA), which maps the words and documents onto the vectors
in LSA space. In this paper, the concept of BoW is extended to
Bag-of-Word Pairs (BoWP), which expresses the document as a
group of word pairs. Using word pairs as a unit, the system can
capture more complex semantic information than BoW. Under
the LSA framework, the BoWP system is shown to improve both
perplexity and word error rate (WER) compared to a BoW system.
Morphological Analysis and Decomposition for
Arabic Speech-to-Text Systems
Hewison Hall, 10:00, Thursday 10 Sept 2009
Chair: Mari Ostendorf, University of Washington, USA
F. Diehl, M.J.F. Gales, M. Tomalin, P.C. Woodland;
University of Cambridge, UK
Multiple Text Segmentation for Statistical Language
Modeling
Thu-Ses1-P3-4, Time: 10:00
Sopheap Seng 1 , Laurent Besacier 1 , Brigitte Bigi 1 , Eric
Castelli 2 ; 1 LIG, France; 2 MICA, Vietnam
Thu-Ses1-P3-1, Time: 10:00
In this article we deal with the text segmentation problem in
statistical language modeling for under-resourced languages with
a writing system without word boundary delimiters. While the
lack of text resources has a negative impact on the performance
of language models, the errors introduced by the automatic word
segmentation makes those data even less usable. To better exploit
the text resources, we propose a method based on weighted finite
state transducers to estimate the N-gram language model from the
training corpus on which each sentence is segmented in multiple
ways instead of a unique segmentation. The multiple segmentation
generates more N-grams from the training corpus and allows
obtaining the N-grams not found in unique segmentation. We use
this approach to train the language models for automatic speech
recognition systems of Khmer and Vietnamese languages and the
multiple segmentations lead to a better performance than the
unique segmentation approach.
Measuring Tagging Performance of a Joint Language
Model
Denis Filimonov, Mary Harper; University of Maryland
at College Park, USA
Thu-Ses1-P3-2, Time: 10:00
Predicting syntactic information in a joint language model (LM)
has been shown not only to improve the model at its main task of
predicting words, but it also allows this information to be passed
to other applications, such as spoken language processing. This
raises the question of just how accurate the syntactic information
predicted by the LM is. In this paper, we present a joint LM
designed not only to scale to large quantities of training data, but
also to be able to utilize fine-grain syntactic information, as well
Language modelling for a morphologically complex language such
as Arabic is a challenging task. Its agglutinative structure results
in data sparsity problems and high out-of-vocabulary rates. In
this work these problems are tackled by applying the MADA tools
to the Arabic text. In addition to morphological decomposition,
MADA performs context-dependent stem-normalisation. Thus, if
word-level system combination, or scoring, is required this normalisation must be reversed. To address this, a novel context-sensitive
method for morpheme-to-word conversion is introduced. The
performance of the MADA decomposed system was evaluated on
an Arabic broadcast transcription task. The MADA-based system
out-performed the word-based system, with both the morphological decomposition and stem normalisation being found to be
important.
Investigating the Use of Morphological
Decomposition and Diacritization for Improving
Arabic LVCSR
Amr El-Desoky, Christian Gollan, David Rybach, Ralf
Schlüter, Hermann Ney; RWTH Aachen University,
Germany
Thu-Ses1-P3-5, Time: 10:00
One of the challenges related to large vocabulary Arabic speech
recognition is the rich morphology nature of Arabic language which
leads to both high out-of-vocabulary (OOV) rates and high language
model (LM) perplexities. Another challenge is the absence of the
short vowels (diacritics) from the Arabic written transcripts which
causes a large difference between spoken and written language
and thus a weaker connection between the acoustic and language
models. In this work, we try to address these two important
challenges by introducing both morphological decomposition and
diacritization in Arabic language modeling. Finally, we are able to
obtain about 3.7% relative reduction in word error rate (WER) with
respect to a comparable non-diacritized full-words system running
on our test set.
Notes
158
Topic Dependent Language Model Based on Topic
Voting on Noun History
A Parallel Training Algorithm for Hierarchical
Pitman-Yor Process Language Models
Welly Naptali, Masatoshi Tsuchiya, Seiichi Nakagawa;
Toyohashi University of Technology, Japan
Songfang Huang, Steve Renals; University of
Edinburgh, UK
Thu-Ses1-P3-6, Time: 10:00
Thu-Ses1-P3-9, Time: 10:00
Language models (LMs) are important in automatic speech recognition systems. In this paper, we propose a new approach to a
topic dependent LM, where the topic is decided in an unsupervised
manner. Latent Semantic Analysis (LSA) is employed to reveal
hidden (latent) relations among nouns in the context words. To
decide the topic of an event, a fixed size word history sequence
(window) is observed, and voting is then carried out based on
noun class occurrences weighted by a confidence measure. Experiments on the Wall Street Journal corpus and Mainichi Shimbun
(Japanese newspaper) corpus show that our proposed method
gives better perplexity than the comparative baselines, including
a word-based/class-based n-gram LM, their interpolated LM, a
cache-based LM, and the Latent Dirichlet Allocation (LDA)-based
topic dependent LM.
The Hierarchical Pitman Yor Process Language Model (HPYLM) is
a Bayesian language model based on a non-parametric prior, the
Pitman-Yor Process. It has been demonstrated, both theoretically
and practically, that the HPYLM can provide better smoothing for
language modeling, compared with state-of-the-art approaches
such as interpolated Kneser-Ney and modified Kneser-Ney smoothing. However, estimation of Bayesian language models is expensive
in terms of both computation time and memory; the inference
is approximate and requires a number of iterations to converge.
In this paper, we present a parallel training algorithm for the
HPYLM, which enables the approach to be applied in the context
of automatic speech recognition, using large training corpora
with large vocabularies. We demonstrate the effectiveness of the
proposed algorithm by estimating language models from corpora
for meeting transcription containing over 200 million words, and
observe significant reductions in perplexity and word error rate.
Investigation of Morph-Based Speech Recognition
Improvements Across Speech Genres
Probabilistic and Possibilistic Language Models
Based on the World Wide Web
Péter Mihajlik, Balázs Tarján, Zoltán Tüske, Tibor
Fegyó; BME, Hungary
Thu-Ses1-P3-7, Time: 10:00
The improvement achieved by changing the basis of speech
recognition from words to morphs (various sub-word units) varies
greatly across tasks and languages. We make an attempt to explore
the source of this variability by the investigation of three LVCSR
tasks corresponding to three speech genres of a highly agglutinative language. Novel, press conference and broadcast news
transcription results are presented and compared to spontaneous
speech recognition results in several experimental setups. A
noticeable correlation is observed between an easily computable
characteristic of various language speech recognition tasks and
between the relative improvements due to (statistical) morph-based
approaches.
Effective Use of Pause Information in Language
Modelling for Speech Recognition
Kengo Ohta, Masatoshi Tsuchiya, Seiichi Nakagawa;
Toyohashi University of Technology, Japan
Thu-Ses1-P3-8, Time: 10:00
This paper addresses mismatch between speech processing units
used by a speech recognizer and sentences of corpora. A standard
speech recognizer divides an input speech into speech processing
units based on its power information. On the other hand, training
corpora of language models are divided into sentences based
on punctuations. There is inevitable mismatch between speech
processing units and sentences, and both of them are not optimal
for a spontaneous speech recognition task. This paper presents
two sub issues to address this problem. At first, the words of the
preceding units are utilized to predict the words of the succeeding
units, in order to address the mismatch between speech processing
units and optimal units. Secondly, we propose a method to build
a language model including short pause from a corpus with no
short pause to address the mismatch between speech processing
units and sentences. Their combination achieved a 4.5% relative
improvement over the conventional method in the meeting speech
recognition task.
Stanislas Oger, Vladimir Popescu, Georges Linarès; LIA,
France
Thu-Ses1-P3-10, Time: 10:00
Usually, language models are built either from a closed corpus,
or by using World Wide Web retrieved documents, which are considered as a closed corpus themselves. In this paper we propose
several other ways, more adapted to the nature of the Web, of using
this resource for language modeling. We first start by improving
an approach consisting in estimating n-gram probabilities from
Web search engine statistics. Then, we propose a new way of considering the information extracted from the Web in a probabilistic
framework. Then, we also propose to rely on Possibility Theory for
effectively using this kind of information. We compare these two
approaches on two automatic speech recognition tasks: (i) transcribing broadcast news data, and (ii) transcribing domain-specific
data, concerning surgical operation film comments. We show that
the two approaches are effective in different situations.
Thu-Ses1-P4 : Systems for Spoken Language
Understanding
Hewison Hall, 10:00, Thursday 10 Sept 2009
Chair: Renato de Mori, LIA, France
Classification-Based Strategies for Combining
Multiple 5-W Question Answering Systems
Sibel Yaman 1 , Dilek Hakkani-Tür 1 , Gokhan Tur 2 ,
Ralph Grishman 3 , Mary Harper 4 , Kathleen R.
McKeown 5 , Adam Meyers 3 , Kartavya Sharma 5 ; 1 ICSI,
USA; 2 SRI International, USA; 3 New York University,
USA; 4 University of Maryland at College Park, USA;
5
Columbia University, USA
Thu-Ses1-P4-1, Time: 10:00
We describe and analyze inference strategies for combining outputs
from multiple question answering systems each of which was developed independently. Specifically, we address the DARPA-funded
GALE information distillation Year 3 task of finding answers to the
Notes
159
5-Wh questions (who, what, when, where, and why) for each given
sentence. The approach we take revolves around determining
the best system using discriminative learning. In particular, we
train support vector machines with a set of novel features that
encode systems’ capabilities of returning as many correct answers
as possible. We analyze two combination strategies: one combines
multiple systems at the granularity of sentences, and the other at
the granularity of individual fields. Our experimental results indicate that the proposed features and combination strategies were
able to improve the overall performance by 22% to 36% relative to a
random selection, 16% to 35% relative to a majority voting scheme,
and 15% to 23% relative to the best individual system.
sented a strategy that consists in the robust detection of subjective
opinions about a particular topic in a spoken message. If the
same automatic system is used for estimating opinion proportions
in different spoken surveys, then the error rate of the entire
automatic process should not vary too much in different surveys
for each type of opinions. Based on this conjecture, a linear error
model is derived and used for error correction. Experimental
results obtained with data of a real-world deployed system show
significant error reductions obtained in the automatic estimation
of proportions in spoken surveys.
Transformation-Based Learning for Semantic Parsing
F. Jurčíček, M. Gašić, S. Keizer, F. Mairesse, B. Thomson,
K. Yu, S. Young; University of Cambridge, UK
Combining Semantic and Syntactic Information
Sources for 5-W Question Answering
Thu-Ses1-P4-5, Time: 10:00
Sibel Yaman 1 , Dilek Hakkani-Tür 1 , Gokhan Tur 2 ;
1
ICSI, USA; 2 SRI International, USA
Thu-Ses1-P4-2, Time: 10:00
This paper focuses on combining answers generated by a semantic parser that produces semantic role labels (SRLs) and those
generated by syntactic parser that produces function tags for
answering 5-W questions, i.e., who, what, when, where, and why.
We take a probabilistic approach in which a system’s ability to
correctly answer 5-W questions is measured with the likelihood
that its answers are produced for the given word sequence. This is
achieved by training statistical language models (LMs) that are used
to predict whether the answers returned by semantic parse or those
returned by the syntactic parser are more likely. We evaluated our
approach using the OntoNotes dataset. Our experimental results
indicate that the proposed LM-based combination strategy was
able to improve the performance of the best individual system in
terms of both F1 measure and accuracy. Furthermore, the error
rates for each question type were also significantly reduced with
the help of the proposed approach.
Phrase and Word Level Strategies for Detecting
Appositions in Speech
Benoit Favre, Dilek Hakkani-Tür; ICSI, USA
Thu-Ses1-P4-3, Time: 10:00
Appositions are grammatical constructs in which two noun phrases
are placed side-by-side, one modifying the other. Detecting them in
speech can help extract semantic information useful, for instance,
for co-reference resolution and question answering. We compare
and combine three approaches: word-level and phrase-level classifiers, and a syntactic parser trained to generate appositions. On
reference parses, the phrase-level classifier outperforms the other
approaches while on automatic parses and ASR output, the combination of the apposition-generating parser and the word-level
classifier works best. An analysis of the system errors reveals that
parsing accuracy and world knowledge are very important for this
task.
Error Correction of Proportions in Spoken Opinion
Surveys
Nathalie Camelin 1 , Renato De Mori 1 , Frederic Bechet 1 ,
Géraldine Damnati 2 ; 1 LIA, France; 2 Orange Labs,
France
This paper presents a semantic parser that transforms an initial
semantic hypothesis into the correct semantics by applying an
ordered list of transformation rules. These rules are learnt automatically from a training corpus with no prior linguistic knowledge
and no alignment between words and semantic concepts. The
learning algorithm produces a compact set of rules which enables
the parser to be very efficient while retaining high accuracy. We
show that this parser is competitive with respect to the state-ofthe-art semantic parsers on the ATIS and TownInfo tasks.
Large-Scale Polish SLU
Patrick Lehnen 1 , Stefan Hahn 1 , Hermann Ney 1 ,
Agnieszka Mykowiecka 2 ; 1 RWTH Aachen University,
Germany; 2 Polish Academy of Sciences, Poland
Thu-Ses1-P4-6, Time: 10:00
In this paper, we present state-of-the art concept tagging results on a new corpus for Polish SLU. For this language, it is
the first large-scale corpus (∼200 different concepts) which has
been semantically annotated and will be made publicly available.
Conditional Random Fields have proven to lead to best results
for string-to-string translation problems. Using this approach, we
achieve a concept error rate of 22.6% on an evaluation corpus. To
additionally extract attribute values, a combination of a statistical
and a rule-based approach is used leading to a CER of 30.2%.
Optimizing CRFs for SLU Tasks in Various
Languages Using Modified Training Criteria
Stefan Hahn, Patrick Lehnen, Georg Heigold, Hermann
Ney; RWTH Aachen University, Germany
Thu-Ses1-P4-7, Time: 10:00
In this paper, we present improvements of our state-of-the-art
concept tagger based on conditional random fields. Statistical
models have been optimized for three tasks of varying complexity
in three languages (French, Italian, and Polish). Modified training
criteria have been investigated leading to small improvements. The
respective corpora as well as parameter optimization results for
all models are presented in detail. A comparison of the selected
features between languages as well as a close look at the tuning
of the regularization parameter is given. The experimental results
show in what level the optimizations of the single systems are
portable between languages.
Thu-Ses1-P4-4, Time: 10:00
The paper analyzes the types of errors encountered in automatic
spoken surveys. These errors are different from the ones that
appear when surveys are taken by humans because they are caused
by the imprecision of an automatic system. Previous studies pre-
Notes
160
Learning Lexicons from Spoken Utterances Based on
Statistical Model Selection
Ryo Taguchi 1 , Naoto Iwahashi 2 , Takashi Nose 3 ,
Kotaro Funakoshi 4 , Mikio Nakano 4 ; 1 ATR, Japan;
2
NICT, Japan; 3 Tokyo Institute of Technology, Japan;
4
Honda Research Institute Japan Co. Ltd., Japan
customers said, thus it may be sufficient to process only agents’
speech.
Our experiments with 1,677 customer calls show that two partial
transcripts comprising only the agents utterances and the first
40 speaker turns actually produce slightly higher classification
accuracy than a transcript set comprising the entire conversations.
In addition, using partial conversations can significantly reduce
the cost for speech transcription.
Thu-Ses1-P4-8, Time: 10:00
This paper proposes a method for the unsupervised learning of
lexicons from pairs of a spoken utterance and an object as its
meaning without any a priori linguistic knowledge other than a
phoneme acoustic model. In order to obtain a lexicon, a statistical
model of the joint probability of a spoken utterance and an object
is learned based on the minimum description length principle.
This model consists of a list of word phoneme sequences and three
statistical models: the phoneme acoustic model, a word-bigram
model, and a word meaning model. Experimental results show that
the method can acquire acoustically, grammatically and semantically appropriate words with about 85% phoneme accuracy.
Improving Speech Understanding Accuracy with
Limited Training Data Using Multiple Language
Models and Multiple Understanding Models
Masaki Katsumaru 1 , Mikio Nakano 2 , Kazunori
Komatani 1 , Kotaro Funakoshi 2 , Tetsuya Ogata 1 ,
Hiroshi G. Okuno 1 ; 1 Kyoto University, Japan; 2 Honda
Research Institute Japan Co. Ltd., Japan
Thu-Ses1-P4-9, Time: 10:00
We aim to improve a speech understanding module with a small
amount of training data. A speech understanding module uses a
language model (LM) and a language understanding model (LUM).
A lot of training data are needed to improve the models. Such data
collection is, however, difficult in an actual process of development. We therefore design and develop a new framework that uses
multiple LMs and LUMs to improve speech understanding accuracy
under various amounts of training data. Even if the amount of
available training data is small, each LM and each LUM can deal
well with different types of utterances and more utterances are
understood by using multiple LM and LUM. As one implementation
of the framework, we develop a method for selecting the most
appropriate speech understanding result from several candidates.
The selection is based on probabilities of correctness calculated
by logistic regressions. We evaluate our framework with various
amounts of training data.
Low-Cost Call Type Classification for Contact Center
Calls Using Partial Transcripts
Youngja Park, Wilfried Teiken, Stephen C. Gates; IBM
T.J. Watson Research Center, USA
Thu-Ses1-P4-10, Time: 10:00
Call type classification and topic classification for contact center
calls using automatically generated transcripts is not yet widely
available mainly due to the high cost and low accuracy of call-center
grade automatic speech transcription. To address these challenges,
we examine if using only partial conversations yields accuracy
comparable to using the entire customer-agent conversations. We
exploit two interesting characteristics of call center calls. First,
contact center calls are highly scripted following prescribed steps,
and the customers problem or request (i.e., the determinant of the
call type) is typically stated in the beginning of a call. Thus, using
only the beginning of calls may be sufficient to determine the call
type. Second, agents often more clearly repeat or rephrase what
A New Quality Measure for Topic Segmentation of
Text and Speech
Mehryar Mohri, Pedro Moreno, Eugene Weinstein;
Google Inc., USA
Thu-Ses1-P4-11, Time: 10:00
The recent proliferation of large multimedia collections has gathered immense attention from the speech research community,
because speech recognition enables the transcription and indexing
of such collections. Topicality information can be used to improve
transcription quality and enable content navigation. In this paper,
we give a novel quality measure for topic segmentation algorithms
that improves over previously used measures.
Our measure
takes into account not only the presence or absence of topic
boundaries but also the content of the text or speech segments
labeled as topic-coherent. Additionally, we demonstrate that topic
segmentation quality of spoken language can be improved using
speech recognition lattices. Using lattices, improvements over the
baseline one-best topic model are observed when measured with
the previously existing topic segmentation quality measure, as well
as the new measure proposed in this paper (9.4% and 7.0% relative
error reduction, respectively).
Concept Segmentation and Labeling for
Conversational Speech
Marco Dinarelli, Alessandro Moschitti, Giuseppe
Riccardi; Università di Trento, Italy
Thu-Ses1-P4-12, Time: 10:00
Spoken Language Understanding performs automatic concept
labeling and segmentation of speech utterances. For this task,
many approaches have been proposed based on both generative and discriminative models. While all these methods have
shown remarkable accuracy on manual transcription of spoken
utterances, robustness to noisy automatic transcription is still
an open issue. In this paper we study algorithms for Spoken
Language Understanding combining complementary learning
models: Stochastic Finite State Transducers produce a list of
hypotheses, which are re-ranked using a discriminative algorithm
based on kernel methods. Our experiments on two different
spoken dialog corpora, MEDIA and LUNA, show that the combined
generative-discriminative model reaches the state-of-the-art such
as Conditional Random Fields (CRF) on manual transcriptions, and
it is robust to noisy automatic transcriptions, outperforming, in
some cases, the state-of-the-art.
Notes
161
Noise Robustness of Tract Variables and their
Application to Speech Recognition
Thu-Ses1-S1 : Special Session: New
Approaches to Modeling Variability for
Automatic Speech Recognition
Vikramjit Mitra 1 , Hosung Nam 2 , Carol Y.
Espy-Wilson 1 , Elliot Saltzman 2 , Louis Goldstein 3 ;
1
University of Maryland at College Park, USA; 2 Haskins
Laboratories, USA; 3 University of Southern California,
USA
Ainsworth (East Wing 4), 10:00, Thursday 10 Sept 2009
Chair: Carol Y. Espy-Wilson, University of Maryland at College
Park, USA and Jennifer Cole, University of Illinois at
Urbana-Champaign, USA
Thu-Ses1-S1-3, Time: 10:40
Introductory Remarks
Carol Y. Espy-Wilson 1 , Jennifer Cole 2 ; 1 University of
Maryland at College Park, USA; 2 University of Illinois
at Urbana-Champaign, USA
Thu-Ses1-S1-0, Time: 10:00
A Noise-Type and Level-Dependent MPO-Based
Speech Enhancement Architecture with Variable
Frame Analysis for Noise-Robust Speech
Recognition
Vikramjit Mitra 1 , Bengt J. Borgstrom 2 , Carol Y.
Espy-Wilson 1 , Abeer Alwan 2 ; 1 University of Maryland
at College Park, USA; 2 University of California at Los
Angeles, USA
Thu-Ses1-S1-1, Time: 10:10
In previous work, a speech enhancement algorithm based on phase
opponency and a periodicity measure (MPO-APP) was developed
for speech recognition. Axiomatic thresholds were used in the
MPO-APP regardless of the signal-to-noise ratio (SNR) of the corrupted speech or any characterization of the noise. The current
work developed an algorithm for adjusting the threshold in the
MPO-APP based on the SNR and whether the speech signal is clean,
corrupted by aperiodic noise or corrupted with noise with periodic
components. In addition, variable frame rate (VFR) analysis has
been incorporated so that dynamic regions in the speech signal are
more heavily sampled than steady-state regions. The result is a
2-stage algorithm that gives superior performance to the previous
MPO-APP, and to several other state-of-the-art speech enhancement
algorithms.
Complementarity of MFCC, PLP and Gabor Features
in the Presence of Speech-Intrinsic Variabilities
This paper analyzes the noise robustness of vocal tract constriction
variable estimation and investigates their role for noise robust
speech recognition. We implemented a simple direct inverse model
using a feed-forward artificial neural network to estimate vocal
tract variables (TVs) from the speech signal. Initially, we trained the
model on clean synthetic speech and then test the noise robustness
of the model on noise-corrupted speech. The training corpus was
obtained from the TAsk Dynamics Application model (TADA [1]),
which generated the synthetic speech as well as their corresponding TVs. Eight different vocal tract constriction variables consisting
of five constriction degree variables (lip aperture [LA], tongue body
[TBCD], tongue tip [TTCD], velum [VEL], and glottis [GLO]); three
constriction location variables (lip protrusion [LP], tongue tip
[TTCL], tongue body [TBCL]) were considered in this study. We
also explored using a modified phase opponency (MPO) [2] speech
enhancement technique as the preprocessor for TV estimation to
observe its effect upon noise robustness. Kalman smoothing was
applied to the estimated TVs to reduce the estimation noise. Finally
the TV estimation module was tested using a naturally-produced
speech that is contaminated with noise at different signal-to-noise
ratios. The estimated TVs from the natural speech corpus are
then used in conjunction with the baseline features to perform
automatic speech recognition (ASR) experiments. Results show
an average 22% and 21% improvement, relative to the baseline, on
ASR performance using the Aurora-2 dataset with car and subway
noise, respectively. The TVs in these experiments are estimated
from the MPO-enhanced speech.
Articulatory Phonological Code for Word
Classification
Xiaodan Zhuang 1 , Hosung Nam 2 , Mark
Hasegawa-Johnson 1 , Louis Goldstein 2 , Elliot
Saltzman 2 ; 1 University of Illinois at
Urbana-Champaign, USA; 2 Haskins Laboratories, USA
Thu-Ses1-S1-4, Time: 10:55
Bernd T. Meyer, Birger Kollmeier; Carl von Ossietzky
Universität Oldenburg, Germany
Thu-Ses1-S1-2, Time: 10:25
In this study, the effect of speech-intrinsic variabilities such as
speaking rate, effort and speaking style on automatic speech
recognition (ASR) is investigated. We analyze the influence of such
variabilities as well as extrinsic factors (i.e., additive noise) on the
most common features in ASR (mel-frequency cepstral coefficients
and perceptual linear prediction features) and spectro-temporal
Gabor features. MFCCs performed best for clean speech, whereas
Gabors were found to be the most robust feature in extrinsic
variabilities. Intrinsic variations were found to have a strong impact on error rates. While performance with MFCCs and PLPs was
degraded in much the same way, Gabor features exhibit a different
sensitivity towards these variabilities and are, e.g., well-suited to
recognize speech with varying pitch. The results suggest that
spectro-temporal and classic features carry complementary information, which could be exploited in feature-stream experiments.
We propose a framework that leverages articulatory phonology
for speech recognition. “Gestural pattern vectors” (GPV) encode
the instantaneous gestural activations that exist across all tract
variables at each time. Given a speech observation, recognizing
the sequence of GPV recovers the ensemble of gestural activations,
i.e., the gestural score. For each word in the vocabulary, we use
a task dynamic model of inter-articulator speech coordination
to generate the “canonical” gestural score. Speech recognition
is achieved by matching the ensemble of gestural activations.
In particular, we estimate the likelihood of the recognized GPV
sequence on word-dependent GPV sequence models trained using
the “canonical” gestural scores. These likelihoods, weighted by
confidence score of the recognized GPVs, are used in a Bayesian
speech recognizer.
Pilot gestural score recovery and word classification experiments
are carried out using synthesized data from one speaker. The
observation distribution of each GPV is modeled by an artificial
neural network and Gaussian mixture tandem model. Bigram
GPV sequence models are used to distinguish gestural scores of
different words. Given the tract variable time functions, about
Notes
162
80% of the instantaneous gestural activation is correctly recovered.
Word recognition accuracy is over 85% for a vocabulary of 139
words with no training observations. These results suggest that
the proposed framework might be a viable alternative to the classic
sequence-of-phones model.
Robust Keyword Spotting with Rapidly Adapting
Point Process Models
Aren Jansen, Partha Niyogi; University of Chicago, USA
Thu-Ses1-S1-5, Time: 11:10
In this paper, we investigate the noise robustness properties of
frame-based and sparse point process-based models for spotting
keywords in continuous speech. We introduce a new strategy
to improve point process model (PPM) robustness by adapting
low-level feature detector thresholds to preserve background firing
rates in the presence of noise. We find that this unsupervised
approach can significantly outperform fully supervised maximum
likelihood linear regression (MLLR) adaptation of an equivalent
keyword-filler HMM system in the presence of additive white and
pink noise. Moreover, we find that the sparsity of PPMs introduces
an inherent resilience to non-stationary babble noise not exhibited
by the frame-based HMM system. Finally, we demonstrate that
our approach requires less adaptation data than MLLR, permitting
rapid online adaptation.
Automatically Rating Pronunciation Through
Articulatory Phonology
Thu-Ses2-O1 : User Interactions in Spoken
Dialog Systems
Main Hall, 13:30, Thursday 10 Sept 2009
Chair: Roberto Pieraccini, SpeechCycle Labs, USA
Learning the Structure of Human-Computer and
Human-Human Dialogs
David Griol 1 , Giuseppe Riccardi 2 , Emilio Sanchis 3 ;
1
Universidad Carlos III de Madrid, Spain; 2 Università di
Trento, Italy; 3 Universidad Politécnica de Valencia,
Spain
Thu-Ses2-O1-1, Time: 13:30
We are interested in the problem of understanding human
conversation structure in the context of human-machine and
human-human interaction. We present a statistical methodology
for detecting the structure of spoken dialogs based on a generative
model learned using decision trees. To evaluate our approach we
have used the LUNA corpora, collected from real users engaged
in problem solving tasks. The results of the evaluation show that
automatic segmentation of spoken dialogs is very effective not
only with models built using separately human-machine dialogs
or human-human dialogs, but it is also possible to infer the taskrelated structure of human-human dialogs with a model learned
using only human-machine dialogs.
Pause and Gap Length in Face-to-Face Interaction
Joseph Tepperman, Louis Goldstein, Sungbok Lee,
Shrikanth S. Narayanan; University of Southern
California, USA
Jens Edlund 1 , Mattias Heldner 1 , Julia Hirschberg 2 ;
1
KTH, Sweden; 2 Columbia University, USA
Thu-Ses1-S1-6, Time: 11:25
Thu-Ses2-O1-2, Time: 13:50
Articulatory Phonology’s link between cognitive speech planning
and the physical realizations of vocal tract constrictions has implications for speech acoustic and duration modeling that should be
useful in assigning subjective ratings of pronunciation quality to
nonnative speech. In this work, we compare traditional phoneme
models used in automatic speech recognition to similar models for
articulatory gestural pattern vectors, each with associated duration
models. What we find is that, on the CDT corpus, gestural models
outperform the phoneme-level baseline in terms of correlation
with listener ratings, and in combination phoneme and gestural
models outperform either one alone. This also validates previous
findings with a similar (but not gesture-based) pseudo-articulatory
representation.
It has long been noted that conversational partners tend to exhibit
increasingly similar pitch, intensity, and timing behavior over
the course of a conversation. However, the metrics developed to
measure this similarity to date have generally failed to capture
the dynamic temporal aspects of this process. In this paper, we
propose new approaches to measuring interlocutor similarity in
spoken dialogue. We define similarity in terms of convergence and
synchrony and propose approaches to capture these, illustrating
our techniques on gap and pause production in Swedish spontaneous dialogues.
General Discussion
Time: 11:40
Modeling Other Talkers for Improved Dialog Act
Recognition in Meetings
Kornel Laskowski 1 , Elizabeth Shriberg 2 ; 1 Carnegie
Mellon University, USA; 2 SRI International, USA
Thu-Ses2-O1-3, Time: 14:10
Automatic dialog act (DA) modeling has been shown to benefit
meeting understanding, but current approaches to DA recognition
tend to suffer from a common problem: they under-represent
behaviors found at turn edges, during which the “floor” is negotiated among meeting participants. We propose a new approach
that takes into account speech from other talkers, relying only
on speech/non-speech information from all participants. We
find (1) that modeling other participants improves DA detection,
even in the absence of other information, (2) that only the single
locally most talkative other participant matters, and (3) that 10
seconds provides a sufficiently large local context. Results further
show significant performance improvements over a lexical-only
system — particularly for the DAs of interest. We conclude that
interaction-based modeling at turn edges can be achieved by
relatively simple features and should be incorporated for improved
meeting understanding.
Notes
163
A Closer Look at Quality Judgments of Spoken
Dialog Systems
Thu-Ses2-O2 : Production: Articulation and
Acoustics
Klaus-Peter Engelbrecht, Felix Hartard, Florian Gödde,
Sebastian Möller; Deutsche Telekom Laboratories,
Germany
Jones (East Wing 1), 13:30, Thursday 10 Sept 2009
Chair: Denis Burnham, University of Western Sydney, Australia
Thu-Ses2-O1-4, Time: 14:30
User judgments of Spoken Dialog Systems provide evaluators of
such systems with a valid measure of their overall quality. Models
for the automatic prediction of user judgments have been built,
following the introduction of PARADISE [1]. Main applications
are the comparison of systems, the analysis of parameters affecting quality, and the adoption of dialog management strategies.
However, a common model which applies to different systems
and users has not been found so far. With the aim of getting a
closer insight into the quality-relevant characteristics of spoken
interactions, an experiment was conducted where 25 users judged
the same 5 dialogs. User judgments were collected after each
dialog turn. The paper presents an analysis of the obtained results
and some conclusions for future work.
New Methods for the Analysis of Repeated
Utterances
In Search of Non-Uniqueness in the
Acoustic-to-Articulatory Mapping
G. Ananthakrishnan, D. Neiberg, Olov Engwall; KTH,
Sweden
Thu-Ses2-O2-1, Time: 13:30
This paper explores the possibility and extent of non-uniqueness
in the acoustic-to-articulatory inversion of speech, from a statistical point of view. It proposes a technique to estimate the
non-uniqueness, based on finding peaks in the conditional probability function of the articulatory space. The paper corroborates
the existence of non-uniqueness in a statistical sense, especially in
stop consonants, nasals and fricatives. The relationship between
the importance of the articulator position and non-uniqueness at
each instance is also explored.
Estimation of Articulatory Gesture Patterns from
Speech Acoustics
Geoffrey Zweig; Microsoft Research, USA
Thu-Ses2-O1-5, Time: 14:50
This paper proposes three novel and effective procedures for
jointly analyzing repeated utterances. First, we propose repetitiondriven system switching, where repetition triggers the use of an
independent backup system for decoding. Second, we propose a
cache language model for use with the second utterance. Finally,
we propose a method with which the acoustics from multiple
utterances — not necessarily exact repetitions of each other —
can be combined to into a composite that increases accuracy.
The combination of all methods produces a relative increase in
sentence accuracy of 65.7% for repeated voice-search queries.
The Effects of Different Voices for Speech-Based
In-Vehicle Interfaces: Impact of Young and Old
Voices on Driving Performance and Attitude
Ing-Marie Jonsson, Nils Dahlbäck; Linköping
University, Sweden
Prasanta Kumar Ghosh 1 , Shrikanth S. Narayanan 1 ,
Pierre Divenyi 2 , Louis Goldstein 1 , Elliot Saltzman 3 ;
1
University of Southern California, USA; 2 EBIRE, USA;
3
Haskins Laboratories, USA
Thu-Ses2-O2-2, Time: 13:50
We investigated dynamic programming (DP) and state-model (SM)
approaches for estimating gestural scores from speech acoustics.
We performed a word-identification task using the gestural pattern
vector sequences estimated by each approach. For a set of 75
randomly chosen words, we obtained the best word-identification
accuracy (66.67%) using the DP approach. This result implies that
considerable support for lexical access during speech perception
might be provided by such a method of recovering gestural information from acoustics.
Formant Trajectories for Acoustic-to-Articulatory
Inversion
Thu-Ses2-O1-6, Time: 15:10
This paper investigates how matching age of driver with age of
voice in a conversational in-vehicle information system affects
attitudes and performance. 36 participants from age groups,
55–75 and 18–25, interacted with a conversational system with
young or old voice in a driving simulator. Results show that all
drivers rather communicated with a young than old voice in the
car. This willingness to communicate had a detrimental effect
on driving performance. It is hence important to carefully select
voices, since voice properties can have enormous effects on driving
safety. Clearly, one voice doesn’t fit all.
İ. Yücel Özbek 1 , Mark Hasegawa-Johnson 2 , Mübeccel
Demirekler 1 ; 1 Middle East Technical University,
Turkey; 2 University of Illinois at Urbana-Champaign,
USA
Thu-Ses2-O2-3, Time: 14:10
This work examines the utility of formant frequencies and their
energies in acoustic-to-articulatory inversion. For this purpose,
formant frequencies and formant spectral amplitudes are automatically estimated from audio, and are treated as observations
for the purpose of estimating electromagnetic articulography
(EMA) coil positions. A mixture Gaussian regression model with
mel-frequency cepstral (MFCC) observations is modified by using
formants and energies to either replace or augment the MFCC
observation vector. The augmented observation results in 3.4%
lower RMS error, and 2% higher correlation coefficient, than the
baseline MFCC observation. Improvement is especially good for
stop consonants, possibly because formant tracking provides information about the acoustic resonances that would be otherwise
unavailable during stop closure and release.
Notes
164
A Robust Variational Method for the
Acoustic-to-Articulatory Problem
Thu-Ses2-O3 : Features for Speech and
Speaker Recognition
Blaise Potard, Yves Laprie; LORIA, France
Thu-Ses2-O2-4, Time: 14:30
This paper presents a novel acoustic-to-articulatory inversion
method based on an articulatory synthesizer and variational calculus, without the need for an initial trajectory. Validation in ideal
conditions is performed to show the potential of the method, and
the performances are compared to codebook based methods. We
also investigate the precision of the articulatory trajectories found
for various acoustic vectors dimensions. Possible extensions are
discussed.
Comparison of Vowel Structures of Japanese and
English in Articulatory and Auditory Spaces
Jianwu Dang 1 , Mark Tiede 2 , Jiahong Yuan 3 ; 1 JAIST,
Japan; 2 Haskins Laboratories, USA; 3 University of
Pennsylvania, USA
Thu-Ses2-O2-5, Time: 14:50
In previous work [1] we investigated the vowel structures of
Japanese in both articulatory space and auditory perceptual space
using Laplacian eigenmaps, and examined relations between
speech production and perception. The results showed that the
inherent structures of Japanese vowels were consistent in the
two spaces. To verify whether such a property generalizes to
other languages, we use the same approach to investigate the
more crowded English vowel space. Results show that the vowel
structure reflects the articulatory features for both languages.
The degree of tongue-palate approximation is the most important
feature for vowels, followed by the open ratio of the mouth to
oral cavity. The topological relations of the vowel structures are
consistent with both the articulatory and auditory perceptual
spaces; in particular the lip-protruded vowel /UW/ of English was
distinct from the unrounded Japanese /W/. The rhotic vowel
/ER/ was located apart from the surface constructed by the other
vowels, where the same phenomena appeared in both spaces.
The Articulatory and Acoustic Impact of Scottish
English /r/ on the Preceding Vowel-Onset
Janine Lilienthal; Queen Margaret University, UK
Thu-Ses2-O2-6, Time: 15:10
This paper demonstrates the use of smoothing spline ANOVA
and T tests to analyze whether the influence of syllable final
consonants on the preceding vowel differs for articulation and
acoustics. The onset of vowels either followed by phrase-final
/r/ or by phrase-initial /r/ is compared for two Scottish English
speakers. To measure articulatory differences of opposing vowel
pairs, smoothing splines of midsagittal tongue shape recorded
via ultrasound imaging are compared. For the acoustic data,
differences of the first two formant frequencies at the onset are
tested. The results confirm that there is no 1:1 mapping between
articulation and acoustics.
Fallside (East Wing 2), 13:30, Thursday 10 Sept 2009
Chair: Thomas Hain, University of Sheffield, UK
Static and Dynamic Modulation Spectrum for Speech
Recognition
Sriram Ganapathy, Samuel Thomas, Hynek
Hermansky; Johns Hopkins University, USA
Thu-Ses2-O3-1, Time: 13:30
We present a feature extraction technique based on static and
dynamic modulation spectrum derived from long-term envelopes
in sub-bands. Estimation of the sub-band temporal envelopes
is done using Frequency Domain Linear Prediction (FDLP). These
sub-band envelopes are compressed with a static (logarithmic) and
dynamic (adaptive loops) compression. The compressed sub-band
envelopes are transformed into modulation spectral components
which are used as features for speech recognition. Experiments
are performed on a phoneme recognition task using a hybrid
HMM-ANN phoneme recognition system and an ASR task using
the TANDEM speech recognition system. The proposed features
provide a relative improvements of 3.8% and 11.5% in phoneme
recognition accuracies for TIMIT and conversation telephone
speech (CTS) respectively. Further, these improvements are found
to be consistent for ASR tasks on OGI-Digits database (relative
improvement of 13.5%).
2-D Processing of Speech for Multi-Pitch Analysis
Tianyu T. Wang, Thomas F. Quatieri; MIT, USA
Thu-Ses2-O3-2, Time: 13:50
This paper introduces a two-dimensional (2-D) processing approach
for the analysis of multi-pitch speech sounds. Our framework
invokes the short-space 2-D Fourier transform magnitude of a
narrowband spectrogram, mapping harmonically-related signal
components to multiple concentrated entities in a new 2-D space.
First, localized time-frequency regions of the spectrogram are
analyzed to extract pitch candidates. These candidates are then
combined across multiple regions for obtaining separate pitch
estimates of each speech-signal component at a single point in
time. We refer to this as multi-region analysis (MRA). By explicitly
accounting for pitch dynamics within localized time segments,
this separability is distinct from that which can be obtained using
short-time autocorrelation methods typically employed in state-ofthe-art multi-pitch tracking algorithms. We illustrate the feasibility
of MRA for multi-pitch estimation on mixtures of synthetic and
real speech.
A Correlation-Maximization Denoising Filter Used as
an Enhancement Frontend for Noise Robust Bird
Call Classification
Wei Chu, Abeer Alwan; University of California at Los
Angeles, USA
Thu-Ses2-O3-3, Time: 14:30
In this paper, we propose a Correlation-Maximization denoising
filter which utilizes periodicity information to remove additive
noise in bird calls. We also developed a statistically-based noise
robust bird-call classification system which uses the denoising
filter as a frontend. Enhanced bird calls which are the output of the
denoising filter are used for feature extraction. Gaussian Mixture
Models (GMM) and Hidden Markov Models (HMM) are used for
classification. Experiments on a large noisy corpus containing bird
Notes
165
calls from 5 species have shown that the Correlation-Maximization
filter is more effective than the Wiener filter in improving the
classification error rate of bird calls which have a quasi-periodic
structure. This improvement results in a 4.1% classification error
rate which is better than the system without a denoising frontend
and a system with a Wiener filter denoising frontend.
Preliminary Inversion Mapping Results with a New
EMA Corpus
elementary subtasks; then we propose a solution that combines a
one-pass strategy that exploits the local repetitiveness of motifs
and a dynamic programming technique to detect repetitions in
audio streams.
Results of an experiment on a radio broadcast show are shown
to illustrate the effectiveness of the technique in providing audio
summaries of real data.
Thu-Ses2-O4 : Speech and Multimodal
Resources & Annotation
Korin Richmond; University of Edinburgh, UK
Thu-Ses2-O3-4, Time: 14:10
In this paper, we apply our inversion mapping method, the
trajectory mixture density network (TMDN), to a new corpus of
articulatory data, recorded with a Carstens AG500 electromagnetic
articulograph. This new data set, mngu0, is relatively large and
phonetically rich, among other beneficial characteristics. We obtain
good results, with a root mean square (RMS) error of only 0.99mm.
This compares very well with our previous lowest result of 1.54mm
RMS error for equivalent coils of the MOCHA fsew0 EMA data. We
interpret this as showing the mngu0 data set is potentially more
consistent than the fsew0 data set, and is very useful for research
which calls for articulatory trajectory data. It also supports our
view that the TMDN is very much suited to the inversion mapping
problem.
Holmes (East Wing 3), 13:30, Thursday 10 Sept 2009
Chair: Kristiina Jokinen, University of Tampere, Finland
ASR Corpus Design for Resource-Scarce Languages
Etienne Barnard, Marelie Davel, Charl van Heerden;
CSIR, South Africa
Thu-Ses2-O4-1, Time: 13:30
Time-Varying Autoregressive Tests for Multiscale
Speech Analysis
We investigate the number of speakers and the amount of data that
is required for the development of useable speaker-independent
speech-recognition systems in resource-scarce languages. Our
experiments employ the Lwazi corpus, which contains speech in
the eleven official languages of South Africa. We find that a surprisingly small number of speakers (fewer than 50) and around 10
to 20 hours of speech per language are sufficient for the purposes
of acceptable phone-based recognition.
Daniel Rudoy 1 , Thomas F. Quatieri 2 , Patrick J. Wolfe 1 ;
1
Harvard University, USA; 2 MIT, USA
Pronunciation Dictionary Development in
Resource-Scarce Environments
Thu-Ses2-O3-5, Time: 14:50
In this paper we develop hypothesis tests for speech waveform
nonstationarity based on time-varying autoregressive models,
and demonstrate their efficacy in speech analysis tasks at both
segmental and sub-segmental scales. Key to the successful synthesis of these ideas is our employment of a generalized likelihood
ratio testing framework tailored to autoregressive coefficient
evolutions suitable for speech. After evaluating our framework on
speech-like synthetic signals, we present preliminary results for
two distinct analysis tasks using speech waveform data. At the
segmental level, we develop an adaptive short-time segmentation
scheme and evaluate it on whispered speech recordings, while
at the sub-segmental level, we address the problem of detecting
the glottal flow closed phase. Results show that our hypothesis
testing framework can reliably detect changes in the vocal tract
parameters across multiple scales, thereby underscoring its broad
applicability to speech analysis.
Audio Keyword Extraction by Unsupervised Word
Discovery
Marelie Davel, Olga Martirosian; CSIR, South Africa
Thu-Ses2-O4-2, Time: 13:50
The deployment of speech technology systems in the developing
world is often hampered by the lack of appropriate linguistic resources. A suitable pronunciation dictionary is one such resource
that can be difficult to obtain for lesser-resourced languages. We
design a process for the development of pronunciation dictionaries
in resource-scarce environments, and apply this to the development of pronunciation dictionaries for ten of the official languages
of South Africa. We define the semi-automated development and
verification process in detail and discuss practicalities, outcomes
and lessons learnt. We analyse the accuracy of the developed
dictionaries and demonstrate how the distribution of rules generated from the dictionaries provides insight into the inherent
predictability of the languages studied.
XTrans: A Speech Annotation and Transcription
Tool
Meghan Lammie Glenn, Stephanie M. Strassel,
Haejoong Lee; University of Pennsylvania, USA
Armando Muscariello, Guillaume Gravier, Frédéric
Bimbot; IRISA, France
Thu-Ses2-O4-3, Time: 14:10
Thu-Ses2-O3-6, Time: 15:10
In real audio data, frequently occurring patterns often convey relevant information on the overall content of the data. The possibility
to extract meaningful portions of the main content by identifying
such key patterns, can be exploited for providing audio summaries
and speeding up the access to relevant parts of the data. We refer
to these patterns as audio motifs in analogy with the nomenclature
in its counterpart task in biology. We describe a framework for
the discovery of audio motifs in streams in an unsupervised
fashion, as no acoustic or linguistic models are used. We define
the fundamental problem by decomposing the overall task into
We present XTrans, a multi-platform, multilingual, multi-channel
transcription application designed and developed by Linguistic
Data Consortium. XTrans provides new and efficient solutions
to many common challenges encountered during the manual
transcription process of a wide variety of audio genres, such
as supporting multiple audio channels in a meeting recording
or right-to-left text directionality for languages like Arabic. To
facilitate accurate transcription, XTrans incorporates a number of
quality control functions, and provides a user-friendly mechanism
for transcribing overlapping speech. This paper will describe the
motivation to develop a new transcription tool, and will give an
overview of XTrans functionality.
Notes
166
How to Select a Good Training-Data Subset for
Transcription: Submodular Active Selection for
Sequences
Thu-Ses2-O5 : Speech Analysis and
Processing III
Ainsworth (East Wing 4), 13:30, Thursday 10 Sept 2009
Chair: Ben Milner, University of East Anglia, UK
Hui Lin, Jeff Bilmes; University of Washington, USA
Thu-Ses2-O4-4, Time: 14:30
Given a large un-transcribed corpus of speech utterances, we
address the problem of how to select a good subset for word-level
transcription under a given fixed transcription budget. We employ
submodular active selection on a Fisher-kernel based graph over
un-transcribed utterances. The selection is theoretically guaranteed to be near-optimal. Moreover, our approach is able to
bootstrap without requiring any initial transcribed data, whereas
traditional approaches rely heavily on the quality of an initial
model trained on some labeled data. Our experiments on phone
recognition show that our approach outperforms both average-case
random selection and uncertainty sampling significantly.
Improving Acceptability Assessment for the
Labelling of Affective Speech Corpora
Zoraida Callejas, Ramón López-Cózar; Universidad de
Granada, Spain
Thu-Ses2-O4-5, Time: 14:50
In this paper we study how to address the assessment of affective
speech corpora. We propose the use of several coefficients and
provide guidelines to obtain a more complete background about
the quality of their annotation. This proposal has been evaluated employing a corpus of non-acted emotions gathered from
spontaneous interactions of users with a spoken dialogue system.
The results show that, due to the nature of non-acted emotional
corpora, traditional interpretations would in most cases consider
the annotation of these corpora unacceptable even with very high
inter-annotator agreement. Our proposal provides a basis to argue
their acceptability by supplying a more fine-grained vision of their
quality.
The Broadcast Narrow Band Speech Corpus: A New
Resource Type for Large Scale Language Recognition
1
1
1
Christopher Cieri , Linda Brandschain , Abby Neely ,
David Graff 1 , Kevin Walker 1 , Chris Caruso 1 , Alvin F.
Martin 2 , Craig S. Greenberg 2 ; 1 University of
Pennsylvania, USA; 2 NIST, USA
Thu-Ses2-O4-6, Time: 15:10
This paper describes a new resource type, broadcast narrow band
speech for use in large scale language recognition research and
technology development. After providing the rational for this new
resource type, the paper describes the collection, segmentation,
auditing procedures and data formats used. Along the way, it
addresses issues of defining language and dialect in found data
and how ground truth is established for this corpus.
Model-Based Automatic Evaluation of L2 Learner’S
English Timing
Chatchawarn Hansakunbuntheung 1 , Hiroaki Kato 2 ,
Yoshinori Sagisaka 1 ; 1 Waseda University, Japan;
2
NICT, Japan
Thu-Ses2-O5-1, Time: 15:30
This paper proposes a method to automatically measure the timing
characteristics of a second-language learner’s speech as a means
to evaluate language proficiency in speech production. We used
the durational differences from native speakers’ speech as an
objective measure to evaluate the learner’s timing characteristics. To provide flexible evaluation without the need to collect
any additional English reference speech, we employed predicted
segmental durations using a statistical duration model instead
of measured raw durations of natives’ speech. The proposed
evaluation method was tested using English speech data uttered
by Thai-native learners with different English-study experiences.
An evaluation experiment shows that the proposed measure based
on duration differences closely correlates to the subjects’ Englishstudy experiences.
Moreover, segmental duration differences
revealed Thai learners’ speech-control characteristics in word-final
stress assignment. These results support the effectiveness of the
proposed model-based objective evaluation.
A Bayesian Approach to Non-Intrusive Quality
Assessment of Speech
Petko N. Petkov 1 , Iman S. Mossavat 2 , W. Bastiaan
Kleijn 1 ; 1 KTH, Sweden; 2 Technische Universiteit
Eindhoven, The Netherlands
Thu-Ses2-O5-2, Time: 15:50
A Bayesian approach to non-intrusive quality assessment of
narrow-band speech is presented. The speech features used to
assess quality are the sample mean and variance of band-powers
evaluated from the temporal envelope in the channels of an auditory filter-bank. Bayesian multivariate adaptive regression splines
(BMARS) is used to map features into quality ratings. The proposed
combination of features and regression method leads to a high
performance quality assessment algorithm that learns efficiently
from a small amount of training data and avoids overfitting. Use
of the Bayesian approach also allows the derivation of credible
intervals on the model predictions, which provide a quantitative
measure of model confidence and can be used to identify the need
for complementing the training databases.
Precision of Phoneme Boundaries Derived Using
Hidden Markov Models
Ladan Baghai-Ravary, Greg Kochanski, John Coleman;
University of Oxford, UK
Thu-Ses2-O5-3, Time: 16:10
Some phoneme boundaries correspond to abrupt changes in the
acoustic signal. Others are less clear-cut because the transition
from one phoneme to the next is gradual.
This paper compares the phoneme boundaries identified by a
large number of different alignment systems, using different
Notes
167
signal representations and Hidden Markov Model structures. The
variability of the different boundaries is analysed statistically, with
the boundaries grouped in terms of the broad phonetic classes of
the respective phonemes.
The mutual consistency between the boundaries from the various
systems is analysed to identify which classes of phoneme boundary
can be identified reliably by an automatic labelling system, and
which are ill-defined and ambiguous.
The results presented here provide a starting point for future
development of techniques for objective comparisons between systems without giving undue weight to variations in those phoneme
boundaries which are inherently ambiguous. Such techniques
should improve the efficiency with which new alignment and HMM
training algorithms can be developed.
A Novel Method for Epoch Extraction from Speech
Signals
a discontinuity in the Linear Prediction residual. The proposed
method is compared to the DYPSA algorithm on the CMU ARCTIC
database. A significant improvement as well as a better noise
robustness are reported. Besides, results of GOI identification
accuracy are promising for the glottal source characterization.
Thu-Ses2-P1 : Speaker and Speech
Variability, Paralinguistic and Nonlinguistic
Cues
Hewison Hall, 13:30, Thursday 10 Sept 2009
Chair: Christer Gobl, Trinity College Dublin, Ireland
A Novel Codebook Search Technique for Estimating
the Open Quotient
Yen-Liang Shue, Jody Kreiman, Abeer Alwan;
University of California at Los Angeles, USA
Lakshmish Kaushik, Douglas O’Shaughnessy;
INRS-EMT, Canada
Thu-Ses2-P1-1, Time: 13:30
Thu-Ses2-O5-4, Time: 16:30
This paper introduces a novel method of speech epoch extraction
using a modified Wigner-Ville distribution.
The Wigner-Ville
distribution is an efficient speech representation tool with which
minute speech variations can be tracked precisely. In this paper,
epoch detection/extraction using accurate energy tracking, noise
robustness, and the efficient speech representation properties
of a modified discrete Wigner-Ville distribution is explored. The
developed technique is tested using the Arctic database and
its epoch information from an electro-glottograph as reference
epochs. The developed algorithm is compared with the available
state of the art methods in various noise conditions (babble, white,
and vehicle) and different levels of degradation. The proposed
method outperforms the existing methods in the literature.
The open quotient (OQ), loosely defined as the proportion of time
the glottis is open during phonation, is an important parameter
in many source models. Accurate estimation of OQ from acoustic
signals is a non-trivial process as it involves the separation of the
source signal from the vocal-tract transfer function. Often this process is hampered by the lack of direct physiological data with which
to calibrate algorithms. In this paper, an analysis-by-synthesis
method using a codebook of harmonically-based Liljencrants-Fant
(LF) source models in conjunction with a constrained optimizer
was used to obtain estimates of OQ from four subjects. The
estimates were compared with physiological measurements from
high-speed imaging. Results showed relatively high correlations
between the estimated and measured values for only two of the
speakers, suggesting that existing source models may be unable to
accurately represent some source signals.
LS Regularization of Group Delay Features for
Speaker Recognition
Long Term Examination of Intra-Session and
Inter-Session Speaker Variability
Jia Min Karen Kua 1 , Julien Epps 1 , Eliathamby
Ambikairajah 1 , Eric Choi 2 ; 1 University of New South
Wales, Australia; 2 National ICT Australia, Australia
A.D. Lawson 1 , A.R. Stauffer 1 , B.Y. Smolenski 1 , B.B.
Pokines 2 , M. Leonard 3 , E.J. Cupples 1 ; 1 RADC Inc.,
USA; 2 Oasis Systems, USA; 3 University of Texas at
Dallas, USA
Thu-Ses2-O5-5, Time: 16:50
Due to the increasing use of fusion in speaker recognition systems,
features that are complementary to MFCCs offer opportunities to
advance the state of the art. One promising feature is based on
group delay, however this can suffer large variability due to its
numerical formulation. In this paper, we investigate reducing this
variability in group delay features with least squares regularization.
Evaluations on the NIST 2001 and 2008 SRE databases show a
relative improvement of at least 6% and 18% EER respectively when
group delay-based system is fused with MFCC-based system.
Glottal Closure and Opening Instant Detection from
Speech Signals
Thomas Drugman, Thierry Dutoit; Faculté
Polytechnique de Mons, Belgium
Thu-Ses2-P1-2, Time: 13:30
Session variability in speaker recognition is a well recognized
phenomena, but poorly understood largely due to a dearth of
robust longitudinal data. The current study uses a large, longterm speaker database to quantify both speaker variability changes
within a conversation and the impact of speaker variability changes
over the long term (3 years). Results demonstrate that 1) change
in accuracy over the course of a conversation is statistically very
robust and 2) that the aging effect over three years is statistically
negligible. Finally we demonstrate that voice change during the
course of a conversation is, in large part, comparable across
sessions.
Distorted Visual Information Influences Audiovisual
Perception of Voicing
Thu-Ses2-O5-6, Time: 17:10
This paper proposes a new procedure to detect Glottal Closure and
Opening Instants (GCIs and GOIs) directly from speech waveforms.
The procedure is divided into two successive steps. First a meanbased signal is computed, and intervals where speech events are
expected to occur are extracted from it. Secondly, at each interval
a precise position of the speech event is assigned by locating
Ragnhild Eg, Dawn Behne; NTNU, Norway
Thu-Ses2-P1-3, Time: 13:30
Research has shown that visual information becomes less reliable
when images are severely distorted. Furthermore, while voicing
is generally identified from acoustical cues, it may also provide
perception with visual cues. The current study investigated the
Notes
168
impact of video distortion on the audiovisual perception of voicing.
Audiovisual stimuli were presented to 30 participants with the
original video quality, or with reduced video resolution (75×60
pixels, 45×36 pixels). Results revealed that in addition to increased
auditory reliance with video distortion, particularly for voiceless
stimuli, perception of voiceless stimuli was more influenced by the
visual modality than voiced stimuli.
vealed that both formant space and F0 can act as cues for speaker
discrimination even via BCUHA. However, sensitivity to formant
space in BCU hearing is less than in AC hearing.
Perceived Naturalness of a Synthesizer of
Disordered Voices
Céline De Looze, Stéphane Rauzy; LPL, France
Samia Fraj, Francis Grenez, Jean Schoentgen;
Université Libre de Bruxelles, Belgium
In this article a clustering algorithm, allowing the automatic detection of speakers’ register changes, is presented. Together with
automatic detection of pause duration, it has shown to be efficient
for the automatic detection and prediction of topic changes.
The need to take into account other parameters such as tempo
and intensity, in the framework of Linear Discriminant Analysis,
is proposed in order to improve the identification of the topic
structure of discourse.
Automatic Detection and Prediction of Topic
Changes Through Automatic Detection of Register
Variations and Pause Duration
Thu-Ses2-P1-7, Time: 13:30
Thu-Ses2-P1-4, Time: 13:30
The presentation describes a synthesizer of normal and disordered voice timbres and their perceptual evaluation with respect
to naturalness. The simulator uses a shaping function model,
which enables controlling the perturbations of the frequency and
harmonic richness of the glottal area signal via the control of the
instantaneous frequency and amplitude of two harmonic driving
functions. Several types of perturbations are simulated. Perceptual
experiments, which involve stimuli of synthetic and human vowels
with normal values of perturbations, have been carried out. The
first has been based on a binary synthetic/natural classification.
The second has involved a discrimination task. Both experiments
suggest that human judges are unable to distinguish between human and synthetic vowels prepared with the synthesizer described
here.
Audio-Visual Speech Asynchrony Modeling in a
Talking Head
Alexey Karpov 1 , Liliya Tsirulnik 2 , Zdeněk Krňoul 3 ,
Andrey Ronzhin 1 , Boris Lobanov 2 , Miloš Železný 3 ;
1
Russian Academy of Sciences, Russia; 2 National
Academy of Sciences, Belarus; 3 University of West
Bohemia in Pilsen, Czech Republic
Analyzing Features for Automatic Age Estimation
on Cross-Sectional Data
Werner Spiegl 1 , Georg Stemmer 2 , Eva Lasarcyk 3 ,
Varada Kolhatkar 4 , Andrew Cassidy 5 , Blaise Potard 6 ,
Stephen Shum 7 , Young Chol Song 8 , Puyang Xu 5 , Peter
Beyerlein 9 , James Harnsberger 10 , Elmar Nöth 1 ; 1 FAU
Erlangen-Nürnberg, Germany; 2 SVOX Deutschland
GmbH, Germany; 3 Universität des Saarlandes,
Germany; 4 University of Minnesota Duluth, USA;
5
Johns Hopkins University, USA; 6 CRIN, France;
7
University of California at Berkeley, USA; 8 Stony
Brook University, USA; 9 TFH Wildau, Germany;
10
University of Florida, USA
Thu-Ses2-P1-8, Time: 13:30
Thu-Ses2-P1-5, Time: 13:30
An audio-visual speech synthesis system with modeling of asynchrony between auditory and visual speech modalities is proposed
in the paper. Corpus-based study of real recordings gave us the
required data for understanding the problem of modalities asynchrony that is partially caused by the co-articulation phenomena.
A set of context-dependent timing rules and recommendations
was elaborated in order to make a synchronization of auditory and
visual speech cues of the animated talking head similar to a natural
humanlike way. The cognitive evaluation of the model-based
talking head for Russian with implementation of the original
asynchrony model has shown high intelligibility and naturalness
of audio-visual synthesized speech.
The Effects of Fundamental Frequency and Formant
Space on Speaker Discrimination Through
Bone-Conducted Ultrasonic Hearing
Takayuki Kagomiya, Seiji Nakagawa; AIST, Japan
Thu-Ses2-P1-6, Time: 13:30
Human listeners can perceive speech signals from a voicemodulated ultrasonic carrier which is presented through a
bone-conduction stimulator, even if they are sensorineural hearing
loss patients. As an application of this phenomenon, we have been
developing a bone-conducted ultrasonic hearing aid (BCUHA). This
research examined whether formant space and F0 can be cues of
speaker discrimination in BCU hearing as well as via air-conduction
(AC) hearing. A series of speaker discrimination experiments re-
We develop an acoustic feature set for the estimation of a person’s
age from a recorded speech signal. The baseline features are
Mel-frequency cepstral coefficients (MFCCs) which are extended by
various prosodic features, pitch and formant frequencies. From
experiments on the University of Florida Vocal Aging Database we
can draw different conclusions. On the one hand, adding prosodic,
pitch and formant features to the MFCC baseline leads to relative
reductions of the mean absolute error between 4–20%. Improvements are even larger when perceptual age labels are taken as a
reference. On the other hand, reasonable results with a mean absolute error in age estimation of about 12 years are already achieved
using a simple gender-independent setup and MFCCs only. Future
experiments will evaluate the robustness of the prosodic features
against channel variability on other databases and investigate the
differences between perceptual and chronological age labels.
Intercultural Differences in Evaluation of
Pathological Voice Quality: Perceptual and
Acoustical Comparisons Between RASATI and
GRBASI Scales
Emi Juliana Yamauchi 1 , Satoshi Imaizumi 1 , Hagino
Maruyama 2 , Tomoyuki Haji 2 ; 1 Prefectural University
of Hiroshima, Japan; 2 Kurashiki Central Hospital,
Japan
Thu-Ses2-P1-9, Time: 13:30
This paper analyzes differences and commonality in pathological
voice quality evaluation between two different scaling systems, GR-
Notes
169
BASI and RASATI. The results identified significant interrelations
between the scales. Harshness, included in RASATI, is described
as noisiness and strain in the GRBASI scale. Roughness is found to
be the most consistent factor and easiest to identify by listeners
of different linguistic backgrounds. Intercultural agreement in
pathological voice quality evaluation seems to be possible.
F0 Cues for the Discourse Functions of “hã” in Hindi
Kalika Bali; Microsoft Research India, India
Thu-Ses2-P1-10, Time: 13:30
Affirmative particles are often employed in conversational speech
to convey more than their literal semantic meaning. The discourse
information conveyed by such particles can have consequences in
both Speech Understanding and Speech Production for a Spoken
Dialogue System. This paper analyses the different discourse functions of the affirmative particle hã (“yes”) in Hindi and in explores
the role of fundamental frequency (f0) as a cue to disambiguating
these functions.
Audio Spatialisation Strategies for Multitasking
During Teleconferences
Stuart N. Wrigley, Simon Tucker, Guy J. Brown, Steve
Whittaker; University of Sheffield, UK
Thu-Ses2-P1-11, Time: 13:30
Multitasking during teleconferences is becoming increasingly common: participants continue their work whilst monitoring the audio
for topics of interest. Our previous work has established the benefit
of spatialised audio presentation on improving multitasking performance. In this study, we investigate the different spatialisation
strategies employed by subjects in order to aid their multitasking
performance and improve their user experience. Subjects were
given the freedom to place each participant at a different location
in the acoustic space both in terms of azimuth and distance. Their
strategies were based upon cues regarding keywords and which
participant will utter them. Our findings suggest that subjects
employ consistent strategies with regard to the location of target
and distracter talkers. Furthermore, manipulation of the acoustic
space plays an important role in multitasking performance and the
user experience.
Speech Rate Effects on Linguistic Change
Alexsandro R. Meireles 1 , Plínio A. Barbosa 2 ; 1 Federal
University of Espírito Santo, Brazil; 2 State University of
Campinas, Brazil
Thu-Ses2-P1-12, Time: 13:30
This work is couched in the Articulatory Phonology theoretical
framework, and it discusses the possible role of speech rate on
diachronic change from antepenultimate stress words to penultimate stress words. In this kind of change, there is deletion of
the medial (or final) post-stressed vowel of the antepenultimate
stress words. Our results suggest that speech rate can explain
this historical process of linguistic change, since the medial poststressed vowel reduces more, although without deletion, than the
final post-stressed vowel from normal to fast rate. These results
were confirmed by Friedman’s ANOVA. A one-way ANOVA also
indicated that the duration of the medial post-stressed vowel is
significantly smaller than the duration of the final post-stressed
vowel. On the other hand, words such as “fôlego” (breath) and
“sábado” (Saturday) reduce less their post-stressed segments in
comparison with words such as “abóbora” (pumpkin). This finding,
associated to Brazilian Portuguese phonotactic restrictions, can
explain why forms such as “folgo” and “sabdo” are not frequently
found in this language. Besides, linguistic changes influenced by
speech rate act according to dialect and gender. In this paper,
speakers from the Mineiro dialect (from Minas Gerais state) (rate:
7.5 syllables/sec.) reduced the medial post-stressed vowel more
than speakers from the Paulista dialect (from São Paulo state) (rate:
6.4 syllables/second), and male speakers (rate: 5.8 syllables/sec.)
reduced the medial post-stressed vowel more than female speakers
(rate: 5.2 syllables/second). These results were also confirmed by
one-way ANOVA.
Mandarin Spontaneous Narrative Planning —
Prosodic Evidence from National Taiwan University
Lecture Corpus
Chiu-yu Tseng 1 , Zhao-yu Su 1 , Lin-shan Lee 2 ;
1
Academia Sinica, Taiwan; 2 National Taiwan
University, Taiwan
Thu-Ses2-P1-13, Time: 13:30
This paper discusses discourse planning of pre-organized spontaneous narratives (SpnNS) in comparison with read speech (RS).
F0 and tempo modulations are compared by speech paragraph
size and discourse boundaries. The speaking rate of SpnNS from
university classroom lecture is 2 to 3 times to that of RS by
professionals; paragraph phrasing of SpnNS is 6 times that of RS.
Patterns of paragraph association are distinct for SpnNS and RS.
Sub-paragraph and paragraph units in RS are marked by distinct
relative F0 resets and boundary pause duration, but by patterns
of intensity contrasts in SpnNS instead. Consistent to both data
sets is the finding that combined relative supra-segmental cues
reflecting global prosodic properties are more discriminative to
distinguish discourse boundaries than any fragments of singular
cue, supporting higher-level discourse planning in the acoustic
signals. We believe these findings can be directly applied to speech
technology development.
Thu-Ses2-P2 : ASR: Acoustic Model Features
Hewison Hall, 13:30, Thursday 10 Sept 2009
Chair: Richard M. Stern, Carnegie Mellon University, UK
Investigation into Bottle-Neck Features for Meeting
Speech Recognition
František Grézl, Martin Karafiát, Lukáš Burget; Brno
University of Technology, Czech Republic
Thu-Ses2-P2-1, Time: 13:30
This work investigates into recently proposed Bottle-Neck features
for ASR. The bottle-neck ANN structure is imported into Split
Context architecture gaining significant WER reduction. Further,
Universal Context architecture was developed which simplifies the
system by using only one universal ANN for all temporal splits.
Significant WER reduction can be obtained by applying fMPE on
top of our BN features as a technique for discriminative feature
extraction and further gain is also obtained by retraining model
parameters using MPE criterion. The results are reported on
meeting data from RT07 evaluation.
Multi-Stream to Many-Stream: Using
Spectro-Temporal Features for ASR
Sherry Y. Zhao, Suman Ravuri, Nelson Morgan; ICSI,
USA
Thu-Ses2-P2-2, Time: 13:30
We report progress in the use of multi-stream spectro-temporal
features for both small and large vocabulary automatic speech
Notes
170
recognition tasks. Features are divided into multiple streams for
parallel processing and dynamic utilization in this approach. For
small vocabulary speech recognition experiments, the incorporation of up to 28 dynamically-weighted spectro-temporal feature
streams along with MFCCs yields roughly 21% improvement on
the baseline in low noise conditions and 47% improvement in
noise-added conditions, a greater improvement on the baseline
than in our previous work. A four stream framework yields a 14%
improvement over the baseline in the large vocabulary low noise
recognition experiment. These results suggest that the division
of spectro-temporal features into multiple streams may be an
effective way to flexibly utilize an inherently large number of
features for automatic speech recognition.
This paper aims at investigating the use of TANDEM features based
on hierarchical processing of the modulation spectrum. The study
is done in the framework of the GALE project for recognition of
Mandarin Broadcast data. We describe the improvements obtained
using the hierarchical processing and the addition of features like
pitch and short-term critical band energy. Results are consistent
with previous findings on a different LVCSR task suggesting that
the proposed technique is effective and robust across several
conditions. Furthermore we describe integration into RWTH GALE
LVCSR system trained on 1600 hours of Mandarin data and present
progress across the GALE 2007 and GALE 2008 RWTH systems
resulting in approximately 20% CER reduction on several data set.
Hill-Climbing Feature Selection for Multi-Stream ASR
Tandem Representations of Spectral Envelope and
Modulation Frequency Features for ASR
David Gelbart 1 , Nelson Morgan 1 , Alexey Tsymbal 2 ;
1
ICSI, USA; 2 Siemens AG, Germany
Samuel Thomas, Sriram Ganapathy, Hynek
Hermansky; Johns Hopkins University, USA
Thu-Ses2-P2-6, Time: 13:30
Thu-Ses2-P2-3, Time: 13:30
We present a feature extraction technique for automatic speech
recognition that uses Tandem representation of short-term spectral envelope and modulation frequency features. These features,
derived from sub-band temporal envelopes of speech estimated
using frequency domain linear prediction, are combined at the
phoneme posterior level. Tandem representations derived from
these phoneme posteriors are used along with HMM-based ASR
systems for both small and large vocabulary continuous speech
recognition (LVCSR) tasks. For a small vocabulary continuous digit
task on the OGI Digits database, the proposed features reduce the
word error rate (WER) by 13% relative to other feature extraction
techniques. We obtain a relative reduction of about 14% in WER for
an LVCSR task using the NIST RT05 evaluation data. For phoneme
recognition tasks on the TIMIT database these features provide a
relative improvement of 13% compared to other techniques.
Entropy-Based Feature Analysis for Speech
Recognition
Panji Setiawan 1 , Harald Höge 2 , Tim Fingscheidt 3 ;
1
Siemens Enterprise Communications GmbH & Co. KG,
Germany; 2 SVOX Deutschland GmbH, Germany;
3
Technische Universität Braunschweig, Germany
Thu-Ses2-P2-4, Time: 13:30
Based on the concept of entropy, a new approach to analyse the
quality of features as used in speech recognition is proposed. We
regard the relation between the hidden Markov model (HMM) states
and the corresponding frame based feature vectors as a coding
problem, where the states are sent through a noisy recognition
channel and received as feature vectors. Using the relation between
Shannon’s conditional entropy and the error rate on state level, we
estimate how much information is contained in the feature vectors
to recognize the states. Thus, the conditional entropy is a measure
for the quality of the features. Finally, we show how noise reduces
the information contained in the features.
Hierarchical Processing of the Modulation Spectrum
for GALE Mandarin LVCSR System
Fabio Valente 1 , Mathew Magimai-Doss 1 , C. Plahl 2 ,
Suman Ravuri 3 ; 1 IDIAP Research Institute,
Switzerland; 2 RWTH Aachen University, Germany;
3
ICSI, USA
Thu-Ses2-P2-5, Time: 13:30
We performed automated feature selection for multi-stream (i.e.,
ensemble) automatic speech recognition, using a hill-climbing (HC)
algorithm that changes one feature at a time if the change improves
a performance score. For both clean and noisy data sets (using
the OGI Numbers corpus), HC usually improved performance
on held out data compared to the initial system it started with,
even for noise types that were not seen during the HC process.
Overall, we found that using Opitz’s scoring formula, which blends
single-classifier word recognition accuracy and ensemble diversity,
worked better than ensemble accuracy as a performance score
for guiding HC in cases of extreme mismatch between the SNR of
training and test sets.
Our noisy version of the Numbers corpus, our multi-layerperceptron-based Numbers ASR system, and our HC scripts are
available online.
Robust F0 Estimation Based on Log-Time Scale
Autocorrelation and its Application to Mandarin
Tone Recognition
Yusuke Kida, Masaru Sakai, Takashi Masuko, Akinori
Kawamura; Toshiba Corporate R&D Center, Japan
Thu-Ses2-P2-7, Time: 13:30
This paper proposes a novel F0 estimation method in which deltalogF0 is directly estimated based on autocorrelation function (ACF)
on a logarithmic time scale. Since peaks of ACFs of periodic signals
have a specific pattern on the log-time scale and the period only
affects the position of the pattern, delta-logF0 can be estimated
directly from the shift of the peaks of the log-time scale ACF
(LTACF) without F0 estimation. Then logF0 is estimated from the
sum of LTACFs shifted based on delta-logF0. Experimental results
show that the proposed method is more robust against noise than
the baseline ACF-based method. It is also shown that the proposed
method significantly improves the Mandarin tone recognition
accuracy.
Invariant-Integration Method for Robust Feature
Extraction in Speaker-Independent Speech
Recognition
Florian Müller, Alfred Mertins; Universität zu Lübeck,
Germany
Thu-Ses2-P2-8, Time: 13:30
The vocal tract length (VTL) is one of the variabilities that
speaker-independent automatic speech recognition (ASR) systems
encounter. Standard methods to compensate for the effects of different VTLs within the processing stages of the ASR systems often
Notes
171
have a high computational effort. By using an appropriate warping
scheme for the frequency centers of the time-frequency analysis, a
change in VTL can be approximately described by a translation in
the subband-index space. We present a new type of features that
is based on the principle of invariant integration, and an according
feature selection method is described. ASR experiments show the
increased robustness of the proposed features in comparison to
standard MFCCs.
performance is obtained for any environmental condition, clean as
well as noisy.
Thu-Ses2-P3 : ASR: Tonal Language,
Cross-Lingual and Multilingual ASR
Hewison Hall, 13:30, Thursday 10 Sept 2009
Chair: Lori Lamel, LIMSI, France
Discriminative Feature Transformation Using
Output Coding for Speech Recognition
Pronunciation-Based ASR for Names
Omid Dehzangi 1 , Bin Ma 2 , Eng Siong Chng 1 , Haizhou
Li 2 ; 1 Nanyang Technological University, Singapore;
2
Institute for Infocomm Research, Singapore
Henk van den Heuvel 1 , Bert Réveil 2 , Jean-Pierre
Martens 2 ; 1 Radboud Universiteit Nijmegen, The
Netherlands; 2 Ghent University, Belgium
Thu-Ses2-P2-9, Time: 13:30
Thu-Ses2-P3-1, Time: 13:30
In this paper, we present a new mechanism to extract discriminative acoustic features for speech recognition using continuous
output coding (COC) based feature transformation. Our proposed
method first expands the short-time spectral features into a higher
dimensional feature space to improve its discriminative capability.
The expansion is performed by employing the polynomial expansion. The high dimension features are then projected into lower
dimension space using continuous output coding technique implemented by a set of linear SVMs. The resulting feature vectors are
designed to encode the difference between phones. The generated
features are shown to be more discriminative than MFCCs and
experimental results on both TIMIT and NTIMIT corpus showed
better phone recognition accuracy with the proposed features.
To improve the ASR of proper names a novel method based on
the generation of pronunciation variants by means of phonemeto-phoneme converters (P2Ps) is proposed. The aim is convert
baseline transcriptions into variants that maximally resemble
actual name pronunciations that were found in a training corpus.
The method has to operate in a cross lingual setting with native
Dutch persons speaking Dutch and foreign names, and foreign
persons speaking Dutch names. The P2Ps are trained to act either
on conventional G2P-transcriptions or on canonical transcriptions
that were provided by a human expert. Including the variants
produced by the P2Ps in the lexicon of the recognizer substantially
improves the recognition accuracy for natives pronouncing foreign
names, but not for the other investigated combinations.
Discriminant Spectrotemporal Features for Phoneme
Recognition
How Speaker Tongue and Name Source Language
Affect the Automatic Recognition of Spoken Names
Nima Mesgarani, G.S.V.S. Sivaram, Sridhar Krishna
Nemala, Mounya Elhilali, Hynek Hermansky; Johns
Hopkins University, USA
Bert Réveil 1 , Jean-Pierre Martens 1 , Bart D’hoore 2 ;
1
Ghent University, Belgium; 2 Nuance, Belgium
Thu-Ses2-P2-10, Time: 13:30
In this paper the automatic recognition of person names and
geographical names uttered by native and non-native speakers is
examined in an experimental set-up. The major aim was to raise
our understanding of how well and under which circumstances
previously proposed methods of multilingual pronunciation
modeling and multilingual acoustic modeling contribute to a
better name recognition in a cross-lingual context. To come to
a meaningful interpretation of results we have categorized each
language according to the amount of exposure a native speaker is
expected to have had to this language. After having interpreted
our results we have also tried to find an answer to the question of
how much further improvement one might be able to attain with a
more advanced pronunciation modeling technique which we plan
to develop.
Thu-Ses2-P3-2, Time: 13:30
We propose discriminant methods for deriving two-dimensional
spectrotemporal features for phoneme recognition that are estimated to maximize the separation between the representations of
phoneme classes. The linearity of the filters results in their intuitive
interpretation enabling us to investigate the working principles of
the system and to improve its performance by locating the sources
of error. Two methods for the estimation of filters are proposed:
Regularized Least Square (RLS) and Modified Linear Discriminant
Analysis (MLDA). Both methods reach a comparable improvement
over the baseline condition demonstrating the advantage of the
discriminant spectrotemporal filters.
Auditory Model Based Optimization of MFCCs
Improves Automatic Speech Recognition
Performance
Online Generation of Acoustic Models for
Multilingual Speech Recognition
Saikat Chatterjee, Christos Koniaris, W. Bastiaan
Kleijn; KTH, Sweden
Martin Raab 1 , Guillermo Aradilla 1 , Rainer Gruhn 1 ,
Elmar Nöth 2 ; 1 Harman Becker Automotive Systems,
Germany; 2 FAU Erlangen-Nürnberg, Germany
Thu-Ses2-P2-11, Time: 13:30
Using a spectral auditory model along with perturbation based
analysis, we develop a new framework to optimize a set of features
such that it emulates the behavior of the human auditory system.
The optimization is carried out in an off-line manner based on
the conjecture that the local geometries of the feature domain
and the perceptual auditory domain should be similar. Using this
principle, we modify and optimize the static mel frequency cepstral
coefficients (MFCCs) without considering any feedback from the
speech recognition system. We show that improved recognition
Thu-Ses2-P3-3, Time: 13:30
Our goal is to provide a multilingual speech based Human Machine
Interface for in-car infotainment and navigation systems. The
multilinguality is for example needed for music player control via
speech as artist and song names in the globalized music market
come from many languages. Another frequent use case is the
input of foreign navigation destinations via speech. In this paper
we propose approximated projections between mixtures of Gaus-
Notes
172
sians that allow the generation of the multilingual system from
monolingual systems. This makes the creation of the multilingual
systems on an embedded system possible with the benefit that
training and maintenance effort remain unchanged compared
to the provision of monolingual systems. We also sketch how
this algorithm can help together with our previous work to have
an efficient architecture for multilingual speech recognition on
embedded devices.
Basic Speech Recognition for Spoken Dialogues
Charl van Heerden, Etienne Barnard, Marelie Davel;
CSIR, South Africa
Thu-Ses2-P3-4, Time: 13:30
Spoken dialogue systems (SDSs) have great potential for information access in the developing world. However, the realisation of
that potential requires the solution of several challenging problems, including the development of sufficiently accurate speech
recognisers for a diverse multitude of languages. We investigate
the feasibility of developing small-vocabulary speaker-independent
ASR systems designed for use in a telephone-based information
system, using ten resource-scarce languages spoken in South Africa
as a case study.
We contrast a cross-language transfer approach (using a welltrained system from a different language) with the development of
new language-specific corpora and systems, and evaluate the effectiveness of both approaches. We find that limited speech corpora (3
to 8 hours of data from around 200 speakers) are sufficient for the
development of reasonably accurate recognisers: Error rates are in
the range 2% to 12% for a ten-word task, where vocabulary words are
excluded from training to simulate vocabulary-independent performance. This approach is substantially more accurate than crosslanguage transfer, and sufficient for the development of basic spoken dialogue systems.
Tonal Articulatory Feature for Mandarin and its
Application to Conversational LVCSR
of code-mixing utterances. By examining the recognition results
of Canton-English code-mixing speech, where Canton is the matrix
language and English is the embedded language, we noticed that
recognition accuracy of the embedded language plays a significant
role to the overall performance. In particular, significant performance degradation is found in the matrix language if the embedded words can not be recognized correctly. We also studied the error propagation effect of the embedded English. The results show
that the error in embedded English words may propagate to two
neighboring Cantonese syllables. Finally, analysis is carried out to
determine the influencing factors for recognition performance in
embedded English.
A One-Step Tone Recognition Approach Using
MSD-HMM for Continuous Speech
Changliang Liu, Fengpei Ge, Fuping Pan, Bin Dong,
Yonghong Yan; Chinese Academy of Sciences, China
Thu-Ses2-P3-7, Time: 13:30
There are two types of methods for tone recognition of continuous
speech: one-step and two-step approaches. Two-step approaches
need to identify the syllable boundaries firstly, while one-step approaches do not. Previous studies mostly focus on two-step approaches. In this paper, a one-step approach using Multi-space
distribution HMM (MSD-HMM) is investigated. The F0, which only
exists in voiced speech, is modeled by MSD-HMM. Then, a tonal syllable network is built based on the reference and Viterbi search is
carried out on it to find the best tone sequence. Two modifications to the conventional tri-phone HMM models are investigated:
tone-based context expansion and syllable-based model units. The
experimental results proved that tone-based context information is
more important for tone recognition and syllable-based HMM models are much better than phone-based ones. The final tone correct
rate result is 88.8%, which is much higher than the state-of-the-art
two-step approaches.
Stream-Based Context-Sensitive Phone Mapping for
Cross-Lingual Speech Recognition
Qingqing Zhang, Jielin Pan, Yonghong Yan; Chinese
Academy of Sciences, China
Khe Chai Sim, Haizhou Li; Institute for Infocomm
Research, Singapore
Thu-Ses2-P3-5, Time: 13:30
This paper presents our recent work on the development of a tonal
Articulatory Feature (AF) for Mandarin and its application to conversational LVCSR. Motivated by the theory of Mandarin phonology, eight features for classifying the acoustic units and one feature for classifying the tone are investigated and constructed in
the paper, and the AF-based tandem approach is used to improve
speech recognition performances. With this Mandarin AF set, a significant relative reduction on Character Error Rate is obtained over
the baseline system using the standard acoustic feature, and the
comparison between the ASR systems based on AF classifiers with
and without the tonal feature demonstrates that the system with
the tonal feature achieves better performances further.
Effects of Language Mixing for Automatic
Recognition of Cantonese-English Code-Mixing
Utterances
Houwei Cao, P.C. Ching, Tan Lee; Chinese University of
Hong Kong, China
Thu-Ses2-P3-8, Time: 13:30
Recently, a Probabilistic Phone Mapping (PPM) model was proposed
to facilitate cross-lingual automatic speech recognition using a
foreign phonetic system. Under this framework, discrete hidden
Markov models (HMMs) are used to map a foreign phone sequence
to a target phone sequence. Context-sensitive mapping is made
possible by expanding the discrete observation symbols to include
the contexts of the foreign phones in which they appear in the sequence. Unfortunately, modelling the context dependencies jointly
results in dramatic increase in model parameters as wider contexts
are used. In this paper, the probability of observing a contextdependent symbol is decomposed into the product of probabilities
of observing the symbol and its contexts. This allows wider contexts to be modelled without greatly compromising the model complexity. This can be modelled conveniently using a multiple-stream
discrete HMM system where the contexts are treated as independent
streams. Experimental results are reported on TIMIT English phone
recognition task using the Czech, Hungarian and Russian foreign
phone recognisers.
Thu-Ses2-P3-6, Time: 13:30
While automatic speech recognition of either Cantonese or English
alone has achieved a great degree of success, recognition of CantonEnglish code-mixing speech is not as trivial. This paper attempts to
analyze the effect of language mixing on recognition performance
Notes
173
Human Translations Guided Language Discovery for
ASR Systems
Sebastian Stüker 1 , Laurent Besacier 2 , Alex Waibel 1 ;
1
Universität Karlsruhe (TH), Germany; 2 LIG, France
A Noise Robust Method for Pattern Discovery in
Quantized Time Series: The Concept Matrix
Approach
Okko Johannes Räsänen, Unto Kalervo Laine, Toomas
Altosaar; Helsinki University of Technology, Finland
Thu-Ses2-P3-9, Time: 13:30
The traditional approach of collecting and annotating the necessary
training data is due to economic constraints not feasible for most of
the 7,000 languages in the world. At the same time it is of vital interest to have natural language processing systems address practically
all of them. Therefore, new, efficient ways of gathering the needed
training material have to be found. In this paper we continue our
experiments on exploiting the knowledge gained from human simultaneous translations that happen frequently in the real world,
in order to discover word units in a new language. We evaluate
our approach by measuring the performance of statistical machine
translation systems trained on the word units discovered from an
oracle phoneme sequence. We improve it then by combining it with
a word discovery technique that works without supervision, solely
on the unsegmented phoneme sequences.
Thu-Ses2-P4-3, Time: 13:30
An efficient method for pattern discovery from discrete time series is introduced in this paper. The method utilizes two parallel streams of data, a discrete unit time-series and a set of labeled
events, From these inputs it builds associative models between systematically co-occurring structures existing in both streams. The
models are based on transitional probabilities of events at several
different time scales. Learning and recognition processes are incremental, making the approach suitable for online learning tasks.
The capabilities of the algorithm are demonstrated in a continuous
speech recognition task operating in varying noise levels.
Using Parallel Architectures in Speech Recognition
Patrick Cardinal, Pierre Dumouchel, Gilles Boulianne;
CRIM, Canada
Thu-Ses2-P4 : ASR: New Paradigms II
Thu-Ses2-P4-4, Time: 13:30
Hewison Hall, 13:30, Thursday 10 Sept 2009
Chair: Michael Schuster, Google, USA
The Case for Case-Based Automatic Speech
Recognition
Viktoria Maier, Roger K. Moore; University of Sheffield,
UK
Thu-Ses2-P4-1, Time: 13:30
In order to avoid global parameter settings which are locally suboptimal, this paper argues for the inclusion of more knowledge (in particular procedural knowledge) into automatic speech recognition
(ASR) systems. Two related fields provide inspiration for this new
perspective: (a) ‘cognitive architectures’ indicate how experience
with related problems can give rise to more (expert) knowledge, and
(b) ‘case-based reasoning’ provides an extended framework which is
relevant to any similarity-based recognition systems. The outcome
of this analysis is a proposal for a new approach termed ‘Case-Based
ASR’.
The speed of modern processors has remained constant over the
last few years and thus, to be scalable, applications must be parallelized. In addition to the main CPU, almost every computer is
equipped with a Graphics Processors Unit (GPU) which is in essence
a specialized parallel processor. This paper explores how performances of speech recognition systems can be enhanced by using
GPU for the acoustic computations and multi-core CPUs for the
Viterbi search in a large vocabulary application. The multi-core
implementation of our speech recognition system runs 1.3 times
faster than the single-threaded CPU implementation. Addition of
the GPU for dedicated acoustic computations increases the speed
by a factor of 2.8, leading to a word accuracy improvement of 16.6%
absolute at real-time, compared to the single-threaded CPU implementation.
Example-Based Speech Recognition Using Formulaic
Phrases
Christopher J. Watkins, Stephen J. Cox; University of
East Anglia, UK
Thu-Ses2-P4-5, Time: 13:30
A Self-Labeling Speech Corpus: Collecting Spoken
Words with an Online Educational Game
Ian McGraw, Alexander Gruenstein, Andrew
Sutherland; MIT, USA
Thu-Ses2-P4-2, Time: 13:30
We explore a new approach to collecting and transcribing speech
data by using online educational games. One such game, Voice Race,
elicited over 55,000 utterances over a 22 day period, representing
18.7 hours of speech. Voice Race was designed such that the transcripts for a significant subset of utterances can be automatically
inferred using the contextual constraints of the game. Game context can also be used to simplify transcription to a multiple choice
task, which can be performed by non-experts. We found that one
third of the speech collected with Voice Race could be automatically transcribed with over 98% accuracy; and that an additional
49% could be labeled cheaply by Amazon Mechanical Turk workers.
We demonstrate the utility of the self-labeled speech in an acoustic
model adaptation task, which resulted in a reduction in the Voice
Race utterance error rate. The collected utterances cover a wide variety of vocabulary, and should be useful across a range of research.
In this paper, we describe the design of an ASR system that is based
on identifying and extracting formulaic phrases from a corpus and
then, rather than building statistical models of them, performing
example-based recognition of these phrases. We describe a method
for combining formulaic phrases into a bigram language model that
results in a 13% decrease in WER on a monophone HMM recogniser
over the baseline. We show that using this model with phrase templates in the example-based recogniser gives a significant improvement in WER compared to word templates, but performance still
falls short of the HMM recogniser. We also describe an LDA decision
tree classifier that reduces the search space of the DTW decoder by
40% while at the same time decreasing WER.
Parallel Fast Likelihood Computation for LVCSR
Using Mixture Decomposition
Naveen Parihar 1 , Ralf Schlüter 2 , David Rybach 2 ,
Eric A. Hansen 1 ; 1 Mississippi State University, USA;
2
RWTH Aachen University, Germany
Thu-Ses2-P4-6, Time: 13:30
This paper describes a simple and robust method for improving
Notes
174
the runtime of likelihood computation on multi-core processors
without degrading system accuracy. The method improves runtime
by parallelizing likelihood computations on a multi-core processor.
Mixtures are decomposed among the cores and each core computes
the likelihood of the mixture allocated to it. We study two approaches to mixture decomposition — Chunk based and Decisiontree based. When applied to RWTH TC-STAR EPPS English LVCSR
system on an Intel Core2 Quad processor with varying pruningbeam width settings, the method resulted in a 54% to 70% improvement in the likelihood computation runtime, and a 18% to 59% improvement in the overall runtime.
jor issue that should be addressed in detection-based ASR. To this
end, we propose several methods to reduce the asynchrony or the
effects of asynchrony. The results are quite promising; for example, currently, we can achieve 67.67% phone accuracy in the TIMIT
free phone recognition task with only 11 binary-valued articulatory
features.
An Indexing Weight for Voice-to-Text Search
Thu-Ses2-P4-10, Time: 13:30
Chen Liu; Motorola, USA
To date, the use of Conditional Random Fields (CRFs) in automatic
speech recognition has been limited to the tasks of phone classification and phone recognition. In this paper, we present a framework
for using CRF models in a word recognition task that extends the
well-known Tandem HMM framework to CRFs. We show results that
compare favorably to a set of standard baselines, and discuss some
of the benefits and potential pitfalls of this method.
Thu-Ses2-P4-7, Time: 13:30
The TF-IDF (term frequency-inverse document frequency) weight is
a well-known indexing weight in information retrieval and text mining. However, it is not suitable for the increasingly popular voiceto-text search, as it does not take into account the impact of voice
in the search process. We propose a method for calculating a new
indexing weight, which is used as guidance for selection of suitable
queries for voice-to-text search. In designing the new weight, we
combine prominence factors from both the text and acoustic domains. Experimental results show significant improvement in the
average search success rate with the new indexing weight.
CRANDEM: Conditional Random Fields for Word
Recognition
Jeremy Morris, Eric Fosler-Lussier; Ohio State
University, USA
HEAR: An Hybrid Episodic-Abstract Speech
Recognizer
Sébastien Demange, Dirk Van Compernolle; Katholieke
Universiteit Leuven, Belgium
Thu-Ses2-P4-11, Time: 13:30
On Invariant Structural Representation for Speech
Recognition: Theoretical Validation and
Experimental Improvement
Yu Qiao, Nobuaki Minematsu, Keikichi Hirose;
University of Tokyo, Japan
Thu-Ses2-P4-8, Time: 13:30
One of the most challenging problems in speech recognition is to
deal with inevitable acoustic variations caused by non-linguistic factors. Recently, an invariant structural representation of speech was
proposed [1], where the non-linguistic variations are effectively removed though modeling the dynamic and contrastive aspects of
speech signals. This paper describes our recent progresses on
this problem. Theoretically, we prove that the maximum likelihood based decomposition can lead to the same structural representations for a sequence and its transformed version. Practically,
we introduce a method of discriminant analysis of eigen-structure
to deal with two limitations of structural representations, namely,
high dimensionality and too strong invariance. In the 1st experiment, we evaluate the proposed method through recognizing connected Japanese vowels. The proposed method achieves a recognition rate 99.0%, which is higher than those of the previous structure based recognition methods [2, 3, 4] and word HMM. In the
2nd experiment, we examine the recognition performance of structural representations to vocal tract length (VTL) differences. The
experimental results indicate that structural representations have
much more robustness to VTL changes than HMM. Moreover, the
proposed method is about 60 times faster than the previous ones.
This paper presents a new architecture for automatic continuous
speech recognition called HEAR — Hybrid Episodic-Abstract speech
Recognizer. HEAR relies on both parametric speech models (HMMs)
and episodic memory. We propose an evaluation on the Wall Street
Journal corpus, a standard continuous speech recognition task, and
compare the results with a state-of-the-art HMM baseline. HEAR
is shown to be a viable and a competitive architecture. While the
HMMs have been studied and optimized during decades, their performance seems to converge to a limit which is lower than human performance. On the contrary, episodic memory modeling for
speech recognition as applied in HEAR offers flexibility to enrich
the recognizer with information the HMMs lack. This opportunity
as well as future work are exposed in a discussion.
Articulatory Feature Asynchrony Analysis and
Compensation in Detection-Based ASR
I-Fan Chen, Hsin-Min Wang; Academia Sinica, Taiwan
Thu-Ses2-P4-9, Time: 13:30
This paper investigates the effects of two types of imperfection,
namely detection errors and articulatory feature asynchrony, of
the front-end articulatory feature detector on the performance of a
detection-based ASR system. Based on a set of variable-controlled
experiments, we find that articulatory feature asynchrony is the ma-
Notes
175
Notes
176
Author Index
A
Abad, Alberto . . . . . . . .
Abboutabit, N. . . . . . . . .
Abdelwahab, Amira . .
Abutalebi, H.R. . . . . . . .
Acero, Alex . . . . . . . . . . .
Ackermann, P. . . . . . . . .
Acosta, Jaime C. . . . . . .
Adada, Junichi . . . . . . . .
Adda-Decker, Martine
Adell, Jordi . . . . . . . . . . .
Agüero, Pablo Daniel .
Aguilar, Lourdes . . . . . .
Aho, Eija . . . . . . . . . . . . . .
Aimetti, Guillaume . . .
Aist, Gregory . . . . . . . . .
Ajmera, Jitendra . . . . . .
Akagi, Masato . . . . . . . .
Akamine, Masami . . . .
Akita, Yuya . . . . . . . . . . .
Al Bawab, Ziad . . . . . . .
Alcázar, José . . . . . . . . .
Alfandary, Amir . . . . . .
Ali, Saandia . . . . . . . . . . .
Alías, Francesc . . . . . . .
Alku, Paavo . . . . . . . . . . .
Allauzen, Alexandre .
Allauzen, Cyril . . . . . . .
Almajai, Ibrahim . . . . .
Al Moubayed, Samer .
Aloni-Lavi, Ruth . . . . . .
Alpan, A. . . . . . . . . . . . . . .
Altosaar, Toomas . . . .
Alwan, Abeer . . . . . . . . .
Amano, Shigeaki . . . . .
Amano-Kusumoto, A.
Ambikairajah, E. . . . . . .
Amino, Kanae . . . . . . . .
Ananthakrishnan, G. .
Andersen, O. . . . . . . . . . .
Andersson, J.S. . . . . . . .
André, Elisabeth . . . . . .
Andreou, Andreas G. .
Anguera, Xavier . . . . . .
Aradilla, Guillermo . . .
Arai, Takayuki . . . . . . . .
Arias, Juan Pablo . . . . .
Ariki, Yasuo . . . . . . . . . .
Ariyaeeinia, A. . . . . . . . .
Aronowitz, Hagai . . . . .
Attabi, Yazid . . . . . . . . .
Atterer, Michaela . . . . .
Atwell, Eric . . . . . . . . . . .
Aubergé, Véronique . .
Avigal, Mireille . . . . . . .
Avinash, B. . . . . . . . . . . . .
Ayan, Necip Fazil . . . . .
Aylett, Matthew P. . . . .
Mon-Ses2-O3-5
Tue-Ses2-P2-8
Tue-Ses3-P2-2
Wed-Ses3-P2-12
Tue-Ses3-P1-12
Tue-Ses1-O1-5
Mon-Ses3-P4-4
Wed-Ses1-O2-4
Thu-Ses1-P2-2
Mon-Ses2-P1-6
Wed-Ses3-O1-3
Mon-Ses2-S1-2
Mon-Ses3-P2-13
Wed-Ses1-O4-5
Wed-Ses4-P4-10
Wed-Ses1-P2-3
Tue-Ses1-O2-6
Tue-Ses2-P2-13
Mon-Ses3-P4-1
Tue-Ses3-P2-7
Wed-Ses1-O4-3
Tue-Ses3-P2-7
Wed-Ses2-P3-6
Mon-Ses2-O3-3
Wed-Ses2-P3-2
Tue-Ses3-P3-9
Mon-Ses2-P2-6
Wed-Ses3-O2-3
Mon-Ses3-P2-8
Tue-Ses3-P1-2
Wed-Ses1-P2-2
Thu-Ses1-O2-5
Tue-Ses2-P3-8
Wed-Ses2-O4-2
Wed-Ses2-O4-6
Tue-Ses3-P3-5
Mon-Ses2-P2-6
Tue-Ses1-S2-8
Tue-Ses1-O2-6
Tue-Ses1-P3-5
Wed-Ses1-P4-13
Thu-Ses2-P4-3
Mon-Ses2-O1-3
Wed-Ses1-O3-6
Thu-Ses1-S1-1
Thu-Ses2-O3-3
Thu-Ses2-P1-1
Mon-Ses3-P1-6
Wed-Ses1-P2-10
Tue-Ses2-P1-11
Wed-Ses2-P2-3
Thu-Ses2-O5-5
Wed-Ses3-P1-9
Tue-Ses3-P2-3
Thu-Ses2-O2-1
Tue-Ses2-P1-8
Mon-Ses3-P2-9
Mon-Ses2-S1-5
Tue-Ses1-P3-4
Wed-Ses3-P2-8
Thu-Ses2-P3-3
Mon-Ses2-O2-6
Tue-Ses1-P1-1
Wed-Ses3-P1-9
Tue-Ses2-P2-4
Mon-Ses2-P4-2
Wed-Ses3-P2-7
Tue-Ses1-P4-11
Mon-Ses2-S1-9
Tue-Ses2-O3-3
Wed-Ses1-P4-14
Tue-Ses1-P3-9
Wed-Ses1-P2-12
Wed-Ses3-O4-5
Wed-Ses2-P2-8
Tue-Ses2-P1-6
Mon-Ses3-O4-5
Wed-Ses2-P3-11
50
94
104
145
103
75
72
113
156
52
138
60
69
115
149
118
76
95
71
105
115
105
134
50
133
108
54
139
68
102
118
152
97
128
129
107
54
87
76
82
124
174
48
114
162
165
168
66
119
93
131
168
143
105
164
92
68
60
82
145
172
49
79
143
94
57
145
85
61
90
124
83
119
141
132
92
65
135
B
Bachan, Jolanta . . . . . . . Mon-Ses3-P2-4 67
Bäckström, Tom . . . . . . Thu-Ses1-P1-3 155
Badin, Pierre . . . . . . . . . . Wed-Ses3-O4-3 141
Badino, Leonardo . . . . . Mon-Ses3-P2-9 68
Baghai-Ravary, Ladan
Thu-Ses2-O5-3 167
Bagou, Odile . . . . . . . . . . Mon-Ses3-P1-2 66
Bailly, Gérard . . . . . . . . . Mon-Ses3-S1-8 74
Wed-Ses3-O4-3 141
Baker, Brendan . . . . . . . Tue-Ses3-O3-2 100
Balakrishnan, Suhrid . Wed-Ses3-S1-3 150
Balchandran, Rajesh . Mon-Ses2-P4-10 59
Bali, Kalika . . . . . . . . . . . . Mon-Ses3-P2-12 69
Thu-Ses2-P1-10 170
Ban, Sung Min . . . . . . . . Wed-Ses3-O3-1 140
Ban, Vin Shen . . . . . . . . . Mon-Ses2-P1-7 52
Banglore, Srinivas . . . . Wed-Ses1-S1-1 124
Banno, Hideki . . . . . . . . Thu-Ses1-P2-6 157
Bapineedu, G. . . . . . . . . . Tue-Ses2-P1-6 92
Barbosa, Plínio A. . . . . . Tue-Ses2-O2-4 89
Tue-Ses3-S2-3 110
Wed-Ses2-S1-2 137
Thu-Ses2-P1-12 170
Barbot, Nelly . . . . . . . . . . Thu-Ses1-P2-3 157
Bárkányi, Zsuzsanna . Mon-Ses3-P1-10 67
Barker, Jon . . . . . . . . . . . . Mon-Ses2-P1-5 52
Barnard, Etienne . . . . . . Tue-Ses1-P3-12 83
Thu-Ses2-O4-1 166
Thu-Ses2-P3-4 173
Barney, Anna . . . . . . . . . Mon-Ses3-P1-7 66
Tue-Ses1-S2-11 87
Barra-Chicote, R. . . . . . Mon-Ses2-S1-7 61
Bartalis, Mátyás . . . . . . Mon-Ses3-P4-9 73
Bar-Yosef, Yossi . . . . . . Tue-Ses3-O3-3 100
Batliner, Anton . . . . . . . Mon-Ses2-S1-1 60
Mon-Ses3-P4-4 72
Baumann, Timo . . . . . . Tue-Ses2-O3-3 90
Wed-Ses1-P4-14 124
Bayer, Stefan . . . . . . . . . Thu-Ses1-P1-3 155
Beautemps, Denis . . . . Tue-Ses3-P2-2 104
Bechet, Frederic . . . . . . Tue-Ses2-O3-5 90
Thu-Ses1-P4-4 160
Beck, Jeppe . . . . . . . . . . . Tue-Ses3-O4-1 101
Beckman, Mary . . . . . . . Tue-Ses1-O2-2 76
Behne, Dawn . . . . . . . . . . Thu-Ses2-P1-3 168
Belfield, Bill . . . . . . . . . . . Wed-Ses2-O3-6 128
Bell, Peter . . . . . . . . . . . . . Wed-Ses2-P4-12 137
Bellegarda, Jerome R.
Tue-Ses1-O4-4 78
Benaroya, Elie-Laurent Mon-Ses3-S1-4 74
Ben-David, Shai . . . . . . . Mon-Ses3-O4-4 65
Benders, Titia . . . . . . . . . Mon-Ses3-O2-6 63
Ben-Harush, Oshry . . . Tue-Ses1-P4-8 85
Beňuš, Štefan . . . . . . . . . Tue-Ses1-P1-11 80
Wed-Ses2-S1-5 138
Ben Youssef, Atef . . . . Wed-Ses3-O4-3 141
BenZeghiba, M.F. . . . . . Wed-Ses3-O1-5 138
Berkling, Kay . . . . . . . . . Thu-Ses1-P2-1 156
Besacier, Laurent . . . . . Thu-Ses1-P3-1 158
Thu-Ses2-P3-9 174
Beskow, Jonas . . . . . . . . Mon-Ses2-P4-12 59
Tue-Ses3-P3-5 107
Beyerlein, Peter . . . . . . . Thu-Ses2-P1-8 169
Biadsy, Fadi . . . . . . . . . . . Mon-Ses2-P2-12 55
Bigi, Brigitte . . . . . . . . . . Thu-Ses1-P3-1 158
Bilmes, Jeff . . . . . . . . . . . Tue-Ses1-O1-1 75
Wed-Ses2-O3-1 127
Thu-Ses1-O3-6 154
Thu-Ses2-O4-4 167
Bimbot, Frédéric . . . . . . Thu-Ses2-O3-6 166
Bistritz, Yuval . . . . . . . . Tue-Ses3-O3-3 100
Bitouk, Dmitri . . . . . . . . Wed-Ses2-P2-6 132
Black, Matthew . . . . . . . Wed-Ses2-O2-2 126
Blanco, José Luis . . . . . Tue-Ses3-P3-9 108
Blomberg, Mats . . . . . . . Mon-Ses3-P3-11 71
Tue-Ses3-P2-3 105
Bocklet, Tobias . . . . . . . Wed-Ses2-P2-4 131
Boeffard, Olivier . . . . . . Tue-Ses1-O4-3 78
Thu-Ses1-P2-3 157
Boersma, Paul . . . . . . . . Mon-Ses3-O2-6 63
Mon-Ses3-P1-5 66
Bőhm, Tamás . . . . . . . . . Mon-Ses3-P1-10 67
Boidin, Cédric . . . . . . . . Tue-Ses1-O4-3 78
Wed-Ses2-P3-9 134
Wed-Ses3-S1-6 150
Bolder, Bram . . . . . . . . . . Wed-Ses2-S1-6 138
Bonafonte, Antonio . . Mon-Ses2-O2-5 49
Mon-Ses3-P2-13 69
Wed-Ses1-O4-5 115
Wed-Ses4-P4-10 149
Bonastre, J-F. . . . . . . . . . Mon-Ses2-P2-1 53
Bonneau, Anne . . . . . . . Mon-Ses3-P1-8 66
177
Boonpiam, Vataya . . . . Mon-Ses3-P2-3 67
Borel, Stephanie . . . . . . Tue-Ses1-S2-9 87
Borges, Nash . . . . . . . . . . Thu-Ses1-O3-5 153
Borgstrom, Bengt J. . . Thu-Ses1-S1-1 162
Bořil, Hynek . . . . . . . . . . Tue-Ses2-P4-6 98
Bouchon-Meunier, B. . Wed-Ses3-S1-4 150
Boufaden, Narjès . . . . . Mon-Ses2-S1-9 61
Boula de Mareüil, P. . . Wed-Ses3-O1-3 138
Thu-Ses1-O2-5 152
Boulenger, Véronique Wed-Ses2-O1-5 126
Boulianne, Gilles . . . . . Tue-Ses3-P3-6 107
Thu-Ses2-P4-4 174
Bourlard, Hervé . . . . . . Tue-Ses2-O4-4 91
Wed-Ses3-O3-6 141
Boves, Lou . . . . . . . . . . . . Tue-Ses1-O2-6 76
Wed-Ses1-P4-1 122
Wed-Ses3-O2-1 139
Bozkurt, Baris . . . . . . . . Mon-Ses2-O4-5 51
Bozkurt, Elif . . . . . . . . . . Mon-Ses2-S1-4 60
Braga, Daniela . . . . . . . . Tue-Ses3-O4-1 101
Brandl, Holger . . . . . . . . Wed-Ses2-S1-6 138
Brandschain, Linda . . . Thu-Ses2-O4-6 167
Braunschweiler, N. . . . Wed-Ses2-P3-12 135
Bray, W.P. . . . . . . . . . . . . . Wed-Ses1-P4-3 122
Bresch, Erik . . . . . . . . . . . Tue-Ses1-O2-5 76
Breslin, Catherine . . . . Tue-Ses3-P2-8 105
Bretier, Philippe . . . . . . Wed-Ses3-S1-4 150
Brierley, Claire . . . . . . . . Tue-Ses1-P3-9 83
Brodnik, Andrej . . . . . . Thu-Ses1-O3-1 153
Brown, Guy J. . . . . . . . . . Thu-Ses2-P1-11 170
Brumberg, Jonathan S. Mon-Ses3-S1-3 73
Brümmer, Niko . . . . . . . Wed-Ses1-O1-3 112
Wed-Ses3-O1-4 138
Brungart, Douglas S. . Mon-Ses3-O2-5 63
Buera, L. . . . . . . . . . . . . . . Mon-Ses2-O1-6 48
Tue-Ses2-P4-7 99
Tue-Ses3-P2-9 106
Bugalho, M. . . . . . . . . . . . Tue-Ses2-P2-8 94
Bunnell, H. Timothy . . Wed-Ses1-P2-1 118
Buquet, Julie . . . . . . . . . . Mon-Ses3-P1-8 66
Burget, Lukáš . . . . . . . . . Mon-Ses2-O3-2 50
Mon-Ses2-S1-10 61
Tue-Ses3-O3-1 99
Wed-Ses3-O1-4 138
Wed-Ses3-P2-4 144
Thu-Ses2-P2-1 170
Burkett, David . . . . . . . . Mon-Ses3-O4-5 65
Burkhardt, Felix . . . . . . Thu-Ses1-P2-9 158
Bürki, Audrey . . . . . . . . . Wed-Ses3-P1-1 142
Buß, Okko . . . . . . . . . . . . Tue-Ses2-O3-3 90
Busset, Julie . . . . . . . . . . Mon-Ses2-O2-2 49
Busso, Carlos . . . . . . . . . Mon-Ses2-S1-3 60
Wed-Ses2-P1-6 130
Butko, T. . . . . . . . . . . . . . . Tue-Ses2-P2-7 94
Byrne, William . . . . . . . . Mon-Ses3-O3-1 63
C
Caballero Morales, O.
Cabrera, Joao . . . . . . . . .
Cadic, Didier . . . . . . . . . .
Caetano, Janine . . . . . . .
Cahill, Peter . . . . . . . . . .
Cai, Jun . . . . . . . . . . . . . . .
Cai, Lianhong . . . . . . . . .
Callejas, Zoraida . . . . .
Calvo, José R. . . . . . . . . .
Camelin, Nathalie . . . .
Campbell, Joseph . . . .
Campbell, Nick . . . . . . .
Campbell, W.M. . . . . . . .
Campillo, Francisco . .
Canton-Ferrer, C. . . . . .
Cao, Houwei . . . . . . . . . .
Carayannis, George . .
Cardinal, Patrick . . . . .
Carenini, Giuseppe . . .
Carlson, Rolf . . . . . . . . .
Carreira-Perpiñán, MA
Carson-Berndsen, J. . .
Caruso, Chris . . . . . . . . .
Wed-Ses1-O3-1
Mon-Ses3-S1-5
Wed-Ses2-P3-9
Tue-Ses1-S2-11
Tue-Ses3-O4-6
Mon-Ses2-O2-2
Mon-Ses3-O3-4
Wed-Ses1-P3-5
Thu-Ses2-O4-5
Wed-Ses3-P2-5
Thu-Ses1-P4-4
Wed-Ses3-O2-2
Wed-Ses2-S1-3
Mon-Ses2-P2-8
Wed-Ses1-O1-2
Wed-Ses3-O1-6
Wed-Ses3-P2-10
Wed-Ses4-P4-10
Tue-Ses2-P2-7
Thu-Ses2-P3-6
Tue-Ses2-O4-2
Tue-Ses3-P3-6
Thu-Ses2-P4-4
Wed-Ses2-P2-2
Wed-Ses1-P2-7
Wed-Ses1-P4-1
Tue-Ses1-P1-5
Tue-Ses1-P3-13
Tue-Ses3-O4-6
Thu-Ses2-O4-6
113
74
134
87
101
49
64
120
167
144
160
139
137
54
111
139
145
149
94
173
90
107
174
131
119
122
79
83
101
167
Casas, J.R. . . . . . . . . . . . . Tue-Ses2-P2-7 94
Caskey, Sasha . . . . . . . . Mon-Ses3-O4-4 65
Cassidy, Andrew . . . . . Thu-Ses2-P1-8 169
Castaldo, Fabio . . . . . . . Mon-Ses2-P2-4 54
Tue-Ses2-O4-1 90
Castelli, Eric . . . . . . . . . . Wed-Ses3-O4-5 141
Thu-Ses1-P3-1 158
Castillo-Guerra, E. . . . . Tue-Ses1-S2-5 86
Cazi, Nadir . . . . . . . . . . . . Tue-Ses3-P1-7 103
Cecere, Elvio . . . . . . . . . . Tue-Ses3-P4-5 109
Cen, Ling . . . . . . . . . . . . . . Wed-Ses2-P3-8 134
Cerisara, C. . . . . . . . . . . . Wed-Ses1-P4-6 123
Černocký, Jan . . . . . . . . Mon-Ses2-S1-10 61
Tue-Ses3-O3-1 99
Wed-Ses3-P2-4 144
Cerva, Petr . . . . . . . . . . . . Tue-Ses2-O1-6 88
Chan, Arthur . . . . . . . . . Wed-Ses2-O3-6 128
Chan, Paul . . . . . . . . . . . . Wed-Ses2-P3-8 134
Chan, W.-Y. . . . . . . . . . . . Wed-Ses2-O4-4 129
Chang, Hung-An . . . . . . Mon-Ses2-P3-6 56
Chang, Joon-Hyuk . . . . Tue-Ses3-P1-10 103
Thu-Ses1-P1-4 155
Charlier, Malorie . . . . . Wed-Ses1-O4-4 115
Chatterjee, Saikat . . . . Thu-Ses2-P2-11 172
Chaubard, Laura . . . . . . Thu-Ses1-O4-6 154
Chelba, Ciprian . . . . . . . Mon-Ses3-O1-1 61
Chen, Berlin . . . . . . . . . . Tue-Ses3-P4-10 110
Wed-Ses1-P4-12 124
Chen, Chia-Ping . . . . . . Tue-Ses2-O4-3 91
Tue-Ses2-P4-10 99
Chen, Hui . . . . . . . . . . . . . Wed-Ses3-O4-1 141
Chen, I-Fan . . . . . . . . . . . Thu-Ses2-P4-9 175
Chen, Jia-Yu . . . . . . . . . . Mon-Ses2-P3-5 56
Chen, Langzhou . . . . . . Thu-Ses1-P3-3 158
Chen, Nancy F. . . . . . . . Wed-Ses3-O2-2 139
Chen, Sin-Horng . . . . . . Mon-Ses3-P2-5 68
Thu-Ses1-P2-5 157
Chen, Szu-wei . . . . . . . . Tue-Ses2-O2-3 89
Chen, Zhengqing . . . . . Tue-Ses2-P1-1 91
Cheng, Chierh . . . . . . . . Mon-Ses3-P1-3 66
Cheng, Chih-Chieh . . . Tue-Ses1-O1-3 75
Cheng, Shih-Sian . . . . . Tue-Ses2-O4-3 91
Chetty, Girija . . . . . . . . . Tue-Ses2-P2-12 95
Chevelu, Jonathan . . . . Wed-Ses3-S1-6 150
Chiang, Chen-Yu . . . . . Mon-Ses3-P2-5 68
Thu-Ses1-P2-5 157
Chien, Jen-Tzung . . . . . Mon-Ses3-O1-6 62
Tue-Ses3-P2-10 106
Chin, K.K. . . . . . . . . . . . . . Wed-Ses3-P3-7 147
Thu-Ses1-P3-3 158
Ching, P.C. . . . . . . . . . . . . Tue-Ses1-P4-2 84
Thu-Ses2-P3-6 173
Chiou, Sheng-Chiuan . Tue-Ses2-P4-10 99
Chiu, Yu-Hsiang Bosco Mon-Ses2-O1-2 48
Chládková, Kateřina . . Mon-Ses3-P1-5 66
Chng, Eng Siong . . . . . . Mon-Ses2-P2-10 55
Thu-Ses2-P2-9 172
Cho, Jeongmi . . . . . . . . . Wed-Ses3-O3-2 140
Cho, Kook . . . . . . . . . . . . Tue-Ses3-P1-13 103
Choi, Eric . . . . . . . . . . . . . Thu-Ses2-O5-5 168
Choi, Jae-Hun . . . . . . . . . Tue-Ses3-P1-10 103
Thu-Ses1-P1-4 155
Chollet, Gérard . . . . . . . Mon-Ses3-S1-4 74
Chonavel, Thierry . . . . Wed-Ses1-O4-2 115
Chong, Jike . . . . . . . . . . . Tue-Ses2-P3-3 96
Chotimongkol, A. . . . . . Mon-Ses3-P2-6 68
Wed-Ses1-P3-12 122
Christensen, Heidi . . . Mon-Ses2-P1-5 52
Chu, Wei . . . . . . . . . . . . . . Thu-Ses2-O3-3 165
Chueh, Chuang-Hua . . Mon-Ses3-O1-6 62
Chung, Hoon . . . . . . . . . Tue-Ses2-O1-1 87
Chung, Hyun-Yeol . . . . Wed-Ses3-P3-3 146
Chung, Minhwa . . . . . . . Wed-Ses1-P1-5 116
Cieri, Christopher . . . . Thu-Ses2-O4-6 167
Clark, Robert A.J. . . . . . Mon-Ses3-P2-9 68
Tue-Ses3-O4-3 101
Claveau, Vincent . . . . . Tue-Ses3-O4-4 101
Clemens, Caroline . . . . Tue-Ses1-P3-10 83
Clements, Mark A. . . . . Wed-Ses2-O4-1 128
Coelho, Luis . . . . . . . . . . Tue-Ses3-O4-1 101
Colás, José . . . . . . . . . . . . Wed-Ses2-P4-10 136
Colby, Glen . . . . . . . . . . . Mon-Ses3-S1-5 74
Cole, Jeffrey . . . . . . . . . . Tue-Ses3-P4-3 108
Cole, Jennifer . . . . . . . . . Thu-Ses1-O2-6 152
Thu-Ses1-S1-0 162
Coleman, John . . . . . . . . Thu-Ses2-O5-3 167
Colibro, Daniele . . . . . .
Cooke, Martin . . . . . . . .
Cordoba, R. . . . . . . . . . . .
Corns, A. . . . . . . . . . . . . . .
Cosi, Piero . . . . . . . . . . . .
Côté, Nicolas . . . . . . . . .
Cowie, Roddy . . . . . . . . .
Cox, Stephen J. . . . . . . .
Coyle, Eugene . . . . . . . .
Crammer, Koby . . . . . . .
Cranen, B. . . . . . . . . . . . . .
Creer, S.M. . . . . . . . . . . . .
Crevier-Buchman, Lise
Csapó, Tamás Gábor .
Cuayáhuitl, Heriberto
Cucchiarini, Catia . . . .
Cui, Xiaodong . . . . . . . .
Cumani, Sandro . . . . . .
Cummins, Fred . . . . . . .
Cunningham, S.P. . . . . .
Cupples, E.J. . . . . . . . . . .
Cutler, Anne . . . . . . . . . .
Cutugno, Francesco . .
Cvetković, Zoran . . . . .
Mon-Ses2-P2-4
Tue-Ses2-P3-9
Tue-Ses1-P1-12
Wed-Ses2-O1-6
Mon-Ses2-P4-8
Tue-Ses1-O2-6
Mon-Ses3-P3-1
Thu-Ses1-O4-1
Wed-Ses1-O2-6
Wed-Ses1-O3-1
Thu-Ses2-P4-5
Wed-Ses2-S1-4
Thu-Ses1-O3-6
Tue-Ses2-P4-2
Tue-Ses3-P3-1
Tue-Ses1-S2-9
Mon-Ses3-P1-10
Mon-Ses2-P4-7
Mon-Ses3-P4-2
Mon-Ses2-P3-8
Mon-Ses2-P2-4
Mon-Ses2-O2-3
Tue-Ses1-P4-9
Tue-Ses3-P3-1
Wed-Ses1-P4-3
Thu-Ses2-P1-2
Mon-Ses3-O2-2
Tue-Ses3-P4-5
Wed-Ses3-P3-4
54
97
80
126
58
76
69
154
113
113
174
137
154
98
106
87
67
58
71
57
54
49
85
106
122
168
63
109
146
D
Dahlbäck, Nils . . . . . . . . Thu-Ses2-O1-6 164
Dai, Beiqian . . . . . . . . . . . Tue-Ses3-O3-6 100
Dai, Li-Rong . . . . . . . . . . Mon-Ses3-O3-2 64
Daimo, Katsunori . . . . Tue-Ses1-P1-3 79
Dakka, Wisam . . . . . . . . Tue-Ses3-P4-12 110
d’Alessandro, C. . . . . . . Wed-Ses2-P3-9 134
Dalsgaard, P. . . . . . . . . . . Tue-Ses2-P1-8 92
Damnati, Géraldine . . Tue-Ses1-O4-3 78
Thu-Ses1-P4-4 160
Damper, Robert I. . . . . Wed-Ses2-P2-11 133
Dang, Jianwu . . . . . . . . . Mon-Ses2-O2-1 48
Thu-Ses2-O2-5 165
Dansereau, R.M. . . . . . . Wed-Ses2-O4-4 129
Darch, Jonathan . . . . . . Wed-Ses2-O4-2 128
D’Arcy, Shona . . . . . . . . Tue-Ses1-S1-4 86
Das, Amit . . . . . . . . . . . . . Tue-Ses3-P1-15 104
Dashtbozorg, Behdad Tue-Ses3-P1-12 103
Davel, Marelie . . . . . . . . Thu-Ses2-O4-1 166
Thu-Ses2-O4-2 166
Thu-Ses2-P3-4 173
Davies, Hannah . . . . . . . Wed-Ses3-P1-12 143
Davis, Chris . . . . . . . . . . . Mon-Ses3-O2-2 63
Wed-Ses3-O4-4 141
Davis, Matthew H. . . . . Mon-Ses3-O2-1 62
Dayanidhi, Krishna . . . Tue-Ses3-P4-2 108
Dean, Jeffrey . . . . . . . . . . Mon-Ses3-O1-1 61
de Castro, Alberto . . . . Wed-Ses3-P2-6 144
Dehak, Najim . . . . . . . . . Mon-Ses2-S1-9 61
Wed-Ses1-O1-3 112
Dehak, Réda . . . . . . . . . . Mon-Ses2-S1-9 61
Wed-Ses1-O1-3 112
Dehzangi, Omid . . . . . . Thu-Ses2-P2-9 172
de Jong, F.M.G. . . . . . . . Tue-Ses1-P4-10 85
Tue-Ses2-O4-5 91
Wed-Ses2-P2-7 132
Thu-Ses1-O4-4 154
Deléglise, Paul . . . . . . . . Tue-Ses1-O3-1 77
Wed-Ses2-P4-8 136
De Looze, Céline . . . . . Thu-Ses2-P1-7 169
De Luca, Carlo J. . . . . . . Mon-Ses3-S1-5 74
Demange, Sébastien . . Mon-Ses3-P3-6 70
Thu-Ses2-P4-11 175
Demenko, Grażyna . . . Wed-Ses1-P4-4 123
Demirekler, Mübeccel Thu-Ses2-O2-3 164
De Mori, Renato . . . . . . Mon-Ses2-P4-9 58
Thu-Ses1-P4-4 160
Demuynck, Kris . . . . . . Tue-Ses2-P3-2 96
Denby, Bruce . . . . . . . . . Mon-Ses3-S1-4 74
Deng, Li . . . . . . . . . . . . . . . Tue-Ses1-O1-5 75
Deng, Yunbin . . . . . . . . . Mon-Ses3-S1-5 74
den Ouden, Hanny . . . Tue-Ses1-P2-4 81
Despres, Julien . . . . . . . Mon-Ses2-O3-6 50
D’Haro, L.F. . . . . . . . . . . . Mon-Ses2-P4-8 58
D’hoore, Bart . . . . . . . . . Thu-Ses2-P3-2 172
Dickie, Catherine . . . . . Wed-Ses1-P4-2 122
178
Mon-Ses2-P3-7
Thu-Ses1-P3-4
Tue-Ses2-O2-1
Wed-Ses4-P4-13
Wed-Ses4-P4-14
Tue-Ses2-O3-1
Thu-Ses1-P4-12
Mon-Ses3-O3-6
Tue-Ses3-P2-4
Tue-Ses3-P2-5
Wed-Ses2-P4-7
Thu-Ses1-P1-3
Tue-Ses2-O2-5
Thu-Ses2-O2-2
Tue-Ses3-P1-1
Wed-Ses3-O3-5
Thu-Ses1-P1-5
Tue-Ses1-O3-4
Wed-Ses2-P1-2
Mon-Ses2-P2-6
Mon-Ses2-P2-7
Wed-Ses2-P2-8
Tue-Ses1-P3-2
Wed-Ses3-P1-13
Wed-Ses4-P4-2
Thu-Ses1-O2-2
Mon-Ses2-P3-1
Mon-Ses2-P1-4
Wed-Ses2-S1-6
Thu-Ses2-P3-7
Wed-Ses2-P3-8
Wed-Ses4-P4-9
Wed-Ses2-S1-4
Wed-Ses1-O2-6
Wed-Ses1-P4-2
Mon-Ses3-S1-4
Tue-Ses1-O2-6
Wed-Ses1-P2-9
Mon-Ses2-O4-5
Tue-Ses3-P3-10
Wed-Ses1-P3-8
Thu-Ses2-O5-6
Tue-Ses3-O4-6
Mon-Ses3-O3-4
Tue-Ses3-P3-10
Tue-Ses2-P3-2
Wed-Ses1-P3-4
Mon-Ses2-S1-9
Wed-Ses1-O1-3
Thu-Ses2-P4-4
Mon-Ses2-O4-5
Tue-Ses3-P3-10
Wed-Ses1-O4-4
Wed-Ses1-P3-8
Thu-Ses2-O5-6
Wed-Ses1-P3-13
56
158
88
149
149
89
161
64
105
105
136
155
89
164
102
140
155
77
129
54
54
132
82
143
147
152
55
52
138
173
134
148
137
113
122
74
76
119
51
108
121
168
101
64
108
96
120
61
112
174
51
108
115
121
168
122
Edlund, Jens . . . . . . . . . . Mon-Ses2-P4-12
59
77
163
168
154
158
71
136
172
107
51
79
123
164
81
164
93
131
168
60
60
119
65
125
125
142
92
91
60
Diehl, F. . . . . . . . . . . . . . . .
Dimitrova, Diana V. . .
D’Imperio, Mariapaola
Dinarelli, Marco . . . . . .
Dines, John . . . . . . . . . . .
Disch, Sascha . . . . . . . . .
Dittrich, Heleen . . . . . .
Divenyi, Pierre . . . . . . . .
DiVita, Joseph . . . . . . . .
Dixon, Paul R. . . . . . . . .
Djamah, Mouloud . . . .
Dobrišek, Simon . . . . . .
Dobry, Gil . . . . . . . . . . . . .
Docio-Fernandez, L. . .
Dogil, Grzegorz . . . . . .
Dognin, Pierre L. . . . . . .
Dole, Marjorie . . . . . . . .
Domont, Xavier . . . . . . .
Dong, Bin . . . . . . . . . . . . .
Dong, Minghui . . . . . . . .
Dorn, Amelie . . . . . . . . .
Dorran, David . . . . . . . .
Douglas-Cowie, Ellen .
Draxler, Christoph . . .
Dreyfus, Gérard . . . . . .
Driesen, Joris . . . . . . . . .
Drugman, Thomas . . .
Du, Jinhua . . . . . . . . . . . .
Duan, Quansheng . . . .
Dubuisson, Thomas . .
Duchateau, Jacques . .
Duckhorn, Frank . . . . .
Dumouchel, Pierre . . .
Dutoit, Thierry . . . . . . .
Dziemianko, Michal . .
E
Eg, Ragnhild . . . . . . . . . .
Egi, Noritsugu . . . . . . . .
El-Desoky, Amr . . . . . . .
Elenius, Daniel . . . . . . . .
El Hannani, Asmaa . . .
Elhilali, Mounya . . . . . .
el Kaliouby, Rana . . . . .
Ellis, Dan P.W. . . . . . . . .
Enflo, Laura . . . . . . . . . . .
Engelbrecht, K-P. . . . . .
Engwall, Olov . . . . . . . . .
Epps, Julien . . . . . . . . . . .
Erdem, A. Tanju . . . . . .
Erdem, Çiǧdem Eroǧlu
Erickson, Donna . . . . . .
Ernestus, Mirjam . . . . .
Errity, Andrew . . . . . . . .
Erro, Daniel . . . . . . . . . . .
Erzin, Engin . . . . . . . . . . .
Tue-Ses1-O3-5
Thu-Ses2-O1-2
Thu-Ses2-P1-3
Thu-Ses1-O4-1
Thu-Ses1-P3-5
Mon-Ses3-P3-11
Wed-Ses2-P4-7
Thu-Ses2-P2-10
Tue-Ses3-P3-8
Mon-Ses2-O4-3
Tue-Ses1-P1-7
Wed-Ses1-P4-7
Thu-Ses2-O1-4
Tue-Ses1-P2-6
Thu-Ses2-O2-1
Tue-Ses2-P1-11
Wed-Ses2-P2-3
Thu-Ses2-O5-5
Mon-Ses2-S1-4
Mon-Ses2-S1-4
Wed-Ses1-P2-12
Mon-Ses3-P1-1
Wed-Ses2-O1-1
Wed-Ses2-O1-3
Wed-Ses3-P1-2
Tue-Ses2-P1-7
Tue-Ses2-P1-2
Mon-Ses2-S1-4
Escalante-Ruiz, Rafael Wed-Ses4-P4-6 148
Escudero, David . . . . . . Wed-Ses4-P4-10 149
Espy-Wilson, Carol Y.
Thu-Ses1-S1-0 162
Thu-Ses1-S1-1 162
Thu-Ses1-S1-3 162
Estève, Yannick . . . . . . . Wed-Ses2-P4-8 136
Evanini, Keelan . . . . . . . Wed-Ses1-P1-3 116
Wed-Ses3-O2-4 139
Ewender, Thomas . . . . Mon-Ses2-O4-1 50
Mon-Ses3-O1-2 62
Eyben, Florian . . . . . . . . Wed-Ses1-O2-6 113
Thu-Ses1-O1-5 151
Gales, M.J.F. . . . . . . . . . . . Mon-Ses2-P3-7
Galliano, Sylvain . . . . . .
Gamboa Rosales, A. . .
Gamboa Rosales, H. . .
Ganapathy, Sriram . . .
F
Fagel, Sascha . . . . . . . . . Tue-Ses1-O2-4 76
Faisman, Alexander . . Mon-Ses3-O4-4 65
Fakotakis, Nikos . . . . . . Wed-Ses2-O2-1 126
Fan, Xing . . . . . . . . . . . . . . Tue-Ses1-P4-3 84
Fang, Qiang . . . . . . . . . . . Mon-Ses2-O2-1 48
Fapšo, Michal . . . . . . . . . Wed-Ses3-P2-4 144
Fatema, K. . . . . . . . . . . . . Tue-Ses3-P3-1 106
Faure, Julien . . . . . . . . . . Tue-Ses2-P2-5 94
Favre, Benoit . . . . . . . . . . Tue-Ses2-P3-4 96
Tue-Ses3-P4-8 109
Tue-Ses3-P4-9 109
Thu-Ses1-P4-3 160
Fegyó, Tibor . . . . . . . . . . Thu-Ses1-P3-7 159
Feldes, Stefan . . . . . . . . . Tue-Ses1-P3-10 83
Feng, Junlan . . . . . . . . . . Wed-Ses1-S1-1 124
Fernández, Fernando . Mon-Ses2-S1-7 61
Wed-Ses3-S1-2 150
Fernández, Rafael . . . . Wed-Ses3-P2-5 144
Fernández, Rubén . . . . Tue-Ses3-P3-9 108
Fernandez Astudillo, R Thu-Ses1-O1-1 150
Ferreiros, Javier . . . . . . Wed-Ses3-S1-2 150
Filimonov, Denis . . . . . Thu-Ses1-P3-2 158
Fingscheidt, Tim . . . . . Thu-Ses2-P2-4 171
Fitt, Sue . . . . . . . . . . . . . . . Tue-Ses3-O4-3 101
Flego, F. . . . . . . . . . . . . . . . Tue-Ses2-P4-8 99
Thu-Ses1-O1-3 151
Fohr, D. . . . . . . . . . . . . . . . Wed-Ses1-P4-6 123
Fon, Janice . . . . . . . . . . . . Tue-Ses1-P2-2 81
Forbes-Riley, Kate . . . . Wed-Ses3-S1-1 149
Fosler-Lussier, Eric . . . Tue-Ses1-O2-2 76
Tue-Ses1-P3-6 82
Tue-Ses2-P3-10 97
Wed-Ses1-P1-4 116
Thu-Ses2-P4-10 175
Foster, Kylie . . . . . . . . . . Mon-Ses2-O2-4 49
Fougeron, Cécile . . . . . . Wed-Ses3-P1-1 142
Fousek, Petr . . . . . . . . . . Mon-Ses2-O3-6 50
Mon-Ses2-P3-5 56
Fraile, Rubén . . . . . . . . . Tue-Ses1-S2-7 87
Fraj, Samia . . . . . . . . . . . . Thu-Ses2-P1-4 169
Frankel, Joe . . . . . . . . . . . Wed-Ses2-P4-10 136
Wed-Ses2-P4-11 137
Wed-Ses2-P4-12 137
Frauenfelder, Ulrich H. Wed-Ses3-P1-1 142
Frissora, Michael . . . . . Mon-Ses3-O4-4 65
Frolova, Olga V. . . . . . . Wed-Ses1-P2-11 119
Fu, Qian-Jie . . . . . . . . . . . Wed-Ses1-O4-6 115
Fujie, Shinya . . . . . . . . . . Mon-Ses2-P4-4 58
Fujimoto, Masakiyo . . Tue-Ses2-P4-4 98
Fujinaga, Tsuyoshi . . . Tue-Ses3-P4-4 109
Fujisaki, Hiroya . . . . . . . Wed-Ses4-P4-3 148
Fukuda, Takashi . . . . . . Mon-Ses2-O1-5 48
Funakoshi, Kotaro . . . . Thu-Ses1-P4-8 161
Thu-Ses1-P4-9 161
Furui, Sadaoki . . . . . . . .
Mon-Ses1-K-1 47
Mon-Ses3-P3-10 71
Tue-Ses1-O3-6 77
Tue-Ses2-P4-9 99
Wed-Ses3-O3-5 140
Furuya, Ken’ichi . . . . . . Tue-Ses3-P1-4 102
Futagi, Yoko . . . . . . . . . . Mon-Ses3-P4-5 72
G
Gabbouj, Moncef . . . . .
Gajšek, Rok . . . . . . . . . . .
Gakuru, Mucemi . . . . . .
Wed-Ses1-P3-7
Thu-Ses1-P2-8
Wed-Ses2-P1-2
Wed-Ses2-P3-5
121
157
129
134
Gangashetty, S.V. . . . . .
Gao, Jie . . . . . . . . . . . . . . .
Garcia, Jose Enrique . .
Garcia, Luz . . . . . . . . . . .
Garcia-Mateo, Carmen
García-Moral, Ana I. . .
Garg, Nikhil . . . . . . . . . . .
Garner, Philip N. . . . . . .
Gašić, M. . . . . . . . . . . . . . .
Gates, Stephen C. . . . . .
Gatica-Perez, Daniel . .
Gaudrain, Etienne . . . .
Gauvain, Jean-Luc . . . .
Gay, Sandrine . . . . . . . . .
Ge, Fengpei . . . . . . . . . . .
Gelbart, David . . . . . . . .
Gemello, Roberto . . . . .
Gemmeke, J.F. . . . . . . . .
Georgiou, P.G. . . . . . . . .
Gerosa, Matteo . . . . . . .
Ghai, Shweta . . . . . . . . . .
Ghemawat, Sanjay . . . .
Ghosh, P.K. . . . . . . . . . . .
Gibbon, Dafydd . . . . . .
Gibson, Matthew . . . . .
Gilbert, Mazin . . . . . . . .
Gilmore, L. Donald . . .
Giró, X. . . . . . . . . . . . . . . . .
Gish, Herbert . . . . . . . . .
Giuliani, Diego . . . . . . . .
Glass, James R. . . . . . . .
Glembek, Ondřej . . . . .
Glenn, Meghan L. . . . . .
Gobl, Christer . . . . . . . .
Gödde, Florian . . . . . . . .
Godino-Llorente, J.I. . .
Godoy, Elizabeth . . . . .
Goel, Vaibhava . . . . . . .
Goerick, Christian . . . .
Goldman, Jean P. . . . . .
Goldstein, Louis . . . . . .
Gollan, Christian . . . . .
Gomez, Randy . . . . . . . .
Goncharoff, Vladimir .
Gonina, Ekaterina . . . .
Gonzalez-Rodriguez, J
Gonzalvo, Xavi . . . . . . .
Goodwin, Matthew . . .
Goto, Masataka . . . . . . .
Goudbeek, Martijn . . .
Gracia, Sergio . . . . . . . . .
Graciarena, Martin . . .
Gráczi, Tekla Etelka . .
Graff, David . . . . . . . . . . .
Mon-Ses3-O1-3
Mon-Ses3-O1-4
Tue-Ses2-P4-8
Wed-Ses1-O1-6
Wed-Ses3-P3-2
Thu-Ses1-O1-3
Thu-Ses1-P3-4
Thu-Ses1-O4-6
Tue-Ses1-O4-5
Tue-Ses1-O4-5
Thu-Ses1-P1-2
Thu-Ses2-O3-1
Thu-Ses2-P2-3
Tue-Ses2-P1-6
Wed-Ses2-P4-6
Wed-Ses2-P4-1
Thu-Ses1-P1-1
Mon-Ses2-O1-4
Tue-Ses1-P3-2
Tue-Ses1-P2-7
Tue-Ses3-P4-8
Wed-Ses2-P4-7
Thu-Ses1-P4-5
Thu-Ses1-P4-10
Wed-Ses3-O3-6
Mon-Ses2-P1-7
Mon-Ses2-O3-6
Wed-Ses3-O1-5
Mon-Ses2-O3-6
Thu-Ses2-P3-7
Thu-Ses2-P2-6
Mon-Ses2-O1-4
Tue-Ses2-P3-9
Tue-Ses2-P4-2
Mon-Ses3-O4-6
Tue-Ses1-S2-10
Mon-Ses3-P3-8
Wed-Ses1-O3-3
Mon-Ses3-O1-1
Mon-Ses3-O4-6
Thu-Ses2-O2-2
Mon-Ses3-P2-4
Wed-Ses1-P3-11
Wed-Ses1-S1-1
Mon-Ses3-S1-5
Tue-Ses2-P2-7
Wed-Ses2-O3-6
Tue-Ses1-S2-10
Mon-Ses2-P3-6
Wed-Ses3-O1-4
Wed-Ses3-P2-4
Thu-Ses2-O4-3
Wed-Ses2-P1-5
Thu-Ses2-O1-4
Tue-Ses1-S2-7
Wed-Ses1-O4-2
Mon-Ses2-P3-1
Mon-Ses2-P3-5
Tue-Ses3-P2-12
Wed-Ses2-S1-6
Wed-Ses1-O2-1
Tue-Ses1-O2-1
Tue-Ses1-O2-5
Thu-Ses1-S1-3
Thu-Ses1-S1-4
Thu-Ses1-S1-6
Thu-Ses2-O2-2
Mon-Ses2-O3-4
Wed-Ses2-P4-5
Thu-Ses1-P3-5
Tue-Ses2-P4-1
Tue-Ses3-P1-1
Tue-Ses2-P3-3
Wed-Ses2-P1-3
Wed-Ses3-P2-6
Mon-Ses3-O3-5
Tue-Ses3-P3-8
Tue-Ses1-P2-8
Tue-Ses2-P2-6
Tue-Ses3-P4-6
Wed-Ses1-O2-1
Wed-Ses2-P3-6
Wed-Ses2-P2-4
Mon-Ses3-P1-10
Thu-Ses2-O4-6
179
56
62
62
99
112
146
151
158
154
78
78
155
165
171
92
136
135
155
48
82
81
109
136
160
161
141
52
50
138
50
173
171
48
97
98
65
87
70
114
61
65
164
67
121
124
74
94
128
87
56
138
144
166
130
164
87
115
55
56
106
138
112
76
76
162
162
163
164
50
136
158
98
102
96
130
144
64
107
81
94
109
112
134
131
67
167
Granström, Björn . . . . . Mon-Ses2-P4-12
Gravano, Agustín . . . . .
Gravier, Guillaume . . .
Green, P.D. . . . . . . . . . . . .
Greenberg, Craig S. . . .
Grenez, Francis . . . . . . .
Grézl, František . . . . . .
Grieco, J.J. . . . . . . . . . . . .
Griffiths, Thomas L. . .
Grigoriev, Aleks S. . . . .
Grimaldi, Marco . . . . . .
Grimaldi, Mirko . . . . . .
Griol, David . . . . . . . . . . .
Grishman, Ralph . . . . .
Gruenstein, Alexander
Gruhn, Rainer . . . . . . . .
Gu, Lingyun . . . . . . . . . . .
Guan, Yong . . . . . . . . . . .
Gubian, Michele . . . . . .
Gudnason, Jon . . . . . . . .
Guenther, Frank H. . . .
Gupta, Sanjeev . . . . . . .
Gustafson, Joakim . . .
Guterman, Hugo . . . . . .
Gutiérrez, Juana M. . .
Gutkin, Alexander . . . .
Tue-Ses3-P3-5
Tue-Ses2-O2-6
Mon-Ses3-O1-5
Thu-Ses1-O4-6
Thu-Ses2-O3-6
Tue-Ses3-P3-1
Thu-Ses1-O4-5
Thu-Ses2-O4-6
Tue-Ses1-S1-3
Tue-Ses1-S2-8
Thu-Ses2-P1-4
Thu-Ses2-P2-1
Wed-Ses1-P4-3
Tue-Ses0-K-1
Wed-Ses1-P2-11
Tue-Ses1-P4-9
Wed-Ses1-P1-9
Mon-Ses2-P4-6
Thu-Ses2-O1-1
Thu-Ses1-P4-1
Thu-Ses2-P4-2
Thu-Ses2-P3-3
Thu-Ses1-O3-4
Mon-Ses3-O3-6
Wed-Ses3-O2-1
Mon-Ses2-O4-3
Mon-Ses3-S1-3
Tue-Ses2-P1-5
Mon-Ses2-P4-12
Tue-Ses1-P4-8
Tue-Ses1-S2-7
Mon-Ses3-O3-5
59
107
89
62
154
166
106
154
167
86
87
169
170
122
47
119
85
117
58
163
159
174
172
153
64
139
51
73
92
59
85
87
64
H
Haderlein, Tino . . . . . . . Tue-Ses1-S2-6
Haeb-Umbach, R. . . . . . Tue-Ses2-P2-11
86
95
Tue-Ses2-P4-3 98
Wed-Ses3-P3-5 147
Hahn, Stefan . . . . . . . . . . Thu-Ses1-P4-6 160
Thu-Ses1-P4-7 160
Hain, Thomas . . . . . . . . . Wed-Ses2-P4-7 136
Haji, Tomoyuki . . . . . . . Thu-Ses2-P1-9 169
Haji Abolhassani, I. . . Tue-Ses3-P1-16 104
Hakkani-Tür, Dilek . . . Tue-Ses3-P4-8 109
Tue-Ses3-P4-9 109
Thu-Ses1-P4-1 159
Thu-Ses1-P4-2 160
Thu-Ses1-P4-3 160
Hakulinen, Jaakko . . . . Tue-Ses3-P3-4 107
Thu-Ses1-O4-2 154
Hallé, Pierre . . . . . . . . . . Mon-Ses2-P1-6 52
Han, Kyu J. . . . . . . . . . . . Tue-Ses2-O4-6 91
Thu-Ses1-O3-3 153
Haneda, Yoichi . . . . . . . Tue-Ses3-P1-4 102
Hans, Stéphane . . . . . . . Tue-Ses1-S2-9 87
Hansakunbuntheung, C Thu-Ses2-O5-1 167
Hansen, Eric A. . . . . . . . Thu-Ses2-P4-6 174
Hansen, John H.L. . . . . Mon-Ses2-P2-3 53
Tue-Ses1-P3-7 82
Tue-Ses1-P4-3 84
Tue-Ses2-P4-6 98
Tue-Ses3-P1-14 104
Tue-Ses3-P1-15 104
Wed-Ses2-P2-5 132
Wed-Ses3-P2-13 146
Wed-Ses3-P3-6 147
Hansen, Mervi . . . . . . . . Tue-Ses3-P3-4 107
Haokip, D. Mary Kim . Mon-Ses3-P2-4 67
Harb, Boulos . . . . . . . . . . Mon-Ses3-O1-1 61
Harish, A.N. . . . . . . . . . . . Tue-Ses2-P1-10 93
Harish, D. . . . . . . . . . . . . . Thu-Ses1-P1-8 156
Harlow, Ray . . . . . . . . . . . Tue-Ses3-S2-5 111
Harnsberger, James . . Thu-Ses2-P1-8 169
Harper, Mary . . . . . . . . . Thu-Ses1-P3-2 158
Thu-Ses1-P4-1 159
Harrison, Alissa . . . . . . Wed-Ses1-P2-5 118
Hartard, Felix . . . . . . . . . Thu-Ses2-O1-4 164
Harte, Naomi . . . . . . . . . Tue-Ses2-P1-13 93
Hartmann, William . . . Wed-Ses1-P1-4 116
Hasegawa-Johnson, M. Tue-Ses3-P3-7 107
Wed-Ses2-O2-4 127
Thu-Ses1-O2-6 152
Thu-Ses1-S1-4 162
Thu-Ses2-O2-3 164
Hashimoto, Kei . . . . . . .
Hassan, Ali . . . . . . . . . . .
Hawkins, Sarah . . . . . . .
Hayashi, Kyohei . . . . . .
Hazen, Timothy J. . . . .
He, Guan-min . . . . . . . . .
Heaton, James T. . . . . .
Hecht, Ron M. . . . . . . . .
Heckmann, Martin . . . .
Heeren, W. . . . . . . . . . . . .
Heigold, Georg . . . . . . .
Heimonen, Tomi . . . . .
Heinrich, Antje . . . . . . .
Heinrich, Christian . . .
Heintz, Ilana . . . . . . . . . .
Helander, Elina . . . . . . .
Heldner, Mattias . . . . . .
Hella, Juho . . . . . . . . . . . .
Hendriks, Richard C. .
Hennebert, Jean . . . . . .
Heracleous, Panikos . .
Hermansky, Hynek . . .
Hernáez, Inmaculada
Hernández, Gabriel . .
Hernández, Luis A. . . .
Hernando, J. . . . . . . . . . .
Hershey, John R. . . . . .
Hestvik, Arild . . . . . . . . .
Heusdens, Richard . . .
Heylen, Dirk . . . . . . . . . .
Hezroni, Omer . . . . . . . .
Hieronymus, J.L. . . . . . .
Higgins, Derrick . . . . . .
Hill, Harold . . . . . . . . . . .
Hines, Andrew . . . . . . . .
Hinrichs, Erhard . . . . . .
Hioka, Yusuke . . . . . . . .
Hirose, Keikichi . . . . . .
Hirsch, Fabrice . . . . . . .
Hirsch, Hans-Günter .
Hirschberg, Julia . . . . .
Hirst, Daniel . . . . . . . . . .
Hoeks, John C.J. . . . . . .
Hoen, Michel . . . . . . . . . .
Hofer, Gregor . . . . . . . . .
Hoffmann, Ruediger . .
Hoffmann, Sarah . . . . .
Hoffmeister, Björn . . .
Höge, Harald . . . . . . . . .
Holzrichter, John F. . .
Hong, Hyejin . . . . . . . . .
Hong, Jung Ook . . . . . .
Hönig, F. . . . . . . . . . . . . . .
Tue-Ses1-O1-6
Wed-Ses1-P3-1
Wed-Ses2-P2-11
Tue-Ses1-P2-1
Tue-Ses1-P1-2
Wed-Ses2-P4-13
Tue-Ses2-P4-5
Mon-Ses3-S1-5
Mon-Ses2-P2-6
Mon-Ses2-P2-7
Wed-Ses1-O1-5
Wed-Ses2-P2-8
Wed-Ses2-S1-6
Wed-Ses4-P4-1
Mon-Ses2-P3-2
Wed-Ses2-P4-4
Wed-Ses2-P4-5
Thu-Ses1-P4-7
Tue-Ses3-P3-4
Thu-Ses1-O4-2
Tue-Ses1-P2-1
Tue-Ses2-O1-3
Tue-Ses1-O2-2
Wed-Ses1-P3-7
Tue-Ses1-O3-5
Thu-Ses2-O1-2
Tue-Ses3-P3-4
Thu-Ses1-O4-2
Tue-Ses3-P1-3
Wed-Ses2-O4-3
Mon-Ses2-P2-1
Tue-Ses3-P2-2
Wed-Ses3-O4-3
Mon-Ses2-O3-2
Thu-Ses1-P1-2
Thu-Ses2-O3-1
Thu-Ses2-P2-3
Thu-Ses2-P2-10
Mon-Ses2-S1-6
Tue-Ses2-P1-2
Wed-Ses3-P2-5
Tue-Ses3-P3-9
Tue-Ses2-P2-7
Mon-Ses2-P3-1
Tue-Ses3-P1-6
Wed-Ses1-P2-1
Tue-Ses3-P1-3
Wed-Ses2-O4-3
Wed-Ses2-S1-1
Mon-Ses2-P2-6
Mon-Ses2-P2-7
Mon-Ses3-O1-4
Mon-Ses3-P4-5
Wed-Ses3-O4-4
Tue-Ses2-P1-13
Wed-Ses1-P4-1
Tue-Ses3-P1-4
Mon-Ses2-P4-15
Mon-Ses3-P2-10
Mon-Ses3-P4-6
Wed-Ses2-P3-1
Wed-Ses3-O2-6
Wed-Ses4-P4-3
Thu-Ses2-P4-8
Mon-Ses2-O2-2
Mon-Ses3-P3-7
Mon-Ses2-P2-12
Tue-Ses2-O2-6
Wed-Ses1-P2-7
Thu-Ses2-O1-2
Tue-Ses3-S2-1
Wed-Ses3-O2-3
Tue-Ses2-O2-1
Mon-Ses2-P1-4
Wed-Ses2-O1-5
Wed-Ses1-P3-13
Tue-Ses1-O4-5
Mon-Ses2-O4-1
Mon-Ses2-P3-10
Tue-Ses2-P3-5
Wed-Ses2-P4-4
Wed-Ses2-P4-5
Thu-Ses2-P2-4
Mon-Ses3-S1-1
Wed-Ses1-P1-5
Mon-Ses2-O4-4
Mon-Ses3-P4-4
75
120
133
80
79
137
98
74
54
54
112
132
138
147
55
135
136
160
107
154
80
88
76
121
77
163
107
154
102
129
53
104
141
50
155
165
171
172
61
91
144
108
94
55
102
118
102
129
137
54
54
62
72
141
93
122
102
59
68
72
133
140
148
175
49
70
55
89
119
163
110
139
88
52
126
122
78
50
57
96
135
136
171
73
116
51
72
Honma, Daisuke . . . . . .
Hoque, Mohammed E.
Hori, Chiori . . . . . . . . . . .
Horiuchi, Yasuo . . . . . .
Hosom, John-Paul . . . .
House, David . . . . . . . . .
Houtepen, Véronique
Howard, David M. . . . .
Howard, Ian S. . . . . . . . .
Hsiao, Roger . . . . . . . . . .
Hu, Hongbing . . . . . . . . .
Hu, Rile . . . . . . . . . . . . . . .
Huang, Jing . . . . . . . . . . .
Huang, Shen . . . . . . . . . .
Huang, Songfang . . . . .
Huang, Thomas S. . . . .
Hubeika, Valiantsina .
Huckvale, Mark . . . . . . .
Hudson, Toby . . . . . . . .
Hueber, Thomas . . . . . .
Huerta, Juan M. . . . . . . .
Huijbregts, Marijn . . . .
Hung, Jeih-weih . . . . . .
Hunter, Peter . . . . . . . . .
Hwang, Hsin-Te . . . . . .
Tue-Ses3-P2-6
Tue-Ses3-P3-8
Mon-Ses2-P4-5
Wed-Ses1-P4-11
Wed-Ses3-P2-12
Wed-Ses1-P2-10
Wed-Ses1-P4-1
Wed-Ses4-P4-8
Tue-Ses2-O2-5
Tue-Ses1-P1-4
Tue-Ses1-O2-4
Tue-Ses1-O1-2
Tue-Ses2-P1-1
Mon-Ses3-O3-6
Tue-Ses3-P2-1
Wed-Ses2-O2-6
Thu-Ses1-P3-9
Tue-Ses3-O3-6
Tue-Ses3-O3-1
Wed-Ses3-O1-4
Wed-Ses3-P2-4
Tue-Ses1-O2-4
Tue-Ses3-P1-11
Wed-Ses3-P1-10
Mon-Ses3-S1-4
Mon-Ses3-O4-4
Tue-Ses1-P4-10
Tue-Ses2-O4-5
Thu-Ses1-O4-4
Tue-Ses2-P4-5
Mon-Ses2-O2-4
Thu-Ses1-P2-5
105
107
58
124
145
119
122
148
89
79
76
75
91
64
104
127
159
100
99
138
144
76
103
143
74
65
85
91
154
98
49
157
I
Ichikawa, Osamu . . . . . Mon-Ses2-O1-5 48
Iimura, Miki . . . . . . . . . . Mon-Ses3-P4-7 72
Ijima, Yusuke . . . . . . . . . Mon-Ses3-P3-4 70
Imaizumi, Satoshi . . . . Thu-Ses2-P1-9 169
Ingram, John . . . . . . . . . Thu-Ses1-O2-4 152
Irino, Toshio . . . . . . . . . . Mon-Ses2-P1-2 52
Thu-Ses1-P2-6 157
Iriondo, Ignasi . . . . . . . . Mon-Ses2-S1-2 60
Mon-Ses3-O3-5 64
Isard, Stephen . . . . . . . . Wed-Ses1-P1-3 116
Ishizuka, Kentaro . . . . Tue-Ses2-P4-4 98
Issing, Jochen . . . . . . . . Thu-Ses1-P1-9 156
Itahashi, Shuichi . . . . . . Mon-Ses3-P1-6 66
Ito, Akinori . . . . . . . . . . . Mon-Ses2-P1-1 51
Mon-Ses3-P4-3 72
Tue-Ses3-P2-6 105
Ito, Kiwako . . . . . . . . . . . Thu-Ses1-O2-3 152
Ito, Masashi . . . . . . . . . . . Mon-Ses2-P1-1 51
Mon-Ses3-P4-3 72
Itoh, Toshihiko . . . . . . . Wed-Ses1-P4-9 123
Ivanov, George . . . . . . . Mon-Ses3-P4-5 72
Iwahashi, Naoto . . . . . . Wed-Ses3-S1-5 150
Thu-Ses1-P4-8 161
Iwano, Koji . . . . . . . . . . . Wed-Ses3-O3-5 140
Iyer, Nandini . . . . . . . . . . Mon-Ses3-O2-5 63
Izdebski, Krzysztof . . Tue-Ses1-S1-1 86
Izumi, Yosuke . . . . . . . . Wed-Ses2-O4-5 129
Jin, Zhaozhang . . . . . . .
Johnson, Ralph . . . . . . .
Jonsson, Ing-Marie . . .
Jorge, Juliana . . . . . . . . .
Joshi, Shrikant . . . . . . . .
Josse, Yvan . . . . . . . . . . .
Jou, Szu-Chen Stan . . .
Joublin, Frank . . . . . . . .
Ju, Yun-Cheng . . . . . . . .
Tue-Ses1-P3-6
Tue-Ses3-P1-1
Thu-Ses2-O1-6
Tue-Ses1-S2-11
Thu-Ses1-O3-2
Mon-Ses2-O3-6
Mon-Ses3-S1-6
Wed-Ses2-S1-6
Tue-Ses2-O1-2
Tue-Ses2-O1-4
Jurčíček, F. . . . . . . . . . . . . Thu-Ses1-P4-5
Jyothi, Preethi . . . . . . . . Tue-Ses2-P3-10
K
K., Sri Rama Murty . . .
K., Sudheer Kumar . . .
Kaburagi, Tokihiko . . .
Kagomiya, Takayuki . .
Kagoshima, Takehiko
Kahn, Jeremy G. . . . . . .
Kahn, Juliette . . . . . . . . .
Kain, Alexander . . . . . .
Kaino, Tomomi . . . . . . .
Kajarekar, Sachin . . . . .
Kalaldeh, Raya . . . . . . . .
Kalgaonkar, Kaustubh
Kalinli, Ozlem . . . . . . . .
Kallasjoki, Heikki . . . . .
Kamaruddin, N. . . . . . .
Kang, Shiyin . . . . . . . . . .
Karafiát, Martin . . . . . . .
Karakos, Damianos . . .
Karam, Zahi N. . . . . . . .
Karhila, Reima . . . . . . . .
Karlsson, Anastasia . .
Karpov, Alexey . . . . . . .
Kashioka, Hideki . . . . .
Kataoka, Akitoshi . . . .
Kato, Hiroaki . . . . . . . . .
Kato, Masaharu . . . . . . .
Kato, Tomoyuki . . . . . .
Katsouros, Vassilis . . .
Katsumaru, Masaki . . .
Katsurada, Kouichi . . .
Kaufmann, Tobias . . . .
Kaushik, Lakshmish . .
Kawaguchi, Hiroshi . .
Kawahara, Hideki . . . . .
Kawahara, Tatsuya . . .
J
Jan, Ea-Ee . . . . . . . . . . . . . Mon-Ses2-P4-11
Jänsch, Klaus . . . . . . . . .
Jansen, Aren . . . . . . . . . .
Järvikivi, Juhani . . . . . .
Jelinek, Frederick . . . . .
Jemaa, Imen . . . . . . . . . .
Jensen, Jesper . . . . . . . .
Jensen, Søren Holdt . .
Jeon, HyeonBae . . . . . . .
Jeon, Je Hun . . . . . . . . . .
Jeong, Yongwon . . . . . .
Jesus, Luis M.T. . . . . . . .
Jia, Huibin . . . . . . . . . . . .
Jiampojamarn, S. . . . . .
Jiang, Hui . . . . . . . . . . . . .
Jin, Qin . . . . . . . . . . . . . . .
180
Mon-Ses3-O4-3
Mon-Ses3-O4-4
Wed-Ses1-P4-2
Thu-Ses1-S1-5
Wed-Ses1-P2-2
Tue-Ses2-P3-6
Wed-Ses2-O3-5
Wed-Ses1-P1-6
Tue-Ses3-P1-3
Wed-Ses2-O4-3
Tue-Ses3-P3-11
Tue-Ses2-O1-1
Mon-Ses2-P2-5
Mon-Ses3-P3-3
Mon-Ses3-P1-7
Tue-Ses1-S2-11
Tue-Ses2-P1-9
Tue-Ses3-O4-5
Tue-Ses1-O1-4
Wed-Ses1-O4-6
Tue-Ses1-P4-5
Tue-Ses1-P4-7
59
65
65
122
163
118
96
128
116
102
129
108
87
54
69
66
87
93
101
75
115
84
85
82
102
164
87
153
50
74
138
88
88
160
97
Kawai, Hisashi . . . . . . . .
Kawamura, Akinori . . .
Kawano, Hiroshi . . . . . .
Keane, Elinor . . . . . . . . .
Keegan, Peter . . . . . . . . .
Keizer, S. . . . . . . . . . . . . . .
Kelling, Martin . . . . . . . .
Kempton, Timothy . . .
Kenmochi, Hideki . . . .
Kennedy, Philip R. . . . .
Kenny, Patrick . . . . . . . .
Kesheorey, M.R. . . . . . .
Kessens, Judith . . . . . . .
Ketabdar, Hamed . . . . .
Keutzer, Kurt . . . . . . . . .
Khudanpur, Sanjeev . .
Kida, Yusuke . . . . . . . . .
Kim, Byeongchang . . . .
Kim, Chanwoo . . . . . . . .
Kim, D.K. . . . . . . . . . . . . .
Wed-Ses1-O2-5
Wed-Ses1-O2-5
Tue-Ses1-P1-3
Thu-Ses2-P1-6
Wed-Ses2-P3-10
Tue-Ses2-O3-4
Wed-Ses3-P2-14
Tue-Ses1-O4-1
Mon-Ses3-S1-2
Wed-Ses1-O1-1
Wed-Ses1-O1-4
Wed-Ses2-P2-4
Wed-Ses4-P4-9
Wed-Ses2-O4-1
Wed-Ses2-O3-4
Tue-Ses3-P1-2
Wed-Ses2-P1-8
Mon-Ses3-O3-4
Wed-Ses1-P3-5
Mon-Ses2-O3-2
Wed-Ses2-P4-7
Wed-Ses3-P2-4
Thu-Ses2-P2-1
Tue-Ses1-P3-4
Wed-Ses1-O1-2
Wed-Ses3-O1-6
Wed-Ses3-P2-10
Mon-Ses3-O3-6
Wed-Ses4-P4-8
Thu-Ses2-P1-5
Mon-Ses2-P4-5
Wed-Ses1-P4-11
Wed-Ses3-S1-5
Tue-Ses3-P1-4
Tue-Ses3-S2-6
Wed-Ses1-P2-13
Thu-Ses2-O5-1
Wed-Ses3-P3-1
Wed-Ses1-P4-10
Tue-Ses2-O4-2
Thu-Ses1-P4-9
Wed-Ses2-P4-14
Mon-Ses3-O1-2
Thu-Ses2-O5-4
Tue-Ses3-P4-4
Thu-Ses1-P2-6
Mon-Ses2-O3-3
Tue-Ses2-P2-6
Tue-Ses2-P4-1
Tue-Ses3-P4-7
Tue-Ses1-O4-6
Thu-Ses2-P2-7
Tue-Ses2-O3-2
Tue-Ses3-S2-4
Tue-Ses3-S2-5
Thu-Ses1-P4-5
Tue-Ses2-P2-11
Wed-Ses1-P1-2
Tue-Ses3-P1-8
Mon-Ses3-S1-3
Tue-Ses2-O4-1
Wed-Ses1-O1-3
Tue-Ses2-P1-5
Thu-Ses1-O4-3
Mon-Ses2-S1-8
Tue-Ses2-P3-3
Tue-Ses2-P3-6
Thu-Ses2-P2-7
Tue-Ses3-O4-2
Mon-Ses2-O1-1
Thu-Ses1-O1-2
Wed-Ses3-P3-2
113
113
79
169
134
90
146
78
73
111
112
131
148
128
128
102
130
64
120
50
136
144
170
82
111
139
145
64
148
169
58
124
150
102
111
119
167
146
124
90
161
137
62
168
109
157
50
94
98
109
78
171
90
110
111
160
95
116
103
73
90
112
92
154
61
96
96
171
101
47
151
146
Kim, Gibak . . . . . . . . . . . . Tue-Ses1-P3-3 82
Kim, Hong Kook . . . . . . Thu-Ses1-P1-10 156
Kim, Hyung Soon . . . . . Mon-Ses3-P3-3 69
Wed-Ses3-O3-1 140
Kim, Jangwon . . . . . . . . Wed-Ses2-P1-7 130
Kim, Jeesun . . . . . . . . . . . Mon-Ses3-O2-2 63
Wed-Ses3-O4-4 141
Kim, Namhoon . . . . . . . Wed-Ses3-O3-2 140
Kim, Wooil . . . . . . . . . . . . Wed-Ses2-P2-5 132
Wed-Ses3-P3-6 147
Kim, Yeon-Jun . . . . . . . . Tue-Ses3-P4-11 110
Kim, Yoon-Chul . . . . . . Tue-Ses1-O2-5 76
King, Jeanette . . . . . . . . Tue-Ses3-S2-5 111
King, Simon . . . . . . . . . . . Mon-Ses3-O3-6 64
Tue-Ses3-P2-4 105
Wed-Ses2-P3-11 135
Wed-Ses2-P4-10 136
Wed-Ses2-P4-11 137
Wed-Ses2-P4-12 137
Thu-Ses1-P2-1 156
Kiss, Géza . . . . . . . . . . . . Mon-Ses3-P4-9 73
Kitaoka, Norihide . . . . . Wed-Ses1-P4-9 123
Kitzig, Andreas . . . . . . . Mon-Ses3-P3-7 70
Kjems, Ulrik . . . . . . . . . . Wed-Ses2-O4-3 129
Kleber, Felicitas . . . . . . Wed-Ses1-P2-4 118
Kleijn, W. Bastiaan . . . Thu-Ses2-O5-2 167
Thu-Ses2-P2-11 172
Klessa, Katarzyna . . . . Wed-Ses1-P4-4 123
Knill, Kate . . . . . . . . . . . . Tue-Ses3-P2-8 105
Thu-Ses1-P3-3 158
Ko, Tom . . . . . . . . . . . . . . Tue-Ses2-P3-12 97
Kobashikawa, Satoshi Wed-Ses1-O3-5 114
Kobayashi, Takao . . . . Mon-Ses3-P3-4 70
Thu-Ses1-P2-2 156
Kobayashi, Tetsunori
Mon-Ses2-P4-4 58
Kochanski, Greg . . . . . . Tue-Ses3-S2-4 110
Thu-Ses2-O5-3 167
Kockmann, Marcel . . . Mon-Ses2-S1-10 61
Wed-Ses3-P2-4 144
Köhler, Joachim . . . . . . Wed-Ses2-P4-9 136
Kojima, Hiroaki . . . . . . . Tue-Ses3-P2-11 106
Kokkinakis, George . . Wed-Ses2-O2-1 126
Kolhatkar, Varada . . . . Thu-Ses2-P1-8 169
Kollmeier, Birger . . . . . Thu-Ses1-S1-2 162
Kolossa, Dorothea . . . . Thu-Ses1-O1-1 150
Komatani, Kazunori . . Mon-Ses2-P4-1 57
Thu-Ses1-P4-9 161
Kombrink, Stefan . . . . . Mon-Ses2-O3-2 50
Kondo, Kazunobu . . . . Tue-Ses3-P1-8 103
Kondo, Mariko . . . . . . . . Wed-Ses1-P2-5 118
Kondoz, Ahmet . . . . . . Thu-Ses1-P1-6 155
Kondrak, Grzegorz . . . Tue-Ses3-O4-5 101
Koniaris, Christos . . . . Thu-Ses2-P2-11 172
Konno, Tomoaki . . . . . . Mon-Ses3-P4-3 72
Korchagin, Danil . . . . . Wed-Ses2-P4-7 136
Koreman, Jacques . . . . Wed-Ses3-P1-2 142
Körner, E. . . . . . . . . . . . . . Mon-Ses3-P4-4 72
Kosaka, Tetsuo . . . . . . . Wed-Ses3-P3-1 146
Kousidis, Spyros . . . . . Wed-Ses2-S1-4 137
Krauwer, Steven . . . . . . Wed-Ses1-P4-1 122
Kreiman, Jody . . . . . . . . Thu-Ses2-P1-1 168
Křen, Michal . . . . . . . . . . Wed-Ses1-P4-5 123
Kristensson, Per Ola . Wed-Ses1-S1-2 125
Krňoul, Zdeněk . . . . . . . Thu-Ses2-P1-5 169
Kroos, Christian . . . . . . Tue-Ses1-P1-6 79
Wed-Ses3-O4-4 141
Krueger, Alexander . . . Tue-Ses2-P4-3 98
Kua, Jia Min Karen . . . Thu-Ses2-O5-5 168
Kühnel, Christine . . . . . Mon-Ses2-P4-14 59
Kulp, Scott . . . . . . . . . . . . Tue-Ses3-P4-1 108
Kumar, Kshitiz . . . . . . . Wed-Ses3-O4-2 141
Thu-Ses1-O1-2 151
Kunduk, Melda . . . . . . . Tue-Ses1-S1-1 86
Kunikoshi, Aki . . . . . . . . Mon-Ses2-P4-15 59
Kuo, Hong-Kwang . . . . Mon-Ses2-P4-11 59
Kurashima, Atsuko . . . Thu-Ses1-O4-1 154
Kurimo, Mikko . . . . . . . Mon-Ses3-O3-6 64
Tue-Ses3-P1-2 102
Kuroiwa, Shingo . . . . . . Wed-Ses3-P2-12 145
Laine, Unto Kalervo . .
Laivo, Tuuli . . . . . . . . . . .
Lamel, Lori . . . . . . . . . . . .
Lane, Ian . . . . . . . . . . . . . .
Lane, Joseph K. . . . . . . .
Langlois, David . . . . . . .
Lapidot, Itshak . . . . . . .
Laprie, Yves . . . . . . . . . .
Laroche, Romain . . . . .
Lasarcyk, Eva . . . . . . . . .
Laskowski, Kornel . . . .
Latorre, Javier . . . . . . . .
Laurent, Antoine . . . . .
Lawless, René . . . . . . . . .
Lawson, A.D. . . . . . . . . . .
Lazaridis, Alexandros
Le, Quoc Anh . . . . . . . . .
Leak, Jayne . . . . . . . . . . .
Lecorvé, Gwénolé . . . . .
Lecouteux, Benjamin .
Lee, Akinobu . . . . . . . . .
Lee, Antonio . . . . . . . . . .
Lee, Chi-Chun . . . . . . . .
Lee, Ching-Hsien . . . . .
Lee, Chin-Hui . . . . . . . . .
Lee,
Lee,
Lee,
Lee,
Lee,
Lee,
Gary Geunbae . . .
Haejoong . . . . . . . .
Jinsik . . . . . . . . . . . .
Kong-Aik . . . . . . . . .
Lin-shan . . . . . . . . .
Sungbok . . . . . . . . .
Lee, S.W. . . . . . . . . . . . . . .
Lee, Tan . . . . . . . . . . . . . . .
Lee, Yi-Hui . . . . . . . . . . . .
Lee, Young Han . . . . . . .
Lee, YunKeun . . . . . . . . .
Leemann, Adrian . . . . .
Lefèvre, Fabrice . . . . . . .
Lehnen, Patrick . . . . . . .
Lei, Howard . . . . . . . . . . .
Lei, Xin . . . . . . . . . . . . . . . .
Lei, Yun . . . . . . . . . . . . . . .
Lei, Zhenchun . . . . . . . .
Leijon, Arne . . . . . . . . . .
Leman, Adrien . . . . . . . .
Lemnitzer, Lothar . . . .
Lemon, Oliver . . . . . . . .
Lennes, Mietta . . . . . . . .
Leonard, M. . . . . . . . . . . .
Leutnant, Volker . . . . .
Levow, Gina-Anne . . . .
Li, Aijun . . . . . . . . . . . . . .
Li, Baojie . . . . . . . . . . . . . .
Li, Haizhou . . . . . . . . . . .
L
Lacheret-Dujour, A. . . Mon-Ses3-P2-7
Lacroix, Arild . . . . . . . . . Tue-Ses2-P1-12
Laface, Pietro . . . . . . . . . Mon-Ses2-P2-4
68
93
54
Tue-Ses2-P3-9 97
Laganaro, Marina . . . . . Mon-Ses3-P1-2 66
Lai, Catherine . . . . . . . . . Wed-Ses2-P1-1 129
Li, Hongyan . . . . . . . . . . .
Li, Jinyu . . . . . . . . . . . . . . .
Li, Junfeng . . . . . . . . . . . .
Tue-Ses1-P3-5
Tue-Ses2-P2-13
Wed-Ses1-P4-13
Thu-Ses2-P4-3
Tue-Ses3-P3-4
Thu-Ses1-O4-2
Mon-Ses2-O3-6
Mon-Ses2-P1-6
Wed-Ses3-O1-5
Mon-Ses2-P2-11
Tue-Ses3-P3-8
Mon-Ses3-O4-1
Tue-Ses1-P4-8
Mon-Ses2-O2-2
Wed-Ses1-P1-6
Thu-Ses2-O2-4
Wed-Ses3-S1-4
Thu-Ses2-P1-8
Tue-Ses1-O3-5
Thu-Ses2-O1-3
Wed-Ses2-P3-6
Tue-Ses1-O3-1
Mon-Ses3-P4-5
Wed-Ses1-P4-3
Wed-Ses3-P2-11
Thu-Ses2-P1-2
Wed-Ses2-O2-1
Mon-Ses3-P4-10
Tue-Ses3-P1-11
Mon-Ses3-O1-5
Tue-Ses2-P3-4
Wed-Ses1-P3-3
Mon-Ses3-O4-4
Mon-Ses2-S1-3
Wed-Ses2-P1-6
Mon-Ses3-P4-8
Mon-Ses2-P2-2
Wed-Ses1-O3-2
Tue-Ses3-O4-2
Thu-Ses2-O4-3
Tue-Ses3-O4-2
Mon-Ses2-P2-10
Thu-Ses2-P1-13
Mon-Ses2-S1-3
Tue-Ses1-O2-5
Wed-Ses2-O2-2
Wed-Ses2-P1-6
Wed-Ses2-P1-7
Thu-Ses1-S1-6
Tue-Ses3-P1-9
Tue-Ses1-P4-2
Tue-Ses3-P1-9
Thu-Ses2-P3-6
Wed-Ses1-O3-6
Thu-Ses1-P1-10
Tue-Ses2-O1-1
Wed-Ses4-P4-3
Mon-Ses2-P4-9
Thu-Ses1-P4-6
Thu-Ses1-P4-7
Tue-Ses1-P4-1
Wed-Ses3-P2-1
Wed-Ses2-P4-2
Wed-Ses2-P4-3
Wed-Ses3-P2-13
Tue-Ses3-O3-5
Tue-Ses1-P2-3
Tue-Ses2-P2-5
Wed-Ses1-P4-1
Wed-Ses3-S1-6
Wed-Ses1-P2-3
Thu-Ses2-P1-2
Tue-Ses2-P2-11
Wed-Ses3-P3-5
Tue-Ses1-O3-3
Mon-Ses2-O2-1
Tue-Ses1-O1-4
Wed-Ses1-O4-6
Mon-Ses2-P2-10
Tue-Ses1-P4-4
Wed-Ses2-P3-8
Wed-Ses3-O1-2
Thu-Ses2-P2-9
Thu-Ses2-P3-8
Wed-Ses2-O2-6
Wed-Ses1-O3-2
Wed-Ses2-P1-4
181
82
95
124
174
107
154
50
52
138
55
107
64
85
49
116
165
150
169
77
163
134
77
72
122
145
168
126
73
103
62
96
120
65
60
130
72
53
114
101
166
101
55
170
60
76
126
130
130
163
103
84
103
173
114
156
87
148
58
160
160
84
144
135
135
146
100
81
94
122
150
118
168
95
147
77
48
75
115
55
84
134
138
172
173
127
114
130
Li, Lihong . . . . . . . . . . . . .
Li, Runxin . . . . . . . . . . . . .
Li, Su . . . . . . . . . . . . . . . . . .
Liang, Hui . . . . . . . . . . . . .
Liang, Jiaen . . . . . . . . . . .
Liang, Ruoying . . . . . . .
Libal, Vit . . . . . . . . . . . . . .
Liberman, Mark . . . . . . .
Lilienthal, Janine . . . . .
Lim, Daniel C.Y. . . . . . .
Lin, Hsin-Yi . . . . . . . . . . .
Lin, Hui . . . . . . . . . . . . . . .
Lin, Shih-Hsiang . . . . . .
Linarès, Georges . . . . . .
Lincoln, Mike . . . . . . . . .
Lindberg, Børge . . . . . .
Ling, Zhen-Hua . . . . . . .
Liscombe, Jackson . . .
Litman, Diane . . . . . . . .
Liu,
Liu,
Liu,
Liu,
Liu,
Changliang . . . . . . .
Chao-Hong . . . . . . .
Chen . . . . . . . . . . . . .
Wen . . . . . . . . . . . . . .
X. . . . . . . . . . . . . . . . . .
Liu, Yang . . . . . . . . . . . . . .
Lleida, Eduardo . . . . . . .
Lo, Yueng-Tien . . . . . . .
Lobanov, Boris . . . . . . . .
Lœvenbruck, Hélène .
Loizou, Philipos C. . . .
Lolive, Damien . . . . . . . .
Longworth, C. . . . . . . . .
Lööf, Jonas . . . . . . . . . . .
Loots, Linsen . . . . . . . . .
Lopez, Eduardo . . . . . . .
López-Cózar, Ramón .
Lopez-Gonzalo, E. . . . .
Lopez-Moreno, Ignacio
Lopez-Otero, Paula . . .
Loukina, Anastassia . .
Lowit, Anja . . . . . . . . . . .
Lu, Jianhua . . . . . . . . . . .
Lu, Xiao Bo . . . . . . . . . . . .
Lu, Xugang . . . . . . . . . . . .
Lu, Youyi . . . . . . . . . . . . .
Lubensky, David . . . . . .
Lucas-Cuesta, J.M. . . . .
Luengo, Iker . . . . . . . . . .
Lugger, Marko . . . . . . . .
Lulich, Steven M. . . . . .
Luo, Dean . . . . . . . . . . . . .
Lutfi, S. . . . . . . . . . . . . . . . .
Lutzky, Manfred . . . . . .
Lyakso, Elena E. . . . . . .
Lyras, Dimitrios P. . . . .
Wed-Ses3-S1-3
Tue-Ses1-P4-5
Mon-Ses2-P1-7
Tue-Ses3-P2-5
Wed-Ses2-O2-6
Mon-Ses2-P3-10
Wed-Ses3-O4-2
Wed-Ses1-P1-3
Wed-Ses3-O2-5
Thu-Ses2-O2-6
Mon-Ses2-P2-11
Tue-Ses1-P2-2
Thu-Ses1-O3-6
Thu-Ses2-O4-4
Tue-Ses3-P4-10
Wed-Ses1-P4-12
Tue-Ses2-P2-9
Tue-Ses2-P2-10
Tue-Ses2-P3-4
Thu-Ses1-P3-10
Wed-Ses2-P4-7
Tue-Ses3-P3-11
Wed-Ses3-O3-3
Mon-Ses3-O3-2
Tue-Ses3-P4-2
Wed-Ses2-P2-1
Wed-Ses3-S1-1
Thu-Ses2-P3-7
Tue-Ses2-O3-6
Thu-Ses2-P4-7
Mon-Ses3-O4-4
Mon-Ses3-O1-3
Mon-Ses3-O1-4
Mon-Ses2-P2-5
Tue-Ses3-P4-9
Mon-Ses2-O1-6
Mon-Ses3-P3-2
Tue-Ses2-P4-7
Tue-Ses3-P2-9
Wed-Ses2-P4-1
Thu-Ses1-P1-1
Tue-Ses3-P4-10
Thu-Ses2-P1-5
Mon-Ses3-S1-8
Tue-Ses1-P3-3
Thu-Ses1-P2-3
Wed-Ses1-O1-6
Mon-Ses2-O3-4
Wed-Ses2-P4-4
Wed-Ses2-P4-5
Mon-Ses2-P2-9
Wed-Ses3-P2-1
Thu-Ses2-O4-5
Tue-Ses1-P4-1
Wed-Ses2-P1-3
Tue-Ses1-P3-2
Tue-Ses3-S2-4
Tue-Ses1-P1-10
Wed-Ses3-P3-8
Mon-Ses2-O2-4
Thu-Ses1-O1-4
Tue-Ses1-P1-12
Mon-Ses2-P4-11
Mon-Ses3-O4-4
Mon-Ses2-S1-7
Wed-Ses3-S1-2
Mon-Ses2-S1-6
Wed-Ses2-P1-9
Mon-Ses3-P1-10
Mon-Ses3-P4-6
Mon-Ses2-S1-7
Thu-Ses1-P1-9
Wed-Ses1-P2-11
Wed-Ses2-O2-1
150
84
52
105
127
57
141
116
139
165
55
81
154
167
110
124
95
95
96
159
136
108
140
64
108
131
149
173
90
175
65
62
62
54
109
48
69
99
106
135
155
110
169
74
82
157
112
50
135
136
55
144
167
84
130
82
110
80
147
49
151
80
59
65
61
150
61
131
67
72
61
156
119
126
M
M., Anand Joseph . . . . Tue-Ses2-P1-5 92
M., Sri Harish Reddy . Wed-Ses1-O2-5 113
Ma, Bin . . . . . . . . . . . . . . . . Mon-Ses2-P2-10 55
Tue-Ses1-P4-4 84
Wed-Ses3-O1-2 138
Thu-Ses2-P2-9 172
Ma, Joan Ka-Yin . . . . . . Wed-Ses1-P1-12 117
Ma, Xuebin . . . . . . . . . . . . Wed-Ses3-O2-6 140
Ma, Zhanyu . . . . . . . . . . . Tue-Ses1-P2-3 81
Maassen, Ben . . . . . . . . .
Mac, Dang-Khoa . . . . . .
Machač, Pavel . . . . . . . . .
Macias-Guarasa, J. . . . .
Maclagan, Margaret . .
Madsack, Andreas . . . .
Magimai-Doss, M. . . . . .
Maia, Ranniery . . . . . . . .
Maier, Andreas . . . . . . .
Maier, Stefan . . . . . . . . .
Maier, Viktoria . . . . . . . .
Maina, Ciira wa . . . . . . .
Mairesse, F. . . . . . . . . . . .
Mak, Brian . . . . . . . . . . . .
Mak, Man-Wai . . . . . . . . .
Makhoul, John . . . . . . . .
Mäkinen, Erno . . . . . . . .
Makino, Shozo . . . . . . . .
Malisz, Zofia . . . . . . . . . .
Malkin, Jonathan . . . . .
Mana, Franco . . . . . . . . .
Mandal, Arindam . . . . .
Manna, Amit . . . . . . . . . .
Marcheret, Etienne . . .
Markaki, Maria . . . . . . . .
Martens, Jean-Pierre . .
Martin, Alvin F. . . . . . . .
Martin Mota, Sidney . .
Martins, Paula . . . . . . . .
Martirosian, Olga . . . . .
Maruyama, Hagino . . .
Maryn, Y. . . . . . . . . . . . . .
Maskey, Sameer . . . . . .
Masuko, Takashi . . . . .
Mata, Ana Isabel . . . . . .
Matarić, Maja J. . . . . . . .
Matějka, Pavel . . . . . . . .
Matoušek, Jindřich . . .
Matrouf, Driss . . . . . . . .
Matsubara, Takeshi . .
Matsuda, Shigeki . . . . .
Matsuyama, Kyoko . . .
Matsuyama, Yoichi . . .
Mayr, Robert . . . . . . . . . .
McDermott, Erik . . . . . .
McDonnell, Ciaran . . . .
McGraw, Ian . . . . . . . . . .
McKenna, John . . . . . . .
McKeown, Kathleen R.
McLaren, Mitchell . . . .
Medina, Victoria . . . . . .
Meignier, Sylvain . . . . .
Meireles, A.R. . . . . . . . . .
Meister, Einar . . . . . . . . .
Melamed, I. Dan . . . . . .
Mella, O. . . . . . . . . . . . . . .
Melto, Aleksi . . . . . . . . . .
Meltzner, Geoffrey S. .
Ménard, Lucie . . . . . . . .
Menard, Madeleine . . .
Meng, Helen . . . . . . . . . .
Merlin, Teva . . . . . . . . . .
Mertens, C. . . . . . . . . . . .
Mertens, Timo . . . . . . . .
Mertins, Alfred . . . . . . .
Tue-Ses1-P1-9
Tue-Ses1-P1-10
Wed-Ses3-O4-5
Tue-Ses1-P3-11
Mon-Ses2-P4-8
Mon-Ses2-S1-7
Tue-Ses3-S2-5
Wed-Ses1-P1-7
Wed-Ses3-O3-6
Thu-Ses2-P2-5
Tue-Ses1-O4-6
Wed-Ses1-P3-9
Mon-Ses3-P4-4
Tue-Ses1-S2-6
Tue-Ses1-S2-10
Mon-Ses2-P4-3
Thu-Ses2-P4-1
Tue-Ses3-P1-17
Thu-Ses1-P4-5
Tue-Ses2-P3-12
Wed-Ses3-P2-2
Tue-Ses3-O3-4
Wed-Ses3-P2-2
Tue-Ses1-O3-2
Tue-Ses3-P3-4
Mon-Ses3-P4-3
Tue-Ses3-P2-6
Tue-Ses3-S2-8
Tue-Ses1-O1-1
Mon-Ses2-O1-4
Wed-Ses2-P4-2
Wed-Ses2-P4-3
Mon-Ses2-P2-6
Mon-Ses2-P2-7
Mon-Ses2-P3-5
Wed-Ses3-O4-2
Tue-Ses1-S1-2
Thu-Ses2-P3-1
Thu-Ses2-P3-2
Thu-Ses1-O4-5
Thu-Ses2-O4-6
Wed-Ses1-P1-13
Mon-Ses3-P1-9
Thu-Ses2-O4-2
Thu-Ses2-P1-9
Tue-Ses1-S2-8
Mon-Ses3-O4-3
Tue-Ses3-P4-12
Thu-Ses2-P2-7
Wed-Ses1-P2-6
Wed-Ses1-O2-3
Mon-Ses2-O3-2
Tue-Ses3-O3-1
Wed-Ses3-O1-4
Wed-Ses3-P2-4
Tue-Ses1-P3-11
Mon-Ses2-P2-1
Tue-Ses2-P2-9
Tue-Ses2-P2-10
Mon-Ses3-P3-4
Wed-Ses1-O3-2
Mon-Ses2-P4-1
Mon-Ses2-P4-4
Wed-Ses3-P1-12
Mon-Ses2-P3-4
Wed-Ses2-S1-4
Thu-Ses2-P4-2
Tue-Ses2-P1-7
Thu-Ses1-P4-1
Tue-Ses3-O3-2
Mon-Ses2-P1-8
Tue-Ses1-O3-1
Wed-Ses2-P4-8
Thu-Ses2-P1-12
Mon-Ses3-O2-4
Tue-Ses3-P4-11
Wed-Ses1-P4-6
Tue-Ses3-P3-4
Thu-Ses1-O4-2
Mon-Ses3-S1-5
Tue-Ses1-O2-2
Tue-Ses1-S2-9
Wed-Ses1-P2-5
Wed-Ses2-P4-8
Tue-Ses1-S1-3
Wed-Ses2-P4-9
Thu-Ses2-P2-8
80
80
141
83
58
61
111
116
141
171
78
121
72
86
87
57
174
104
160
97
144
100
144
77
107
72
105
111
75
48
135
135
54
54
56
141
86
172
172
154
167
117
67
166
169
87
65
110
171
118
113
50
99
138
144
83
53
95
95
70
114
57
58
143
56
137
174
92
159
100
53
77
136
170
63
110
123
107
154
74
76
87
118
136
86
136
171
Mesgarani, Nima . . . . . Thu-Ses2-P2-10 172
Messaoudi, Abdel . . . . Mon-Ses2-O3-6 50
Metze, Florian . . . . . . . . Mon-Ses2-P4-13 59
Mon-Ses2-S1-8 61
Wed-Ses1-P4-7 123
Meunier, Fanny . . . . . . . Mon-Ses2-P1-4 52
Wed-Ses2-O1-5 126
Meyer, Bernd T. . . . . . . . Thu-Ses1-S1-2 162
Meyer, Gerard G.L. . . . Thu-Ses1-O3-5 153
Meyers, Adam . . . . . . . . Thu-Ses1-P4-1 159
Miao, Qi . . . . . . . . . . . . . . . Tue-Ses1-O4-1 78
Michel, Violaine . . . . . . Mon-Ses3-P1-2 66
Miettinen, Toni . . . . . . . Tue-Ses3-P3-4 107
Miguel, Antonio . . . . . . Mon-Ses2-O1-6 48
Mon-Ses3-P3-2 69
Tue-Ses2-P4-7 99
Tue-Ses3-P2-9 106
Wed-Ses2-P4-1 135
Thu-Ses1-P1-1 155
Mihajlik, Péter . . . . . . . . Thu-Ses1-P3-7 159
Mihelič, France . . . . . . . Tue-Ses1-O3-4 77
Wed-Ses2-P1-2 129
Thu-Ses1-O3-1 153
Mikami, Hiroki . . . . . . . Tue-Ses2-O3-2 90
Mikawa, Masahiko . . . . Thu-Ses1-P2-4 157
Miki, Nobuhiro . . . . . . . Tue-Ses1-P1-2 79
Miller, Amanda . . . . . . . Wed-Ses3-P1-3 142
Wed-Ses3-P1-4 142
Milner, Ben . . . . . . . . . . . Wed-Ses2-O4-2 128
Wed-Ses2-O4-6 129
Mimura, Masato . . . . . . Mon-Ses2-O3-3 50
Minematsu, Nobuaki . Mon-Ses2-P4-15 59
Mon-Ses3-P2-10 68
Mon-Ses3-P4-6 72
Wed-Ses2-P3-1 133
Wed-Ses3-O2-6 140
Thu-Ses2-P4-8 175
Ming, Ji . . . . . . . . . . . . . . . Wed-Ses3-P3-8 147
Misu, Teruhisa . . . . . . . . Mon-Ses2-P4-5 58
Wed-Ses1-P4-11 124
Mitra, Vikramjit . . . . . . Thu-Ses1-S1-1 162
Thu-Ses1-S1-3 162
Miura, Kazuo . . . . . . . . . Tue-Ses3-P4-4 109
Mixdorff, Hansjörg . . . Tue-Ses2-O2-2 89
Wed-Ses4-P4-12 149
Thu-Ses1-O2-4 152
Mo, Yoonsook . . . . . . . . Thu-Ses1-O2-6 152
Möbius, Bernd . . . . . . . . Wed-Ses3-P1-13 143
Thu-Ses1-O2-1 152
Thu-Ses1-O2-2 152
Moers, Donata . . . . . . . . Wed-Ses2-P3-7 134
Mohri, Mehryar . . . . . . . Thu-Ses1-P4-11 161
Moinet, Alexis . . . . . . . . Wed-Ses1-O4-4 115
Molina, Carlos . . . . . . . . Wed-Ses2-O2-5 127
Möller, Sebastian . . . . . Mon-Ses2-P4-14 59
Wed-Ses1-P4-7 123
Thu-Ses1-O4-1 154
Thu-Ses2-O1-4 164
Moniz, Helena . . . . . . . . Wed-Ses1-P2-6 118
Montero, J.M. . . . . . . . . . Mon-Ses2-S1-7 61
Montiel-Hernández, J. Mon-Ses2-P4-7 58
Monzo, Carlos . . . . . . . . Mon-Ses2-S1-2 60
Moore, Roger K. . . . . . . Tue-Ses2-P2-13 95
Wed-Ses1-P1-2 116
Wed-Ses1-P2-8 119
Thu-Ses2-P4-1 174
Moosmayr, Tobias . . . . Thu-Ses1-O1-5 151
Moreno, Pedro . . . . . . . . Thu-Ses1-P4-11 161
Morgan, Nelson . . . . . . . Thu-Ses2-P2-2 170
Thu-Ses2-P2-6 171
Mori, Shinsuke . . . . . . . Tue-Ses3-P4-7 109
Morinaka, Ryo . . . . . . . . Wed-Ses2-P3-10 134
Morise, Masanori . . . . . Thu-Ses1-P2-6 157
Morita, Masahiro . . . . . Wed-Ses2-P3-10 134
Morris, Jeremy . . . . . . . . Thu-Ses2-P4-10 175
Morris, Robert . . . . . . . . Tue-Ses3-P1-1 102
Moschitti, Alessandro Thu-Ses1-P4-12 161
Mossavat, Iman S. . . . . Thu-Ses2-O5-2 167
Mostow, Jack . . . . . . . . . Mon-Ses3-P4-1 71
Motlicek, Petr . . . . . . . . . Tue-Ses2-P3-11 97
Thu-Ses1-P1-2 155
Moudenc, Thierry . . . . Tue-Ses1-O4-3 78
Mower, Emily . . . . . . . . . Mon-Ses2-S1-3 60
Wed-Ses1-O2-3 113
Mukerjee, Kunal . . . . . . Tue-Ses3-P4-3 108
Müller, Daniela . . . . . . . Wed-Ses1-P1-13 117
Müller, Florian . . . . . . . . Thu-Ses2-P2-8 171
182
Murakami, Hiroko . . . . Mon-Ses3-P3-10 71
Murphy, Damian . . . . . Tue-Ses1-P1-4 79
Murphy, P. . . . . . . . . . . . . Tue-Ses1-S2-8 87
Murray, Gabriel . . . . . . . Wed-Ses2-P2-2 131
Murthy, Hema A. . . . . . Wed-Ses3-P2-9 145
Muscariello, Armando Thu-Ses2-O3-6 166
Mykowiecka, A. . . . . . . . Thu-Ses1-P4-6 160
N
Nadeu, C. . . . . . . . . . . . . .
Nagai, Takayuki . . . . . .
Nagaraja, Sunil . . . . . . .
Nakagawa, Seiichi . . . .
Nakagawa, Seiji . . . . . . .
Nakajima, Yoshitaka .
Nakamura, Atsushi . . .
Nakamura, Keigo . . . . .
Nakamura, Mitsuhiro
Nakamura, Satoshi . . .
Nakamura, Shizuka . .
Nakamura, Shogo . . . .
Nakano, Alberto Y. . . .
Nakano, Mikio . . . . . . . .
Nakatani, Tomohiro . .
Nam, Hosung . . . . . . . . .
Nambu, Yoshiki . . . . . .
Nanjo, Hiroaki . . . . . . . .
Nankaku, Yoshihiko . .
Naptali, Welly . . . . . . . . .
Narayanan, S.S. . . . . . . .
Nasr, Alexis . . . . . . . . . . .
Naumann, Anja B. . . . .
Nava, Emily . . . . . . . . . . .
Navas, Eva . . . . . . . . . . . .
Navratil, Jiri . . . . . . . . . .
Naylor, Patrick A. . . . . .
Neely, Abby . . . . . . . . . . .
Neerincx, Mark A. . . . .
Neiberg, D. . . . . . . . . . . . .
Nemala, S.K. . . . . . . . . . .
Németh, Géza . . . . . . . .
Nemoto, Akira . . . . . . . .
Nenkova, Ani . . . . . . . . .
Neto, Nelson . . . . . . . . . .
Neubig, Graham . . . . . .
Ney, Hermann . . . . . . . .
Ng, Tim . . . . . . . . . . . . . . .
Nguyen, Binh Phu . . . .
Nguyen, Kham . . . . . . . .
Tue-Ses2-P2-7
Mon-Ses3-S1-2
Tue-Ses1-S2-5
Tue-Ses2-P2-2
Thu-Ses1-P3-6
Thu-Ses1-P3-8
Thu-Ses2-P1-6
Mon-Ses3-S1-2
Mon-Ses2-P3-4
Tue-Ses2-P3-7
Mon-Ses3-S1-2
Tue-Ses3-P3-2
Wed-Ses3-P1-11
Mon-Ses2-P4-5
Tue-Ses1-O4-6
Wed-Ses1-O3-2
Wed-Ses1-P3-9
Wed-Ses1-P4-11
Wed-Ses3-S1-5
Thu-Ses1-O1-4
Tue-Ses3-S2-6
Tue-Ses1-P1-3
Tue-Ses2-P2-2
Thu-Ses1-P4-8
Thu-Ses1-P4-9
Tue-Ses2-P4-4
Thu-Ses1-S1-3
Thu-Ses1-S1-4
Thu-Ses1-P2-4
Tue-Ses2-O3-2
Mon-Ses3-P2-11
Tue-Ses1-O1-6
Wed-Ses1-P3-1
Wed-Ses1-P3-3
Thu-Ses1-P3-6
Mon-Ses2-S1-3
Mon-Ses3-O4-6
Tue-Ses1-O2-1
Tue-Ses1-O2-5
Tue-Ses2-O4-6
Wed-Ses1-O2-3
Wed-Ses2-O2-2
Wed-Ses2-O3-4
Wed-Ses2-P1-6
Wed-Ses2-P1-7
Thu-Ses1-O3-3
Thu-Ses1-S1-6
Thu-Ses2-O2-2
Tue-Ses2-O3-5
Wed-Ses1-P4-7
Tue-Ses1-O2-1
Mon-Ses2-S1-6
Tue-Ses2-P1-2
Wed-Ses3-O4-2
Mon-Ses2-O4-3
Thu-Ses2-O4-6
Wed-Ses2-P2-7
Tue-Ses3-P2-3
Thu-Ses2-O2-1
Thu-Ses2-P2-10
Mon-Ses3-P4-9
Wed-Ses3-O2-6
Wed-Ses2-P2-6
Mon-Ses2-O3-5
Tue-Ses3-P4-7
Mon-Ses2-O3-4
Mon-Ses2-P3-2
Mon-Ses2-P3-10
Tue-Ses2-P3-5
Wed-Ses2-P4-4
Wed-Ses2-P4-5
Thu-Ses1-P3-5
Thu-Ses1-P4-6
Thu-Ses1-P4-7
Tue-Ses1-O3-2
Wed-Ses1-O4-3
Tue-Ses1-O3-2
94
73
86
94
159
159
169
73
56
97
73
106
143
58
78
114
121
124
150
151
111
79
94
161
161
98
162
162
157
90
69
75
120
120
159
60
65
76
76
91
113
126
128
130
130
153
163
164
90
123
76
61
91
141
51
167
132
105
164
172
73
140
132
50
109
50
55
57
96
135
136
158
160
160
77
115
77
Nguyen, Long . . . . . . . . . Tue-Ses1-O3-2 77
Nguyen, Patrick . . . . . . Wed-Ses2-O3-2 127
Nguyen-Thien, Nhu . . . Thu-Ses1-O1-5 151
Ní Chasaide, Ailbhe . . Wed-Ses2-P1-5 130
Wed-Ses4-P4-9 148
Nicholas, Greg . . . . . . . . Wed-Ses2-P2-1 131
Niebuhr, Oliver . . . . . . . Wed-Ses4-P4-7 148
Niesler, Thomas . . . . . . Mon-Ses2-P2-9 55
Nijland, Lian . . . . . . . . . . Tue-Ses1-P1-9 80
Nilsenová, Marie . . . . . . Tue-Ses2-O2-5 89
Nishida, Masafumi . . . Wed-Ses3-P2-12 145
Nishiki, Kenta . . . . . . . . Wed-Ses2-O4-5 129
Nishikido, Akikazu . . . Mon-Ses2-O2-1 48
Nishimoto, Takuya . . . Wed-Ses2-O4-5 129
Nishimura, Akira . . . . . Thu-Ses1-P1-7 155
Nishimura, Masafumi Mon-Ses2-O1-5 48
Thu-Ses1-P2-7 157
Nishimura, Ryota . . . . . Wed-Ses1-P4-9 123
Nishiura, Takanobu . . Tue-Ses2-O3-2 90
Tue-Ses3-P1-13 103
Nisimura, Ryuichi . . . . Thu-Ses1-P2-6 157
Nitta, Tsuneo . . . . . . . . . Wed-Ses2-P4-14 137
Niyogi, Partha . . . . . . . . Thu-Ses1-S1-5 163
Noguchi, Hiroki . . . . . . Tue-Ses3-P4-4 109
Nogueira, João . . . . . . . . Tue-Ses3-O4-1 101
Nolan, Francis . . . . . . . . Wed-Ses3-P1-10 143
Noor, Elad . . . . . . . . . . . . Wed-Ses1-O1-5 112
Nose, Takashi . . . . . . . . Mon-Ses3-P3-4 70
Thu-Ses1-P2-2 156
Thu-Ses1-P4-8 161
Nöth, Elmar . . . . . . . . . . . Mon-Ses3-P4-4 72
Tue-Ses1-S2-6 86
Tue-Ses1-S2-10 87
Thu-Ses2-P1-8 169
Thu-Ses2-P3-3 172
Nouza, Jan . . . . . . . . . . . . Tue-Ses2-O1-6 88
Novák, Miroslav . . . . . . Tue-Ses2-P3-1 96
Novotney, Scott . . . . . . Mon-Ses2-P3-9 57
Nurminen, Jani . . . . . . . Wed-Ses1-P2-2 118
Wed-Ses1-P3-7 121
Thu-Ses1-P2-8 157
Nwe, Tin Lay . . . . . . . . . . Tue-Ses1-P4-4 84
O
Obin, Nicolas . . . . . . . . .
Ogata, Jun . . . . . . . . . . . .
Ogata, Tetsuya . . . . . . .
Ogawa, Atsunori . . . . .
Ogbureke, Kalu U. . . . .
Oger, Stanislas . . . . . . .
Ohara, Keiji . . . . . . . . . . .
Ohl, Claudia K. . . . . . . .
Ohls, Sarah . . . . . . . . . . .
Ohta, Kengo . . . . . . . . . .
Ohtake, Kiyonori . . . . .
Ohtani, Yamato . . . . . . .
Okamoto, Haruka . . . .
Okamoto, Jun . . . . . . . .
Okolowski, Stefanie . .
Okuno, Hiroshi G. . . . .
Olaszy, Gábor . . . . . . . .
Oliveira, Catarina . . . . .
Olsen, Peder A. . . . . . . .
Ono, Nobutaka . . . . . . .
Ooigawa, Tomohiko . .
Oonishi, Tasuku . . . . . .
Ordelman, Roeland . . .
Orglmeister, Reinhold
Ormel, Ellen . . . . . . . . . .
Orr, Rosemary . . . . . . . .
Ortega, Alfonso . . . . . .
Ortego-Resa, Carlos . .
Mon-Ses3-P2-7
Tue-Ses2-P2-6
Tue-Ses3-P4-6
Mon-Ses2-P4-1
Thu-Ses1-P4-9
Tue-Ses2-P3-7
Wed-Ses1-O3-5
Tue-Ses1-P3-13
Thu-Ses1-P3-10
Mon-Ses2-P1-1
Wed-Ses4-P4-11
Mon-Ses3-P4-5
Thu-Ses1-P3-8
Mon-Ses2-P4-5
Wed-Ses1-P4-11
Wed-Ses1-O4-1
Wed-Ses1-O4-4
Wed-Ses3-P2-12
Wed-Ses1-P4-10
Wed-Ses2-O1-4
Mon-Ses2-P4-1
Thu-Ses1-P4-9
Mon-Ses3-P4-9
Mon-Ses3-P1-9
Mon-Ses2-P3-1
Mon-Ses2-P3-5
Tue-Ses3-P1-6
Tue-Ses3-P2-12
Wed-Ses2-O4-5
Wed-Ses1-P1-10
Wed-Ses3-O3-5
Thu-Ses1-O4-4
Thu-Ses1-O1-1
Tue-Ses3-P3-5
Wed-Ses3-O1-1
Mon-Ses2-O1-6
Tue-Ses2-P4-7
Tue-Ses3-P2-9
Wed-Ses2-P4-1
Thu-Ses1-P1-1
Wed-Ses2-P1-3
68
94
109
57
161
97
114
83
159
51
149
72
159
58
124
115
115
145
124
126
57
161
73
67
55
56
102
106
129
117
140
154
150
107
138
48
99
106
135
155
130
O’Shaughnessy, D. . . . Tue-Ses3-P1-16 104
Thu-Ses1-P1-5 155
Thu-Ses2-O5-4 168
Osma-Ruiz, Víctor . . . . Tue-Ses1-S2-7 87
Ostendorf, Mari . . . . . .
Thu-Ses0-K-1 47
Tue-Ses2-O3-4 90
Öster, Ann-Marie . . . . . Tue-Ses3-P3-5 107
Ouellet, Pierre . . . . . . . . Wed-Ses1-O1-3 112
Ouni, Kaïs . . . . . . . . . . . . Wed-Ses1-P1-6 116
Oura, Keiichiro . . . . . . . Mon-Ses3-O3-6 64
Wed-Ses1-P3-3 120
Ouyang, JianJun . . . . . . Wed-Ses3-O4-1 141
Özbek, İ. Yücel . . . . . . . Thu-Ses2-O2-3 164
P
Pabst, Friedemann . . .
Padmanabhan, R. . . . . .
Paek, Tim . . . . . . . . . . . . .
Paliwal, Kuldip . . . . . . .
Palomäki, Kalle J. . . . . .
Pan, Fuping . . . . . . . . . . .
Pan, Jielin . . . . . . . . . . . . .
Pan, Yi-cheng . . . . . . . . .
Pandey, Pramod . . . . . .
Pantazis, Yannis . . . . . .
Pardo, David . . . . . . . . . .
Pardo, J.M. . . . . . . . . . . . .
Parihar, Naveen . . . . . . .
Parizet, Etienne . . . . . . .
Park, Chiyoun . . . . . . . .
Park, J. . . . . . . . . . . . . . . . .
Park, JeonGue . . . . . . . .
Park, Youngja . . . . . . . .
Park, Yun-Sik . . . . . . . . .
Parthasarathi, S.H.K. .
Patel, Rupal . . . . . . . . . . .
Patil, Vaishali . . . . . . . . .
Patterson, Roy D. . . . . .
Pawlewski, M. . . . . . . . . .
Peabody, Mitchell . . . .
Pedersen, C.F. . . . . . . . .
Peláez-Moreno, C. . . . .
Pelecanos, Jason . . . . . .
Pellegrini, T. . . . . . . . . . .
Pellegrino, François . .
Penard, Nils . . . . . . . . . . .
Pérez, Javier . . . . . . . . . .
Pernkopf, Franz . . . . . .
Petkov, Petko N. . . . . . .
Petrone, Caterina . . . . .
Pfister, Beat . . . . . . . . . . .
Pfitzinger, Hartmut R.
Philburn, Elke . . . . . . . .
Picard, Rosalind W. . . .
Pieraccini, Roberto . . .
Pigeon, Stéphane . . . . .
Pillay, S.G. . . . . . . . . . . . .
Pinault, Florian . . . . . . .
Pinho, Cátia M.R. . . . . .
Plahl, C. . . . . . . . . . . . . . . .
Planet, Santiago . . . . . .
Podlipský, Václav J. . .
Pohjalainen, Jouni . . . .
Pokines, B.B. . . . . . . . . . .
Pollák, Petr . . . . . . . . . . .
Polzehl, Tim . . . . . . . . . .
Pon-Barry, Heather . . .
Popa, Victor . . . . . . . . . .
Popescu, Vladimir . . . .
Popescu-Belis, Andrei
Portêlo, J. . . . . . . . . . . . . .
Post, Brechtje . . . . . . . . .
Potamianos, G. . . . . . . .
Potard, Blaise . . . . . . . . .
Povey, Daniel . . . . . . . . .
Tue-Ses1-P1-7
Wed-Ses3-P2-9
Tue-Ses2-O1-4
Tue-Ses2-P1-4
Tue-Ses3-P1-5
Tue-Ses3-P1-2
Thu-Ses2-P3-7
Thu-Ses2-P3-5
Tue-Ses1-O3-6
Mon-Ses3-P2-4
Mon-Ses2-O4-2
Tue-Ses3-P3-9
Mon-Ses2-P4-8
Mon-Ses2-S1-7
Thu-Ses2-P4-6
Tue-Ses2-P2-5
Wed-Ses3-O3-2
Mon-Ses2-P3-7
Tue-Ses2-O1-1
Thu-Ses1-P4-10
Tue-Ses3-P1-10
Thu-Ses1-P1-4
Wed-Ses3-O3-6
Wed-Ses3-P2-9
Mon-Ses3-S1-5
Thu-Ses1-O3-2
Mon-Ses2-P1-7
Wed-Ses3-P2-7
Mon-Ses3-P1-4
Tue-Ses2-P1-8
Tue-Ses1-P2-7
Wed-Ses1-O1-4
Tue-Ses2-P2-8
Wed-Ses2-O1-5
Tue-Ses1-S1-4
Mon-Ses2-O2-5
Tue-Ses1-P3-1
Tue-Ses2-P1-3
Thu-Ses2-O5-2
Wed-Ses4-P4-13
Mon-Ses2-O4-1
Mon-Ses3-O1-2
Tue-Ses2-O2-2
Wed-Ses4-P4-11
Wed-Ses4-P4-12
Tue-Ses1-P1-8
Tue-Ses3-P3-8
Tue-Ses3-P4-2
Wed-Ses3-O3-4
Wed-Ses3-P2-7
Mon-Ses2-P4-9
Mon-Ses3-P1-7
Wed-Ses2-P4-4
Thu-Ses2-P2-5
Mon-Ses2-S1-2
Mon-Ses2-P1-3
Mon-Ses3-P1-5
Tue-Ses3-P1-2
Thu-Ses2-P1-2
Tue-Ses3-S2-7
Mon-Ses2-S1-8
Wed-Ses1-O2-2
Thu-Ses1-P2-8
Thu-Ses1-P3-10
Mon-Ses3-P4-10
Tue-Ses2-P2-8
Wed-Ses3-P1-10
Wed-Ses4-P4-14
Wed-Ses3-O4-2
Thu-Ses2-O2-4
Thu-Ses2-P1-8
Mon-Ses2-O3-1
183
79
145
88
92
102
102
173
173
77
67
51
108
58
61
174
94
140
56
87
161
103
155
141
145
74
153
52
145
66
92
81
112
94
126
86
49
81
92
167
149
50
62
89
149
149
80
107
108
140
145
58
66
135
171
60
52
66
102
168
111
61
113
157
159
73
94
143
149
141
165
169
49
Prabhavalkar, Rohit . .
Putois, Ghislain . . . . . . .
Pylkkönen, Janne . . . . .
Tue-Ses1-P3-6 82
Wed-Ses3-S1-4 150
Mon-Ses2-P3-3 56
Q
Qian, Yao . . . . . . . . . . . . . Mon-Ses3-O3-3
Qiao, Yu . . . . . . . . . . . . . .
Qin,
Qin,
Qin,
Qin,
Chao . . . . . . . . . . . . .
Long . . . . . . . . . . . . .
Shenghao . . . . . . . .
Yong . . . . . . . . . . . . .
Quarteroni, Silvia . . . . .
Quatieri, Thomas F. . .
Quené, Hugo . . . . . . . . . .
Wed-Ses1-P3-2
Mon-Ses2-P4-15
Mon-Ses3-P4-6
Wed-Ses2-P3-1
Wed-Ses3-O2-6
Thu-Ses2-P4-8
Tue-Ses1-P1-5
Wed-Ses1-P3-10
Wed-Ses1-P4-8
Mon-Ses3-O3-4
Wed-Ses1-P3-5
Tue-Ses2-O3-1
Thu-Ses2-O3-2
Thu-Ses2-O3-5
Tue-Ses1-P2-4
64
120
59
72
133
140
175
79
121
123
64
120
89
165
166
81
R
Raab, Martin . . . . . . . . . . Thu-Ses2-P3-3 172
Rachevsky, Leonid . . . Mon-Ses2-P4-10 59
Radfar, M.H. . . . . . . . . . . Wed-Ses2-O4-4 129
Raisamo, Roope . . . . . . Tue-Ses3-P3-4 107
Raitio, Tuomo . . . . . . . . Wed-Ses1-P2-2 118
Raj, Bhiksha . . . . . . . . . . Mon-Ses2-O1-2 48
Wed-Ses2-P3-2 133
Thu-Ses1-O1-2 151
Rajaniemi, Juha-Pekka Tue-Ses3-P3-4 107
Rajendran, S. . . . . . . . . . Wed-Ses3-P1-5 142
Rajkumar, R. . . . . . . . . . . Thu-Ses1-O2-3 152
Ramabhadran, B. . . . . . Mon-Ses3-O4-3 65
Wed-Ses2-O3-5 128
Ramakrishnan, S. . . . . . Tue-Ses2-P2-3 94
Ramasubramanian, V. Thu-Ses1-P1-8 156
Ramos, Daniel . . . . . . . . Wed-Ses2-P1-3 130
Wed-Ses3-P2-6 144
Rantala, Jussi . . . . . . . . . Tue-Ses3-P3-4 107
Rao, Preeti . . . . . . . . . . . . Tue-Ses2-P2-3 94
Thu-Ses1-O3-2 153
Rao, Vishweshwara . . . Tue-Ses2-P2-3 94
Rapcan, Viliam . . . . . . . Tue-Ses1-S1-4 86
Räsänen, Okko J. . . . . . Tue-Ses1-O2-6 76
Tue-Ses1-P3-5 82
Tue-Ses2-P2-13 95
Wed-Ses1-P4-13 124
Thu-Ses2-P4-3 174
Rasmussen, Morten H. Tue-Ses3-P3-11 108
Rastrow, Ariya . . . . . . . . Tue-Ses2-P3-6 96
Wed-Ses2-O3-5 128
Rath, S.P. . . . . . . . . . . . . . . Mon-Ses3-P3-5 70
Mon-Ses3-P3-9 70
Mon-Ses3-P3-12 71
Wed-Ses3-P2-3 144
Rauzy, Stéphane . . . . . . Thu-Ses2-P1-7 169
Ravuri, Suman . . . . . . . . Thu-Ses2-P2-2 170
Thu-Ses2-P2-5 171
Raybaud, Sylvain . . . . . Mon-Ses3-O4-1 64
Rebordao, A.R.F. . . . . . . Mon-Ses3-P2-10 68
Redeker, Gisela . . . . . . . Tue-Ses2-O2-1 88
Reed, Jeremy . . . . . . . . . Mon-Ses2-P2-2 53
Regunathan, Shankar
Tue-Ses3-P4-3 108
Reichel, Uwe D. . . . . . . . Wed-Ses1-P2-4 118
Reidhammer, K. . . . . . . Tue-Ses3-P4-8 109
Reilly, Richard B. . . . . . Tue-Ses1-S1-4 86
Rekhis, Oussama . . . . . Wed-Ses1-P1-6 116
Renals, Steve . . . . . . . . . Tue-Ses3-P3-3 107
Thu-Ses1-P3-9 159
Rennie, Steven J. . . . . . Tue-Ses3-P1-6 102
Rettelbach, Nikolaus . Thu-Ses1-P1-9 156
Réveil, Bert . . . . . . . . . . . Thu-Ses2-P3-1 172
Thu-Ses2-P3-2 172
Reynolds, Douglas . . . Tue-Ses2-O4-1 90
Wed-Ses3-P2-10 145
Riccardi, Giuseppe . . . Mon-Ses2-P4-6 58
Tue-Ses2-O3-1 89
Thu-Ses1-P4-12 161
Thu-Ses2-O1-1 163
Richardson, F.S. . . . . . . Mon-Ses2-P2-8 54
Wed-Ses3-P2-10 145
Richmond, Korin . . . . . Tue-Ses3-O4-3 101
Wed-Ses2-P3-3 133
Thu-Ses2-O3-4 166
Rieser, Verena . . . . . . . .
Riester, Arndt . . . . . . . .
Rigoll, Gerhard . . . . . . .
Riley, Michael . . . . . . . . .
Rilliard, Albert . . . . . . . .
Robertson, Ian H. . . . . .
Rodet, Xavier . . . . . . . . .
Rogers, Jack C. . . . . . . .
Romano, Sara . . . . . . . . .
Romportl, Jan . . . . . . . .
Romsdorfer, Harald . .
Ronzhin, Andrey . . . . .
Rose, Phil . . . . . . . . . . . . .
Rosec, Olivier . . . . . . . . .
Rossato, Solange . . . . .
Rotaru, Mihai . . . . . . . . .
Roukos, S. . . . . . . . . . . . .
Rouvier, Mickael . . . . . .
Roy, Brandon C. . . . . . .
Roy, Deb . . . . . . . . . . . . . .
Roy, Serge H. . . . . . . . . .
Rudoy, Daniel . . . . . . . .
Rugchatjaroen, A. . . . .
Rutenbar, Rob A. . . . . .
Rybach, David . . . . . . . .
Wed-Ses3-S1-6
Wed-Ses4-P4-2
Mon-Ses2-P4-3
Wed-Ses2-P1-10
Tue-Ses2-P3-8
Wed-Ses1-P2-12
Wed-Ses3-O4-5
Thu-Ses1-O2-5
Tue-Ses1-S1-4
Mon-Ses3-P2-7
Mon-Ses3-O2-1
Tue-Ses3-P4-5
Tue-Ses1-O4-2
Mon-Ses3-P2-1
Mon-Ses3-P2-2
Thu-Ses2-P1-5
Wed-Ses3-P1-7
Mon-Ses2-O4-2
Wed-Ses1-O4-2
Wed-Ses3-P2-14
Wed-Ses2-P2-1
Mon-Ses3-O4-3
Tue-Ses2-P2-9
Tue-Ses2-P2-10
Wed-Ses1-P1-1
Wed-Ses0-K-1
Wed-Ses1-P1-1
Mon-Ses3-S1-5
Thu-Ses2-O3-5
Mon-Ses3-P2-3
Mon-Ses3-P2-6
Wed-Ses1-P3-12
Wed-Ses2-O3-3
Mon-Ses2-P3-2
Wed-Ses2-P4-5
Thu-Ses1-P3-5
Thu-Ses2-P4-6
150
147
57
131
97
119
141
152
86
68
62
109
78
67
67
169
143
51
115
146
131
65
95
95
116
47
116
74
166
67
68
122
127
55
136
158
174
S
Sá Couto, Pedro . . . . . . Tue-Ses1-S2-11 87
Sáenz-Lechón, Nicolás Tue-Ses1-S2-7 87
Sagayama, Shigeki . . . . Wed-Ses2-O4-5 129
Sagisaka, Yoshinori . . Tue-Ses3-S2-6 111
Wed-Ses1-P2-13 119
Thu-Ses2-O5-1 167
Saheer, Lakshmi . . . . . . Tue-Ses3-P2-5 105
Sainz, Iñaki . . . . . . . . . . . Tue-Ses2-P1-2 91
Saito, Daisuke . . . . . . . . Wed-Ses2-P3-1 133
Saito, You . . . . . . . . . . . . . Wed-Ses3-P3-1 146
Saitou, Takeshi . . . . . . . Tue-Ses1-P2-8 81
Sakai, Masaru . . . . . . . . . Thu-Ses2-P2-7 171
Sakai, Shinsuke . . . . . . . Tue-Ses1-O4-6 78
Wed-Ses1-P3-9 121
Sakrajda, Andrej . . . . . Mon-Ses3-O4-4 65
Saleem, Shirin . . . . . . . . Mon-Ses3-O4-2 65
Sales Dias, Miguel . . . . Tue-Ses3-O4-1 101
Saltzman, Elliot . . . . . . . Thu-Ses1-S1-3 162
Thu-Ses1-S1-4 162
Thu-Ses2-O2-2 164
Salvi, Giampiero . . . . . . Tue-Ses3-P3-5 107
Sanand, D.R. . . . . . . . . . . Mon-Ses3-P3-12 71
Tue-Ses2-P1-10 93
Sánchez, Carmelo . . . . Tue-Ses1-S2-7 87
Sanchis, Emilio . . . . . . . Mon-Ses2-P4-6 58
Thu-Ses2-O1-1 163
Sanders, Eric . . . . . . . . . . Thu-Ses1-O4-3 154
Sands, Bonny . . . . . . . . . Wed-Ses3-P1-3 142
Sangwan, Abhijeet . . . Mon-Ses2-P2-3 53
San-Segundo, R. . . . . . . Mon-Ses2-P4-8 58
Mon-Ses2-S1-7 61
Sansone, Larry . . . . . . . . Mon-Ses2-P4-10 59
Santos, Ricardo . . . . . . . Tue-Ses1-S2-11 87
Saratxaga, Ibon . . . . . . . Tue-Ses2-P1-2 91
Sarikaya, R. . . . . . . . . . . . Mon-Ses3-O4-3 65
Sarkar, A.K. . . . . . . . . . . . Mon-Ses3-P3-9 70
Wed-Ses3-P2-3 144
Saruwatari, Hiroshi . . . Tue-Ses3-P3-2 106
Wed-Ses1-O4-1 115
Sato, Taichi . . . . . . . . . . . Mon-Ses3-P4-7 72
Saul, Lawrence K. . . . . . Tue-Ses1-O1-3 75
Savino, Michelina . . . . . Wed-Ses4-P4-4 148
Saz, Oscar . . . . . . . . . . . . Mon-Ses3-P3-2 69
Scanzio, Stefano . . . . . . Tue-Ses2-P3-9 97
Schaeffler, Felix . . . . . . . Wed-Ses1-P4-2 122
Schaffer, Stefan . . . . . . . Mon-Ses2-P4-13 59
Wed-Ses1-P4-7 123
Schalkwyk, Johan . . . .
Scharenborg, Odette .
Scheffer, Nicolas . . . . . .
Schenk, Joachim . . . . . .
Scherer, Klaus R. . . . . .
Schiel, Florian . . . . . . . .
Schlangen, David . . . . .
Schlüter, Ralf . . . . . . . . .
Schmalenstroeer, J. . . .
Schmidt, Konstantin .
Schneider, Daniel . . . . .
Schneider, Katrin . . . . .
Schnell, Karl . . . . . . . . . .
Schnell, Markus . . . . . .
Schoentgen, Jean . . . . .
Schuhmacher, K. . . . . .
Schuller, Björn . . . . . . . .
Schultz, Tanja . . . . . . . .
Schuppler, Barbara . . .
Schuster, Maria . . . . . . .
Schwartz, Reva . . . . . . .
Schwartz, Richard . . . .
Schwarz, Jan . . . . . . . . . .
Schwarz, Petr . . . . . . . . .
Schwärzler, Stefan . . .
Schweitzer, Antje . . . .
Schweitzer, Katrin . . . .
Schwerin, Belinda . . . .
Scipioni, Marcello . . . .
Scott, Abigail . . . . . . . . .
Sébillot, Pascale . . . . . .
Seebode, Julia . . . . . . . .
Segura, C. . . . . . . . . . . . . .
Segura, Jose Carlos . . .
Seid, Hussien . . . . . . . . .
Seide, F. . . . . . . . . . . . . . . .
Sekiyama, Kaoru . . . . .
Selouani, Sid-Ahmed .
Seltzer, Michael . . . . . .
Seneff, Stephanie . . . . .
Seng, Sopheap . . . . . . . .
Serniclaes, Willy . . . . . .
Sethu, Vidhyasaharan
Sethy, Abhinav . . . . . . .
Setiawan, Panji . . . . . . .
Sgarbas, Kyriakos . . . .
Sha, Fei . . . . . . . . . . . . . . .
Shafran, Izhak . . . . . . . .
Shah, Sheena . . . . . . . . .
Shaikh, Mostafa A.M. .
Shannon, Matt . . . . . . . .
Sharma, Harsh V. . . . . .
Sharma, Kartavya . . . .
Shen, Guanghu . . . . . . .
Shen, Wade . . . . . . . . . . .
Shi, Qin . . . . . . . . . . . . . . .
Shieber, Stuart . . . . . . . .
Shiga, Yoshinori . . . . . .
Shih, Chilin . . . . . . . . . . .
Shikano, Kiyohiro . . . .
Shimodaira, Hiroshi . .
Tue-Ses2-O1-5
Tue-Ses2-P3-8
Wed-Ses1-S1-4
Wed-Ses1-P1-8
Wed-Ses2-O1-4
Wed-Ses1-O1-1
Wed-Ses1-O1-4
Mon-Ses2-P4-3
Wed-Ses1-O2-1
Tue-Ses2-O1-3
Tue-Ses2-O3-3
Wed-Ses1-P4-14
Mon-Ses2-P3-2
Mon-Ses2-P3-10
Tue-Ses2-P3-5
Wed-Ses2-P4-4
Wed-Ses2-P4-5
Thu-Ses1-P3-5
Thu-Ses2-P4-6
Tue-Ses2-P2-11
Thu-Ses1-P1-9
Wed-Ses2-P4-9
Thu-Ses1-O2-2
Tue-Ses2-P1-12
Thu-Ses1-P1-9
Tue-Ses1-S1-3
Tue-Ses1-S2-8
Thu-Ses2-P1-4
Tue-Ses1-P3-10
Mon-Ses2-S1-1
Wed-Ses1-O2-6
Wed-Ses2-P1-10
Thu-Ses1-O1-5
Mon-Ses3-S1-6
Mon-Ses3-S1-7
Tue-Ses1-O1-2
Tue-Ses1-P4-5
Tue-Ses1-P4-7
Wed-Ses3-P1-2
Tue-Ses1-S2-6
Wed-Ses3-O2-2
Mon-Ses2-P3-9
Wed-Ses4-P4-12
Wed-Ses3-P2-4
Mon-Ses2-P4-3
Wed-Ses3-P1-13
Thu-Ses1-O2-1
Wed-Ses4-P4-2
Tue-Ses3-P1-5
Tue-Ses1-S2-10
Wed-Ses3-P1-3
Mon-Ses3-O1-5
Mon-Ses2-P4-13
Wed-Ses1-P4-7
Tue-Ses2-P2-7
Mon-Ses2-O1-4
Wed-Ses3-P1-5
Wed-Ses1-O3-4
Wed-Ses1-P2-12
Tue-Ses3-P1-16
Tue-Ses2-O1-2
Mon-Ses3-P1-4
Thu-Ses1-P3-1
Mon-Ses2-P1-8
Wed-Ses2-P2-3
Wed-Ses2-O3-5
Thu-Ses2-P2-4
Wed-Ses2-O2-1
Tue-Ses1-O1-3
Wed-Ses1-P2-10
Wed-Ses3-P1-3
Wed-Ses3-P1-4
Mon-Ses3-P2-10
Mon-Ses3-O3-1
Tue-Ses3-P3-7
Thu-Ses1-P4-1
Wed-Ses3-P3-3
Wed-Ses2-P4-13
Wed-Ses3-O2-2
Wed-Ses1-P3-5
Wed-Ses1-O2-2
Wed-Ses1-P3-6
Tue-Ses3-S2-4
Mon-Ses3-S1-2
Tue-Ses3-P3-2
Wed-Ses1-O4-1
Wed-Ses1-P3-13
184
88
97
125
117
126
111
112
57
112
88
90
124
55
57
96
135
136
158
174
95
156
136
152
93
156
86
87
169
83
60
113
131
151
74
74
75
84
85
142
86
139
57
149
144
57
143
152
147
102
87
142
62
59
123
94
48
142
114
119
104
88
66
158
53
131
128
171
126
75
119
142
142
68
63
107
159
146
137
139
120
113
120
110
73
106
115
122
Shinoda, Koichi . . . . . . .
Shinohara, Shigeko . . .
Shinozaki, Takahiro . .
Shiota, Sayaka . . . . . . . .
Shochi, Takaaki . . . . . . .
Shozakai, Makoto . . . .
Shriberg, Elizabeth . . .
Shuang, Zhiwei . . . . . . .
Shue, Yen-Liang . . . . . .
Shum, Stephen . . . . . . .
Sicconi, Roberto . . . . . .
Sieczkowska, J. . . . . . . .
Sigüenza, Álvaro . . . . .
Silén, Hanna . . . . . . . . . .
Sim, Khe Chai . . . . . . . .
Simko, Juraj . . . . . . . . . .
Simpson, Brian D. . . . .
Sinha, Rohit . . . . . . . . . .
Siniscalchi, Sabato M.
Sisinni, Bianca . . . . . . . .
Siu, Man-hung . . . . . . . .
Sivakumaran, P. . . . . . .
Sivaram, G.S.V.S. . . . . . .
Skantze, Gabriel . . . . . .
Skarnitzl, Radek . . . . . .
Smaïli, Kamel . . . . . . . . .
Smolenski, B.Y. . . . . . . .
Socoró, Joan Claudi . .
Solewicz, Yosef A. . . . .
Sollich, Peter . . . . . . . . . .
Song, Hwa Jeon . . . . . . .
Song, Ji-Hyun . . . . . . . . .
Song, Young Chol . . . .
Sonu, Mee . . . . . . . . . . . . .
Soong, Frank K. . . . . . . .
Soronen, Hannu . . . . . .
Speed, Matt . . . . . . . . . . .
Speer, Shari R. . . . . . . . .
Spiegl, Werner . . . . . . . .
Spinu, Laura . . . . . . . . . .
Sproat, Richard . . . . . . .
Sreenivas, T.V. . . . . . . . .
Sridharan, Sridha . . . . .
Stafylakis, Themos . . .
Stallard, David . . . . . . . .
Stamatakis, E. . . . . . . . . .
Stark, Anthony . . . . . . .
Stauffer, A.R. . . . . . . . . .
Steed, William . . . . . . . .
Stegmann, Joachim . . .
Steidl, Stefan . . . . . . . . .
Steiner, Ingmar . . . . . . .
Stemmer, Georg . . . . . .
Stern, Richard M. . . . . .
Stewart, Osamuyimen
Stolcke, Andreas . . . . .
Stone, Maureen . . . . . . .
Strasheim, Albert . . . . .
Strassel, Stephanie M.
Strecha, Guntram . . . .
Strik, Helmer . . . . . . . . .
Strömbergsson, Sofia .
Strope, Brian . . . . . . . . . .
Mon-Ses3-P3-10
Wed-Ses1-P1-10
Tue-Ses2-P4-9
Tue-Ses1-O1-6
Wed-Ses1-P2-12
Wed-Ses1-P4-10
Wed-Ses1-O1-1
Wed-Ses2-P2-4
Thu-Ses2-O1-3
Mon-Ses3-O3-4
Wed-Ses1-P3-5
Thu-Ses1-P2-7
Thu-Ses2-P1-1
Thu-Ses2-P1-8
Wed-Ses1-S1-5
Wed-Ses3-P1-13
Tue-Ses3-P3-9
Wed-Ses1-P3-7
Thu-Ses2-P3-8
Mon-Ses2-O2-3
Mon-Ses3-O2-5
Mon-Ses3-P3-8
Wed-Ses1-O3-3
Mon-Ses2-P2-2
Wed-Ses1-P1-9
Wed-Ses2-O3-6
Wed-Ses3-P2-7
Thu-Ses2-P2-10
Mon-Ses2-P4-12
Mon-Ses2-P1-3
Tue-Ses1-P3-11
Mon-Ses3-O4-1
Thu-Ses2-P1-2
Mon-Ses2-S1-2
Mon-Ses3-O3-5
Tue-Ses1-P4-11
Wed-Ses3-P3-4
Mon-Ses3-P3-3
Wed-Ses3-O3-1
Tue-Ses3-P1-10
Thu-Ses1-P1-4
Thu-Ses2-P1-8
Wed-Ses1-P2-13
Mon-Ses3-O3-3
Tue-Ses3-P1-9
Wed-Ses1-P3-2
Wed-Ses1-P4-8
Tue-Ses3-P3-4
Tue-Ses1-P1-4
Thu-Ses1-O2-3
Thu-Ses2-P1-8
Wed-Ses1-P2-1
Wed-Ses2-O2-4
Tue-Ses3-P1-7
Tue-Ses3-O3-2
Wed-Ses1-O1-4
Tue-Ses2-O4-2
Mon-Ses3-O4-2
Wed-Ses3-P1-10
Tue-Ses2-P1-4
Wed-Ses1-P4-3
Wed-Ses3-P2-11
Thu-Ses2-P1-2
Wed-Ses3-P1-7
Tue-Ses1-P3-10
Mon-Ses2-S1-1
Wed-Ses2-P3-3
Thu-Ses2-P1-8
Mon-Ses2-O1-1
Mon-Ses2-O1-2
Tue-Ses2-P4-7
Wed-Ses2-P3-2
Thu-Ses1-O1-2
Thu-Ses1-O3-4
Mon-Ses2-P4-11
Mon-Ses3-O4-4
Wed-Ses2-P2-4
Wed-Ses2-P4-2
Wed-Ses2-P4-3
Mon-Ses3-S1-4
Wed-Ses3-O1-4
Thu-Ses2-O4-3
Wed-Ses1-P3-4
Mon-Ses3-P4-2
Wed-Ses3-O2-1
Tue-Ses1-P2-5
Tue-Ses2-O1-5
71
117
99
75
119
124
111
131
163
64
120
157
168
169
125
143
108
121
173
49
63
70
114
53
117
128
145
172
59
52
83
64
168
60
64
85
146
69
140
103
155
169
119
64
103
120
123
107
79
152
169
118
127
103
100
112
90
65
143
92
122
145
168
143
83
60
133
169
47
48
99
133
151
153
59
65
131
135
135
74
138
166
120
71
139
81
88
Štruc, Vitomir . . . . . . . . Wed-Ses2-P1-2 129
Stüker, Sebastian . . . . . Thu-Ses2-P3-9 174
Sturim, D.E. . . . . . . . . . . . Wed-Ses3-P2-10 145
Stuttle, Matt . . . . . . . . . . Tue-Ses3-P2-8 105
Stylianou, Yannis . . . . . Mon-Ses2-O4-2 51
Tue-Ses1-S1-2 86
Su, Zhao-yu . . . . . . . . . . . Thu-Ses2-P1-13 170
Subramanya, A. . . . . . . . Tue-Ses1-O1-1 75
Wed-Ses2-O3-1 127
Suendermann, David . Tue-Ses3-P4-2 108
Sugiura, Komei . . . . . . . Wed-Ses3-S1-5 150
Suk, Soo-Young . . . . . . . Tue-Ses3-P2-11 106
Wed-Ses3-P3-3 146
Sumi, Kouhei . . . . . . . . . Tue-Ses2-P2-6 94
Sumner, Meghan . . . . . Mon-Ses3-O2-3 63
Sun, Hanwu . . . . . . . . . . . Tue-Ses1-P4-4 84
Sun, Hongjun . . . . . . . . . Tue-Ses2-P1-9 93
Sun, Yang . . . . . . . . . . . . . Thu-Ses1-O1-5 151
Sun, Yanqing . . . . . . . . . Wed-Ses2-P1-4 130
Sundaram, Shiva . . . . . Mon-Ses2-S1-8 61
Sundberg, Johan . . . . . . Tue-Ses1-P1-7 79
Sung, Po-Yi . . . . . . . . . . . Thu-Ses1-P2-5 157
Suni, Antti . . . . . . . . . . . . Wed-Ses1-P2-2 118
Sutherland, Andrew . . Thu-Ses2-P4-2 174
Suzuki, Motoyuki . . . . . Tue-Ses3-P2-6 105
Svantesson, Jan-Olof . Wed-Ses4-P4-8 148
Svendsen, Torbjørn . . Mon-Ses2-P2-2 53
Swerts, Marc . . . . . . . . . . Tue-Ses2-O2-5 89
Szaszák, György . . . . . . Wed-Ses2-O2-3 126
Sztahó, Dávid . . . . . . . . . Wed-Ses2-O2-3 126
T
Taal, Cees H. . . . . . . . . . . Wed-Ses2-O4-3 129
Tachibana, Ryuki . . . . . Mon-Ses2-O1-5 48
Thu-Ses1-P2-7 157
Taguchi, Ryo . . . . . . . . . . Thu-Ses1-P4-8 161
Tajima, Keiichi . . . . . . . Wed-Ses1-P2-13 119
Takacs, Gyorgy . . . . . . . Wed-Ses3-O4-6 142
Takahashi, Akira . . . . . Thu-Ses1-O4-1 154
Takahashi, Satoshi . . . Wed-Ses1-O3-5 114
Takahashi, Toru . . . . . . Thu-Ses1-P2-6 157
Takeshima, Chihiro . . Mon-Ses2-P1-2 52
Takiguchi, Tetsuya . . . Mon-Ses2-P4-2 57
Tamura, Masatsune . . Wed-Ses2-P3-10 134
Tan, Zheng-Hua . . . . . . Tue-Ses3-P3-11 108
Wed-Ses3-O3-3 140
Tanaka, Kazuyo . . . . . . Thu-Ses1-P2-4 157
Tanaka, Kihachiro . . . . Mon-Ses3-P4-7 72
Taniyama, Hikaru . . . . Mon-Ses2-P4-4 58
Tao, Jianhua . . . . . . . . . . Tue-Ses2-P1-9 93
Tarján, Balázs . . . . . . . . Thu-Ses1-P3-7 159
Tashev, Ivan . . . . . . . . . . Tue-Ses2-O1-2 88
Tauberer, Joshua . . . . . Wed-Ses3-O2-4 139
Tayanin, Damrong . . . Wed-Ses4-P4-8 148
Taylor, Paul . . . . . . . . . . . Mon-Ses3-O3-5 64
Teiken, Wilfried . . . . . . Thu-Ses1-P4-10 161
Teixeira, António . . . . . Mon-Ses3-P1-9 67
Tejedor, Javier . . . . . . . . Wed-Ses2-P4-10 136
ten Bosch, L. . . . . . . . . . . Tue-Ses1-O2-6 76
Tue-Ses2-P2-13 95
Wed-Ses1-P2-8 119
Wed-Ses1-P2-9 119
Tepperman, Joseph . . Tue-Ses1-O2-1 76
Tue-Ses1-O2-5 76
Wed-Ses2-O2-2 126
Thu-Ses1-S1-6 163
Terband, Hayo . . . . . . . . Tue-Ses1-P1-9 80
Tue-Ses1-P1-10 80
Teshima, Shigeki . . . . . Wed-Ses2-P4-14 137
Thambiratnam, K. . . . . Wed-Ses1-O3-4 114
Thangthai, Ausdang . . Mon-Ses3-P2-6 68
Wed-Ses1-P3-12 122
Thatphithakkul, N. . . . Mon-Ses3-P2-6 68
Wed-Ses1-P3-12 122
Thiruvaran, T. . . . . . . . . Tue-Ses2-P1-11 93
Thomas, Mark R.P. . . . Mon-Ses2-O4-3 51
Thomas, Samuel . . . . . . Thu-Ses2-O3-1 165
Thu-Ses2-P2-3 171
Thompson, Laura . . . . Tue-Ses3-S2-5 111
Thomson, B. . . . . . . . . . . Thu-Ses1-P4-5 160
Thorpe, William . . . . . . Mon-Ses2-O2-4 49
Tian, Jilei . . . . . . . . . . . . . Mon-Ses3-O3-6 64
Tiede, Mark . . . . . . . . . . . Thu-Ses2-O2-5 165
Tihelka, Daniel . . . . . . . Tue-Ses1-O4-2 78
Ting, Chuan-Wei . . . . . . Tue-Ses3-P2-10 106
Tishby, Naftali . . . . . . . .
Tobiasson, Helena . . . .
Toda, Tomoki . . . . . . . .
Toivola, Minnaleena . .
Tokuda, Keiichi . . . . . . .
Tokuma, Shinichi . . . . .
Tomalin, M. . . . . . . . . . . .
Tompkins, Frank . . . . .
Tong, Rong . . . . . . . . . . .
Torreira, Francisco . . .
Torres-Carrasquillo, P.
Toth, Arthur R. . . . . . . .
Tran, Viet-Anh . . . . . . . .
Trancoso, Isabel . . . . . .
Tremblay, Annie . . . . . .
Trilla, Alexandre . . . . .
Trmal, Jan . . . . . . . . . . . .
Truong, Khiet P. . . . . . .
Tsakalidis, Stavros . . .
Tsao, Yu . . . . . . . . . . . . . .
Tschöpe, Constanze . .
Tseng, Chiu-yu . . . . . . .
Tseng, Chun-Han . . . . .
Tsiartas, Andreas . . . .
Tsirulnik, Liliya . . . . . . .
Tsuchiya, Masatoshi .
Tsuge, Satoru . . . . . . . . .
Tsurutani, Chiharu . . .
Tsuzaki, Minoru . . . . . .
Tsymbal, Alexey . . . . . .
Tucker, Benjamin V. . .
Tucker, Simon . . . . . . . .
Tulli, Juan Carlos . . . . .
Tur, Gokhan . . . . . . . . . .
Turicchia, Lorenzo . . .
Turunen, Markku . . . . .
Tüske, Zoltán . . . . . . . . .
Mon-Ses2-P2-7
Wed-Ses1-O1-5
Mon-Ses2-P4-12
Mon-Ses3-S1-2
Mon-Ses3-S1-8
Tue-Ses3-P3-2
Wed-Ses1-O4-1
Wed-Ses1-O4-4
Wed-Ses1-P3-9
Wed-Ses1-P2-3
Mon-Ses3-O3-6
Mon-Ses3-P2-11
Tue-Ses1-O1-6
Wed-Ses1-P3-1
Wed-Ses1-P3-3
Wed-Ses1-P3-9
Wed-Ses1-P3-10
Wed-Ses1-P1-11
Mon-Ses2-P3-7
Thu-Ses1-P3-4
Mon-Ses2-O4-6
Mon-Ses2-P2-10
Mon-Ses3-P1-1
Wed-Ses3-O2-1
Mon-Ses2-P2-8
Mon-Ses3-S1-6
Mon-Ses3-S1-7
Mon-Ses3-S1-8
Mon-Ses2-O3-5
Tue-Ses2-O2-4
Tue-Ses2-P2-8
Wed-Ses1-P2-6
Mon-Ses2-P1-9
Mon-Ses3-P2-8
Tue-Ses1-P3-11
Wed-Ses2-P2-7
Mon-Ses3-O4-2
Wed-Ses1-O3-2
Wed-Ses1-P3-4
Wed-Ses1-P2-5
Thu-Ses2-P1-13
Tue-Ses2-O4-3
Mon-Ses3-O4-6
Thu-Ses2-P1-5
Thu-Ses1-P3-6
Thu-Ses1-P3-8
Wed-Ses3-P2-12
Tue-Ses1-O2-3
Mon-Ses2-P1-2
Thu-Ses2-P2-6
Wed-Ses2-O1-1
Thu-Ses2-P1-11
Mon-Ses3-P2-13
Wed-Ses1-O4-5
Thu-Ses1-P4-1
Thu-Ses1-P4-2
Wed-Ses2-P3-2
Tue-Ses3-P3-4
Thu-Ses1-O4-2
Thu-Ses1-P3-7
54
112
59
73
74
106
115
115
121
118
64
69
75
120
120
121
121
117
56
158
51
55
65
139
54
74
74
74
50
89
94
118
53
68
83
132
65
114
120
118
170
91
65
169
159
159
145
76
52
171
125
170
69
115
159
160
133
107
154
159
U
Umesh, S. . . . . . . . . . . . . .
Unoki, Masashi . . . . . . .
Unver, Emre . . . . . . . . . .
Uriz, Alejandro José . .
Usabaev, Bela . . . . . . . . .
Mon-Ses3-P3-5
Mon-Ses3-P3-9
Mon-Ses3-P3-12
Tue-Ses2-P1-10
Wed-Ses3-P2-3
Thu-Ses1-O1-4
Thu-Ses1-P1-6
Wed-Ses1-O4-5
Mon-Ses3-O3-6
70
70
71
93
144
151
155
115
64
V
Vainio, Martti . . . . . . . . .
Vaissiere, Jacqueline .
Valente, Fabio . . . . . . . .
Valkama, Pellervo . . . .
Válková, Lucie . . . . . . . .
Valverde-Albacete, F.J.
van Brenk, Frits . . . . . .
Van Compernolle, D. .
van Dalen, R.C. . . . . . . .
Wed-Ses1-P2-2
Wed-Ses1-P4-1
Tue-Ses1-S2-9
Tue-Ses2-O4-4
Thu-Ses2-P2-5
Tue-Ses3-P3-4
Wed-Ses1-P4-5
Tue-Ses1-P2-7
Tue-Ses1-P1-9
Tue-Ses1-P1-10
Mon-Ses3-P3-6
Thu-Ses2-P4-11
Wed-Ses1-O1-6
Thu-Ses1-O1-3
185
118
122
87
91
171
107
123
81
80
80
70
175
112
151
van den Heuvel, Henk
van der Plas, Lonneke
van der Werff, Laurens
van de Ven, Marco . . . .
van Dommelen, Wim .
van Doremalen, Joost
Van hamme, Hugo . . .
van Heerden, Charl . . .
Van Heuven, V.J. . . . . .
van Leeuwen, David A.
van Lieshout, Pascal . .
van Niekerk, D.R. . . . . .
van Santen, Jan P.H. . .
Van Segbroeck, M. . . . .
van Son, Nic . . . . . . . . . .
Vasilescu, Ioana . . . . . .
Veaux, Christophe . . . .
Verdet, Florian . . . . . . .
Verlinde, Patrick . . . . . .
Verma, Ragini . . . . . . . .
Vertanen, Keith . . . . . . .
Vesnicer, Boštjan . . . . .
Viana, M. Céu . . . . . . . . .
Vicsi, Klára . . . . . . . . . . .
Vijayasenan, Deepu . .
Villette, Stephane . . . .
Vipperla, Ravichander
Viscelgia, Tanya . . . . . .
Visweswariah, Karthik
Vivanco, Hiram . . . . . . .
Vlasenko, Bogdan . . . .
Vogel, Irene . . . . . . . . . . .
Vogt, Robbie . . . . . . . . . .
Vogt, Thurid . . . . . . . . . .
Volín, Jan . . . . . . . . . . . . .
Thu-Ses1-O4-3
Thu-Ses2-P3-1
Wed-Ses3-S1-6
Thu-Ses1-O4-4
Wed-Ses2-O1-1
Wed-Ses3-P1-2
Mon-Ses3-P4-2
Tue-Ses2-P3-2
Tue-Ses2-P4-2
Wed-Ses1-P2-9
Thu-Ses1-O1-6
Tue-Ses2-O1-5
Thu-Ses2-O4-1
Thu-Ses2-P3-4
Wed-Ses4-P4-1
Tue-Ses1-P4-6
Tue-Ses1-P4-10
Tue-Ses2-O4-5
Wed-Ses2-P2-7
Wed-Ses3-O1-1
Thu-Ses1-O4-3
Tue-Ses1-P1-9
Tue-Ses1-P1-10
Tue-Ses1-P3-12
Tue-Ses1-O4-1
Tue-Ses2-P4-2
Thu-Ses1-O1-6
Tue-Ses3-P3-5
Mon-Ses2-P1-6
Wed-Ses3-P1-1
Mon-Ses2-P2-1
Wed-Ses3-O3-4
Wed-Ses2-P2-6
Wed-Ses1-S1-2
Tue-Ses1-O3-4
Mon-Ses2-O3-5
Tue-Ses2-O2-4
Wed-Ses2-O2-3
Tue-Ses2-O4-4
Thu-Ses1-P1-6
Tue-Ses3-P3-3
Wed-Ses1-P2-5
Tue-Ses3-P2-1
Tue-Ses2-P2-4
Wed-Ses2-O2-5
Wed-Ses2-P2-10
Wed-Ses1-P2-1
Tue-Ses3-O3-2
Wed-Ses1-O1-4
Mon-Ses2-S1-5
Mon-Ses2-P1-3
Tue-Ses3-S2-7
154
172
150
154
125
142
71
96
98
119
151
88
166
173
147
84
85
91
132
138
154
80
80
83
78
98
151
107
52
142
53
140
132
125
77
50
89
126
91
155
107
118
104
94
127
133
118
100
112
60
52
111
Wed-Ses1-P4-5
Wed-Ses4-P4-5
Mon-Ses2-S1-8
Tue-Ses2-P2-12
Tue-Ses3-S2-2
Wed-Ses2-P3-7
Wed-Ses2-P1-8
Thu-Ses2-P3-9
Mon-Ses2-P1-10
Thu-Ses2-O4-6
Mon-Ses2-P4-3
Tue-Ses3-P1-17
Wed-Ses3-P1-13
Wed-Ses4-P4-2
Wed-Ses2-P4-7
Mon-Ses3-S1-6
Mon-Ses3-S1-7
Tue-Ses2-O2-3
Wed-Ses1-S1-3
Mon-Ses3-O3-2
Mon-Ses3-O4-3
Wed-Ses2-P4-10
Wed-Ses2-P4-11
Wed-Ses2-P4-12
Tue-Ses2-O4-3
Thu-Ses2-P4-9
Wed-Ses3-O4-1
Wed-Ses1-P4-8
Mon-Ses3-O3-3
Tue-Ses1-P4-2
Wed-Ses2-O2-6
Wed-Ses1-O3-6
123
148
61
95
110
134
130
174
53
167
57
104
143
147
136
74
74
89
125
64
65
136
137
137
91
175
141
123
64
84
127
114
W
Waclawičová, Martina
Wagner, Agnieszka . . .
Wagner, Michael . . . . . .
Wagner, Petra . . . . . . . . .
Wahab, Abdul . . . . . . . .
Waibel, Alex . . . . . . . . . .
Walker, Benjamin H. .
Walker, Kevin . . . . . . . . .
Wallhoff, Frank . . . . . . .
Walsh, John MacLaren
Walsh, Michael . . . . . . .
Wan, Vincent . . . . . . . . .
Wand, Michael . . . . . . . .
Wang,
Wang,
Wang,
Wang,
Wang,
Bei . . . . . . . . . . . . .
Chao . . . . . . . . . . .
Cheng-Cheng . .
D. . . . . . . . . . . . . . .
Dong . . . . . . . . . . .
Wang, Hsin-Min . . . . . . .
Wang,
Wang,
Wang,
Wang,
Wang,
Wang,
Lan . . . . . . . . . . . .
Lijuan . . . . . . . . . .
Miaomiao . . . . . .
Ning . . . . . . . . . . .
Shijin . . . . . . . . . .
Shizhen . . . . . . . .
Wang, Tianyu T. . . . . . . Thu-Ses2-O3-2 165
Wang, Wen . . . . . . . . . . . . Mon-Ses3-O4-5 65
Wed-Ses2-P4-2 135
Wed-Ses2-P4-3 135
Wang, Y. . . . . . . . . . . . . . . Tue-Ses2-P4-2 98
Wang, Yih-Ru . . . . . . . . . Mon-Ses3-P2-5 68
Ward, Nigel G. . . . . . . . . Mon-Ses2-P1-10 53
Wed-Ses1-O2-4 113
Wed-Ses4-P4-6 148
Watanabe, Shinji . . . . . Mon-Ses2-P3-4 56
Wed-Ses2-O4-5 129
Watkins, C.J. . . . . . . . . . . Thu-Ses2-P4-5 174
Watson, Catherine I. . . Tue-Ses3-S2-5 111
Watson, Ian . . . . . . . . . . . Tue-Ses3-S2-4 110
Watts, Oliver . . . . . . . . . . Mon-Ses3-O3-6 64
Thu-Ses1-P2-1 156
Way, Andy . . . . . . . . . . . . Tue-Ses3-O4-6 101
Weber, Frederick . . . . . Mon-Ses3-P2-12 69
Wechsung, Ina . . . . . . . . Mon-Ses2-P4-13 59
Wed-Ses1-P4-7 123
Weenink, David . . . . . . . Wed-Ses2-P3-4 133
Weinstein, Eugene . . . . Thu-Ses1-P4-11 161
Weiss, Benjamin . . . . . . Mon-Ses2-P4-14 59
Wendemuth, Andreas Wed-Ses2-P2-10 133
Wenhardt, Stefan . . . . . Tue-Ses1-S2-6 86
Wenndt, S.J. . . . . . . . . . . . Wed-Ses1-P4-3 122
Werner, Stefan . . . . . . . . Mon-Ses3-O2-4 63
White, Christopher M.
Tue-Ses2-P3-6 96
Wed-Ses2-P4-13 137
White, Michael . . . . . . . . Thu-Ses1-O2-3 152
Whittaker, Steve . . . . . . Thu-Ses2-P1-11 170
Wibowo, Suryoadhi . . . Tue-Ses2-P2-1 93
Widjaja, Henry . . . . . . . . Tue-Ses2-P2-1 93
Wiesenegger, Michael
Tue-Ses1-P3-1 81
Wik, Preben . . . . . . . . . . . Tue-Ses1-P2-6 81
Wilfart, Geoffrey . . . . . . Wed-Ses1-P3-8 121
Williams, Jason D. . . . . Wed-Ses3-S1-3 150
Windmann, Andreas . Tue-Ses3-S2-2 110
Winkelmann, Raphael Wed-Ses1-P2-4 118
Wintrode, Jonathan . . Tue-Ses3-P4-1 108
Wittenberg, Sören . . . . Wed-Ses1-P3-4 120
Wittenburg, Peter . . . . . Wed-Ses1-P4-1 122
Woehrling, Cécile . . . . . Wed-Ses3-O1-3 138
Wohlmayr, Michael . . . Tue-Ses2-P1-3 92
Wójcicki, Kamil . . . . . . . Tue-Ses3-P1-5 102
Wokurek, Wolfgang . . Wed-Ses1-P1-7 116
Wolfe, Patrick J. . . . . . . Mon-Ses2-O4-4 51
Mon-Ses2-O4-6 51
Thu-Ses2-O3-5 166
Wölfel, Matthias . . . . . . Tue-Ses1-P4-7 85
Wolff, Matthias . . . . . . . Wed-Ses1-P3-4 120
Wöllmer, Martin . . . . . . Wed-Ses1-O2-6 113
Thu-Ses1-O1-5 151
Wolters, Maria . . . . . . . . Tue-Ses3-P3-3 107
Wong, W. . . . . . . . . . . . . . . Wed-Ses2-O4-4 129
Woodland, P.C. . . . . . . . Mon-Ses2-P3-7 56
Mon-Ses3-O1-3 62
Mon-Ses3-O1-4 62
Thu-Ses1-P3-4 158
Woods, Roger . . . . . . . . . Wed-Ses3-P3-8 147
Wrigley, Stuart N. . . . . . Thu-Ses2-P1-11 170
Wrobel-Dautcourt, B. . Mon-Ses3-P1-8 66
Wu, Cheng . . . . . . . . . . . . Mon-Ses3-O4-4 65
Wu, Chung-Hsien . . . . . Tue-Ses2-O3-6 90
Wu, Dalei . . . . . . . . . . . . . Tue-Ses1-O1-4 75
Wed-Ses1-O4-6 115
Wu, Guanyong . . . . . . . . Mon-Ses2-O3-1 49
Wu, Hsu-Chih . . . . . . . . . Mon-Ses3-P4-8 72
Wu, Jiang . . . . . . . . . . . . . Tue-Ses2-P1-1 91
Wu, Wei . . . . . . . . . . . . . . . Tue-Ses2-O3-4 90
Wed-Ses2-P4-2 135
Wu, Wing Li . . . . . . . . . . . Wed-Ses3-P1-6 142
Wu, Yi-Jian . . . . . . . . . . . . Mon-Ses3-P2-11 69
Wed-Ses1-P3-10 121
Wu, Zhizheng . . . . . . . . . Mon-Ses3-O3-3 64
Wuth, Jorge . . . . . . . . . . . Wed-Ses2-O2-5 127
Wutiwiwatchai, Chai . Mon-Ses3-P2-3 67
Mon-Ses3-P2-6 68
Wed-Ses1-P3-12 122
X
Xiang, Bing . . . . . . . . . . . Mon-Ses2-P3-8 57
Xie, Shasha . . . . . . . . . . . Tue-Ses3-P4-9 109
Xu, Bo . . . . . . . . . . . . . . . . . Wed-Ses2-O2-6 127
Xu, Haihua . . . . . . . . . . . . Mon-Ses2-O3-1 49
Xu, Haitian . . . . . . . . . . . . Wed-Ses3-P3-7 147
132
132
100
169
66
Tue-Ses2-O2-3 89
Wed-Ses1-P1-11 117
Xue, Jian . . . . . . . . . . . . . . Mon-Ses2-P3-8 57
Xu,
Xu,
Xu,
Xu,
Xu,
Lu . . . . . . . . . . . . . . . . . Wed-Ses2-P2-9
Mingxing . . . . . . . . . Wed-Ses2-P2-9
Minqiang . . . . . . . . . Tue-Ses3-O3-6
Puyang . . . . . . . . . . . . Thu-Ses2-P1-8
Yi . . . . . . . . . . . . . . . . . Mon-Ses3-P1-3
Y
Yamada, Makoto . . . . . Tue-Ses3-P1-8 103
Yamagata, Tomoyuki . Mon-Ses2-P4-2 57
Yamagishi, Junichi . . . Mon-Ses3-O3-6 64
Mon-Ses3-P2-9 68
Tue-Ses3-P2-4 105
Wed-Ses2-P3-11 135
Thu-Ses1-P2-1 156
Yamaguchi, Y. . . . . . . . . Wed-Ses1-O3-5 114
Yamakawa, Kimiko . . . Mon-Ses3-P1-6 66
Yamamoto, Kazumasa Tue-Ses2-P2-2 94
Yaman, Sibel . . . . . . . . . . Thu-Ses1-P4-1 159
Thu-Ses1-P4-2 160
Yamanaka, N. . . . . . . . . . Mon-Ses3-P4-4 72
Yamashita, Yoichi . . . . Tue-Ses3-P1-13 103
Yamauchi, Emi Juliana Thu-Ses2-P1-9 169
Yamauchi, Yutaka . . . . Mon-Ses3-P4-6 72
Yan, Yonghong . . . . . . . Wed-Ses2-P1-4 130
Wed-Ses2-P4-6 136
Thu-Ses2-P3-5 173
Thu-Ses2-P3-7 173
Yan, Yuling . . . . . . . . . . . Tue-Ses1-S1-1 86
Yan, Zhi-Jie . . . . . . . . . . . Wed-Ses1-P3-2 120
Yang, Bin . . . . . . . . . . . . . Wed-Ses2-P1-9 131
Yang, Dali . . . . . . . . . . . . . Wed-Ses2-P2-9 132
Yang, Dong . . . . . . . . . . . Tue-Ses1-O3-6 77
Yang, Qian . . . . . . . . . . . . Tue-Ses1-P4-7 85
Yano, Masafumi . . . . . . Mon-Ses2-P1-1 51
Yanushevskaya, Irena Wed-Ses2-P1-5 130
Ye, Guoli . . . . . . . . . . . . . . Wed-Ses3-P2-2 144
Yegnanarayana, B. . . . . Tue-Ses2-P1-5 92
Tue-Ses2-P1-6 92
Wed-Ses1-O2-5 113
Wed-Ses3-P1-5 142
Yeh, Yao-Ming . . . . . . . . Tue-Ses3-P4-10 110
Yi, Youngmin . . . . . . . . . Tue-Ses2-P3-3 96
Yip, Michael C.W. . . . . . Wed-Ses2-O1-2 125
Yoma, Nestor Becerra
Tue-Ses2-P2-4 94
Wed-Ses2-O2-5 127
Yoon, Su-Youn . . . . . . . . Wed-Ses2-O2-4 127
Yoshimoto, Masahiko
Tue-Ses3-P4-4 109
You, Hong . . . . . . . . . . . . Mon-Ses2-O1-3 48
Young, S. . . . . . . . . . . . . . . Thu-Ses1-P4-5 160
Yousafzai, Jibran . . . . . Wed-Ses3-P3-4 146
Yu, Dong . . . . . . . . . . . . . . Tue-Ses1-O1-5 75
Yu, K. . . . . . . . . . . . . . . . . . Thu-Ses1-P4-5 160
Yu, Kai . . . . . . . . . . . . . . . . Wed-Ses2-O3-3 127
Yu, Tao . . . . . . . . . . . . . . . Tue-Ses3-P1-14 104
Yuan, Jiahong . . . . . . . . Wed-Ses3-O2-5 139
Thu-Ses2-O2-5 165
Z
Zahorian, Stephen A. . Tue-Ses2-P1-1 91
Zainkó, Csaba . . . . . . . . Mon-Ses3-P4-9 73
Zbib, Rabih . . . . . . . . . . . Tue-Ses1-O3-2 77
Zdansky, Jindrich . . . . Tue-Ses2-O1-6 88
Zechner, Klaus . . . . . . . Mon-Ses3-P4-5 72
Zeissler, V. . . . . . . . . . . . . Mon-Ses3-P4-4 72
Železný, Miloš . . . . . . . . Thu-Ses2-P1-5 169
Zellers, Margaret . . . . . Wed-Ses4-P4-14 149
Zen, Heiga . . . . . . . . . . . . Wed-Ses1-P3-3 120
Wed-Ses2-P3-12 135
Zhang, Bin . . . . . . . . . . . . Tue-Ses2-O3-4 90
Zhang, Caicai . . . . . . . . . Wed-Ses3-P1-8 143
Zhang, Chi . . . . . . . . . . . . Tue-Ses1-P3-7 82
Zhang, Jianping . . . . . . Wed-Ses2-P1-4 130
Zhang, Le . . . . . . . . . . . . . Wed-Ses2-P4-7 136
Zhang, Qingqing . . . . . . Thu-Ses2-P3-5 173
Zhang, R. . . . . . . . . . . . . . Mon-Ses3-O4-3 65
Zhang, Shi-Xiong . . . . . Tue-Ses3-O3-4 100
Zhao, Qingwei . . . . . . . . Wed-Ses2-P4-6 136
Zhao, Sherry Y. . . . . . . . Thu-Ses2-P2-2 170
Zheng, Jing . . . . . . . . . . . Mon-Ses3-O4-5 65
Wed-Ses2-P4-3 135
Zhou, Bowen . . . . . . . . . . Mon-Ses2-P3-8 57
Zhou, Haolang . . . . . . . . Tue-Ses1-P3-4 82
Zhou, Xi . . . . . . . . . . . . . . . Tue-Ses3-O3-6 100
186
Zhou, Yu . . . . . . . . . . . . . . Wed-Ses2-P1-4 130
Zhu, Donglai . . . . . . . . . . Wed-Ses3-O1-2 138
Zhu, Jie . . . . . . . . . . . . . . . Mon-Ses2-O3-1 49
Zhuang, Xiaodan . . . . . Thu-Ses1-S1-4 162
Žibert, Janez . . . . . . . . . . Thu-Ses1-O3-1 153
Zigel, Yaniv . . . . . . . . . . . Mon-Ses2-P2-6 54
Mon-Ses2-P2-7 54
Wed-Ses2-P2-8 132
Zimmermann, M. . . . . . Tue-Ses1-P3-8 83
Zubizarreta, M.L. . . . . . Tue-Ses1-O2-1 76
Zweig, Geoffrey . . . . . . . Wed-Ses1-S1-6 125
Wed-Ses2-O3-2 127
Thu-Ses2-O1-5 164
Venue Floorplan
Rainbow Room
Ground Floor
Jones (East Wing 1) &
Fallside (East Wing 2)
Foyer
West Bar
Main Hall
First Floor
Hewison Hall
Sunrise Room
East Bar
Holmes (East Wing 3) &
Ainsworth (East Wing 4)
Brighton
Centre
Suites
(BCS)
Third Floor
187
Interspeech 2009 Programme-at-a-Glance
Sunday, 06 September (Tutorials, Loebner Competition)
Jones (East Wing 1)
Fallside (East Wing 2)
Holmes (East Wing 3)
Ainsworth (East Wing 4)
Rainbow Room
TUTORIALS
08:30
Registration for tutorials opens (closes at 14:30)
09:00
ISCA Board Meeting 1 (finish at 17:00) - BCS Room 3
09:15
T-1: Analysis by Synthesis of Speech
Prosody, from Data to Models
10:45
Coffee break
11:15
T-1: Analysis by Synthesis of Speech
Prosody, from Data to Models
12:45
Lunch
T-2: Dealing with High Dimensional Data
with Dimensionality Reduction
T-3: Language and Dialect Recognition
T-4: Emerging Technologies for Silent
Speech Interfaces
T-2: Dealing with High Dimensional Data
with Dimensionality Reduction
T-3: Language and Dialect Recognition
T-4: Emerging Technologies for Silent
Speech Interfaces
T-6: Emotion Recognition in the Next
Generation: an Overview and Recent
Development
T-7: Fundamentals and Recent Advances
in HMM-based Speech Synthesis
T-8: Statistical Approaches to Dialogue
Systems
T-6: Emotion Recognition in the Next
Generation: an Overview and Recent
Development
T-7: Fundamentals and Recent Advances
in HMM-based Speech Synthesis
T-8: Statistical Approaches to Dialogue
Systems
14:00
General registration opens (closes at 18:00)
14:15
T-5: In-Vehicle Speech Processing &
Analysis
15:45
Tea break
16:15
T-5: In-Vehicle Speech Processing &
Analysis
18:00
Elsevier Thank You Reception for Former Computer Speech and Language Editors (finish at 19:30) - BCS Room 1
Loebner Competition
The first Interspeech
conversational
systems challenge
Monday, 07 September
Jones
(East Wing 1)
Main Hall
Fallside
(East Wing 2)
Holmes
(East Wing 3)
ORAL SESSIONS
09:00
Arrival and Registration
10:00
Opening Ceremony - Main Hall
Hewison Hall
Ainsworth
(East Wing 4)
POSTER SESSIONS
SPECIAL SESSION
11:00
Keynote: Sadaoki Furui ISCA Medallist “ Selected Topics from 40 years of Research on Speech and Speaker Recognition” - Main Hall
12:00
Lunch; IAC (Advisory Council) Meeting - BCS Room 3
13:30
ASR: Features for
Noise Robustness
15:30
Tea Break
16:00
ASR: Language
Models I
19:30
Welcome Reception - Brighton Dome
Production:
Articulatory
Modelling
Systems for LVCSR
and Rich
Transcription
Accent and
Speech Analysis and
Speech Perception I Language
Processing I
Recognition
ASR: Acoustic Model
Spoken Dialogue
Training and
Systems
Combination
Phoneme-level
Perception
Statistical
Parametric
Synthesis I
Systems for Spoken
Human Speech
Language
Production I
Translation
ASR: Adaptation I
Prosody, Text
Analysis, and
Multilingual Models
Applications in
Learning and Other
Areas
Emotion Challenge
Silent Speech
Interfaces
Tuesday, 08 September
08:30
Keynote: Tom Griffiths “ Connecting Human and Machine Learning via Probabilistic Models of Cognition” - Main Hall
09:30
Coffee Break
10:00
ASR: Discriminative
Training
12:00
Lunch; Elsevier Editorial Board Meeting for Computer Speech and Language - BCS Room 1; Special Interest Group Meeting - BCS Room 3
13:30
Standardising assessments for voice and speech pathology (finish at 14:30) - BCS Room 3
13:30
Automotive and
Mobile Applications
15:30
Tea Break
16:00
Panel: Speech &
Intelligence
Language
Acquisition
ASR: Lexical and
Prosodic Models
Unit-Selection
Synthesis
Human Speech
Production II
Speech and Audio
Speech Perception II Segmentation and
Classification
Advanced Voice
Speaker Recognition
Function
and Diarisation
Assessment
ASR: Spoken
Prosody: Production
Language
I
Understanding
Speaker Diarisation
Speech Processing
Speech Analysis and
with Audio or
Processing II
Audiovisual Input
ASR: Decoding and
Confidence
Measures
Robust ASR I
ISCA Student
Advisory Committee
Speaker
Verification &
Identification I
Text Processing for
Spoken Language
Generation
Single- and MultiChannel Speech
Enhancement
ASR: Acoustic
Modelling
Assistive Speech
Technology
Topics in Spoken
Language
Processing
Measuring the
Rhythm of Speech
Prosody Perception
and Language
Acquisition
Statistical
Parametric
Synthesis II
Resources,
Annotation and
Evaluation
Lessons and
Challenges
Deploying Voice
Search
Speech Synthesis
Methods
LVCSR Systems and
Active Listening &
Spoken Term
Synchrony
Detection
18:15
ISCA General Assembly - Main Hall
19:30
Reviewers' Reception - Brighton Pavilion; Student Reception - Al Duomo Restaurant
Wednesday, 09 September
08:30
Keynote: Deb Roy “ New Horizons in the Study of Language Development ” - Main Hall
09:30
Coffee Break
10:00
Speaker Verification Emotion and
& Identification II
Expression I
12:00
Lunch; Interspeech Steering Committee - BCS Room 1; Elsevier Editorial Board Meeting for Speech Communication - BCS Room 3
13:30
Word-level
Perception
15:30
Tea Break
16:00
Language
Recognition
19:30
Revelry at the Racecourse
ASR: Adaptation II
Voice
Transformation I
Phonetics, Phonology, Cross-Language
Comparisons,
Pathology
Applications in
Education and
Learning
ASR: New
Paradigms I
Single-Channel
Speech
Enhancement
Emotion and
Expression II
Expression, Emotion and Personality
Recognition
Phonetics &
Phonology
Speech Activity
Detection
Multimodal Speech
(e.g. Audiovisual
Speech, Gesture)
Phonetics
Speaker Verification
Robust ASR II
& Identification III
Machine Learning
Prosody: Production
for Adaptivity in
II
Dialogue Systems
Voice
Transformation II
Systems for Spoken New Approaches to
Language
Modelling Variability
Understanding
for ASR
Thursday, 10 September
08:30
Keynote: Mari Ostendorf “ Transcribing Speech for Spoken Language Processing” - Main Hall
09:30
Coffee Break
10:00
Robust ASR III
12:00
Lunch; Industrial Lunch - BCS Room 1
13:30
User Interactions in
Spoken Dialog
Systems
15:30
Tea Break
16:00
Closing Ceremony - Main Hall
Prosody: Perception
Production:
Articulation and
Acoustics
Segmentation and
Classification
Evaluation & Standardisation of SL
Technology & Syst.
Features for Speech Speech and Multiand Speaker
modal Resources &
Recognition
Annotation
Speech Coding
ASR: Language
Models II
Speaker & Speech
ASR: Tonal LanASR: Acoustic Model
Variability, Paraling.
guage, Cross-Ling.
Features
& Nonling. Cues
and Multiling. ASR
ASR: New
Paradigms II
Speech Analysis and
Processing III