Message from the ISCA President
Transcription
Message from the ISCA President
6-10 September 2009 CONFERENCE PROGRAMME & ABSTRACT BOOK Speech and Intelligence Interspeech 2009 Brighton UK 10th Annual Conference of the International Speech Communication Association www.interspeech2009.org Map of Venue & Central Brighton Conference venue highlighted with red arrow. Map courtesy of 6-10 September 2009 The Brighton Centre Brighton, United Kingdom Copyright © 2009 International Speech Communication Association http://www.isca-speech.org [email protected] All rights reserved. ISSN 1990-9772 Proceedings CD-ROM Editors: Maria Uther, Roger Moore, Stephen Cox Acknowledgements: Photos of Brighton and map of Brighton courtesy of Visit Brighton (copyright holder), Brighton & Hove City Council. Thanks also to our colleagues on the organising committee, Meeting Makers and Brighton & Hove City Council for contributing relevant information for this booklet. Thanks to Dr. James Uther for assistance on various elements of the booklet. Abstracts and Proceedings CD-ROM Production by Causal Productions Pty Ltd http://www.causalproductions.com [email protected] 2 Welcome to Interspeech 2009 .................................................................................................... 4 Interspeech 2009 Information ..................................................................................................... 8 General Information and the City of Brighton............................................................................ 10 Interspeech 2009 Social Programme........................................................................................ 13 Interspeech 2009 Organisers ................................................................................................... 14 Sponsors and Exhibitors........................................................................................................... 20 Supporting Committee Institutions............................................................................................ 22 Contest for the Loebner Prize 2009 .......................................................................................... 24 Satellite Events ........................................................................................................................ 26 Interspeech 2009 Keynote Sessions ........................................................................................ 28 Interspeech 2009 Special Sessions.......................................................................................... 32 Interspeech 2009 Tutorials - Sunday 6 September 2009 .......................................................... 35 Public Engagement Events ...................................................................................................... 39 Session Index........................................................................................................................... 40 Interspeech 2009 Programme by Day ...................................................................................... 42 Abstracts .................................................................................................................................. 47 Author Index............................................................................................................................177 Venue Floorplan .....................................................................................................................187 3 Welcome to Interspeech 2009 Message from the ISCA President On behalf of the International Speech Communication Association (ISCA), welcome to INTERSPEECH 2009 in Brighton. Ten years ago, in Budapest, we took the first steps towards creating what later became a unified Association and a unified conference, joining the best of the former EUROSPEECH and ICSLP conferences. This conference is the 10th in the long cycle that bears the INTERSPEECH label and that already includes conferences in such wonderful venues as Beijing, Aalborg, Denver, Geneva, Jeju, Lisbon, Pittsburgh, Antwerp and Brisbane. I would like to start this long message of thanks and commendation by honouring the ISCA Medalist for 2009, Prof. Sadaoki Furui, for his outstanding contributions to speech processing, and leadership as one of the first Presidents of the International Speech Communication Association. At this meeting we shall also recognize several other ISCA members who, for their technical merit and service to the Association, were recently elected as ISCA Fellows: Anne Cutler, Wolfgang Hess, Joseph Mariani, Hermann Ney, Roberto Pieraccini, and Elizabeth Shriberg. Next in my list of ISCA volunteers to thank are the members who are promoting research activities on speech science and technology by giving lectures in different parts of the world, as ISCA Distinguished Lecturers: Rich Stern, Abeer Alwan, and James Glass. ISCA's growing efforts to promote speech science and technology research are reflected in the work of Special Interest Groups, and International Sub-Committees as well, in the many workshops spanning multidisciplinary areas that continuously enlarge our electronic archive, in the increasing number of grants to students, in our very long newsletter ISCApad, and in many other activities. Of particular relevance are the ones undertaken by our very active Student Advisory Committee, who recently launched a resume posting service. Thank you all for your continuing help and support! The coordination of all these activities is the responsibility of the ISCA Board, and I would like to particularly thank the two members who have completed their terms in 2009 for all their efforts for the community: Eva Hajičova and Lin-shan Lee. The past year has been a year of expansion, but also of consolidation of many new different activities. This was the prime motivation for enlarging the Board to 14 members. I take this opportunity to welcome the new members: Nick Campbell, Keikichi Hirose, Haizhou Li, Douglas O'Shaughnessy, and Yannis Stylianou. For INTERSPEECH 2009, over thirteen hundred papers were submitted, and approximately 57% of the regular papers were selected for presentation after the peer review process, with at least 3 reviews per paper. Many members of the international community participated in this review process, and I hope to have the chance to thank you all personally in Brighton. The large number of submissions marks INTERSPEECH conferences as the major annual conferences in speech science and technology, a role that we would like to further enhance by getting these conferences included in major citation indexes. We hope that this number keeps increasing in the next events in Makuhari (2010) and Florence (2011), and Portland (2012). This conference is chaired by the first ISCA President, Prof. Roger Moore. We are particularly appreciative of his organizing team for testing a new model for INTERSPEECH organization, independent of any university sponsorship. For this bold step, and for all the work and devotion that you have put into organizing this conference, thank you all so much! Having heard you discuss your plans for almost 4 years, we are looking forward to a superbly organized conference, with excellent keynotes and tutorials, interesting sessions, a very lively social programme, and innovations such as the Loebner award. I invite you all to join us in celebrating the 10th anniversary of ISCA and wish you a very successful conference. Isabel Trancoso, ISCA President 4 Message from the General Chair Dear Colleague, On behalf of the organising team, it is with great personal pleasure that I welcome you to Brighton and to INTERSPEECH-2009: the 10th Annual Conference of the International Speech Communication Association (ISCA). It has been almost four years since we put together the bid to host INTERSPEECH in the UK, and it has been a truly momentous experience for everyone involved. A large number of people have worked very hard to bring this event to fruition, so I sincerely hope that everything will run as smoothly as possible, and that you have an enjoyable and productive time with us here in Brighton. The theme of this year’s conference is Speech and Intelligence, and we have arranged a number of special events in line with this. For example, on Sunday we are hosting the 19th annual Loebner Prize for Artificial Intelligence (a text-based instantiation of a Turing test) run by Hugh Loebner himself. Also on the Sunday, and inspired by the Loebner Prize, for the first time we will be attempting to run a real-time speech-based version of the Turing test. Although we only have a couple of contestants, we hope that it will be an informative and entertaining aspect of the day’s activities. Another conference event related to Speech and Intelligence is a special semi-plenary discussion that is scheduled to take place on Tuesday at 16:00 (Tue-Ses3-S1). This will involve a number of distinguished panellists who have agreed to engage in a lively Q&A interaction with the audience. Please come along and join in with the debate. As well as the two competitions, other events taking place on Sunday include eight excellent and varied Tutorials presented by fourteen top quality lecturers. Tutorials have become a popular feature of INTERSPEECH conferences, and several hundred attendees take advantage of the opportunity to learn at first-hand some of the core scientific principles underpinning different aspects of our developing field. Another first for INTERSPEECH-2009 is Sunday’s public exhibition of speech-related activities. Public engagement with science is an important issue in modern society, and we are very grateful for the help and support that we have received to mount this event from the UK Engineering and Physical Sciences Research Council funded science outreach project Walking with Robots (http://www.walkingwithrobots.org/). It will be interesting to see what the people of Brighton make of our particular brand of science and technology – let’s hope for some positive feedback! Monday sees the beginning of the main conference programme and, after the opening ceremony (during which we will pay tribute to our recently departed senior colleague and ESCA Medallist, Gunnar Fant), we are honoured to welcome this year’s ISCA Medallist, Prof. Sadaoki Furui (Tokyo Institute of Technology), who will be presenting the first Keynote talk of the conference. Sadaoki’s subject is “Selected Topics from 40 Years of Research on Speech and Speaker Recognition”, and I’m sure that he will provide us with a wealth of interesting insights into the progress that he has seen at the forefront of these areas of research. The main technical programme of oral and poster sessions starts after lunch on Monday and runs through until Thursday afternoon. This year we received an almost record number of submissions: 1303 by the published April deadline. These were assessed by 645 reviewers, and the resulting ~4000 reviews were organised by the 24 Area Coordinators so that the final accept/reject decisions could be made at the Technical Programme Committee meeting held in London at the beginning of June. This careful selection process resulted in the acceptance of 762 papers (707 in the main programme and 55 in special sessions), all of which means that we have a total of 38 oral sessions, 39 poster sessions and 10 special sessions at this year’s conference. In addition to the main programme, each day starts with a prestigious Keynote talk from a distinguished scientist of international standing. On Tuesday, Tom Griffiths (UC Berkley) will present his talk entitled “Connecting Human and Machine Learning via Probabilistic Models of Cognition”; on Wednesday, Deb Roy (MIT Media Lab) promises to lead us towards “New Horizons in the Study of 5 Language Development”; and on Thursday, Mari Ostendorf (University of Washington) will address her topic of “Transcribing Speech for Spoken Language Processing”. Keynote presentations are often the scientific highlight of any conference, so I hope that, like me, you are looking forward to some stimulating early morning talks. Alongside the regular sessions, we also have a number of special sessions, each of which is devoted to a ‘hot’ topic in spoken language processing. Daniel Hirst has organised a session on ‘Measuring the rhythm of speech’; Oliver Lemon and Olivier Pietquin have put together a session on ‘Machine learning for adaptivity in spoken dialogue systems’; Carol Espy-Wilson, Jennifer Cole, Abeer Alwan, Louis Goldstein, Mary Harper, Elliot Saltzman and Mark Hasegawa-Johnson have arranged a session on ‘New approaches to modeling variability for automatic speech recognition’; Bruce Denby and Tanja Schultz have put together a session on ‘Silent speech interfaces’; Bjoern Schuller, Stefan Steidl and Anton Batliner have organised the ‘INTERSPEECH 2009 emotion challenge’; Anna Barney and Mette Pedersen have gathering together peole interested in ‘Advanced voice function assessment’; Nick Campbell, Anton Nijholt, Joakim Gustafson and Carl Vogel are responsible for a session on ‘Active listening and synchrony’; and Mike Cohen, Johan Schalkwyk and Mike Phillips have organised a session on ‘Lessons and challenges deploying voice search’. As well as the scientific programme, we have also arranged a series of social events and activities. Unfortunately, due to the sheer number of attendees, we had to abandon the idea of holding a Party on the Pier. Instead, we are very pleased to have found an excellent alternative, Revelry at the Racecourse, which is taking place high above the town with stunning seaward views for an evening of food and fun. Other events include the Welcome Reception at the Brighton Museum and Art Gallery, the Students’ Reception at the stylish Italian Al Duomo restaurant, and the Reviewers’ Reception at the amazing Royal Pavilion. These organised social events are just a few of the opportunities that you will have to discover the delights of the local area. Brighton is a vibrant British seaside town, so I hope that you will enjoy exploring its many attractions and perhaps (weather permitting) the beach. As I mentioned above, an event the size of INTERSPEECH simply cannot take place without the help and support of large a number of individuals, many of whom give their services freely despite the many other calls on their time. I would particularly like to thank Stephen Cox (University of East Anglia) for taking on the extremely time-consuming role of Technical Programme Chair and Valerie Hazan (University College London) for diligently looking after the most crucial aspect of the whole operation – the financial budget. In fact, due to the lack of underwriting by any particular institution, we were obliged to adopt a very different financial model for running INTERSPEECH this year. So I would also like to thank Valerie and Stephen for agreeing to join me in taking on these additional responsibilities, and ISCA for providing extra help with managing our cash flow. I would also like to thank the rest of the organising committee: Anna Barney (University of Southampton) for putting together a fun social programme; Andy Breen (Nuance UK) for doing a tremendous job raising sponsorship in a very difficult financial climate; Shona D'Arcy (Trinity College, Dublin) for liaising with the students and organising the student helpers; Thomas Hain (University of Sheffield) for organising an impressive array of tutorials; Mark Huckvale (University College London) for doing a superb job as web master and for significantly upgrading the submission system; Philip Jackson (University of Surrey) for liaising with Hugh Loebner and organising the Loebner Prize competition; Peter Jancovic (University of Birmingham) for coordinating the different meeting room requirements; Denis Johnston (Haughgate Innovations) for helping with the sponsorship drive; Simon King (University of Edinburgh) for providing a contact point for the satellite workshops; Mike McTear (University of Ulster) for helping to smooth the registration process; Ben Milner (University of East Anglia) for organising the exhibition; Ji Ming (Queen's University Belfast) for coordinating the special sessions; Steve Renals (University of Edinburgh) for arranging the plenary sessions; Martin Russell (University of Birmingham) for looking after publicity and for producing a terrific poster; Maria Uther (Brunel University) for preparing the abstract book and conference proceedings; Simon Worgan (University of Sheffield) for organising the public outreach event and the speech-based competition; and Steve Young (University of Cambridge) for assisting with obtaining an opening speaker. I would also like to thank the team at Meeting Makers – our Professional Conference Organisers – who have brought our vision into reality by providing valuable help and advice along the way. 6 I would particularly like to thank our generous sponsors, without whose support it would have been very difficult to mount the event. When we started this process in 2005, who could have envisaged the dire financial situation faced by the world’s economies and banking systems in 2009? It is a great relief to us that our sponsors have found the means to provide us with support in these difficult times. We are especially grateful to Brighton & Hove City Council for their subvention towards the costs of the conference centre and to Visit Brighton for their encouragement in bringing a large conference such as INTERSPEECH to the UK. Finally, I would like to thank everyone who submitted a paper to this year’s conference (whether they were successful or not), the Reviewers for diligently evaluating them, the Area Chairs for putting together a varied and high-quality programme, and the Session Chairs and Student Helpers for ensuring a smooth running of the event itself. I do hope that you will have an enjoyable and productive time here in Brighton and that you will leave with fond memories of INTERSPEECH-2009 – the place, the people, and the scientific exchanges you engaged in while you were here. With best wishes for a successful conference. Roger Moore, Conference Chair, Interspeech 2009 7 Interspeech 2009 Information Venue Interspeech 2009 will take place in the Brighton Centre. The Brighton Centre is located on the King's Road, about a 10 minute walk from Brighton station. All keynote talks are in the Main Hall. See conference programme for venues of other sessions. Registration The Conference Registration Desks are located in the main foyer. For registration or any administrative issues please enquire at the Conference Registration Desks. These desks will be open at the following times: Tutorial registration will be open on Sunday 6 September from 0830 hours to 1430 hours. Conference registration will be open at the following times: Sunday 6 September 1400 – 1800 hours Monday 7 September 0900 – 1700 hours Tuesday 8 September 0800 – 1700 hours Wednesday 9 September 0800 – 1700 hours Thursday 10 September 0800 – 1700 hours The full registration package includes: • Entry to all conference sessions (excluding satellite workshops and tutorial sessions) • Conference bag containing: o Abstract book and conference programme o CD-ROM of Conference Proceedings o Promotional material • Welcome Reception at Brighton Museum (Monday 7th September) • Revelry at the Racecourse (Wednesday 9th September) • Coffee breaks as per programme • Badge Badge Your name badge, issued to you when you register, must be worn to all conference sessions and social events for identification and security purposes. Non-Smoking event Smoking is not permitted anywhere inside the Conference Centre. Language The official language of Interspeech 2009 is English. Internet access Wi-fi access is available throughout the Brighton Centre. Internet access is also provided on allocated PCs in the Rainbow Room on the Ground Floor and will be open from Monday to Thursday during conference open hours. The following website provides details of Wireless access points in Brighton http://www.brighton.co.uk/Wireless_Hotspots/. 8 Speaker Preparation Room The Speaker Preparation Room is located in the Sunrise Room. If you are presenting an oral paper, you must load your presentation onto the central fileserver and check that it displays correctly well before your talk. We recommend you do this well in advance of your presentation and certainly no later than two hours beforehand. The room will be open during the following times: Sunday 6th September 1400-1800 hours th Monday 7 September 0900-1700 hours Tuesday 8th September 0830-1700 hours Wednesday 9th September 0830-1700 hours Thursday 10th September 0830-1700 hours Coffee Breaks These are scheduled according to the programme and will be served in the Hewison Hall and foyers. All those with special dietary requirements should make themselves known to Centre Staff who will provide alternative catering. Lunch Breaks Lunch is not included as part of your registration. There are several cafes and restaurants in Brighton from which you can purchase lunch. Insurance Registration fees do not include personal, travel or medical insurance of any kind. Delegates are advised when registering for the conference and booking travel that a travel insurance policy should be taken out to cover risks including (but not limited to) loss, cancellation, medical costs and injury. Interspeech 2009 and/or the conference organisers will not accept any responsibility for any delegate failing to insure. 9 General Information and the City of Brighton Getting There Rail Brighton is under an hour by rail from London Victoria station, with 2 services every hour. There are also regular services from many other points, including a direct service from St Pancras station, connecting with Eurostar, and a link to Gatwick and Luton airports. More information can be found at www.nationalrail.co.uk or www.firstcapitalconnect.co.uk Road Brighton is about 45 minutes from the M25 London orbital down the M23 motorway, and 30 minutes from Gatwick airport. Coach Regular services to Brighton depart from London, Heathrow and Gatwick airports, and many other locations in the U.K. See www.nationalexpress.com for further information. Air Brighton is just 30 minutes by road or rail from London Gatwick International Airport and 90 minutes by road from London Heathrow. There are fast coach links between Heathrow, Gatwick and Brighton. Sea Brighton is 20 minutes by road or 25 minutes by rail from the port of Newhaven where a ferry service operates to Dieppe. See www.aferry.com/visitbrighton/ for more details. Car Rental Major car rental companies have offices located at the major airports. To drive in the U.K. you must have a current driver’s license. Note that cars travel on the left side of the road in the U.K. Brighton Brighton is one of the most colourful, vibrant and creative cities in Europe. It has a very cosmopolitan flavour that is compact, energetic, unique, fun, lively, historic and free-spirited. Nestling between the South Downs & the sea on the south coast, Brighton offers everything from Regency heritage to beachfront cafes and a lively nightlife. It is a fantastic mix of iconic attractions, award winning restaurants, funky arts, cultural festivals and events. Time Zone Brighton and the U.K. in general are in British Summer Time (BST) at the time of the conference, BST = UTC (Greenwich) + 1. Money and Credit Cards Currency is the British Pound (GBP). Major international credit and charge cards such as Visa, American Express and MasterCard are widely accepted at retail outlets. Travellers’ cheques are also widely accepted and can be cashed at banks, airports and major hotels. 10 Transportation in Brighton Brighton and Hove is so compact, that once you’re here, you might find it easiest to explore the city on foot. A frequent bus service also runs, costing £1.80 for a standard ticket across the city or various day tickets exist for £3.60 or £4.50 depending on the zone covered. Brighton is also one of five nationally selected ‘cycling demonstration towns’, and bicycles may be hired in a number of shops. Accommodation The Brighton area has an abundance of different kinds of accommodation from budget to luxury hotels. We recommend the Visit Brighton website to find accommodation to suit your needs: http://www.visitbrighton.com/site/accommodation Electrical Voltage The electrical supply in the U.K. is 230-240 volts, AC 50Hz. The U.K. three-pin power outlet is different from that in many other countries, so you will need an adaptor socket. If your appliances are 110-130 volts you will need a voltage converter. Universal outlet adaptors for both 240V and 110 V appliances are sometimes available in leading hotels. Tipping There are no hard and fast rules for tipping in the U.K. If you are happy with the service, a 1015% tip is customary, particularly in a restaurant or café with table service. Tipping in bars is not expected. For taxi fares, it is usual to round up to the nearest pound (£). Non-Smoking Policy The UK smoking laws prevent smoking in enclosed public spaces. It is therefore not acceptable to smoke in restaurants, bars and other public venues. There may be designated smoking areas. Shop Opening Hours Shopping hours tend to be from 1000 – 1800 hours with late opening until 2000 hours on a Thursday. Emergencies Please dial 999 for Fire, Ambulance or Police emergency services. The European emergency number 112 may also be used. Medical Assistance In the case of medical emergencies, there is a 24 hour Accident & Emergency department at the Royal Sussex County Hospital, Eastern Road, BN2 5BF - 01273 696955 (For ambulances, telephone 999). The Doctor's Surgery for temporary residents and visitors is: Chilvers McCrea Medical Centre, 1st Floor Boots the Chemist, 129 North Street, Brighton, 01273 328080. Open: Mon - Friday 08.00 - 18.00 Saturday 09.00 - 13.00 Closed on Sundays. For emergency dental treament (out of hours) call the Brighton & Hove health authority on 01273 486444 (lines open weekdays 06.30 - 21.30, weekends and bank holidays 09.00 - 12.30) Pharmacies The following pharmacies open later than normal working hours: Ashtons: 98 Dyke Road, Seven Dials, Brighton, BN1 3JD, 01273 325020 11 Open: Mon - Sunday 09.00 - 22.00 except for 25 December Asda: Brighton Marina, BN2 5UT, 01273 688019 Open: Mon - Thursday and Saturday: 09.00 - 20.00 Friday: 09.00 - 21.00 Sunday: 10.00 - 16.00 (times vary on Bank Holidays) Asda: Crowhurst Road, Brighton, BN1 8AS, 01273 542314 Open: Mon - Thursday & Saturday: 09.00 - 20.00 Friday: 09.00 - 21.00 Sunday: 11.00 - 17.00 (times vary on Bank Holidays) Westons: 6-7 Coombe Terrace, Lewes Road, Brighton, BN2 4AD, 01273 605354 Open: Mon - Sunday 09.00 - 22.00 Places of Interest • • • • • The Royal Pavilion – the seaside palace of the Prince Regent (George IV) – www.royalpavilion.org.uk. Brighton Walking Tours – Hear about Brighton’s history and discover interesting facts by downloading www.coolcitywalks.com/brighton/index.php onto your mp3 player. Brighton Pier – enjoy a typical day at the seaside on Brighton Pier. Enjoy the funfair rides, enjoy traditional seaside treats like candyfloss or a stick of rock or hire a deckchair and relax. Enjoy a traditional afternoon tea at the Grand hotel – www.grandbrighton.co.uk Take a 45 minute pleasure cruise along the coast and past the 2 piers or join a 90 minute mackerel fishing trip – www.watertours.co.uk. Dining and Entertainment There are numerous cafes and restaurants in the Brighton area, particularly around the ‘The Lanes’ and ‘North Laine’ areas. 12 Interspeech 2009 Social Programme Monday 7th September, Welcome Reception Everyone is invited to the Welcome Reception to be held at the Brighton Museum and Art Gallery, part of the historic Brighton Pavilion Estate. Event starts at 19:30. Tuesday 8th September, Reviewers Reception A reception for our hardworking reviewers will take place at the Royal Pavilion, the spectacular seaside palace of the Prince Regent (King George IV) transformed by John Nash between 1815 and 1822 into one of the most dazzling and exotic buildings in the British Isles. Event starts at 19:30. Admission by ticket enclosed with conference pack. Tuesday 8th September, Student delegates drinks reception Also on Tuesday evening there will be a drinks reception for student delegates at the stylish Italian Al Duomo Restaurant in the heart of the town. Event starts at 19:30. Admission by ticket enclosed with conference pack. Wednesday 9th September, Revelry at the Racecourse Look out for Revelry at the Racecourse! The Brighton Racecourse, set high on the Sussex Downs with stunning views of Brighton and Hove, will host the Conference Dinner and Party. No live horses, but expect a conference event with a difference and plenty of fun. Everyone is welcome and the event will start at 19:30. Transport to and from the event will be provided. 13 Interspeech 2009 Organisers Conference Committee Role General Chair Name Prof. Roger Moore Institution University of Sheffield Technical Programme Prof. Stephen J. Cox University of East Anglia Finance Prof. Valerie Hazan University College London Publications Dr. Maria Uther Brunel University Web Master Dr. Mark Huckvale University College London Plenary Sessions Prof. Steve Renals University of Edinburgh Tutorials Dr. Thomas Hain University of Sheffield Special Sessions Dr. Ji Ming Queen's University Belfast Satellite Workshops Dr. Simon King University of Edinburgh Sponsorship Dr. Andrew Breen Nuance UK Exhibits Dr. Ben Milner University of East Anglia Publicity Prof. Martin Russell University of Birmingham Social Programme Dr. Anna Barney University of Southampton Industrial Liaison Mr. Denis Johnston Haughgate Innovations Advisor Prof. Steve Young University of Cambridge Student Liaison Dr. Shona D'Arcy Trinity College, Dublin Public Outreach Dr. Simon Worgan University of Sheffield Registration Prof. Michael McTear University of Ulster Loebner Contest Dr. Philip Jackson University of Surrey Meeting Rooms Dr. Peter Jancovic University of Birmingham Conference Organisers Meeting Makers Ltd 76 Southbrae Drive Glasgow G13 1PP Tel: +44 (0) 141 434 1500 Fax: +44 (0) 141 434 1519 [email protected] 14 Scientific Committee The conference organisers are indebted to the following people who made such a large contribution to the creation of the technical programme. Technical Chair • Stephen Cox, University of East Anglia Area Coordinators Special Session Organisers 1. Aladdin Ariyaeeinia, University of Hertfordshire 2. Nick Campbell, Trinity College Dublin 3. Mark J. F. Gales, Cambridge University 4. Yoshi Gotoh, Sheffield University 5. Phil Green, Sheffield University 6. Valerie Hazan, University College London 7. Wendy Holmes, Aurix Ltd 8. David House, KTH Stockholm 9. Kate Knill, Toshiba Research Europe Ltd 10. Bernd Möbius, Stuttgart 11. Sebastian Möller, Deutsche Telekom Laboratories 12. Satoshi Nakamura, NICT/ATR 13. Kuldip Paliwal, Griffith University 14. Gerasimos Potamianos, Athens 15. Ralf Schlüter, RWTH Aachen University 16. Tanja Schultz, Carnegie Mellon University 17. Yannis Stylianou, University of Crete 18. Marc Swerts, Tilburg University 19. Isabel Trancoso, INESC-ID Lisboa / IST 20. Saeed Vaseghi, Brunel University 21. Yi Xu, University College London 22. Bayya Yegnanarayana, International Institute of Information Technology Hyderabad 23. Kai Yu, Cambridge University 24. Heiga Zen, Toshiba Research Ltd. • • • • • • • • • • • • • • • • • • • • • • • • 15 Abeer Alwan, UCLA Anna Barney, ISVR, University of Southampton Anton Batliner, FAU ErlangenNuremberg Nick Campbell, Trinity College Dublin Mike Cohen, Google Jennifer Cole, Illinois Bruce Denby, Université Pierre et Marie Curie Carol Espy-Wilson, University of Maryland Louis Goldstein, University of Southern California Joakim Gustafson, KTH Stockholm Mary Harper, University of Maryland Mark Hasegawa-Johnson, Illinois Daniel Hirst, Universite de Provence Oliver Lemon, Organiser Edinburgh University Anton Nijholt, Twente Mette Pedersen, Medical Centre, Voice Unit, Denmark Mike Phillips, Vlingo Olivier Pietquin, IMS Research Group Elliot Saltzman, Haskins Laboratories Johan Schalkwyk, Google Björn Schuller, Technische Universität München Tanja Schultz, Carnegie Mellon University Stefan Steidl, FAU Erlangen-Nuremberg Carl Vogel, Trinity College Dublin Scientific Reviewers Alberto Abad Sherif Abdou Alex Acero Andre Gustavo Adami Gilles Adda Martine Adda-Decker Masato Akagi Masami Akamine Murat Akbacak Jan Alexandersson Paavo Alku Abeer Alwan Eliathamby Ambikairajah Noam Amir Ove Andersen Tim Anderson Elisabeth Andre Walter Andrews Takayuki Arai Masahiro Araki Aladdin Ariyaeeinia Victoria Arranz Bishnu Atal Roland Auckenthaler Cinzia Avesani Matthew Aylett Harald Baayen Michiel Bacchiani Pierre Badin Janet Baker Srinivas Bangalore Plinio Barbosa Etienne Barnard Anna Barney Dante Barone William J. Barry Anton Batliner Frederic Bechet Steve Beet Homayoon Beigi Nuno Beires Jerome Bellegarda Jose Miguel Benedi Ruiz M. Carmen Benitez Ortuzar Nicole Beringer Kay Berkling Laurent Besacier Frédéric Bettens Frederic Bimbot Maximilian Bisani Judith Bishop Alan Black Mats Blomberg Gerrit Bloothooft Antonio Bonafonte Jean-Francois Bonastre Zinny Bond Helene Bonneau-Maynard Herve Bourlard Lou Boves Daniela Braga Bettina Braun Catherine Breslin John Bridle Mirjam Broersma Niko Brummer Udo Bub Luis Buera Tim Bunnell Anthony Burkitt Denis Burnham Bill Byrne Jose R. Calvo de Lara Joseph P. Campbell Nick Campbell William Campbell Antonio Cardenal Lopez Valentin Cardenoso-Payo Michael Carey Rolf Carlson Maria Jose Castro Bleda Mauro Cettolo Joyce Chai Chandra Sekhar Chellu Fang Chen Yan-Ming Cheng Rathi Chengalvarayan Jo Cheolwoo Jen-Tzung Chien KK Chin Gerard Chollet Khalid Choukri Heidi Christensen Hyunsong Chung Robert Clark Mike Cohen Luisa Coheur Jennifer Cole Alistair Conkie Martin Cooke Ricardo de Cordoba Piero Cosi Bob Cowan Felicity Cox Catia Cucchiarini Fred Cummins Francesco Cutugno Joachim Dale Paul Dalsgaard Geraldine Damnati Morena Danieli Marelie Davel Chris Davis Amedeo De Dominicis Angel de la Torre Vega Renato De Mori 16 Carme de-la-Mota Gorriz David Dean Michael Deisher Grazyna Demenko Kris Demuynck Bruce Denby Matthias Denecke Li Deng Giuseppe Di Fabbrizio Vassilis Digalakis Christine Doran Ellen Douglas-Cowie Christoph Draxler Jasha Droppo Andrzej Drygajlo Jacques Duchateau Christophe D`Alessandro Mariapaola D`Imperio Kjell Elenius Daniel Ellis Ahmad Emami Julien Epps Anders Eriksson Mirjam Ernestus David Escudero Mancebo Carol Espy-Wilson Sascha Fagel Daniele Falavigna Mauro Falcone Isabel Fale Kevin Farrell Marcos Faundez-Zanuy Marcello Federico Junlan Feng Javier Ferreiros Carlos A. Ferrer Riesgo Tim Fingscheidt Janet Fletcher José A. R. Fonollosa Kate Forbes-Riley Eric Fosler-Lussier Diamantino Freitas Juergen Fritsch Sonia Frota Christian Fuegen Olac Fuentes Hiroya Fujisaki Toshiaki Fukada Sadaoki Furui Mark J. F. Gales Ascensiòn Gallardo Antolìn Yuqing Gao Fernando Garcia Granada Carmen Garcia Mateo Mária Gósy Panayiotis Georgiou Dafydd Gibbon Mazin Gilbert Juan Ignacio Godino Llorente Simon Godsill Godsill Roland Goecke Pedro Gomez Vilda Joaquin Gonzalez-Rodriguez Allen Gorin Martijn Goudbeek Philippe Gournay Bjorn Granstrom Agustin Gravano David Grayden Phil Green Rodrigo C Guido Susan Guion Ellen Haas Kadri Hacioglu Reinhold Haeb-Umbach Jari Hagqvist Thomas Hain John Hajek Eva Hajicova Dilek Hakkani-Tur Gerhard Hanrieder John H.L. Hansen Mary Harper Jonathan Harrington John Harris Naomi Harte Mark Hasegawa-Johnson Jean-Paul Haton Valerie Hazan Timothy J. Hazen Harald Höge Matthieu Hebert Martin Heckmann Per Hedelin Peter Heeman Paul Heisterkamp John Henderson Inmaculada Hernaez Rioja Luis Alfonso Hernandez Gomez Javier Hernando Wolfgang Hess Lee Hetherington Ulrich Heute Keikichi Hirose Hans Guenter Hirsch Julia Hirschberg Daniel Hirst Wendy Holmes W. Harvey Holmes Chiori Hori John-Paul Hosom David House Fei Huang Qiang Huang Isabel Hub Faria Juan Huerta Marijn Huijbregts Melvyn Hunt Lluis F. Hurtado Oliver John Ingram Shunichi Ishihara Masato Ishizaki Yoshiaki Itoh Philip Jackson Tae-Yeoub Jang Esther Janse Arne Jönsson Alexandra Jesse Luis Jesus Qin Jin Michael Johnston Kristiina Jokinen Caroline Jones Szu-Chen Stan Jou Denis Jouvet Ho-Young Jung Peter Kabal Takehiko Kagoshima Alexander Kain Hong-Goo Kang Stephan Kanthak Hiroaki Kato Tatsuya Kawahara Hisashi Kawai Mano Kazunori Thomas Kemp Patrick Kenny Jong-mi Kim Sunhee Kim Byeongchang Kim Hoirin Kim Hong Kook Kim Hyung Soon Kim Jeesun Kim Nam Soo Kim Sanghun Kim Simon King Brian Kingsbury Yuko Kinoshita Christine Kitamura Esther Klabbers Dietrich Klakow Bastiaan Kleijn Kate Knill Hanseok Ko Takao Kobayashi M. A. Kohler George Kokkinakis John Kominek Myoung-Wan Koo Sunil Kumar Kopparapu Christian Kroos Chul Hong Kwon Hyuk-Chul Kwon Oh-Wook Kwon Francisco Lacerda Pietro Laface Unto K. Laine Claude Lamblin Lori Lamel Kornel Laskowski 17 Javier Latorre Gary Geunbae Lee Chin Hui Lee Lin-shan Lee Oliver Lemon Qi (Peter) Li Haizhou Li Carlos Lima Georges Linares Mike Lincoln Bjorn Lindblom Anders Lindström Zhen-Hua Ling Yang Liu Yi Liu Karen Livescu Eduardo Lleida Solano Joaquim Llisterri Deborah Loakes Maria Teresa Lopez Soto Ramon Lopez-Cozar Delgado Patrick Lucey Changxue Ma Ning Ma Dusan Macho Cierna Javier Macias-Guarasa Abdulhussain E. Mahdi Brian Mak Robert Malkin Nuno Mamede Claudia Manfredi Lidia Mangu José B. Mariño Acebal Ewin Marsi Carlos David Martínez Hinarejos Jean-Pierre Martens Rainer Martin Alvin Martin Enrique Masgrau Gomez Sameer Maskey John Mason Takashi Masuko Ana Isabel Mata Tomoko Matsui Karen Mattock Bernd Möbius Sebastian Möller Christian Müller Alan McCree Erik McDermott Gordon McIntyre John McKenna Michael McTear Hugo Meinedo Alexsandro Meireles Carlos Eduardo Meneses Ribeiro Helen Meng Florian Metze Bruce Millar Ben Milner Nobuaki Minematsu Wolfgang Minker Holger Mitterer Hansjoerg Mixdorff Parham Mokhtari Garry Molholt Juan Manuel Montero Martinez Asuncion Moreno Pedro J. Moreno Nelson Morgan Ronald Mueller Climent Nadeu Seiichi Nakagawa Satoshi Nakamura Shrikanth Narayanan Eva Navas Jiri Navratil Ani Nenkova João Neto Ron Netsell Sergio Netto Hermann Ney Patrick Nguyen Anton Nijholt Elmar Noeth Albino Nogueiras Michael Norris Mohaddeseh Nosratighods Regine Obrecht John Ohala Sumio Ohno Luis Oliveira Peder Olsen Mohamed Omar Roeland Ordelman Rosemary Orr Alfonso Ortega Gimenez Javier Ortega-Garcia Beatrice Oshika Mari Ostendorf Douglas O`Shaughnessy Fernando S. Pacheco Tim Paek Vincent Pagel Sira Elena Palazuelos Cagigas Kuldip Paliwal Yue Pan P.C. Pandey Jose Manuel Pardo Jun Park Jong Cheol Park Patrick Paroubek Steffen Pauws Mette Pedersen Antonio M. Peinado Carmen Pelaez-Moreno Jason Pelecanos Bryan Pellom Christina Pentland Fernando Perdigao Jose Luis Perez Cordoba Pascal Perrier Hartmut R. Pfitzinger Mike Phillips Michael Picheny Joe Picone Roberto Pieraccini Olivier Pietquin Ferran Pla Santamaría Aristodemos Pnevmatikakis Louis C.W. Pols Alexandros Potamianos Gerasimos Potamianos David Powers Rohit Prasad Mahadeva Prassanna Kristin Precoda Patti Price Tarun Pruthi Mark Przybocki Yao Qian Zhu Qifeng Thomas F. Quatieri Raja Rajasekaran Nitendra Rajput Bhuvana Ramabhadran V. Ramasubramanian Preeti Rao Andreia S. Rauber Mosur Ravishankar Mario Refice Norbert Reithinger Steve Renals Fernando Gil Vianna Resende Jr. Douglas Reynolds Luca Rigazio Michael Riley Christian Ritz Tony Robinson Eduardo Rodriguez Banga Luis Javier Rodriguez-Fuentes Richard Rose Olivier Rosec Antti-Veikko Rosti Jean-Luc Rouas Antonio Rubio Martin Russell Yoshinori Sagisaka Josep M. Salavedra K Samudravijaya Rubén San-Segundo Victoria Eugenia Sanchez Calle Emilio Sanchis Eric Sanders George Saon Shimon Sapir Murat Saraclar Ruhi Sarikaya Hiroshi Saruwatari Antonio Satue Villar 18 Michelina Savino Joan Andreu Sánchez Peiró Thomas Schaaf Johan Schalkwyk Odette Scharenborg Ralf Schlüter Jean Schoentgen Marc Schroeder Bjoern Schuller Michael Schuster Reva Schwartz Antje Schweitzer Encarnacion Segarra Soriano Jose Carlos Segura Frank Seide Mike Seltzer D. Sen Stephanie Seneff Cheol Jae Seong Antonio Serralheiro Kiyohiro Shikano Jiyoung Shin Koichi Shinoda Carlos Silva Olivier Siohan Malcolm Slaney Raymond Slyh Connie So M. Mohan Sondhi Victor Sorokin Dave Stallard Mark Steedman Stefan Steidl Andreas Stergiou Richard Stern Mary Stevens Helmer Strik Volker Strom Sebastian Stueker Matt Stuttle Yannis Stylianou Rafid Sukkar Torbjorn Svendsen Marc Swerts Ann Syrdal David Talkin Zheng-Hua Tan Yun Tang Jianhua Tao Carlos Teixeira Antonio Teixeira Joao Paulo Teixeira Louis ten Bosch Jacques Terken Barry-John Theobald William Thorpe Jilei Tian Michael Tjalve Tomoki Toda Roberto Togneri Keiichi Tokuda Doroteo T Toledano Laura Tomokiyo María Inés Torres Barañano Arthur Toth Dat Tran Isabel Trancoso David Traum Kimiko Tsukada Roger Tucker Gokhan Tur L. Alfonso Urena Lopez Jacqueline Vaissiere Dirk Van Compernolle Henk van den Heuvel Jan van Doorn Hugo Van hamme Arjan van Hessen Roeland van Hout David van Leeuwen Jan van Santen Rob Van Son Peter Vary Saeed Vaseghi Mario Vayra Werner Verhelst Jo Verhoeven Ceu Viana Marina Vigario Fábio Violaro Carlos Enrique Vivaracho Pascual Robbie Vogt Julie Vonwiller Michael Wagner Marilyn Walker Patrick Wambacq Hsiao-Chuan Wang Kuansan Wang Ye-Yi Wang Chao Wang Hsin-min Wang Nigel Ward Catherine Watson Christian Wellekens Stanley Wenndt Stefan Werner Yorick Wilks Daniel Willett Briony Williams Monika Woszczyna Johan Wouters Stuart Wrigley John Zhiyong Wu 19 Yi-Jian Wu Bing Xiang Yi Xu Haitian Xu Junichi Yamagashi Bayya Yegnanarayana Nestor Becerra Yoma Chang D. Yoo Dongsuk Yook Steve Young Roger (Peng) Yu Dong Yu Kai Yu Young-Sun Yun Milos Zelezny Heiga Zen Andrej Zgank Tong Zhang Yunxin Zhao Jing Zheng Bowen Zhou Imed Zitouni Udo Zoelzer Geoffrey Zweig Sponsors and Exhibitors Sponsors We gratefully acknowledge the support of: Google, Inc. for silver-level sponsorship of the conference. Carstens Medizinelektronik GmbH for bronze-level sponsorship of the conference. Northern Digital, Inc. for bronze-level sponsorship of the conference. Nuance, Inc. for bronze-level sponsorship of the conference. Toshiba Research Europe for bronze-level sponsorship of the conference. Appen Pty Ltd for bronze-level sponsorship of the conference. 20 Crown Industries, Inc. Crown Industries, Inc for supporting the Loebner Prize competition. IBM Research Inc. for supporting the Loebner Prize competition. Brighton & Hove City Council for sponsoring the conference venue. Exhibitors Interspeech 2009 welcomes INTERSPEECH 2010 21 the following exhibitors: Supporting Committee Institutions The organising committee would like to acknowledge the support of their respective institutions (institutions presented in alphabetical order). Trinity College Dublin 22 In Memoriam Gunnar Fant 8th October 1919 – 6th June 2009 Speech research pioneer and recipient of the1989 ESCA Medal for scientific achievement 23 Contest for the Loebner Prize 2009 Time: 10:45am Sunday 6 September Venue: Rainbow room, Brighton Centre The Loebner Prize for artificial intelligence is the first formal instantiation of a Turing Test. The test is named after Alan Turing the brilliant British mathematician with many accomplishments in computing science. In 1950, in the article Computing Machinery and Intelligence which appeared in the philosophy journal Mind, Alan Turing asked the question "Can a Machine Think?" He answered in the affirmative, but a central question was: "If a computer could think, how could we tell?" Turing's suggestion was, that if the responses from a computer in an imitation game were indistinguishable from that of a human, the computer could be said to be thinking. The Loebner prize competition seeks to find out how close we are to building a computer to pass the Turing test. In 1950 Alan Turing wrote: "I believe that in about fifty years' time it will be possible, to programme computers, with a storage capacity of about 109, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning..." The 2009 Loebner Prize will operate in the following manner. • • • • • Panels of judges communicate with two entities over a typewritten link. One entity is a human, one is a computer program, allocated at random. Each judge will begin the round by making an initial comment to the first entity and continue interacting for 5 minutes. At the conclusion of the five minutes, the judge will begin the interaction with the second entity and continue for 5 minutes. Entities will be expected to respond to the judges' initial comments or questions. There will be no restrictions on what names etc the entries, humans, or judges can use, nor any other restrictions on the content of the conversations. At the conclusion of the 10 minutes of questioning, judges will be allowed 10 minutes to review the conversations. They will then score one of the two entities as the human. Following this, there will be a 5 minute period for judges and confederates to take their places for the next round. The system that is most often considered to be human by the judges will win a Bronze Loebner medal and $3000. More details at the Loebner Prize web site: http://www.loebner.net/Prizef/loebner-prize.html. The Loebner Prize is made possible by funding from Crown Industries, Inc., of East Orange NJ and contributions from IBM research. Organiser: Philip Jackson, [email protected] 24 “Spoken Language Processing for All” Spoken Language Processing for All Ages, Health Conditions, Native Languages, and Environments INTERSPEECH 2010, the 11th conference in the annual series of Interspeech events, will be held at the International Convention Hall at the Makuhari Messe exhibition complex in Chiba, Japan. The conference venue allows easy access for international travelers: 30 minutes from Narita International Airport by bus, 30 minutes from Tokyo station by train, and within walking distance of a number of hotels with a wide variety of room rates. INTERSPEECH 2010 returns to Japan for the first time in 16 years. Japan hosted the first and third International Conferences on Spoken Language Processing (ICSLP) in 1990 and 1994. In 2010, we seek to emphasize the interdisciplinary nature of speech research, and facilitate cross-fertilization among various branches of spoken language science and technology. Mark 26-30 September 2010 on your calendar now! For further details, visit http://www.interspeech2010.org/ or write to [email protected] General Chair: Keikichi Hirose (The University of Tokyo) Makuhari Messe General Co-Chair: Yoshinori Sagisaka (Waseda University) Technical Program Committee Chair: Satoshi Nakamura (NICT/ATR) 25 Satellite Events This is the list of satellite workshops linked to Interspeech 2009. ACORNS Workshop on Computational Models of Language Evolution, Acquisition and Processing 11 September 2009, Brighton, UK. The workshop brings together up to 50 scientists to discuss future research in language acquisition, processing and evolution. Deb Roy, Friedemann Pulvermüller, Rochelle Newman and Lou Boves will provide an overview of the state-of-art, a number of discussants from different disciplines will widen the perspective, and all participants can contribute to a roadmap. AVSP 2009 - Audio-Visual Speech Processing 10-13 Sept 2009, University of East Anglia, Norwich, U.K. The International Conference on Auditory-Visual Speech Processing (AVSP) attracts an interdisciplinary audience of psychologists, engineers, scientists and linguists, and considers a range of topics related to speech perception, production, recognition and synthesis. Recently the scope of AVSP has broadened to also include discussion on more general issues related to audiovisual communication. For example, the interplay between speech and the expressions of emotion, and the relationship between speech and manual gestures. Blizzard Challenge Workshop 4 September, University of Edinburgh, U.K. In order to better understand and compare research techniques in building corpus-based speech synthesizers on the same data, the Blizzard Challenge was devised. The basic challenge is to take the released speech database, build a synthetic voice from the data and synthesize a prescribed set of test sentences which are evaluated through listening tests. The results are presented at this workshop. Attendance at the 2009 workshop for the 4th Blizzard Challenge is open to all, not just participants in the challenge. Registration closes on 14th August 2009. SIGDIAL - Special Interest Group on Dialogue 11-12 Sept 2009, Queen Mary University, London, U.K. The SIGDIAL venue provides a regular forum for the presentation of cutting edge research in discourse and dialogue to both academic and industry researchers. The conference is sponsored by the SIGDIAL organization, which serves as the Special Interest Group in discourse and dialogue for both the Association for Computational Linguistics and the International Speech Communication Association. SLaTE Workshop on Speech and Language Technology in Education (SLaTE) 3-5 September 2009, Wroxall, Warwickshire, U.K. SLaTE 2009 follows SLaTE 2007, held in Farmington, Pennsylvania, USA, and the STiLL meeting organized by KTH in Marholmen, Sweden, in 1998. The workshop will address all topics which concern speech and language technology for education. Papers will discuss theories, applications, evaluation, limitations, persistent difficulties, general research tools and 26 techniques. Papers that critically evaluate approaches or processing strategies will be especially welcome, as will prototype demonstrations of real-world applications. Young Researchers' Roundtable on Spoken Dialogue Systems 13-14 September 2009, Queen Mary, University of London, U.K. The Young Researchers' Roundtable on Spoken Dialog Systems is an annual workshop designed for students, post docs, and junior researchers working in research related to spoken dialogue systems in both academia and industry. The roundtable provides an open forum where participants can discuss their research interests, current work and future plans. The workshop is meant to provide an interdisciplinary forum for creative thinking about current issues in spoken dialogue systems research, and help create a stronger international network of young researchers working in the field. 27 Interspeech 2009 Keynote Sessions Keynote 1 ISCA Scientific Achievement Medallist for 2009 Sadaoki Furui, Tokyo Institute of Technology Selected topics from 40 years of research on speech and speaker recognition Mon-Ses1-K: Monday 11:00, Main Hall Chair: Isabel Trancoso Abstract This talk summarizes my 40 years research on speech and speaker recognition, focusing on selected topics that I have investigated at NTT Laboratories, Bell Laboratories and Tokyo Institute of Technology with my colleagues and students. These topics include: the importance of spectral dynamics in speech perception; speaker recognition methods using statistical features, cepstral features, and HMM/GMM; text-prompted speaker recognition; speech recognition by dynamic features; Japanese LVCSR; spontaneous speech corpus construction and analysis; spontaneous speech recognition; automatic speech summarization; WFST-based decoder development and its applications; and unsupervised model adaptation methods. Presenter Sadaoki Furui is currently a Professor at Tokyo Institute of Technology, Department of Computer Science. He is engaged in a wide range of research on speech analysis, speech recognition, speaker recognition, speech synthesis, and multimodal human-computer interaction and has authored or coauthored over 800 published articles. He is a Fellow of the IEEE, the International Speech Communication Association (ISCA), the Institute of Electronics, Information and Communication Engineers of Japan (IEICE), and the Acoustical Society of America. He has served as President of the Acoustical Society of Japan (ASJ) and the ISCA. He has served as a member of the Board of Governor of the IEEE Signal Processing (SP) Society and Editor-in-Chief of both the Transaction of the IEICE and the Journal of Speech Communication. He has received the Yonezawa Prize, the Paper Award and the Achievement Award from the IEICE (1975, 88, 93, 2003, 2003, 2008), and the Sato Paper Award from the ASJ (1985, 87). He has received the Senior Award and Society Award from the IEEE SP Society (1989, 2006), the Achievement Award from the Minister of Science and Technology and the Minister of Education, Japan (1989, 2006), and the Purple Ribbon Medal from Japanese Emperor (2006). In 1993 he served as an IEEE SPS Distinguished Lecturer. 28 Keynote 2 Tom Griffiths, UC Berkeley Connecting human and machine learning via probabilistic models of cognition Tue-Ses0-K: Tuesday 08:30, Main Hall Chair: Steve Renals Abstract Human performance defines the standard that machine learning systems aspire to in many areas, including learning language. This suggests that studying human cognition may be a good way to develop better learning algorithms, as well as providing basic insights into how the human mind works. However, in order for ideas to flow easily from cognitive science to computer science and vice versa, we need a common framework for describing human and machine learning. I will summarize recent work exploring the hypothesis that probabilistic models of cognition, which view learning as a form of statistical inference, provide such a framework, including results that illustrate how novel ideas from statistics can inform cognitive science. Specifically, I will talk about how probabilistic models can be used to identify the assumptions of learners, learn at different levels of abstraction, and link the inductive biases of individuals to cultural universals. Presenter Tom Griffiths is an Assistant Professor of Psychology and Cognitive Science at UC Berkeley, with courtesy appointments in Computer Science and Neuroscience. His research explores connections between human and machine learning, using ideas from statistics and artificial intelligence to try to understand how people solve the challenging computational problems they encounter in everyday life. He received his PhD in Psychology from Stanford University in 2005, and taught in the Department of Cognitive and Linguistic Sciences at Brown University before moving to Berkeley. His work and that of his students has received awards from the Neural Information Processing Systems conference and the Annual Conference of the Cognitive Science Society, and in 2006 IEEE Intelligent Systems magazine named him one of "Ten to watch in AI." 29 Keynote 3 Deb Roy, MIT Media Lab New Horizons in the Study of Language Development Wed-Ses0-K: Wednesday 08:30, Main Hall Chair: Roger Moore Abstract Emerging forms of ecologically-valid longitudinal recordings of human behavior and social interaction promise fresh perspectives on age-old questions of child development. In a pilot effort, 240,000 hours of audio and video recordings of one child’s life at home are being analyzed with a focus on language development. To study a corpus of this scale and richness, current methods of developmental sciences are insufficient. New data analysis algorithms and methods for interpretation and computational modeling are under development. Preliminary speech analysis reveals surprising levels of linguistic “finetuning” by caregivers that may provide crucial support for word learning. Ongoing analysis of various other aspects of the corpus aim to model detailed aspects of the child’s language development as a function of learning mechanisms combined with everyday experience. Plans to collect similar corpora from more children based on a streamlined recording system are underway. Presenter Deb Roy directs the Media Lab's Cognitive Machines group, is founding director of MIT’s Center for Future Banking, and chairs the academic program in Media Arts and Sciences. A native of Canada, he received his bachelor of computer engineering from the University of Waterloo in 1992 and his PhD in cognitive science from MIT in 1999. He joined the MIT faculty in 2000 and was named AT&T Associate Professorship of Media Arts and Sciences in 2003. Roy studies how children learn language, and designs machines that learn to communicate in human-like ways. To enable this work, he has developed new data-driven methods for analyzing and modeling human linguistic and social behavior. He has begun exploring applications of these methods to a range of new domains from financial behavior to autism. Roy has authored numerous scientific papers in the areas of artificial intelligence, cognitive modeling, human-machine interaction, data mining and information visualization. 30 Keynote 4 Mari Ostendorf, University of Washington Transcribing Speech for Spoken Language Processing Thu-Ses0-K: Thursday 08:30, Main Hall Chair: Martin Russell Abstract As storage costs drop and bandwidth increases, there has been a rapid growth of spoken information available via the web or in online archives -- including radio and TV broadcasts, oral histories, legislative proceedings, call center recordings, etc. -- raising problems of document retrieval, information extraction, summarization and translation for spoken language. While there is a long tradition of research in these technologies for text, new challenges arise when moving from written to spoken language. In this talk, we look at differences between speech and text, and how we can leverage the information in the speech signal beyond the words to provide structural information in a rich, automatically generated transcript that better serves language processing applications. In particular, we look at three interrelated types of structure (segmentation, prominence and syntax), methods for automatic detection, the benefit of optimizing rich transcription for the target language processing task, and the impact of this structural information in tasks such as parsing, topic detection, information extraction and translation. Presenter Mari Ostendorf received the Ph.D. in electrical engineering from Stanford University. After working at BBN Laboratories and Boston University, she joined the University of Washington (UW) in 1999. She has also been a visiting researcher at the ATR Interpreting Telecommunications Laboratory and at the University of Karlsruhe. At UW, she is currently an Endowed Professor of System Design Methodologies in Electrical Engineering and an Adjunct Professor in Computer Science and Engineering and in Linguistics. Currently, she is the Associate Dean for Research and Graduate Studies in the UW College of Engineering. She teaches undergraduate and graduate courses in signal processing and statistical learning, including a design-oriented freshman course that introduces students to signal processing and communications. Prof. Ostendorf's research interests are in dynamic and linguistically-motivated statistical models for speech and language processing. Her work has resulted in over 200 publications and 2 paper awards. Prof. Ostendorf has served as co-Editor of Computer Speech and Language, as the Editor-in-Chief of the IEEE Transactions on Audio, Speech and Language Processing, and is currently on the IEEE Signal Processing Society Board of Governors and the ISCA Advisory Council. She is a Fellow of IEEE and ISCA. 31 Interspeech 2009 Special Sessions The Interspeech 2009 Organisation Committee is pleased to announce acceptance of the following Special Sessions at Interspeech 2009. INTERSPEECH 2009 Emotion Challenge Mon-Ses2-S1: Monday 13:30 Place: Ainsworth (East Wing 4) The INTERSPEECH 2009 Emotion Challenge aims to help bridge the gap between the excellent research on human emotion recognition from speech and the low compatibility of results. The FAU Aibo Emotion Corpus of spontaneous, emotionally coloured speech, and benchmark results of the two most popular approaches will be provided by the organisers. This consists of nine hours of speech from 51 children, recorded at two different schools. This corpus allows for distinct definition of test and training partitions incorporating speaker independence as needed in most real-life settings. The corpus further provides a uniquely detailed transcription of spoken content with word boundaries, non-linguistic vocalisations, emotion labels, units of analysis, etc. The results of the Challenge will be presented at the Special Session and prizes will be awarded to the sub-challenge winners and a best paper. Organisers: Bjoern Schuller ([email protected]), Technische Universitaet Muenchen, Germany, Stefan Steidl ([email protected]), FAU Erlangen-Nuremberg, Germany, Anton Batliner ([email protected]), FAU Erlangen-Nuremberg, Germany. Silent Speech Interfaces Mon-Ses3-S1: Monday 16:00 Place: Ainsworth (East Wing 4) A Silent Speech Interface (SSI) is an electronic system enabling speech communication to take place without the necessity of emitting an audible acoustic signal. By acquiring sensor data from elements of the human speech production process – from the articulators, their neural pathways, or the brain itself – an SSI produces a digital representation of speech which can synthesized directly, interpreted as data, or routed into a communications network. Due to this novel approach Silent Speech Interfaces have the potential to overcome the major limitations of traditional speech interfaces today, i.e. (a) limited robustness in the presence of ambient noise; (b) lack of secure transmission of private and confidential information; and (c) disturbance of bystanders created by audibly spoken speech in quiet environments; while at the same time retaining speech as the most natural human communication modality. The special session intends to bring together researchers in the areas of human articulation, speech and language technologies, data acquisition and signal processing, as well as in human interface design, software engineering and systems integration. Its goal is to promote the exchange of ideas on current SSI challenges and to discuss solutions, by highlighting, for each of the technological approaches presented, its range of applications, key advantages, potential drawbacks, and current state of development. Organisers: Bruce Denby ([email protected]), Université Pierre et Marie Curie, France, Tanja Schultz ([email protected]), Cognitive Systems Lab, University of Karlsruhe, Germany. Advanced Voice Function Assessment Tue-Ses1-S1: Tuesday 10:00 Place: Ainsworth (East Wing 4) In order to advance the field of voice function assessment in a clinical setting, cooperation between clinicians and technologists is essential. The aim of this special session is to showcase 32 work that crosses the borders between basic, applied and clinical research and highlights the development of partnership between technologists and healthcare professionals in advancing the protocols and technologies for the assessment of voice function. Organisers: Anna Barney ([email protected]), Institute of Sound and Vibration Research, UK, Mette Pedersen ([email protected]), Medical Centre, Voice Unit, Denmark. Measuring the Rhythm of Speech Tue-Ses3-S2: Tuesday 16:00 Place: Ainsworth (East Wing 4) There has been considerable interest in the last decade in the modelling of rhythm both from a typological perspective (e.g. establishing objective criteria for classifying languages or dialect as stress timed, syllable timed or mora timed) and from the perspective of establishing evaluation metrics of non standard or deviant varieties of speech such as that obtained from non-native speakers, from speakers with pathological disabilities or from automatic speech synthesis. The aim of this special session will be to bring together a number of researchers who have contributed to this debate and to assess and discuss the current status of our understanding of the relative value of different metrics for different tasks. Organiser: Daniel Hirst ([email protected]), Laboratoire Parole et Langage, Université de Provence, France. Lessons and Challenges Deploying Voice Search Wed-Ses1-S1: Wednesday 10:00 Place: Ainsworth (East Wing 4) In the past year, a number of companies have deployed multimodal search applications for mobile phones. These applications enable spoken input for search, as an alternative to typing. There are many technical challenges associated with deploying such applications, including: High perplexity: A language model for general search must accommodate a very large vocabulary and tremendous range of possible inputs; Challenging acoustic environments: Mobile phones are often used when "on the go" - which can often be in noisy environments; Challenging usage scenarios: Mobile search may be used in challenging situations such as information access while driving a car. This session will focus on early lessons learned from usage data, challenges posed, and technical and design solutions to these challenges, as well as a look towards the future. Organisers: Mike Cohen ([email protected]), Google, Johan ([email protected]), Google, Mike Phillips ([email protected]), Vlingo. Schalkwyk Active Listening & Synchrony Wed-Ses2-S1: Wednesday 13:30 Place: Ainsworth (East Wing 4) Traditional approaches to Multimodal Interface design have tended to assume a "ping-pong" or "push-to-talk" approach to speech interaction wherein either the system or the interlocuting human is active at any one time. This is contrary to many recent findings in conversation and discourse analysis, where the definition of a "turn", or even an "utterance" is found to be very complex; people don’t "take turns" to talk in a typical conversational interaction, but they each contribute actively and interactively to the joint emergence of a "common understanding". . The aim of this special session, marking the 70th anniversary of synchrony research, is to bring together researchers from the varous different fields, who have special interest in novel techniques that are aimed at overcoming weaknesses of the "push-to-talk" approach in interface 33 technology, or who have knowledge of the history of this field from which the research community could benefit. Organisers: Nick Campbell ([email protected]), Trinity College Dublin, Ireland, Anton Nijholt ([email protected]), University of Twente, The Netherlands, Joakim Gustafson ([email protected]), KTH, Sweden, Carl Vogel ([email protected]), Trinity College Dublin, Ireland. Machine Learning for Adaptivity in Spoken Dialogue Systems Wed-Ses3-S1: Wednesday 16:00 Place: Ainsworth (East Wing 4) In the past decade, research in the field of Spoken Dialogue Systems (SDS) has experienced increasing growth, and new applications include interactive mobile search, tutoring, and troubleshooting systems. The design and optimization of robust SDS for such tasks requires the development of dialogue strategies which can automatically adapt to different types of users and noise conditions. New statistical learning techniques are emerging for training and optimizing adaptive speech recognition, spoken language understanding, dialogue management, natural language generation, and speech synthesis in spoken dialogue systems. Among machine learning techniques for spoken dialogue strategy optimization, reinforcement learning using Markov Decision Processes (MDPs) and Partially Observable MDP (POMDPs) has become a particular focus. The purpose of this special session is to provide an opportunity for the international research community to share ideas on these topics and to have constructive discussions in a single, focussed, special conference session. Organisers: Oliver Lemon ([email protected]), Edinburgh University, UK, Olivier Pietquin ([email protected]), Supélec - IMS Research Group, France. New Approaches to Modeling Variability for Automatic Speech Recognition Thu-Ses1-S1: Thursday 10:00 Place: Ainsworth (East Wing 4) Despite great strides in the development of automatic speech recognition (ASR) technology, our community is still far from achieving its holy grail: an ASR system with performance comparable to humans in automatically transcribing unrestricted conversational speech, spoken by many speakers and in adverse acoustic environments. Many of the difficulties faced by ASR models are due to the high degree of variation in the acoustic waveforms associated with a given phonetic unit measured across different segmental and prosodic contexts. Such variation has both deterministic origins (intersegmental coarticulation; prosodic juncture and accent) and stochastic origins (token-to-token variability for utterances with the same segmental and prosodic structure). Current ASR systems successfully model acoustic variation that is due to adjacent phone context, but variation due to other sources, including prosodic context, speech rate, and speaker, is not adequately treated. The goal of this special session is to bring together researchers who are exploring alternative approaches to state-of-the-art ASR methodologies. Of special interest are new approaches that model variation in the speech signal at multiple levels, from both linguistic and extra-linguistic sources. In particular, we encourage the participation of those who are attempting to incorporate the insights that the field has gained over the past several decades from acoustic phonetics, speech production, speech perception, prosody, lexical access, natural language processing and pattern recognition to the problem of developing models of speech recognition that are robust to the full variability of speech. Organisers: Carol Espy-Wilson ([email protected]), University of Maryland , Jennifer Cole ([email protected]), Illinois, Abeer Alwan, UCLA, Louis Goldstein, University of Southern California, Mary Harper, University of Maryland, Elliot Saltzman, Haskins Laboratories, Mark Hasegawa-Johnson, Illinois. 34 Interspeech 2009 Tutorials - Sunday 6 September 2009 T-1: Analysis by synthesis of speech prosody, from data to models Sunday 9:15 Place: Jones (East Wing 1) The study of speech prosody today has become a research area which has attracted interest from researchers in a great number of different related fields including academic linguistics and phonetics, conversation analysis, semantics and pragmatics, sociolinguistics, acoustics, speech synthesis and recognition, cognitive psychology, neuroscience, speech therapy, language teaching... and no doubt many more. So much so, that it is particularly difficult for any one person to keep up to date on research in all relevant areas. This is particularly true for new researchers coming into the field. This tutorial will propose an overview of a variety of current ideas on the methodology and tools for the automatic and semi-automatic analysis and synthesis of speech prosody, consisting in particular of lexical prosody, rhythm, accentuation and intonation. The tools presented will include but not be restricted to those developed by the presenter himself. The emphasis will be on the importance of data analysis for the testing of linguistic models and the relevance of these models to the analysis itself. The target audience will be researchers who are aware of the importance of the analysis and synthesis of prosody for their own research interests and who wish to update their knowledge of background and current work in the field. Presenter: Daniel Hirst ([email protected]), Laboratoire Parole et Langage, Université de Provence, France. T-2: Dealing with High Dimensional Data with Dimensionality Reduction Sunday 9:15 Place: Fallside (East Wing 2) Dimensionality reduction is a standard component of the toolkit in any area of data modelling. Over the last decade algorithmic development in the area of dimensionality reduction has been rapid. Approaches such as Isomap, LLE, and maximum variance unfolding have extended the methodologies available to the practitioner. More recently, probabilistic dimensionality reduction techniques have been used with great success in modelling of human motion. How are all these approaches related? What are they useful for? In this tutorial our aim is to develop an understanding of high dimensional data and what the problems are with dealing with it. We will motivate the use of nonlinear dimensionality reduction as a solution for these problems. The keystone to unify the various approaches to non-linear dimensionality reduction is principal component analysis. We will show how it underpins spectral methods and attempt to cast spectral approaches within the same unifying framework. We will further build on principal component analysis to introduce probabilistic approaches to non-linear dimensionality reduction. These approaches have become increasingly popular in graphics and vision through the Gaussian Process Latent Variable Model. We will review the GP-LVM and also consider earlier approaches such as the Generative Topographic Mapping and Latent Density Networks. Presenter: Neil Lawrence ([email protected]), School of Computer Science, Univ. of Manchester, UK ; Jon Barker ([email protected]), Department of Computer Science, Univ. of Sheffield, UK T-3: Language and Dialect Recognition Sunday 9:15 Place: Holmes (East Wing 3) Spoken language recognition (a.k.a Language ID or LID) is a task of recognizing the language from a sample spoken by an unknown speaker. Language ID finds applications in multi-lingual dialog systems, distillation, diarization and indexing systems, speaker detection and speech recognition. Often, LID represents one of the first and necessary processing steps in many 35 speech processing systems. Furthermore, language, dialect, and accent are of interest in diarization, indexing/search, and may play an important auxiliary role in identifying speakers. LID has seen almost four decades of active research. Benefitting from the development of public multi-lingual corpora in the 90’s, the progress in LID technology has accelerated in the 00’s tremendously. While the availability of large corpora served as an enabling medium, establishing a series of NIST-administered Language Recognition Evaluations (LRE) provided the research community with a common ground of comparison and proved to be a strong catalyst. In another positive way, the LRE series gave rise to a "cross-pollination effect" by effectively fusing the speaker and language recognition communities thus sharing and spreading their respective methods and techniques. In the past five years or so, a considerable success was achieved by focusing on and developing techniques to deal with channel and session variability, to improve acoustic language modeling by means of discriminative methods, and to further refine the basic phonotactic approaches. The goal of this tutorial is to survey the LID area from a historical perspective as well as in its most modern state. Several important milestones contributing to the growth of the LID area will be identified. In a second, larger part, most successful state-of-the-art probabilistic approaches and modeling techniques will be described more in detail. Among these belong various phonotactic architectures, UBM-GMMs, discriminative techniques, and subspace modeling tricks. The closely related problem of detecting dialects will be discussed in the final part. Presenter: Jiri Navratil ([email protected]), Multilingual Analytics and User Technologies, IBM T.J. Watson Research Center, USA. T-4: Emerging Technologies for Silent Speech Interfaces Sunday 9:15 Place: Ainsworth (East Wing 4) In the past decade, the performance of automatic speech processing systems, including speech recognition, text and speech translation, and speech synthesis, has improved dramatically. This has resulted in an increasingly widespread use of speech and language technologies in a wide variety of applications, such as commercial information retrieval systems, call centre services, voice-operated cell phones or car navigation systems, personal dictation and translation assistance, as well as applications in military and security domains. However, speech-driven interfaces based on conventional acoustic speech signals still suffer from several limitations. Firstly, the acoustic signals are transmitted through the air and are thus prone to ambient noise. Despite tremendous efforts, robust speech processing systems, which perform reliably in crowded restaurants, airports, or other public places, are still not in sight. Secondly, conventional interfaces rely on audibly uttered speech, which has two major drawbacks: it jeopardizes confidential communications in public and it disturbs any bystanders. Services, which require the access, retrieval, and transmission of private or confidential information, such as PINS, passwords, and security or safety information, are particularly vulnerable. Recently, Silent Speech Interfaces have been proposed which allow its users to communicate by speaking silently, i.e. without producing any sound. This is realized by capturing the speech signal at the early stage of human articulation, namely before the signal becomes airborne, and then transfer these articulation-related signals for further processing and interpretation. Due to this novel approach Silent Speech Interfaces have the potential to overcome the major limitations of traditional speech interfaces today, i.e. (a) limited robustness in the presence of ambient noise; (b) lack of secure transmission of private and confidential information; and (c) disturbance of bystanders created by audibly spoken speech in quiet environments; while at the same time retaining speech as the most natural human communication modality. The SSI furthermore could provide an alternative for persons with speech disabilities such as laryngectomy, as well as the elderly or weak who may not be healthy or strong enough to speak aloud effectively. Presenter: Tanja Schultz ([email protected]), Computer Science Department, Karlsruhe University, Germany; Bruce Denby ([email protected]), Université Pierre et Marie Curie (Paris-VI). 36 T-5: In-Vehicle Speech Processing & Analysis Sunday 14:15 Place: Jones (East Wing 1) In this tutorial, we will focus on speech technology for in-vehicle use by discussing the cuttingedge developments in these two applications: 1. Speech as interface: Robust speech recognition system development under vehicle-noise conditions (i.e. engine, open windows, A/C operation). This field of study includes application of microphone-arrays for in-vehicle use to reduce the effect of the noise on speech recognition employing beam-forming algorithms. The resultant system can be employed as a driver-vehicle interface for entering prompts and commands for music search, control of in-vehicle systems such as cell-phone, A/C, windows etc. instead of manual operation which engages the driver visually as well. 2. Speech as monitoring system: Speech can be used to design a sub-module for drivermonitoring systems. For the last two decades speech under stress studies has contributed to improve the performance of ASR systems. Detecting stress in speech can also help improving the performance of driver monitoring systems which conventionally relies on computer vision applications of driver head and eye tracking. On the other hand, the effects of introducing speech technologies as an interface can be assessed via driver behaviour modeling studies. Presenter: John H.L. Hansen([email protected]),Pinar Boyraz ([email protected]),Erik Jonsson School of Engineering and Computer Science, University of Texas, Dallas, USA T-6: Emotion Recognition in the Next Generation: an Overview and Recent Development Sunday 14:15 Place: Fallside (East Wing 2) Emotional aspects have recently attracted considerable attention as being the "next big thing" for dialog systems and robotic product’s market success, and practically any intelligent HumanMachine Interface. Having matured over the last decade of research, recognition technology is now becoming ready for usage in such systems, and many further applications as Multimedia Retrieval and Surveillance. At the same time systems have evolved considerably more complex: in addition to a variety of definitions and theoretical approaches, today’s engines demand subject independency, coping with spontaneous and non prototypical emotions, robustness against noise, transmission, and optimal system integration. In this respect this tutorial will present an introduction to the recognition of emotion with a particular focus on recent developments in audio-based analysis. A general introduction to researchers working in related fields will be followed by current issues and impulses for acoustic, linguistic, and multi-stream and -modal analyses. A summary of the main recognition techniques will be presented, as well as an overview on current challenges, datasets, studies and performances in view of optimal future application design. Also, the first open source Emotion Recognition Engine “openSMILE” developed in the European Union’s Seventh Framework Programme Project SEMAINE will be introduced to the participants in order for them to be directly able to experiment with emotion recognition from speech or test latest technology on their datasets. Presenter: Björn Schuller([email protected]), Munich University of Technology, Germany. T-7: Fundamentals and recent advances in HMM-based speech synthesis Sunday 14:15 Place: Holmes (East Wing 3) Over the last ten years, the quality of speech synthesis has dramatically improved with the rise of general corpus-based speech synthesis. Especially, state-of-the-art unit selection speech synthesis can generate natural-sounding high quality speech. However, for constructing humanlike talking machines, speech synthesis systems are required to have an ability to generate speech with arbitrary speaker’s voice characteristics, various speaking styles including native 37 and non-native speaking styles in different languages, varying emphasis and focus, and/or emotional expressions; it is still difficult to have such flexibility with unit-selection synthesizers, since they need a large-scale speech corpus for each voice. In recent years, a kind of statistical parametric speech synthesis based on hidden Markov models (HMMs) has been developed. The system has the following features: 1.Original speaker’s voice characteristics can easily be reproduced because all speech features including spectral, excitation, and duration parameters are modeled in a unified framework of HMM, and then generated from the trained HMMs themselves. 2. Using a very small amount of adaptation speech data, voice characteristics can easily be modified by transforming HMM parameters by a speaker adaptation technique used in speech recognition systems. From these features, the HMM-based speech synthesis approach is expected to be useful for constructing speech synthesizers which can give us the flexibility we have in human voices. In this tutorial, the system architecture is outlined, and then basic techniques used in the system, including algorithms for speech parameter generation from HMM, are described with simple examples. Relation to the unit selection approach, trajectory modeling, recent improvements, and evaluation methodologies are are summarized. Techniques developed for increasing the flexibility and improving the speech quality are also reviewed. Presenters: Keiichi Tokuda ([email protected]), Department of Computer Science and Engineering, Nagoya Institute of Technology; Heiga Zen ([email protected]), Toshiba Research Europe Ltd. T-8: Statistical approaches to dialogue systems Sunday 14:15 Place: Ainsworth (East Wing 4) The objective of this tutorial is to provide a comprehensive, cohesive overview of statistical techniques in dialog management for the newcomer. Specifically we will start by motivating the research area by showing how traditional techniques fail and intuitively why statistical techniques would be expected to do better. Then, in classroom style presentation, we will explain the core algorithms and how they have been applied to spoken dialogue systems. Our intention is to provide a cohesive treatment of the techniques using a unified, common notation in order to give the audience a clear understanding of how the techniques interrelate. Finally we will report results from the literature to provide an indication of the impact in practice. Through the tutorial we will draw on both our own work and the literature (with citations throughout), and wherever possible we will use audio/video recordings of interactions to illustrate operation. We will provide lecture notes and a comprehensive bibliography. Our aim is that attendees to this course should be able to readily read papers in this area, comment on them meaningfully, and (we hope!) suggest avenues for future work in this area rich in open challenges and begin research enquiries of their own. Presenters: Jason Williams ([email protected]), AT&T Labs – Research, USA; Steve Young ([email protected]) , Blaise Thomson([email protected]) , Information Engineering Division, Cambridge University, UK. 38 Public Engagement Events The first Interspeech conversational systems challenge Sunday 14:15 Place: Rainbow Room The first interspeech conversational systems challenge is based around the original Loebner competition but due to the unique challenges of speech we have changed things slightly. We have devised a scenario that presents an urgent and direct task full of 'full-blown' emotion. As a result competitors systems will have to convey urgency and emotion through speech, while any speech recognition system will have to function successfully in a conversational context with little time for training. Each judge will be given the following briefing: "You're a captain of the one of the fleets finest starships, suddenly your sensors detect a badly damaged ship heading straight for you, the intercom crackles into life: there's lots of interference but they're requesting to dock. The ship is about to crash into you, do you push the button blowing the artificial infiltrator out of the sky or do you open the landing bay and guide the human refugee to safety, you have 3 minutes to decide." The artificial system that fools our judges for the longest period of time will be declared the winner. Competitors: Marc Schroeder and Jens Edlund; Organiser: Simon Worgan ([email protected]) Interspeech 2009 public exhibition Sunday 10:00 Place: Public Foyer On Sunday 6th September a number of exhibitors will be demonstrating aspects of speech and language technology to the general public. Hosted in the public foyer, exhibits will include emotive talking heads, agents that attempt to elicit rapport from human speakers and customized text-to-speech systems. 39 Session Index Mon-Ses1-K Keynote 1 — Sadaoki Furui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Tue-Ses0-K Keynote 2 — Tom Griffiths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Wed-Ses0-K Keynote 3 — Deb Roy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Thu-Ses0-K Keynote 4 — Mari Ostendorf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Mon-Ses2-O1 ASR: Features for Noise Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Mon-Ses2-O2 Production: Articulatory Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48 Mon-Ses2-O3 Systems for LVCSR and Rich Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49 Mon-Ses2-O4 Speech Analysis and Processing I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50 Mon-Ses2-P1 Speech Perception I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51 Mon-Ses2-P2 Accent and Language Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53 Mon-Ses2-P3 ASR: Acoustic Model Training and Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55 Mon-Ses2-P4 Spoken Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57 Mon-Ses2-S1 Special Session: INTERSPEECH 2009 Emotion Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60 Mon-Ses3-O1 Automatic Speech Recognition: Language Models I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61 Mon-Ses3-O2 Phoneme-Level Perception. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62 Mon-Ses3-O3 Statistical Parametric Synthesis I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63 Mon-Ses3-O4 Systems for Spoken Language Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64 Mon-Ses3-P1 Human Speech Production I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65 Mon-Ses3-P2 Prosody, Text Analysis, and Multilingual Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67 Mon-Ses3-P3 Automatic Speech Recognition: Adaptation I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69 Mon-Ses3-P4 Applications in Learning and Other Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71 Mon-Ses3-S1 Special Session: Silent Speech Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73 Tue-Ses1-O1 ASR: Discriminative Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75 Tue-Ses1-O2 Language Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76 Tue-Ses1-O3 ASR: Lexical and Prosodic Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77 Tue-Ses1-O4 Unit-Selection Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78 Tue-Ses1-P1 Human Speech Production II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79 Tue-Ses1-P2 Speech Perception II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80 Tue-Ses1-P3 Speech and Audio Segmentation and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81 Tue-Ses1-P4 Speaker Recognition and Diarisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84 Tue-Ses1-S1 Special Session: Advanced Voice Function Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .86 Tue-Ses2-O1 Automotive and Mobile Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .87 Tue-Ses2-O2 Prosody: Production I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88 Tue-Ses2-O3 ASR: Spoken Language Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89 Tue-Ses2-O4 Speaker Diarisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .90 Tue-Ses2-P1 Speech Analysis and Processing II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91 Tue-Ses2-P2 Speech Processing with Audio or Audiovisual Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Tue-Ses2-P3 ASR: Decoding and Confidence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .96 Tue-Ses2-P4 Robust Automatic Speech Recognition I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98 Tue-Ses3-S1 Panel: Speech & Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99 Tue-Ses3-O3 Speaker Verification & Identification I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99 Tue-Ses3-O4 Text Processing for Spoken Language Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Tue-Ses3-P1 Single- and Multichannel Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Tue-Ses3-P2 ASR: Acoustic Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Tue-Ses3-P3 Assistive Speech Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Tue-Ses3-P4 Topics in Spoken Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 40 Tue-Ses3-S2 Special Session: Measuring the Rhythm of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Wed-Ses1-O1 Speaker Verification & Identification II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Wed-Ses1-O2 Emotion and Expression I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Wed-Ses1-O3 Automatic Speech Recognition: Adaptation II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Wed-Ses1-O4 Voice Transformation I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Wed-Ses1-P1 Phonetics, Phonology, Cross-Language Comparisons, Pathology . . . . . . . . . . . . . . . . . . . . . . . . 116 Wed-Ses1-P2 Prosody Perception and Language Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Wed-Ses1-P3 Statistical Parametric Synthesis II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Wed-Ses1-P4 Resources, Annotation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Wed-Ses1-S1 Special Session: Lessons and Challenges Deploying Voice Search. . . . . . . . . . . . . . . . . . . . . . . 124 Wed-Ses2-O1 Word-Level Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Wed-Ses2-O2 Applications in Education and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Wed-Ses2-O3 ASR: New Paradigms I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Wed-Ses2-O4 Single-Channel Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Wed-Ses2-P1 Emotion and Expression II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Wed-Ses2-P2 Expression, Emotion and Personality Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Wed-Ses2-P3 Speech Synthesis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Wed-Ses2-P4 LVCSR Systems and Spoken Term Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Wed-Ses2-S1 Special Session: Active Listening & Synchrony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Wed-Ses3-O1 Language Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Wed-Ses3-O2 Phonetics & Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Wed-Ses3-O3 Speech Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Wed-Ses3-O4 Multimodal Speech (e.g. Audiovisual Speech, Gesture) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Wed-Ses3-P1 Phonetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Wed-Ses3-P2 Speaker Verification & Identification III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Wed-Ses3-P3 Robust Automatic Speech Recognition II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Wed-Ses3-P4 Prosody: Production II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Wed-Ses3-S1 Special Session: Machine Learning for Adaptivity in Spoken Dialogue Systems . . . . . . . . 149 Thu-Ses1-O1 Robust Automatic Speech Recognition III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Thu-Ses1-O2 Prosody: Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Thu-Ses1-O3 Segmentation and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Thu-Ses1-O4 Evaluation & Standardisation of SL Technology and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Thu-Ses1-P1 Speech Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Thu-Ses1-P2 Voice Transformation II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Thu-Ses1-P3 Automatic Speech Recognition: Language Models II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Thu-Ses1-P4 Systems for Spoken Language Understanding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Thu-Ses1-S1 Special Session: New Approaches to Modeling Variability for Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Thu-Ses2-O1 User Interactions in Spoken Dialog Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Thu-Ses2-O2 Production: Articulation and Acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Thu-Ses2-O3 Features for Speech and Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Thu-Ses2-O4 Speech and Multimodal Resources & Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Thu-Ses2-O5 Speech Analysis and Processing III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Thu-Ses2-P1 Speaker and Speech Variability, Paralinguistic and Nonlinguistic Cues . . . . . . . . . . . . . . . . . 168 Thu-Ses2-P2 ASR: Acoustic Model Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Thu-Ses2-P3 ASR: Tonal Language, Cross-Lingual and Multilingual ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Thu-Ses2-P4 ASR: New Paradigms II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 41 DAY 0: Sunday September 6th Jones (East Wing 1) Fallside (East Wing 2) Holmes (East Wing 3) Ainsworth (East Wing 4) Rainbow Room TUTORIAL 08:30 Registration for tutorials opens (closes at 14:30) 09:00 ISCA Board Meeting 1 (finish at 17:00) - BCS Room 3 T-1: Analysis by Synthesis of T-2: Dealing with High Speech Prosody, from Data Dimensional Data with 09:15 to Models Dimensionality Reduction T-3: Language and Dialect Recognition T-4: Emerging Technologies for Silent Speech Interfaces T-3: Language and Dialect Recognition T-4: Emerging Technologies for Silent Speech Interfaces 10:45 Coffee break T-1: Analysis by Synthesis of T-2: Dealing with High Speech Prosody, from Data Dimensional Data with 11:15 to Models Dimensionality Reduction Loebner Competition 12:45 Lunch 42 14:00 General registration opens (closes at 18:00) T-5: In-Vehicle Speech Processing & Analysis 14:15 T-6: Emotion Recognition in the Next Generation: an Overview and Recent Development T-7: Fundamentals and Recent Advances in HMMbased Speech Synthesis T-8: Statistical Approaches to Dialogue Systems T-6: Emotion Recognition in the Next Generation: an Overview and Recent Development T-7: Fundamentals and Recent Advances in HMMbased Speech Synthesis T-8: Statistical Approaches to Dialogue Systems 15:45 Tea break T-5: In-Vehicle Speech Processing & Analysis 16:15 17:45 18:00 Elsevier Thank You Reception for Former Computer Speech and Language Editors (finish at 19:30) - BCS Room 1 The first Interspeech conversational systems challenge DAY 1: Monday September 7th Main Hall Jones (East Wing 1) Fallside (East Wing 2) Holmes (East Wing 3) Hewison Hall Ainsworth (East Wing 4) POSTER SPECIAL ORAL 09:00 Arrival and Registration (closes at 17:00) 10:00 Opening Ceremony in Main Hall 11:00 Mon-Ses1-K, Plenary Session in Main Hall. Keynote Speaker: Sadaoki Furui, ISCA Medallist, Department of Computer Science, Tokyo Institute of Technology Selected topics from 40 years of research on speech and speaker recognition 12:00 Lunch; IAC (Advisory Council) Meeting - BCS Room 3 13:30 Mon-Ses2-O1 ASR: Features for Noise Robustness Tea Break 16:00 Mon-Ses3-O1 ASR: Language Models I 43 15:30 Mon-Ses2-O2 Production: Articulatory modelling Mon-Ses2-O3 Systems for LVCSR and Rich Transcription Mon-Ses2-O4 Speech Analysis and Processing I Mon-Ses2-P1 Speech Perception I Mon-Ses2-P2 Accent and Language Recognition Mon-Ses2-P3 Mon-Ses2-P4 ASR: Acoustic Spoken Dialogue Model Training Systems and Combination Mon-Ses2-S1 Special Session: INTERSPEECH 2009 Emotion Challenge Mon-Ses3-O2 Phoneme-level Perception Mon-Ses3-O3 Statistical Parametric Synthesis I Mon-Ses3-O4 Systems for Spoken Language Translation Mon-Ses3-P1 Human Speech Production I Mon-Ses3-P2 Prosody, Text Analysis, and Multilingual Models Mon-Ses3-P3 ASR: Adaptation I Mon-Ses3-S1 Special Session: Silent Speech Interfaces Mon-Ses3-P4 Applications in Learning and Other Areas 18:00 19:30 Welcome Reception - Brighton Dome Interspeech 2009 - Main Conference Session Codes DAYS: TIMES: TYPE: Mon = Monday Ses1 = 10:00 - 12:00 O = Oral Tue = Tuesday Ses2 = 13:30 - 15:30 P = Poster Wed = Wednesday Ses3 = 16:00 - 18:00 S = Special Thu = Thursday K = Keynote Day 2: Tuesday September 8th Main Hall Jones Fallside Holmes (East Wing 1) (East Wing 2) (East Wing 3) ORAL Ainsworth (East Wing 4) SPECIAL Hewison Hall POSTER 08:00 Registration (closes at 17:00) 08:30 Tues-Ses1-K, Plenary Session in Main Hall. Keynote Speaker: Tom Griffiths, University of California, Berkeley, USA Connecting human and machine learning via probabilistic models of cognition 09:30 Coffee Break Tue-Ses1-O1 Tue-Ses1-O2 ASR: Discrimi- Language native Training Acquisition 10:00 Tue-Ses1-O3 ASR: Lexical and Prosodic Models Tue-Ses1-O4 Unit-Selection Synthesis Tue-Ses1-P1 Tue-Ses1-P2 Human Speech Speech Production II perception II Tue-Ses1-P3 Speech and Audio Segmentation and Classification Tue-Ses1-P4 Speaker Recognition and Diarisation Tue-Ses1-S1 Special Session: Advanced Voice Function Assessment 12:00 Lunch; Elsevier Editorial Board Meeting for Computer Speech and Language - BCS Room 1; Special Interest Group Meeting - BCS Room 3 13:30 Standardising assessment for voice and speech pathology (finish at 14:30) - BCS Room 3 44 Tue-Ses2-O1 Automotive and Mobile 13:30 Applications 15:30 Tea Break Tue-Ses3-S1 Panel: Speech & Intelligence 16:00 Tue-Ses2-O2 Prosody: Production I Tue-Ses2-O3 ASR: Spoken Language Understanding Tue-Ses2-O4 Speaker Diarisation Tue-Ses2-P1 Speech Analysis and Processing II Tue-Ses2-P2 Speech Processing with Audio or Audiovisual Input Tue-Ses2-P3 ASR: Decoding and Confidence Measures Tue-Ses2-P4 Robust Automatic Speech Recognition I ISCA Student Advisory Committee Tue-Ses3-O3 Speaker Verification & Identification I Tue-Ses3-O4 Text Processing for Spoken Language Generation Tue-Ses3-P1 Single- and Multi-Channel Speech Enhancement Tue-Ses3-P2 ASR: Acoustic Modelling Tue-Ses3-P3 Assistive Speech Technology Tue-Ses3-P4 Topics in Spoken Language Processing Tue-Ses3-S2 Special Session: Measuring the Rhythm of Speech 18:00 18:15 ISCA General Assembly - Main Hall 19:30 Reviewers' Reception - Brighton Pavilion; Student Reception - Al Duomo Restaurant DAY 3: Wednesday September 9th Main Hall Jones Fallside (East Wing 1) (East Wing 2) ORAL Holmes (East Wing 3) Ainsworth (East Wing 4) SPECIAL Hewison Hall POSTER 08:00 Registration (closes at 17:00) 08:30 Wed-Ses0-K, Plenary Session in Main Hall. Keynote Speaker: Deb Roy, MIT Media Laboratory New Horizons in the Study of Language Development 09:30 Coffee Break Wed-Ses1-O1 Speaker Verification & 10:00 Identification II Wed-Ses1-O2 Emotion and Expression I Wed-Ses1-O3 ASR: Adaptation II Wed-Ses1-O4 Voice Transformation I Wed-Ses1-P1 Phonetics, Phonology, CrossLanguage Comparisons, Pathology Wed-Ses1-P2 Prosody Perception and Language Acquisition Wed-Ses1-P3 Statistical Parametric Synthesis II Wed-Ses1-P4 Resources, Annotation and Evaluation Wed-Ses1-S1 Special Session: Lessons and Challenges Deploying Voice Search 12:00 Lunch; Interspeech Steering Committee - BCS Room 1; Elsevier Editorial Board Meeting for Speech Communication - BCS Room 3 13:30 Wed-Ses2-O1 Word-level Perception 45 15:30 Tea Break Wed-Ses3-O1 Language Recognition 16:00 Wed-Ses2-O2 Applications in Education and Learning Wed-Ses2-O3 ASR: New Paradigms I Wed-Ses2-O4 Single-Channel Speech Enhancement Wed-Ses2-P1 Emotion and Expression II Wed-Ses2-P2 Expression, Emotion and Personality Recognition Wed-Ses2-P3 Wed-Ses2-P4 Speech Synthesis LVCSR Systems Methods and Spoken Term Detection Wed-Ses2-S1 Special Session: Active Listening & Synchrony Wed-Ses3-O2 Phonetics & Phonology Wed-Ses3-O3 Speech Activity Detection Wed-Ses3-O4 Multimodal Speech (e.g. Audiovisual Speech, Gesture) Wed-Ses3-P1 Phonetics Wed-Ses3-P2 Speaker Verification & Identification III Wed-Ses3-P3 Wed-Ses3-P4 Robust Automatic Prosody: Speech Production II Recognition II Wed-Ses3-S1 Special Session: Machine Learning for Adaptivity in Dialogue Systems 18:00 19:30 Revelry at the Racecourse DAY 4: Thursday September 10th Main Hall Jones Fallside (East Wing 1) (East Wing 2) ORAL Holmes (East Wing 3) Ainsworth (East Wing 4) SPECIAL Hewison Hall POSTER 08:00 Registration (closes at 17:00) 08:30 Thu-Ses0-K, Plenary Session in Main Hall. Keynote Speaker: Mari Ostendorf, University of Washington Transcribing Speech for Spoken Language Processing 09:30 Coffee Break Thu-Ses1-O1 Robust Automatic 10:00 Speech Recognition III Thu-Ses1-O2 Prosody: Perception Thu-Ses1-O3 Segmentation and Classification Thu-Ses1-O4 Evaluation & Standardisation of SL Technology and Systems Thu-Ses1-P1 Speech Coding Thu-Ses1-P2 Voice Transformation II Thu-Ses1-P3 ASR: Language Models II Thu-Ses1-P4 Systems for Spoken Language Understanding Thu-Ses2-O4 Speech and Multimodal Resources & Annotation Thu-Ses2-P1 Speaker and Speech Variability, Paralinguistic and Nonlinguistic Cues Thu-Ses2-P2 ASR: Acoustic Model Features Thu-Ses2-P3 Thu-Ses2-P4 ASR: Tonal ASR: New Language, Cross- Paradigms II Lingual and Multilingual ASR Thu-Ses1-S1 Special Session: New Approaches to Modeling Variability for ASR 12:00 Lunch; Industrial Lunch - BCS Room 1 46 Thu-Ses2-O1 User Interactions in Spoken Dialog 13:30 Systems Thu-Ses2-O2 Production: Articulation and Acoustics 15:30 Tea Break 16:00 Closing Ceremony - Main Hall Thu-Ses2-O3 Features for Speech and Speaker Recognition Thu-Ses2-O5 Speech Analysis and Processing III Abstracts audio-video recordings spanning the first three years of one child’s life at home. To study a corpus of this scale and richness, current methods of developmental cognitive science are inadequate. We are developing new methods for data analysis and interpretation that combine pattern recognition algorithms with interactive user interfaces and data visualization. Preliminary speech analysis reveals surprising levels of linguistic fine-tuning by caregivers that may provide crucial support for word learning. Ongoing analyses of the corpus aim to model detailed aspects of the child’s language development as a function of learning mechanisms combined with lifetime experience. Plans to collect similar corpora from more children based on a transportable recording system are underway. Mon-Ses1-K : Keynote 1 — Sadaoki Furui Main Hall, 11:00, Monday 7 Sept 2009 Chair: Isabel Trancoso, INESC-ID Lisboa/IST, Portugal Selected Topics from 40 Years of Research on Speech and Speaker Recognition Sadaoki Furui; Tokyo Institute of Technology, Japan Mon-Ses1-K-1, Time: 11:00 This paper summarizes my 40 years of research on speech and speaker recognition, focusing on selected topics that I have investigated at NTT Laboratories, Bell Laboratories and Tokyo Institute of Technology with my colleagues and students. These topics include: the importance of spectral dynamics in speech perception; speaker recognition methods using statistical features, cepstral features, and HMM/GMM; text-prompted speaker recognition; speech recognition using dynamic features; Japanese LVCSR; robust speech recognition; spontaneous speech corpus construction and analysis; spontaneous speech recognition; automatic speech summarization; and WFST-based decoder development and its applications. Tue-Ses0-K : Keynote 2 — Tom Griffiths Main Hall, 08:30, Tuesday 8 Sept 2009 Chair: Steve Renals, University of Edinburgh, UK Connecting Human and Machine Learning via Probabilistic Models of Cognition Thomas L. Griffiths; University of California at Berkeley, USA Tue-Ses0-K-1, Time: 08:30 Human performance defines the standard that machine learning systems aspire to in many areas, including learning language. This suggests that studying human cognition may be a good way to develop better learning algorithms, as well as providing basic insights into how the human mind works. However, in order for ideas to flow easily from cognitive science to computer science and vice versa, we need a common framework for describing human and machine learning. I will summarize recent work exploring the hypothesis that probabilistic models of cognition, which view learning as a form of statistical inference, provide such a framework, including results that illustrate how novel ideas from statistics can inform cognitive science. Specifically, I will talk about how probabilistic models can be used to identify the assumptions of learners, learn at different levels of abstraction, and link the inductive biases of individuals to cultural universals. Thu-Ses0-K : Keynote 4 — Mari Ostendorf Main Hall, 08:30, Thursday 10 Sept 2009 Chair: Martin Russell, University of Birmingham, UK Transcribing Human-Directed Speech for Spoken Language Processing Mari Ostendorf; University of Washington, USA Thu-Ses0-K-1, Time: 08:30 As storage costs drop and bandwidth increases, there has been a rapid growth of spoken information available via the web or in online archives, raising problems of document retrieval, information extraction, summarization and translation for spoken language. While there is a long tradition of research in these technologies for text, new challenges arise when moving from written to spoken language. In this talk, we look at differences between speech and text, and how we can leverage the information in the speech signal beyond the words to provide structural information in a rich, automatically generated transcript that better serves language processing applications. In particular, we look at three interrelated types of structure (orthographic, prosodic, and syntactic), methods for automatic detection, the benefit of optimizing rich transcription for the target language processing task, and the impact of this structural information in tasks such as information extraction, translation, and summarization. Mon-Ses2-O1 : ASR: Features for Noise Robustness Main Hall, 13:30, Monday 7 Sept 2009 Chair: Hynek Hermansky, Johns Hopkins University, USA Feature Extraction for Robust Speech Recognition Using a Power-Law Nonlinearity and Power-Bias Subtraction Chanwoo Kim, Richard M. Stern; Carnegie Mellon University, USA Wed-Ses0-K : Keynote 3 — Deb Roy Mon-Ses2-O1-1, Time: 13:30 Main Hall, 08:30, Wednesday 9 Sept 2009 Chair: Roger K. Moore, University of Sheffield, UK New Horizons in the Study of Child Language Acquisition Deb Roy; MIT, USA Wed-Ses0-K-1, Time: 08:30 Naturalistic longitudinal recordings of child development promise to reveal fresh perspectives on fundamental questions of language acquisition. In a pilot effort, we have recorded 230,000 hours of This paper presents a new feature extraction algorithm called Power-Normalized Cepstral Coefficients (PNCC) that is based on auditory processing. Major new features of PNCC processing include the use of a power-law nonlinearity that replaces the traditional log nonlinearity used for MFCC coefficients, and a novel algorithm that suppresses background excitation by estimating SNR based on the ratio of the arithmetic to geometric mean power, and subtracts the inferred background power. Experimental results demonstrate that the PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for various types of additive noise. The computational Notes 47 cost of PNCC is only slightly greater than that of conventional MFCC processing. Towards Fusion of Feature Extraction and Acoustic Model Training: A Top Down Process for Robust Speech Recognition Yu-Hsiang Bosco Chiu, Bhiksha Raj, Richard M. Stern; Carnegie Mellon University, USA Mon-Ses2-O1-2, Time: 13:50 sentences. Two evolutions of PEQ are presented as solutions to the limitations encountered. The effects of the proposed evolutions are evaluated on three speech corpora, namely WSJ0, AURORA-3 and HIWIRE cockpit databases, with different mismatch conditions given by convolutional and/or additive noise and non-native speakers. The obtained results show that the encountered limitations can be overcome by the newly introduced techniques. Dynamic Features in the Linear Domain for Robust Automatic Speech Recognition in a Reverberant Environment This paper presents a strategy to learn physiologically-motivated components in a feature computation module discriminatively, directly from data, in a manner that is inspired by the presence of efferent processes in the human auditory system. In our model a set of logistic functions which represent the rate-level nonlinearities found in most mammal hearing system are put in as part of the feature extraction process. The parameters of these rate-level functions are estimated to maximize the a posteriori probability of the correct class in the training data. The estimated feature computation is observed to be robust against environmental noise. Experiments conducted with the CMU Sphinx-III on the DARPA Resource Management task show that the discriminatively estimated rate-nonlinearity results in better performance in the presence of background noise than traditional procedures which separate the feature extraction and model training into two distinct parts without feed back from the latter to the former. Since the MFCC are calculated from logarithmic spectra, the delta and delta-delta are considered as difference operations in a logarithmic domain. In a reverberant environment, speech signals have trailing reverberations, whose power is plotted as a long-term exponential decay. This means the logarithmic delta value tends to remain large for a long time. This paper proposes a delta feature calculated in the linear domain, due to the rapid decay in reverberant environments. In an experiment using an evaluation framework (CENSREC-4), significant improvements were found in reverberant situations by simply replacing the MFCC dynamic features with the proposed dynamic features. Temporal Modulation Processing of Speech Signals for Noise Robust ASR Local Projections and Support Vector Based Feature Selection in Speech Recognition Hong You, Abeer Alwan; University of California at Los Angeles, USA Antonio Miguel, Alfonso Ortega, L. Buera, Eduardo Lleida; Universidad de Zaragoza, Spain Osamu Ichikawa, Takashi Fukuda, Ryuki Tachibana, Masafumi Nishimura; IBM Tokyo Research Lab, Japan Mon-Ses2-O1-5, Time: 14:50 Mon-Ses2-O1-3, Time: 14:10 Mon-Ses2-O1-6, Time: 15:10 In this paper, we analyze the temporal modulation characteristics of speech and noise from a speech/non-speech discrimination point of view. Although previous psychoacoustic studies [3][10] have shown that low temporal modulation components are important for speech intelligibility, there is no reported analysis on modulation components from the point of view of speech/noise discrimination. Our data-driven analysis of modulation components of speech and noise reveals that speech and noise is more accurately classified by low-passed modulation frequencies than band-passed ones. Effects of additive noise on the modulation characteristics of speech signals are also analyzed. Based on the analysis, we propose a frequency adaptive modulation processing algorithm for a noise robust ASR task. The algorithm is based on speech channel classification and modulation pattern denoising. Speech recognition experiments are performed to compare the proposed algorithm with other noise robust frontends, including RASTA and ETSI AFE. Recognition results show that the frequency adaptive modulation processing is promising. In this paper we study a method to provide noise robustness in mismatch conditions for speech recognition using local frequency projections and feature selection. Local time-frequency filtering patterns have been used previously to provide noise robust features and a simpler feature set to apply reliability weighting techniques. The proposed method combines two techniques to select the feature set, first a reliability metric based on information theory and, second, a support vector set to reduce the errors. The support vector set provides the most representative examples which have influence in the error rate in mismatch conditions, so that only the features which incorporate implicit robustness to mismatch are selected. Some experimental results are obtained with this method compared to baseline systems using the Aurora 2 database. Progressive Memory-Based Parametric Non-Linear Feature Equalization Jones (East Wing 1), 13:30, Monday 7 Sept 2009 Chair: Rob van Son, University of Amsterdam, The Netherlands Luz Garcia 1 , Roberto Gemello 2 , Franco Mana 2 , Jose Carlos Segura 1 ; 1 Universidad de Granada, Spain; 2 Loquendo, Italy Feedforward Control of a 3D Physiological Articulatory Model for Vowel Production Mon-Ses2-O1-4, Time: 14:30 This paper analyzes the benefits and drawbacks of PEQ (Parametric Non-linear Equalization), a features normalization technique based on the parametric equalization of the MFCC parameters to match a reference probability distribution. Two limitations have been outlined: the distortion intrinsic to the normalization process and the lack of accuracy in estimating normalization statistics on short Mon-Ses2-O2 : Production: Articulatory Modelling Qiang Fang 1 , Akikazu Nishikido 2 , Jianwu Dang 2 , Aijun Li 1 ; 1 Chinese Academy of Social Sciences, China; 2 JAIST, Japan Mon-Ses2-O2-1, Time: 13:30 A 3D Physiological articulatory model has been developed to account for the biomechanical properties of speech organs in speech production. To control the model for investigating the Notes 48 mechanism of speech production, a feedforward control strategy is necessary to generate proper muscle activations according to desired articulatory targets. In this paper, we elaborated a feedforward control module for the 3D physiological articulatory model. In the feedforward control process, an input articulatory target, specified by articulatory parameters, is transformed to intrinsic representation of articulation; then, a muscle activation pattern by a proposed mapping function. The results show that the proposed feedforward control strategy is able to control the proposed 3D physiological articulatory model with high accuracy both acoustically and articulatorily. Articulatory Modeling Based on Semi-Polar Coordinates and Guided PCA Technique Jun Cai 1 , Yves Laprie 1 , Julie Busset 1 , Fabrice Hirsch 2 ; 1 LORIA, France; 2 Institut de Phonétique de Strasbourg, France Mon-Ses2-O2-2, Time: 13:50 Research on 2-dimensional static articulatory modeling has been performed by using the semi-polar system and the guided PCA analysis of lateral X-ray images of vocal tract. The density of the grid lines in the semi-polar system has been increased to have a better descriptive precision. New parameters have been introduced to describe the movements of tongue apex. An extra feature, the tongue root, has been extracted as one of the elementary factors in order to improve the precision of tongue model. New methods still remain to be developed for describing the movements of tongue apex. Sequencing of Articulatory Gestures Using Cost Optimization Juraj Simko, Fred Cummins; University College Dublin, Ireland Mon-Ses2-O2-3, Time: 14:10 Within the framework of articulatory phonology (AP), gestures function as primitives, and their ordering in time is provided by a gestural score. Determining how they should be sequenced in time has been something of a challenge. We modify the task dynamic implementation of AP, by defining tasks to be the desired positions of physically embodied end effectors. This allows us to investigate the optimal sequencing of gestures based on a parametric cost function. Costs evaluated include precision of articulation, articulatory effort, and gesture duration. We find that a simple optimization using these costs results in stable gestural sequences that reproduce several known coarticulatory effects. From Experiments to Articulatory Motion — A Three Dimensional Talking Head Model Xiao Bo Lu 1 , William Thorpe 1 , Kylie Foster 2 , Peter Hunter 1 ; 1 University of Auckland, New Zealand; 2 University of Massey, New Zealand deformation technique. The motion of the epiglottis has also been considered in the model. Towards Robust Glottal Source Modeling Javier Pérez, Antonio Bonafonte; Universitat Politècnica de Catalunya, Spain Mon-Ses2-O2-5, Time: 14:50 We present here a new method for the simultaneous estimation of the derivative glottal waveform and the vocal tract filter. The algorithm is pitch-synchronous and uses overlapping frames of several glottal cycles to increase the robustness and quality of the estimation. Two parametric models for the glottal waveform are used: the KLGLOTT88 during the convex optimization iteration, and the LF model for the final parametrization. We use a synthetic corpus using real data published in several studies to evaluate the performance. A second corpus has been specially recorded for this work, consisting of isolated vowels uttered with different voice qualities. The algorithm has been found to perform well with most of the voice qualities present in the synthetic data-set in terms of glottal waveform matching. The performance is also good with the real vowel data-set in terms of resynthesis quality. Sliding Vocal-Tract Model and its Application for Vowel Production Takayuki Arai; Sophia University, Japan Mon-Ses2-O2-6, Time: 15:10 In a previous study, Arai implemented a sliding vocal-tract model based on Fant’s three-tube model and demonstrated its usefulness for education in acoustics and speech science. The sliding vocal-tract model consists of a long outer cylinder and a short inner cylinder, which simulates tongue constriction in the vocal tract. This model can produce different vowels by sliding the inner cylinder and changing the degree of constriction. In this study, we investigated the model’s coverage of vowels on the vowel space and explored its application for vowel production in the speech and hearing sciences. Mon-Ses2-O3 : Systems for LVCSR and Rich Transcription Fallside (East Wing 2), 13:30, Monday 7 Sept 2009 Chair: Thomas Schaaf, Multimodal Technologies Inc., USA Minimum Hypothesis Phone Error as a Decoding Method for Speech Recognition Haihua Xu 1 , Daniel Povey 2 , Jie Zhu 1 , Guanyong Wu 1 ; 1 Shanghai Jiao Tong University, China; 2 Microsoft Research, USA Mon-Ses2-O3-1, Time: 13:30 Mon-Ses2-O2-4, Time: 14:30 The goal of this study is to develop a customised computer model that can accurately represent the motion of vocal articulators during vowels and consonants. Models of the articulators were constructed as Finite Element (FE) meshes based on digitised high-resolution MRI (Magnetic Resonance Imaging) scans obtained during quiet breathing. Articulatory kinematics during speaking were obtained by EMA (Electromagnetic Articulography) and video of the face. The movement information thus acquired was applied to the FE model to provide jaw motion, modeled as a rigid body, and tongue, cheek and lip movements modeled with a free-form In this paper we show how methods for approximating phone error as normally used for Minimum Phone Error (MPE) discriminative training, can be used instead as a decoding criterion for lattice rescoring. This is an alternative to Confusion Networks (CN) which are commonly used in speech recognition. The standard (Maximum A Posteriori) decoding approach is a Minimum Bayes Risk estimate with respect to the Sentence Error Rate (SER); however, we are typically more interested in the Word Error Rate (WER). Methods such as CN and our proposed Minimum Hypothesis Phone Error (MHPE) aim to get closer to minimizing the expected WER. Based on preliminary experiments we find that our approach gives more improvement than CN, and is conceptually simpler. Notes 49 Posterior-Based Out of Vocabulary Word Detection in Telephone Speech Porting an European Portuguese Broadcast News Recognition System to Brazilian Portuguese Stefan Kombrink 1 , Lukáš Burget 1 , Pavel Matějka 1 , Martin Karafiát 1 , Hynek Hermansky 2 ; 1 Brno University of Technology, Czech Republic; 2 Johns Hopkins University, USA Alberto Abad 1 , Isabel Trancoso 2 , Nelson Neto 3 , M. Céu Viana 4 ; 1 INESC-ID Lisboa, Portugal; 2 INESC-ID Lisboa/IST, Portugal; 3 Federal University of Pará, Brazil; 4 CLUL, Portugal Mon-Ses2-O3-2, Time: 13:50 Mon-Ses2-O3-5, Time: 14:50 In this paper we present an out-of-vocabulary word detector suitable for English conversational and read speech. We use an approach based on phone posteriors created by a Large Vocabulary Continuous Speech Recognition system and an additional phone recognizer, that allows detection of OOV and misrecognized words. In addition, the recognized word output can be transcribed more detailed using several classes. Reported results are on CallHome English and Wall Street Journal data. This paper reports on recent work in the context of the activities of the PoSTPort project aimed at porting a Broadcast News recognition system originally developed for European Portuguese to other varieties. Concretely, in this paper we have focused on porting to Brazilian Portuguese. The impact of some of the main sources of variability has been assessed, besides proposing solutions at the lexical, acoustic and syntactic levels. The ported Brazilian Portuguese Broadcast News system allowed a drastic performance improvement from 56.6% WER (obtained with the European Portuguese system) to 25.5%. Automatic Transcription System for Meetings of the Japanese National Congress Yuya Akita, Masato Mimura, Tatsuya Kawahara; Kyoto University, Japan Mon-Ses2-O3-3, Time: 14:10 This paper presents an automatic speech recognition (ASR) system for assisting meeting record creation of the National Congress of Japan. The system is designed to cope with spontaneous characteristics of meeting speech, as well as a variety of topics and speakers. For acoustic model, minimum phone error (MPE) training is applied with several normalization techniques. For language model, we have proposed statistical style transformation to generate spoken-style N-grams and their statistics. We also introduce statistical modeling of pronunciation variation in spontaneous speech. The ASR system was evaluated on real congressional meetings, and achieved word accuracy of 84%. It is also suggested that the ASR-based transcripts with this accuracy level is usable for editing meeting records. Cross-Language Bootstrapping for Unsupervised Acoustic Model Training: Rapid Development of a Polish Speech Recognition System Modeling Northern and Southern Varieties of Dutch for STT Julien Despres 1 , Petr Fousek 2 , Jean-Luc Gauvain 2 , Sandrine Gay 1 , Yvan Josse 1 , Lori Lamel 2 , Abdel Messaoudi 2 ; 1 Vecsys Research, France; 2 LIMSI, France Mon-Ses2-O3-6, Time: 15:10 This paper describes how the Northern (NL) and Southern (VL) varieties of Dutch are modeled in the joint Limsi-Vecsys Research speech-to-text transcription systems for broadcast news (BN) and conversational telephone speech (CTS). Using the Spoken Dutch Corpus resources (CGN), systems were developed and evaluated in the 2008 N-Best benchmark. Modeling techniques that are used in our systems for other languages were found to be effective for the Dutch language, however it was also found to be important to have acoustic and language models, and statistical pronunciation generation rules adapted to each variety. This was in particular true for the MLP features which were only effective when trained separately for Dutch and Flemish. The joint submissions obtained the lowest WERs in the benchmark by a significant margin. Jonas Lööf, Christian Gollan, Hermann Ney; RWTH Aachen University, Germany Mon-Ses2-O4 : Speech Analysis and Processing I Mon-Ses2-O3-4, Time: 14:30 Holmes (East Wing 3), 13:30, Monday 7 Sept 2009 Chair: Bernd Möbius, Universität Stuttgart, Germany This paper describes the rapid development of a Polish language speech recognition system. The system development was performed without access to any transcribed acoustic training data. This was achieved through the combined use of cross-language bootstrapping and confidence based unsupervised acoustic model training. A Spanish acoustic model was ported to Polish, through the use of a manually constructed phoneme mapping. This initial model was refined through iterative recognition and retraining of the untranscribed audio data. The system was trained and evaluated on recordings from the European Parliament, and included several state-of-the-art speech recognition techniques in addition to the use of unsupervised model training. Confidence based speaker adaptive training using features space transform adaptation, as well as vocal tract length normalization and maximum likelihood linear regression, was used to refine the acoustic model. Through the combination of the different techniques, good performance was achieved on the domain of parliamentary speeches. Nearly Perfect Detection of Continuous F0 Contour and Frame Classification for TTS Synthesis Thomas Ewender, Sarah Hoffmann, Beat Pfister; ETH Zürich, Switzerland Mon-Ses2-O4-1, Time: 13:30 We present a new method for the estimation of a continuous fundamental frequency (F0 ) contour. The algorithm implements a global optimization and yields virtually error-free F0 contours for high quality speech signals. Such F0 contours are subsequently used to extract a continuous fundamental wave. Some local properties of this wave, together with a number of other speech features allow to classify the frames of a speech signal into five classes: voiced, unvoiced, mixed, irregularly glottalized and silence. The presented F0 detection and frame classification can be applied to F0 modeling and prosodic modification of speech segments in high-quality concatenative speech synthesis. Notes 50 AM-FM Estimation for Speech Based on a Time-Varying Sinusoidal Model demonstrate the accuracy and robustness of our method through comparisons to state-of-the art pitch estimation algorithms using both simulated and real waveform data. Yannis Pantazis 1 , Olivier Rosec 2 , Yannis Stylianou 1 ; 1 FORTH, Greece; 2 Orange Labs, France Complex Cepstrum-Based Decomposition of Speech for Glottal Source Estimation Mon-Ses2-O4-2, Time: 13:50 In this paper we present a method based on a time-varying sinusoidal model for a robust and accurate estimation of amplitude and frequency modulations (AM-FM) in speech. The suggested approach has two main steps. First, speech is modeled as a sinusoidal model with time-varying amplitudes. Specifically, the model makes use of a first order time polynomial with complex coefficients for capturing instantaneous amplitude and frequency (phase) components. Next, the model parameters are updated by using the previously estimated instantaneous phase information. Thus, an iterative scheme for AM-FM decomposition of speech is suggested which was validated on synthetic AM-FM signals and tested on reconstruction of voiced speech signals where the signal-to-error reconstruction ratio (SERR) was used as measure. Compared to the standard sinusoidal representation, the suggested approach found to improve the corresponding SERR by 47%, resulting in over 30 dB of SERR. Voice Source Waveform Analysis and Synthesis Using Principal Component Analysis and Gaussian Mixture Modelling Jon Gudnason 1 , Mark R.P. Thomas 1 , Patrick A. Naylor 1 , Dan P.W. Ellis 2 ; 1 Imperial College London, UK; 2 Columbia University, USA Thomas Drugman 1 , Baris Bozkurt 2 , Thierry Dutoit 1 ; 1 Faculté Polytechnique de Mons, Belgium; 2 Izmir Institute of Technology, Turkey Mon-Ses2-O4-5, Time: 14:50 Homomorphic analysis is a well-known method for the separation of non-linearly combined signals. More particularly, the use of complex cepstrum for source-tract deconvolution has been discussed in various articles. However there exists no study which proposes a glottal flow estimation methodology based on cepstrum and reports effective results. In this paper, we show that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of a windowed speech signal as done by the Zeros of the Z-Transform (ZZT) decomposition. Based on exactly the same principles presented for ZZT decomposition, windowing should be applied such that the windowed speech signals exhibit mixed-phase characteristics which conform the speech production model that the anticausal component is mainly due to the glottal flow open phase. The advantage of the complex cepstrum-based approach compared to the ZZT decomposition is its much higher speed. Approximate Intrinsic Fourier Analysis of Speech Mon-Ses2-O4-3, Time: 14:10 Frank Tompkins, Patrick J. Wolfe; Harvard University, USA The paper presents a voice source waveform modeling techniques based on principal component analysis (PCA) and Gaussian mixture modeling (GMM). The voice source is obtained by inverse-filtering speech with the estimated vocal tract filter. This decomposition is useful in speech analysis, synthesis, recognition and coding. Existing models of the voice source signal are based on functionfitting or physically motivated assumptions and although they are well defined, estimation of their parameters is not well understood and few are capable of reproducing the large variety of voice source waveforms. Here, a data-driven approach is presented for signal decomposition and classification based on the principal components of the voice source. The principal components are analyzed and the ‘prototype’ voice source signals corresponding to the Gaussian mixture means are examined. We show how an unknown signal can be decomposed into its components and/or prototypes and resynthesized. We show how the techniques are suited for both low bitrate or high quality analysis/synthesis schemes. Popular parametric models of speech sounds such as the sourcefilter model provide a fixed means of describing the variability inherent in speech waveform data. However, nonlinear dimensionality reduction techniques such as the intrinsic Fourier analysis method of Jansen and Niyogi provide a more flexible means of adaptively estimating such structure directly from data. Here we employ this approach to learn a low-dimensional manifold whose geometry is meant to reflect the structure implied by the human speech production system. We derive a novel algorithm to efficiently learn this manifold for the case of many training examples — the setting of both greatest practical interest and computational difficulty. We then demonstrate the utility of our method by way of a proof-of-concept phoneme identification system that operates effectively in the intrinsic Fourier domain. Model-Based Estimation of Instantaneous Pitch in Noisy Speech Hewison Hall, 13:30, Monday 7 Sept 2009 Chair: Paul Boersma, University of Amsterdam, The Netherlands Jung Ook Hong, Patrick J. Wolfe; Harvard University, USA Mon-Ses2-O4-6, Time: 15:10 Mon-Ses2-P1 : Speech Perception I Relative Importance of Formant and Whole-Spectral Cues for Vowel Perception Mon-Ses2-O4-4, Time: 14:30 In this paper we propose a model-based approach to instantaneous pitch estimation in noisy speech, by way of incorporating pitch smoothness assumptions into the well-known harmonic model. In this approach, the latent pitch contour is modeled using a basis of smooth polynomials, and is fit to waveform data by way of a harmonic model whose partials have time-varying amplitudes. The resultant nonlinear least squares estimation task is accomplished through the Gauss-Newton method with a novel initialization step that serves to greatly increase algorithm efficiency. We Masashi Ito, Keiji Ohara, Akinori Ito, Masafumi Yano; Tohoku University, Japan Mon-Ses2-P1-1, Time: 13:30 Three psycho-acoustical experiments were carried out to investigate relative importance of formant frequency and whole spectral shape as cues for vowel perception. Four types of vowel-like signals were presented to eight listeners. The mean responses for stimuli including both formant and amplitude-ratio feature were quite similar to those for the stimuli including only formant peak Notes 51 feature. Nonetheless reasonable vowel changes were observed in responses for stimuli including only amplitude-ratio feature. The perceived vowel changes were also observed even for stimuli including neither of these features. The results suggested that perceptual cues were involved in various parts of vowel spectrum. principally due to lexical and phonetic competition. We also found a binaural unmasking effect, which was more important when speech was used as interferer, suggesting that this suppressive effect was more efficient in the case of high-level informational (lexical and phonetic) competition. Influences of Vowel Duration on Speaker-Size Estimation and Discrimination Using Location Cues to Track Speaker Changes from Mobile, Binaural Microphones Chihiro Takeshima 1 , Minoru Tsuzaki 1 , Toshio Irino 2 ; 1 Kyoto City University of Arts, Japan; 2 Wakayama University, Japan Heidi Christensen, Jon Barker; University of Sheffield, UK Mon-Ses2-P1-2, Time: 13:30 This paper presents initial developments towards computational hearing models that move beyond stationary microphone assumptions. We present a particle filtering based system for using localisation cues to track speaker changes in meeting recordings. Recording are made using in-ear binaural microphones worn by a listener whose head is constantly moving. Tracking speaker changes requires simultaneously inferring the perceiver’s head orientation, as any change in relative spatial angle to a source can be caused by either the source moving or the microphones moving. In real applications, such as robotics, there may be access to external estimates of the perceiver’s position. We investigate the effect of simulating varying degrees of measurement noise in an external perceiver position estimate. We show that only limited self-position knowledge is needed to greatly improve the reliability with which we can decode the acoustic localisation cues in the meeting scenario. Mon-Ses2-P1-5, Time: 13:30 Several experimental studies have shown that the human auditory system has a mechanism for extracting speaker-size information, using sufficiently long sounds. This paper investigated influence of vowel duration on the processing for size extraction using short vowels. In a size estimation experiment, listeners subjectively estimated the size (height) of the speaker for isolated vowels. The results showed that listeners’ perception of speaker size was highly correlated with the factor of vocal-tract length in all the tested durations (from 16 ms to 256 ms). In a size discrimination experiment, listeners were presented with two vowels scaled the vocal-tract length and were asked which vowel was perceived to be spoken by a smaller speaker. The results showed that the just-noticeable differences (JNDs) in speaker size were almost the same for the durations longer than 32 ms. However, the JNDs rose considerably for 16-ms duration. These observations of the experiments suggest that the auditory system can extract speaker-size information even for 16-ms vowels although the precision of size extraction would deteriorate when the duration becomes less than 32 ms. A Perceptual Investigation of Speech Transcription Errors Involving Frequent Near-Homophones in French and American English Ioana Vasilescu 1 , Martine Adda-Decker 1 , Lori Lamel 1 , Pierre Hallé 2 ; 1 LIMSI, France; 2 LPP, France High Front Vowels in Czech: A Contrast in Quantity or Quality? Václav Jonáš Podlipský 1 , Radek Skarnitzl 2 , Jan Volín 2 ; 1 Palacký University Olomouc, Czech Republic; 2 Charles University in Prague, Czech Republic Mon-Ses2-P1-3, Time: 13:30 We investigate the perception and production of Czech /I/ and /i:/, a contrast traditionally described as quantitative. First, we show that the spectral difference between the vowels is for many Czechs as strong a cue as (or even stronger than) duration. Second, we test the hypothesis that this shift towards vowel quality as a perceptual cue for this contrast resulted in weakening of the durational differentiation in production. Our measurements confirm this: members of the /I/-/i:/ pair differed in duration much less than those of other short-long pairs. We interpret these findings in terms of Lindblom’s H&H theory. Mon-Ses2-P1-6, Time: 13:30 This article compares the errors made by automatic speech recognizers to those made by humans for near-homophones in American English and French. This exploratory study focuses on the impact of limited word context and the potential resulting ambiguities for automatic speech recognition (ASR) systems and human listeners. Perceptual experiments using 7-gram chunks centered on incorrect or correct words output by an ASR system, show that humans make significantly more transcription errors on the first type of stimuli, thus highlighting the local ambiguity. The long-term aim of this study is to improve the modeling of such ambiguous items in order to reduce ASR errors. The Role of Glottal Pulse Rate and Vocal Tract Length in the Perception of Speaker Identity Etienne Gaudrain, Su Li, Vin Shen Ban, Roy D. Patterson; University of Cambridge, UK Effect of Contralateral Noise on Energetic and Informational Masking on Speech-in-Speech Intelligibility Mon-Ses2-P1-7, Time: 13:30 Marjorie Dole 1 , Michel Hoen 2 , Fanny Meunier 1 ; 1 DDL, France; 2 SBRI, France Mon-Ses2-P1-4, Time: 13:30 This experiment tested the advantage of binaural presentation of an interfering noise in a task involving identification of monaurallypresented words. These words were embedded in three types of noise: a stationary noise, a speech-modulated noise and a speechbabble noise, in order to assess energetic and informational masking contributions to binaural unmasking. Our results showed important informational masking in the monaural condition, In natural speech, for a given speaker, vocal tract length (VTL) is effectively fixed whereas glottal pulse rate (GPR) is varied to indicate prosodic distinctions. This suggests that VTL will be a more reliable cue for identifying a speaker than GPR. It also suggests that listeners will accept larger changes in GPR before perceiving speaker change. We measured the effect of GPR and VTL on the perception of a speaker difference, and found that listeners hear different speakers given a VTL difference of 25%, but they require a GPR difference of 45%. Notes 52 Development of Voicing Categorization in Deaf Children with Cochlear Implant Mon-Ses2-P2 : Accent and Language Recognition Victoria Medina, Willy Serniclaes; LPP, France Mon-Ses2-P1-8, Time: 13:30 Cochlear implant (CI) improves hearing but communication abilities still depend on several factors. The present study assesses the development of voicing categorization in deaf children with cochlear implant, examining both categorical perception (CP) and boundary precision (BP) performances. We compared 22 implanted children to 55 normal-hearing children using different age factors. The results showed that the development of voicing perception in CI children is fairly similar to that in normal-hearing controls with the same auditory experience and irrespective of differences in the age of implantation (two vs. three years of age). Processing Liaison-Initial Words in Native and Non-Native French: Evidence from Eye Movements Annie Tremblay; University of Illinois at Urbana-Champaign, USA Mon-Ses2-P1-9, Time: 13:30 French listeners have no difficulty recognizing liaison-initial words. This is in part because acoustic/phonetic information distinguishes liaison consonants from (non-resyllabified) word onsets in the speech signal. Using eye tracking, this study investigates whether native speakers of English, a language that does not have a phonological resyllabification process like liaison, can develop target-like segmentation procedures for recognizing liaison-initial words in French, and if so, how such procedures develop with increasing proficiency. Estimating the Potential of Signal and Interlocutor-Track Information for Language Modeling Hewison Hall, 13:30, Monday 7 Sept 2009 Chair: William Campbell, MIT, USA Factor Analysis and SVM for Language Recognition Florian Verdet 1 , Driss Matrouf 1 , Jean-François Bonastre 1 , Jean Hennebert 2 ; 1 LIA, France; 2 Université de Fribourg, Switzerland Mon-Ses2-P2-1, Time: 13:30 Statistic classifiers operate on features that generally include both, useful and useless information. These two types of information are difficult to separate in feature domain. Recently, a new paradigm based on Factor Analysis (FA) proposed a model decomposition into useful and useless components. This method has successfully been applied to speaker recognition tasks. In this paper, we study the use of FA for language recognition. We propose a classification method based on SDC features and Gaussian Mixture Models (GMM). We present well performing systems using Factor Analysis and FA-based Support Vector Machine (SVM) classifiers. Experiments are conducted using NIST LRE 2005’s primary condition. The relative equal error rate reduction obtained by the best factor analysis configuration with respect to baseline GMM-UBM system is over 60%, corresponding to an EER of 6.59%. Exploring Universal Attribute Characterization of Spoken Languages for Spoken Language Recognition Sabato Marco Siniscalchi 1 , Jeremy Reed 2 , Torbjørn Svendsen 1 , Chin-Hui Lee 2 ; 1 NTNU, Norway; 2 Georgia Institute of Technology, USA Mon-Ses2-P2-2, Time: 13:30 Nigel G. Ward, Benjamin H. Walker; University of Texas at El Paso, USA Mon-Ses2-P1-10, Time: 13:30 Although today most language models treat language purely as word sequences, there is recurring interest in tapping new sources of information, such as disfluencies, prosody, the interlocutor’s dialog act, and the interlocutor’s recent words. In order to estimate the potential value of such sources of information, we extend Shannon’s guessing-game method for estimating entropy to work for spoken dialog. Four teams of two subjects each predicted the next word in a dialog using various amounts of context: one word, two words, all the words spoken so far, or the full dialog audio so far. The entropy benefit in the full-audio condition over the full text condition was substantial, .64 bits per word, greater than the .54 bit benefit of full text context over trigrams. This suggests that language models may be improved by use of the prosody of the speaker and context from the interlocutor. We propose a novel universal acoustic characterization approach to spoken language identification (LID), in which any spoken language is described with a common set of fundamental units defined “universally.” Specifically, manner and place of articulation form this unit inventory and are used to build a set of universal attribute models with data-driven techniques. Using the vector space modeling approaches to LID a spoken utterance is first decoded into a sequence of attributes. Then, a feature vector consisting of co-occurrence statistics of attribute units is created, and the final LID decision is implemented with a set of vector space language classifiers. Although the present study is just in its preliminary stage, promising results comparable to acoustically rich phone-based LID systems have already been obtained on the NIST 2003 LID task. The results provide clear insight for further performance improvements and encourage a continuing exploration of the proposed framework. On the Use of Phonological Features for Automatic Accent Analysis Abhijeet Sangwan, John H.L. Hansen; University of Texas at Dallas, USA Mon-Ses2-P2-3, Time: 13:30 In this paper, we present an automatic accent analysis system that is based on phonological features (PFs). The proposed system exploits the knowledge of articulation embedded in phonology by rapidly build Markov models (MMs) of PFs extracted from accented speech. The Markov models capture information in the PF space along two dimensions of articulation: PF state-transitions and state-durations. Furthermore, by utilizing MMs of native and Notes 53 non-native accents a new statistical measure of “accentedness” is developed which rates the articulation of a word on a scale of native-like (-1) to non-native like (+1). The proposed methodology is then used to perform an automatic cross-sectional study of accented English spoken by native speakers of Mandarin Chinese (N-MC). The experimental results demonstrate the capability of the proposed system to rapidly perform quantitative as well as qualitative analysis of foreign accents. The work developed in this paper is easily assimilated into language learning systems, and has impact in the areas of speaker and speech recognition. Language Recognition Using Language Factors Fabio Castaldo 1 , Sandro Cumani 1 , Pietro Laface 1 , Daniele Colibro 2 ; 1 Politecnico di Torino, Italy; 2 Loquendo, Italy Mon-Ses2-P2-4, Time: 13:30 Language recognition systems based on acoustic models reach state of the art performance using discriminative training techniques. In speaker recognition, eigenvoice modeling of the speaker, and the use of speaker factors as input features to SVMs has recently been demonstrated to give good results compared to the standard GMM-SVM approach, which combines GMMs supervectors and SVMs. In this paper we propose, in analogy to the eigenvoice modeling approach, to estimate an eigen-language space, and to use the language factors as input features to SVM classifiers. Since language factors are low-dimension vectors, training and evaluating SVMs with different kernels and with large training examples becomes an easy task. This approach is demonstrated on the 14 languages of the NIST 2007 language recognition task, and shows performance improvements with respect to the standard GMM-SVM technique. Automatic Accent Detection: Effect of Base Units and Boundary Information Je Hun Jeon, Yang Liu; University of Texas at Dallas, USA Mon-Ses2-P2-5, Time: 13:30 Automatic prominence or pitch accent detection is important as it can perform automatic prosodic annotation of speech corpora, as well as provide additional features in other tasks such as keyword detection. In this paper, we evaluate how accent detection performance changes according to different base units and what kind of boundary information is available. We compare word, syllable, and vowel-based units when their boundaries are provided. We also automatically estimate syllable boundaries using energy contours when phone-level alignment is available. In addition, we utilize a sliding window with fixed length under the condition of unknown boundaries. Our experiments show that when boundary information is available, using longer base unit achieves better performance. In the case of no boundary information, using a moving window with a fixed size achieves similar performance to using syllable information on word-level evaluation, suggesting that accent detection can be performed without relying on a speech recognizer to generate boundaries. Age Verification Using a Hybrid Speech Processing Approach Ron M. Hecht 1 , Omer Hezroni 1 , Amit Manna 1 , Ruth Aloni-Lavi 1 , Gil Dobry 1 , Amir Alfandary 1 , Yaniv Zigel 2 ; 1 PuddingMedia, Israel; 2 Ben-Gurion University of the Negev, Israel Mon-Ses2-P2-6, Time: 13:30 The human speech production system is a multi-level system. On the upper level, it starts with information that one wants to transmit. It ends on the lower level with the materialization of the information into a speech signal. Most of the recent work conducted in age estimation is focused on the lower-acoustic level. In this research the upper lexical level information is utilized for age-group verification and it is shown that one’s vocabulary reflects one’s age. Several age-group verification systems that are based on automatic transcripts are proposed. In addition, a hybrid approach is introduced, an approach that combines the word-based system and an acoustic-based system. Experiments were conducted on a four age-groups verification task using the Fisher corpora, where an average equal error rate (EER) of 28.7% was achieved using the lexical-based approach and 28.0% using an acoustic approach. By merging these two approaches the verification error was reduced to 24.1%. Information Bottleneck Based Age Verification Ron M. Hecht 1 , Omer Hezroni 1 , Amit Manna 1 , Gil Dobry 2 , Yaniv Zigel 2 , Naftali Tishby 3 ; 1 PuddingMedia, Israel; 2 Ben-Gurion University of the Negev, Israel; 3 Hebrew University, Israel Mon-Ses2-P2-7, Time: 13:30 Word N-gram models can be used for word-based age-group verification. In this paper the agglomerative information bottleneck (AIB) approach is used to tackle one of the most fundamental drawbacks of word N-gram models: its abundant amount of irrelevant information. It is demonstrated that irrelevant information can be omitted by joining words to form word-clusters; this provides a mechanism to transform any sequence of words to a sequence of word-cluster labels. Consequently, word N-gram models are converted to word-cluster N-gram models which are more compact. Age verification experiments were conducted on the Fisher corpora. Their goal was to verify the age-group of the speaker of an unknown speech segment. In these experiments an N-gram model was compressed to a fifth of its original size without reducing the verification performance. In addition, a verification accuracy improvement is demonstrated by disposing irrelevant information. Discriminative N-Gram Selection for Dialect Recognition F.S. Richardson, W.M. Campbell, P.A. Torres-Carrasquillo; MIT, USA Mon-Ses2-P2-8, Time: 13:30 Dialect recognition is a challenging and multifaceted problem. Distinguishing between dialects can rely upon many tiers of interpretation of speech data — e.g., prosodic, phonetic, spectral, and word. High-accuracy automatic methods for dialect recognition typically use either phonetic or spectral characteristics of the input. A challenge with spectral system, such as those based on shifted-delta cepstral coefficients, is that they achieve good performance but do not provide insight into distinctive dialect features. In this work, a novel method based upon discriminative training and phone N-grams is proposed. This approach achieves Notes 54 excellent classification performance, fuses well with other systems, and has interpretable dialect characteristics in the phonetic tier. The method is demonstrated on data from the LDC and prior NIST language recognition evaluations. The method is also combined with spectral methods to demonstrate state-of-the-art performance in dialect recognition. Data-Driven Phonetic Comparison and Conversion Between South African, British and American English Pronunciations Linsen Loots, Thomas Niesler; Stellenbosch University, South Africa Using Prosody and Phonotactics in Arabic Dialect Identification Mon-Ses2-P2-9, Time: 13:30 We analyse pronunciations in American, British and South African English pronunciation dictionaries. Three analyses are performed. First the accuracy is determined with which decision tree based grapheme-to-phoneme (G2P) conversion can be applied to each accent. It is found that there is little difference between the accents in this regard. Secondly, pronunciations are compared by performing pairwise alignments between the accents. Here we find that South African English pronunciation most closely matches British English. Finally, we apply decision trees to the conversion of pronunciations from one accent to another. We find that pronunciations of unknown words can be more accurately determined from a known pronunciation in a different accent than by means of G2P methods. This has important implications for the development of pronunciation dictionaries in less-resourced varieties of English, and hence also for the development of ASR systems. Target-Aware Language Models for Spoken Language Recognition 1 1 1 This paper investigates the use of language identification (LID) in real-time speech-to-speech translation systems. We propose a framework that incorporates LID capability into a speech-tospeech translation system while minimizing the impact on the system’s real-time performance. We compared two phone-based LID approaches, namely PRLM and PPRLM, to a proposed extended approach based on Conditional Random Field classifiers. The performances of these three approaches were evaluated to identify the input language in the CMU English-Iraqi TransTAC system, and the proposed approach obtained significantly higher classification accuracies on two of the three test sets evaluated. Fadi Biadsy, Julia Hirschberg; Columbia University, USA Mon-Ses2-P2-12, Time: 13:30 While Modern Standard Arabic is the formal spoken and written language of the Arab world, dialects are the major communication mode for everyday life; identifying a speaker’s dialect is thus critical to speech processing tasks such as automatic speech recognition, as well as speaker identification. We examine the role of prosodic features (intonation and rhythm) across four Arabic dialects: Gulf, Iraqi, Levantine, and Egyptian, for the purpose of automatic dialect identification. We show that prosodic features can significantly improve identification, over a purely phonotactic-based approach, with an identification accuracy of 86.33% for 2m utterances. Mon-Ses2-P3 : ASR: Acoustic Model Training and Combination Hewison Hall, 13:30, Monday 7 Sept 2009 Chair: Jeff Bilmes, University of Washington, USA 2 Rong Tong , Bin Ma , Haizhou Li , Eng Siong Chng , Kong-Aik Lee 1 ; 1 Institute for Infocomm Research, Singapore; 2 Nanyang Technological University, Singapore Refactoring Acoustic Models Using Variational Expectation-Maximization Mon-Ses2-P2-10, Time: 13:30 Pierre L. Dognin, John R. Hershey, Vaibhava Goel, Peder A. Olsen; IBM T.J. Watson Research Center, USA This paper studies a new way of constructing multiple phone tokenizers for language recognition. In this approach, each phone tokenizer for a target language will share a common set of acoustic models, while each tokenizer will have a unique phone-based language model (LM) trained for a specific target language. The target-aware language models (TALM) are constructed to capture the discriminative ability of individual phones for the desired target languages. The parallel phone tokenizers thus formed are shown to achieve better performance than the original phone recognizer. The proposed TALM is very different from the LM in the traditional PPRLM technique. First of all, the TALM applies the LM information in the front-end as opposed to PPRLM approach which uses a LM in the system back-end; Furthermore, the TALM exploits the discriminative phones occurrence statistics, which are different from the traditional n-gram statistics in PPRLM approach. A novel way of training TALM is also studied in this paper. Our experimental results show that the proposed method consistently improves the language recognition performance on NIST 1996, 2003 and 2007 LRE 30-second closed test sets. In probabilistic modeling, it is often useful to change the structure, or refactor, a model, so that it has a different number of components, different parameter sharing, or other constraints. For example, we may wish to find a Gaussian mixture model (GMM) with fewer components that best approximates a reference model. Maximizing the likelihood of the refactored model under the reference model is equivalent to minimizing their KL divergence. For GMMs, this optimization is not analytically tractable. However, a lower bound to the likelihood can be maximized using a variational expectation-maximization algorithm. Automatic speech recognition provides a good framework to test the validity of such methods, because we can train reference models of any given size for comparison with refactored models. We show that we can efficiently reduce model size by 50%, with the same recognition performance as the corresponding model trained from data. Language Identification for Speech-to-Speech Translation Daniel Chung Yong Lim, Ian Lane; Carnegie Mellon University, USA Mon-Ses2-P2-11, Time: 13:30 Mon-Ses2-P3-1, Time: 13:30 Investigations on Convex Optimization Using Log-Linear HMMs for Digit String Recognition Georg Heigold, David Rybach, Ralf Schlüter, Hermann Ney; RWTH Aachen University, Germany Mon-Ses2-P3-2, Time: 13:30 Discriminative methods are an important technique to refine the acoustic model in speech recognition. Conventional discriminative Notes 55 training is initialized with some baseline model and the parameters are re-estimated in a separate step. This approach has proven to be successful, but it includes many heuristics, approximations, and parameters to be tuned. This tuning involves much engineering and makes it difficult to reproduce and compare experiments. In contrast to the conventional training, convex optimization techniques provide a sound approach to estimate all model parameters from scratch. Such a straight approach hopefully dispense with additional heuristics, e.g. scaling of posteriors. This paper addresses the question how well this concept using log-linear models carries over to practice. Experimental results are reported for a digit string recognition task, which allows for the investigation of this issue without approximations. Investigations on Discriminative Training in Large Scale Acoustic Model Estimation Janne Pylkkönen; Helsinki University of Technology, Finland Mon-Ses2-P3-3, Time: 13:30 In this paper two common discriminative training criteria, maximum mutual information (MMI) and minimum phone error (MPE), are investigated. Two main issues are addressed: sensitivity to different lattice segmentations and the contribution of the parameter estimation method. It is noted that MMI and MPE may benefit from different lattice segmentation strategies. The use of discriminative criterion values as the measure of model goodness is shown to be problematic as the recognition results do not correlate well with these measures. Moreover, the parameter estimation method clearly affects the recognition performance of the model irrespective of the value of the discriminative criterion. Also the dependence on the recognition task is demonstrated by example with two Finnish large vocabulary dictation tasks used in the experiments. Margin-Space Integration of MPE Loss via Differencing of MMI Functionals for Generalized Error-Weighted Discriminative Training Erik McDermott, Shinji Watanabe, Atsushi Nakamura; NTT Corporation, Japan Mon-Ses2-P3-4, Time: 13:30 Using the central observation that margin-based weighted classification error (modeled using Minimum Phone Error (MPE)) corresponds to the derivative with respect to the margin term of margin-based hinge loss (modeled using Maximum Mutual Information (MMI)), this article subsumes and extends marginbased MPE and MMI within a broader framework in which the objective function is an integral of MPE loss over a range of margin values. Applying the Fundamental Theorem of Calculus, this integral is easily evaluated using finite differences of MMI functionals; lattice-based training using the new criterion can then be carried out using differences of MMI gradients. Experimental results comparing the new framework with margin-based MMI, MCE and MPE on the Corpus of Spontaneous Japanese and the MIT OpenCourseWare/MIT-World corpus are presented. Compacting Discriminative Feature Space Transforms for Embedded Devices Etienne Marcheret 1 , Jia-Yu Chen 2 , Petr Fousek 3 , Peder A. Olsen 1 , Vaibhava Goel 1 ; 1 IBM T.J. Watson Research Center, USA; 2 University of Illinois at Urbana-Champaign, USA; 3 IBM Research, Czech Republic Mon-Ses2-P3-5, Time: 13:30 Discriminative training of the feature space using the minimum phone error objective function has been shown to yield remarkable accuracy improvements. These gains, however, come at a high cost of memory. In this paper we present techniques that maintain fMPE performance while reducing the required memory by approximately 94%. This is achieved by designing a quantization methodology which minimizes the error between the true fMPE computation and that produced with the quantized parameters. Also illustrated is a Viterbi search over the allocation of quantization levels, providing a framework for optimal non-uniform allocation of quantization levels over the dimensions of the fMPE feature vector. This provides an additional 8% relative reduction in required memory with no loss in recognition accuracy. A Back-Off Discriminative Acoustic Model for Automatic Speech Recognition Hung-An Chang, James R. Glass; MIT, USA Mon-Ses2-P3-6, Time: 13:30 In this paper we propose a back-off discriminative acoustic model for Automatic Speech Recognition (ASR). We use a set of broad phonetic classes to divide the classification problem originating from context-dependent modeling into a set of sub-problems. By appropriately combining the scores from classifiers designed for the sub-problems, we can guarantee that the back-off acoustic score for different context-dependent units will be different. The back-off model can be combined with discriminative training algorithms to further improve the performance. Experimental results on a large vocabulary lecture transcription task show that the proposed back-off discriminative acoustic model has more than a 2.0% absolute word error rate reduction compared to clustering-based acoustic model. Efficient Generation and Use of MLP Features for Arabic Speech Recognition J. Park, F. Diehl, M.J.F. Gales, M. Tomalin, P.C. Woodland; University of Cambridge, UK Mon-Ses2-P3-7, Time: 13:30 Front-end features computed using Multi-Layer Perceptrons (MLPs) have recently attracted much interest, but are a challenge to scale to large networks and very large training data sets. This paper discusses methods to reduce the training time for the generation of MLP features and their use in an ASR system using a variety of techniques: parallel training of a set of MLPs on different data sub-sets; methods for computing features from by a combination of these networks; and rapid discriminative training of HMMs using MLP-based features. The impact on MLP frame-based accuracy using different training strategies is discussed along with the effect on word rates from incorporating the MLP features in various configurations into an Arabic broadcast audio transcription system. Notes 56 A Study of Bootstrapping with Multiple Acoustic Features for Improved Automatic Speech Recognition Mon-Ses2-P4 : Spoken Dialogue Systems Hewison Hall, 13:30, Monday 7 Sept 2009 Chair: Dilek Hakkani-Tür, ICSI, USA Xiaodong Cui, Jian Xue, Bing Xiang, Bowen Zhou; IBM T.J. Watson Research Center, USA Mon-Ses2-P3-8, Time: 13:30 This paper investigates a scheme of bootstrapping with multiple acoustic features (MFCC, PLP and LPCC) to improve the overall performance of automatic speech recognition. In this scheme, a Gaussian mixture distribution is estimated for each type of feature resampled in each HMM state by single-pass retraining on a shared decision tree. Thus obtained acoustic models based on the multiple features are combined by likelihood averaging during decoding. Experiments on large vocabulary spontaneous speech recognition show its superior overall performance than the best of acoustic models from individual features. It also achieves comparable performance to Recognizer Output Voting Error Reduction (ROVER) with computational advantages. Analysis of Low-Resource Acoustic Model Self-Training Scott Novotney, Richard Schwartz; BBN Technologies, USA Mon-Ses2-P3-9, Time: 13:30 Previous work on self-training of acoustic models using unlabeled data reported significant reductions in WER assuming a large phonetic dictionary was available. We now assume only those words from ten hours of speech are initially available. Subsequently, we are then given a large vocabulary and then quantify the value of repeating self-training with this larger dictionary. This experiment is used to analyze the effects of self-training on categories of words. We report the following findings: (i) Although the small 5k vocabulary raises WER by 2% absolute, self-training is equally effective as using a large 75k vocabulary. (ii) Adding all 75k words to the decoding vocabulary after self-training reduces the WER degradation to only 0.8% absolute. (iii) Self-training most benefits those words in the unlabeled audio but not transcribed by a wide margin. Log-Linear Model Combination with Word-Dependent Scaling Factors Björn Hoffmeister, Ruoying Liang, Ralf Schlüter, Hermann Ney; RWTH Aachen University, Germany Mon-Ses2-P3-10, Time: 13:30 Log-linear model combination is the standard approach in LVCSR to combine several knowledge sources, usually an acoustic and a language model. Instead of using a single scaling factor per knowledge source, we make the scaling factor word- and pronunciation-dependent. In this work, we combine three acoustic models, a pronunciation model, and a language model for a Mandarin BN/BC task. The achieved error rate reduction of 2% relative is small but consistent for two test sets. An analysis of the results shows that the major contribution comes from the improved interdependency of language and acoustic model. Enabling a User to Specify an Item at Any Time During System Enumeration — Item Identification for Barge-In-Able Conversational Dialogue Systems Kyoko Matsuyama, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno; Kyoto University, Japan Mon-Ses2-P4-1, Time: 13:30 In conversational dialogue systems, users prefer to speak at any time and to use natural expressions. We have developed an Independent Component Analysis (ICA) based semi-blind source separation method, which allows users to barge-in over system utterances at any time. We created a novel method from timing information derived from barge-in utterances to identify one item that a user indicates during system enumeration. First, we determine the timing distribution of user utterances containing referential expressions and then approximate it using a gamma distribution. Second, we represent both the utterance timing and automatic speech recognition (ASR) results as probabilities of the desired selection from the system’s enumeration. We then integrate these two probabilities to identify the item having the maximum likelihood of selection. Experimental results using 400 utterances indicated that our method outperformed two methods used as a baseline (one of ASR results only and one of utterance timing only) in identification accuracy. System Request Detection in Human Conversation Based on Multi-Resolution Gabor Wavelet Features Tomoyuki Yamagata, Tetsuya Takiguchi, Yasuo Ariki; Kobe University, Japan Mon-Ses2-P4-2, Time: 13:30 For a hands-free speech interface, it is important to detect commands in spontaneous utterances. Usual voice activity detection systems can only distinguish speech frames from non-speech frames, but they cannot discriminate whether the detected speech section is a command for a system or not. In this paper, in order to analyze the difference between system requests and spontaneous utterances, we focus on fluctuations in a long period, such as prosodic articulation, and fluctuations in a short period, such as phoneme articulation. The use of multi-resolution analysis using Gabor wavelet on a Log-scale Mel-frequency Filter-bank clarifies the different characteristics of system commands and spontaneous utterances. Experiments using our robot dialog corpus show that the accuracy of the proposed method is 92.6% in F-measure, while the conventional power and prosody-based method is just 66.7%. Using Graphical Models for Mixed-Initiative Dialog Management Systems with Realtime Policies Stefan Schwärzler, Stefan Maier, Joachim Schenk, Frank Wallhoff, Gerhard Rigoll; Technische Universität München, Germany Mon-Ses2-P4-3, Time: 13:30 In this paper, we present a novel approach for dialog modeling, which extends the idea underlying the partially observable Markov Decision Processes (POMDPs), i.e. it allows for calculating the dialog policy in real-time and thereby increases the system flexibility. The use of statistical dialog models is particularly advantageous to react adequately to common errors of speech recognition sys- Notes 57 tems. Comparing our results to the reference system (POMDP), we achieve a relative reduction of 31:6% of the average dialog length. Furthermore, the proposed system shows a relative enhancement of 64:4% of the sensitivity rate in the error recognition capabilities using the same specifity rate in both systems. The achieved results are based on the Air Travelling Information System with 21650 user utterances in 1585 natural spoken dialogs. Conversation Robot Participating in and Activating a Group Communication has been applied to develop a dialog manager within the framework of the European LUNA project, whose main goal is the creation of a robust natural spoken language understanding system. We present an evaluation of this approach for both human machine and human-human conversations acquired in this project. We demonstrate that a statistical dialog manager developed with the proposed technique and learned from a corpus of human-machine dialogs can successfully infer the task-related topics present in spontaneous human-human dialogs. A Policy-Switching Learning Approach for Adaptive Spoken Dialogue Agents Shinya Fujie, Yoichi Matsuyama, Hikaru Taniyama, Tetsunori Kobayashi; Waseda University, Japan Mon-Ses2-P4-4, Time: 13:30 As a new type of application of the conversation system, a robot activating other parties’ communications has been developed. The robot participates in a quiz game with other participants and tries to activate the game. The functions installed in the robot are as follows: (1) The robot can participate in a group communication using its basic group conversation function. (2) The robot can perform the game according to the rules of the game. (3) The robot can activate communication using its proper actions depending on the game situations and the participants’ situations. We conducted a real field experiment: the prototype system performed a quiz game with elderly people in an adult day-care center. The robot successfully entertained the people with its one hour demonstration. Recent Advances in WFST-Based Dialog System Chiori Hori, Kiyonori Ohtake, Teruhisa Misu, Hideki Kashioka, Satoshi Nakamura; NICT, Japan Mon-Ses2-P4-5, Time: 13:30 To construct an expandable and adaptable dialog system which handles multiple tasks, we proposed a dialog system using a weighted finite-state transducer (WFST) in which users concept and system action tags are input and output of the transducer, respectively. To test the potential of the WFST-based dialog management (DM) platform using statistical DM models, we constructed a dialog system using a human-to-human spoken dialog corpus for hotel reservation, which is annotated with Interchange Format (IF). A scenario WFST and a spoken language understanding (SLU) WFST were obtained from the corpus and then composed together and optimized. We evaluated the detection accuracy of the system next actions. In this paper, we focus on how WFST optimization operations contribute to the performance of the system. In addition, we have constructed a full WFST-based dialog system by composing SLU, scenario and sentence generation (SG) WFSTs. We show an example of a hotel reservation dialog with the fully composed system and discuss future work. A Statistical Dialog Manager for the LUNA Project David Griol 1 , Giuseppe Riccardi 2 , Emilio Sanchis 3 ; 1 Universidad Carlos III de Madrid, Spain; 2 Università di Trento, Italy; 3 Universidad Politécnica de Valencia, Spain Mon-Ses2-P4-6, Time: 13:30 In this paper, we present an approach for the development of a statistical dialog manager, in which the system response is selected by means of a classification process which considers all the previous history of the dialog to select the next system response. In particular, we use decision trees for its implementation. The statistical model is automatically learned from training data which are labeled in terms of different SLU features. This methodology Heriberto Cuayáhuitl, Juventino Montiel-Hernández; Autonomous University of Tlaxcala, Mexico Mon-Ses2-P4-7, Time: 13:30 The reinforcement learning paradigm has been adopted for inferring optimized and adaptive spoken dialogue agents. Such agents are typically learnt and tested without combining competing agents that may yield better performance at some points in the conversation. This paper presents an approach that learns dialogue behaviour from competing agents — switching from one policy to another competing one — on a previously proposed hierarchical learning framework. This policy-switching approach was investigated using a simulated flight booking dialogue system based on different types of information request. Experimental results reported that the induced agent using the proposed policyswitching approach yielded 8.2% fewer system actions than three baselines with a fixed type of information request. This result suggests that the proposed approach is useful for learning adaptive and scalable spoken dialogue agents. Strategies for Accelerating the Design of Dialogue Applications using Heuristic Information from the Backend Database L.F. D’Haro 1 , R. Cordoba 1 , R. San-Segundo 1 , J. Macias-Guarasa 2 , J.M. Pardo 1 ; 1 Universidad Politécnica de Madrid, Spain; 2 Universidad de Alcalá, Spain Mon-Ses2-P4-8, Time: 13:30 Nowadays, current commercial and academic platforms for developing spoken dialogue applications lack of acceleration strategies based on using heuristic information from the contents or structure of the backend database in order to speed up the definition of the dialogue flow. In this paper we describe our attempts to take advantage of these information sources using the following strategies: the quick creation of classes and attributes to define the data model structure, the semi-automatic generation and debugging of database access functions, the automatic proposal of the slots that should be preferably requested using mixed-initiative forms or the slots that are better to request one by one using directed forms, and the generation of automatic state proposals to specify the transition network that defines the dialogue flow. Subjective and objective evaluations confirm the advantages of using the proposed strategies to simplify the design, and the high acceptance of the platform and its acceleration strategies. Feature-Based Summary Space for Stochastic Dialogue Modeling with Hierarchical Semantic Frames Florian Pinault, Fabrice Lefèvre, Renato De Mori; LIA, France Mon-Ses2-P4-9, Time: 13:30 Notes 58 In a spoken dialogue system, the dialogue manager needs to make decisions in a highly noisy environment, mainly due to speech recognition and understanding errors. This work addresses this issue by proposing a framework to interface efficient probabilistic modeling for both the spoken language understanding module and the dialogue management module. First hierarchical semantic frames are inferred and composed so as to build a thorough representation of the user’s utterance semantics. Then this representation is mapped into a feature-based summary space in which is defined the set of dialogue states used by the stochastic dialogue manager, based on the partially observable Markov decision process (POMDP) paradigm. This allows a planning of the dialogue course taking into account the uncertainty on the current dialogue state and tractability is ensured by the use of an intermediate summary space. A preliminary implementation of such a system is presented on the Media domain. The task is touristic information and hotel booking, and the availability of WoZ data allows to consider a model-based approach to the POMDP dialogue manager. The MonAMI Reminder: A Spoken Dialogue System for Face-to-Face Interaction Jonas Beskow, Jens Edlund, Björn Granström, Joakim Gustafson, Gabriel Skantze, Helena Tobiasson; KTH, Sweden Mon-Ses2-P4-12, Time: 13:30 We describe the MonAMI Reminder, a multimodal spoken dialogue system which can assist elderly and disabled people in organising and initiating their daily activities. Based on deep interviews with potential users, we have designed a calendar and reminder application which uses an innovative mix of an embodied conversational agent, digital pen and paper, and the web to meet the needs of those users as well as the current constraints of speech technology. We also explore the use of head pose tracking for interaction and attention control in human-computer face-to-face interaction. Influence of Training on Direct and Indirect Measures for the Evaluation of Multimodal Systems Julia Seebode 1 , Stefan Schaffer 1 , Ina Wechsung 2 , Florian Metze 3 ; 1 Technische Universität Berlin, Germany; 2 Deutsche Telekom Laboratories, Germany; 3 Carnegie Mellon University, USA Language Modeling and Dialog Management for Address Recognition Rajesh Balchandran, Leonid Rachevsky, Larry Sansone; IBM T.J. Watson Research Center, USA Mon-Ses2-P4-10, Time: 13:30 Mon-Ses2-P4-13, Time: 13:30 This paper describes a language modeling and dialog management system for efficient and robust recognition of several arbitrarily ordered and inter-related components from very large datasets — such as with a complete addresses specified in a single sentence with address components in their natural sequence. A new two-pass speech recognition technique based on using multiple language models with embedded grammars is presented. Tests with this technique on complete address recognition task yielded good results and memory and CPU requirements are sufficiently low to make this technique viable for embedded environments. Additionally, a goal oriented algorithm for dialog based error recovery and disambiguation, that does not require manual identification of all possible dialog situations, is also presented. The combined system yields very high task completion accuracy, for only a few additional turns of interaction. Finding suitable evaluation methods is an indispensable task during the development of new user interfaces, as no standardized approach has so far been established, especially for multimodal interfaces. In the current study, we used several data sources (direct and indirect measurements) to evaluate a multimodal version of an information system, tested on trained and untrained users. We investigated the extent to which the different types of data showed concordance concerning the perceived quality of the system, in order to derive clues as to the suitability of the respective evaluation methods. The aim was to examine, if widely used methods not originally developed for multimodal interfaces are appropriate under these conditions, and to derive new evaluation paradigms. A Framework for Rapid Development of Conversational Natural Language Call Routing Systems for Call Centers Talking Heads for Interacting with Spoken Dialog Smart-Home Systems Christine Kühnel, Benjamin Weiss, Sebastian Möller; Deutsche Telekom Laboratories, Germany Ea-Ee Jan, Hong-Kwang Kuo, Osamuyimen Stewart, David Lubensky; IBM T.J. Watson Research Center, USA Mon-Ses2-P4-11, Time: 13:30 A framework for rapid development of conversational natural language call routing systems is proposed. The framework cuts costs by using only scantily prepared business requirements to automatically create an initial prototype. Besides clear targets (terminal routing classes), vague targets which are variations of users’ incomplete (semantically overlapping) sentences are enumerated. The vague targets can be derived from the confusion set of the semantic tokens of the clear targets. Also automatically generated for a vague target is a disambiguation dialogue module, which consists of a prompt and grammar to guide the user from a vague target to one of its associated clear targets. In the final analysis, our framework is able to reduce the human labor associated with developing an initial natural language call routing system from a few weeks to just a few hours. The experimental results from a deployed pilot system support the feasibility of our proposed approach. Mon-Ses2-P4-14, Time: 13:30 In this paper the relation between the quality of a talking head as an output component of a spoken dialog system and the quality of the system itself is investigated. Results show that the quality of the talking head has indeed an important impact on system quality. The quality of the talking head itself is found to be influenced by visual and speech quality and the synchronization of voice and lip movement. Speech Generation from Hand Gestures Based on Space Mapping Aki Kunikoshi, Yu Qiao, Nobuaki Minematsu, Keikichi Hirose; University of Tokyo, Japan Mon-Ses2-P4-15, Time: 13:30 Individuals with speaking disabilities, particularly people suffering from dysarthria, often use a TTS synthesizer for speech communication. Since users always have to type sound symbols and the synthesizer reads them out in a monotonous style, the use of the current synthesizers usually renders real-time operation and lively Notes 59 communication difficult. This is why dysarthric users often fail to control the flow of conversation. In this paper, we propose a novel speech generation framework which makes use of hand gestures as input. People usually use tongue gesture transitions for speech generation but we develop a special glove, by wearing which, speech sounds are generated from hand gesture transitions. For development, GMM-based voice conversion techniques (mapping techniques) are applied to estimate a mapping function between a space of hand gestures and another space of speech sounds. In this paper, as an initial trial, a mapping between hand gestures and Japanese vowel sounds is estimated so that topological features of the selected gestures in a feature space and those of the five Japanese vowels in a cepstrum space are equalized. Experiments show that the special glove can generate good Japanese vowel transitions with voluntary control of duration and articulation. Mon-Ses2-S1 : Special Session: INTERSPEECH 2009 Emotion Challenge Ainsworth (East Wing 4), 13:30, Monday 7 Sept 2009 Chair: Björn Schuller, Technische Universität München, Germany Emotion Recognition Using a Hierarchical Binary Decision Tree Approach Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, Shrikanth S. Narayanan; University of Southern California, USA Mon-Ses2-S1-3, Time: 13:50 Emotion state tracking is an important aspect of human-computer and human-robot interaction. It is important to design task specific emotion recognition systems for real-world applications. In this work, we propose a hierarchical structure loosely motivated by Appraisal Theory for emotion recognition. The levels in the hierarchical structure are carefully designed to place the easier classification task at the top level and delay the decision between highly ambiguous classes to the end. The proposed structure maps an input utterance into one of the five-emotion classes through subsequent layers of binary classifications. We obtain a balanced recall on each of the individual emotion classes using this hierarchical structure. The performance measure of the average unweighted recall percentage on the evaluation data set improves by 3.3% absolute (8.8% relative) over the baseline model. Improving Automatic Emotion Recognition from Speech Signals The INTERSPEECH 2009 Emotion Challenge Björn Schuller 1 , Stefan Steidl 2 , Anton Batliner 2 ; 1 Technische Universität München, Germany; 2 FAU Erlangen-Nürnberg, Germany Mon-Ses2-S1-1, Time: 13:30 Elif Bozkurt 1 , Engin Erzin 1 , Çiǧdem Eroǧlu Erdem 2 , A. Tanju Erdem 3 ; 1 Koç University, Turkey; 2 Bahçeşehir University, Turkey; 3 Özyeğin University, Turkey The last decade has seen a substantial body of literature on the recognition of emotion from speech. However, in comparison to related speech processing tasks such as Automatic Speech and Speaker Recognition, practically no standardised corpora and test-conditions exist to compare performances under exactly the same conditions. Instead a multiplicity of evaluation strategies employed — such as cross-validation or percentage splits without proper instance definition — prevents exact reproducibility. Further, in order to face more realistic scenarios, the community is in desperate need of more spontaneous and less prototypical data. This INTERSPEECH 2009 Emotion Challenge aims at bridging such gaps between excellent research on human emotion recognition from speech and low compatibility of results. The FAU Aibo Emotion Corpus [1] serves as basis with clearly defined test and training partitions incorporating speaker independence and different room acoustics as needed in most real-life settings. This paper introduces the challenge, the corpus, the features, and benchmark results of two popular approaches towards emotion recognition from speech. We present a speech signal driven emotion recognition system. Our system is trained and tested with the INTERSPEECH 2009 Emotion Challenge corpus, which includes spontaneous and emotionally rich recordings. The challenge includes classifier and feature sub-challenges with five-class and two-class classification problems. We investigate prosody related, spectral and HMM-based features for the evaluation of emotion recognition with Gaussian mixture model (GMM) based classifiers. Spectral features consist of mel-scale cepstral coefficients (MFCC), line spectral frequency (LSF) features and their derivatives, whereas prosody-related features consist of mean normalized values of pitch, first derivative of pitch and intensity. Unsupervised training of HMM structures are employed to define prosody related temporal features for the emotion recognition problem. We also investigate data fusion of different features and decision fusion of different classifiers, which are not well studied for emotion recognition framework. Experimental results of automatic emotion recognition with the INTERSPEECH 2009 Emotion Challenge corpus are presented. GTM-URL Contribution to the INTERSPEECH 2009 Emotion Challenge Exploring the Benefits of Discretization of Acoustic Features for Speech Emotion Recognition Santiago Planet, Ignasi Iriondo, Joan Claudi Socoró, Carlos Monzo, Jordi Adell; Universitat Ramon Llull, Spain Thurid Vogt, Elisabeth André; Universität Augsburg, Germany Mon-Ses2-S1-2, Time: 13:40 We present a contribution to the Open Performance subchallenge of the INTERSPEECH 2009 Emotion Challenge. We evaluate the feature extraction and classifier of EmoVoice, our framework for real-time emotion recognition from voice on the challenge database and achieve competitive results. Furthermore, we explore the benefits of discretizing numeric acoustic features and find it beneficial in a multi-class task. Mon-Ses2-S1-4, Time: 14:00 Mon-Ses2-S1-5, Time: 14:10 This paper describes our participation in the INTERSPEECH 2009 Emotion Challenge [1]. Starting from our previous experience in the use of automatic classification for the validation of an expressive corpus, we have tackled the difficult task of emotion recognition from speech with real-life data. Our main contribution to this work is related to the Classifier Sub-Challenge, for which we tested several classification strategies. On the whole, the results were slightly worse than or similar to the baseline, but we found some configurations that could be considered in future implementations. Notes 60 Combining Spectral and Prosodic Information for Emotion Recognition in the Interspeech 2009 Emotion Challenge Cepstral and Long-Term Features for Emotion Recognition Iker Luengo, Eva Navas, Inmaculada Hernáez; University of the Basque Country, Spain Pierre Dumouchel 1 , Najim Dehak 1 , Yazid Attabi 1 , Réda Dehak 2 , Narjès Boufaden 1 ; 1 CRIM, Canada; 2 LRDE, France Mon-Ses2-S1-6, Time: 14:20 Mon-Ses2-S1-9, Time: 14:50 This paper describes the system presented at the Interspeech 2009 Emotion Challenge. It relies on both spectral and prosodic features in order to automatically detect the emotional state of the speaker. As both kinds of features have very different characteristics, they are treated separately, creating two sub-classifiers, one using the spectral features and the other one using the prosodic ones. The results of these two classifiers are then combined with a fusion system based on Support Vector Machines. In this paper, we describe systems that were developed for the Open Performance Sub-Challenge of the INTERSPEECH 2009 Emotion Challenge. We participate in both two-class and five-class emotion detection. For the two-class problem, the best performance is obtained by logistic regression fusion of three systems. These systems use short- and long-term speech features. Fusion allowed to an absolute improvement of 2.6% on the unweighted recall value compared with [1]. For the five-class problem, we submitted two individual systems: cepstral GMM vs. long-term GMM-UBM. The best result comes from a cepstral GMM and produces an absolute improvement of 3.5% compared to [6]. Acoustic Emotion Recognition Using Dynamic Bayesian Networks and Multi-Space Distributions R. Barra-Chicote 1 , Fernando Fernández 1 , S. Lutfi 1 , Juan Manuel Lucas-Cuesta 1 , J. Macias-Guarasa 2 , J.M. Montero 1 , R. San-Segundo 1 , J.M. Pardo 1 ; 1 Universidad Politécnica de Madrid, Spain; 2 Universidad de Alcalá, Spain Brno University of Technology System for Interspeech 2009 Emotion Challenge Marcel Kockmann, Lukáš Burget, Jan Černocký; Brno University of Technology, Czech Republic Mon-Ses2-S1-10, Time: 15:00 Mon-Ses2-S1-7, Time: 14:30 In this paper we describe the acoustic emotion recognition system built at the Speech Technology Group of the Universidad Politecnica de Madrid (Spain) to participate in the INTERSPEECH 2009 Emotion Challenge. Our proposal is based on the use of a Dynamic Bayesian Network (DBN) to deal with the temporal modelling of the emotional speech information. The selected features (MFCC, F0, Energy and their variants) are modelled as different streams, and the F0 related ones are integrated under a Multi Space Distribution (MSD) framework, to properly model its dual nature (voiced/unvoiced). Experimental evaluation on the challenge test set, show a 67.06% and 38.24% of unweighted recall for the 2 and 5-classes tasks respectively. In the 2-class case, we achieve similar results compared with the baseline, with a considerable less number of features. In the 5-class case, we achieve a statistically significant 6.5% relative improvement. This paper describes Brno University of Technology (BUT) system for the Interspeech 2009 Emotion Challenge. Our submitted system for the Open Performance Sub-Challenge uses acoustic frame based features as a front-end and Gaussian Mixture Models as a back-end. Different feature types and modeling approaches successfully applied in speaker- and language recognition are investigated and we can achieve an 16% and 9% relative improvement over the best dynamic and static baseline system on the 5-class task, respectively. Summary of the INTERSPEECH 2009 Emotion Challenge Time: 15:10 Awards Ceremony Time: 15:20 Emotion Classification in Children’s Speech Using Fusion of Acoustic and Linguistic Features Tim Polzehl 1 , Shiva Sundaram 1 , Hamed Ketabdar 1 , Michael Wagner 2 , Florian Metze 3 ; 1 Technische Universität Berlin, Germany; 2 University of Canberra, Australia; 3 Carnegie Mellon University, USA Mon-Ses3-O1 : Automatic Speech Recognition: Language Models I Main Hall, 16:00, Monday 7 Sept 2009 Chair: Keiichi Tokuda, Nagoya Insitute of Technology, Japan Mon-Ses2-S1-8, Time: 14:40 This paper describes a system to detect angry vs. non-angry utterances of children who are engaged in dialog with an Aibo robot dog. The system was submitted to the Interspeech2009 Emotion Challenge evaluation. The speech data consist of short utterances of the children’s speech, and the proposed system is designed to detect anger in each given chunk. Frame-based cepstral features, prosodic and acoustic features as well as glottal excitation features are extracted automatically, reduced in dimensionality and classified by means of an artificial neural network and a support vector machine. An automatic speech recognizer transcribes the words in an utterance and yields a separate classification based on the degree of emotional salience of the words. Late fusion is applied to make a final decision on anger vs. non-anger of the utterance. Preliminary results show 75.9% unweighted average recall on the training data and 67.6% on the test set. Back-Off Language Model Compression Boulos Harb, Ciprian Chelba, Jeffrey Dean, Sanjay Ghemawat; Google Inc., USA Mon-Ses3-O1-1, Time: 16:00 With the availability of large amounts of training data relevant to speech recognition scenarios, scalability becomes a very productive way to improve language model performance. We present a technique that represents a back-off n-gram language model using arrays of integer values and thus renders it amenable to effective block compression. We propose a few such compression algorithms and evaluate the resulting language model along two dimensions: memory footprint, and speed reduction relative to the uncompressed one. We experimented with a model Notes 61 that uses a 32-bit word vocabulary (at most 4B words) and logprobabilities/back-off-weights quantized to 1 byte, respectively. The best compression algorithm achieves 2.6 bytes/n-gram at ≈18X slower than uncompressed. For faster LM operation we found it feasible to represent the LM at ≈4.0 bytes/n-gram, and ≈3X slower than the uncompressed LM. The memory footprint of a LM containing one billion n-grams can thus be reduced to 3–4 Gbytes without impacting its speed too much. best recognition were slightly worse than word based character recognition. However combining the two systems using log-linear combination gives better results than either system separately. An equally weighted combination gave consistent CER gains of 0.1–0.2% absolute over the word based standard system. Improving Broadcast News Transcription with a Precision Grammar and Discriminative Reranking Gwénolé Lecorvé, Guillaume Gravier, Pascale Sébillot; IRISA, France Constraint Selection for Topic-Based MDI Adaptation of Language Models Mon-Ses3-O1-5, Time: 17:20 Tobias Kaufmann, Thomas Ewender, Beat Pfister; ETH Zürich, Switzerland We propose a new approach of integrating a precision grammar into speech recognition. The approach is based on a novel robust parsing technique and discriminative reranking. By reranking 100-best output of the LIMSI German broadcast news transcription system we achieved a significant reduction of the word error rate by 9.6% relative. To our knowledge, this is the first significant improvement for a real-world broad-domain speech recognition task due to a precision grammar. This paper presents an unsupervised topic-based language model adaptation method which specializes the standard minimum information discrimination approach by identifying and combining topic-specific features. By acquiring a topic terminology from a thematically coherent corpus, language model adaptation is restrained to the sole probability re-estimation of n-grams ending with some topic-specific words, keeping other probabilities untouched. Experiments are carried out on a large set of spoken documents about various topics. Results show significant perplexity and recognition improvements which outperform results of classical adaptation techniques. Use of Contexts in Language Model Interpolation and Adaptation Nonstationary Latent Dirichlet Allocation for Speech Recognition X. Liu, M.J.F. Gales, P.C. Woodland; University of Cambridge, UK Chuang-Hua Chueh, Jen-Tzung Chien; National Cheng Kung University, Taiwan Mon-Ses3-O1-2, Time: 16:20 Mon-Ses3-O1-3, Time: 16:40 Mon-Ses3-O1-6, Time: 17:40 Language models (LMs) are often constructed by building component models on multiple text sources to be combined using global, context free interpolation weights. By re-adjusting these weights, LMs may be adapted to a target domain representing a particular genre, epoch or other higher level attributes. A major limitation with this approach is other factors that determine the “usefulness” of sources on a context dependent basis, such as modeling resolution, generalization, topics and styles, are poorly modeled. To overcome this problem, this paper investigates a context dependent form of LM interpolation and test-time adaptation. Depending on the context, a discrete history weighting function is used to dynamically adjust the contribution from component models. In previous research, it was used primarily for LM adaptation. In this paper, a range of schemes to combine context dependent weights obtained from training and test data to improve LM adaptation are proposed. Consistent perplexity and error rate gains of 6% relative were obtained on a state-of-the-art broadcast recognition task. Latent Dirichlet allocation (LDA) has been successful for document modeling. LDA extracts the latent topics across documents. Words in a document are generated by the same topic distribution. However, in real-world documents, the usage of words in different paragraphs is varied and accompanied with different writing styles. This study extends the LDA and copes with the variations of topic information within a document. We build the nonstationary LDA (NLDA) by incorporating a Markov chain which is used to detect the stylistic segments in a document. Each segment corresponds to a particular style in composition of a document. This NLDA can exploit the topic information between documents as well as the word variations within a document. We accordingly establish a Viterbi-based variational Bayesian procedure. A language model adaptation scheme using NLDA is developed for speech recognition. Experimental results show improvement of NLDA over LDA in terms of perplexity and word error rate. Exploiting Chinese Character Models to Improve Speech Recognition Performance Mon-Ses3-O2 : Phoneme-Level Perception 1 2 Jones (East Wing 1), 16:00, Monday 7 Sept 2009 Chair: Rolf Carlson, KTH, Sweden 2 J.L. Hieronymus , X. Liu , M.J.F. Gales , P.C. Woodland 2 ; 1 NASA Ames Research Center, USA; 2 University of Cambridge, UK Categorical Perception of Speech Without Stimulus Repetition Mon-Ses3-O1-4, Time: 17:00 The Chinese language is based on characters which are syllabic in nature. Since languages have syllabotactic rules which govern the construction of syllables and their allowed sequences, Chinese character sequence models can be used as a first level approximation of allowed syllable sequences. N-gram character sequence models were trained on 4.3 billion characters. Characters are used as a first level recognition unit with multiple pronunciations per character. For comparison the CU-HTK Mandarin word based system was used to recognize words which were then converted to character sequences. The character only system error rates for one Jack C. Rogers, Matthew H. Davis; University of Cambridge, UK Mon-Ses3-O2-1, Time: 16:00 We explored the perception of phonetic continua generated with an automated auditory morphing technique in three perceptual experiments. The use of large sets of stimuli allowed an assessment of the impact of single vs. paired presentation without the massed stimulus repetition typical of categorical perception experiments. A third experiment shows that such massed repetition alters the degree of categorical and sub-categorical discrimination possible in Notes 62 speech perception. Implications for accounts of speech perception are discussed. Perceptual Grouping of Alternating Word Pairs: Effect of Pitch Difference and Presentation Rate Nandini Iyer, Douglas S. Brungart, Brian D. Simpson; Air Force Research Laboratory, USA Non-Automaticity of Use of Orthographic Knowledge in Phoneme Evaluation Anne Cutler 1 , Chris Davis 2 , Jeesun Kim 2 ; 1 Max Planck Institute for Psycholinguistics, The Netherlands; 2 University of Western Sydney, Australia Mon-Ses3-O2-2, Time: 16:20 Two phoneme goodness rating experiments addressed the role of orthographic knowledge in the evaluation of speech sounds. Ratings for the best tokens of /s/ were higher in words spelled with S (e.g., bless) than in words where /s/ was spelled with C (e.g., voice). This difference did not appear for analogous nonwords for which every lexical neighbour had either S or C spelling (pless, floice). Models of phonemic processing incorporating obligatory influence of lexical information in phonemic processing cannot explain this dissociation; the data are consistent with models in which phonemic decisions are not subject to necessary top-down lexical influence. Learning and Generalization of Novel Contrastive Cues Meghan Sumner; Stanford University, USA Mon-Ses3-O2-3, Time: 16:40 This paper examines the learning of a novel phonetic contrast. Specifically, we examine how a contrast is learned — do speakers learn a specific property about a particular word, or do they internalize a pattern that can be applied to words of a particular type in subsequent processing? In two experiments, participants were trained to treat stop release as contrastive. Following training, participants took either a minimal pair decision or a cross-modal form priming task, both of which include trained words, untrained words with a trained rime, and novel, untrained words. The results of both experiments suggest that both strategies are used in learning — listeners generalize to words with similar rimes, but are unable to extend this knowledge to novel words. Mon-Ses3-O2-5, Time: 17:20 When listeners hear sequences of tones that slowly alternate between a low frequency and a slightly higher frequency, they tend to report hearing a single stream of alternating tones. However, when the alternation rate and/or the frequency difference increases, they often report hearing two distinct streams: a slowly pulsing high and low frequency stream. This experiment used repeating sequences of spondees to investigate whether a similar streaming phenomenon might occur for speech stimuli. The F0 difference between every other word was varied from 0–18 semitones. Each word was either 100 or 125 ms in duration. The inter-onset intervals (IOIs) of the individual words were varied from 100–300 ms. The spondees were selected in such a way that listeners who perceived a single stream of sequential words would report hearing a different set of spondees than ones who perceived two distinct streams grouped by frequency. As expected, F0 differences was a strong cue for sequential segregation. Moreover, the number of ‘two’ stream judgments were greater at smaller IOIs, suggesting that factors that influence the obligatory streaming of tonal signals are also important in the segregation of speech signals. Comparing Methods to Find a Best Exemplar in a Multidimensional Space Titia Benders, Paul Boersma; University of Amsterdam, The Netherlands Mon-Ses3-O2-6, Time: 17:40 We present a simple algorithm for running a listening experiment aimed at finding the best exemplar in a multidimensional space. For simulated humanlike listeners, who have perception thresholds and some decision noise on their responses, the algorithm on average ends up twelve times closer than Iverson and Evans’ algorithm [1]. Vowel Category Perception Affected by Microdurational Variations Mon-Ses3-O3 : Statistical Parametric Synthesis I Einar Meister 1 , Stefan Werner 2 ; 1 Tallinn University of Technology, Estonia; 2 University of Joensuu, Finland Fallside (East Wing 2), 16:00, Monday 7 Sept 2009 Chair: Jean-François Bonastre, LIA, France Mon-Ses3-O2-4, Time: 17:00 Vowel quality perception in quantity languages is considered to be unrelated to vowel duration since duration is used to realize quantity oppositions. To test the role of microdurational variations in vowel category perception in Estonian listening experiments with synthetic stimuli were carried out, involving five vowel pairs along the close-open axis. The results show that in the case of high-mid vowel pairs vowel openness correlates positively with stimulus duration; in mid-low vowel pairs no such correlation was found. The discrepancy in the results is explained by the hypothesis that in case of shorter perceptual distances (high-mid area of vowel space) intrinsic duration plays the role of a secondary feature to enhance perceptual contrast between vowels, whereas in case of mid-low oppositions perceptual distance is large enough to guarantee the necessary perceptual contrast by spectral features alone and vowel intrinsic duration as an additional cue is not needed. Autoregressive HMMs for Speech Synthesis Matt Shannon, William Byrne; University of Cambridge, UK Mon-Ses3-O3-1, Time: 16:00 We propose the autoregressive HMM for speech synthesis. We show that the autoregressive HMM supports efficient EM parameter estimation and that we can use established effective synthesis techniques such as synthesis considering global variance with minimal modification. The autoregressive HMM uses the same model for parameter estimation and synthesis in a consistent way, in contrast to the standard HMM synthesis framework, and supports easy and efficient parameter estimation, in contrast to the trajectory HMM. We find that the autoregressive HMM gives performance comparable to the standard HMM synthesis framework on a Blizzard Challenge-style naturalness evaluation. Notes 63 Asynchronous F0 and Spectrum Modeling for HMM-Based Speech Synthesis Local Minimum Generation Error Criterion for Hybrid HMM Speech Synthesis Cheng-Cheng Wang, Zhen-Hua Ling, Li-Rong Dai; University of Science & Technology of China, China Xavi Gonzalvo 1 , Alexander Gutkin 2 , Joan Claudi Socoró 3 , Ignasi Iriondo 3 , Paul Taylor 1 ; 1 Phonetic Arts Ltd., UK; 2 Yahoo! Europe, UK; 3 Universitat Ramon Llull, Spain Mon-Ses3-O3-2, Time: 16:20 This paper proposes an asynchronous model structure for fundamental frequency(F0) and spectrum modeling in HMM-based parametric speech synthesis to improve the performance of F0 prediction. F0 and spectrum features are considered to be synchronous in the conventional system. Considering that the production of these two features is decided by the movement of different speech organs, an explicitly asynchronous model structure is introduced. At training stage, F0 models are training asynchronously with spectrum models. At synthesis stage, the two features are generated respectively. The objective and subjective evaluation results show the proposed method can effectively improve the accuracy of F0 prediction. A Minimum V/U Error Approach to F0 Generation in HMM-Based TTS Mon-Ses3-O3-5, Time: 17:20 This paper presents an HMM-driven hybrid speech synthesis approach in which unit selection concatenative synthesis is used to improve the quality of the statistical system using a Local Minimum Generation Error (LMGE) during the synthesis stage. The idea behind this approach is to combine the robustness due to HMMs with the naturalness of concatenated units. Unlike the conventional hybrid approaches to speech synthesis that use concatenative synthesis as a backbone, the proposed system employs stable regions of natural units to improve the statistically generated parameters. We show that this approach improves the generation of vocal tract parameters, smoothes the bad joints and increases the overall quality. Yao Qian, Frank K. Soong, Miaomiao Wang, Zhizheng Wu; Microsoft Research Asia, China Thousands of Voices for HMM-Based Speech Synthesis Mon-Ses3-O3-3, Time: 16:40 Junichi Yamagishi 1 , Bela Usabaev 2 , Simon King 1 , Oliver Watts 1 , John Dines 3 , Jilei Tian 4 , Rile Hu 4 , Yong Guan 4 , Keiichiro Oura 5 , Keiichi Tokuda 5 , Reima Karhila 6 , Mikko Kurimo 6 ; 1 University of Edinburgh, UK; 2 Universität Tübingen, Germany; 3 IDIAP Research Institute, Switzerland; 4 Nokia Research Center, China; 5 Nagoya Institute of Technology, Japan; 6 Helsinki University of Technology, Finland The HMM-based TTS can produce a highly intelligible and decent quality voice. However, HMM model degrades when feature vectors used in training are noisy. Among all noisy features, pitch tracking errors and corresponding flawed voiced/unvoiced (v/u) decisions are identified as two key factors in voice quality problems. In this paper, we propose a minimum v/u error approach to F0 generation. A prior knowledge of v/u is imposed in each Mandarin phone and accumulated v/u posterior probabilities are used to search for the optimal v/u switching point in each VU or UV segment in generation. Objectively the new approach is shown to improve v/u prediction performance, specifically on voiced to unvoiced swapping errors. They are reduced from 3.7% (baseline) down to 2.0% (new approach). The improvement is also subjectively confirmed by an AB preference test score, 72% (new approach) versus 22% (baseline). Voiced/Unvoiced Decision Algorithm for HMM-Based Speech Synthesis Shiyin Kang 1 , Zhiwei Shuang 2 , Quansheng Duan 1 , Yong Qin 2 , Lianhong Cai 1 ; 1 Tsinghua University, China; 2 IBM China Research Lab, China Mon-Ses3-O3-4, Time: 17:00 This paper introduces a novel method to improve the U/V decision method in HMM-based speech synthesis. In the conventional method, the U/V decision of each state is independently made, and a state in the middle of a vowel may be decided as unvoiced. In this paper, we propose to utilize the constraints of natural speech to improve the U/V decision inside a unit, such as syllable or phone. We use a GMM-based U/V change time model to select the best U/V change time in one unit, and refine the U/V decision of all states in that unit based on the selected change time. The result of a perceptual evaluation demonstrates that the proposed method can significantly improve the naturalness of the synthetic speech. Mon-Ses3-O3-6, Time: 17:40 Our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an ‘average voice model’ plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack of phonetic balance. This enables us consider building high-quality voices on ‘non-TTS’ corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper we show thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal databases (WSJ0/WSJ1/WSJCAM0), Resource Management, Globalphone and Speecon. We report some perceptual evaluation results and outline the outstanding issues. Mon-Ses3-O4 : Systems for Spoken Language Translation Holmes (East Wing 3), 16:00, Monday 7 Sept 2009 Chair: Hermann Ney, RWTH Aachen University, Germany Efficient Combination of Confidence Measures for Machine Translation Sylvain Raybaud, David Langlois, Kamel Smaïli; LORIA, France Mon-Ses3-O4-1, Time: 16:00 We present in this paper a twofold contribution to Confidence Measures for Machine Translation. First, in order to train and Notes 64 test confidence measures, we present a method to automatically build corpora containing realistic errors. Errors introduced into reference translation simulate classical machine translation errors (word deletion and word substitution), and are supervised by Wordnet. Second, we use SVM to combine original and classical confidence measures both at word- and sentence-level. We show that the obtained combination outperforms by 14% (absolute) our best single word-level confidence measure, and that combination of sentence-level confidence measures produces meaningful scores. Incremental Dialog Clustering for Speech-to-Speech Translation Using Syntax in Large-Scale Audio Document Translation David Stallard, Stavros Tsakalidis, Shirin Saleem; BBN Technologies, USA Mon-Ses3-O4-2, Time: 16:20 Application domains for speech-to-speech translation and dialog systems often contain sub-domains and/or task-types for which different outputs are appropriate for a given input. It would be useful to be able to automatically find such sub-domain structure in training corpora, and to classify new interactions with the system into one of these sub-domains. To this end, We present a document-clustering approach to such sub-domain classification, which uses a recently-developed algorithm based on von Mises Fisher distributions. We give preliminary perplexity reduction and MT performance results for a speech-to-speech translation system using this model. Iterative Sentence-Pair Extraction from Quasi-Parallel Corpora for Machine Translation R. Sarikaya, Sameer Maskey, R. Zhang, Ea-Ee Jan, D. Wang, Bhuvana Ramabhadran, S. Roukos; IBM T.J. Watson Research Center, USA Mon-Ses3-O4-3, Time: 16:40 This paper addresses parallel data extraction from the quasiparallel corpora generated in a crowd-sourcing project where ordinary people watch tv shows and movies and transcribe/translate what they hear, creating document pools in different languages. Since they do not have guidelines for naming and performing translations, it is often not clear which documents are the translations of the same show/movie and which sentences are the translations of the each other in a given document pair. We introduce a method for automatically pairing documents in two languages and extracting parallel sentences from the paired documents. The method consists of three steps: i) document pairing, ii) sentence pair alignment of the paired documents, and iii) context extrapolation to boost the sentence pair coverage. Human evaluation of the extracted data shows that 95% of the extracted sentences carry useful information for translation. Experimental results also show that using the extracted data provides significant gains over the baseline statistical machine translation system built with manually annotated data. RTTS: Towards Enterprise-Level Real-Time Speech Transcription and Translation Services Juan M. Huerta 1 , Cheng Wu 1 , Andrej Sakrajda 1 , Sasha Caskey 1 , Ea-Ee Jan 1 , Alexander Faisman 1 , Shai Ben-David 2 , Wen Liu 3 , Antonio Lee 1 , Osamuyimen Stewart 1 , Michael Frissora 1 , David Lubensky 1 ; 1 IBM T.J. Watson Research Center, USA; 2 IBM Haifa Research Lab, Israel; 3 IBM China Research Lab, China Mon-Ses3-O4-4, Time: 17:00 In this paper we describe the RTTS system for enterprise-level real time speech recognition and translation. RTTS follows a Web Service-based approach which allows the encapsulation of ASR and MT Technology components thus hiding the configuration and tuning complexities and details from the client applications while exposing a uniform interface. In this way, RTTS is capable of easily supporting a wide variety of client applications. The clients we have implemented include a VoIP-based real time speech-to-speech translation system, a chat and Instant Messaging translation System, a Transcription Server, among others. Jing Zheng 1 , Necip Fazil Ayan 1 , Wen Wang 1 , David Burkett 2 ; 1 SRI International, USA; 2 University of California at Berkeley, USA Mon-Ses3-O4-5, Time: 17:20 Recently, the use of syntax has very effectively improved machine translation (MT) quality in many text translation tasks. However, using syntax in speech translation poses additional challenges because of disfluencies and other spoken language phenomena, and of errors introduced by automatic speech recognition (ASR). In this paper, we investigate the effect of using syntax in a large-scale audio document translation task targeting broadcast news and broadcast conversations. We do so by comparing the performance of three synchronous context-free grammar based translation approaches: 1) hierarchical phrase-based translation, 2) syntaxaugmented MT, and 3) string-to-dependency MT. The results show a positive effect of explicitly using syntax when translating broadcast news, but no benefit when translating broadcast conversations. The results indicate that improving the robustness of syntactic systems against conversational language style is important to their success and requires future effort. Context-Driven Automatic Bilingual Movie Subtitle Alignment Andreas Tsiartas, Prasanta Kumar Ghosh, Panayiotis G. Georgiou, Shrikanth S. Narayanan; University of Southern California, USA Mon-Ses3-O4-6, Time: 17:40 Movie subtitle alignment is a potentially useful approach for deriving automatically parallel bilingual/multilingual spoken language data for automatic speech translation. In this paper, we consider the movie subtitle alignment task. We propose a distance metric between utterances of different languages based on lexical features derived from bilingual dictionaries. We use the dynamic time warping algorithm to obtain the best alignment. The best F-score of ∼0.713 is obtained using the proposed approach. Mon-Ses3-P1 : Human Speech Production I Hewison Hall, 16:00, Monday 7 Sept 2009 Chair: Shrikanth Narayanan, University of Southern California, USA Probabilistic Effects on French [t] Duration Francisco Torreira, Mirjam Ernestus; Radboud Universiteit Nijmegen, The Netherlands Mon-Ses3-P1-1, Time: 16:00 The present study shows that [t] consonants are affected by probabilistic factors in a syllable-timed language as French, and in spontaneous as well as in journalistic speech. Study 1 showed Notes 65 a word bigram frequency effect in spontaneous French, but its exact nature depended on the corpus on which the probabilistic measures were based. Study 2 investigated journalistic speech and showed an effect of the joint frequency of the test word and its following word. We discuss the possibility that these probabilistic effects are due to the speaker’s planning of upcoming words, and to the speaker’s adaptation to the listener’s needs. On the Production of Sandhi Phenomena in French: Psycholinguistic and Acoustic Data Odile Bagou, Violaine Michel, Marina Laganaro; University of Neuchâtel, Switzerland Mon-Ses3-P1-2, Time: 16:00 This preliminary study addresses two complementary questions about the production of sandhi phenomena in French. First, we investigated whether the encoding of enchaînement and liaison enchaînée involves a processing cost compared to non-resyllabified sequences. This question was analyzed with a psycholinguistic production time paradigm. The elicited sequences were then used to address our second question, namely how critical V1 CV2 sequences are phonetically realized across different boundary conditions. We compared the durational properties of critical sequences containing a word-final coda consonant (enchaînement: V1 .C#V2 ), an additional consonant (liaison enchaînée: V1 +C#V2 ) and a similar onset consonant (V1 #CV2 ). Results on production latencies suggested that the encoding of liaison enchaînée involves an additional processing cost compared to the two other boundary conditions. In addition, the acoustic analyses indicated durational differences across the three boundary conditions on V1 , C and V2 . Implications for both, psycholinguistic and phonological models are discussed. Extreme Reductions: Contraction of Disyllables into Monosyllables in Taiwan Mandarin Chierh Cheng, Yi Xu; University College London, UK Mon-Ses3-P1-3, Time: 16:00 This study investigates a severe form of segmental reduction known as contraction. In Taiwan Mandarin, a disyllabic word or phrase is often contracted into a monosyllabic unit in conversational speech, just as “do not” is often contracted into “don’t” in English. A systematic experiment was conducted to explore the underlying mechanism of such contraction. Preliminary results show evidence that contraction is not a categorical shift but a gradient undershoot of the articulatory target as a result of time pressure. Moreover, contraction seems to occur only beyond a certain duration threshold. These findings may further our understanding of the relation between duration and segmental reduction. Annotation and Features of Non-Native Mandarin Tone Quality Mitchell Peabody, Stephanie Seneff; MIT, USA Mon-Ses3-P1-4, Time: 16:00 Native speakers of non-tonal languages, such as American English, frequently have difficulty accurately producing the tones of Mandarin Chinese. This paper describes a corpus of Mandarin Chinese spoken by non-native speakers and annotated for tone quality using a simple good/bad system. We examine inter-rater correlation of the annotations and highlight the differences in feature distribution between native, good non-native, and bad non-native tone productions. We find that the features of tones judged by a simple majority to be bad are significantly different from features from tones judged to be good, and tones produced by native speakers. On-Line Formant Shifting as a Function of F0 Kateřina Chládková 1 , Paul Boersma 1 , Václav Jonáš Podlipský 2 ; 1 University of Amsterdam, The Netherlands; 2 Palacký University Olomouc, Czech Republic Mon-Ses3-P1-5, Time: 16:00 We investigate whether there is a within-speaker effect of a higher F0 on the values of the first and the second formant. When asked to speak at a high F0, speakers turn out to raise their formants as well. In the F1 dimension this effect is greater for women than for men. We conclude that while a general formant raising effect might be due to the physiology of a high F0 (i.e. raised larynx and shorter vocal tract), a plausible explanation for the gender-dependent size of the effect can only be found in the undersampling hypothesis. Production Boundary Between Fricative and Affricate in Japanese and Korean Speakers Kimiko Yamakawa 1 , Shigeaki Amano 2 , Shuichi Itahashi 1 ; 1 National Institute of Informatics, Japan; 2 NTT Corporation, Japan Mon-Ses3-P1-6, Time: 16:00 A fricative [s] and an affricate [ts] pronounced by both native Japanese and Korean speakers were analyzed to clarify the effect of the mother language on speech production. It was revealed that Japanese speakers have a clear individual production boundary between [s] and [ts], and that this boundary corresponds to the production boundary of all Japanese speakers. In contrast, although Korean speakers tend to have a clear individual production boundary, the boundary dose not corresponds to that of Japanese speakers. These facts suggest that Korean speakers tend to have a stable [s]-[ts] production boundary but that differ from Japanese speakers. Aerodynamics of Fricative Production in European Portuguese Cátia M.R. Pinho 1 , Luis M.T. Jesus 1 , Anna Barney 2 ; 1 Universidade de Aveiro, Portugal; 2 University of Southampton, UK Mon-Ses3-P1-7, Time: 16:00 The characteristics of steady state fricative production, and those of the phone preceding and following the fricative, were investigated. Aerodynamic and electroglotographic (EGG) recordings of four normal adult speakers (two females and two males), producing a speech corpus of 9 isolated words with the European Portuguese (EP) voiced fricatives /v, z, Z/ in initial, medial and final word position, and the same 9 words embedded in 42 different real EP carrier sentences, were analysed. Multimodal data allowed the characterisation of fricatives in terms of their voicing mechanisms, based on the amplitude of oral flow, F1 excitation and fundamental frequency (F0). Contextual Effects on Protrusion and Lip Opening for /i,y/ Anne Bonneau, Julie Buquet, Brigitte Wrobel-Dautcourt; LORIA, France Mon-Ses3-P1-8, Time: 16:00 This study investigates the effect of “adverse” contexts, especially that of the consonant /S/, on labial parameters for French /i,y/. Five parameters were analysed: the height, width and area of lip opening, the distance between the corners of the mouth, as well as Notes 66 lip protrusion. Ten speakers uttered a corpus made up of isolated vowels, syllables and logatoms. A special procedure has been designed to evaluate lip opening contours. Results showed that the carry-over effect of the consonant /S/ can impede the opposition between /i/ and /y/ in the protrusion dimension, depending upon speakers. 82% of German and French mixed-lingual test sentences cannot be distinguished from natural polyglot prosody. Weighted Neural Network Ensemble Models for Speech Prosody Control Harald Romsdorfer; ETH Zürich, Switzerland Speech Rate Effects on European Portuguese Nasal Vowels Mon-Ses3-P2-2, Time: 16:00 Catarina Oliveira, Paula Martins, António Teixeira; Universidade de Aveiro, Portugal Mon-Ses3-P1-9, Time: 16:00 This paper presents new temporal information regarding the production of European Portuguese (EP) nasal vowels, based on new EMMA data. The influence of speech rate on duration of velum gestures and their coordination with consonantic and glottal gestures were analyzed. As information on relative speed of articulators is scarce, the parameter stiffness for the nasal gestures was also calculated and analyzed. Results show clear effects of speech rate on temporal characteristics of EP nasal vowels. Speech rate reduces the duration of velum gestures, increases the stiffness and inter-gestural overlap. Relation of Formants and Subglottal Resonances in Hungarian Vowels In text-to-speech synthesis systems, the quality of the predicted prosody contours influences quality and naturalness of synthetic speech. This paper presents a new statistical model for prosody control that combines an ensemble learning technique using neural networks as base learners with feature relevance determination. This weighted neural network ensemble model was applied for both, phone duration modeling and fundamental frequency modeling. A comparison with state-of-the-art prosody models based on classification and regression trees (CART), multivariate adaptive regression splines (MARS), or artificial neural networks (ANN), shows a 12% improvement compared to the best duration model and a 24% improvement compared to the best F0 model. The neural network ensemble model also outperforms another, recently presented ensemble model based on gradient tree boosting. Cross-Language F0 Modeling for Under-Resourced Tonal Languages: A Case Study on Thai-Mandarin Vataya Boonpiam, Anocha Rugchatjaroen, Chai Wutiwiwatchai; NECTEC, Thailand Tamás Gábor Csapó 1 , Zsuzsanna Bárkányi 2 , Tekla Etelka Gráczi 2 , Tamás Bőhm 1 , Steven M. Lulich 3 ; 1 BME, Hungary; 2 Hungarian Academy of Sciences, Hungary; 3 MIT, USA Mon-Ses3-P2-3, Time: 16:00 Mon-Ses3-P1-10, Time: 16:00 The relation between vowel formants and subglottal resonances (SGRs) has previously been explored in English, German, and Korean. Results from these studies indicate that vowel classes are categorically separated by SGRs. We extended this work to Hungarian vowels, which have not been related to SGRs before. The Hungarian vowel system contains paired long and short vowels as well as a series of front rounded vowels, similar to German but more complex than English and Korean. Results indicate that SGRs separate vowel classes in Hungarian as in English, German, and Korean, and uncover additional patterns of vowel formants relative to the third subglottal resonance (Sg3). These results have implications for understanding phonological distinctive features, and applications in automatic speech technologies. Mon-Ses3-P2 : Prosody, Text Analysis, and Multilingual Models Hewison Hall, 16:00, Monday 7 Sept 2009 Chair: Andrew Breen, Nuance Communications, Belgium This paper proposed a novel method for F0 modeling in underresourced tonal languages. Conventional statistical models require large training data which are deficient in many languages. In tonal languages, different syllabic tones are represented by different F0 shapes, some of them are similar across languages. With cross-language F0 contour mapping, we can augment the F0 model of one under-resourced language with corpora from another rich-resourced language. A case study on Thai HMM-based F0 modeling with a Mandarin corpus is explored. Comparing to baseline systems without cross-language resources, over 7% relative reduction of RMSE and significant improvement of MOS are obtained. Prosodic Issues in Synthesising Thadou, a Tibeto-Burman Tone Language Dafydd Gibbon 1 , Pramod Pandey 2 , D. Mary Kim Haokip 3 , Jolanta Bachan 4 ; 1 Universität Bielefeld, Germany; 2 Jawaharlal Nehru University, India; 3 Assam University, India; 4 Adam Mickiewicz University, Poland Mon-Ses3-P2-4, Time: 16:00 Polyglot Speech Prosody Control Harald Romsdorfer; ETH Zürich, Switzerland Mon-Ses3-P2-1, Time: 16:00 Within a polyglot text-to-speech synthesis system, the generation of an adequate prosody for mixed-lingual texts, sentences, or even words, requires a polyglot prosody model that is able to seamlessly switch between languages and that applies the same voice for all languages. This paper presents the first polyglot prosody model that fulfills these requirements and that is constructed from independent monolingual prosody models. A perceptual evaluation showed that the synthetic polyglot prosody of about The objective of the present analysis is to present linguistic constraints on the phonetic realisation of lexical tone which are relevant for the choice of a speech synthesis development strategy for a specific type of tone language. The selected case is Thadou (Tibeto-Burman), which has lexical and morphosyntactic tone as well as phonetic tone displacement. The last two constraint types differ from those in more well-known tone languages such as Mandarin, and present problems for mainstream corpus-based speech synthesis techniques. Linguistic and phonetic models and a ‘microvoice’ for rule-based tone generation are developed. Notes 67 Advanced Unsupervised Joint Prosody Labeling and Modeling for Mandarin Speech and its Application to Prosody Generation for TTS Sentiment Classification in English from Sentence-Level Annotations of Emotions Regarding Models of Affect Chen-Yu Chiang, Sin-Horng Chen, Yih-Ru Wang; National Chiao Tung University, Taiwan Alexandre Trilla, Francesc Alías; Universitat Ramon Llull, Spain Mon-Ses3-P2-5, Time: 16:00 Mon-Ses3-P2-8, Time: 16:00 Motivated by the success of the unsupervised joint prosody labeling and modeling (UJPLM) method for Mandarin speech on modeling of syllable pitch contour in our previous study, in this paper, the advanced UJPLM (A-UJPLM) method is proposed based on UJPLM to jointly label prosodic tags and model syllable pitch contour, duration and energy level. Experimental results on the Sinica Treebank corpus showed that most prosodic tags labeled were linguistically meaningful and the model parameters estimated were interpretable and generally agreed with other previous study. In virtue of the functions given by the model parameters, an application of A-UJPLM to the prosody generation for Mandarin TTS is proposed. Experimental results showed that the proposed method performed well. Most predicted prosodic features matched well to their original counterparts. This also reconfirmed the effectiveness of the A-UJPLM method. This paper presents a text classifier for automatically tagging the sentiment of input text according to the emotion that is being conveyed. This system has a pipelined framework composed of Natural Language Processing modules for feature extraction and a hard binary classifier for decision making between positive and negative categories. To do so, the Semeval 2007 dataset composed of sentences emotionally annotated is used for training purposes after being mapped into a model of affect. The resulting scheme stands a first step towards a complete emotion classifier for a future automatic expressive text-to-speech synthesizer. Optimization of T-Tilt F0 Modeling Ausdang Thangthai, Anocha Rugchatjaroen, Nattanun Thatphithakkul, Ananlada Chotimongkol, Chai Wutiwiwatchai; NECTEC, Thailand Mon-Ses3-P2-6, Time: 16:00 This paper investigates on the improvement of T-Tilt modeling, a modified Tilt model specifically designed for F0 modeling in tonal languages. The model has proved to work well for F0 analysis but suffers from texttoF0 prediction. To optimize, the T-Tilt event is restricted to span over the whole syllable unit which helps reduce the number of parameters significantly. F0 interpolation and smoothing processes often performed in preprocessing are avoided to prevent modeling errors. F0 shape preclassification and parameter clustering are introduced for better modeling. Evaluation results using the optimized model show the significant improvement for both F0 analysis and prediction. Identification of Contrast and its Emphatic Realization in HMM Based Speech Synthesis Leonardo Badino, J. Sebastian Andersson, Junichi Yamagishi, Robert A.J. Clark; University of Edinburgh, UK Mon-Ses3-P2-9, Time: 16:00 The work presented in this paper proposes to identify contrast in the form of contrastive word pairs and prosodically signal it with emphatic accents in a Text-to-Speech (TTS) application using a Hidden-Markov-Model (HMM) based speech synthesis system. We first describe a novel method to automatically detect contrastive word pairs using textual features only and report its performance on a corpus of spontaneous conversations in English. Subsequently we describe the set of features selected to train a HMM-based speech synthesis system and attempting to properly control prosodic prominence (including emphasis). Results from a large scale perceptual test show that in the majority of cases listeners judge emphatic contrastive word pairs as acceptable as their non-emphatic counterpart, while emphasis on non-contrastive pairs is almost never acceptable. A Multi-Level Context-Dependent Prosodic Model Applied to Durational Modeling How to Improve TTS Systems for Emotional Expressivity Nicolas Obin 1 , Xavier Rodet 1 , Anne Lacheret-Dujour 2 ; 1 IRCAM, France; 2 MoDyCo, France Antonio Rui Ferreira Rebordao, Mostafa Al Masum Shaikh, Keikichi Hirose, Nobuaki Minematsu; University of Tokyo, Japan Mon-Ses3-P2-7, Time: 16:00 We present in this article a multi-level prosodic model based on the estimation of prosodic parameters on a set of well defined linguistic units. Different linguistic units are used to represent different scales of prosodic variations (local and global forms) and thus to estimate the linguistic factors that can explain the variations of prosodic parameters independently on each level. This model is applied to the modeling of syllable-based durational parameters on two read speech corpora — laboratory and acted speech. Compared to a syllable-based baseline model, the proposed approach improves performance in terms of the temporal organization of the predicted durations (correlation score) and reduces model’s complexity, when showing comparable performance in terms of relative prediction error. Mon-Ses3-P2-10, Time: 16:00 Several experiments have been carried out that revealed weaknesses of the current Text-To-Speech (TTS) systems in their emotional expressivity. Although some TTS systems allow XML-based representations of prosodic and/or phonetic variables, few publications considered, as a pre-processing stage, the use of intelligent text processing to detect affective information that can be used to tailor the parameters needed for emotional expressivity. This paper describes a technique for an automatic prosodic parameterization based on affective clues. This technique recognizes the affective information conveyed in a text and, accordingly to its emotional connotation, assigns appropriate pitch accents and other prosodic parameters by XML-tagging. This pre-processing assists the TTS system to generate synthesized speech that contains emotional clues. The experimental results are encouraging and suggest the possibility of suitable emotional expressivity in speech synthesis. Notes 68 State Mapping Based Method for Cross-Lingual Speaker Adaptation in HMM-Based Speech Synthesis Yi-Jian Wu, Yoshihiko Nankaku, Keiichi Tokuda; Nagoya Institute of Technology, Japan Mon-Ses3-P3 : Automatic Speech Recognition: Adaptation I Hewison Hall, 16:00, Monday 7 Sept 2009 Chair: Stephen J. Cox, University of East Anglia, UK Mon-Ses3-P2-11, Time: 16:00 A phone mapping-based method had been introduced for crosslingual speaker adaptation in HMM-based speech synthesis. In this paper, we continue to propose a state mapping based method for cross-lingual speaker adaptation. In this method, we firstly establish the state mapping between two voice models in source and target languages using Kullback-Leibler divergence (KLD). Based on the established mapping information, we introduce two approaches to conduct cross-lingual speaker adaptation, including data mapping and transform mapping approaches. From the experimental results, the state mapping based method outperformed the phone mapping based method. In addition, the data mapping approach achieved better speaker similarity, and the transform mapping approach achieved better speech quality after adaptation. Real Voice and TTS Accent Effects on Intelligibility and Comprehension for Indian Speakers of English as a Second Language Frederick Weber 1 , Kalika Bali 2 ; 1 Columbia University, USA; 2 Microsoft Research India, India Mon-Ses3-P2-12, Time: 16:00 We investigate the effect of accent on comprehension of English for speakers of English as a second language in southern India. Subjects were exposed to real and TTS voices with US and several Indian accents, and were tested for intelligibility and comprehension. Performance trends indicate a measurable advantage for familiar accents, and are broken down by various demographic factors. Improving Consistence of Phonetic Transcription for Text-to-Speech On the Development of Matched and Mismatched Italian Children’s Speech Recognition Systems Piero Cosi; CNR-ISTC, Italy Mon-Ses3-P3-1, Time: 16:00 While at least read speech corpora are available for Italian children’s speech research, there exist many languages which completely lack children’s speech corpora. We propose that learning statistical mappings between the adult and child acoustic space using existing adult/children corpora may provide a future direction for generating children’s models for such data deficient languages. In this work the recent advances in the development of the SONIC Italian children’s speech recognition system will be described. This work, completing a previous one developed in the past, was conducted with the specific goals of integrating the newly trained children’s speech recognition models into the Italian version of the Colorado Literacy Tutor platform. Specifically, children’s speech recognition research for Italian was conducted using the complete training and test set of the FBK (ex ITC-irst) Italian Children’s Speech Corpus (ChildIt). Using the University of Colorado SONIC LVSR system, we demonstrate a phonetic recognition error rate of 12,0% for a system which incorporates Vocal Tract Length Normalization (VTLN), Speaker-Adaptive Trained phonetic models, as well as unsupervised Structural MAP Linear Regression (SMAPLR). Combination of Acoustic and Lexical Speaker Adaptation for Disordered Speech Recognition Oscar Saz, Eduardo Lleida, Antonio Miguel; Universidad de Zaragoza, Spain Mon-Ses3-P3-2, Time: 16:00 Pablo Daniel Agüero 1 , Antonio Bonafonte 2 , Juan Carlos Tulli 1 ; 1 Universidad Nacional de Mar del Plata, Argentina; 2 Universitat Politècnica de Catalunya, Spain Mon-Ses3-P2-13, Time: 16:00 Grapheme-to-phoneme conversion is an important step in speech segmentation and synthesis. Many approaches are proposed in the literature to perform appropriate transcriptions: CART, FST, HMM, etc. In this paper we propose the use of an automatic algorithm that uses the transformation-based error-driven learning to match the phonetic transcription with the speaker’s dialect and style. Different transcriptions based on word, part-of-speech tags, weak forms and phonotactic rules are validated. The experimental results show an improvement in the transcription using an objective measure. The articulation MOS score is also improved, as most of the changes in phonetic transcription affect coarticulation effects. This paper presents an approach to provide of lexical adaptation in Automatic Speech Recognition (ASR) of the disordered speech from a group of young impaired speakers. The outcome of an Acoustic Phonetic Decoder (APD) is used to learn new lexical variants of the 57-word vocabulary and add them to a lexicon personalized to each user. The possibilities of combination of this lexical adaptation with acoustic adaptation achieved through traditional Maximum A Posteriori (MAP) approaches are further explored, and the results show the importance of matching the lexicon in the ASR decoding phase to the lexicon used for the acoustic adaptation. Bilinear Transformation Space-Based Maximum Likelihood Linear Regression Frameworks Hwa Jeon Song, Yongwon Jeong, Hyung Soon Kim; Pusan National University, Korea Mon-Ses3-P3-3, Time: 16:00 This paper proposes two types of bilinear transformation spacebased speaker adaptation frameworks. In training session, transformation matrices for speakers are decomposed into the style factor for speakers’ characteristics and orthonormal basis of eigenvectors to control dimensionality of the canonical model by the singular value decomposition-based algorithm. In adaptation session, the style factor of a new speaker is estimated, depending on what kind of proposed framework is used. At the same time, the dimensionality of the canonical model can be reduced by the orthonormal basis from training. Moreover, both maximum likelihood linear regression (MLLR) and eigenspace-based MLLR are Notes 69 Speaking Style Adaptation for Spontaneous Speech Recognition Using Multiple-Regression HMM circumstances. This may lead to a very inefficient usage of the database. We show that after VTLN significantly more speakers — also from opposite gender — contribute templates to the matching sequence compared to the non-normalized case. In experiments on the Wall Street Journal database this leads to a relative word error rate reduction of 10%. Yusuke Ijima, Takeshi Matsubara, Takashi Nose, Takao Kobayashi; Tokyo Institute of Technology, Japan Improving the Robustness with Multiple Sets of HMMs identified as special cases of our proposed methods. Experimental results show that the proposed methods are much more effective and versatile than other methods. Mon-Ses3-P3-4, Time: 16:00 This paper describes a rapid model adaptation technique for spontaneous speech recognition. The proposed technique utilizes a multiple-regression hidden Markov model (MRHMM) and is based on a style estimation technique of speech. In the MRHMM, the mean vector of probability density function (pdf) is given by a function of a low-dimensional vector, called style vector, which corresponds to the intensity of expressivity of speaking style variation. The value of the style vector is estimated for every utterance of the input speech and the model adaptation is conducted by calculating new mean vectors of the pdf using the estimated style vector. The performance evaluation results using “Corpus of spontaneous Japanese (CSJ)” are shown under a condition in which the amount of model training and adaptation data is very small. Acoustic Class Specific VTLN-Warping Using Regression Class Trees Hans-Günter Hirsch, Andreas Kitzig; HS Niederrhein, Germany Mon-Ses3-P3-7, Time: 16:00 The highest recognition performance is still achieved when training a recognition system with speech data that have been recorded in the acoustic scenario where the system will be applied. We investigated the approach of using several sets of HMMs. These sets have been trained on data that were recorded in different typical noise situations. One HMM set is individually selected at each speech input by comparing the pause segment at the beginning of the utterance with the pause models of all sets. We observed a considerable reduction of the error rates when applying this approach in comparison to two well known techniques for improving the robustness. Furthermore, we developed a technique to additionally adapt certain parameters of the selected HMMs to the specific noise condition. This leads to a further improvement of the recognition rates. S.P. Rath, S. Umesh; IIT Kanpur, India Mon-Ses3-P3-5, Time: 16:00 In this paper, we study the use of different frequency warpfactors for different acoustic classes in a computationally efficient frame-work of Vocal Tract Length Normalization (VTLN). This is motivated by the fact that all acoustic classes do not exhibit similar spectral variations as a result of physiological differences in vocal tract, and therefore, the use of a single frequency-warp for the entire utterance may not be appropriate. We have recently proposed a VTLN method that implements VTLN-warping through a linear-transformation (LT) of the conventional MFCC features and efficiently estimates the warp-factor using the same sufficient statistics as that are used in CMLLR adaptation. In this paper we have shown that, in this framework of VTLN, and using the idea of regression class tree, we can obtain separate VTLN-warping for different acoustic classes. The use of regression class tree ensures that warp-factor is estimated for each class even when there is very little data available for that class. The acoustic classes, in general, can be any collection of the Gaussian components in the acoustic model. We have built acoustic classes by using data-driven approach and by using phonetic knowledge. Using WSJ database we have shown the recognition performance of the proposed acoustic class specific warp-factor both for the data driven and the phonetic knowledge based regression class tree definitions and compare it with the case of the single warp-factor. On the Use of Pitch Normalization for Improving Children’s Speech Recognition Rohit Sinha, Shweta Ghai; IIT Guwahati, India Mon-Ses3-P3-8, Time: 16:00 In this work, we have studied the effect of pitch variations across the speech signals in context of automatic speech recognition. Our initial study done on vowel data indicates that on account of insufficient smoothing of pitch harmonics by the filterbank, particularly for high pitch signals, the variances of mel frequency cepstral coefficients (MFCC) feature significantly increase with increase in the pitch of the speech signals. Further to reduce the variance of MFCC feature due to varying pitch among speakers, a maximum likelihood based explicit pitch normalization method has been explored. On connected digit recognition task, with pitch normalization a relative improvement of 15% is obtained over baseline for children’s speech (higher pitch) on adults’ speech (lower pitch) trained models. Using VTLN Matrices for Rapid and Computationally-Efficient Speaker Adaptation with Robustness to First-Pass Transcription Errors S.P. Rath, S. Umesh, A.K. Sarkar; IIT Kanpur, India Mon-Ses3-P3-9, Time: 16:00 Speaker Normalization for Template Based Speech Recognition Sébastien Demange, Dirk Van Compernolle; Katholieke Universiteit Leuven, Belgium Mon-Ses3-P3-6, Time: 16:00 Vocal Tract Length Normalization (VTLN) has been shown to be an efficient speaker normalization tool for HMM based systems. In this paper we show that it is equally efficient for a template based recognition system. Template based systems, while promising, have as potential drawback that templates maintain all non phonetic details apart from the essential phonemic properties; i.e. they retain information on speaker and acoustic recording In this paper, we propose to combine the rapid adaptation capability of conventional Vocal Tract Length Normalization (VTLN) with the computational efficiency of transform-based adaptation such as MLLR or CMLLR. VTLN requires the estimation of only one parameter and is, therefore, most suited for the cases where there is little adaptation data (i.e. rapid adaptation). In contrast, transform-based adaptation methods require the estimation of matrices. However, the drawback of conventional VTLN is that it is computationally expensive since it requires multiple spectral-warping to generate VTLN-warped features. We have recently shown that VTLN-warping can be implemented by a lineartransformation (LT) of the conventional MFCC features. These LTs are analytically pre-computed and stored. In this frame-work of LT Notes 70 VTLN, computational complexity of VTLN is similar to transformbased adaptation since warp-factor estimation can be done using the same sufficient statistics as that are used in CMLLR. We show that VTLN provides significant improvement in performance when there is small adaptation data as compared to transform-based adaptation methods. We also show that the use of an additional decorrelating transform, MLLT, along with the VTLN-matrices, gives performance that is better than MLLR and comparable to SAT with MLLT even for large adaptation data. Further we show that in the mismatched train and test case (i.e. poor first-pass transcription), VTLN provides significant improvement over the transform-based adaptation methods. We compare the performances of different methods on the WSJ, the RM and the TIDIGITS databases. Speaker Adaptation Based on Two-Step Active Learning Koichi Shinoda, Hiroko Murakami, Sadaoki Furui; Tokyo Institute of Technology, Japan Mon-Ses3-P3-10, Time: 16:00 We propose a two-step active learning method for supervised speaker adaptation. In the first step, the initial adaptation data is collected to obtain a phone error distribution. In the second step, those sentences whose phone distributions are close to the error distribution are selected, and their utterances are collected as the additional adaptation data. We evaluated the method using a Japanese speech database and maximum likelihood linear regression (MLLR) as the speaker adaptation algorithm. We confirmed that our method had a significant improvement over a method using randomly chosen sentences for adaptation. Tree-Based Estimation of Speaker Characteristics for Speech Recognition Mats Blomberg, Daniel Elenius; KTH, Sweden Mon-Ses3-P3-11, Time: 16:00 Speaker adaptation by means of adjustment of speaker characteristic properties, such as vocal tract length, has the important advantage compared to conventional adaptation techniques that the adapted models are guaranteed to be realistic if the description of the properties are. One problem with this approach is that the search procedure to estimate them is computationally heavy. We address the problem by using a multi-dimensional, hierarchical tree of acoustic model sets. The leaf sets are created by transforming a conventionally trained model set using leaf-specific speaker profile vectors. The model sets of non-leaf nodes are formed by merging the models of their child nodes, using a computationally efficient algorithm. During recognition, a maximum likelihood criterion is followed to traverse the tree. Studies of one- (VTLN) and four-dimensional speaker profile vectors (VTLN, two spectral slope parameters and model variance scaling) exhibit a reduction of the computational load to a fraction compared to that of an exhaustive grid search. In recognition experiments on children’s connected digits using adult and male models, the one-dimensional tree search performed as well as the exhaustive search. Further reduction was achieved with four dimensions. The best recognition results are 0.93% and 10.2% WER in TIDIGITS and PF-Star-Sw, respectively, using adult models. A Study on the Influence of Covariance Adaptation on Jacobian Compensation in Vocal Tract Length Normalization D.R. Sanand, S.P. Rath, S. Umesh; IIT Kanpur, India Mon-Ses3-P3-12, Time: 16:00 when there is a mismatch between the train and test speaker conditions. VTLN is implemented using our recently proposed approach of linear transformation of conventional MFCC, i.e. a feature transformation. In this case, Jacobian is simply the determinant of the linear transformation. Feature transformation is equivalent to the means and covariances of the model being transformed by the inverse transformation while leaving the data unchanged. Using a set of adaptation experiments, we analyze the reasons for the degradation during Jacobian compensation and conclude that applying the same VTLN transformation on both means and variances does not fully match the data when there is a mismatch in the speaker conditions. This may have similar implications for constrained-MLLR in mismatched speaker conditions. We then propose to use covariance adaptation on top of VTLN to account for the covariance mismatch between the train and the test speakers and show that accounting for Jacobian after covariance adaptation improves the performance. Mon-Ses3-P4 : Applications in Learning and Other Areas Hewison Hall, 16:00, Monday 7 Sept 2009 Chair: Nestor Becerra Yoma, Universidad de Chile, Chile Designing Spoken Tutorial Dialogue with Children to Elicit Predictable but Educationally Valuable Responses Gregory Aist, Jack Mostow; Carnegie Mellon University, USA Mon-Ses3-P4-1, Time: 16:00 How to construct spoken dialogue interactions with children that are educationally effective and technically feasible? To address this challenge, we propose a design principle that constructs short dialogues in which (a) the user’s utterance are the external evidence of task performance or learning in the domain, and (b) the target utterances can be expressed as a well-defined set, in some cases even as a finite language (up to a small set of variables which may change from exercise to exercise.) The key approach is to teach the human learner a parameterized process that maps input to response. We describe how the discovery of this design principle came out of analyzing the processes of automated tutoring for reading and pronunciation and designing dialogues to address vocabulary and comprehension, show how it also accurately describes the design of several other language tutoring interactions, and discuss how it could extend to non-language tutoring tasks. Optimizing Non-Native Speech Recognition for CALL Applications Joost van Doremalen, Helmer Strik, Catia Cucchiarini; Radboud Universiteit Nijmegen, The Netherlands Mon-Ses3-P4-2, Time: 16:00 We are developing a Computer Assisted Language Learning (CALL) system for practicing oral proficiency that makes use of Automatic Speech Recognition (ASR) to provide feedback on grammar and pronunciation. Since good quality unconstrained non-native ASR is not yet feasible, we use an approach in which we try to elicit constrained responses. The task in the current experiments is to select utterances from a list of responses. The results of our experiments show that significant improvements can be obtained by optimizing the language model and the acoustic models, thus reducing the utterance error rate from 29–26% to 10–8%. In this paper, we first show that accounting for Jacobian in VocalTract Length Normalization (VTLN) will degrade the performance Notes 71 Evaluation of English Intonation Based on Combination of Multiple Evaluation Scores read, we develop a novel score, the “phonetic challenge score”, consisting of a measure for native language-specific difficulties described in the second-language acquisition literature and also of a statistical measure based on the cross-entropy between phoneme sequences of the native language and English. Akinori Ito, Tomoaki Konno, Masashi Ito, Shozo Makino; Tohoku University, Japan Mon-Ses3-P4-3, Time: 16:00 In this paper, we proposed a novel method for evaluating intonation of an English utterance spoken by a learner for intonation learning by a CALL system. The proposed method is based on an intonation evaluation method proposed by Suzuki et al., which uses “word importance factors,” which are calculated based on word clusters given by a decision tree. We extended Suzuki’s method so that multiple decision trees are used and the resulting intonation scores are combined using multiple regression. As a result of an experiment, we obtained correlation coefficient comparable to the correlation between human raters. A Language-Independent Feature Set for the Automatic Evaluation of Prosody Andreas Maier, F. Hönig, V. Zeissler, Anton Batliner, E. Körner, N. Yamanaka, P. Ackermann, Elmar Nöth; FAU Erlangen-Nürnberg, Germany Mon-Ses3-P4-4, Time: 16:00 In second language learning, the correct use of prosody plays a vital role. Therefore, an automatic method to evaluate the naturalness of the prosody of a speaker is desirable. We present a novel method to model prosody independently of the text and thus independently of the language as well. For this purpose, the voiced and unvoiced speech segments are extracted and a 187-dimensional feature vector is computed for each voiced segment. This approach is compared to word based prosodic features on a German text passage. Both are confronted with the perceptive evaluation of two native speakers of German. The word-based feature set yielded correlations of up to 0.92 while the text-independent feature set yielded 0.88. This is in the same range as the inter-rater correlation with 0.88. Furthermore, the text-independent features were computed for a Japanese translation of the passage which was also rated by two native speakers of Japanese. Again, the correlation between the automatic system and the human perception of the naturalness was high with 0.83 and not significantly lower than the inter-rater correlation of 0.92. Adapting the Acoustic Model of a Speech Recognizer for Varied Proficiency Non-Native Spontaneous Speech Using Read Speech with Language-Specific Pronunciation Difficulty We collected about 23,000 read sentences from 200 speakers in four language groups: Chinese, Japanese, Korean, and Spanish. We used this data for acoustic model adaptation of a spontaneous speech recognizer and compared recognition performance between the unadapted baseline and the system after adaptation on a held-out set from the English test responses data set. The results show that using this targeted read speech material for acoustic model adaptation does reduce the word error rate significantly for two of four language groups of the spontaneous speech test set, while changes of the two other language groups are not significant. Analysis and Utilization of MLLR Speaker Adaptation Technique for Learners’ Pronunciation Evaluation Dean Luo 1 , Yu Qiao 1 , Nobuaki Minematsu 1 , Yutaka Yamauchi 2 , Keikichi Hirose 1 ; 1 University of Tokyo, Japan; 2 Tokyo International University, Japan Mon-Ses3-P4-6, Time: 16:00 In this paper, we investigate the effects and problems of MLLR speaker adaptation when applied to pronunciation evaluation. Automatic scoring and error detection experiments are conducted on two publicly available databases of Japanese learners’ English pronunciation. As we expected, over-adaptation causes misjudge of pronunciation accuracy. Following these experiments, two novel methods, Forced-aligned GOP scoring and Regularized-MLLR adaptation, are proposed to solve the adverse effects of MLLR adaption. Experimental results show that the proposed methods can better utilize MLLR adaptation and avoid over-adaptation. Control of Human Generating Force by Use of Acoustic Information — Study on Onomatopoeic Utterances for Controlling Small Lifting-Force Miki Iimura 1 , Taichi Sato 1 , Kihachiro Tanaka 2 ; 1 Tokyo Denki University, Japan; 2 Saitama University, Japan Mon-Ses3-P4-7, Time: 16:00 Klaus Zechner, Derrick Higgins, René Lawless, Yoko Futagi, Sarah Ohls, George Ivanov; Educational Testing Service, USA Mon-Ses3-P4-5, Time: 16:00 This paper presents a novel approach to acoustic model adaptation of a recognizer for non-native spontaneous speech in the context of recognizing candidates’ responses in a test of spoken English. Instead of collecting and then transcribing spontaneous speech data, a read speech corpus is created where non-native speakers of English read English sentences of different degrees of pronunciation difficulty with respect to their native language. The motivation for this approach is (1) to save time and cost associated with transcribing spontaneous speech, and (2) to allow for a targeted training of the recognizer, focusing particularly on those phoneme environments which are difficult to pronounce correctly by non-native speakers and hence have a higher likelihood of being misrecognized. As a criterion for selecting the sentences to be We have conducted basic experiments for applying acoustic information to engineering problems. We asked the subjects to execute lifting actions while listening to sounds and measured the resultant lifting-force. We used human onomatopoeic utterances as the sounds that are presented to the subjects aiming to make their lifting-force small. Especially, we focused on the “emotion” or “nuance” contained in humans’ utterances, which is a unique characteristic evoked by the utterance’ acoustical features. We found that the emotion or nuance can control the lifting-force effectively. We also clarified the acoustical features that are responsible for effective control of lifting-force exerted by human. Mi-DJ: A Multi-Source Intelligent DJ Service Ching-Hsien Lee, Hsu-Chih Wu; ITRI, Taiwan Mon-Ses3-P4-8, Time: 16:00 In this paper, A Multi-source intelligent DJ (Mi-DJ) service is introduced. It is an audio program platform that integrates different Notes 72 media types, including audio and text format content. It acts like a DJ who plays personalized audio program to user whenever and wherever users need. The audio program is automatically generated, comprising several audio clips; all of them are from either existing audio files or text information, such as e-mail, calendar, news or user-preferred article. Our unique program generation technology makes user feel like listening to a well-organized program, instead of several separated audio files. The program can be organized dynamically, which realizes context-aware service based on location, user’s schedule, or other user preference. With appropriate data management, text processing and speech synthesis technologies, Mi-DJ can be applied to many application scenarios. For example, it can be applied in language learning and tour guide. Mon-Ses3-S1 : Special Session: Silent Speech Interfaces Ainsworth (East Wing 4), 16:00, Monday 7 Sept 2009 Chair: Bruce Denby, Université Pierre et Marie Curie, France and Tanja Schultz, Carnegie Mellon University, USA Characterizing Silent and Pseudo-Silent Speech Using Radar-Like Sensors John F. Holzrichter; Fannie and John Hertz Foundation, USA Mon-Ses3-S1-1, Time: 16:00 Radar-like sensors enable the measuring of speech articulator conditions, especially their shape changes and contact events both during silent and normal speech. Such information can be used to associate articulator conditions with digital “codes” for use in communications, machine control, speech masking or canceling, and other applications. Human Voice or Prompt Generation? Can They Co-Exist in an Application? Géza Németh, Csaba Zainkó, Mátyás Bartalis, Gábor Olaszy, Géza Kiss; BME, Hungary Mon-Ses3-P4-9, Time: 16:00 This paper describes an R&D project regarding procedures for the automatic maintenance of the interactive voice response (IVR) system of a mobile telecom operator. The original plan was to create a generic voice prompt generation system for the customer service department. The challenge was to create a solution that is hard to distinguish from the human speaker (i.e. passing a sort of Turing-test) so its output can be freely mixed with original human recordings. The domain of the solution at the first step had to be narrowed down to the price lists of available mobile phones and services. This is updated weekly, so the final operational system generates about 3 hours of speech at each weekend. It operates under human supervision but without intervention in the speech generation process. It was tested both by academic procedures and company customers and was accepted as fulfilling the original requirements. Automatic vs. Human Question Answering Over Multimedia Meeting Recordings Quoc Anh Le 1 , Andrei Popescu-Belis 2 ; 1 University of Namur, Belgium; 2 IDIAP Research Institute, Switzerland Mon-Ses3-P4-10, Time: 16:00 Information access in meeting recordings can be assisted by meeting browsers, or can be fully automated following a questionanswering (QA) approach. An information access task is defined, aiming at discriminating true vs. false parallel statements about facts in meetings. An automatic QA algorithm is applied to this task, using passage retrieval over a meeting transcript. The algorithm scores 59% accuracy for passage retrieval, while random guessing is below 1%, but only scores 60% on combined retrieval and question discrimination, for which humans reach 70%–80% and the baseline is 50%. The algorithm clearly outperforms humans for speed, at less than 1 second per question, vs. 1.5–2 minutes per question for humans. The degradation on ASR compared to manual transcripts still yields lower but acceptable scores, especially for passage identification. Automatic QA thus appears to be a promising enhancement to meeting browsers used by humans, as an assistant for relevant passage identification. Technologies for Processing Body-Conducted Speech Detected with Non-Audible Murmur Microphone Tomoki Toda, Keigo Nakamura, Takayuki Nagai, Tomomi Kaino, Yoshitaka Nakajima, Kiyohiro Shikano; NAIST, Japan Mon-Ses3-S1-2, Time: 16:20 In this paper, we review our recent research on technologies for processing body-conducted speech detected with Non-Audible Murmur (NAM) microphone. NAM microphone enables us to detect various types of body-conducted speech such as extremely soft whisper, normal speech, and so on. Moreover, it is robust against external noise due to its noise-proof structure. To make speech communication more universal by effectively using these properties of NAM microphone, we have so far developed two main technologies: one is body-conducted speech conversion for humanto-human speech communication; and the other is body-conducted speech recognition for man-machine speech communication. This paper gives an overview of these technologies and presents our new attempts to investigate the effectiveness of body-conducted speech recognition. Artificial Speech Synthesizer Control by Brain-Computer Interface Jonathan S. Brumberg 1 , Philip R. Kennedy 2 , Frank H. Guenther 1 ; 1 Boston University, USA; 2 Neural Signals Inc., USA Mon-Ses3-S1-3, Time: 16:40 We developed and tested a brain-computer interface for control of an artificial speech synthesizer by an individual with near complete paralysis. This neural prosthesis for speech restoration is currently capable of predicting vowel formant frequencies based on neural activity recorded from an intracortical microelectrode implanted in the left hemisphere speech motor cortex. Using instantaneous auditory feedback (< 50 ms) of predicted formant frequencies, the study participant has been able to correctly perform a vowel production task at a maximum rate of 80–90% correct. Notes 73 Visuo-Phonetic Decoding Using Multi-Stream and Context-Dependent Models for an Ultrasound-Based Silent Speech Interface shown that the EMG signals created by audible and silent speech are quite distinct. In this paper we first compare various methods of initializing a silent speech EMG recognizer, showing that the performance of the recognizer substantially varies across different speakers. Based on this, we analyze EMG signals from audible and silent speech, present first results on how discrepancies between these speaking modes affect EMG recognizers, and suggest areas for future work. Thomas Hueber 1 , Elie-Laurent Benaroya 1 , Gérard Chollet 2 , Bruce Denby 3 , Gérard Dreyfus 1 , Maureen Stone 4 ; 1 LE-ESPCI, France; 2 LTCI, France; 3 Université Pierre et Marie Curie, France; 4 University of Maryland at Baltimore, USA Synthesizing Speech from Electromyography Using Voice Transformation Techniques Mon-Ses3-S1-4, Time: 17:00 Recent improvements are presented for phonetic decoding of continuous-speech from ultrasound and optical observations of the tongue and lips in a silent speech interface application. In a new approach to this critical step, the visual streams are modeled by context-dependent multi-stream Hidden Markov Models (CD-MSHMM). Results are compared to a baseline system using context-independent modeling and a visual feature fusion strategy, with both systems evaluated on a one-hour, phonetically balanced English speech database. Tongue and lip images are coded using PCA-based feature extraction techniques. The uttered speech signal, also recorded, is used to initialize the training of the visual HMMs. Visual phonetic decoding performance is evaluated successively with and without the help of linguistic constraints introduced via a 2.5k-word decoding dictionary. Arthur R. Toth, Michael Wand, Tanja Schultz; Universität Karlsruhe (TH), Germany Mon-Ses3-S1-7, Time: 17:20 Surface electromyography (EMG) can be used to record the activation potentials of articulatory muscles while a person speaks. This technique could enable silent speech interfaces, as EMG signals are generated even when people pantomime speech without producing sound. Having effective silent speech interfaces would enable a number of compelling applications, allowing people to communicate in areas where they would not want to be overheard or where the background noise is so prevalent that they could not be heard. In order to use EMG signals in speech interfaces, however, there must be a relatively accurate method to map the signals to speech. Up to this point, it appears that most attempts to use EMG signals for speech interfaces have focused on Automatic Speech Recognition (ASR) based on features derived from EMG signals. Following the lead of other researchers who worked with Electro-Magnetic Articulograph (EMA) data and Non-Audible Murmur (NAM) speech, we explore the alternative idea of using Voice Transformation (VT) techniques to synthesize speech from EMG signals. With speech output, both ASR systems and human listeners can directly use EMG-based systems. We report the results of our preliminary studies, noting the difficulties we encountered and suggesting areas for future work. Disordered Speech Recognition Using Acoustic and sEMG Signals Yunbin Deng 1 , Rupal Patel 2 , James T. Heaton 3 , Glen Colby 1 , L. Donald Gilmore 4 , Joao Cabrera 1 , Serge H. Roy 4 , Carlo J. De Luca 4 , Geoffrey S. Meltzner 1 ; 1 BAE Systems Inc., USA; 2 Northeastern University, USA; 3 Massachusetts General Hospital, USA; 4 Delsys Inc., USA Mon-Ses3-S1-5, Time: 17:20 Parallel isolated word corpora were collected from healthy speakers and individuals with speech impairment due to stroke or cerebral palsy. Surface electromyographic (sEMG) signals were collected for both vocalized and mouthed speech production modes. Pioneering work on disordered speech recognition using the acoustic signal, the sEMG signals, and their fusion are reported. Results indicate that speaker-dependent isolated-word recognition from the sEMG signals of articulator muscle groups during vocalized disorderedspeech production was highly effective. However, word recognition accuracy for mouthed speech was much lower, likely related to the fact that some disordered speakers had considerable difficulty producing consistent mouthed speech. Further development of the sEMG-based speech recognition systems is needed to increase usability and robustness. Impact of Different Speaking Modes on EMG-Based Speech Recognition Michael Wand 1 , Szu-Chen Stan Jou 2 , Arthur R. Toth 1 , Tanja Schultz 1 ; 1 Universität Karlsruhe (TH), Germany; 2 Industrial Technology Research Institute, Taiwan Multimodal HMM-Based NAM-to-Speech Conversion Viet-Anh Tran 1 , Gérard Bailly 1 , Hélène Lœvenbruck 1 , Tomoki Toda 2 ; 1 GIPSA, France; 2 NAIST, Japan Mon-Ses3-S1-8, Time: 17:20 Although the segmental intelligibility of converted speech from silent speech using direct signal-to-signal mapping proposed by Toda et al. [1] is quite acceptable, listeners have sometimes difficulty in chunking the speech continuum into meaningful words due to incomplete phonetic cues provided by output signals. This paper studies another approach consisting in combining HMM-based statistical speech recognition and synthesis techniques, as well as training on aligned corpora, to convert silent speech to audible voice. By introducing phonological constraints, such systems are expected to improve the phonetic consistency of output signals. Facial movements are used in order to improve the performance of both recognition and synthesis procedures. The results show that including these movements improves the recognition rate by 6.2% and a final improvement of the spectral distortion by 2.7% is observed. The comparison between direct signal-to-signal and phonetic-based mappings is finally commented in this paper. Mon-Ses3-S1-6, Time: 17:20 We present our recent results on speech recognition by surface electromyography (EMG), which captures the electric potentials that are generated by the human articulatory muscles. This technique can be used to enable Silent Speech Interfaces, since EMG signals are generated even when people only articulate speech without producing any sound. Preliminary experiments have Notes 74 Tue-Ses1-O1 : ASR: Discriminative Training Main Hall, 10:00, Tuesday 8 Sept 2009 Chair: Erik McDermott, NTT Corporation, Japan On the Semi-Supervised Learning of Multi-Layered Perceptrons Jonathan Malkin, Amarnag Subramanya, Jeff Bilmes; University of Washington, USA Tue-Ses1-O1-1, Time: 10:00 We present a novel approach for training a multi-layered perceptron (MLP) in a semi-supervised fashion. Our objective function, when optimized, balances training set accuracy with fidelity to a graph-based manifold over all points. Additionally, the objective favors smoothness via an entropy regularizer over classifier outputs as well as straightforward 2 regularization. Our approach also scales well enough to enable large-scale training. The results demonstrate significant improvement on several phone classification tasks over baseline MLPs. In this paper, we have successfully extended our previous work of convex optimization methods to MMIE-based discriminative training for large vocabulary continuous speech recognition. Specifically, we have re-formulated the MMIE training into a second order cone programming (SOCP) program using some convex relaxation techniques that we have previously proposed. Moreover, the entire SOCP formulation has been developed for word graphs instead of N-best lists to handle large vocabulary tasks. The proposed method has been evaluated in the standard WSJ-5k task and experimental results show that the proposed SOCP method significantly outperforms the conventional EBW method in terms of recognition accuracy as well as convergence behavior. Our experiments also show that the proposed SOCP method is efficient enough to handle some relatively large HMM sets normally used in large vocabulary tasks. Hidden Conditional Random Field with Distribution Constraints for Phone Classification Dong Yu, Li Deng, Alex Acero; Microsoft Research, USA Tue-Ses1-O1-5, Time: 11:20 We propose a new algorithm called Generalized Discriminative Feature Transformation (GDFT) for acoustic models in speech recognition. GDFT is based on Lagrange relaxation on a transformed optimization problem. We show that the existing discriminative feature transformation methods like feature space MMI/MPE (fMMI/MPE), region dependent linear transformation (RDLT), and a non-discriminative feature transformation, constrained maximum likelihood linear regression (CMLLR) are special cases of GDFT. We evaluate the performance of GDFT for Iraqi large vocabulary continuous speech recognition. We advance the recently proposed hidden conditional random field (HCRF) model by replacing the moment constraints (MCs) with the distribution constraints (DCs). We point out that the distribution constraints are the same as the traditional moment constraints for the binary features but are able to better regularize the probability distribution of the continuous-valued features than the moment constraints. We show that under the distribution constraints the HCRF model is no longer log-linear but embeds the model parameters in non-linear functions. We provide an effective solution to the resulting more difficult optimization problem by converting it to the traditional log-linear form at a higher-dimensional space of features exploiting cubic spline. We demonstrate that a 20.8% classification error rate (CER) can be achieved on the TIMIT phone classification task using the HCRF-DC model. This result is superior to any published single-system result on this heavily evaluated task including the HCRF-MC model, the discriminatively trained HMMs, and the large-margin HMMs using the same features. A Fast Online Algorithm for Large Margin Training of Continuous Density Hidden Markov Models Deterministic Annealing Based Training Algorithm for Bayesian Speech Recognition Chih-Chieh Cheng 1 , Fei Sha 2 , Lawrence K. Saul 1 ; 1 University of California at San Diego, USA; 2 University of Southern California, USA Sayaka Shiota, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda; Nagoya Institute of Technology, Japan Tue-Ses1-O1-3, Time: 10:40 This paper proposes a deterministic annealing based training algorithm for Bayesian speech recognition. The Bayesian method is a statistical technique for estimating reliable predictive distributions by marginalizing model parameters. However, the local maxima problem in the Bayesian method is more serious than in the ML-based approach, because the Bayesian method treats not only state sequences but also model parameters as latent variables. The deterministic annealing EM (DAEM) algorithm has been proposed to improve the local maxima problem in the EM algorithm, and its effectiveness has been reported in HMM-based speech recognition using ML criterion. In this paper, the DAEM algorithm is applied to Bayesian speech recognition to relax the local maxima problem. Speech recognition experiments show that the proposed method achieved a higher performance than the conventional methods. Generalized Discriminative Feature Transformation for Speech Recognition Roger Hsiao, Tanja Schultz; Carnegie Mellon University, USA Tue-Ses1-O1-2, Time: 10:20 Tue-Ses1-O1-6, Time: 11:40 We propose an online learning algorithm for large margin training of continuous density hidden Markov models. The online algorithm updates the model parameters incrementally after the decoding of each training utterance. For large margin training, the algorithm attempts to separate the log-likelihoods of correct and incorrect transcriptions by an amount proportional to their Hamming distance. We evaluate this approach to hidden Markov modeling on the TIMIT speech database. We find that the algorithm yields significantly lower phone error rates than other approaches — both online and batch — that do not attempt to enforce a large margin. We also find that the algorithm converges much more quickly than analogous batch optimizations for large margin training. Maximum Mutual Information Estimation via Second Order Cone Programming for Large Vocabulary Continuous Speech Recognition Dalei Wu, Baojie Li, Hui Jiang; York University, Canada Tue-Ses1-O1-4, Time: 11:00 Notes 75 of this hypothesis involved acoustic measurement of L2 speakers’ intonation contours, and comparison of these contours with those of native speakers. Tue-Ses1-O2 : Language Acquisition Jones (East Wing 1), 10:00, Tuesday 8 Sept 2009 Chair: Maria Uther, Brunel University, UK KLAIR: A Virtual Infant for Spoken Language Acquisition Research Connecting Rhythm and Prominence in Automatic ESL Pronunciation Scoring Mark Huckvale 1 , Ian S. Howard 2 , Sascha Fagel 3 ; 1 University College London, UK; 2 University of Cambridge, UK; 3 Technische Universität Berlin, Germany Emily Nava, Joseph Tepperman, Louis Goldstein, Maria Luisa Zubizarreta, Shrikanth S. Narayanan; University of Southern California, USA Tue-Ses1-O2-1, Time: 10:00 Tue-Ses1-O2-4, Time: 11:00 Past studies have shown that a native Spanish speaker’s use of phrasal prominence is a good indicator of her level of English prosody acquisition. Because of the cross-linguistic differences in the organization of phrasal prominence and durational contrasts, we hypothesize that those speakers with English-like prominence in their L2 speech are also expected to have acquired English-like rhythm. Statistics from a corpus of native and nonnative English confirm that speakers with an English-like phrasal prominence are also the ones who use English-like rhythm. Additionally, two methods of automatic score generation based on vowel duration times demonstrate a correlation of at least 0.6 between these automatic scores and subjective scores for phrasal prominence. These findings suggest that simple vowel duration measures obtained from standard automatic speech recognition methods can be salient cues for estimating subjective scores of prosodic acquisition, and of pronunciation in general. Recent research into the acquisition of spoken language has stressed the importance of learning through embodied linguistic interaction with caregivers rather than through passive observation. However the necessity of interaction makes experimental work into the simulation of infant speech acquisition difficult because of the technical complexity of building real-time embodied systems. In this paper we present KLAIR: a software toolkit for building simulations of spoken language acquisition through interactions with a virtual infant. The main part of KLAIR is a sensori-motor server that supplies a client machine learning application with a virtual infant on screen that can see, hear and speak. By encapsulating the real-time complexities of audio and video processing within a server that will run on a modern PC, we hope that KLAIR will encourage and facilitate more experimental research into spoken language acquisition through interaction. Evaluating Parameters for Mapping Adult Vowels to Imitative Babbling An Articulatory Analysis of Phonological Transfer Using Real-Time MRI Joseph Tepperman, Erik Bresch, Yoon-Chul Kim, Sungbok Lee, Louis Goldstein, Shrikanth S. Narayanan; University of Southern California, USA Ilana Heintz 1 , Mary Beckman 1 , Eric Fosler-Lussier 1 , Lucie Ménard 2 ; 1 Ohio State University, USA; 2 Université du Québec à Montréal, Canada Tue-Ses1-O2-5, Time: 11:20 Tue-Ses1-O2-2, Time: 10:20 We design a neural network model of first language acquisition to explore the relationship between child and adult speech sounds. The model learns simple vowel categories using a produce-andperceive babbling algorithm in addition to listening to ambient speech. The model is similar to that of Westermann & Miranda (2004), but adds a dynamic aspect in that it adapts in both the articulatory and acoustic domains to changes in the child’s speech patterns. The training data is designed to replicate infant speech sounds and articulatory configurations. By exploring a range of articulatory and acoustic dimensions, we see how the child might learn to draw correspondences between his or her own speech and that of a caretaker, whose productions are quite different from the child’s. We also design an imitation evaluation paradigm that gives insight into the strengths and weaknesses of the model. Intonation of Japanese Sentences Spoken by English Speakers Chiharu Tsurutani; Griffith University, Australia Tue-Ses1-O2-3, Time: 10:40 This study investigated intonation of Japanese sentences spoken by Australian English speakers and the influence of their first language (L1) prosody on their intonation of Japanese sentences. The second language (L2) intonation is a complicated product of the L1 transfer at two levels of prosodic hierarchy: at word level and at phrase levels. L2 speech is hypothesized to retain the characteristics of L1, and to gain marked features of the target language only during the late stage of acquisition. Investigation Phonological transfer is the influence of a first language on phonological variations made when speaking a second language. With automatic pronunciation assessment applications in mind, this study intends to uncover evidence of phonological transfer in terms of articulation. Real-time MRI videos from three German speakers of English and three native English speakers are compared to uncover the influence of German consonants on close English consonants not found in German. Results show that nonnative speakers demonstrate the effects of L1 transfer through the absence of articulatory contrasts seen in native speakers, while still maintaining minimal articulatory contrasts that are necessary for automatic detection of pronunciation errors, encouraging the further use of articulatory models for speech error characterization and detection. Do Multiple Caregivers Speed up Language Acquisition? L. ten Bosch 1 , Okko Johannes Räsänen 2 , Joris Driesen 3 , Guillaume Aimetti 4 , Toomas Altosaar 2 , Lou Boves 1 , A. Corns 1 ; 1 Radboud Universiteit Nijmegen, The Netherlands; 2 Helsinki University of Technology, Finland; 3 Katholieke Universiteit Leuven, Belgium; 4 University of Sheffield, UK Tue-Ses1-O2-6, Time: 11:40 In this paper we compare three different implementations of language learning to investigate the issue of speaker-dependent initial representations and subsequent generalization. These implementations are used in a comprehensive model of language Notes 76 acquisition under development in the FP6 FET project ACORNS. All algorithms are embedded in a cognitively and ecologically plausible framework, and perform the task of detecting word-like units without any lexical, phonetic, or phonological information. The results show that the computational approaches differ with respect to the extent they deal with unseen speakers, and how generalization depends on the variation observed during training. We also demonstrate the effectiveness of semi-supervised and unsupervised tone recognition techniques for this less-resourced language, with weakly supervised approaches rivaling supervised techniques. A Sequential Minimization Algorithm for Finite-State Pronunciation Lexicon Models Simon Dobrišek, Boštjan Vesnicer, France Mihelič; University of Ljubljana, Slovenia Tue-Ses1-O3 : ASR: Lexical and Prosodic Models Tue-Ses1-O3-4, Time: 11:00 Fallside (East Wing 2), 10:00, Tuesday 8 Sept 2009 Chair: Eric Fosler-Lussier, Ohio State University, USA Grapheme to Phoneme Conversion Using an SMT System Antoine Laurent, Paul Deléglise, Sylvain Meignier; LIUM, France Tue-Ses1-O3-1, Time: 10:00 This paper presents an automatic grapheme to phoneme conversion system that uses statistical machine translation techniques provided by the Moses Toolkit. The generated word pronunciations are employed in the dictionary of an automatic speech recognition system and evaluated using the ESTER 2 French broadcast news corpus. Grapheme to phoneme conversion based on Moses is compared to two other methods: G2P, and a dictionary look-up method supplemented by a rule-based tool for phonetic transcriptions of words unavailable in the dictionary. Moses gives better results than G2P, and have performance comparable to the dictionary look-up strategy. Lexical and Phonetic Modeling for Arabic Automatic Speech Recognition The paper first presents a large-vocabulary automatic speechrecognition system that is being developed for the Slovenian language. The concept of a single-pass token-passing algorithm for the fast speech decoding that can be used with the designed multi-level system structure is discussed. From the algorithmic point of view, the main component of the system is a finite-state pronunciation lexicon model. This component has crucial impact on the overall performance of the system and we developed a sequential minimization algorithm that very efficiently reduces the size and algorithmic complexity of the lexicon model. Our finitestate lexicon model is represented as a state-emitting finite-state transducer. The presented experiments show that the sequential minimization algorithm easily outperforms (up to 60%) the conventional algorithms that were developed for the static global optimization of the transition-emitting finite-state transducers. These algorithms are delivered as part of the AT&T FSM library and the OpenFST library. A General-Purpose 32 ms Prosodic Vector for Hidden Markov Modeling Kornel Laskowski 1 , Mattias Heldner 2 , Jens Edlund 2 ; 1 Carnegie Mellon University, USA; 2 KTH, Sweden Tue-Ses1-O3-5, Time: 11:20 Long Nguyen 1 , Tim Ng 1 , Kham Nguyen 2 , Rabih Zbib 3 , John Makhoul 1 ; 1 BBN Technologies, USA; 2 Northeastern University, USA; 3 MIT, USA Tue-Ses1-O3-2, Time: 10:20 In this paper, we describe the use of either words or morphemes as lexical modeling units and the use of either graphemes or phonemes as phonetic modeling units for Arabic automatic speech recognition (ASR). We designed four Arabic ASR systems: two wordbased systems and two morpheme-based systems. Experimental results using these four systems show that they have comparable state-of-the-art performance individually, but the more sophisticated morpheme-based system tends to be the best. However, they seem to complement each other quite well within the ROVER system combination framework to produce substantially-improved combined results. Assessing Context and Learning for isiZulu Tone Recognition Prosody plays a central role in conversation, making it important for speech technologies to model. Unfortunately, the application of standard modeling techniques to the acoustics of prosody has been hindered by difficulties in modeling intonation. In this work, we explore the suitability of the recently introduced fundamental frequency variation (FFV) spectrum as a candidate general representation of tone. Experiments on 4 tasks demonstrate that FFV features are complimentary to other acoustic measures of prosody and that hidden Markov models offer a suitable modeling paradigm. Proposed improvements yield a 35% relative decrease in error on unseen data and simultaneously reduce time complexity by a factor of five. The resulting representation is sufficiently mature for general deployment in a broad range of automatic speech processing applications. Vocabulary Expansion Through Automatic Abbreviation Generation for Chinese Voice Search Dong Yang, Yi-cheng Pan, Sadaoki Furui; Tokyo Institute of Technology, Japan Gina-Anne Levow; University of Chicago, USA Tue-Ses1-O3-6, Time: 11:40 Tue-Ses1-O3-3, Time: 10:40 Prosody plays an integral role in spoken language understanding. In isiZulu, a Nguni family language with lexical tone, prosodic information determines word meaning. We assess the impact of models of tone and coarticulation for tone recognition. We demonstrate the importance of modeling prosodic context to improve tone recognition. We employ this less commonly studied language to assess models of tone developed for English and Mandarin, finding common threads in coarticulatory modeling. Long named entities are often abbreviated in oral Chinese language, and this usually leads to out-of-vocabulary(OOV) problems in speech recognition applications. The generation of Chinese abbreviations is much more complex than English abbreviations, most of which are acronyms and truncations. In this paper, we propose a new method for automatically generating abbreviations for Chinese named entities and we perform vocabulary expansion using output of the abbreviation model for voice search. In our abbreviation modeling, we convert the abbreviation generation Notes 77 problem into a tagging problem and use the conditional random field (CRF) as the tagging tool. In the vocabulary expansion, considering the multiple abbreviation problem and limited coverage of top-1 abbreviation candidate, we add top-10 candidates into the vocabulary. In our experiments, for the abbreviation modeling, we achieved the top-10 coverage of 88.3% by the proposed method; for the voice search, we improved the voice search accuracy from 16.9% to 79.2% by incorporating the top-10 abbreviation candidates to vocabulary. Tue-Ses1-O4 : Unit-Selection Synthesis Holmes (East Wing 3), 10:00, Tuesday 8 Sept 2009 Chair: Alan Black, Carnegie Mellon University, USA Perceptual Cost Function for Cross-Fading Based Concatenation Qi Miao, Alexander Kain, Jan P.H. van Santen; Oregon Health & Science University, USA A Novel Approach to Cost Weighting in Unit Selection TTS Jerome R. Bellegarda; Apple Inc., USA Tue-Ses1-O4-4, Time: 11:00 Unit selection text-to-speech synthesis relies on multiple cost criteria, each encapsulating a different aspect of acoustic and prosodic context at any given concatenation point. For a particular set of criteria, the relative weighting of the resulting costs crucially affects final candidate ranking. Their influence is typically determined in an empirical manner (e.g., based on a limited amount of synthesized data), yielding global weights that are thus applied to all concatenations indiscriminately. This paper proposes an alternative approach, based on a data-driven framework separately optimized for each concatenation. The cost distribution in every information stream is dynamically leveraged to locally shift weight towards those characteristics that prove most discriminative at this point. An illustrative case study underscores the potential benefits of this solution. Maximum Likelihood Unit Selection for Corpus-Based Speech Synthesis Tue-Ses1-O4-1, Time: 10:00 In earlier research, we applied a linear weighted cross-fading function to ensure smooth concatenation. However, this can cause unnaturally shaped spectral trajectories. We propose context-sensitive cross-fading. To train this system, a perceptually validated cost function is needed, which is the focus of this paper. A corpus was designed to generate a variety of formant trajectory shapes. A perceptual experiment was performed and a multiple linear regression model was applied to predict perceptual quality ratings from various distances between cross-faded and natural trajectories. Results show that perceptual quality could be predicted well from the proposed distance measures. Exploring Automatic Similarity Measures for Unit Selection Tuning Daniel Tihelka 1 , Jan Romportl 2 ; 1 University of West Bohemia in Pilsen, Czech Republic; 2 SpeechTech s.r.o., Czech Republic Tue-Ses1-O4-2, Time: 10:20 The present paper focuses on the current handling of target features in the unit selection approach basically requiring huge corpora. In the paper there are outlined possible solutions based on measuring (dis)similarity among prosodic patterns. As the start of research, the feasibility of (dis)similarity estimation is examined on several intuitively chosen measures of acoustic signal which are correlated to perceived similarity obtained from a large-scale listening test. Towards Intonation Control in Unit Selection Speech Synthesis Cédric Boidin 1 , Olivier Boeffard 2 , Thierry Moudenc 1 , Géraldine Damnati 1 ; 1 Orange Labs, France; 2 IRISA, France Abubeker Gamboa Rosales 1 , Hamurabi Gamboa Rosales 2 , Ruediger Hoffmann 2 ; 1 University of Guanajuato, Mexico; 2 Technische Universität Dresden, Germany Tue-Ses1-O4-5, Time: 11:20 Corpus-based speech synthesis systems deliver a considerable synthesis quality since the unit selection approaches have been optimized in the last decade. Unit selection attempts to find the best combination of speech unit sequences in an inventory so that the perceptual differences between expected (natural) and synthesized signals are as low as possible. However, mismatches and distortions are still possible in concatenative speech synthesis and they are normally perceptible in the synthesized waveform. Therefore, unit selection strategies and parameter tuning are still important issues in the improvement of speech synthesis. We present a novel concept to increase the efficiency of the exhaustive speech unit search within the inventory via a unit selection model. This model bases its operation on a mapping analysis of the concatenation sub-costs, a Bayes optimal classification (BOC), and a Maximum likelihood selection (MLS). The principle advantage of the proposed unit selection method is that it does not require an exhaustive training to set up weighted coefficients for target and concatenation sub-costs. It provides an alternative for unit selection but requires further optimization, e. g. by integrating target cost mapping. A Close Look into the Probabilistic Concatenation Model for Corpus-Based Speech Synthesis Shinsuke Sakai, Ranniery Maia, Hisashi Kawai, Satoshi Nakamura; NICT, Japan Tue-Ses1-O4-6, Time: 11:40 Tue-Ses1-O4-3, Time: 10:40 We propose to control intonation in unit selection speech synthesis with a mixed CART-HMM intonation model. The Finite State Machine (FSM) formulation is suited to incorporate the intonation model in the unit selection framework because it allows for combination of models with different unit types and handling competing intonative variants. Subjective experiments have been carried out to compare segmental and joint-prosodic-and-segmental unit selection. We have proposed a novel probabilistic approach to concatenation modeling for corpus-based speech synthesis, where the goodness of concatenation for a unit is modeled using a conditional Gaussian probability density whose mean is defined as a linear transform of the feature vector from the previous unit. This approach has shown its effectiveness through a subjective listening test. In this paper, we further investigate the characteristics of the proposed method by a objective evaluation and by observing the sequence of concatenation scores across an utterance. We also present the mathematical relationships of the proposed method with other approaches and show that it has a flexible modeling power, having Notes 78 other approaches to concatenation scoring methods as special cases. Characteristics of Two-Dimensional Finite Difference Techniques for Vocal Tract Analysis and Voice Synthesis Matt Speed, Damian Murphy, David M. Howard; University of York, UK Tue-Ses1-P1 : Human Speech Production II Hewison Hall, 10:00, Tuesday 8 Sept 2009 Chair: Martin Cooke, Ikerbasque, Spain Tue-Ses1-P1-4, Time: 10:00 Simple Physical Models of the Vocal Tract for Education in Speech Science Takayuki Arai; Sophia University, Japan Tue-Ses1-P1-1, Time: 10:00 In the speech-related field, physical models of the vocal tract are effective tools for education in acoustics. Arai’s cylinder-type models are based on Chiba and Kajiyama’s measurement of vocal-tract shapes. The models quickly and effectively demonstrate vowel production. In this study, we developed physical models with simplified shapes as educational tools to illustrate how vocal-tract shape accounts for differences among vowels. As a result, the five Japanese vowels were produced by tube-connected models, where several uniform tubes with different cross-sectional areas and lengths are connected as Fant’s and Arai’s three-tube models. Both digital waveguide and finite difference techniques are numerical methods that have been demonstrated as appropriate for acoustic modelling applications. Whilst the application of the digital waveguide mesh to vocal tract modelling has been the subject of previous work, the application of comparable finite difference techniques is as yet untested. This study explores the characteristics of such a finite-difference approach to two-dimensional vocal tract modelling. Initial results suggest that finite difference techniques alone are not ideal, due to the limitation of non-dynamic behaviour and poor representation of admittance discontinuities in the approximation of three-dimensional geometries. They do however introduce robust boundary formulations, and have a valid and useful application in modelling non-vital static volumes, particularly the nasal tract. Adaptation of a Predictive Model of Tongue Shapes Chao Qin, Miguel Á. Carreira-Perpiñán; University of California at Merced, USA Auto-Meshing Algorithm for Acoustic Analysis of Vocal Tract Tue-Ses1-P1-5, Time: 10:00 Kyohei Hayashi, Nobuhiro Miki; Future University Hakodate, Japan Tue-Ses1-P1-2, Time: 10:00 We propose a new method for an auto-meshing algorithm for an acoustic analysis of the vocal tract using the Finite Element Method (FEM). In our algorithm, the domain of the 3 dimensional figure of the vocal tract is decomposed into two domains; one is a surface domain and the other is an inner domain in order to employ the overlapping domain decomposition method. The meshing of surface blocks can be realized with smooth surfaces using a NURBS interpolation. We show the example of the meshes for the vocal tract figure of Japanese vowel /a/, and the trial result of the FEM simulation. It is possible to recover the full midsagittal contour of the tongue with submillimetric accuracy from the location of just 3–4 landmarks on it. This involves fitting a predictive mapping from the landmarks to the contour using a training set consisting of contours extracted from ultrasound recordings. However, extracting sufficient contours is a slow and costly process. Here, we consider adapting a predictive mapping obtained for one condition (such as a given recording session, recording modality, speaker or speaking style) to a new condition, given only a few new contours and no correspondences. We propose an extremely fast method based on estimating a 2D-wise linear alignment mapping, and show it recovers very accurate predictive models from about 10 new contours. Using Sensor Orientation Information for Computational Head Stabilisation in 3D Electromagnetic Articulography (EMA) Voice Production Model Employing an Interactive Boundary-Layer Analysis of Glottal Flow Tokihiko Kaburagi, Katsunori Daimo, Shogo Nakamura; Kyushu University, Japan Christian Kroos; University of Western Sydney, Australia Tue-Ses1-P1-3, Time: 10:00 Tue-Ses1-P1-6, Time: 10:00 A voice production model has been studied by considering essential aerodynamic and acoustic phenomena in human phonation. Acoustic voice sources are produced by the temporal change of volume flow passing through the glottis. A precise flow analysis is therefore performed based on the boundary-layer approximation and the viscous-inviscid interaction between the boundary layer and core flow. This flow analysis can supply information on the separation point of the glottal flow and the thickness of the boundary layer, which strongly depend on the glottal configuration, and yield an effective prediction of the flow behavior. When the flow analysis is combined with a mechanical model of the vocal fold, the resulting acoustic wave travels through the vocal tract and a pressure change develops in the vicinity of the glottis. This change can affect the glottal flow and the motion of the vocal folds, causing source-filter interaction. Preliminary simulations were conducted by changing the relationship between the fundamental and formant frequencies and their results were reported. We propose a new simple algorithm to make use of the sensor orientation information in 3D Electromagnetic Articulography (EMA) for computational head stabilisation. The algorithm also provides a well-defined procedure in the case where only two sensors are available for head motion tracking and allows for the combining of position coordinates and orientation angles for head stabilisation with an equal weighting of each kind of information. An evaluation showed that the method using the orientation angles produced the most reliable results. Collision Threshold Pressure Before and After Vocal Loading Laura Enflo 1 , Johan Sundberg 1 , Friedemann Pabst 2 ; 1 KTH, Sweden; 2 Hospital Dresden Friedrichstadt, Germany Tue-Ses1-P1-7, Time: 10:00 Notes 79 The phonation threshold pressure (PTP) has been found to increase during vocal fatigue. In the present study we compare PTP and collision threshold pressure (CTP) before and after vocal loading in singer and non-singer voices. Seven subjects repeated the vowel sequence /a,e,i,o,u/ at an SPL of at least 80 dB @ 0.3 m for 20 min. Before and after this loading the subjects’ voices were recorded while they produced a diminuendo repeating the syllable /pa/. Oral pressure during the /p/ occlusion was used as a measure of subglottal pressure. Both CTP and PTP increased significantly after the vocal loading. are associated with different motor control strategies involving durational manipulations. The relative contribution of closing movement durations increases with decreasing speech rate, and is a more dominant strategy for elderly speakers. Variability and Stability in Collaborative Dialogues: Turn-Taking and Filled Pauses Štefan Beňuš; Constantine the Philosopher University in Nitra, Slovak Republic Tue-Ses1-P1-11, Time: 10:00 Gender Differences in the Realization of Vowel-Initial Glottalization Elke Philburn; University of Manchester, UK Tue-Ses1-P1-8, Time: 10:00 The aim of the study was to investigate gender-dependent differences in the realization of German glottalized vowel onsets. Laryngographic data of semi-spontaneous speech were collected from four male and four female speakers of Standard German. Measurements of relative vocal fold contact duration were carried out including glottalized vowel onsets as well as non-glottalized controls. The results show that female subjects realized the glottalized vowel onsets with greater maximum vocal fold contact duration than male subjects and that the glottalized vowel onsets produced by females were more clearly distinguished from the non-glottalized controls. Stability and Composition of Functional Synergies for Speech Movements in Children and Adults Filled pauses have important and varied functions in turn-taking behavior, and better understanding of this relationship opens new ways for improving the quality and naturalness of dialogue systems. We use a corpus of collaborative task oriented dialogues to provide new insights into the relationship between filled pauses and turn-taking based on temporal and acoustic features. We then explore which of these patterns are stable and robust across speakers, which are prone to entrainment based on conversational partners, and which are variable and noisy. Our findings suggest that intensity is the least stable feature followed by pitch-related features, and temporal features relating filled pauses to chunking and turn-taking are the most stable. Speaking in the Presence of a Competing Talker Youyi Lu 1 , Martin Cooke 2 ; 1 University of Sheffield, UK; 2 Ikerbasque, Spain Tue-Ses1-P1-12, Time: 10:00 Hayo Terband 1 , Frits van Brenk 2 , Pascal van Lieshout 3 , Lian Nijland 1 , Ben Maassen 1 ; 1 Radboud University Nijmegen Medical Centre, The Netherlands; 2 University of Strathclyde, UK; 3 University of Toronto, Canada Tue-Ses1-P1-9, Time: 10:00 The consistency and composition of functional synergies for speech movements were investigated in 7 year-old children and adults in a reiterated speech task using electromagnetic articulography (EMA). Results showed higher variability in children for tongue tip and jaw, but not for lower lip movement trajectories. Furthermore, the relative contribution to the oral closure of lower lip was smaller in children compared to adults, whereas in this respect no difference was found for tongue tip. These results support and extend findings of non-linearity in speech motor development and illustrate the importance of a multi-measures approach in studying speech motor development. An Analysis of Speech Rate Strategies in Aging Frits van Brenk 1 , Hayo Terband 2 , Pascal van Lieshout 3 , Anja Lowit 1 , Ben Maassen 2 ; 1 University of Strathclyde, UK; 2 Radboud University Nijmegen Medical Centre, The Netherlands; 3 University of Toronto, Canada Tue-Ses1-P1-10, Time: 10:00 Effects of age and speech rate on movement cycle duration were assessed using electromagnetic articulography. In a repetitive task syllables were articulated at eight rates, obtained by metronome and self-pacing. Results indicate that increased speech rate is associated with increasing movement cycle duration stability, while decreased rate leads to a decrease in uniformity of cycle duration, supporting the view that alterations in speech rate How do speakers cope with a competing talker? This study investigated the possibility that speakers are able to retime their contributions to take advantages of temporal fluctuations in the background, reducing any adverse effects for an interlocutor. Speech was produced in quiet, competing talker, modulated noise and stationary backgrounds, with and without a communicative task. An analysis of the timing of contributions relative to the background indicated a significantly reduced chance of overlapping for the modulated noise backgrounds relative to quiet, with competing speech resulting in the least overlap. Strong evidence for an active overlap avoidance strategy is presented. Tue-Ses1-P2 : Speech Perception II Hewison Hall, 10:00, Tuesday 8 Sept 2009 Chair: Odette Scharenborg, Radboud Universiteit Nijmegen, The Netherlands Effect of R-Resonance Information on Intelligibility Antje Heinrich, Sarah Hawkins; University of Cambridge, UK Tue-Ses1-P2-1, Time: 10:00 We investigated the importance of phonetic information in preceding syllables for the intelligibility of minimal-pair words containing /r/ or /l/. Target words were cross-spliced into a different token of the same sentence (match) or into a sentence that was identical but originally contained the paired word (mismatch). Young and old adults heard the sentences, casually or carefully spoken, in cafeteria or 12-talker babble. Matched phonetic information in the syllable immediately before the target segment, and in earlier syllables, facilitated intelligibility of r- but not l-words. Despite hearing loss, older adults also used this phonetic information. Notes 80 Perception of Temporal Cues at Discourse Boundaries Are Real Tongue Movements Easier to Speech Read than Synthesized? Hsin-Yi Lin, Janice Fon; National Taiwan University, Taiwan Olov Engwall, Preben Wik; KTH, Sweden Tue-Ses1-P2-2, Time: 10:00 Speech perception studies with augmented reality displays in talking heads have shown that tongue reading abilities are weak initially, but that subjects become able to extract some information from intra-oral visualizations after a short training session. In this study, we investigate how the nature of the tongue movements influences the results, by comparing synthetic rule-based and actual, measured movements. The subjects were significantly better at perceiving sentences accompanied by real movements, indicating that the current coarticulation model developed for facial movements is not optimal for the tongue. Tue-Ses1-P2-6, Time: 10:00 This study investigates the role of temporal cues in the perception at discourse boundaries. Target cues were penult lengthening, final lengthening, and pause duration. Results showed that different cues are weighted differently for different purposes. Final lengthening is more important for subjects to detect boundaries, while pause duration is more responsible in cuing the boundary sizes. Human Audio-Visual Consonant Recognition Analyzed with Three Bimodal Integration Models Eliciting a Hierarchical Structure of Human Consonant Perception Task Errors Using Formal Concept Analysis Zhanyu Ma, Arne Leijon; KTH, Sweden Tue-Ses1-P2-3, Time: 10:00 With A-V recordings, ten normal hearing people took recognition tests at different signal-to-noise ratios (SNR). The A-V recognition results are predicted by the fuzzy logical model of perception (FLMP) and the post-labelling integration model (POSTL). We also applied hidden Markov models (HMMs) and multi-stream HMMs (MSHMMs) for the recognition. As expected, all the models agree qualitatively with the results that the benefit gained from the visual signal is larger at lower acoustic SNRs. However, the FLMP severely overestimates the A-V integration result, while the POSTL model underestimates it. Our automatic speech recognizers integrated the audio and visual stream efficiently. The visual automatic speech recognizer could be adjusted to correspond to human visual performance. The MSHMMs combine the audio and visual streams efficiently, but the audio automatic speech recognizer must be further improved to allow precise quantitative comparisons with human audio-visual performance. Carmen Peláez-Moreno, Ana I. García-Moral, Francisco J. Valverde-Albacete; Universidad Carlos III de Madrid, Spain Tue-Ses1-P2-7, Time: 10:00 In this paper we have used Formal Concept Analysis to elicit a hierarchical structure of human consonant perception task errors. We have used the Native Listeners experiments provided for the Consonant Challenge session of Interspeech 2008 to analyze perception errors committed in relation to the place of articulation of the consonants being evaluated for one quiet and six noisy acoustic conditions. Acoustic and Perceptual Effects of Vocal training in Amateur Male Singing Takeshi Saitou, Masataka Goto; AIST, Japan Effects of Tempo in Radio Commercials on Young and Elderly Listeners Tue-Ses1-P2-8, Time: 10:00 Hanny den Ouden, Hugo Quené; Utrecht University, The Netherlands Tue-Ses1-P2-4, Time: 10:00 The aim of the present study is to investigate the effects of tempo manipulations in radio commercials, on listeners’ evaluation, cognition and persuasion. Questionnaire scores from 131 young and 130 elderly listeners show effects of tempo manipulation on listeners’ subjective evaluation, but not on their cognitive scores. Tempo effects on persuasion scores are modulated by the listeners’ general disposition towards radio and radio commercials. In sum, it seems that not age but listeners’ general disposition is of importance in evaluating tempo manipulation of radio commercials. This paper reports our investigation of the acoustic effects of vocal training for amateur singers and of the contribution of those effects to perceived vocal quality. Recording singing voices before and after vocal training and then analyzing changes in acoustic parameters with a focus on features unique to singing voices, we found that two different F0 fluctuations (vibrato and overshoot) and singing formant were improved by the training. The results of psychoacoustic experiments showed that perceived voice quality was influenced more by the changes of F0 characteristics than by the changes of spectral characteristics and that acoustic features unique to singing voices contribute to perceived voice quality in the following order: vibrato, singing formant, overshoot, and preparation. Tue-Ses1-P3 : Speech and Audio Segmentation and Classification Self-Voice Recognition in 4 to 5-Year-Old Children Sofia Strömbergsson; KTH, Sweden Tue-Ses1-P2-5, Time: 10:00 Children’s ability to recognize their own recorded voice as their own was explored in a group of 4 to 5-year-old children. The task for the children was to identify which one of four voice samples represented their own voice. The results reveal that children perform well above chance level, and that a time span of 1–2 weeks between the recording and the identification does not affect the children’s performance. F0 similarity between the participant’s recordings and the reference recordings correlated with a higher error-rate. Implications for the use of recordings in speech and language therapy are discussed. Hewison Hall, 10:00, Tuesday 8 Sept 2009 Chair: S. Umesh, IIT Kanpur, India Wavelet-Based Speaker Change Detection in Single Channel Speech Data Michael Wiesenegger, Franz Pernkopf; Graz University of Technology, Austria Tue-Ses1-P3-1, Time: 10:00 Speaker segmentation is the task of finding speaker turns in an audio stream. We propose a metric-based algorithm based on Notes 81 Discrete Wavelet Transform (DWT) features. Principal component analysis (PCA) or linear discriminant analysis (LDA) [1] are further used to reduce the dimensionality of the feature space and remove redundant information. In the experiments our methods referred to as DWT-PCA and DWT-LDA are compared to the DISTBIC algorithm [2] using clean and noisy data of the TIMIT database. Especially, under conditions with strong noise, i.e. -10dB SNR, our DWT-PCA approach is very robust, the false alarm rate (FAR) increases by ∼2% and the missed detection rate (MDR) stays about the same compared to clean speech, whereas the DISTBIC method fails — the FAR and MDR is almost ∼0% and ∼100%, respectively. For clean speech DWT-PCA shows an improvement of ∼30% (relative) for both the FAR and MDR in comparison to the DISTBIC algorithm. DWT-LDA is performing slightly worse than DWT-PCA. An Adaptive Threshold Computation for Unsupervised Speaker Segmentation Laura Docio-Fernandez, Paula Lopez-Otero, Carmen Garcia-Mateo; Universidade de Vigo, Spain Tue-Ses1-P3-2, Time: 10:00 Reliable speaker segmentation is critical in many applications in the speech processing domain. In this paper, we compare the performance of two speaker segmentation systems: the first one is inspired on a typical state-of-art speaker segmentation system, and the other is an improved version of the former system. We show that the proposed system has a better performance as it does not “over-segment” the data. This system includes an algorithm that randomly discards some of the point changes with a probability depending on its performance at any moment. Thus, the system merges adjacent segments when they are spoken by the same speaker with a high probability; anytime a change is discarded the discard probability will rise, as the system made a mistake; the opposite will occur when the two adjacent segments belong to different speakers, as there will not be a mistake in this case. We show the improvements of the new system through comparative experiments on data from the Spanish Parliament Sessions defined for the 2006 TC-STAR Automatic Speech Recognition evaluation campaign. A Semi-Supervised Version of Heteroscedastic Linear Discriminant Analysis Haolang Zhou, Damianos Karakos, Andreas G. Andreou; Johns Hopkins University, USA Tue-Ses1-P3-4, Time: 10:00 Heteroscedastic Linear Discriminant Analysis (HLDA) was introduced in [1] as an extension of Linear Discriminant Analysis to the case where the class-conditional distributions have unequal covariances. The HLDA transform is computed such that the likelihood of the training (labeled) data is maximized, under the constraint that the projected distributions are orthogonal to a nuisance space that does not offer any discrimination. In this paper we consider the case of semi-supervised learning, where a large amount of unlabeled data is also available. We derive update equations for the parameters of the projected distributions, which are estimated jointly with the HLDA transform, and we empirically compare it with the case where no unlabeled data are available. Experimental results with synthetic data and real data from a vowel recognition task show that, in most cases, semi-supervised HLDA results in improved performance over HLDA. Self-Learning Vector Quantization for Pattern Discovery from Speech Okko Johannes Räsänen, Unto Kalervo Laine, Toomas Altosaar; Helsinki University of Technology, Finland Tue-Ses1-P3-5, Time: 10:00 A novel and computationally straightforward clustering algorithm was developed for vector quantization (VQ) of speech signals for a task of unsupervised pattern discovery (PD) from speech. The algorithm works in purely incremental mode, is computationally extremely feasible, and achieves comparable classification quality with the well-known k-means algorithm in the PD task. In addition to presenting the algorithm, general findings regarding the relationship between the amounts of training material, convergence of the clustering algorithm, and the ultimate quality of VQ codebooks are discussed. Monaural Segregation of Voiced Speech Using Discriminative Random Fields A Data-Driven Approach for Estimating the Time-Frequency Binary Mask Rohit Prabhavalkar, Zhaozhang Jin, Eric Fosler-Lussier; Ohio State University, USA Gibak Kim, Philipos C. Loizou; University of Texas at Dallas, USA Tue-Ses1-P3-6, Time: 10:00 Tue-Ses1-P3-3, Time: 10:00 The ideal binary mask, often used in robust speech recognition applications, requires an estimate of the local SNR in each timefrequency (T-F) unit. A data-driven approach is proposed for estimating the instantaneous SNR of each T-F unit. By assuming that the a priori SNR and a posteriori SNR are uniformly distributed within a small region, the instantaneous SNR is estimated by minimizing the localized Bayes risk. The binary mask estimator derived by the proposed approach is evaluated in terms of hit and false alarm rates. Compared to the binary mask estimator that uses the decision-directed approach to compute the SNR, the proposed data-driven approach yielded substantial improvements (up to 40%) in classification performance, when assessed in terms of a sensitivity metric which is based on the difference between the hit and false alarm rates. Techniques for separating speech from background noise and other sources of interference have important applications for robust speech recognition and speech enhancement. Many traditional computational auditory scene analysis (CASA) based approaches decompose the input mixture into a time-frequency (T-F) representation, and attempt to identify the T-F units where the target energy dominates that of the interference. This is accomplished using a two stage process of segmentation and grouping. In this pilot study, we explore the use of Discriminative Random Fields (DRFs) for the task of monaural speech segregation. We find that the use of DRFs allows us to effectively combine multiple auditory features into the system, while simultaneously integrating the the two CASA stages into one. Our preliminary results suggest that CASA based approaches may benefit from the DRF framework. Advancements in Whisper-Island Detection within Normally Phonated Audio Streams Chi Zhang, John H.L. Hansen; University of Texas at Dallas, USA Tue-Ses1-P3-7, Time: 10:00 Notes 82 In this study, several improvements are proposed for improved whisper-island detection within normally phonated audio streams. Based on our previous study, an improved feature, which is more sensitive to vocal effort change points between whisper and neutral speech, is developed and utilized in vocal effort change point (VECP) detection and vocal effort classification. Evaluation is based on the proposed multi-error score, where the improved feature showed better performance in VECPs detection with the lowest MES of 19.08. Furthermore, a more accurate whisper-island detection was obtained using the improved algorithm. Finally, the experimental detection rate results of 95.33% reflects better whisper-island detection performance for the improved algorithm versus that of the original baseline algorithm. Joint Segmentation and Classification of Dialog Acts Using Conditional Random Fields Matthias Zimmermann; xbrain.ch, Switzerland Tue-Ses1-P3-8, Time: 10:00 This paper investigates the use of conditional random fields for joint segmentation and classification of dialog acts exploiting both word and prosodic features that are directly available from a speech recognizer. To validate the approach experiments are conducted with two different sets of dialog act types under both reference and speech to text conditions. Although the proposed framework is conceptually simpler than previous attempts at segmentation and classification of DAs it outperforms all previous systems for a task based on the ICSI (MRDA) meeting corpus. Exploring Complex Vowels as Phrase Break Correlates in a Corpus of English Speech with ProPOSEL, a Prosody and POS English Lexicon Claire Brierley 1 , Eric Atwell 2 ; 1 University of Bolton, UK; 2 University of Leeds, UK Tue-Ses1-P3-9, Time: 10:00 Real-world knowledge of syntax is seen as integral to the machine learning task of phrase break prediction but there is a deficiency of a priori knowledge of prosody in both rule-based and data-driven classifiers. Speech recognition has established that pauses affect vowel duration in preceding words. Based on the observation that complex vowels occur at rhythmic junctures in poetry, we run significance tests on a sample of transcribed, contemporary British English speech and find a statistically significant correlation between complex vowels and phrase breaks. The experiment depends on automatic text annotation via ProPOSEL, a prosody and part-of-speech English lexicon. Automatic Topic Detection of Recorded Voice Messages Caroline Clemens 1 , Stefan Feldes 2 , Karlheinz Schuhmacher 1 , Joachim Stegmann 1 ; 1 Deutsche Telekom Laboratories, Germany; 2 T-Systems, Germany Tue-Ses1-P3-10, Time: 10:00 Identification and Automatic Detection of Parasitic Speech Sounds Jindřich Matoušek 1 , Radek Skarnitzl 2 , Pavel Machač 2 , Jan Trmal 1 ; 1 University of West Bohemia in Pilsen, Czech Republic; 2 Charles University in Prague, Czech Republic Tue-Ses1-P3-11, Time: 10:00 This paper presents initial experiments with the identification and automatic detection of parasitic sounds in speech signals. The main goal of this study is to identify such sounds in the source recordings for unit-selection-based speech synthesis systems and thus to avoid their unintended usage in synthesised speech. The first part of the paper describes the phonetic analysis and identification of parasitic phenomena in recordings of two Czech speakers. In the second part, experiments with the automatic detection of parasitic sounds using HMM-based and BVM classifiers are presented. The results are encouraging, especially those for glottalization phenomena. Phonetic Alignment for Speech Synthesis in Under-Resourced Languages D.R. van Niekerk, Etienne Barnard; CSIR, South Africa Tue-Ses1-P3-12, Time: 10:00 The rapid development of concatenative speech synthesis systems in resource scarce languages requires an efficient and accurate solution with regard to automated phonetic alignment. However, in this context corpora are often minimally designed due to a lack of resources and expertise necessary for large scale development. Under these circumstances many techniques toward accurate segmentation are not feasible and it is unclear which approaches should be followed. In this paper we investigate this problem by evaluating alignment approaches and demonstrating how these approaches can be applied to limit manual interaction while achieving acceptable alignment accuracy with minimal ideal resources. Improving Initial Boundary Estimation for HMM-Based Automatic Phonetic Segmentation Kalu U. Ogbureke, Julie Carson-Berndsen; University College Dublin, Ireland Tue-Ses1-P3-13, Time: 10:00 This paper presents an approach to boundary estimation for automatic segmentation of speech given a phone (sound) sequence. The technique presented represents an extension to existing approaches to Hidden Markov Model based automatic segmentation which modifies the topology of the model to control for duration. An HMM system trained with this modified topology places 77.10%, 86.72% and 91.15% of the boundaries, on the TIMIT speech test corpus annotations, within 10, 15 and 20 ms respectively as compared with manual annotations. This represents an improvement over the baseline result of 70.99%, 83.50% and 89.18% for initial boundary estimation. We present an approach to automatic classification of spontaneously spoken voice messages. During overload periods at call-centers customers are offered a call-back at a later time. A speech dialog asks them to describe their concern on a voice box. The identified topics correspond to the supported service categories, which in turn determine the agent group the customer message is routed to. Our multistage classification process includes speech-to-text, stemming, keyword spotting, and categorization. Classifier training and evaluation have been performed with real-life data. Results show promising performance. The pilot will be launched in a field test. Notes 83 Tue-Ses1-P4 : Speaker Recognition and Diarisation Hewison Hall, 10:00, Tuesday 8 Sept 2009 Chair: Sadaoki Furui, Tokyo Institute of Technology, Japan Importance of Nasality Measures for Speaker Recognition Data Selection and Performance Prediction Howard Lei, Eduardo Lopez-Gonzalo; ICSI, USA Tue-Ses1-P4-1, Time: 10:00 We improve upon measures relating feature vector distributions to speaker recognition (SR) performances for SR performance prediction and arbitrary data selection. In particular, we examine the means and variances of 11 features pertaining to nasality (resulting in 22 measures), computing them on feature vectors of phones to determine which measures give good SR performance prediction of phones. We’ve found that the combination of nasality measures give a 0.917 correlation with the Equal Error Rates (EERs) of phones on SRE08, exceeding the correlation of our previous best measure (mutual information) by 12.7%. When implemented in our data-selection scheme (which does not require a SR system to be run), the nasality measures allow us to select data with combined EER better than data selected via running a SR system in certain cases, at a fortieth of the computational costs. The nasality measures require a tenth of the computational costs compared to our previous best measure. Exploration of Vocal Excitation Modulation Features for Speaker Recognition Ning Wang, P.C. Ching, Tan Lee; Chinese University of Hong Kong, China Tue-Ses1-P4-2, Time: 10:00 To derive spectro-temporal vocal source features complementary to the conventional spectral-based vocal tract features in improving the performance and reliability of a speaker recognition system, the excitation related modulation properties are studied. Through multi-band demodulation method, source-related amplitude and phase quantities are parameterized into feature vectors. Evaluation of the proposed features is carried out first through a set of designed experiments on artificially generated inputs, and then by simulations on speech database. It is observed via the designed experiments that the proposed features are capable of capturing the vocal differences in terms of F0 variation, pitch epoch shape, and relevant excitation details between epochs. In the real task simulations, by combination with the standard spectral features, both the amplitude and the phase-related features are shown to evidently reduce the identification error rate and equal error rate in the context of the Gaussian mixture model-based speaker recognition system. Speaker Identification for Whispered Speech Using Modified Temporal Patterns and MFCCs Xing Fan, John H.L. Hansen; University of Texas at Dallas, USA Tue-Ses1-P4-3, Time: 10:00 Speech production variability due to whisper represents a major challenges for effective speech systems. Whisper is used by talkers intentionally in certain circumstances to protect personal privacy. Due to the absence of periodic excitation in the production of whisper, there are considerable differences between neutral and whispered speech in the spectral structure. Therefore, performance of speaker ID systems trained with high energy voiced phonemes, degrades significantly when tested with whisper. This study considers a combination of modified temporal patterns (m-TRAPs) and MFCCs to improve the performance of a neutral trained system for whispered speech. The m-TRAPs are introduced based on an explanation for the whisper/neutral mismatch degradation of an MFCC based system. A phoneme-by-phoneme score weighting method is used to fuse the score from each subband. Text independent closed set speaker ID was conducted and experimental results show that m-TRAPs are especially efficient for whisper with low SNR. When combining scores from both MFCC and TRAPs based GMMs, an absolute 26.3% improvement in accuracy is obtained compared with a traditional MFCC baseline system. This result confirms a viable approach to improving speaker ID performance between neutral/whisper mismatched conditions. Speaker Diarization for Meeting Room Audio Hanwu Sun, Tin Lay Nwe, Bin Ma, Haizhou Li; Institute for Infocomm Research, Singapore Tue-Ses1-P4-4, Time: 10:00 This paper describes a speaker diarization system in 2007 NIST Rich Transcription (RT07) Meeting Recognition Evaluation for the task of Multiple Distant Microphone (MDM) in meeting room scenarios. The system includes three major modules: data preparation, initial speaker clustering and cluster purification/merging. The data preparation consists of the raw data Wiener filtering and beamforming, Time Difference of Arrival estimate and speech activity detection. Based on the initial processed data, two-stage histogram quantization has been used to perform the initial speaker clustering. A modified purification strategy via high-order GMM clustering method is proposed. BIC criterion is applied for cluster merging. The system achieves a competitive overall DER of 8.31% for RT07 MDM speaker diarization task. Improving Speaker Segmentation via Speaker Identification and Text Segmentation Runxin Li, Tanja Schultz, Qin Jin; Carnegie Mellon University, USA Tue-Ses1-P4-5, Time: 10:00 Speaker segmentation is an essential part of a speaker diarization system. Common segmentation systems usually miss speaker change points when speakers switch fast. These errors seriously confuse the following speaker clustering step and result in high overall speaker diarization error rates. In this paper two methods are proposed to deal with this problem: The first approach uses speaker identification techniques to boost speaker segmentation. And the second approach applies text segmentation methods to improve the performance of speaker segmentation. Experiments on Quaero speaker diarization evaluation data shows that our methods achieve up to 45% relative reduction in the speaker diarization error and 64% relative increase in the speaker change detection recall rate over the baseline system. Moreover, both these two approaches can be considered as post-processing steps over the baseline segmentation, therefore, they can be applied in any speaker diarization systems. Overall Performance Metrics for Multi-Condition Speaker Recognition Evaluations David A. van Leeuwen; TNO Human Factors, The Netherlands Notes 84 Tue-Ses1-P4-6, Time: 10:00 In this paper we propose a framework for measuring the overall performance of an automatic speaker recognition system using a set of trials of a heterogeneous evaluation such as NIST SRE-2008, which combines several acoustic conditions in one evaluation. We do this by weighting trials of different conditions according to their relative proportion, and we derive expressions for the basic speaker recognition performance measures Cdet , Cllr , as well as the min DET curve, from which EER and Cdet can be computed. Examples of pooling of conditions are shown on SRE-2008 data, including speaker sex and microphone type and speaking style. Speaker Identification Using Warped MVDR Cepstral Features Matthias Wölfel 1 , Qian Yang 2 , Qin Jin 3 , Tanja Schultz 2 ; 1 ZKM, Germany; 2 Universität Karlsruhe (TH), Germany; 3 Carnegie Mellon University, USA Tue-Ses1-P4-7, Time: 10:00 It is common practice to use similar or even the same feature extraction methods for automatic speech recognition and speaker identification. While the front-end for the former requires to preserve phoneme discrimination and to compensate for speaker differences to some extend, the front-end for the latter has to preserve the unique characteristics of individual speakers. It seems, therefore, contradictory to use the same feature extraction methods for both tasks. Starting out from the common practice we propose to use warped minimum variance distortionless response (MVDR) cepstral coefficients, which have already been demonstrated to perform superior for automatic speech recognition in particular under adverse conditions. Replacing the widely used mel-frequency cepstral coefficients by WMVDR cepstral coefficients improves the speaker identification accuracy by up to 24% relative. We found that the optimal choice of the model order within the WMVDR framework differs between speech recognition and speaker recognition, confirming our intuition that the two different tasks indeed require different feature extraction strategies. Entropy Based Overlapped Speech Detection as a Pre-Processing Stage for Speaker Diarization Oshry Ben-Harush 1 , Itshak Lapidot 2 , Hugo Guterman 1 ; 1 Ben-Gurion University of the Negev, Israel; 2 Sami Shamoon College of Engineering, Israel Speech Style and Speaker Recognition: A Case Study Marco Grimaldi, Fred Cummins; University College Dublin, Ireland Tue-Ses1-P4-9, Time: 10:00 This work presents an experimental evaluation of the effect of different speech styles on the task of speaker recognition. We make use of willfully altered voice extracted from the chains corpus and methodically assess the effect of its use in both testing and training a reference speaker identification system and a reference speaker verification system. In this work we contrast normal readings of text with two varieties of imitative styles and with the familiar, non-imitative, variant of fast speech. Furthermore, we test the applicability of a novel speech parameterization that has been suggested as a promising technique in the task of speaker identification: the pyknogram frequency estimate coefficients — pykfec. The experimental evaluation indicates that both the reference verification and identification systems are affected by variations in style of the speech material used, especially in the case that speech is also mismatched in channel. Our case studies also indicates that the adoption of pykfec as speech encoding methodology has an overall favorable effect on the systems accuracy scores. The Majority Wins: A Method for Combining Speaker Diarization Systems Marijn Huijbregts 1 , David A. van Leeuwen 2 , Franciska M.G. de Jong 1 ; 1 University of Twente, The Netherlands; 2 TNO Human Factors, The Netherlands Tue-Ses1-P4-10, Time: 10:00 In this paper we present a method for combining multiple diarization systems into one single system by applying a majority voting scheme. The voting scheme selects the best segmentation purely on basis of the output of each system. On our development set of NIST Rich Transcription evaluation meetings the voting method improves our system on all evaluation conditions. For the single distant microphone condition, DER performance improved by 7.8% (relative) compared to the best input system. For the multiple distant microphone condition the improvement is 3.6%. Two-Wire Nuisance Attribute Projection Yosef A. Solewicz 1 , Hagai Aronowitz 2 ; 1 Bar-Ilan University, Israel; 2 IBM Haifa Research Lab, Israel Tue-Ses1-P4-8, Time: 10:00 Tue-Ses1-P4-11, Time: 10:00 One inherent deficiency of most diarization systems is their inability to handle co-channel or overlapped speech. Most of the suggested algorithms perform under singular conditions, require high computational complexity in both time and frequency domains. In this study, frame based entropy analysis of the audio data in the time domain serves as a single feature for an overlapped speech detection algorithm. Identification of overlapped speech segments is performed using Gaussian Mixture Modeling (GMM) along with well known classification algorithms applied on two speaker conversations. By employing this methodology, the proposed method eliminates the need for setting a hard threshold for each conversation or database. This paper addresses the task of nuisance reduction in two-wire speaker recognition applications. Besides channel mismatch, two-wire conversations are contaminated by extraneous speakers which represent an additional source of noise in the supervector domain. It is shown that two-wire nuisance manifests itself as undesirable directions in the inter-speaker subspace. For this purpose, we derive two alternative Nuisance Attribute Projection (NAP) formulations tailored for two-wire sessions. The first formulation generalizes the NAP framework based on a model of two-wire conversations. The second formulation explicitly models the fourvs. two-wire supervector variability. Preliminary experiments show that two-wire NAP significantly outperforms regular NAP in varied two-wire tasks. LDC CALLHOME American English corpus is used for evaluation of the suggested algorithm. The proposed method successfully detects 63.2% of the frames labeled as overlapped speech by the manual segmentation, while keeping a 5.4% false-alarm rate. Notes 85 the so-called salience of the speech signal samples. The method does not request that the signal is locally periodic and the average period length is known a priori. Several implementations are considered and discussed. Salience analysis is compared with the auto-correlation method for cycle detection implemented in Praat. Tue-Ses1-S1 : Special Session: Advanced Voice Function Assessment Ainsworth (East Wing 4), 10:00, Tuesday 8 Sept 2009 Chair: Anna Barney, University of Southampton, UK and Mette Pedersen, Medical Centre of Copenhagen, Denmark The Use of Telephone Speech Recordings for Assessment and Monitoring of Cognitive Function in Elderly People Acoustic and High-Speed Digital Imaging Based Analysis of Pathological Voice Contributes to Better Understanding and Differential Diagnosis of Neurological Dysphonias and of Mimicking Phonatory Disorders Viliam Rapcan, Shona D’Arcy, Nils Penard, Ian H. Robertson, Richard B. Reilly; Trinity College Dublin, Ireland Tue-Ses1-S1-4, Time: 11:00 Krzysztof Izdebski 1 , Yuling Yan 2 , Melda Kunduk 3 ; 1 Pacific Voice and Speech Foundation, USA; 2 Stanford University, USA; 3 Louisiana State University, USA Tue-Ses1-S1-1, Time: 10:00 Using Nyquist-plots definitions and HSDI-based analyses of the acoustic and visual data base of similarly sounding disordered neurologically driven pathological phonations, we categorized these signals and provided an in-depth explanation of how these sounds differ, and how these sounds are generated at the glottic level. Combined evaluations based on modern technology strengthened our knowledge and improved objective guidelines on how to approach clinical diagnosis by ear, significantly aiding the process of differential diagnosis of complex pathological voice qualities in nonlaboratory settings. Cognitive assessment in clinic represents time consuming and expensive task. Speech may be employed as a means of monitoring cognitive function in elderly people. Extraction of speech characteristics from speech recorded remotely over a telephone was investigated and compared to speech characteristics extracted from recordings made in controlled environment. Results demonstrate that speech characteristics can be, with little changes in feature extraction algorithm, reliably (with overall accuracy of 93.2%) extracted from telephone quality speech. With further development of a fully automated IVR system, an early screening system for cognitive decline may be easily realized. Optimized Feature Set to Assess Acoustic Perturbations in Dysarthric Speech Sunil Nagaraja, Eduardo Castillo-Guerra; University of New Brunswick, Canada Normalized Modulation Spectral Features for Cross-Database Voice Pathology Detection Tue-Ses1-S2-5, Time: 11:20 Maria Markaki 1 , Yannis Stylianou 2 ; 1 University of Crete, Greece; 2 FORTH, Greece Tue-Ses1-S1-2, Time: 10:20 In this paper, we employ normalized modulation spectral analysis for voice pathology detection. Such normalization is important when there is a mismatch between training and testing conditions, or in other words, employing the detection system in real (testing) conditions. Modulation spectra usually produce a highdimensionality space. For classification purposes, the size of the original space is reduced using Higher Order Singular Value Decomposition (SVD). Further, we select most relevant features based on the mutual information between subjective voice quality and computed features, which leads to an adaptive to the classification task modulation spectra representation. For voice pathology detection, the adaptive modulation spectra is combined with an SVM classifier. To simulate the real testing conditions; one for training and the other for testing. We address the difference of signal characteristics between training and testing data through subband normalization of modulation spectral features. Simulations show that feature normalization enables the cross-database detection of pathological voices even when training and test data are different. Speech Sample Salience Analysis for Speech Cycle Detection C. Mertens, Francis Grenez, Jean Schoentgen; Université Libre de Bruxelles, Belgium Tue-Ses1-S1-3, Time: 10:40 The presentation proposes a method for the measurement of cycle lengths in voiced speech. The background is the study of acoustic cues of slow (vocal tremor) and fast (vocal jitter) perturbations of the vocal frequency. Here, these acoustic cues are obtained by means of a temporal method that detects speech cycles via This paper is focused on the optimization of features derived to characterize the acoustic perturbations encountered in a group of neurological disorders known as Dysarthria. The work derives a set of orthogonal features that enable acoustic analyses of dysarthric speech from eight different Dysarthria types. The feature set is composed by combinations of objective measurements obtained with digital signal processing algorithms and perceptual judgments of the most reliably perceived acoustic perturbations. The effectiveness of the features to provide relevant information of the disorders is evaluated with different classifiers enabling a classification rate up to 93.7%. A Microphone-Independent Visualization Technique for Speech Disorders Andreas Maier 1 , Stefan Wenhardt 1 , Tino Haderlein 1 , Maria Schuster 2 , Elmar Nöth 1 ; 1 FAU Erlangen-Nürnberg, Germany; 2 Universitätsklinikum Erlangen, Germany Tue-Ses1-S2-6, Time: 10:20 In this paper we introduce a novel method for the visualization of speech disorders. We demonstrate the method with disordered speech and a control group. However, both groups were recorded using two different microphones. The projection of the patient data using a single microphone yields significant correlations between the coordinates on the map and certain criteria of the disorder which were perceptually rated. However, projection of data from multiple microphones reduces this correlation. Usually, the acoustical mismatch between the microphones is greater than the mismatch between the speakers, i.e., not the disorders but the microphones form clusters in the visualization. Based on an extension of the Sammon mapping, we are able to create a map which projects the same speakers onto the same position even if Notes 86 multiple microphones are used. Furthermore, our method also restores the correlation between the map coordinates and the perceptual assessment. Evaluation of the Effect of the GSM Full Rate Codec on the Automatic Detection of Laryngeal Pathologies Based on Cepstral Analysis Rubén Fraile, Carmelo Sánchez, Juan I. Godino-Llorente, Nicolás Sáenz-Lechón, Víctor Osma-Ruiz, Juana M. Gutiérrez; Universidad Politécnica de Madrid, Spain Type I & II (13 patients), type III (5 patients), type V (5 patients). Evaluation was done pre and postoperatively for 12 months. The other group was represented by patients with unilateral vocal fold paralysis treated by thyroplasty (17 patients). Evaluation was done before and 3 months postoperatively. Total VHI, emotional and physical subscales improved significantly for type I&II cordectomy and for thyroplasty. VHI can provide an insight into patient’s handicap. Intelligibility Assessment in Children with Cleft Lip and Palate in Italian and German Tue-Ses1-S2-7, Time: 11:20 Advances in speech signal analysis during the last decade have allowed the development of automatic algorithms for a non-invasive detection of laryngeal pathologies. Bearing in mind the extension of these automatic methods to remote diagnosis scenarios, this paper analyzes the performance of a pathology detector based on Mel Frequency Cepstral Coefficients when the speech signal has undergone the distortion of a speech codec such as the GSM FR codec, which is used in one of the nowadays most widespread communications networks. It is shown that the overall performance of the automatic detection of pathologies is degraded less than 5%, and that such degradation is not due to the codec itself, but to the bandwidth limitation needed at its input. These results indicate that the GSM system can be more adequate to implement remote voice assessment than the analogue telephone channel. Marcello Scipioni 1 , Matteo Gerosa 2 , Diego Giuliani 2 , Elmar Nöth 3 , Andreas Maier 3 ; 1 Politecnico di Milano, Italy; 2 FBK, Italy; 3 FAU Erlangen-Nürnberg, Germany Tue-Ses1-S2-10, Time: 11:20 Current research has shown that the speech intelligibility in children with cleft lip and palate (CLP) can be estimated automatically using speech recognition methods. On German CLP data high and significant correlations between human ratings and the recognition accuracy of a speech recognition system were already reported. In this paper we investigate whether the approach is also suitable for other languages. Therefore, we compare the correlations obtained on German data with the correlations on Italian data. A high and significant correlation (r=0.76; p < 0.01) was identified on the Italian data. These results do not differ significantly from the results on German data (p > 0.05). Universidade de Aveiro’s Voice Evaluation Protocol Cepstral Analysis of Vocal Dysperiodicities in Disordered Connected Speech Luis M.T. Jesus 1 , Anna Barney 2 , Ricardo Santos 3 , Janine Caetano 4 , Juliana Jorge 5 , Pedro Sá Couto 1 ; 1 Universidade de Aveiro, Portugal; 2 University of Southampton, UK; 3 Hospital Privado da Trofa, Portugal; 4 Agrupamento de Escolas Serra da Gardunha, Portugal; 5 RAIZ, Portugal A. Alpan 1 , Jean Schoentgen 1 , Y. Maryn 2 , Francis Grenez 1 , P. Murphy 3 ; 1 Université Libre de Bruxelles, Belgium; 2 Sint-Jan General Hospital, Belgium; 3 University of Limerick, Ireland Tue-Ses1-S2-8, Time: 11:20 Tue-Ses1-S2-11, Time: 11:20 Several studies have shown that the amplitude of the first rahmonic peak (R1) in the cepstrum is an indicator of hoarse voice quality. The cepstrum is obtained by taking the inverse Fourier Transform of the log-magnitude spectrum. In the present study, a number of spectral analysis processing steps are implemented, including period-synchronous and period-asynchronous analysis, as well as harmonic-synchronous and harmonic-asynchronous spectral band-limitation prior to computing the cepstrum. The analysis is applied to connected speech signals. The correlation between amplitude R1 and perceptual ratings is examined for a corpus comprising 28 normophonic and 223 dysphonic speakers. One observes that the correlation between R1 and perceptual ratings increases when the spectrum is band-limited prior to computing the cepstrum. In addition, comparisons are made with a popular cepstral cue which is the cepstral peak prominence (CPP). This paper presents Universidade de Aveiro’s Voice Evaluation Protocol for European Portuguese (EP), and a preliminary inter-rater reliability study. Ten patients with vocal pathology were assessed, by two Speech and Language Therapists (SLTs). Protocol parameters such as overall severity, roughness, breathiness, change of loudness (CAPE-V), grade, breathiness and strain (GRBAS), glottal attack, respiratory support, respiratory-phonotary-articulatory coordination, digital laryngeal manipulation, voice quality after manipulation, muscular tension and diagnosis, presented high reliability and were highly correlated (good inter-rater agreement and high value of correlation). Values for the overall severity and grade were similar to those reported in the literature. Tue-Ses2-O1 : Automotive and Mobile Applications Standard Information from Patients: The Usefulness Main Hall, 13:30, Tuesday 8 Sept 2009 Chair: Kate Knill, Toshiba Research Europe Ltd., UK of Self-Evaluation (Measured with the French Version of the VHI) Lise Crevier-Buchman 1 , Stephanie Borel 1 , Stéphane Hans 1 , Madeleine Menard 1 , Jacqueline Vaissiere 2 ; 1 Université Paris Descartes, France; 2 LPP, France Fast Speech Recognition for Voice Destination Entry in a Car Navigation System Hoon Chung, JeonGue Park, HyeonBae Jeon, YunKeun Lee; ETRI, Korea Tue-Ses1-S2-9, Time: 11:20 Voice Handicap Index is a scale designed to measure the voice disability in daily life. Two groups of patients were evaluated. One group was represented by glottic carcinoma treated by cordectomy Tue-Ses2-O1-1, Time: 13:30 In this paper, we introduce a multi-stage decoding algorithm optimized to recognize very large number of entry names on Notes 87 a resource-limited embedded device. The multi-stage decoding algorithm is composed of a two-stage HMM-based coarse search and a detailed search. The two-stage HMM-based coarse search generates a small set of candidates that are assumed to contain a correct hypothesis with high probability, and the detailed search re-ranks the candidates by rescoring them with sophisticate acoustic models. In this paper, we take experiments with 1-millions of point-of-interest (POI) names on an in-car navigation device with a fixed-point processor running at 620MHz. The experimental result shows that the multi-stage decoding algorithm runs about 2.23 times real-time on the device without serious degradation of recognition performance. are accurately retrieved using a vector space model. In evaluating SMS replies within the acoustically challenging environment of automobiles, the voice search approach consistently outperformed using just the recognition results of a statistical language model or a probabilistic context-free grammar. For SMS replies covered by our templates, the approach achieved as high as 89.7% task completion when evaluating the top five reply candidates. Improving Perceived Accuracy for In-Car Media Search Tue-Ses2-O1-5, Time: 14:50 Yun-Cheng Ju, Michael Seltzer, Ivan Tashev; Microsoft Research, USA Tue-Ses2-O1-2, Time: 13:50 Speech recognition technology is prone to mistakes, but this is not the only source of errors that cause speech recognition systems to fail; sometimes the user simply does not utter the command correctly. Usually, user mistakes are not considered when a system is designed and evaluated. This creates a gap between the claimed accuracy of the system and the actual accuracy perceived by the users. We address this issue quantitatively in our in-car infotainment media search task and propose expanding the capability of voice command to accommodate user mistakes while retaining a high percentage of the performance for queries with correct syntax. As a result, failures caused by user mistakes were reduced by an absolute 70% at the cost of a drop in accuracy of only 0.28%. Laying the Foundation for In-Car Alcohol Detection by Speech Language Modeling for What-with-Where on GOOG-411 Charl van Heerden 1 , Johan Schalkwyk 2 , Brian Strope 2 ; 1 CSIR, South Africa; 2 Google Inc., USA This paper describes the language modeling architectures and recognition experiments that enabled support of ‘what-with-where’ queries on GOOG-411. First we compare accuracy trade-offs between a single national business LM for business queries and using many small models adapted for particular cities. Experimental evaluations show that both approaches lead to comparable overall accuracy. Differences in the distributions of errors also lead to improvements from a simple combination. We then optimize variants of the national business LM in the context of combined business and location queries from the web, and finally evaluate these models on a recognition test from the recently fielded ‘what-with-where’ system. Very Large Vocabulary Voice Dictation for Mobile Devices Jan Nouza, Petr Cerva, Jindrich Zdansky; Technical University of Liberec, Czech Republic Tue-Ses2-O1-6, Time: 15:10 Florian Schiel, Christian Heinrich; LMU München, Germany Tue-Ses2-O1-3, Time: 14:10 The fact that an increasing number of functions in the automobile are and will be controlled by speech of the driver rises the question whether this speech input may be used to detect a possible alcoholic intoxication of the driver. For that matter a large part of the new Alcohol Language Corpus (ALC) edited by the Bavarian Archive of Speech Signals (BAS) will be used for a broad statistical investigation of possible feature candidates for classification. In this contribution we present the motivation and the design of the ALC corpus as well as first results from fundamental frequency and rhythm analysis. Our analysis by comparing sober and alcoholized speech of the same individuals suggests that there are in fact promising features that can automatically be derived from the speech signal during the speech recognition process and will indicate intoxication for most speakers. This paper deals with optimization techniques that can make very large vocabulary voice dictation applications deployable on recent mobile devices. We focus namely on optimization of signal parameterization (frame rate, FFT calculation, fixed-point representation) and on efficient pruning techniques employed on the state and Gaussian mixture level. We demonstrate the applicability of the proposed techniques on the practical design of an embedded 255K-word discrete dictation program developed for Czech. Its real performance is comparable to a client-server version of the fluent dictation program implemented on the same mobile device. Tue-Ses2-O2 : Prosody: Production I Jones (East Wing 1), 13:30, Tuesday 8 Sept 2009 Chair: Fred Cummins, University College Dublin, Ireland Did You Say a BLUE Banana? The Prosody of Contrast and Abnormality in Bulgarian and Dutch Diana V. Dimitrova, Gisela Redeker, John C.J. Hoeks; Rijksuniversiteit Groningen, The Netherlands A Voice Search Approach to Replying to SMS Messages in Automobiles Tue-Ses2-O2-1, Time: 13:30 Yun-Cheng Ju, Tim Paek; Microsoft Research, USA Tue-Ses2-O1-4, Time: 14:30 Automotive infotainment systems now provide drivers the ability to hear incoming Short Message Service (SMS) text messages using text-to-speech. However, the question of how best to allow users to respond to these messages using speech recognition remains unsettled. In this paper, we propose a robust voice search approach to replying to SMS messages based on template matching. The templates are empirically derived from a large SMS corpus and matches In a production experiment on Bulgarian that was based on a previous study on Dutch [1], we investigated the role of prosody when linguistic and extra-linguistic information coincide or contradict. Speakers described abnormally colored fruits in conditions where contrastive focus and discourse relations were varied. We found that the coincidence of contrast and abnormality enhances accentuation in Bulgarian as it did in Dutch. Surprisingly, when both factors are in conflict, the prosodic prominence of abnormality often overruled focus accentuation in both Bulgarian and Dutch, though the languages also show marked differences. Notes 88 A Quantitative Study of F0 Peak Alignment and Sentence Modality Pitch Adaptation in Different Age Groups: Boundary Tones versus Global Pitch Hansjörg Mixdorff 1 , Hartmut R. Pfitzinger 2 ; 1 BHT Berlin, Germany; 2 Christian-Albrechts-Universität zu Kiel, Germany Marie Nilsenová, Marc Swerts, Véronique Houtepen, Heleen Dittrich; Tilburg University, The Netherlands Tue-Ses2-O2-2, Time: 13:50 Linguistic adaptation is a process by which interlocutors adjust their production to their environment. In the context of humancomputer interaction, past research showed that adult speakers adapt to computer speech in various manners but less is known about younger age groups. We report the results of three priming experiments in which children in different age groups interacted with a prerecorded computer voice. The goal of the experiments was to determine to what extent children copy the pitch properties of the interlocutor. Based on the dialogue model of Pickering & Garrod, we predicted that children would be more likely to adapt to pitch primes that were meaningful in the context (high or low boundary tone) compared to primes with no apparent functionality (global pitch manipulation). This prediction was confirmed by our data. Moreover, we observed a decreasing trend in adaptation in the older age groups compared to the younger ones. Tue-Ses2-O2-5, Time: 14:50 The current study examines the relationship between prosodic accent labels assigned in the Kiel Corpus of Spontaneous Speech IV, Isačenko’s intoneme classes of the underlying accents and the associated parameters of the Fujisaki model. Among other findings, there is a close connection between early peaks and information intonemes, as well as late peaks and non-terminal intonemes. The majority of tokens within both intoneme classes, however, are associated with medial peaks. Precise analysis of alignment shows that accent command offset times for information intonemes are significantly earlier than for non-terminal intonemes. This suggests that the anchoring of the relevant tonal transition could be more important for separating different intonational categories than that of the F0 peak. Closely Related Languages, Different Ways of Realizing Focus Backchannel-Inviting Cues in Task-Oriented Dialogue Szu-wei Chen 1 , Bei Wang 2 , Yi Xu 3 ; 1 National Chung Cheng University, Taiwan; 2 Minzu University of China, China; 3 University College London, UK Agustín Gravano, Julia Hirschberg; Columbia University, USA Tue-Ses2-O2-6, Time: 15:10 Tue-Ses2-O2-3, Time: 14:10 We investigated how focus was prosodically realized in Taiwanese, Taiwan Mandarin and Beijing Mandarin by monolingual and bilingual speakers. Acoustic analyses showed that all speakers raised pitch and intensity of focused words, but only Beijing Mandarin speakers lowered pitch and intensity of post-focus words. Crossgroup differences in duration were mixed. When listening to stimuli from their own language groups, subjects from Beijing had over 80% focus recognition rate, while those from Taiwan had less than 70% recognition rate. This difference is mainly due to presence/absence of post-focus compression. These findings have implications for prosodic typology, language contact and bilingualism. Cross-Variety Rhythm Typology in Portuguese Plínio A. Barbosa 1 , M. Céu Viana 2 , Isabel Trancoso 3 ; 1 State University of Campinas, Brazil; 2 CLUL, Portugal; 3 INESC-ID Lisboa/IST, Portugal Tue-Ses2-O2-4, Time: 14:30 This paper aims at proposing a measure of speech rhythm based on the inference of the coupling strength between the syllable oscillator and the stress group oscillator of an underlying coupled oscillators model. This coupling is inferred from the linear regression between the stress group duration and the number of syllables within the group, as well as from the multiple linear regression between the same parameters and an estimate of phrase stress prominence. This technique is applied to compare the rhythmic differences between European and Brazilian Portuguese in two speaking styles and three speakers per variety. Compared with a syllable-sized normalised PVI, the findings suggest that the coupling strength captures better the perceptual effects of the speakers’ renditions. Furthermore, it shows that stress group duration is much better predicted by adding phrase stress prominence to the regression. We examine backchannel-inviting cues — distinct prosodic, acoustic and lexical events in the speaker’s speech that tend to precede a short response produced by the interlocutor to convey continued attention — in the Columbia Games Corpus, a large corpus of task-oriented dialogues. We show that the likelihood of occurrence of a backchannel increases quadratically with the number of cues conjointly displayed by the speaker. Our results are important for improving the coordination of conversational turns in interactive voice-response systems, so that systems can produce backchannels in appropriate places, and so that they can elicit backchannels from users in expected places. Tue-Ses2-O3 : ASR: Spoken Language Understanding Fallside (East Wing 2), 13:30, Tuesday 8 Sept 2009 Chair: Lin-shan Lee, National Taiwan University, Taiwan What’s in an Ontology for Spoken Language Understanding Silvia Quarteroni, Giuseppe Riccardi, Marco Dinarelli; Università di Trento, Italy Tue-Ses2-O3-1, Time: 13:30 Current Spoken Language Understanding systems rely either on hand-written semantic grammars or on flat attribute-value sequence labeling. In both approaches, concepts and their relations (when modeled at all) are domain-specific, thus making it difficult to expand, port or share the domain model. To address this issue, we introduce: 1) a domain model based on an ontology where concepts are classified into either as predicate or argument; 2) the modeling of relations between such concept classes in terms of classical relations as defined in lexical semantics. We study and analyze our approach on the spoken dialog corpus collected within a problem-solving task in the LUNA project. We evaluate the coverage and relevance of the ontology for the interpretation of spoken utterances. Notes 89 A Fundamental Study of Shouted Speech for Acoustic-Based Security System Semantic Role Labeling with Discriminative Feature Selection for Spoken Language Understanding Hiroaki Nanjo 1 , Hiroki Mikami 1 , Hiroshi Kawano 2 , Takanobu Nishiura 2 ; 1 Ryukoku University, Japan; 2 Ritsumeikan University, Japan Chao-Hong Liu, Chung-Hsien Wu; National Cheng Kung University, Taiwan Tue-Ses2-O3-2, Time: 13:30 In the task of Spoken Language Understanding (SLU), Intent Classification techniques have been applied to different domains of Spoken Dialog Systems (SDS). Recently it was shown that intent classification performance can be improved with Semantic Role (SR) information. However, using SR information for SDS encounters two difficulties: 1) the state-of-the-art Automatic Speech Recognition (ASR) systems provide less than 80% recognition rate, 2) speech always exhibits ungrammatical expressions. This study presents an approach to Semantic Role Labeling (SRL) with discriminative feature selection to improve the performance of SDS. Bernoulli event features on word and part-of-speech sequences are introduced for better representation of the ASR recognized text. SRL and SLU experiments conducted using CoNLL-2005 SRL corpus and ATIS spoken corpus show that the proposed feature selection method with Bernoulli event features can improve intent classification by 3.4% and the performance of SRL. Tue-Ses2-O3-6, Time: 13:30 A speech processing system for ensuring safety and security, namely, acoustic-based security system is addressed. Focusing on indoor security such as school security, we study for an advanced acoustic-based system which can discriminate emergency shout from the other speech events based on the understanding of speech events. In this paper, we describe fundamental results of shouted speech. Evaluating the Potential Utility of ASR N-Best Lists for Incremental Spoken Dialogue Systems Timo Baumann, Okko Buß, Michaela Atterer, David Schlangen; Universität Potsdam, Germany Tue-Ses2-O3-3, Time: 13:30 The potential of using ASR n-best lists for dialogue systems has often been recognised (if less often realised): it is often the case that even when the top-ranked hypothesis is erroneous, a better one can be found at a lower rank. In this paper, we describe metrics for evaluating whether the same potential carries over to incremental dialogue systems, where ASR output is consumed and reacted upon while speech is still ongoing. We show that even small N can provide an advantage for semantic processing, at a cost of a computational overhead. Improving the Recognition of Names by Document-Level Clustering Tue-Ses2-O4 : Speaker Diarisation Holmes (East Wing 3), 13:30, Tuesday 8 Sept 2009 Chair: Yannis Stylianou, FORTH, Greece A Study of New Approaches to Speaker Diarization Douglas Reynolds 1 , Patrick Kenny 2 , Fabio Castaldo 3 ; 1 MIT, USA; 2 CRIM, Canada; 3 Politecnico di Torino, Italy Tue-Ses2-O4-1, Time: 13:30 Named entities are of great importance in spoken document processing, but speech recognizers often get them wrong because they are infrequent. A name correction method based on documentlevel name clustering is proposed in this paper, consisting of three components: named entity detection, name clustering, and name hypothesis selection. We compare the performance of this method to oracle conditions and show that the oracle gain is a 23% reduction in name character error for Mandarin and the automatic approach achieves about 20% of that. This paper reports on work carried out at the 2008 JHU Summer Workshop examining new approaches to speaker diarization. Four different systems were developed and experiments were conducted using summed-channel telephone data from the 2008 NIST SRE. The systems are a baseline agglomerative clustering system, a new Variational Bayes system using eigenvoice speaker models, a streaming system using a mix of low dimensional speaker factors and classic segmentation and clustering, and a new hybrid system combining the baseline system with a new cosine-distance speaker factor clustering. Results are presented using the Diarization Error Rate as well as by the EER when using diarization outputs for a speaker detection task. The best configurations of the diarization system produced DERs of 3.5—4.6% and we demonstrate a weak correlation of EER and DER. Robust Dependency Parsing for Spoken Language Understanding of Spontaneous Speech Redefining the Bayesian Information Criterion for Speaker Diarisation Frederic Bechet 1 , Alexis Nasr 2 ; 1 LIA, France; 2 LIF, France Themos Stafylakis, Vassilis Katsouros, George Carayannis; Athena Research Center, Greece Bin Zhang, Wei Wu, Jeremy G. Kahn, Mari Ostendorf; University of Washington, USA Tue-Ses2-O3-4, Time: 13:30 Tue-Ses2-O3-5, Time: 13:30 Tue-Ses2-O4-2, Time: 13:50 We describe in this paper a syntactic parser for spontaneous speech geared towards the identification of verbal subcategorization frames. The parser proceeds in two stages. The first stage is based on generic syntactic resources for French. The second stage is a reranker which is specially trained for a given application. The parser is evaluated on the French media spoken dialogue corpus. A novel approach to the Bayesian Information Criterion (BIC) is introduced. The new criterion redefines the penalty terms of the BIC, such that each parameter is penalized with the effective sample size is trained with. Contrary to Local-BIC, the proposed criterion scores overall clustering hypotheses and therefore is not restricted to hierarchical clustering algorithms. Contrary to Global-BIC, it provides a local dissimilarity measure that depends only the statistics of the examined clusters and not on the overall sample size. We tested our criterion with two benchmark tests and found significant improvement in performance in the speaker diarisation task. Notes 90 Improved Speaker Diarization of Meeting Speech with Recurrent Selection of Representative Speech Segments and Participant Interaction Pattern Modeling Speaker Diarization Using Divide-and-Conquer 1 2 Shih-Sian Cheng , Chun-Han Tseng , Chia-Ping Chen 2 , Hsin-Min Wang 1 ; 1 Academia Sinica, Taiwan; 2 National Sun Yat-Sen University, Taiwan Tue-Ses2-O4-3, Time: 14:10 Speaker diarization systems usually consist of two core components: speaker segmentation and speaker clustering. The current state-of-the-art speaker diarization systems usually apply hierarchical agglomerative clustering (HAC) for speaker clustering after segmentation. However, HAC’s quadratic computational complexity with respect to the number of data samples inevitably limits its application in large-scale data sets. In this paper, we propose a divide-and-conquer (DAC) framework for speaker diarization. It recursively partitions the input speech stream into two sub-streams, performs diarization on them separately, and then combines the diarization results obtained from them using HAC. The results of experiments conducted on RT-02 and RT-03 broadcast news data show that the proposed framework is faster than the conventional segmentation and clustering-based approach while achieving comparable diarization accuracy. Moreover, the proposed framework obtains a higher speedup over the conventional approach on a larger test data set. KL Realignment for Speaker Diarization with Multiple Feature Streams Kyu J. Han, Shrikanth S. Narayanan; University of Southern California, USA Tue-Ses2-O4-6, Time: 15:10 In this work we describe two distinct novel improvements to our speaker diarization system, previously proposed for analysis of meeting speech. The first approach focuses on recurrent selection of representative speech segments for speaker clustering while the other is based on participant interaction pattern modeling. The former selects speech segments with high relevance to speaker clustering, especially from a robust cluster modeling perspective, and keeps updating them throughout clustering procedures. The latter statistically models conversation patterns between meeting participants and applies it as a priori information when refining diarization results. Experimental results reveal that the two proposed approaches provide performance enhancement by 29.82% (relative) in terms of diarization error rate in tests on 13 meeting excerpts from various meeting speech corpora. Tue-Ses2-P1 : Speech Analysis and Processing II Deepu Vijayasenan, Fabio Valente, Hervé Bourlard; IDIAP Research Institute, Switzerland Hewison Hall, 13:30, Tuesday 8 Sept 2009 Chair: A. Ariyaeeinia, University of Hertfordshire, UK Tue-Ses2-O4-4, Time: 14:30 This paper aims at investigating the use of Kullback-Leibler (KL) divergence based realignment with application to speaker diarization. The use of KL divergence based realignment operates directly on the speaker posterior distribution estimates and is compared with traditional realignment performed using HMM/GMM system. We hypothesize that using posterior estimates to re-align speaker boundaries is more robust than gaussian mixture models in case of multiple feature streams with different statistical properties. Experiments are run on the NIST RT06 data. These experiments reveal that in case of conventional MFCC features the two approaches yields the same performance while the KL based system outperforms the HMM/GMM re-alignment in case of combination of multiple feature streams (MFCC and TDOA). Speech Overlap Detection in a Two-Pass Speaker Diarization System Marijn Huijbregts 1 , David A. van Leeuwen 2 , Franciska M.G. de Jong 1 ; 1 University of Twente, The Netherlands; 2 TNO Human Factors, The Netherlands Tue-Ses2-O4-5, Time: 14:50 In this paper we present the two-pass speaker diarization system that we developed for the NIST RT09s evaluation. In the first pass of our system a model for speech overlap detection is generated automatically. This model is used in two ways to reduce the diarization errors due to overlapping speech. First, it is used in a second diarization pass to remove overlapping speech from the data while training the speaker models. Second, it is used to find speech overlap for the final segmentation so that overlapping speech segments can be generated. The experiments show that our overlap detection method improves the performance of all three of our system configurations. Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu; Binghamton University, USA Tue-Ses2-P1-1, Time: 13:30 Recently, the modulation spectrum has been proposed and found to be a useful source of speech information. The modulation spectrum represents longer term variations in the spectrum and thus implicitly requires features extracted from much longer speech segments compared to MFCCs and their delta terms. In this paper, a Discrete Cosine Transform (DCT) analysis of the log magnitude spectrum combined with a Discrete Cosine Series (DCS) expansion of DCT coefficients over time is proposed as a method for capturing both the spectral and modulation information. These DCT/DCS features can be computed so as to emphasize frequency resolution or time resolution or a combination of the two factors. Several variations of the DCT/DCS features were evaluated with phonetic recognition experiments using TIMIT and its telephone version (NTIMIT). Best results obtained with a combined feature set are 73.85% for TIMIT and 62.5% for NTIMIT. The modulation features are shown to be far more important than the spectral features for automatic speech recognition and far more noise robust. Use of Harmonic Phase Information for Polarity Detection in Speech Signals Ibon Saratxaga, Daniel Erro, Inmaculada Hernáez, Iñaki Sainz, Eva Navas; University of the Basque Country, Spain Tue-Ses2-P1-2, Time: 13:30 Phase information resultant from the harmonic analysis of the speech can be very successfully used to determine the polarity of a voiced speech segment. In this paper we present two algorithms Notes 91 which calculate the signal polarity from this information. One is based on the effect of the glottal signal on the phase of the first harmonics and the other on the relative phase shifts between the harmonics. The detection rates of these two algorithms are compared against others established algorithms. Finite Mixture Spectrogram Modeling for Multipitch Tracking Using A Factorial Hidden Markov Model vector quantization. The sequence of codebook indices, the pitch contour and the energy contour derived from the TM signal are used to store/transmit the TM speech information efficiently. At the receiver, the all-pole system corresponding to the estimated CSM spectral vectors is excited by a synthetic residual to generate the speech signal. Analysis of Lombard Speech Using Excitation Source Information Michael Wohlmayr, Franz Pernkopf; Graz University of Technology, Austria G. Bapineedu, B. Avinash, Suryakanth V. Gangashetty, B. Yegnanarayana; IIIT Hyderabad, India Tue-Ses2-P1-3, Time: 13:30 In this paper, we present a simple and efficient feature modeling approach for tracking the pitch of two speakers speaking simultaneously. We model the spectrogram features using Gaussian Mixture Models (GMMs) in combination with the Minimum Description Length (MDL) model selection criterion. This enables to automatically determine the number of Gaussian components depending on the available data for a specific pitch pair. A factorial hidden Markov model (FHMM) is applied for tracking. We compare our approach to two methods based on correlogram features [1]. Those methods either use a HMM [1] or a FHMM [7] for tracking. Experimental results on the Mocha-TIMIT database [2] show that our proposed approach significantly outperforms the correlogrambased methods for speech utterances mixed at 0dB. The superior performance even holds when adding white Gaussian noise to the mixed speech utterances during pitch tracking. Group-Delay-Deviation Based Spectral Analysis of Speech Tue-Ses2-P1-6, Time: 13:30 This paper examines the Lombard effect on the excitation features in speech production. These features correspond mostly to the acoustic features at subsegmental (< pitch period) level. The instantaneous fundamental frequency F0 (i.e., pitch), the strength of excitation at the instants of significant excitation and a loudness measure reflecting the sharpness of the impulse-like excitation around epochs are used to represent the excitation features at the subsegmental level. The Lombard effect influences the pitch and the loudness. The extent of Lombard effect on speech depends on the nature and level (or intensity) of the external feedback that causes the Lombard effect. A Comparison of Linear and Nonlinear Dimensionality Reduction Methods Applied to Synthetic Speech Andrew Errity, John McKenna; Dublin City University, Ireland Anthony Stark, Kuldip Paliwal; Griffith University, Australia Tue-Ses2-P1-7, Time: 13:30 Tue-Ses2-P1-4, Time: 13:30 In this paper, we investigate a new method for extracting useful information from the group delay spectrum of speech. The group delay spectrum is often poorly behaved and noisy. In the literature, various methods have been proposed to address this problem. However, to make the group delay a more tractable function, these methods have typically relied upon some modification of the underlying speech signal. The method proposed in this paper does not require such modifications. To accomplish this, we investigate a new function derived from the group delay spectrum, namely the group delay deviation. We use it for both narrowband analysis and wideband analysis of speech and show that this function exhibits meaningful formant and pitch information. Speaker Dependent Mapping for Low Bit Rate Coding of Throat Microphone Speech In this study a number of linear and nonlinear dimensionality reduction methods are applied to high dimensional representations of synthetic speech to produce corresponding low dimensional embeddings. Several important characteristics of the synthetic speech, such as formant frequencies and f0, are known and controllable prior to dimensionality reduction. The degree to which these characteristics are retained after dimensionality reduction is examined in visualisation and classification experiments. Results of these experiments indicate that each method is capable of discovering meaningful low dimensional representations of synthetic speech and that the nonlinear methods may outperform linear methods in some cases. ZZT-Domain Immiscibility of the Opening and Closing Phases of the LF GFM Under Frame Length Variations C.F. Pedersen, O. Andersen, P. Dalsgaard; Aalborg University, Denmark Anand Joseph M. 1 , B. Yegnanarayana 1 , Sanjeev Gupta 2 , M.R. Kesheorey 2 ; 1 IIIT Hyderabad, India; 2 Center for Artificial Intelligence & Robotics, India Tue-Ses2-P1-8, Time: 13:30 Tue-Ses2-P1-5, Time: 13:30 Throat microphones (TM) which are robust to background noise can be used in environments with high levels of background noise. Speech collected using TM is perceptually less natural. The objective of this paper is to map the spectral features (represented in the form of cepstral features) of TM and close speaking microphone (CSM) speech to improve the former’s perceptual quality, and to represent it in an efficient manner for coding. The spectral mapping of TM and CSM speech is done using a multilayer feed-forward neural network, which is trained from features derived from TM and CSM speech. The sequence of estimated CSM spectral features is quantized and coded as a sequence of codebook indices using Current research has proposed a non-parametric speech waveform representation (rep) based on zeros of the z-transform (ZZT) [1] [2]. Empirically, the ZZT rep has successfully been applied in discriminating the glottal and vocal tract components in pitchsynchronously windowed speech by using the unit circle (UC) as discriminant [1] [2]. Further, similarity between ZZT reps of windowed speech, glottal flow waveforms, and waveforms of glottal flow opening and closing phases has been demonstrated [1] [3]. Therefore, the underlying cause of the separation on either side of the UC can be analyzed via the individual ZZT reps of the opening and closing phase waveforms; the waveforms are generated by the LF glottal flow model (GFM) [1]. The present paper demonstrates this cause and effect analytically and thereby supplement the Notes 92 previous empirical works. Moreover, this paper demonstrates that immiscibility is variant under changes in frame lengths; lengths that maximize or minimize immiscibility are presented. Finally, the distribution of speaker-specific information is analyzed for wideband speech. Dimension Reducing of LSF parameters Based on Radial Basis Function Neural Network Artificial Nasalization of Speech Sounds Based on Pole-Zero Models of Spectral Relations Between Mouth and Nose Signals Hongjun Sun, Jianhua Tao, Huibin Jia; Chinese Academy of Sciences, China Karl Schnell, Arild Lacroix; Goethe-Universität Frankfurt, Germany Tue-Ses2-P1-9, Time: 13:30 Tue-Ses2-P1-12, Time: 13:30 In this paper, we investigate a novel method for transforming line spectral frequency (LSF) parameters to lower dimensional coefficients. Radial basis function neutral network (RBF NN) based transforming model is used to fit LSF vectors. In the training process, two criterions, including mean squared error and weighted mean squared error, are involved to measure distance between original vector and approximate vector. Besides, features of LSF parameters are taken into account to supervise the training process. As a result, LSF vectors are represented by the coefficient vectors of transforming model. The experimental results reveal that 24-order LSF vector can be transformed to 15-dimension coefficient vector with an average spectral distortion of approximately 1dB. Subjective evaluation manifests that the transforming method in this paper will not lead to significant voice quality decreasing. In this contribution, a method for nasalization of speech sounds is proposed based on model-based spectral relations between mouth and nose signals. For that purpose, the mouth and nose signals of speech utterances are recorded simultaneously. The spectral relations of the mouth and nose signals are modeled by pole-zero models. A filtering of non-nasalized speech signals by these pole-zero models yields approximately nasal signals, which can be utilized to nasalize the speech signals. The artificial nasalization can be exploited to modify speech units of a non-nasalized or weakly nasalized representation which should be nasalized due to coarticulation or for the production of foreign words. Characterizing Speaker Variability Using Spectral Envelopes of Vowel Sounds Andrew Hines, Naomi Harte; Trinity College Dublin, Ireland Error Metrics for Impaired Auditory Nerve Responses of Different Phoneme Groups A.N. Harish, D.R. Sanand, S. Umesh; IIT Kanpur, India Tue-Ses2-P1-13, Time: 13:30 Tue-Ses2-P1-10, Time: 13:30 An auditory nerve model allows faster investigation of new signal processing algorithms for hearing aids. This paper presents a study of the degradation of auditory nerve (AN) responses at a phonetic level for a range of sensorineural hearing losses and flat audiograms. The AN model of Zilany & Bruce was used to compute responses to a diverse set of phoneme rich sentences from the TIMIT database. The characteristics of both the average discharge rate and spike timing of the responses are discussed. The experiments demonstrate that a mean absolute error metric provides a useful measure of average discharge rates but a more complex measure is required to capture spike timing response errors. In this paper, we present a study to understand the relation among spectra of speakers enunciating the same sound and investigate the issue of uniform versus non-uniform scaling. There is a lot of interest in understanding this relation as speaker variability is a major source of concern in many applications including Automatic Speech Recognition (ASR). Using dynamic programming, we find mapping relations between smoothed spectral envelopes of speakers enunciating the same sound and show that these relations are not linear but have a consistent non-uniform behavior. This non-uniform behavior is also shown to vary across vowels. Through a series of experiments, we show that using the observed non-uniform relation provides better vowel normalization than just a simple linear scaling relation. All results in this paper are based on vowel data from TIMIT, Hillenbrand et al. and North Texas databases. Tue-Ses2-P2 : Speech Processing with Audio or Audiovisual Input Hewison Hall, 13:30, Tuesday 8 Sept 2009 Chair: Robert I. Damper, University of Southampton, UK Analysis of Band Structures for Speaker-Specific Information in FM Feature Extraction Tharmarajah Thiruvaran, Eliathamby Ambikairajah, Julien Epps; University of New South Wales, Australia Application of Differential Microphone Array for IS-127 EVRC Rate Determination Algorithm Tue-Ses2-P1-11, Time: 13:30 Henry Widjaja, Suryoadhi Wibowo; Institut Teknologi Telkom, Indonesia Frequency modulation (FM) features are typically extracted using a filterbank, usually based on an auditory frequency scale, however there is psychophysical evidence to suggest that this scale may not be optimal for extracting speaker-specific information. In this paper, speaker-specific information in FM features is analyzed as a function of the filterbank structure at the feature, model and classification stages. Scatter matrix based separation measures at the feature level and Kullback-Leibler distance based measures at the model level are used to analyze the discriminative contributions of the different bands. Then a series of speaker recognition experiments are performed to study how each band of the FM feature contributes to speaker recognition. A new filter bank structure is proposed that attempts to maximize the speaker-specific information in the FM feature for telephone data. Tue-Ses2-P2-1, Time: 13:30 Differential microphone array is known to have low sensitivity to distant sound sources. Such characteristics may be advantageous in voice activity detection where it can be assumed that the target speaker is close and background noise sources are distant. In this paper we develop a simple modification to the EVRC rate determination algorithm (EVRC RDA) to exploit the noise-canceling property of differential microphone array to improve its performance in highly dynamic noise environment. Comprehensive computer simulations show that the modified algorithm outperforms the original EVRC RDA in all tested noise conditions. Notes 93 Estimating the Position and Orientation of an Acoustic Source with a Microphone Array Network A Non-Intrusive Signal-Based Model for Speech Quality Evaluation Using Automatic Classification of Background Noises Alberto Yoshihiro Nakano, Seiichi Nakagawa, Kazumasa Yamamoto; Toyohashi University of Technology, Japan Adrien Leman 1 , Julien Faure 1 , Etienne Parizet 2 ; 1 Orange Labs, France; 2 LVA, France Tue-Ses2-P2-2, Time: 13:30 Tue-Ses2-P2-5, Time: 13:30 We propose a method that finds the position and orientation of an acoustic source in an enclosed environment. For each of eight T-shaped arrays forming a microphone array network, the time delay of arrival (TDOA) of signals from microphone pairs, a source position candidate, and energy related features are estimated. These form the input for artificial neural networks (ANNs), the purpose of which is to provide indirectly a more precise position of the source and, additionally, to estimate the source’s orientation using various combinations of the estimated parameters. The best combination of parameters (TDOAs and microphone positions) yields a 21.8% reduction in the mean average position error compared to baselines, and a correct orientation ratio higher than 99.0%. The position estimation baselines include two estimation methods: a TDOA-based method that finds the source position geometrically, and the SRP-PHAT that finds the most likely source position by spatial exploration. This paper describes an original method for speech quality evaluation in the presence of different types of background noises for a range of communications (mobile, VoIP, RTC). The model is obtained from subjective experiments described in [1]. These experiments show that background noise can be more or less tolerated by listeners, depending on the sources of noise that can be identified. Using a classification method, the background noises can be classified into four groups. For each one of the four groups, a relation between loudness of the noise and speech quality is proposed. Singing Voice Detection in Polyphonic Music using Predominant Pitch Acoustic Event Detection for Spotting “Hot Spots” in Podcasts Kouhei Sumi 1 , Tatsuya Kawahara 1 , Jun Ogata 2 , Masataka Goto 2 ; 1 Kyoto University, Japan; 2 AIST, Japan Tue-Ses2-P2-6, Time: 13:30 Vishweshwara Rao, S. Ramakrishnan, Preeti Rao; IIT Bombay, India Tue-Ses2-P2-3, Time: 13:30 This paper demonstrates the superiority of energy-based features derived from the knowledge of predominant-pitch, for singing voice detection in polyphonic music over commonly used spectral features. However, such energy-based features tend to misclassify loud, pitched instruments. To provide robustness to such accompaniment we exploit the relative instability of the pitch contour of the singing voice by attenuating harmonic spectral content belonging to stable-pitch instruments, using sinusoidal modeling. The obtained feature shows high classification accuracy when applied to north Indian classical music data and is also found suitable for automatic detection of vocal-instrumental boundaries required for smoothing the frame-level classifier decisions. Word Stress Assessment for Computer Aided Language Learning This paper presents a method to detect acoustic events that can be used to find “hot spots” in podcast programs. We focus on meaningful non-verbal audible reactions which suggest hot spots such as laughter and reactive tokens. In order to detect this kind of short events and segment the counterpart utterances, we need accurate audio segmentation and classification, dealing with various recording environments and background music. Thus, we propose a method for automatically estimating and switching penalty weights for the BIC-based segmentation depending on background environments. Experimental results show significant improvement in detection accuracy by proposed method compared to when using a constant penalty weight. Improving Detection of Acoustic Events Using Audiovisual Data and Feature Level Fusion T. Butko, C. Canton-Ferrer, C. Segura, X. Giró, C. Nadeu, J. Hernando, J.R. Casas; Universitat Politècnica de Catalunya, Spain Tue-Ses2-P2-7, Time: 13:30 Juan Pablo Arias, Nestor Becerra Yoma, Hiram Vivanco; Universidad de Chile, Chile Tue-Ses2-P2-4, Time: 13:30 In this paper an automatic word stress assessment system is proposed based on a top-to-bottom scheme. The method presented is text and language independent. The utterance pronounced by the student is directly compared with a reference one. The trend similarity of F 0 and energy contours are compared frame-by-frame by using DTW alignment. The stress assessment evaluation system gives an EER equal to 21.5%, which in turn is similar to the error observed in phonetic quality evaluation schemes. These results suggest that the proposed system can be employed in real applications and applicable to any language. The detection of the acoustic events (AEs) that are naturally produced in a meeting room may help to describe the human and social activity that takes place in it. When applied to spontaneous recordings, the detection of AEs from only audio information shows a large amount of errors, which are mostly due to temporal overlapping of sounds. In this paper, a system to detect and recognize AEs using both audio and video information is presented. A feature-level fusion strategy is used, and the structure of the HMM-GMM based system considers each class separately and uses a one-against-all strategy for training. Experimental AED results with a new and rather spontaneous dataset are presented which show the advantage of the proposed approach. Detecting Audio Events for Semantic Video Search M. Bugalho 1 , J. Portêlo 2 , Isabel Trancoso 1 , T. Pellegrini 2 , Alberto Abad 2 ; 1 INESC-ID Lisboa/IST, Portugal; 2 INESC-ID Lisboa, Portugal Tue-Ses2-P2-8, Time: 13:30 Notes 94 This paper describes our work on audio event detection, one of our tasks in the European project VIDIVIDEO. Preliminary experiments with a small corpus of sound effects have shown the potential of this type of corpus for training purposes. This paper describes our experiments with SVM classifiers, and different features, using a 290-hour corpus of sound effects, which allowed us to build detectors for almost 50 semantic concepts. Although the performance of these detectors on the development set is quite good (achieving an average F-measure of 0.87), preliminary experiments on documentaries and films showed that the task is much harder in real-life videos, which so often include overlapping audio events. conferencing systems or user-adaptive interfaces. A key feature of the proposed system is to first glean information about the speaker’s location and identity from the audio and visual data streams separately and then to fuse these data in a probabilistic framework employing the Viterbi algorithm. Here, visual evidence of a person is utilized through a priori state probabilities, while location and speaker change information are employed via time-variant transition probabilities. Experiments show that video information yields a substantial improvement compared to pure audio-based diarization. Factor Analysis for Audio-Based Video Genre Classification Multimodal Speaker Verification Using Ancillary Known Speaker Characteristics Such as Gender or Age Mickael Rouvier, Driss Matrouf, Georges Linarès; LIA, France Girija Chetty, Michael Wagner; University of Canberra, Australia Tue-Ses2-P2-9, Time: 13:30 Tue-Ses2-P2-12, Time: 13:30 Statistical classifiers operate on features that generally include both useful and useless information. These two types of information are difficult to separate in the feature domain. Recently, a new paradigm based on a Latent Factor Analysis (LFA) proposed a model decomposition into useful and useless components. This method was successfully applied to speaker and language recognition tasks. In this paper, we study the use of LFA for video genre classification by using only the audio channel. We propose a classification method based on short-term cepstral features and Gaussian Mixture Models (GMM) or Support Vector Machine (SVM) classifiers, that are combined with Factor Analysis (FA). Experiments are conducted on a corpus composed of 5 types of video (musics, commercials, cartoons, movies and news). The relative classification error reduction obtained by using the best factor analysis configuration with respect to the baseline system, Gaussian Mixture Model Universal Background Model (GMM-UBM), is about 56%, corresponding to a correct identification rate of about 90%. Multimodal speaker verification based on easy-to-obtain biometric traits such as face and voice is rapidly gaining acceptance as the preferred technology for many applications. In many such practical applications, other characteristics of the speaker such as gender or age are known and may be exploited for enhanced verification accuracy. In this paper we present a parallel approach determining gender as an ancillary speaker characteristic, which is incorporated in the decision of a face-voice speaker verification system. Preliminary experiments with the DaFEx multimodal audio-video database show that fusing the results of gender recognition and identity verification improves the performance of multimodal speaker verification. Robust Audio-Based Classification of Video Genre Mickael Rouvier, Georges Linarès, Driss Matrouf; LIA, France Tue-Ses2-P2-10, Time: 13:30 Discovering Keywords from Cross-Modal Input: Ecological vs. Engineering Methods for Enhancing Acoustic Repetitions Guillaume Aimetti 1 , Roger K. Moore 1 , L. ten Bosch 2 , Okko Johannes Räsänen 3 , Unto Kalervo Laine 3 ; 1 University of Sheffield, UK; 2 Radboud Universiteit Nijmegen, The Netherlands; 3 Helsinki University of Technology, Finland Tue-Ses2-P2-13, Time: 13:30 Video genre classification is a challenging task in a global context of fast growing video collections available on the Internet. This paper presents a new method for video genre identification by audio analysis. Our approach relies on the combination of low and high level audio features. We investigate the discriminative capacity of features related to acoustic instability, speaker interactivity, speech quality and acoustic space characterization. The genre identification is performed on these features by using a SVM classifier. Experiments are conducted on a corpus composed from cartoons, movies, news, commercials and musics on which we obtain an identification rate of 91%. Fusing Audio and Video Information for Online Speaker Diarization This paper introduces a computational model that automatically segments acoustic speech data and builds internal representations of keyword classes from cross-modal (acoustic and pseudo-visual) input. Acoustic segmentation is achieved using a novel dynamic time warping technique and the focus of this paper is on recent investigations conducted to enhance the identification of repeating portions of speech. This ongoing research is inspired by current cognitive views of early language acquisition and therefore strives for ecological plausibility in an attempt to build more robust speech recognition systems. Results show that an ad-hoc computationally engineered solution can aid the discovery of repeating acoustic patterns. However, we show that this improvement can be simulated in a more ecologically valid way. Joerg Schmalenstroeer, Martin Kelling, Volker Leutnant, Reinhold Haeb-Umbach; Universität Paderborn, Germany Tue-Ses2-P2-11, Time: 13:30 In this paper we present a system for identifying and localizing speakers using distant microphone arrays and a steerable pan-tilt-zoom camera. Audio and video streams are processed in real-time to obtain the diarization information “who speaks when and where” with low latency to be used in advanced video Notes 95 than 11 × speedup compared to a highly optimized sequential implementation on Intel Core i7 without sacrificing accuracy. Tue-Ses2-P3 : ASR: Decoding and Confidence Measures Hewison Hall, 13:30, Tuesday 8 Sept 2009 Chair: Kai Yu, University of Cambridge, UK Combined Low Level and High Level Features for Out-of-Vocabulary Word Detection Incremental Composition of Static Decoding Graphs Benjamin Lecouteux 1 , Georges Linarès 1 , Benoit Favre 2 ; 1 LIA, France; 2 ICSI, USA Miroslav Novák; IBM T.J. Watson Research Center, USA Tue-Ses2-P3-4, Time: 13:30 Tue-Ses2-P3-1, Time: 13:30 This paper addresses the issue of Out-Of-Vocabulary (OOV) word detection in Large Vocabulary Continuous Speech Recognition (LVCSR) systems. We propose a method inspired by confidence measures, that consists in analyzing the recognition system outputs in order to automatically detect errors due to OOV words. This method combines various features based on acoustic, linguistic, decoding graph and semantics. We evaluate separately each feature and we estimate their complementarity. Experiments are conducted on a large French broadcast news corpus from the ESTER evaluation campaign. Results show good performance in real conditions: the method obtains an OOV word detection rate of 43%–90% with 2.5%–17.5% of false detection. A fast, scalable and memory-efficient method for static decoding graph construction is presented. As an alternative to the traditional transducer-based approach, it is based on incremental composition. Memory efficiency is achieved by combining composition, determinization and minimization into a single step, thus eliminating large intermediate graphs. We have previously reported the use of incremental composition limited to grammars and left cross-word context [1]. Here, this approach is extended to n-gram models with explicit ε arcs and right cross-word context. Evaluation of Phone Lattice Based Speech Decoding Jacques Duchateau, Kris Demuynck, Hugo Van hamme; Katholieke Universiteit Leuven, Belgium Bayes Risk Approximations Using Time Overlap with an Application to System Combination Tue-Ses2-P3-2, Time: 13:30 Björn Hoffmeister, Ralf Schlüter, Hermann Ney; RWTH Aachen University, Germany Previously, we proposed a flexible two-layered speech recogniser architecture, called FLaVoR. In the first layer an unconstrained, task independent phone recogniser generates a phone lattice. Only in the second layer the task specific lexicon and language model are applied to decode the phone lattice and produce a word level recognition result. In this paper, we present a further evaluation of the FLaVoR architecture. The performance of a classical singlelayered architecture and the FLaVoR architecture are compared on two recognition tasks, using the same acoustic, lexical and language models. On the large vocabulary Wall Street Journal 5k and 20k benchmark tasks, the two-layered architecture resulted in slightly but not significantly better word error rates. On a reading error detection task for a reading tutor for children, the FLaVoR architecture clearly outperformed the single-layered architecture. A Fully Data Parallel WFST-Based Large Vocabulary Continuous Speech Recognition on a Graphics Processing Unit Tue-Ses2-P3-5, Time: 13:30 The computation of the Minimum Bayes Risk (MBR) decoding rule for word lattices needs approximations. We investigate a class of approximations where the Levenshtein alignment is approximated under the condition that competing lattice arcs overlap in time. The approximations have their origins in MBR decoding and in discriminative training. We develop modified versions and propose a new, conceptually extremely simple confusion network algorithm. The MBR decoding rule is extended to scope with several lattices, which enables us to apply all the investigated approximations to system combination. All approximations are tested on a Mandarin and on an English LVCSR task for a single system and for system combination. The new methods are competitive in error rate and show some advantages over the standard approaches to MBR decoding. Unsupervised Estimation of the Language Model Scaling Factor Jike Chong, Ekaterina Gonina, Youngmin Yi, Kurt Keutzer; University of California at Berkeley, USA Tue-Ses2-P3-3, Time: 13:30 Tremendous compute throughput is becoming available in personal desktop and laptop systems through the use of graphics processing units (GPUs). However, exploiting this resource requires re-architecting an application to fit a data parallel programming model. The complex graph traversal routines in the inference process for large vocabulary continuous speech recognition (LVCSR) have been considered by many as unsuitable for extensive parallelization. We explore and demonstrate a fully data parallel implementation of a speech inference engine on NVIDIA’s GTX280 GPU. Our implementation consists of two phases - compute-intensive observation probability computation phase and communication-intensive graph traversal phase. We take advantage of dynamic elimination of redundant computation in the compute-intensive phase while maintaining close-to-peak execution efficiency. We also demonstrate the importance of exploring application-level trade-offs in the communication-intensive graph traversal phase to adapt the algorithm to data parallel execution on GPUs. On 3.1 hours of speech data set, we achieve more Christopher M. White, Ariya Rastrow, Sanjeev Khudanpur, Frederick Jelinek; Johns Hopkins University, USA Tue-Ses2-P3-6, Time: 13:30 This paper addresses the adjustment of the language model (LM) scaling factor of an automatic speech recognition (ASR) system for a new domain using only un-transcribed speech. The main idea is to replace the (unavailable) reference transcript with an automatic transcript generated by an independent ASR system, and adjust parameters using this sloppy reference. It is shown that despite its fairly high error rate (ca. 35%), choosing the scaling factor to minimize disagreement with the erroneous transcripts is still an effective recipe for model selection. This effectiveness is demonstrated by adjusting an ASR system trained on Broadcast News to transcribe the MIT Lectures corpus. An ASR system for telephone speech produces the sloppy reference, and optimizing towards it yields a nearly optimal LM scaling factor for the MIT Lectures corpus. Notes 96 Simultaneous Estimation of Confidence and Error Cause in Speech Recognition Using Discriminative Model A Comparison of Audio-Free Speech Recognition Error Prediction Methods Atsunori Ogawa, Atsushi Nakamura; NTT Corporation, Japan Tue-Ses2-P3-7, Time: 13:30 Since recognition errors are unavoidable in speech recognition, confidence scoring, which accurately estimates the reliability of recognition results, is a critical function for speech recognition engines. In addition to achieving accurate confidence estimation, if we are to develop speech recognition systems that will be widely used by the public, speech recognition engines must be able to report the causes of errors properly, namely they must offer a reason for any failure to recognize input utterances. This paper proposes a method that simultaneously estimates both confidences and causes of errors in speech recognition results by using discriminative models. We evaluated the proposed method in an initial speech recognition experiment, and confirmed its promising performance with respect to confidence and error cause estimation. A Generalized Composition Algorithm for Weighted Finite-State Transducers Preethi Jyothi, Eric Fosler-Lussier; Ohio State University, USA Tue-Ses2-P3-10, Time: 13:30 Predicting possible speech recognition errors can be invaluable for a number of Automatic Speech Recognition (ASR) applications. In this study, we extend a Weighted Finite State Transducer (WFST) framework for error prediction to facilitate a comparison between two approaches of predicting confusable words: examining recognition errors on the training set to learn phone confusions and utilizing distances between the phonetic acoustic models for the prediction task. We also expand the framework to deal with continuous word recognition and we can accurately predict 60% of the misrecognized sentences (with an average words-per-sentence count of 15) and a little over 70% of the total number of errors from the unseen test data where no acoustic information related to the test data is utilized. Automatic Out-of-Language Detection Based on Confidence Measures Derived from LVCSR Word and Phone Lattices Petr Motlicek; IDIAP Research Institute, Switzerland Cyril Allauzen, Michael Riley, Johan Schalkwyk; Google Inc., USA Tue-Ses2-P3-8, Time: 13:30 This paper describes a weighted finite-state transducer composition algorithm that generalizes the concept of the composition filter and presents filters that remove useless epsilon paths and push forward labels and weights along epsilon paths. This filtering permits the composition of large speech recognition contextdependent lexicons and language models much more efficiently in time and space than previously possible. We present experiments on Broadcast News and a spoken query task that demonstrate an ∼5% to 10% overhead for dynamic, runtime composition compared to a static, offline composition of the recognition transducer. To our knowledge, this is the first such system with so little overhead. Word Confidence Using Duration Models Tue-Ses2-P3-11, Time: 13:30 Confidence Measures (CMs) estimated from Large Vocabulary Continuous Speech Recognition (LVCSR) outputs are commonly used metrics to detect incorrectly recognized words. In this paper, we propose to exploit CMs derived from frame-based word and phone posteriors to detect speech segments containing pronunciations from non-target (alien) languages. The LVCSR system used is built for English, which is the target language, with medium-size recognition vocabulary (5k words). The efficiency of detection is tested on a set comprising speech from three different languages (English, German, Czech). Results achieved indicate that employment of specific temporal context (integrated in the word or phone level) significantly increases the detection accuracies. Furthermore, we show that combination of several CMs can also improve the efficiency of detection. Automatic Estimation of Decoding Parameters Using Large-Margin Iterative Linear Programming Stefano Scanzio 1 , Pietro Laface 1 , Daniele Colibro 2 , Roberto Gemello 2 ; 1 Politecnico di Torino, Italy; 2 Loquendo, Italy Brian Mak, Tom Ko; Hong Kong University of Science & Technology, China Tue-Ses2-P3-9, Time: 13:30 In this paper, we propose a word confidence measure based on phone durations depending on large contexts. The measure is based on the expected duration of each recognized phone in a word. In the approach here proposed the duration of each phone is in principle context-dependent, and the measure is a function of the distance between the observed and expected phone duration distributions within a word. Our experiments show that, since the “duration confidence” does not make use of any acoustic information, its Equal Error Rate (EER) in terms of False Accept and False Rejection rates is not as good as the one obtained by using the more informed acoustic confidence measure. However, combining the two measures by a simple linear interpolation, the system EER improves by 6% to 10% relative on an isolated word recognition task in several languages. Tue-Ses2-P3-12, Time: 13:30 The decoding parameters in automatic speech recognition — grammar factor and word insertion penalty — are usually determined by performing a grid search on a development set. Recently, we cast their estimation as a convex optimization problem, and proposed a solution using an iterative linear programming algorithm. However, the solution depends on how well the development data set matches with the test set. In this paper, we further investigates an improvement on the generalization property of the solution by using large margin training within the iterative linear programming framework. Empirical evaluation on the WSJ0 5K speech recognition tasks shows that the recognition performance of the decoding parameters found by the improved algorithm using only a subset of the acoustic model training data is even better than that of the decoding parameters found by grid search on the development data, and is close to the performance of those found by grid search on the test set. Notes 97 Tue-Ses2-P4 : Robust Automatic Speech Recognition I A Study of Mutual Front-End Processing Method Based on Statistical Model for Noise Robust Speech Recognition Hewison Hall, 13:30, Tuesday 8 Sept 2009 Chair: Alex Acero, Microsoft Research, USA Masakiyo Fujimoto, Kentaro Ishizuka, Tomohiro Nakatani; NTT Corporation, Japan Tue-Ses2-P4-4, Time: 13:30 Optimization of Dereverberation Parameters Based on Likelihood of Speech Recognizer Randy Gomez, Tatsuya Kawahara; Kyoto University, Japan Tue-Ses2-P4-1, Time: 13:30 Speech recognition under reverberant condition is a difficult task. Most dereverberation techniques used to address this problem enhance the reverberant waveform independent from that of the speech recognizer. In this paper, we improve the conventional Spectral Subtraction-based (SS) dereverberation technique. In our proposed approach, the dereverberation parameters are optimized to improve the likelihood of the acoustic model. The system is capable of adaptively fine-tuning these parameters jointly with acoustic model training. Additional optimization is also implemented during decoding of the test utterances. We have evaluated using real reverberant data and experimental results show that the proposed method significantly improves the recognition performance over the conventional approach. This paper addresses robust front-end processing for automatic speech recognition (ASR) in noise. Accurate recognition of corrupted speech requires noise robust front-end processing, e.g., voice activity detection (VAD) and noise suppression (NS). Typically, VAD and NS are combined as one-way processing, and are developed independently. However, VAD and NS should not be assumed to be independent techniques, because sharing each others’ information is important for the improvement of front-end processing. Thus, we investigate the mutual front-end processing by integrating VAD and NS, which can beneficially share each others’ information. In an evaluation of a concatenated speech corpus, CENSREC-1-C database, the proposed method improves the performance of both VAD and ASR compared with the conventional method. Integrating Codebook and Utterance Information in Cepstral Statistics Normalization Techniques for Robust Speech Recognition Guan-min He, Jeih-weih Hung; National Chi Nan University, Taiwan Application of Noise Robust MDT Speech Recognition on the SPEECON and SpeechDat-Car Databases Tue-Ses2-P4-5, Time: 13:30 J.F. Gemmeke 1 , Y. Wang 2 , Maarten Van Segbroeck 2 , B. Cranen 1 , Hugo Van hamme 2 ; 1 Radboud Universiteit Nijmegen, The Netherlands; 2 Katholieke Universiteit Leuven, Belgium Tue-Ses2-P4-2, Time: 13:30 We show that the recognition accuracy of an MDT recognizer which performs well on artificially noisified data, deteriorates rapidly under realistic noisy conditions (using multiple microphone recordings from the SPEECON/SpeechDat-Car databases) and is outperformed by a commercially available recognizer which was trained using a multi-condition paradigm. Analysis of the recognition results indicates that the recording channels with the lowest SNRs where the MDT recognizer fails most, are also the channels which suffer most from room reverberation. Despite the channel compensation measures we took, it appears difficult to maintain the restorative power of MDT in such non-additive noise conditions. Model Based Feature Enhancement for Automatic Speech Recognition in Reverberant Environments Cepstral statistics normalization techniques have been shown to be very successful at improving the noise robustness of speech features. This paper proposes a hybrid-based scheme to achieve a more accurate estimate of the statistical information of features in these techniques. By properly integrating codebook and utterance knowledge, the resulting hybrid-based approach significantly outperforms conventional utterance-based, segmentbased and codebook-based approaches in noisy environments. For the Aurora-2 clean-condition training task, the proposed hybrid codebook/segment-based histogram equalization (CS-HEQ) achieves an average recognition accuracy of 90.66%, which is better than utterance-based HEQ (87.62%), segment-based HEQ (85.92%) and codebook-based HEQ (85.29%). Furthermore, the high-performance CS-HEQ can be implemented with a short delay and can thus be applied in real-time online systems. Reduced Complexity Equalization of Lombard Effect for Speech Recognition in Noisy Adverse Environments Hynek Bořil, John H.L. Hansen; University of Texas at Dallas, USA Alexander Krueger, Reinhold Haeb-Umbach; Universität Paderborn, Germany Tue-Ses2-P4-6, Time: 13:30 Tue-Ses2-P4-3, Time: 13:30 In this paper we present a new feature space dereverberation technique for automatic speech recognition. We derive an expression for the dependence of the reverberant speech features in the log-mel spectral domain on the non-reverberant speech features and the room impulse response. The obtained observation model is used for a model based speech enhancement based on Kalman filtering. The performance of the proposed enhancement technique is studied on the AURORA5 database. In our currently best configuration, which includes uncertainty decoding, the number of recognition errors is approximately halved compared to the recognition of unprocessed speech. In real-world adverse environments, speech signal corruption by background noise, microphone channel variations, and speech production adjustments introduced by speakers in an effort to communicate efficiently over noise (Lombard effect) severely impact automatic speech recognition (ASR) performance. Recently, a set of unsupervised techniques reducing ASR sensitivity to these sources of distortion have been presented, with the main focus on equalization of Lombard effect (LE). The algorithms performing maximum-likelihood spectral transformation, cepstral dynamics normalization, and decoding with a codebook of noisy speech models have been shown to outperform conventional methods, however, at a cost of considerable increase in computational complexity due to required numerous decoding passes through the ASR Notes 98 models. In this study, a scheme utilizing a set of speech-in-noise Gaussian mixture models and a neutral/LE classifier is shown to substantially decrease the computational load (from 14 to 2–4 ASR decoding passes) while preserving overall system performance. In addition, an extended codebook capturing multiple environmental noises is introduced and shown to improve ASR in changing environments (8.2–49.2% absolute WER improvement). The evaluation is performed on the Czech Lombard Speech Database (CLSD’05). The task is to recognize neutral/LE connected digit strings presented in different levels of background car noise and Aurora 2 noises. Unsupervised Training Scheme with Non-Stereo Data for Empirical Feature Vector Compensation domain such as MFCC. The proposed method works as a feature extraction front-end that is independent from decoding engine, and has ability to compensate for non-stationary additive and convolutional noises with a short time delay. It includes spectral subtraction as a special case when no parameter optimization is performed. Experiments were performed using the AURORA-2J database. It has been shown that significantly higher recognition performance is obtained by the proposed method than spectral subtraction. Noise-Robust Feature Extraction Based on Forward Masking Sheng-Chiuan Chiou, Chia-Ping Chen; National Sun Yat-Sen University, Taiwan L. Buera 1 , Antonio Miguel 1 , Alfonso Ortega 1 , Eduardo Lleida 1 , Richard M. Stern 2 ; 1 Universidad de Zaragoza, Spain; 2 Carnegie Mellon University, USA Tue-Ses2-P4-10, Time: 13:30 Tue-Ses2-P4-7, Time: 13:30 In this paper, a novel training scheme based on unsupervised and non-stereo data is presented for Multi-Environment Model-based LInear Normalization (MEMLIN) and MEMLIN with cross-probability model based on GMMs (MEMLIN-CPM). Both are data-driven feature vector normalization techniques which have been proved very effective in dynamic noisy acoustic environments. However, this kind of techniques usually requires stereo data in a previous training phase, which could be an important limitation in real situations. To compensate this drawback, we present an approach based on ML criterion and Vector Taylor Series (VTS). Experiments have been carried out with Spanish SpeechDat Car, reaching consistent improvements: 48.7% and 61.9% when the novel training process is applied over MEMLIN and MEMLIN-CPM, respectively. Incremental Adaptation with VTS and Joint Adaptively Trained Systems Forward masking is a phenomenon of human auditory perception, that a weaker sound is masked by a preceding stronger masker. The actual cause of forward masking is not clear, but synaptic adaptation and temporal integration are heuristic explanations. In this paper, we postulate the mechanism of forward masking to be synaptic adaptation and temporal integration, and incorporate them in the feature extraction process of an automatic speech recognition system to improve noise-robustness. The synaptic adaptation is implemented by a highpass filter, and the temporal integration is implemented by a bandpass filter. We apply both filters in the domain of log mel-spectrum. On the Aurora 3 tasks, we evaluate three modified mel-frequency cepstral coefficients: synaptic adaptation only, temporal integration only, and both synaptic adaptation and temporal integration. Experiments show that the overall improvement is 16.1%, 21.8%, and 26.2% respectively in the three cases over the baseline. Tue-Ses3-S1 : Panel: Speech & Intelligence Main Hall, 16:00, Tuesday 8 Sept 2009 Chair: Roger K. Moore, University of Sheffield, UK F. Flego, M.J.F. Gales; University of Cambridge, UK Tue-Ses2-P4-8, Time: 13:30 Recently adaptive training schemes using model based compensation approaches such as VTS and JUD have been proposed. Adaptive training allows the use of multi-environment training data whilst training a neutral, “clean”, acoustic model to be trained. This paper describes and assesses the advantages of using incremental, rather than batch, mode adaptation with these adaptively trained systems. Incremental adaptation reduces the latency during recognition, and has the possibility of reducing the error rate for slowly varying noise. The work is evaluated on a large scale multi-environment training configuration targeted at in-car speech recognition. Results on in-car collected test data indicate that incremental adaptation is an attractive option when using these adaptively trained systems. Target Speech GMM-Based Spectral Compensation for Noise Robust Speech Recognition Panel: Speech & Intelligence Tue-Ses3-S1-1, Time: 16:00 In line with the theme of this year’s INTERSPEECH conference, this special semi-plenary Panel Session will be run as a guided discussion, drawing on issues raised by the panel members and solicited in advance from the attendees. An international panel of distinguished experts will engage with the topic of ‘speech and intelligence’ and address open questions such the importance of a link between spoken language and other aspects of human cognition. It is expected that this special event will be both informative and entertaining, and will involve opportunities for audience participation. Tue-Ses3-O3 : Speaker Verification & Identification I Takahiro Shinozaki, Sadaoki Furui; Tokyo Institute of Technology, Japan Fallside (East Wing 2), 16:00, Tuesday 8 Sept 2009 Chair: Patrick Kenny, CRIM, Canada Tue-Ses2-P4-9, Time: 13:30 To improve speech recognition performance in adverse conditions, a noise compensation method is proposed that applies a transformation in the spectral domain whose parameters are optimized based on likelihood of speech GMM modeled on the feature domain. The idea is that additive and convolutional noises have mathematically simple expression in the spectral domain while speech characteristics are better modeled in the feature Investigation into Variants of Joint Factor Analysis for Speaker Recognition Lukáš Burget, Pavel Matějka, Valiantsina Hubeika, Jan Černocký; Brno University of Technology, Czech Republic Notes 99 Tue-Ses3-O3-1, Time: 16:00 In this paper, we have investigated into JFA used for speaker recognition. First, we performed systematic comparison of full JFA with its simplified variants and confirmed superior performance of the full JFA with both eigenchannels and eigenvoices. We investigated into sensitivity of JFA on the number of eigenvoices both for the full one and simplified variants. We studied the importance of normalization and found that gender-dependent zt-norm was crucial. The results are reported on NIST 2006 and 2008 SRE evaluation data. Improved GMM-Based Speaker Verification Using SVM-Driven Impostor Dataset Selection this assumption to derive a new general kernel. The kernel function is general in that it is a linear combination of any kernels belonging to the reproducing kernel Hilbert space. The combination weights are obtained by optimizing the ability of a discriminant function to separate a target speaker from impostors using either regression analysis or SVM training. The idea was applied to both low- and high-level speaker verification. In both cases, results show that the proposed kernels outperform the state-of-the-art sequence kernels. Further performance enhancement was also observed when the high-level scores were combined with acoustic scores. UBM-Based Sequence Kernel for Speaker Recognition Zhenchun Lei; Jiangxi Normal University, China Mitchell McLaren, Robbie Vogt, Brendan Baker, Sridha Sridharan; Queensland University of Technology, Australia Tue-Ses3-O3-5, Time: 17:20 Tue-Ses3-O3-2, Time: 16:20 The problem of impostor dataset selection for GMM-based speaker verification is addressed through the recently proposed data-driven background dataset refinement technique. The SVM-based refinement technique selects from a candidate impostor dataset those examples that are most frequently selected as support vectors when training a set of SVMs on a development corpus. This study demonstrates the versatility of dataset refinement in the task of selecting suitable impostor datasets for use in GMM-based speaker verification. The use of refined Z- and T-norm datasets provided performance gains of 15% in EER in the NIST 2006 SRE over the use of heuristically selected datasets. The refined datasets were shown to generalise well to the unseen data of the NIST 2008 SRE. Adaptive Individual Background Model for Speaker Verification This paper proposes a probabilistic sequence kernel based on the universal background model, which is widely used in speaker recognition. The Gaussian components are used to construct the speaker reference space, and the utterances with different length are mapped into the fixed size vectors after normalization with correlation matrix. Finally the linear support vector machine is used for speaker recognition. A transition probabilistic sequence kernel is also proposed by adaption the transition information between neighbor frames. The experiments on NIST 2001 show that the performance is compared with the traditional UBM-MAP model. If we fusion the models, the performance will be improved 16.8% and 19.1% respectively compared with the UBM-MAP model. GMM Kernel by Taylor Series for Speaker Verification Minqiang Xu 1 , Xi Zhou 2 , Beiqian Dai 1 , Thomas S. Huang 2 ; 1 University of Science & Technology of China, China; 2 University of Illinois at Urbana-Champaign, USA Yossi Bar-Yosef, Yuval Bistritz; Tel-Aviv University, Israel Tue-Ses3-O3-3, Time: 16:40 Tue-Ses3-O3-6, Time: 17:40 Most techniques for speaker verification today use Gaussian Mixture Models (GMMs) and make the decision by comparing the likelihood of the speaker model to the likelihood of a universal background model (UBM). The paper proposes to replace the UBM by an individual background model (IBM) that is generated for each speaker. The IBM is created using the K-nearest cohort models and the UBM by a simple new adaptation algorithm. The new GMM-IBM speaker verification system can also be combined with various score normalization techniques that have been proposed to increase the robustness of the GMM-UBM system. Comparative experiments were held on the NIST-2004-SRE database with a plain system setting (without score normalization) and also with the combination of adaptive test normalization (ATnorm). Results indicated that the proposed GMM-IBM system outperforms a comparable GMM-UBM system. Currently, approach of Gaussian Mixture Model combined with Support Vector Machine to text-independent speaker verification task has produced the stat-of-the-art performance. Many kernels have been reported for combining GMM and SVM. Optimization of Discriminative Kernels in SVM Speaker Verification In this paper, we propose a novel kernel to represent the GMM distribution by Taylor expansion theorem and it’s regarded as the input of SVM. The utterance-specific GMM is represented as a combination of orders of Taylor series expansing at the means of the Gaussian components. Here we extract the distribution information around the means of the Gaussian components in the GMM as we can naturally assume that each mean position indicates a feature cluster in the feature space. And then the kernel computes the emsemble distance between orders of Taylor series. Results of our new kernel on NIST speaker recognition evaluation (SRE) 2006 core task have been shown relative improvements of up to 7.1% and 11.7% in EER for male and female compared to K-L divergence based SVM system. Shi-Xiong Zhang, Man-Wai Mak; Hong Kong Polytechnic University, China Tue-Ses3-O3-4, Time: 17:00 An important aspect of SVM-based speaker verification systems is the design of sequence kernels. These kernels should be able to map variable-length observation sequences to fixed-size supervectors that capture the dynamic characteristics of speech utterances and allow speakers to be easily distinguished. Most existing kernels in SVM speaker verification are obtained by assuming a specific form for the similarity function of supervectors. This paper relaxes Notes 100 as the need to list allowed alignments is removed. Finally, loose comparison with other studies indicates Combilex is a superior quality lexicon in terms of consistency and size. Tue-Ses3-O4 : Text Processing for Spoken Language Generation Holmes (East Wing 3), 16:00, Tuesday 8 Sept 2009 Chair: Douglas Reynolds, MIT, USA Letter-to-Phoneme Conversion by Inference of Rewriting Rules Automatic Syllabification for Danish Text-to-Speech Systems Vincent Claveau; IRISA, France 1 1 2 Jeppe Beck , Daniela Braga , João Nogueira , Miguel Sales Dias 1 , Luis Coelho 3 ; 1 Microsoft Language Development Center, Portugal; 2 University of Lisbon, Portugal; 3 Polytechnic Institute of Oporto, Portugal Tue-Ses3-O4-1, Time: 16:00 In this paper, a rule-based automatic syllabifier for Danish is described using the Maximal Onset Principle. Prior success rates of rule-based methods applied to Portuguese and Catalan syllabification modules were on the basis of this work. The system was implemented and tested using a very small set of rules. The results gave rise to 96.9% and 98.7% of word accuracy rate, contrary to our initial expectations, being Danish a language with a complex syllabic structure and thus difficult to be rule-driven. Comparison with data-driven syllabification system using artificial neural networks showed a higher accuracy rate of the former system. Hybrid Approach to Grapheme to Phoneme Conversion for Korean Tue-Ses3-O4-4, Time: 17:00 Phonetization is a crucial step for oral document processing. In this paper, a new letter-to-phoneme conversion approach is proposed; it is automatic, simple, portable and efficient. It relies on a machine learning technique initially developed for transliteration and translation; the system infers rewriting rules from examples of words with their phonetic representations. This approach is evaluated in the framework of the Pronalsyl Pascal challenge, which includes several datasets on different languages. The obtained results equal or outperform those of the best known systems. Moreover, thanks to the simplicity of our technique, the inference time of our approach is much lower than those of the best performing state-of-the-art systems. Online Discriminative Training for Grapheme-to-Phoneme Conversion Sittichai Jiampojamarn, Grzegorz Kondrak; University of Alberta, Canada Tue-Ses3-O4-5, Time: 17:20 Jinsik Lee 1 , Byeongchang Kim 2 , Gary Geunbae Lee 1 ; 1 POSTECH, Korea; 2 Catholic University of Daegu, Korea Tue-Ses3-O4-2, Time: 16:20 In the grapheme to phoneme conversion problem for Korean, two main approaches have been discussed: knowledge-based and data-driven methods. However, both camps have limitations: the knowledge-based hand-written rules cannot handle some of the pronunciation changes due to the lack of capability of linguistic analyzers and many exceptions; data-driven methods always suffer from data sparseness. To overcome the shortages of both camps, this paper presents a novel combining method which effectively integrates two components: (1) a rule-based converting system based on linguistically motivated hand-written rules and (2) a statistical converting system using a Maximum Entropy model. The experimental results clearly show the effectiveness of our proposed method. We present an online discriminative training approach to graphemeto-phoneme (g2p) conversion. We employ a many-to-many alignment between graphemes and phonemes, which overcomes the limitations of widely used one-to-one alignments. The discriminative structure-prediction model incorporates input segmentation, phoneme prediction, and sequence modeling in a unified dynamic programming framework. The learning model is able to capture both local context features in inputs, as well as non-local dependency features in sequence outputs. Experimental results show that our system surpasses the state-of-the-art on several data sets. Using Same-Language Machine Translation to Create Alternative Target Sequences for Text-to-Speech Synthesis Peter Cahill 1 , Jinhua Du 2 , Andy Way 2 , Julie Carson-Berndsen 1 ; 1 University College Dublin, Ireland; 2 Dublin City University, Ireland Tue-Ses3-O4-6, Time: 17:40 Robust LTS Rules with the Combilex Speech Technology Lexicon Korin Richmond, Robert A.J. Clark, Sue Fitt; University of Edinburgh, UK Tue-Ses3-O4-3, Time: 16:40 Combilex is a high quality pronunciation lexicon, aimed at speech technology applications, that has recently been released by CSTR. Combilex benefits from several advanced features. This paper evaluates one of these: the explicit alignment of phones to graphemes in a word. This alignment can help to rapidly develop robust and accurate letter-to-sound (LTS) rules, without needing to rely on automatic alignment methods. To evaluate this, we used Festival’s LTS module, comparing its standard automatic alignment with Combilex’s explicit alignment. Our results show using Combilex’s alignment improves LTS accuracy: 86.50% words correct as opposed to 84.49%, with our most general form of lexicon. In addition, building LTS models is greatly accelerated, Modern speech synthesis systems attempt to produce speech utterances from an open domain of words. In some situations, the synthesiser will not have the appropriate units to pronounce some words or phrases accurately but it still must attempt to pronounce them. This paper presents a hybrid machine translation and unit selection speech synthesis system. The machine translation system was trained with English as the source and target language. Rather than the synthesiser only saying the input text as would happen in conventional synthesis systems, the synthesiser may say an alternative utterance with the same meaning. This method allows the synthesiser to overcome the problem of insufficient units in runtime. Notes 101 show that the proposed estimator leads to significant improvement for the presented estimator over state-of-the-art methods. Tue-Ses3-P1 : Single- and Multichannel Speech Enhancement Hewison Hall, 16:00, Tuesday 8 Sept 2009 Chair: Richard C. Hendriks, Technische Universiteit Delft, The Netherlands Speech Enhancement in a 2-Dimensional Area Based on Power Spectrum Estimation of Multiple Areas with Investigation of Existence of Active Sources Watermark Recovery from Speech Using Inverse Filtering and Sign Correlation Yusuke Hioka 1 , Ken’ichi Furuya 1 , Yoichi Haneda 1 , Akitoshi Kataoka 2 ; 1 NTT Corporation, Japan; 2 Ryukoku University, Japan Robert Morris 1 , Ralph Johnson 1 , Vladimir Goncharoff 2 , Joseph DiVita 1 ; 1 SPAWAR Systems Center Pacific, USA; 2 University of Illinois at Chicago, USA Tue-Ses3-P1-4, Time: 16:00 Tue-Ses3-P1-1, Time: 16:00 This paper presents an improved method for asynchronous embedding and recovery of sub-audible watermarks in speech signals. The watermark, a sequence of DTMF tones, was added to speech without knowledge of its time-varying characteristics. Watermark recovery began by implementing a synchronized zero-phase inverse filtering operation to decorrelate the speech during its voiced segments. The final step was to apply the sign correlation technique, which resulted in performance advantages over linear correlation detection. Our simulations include the effects of finite word length in the correlator. A microphone array that emphasizes sound sources located in a particular 2-dimensional area is described. We previously developed a method that estimates the power spectra of target and noise sounds using multiple fixed beamformings. However, that method requires the areas where the noise sources are located to be restricted. We describe the principle of this limitation then propose a procedure that investigates the possibility of the existence of a sound source in a target area and other areas beforehand to reduce the number of unknown power spectra to be estimated. Modulation Domain Spectral Subtraction for Speech Enhancement Kuldip Paliwal, Belinda Schwerin, Kamil Wójcicki; Griffith University, Australia Tue-Ses3-P1-5, Time: 16:00 Weighted Linear Prediction for Speech Analysis in Noisy Conditions Jouni Pohjalainen, Heikki Kallasjoki, Kalle J. Palomäki, Mikko Kurimo, Paavo Alku; Helsinki University of Technology, Finland Tue-Ses3-P1-2, Time: 16:00 Following earlier work, we modify linear predictive (LP) speech analysis by including temporal weighting of the squared prediction error in the model optimization. In order to focus this so called weighted LP model on the least noisy signal regions in the presence of stationary additive noise, we use short-time signal energy as the weighting function. We compare the noisy spectrum analysis performance of weighted LP and its recently proposed variant, the latter guaranteed to produce stable synthesis models. As a practical test case, we use automatic speech recognition to verify that the weighted LP methods improve upon the conventional FFT and LP methods by making spectrum estimates less prone to corruption by additive noise. Log-Spectral Magnitude MMSE Estimators Under Super-Gaussian Densities 1 In this paper we investigate the modulation domain as an alternative to the acoustic domain for speech enhancement. More specifically, we wish to determine how competitive the modulation domain is for spectral subtraction as compared to the acoustic domain. For this purpose, we extend the traditional analysismodification-synthesis framework to include modulation domain processing. We then compensate the noisy modulation spectrum for additive noise distortion by applying the spectral subtraction algorithm in the modulation domain. Using subjective listening tests and objective speech quality evaluation we show that the proposed method results in improved speech quality. Furthermore, applying spectral subtraction in the modulation domain does not introduce the musical noise artifacts that are typically present after acoustic domain spectral subtraction. The proposed method also achieves better background noise reduction than the MMSE method. Variational Loopy Belief Propagation for Multi-Talker Speech Recognition Steven J. Rennie, John R. Hershey, Peder A. Olsen; IBM T.J. Watson Research Center, USA Tue-Ses3-P1-6, Time: 16:00 1 Richard C. Hendriks , Richard Heusdens , Jesper Jensen 2 ; 1 Technische Universiteit Delft, The Netherlands; 2 Oticon A/S, Denmark Tue-Ses3-P1-3, Time: 16:00 Despite the fact that histograms of speech DFT coefficients are super-Gaussian, not much attention has been paid to develop estimators under these super-Gaussian distributions in combination with perceptual meaningful distortion measures. In this paper we present log-spectral magnitude MMSE estimators under superGaussian densities, resulting in an estimator that is perceptually more meaningful and in line with measured histograms of speech DFT coefficients. Compared to state-of-the-art reference methods, the presented estimator leads to an improvement of the segmental SNR in the order of 0.5 dB up to 1 dB. Moreover, listening tests We address single-channel speech separation and recognition by combining loopy belief propagation and variational inference methods. Inference is done in a graphical model consisting of an HMM for each speaker combined with the max interaction model of source combination. We present a new variational inference algorithm that exploits the structure of the max model to compute an arbitrarily tight bound on the probability of the mixed data. The variational parameters are chosen so that the algorithm scales linearly in the size of the language and acoustic models, and quadratically in the number of sources. The algorithm scores 30.7% on the SSC task [1], which is the best published result by a method that scales linearly with speaker model complexity to date. The algorithm achieves average recognition error rates of 27%, 35%, and 51% on small datasets of SSC-derived speech mixtures containing two, three, and four sources, respectively, using a single audio channel. Notes 102 Enhancement of Binaural Speech Using Codebook Constrained Iterative Binaural Wiener Filter Enhanced Minimum Statistics Technique Incorporating Soft Decision for Noise Suppression Nadir Cazi, T.V. Sreenivas; Indian Institute of Science, India Yun-Sik Park, Ji-Hyun Song, Jae-Hun Choi, Joon-Hyuk Chang; Inha University, Korea Tue-Ses3-P1-7, Time: 16:00 Tue-Ses3-P1-10, Time: 16:00 A clean speech VQ codebook has been shown to be effective in providing intraframe constraints and hence better convergence of the iterative Wiener filtering scheme for single channel speech enhancement. Here we present an extension of the single channel CCIWF scheme to binaural speech input by incorporating a speech distortion weighted multi-channel Wiener filter. The new algorithm shows considerable improvement over single channel CCIWF in each channel, in a diffuse noise field environment, in terms of a posteriori SNR and speech intelligibility measure. Next, considering a moving speech source, a good tracking performance is seen, up to a certain resolution. In this paper, we propose a novel approach to noise power estimation for robust noise suppression in noisy environments. From investigation of the state-of-the-art techniques for noise power estimation, it is discovered that the previously known methods are accurate mostly either during speech absence or speech presence but none of it works well in both situations. Our approach combines minimum statistics (MS) and soft decision (SD) techniques based on probability of speech absence. The performance of the proposed approach is evaluated by a quantitative comparison method and subjective test under various noise environments and found to yield better results compared with conventional MS and SD-based schemes. A Semi-Blind Source Separation Method with a Less Amount of Computation Suitable for Tiny DSP Modules Effect of Noise Reduction on Reaction Time to Speech in Noise Kazunobu Kondo, Makoto Yamada, Hideki Kenmochi; Yamaha Corporation, Japan Mark Huckvale, Jayne Leak; University College London, UK Tue-Ses3-P1-8, Time: 16:00 Tue-Ses3-P1-11, Time: 16:00 In this paper, we propose a method of implementing FDICA on tiny DSP modules. Firstly, we show a semi-blind separation matrix initialization step that consists of an estimation method using covariance fitting for a known source and an unknown source. It contributes to the faster convergence and less amount of computation. Secondly, a learning band selection step is shown that consists of the determinant of the covariance matrix as a criteria for selection; This achieves a significant reduction of an amount of computation with practical separation performance. Finally, the effectiveness of the proposed method is evaluated via the source separation simulations in anechoic and reverberant rooms, and also a procedure and a resource presumption for the integrated method which we call tinyICA are shown. In moderate levels of noise, listeners report that noise reduction (NR) processing can improve the perceived quality of a speech signal as measured on a typical MOS rating scale. Most quantitative experiments of intelligibility, however, show that NR reduces the intelligibility of noisy speech signals, and so should be expected to increase the cognitive effort required to process utterances. To study cognitive effort we look at how NR affects reaction times to speech in noise, using material that is still highly intelligible. We show that adding noise increases reaction times and that NR does not restore reaction times back to the quiet condition. The implication is that NR does not make speech “easier” to process, at least as far as this task is concerned. Model-Based Speech Separation: Identifying Transcription Using Orthogonality Joint Noise Reduction and Dereverberation of Speech Using Hybrid TF-GSC and Adaptive MMSE Estimator S.W. Lee 1 , Frank K. Soong 2 , Tan Lee 1 ; 1 Chinese University of Hong Kong, China; 2 Microsoft Research Asia, China Behdad Dashtbozorg, Hamid Reza Abutalebi; Yazd University, Iran Tue-Ses3-P1-9, Time: 16:00 This paper proposes a new multichannel hybrid method for dereverberation of speech signals in noisy environments. This method extends the use of a hybrid noise reduction method for dereverberation which is based on the combination of Generalized Sidelobe Canceller (GSC) and a single-channel noise reduction stage. In this research, we employ Transfer Function GSC (TF-GSC) that is more suitable for dereverberation. The single-channel stage is an Adaptive Minimum Mean-Square Error (AMMSE) spectral amplitude estimator. We also modify the AMMSE estimator for dereverberation application. Experimental results demonstrate superiority of the proposed method in dereverberation of speech signal in noisy environments. Tue-Ses3-P1-12, Time: 16:00 Spectral envelopes and harmonics are the building elements of a speech signal. By estimating these elements, individual speech sources in a mixture observation can be reconstructed and hence separated. Transcription gives the spoken content. More important, it describes the expected sequence of spectral envelopes, if modeling of different speech sounds is acquired. Our recently proposed single-microphone speech separation algorithm exploits this to derive the spectral envelope trajectories of individual sources and remove interference accordingly. The correctness of such transcription becomes critical to the separation performance. This paper investigates the relationship between the correctness of transcription hypotheses and the orthogonality of associated source estimates. An orthogonality measure is introduced to quantify the correlation between spectrograms. Experiments verify that underlying true transcriptions lead to a salient orthogonality distribution, which is distinguishable from the counterfeit transcription one. Accordingly a transcription identification technique is developed, which succeeds in identifying true transcriptions in 99.74% of the experimental trials. A Study on Multiple Sound Source Localization with a Distributed Microphone System Kook Cho, Takanobu Nishiura, Yoichi Yamashita; Ritsumeikan University, Japan Tue-Ses3-P1-13, Time: 16:00 This paper describes a novel method for multiple sound source Notes 103 localization and its performance evaluation in actual room environments. The proposed method localizes a sound source by finding the position that maximizes the accumulated correlation coefficient between multiple channel pairs. After the estimation of the first sound source, a typical pattern of the accumulated correlation for a single sound source is subtracted from the observed distribution of the accumulated correlation. Subsequently, the second sound source is searched again. To evaluate the effectiveness of the proposed method, experiments of multiple sound source localization were carried out in an actual office room. The result shows that multiple sound source localization accuracy is about 99.7%. The proposed method could realize the multiple sound source localization robustly and stably. Robust Minimal Variance Distortionless Speech Power Spectra Enhancement Using Order Statistic Filter for Microphone Array Tao Yu, John H.L. Hansen; University of Texas at Dallas, USA Tue-Ses3-P1-14, Time: 16:00 In this study, we propose a novel minimal variance distortionless speech power spectral enhancement algorithm, which is robust to some of the real-world implementation issues. Our proposed method is implemented in the power spectral domain where stochastic noise can be modeled as the exponential distribution, whose non-Gaussianity is explored by order statistics filter. Both theoretical and experimental results shows the effectiveness of our proposed method over traditional ones. Speech Enhancement Minimizing Generalized Euclidean Distortion Using Supergaussian Priors spectrum, using the knowledge of the window function we are using for the STFT. These harmonics are then scaled and laid on multiples of F0 . Experimental results prove the effectiveness of this enhancement method in various noisy conditions and various SNR ratios. Joint Speech Enhancement and Speaker Identification Using Monte Carlo Methods Ciira wa Maina, John MacLaren Walsh; Drexel University, USA Tue-Ses3-P1-17, Time: 16:00 We present an approach to speaker identification using noisy speech observations where the speech enhancement and speaker identification tasks are performed jointly. This is motivated by the belief that human beings perform these tasks jointly and that optimality may be sacrificed if sequential processing is used. We employ a Bayesian approach where the speech features are modeled using a mixture of Gaussians prior. A Gibbs sampler is used to estimate the speech source and the identity of the speaker. Preliminary experimental results are presented comparing our approach to a maximum likelihood approach and demonstrating the ability of our method to both enhance speech and identify speakers. Tue-Ses3-P2 : ASR: Acoustic Modelling Hewison Hall, 16:00, Tuesday 8 Sept 2009 Chair: Simon King, University of Edinburgh, UK Combined Discriminative Training for Multi-Stream HMM-Based Audio-Visual Speech Recognition Amit Das, John H.L. Hansen; University of Texas at Dallas, USA Tue-Ses3-P1-15, Time: 16:00 Jing Huang 1 , Karthik Visweswariah 2 ; 1 IBM T.J. Watson Research Center, USA; 2 IBM India Research Lab, India We introduce short time spectral estimators which minimize the weighted Euclidean distortion (WED) between the clean and estimated speech spectral components when clean speech is degraded by additive noise. The traditional minimum mean square error (MMSE) estimator does not take into account sufficient perceptual measure during enhancement of noisy speech. However, the new estimators discussed in this paper provide greater flexibility to improve speech quality. We explore the cases when clean speech spectral magnitude and discrete Fourier transform (DFT) coefficients are modeled by super-Gaussian priors like Chi and bilateral Gamma distributions respectively. We also present the joint maximum a posteriori (MAP) estimators of the Chi distributed spectral magnitude and uniform phase. Performance evaluations over two noise types and three SNR levels demonstrate improved results of the proposed estimators. In this paper we investigate discriminative training of models and feature space for a multi-stream hidden Markov model (HMM) based audio-visual speech recognizer (AVSR). Since the two streams are used together in decoding, we propose to train the parameters of the two streams jointly. This is in contrast to prior work which has considered discriminative training of parameters in each stream independent of the other. In experiments on a 20-speaker one-hour speaker independent test set, we obtain 22% relative gain on AVSR performance over A/V models whose parameters are trained separately, and 50% relative gain on AVSR over the baseline maximum-likelihood models. On a noisy (mismatched to training) test set, we obtain 21% relative gain over A/V models whose parameters are trained separately. This represents 30% relative improvement over the maximum-likelihood baseline. STFT-Based Speech Enhancement by Reconstructing the Harmonics Cued Speech Recognition for Augmentative Communication in Normal-Hearing and Hearing-Impaired Subjects Iman Haji Abolhassani 1 , Sid-Ahmed Selouani 2 , Douglas O’Shaughnessy 1 ; 1 INRS-EMT, Canada; 2 Université de Moncton, Canada Tue-Ses3-P2-1, Time: 16:00 Panikos Heracleous, Denis Beautemps, Noureddine Abboutabit; GIPSA, France Tue-Ses3-P1-16, Time: 16:00 Tue-Ses3-P2-2, Time: 16:00 A novel Short Time Fourier Transform (STFT) based speech enhancement method is introduced. This method enhances the magnitude spectrum of a noisy speech segment. The new idea that is used in this method is to basically reconstruct the harmonics at the multiples of the fundamental frequency (F0 ) rather than trying to improve them. The harmonics are produced, in the magnitude Speech is the most natural communication mean for humans. However, in situations where audio speech is not available or cannot be perceived because of disabilities or adverse environmental conditions, people may resort to alternative methods such as augmented speech. Augmented speech is audio speech supplemented or replaced by other modalities, such as audiovisual Notes 104 speech, or Cued Speech. Cued Speech is a visual communication mode, which uses lipreading and handshapes placed in different position to make spoken language wholly understandable to deaf individuals. The current study reports the authors’ activities and progress in Cued Speech recognition for French. Previously, the authors have reported experimental results for vowel- and consonant recognition in Cued Speech for French in the case of a normal-hearing subject. The study has been extended by also employing a deaf cuer, and both cuer-dependent and multi-cuer experiments based on hidden Markov models (HMM) have been conducted. On Acquiring Speech Production Knowledge from Articulatory Measurements for Phoneme Recognition D. Neiberg, G. Ananthakrishnan, Mats Blomberg; KTH, Sweden Tue-Ses3-P2-3, Time: 16:00 The paper proposes a general version of a coupled Hidden Markov/Bayesian Network model for performing phoneme recognition on acoustic-articulatory data. The model uses knowledge learned from the articulatory measurements, available for training, for phoneme recognition on the acoustic input. After training on the articulatory data, the model is able to predict 71.5% of the articulatory state sequences using the acoustic input. Using optimized parameters, the proposed method shows a slight improvement for two speakers over the baseline phoneme recognition system which does not use articulatory knowledge. However, the improvement is only statistically significant for one of the speakers. While there is an improvement in recognition accuracy for the vowels, diphthongs and to some extent the semi-vowels, there is a decrease in accuracy for the remaining phonemes. There has been increasing interest in the use of unsupervised adaptation for the personalisation of text-to-speech (TTS) voices, particularly in the context of speech-to-speech translation. This requires that we are able to generate adaptation transforms from the output of an automatic speech recognition (ASR) system. An approach that utilises unified ASR and TTS models would seem to offer an ideal mechanism for the application of unsupervised adaptation to TTS since transforms could be shared between ASR and TTS. Such unified models should use a common set of parameters. A major barrier to such parameter sharing is the use of differing contexts in ASR and TTS. In this paper we propose a simple approach that generates ASR models from a trained set of TTS models by marginalising over the TTS contexts that are not used by ASR. We present preliminary results of our proposed method on a large vocabulary speech recognition task and provide insights into future directions of this work. Detailed Description of Triphone Model Using SSS-Free Algorithm Motoyuki Suzuki 1 , Daisuke Honma 2 , Akinori Ito 2 , Shozo Makino 2 ; 1 University of Tokushima, Japan; 2 Tohoku University, Japan Tue-Ses3-P2-6, Time: 16:00 The triphone model is frequently used as an acoustic model. It is effective for modeling phonetic variations caused by coarticulation. However, it is known that acoustic features of phonemes are also affected by other factors such as speaking style and speaking speed. In this paper, a new acoustic model is proposed. All training data which have the same phoneme context are automatically clustered into several clusters based on acoustic similarity, and a “sub-triphones” is trained using training data corresponding to a cluster. In experiments, the sub-triphone model achieved about 5% higher phoneme accuracy than the triphone model. Measuring the Gap Between HMM-Based ASR and TTS John Dines 1 , Junichi Yamagishi 2 , Simon King 2 ; 1 IDIAP Research Institute, Switzerland; 2 University of Edinburgh, UK Decision Tree Acoustic Models for ASR Jitendra Ajmera, Masami Akamine; Toshiba Corporate R&D Center, Japan Tue-Ses3-P2-4, Time: 16:00 Tue-Ses3-P2-7, Time: 16:00 The EMIME European project is conducting research in the development of technologies for mobile, personalised speech-to-speech translation systems. The hidden Markov model is being used as the underlying technology in both automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components, thus, the investigation of unified statistical modelling approaches has become an implicit goal of our research. As one of the first steps towards this goal, we have been investigating commonalities and differences between HMM-based ASR and TTS. In this paper we present results and analysis of a series of experiments that have been conducted on English ASR and TTS systems measuring their performance with respect to phone set and lexicon, acoustic feature type and dimensionality and HMM topology. Our results show that, although the fundamental statistical model may be essentially the same, optimal ASR and TTS performance often demands diametrically opposed system designs. This represents a major challenge to be addressed in the investigation of such unified modelling approaches. This paper presents a summary of our research progress using decision-tree acoustic models (DTAM) for large vocabulary speech recognition. Various configurations of training DTAMs are proposed and evaluated on wall-street journal (WSJ) task. A number of different acoustic and categorical features have been used for this purpose. Various ways of realizing a forest instead of a single tree have been presented and shown to improve recognition accuracy. Although the performance is not shown to be better than Gaussian mixture models (GMMs), several advantages of DTAMs have been highlighted and exploited. These include compactness, computational simplicity and ability to handle unordered information. Speech Recognition with Speech Synthesis Models by Marginalising over Decision Tree Leaves John Dines, Lakshmi Saheer, Hui Liang; IDIAP Research Institute, Switzerland Tue-Ses3-P2-5, Time: 16:00 Compression Techniques Applied to Multiple Speech Recognition Systems Catherine Breslin, Matt Stuttle, Kate Knill; Toshiba Research Europe Ltd., UK Tue-Ses3-P2-8, Time: 16:00 Speech recognition systems typically contain many Gaussian distributions, and hence a large number of parameters. This makes them both slow to decode speech, and large to store. Techniques have been proposed to decrease the number of parameters. One approach is to share parameters between multiple Gaussians, thus reducing the total number of parameters and allowing for shared Notes 105 likelihood calculation. Gaussian tying and subspace clustering are two related techniques which take this approach to system compression. These techniques can decrease the number of parameters with no noticeable drop in performance for single systems. However, multiple acoustic models are often used in real speech recognition systems. This paper considers the application of Gaussian tying and subspace compression to multiple systems. Results show that two speech recognition systems can be modelled using the same number of Gaussians as just one system, with little effect on individual system performance. performance in gender-independent, spontaneous-speaking applications. However, the multi-path acoustic model size may increase and require more training samples depending on the increased number of paths. To solve this problem, we used a tied-state multipath topology by which we can create a three-domain successive state splitting method to which environmental splitting is added. This method can obtain a suitable model topology with small mixture components. Experiments demonstrated that the proposed multi-path HMnet model performs better than single-path models for the same number of states. Graphical Models for Discrete Hidden Markov Models in Speech Recognition Acoustic Modeling Using Exponential Families Vaibhava Goel, Peder A. Olsen; IBM T.J. Watson Research Center, USA Antonio Miguel, Alfonso Ortega, L. Buera, Eduardo Lleida; Universidad de Zaragoza, Spain Tue-Ses3-P2-12, Time: 16:00 Tue-Ses3-P2-9, Time: 16:00 Emission probability distributions in speech recognition have been traditionally associated to continuous random variables. The most successful models have been the mixtures of Gaussians in the states of the hidden Markov models to generate/ capture observations. In this work we show how graphical models can be used to extract the joint information of more than two features. This is possible if we previously quantize the speech features to a small number of levels and model them as discrete random variables. In this paper it is shown a method to estimate a graphical model with a bounded number of dependencies, which is a subset of the directed acyclic graph based model framework, Bayesian networks. Some experimental results have been obtained with mixtures of graphical models compared to baseline systems using mixtures of Gaussians with full and diagonal covariance matrices. Factor Analyzed HMM Topology for Speech Recognition We present a framework to utilize general exponential families for acoustic modeling. Maximum Likelihood (ML) parameter estimation is carried out using sampling based estimates of the partition function and expected feature vector. Markov Chain Monte Carlo procedures are used to draw samples from general exponential densities. We apply our ML estimation framework to two new exponential families to demonstrate the modeling flexibility afforded by this framework. Tue-Ses3-P3 : Assistive Speech Technology Hewison Hall, 16:00, Tuesday 8 Sept 2009 Chair: Elmar Nöth, FAU Erlangen-Nürnberg, Germany Personalizing Synthetic Voices for People with Progressive Speech Disorders: Judging Voice Similarity Tue-Ses3-P2-10, Time: 16:00 S.M. Creer 1 , S.P. Cunningham 1 , P.D. Green 1 , K. Fatema 2 ; 1 University of Sheffield, UK; 2 University of Kent, UK This paper presents a new factor analyzed (FA) similarity measure between two Gaussian mixture models (GMMs). An adaptive hidden Markov model (HMM) topology is built to compensate the pronunciation variations in speech recognition. Our idea aims to evaluate whether the variation of a HMM state from new speech data is significant or not and judge if a new state should be generated in the models. Due to the effectiveness of FA data analysis, we measure the GMM similarity by estimating the common factors and specific factors embedded in the HMM means and variances. Similar Gaussian densities are represented by the common factors. Specific factors express the residual of similarity measure. We perform a composite hypothesis test due to common factors as well as specific factors. An adaptive HMM topology is accordingly established from continuous collection of training utterances. Experiments show that the proposed FA measure outperforms other measures with comparable size of parameters. In building personalized synthetic voices for people with speech disorders, the output should capture the individual’s vocal identity. This paper reports a listener judgment experiment on the similarity of Hidden Markov Model based synthetic voices using varying amounts of adaptation data to two non-impaired speakers. We conclude that around 100 sentences of data is needed to build a voice that retains the characteristics of the target speaker but using more data improves the voice. Experiments using Multi-Layer Perceptrons (MLPs) are conducted to find which acoustic features contribute to the similarity judgments. Results show that melcepstral distortion and fraction of voicing agreement contribute most to replicating the similarity judgment but the combination of all features is required for accurate prediction. Ongoing work applies the findings to voice building for people with impaired speech. Tied-State Multi-Path HMnet Model Using Three-Domain Successive State Splitting Electrolaryngeal Speech Enhancement Based on Statistical Voice Conversion Soo-Young Suk, Hiroaki Kojima; AIST, Japan Keigo Nakamura, Tomoki Toda, Hiroshi Saruwatari, Kiyohiro Shikano; NAIST, Japan Chuan-Wei Ting, Jen-Tzung Chien; National Cheng Kung University, Taiwan Tue-Ses3-P2-11, Time: 16:00 In this paper, we address the improvement of an acoustic model using the multi-path Hidden Markov network (HMnet) model for automatically creating non-uniform tied-state, context-dependent hidden markov model topologies. Recent research has achieved multi-path model topologies in order to improve the recognition Tue-Ses3-P3-1, Time: 16:00 Tue-Ses3-P3-2, Time: 16:00 This paper proposes a speaking-aid system for laryngectomees using GMM-based voice conversion that converts electrolaryngeal speech (EL speech) to normal speech. Because valid F0 information cannot be obtained from the EL speech, we have so far converted the EL speech to whispering. This paper conducts the EL speech Notes 106 conversion to normal speech using F0 counters estimated from the spectral information of the EL speech. In this paper, we experimentally evaluate these two types of output speech of our speaking-aid system from several points of view. The experimental results demonstrate that the converted normal speech is preferred to the converted whisper. subjects benefit from SynFace especially with speech with stereo babble noise. Age Recognition for Spoken Dialogue Systems: Do We Need It? Live closed-captions for deaf and hard of hearing audiences are currently produced by stenographers, or by voice writers using speech recognition. Both techniques can produce captions with errors. We are currently developing a correction module that allows a user to intercept the real-time caption stream and correct it before it is broadcast. We report results of preliminary experiments on correction rate and actual user performance using a prototype correction module connected to the output of a speech recognition captioning system. Maria Wolters, Ravichander Vipperla, Steve Renals; University of Edinburgh, UK Tue-Ses3-P3-3, Time: 16:00 When deciding whether to adapt relevant aspects of the system to the particular needs of older users, spoken dialogue systems often rely on automatic detection of chronological age. In this paper, we show that vocal ageing as measured by acoustic features is an unreliable indicator of the need for adaptation. Simple lexical features greatly improve the prediction of both relevant aspects of cognition and interactions style. Lexical features also boost age group prediction. We suggest that adaptation should be based on observed behaviour, not on chronological age, unless it is not feasible to build classifiers for relevant adaptation decisions. Speech-Based and Multimodal Media Center for Different User Groups Real-Time Correction of Closed-Captions Patrick Cardinal, Gilles Boulianne; CRIM, Canada Tue-Ses3-P3-6, Time: 16:00 Universal Access: Speech Recognition for Talkers with Spastic Dysarthria Harsh Vardhan Sharma 1 , Mark Hasegawa-Johnson 2 ; 1 Beckman Institute for Advanced Science & Technology, USA; 2 University of Illinois at Urbana-Champaign, USA Tue-Ses3-P3-7, Time: 16:00 Markku Turunen 1 , Jaakko Hakulinen 1 , Aleksi Melto 1 , Juho Hella 1 , Juha-Pekka Rajaniemi 1 , Erno Mäkinen 1 , Jussi Rantala 1 , Tomi Heimonen 1 , Tuuli Laivo 1 , Hannu Soronen 2 , Mervi Hansen 2 , Pellervo Valkama 1 , Toni Miettinen 1 , Roope Raisamo 1 ; 1 University of Tampere, Finland; 2 Tampere University of Technology, Finland Tue-Ses3-P3-4, Time: 16:00 We present a multimodal media center interface based on speech input, gestures, and haptic feedback. For special user groups, including visually and physically impaired users, the application features a zoomable context + focus GUI in tight combination with speech output and full speech-based control. These features have been developed in cooperation with representatives of the user groups. Evaluations of the system with regular users have been conducted and results from a study where subjective evaluations were collected show that the performance and user experience of speech input were very good, similar to results from a ten month public pilot use. Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-Media Setting Samer Al Moubayed 1 , Jonas Beskow 1 , Ann-Marie Öster 1 , Giampiero Salvi 1 , Björn Granström 1 , Nic van Son 2 , Ellen Ormel 2 ; 1 KTH, Sweden; 2 Viataal, The Netherlands Tue-Ses3-P3-5, Time: 16:00 In this paper we present recent results on the development of the SynFace lip synchronized talking head towards multilinguality, varying signal conditions and noise robustness in the Hearing at Home project. We then describe the large scale hearing impaired user studies carried out for three languages. The user tests focus on measuring the gain in Speech Reception Threshold in Noise when using SynFace, and on measuring the effort scaling when using SynFace by hearing impaired people. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at inter-subject variability, it is clear that many This paper describes the results of our experiments in small and medium vocabulary dysarthric speech recognition, using the database being recorded by our group under the Universal Access initiative. We develop and test speaker-dependent, word- and phone-level speech recognizers utilizing the hidden Markov Model architecture; the models are trained exclusively on dysarthric speech produced by individuals diagnosed with cerebral palsy. The experiments indicate that (a) different system configurations (being word vs. phone based, number of states per HMM, number of Gaussian components per state specific observation probability density etc.) give useful performance (in terms of recognition accuracy) for different speakers and different task-vocabularies, and (b) for very low intelligibility subjects, speech recognition outperforms human listeners on recognizing dysarthric speech. Exploring Speech Therapy Games with Children on the Autism Spectrum Mohammed E. Hoque, Joseph K. Lane, Rana el Kaliouby, Matthew Goodwin, Rosalind W. Picard; MIT, USA Tue-Ses3-P3-8, Time: 16:00 Individuals on the autism spectrum often have difficulties producing intelligible speech with either high or low speech rate, and atypical pitch and/or amplitude affect. In this study, we present a novel intervention towards customizing speech enabled games to help them produce intelligible speech. In this approach, we clinically and computationally identify the areas of speech production difficulties of our participants. We provide an interactive and customized interface for the participants to meaningfully manipulate the prosodic aspects of their speech. Over the course of 12 months, we have conducted several pilots to set up the experimental design, developed a suite of games and audio processing algorithms for prosodic analysis of speech. Preliminary results demonstrate our intervention being engaging and effective for our participants. Notes 107 Analyzing GMMs to Characterize Resonance Anomalies in Speakers Suffering from Apnoea 1 1 1 José Luis Blanco , Rubén Fernández , David Pardo , Álvaro Sigüenza 1 , Luis A. Hernández 1 , José Alcázar 2 ; 1 Universidad Politécnica de Madrid, Spain; 2 Hospital Torrecardenas, Spain Tue-Ses3-P3-9, Time: 16:00 Past research on the speech of apnoea patients has revealed that resonance anomalies are among the most distinguishing traits for these speakers. This paper presents an approach to characterize these peculiarities using GMMs and distance measures between distributions. We report the findings obtained with two analytical procedures, working with a purpose-designed speech database of both healthy and apnoea-suffering patients. First, we validate the database to guarantee that the models trained are able to describe the acoustic space in a way that may reveal differences between groups. Then we study abnormal nasalization in apnoea patients by considering vowels in nasal and non-nasal phonetic contexts. Our results confirm that there are differences between the groups, and that statistical modelling techniques can be used to describe this factor. Results further suggest that it would be possible to design an automatic classifier using such discriminative information. On the Mutual Information Between Source and Filter Contributions for Voice Pathology Detection Thomas Drugman, Thomas Dubuisson, Thierry Dutoit; Faculté Polytechnique de Mons, Belgium Tue-Ses3-P3-10, Time: 16:00 This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal. For this, we investigate the use of the glottal source estimation as a means to detect voice disorders. Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody. The relevancy of these features is assessed through mutual information-based measures. This allows an intuitive interpretation in terms of discrimination power and redundancy between the features, independently of any subsequent classifier. It is discussed which characteristics are interestingly informative or complementary for detecting voice pathologies. A System for Detecting Miscues in Dyslexic Read Speech Morten Højfeldt Rasmussen, Zheng-Hua Tan, Børge Lindberg, Søren Holdt Jensen; Aalborg University, Denmark Tue-Ses3-P3-11, Time: 16:00 While miscue detection in general is a well explored research field little attention has so far been paid to miscue detection in dyslexic read speech. This domain differs substantially from the domains that are commonly researched, as for example dyslexic read speech includes frequent regressions and long pauses between words. A system detecting miscues in dyslexic read speech is presented. It includes an ASR component employing a forced-alignment like grammar adjusted for dyslexic input and uses the GOP score and phone duration to accept or reject the read words. Experimental results show that the system detects miscues at a false alarm rate of 5.3% and a miscue detection rate of 40.1%. These results are worse than current state of the art reading tutors perhaps indicating that dyslexic read speech is a challenge to handle. Tue-Ses3-P4 : Topics in Spoken Language Processing Hewison Hall, 16:00, Tuesday 8 Sept 2009 Chair: Chiori Hori, NICT, Japan Techniques for Rapid and Robust Topic Identification of Conversational Telephone Speech Jonathan Wintrode 1 , Scott Kulp 2 ; 1 United States Department of Defense, USA; 2 Rutgers University, USA Tue-Ses3-P4-1, Time: 16:00 In this paper, we investigate the impact of automatic speech recognition (ASR) errors on the accuracy of topic identification in conversational telephone speech. We present a modified TF-IDF feature weighting calculation that provides significant robustness under various recognition error conditions. For our experiments we take conversations from the Fisher corpus to produce 1-best and lattice outputs using a single recognizer tuned to run at various speeds. We use an SVM classifier to perform topic identification on the output. We observe classifiers incorporating confidence information to be significantly more robust to errors than those treating output as unweighted text. Localization of Speech Recognition in Spoken Dialog Systems: How Machine Translation Can Make Our Lives Easier David Suendermann, Jackson Liscombe, Krishna Dayanidhi, Roberto Pieraccini; SpeechCycle Labs, USA Tue-Ses3-P4-2, Time: 16:00 The localization of speech recognition for large-scale spoken dialog systems can be a tremendous exercise. Usually, all involved grammars have to be translated by a language expert, and new data has to be collected, transcribed, and annotated for statistical utterance classifiers resulting in a time-consuming and expensive undertaking. Often though, a vast number of transcribed and annotated utterances exists for the source language. In this paper, we propose to use such data and translate it into the target language using machine translation. The translated utterances and their associated (original) annotations are then used to train statistical grammars for all contexts of the target system. As an example, we localize an English spoken dialog system for Internet troubleshooting to Spanish by translating more than 4 million source utterances without any human intervention. In an application of the localized system to more than 10,000 utterances collected on a similar Spanish Internet troubleshooting system, we show that the overall accuracy was only 5.7% worse than that of the English source system. Algorithms for Speech Indexing in Microsoft Recite Kunal Mukerjee, Shankar Regunathan, Jeffrey Cole; Microsoft Corporation, USA Tue-Ses3-P4-3, Time: 16:00 Microsoft Recite is a mobile application to store and retrieve spoken notes. Recite stores and matches n-grams of pattern class identifiers that are designed to be language neutral and handle a large number of out of vocabulary phrases. The query algorithm expects noise and fragmented matches and compensates for them with a heuristic ranking scheme. This contribution describes a class of indexing algorithms for Recite that allows for high retrieval accuracy while meeting the constraints of low computational complexity and memory footprint of embedded platforms. The Notes 108 results demonstrate that a particular indexing scheme within this class can be selected to optimize the trade-off between retrieval accuracy and insertion/query complexity. Parallelized Viterbi Processor for 5,000-Word Large-Vocabulary Real-Time Continuous Speech Recognition FPGA System A WFST-Based Log-Linear Framework for Speaking-Style Transformation Graham Neubig, Shinsuke Mori, Tatsuya Kawahara; Kyoto University, Japan Tue-Ses3-P4-7, Time: 16:00 Tsuyoshi Fujinaga, Kazuo Miura, Hiroki Noguchi, Hiroshi Kawaguchi, Masahiko Yoshimoto; Kobe University, Japan Tue-Ses3-P4-4, Time: 16:00 We propose a novel Viterbi processor for the large vocabulary real-time continuous speech recognition. This processor is built with multi Viterbi cores. Since each core can independently compute, these cores reduce the cycle times very efficiently. To verify the effect of utilizing multi cores, we implement a dual-core Viterbi processor in an FPGA and achieve 49% cycle-time reduction, compared to a single-core processor. Our proposed dual-core Viterbi processor achieves the 5,000-word real-time continuous speech recognition at 65.175 MHz. In addition, it is easy to implement scalable increases in the number of cores, which leads to achievement of the larger vocabulary. SpLaSH (Spoken Language Search Hawk): Integrating Time-Aligned with Text-Aligned Annotations Sara Romano, Elvio Cecere, Francesco Cutugno; Università di Napoli Federico II, Italy When attempting to make transcripts from automatic speech recognition results, disfluency deletion, transformation of colloquial expressions, and insertion of dropped words must be performed to ensure that the final product is clean transcript-style text. This paper introduces a system for the automatic transformation of the spoken word to transcript-style language that enables not only deletion of disfluencies, but also substitutions of colloquial expressions and insertion of dropped words. A number of potentially useful features are combined in a log-linear probabilistic framework, and the utility of each is examined. The system is implemented using weighted finite state transducers (WFSTs) to allow for easy combination of features and integration with other WFST-based systems. On evaluation, the best system achieved a 5.37% word error rate, a 5.49% absolute gain over a rule-based baseline and a 1.54% absolute gain over a simple noisy-channel model. ClusterRank: A Graph Based Method for Meeting Summarization Nikhil Garg 1 , Benoit Favre 2 , Korbinian Reidhammer 2 , Dilek Hakkani-Tür 2 ; 1 EPFL, Switzerland; 2 ICSI, USA Tue-Ses3-P4-8, Time: 16:00 In this work we present SpLaSH (Spoken Language Search Hawk), a toolkit used to perform complex queries on spoken language corpora. In SpLaSH, tools for the integration of time aligned annotations (TMA), by means of annotation graphs, with text aligned ones (TXA), by means of generic XML files, are provided. SpLaSH imposes a very limited number of constraints to the data model design, allowing the integration of annotations developed separately within the same dataset and without any relative dependency. It also provides a GUI allowing three types of queries: simple query on TXA or TMA structures, sequence query on TMA structure and cross query on both TXA and TMA integrated structures. This paper presents an unsupervised, graph based approach for extractive summarization of meetings. Graph based methods such as TextRank have been used for sentence extraction from news articles. These methods model text as a graph with sentences as nodes and edges based on word overlap. A sentence node is then ranked according to its similarity with other nodes. The spontaneous speech in meetings leads to incomplete, ill-formed sentences with high redundancy and calls for additional measures to extract relevant sentences. We propose an extension of the TextRank algorithm that clusters the meeting utterances and uses these clusters to construct the graph. We evaluate this method on the AMI meeting corpus and show a significant improvement over TextRank and other baseline methods. PodCastle: Collaborative Training of Acoustic Models on the Basis of Wisdom of Crowds for Podcast Transcription Leveraging Sentence Weights in a Concept-Based Optimization Framework for Extractive Meeting Summarization Jun Ogata, Masataka Goto; AIST, Japan Shasha Xie 1 , Benoit Favre 2 , Dilek Hakkani-Tür 2 , Yang Liu 1 ; 1 University of Texas at Dallas, USA; 2 ICSI, USA Tue-Ses3-P4-5, Time: 16:00 Tue-Ses3-P4-6, Time: 16:00 This paper presents acoustic-model-training techniques for improving automatic transcription of podcasts. A typical approach for acoustic modeling is to create a task-specific corpus including hundreds (or even thousands) of hours of speech data and their accurate transcriptions. This approach, however, is impractical in podcast-transcription task because manual generation of the transcriptions of the large amounts of speech covering all the various types of podcast contents will be too costly and time consuming. To solve this problem, we introduce collaborative training of acoustic models on the basis of wisdom of crowds, i.e., the transcriptions of podcast-speech data are generated by anonymous users on our web service PodCastle. We then describe a podcast-dependent acoustic modeling system by using RSS metadata to deal with the differences of acoustic conditions in podcast speech data. From our experimental results on actual podcast speech data, the effectiveness of the proposed acoustic model training was confirmed. Tue-Ses3-P4-9, Time: 16:00 We adopt an unsupervised concept-based global optimization framework for extractive meeting summarization, where a subset of sentences is selected to cover as many important concepts as possible. We propose to leverage sentence importance weights in this model. Three ways are introduced to combine the sentence weights within the concept-based optimization framework: selecting sentences for concept extraction, pruning unlikely candidate summary sentences, and using joint optimization of sentence and concept weights. Our experimental results on the ICSI meeting corpus show that our proposed methods can significantly improve the performance for both human transcripts and ASR output compared to the concept-based baseline approach, and this unsupervised approach achieves results comparable with those from supervised learning approaches presented in previous work. Notes 109 Hybrids of Supervised and Unsupervised Models for Extractive Speech Summarization Shih-Hsiang Lin, Yueng-Tien Lo, Yao-Ming Yeh, Berlin Chen; National Taiwan Normal University, Taiwan Tue-Ses3-P4-10, Time: 16:00 Speech summarization, distilling important information and removing redundant and incorrect information from spoken documents, has become an active area of intensive research in the recent past. In this paper, we consider hybrids of supervised and unsupervised models for extractive speech summarization. Moreover, we investigate the use of the unsupervised summarizer to improve the performance of the supervised summarizer when manual labels are not available for training the latter. A novel training data selection and relabeling approach designed to leverage the inter-document or/and the inter-sentence similarity information is explored as well. Encouraging results were initially demonstrated. Automatic Detection of Audio Advertisements I. Dan Melamed, Yeon-Jun Kim; AT&T Labs Research, USA which found little or no evidence for the different types of isochrony which had been assumed to be the basis for the classification. In recent years, there has been a renewal of interest with the development of empirical metrics for measuring rhythm. In this paper it is shown that some of these metrics are more sensitive to the rhythm of the text than to the rhythm of the utterance itself. While a number of recent proposals have been made for improving these metrics it is proposed that what is needed is more detailed studies of large corpora in order to develop more sophisticated models of the way in which prosodic structure is realised in different languages. New data on British English is presented using the Aix-Marsec corpus. Oral Presentation of Poster Papers Time: 16:20 No Time to Lose? Time Shrinking Effects Enhance the Impression of Rhythmic “Isochrony” and Fast Speech Rate Petra Wagner, Andreas Windmann; Universität Bielefeld, Germany Tue-Ses3-P4-11, Time: 16:00 Quality control analysts in customer service call centers often search for keywords in call transcripts. Their searches can return an overwhelming number of false positives when the search terms also appear in advertisements that customers hear while they are on hold. This paper presents new methods for detecting advertisements in audio data, so that they can be filtered out. In order to be usable in real-world applications, our methods are designed to minimize human intervention after deployment. Even so, they are much more accurate than a baseline HMM method. Named Entity Network Based on Wikipedia Sameer Maskey 1 , Wisam Dakka 2 ; 1 IBM T.J. Watson Research Center, USA; 2 Google Inc., USA Tue-Ses3-P4-12, Time: 16:00 Named Entities (NEs) play an important role in many natural language and speech processing tasks. A resource that identifies relations between NEs could potentially be very useful. We present such automatically generated knowledge resource from Wikipedia, Named Entity Network (NE-NET), that provides a list of related Named Entities (NEs) and the degree of relation for any given NE. Unlike some manually built knowledge resource, NE-NET has a wide coverage consisting of 1.5 million NEs represented as nodes of a graph with 6.5 million arcs relating them. NE-NET also provides the ranks of the related NEs using a simple ranking function that we propose. In this paper, we present NE-NET and our experiments showing how NE-NET can be used to improve the retrieval of spoken (Broadcast News) and text documents. Tue-Ses3-S2 : Special Session: Measuring the Rhythm of Speech Ainsworth (East Wing 4), 16:00, Tuesday 8 Sept 2009 Chair: Daniel Hirst, LPL, France and Greg Kochanski, University of Oxford, UK Tue-Ses3-S2-2, Time: 16:40 Time Shrinking denotes the psycho-acoustic shrinking effect of a short interval on one or several subsequent longer intervals. Its effectiveness in the domain of speech perception has so far not been examined. Two perception experiments clearly suggest the influence of relative duration patterns triggering time shrinking on the perception of tempo and rhythmical isochrony or rather “evenness”. A comparison between the experimental data and duration patterns across various languages suggests a strong influence of time shrinking on the impression of isochrony in speech and perceptual speech rate. Our results thus emphasize the necessity of taking into account relative timing within rhythmical domains such as feet, phrases or narrow rhythm units as a complementary perspective to popular global rhythm variability metrics. Measuring Speech Rhythm Variation in a Model-Based Framework Plínio A. Barbosa; State University of Campinas, Brazil Tue-Ses3-S2-3, Time: 17:00 A coupled-oscillators-model-based method for measuring speech rhythm is presented. This model explains cross-linguistic differences in rhythm as deriving from varying degrees of coupling strength between a syllable oscillator and a phrase stress oscillator. The method was applied to three texts read aloud in French, in Brazilian and European Portuguese by seven speakers. The results reproduce the early findings on rhythm typology for these languages/varieties with the following advantages: it successfully accounts for speech rate variation, related to the syllabic oscillator frequency in the model; it takes only syllable-sized units into account, not splitting syllables into vowels and consonants; the consequences of phrase stress magnitude on stress group duration are directly considered; both universal and language-specific aspects of speech rhythm are captured by the model. Rhythm Measures with Language-Independent Segmentation The Rhythm of Text and the Rhythm of Utterances: From Metrics to Models Anastassia Loukina 1 , Greg Kochanski 1 , Chilin Shih 2 , Elinor Keane 1 , Ian Watson 1 ; 1 University of Oxford, UK; 2 University of Illinois at Urbana-Champaign, USA Daniel Hirst; LPL, France Tue-Ses3-S2-1, Time: 16:00 The typological classification of languages as stress-timed, syllabletimed and mora-timed did not stand up to empirical investigation Tue-Ses3-S2-4, Time: 17:20 Notes 110 We compare 15 measures of speech rhythm based on an automatic segmentation of speech into vowel-like and consonant-like regions. This allows us to apply identical segmentation criteria to all languages and to compute rhythm measures over a large corpus. It may also approximate more closely the segmentation available to pre-lexical infants, who apparently can discriminate between languages. We find that within-language variation is large and comparable to the between-languages differences we observed. We evaluate the success of different measures in separating languages and show that the efficiency of measures depends on the languages included in the corpus. Rhythm appears to be described by two dimensions and different published rhythm measures capture different aspects of it. Panel Discussion Time: 17:40 No abstract was available at the time of publication. Investigating Changes in the Rhythm of Maori Over Time Margaret Maclagan 1 , Catherine I. Watson 2 , Jeanette King 1 , Ray Harlow 3 , Laura Thompson 2 , Peter Keegan 2 ; 1 University of Canterbury, New Zealand; 2 University of Auckland, New Zealand; 3 University of Waikato, New Zealand nantal and vocalic intervals in spoken texts. One of the problems of this approach lies in complex syllabic structures. Unless we make an a-priori phonological decision, sonorous consonants may contribute to either vocalic or consonantal part of the speech signal in post-initial and pre-final positions of syllabic onsets and codas. A procedure is offered to avoid phonological dilemmas together with tedious manual work. The method is tested on continuous Czech and English texts read out by several professionals. Vowel Duration in Pre-Geminate Contexts in Polish Zofia Malisz; Adam Mickiewicz University, Poland Tue-Ses3-S2-8, Time: 18:00 The study presents Polish experimental data on the variability of vowel duration in the context of following singleton and geminate consonants. The aim of the study is to explain the low vocalic variability values obtained from “rhythm metrics” based analyses of speech rhythm. It also aims at contributing to the discussion about current dynamical models of speech rhythm that contain assumptions of the relative temporal stability of the vowel-to-vowel sequence. The results suggest that vowels in Polish co-vary with following consonant length in a roughly proportionate manner. An interpretation of the effect is offered where a fortition process overrides the possibility of temporal compensation. Wed-Ses1-O1 : Speaker Verification & Identification II Tue-Ses3-S2-5, Time: 18:00 Present-day Maori elders comment that the mita (which includes rhythm) of the Maori language, has changed over time. This paper presents the first results in a study of the change of Maori rhythm. PVI analyses did not capture this change. Perceptual experiments, using extracts of speech low-pass filtered to 400 Hz, demonstrated that Maori and English speech could be distinguished. Listeners who spoke Maori were more accurate than those who spoke only English. The English and Maori speech of groups of different speakers born at different times was perceived differently, indicating that the rhythm of Maori has indeed changed over time. Main Hall, 10:00, Wednesday 9 Sept 2009 Chair: Steve Renals, University of Edinburgh, UK Effects of Mora-Timing in English Rhythm Control by Japanese Learners Intersession variability (ISV) compensation in speaker recognition is well studied with respect to extrinsic variation, but little is known about its ability to model intrinsic variation. We find that ISV compensation is remarkably successful on a corpus of intrinsic variation that is highly controlled for channel (a dominant component of ISV). The results are particularly surprising because the ISV training data come from a different corpus than do speaker train and test data. We further find that relative improvements are (1) inversely related to uncompensated performance, (2) reduced more by vocal effort train/test mismatch than by speaking style mismatch, and (3) reduced additionally for mismatches in both style and level. Results demonstrate that intersession variability compensation does model intrinsic variation, and suggest that mismatched data may be more useful than previously expected for modeling certain types of within-speaker variability in speech. Shizuka Nakamura 1 , Hiroaki Kato 2 , Yoshinori Sagisaka 1 ; 1 Waseda University, Japan; 2 NICT, Japan Tue-Ses3-S2-6, Time: 18:00 In this paper, we analyzed the durational differences between learners and native speakers in various speech units from the perspective of that the contrast between the stressed and the unstressed is one of the most important features to characterize stress-timing of English by comparison with mora-timing of Japanese. The results showed that the lengthening and shortening of learner speech were not enough to convey the difference between the stressed and the unstressed. Finally, it was confirmed that these durational differences strongly affected the subjective evaluation scores given by English language teachers. The Dynamic Dimension of the Global Speech-Rhythm Attributes 1 Does Session Variability Compensation in Speaker Recognition Model Intrinsic Variation Under Mismatched Conditions? Elizabeth Shriberg, Sachin Kajarekar, Nicolas Scheffer; SRI International, USA Wed-Ses1-O1-1, Time: 10:00 Variability Compensated Support Vector Machines Applied to Speaker Verification Zahi N. Karam, W.M. Campbell; MIT, USA 2 1 Jan Volín , Petr Pollák ; Charles University in Prague, Czech Republic; 2 Czech Technical University in Prague, Czech Republic Tue-Ses3-S2-7, Time: 18:00 Recent years have revealed that certain global attributes of speech rhythm can be quite successfully captured with respect to conso- Wed-Ses1-O1-2, Time: 10:20 Speaker verification using SVMs has proven successful, specifically using the GSV Kernel [1] with nuisance attribute projection (NAP) [2]. Also, the recent popularity and success of joint factor analysis [3] has led to promising attempts to use speaker factors directly as SVM features [4]. NAP projection and the use of speaker factors with SVMs are methods of handling variability in SVM speaker Notes 111 verification: NAP by removing undesirable nuisance variability, and using the speaker factors by forcing the discrimination to be performed based on inter-speaker variability. These successes have led us to propose a new method we call variability compensated SVM (VCSVM) to handle both inter and intra-speaker variability directly in the SVM optimization. This is done by adding a regularized penalty to the optimization that biases the normal to the hyperplane to be orthogonal to the nuisance subspace or alternatively to the complement of the subspace containing the inter-speaker variability. This bias will attempt to ensure that interspeaker variability is used in the recognition while intra-speaker variability is ignored. In this paper, we present the VCSVM theory and promising results on nuisance compensation. Support Vector Machines versus Fast Scoring in the Low-Dimensional Total Variability Space for Speaker Verification Najim Dehak 1 , Réda Dehak 2 , Patrick Kenny 1 , Niko Brümmer 3 , Pierre Ouellet 1 , Pierre Dumouchel 1 ; 1 CRIM, Canada; 2 LRDE, France; 3 AGNITIO, South Africa Wed-Ses1-O1-3, Time: 10:40 This paper presents a new speaker verification system architecture based on Joint Factor Analysis (JFA) as feature extractor. In this modeling, the JFA is used to define a new low-dimensional space named the total variability factor space, instead of both channel and speaker variability spaces for the classical JFA. The main contribution in this approach, is the use of the cosine kernel in the new total factor space to design two different systems: the first system is Support Vector Machines based, and the second one uses directly this kernel as a decision score. This last scoring method makes the process faster and less computation complex compared to others classical methods. We tested several intersession compensation methods in total factors, and we found that the combination of Linear Discriminate Analysis and Within Class Covariance Normalization achieved the best performance. We achieved a remarkable results using fast scoring method based only on cosine kernel especially for male trials, we yield an EER of 1.12% and MinDCF of 0.0094 on the English trials of the NIST 2008 SRE dataset. Within-Session Variability Modelling for Factor Analysis Speaker Verification Robbie Vogt 1 , Jason Pelecanos 2 , Nicolas Scheffer 3 , Sachin Kajarekar 3 , Sridha Sridharan 1 ; 1 Queensland University of Technology, Australia; 2 IBM T.J. Watson Research Center, USA; 3 SRI International, USA Speaker Recognition by Gaussian Information Bottleneck Ron M. Hecht 1 , Elad Noor 2 , Naftali Tishby 3 ; 1 Tel-Aviv University, Israel; 2 Weizmann Institute of Science, Israel; 3 Hebrew University, Israel Wed-Ses1-O1-5, Time: 11:20 This paper explores a novel approach for the extraction of relevant information in speaker recognition tasks. This approach uses a principled information theoretic framework — the Information Bottleneck method (IB). In our application, the method compresses the acoustic data while preserving mostly the relevant information for speaker identification. This paper focuses on a continuous version of the IB method known as the Gaussian Information Bottleneck (GIB). This version assumes that both the source and target variables are high dimensional multivariate Gaussian variables. The GIB was applied in our work to the Super Vector (SV) dimension reduction conundrum. Experiments were conducted on the male part of the NIST SRE 2005 corpora. The GIB representation was compared to other dimension reduction techniques and to a baseline system. In our experiments, the GIB outperformed the baseline system; achieving a 6.1% Equal Error Rate (EER) compared to the 15.1% EER of a baseline system. Variational Dynamic Kernels for Speaker Verification C. Longworth, R.C. van Dalen, M.J.F. Gales; University of Cambridge, UK Wed-Ses1-O1-6, Time: 11:40 An important aspect of SVM-based speaker verification is the choice of dynamic kernel. Recently there has been interest in the use of kernels based on the Kullback-Leibler divergence between GMMs. Since this has no closed-form solution, typically a matched-pair upper bound is used instead. This places significant restrictions on the forms of model structure that may be used. All GMMs must contain the same number of components and must be adapted from a single background model. For many tasks this will not be optimal. In this paper, dynamic kernels are proposed based on alternative, variational approximations to the KL divergence. Unlike the matched-pair bound, these do not restrict the forms of GMM that may be used. Additionally, using a more accurate approximation of the divergence may lead to performance gains. Preliminary results using these kernels are presented on the NIST 2002 SRE dataset. Wed-Ses1-O2 : Emotion and Expression I Jones (East Wing 1), 10:00, Wednesday 9 Sept 2009 Chair: Ailbhe Ní Chasaide, Trinity College Dublin, Ireland Wed-Ses1-O1-4, Time: 11:00 This work presents an extended Joint Factor Analysis model including explicit modelling of unwanted within-session variability. The goals of the proposed extended JFA model are to improve verification performance with short utterances by compensating for the effects of limited or imbalanced phonetic coverage, and to produce a flexible JFA model that is effective over a wide range of utterance lengths without adjusting model parameters such as retraining session subspaces. Experimental results on the 2006 NIST SRE corpus demonstrate the flexibility of the proposed model by providing competitive results over a wide range of utterance lengths without retraining and also yielding modest improvements in a number of conditions over current state-of-the-art. Emotion Dimensions and Formant Position Martijn Goudbeek 1 , Jean Philippe Goldman 2 , Klaus R. Scherer 3 ; 1 Tilburg University, The Netherlands; 2 Université de Genève, Switzerland; 3 Swiss Center for Affective Sciences, Switzerland Wed-Ses1-O2-1, Time: 10:00 The influence of emotion on articulatory precision was investigated in a newly established corpus of acted emotional speech. The frequencies of the first and second formant of the vowels /i/, /u/, and /a/ was measured and shown to be significantly affected by emotion dimension. High arousal resulted in a higher mean F1 in all vowels, whereas positive valence resulted in higher mean values Notes 112 for F2. The dimension potency/control showed a pattern of effects that was consistent with a larger vocalic triangle for emotions high in potency/control. The results are interpreted in the context of Scherer’s component process model. Identifying Uncertain Words Within an Utterance via Prosodic Features Heather Pon-Barry, Stuart Shieber; Harvard University, USA Wed-Ses1-O2-2, Time: 10:20 activation, evaluation, and power. We found immediate response patterns, in which the staff member colored her utterances in response to the emotion shown by the student in the immediately previous utterance, and built a predictive model suitable for use in a dialog system to persuasively discuss graduate school with students. Analysis of Laugh Signals for Detecting in Continuous Speech Sudheer Kumar K. 1 , Sri Harish Reddy M. 1 , Sri Rama Murty K. 2 , B. Yegnanarayana 1 ; 1 IIIT Hyderabad, India; 2 IIT Madras, India We describe an experiment that investigates whether sub-utterance prosodic features can be used to detect uncertainty at the wordlevel. That is, given an utterance that is classified as uncertain, we want to determine which word or phrase the speaker is uncertain about. We have a corpus of utterances spoken under varying degrees of certainty. Using combinations of sub-utterance prosodic features we train models to predict the level of certainty of an utterance. On a set of utterances that were perceived to be uncertain, we compare the predictions of our models for two candidate ‘target word’ segmentations: (a) one with the actual word causing uncertainty as the proposed target word, and (b) one with a control word as the proposed target word. Our best model correctly identifies the word causing the uncertainty rather than the control word 91% of the time. Laughter is a nonverbal vocalization that occurs often in speech communication. Since laughter is produced by the speech production mechanism, spectral analysis methods are used mostly for the study of laughter acoustics. In this paper the significance of excitation features for discriminating laughter and speech is discussed. New features describing the excitation characteristics are used to analyze the laugh signals. The features are based on instantaneous pitch and strength of excitation at epochs. An algorithm is developed based on these features to detect laughter regions in continuous speech. The results are illustrated by detecting laughter regions in a TV broadcast program. Evaluating Evaluators: A Case Study in Understanding the Benefits and Pitfalls of Multi-Evaluator Modeling Data-Driven Clustering in Emotional Space for Affect Recognition Using Discriminatively Trained LSTM Networks Emily Mower, Maja J. Matarić, Shrikanth S. Narayanan; University of Southern California, USA Martin Wöllmer 1 , Florian Eyben 1 , Björn Schuller 1 , Ellen Douglas-Cowie 2 , Roddy Cowie 2 ; 1 Technische Universität München, Germany; 2 Queen’s University Belfast, UK Wed-Ses1-O2-3, Time: 10:40 Emotion perception is a complex process, often measured using stimuli presentation experiments that query evaluators for their perceptual ratings of emotional cues. These evaluations contain large amounts of variability both related and unrelated to the evaluated utterances. One approach to handling this variability is to model emotion perception at the individual level. However, the perceptions of specific users may not adequately capture the emotional acoustic properties of an utterance. This problem can be mitigated by the common technique of averaging evaluations from multiple users. We demonstrate that this averaging procedure improves classification performance when compared to classification results from models created using individual-specific evaluations. We also demonstrate that the performance increases are related to the consistency with which evaluators label data. These results suggest that the acoustic properties of emotional speech are better captured using models formed from averaged evaluations rather than from individual-specific evaluations. Responding to User Emotional State by Adding Emotional Coloring to Utterances Wed-Ses1-O2-5, Time: 11:20 Wed-Ses1-O2-6, Time: 11:40 In today’s affective databases speech turns are often labelled on a continuous scale for emotional dimensions such as valence or arousal to better express the diversity of human affect. However, applications like virtual agents usually map the detected emotional user state to rough classes in order to reduce the multiplicity of emotion dependent system responses. Since these classes often do not optimally reflect emotions that typically occur in a given application, this paper investigates data-driven clustering of emotional space to find class divisions that better match the training data and the area of application. Thereby we consider the Belfast Sensitive Artificial Listener database and TV talkshow data from the VAM corpus. We show that a discriminatively trained Long Short-Term Memory (LSTM) recurrent neural net that explicitly learns clusters in emotional space and additionally models context information outperforms both, Support Vector Machines and a Regression-LSTM net. Jaime C. Acosta, Nigel G. Ward; University of Texas at El Paso, USA Wed-Ses1-O3 : Automatic Speech Recognition: Adaptation II Wed-Ses1-O2-4, Time: 11:00 Fallside (East Wing 2), 10:00, Wednesday 9 Sept 2009 Chair: Satoshi Nakamura, NICT, Japan When people speak to each other, they share a rich set of nonverbal behaviors such as varying prosody in voice. These behaviors, sometimes interpreted as demonstrations of emotions, call for appropriate responses, but today’s spoken dialog systems lack the ability to do so. We collected a corpus of persuasive dialogs, specifically conversations about graduate school between a staff member and students, and had judges label all utterances with triples indicating the perceived emotions, using the three dimensions: On the Estimation and the Use of Confusion-Matrices for Improving ASR Accuracy Omar Caballero Morales, Stephen J. Cox; University of East Anglia, UK Wed-Ses1-O3-1, Time: 10:00 Notes 113 In previous work, we described how learning the pattern of recognition errors made by an individual using a certain ASR system leads to increased recognition accuracy compared with a standard MLLR adaptation approach. This was the case for low-intelligibility speakers with dysarthric speech, but no improvement was observed for normal speakers. In this paper, we describe an alternative method for obtaining the training data for confusion-matrix estimation for normal speakers which is more effective than our previous technique. We also address the issue of data sparsity in estimation of confusion-matrices by using non-negative matrix factorization (NMF) to discover structure within them. The confusion-matrix estimates made using these techniques are integrated into the ASR process using a technique termed as “metamodels”, and the results presented here show statistically significant gains in word recognition accuracy when applied to normal speech. A Study on Soft Margin Estimation of Linear Regression Parameters for Speaker Adaptation Shigeki Matsuda 1 , Yu Tsao 1 , Jinyu Li 2 , Satoshi Nakamura 1 , Chin-Hui Lee 3 ; 1 NICT, Japan; 2 Microsoft Corporation, USA; 3 Georgia Institute of Technology, USA Unsupervised Lattice-Based Acoustic Model Adaptation for Speaker-Dependent Conversational Telephone Speech Transcription K. Thambiratnam, F. Seide; Microsoft Research Asia, China Wed-Ses1-O3-4, Time: 10:00 This paper examines the application of lattice adaptation techniques to speaker-dependent models for the purpose of conversational telephone speech transcription. Given sufficient training data per speaker, it is feasible to build adapted speaker-dependent models using lattice MLLR and lattice MAP. Experiments on iterative and cascaded adaptation are presented. Additionally various strategies for thresholding frame posteriors are investigated, and it is shown that accumulating statistics from the local bestconfidence path is sufficient to achieve optimal adaptation. Overall, an iterative cascaded lattice system was able to reduce WER by 7.0% abs., which was a 0.8% abs. gain over transcript-based adaptation. Lattice adaptation reduced the unsupervised/supervised adaptation gap from 2.5% to 1.7%. Rapid Unsupervised Adaptation Using Frame Independent Output Probabilities of Gender and Context Independent Phoneme Models Wed-Ses1-O3-2, Time: 10:00 We formulate a framework for soft margin estimation-based linear regression (SMELR) and apply it to supervised speaker adaptation. Enhanced separation capability and increased discriminative ability are two key properties in margin-based discriminative training. For the adaptation process to be able to flexibly utilize any amount of data, we also propose a novel interpolation scheme to linearly combine the speaker independent (SI) and speaker adaptive SMELR (SMELR/SA) models. The two proposed SMELR algorithms were evaluated on a Japanese large vocabulary continuous speech recognition task. Both the SMELR and interpolated SI+SMELR/SA techniques showed improved speech adaptation performance in comparison with the well-known maximum likelihood linear regression (MLLR) method. We also found that the interpolation framework works even more effectively than SMELR when the amount of adaptation data is relatively small. Exploring the Role of Spectral Smoothing in Context of Children’s Speech Recognition Shweta Ghai, Rohit Sinha; IIT Guwahati, India Wed-Ses1-O3-3, Time: 10:00 This work is motivated by our earlier study which shows that on explicit pitch normalization the children’s speech recognition performance on the adults’ speech trained models improves as a result of reduction in the pitch-dependent distortions in the spectral envelope. In this paper, we study the role of spectral smoothing in context of children’s speech recognition. The spectral smoothing has been effected in the feature domain by two approaches viz., modification of bandwidth of the filters in the filterbank and cepstral truncation. In conjunction, both approaches give significant improvement in the children’s speech recognition performance with 57% relative improvement over the baseline. Also, when combined with the widely used vocal tract length normalization (VTLN), these spectral smoothing approaches result in an additional 25% relative improvement over the VTLN performance for children’s speech recognition on the adults’ speech trained models. Satoshi Kobashikawa, Atsunori Ogawa, Yoshikazu Yamaguchi, Satoshi Takahashi; NTT Corporation, Japan Wed-Ses1-O3-5, Time: 10:20 Business is demanding higher recognition accuracy with no increase in computation time compared to previously adopted baseline speech recognition systems. Accuracy can be improved by adding a gender dependent acoustic model and unsupervised adaptation based on CMLLR (Constrained Maximum Likelihood Linear Regression). CMLLR-based batch-type unsupervised adaptation estimates a single global transformation matrix by utilizing prior unsupervised labeling, which unfortunately increases the computation time. Our proposed technique reduces prior gender selection and labeling time by using frame independent output probabilities of only gender dependent speech GMM (Gaussian Mixture Model) and context independent phoneme (monophone) HMM (Hidden Markov Model) in dual-gender acoustic models. The proposed technique further raises accuracy by employing a power term after adaptation. Simulations using spontaneous speech show that the proposed technique reduces computation time by 17.9% and the relative error in correct rate by 13.7% compared to the baseline without prior gender selection and unsupervised adaptation. Bark-Shift Based Nonlinear Speaker Normalization Using the Second Subglottal Resonance Shizhen Wang, Yi-Hui Lee, Abeer Alwan; University of California at Los Angeles, USA Wed-Ses1-O3-6, Time: 11:00 In this paper, we propose a Bark-scale shift based piecewise nonlinear warping function for speaker normalization, and a joint frequency discontinuity and energy attenuation detection algorithm to estimate the second subglottal resonance (Sg2). We then apply Sg2 for rapid speaker normalization. Experimental results on children’s speech recognition show that the proposed nonlinear warping function is more effective for speaker normalization than linear frequency warping. Compared to maximum likelihood based grid search methods, Sg2 normalization is more efficient and achieves comparable or better performance, especially for limited normalization data. Notes 114 Cross-Language Voice Conversion Based on Eigenvoices Wed-Ses1-O4 : Voice Transformation I Holmes (East Wing 3), 10:00, Wednesday 9 Sept 2009 Chair: Yannis Stylianou, FORTH, Greece Malorie Charlier 1 , Yamato Ohtani 1 , Tomoki Toda 1 , Alexis Moinet 2 , Thierry Dutoit 2 ; 1 NAIST, Japan; 2 Faculté Polytechnique de Mons, Belgium Many-to-Many Eigenvoice Conversion with Reference Voice Wed-Ses1-O4-4, Time: 11:00 Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, Kiyohiro Shikano; NAIST, Japan Wed-Ses1-O4-1, Time: 10:00 In this paper, we propose many-to-many voice conversion (VC) techniques to convert an arbitrary source speaker’s voice into an arbitrary target speaker’s voice. We have proposed one-to-many eigenvoice conversion (EVC) and many-to-one EVC. In the EVC, an eigenvoice Gaussian mixture model (EV-GMM) is trained in advance using multiple parallel data sets of a reference speaker and many pre-stored speakers. The EV-GMM is flexibly adapted to an arbitrary speaker using a small amount of adaptation data without any linguistic constraints. In this paper, we achieve many-to-many VC by sequentially performing many-to-one EVC and one-to-many EVC through the reference speaker using the same EV-GMM. Experimental results demonstrate the effectiveness of the proposed many-to-many VC. This paper presents a novel cross-language voice conversion (VC) method based on eigenvoice conversion (EVC). Cross-language VC is a technique for converting voice quality between two speakers uttering different languages each other. In general, parallel data consisting of utterance pairs of those two speakers are not available. To deal with this problem, we apply EVC to cross-language VC. First, we train an eigenvoice GMM (EV-GMM) using many parallel data sets by a source speaker and many pre-stored other speakers who can utter the same language as the source speaker. And then, the conversion model between the source speaker and a target speaker who cannot utter the source speaker’s language is developed by adapting the EV-GMM using a few arbitrary sentences uttered by the target speaker in a different language. The experimental results demonstrate that the proposed method yields significant performance improvements in both speech quality and conversion accuracy for speaker individuality compared with a conventional cross-language VC method based on frame selection. Alleviating the One-to-Many Mapping Problem in Voice Conversion with Context-Dependent Modeling Voice Conversion Using K-Histograms and Frame Selection Elizabeth Godoy 1 , Olivier Rosec 1 , Thierry Chonavel 2 ; 1 Orange Labs, France; 2 Telecom Bretagne, France Alejandro José Uriz 1 , Pablo Daniel Agüero 1 , Antonio Bonafonte 2 , Juan Carlos Tulli 1 ; 1 Universidad Nacional de Mar del Plata, Argentina; 2 Universitat Politècnica de Catalunya, Spain Wed-Ses1-O4-2, Time: 10:20 This paper addresses the “one-to-many” mapping problem in Voice Conversion (VC) by exploring source-to-target mappings in GMMbased spectral transformation. Specifically, we examine differences using source-only versus joint source/target information in the classification stage of transformation, effectively illustrating a “one-to-many effect” in the traditional acoustically-based GMM. We propose combating this effect by using phonetic information in the GMM learning and classification. We then show the success of our proposed context-dependent modeling with transformation results using an objective error criterion. Finally, we discuss implications of our work in adapting current approaches to VC. Wed-Ses1-O4-5, Time: 11:20 The goal of voice conversion systems is to modify the voice of a source speaker to be perceived as if it had been uttered by another specific speaker. Many approaches found in the literature work based on statistical models and introduce an oversmoothing in the target features. Our proposal is a new model that combines several techniques used in unit selection for text-to-speech and a non-gaussian transformation mathematical model. Subjective results support the proposed approach. Efficient Modeling of Temporal Structure of Speech for Applications in Voice Transformation Online Model Adaptation for Voice Conversion Using Model-Based Speech Synthesis Techniques Binh Phu Nguyen, Masato Akagi; JAIST, Japan Dalei Wu 1 , Baojie Li 1 , Hui Jiang 1 , Qian-Jie Fu 2 ; 1 York University, Canada; 2 House Ear Institute, USA Wed-Ses1-O4-3, Time: 10:40 Aims of voice transformation are to change styles of given utterances. Most voice transformation methods process speech signals in a time-frequency domain. In the time domain, when processing spectral information, conventional methods do not consider relations between neighboring frames. If unexpected modifications happen, there are discontinuities between frames, which lead to the degradation of the transformed speech quality. This paper proposes a new modeling of temporal structure of speech to ensure the smoothness of the transformed speech for improving the quality of transformed speech in the voice transformation. In our work, we propose an improvement of the temporal decomposition (TD) technique, which decomposes a speech signal into event targets and event functions, to model the temporal structure of speech. The TD is used to control the spectral dynamics and to ensure the smoothness of transformed speech. We investigate the TD in two applications, concatenative speech synthesis and spectral voice conversion. Experimental results confirm the effectiveness of TD in terms of improving the quality of the transformed speech. Wed-Ses1-O4-6, Time: 11:40 In this paper, we present a novel voice conversion method using model-based speech synthesis that can be used for some applications where prior knowledge or training data is not available from the source speaker. In the proposed method, training data from a target speaker is used to build a GMM-based speech model and voice conversion is then performed for each utterance from the source speaker according to the pre-trained target speaker model. To reduce the mismatch between source and target speakers, online model adaptation is proposed to improve model selection accuracy, based on maximum likelihood linear regression (MLLR). Objective and subjective evaluations suggest that the proposed methods are quite effective in generating acceptable voice quality for voice conversion even without training data from source speakers. Notes 115 Investigating Phonetic Information Reduction and Lexical Confusability Wed-Ses1-P1 : Phonetics, Phonology, Cross-Language Comparisons, Pathology William Hartmann, Eric Fosler-Lussier; Ohio State University, USA Hewison Hall, 10:00, Wednesday 9 Sept 2009 Chair: Valerie Hazan, University College London, UK Wed-Ses1-P1-4, Time: 10:00 Fast Transcription of Unstructured Audio Recordings Brandon C. Roy, Deb Roy; MIT, USA Wed-Ses1-P1-1, Time: 10:00 We introduce a new method for human-machine collaborative speech transcription that is significantly faster than existing transcription methods. In this approach, automatic audio processing algorithms are used to robustly detect speech in audio recordings and split speech into short, easy to transcribe segments. Sequences of speech segments are loaded into a transcription interface that enables a human transcriber to simply listen and type, obviating the need for manually finding and segmenting speech or explicitly controlling audio playback. As a result, playback stays synchronized to the transcriber’s speed of transcription. In evaluations using naturalistic audio recordings made in everyday home situations, the new method is up to 6 times faster than other popular transcription tools while preserving transcription quality. Finding Allophones: An Evaluation on Consonants in the TIMIT Corpus Timothy Kempton, Roger K. Moore; University of Sheffield, UK Wed-Ses1-P1-2, Time: 10:00 Phonemic analysis, the process of identifying the contrastive sounds in a language, involves finding allophones; phonetic variants of those contrastive sounds. An algorithm for finding allophones (developed by Peperkamp et al.) is evaluated on consonants in the TIMIT acoustic phonetic transcripts. A novel phonetic filter based on the active articulator is introduced and has a higher recall than previous filters. The combined retrieval performance, measured by area under the ROC curve, is 83%. The system implemented can process any language transcribed in IPA and is currently being used to assist the phonemic analysis of unwritten languages. In the presence of pronunciation variation and the masking effects of additive noise, we investigate the role of phonetic information reduction and lexical confusability on ASR performance. Contrary to previous work [1], we show that place of articulation as a representation for unstressed segments performs at least as well as manner of articulation in the presence of additive noise. Methods of phonetic reduction introduce lexical confusibility which negatively impact performance. By limiting this confusability, recognizers that employ high levels of phonetic reduction (40.1%) can perform as well a baseline system in the presence of nonstationary noise. Improving Phone Recognition Performance via Phonetically-Motivated Units Hyejin Hong, Minhwa Chung; Seoul National University, Korea Wed-Ses1-P1-5, Time: 10:00 This paper examines how phonetically-motivated units affect the performance of phone recognition systems. Focusing on the realization of /h/, which is one of the most frequently error-making phones in Korean phone recognition, three different phone sets are designed by considering optional phonetic constraints which show complementary distributions. Experimental results show that one of the proposed sets, the h-deletion set improves phone recognition performance compared to the baseline phone recognizer. It is noteworthy that this set needs no additional phonetic unit, which means that no more HMM is necessary to be modeled, accordingly it has the advantage in terms of model size. Besides, it obtains competent performance compared to the baseline system in terms of word recognition as well. Thus, this phonetically-motivated approach dealing with improvement of phone recognition performance is expected to be used in embedded solutions which require fast and light recognition process. An Evaluation of Formant Tracking methods on an Arabic Database Imen Jemaa 1 , Oussama Rekhis 1 , Kaïs Ouni 1 , Yves Laprie 2 ; 1 Ecole Nationale d’Ingénieurs de Tunis, Tunisia; 2 LORIA, France Automatic Formant Extraction for Sociolinguistic Analysis of Large Corpora Keelan Evanini, Stephen Isard, Mark Liberman; University of Pennsylvania, USA Wed-Ses1-P1-6, Time: 10:00 Wed-Ses1-P1-3, Time: 10:00 In this paper, we propose a method of formant prediction from pole and bandwidth data, and apply this method to automatically extract F1 and F2 values from a corpus of regional dialect variation in North America that contains 134,000 manual formant measurements. These predicted formants are shown to increase performance over the default formant values from a popular speech analysis package. Finally, we demonstrate that sociolinguistic analysis based on vowel formant data can be conducted reliably using the automatically predicted values, and we argue that sociolinguists should begin to use this methodology in order to be able to analyze larger amounts of data efficiently. In this paper we present a formant database of Arabic used to evaluate our new automatic formant tracking algorithm based on Fourier ridges detection. In this method we have introduced a continuity constraint based on the computation of centres of gravity for a set of formant candidates. This leads to connect a frame of speech to its neighbours and thus improves the robustness of tracking. The formant trajectories obtained by the algorithm proposed are compared to those of the hand edited formant database and those given by Praat with LPC data. Comparison of Manual and Automated Estimates of Subglottal Resonances Wolfgang Wokurek, Andreas Madsack; Universität Stuttgart, Germany Wed-Ses1-P1-7, Time: 10:00 This study compares manual measurements of the first two sub- Notes 116 glottal resonances to the results of an automated measurement procedure for the same quantities. We also briefly sketch the sensor prototype that is used for the measurements. The subglottal resonances are presented in the space spanned by the vowels’ first two formants. A three axis acceleration sensor is gently pressed at the neck of the speaker. In front of the ligamentum conicum, located near the lower end of the larynx, pressure signals may be recorded that follow the subglottal pressure changes at least up to 2 kHz bandwidth. The recordings of the subglottal pressure signals are made simultaneously with recordings of the electroglottogram and the acoustic speech sound with 12 male and 12 female speakers. Using Durational Cues in a Computational Model of Spoken-Word Recognition Odette Scharenborg; Radboud Universiteit Nijmegen, The Netherlands Wed-Ses1-P1-8, Time: 10:00 Evidence that listeners use durational cues to help resolve temporarily ambiguous speech input has accumulated over the past few years. In this paper, we investigate whether durational cues are also beneficial for word recognition in a computational model of spoken-word recognition. Two sets of simulations were carried out using the acoustic signal as input. The simulations showed that the computational model, like humans, takes benefit from durational cues during word recognition, and uses these to disambiguate the speech signal. These results thus provide support for the theory that durational cues play a role in spoken-word recognition. Second Language Discrimination Vowel Contrasts by Adults Speakers with a Five Vowel System Bianca Sisinni, Mirko Grimaldi; Università del Salento, Italy Wed-Ses1-P1-9, Time: 10:00 This study tests the ability of a group of Salento Italian undergraduate students that have been exposed to L2 in a scholastic context to perceive British English second language (L2) vowel phonemes. The aim is to verify if the Perceptual Assimilation Model could be applied to them. In order to test their ability to perceive L2 phonemes, subjects have executed an identification and an oddity discrimination test. The results indicated that the L2 discrimination processes are in line with those predicted by the PAM, supporting the idea that students with a formal L2 background are still naïve listeners to the L2. Three-Way Laryngeal Categorization of Japanese, French, English and Chinese Plosives by Korean Speakers The Effect of F0 Peak-Delay on the L1 / L2 Perception of English Lexical Stress Shinichi Tokuma 1 , Yi Xu 2 ; 1 Chuo University, Japan; 2 University College London, UK Wed-Ses1-P1-11, Time: 10:00 This study investigated the perceptual effect of F0 peak-delay on L1 / L2 perception of English lexical stress. A bisyllabic English non-word ‘nini’ /nInI/ whose F0 was set to reach its peak in the second syllable was embedded in a frame sentence and used as the stimulus of the perceptual experiment. Native English and Japanese speakers were asked to determine lexical stress locations in the experiment. The results showed that in the perception of English lexical stress, delayed F0 peaks which were aligned with the second syllable of the stimulus words perceptually affected Japanese and English groups in the same manner: both groups perceived the delayed F0 peaks as a cue to lexical stress in the first syllable when the peaks were aligned with, or before, the end of /n/ in the second syllable. A supplementary experiment conducted on Japanese speakers confirmed the location of the categorical boundary. These findings are supported by the data provided by previous studies on L1 acoustic analysis and on L1 / L2 perception of intonation. Lexical Tone Production by Cantonese Speakers with Parkinson’s Disease Joan Ka-Yin Ma; Technische Universität Dresden, Germany Wed-Ses1-P1-12, Time: 10:00 The aim of this study was to investigate lexical tone production in Cantonese speakers associated with Parkinson’s disease (PD speakers). The effect of intonation on the production of lexical tone was also examined. Speech data was collected from five Cantonese PD speakers. Speech materials consisted of targets contrasting in tones, embedded in different sentence contexts (initial, medial and final) and intonations (statements and questions). Analysis of the normalized F0 patterns showed that PD speakers contrasted the six lexical tones in similar manner as compared with control speakers across positions and intonations, except at the final position of questions. Significantly lower F0 values were found at the 75% and 100% time points of the final syllable of questions for the PD speakers than for the control speakers, indicating that intonation has a smaller influence on the F0 patterns of lexical tones for PD speakers than control speakers. The results of this study supported the previous claim of differential control for intonation and tone. Acoustic Cues of Palatalisation in Plosive + Lateral Onset Clusters Daniela Müller 1 , Sidney Martin Mota 2 ; 1 CLLE-ERSS, France; 2 Escola Oficial d’Idiomes de Tarragona, Spain Tomohiko Ooigawa, Shigeko Shinohara; Sophia University, Japan Wed-Ses1-P1-13, Time: 10:00 Wed-Ses1-P1-10, Time: 10:00 Korean has a three-way laryngeal contrast in oral stops. This paper reports perception patterns of plosives of Japanese, French, English and Chinese by Korean speakers. In Korean loanwords, laryngeal contrasts of Japanese, French, and English plosives show distinct patterns. To test whether perception explains the loanword patterns, we selected languages with different acoustic properties and carried out perception tests. Our results reveal discrepancies between the phonological adaptation and the acoustic perception patterns. Palatalisation of /l/ in obstruent + lateral onset clusters in the absence of a following palatal sound has received a considerable amount of attention from historical linguistics. The phonetics of its development, however, remains less well-investigated. This paper aims at studying the acoustic cues that could have led plosive + lateral onset clusters to develop palatalisation. It is found that onset clusters with velar plosives favour palatalisation more than labial + lateral clusters, and that a high degree of darkness diminishes the likelihood of palatalisation to take place. Notes 117 In this study a perception experiment was carried out to examine the perceived similarity of intonation contours. Amongst other results we found, that the subjects are capable to produce consistent similarity judgements. Wed-Ses1-P2 : Prosody Perception and Language Acquisition Hewison Hall, 10:00, Wednesday 9 Sept 2009 Chair: David House, KTH, Sweden Perception of English Compound vs. Phrasal Stress: Natural vs. Synthetic Speech Irene Vogel, Arild Hestvik, H. Timothy Bunnell, Laura Spinu; University of Delaware, USA Wed-Ses1-P2-1, Time: 10:00 The ability of listeners to distinguish between compound and phrasal stress in English was examined on the basis of a picture selection task. The responses to naturally and synthetically produced stimuli were compared. While greater overall accuracy was observed with the natural stimuli, the same pattern of greater accuracy with compound stress than with phrasal stress was observed with both types of stimuli. New Method for Delexicalization and its Application to Prosodic Tagging for Text-to-Speech Synthesis Martti Vainio 1 , Antti Suni 1 , Tuomo Raitio 2 , Jani Nurminen 3 , Juhani Järvikivi 4 , Paavo Alku 2 ; 1 University of Helsinki, Finland; 2 Helsinki University of Technology, Finland; 3 Nokia Devices R&D, Finland; 4 Max Planck Institute for Psycholinguistics, The Netherlands Wed-Ses1-P2-2, Time: 10:00 This paper describes a new flexible delexicalization method based on glottal excited parametric speech synthesis scheme. The system utilizes inverse filtered glottal flow and all-pole modelling of the vocal tract. The method provides a possibility to retain and manipulate all relevant prosodic features of any kind of speech. Most importantly, the features include voice quality, which has not been properly modeled in earlier delexicalization methods. The functionality of the new method was tested in a prosodic tagging experiment aimed at providing word prominence data for a text-to-speech synthesis system. The experiment confirmed the usefulness of the method and further corroborated earlier evidence that linguistic factors influence the perception of prosodic prominence. Minnaleena Toivola, Mietta Lennes, Eija Aho; University of Helsinki, Finland Wed-Ses1-P2-3, Time: 10:00 In this study, the temporal aspects of speech are compared in read-aloud Finnish produced by six native and 16 non-native speakers. It is shown that the speech and articulation rates as well as pause durations are different for native and non-native speakers. Moreover, differences exist between the groups of speakers representing four different non-native languages. Surprisingly, the native Finnish speakers tend to make longer pauses than the non-natives. The results are relevant when developing methods for assessing fluency or the strength of foreign accent. Uwe D. Reichel, Felicitas Kleber, Raphael Winkelmann; Technische Universität München, Germany Wed-Ses1-P2-4, Time: 10:00 Finally, we developed applicable linear regression and neural feed forward network models predicting similarity perception of intonation on the basis of physical contour distances. The performance of the neural networks, measured in terms of mean absolute error, did not differ significantly from the human performance derived from judgement consistency. Studying L2 Suprasegmental Features in Asian Englishes: A Position Paper Helen Meng 1 , Chiu-yu Tseng 2 , Mariko Kondo 3 , Alissa Harrison 1 , Tanya Viscelgia 4 ; 1 Chinese University of Hong Kong, China; 2 Academia Sinica, Taiwan; 3 Waseda University, Japan; 4 Ming Chuan University, Taiwan Wed-Ses1-P2-5, Time: 10:00 This position paper highlights the importance of suprasegmental training in secondary language (L2) acquisition. Suprasegmental features are manifested in terms of acoustic cues and convey important information about linguistic and information structures. Hence, L2 learners must harness appropriate suprasegmental productions for effective communication. However, this learning process is influenced by well-established perceptions of sounds and articulatory motions in the primary language (L1). We propose to design and collect a corpus to support systematic analysis of L2 suprasegmental features. We lay out a set of carefully selected textual environments that illustrate how suprasegmental features convey information including part-of-speech, syntax, focus, speech acts and semantics. We intend to use these textual environments for collecting speech data in a variety of Asian Englishes from non-native English speakers. Analyses of such corpora should lead to research findings that have important implications for language education, as well as speech technology development for computer-aided language learning (CALL) applications. Classification of Disfluent Phenomena as Fluent Communicative Devices in Specific Prosodic Contexts Speech Rate and Pauses in Non-Native Finnish Modelling Similarity Perception of Intonation On the basis of this data we studied the influence of several physical distance measures on the human similarity judgements by grouping these measures to principal components and by comparing the weights of these components in a linear regression model predicting human perception. Non-correlation based distance measures for f0 contours received the highest relative weight. Helena Moniz 1 , Isabel Trancoso 2 , Ana Isabel Mata 1 ; 1 FLUL/CLUL, Portugal; 2 INESC-ID Lisboa/IST, Portugal Wed-Ses1-P2-6, Time: 10:00 This work explores prosodic cues of disfluent phenomena. In our previous work, we conducted a perceptual experiment regarding (dis)fluency ratings. Results suggested that some disfluencies may be considered felicitous by listeners, namely filled pauses and prolongations. In an attempt to discriminate which linguistic features are more salient in the classification of disfluencies as either fluent or disfluent phenomena, we used CART techniques on a corpus of 3.5 hours of spontaneous and prepared non-scripted speech. CART results pointed out 2 splits: break indices and contour shape. The first split indicates that events uttered at breaks 3 and 4 are considered felicitous. The second shows that these events must have flat or ascending contours to be considered Notes 118 as such; otherwise they are strongly penalized. Our preliminary results suggest that there are regular trends in the production of these events, namely, prosodic phrasing and contour shape. Cross-Cultural Perception of Discourse Phenomena Akiko Amano-Kusumoto, John-Paul Hosom, Izhak Shafran; Oregon Health & Science University, USA Wed-Ses1-P2-10, Time: 10:00 Rolf Carlson 1 , Julia Hirschberg 2 ; 1 KTH, Sweden; 2 Columbia University, USA Wed-Ses1-P2-7, Time: 10:00 We discuss perception studies of two low level indicators of discourse phenomena by Swedish, Japanese, and Chinese native speakers. Subjects were asked to identify upcoming prosodic boundaries and disfluencies in Swedish spontaneous speech. We hypothesize that speakers of prosodically unrelated languages should be less able to predict upcoming phrase boundaries but potentially better able to identify disfluencies, since indicators of disfluency are more likely to depend upon lexical, as well as acoustic information. However, surprisingly, we found that both phenomena were fairly well recognized by native and non-native speakers, with, however, some possible interference from word tones for the Chinese subjects. Modelling Vocabulary Growth from Birth to Young Adulthood 1 Classifying Clear and Conversational Speech Based on Acoustic Features This paper reports an investigation of features relevant for classifying two speaking styles, namely, conversational speaking style and clear (e.g. hyper-articulated) speaking style. Spectral and prosodic features were automatically extracted from speech and classified using decision tree classifiers and multilayer perceptrons to achieve accuracies of about 71% and 77% respectively. More interestingly, we found that out of the 56 features only about 9 features are needed to capture the most predictive power. While perceptual studies have shown that spectral cues are more useful than prosodic features for intelligibility [1], here we find prosodic features are more important for classification. The Acoustic Characteristics of Russian Vowels in Children of 6 and 7 Years of Age Elena E. Lyakso, Olga V. Frolova, Aleks S. Grigoriev; St. Petersburg State University, Russia Wed-Ses1-P2-11, Time: 10:00 2 1 Roger K. Moore , L. ten Bosch ; University of Sheffield, UK; 2 Radboud Universiteit Nijmegen, The Netherlands Wed-Ses1-P2-8, Time: 10:00 There has been considerable debate over the existence of the ‘vocabulary spurt’ phenomenon — an apparent acceleration in word learning that is commonly said to occur in children around the age of 18 months. This paper presents an investigation into modelling the phenomenon using data from almost 1800 children. The results indicate that the acquisition of a receptive/productive lexicon can be quite adequately modelled as a single growth function with an ecologically well founded and cognitively plausible interpretation. Hence it is concluded that there is little evidence for the vocabulary spurt phenomenon as a separable aspect of language acquisition. Adaptive Non-Negative Matrix Factorization in a Computational Model of Language Acquisition The purpose of this investigation is to examine the process of acoustic features of vowels from child speech approaching corresponding values in the normal Russian adult speech. The vowels formants structure, pitch and vowels duration were examined. Word stress and palatal context influence on the formants structure of the vowels were taken into account. It was shown that the word stress is formed by 6–7 years of age on the basis of the feature typical for Russian language. Formant structure of Russian vowels /u/ and /i/ is not formed by the age of 7 years. Native speakers recognize the meaning of 57–93% words in speech of 6 and 7-years-old children. Japanese Children’s Acquisition of Prosodic Politeness Expressions Takaaki Shochi 1 , Donna Erickson 2 , Kaoru Sekiyama 1 , Albert Rilliard 3 , Véronique Aubergé 4 ; 1 Kumamoto University, Japan; 2 Showa Music University, Japan; 3 LIMSI, France; 4 GIPSA, France Wed-Ses1-P2-12, Time: 10:00 Joris Driesen 1 , L. ten Bosch 2 , Hugo Van hamme 1 ; 1 Katholieke Universiteit Leuven, Belgium; 2 Radboud Universiteit Nijmegen, The Netherlands Wed-Ses1-P2-9, Time: 10:00 During the early stages of language acquisition, young infants face the task of learning a basic vocabulary without the aid of prior linguistic knowledge. It is believed the long term episodic memory plays an important role in this process. Experiments have shown that infants retain large amounts of very detailed episodic information about the speech they perceive (e.g. [1]). This weakly justifies the fact that some algorithms attempting to model the process of vocabulary acquisition computationally process large amounts of speech data in batch. Non-negative Matrix Factorization (NMF), a technique that is particularly successful in data mining but can also be applied to vocabulary acquisition (e.g. [2]), is such an algorithm. In this paper, we will integrate an adaptive variant of NMF into a computational framework for vocabulary acquisition, foregoing the need for long term storage of speech inputs, and experimentally show its accuracy matches that of the original batch algorithm. This paper presents a perception experiment to measure the ability of Japanese children in fourth and fifth grade elementary school to recognize culturally encoded expressions of politeness and impoliteness in their native language. Audio-visual stimuli were presented to listeners, who rate the politeness degree and a possible situation where such an expression could be used. Analysis of results focuses on the differences and the similarities between adult listeners and children, for each attitude and modality. Facial information seems to be retrieved earlier than audio ones, and expressions of different degrees of Japanese politeness, including expressions of kyoshuku, are still not understood around 10 years of age. Perceptual Training of Singleton and Geminate Stops in Japanese Language by Korean Learners Mee Sonu 1 , Keiichi Tajima 2 , Hiroaki Kato 3 , Yoshinori Sagisaka 1 ; 1 Waseda University, Japan; 2 Hosei University, Japan; 3 NICT, Japan Wed-Ses1-P2-13, Time: 10:00 Notes 119 We aim to build up an effective perceptual training paradigm toward a computer-assisted language learning (CALL) system for second language. This study investigated the effectiveness of the perceptual training on Korean-speaking learners of Japanese in the distinction between geminate and singleton stops of Japanese. The training consisted of identification of geminate and singleton stops with feedback. We investigated whether training improves the learners’ identification of the geminate and singleton stops in Japanese. Moreover, we examined how perceptual training is affected by factors that influence speaking rate. Results were as follows. Participants who underwent perceptual training improved overall performance to a greater extent than untrained control participants. However, there was no significant difference between the group that was trained with three speaking rates and the group that was trained with normal rate only. Wed-Ses1-P3 : Statistical Parametric Synthesis II Hewison Hall, 10:00, Wednesday 9 Sept 2009 Chair: Simon King, University of Edinburgh, UK A Bayesian Approach to Hidden Semi-Markov Model Based Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda; Nagoya Institute of Technology, Japan Wed-Ses1-P3-1, Time: 10:00 This paper proposes a Bayesian approach to hidden semi-Markov model (HSMM) based speech synthesis. Recently, hidden Markov model (HMM) based speech synthesis based on the Bayesian approach was proposed. The Bayesian approach is a statistical technique for estimating reliable predictive distributions by treating model parameters as random variables. In the Bayesian approach, all processes for constructing the system are derived from one single predictive distribution which exactly represents the problem of speech synthesis. However, there is an inconsistency between training and synthesis: although the speech is synthesized from HMMs with explicit state duration probability distributions, HMMs are trained without them. In this paper, we introduce an HSMM, which is an HMM with explicit state duration probability distributions, into the HMM-based Bayesian speech synthesis system. Experimental results show that the use of HSMM improves the naturalness of the synthesized speech. Rich Context Modeling for High Quality HMM-Based TTS Zhi-Jie Yan, Yao Qian, Frank K. Soong; Microsoft Research Asia, China Wed-Ses1-P3-2, Time: 10:00 This paper presents a rich context modeling approach to high quality HMM-based speech synthesis. We first analyze the oversmoothing problem in conventional decision tree tying-based HMM, and then propose to model the training speech tokens with rich context models. Special training procedure is adopted for reliable estimation of the rich context model parameters. In synthesis, a search algorithm following a context-based pre-selection is performed to determine the optimal rich context model sequence which generates natural and crisp output speech. Experimental results show that spectral envelopes synthesized by the rich context models are with crisper formant structures and evolve with richer details than those obtained by the conventional models. The speech quality improvement is also perceived by listeners in a subjective preference test, in which 76% of the sentences synthesized using rich context modeling are preferred. Tying Covariance Matrices to Reduce the Footprint of HMM-Based Speech Synthesis Systems Keiichiro Oura, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee, Keiichi Tokuda; Nagoya Institute of Technology, Japan Wed-Ses1-P3-3, Time: 10:00 This paper proposes a technique of reducing footprint of HMMbased speech synthesis systems by tying all covariance matrices. HMM-based speech synthesis systems usually consume smaller footprint than unit-selection synthesis systems because statistics rather than speech waveforms are stored. However, further reduction is essential to put them on embedded devices which have very small memory. According to the empirical knowledge that covariance matrices have smaller impact for the quality of synthesized speech than mean vectors, here we propose a clustering technique of mean vectors while tying all covariance matrices. Subjective listening test results show that the proposed technique can shrink the footprint of an HMM-based speech synthesis system while retaining the quality of synthesized speech. The HMM Synthesis Algorithm of an Embedded Unified Speech Recognizer and Synthesizer Guntram Strecha 1 , Matthias Wolff 1 , Frank Duckhorn 1 , Sören Wittenberg 1 , Constanze Tschöpe 2 ; 1 Technische Universität Dresden, Germany; 2 Fraunhofer IZFP, Germany Wed-Ses1-P3-4, Time: 10:00 In this paper we present an embedded unified speech recognizer and synthesizer using identical, speaker independent HiddenMarkov-Models. The system was prototypically realized on a signal processor extended by a field programmable gate array. In a first section we will give a brief overview of the system. The main part of the paper deals with a specially designed unit based HMM synthesis algorithm. In a last section we state the results of an informal listening evaluation of the speech synthesizer. Syllable HMM Based Mandarin TTS and Comparison with Concatenative TTS Zhiwei Shuang 1 , Shiyin Kang 2 , Qin Shi 1 , Yong Qin 1 , Lianhong Cai 2 ; 1 IBM China Research Lab, China; 2 Tsinghua University, China Wed-Ses1-P3-5, Time: 10:00 This paper introduces a Syllable HMM based Mandarin TTS system. 10-state left-to-right HMMs are used to model each syllable. We leverage the corpus and the front end of a concatenative TTS system to build the Syllable HMM based TTS system. Furthermore, we utilize the unique consonant/vowel structure of Mandarin syllable to improve the voiced/unvoiced decision of HMM states. Evaluation results show that the Syllable HMM based Mandarin TTS system with a 5.3MB’s model size can achieve an overall quality close to a concatenative TTS system with 1GB’ data size. Pulse Density Representation of Spectrum for Statistical Speech Processing Yoshinori Shiga; NICT, Japan Wed-Ses1-P3-6, Time: 10:00 This study investigates a new spectral representation that is suitable for statistical parametric speech synthesis. Statistical speech processing involves spectral averaging in the training process; however, averaging spectra in the domain of conventional speech Notes 120 parameters over-smooths the resulting means, which degrades the quality of the speech synthesised. In the proposed representation, high-energy parts of the spectrum, such as sections of dominant formants, are represented by a group of high-density pulses in the frequency domain. These pulses’ locations (i.e., frequencies) are then parameterised. The representation is theoretically capable of averaging spectra with less over-smoothing effect. The experimental results provide the optimal values of factors necessary for the encoding and decoding of the proposed representation towards the future applications of speech synthesis. Parameterization of Vocal Fry in HMM-Based Speech Synthesis Hanna Silén 1 , Elina Helander 1 , Jani Nurminen 2 , Moncef Gabbouj 1 ; 1 Tampere University of Technology, Finland; 2 Nokia Devices R&D, Finland Wed-Ses1-P3-7, Time: 10:00 HMM-based speech synthesis offers a way to generate speech with different voice qualities. However, sometimes databases contain certain inherent voice qualities that need to be parametrized properly. One example of this is vocal fry typically occurring at the end of utterances. A popular mixed excitation vocoder for HMMbased speech synthesis is STRAIGHT. The standard STRAIGHT is optimized for modal voices and may not produce high quality with other voice types. Fortunately, due to the flexibility of STRAIGHT, different F0 and aperiodicity measures can be used in the synthesis without any inherent degradations in speech quality. We have replaced the STRAIGHT excitation with a representation based on a robust F0 measure and a carefully determined two-band voicing. According to our analysis-synthesis experiments, the new parameterization can improve the speech quality. In HMM-based speech synthesis, the quality is significantly improved especially due to the better modeling of vocal fry. A Deterministic Plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis Thomas Drugman 1 , Geoffrey Wilfart 2 , Thierry Dutoit 1 ; 1 Faculté Polytechnique de Mons, Belgium; 2 Acapela Group, Belgium Wed-Ses1-P3-8, Time: 10:00 Speech generated by parametric synthesizers generally suffers from a typical buzziness, similar to what was encountered in old LPC-like vocoders. In order to alleviate this problem, a more suited modeling of the excitation should be adopted. For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual. In this model, the excitation is divided into two distinct spectral bands delimited by the maximum voiced frequency. The deterministic part concerns the low-frequency contents and consists of a decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis. The stochastic component is a high-pass filtered noise whose time structure is modulated by an energyenvelope, similarly to what is done in the Harmonic plus Noise Model (HNM). The proposed residual model is integrated within a HMM-based speech synthesizer and is compared to the traditional excitation through a subjective test. Results show a significant improvement for both male and female voices. In addition the proposed model requires few computational load and memory, which is essential for its integration in commercial applications. A Decision Tree-Based Clustering Approach to State Definition in an Excitation Modeling Framework for HMM-Based Speech Synthesis Ranniery Maia 1 , Tomoki Toda 2 , Keiichi Tokuda 3 , Shinsuke Sakai 1 , Satoshi Nakamura 1 ; 1 NICT, Japan; 2 NAIST, Japan; 3 Nagoya Institute of Technology, Japan Wed-Ses1-P3-9, Time: 10:00 This paper presents a decision tree-based algorithm to cluster residual segments assuming an excitation model based on statedependent filtering of pulse train and white noise. The decision tree construction principle is the same as the one applied to speech recognition. Here parent nodes are split using the residual maximum likelihood criterion. Once these excitation decision trees are constructed for residual signals segmented by full context models, using questions related to the full context of the training sentences, they can be utilized for excitation modeling in speech synthesis based on hidden Markov models (HMM). Experimental results have shown that the algorithm in question is very effective in terms of clustering residual signals given segmentation, pitch marks and full context questions, resulting in filters with good residual modeling properties. An Improved Minimum Generation Error Based Model Adaptation for HMM-Based Speech Synthesis Yi-Jian Wu 1 , Long Qin 2 , Keiichi Tokuda 1 ; 1 Nagoya Institute of Technology, Japan; 2 Carnegie Mellon University, USA Wed-Ses1-P3-10, Time: 10:00 A minimum generation error (MGE) criterion had been proposed for model training in HMM-based speech synthesis. In this paper, we apply the MGE criterion to model adaptation for HMM-based speech synthesis, and introduce an MGE linear regression (MGELR) based model adaptation algorithm, where the regression matrices used to transform source models are optimized so as to minimize the generation errors of adaptation data. In addition, we incorporate the recent improvements of MGE criterion into MGELR-based model adaptation, including state alignment under MGE criterion and using a log spectral distortion (LSD) instead of Euclidean distance for spectral distortion measure. From the experimental results, the adaptation performance was improved after incorporating these two techniques, and the formal listening tests showed that the quality and speaker similarity of synthesized speech after MGELR-based adaptation were significantly improved over the original MLLR-based adaptation. Two-Pass Decision Tree Construction for Unsupervised Adaptation of HMM-Based Synthesis Models Matthew Gibson; University of Cambridge, UK Wed-Ses1-P3-11, Time: 10:00 Hidden Markov model (HMM) -based speech synthesis systems possess several advantages over concatenative synthesis systems. One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset. Speaker adaptation methods used in the field of HMM-based automatic speech recognition (ASR) are adopted for this task. In the case of unsupervised speaker adaptation, previous work has used a supplementary set of acoustic models to firstly estimate the transcription of the adaptation data. By defining a mapping between HMM-based synthesis models and ASR-style models, this paper introduces an approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which Notes 121 avoids the need for supplementary acoustic models. Further, this enables unsupervised adaptation of HMM-based speech synthesis models without the need to perform linguistic analysis of the estimated transcription of the adaptation data. Speaker Adaptation Using a Parallel Phone Set Pronunciation Dictionary for Thai-English Bilingual TTS Wed-Ses1-P4 : Resources, Annotation and Evaluation Hewison Hall, 10:00, Wednesday 9 Sept 2009 Chair: Michael Wagner, University of Canberra, Australia Resources for Speech Research: Present and Future Infrastructure Needs Anocha Rugchatjaroen, Nattanun Thatphithakkul, Ananlada Chotimongkol, Ausdang Thangthai, Chai Wutiwiwatchai; NECTEC, Thailand Wed-Ses1-P3-12, Time: 10:00 This paper develops a bilingual Thai-English TTS system from two monolingual HMM-based TTS systems. An English Nagoya HMM-based TTS system (HTS) provides correct pronunciations of English words but the voice is different from the voice in a Thai HTS system. We apply a CSMAPLR adaptation technique to make the English voice sounds more similar to the Thai voice. To overcome a phone mapping problem normally occurs with a pair of languages that have dissimilar phone sets, we utilize a cross-language pronunciation mapping through a parallel phone set pronunciation dictionary. The results from the subjective listening test show that English words synthesized by our proposed system are more intelligible (with 0.61 higher MOS) than the existing bilingual Thai-English TTS. Moreover, with the proposed adaptation method, the synthesized English words sound more similar to synthesized Thai words. HMM-Based Automatic Eye-Blink Synthesis from Speech Michal Dziemianko, Gregor Hofer, Hiroshi Shimodaira; University of Edinburgh, UK Wed-Ses1-P3-13, Time: 10:00 In this paper we present a novel technique to automatically synthesise eye blinking from a speech signal. Animating the eyes of a talking head is important as they are a major focus of attention during interaction. The developed system predicts eye blinks from the speech signal and generates animation trajectories automatically employing a “Trajectory Hidden Markov Model”. The evaluation of the recognition performance showed that the timing of blinking can be predicted from speech with an F-score value upwards of 52%, which is well above chance. Additionally, a preliminary perceptual evaluation was conducted, that confirmed that adding eye blinking significantly improves the perception the character. Finally it showed that the speech synchronised synthesised blinks outperform random blinking in naturalness ratings. Lou Boves 1 , Rolf Carlson 2 , Erhard Hinrichs 3 , David House 2 , Steven Krauwer 4 , Lothar Lemnitzer 3 , Martti Vainio 5 , Peter Wittenburg 6 ; 1 Radboud Universiteit Nijmegen, The Netherlands; 2 KTH, Sweden; 3 Universität Tübingen, Germany; 4 Utrecht University, The Netherlands; 5 University of Helsinki, Finland; 6 Max Planck Institute for Psycholinguistics, The Netherlands Wed-Ses1-P4-1, Time: 10:00 This paper introduces the EU-FP7 project CLARIN, a joint effort of over 150 institutions in Europe, aimed at the creation of a sustainable language resources and technology infrastructure for the humanities and social sciences research community. The paper briefly introduces the vision behind the project and how it relates to speech research with a focus on the contributions that CLARIN can and will make to research in spoken language processing. Speech Recordings via the Internet: An Overview of the VOYS Project in Scotland Catherine Dickie 1 , Felix Schaeffler 1 , Christoph Draxler 2 , Klaus Jänsch 2 ; 1 Queen Margaret University, UK; 2 LMU München, Germany Wed-Ses1-P4-2, Time: 10:00 The VOYS (Voices of Young Scots) project aims to establish a speech database of adolescent Scottish speakers. This database will serve for speech recognition technology and sociophonetic research. 300 pupils will ultimately be recorded at secondary schools in 10 locations in Scotland. Recordings are performed via the Internet using two microphones (close-talk and desktop) in 22,05 kHz 16 bit linear stereo signal quality. VOYS is the first large-scale and cross-boundary speech data collection based on the WikiSpeech content management system for speech resources. In VOYS, schools receive a kit containing the microphones and A/D interface and they organise the recordings themselves. The recorded data is immediately uploaded to the server in Munich, alleviating the schools from all data-handling tasks. This paper outlines the corpus specification, describes the technical issues, summarises the signal quality and gives a status report. The Multi-Session Audio Research Project (MARP) Corpus: Goals, Design and Initial Findings A.D. Lawson 1 , A.R. Stauffer 1 , E.J. Cupples 1 , S.J. Wenndt 2 , W.P. Bray 3 , J.J. Grieco 2 ; 1 RADC Inc., USA; 2 Air Force Research Laboratory, USA; 3 Oasis Systems, USA Wed-Ses1-P4-3, Time: 10:00 This project describes the composition and goals of the Multisession Audio Research Project (MARP) corpus and some initial experimental findings. The MARP corpus is a three year longitudinal collect of 21 sessions and more than 60 participants. This study was undertaken to test the impact of various factors on speaker recog- Notes 122 nition, such as inter-session variability, intonation, aging, whispering and text dependency. Initial results demonstrate the impact of sentence intonation, whispering, text dependency and cross session tests. These results highlight the sensitivity of speaker recognition to vocal, environmental and phonetic conditions that are commonly encountered but rarely explored or tested. Structure and Annotation of Polish LVCSR Speech Database Predicting the Quality of Multimodal Systems Based on Judgments of Single Modalities Ina Wechsung 1 , Klaus-Peter Engelbrecht 1 , Anja B. Naumann 1 , Stefan Schaffer 2 , Julia Seebode 2 , Florian Metze 3 , Sebastian Möller 1 ; 1 Deutsche Telekom Laboratories, Germany; 2 Technische Universität Berlin, Germany; 3 Carnegie Mellon University, USA Wed-Ses1-P4-7, Time: 10:00 Katarzyna Klessa, Grażyna Demenko; Adam Mickiewicz University, Poland Wed-Ses1-P4-4, Time: 10:00 This paper reports on the problems occurring in the process of building LVCSR (Large Vocabulary Continuous Speech Recognition) corpora based on the internal evaluation of the Polish database JURISDIC. The initial assumptions are discussed together with technical matters concerning the database realization and annotation results. Providing rich database statistics was considered crucial especially regarding linguistic description both for database evaluation and for the implementation of linguistic factors in acoustic models for speech recognition. The assumed principles for database construction are: low redundancy, acoustic-phonetic variability adequate to dictation task, representativeness, balanced, heterogeneous structure enabling separate or combined modeling of phonetic-acoustic structures. Balanced Corpus of Informal Spoken Czech: Compilation, Design and Findings This paper investigates the relationship between user ratings of multimodal systems and user ratings of its single modalities. Based on previous research showing precise predictions of ratings of multimodal systems based on ratings of single modality, it was hypothesized that the accuracy might have been caused by the participants’ efforts to rate consistently. We address this issue with two new studies. In the first study, the multimodal system was presented before the single modality versions were known by the users. In the second study, the type of system was changed, and age effects were investigated. We apply linear regression and show that models get worse when the order is changed. In addition, models for younger users perform better than those for older users. We conclude that ratings can be impacted by the effort of users to judge consistently, as well as their ability to do so. Auto-Checking Speech Transcriptions by Multiple Template Constrained Posterior Lijuan Wang 1 , Shenghao Qin 2 , Frank K. Soong 1 ; 1 Microsoft Research Asia, China; 2 Microsoft Business Division, China Martina Waclawičová, Michal Křen, Lucie Válková; Charles University in Prague, Czech Republic Wed-Ses1-P4-8, Time: 10:00 Wed-Ses1-P4-5, Time: 10:00 The paper presents ORAL2008, a new 1-million corpus of spoken Czech compiled within the framework of the Czech National Corpus project. ORAL2008 is designed as a representation of authentic spoken language used in informal situations and it is balanced in the main sociolinguistic categories of speakers. The paper concentrates also on the data collection, its broad coverage and the transcription system that registers variability of spoken Czech. Possible findings based on the provided data are finally outlined. JTrans: An Open-Source Software for Semi-Automatic Text-to-Speech Alignment C. Cerisara, O. Mella, D. Fohr; LORIA, France Wed-Ses1-P4-6, Time: 10:00 Aligning speech corpora with text transcriptions is an important requirement of many speech processing, data mining applications and linguistic researches. Despite recent progress in the field of speech recognition, many linguists still manually align spontaneous and noisy speech recordings to guarantee a good alignment quality. This work proposes an open-source java software with an easy-touse GUI that integrates dedicated semi-automatic speech alignment algorithms that can be dynamically controlled and guided by the user. The objective of this software is to facilitate and speed up the process of creating and aligning speech corpora. Checking transcription errors in speech database is an important but tedious task that traditionally requires intensive manual labor. In [9], Template Constrained Posterior (TCP) was proposed to automate the checking process by screening potential erroneous sentences with a single context template. However, single templatebased method is not robust and requires parameter optimization that still involves some manual work. In this work, we propose to use multiple templates which is more robust and requires no development data for parameter optimization. By using its multiple hypothesis sifting capabilities — from well-defined, full context to loosely defined context like wild card, the confidence for a focus unit can be measured at different expected accuracy. The joint verification by multiple TCP improves measured confidence of each unit in the transcription and is robust across different speech databases. Experimental results show that the checking process automatically separates erroneous sentences from correct ones: the sentence error hit rate decrease rapidly in the sorted TCP values, from 59% to 7% for the Mexican Spanish database and from 63% to 11% for the American English database, among the top 10% sentences in the rank lists. Subjective Experiments on Influence of Response Timing in Spoken Dialogues Toshihiko Itoh 1 , Norihide Kitaoka 2 , Ryota Nishimura 3 ; 1 Hokkaido University, Japan; 2 Nagoya University, Japan; 3 Toyohashi University of Technology, Japan Wed-Ses1-P4-9, Time: 10:00 To verify the validity of analysis results relating to dialogue rhythm from earlier studies, we produced spoken dialogues based on analysis results relating to response timing and the other spoken dialogues, and performed subjective experiments to investigate parameters such as the naturalness of the dialogue, the incongruity Notes 123 of the synthesized speech, and the ease of comprehension of the utterances. We used very short task-oriented four-turn dialogues using synthesized speech in Experiment 1, and approx. one-minute free-conversation dialogues in Experiment 2 using natural human speech and synthesized speech. As a result, we were able to show that a natural response timing exists for utterances, and that response timings that conform to the utterance contents are felt to be more natural, thus demonstrating the validity of the analysis results relating to dialogue rhythm. Usability Study of VUI consistent with GUI Focusing on Age-Groups Jun Okamoto, Tomoyuki Kato, Makoto Shozakai; Asahi Kasei Corporation, Japan Wed-Ses1-P4-10, Time: 10:00 We studied the usability of a Voice User Interface (VUI) that is consistent with a Graphical User Interface (GUI), and focused on its dependency with user age-groups. Usability tests were iteratively conducted on 245 Japanese subjects with age-groups from 20s to 60s using a prototype of an in-vehicle information application. Next we calculated and analyzed statistics of the usability tests. We discuss the differences in usability with respect to age-groups and how to handle them. We propose that it is necessary to make voice guidance straightforward and to devise a VUI consistent with a GUI (VGUI) in order to let users understand the system structure. Also we found that the default design of a VGUI should be as simple as possible so that elderly users, who may be slow to learn the new system structure, are able to easily learn it. Annotating Communicative Function and Semantic Content in Dialogue Act for Construction of Consulting Dialogue Systems Teruhisa Misu, Kiyonori Ohtake, Chiori Hori, Hideki Kashioka, Satoshi Nakamura; NICT, Japan Wed-Ses1-P4-11, Time: 10:00 Our goal in this study is to train a dialogue manager that can handle consulting dialogues through spontaneous interactions from a tagged dialogue corpus. We have collected 130 hours of consulting dialogues in sightseeing guidance domain. This paper provides our taxonomy of dialogue act (DA) annotation that can describe two aspects of utterances. One is a communicative function (speech act), and the other is a semantic content of an utterance. We provide an overview of the Kyoto tour guide dialogue corpus and a preliminary analysis using the dialogue act tags. Improved Speech Summarization with Multiple-Hypothesis Representations and Kullback-Leibler Divergence Measures An Improved Speech Segmentation Quality Measure: The R-Value Okko Johannes Räsänen, Unto Kalervo Laine, Toomas Altosaar; Helsinki University of Technology, Finland Wed-Ses1-P4-13, Time: 10:00 Phone segmentation in ASR is usually performed indirectly by Viterbi decoding of HMM output. Direct approaches also exist, e.g., blind speech segmentation algorithms. In either case, performance of automatic speech segmentation algorithms is often measured using automated evaluation algorithms and used to optimize a segmentation system’s performance. However, evaluation approaches reported in literature were found to be lacking. Also, we have determined that increases in phone boundary location detection rates are often due to increased over-segmentation levels and not to algorithmic improvements, i.e., by simply adding random boundaries a better hit-rate can be achieved when using current quality measures. Since established measures were found to be insensitive to this type of random boundary insertion, a new R-value quality measure is introduced that indicates how close a segmentation algorithm’s performance is to an ideal point of operation. No Sooner Said Than Done? Testing Incrementality of Semantic Interpretations of Spontaneous Speech Michaela Atterer, Timo Baumann, David Schlangen; Universität Potsdam, Germany Wed-Ses1-P4-14, Time: 10:00 Ideally, a spoken dialogue system should react without much delay to a user’s utterance. Such a system would already select an object, for instance, before the user has finished her utterance about moving this particular object to a particular place. A prerequisite for such a prompt reaction is that semantic representations are built up on the fly and passed on to other modules. Few approaches to incremental semantics construction exist, and, to our knowledge, none of those has been systematically tested on a spontaneous speech corpus. In this paper, we develop measures to test empirically on transcribed spontaneous speech to what extent we can create semantic interpretation on the fly with an incremental semantic chunker that builds a frame semantics. Wed-Ses1-S1 : Special Session: Lessons and Challenges Deploying Voice Search Ainsworth (East Wing 4), 10:00, Wednesday 9 Sept 2009 Chair: Michael Cohen, Google, USA and Mike Phillips, Vlingo, USA Role of Natural Language Understanding in Voice Local Search Shih-Hsiang Lin, Berlin Chen; National Taiwan Normal University, Taiwan Wed-Ses1-P4-12, Time: 10:00 Imperfect speech recognition often leads to degraded performance when leveraging existing text-based methods for speech summarization. To alleviate this problem, this paper investigates various ways to robustly represent the recognition hypotheses of spoken documents beyond the top scoring ones. Moreover, a new summarization method stemming from the Kullback-Leibler (KL) divergence measure and exploring both the sentence and document relevance information is proposed to work with such robust representations. Experiments on broadcast news speech summarization seem to demonstrate the utility of the presented approaches. Junlan Feng, Srinivas Banglore, Mazin Gilbert; AT&T Labs Research, USA Wed-Ses1-S1-1, Time: 10:00 Speak4it is a voice-enabled local search system currently available for iPhone devices. The natural language understanding (NLU) component is one of the key technology modules in this system. The role of NLU in voice-enabled local search is twofold: (a) parse the automatic speech recognition (ASR) output (1-best and word lattices) into meaningful segments that contribute to high-precision local search, and (b) understand user’s intent. This paper is concerned with the first task of NLU. In previous work, we had presented a scalable approach to parsing, which is built upon text indexing and search framework, and can also parse ASR lattices. In this paper, we propose an algorithm to improve the baseline by extracting the “subjects” of the query. Experimental results indicate that lattice- Notes 124 based query parsing outperforms ASR 1-best based parsing by 2.1% absolute and extracting subjects in the query improves the robustness of search. Recognition and Correction of Voice Web Search Queries Wed-Ses2-O1 : Word-Level Perception Main Hall, 13:30, Wednesday 9 Sept 2009 Chair: Jeesun Kim, University of Western Sydney, Australia Semantic Context Effects in the Recognition of Acoustically Unreduced and Reduced Words Keith Vertanen, Per Ola Kristensson; University of Cambridge, UK Wed-Ses1-S1-2, Time: 10:15 In this work we investigate how to recognize and correct voice web search queries. We describe our corpus of web search queries and show how it was used to improve recognition accuracy. We show that using a search-specific vocabulary with automatically generated pronunciations is superior to using a vocabulary limited to a fixed pronunciation dictionary. We conducted a formative user study to investigate recognition and correction aspects of voice search in a mobile context. In the user study, we found that despite a word error rate of 48%, users were able to speak and correct search queries in about 18 seconds. Users did this while walking around using a mobile touch-screen device. Voice Search and Everything Else — What Users Are Saying to the Vlingo Top Level Voice UI Chao Wang; Vlingo, USA Wed-Ses1-S1-3, Time: 10:30 Marco van de Ven 1 , Benjamin V. Tucker 2 , Mirjam Ernestus 3 ; 1 Max Planck Institute for Psycholinguistics, The Netherlands; 2 University of Alberta, Canada; 3 Radboud Universiteit Nijmegen, The Netherlands Wed-Ses2-O1-1, Time: 13:30 Listeners require context to understand the casual pronunciation variants of words that are typical of spontaneous speech [1]. The present study reports two auditory lexical decision experiments, investigating listeners’ use of semantic contextual information in the comprehension of unreduced and reduced words. We found a strong semantic priming effect for low frequency unreduced words, whereas there was no such effect for reduced words. Word frequency was facilitatory for all words. These results show that semantic context is relevant especially for the comprehension of unreduced words, which is unexpected given the listener driven explanation of reduction in spontaneous speech. Context Effects and the Processing of Ambiguous Words: Further Evidence from Semantic Incongruence Searching Google by Voice Johan Schalkwyk; Google Inc., USA Michael C.W. Yip; Hong Kong Institute of Education, China Wed-Ses1-S1-4, Time: 10:45 Wed-Ses2-O1-2, Time: 13:50 Multiple-hypotheses searches from deeply parsed requests to multiple-evidences scoring: the DeepQA challenge Roberto Sicconi; IBM T.J. Watson Research Center, USA Wed-Ses1-S1-5, Time: 11:00 Research Areas in Voice Search: Lessons from Microsoft Deployments Geoffrey Zweig; Microsoft Research, USA Wed-Ses1-S1-6, Time: 11:15 A cross-modal naming experiment was conducted to further verify the effects of context and other lexical information in the processing of Chinese homophones during spoken language comprehension. In this experiment, listeners named aloud a visual probe as fast as they could, at a pre-designated point upon hearing the sentence, which ended with a spoken Chinese homophone. Results further support that context has exerted an effect on the disambiguation of various homophonic meanings at an early stage, within the acoustic boundary of the word. This contextual effect was even stronger than the tonal information. Finally, the present results are in line with the context-dependency hypothesis that selection of the appropriate meaning of an ambiguous word depends on the simultaneous interaction among sentential, tonal and other lexical information during lexical access. Panel Discussion Wed-Ses1-S1-7, Time: 11:30 Panel Members: • Johan Schalkwyk, Senior Staff Software Engineer, Google Inc., USA • Chao Wang, Principal Speech Scientist, Vlingo, USA • Roberto Sicconi, Program Director, DeepQA New Opportunities, IBM T.J. Watson Research Center, USA • Mazin Gilbert, AT&T Labs Research, USA • Geoffrey Zweig, Senior Researcher, Microsoft Research, USA • Kieth Vertanen, University of Cambridge, UK The Roles of Reconstruction and Lexical Storage in the Comprehension of Regular Pronunciation Variants Mirjam Ernestus; Radboud Universiteit Nijmegen, The Netherlands Wed-Ses2-O1-3, Time: 14:10 This paper investigates how listeners process regular pronunciation variants, resulting from simple general reduction processes. Study 1 shows that when listeners are presented with new words, they store the pronunciation variants presented to them, whether these are unreduced or reduced. Listeners thus store information on word-specific pronunciation variation. Study 2 suggests that if participants are presented with regularly reduced pronunciations, they also reconstruct and store the corresponding unreduced pronunciations. These unreduced pronunciations apparently have Notes 125 special status. Together the results support hybrid models of speech processing, assuming roles for both exemplars and abstract representations. Wed-Ses2-O2 : Applications in Education and Learning Jones (East Wing 1), 13:30, Wednesday 9 Sept 2009 Chair: Maxine Eskenazi, Carnegie Mellon University, USA Lexical Embedding in Spoken Dutch Odette Scharenborg 1 , Stefanie Okolowski 2 ; 1 Radboud Universiteit Nijmegen, The Netherlands; 2 Universität Trier, Germany A Large Greek-English Dictionary with Incorporated Speech and Language Processing Tools Wed-Ses2-O1-4, Time: 14:30 A stretch of speech is often consistent with multiple words, e.g., the sequence /hæm/ is consistent with ‘ham’ but also with the first syllable of ‘hamster’, resulting in temporary ambiguity. However, to what degree does this lexical embedding occur? Analyses on two corpora of spoken Dutch showed that 11.9%–19.5% of polysyllabic word tokens have word-initial embedding, while 4.1%–7.5% of monosyllabic word tokens can appear word-initially embedded. This is much lower than suggested by an analysis of a large dictionary of Dutch. Speech processing thus appears to be simpler than one might expect on the basis of statistics on a dictionary. Real-Time Lexical Competitions During Speech-in-Speech Comprehension Véronique Boulenger 1 , Michel Hoen 2 , François Pellegrino 1 , Fanny Meunier 1 ; 1 DDL, France; 2 SBRI, France Wed-Ses2-O1-5, Time: 14:50 This study investigates speech comprehension in competing multitalker babble. We examined the effects of number of simultaneous talkers and of frequency of words in the babble on lexical decision to target words. Results revealed better performance at a low talker number (n = 2). Importantly, frequency of words in the babble significantly affected performance: high frequency word babble interfered more strongly with word recognition than low frequency babble. This informational masking was particularly salient for the 2-talker babble. These findings suggest that investigating speech-in-speech comprehension may provide crucial information on lexical competition processes that occur in real-time during word recognition. Discovering Consistent Word Confusions in Noise Martin Cooke; Ikerbasque, Spain Wed-Ses2-O1-6, Time: 15:10 Listeners make mistakes when communicating under adverse conditions, with overall error rates reasonably well-predicted by existing speech intelligibility metrics. However, a detailed examination of confusions made by a majority of listeners is more likely to provide insights into processes of normal word recognition. The current study measured the rate at which robust misperceptions occurred for highly-confusable words embedded in noise. In a second experiment, confusions discovered in the first listening test were subjected to a range of manipulations designed to help identify their cause. These experiments reveal that while majority confusions are quite rare, they occur sufficiently often to make large-scale discovery worthwhile. Surprisingly few misperceptions were due solely to energetic masking by the noise, suggesting that speech and noise “react” in complex ways which are not well-described by traditional masking concepts. Dimitrios P. Lyras, George Kokkinakis, Alexandros Lazaridis, Kyriakos Sgarbas, Nikos Fakotakis; University of Patras, Greece Wed-Ses2-O2-1, Time: 13:30 A large Greek-English Dictionary with 81,515 entries, 192,592 translations into English and 50,106 usage examples with their translation has been developed in combined printed and electronic (DVD) form. The electronic dictionary features unique facilities for searching the entire or any part of the Greek and English section, and has incorporated a series of speech and language processing tools which may efficiently assist learners of Greek and English. This paper presents the human-machine interface of the dictionary and the most important tools, i.e. the TTS-synthesizers for Greek and English, the lemmatizers for Greek and English, the Grapheme-to-Phoneme converter for Greek and the syllabification system for Greek. Predicting Children’s Reading Ability Using Evaluator-Informed Features Matthew Black, Joseph Tepperman, Sungbok Lee, Shrikanth S. Narayanan; University of Southern California, USA Wed-Ses2-O2-2, Time: 13:50 Automatic reading assessment software has the difficult task of trying to model human-based observations, which have both objective and subjective components. In this paper, we mimic the grading patterns of a “ground-truth” (average) evaluator in order to produce models that agree with many people’s judgments. We examine one particular reading task, where children read a list of words aloud, and evaluators rate the children’s overall reading ability on a scale from one to seven. We first extract various features correlated with the specific cues that evaluators said they used. We then compare various supervised learning methods that mapped the most relevant features to the ground-truth evaluator scores. Our final system predicted these scores with 0.91 correlation, higher than the average inter-evaluator agreement. Automatic Intonation Classification for Speech Training Systems György Szaszák, Dávid Sztahó, Klára Vicsi; BME, Hungary Wed-Ses2-O2-3, Time: 14:10 A prosodic Hidden Markov model (HMM) based modality recognizer has been developed, which, after supra-segmental acoustic pre-processing, can perform clause and sentence boundary detection and modality (sentence type) recognition. This modality recognizer is adapted to carry out automatic evaluation of the intonation of the produced utterances in a speech training system for hearing-impaired persons or foreign language learners. The system is evaluated on utterances from normally-speaking persons and tested with speech-impaired (due to hearing problems) persons. To allow a deeper analysis, the automatic classification of the intonation is compared to subjective listening tests. Notes 126 Automated Pronunciation Scoring Using Confidence Scoring and Landmark-Based SVM 1 Wed-Ses2-O3 : ASR: New Paradigms I Fallside (East Wing 2), 13:30, Wednesday 9 Sept 2009 Chair: Geoffrey Zweig, Microsoft Research, USA 1 Su-Youn Yoon , Mark Hasegawa-Johnson , Richard Sproat 2 ; 1 University of Illinois at Urbana-Champaign, USA; 2 Oregon Health & Science University, USA Wed-Ses2-O2-4, Time: 14:30 In this study, we present a pronunciation scoring method for second language learners of English (hereafter, L2 learners). This study presents a method using both confidence scoring and classifiers. Classifiers have an advantage over confidence scoring for specialization in the specific phonemes where L2 learners make frequent errors. Classifiers (Landmark-based Support Vector Machines) were trained in order to distinguish L2 phonemes from their frequent substitution patterns. In this study, the method was evaluated on the specific English phonemes where L2 English learners make frequent errors. The results suggest that the automated pronunciation scoring method can be improved consistently by combining the two methods. ASR Based Pronunciation Evaluation with Automatically Generated Competing Vocabulary Carlos Molina, Nestor Becerra Yoma, Jorge Wuth, Hiram Vivanco; Universidad de Chile, Chile The Semi-Supervised Switchboard Transcription Project Amarnag Subramanya, Jeff Bilmes; University of Washington, USA Wed-Ses2-O3-1, Time: 13:30 In previous work, we proposed a new graph-based semi-supervised learning (SSL) algorithm and showed that it outperforms other state-of-the-art SSL approaches for classifying documents and web-pages. Here we use a multi-threaded implementation in order to scale the algorithm to very large data sets. We treat the phonetically annotated portion of the Switchboard transcription project (STP) as labeled data and automatically annotate (at the phonetic level) the Switchboard I (SWB) training set and show that our proposed approach outperforms state-of-the-art SSL algorithms as well as a state-of-the-art strictly supervised classifier. As a result, we have STP-style annotations of the entire SWB-I training set which we refer to as semi-supervised STP (S3TP). Maximum Mutual Information Multi-Phone Units in Direct Modeling Wed-Ses2-O2-5, Time: 14:50 In this paper the application of automatic speech recognition (ASR) technology in CAPT (Computer Aided Pronunciation Training) is addressed. A method to automatically generate the competitive lexicon, required by an ASR engine to compare the pronunciation of a target word with its correct and wrong phonetic realization, is presented. In order to enable the efficient deployment of CAPT applications, the generation of this competitive lexicon does not require any human assistance or a priori information of mother language dependent errors. The method presented here leads to averaged subjective-objective score correlation equal to 0.82 and 0.75 depending on the task. High Performance Automatic Mispronunciation Detection Method Based on Neural Network and TRAP Features Hongyan Li, Shijin Wang, Jiaen Liang, Shen Huang, Bo Xu; Chinese Academy of Sciences, China Wed-Ses2-O2-6, Time: 15:10 In this paper, we propose a new approach to utilize temporal information and neural network (NN) to improve the performance of automatic mispronunciation detection (AMD). Firstly, the alignment results between speech signals and corresponding phoneme sequences are obtained within the classic GMM-HMM framework. Then, the long-time TempoRAl Patterns (TRAPs) [5] features are introduced to describe the pronunciation quality instead of the conventional spectral features (e.g. MFCC). Based on the phoneme boundaries and TRAPs features, we use Multi-layer Perceptron (MLP) to calculate the final posterior probability of each testing phoneme, and determine whether it is a mispronunciation or not by comparing with a phone dependent threshold. Moreover, we combine the TRAPs-MLP method with our existing methods to further improve the performance. Experiments show that the TRAPs-MLP method can give a significant relative improvement of 39.04% in EER (Equal Error Rate) reduction, and the fusion of TRAPs-MLP, GMM-UBM and GLDS-SVM [4] methods can yield 48.32% in EER reduction relatively, both compared with the baseline GMM-UBM method. Geoffrey Zweig, Patrick Nguyen; Microsoft Research, USA Wed-Ses2-O3-2, Time: 13:30 This paper introduces a class of discriminative features for use in maximum entropy speech recognition models. The features we propose are acoustic detectors for discriminatively determined multi-phone units. The multi-phone units are found by computing the mutual information between the phonetic sub-sequences that occur in the training lexicon, and the word labels. This quantity is a function of an error model governing our ability to detect phone sequences accurately (an otherwise informative sequence which cannot be reliably detected is not so useful). We show how to compute this mutual information quantity under a class of error models efficiently, in one pass over the data, for all phonetic subsequences in the training data. After this computation, detectors are created for a subset of highly informative units. We then define two novel classes of features based on these units: associative and transductive. Incorporating these features in a maximum entropy based direct model for Voice-Search outperforms the baseline by 24% in sentence error rate. Profiling Large-Vocabulary Continuous Speech Recognition on Embedded Devices: A Hardware Resource Sensitivity Analysis Kai Yu, Rob A. Rutenbar; Carnegie Mellon University, USA Wed-Ses2-O3-3, Time: 13:30 When deployed in embedded systems, speech recognizers are necessarily reduced from large-vocabulary continuous speech recognizers (LVCSR) found on desktops or servers to fit the limited hardware. However, embedded hardware continues to evolve in capability; today’s smartphones are vastly more powerful than their recent ancestors. This begets a new question: which hardware features not currently found on today’s embedded platforms, but potentially add-ons to tomorrow’s devices, are most likely to improve recognition performance? Said differently — what is the sensitivity of the recognizer to fine-grain details of the embedded Notes 127 hardware resources? To answer this question rigorously and quantitatively, we offer results from a detailed study of LVCSR performance as a function of micro-architecture options on an embedded ARM11 and an enterprise-class Intel Core2Duo. We estimate speed and energy consumption, and show, feature by feature, how hardware resources impact recognizer performance. Continuous Speech Recognition Using Attention Shift Decoding with Soft Decision Ozlem Kalinli, Shrikanth S. Narayanan; University of Southern California, USA speech recognizers without transcribed data by formulating the HMM training as an optimization over both the parameter and transcription sequence space. We describe how this can be easily implemented using existing STT tools. We tested the effectiveness of our unsupervised training approach on the task of topic classification on the Switchboard corpus. The unsupervised HMM recognizer, initialized with a segmental tokenizer, outperformed both the a HMM phoneme recognizer trained with 1 hour of transcribed data, and the Brno University of Technology (BUT) Hungarian phoneme recognizer. This approach can also be applied to other speech applications, including spoken term detection, language and speaker verification. Wed-Ses2-O3-4, Time: 13:30 We present an attention shift decoding (ASD) method inspired by human speech recognition. In contrast to the traditional automatic speech recognition (ASR) systems, ASD decodes speech inconsecutively using reliability criteria; the gaps (unreliable speech regions) are decoded with the evidence of islands (reliable speech regions). On the BU Radio News Corpus, ASD provides significant improvement (2.9% absolute) over the baseline ASR results when it is used with oracle island-gap information. At the core of the ASD method is the automatic island-gap detection. Here, we propose a new feature set for automatic island-gap detection which achieves 83.7% accuracy. To cope with the imperfect nature of the island-gap classification, we also propose a new ASD algorithm using soft decision. The ASD with soft decision provides 0.4% absolute (2.2% relative) improvement over the baseline ASR results when it is used with automatically detected islands and gaps. Towards Using Hybrid Word and Fragment Units for Vocabulary Independent LVCSR Systems Ariya Rastrow 1 , Abhinav Sethy 2 , Bhuvana Ramabhadran 2 , Frederick Jelinek 1 ; 1 Johns Hopkins University, USA; 2 IBM T.J. Watson Research Center, USA Wed-Ses2-O4 : Single-Channel Speech Enhancement Holmes (East Wing 3), 13:30, Wednesday 9 Sept 2009 Chair: B. Yegnanarayana, IIIT Hyderabad, India Constrained Probabilistic Subspace Maps Applied to Speech Enhancement Kaustubh Kalgaonkar, Mark A. Clements; Georgia Institute of Technology, USA Wed-Ses2-O4-1, Time: 13:30 This paper presents a probabilistic algorithm that extracts a mapping between two subspaces by representing each subspace as a collection of states. In many cases, the data is a time series with temporal constraints. This paper suggests a method to impose these temporal constraints on the transitions between the states of the subspace. This probabilistic model has been successfully applied to the problem of speech enhancement and improves the performance of a Wiener filter by providing robust estimates of a priori SNR. Wed-Ses2-O3-5, Time: 13:30 This paper presents the advantages of augmenting a word-based system with sub-word units as a step towards building open vocabulary speech recognition systems. We show that a hybrid system which combines words and data-driven, variable length sub word units has a better phone accuracy than word only systems. In addition the hybrid system is better in detecting Out-Of-Vocabulary (OOV) terms and representing them phonetically. Results are presented on the RT-04 broadcast news and MIT Lecture data sets. An FSM-based approach to recover OOV words from the hybrid lattices is also presented. At an OOV rate of 2.5% on RT-04 we observed a 8% relative improvement in phone error rate (PER), 7.3% relative improvement in oracle PER and 7% relative improvement in WER after recovering the OOV terms. A significant reduction of 33% relative in PER is seen in the OOV regions. Unsupervised Training of an HMM-Based Speech Recognizer for Topic Classification Herbert Gish, Man-hung Siu, Arthur Chan, Bill Belfield; BBN Technologies, USA Wed-Ses2-O3-6, Time: 13:30 Reconstructing Clean Speech from Noisy MFCC Vectors Ben Milner, Jonathan Darch, Ibrahim Almajai; University of East Anglia, UK Wed-Ses2-O4-2, Time: 13:50 The aim of this work is to reconstruct clean speech solely from a stream of noise-contaminated MFCC vectors, as may be encountered in distributed speech recognition systems. Speech reconstruction is performed using the ETSI Aurora back-end speech reconstruction standard which requires MFCC vectors, fundamental frequency and voicing information. In this work, fundamental frequency and voicing are obtained using maximum a posteriori prediction from input MFCC vectors, thereby allowing speech reconstruction solely from a stream of MFCC vectors. Two different methods to improve prediction accuracy in noisy conditions are then developed. Experimental results first establish that improved fundamental frequency and voicing prediction is obtained when noise compensation is applied. A series of human listening tests are then used to analyse the reconstructed speech quality, which determine the effectiveness of noise compensation in terms of mean opinion scores. HMM-based Speech-To-Text (STT) systems are widely deployed not only for dictation tasks but also as the first processing stage of many automatic speech applications such as spoken topic classification. However, the necessity of transcribed data for training the HMMs precludes its use in domains where transcribed speech is difficult to come by because of the specific domain, channel or language. In this work, we propose building HMM-based Notes 128 An Evaluation of Objective Quality Measures for Speech Intelligibility Prediction Cees H. Taal 1 , Richard C. Hendriks 1 , Richard Heusdens 1 , Jesper Jensen 2 , Ulrik Kjems 2 ; 1 Technische Universiteit Delft, The Netherlands; 2 Oticon A/S, Denmark Wed-Ses2-O4-3, Time: 14:10 In this research various objective quality measures are evaluated in order to predict the intelligibility for a wide range of non-linearly processed speech signals and speech degraded by additive noise. The obtained results are compared with the prediction results of a more advanced perceptual-based model proposed by Dau et al. and an objective intelligibility measure, namely the coherence speech intelligibility index (cSII). These tests are performed in order to gain more knowledge between the link of speech-quality and speechintelligibility and may help us to exploit the extensive research done into the field of speech-quality for speech-intelligibility. It is shown that cSII does not necessarily show better performance compared to conventional objective (speech)-quality measures. In general, the DAU-model is the only method with reasonable results for all processing conditions. Performance Comparison of HMM and VQ Based Single Channel Speech Separation M.H. Radfar 1 , W.-Y. Chan 2 , R.M. Dansereau 3 , W. Wong 1 ; 1 University of Toronto, Canada; 2 Queen’s University, Canada; 3 Carleton University, Canada through experiments: 1) soft mask is better than binary mask in terms of recognition performance and 2) cepstral mean normalization (CMN) reduces the distortion, especially for that caused by soft mask. At the end, we evaluate the recognition performance of our method in noisy and reverberant real environment. Enhancing Audio Speech Using Visual Speech Features Ibrahim Almajai, Ben Milner; University of East Anglia, UK Wed-Ses2-O4-6, Time: 15:10 This work presents a novel approach to speech enhancement by exploiting the bimodality of speech and the correlation that exists between audio and visual speech features. For speech enhancement, a visually-derived Wiener filter is developed. This obtains clean speech statistics from visual features by modelling their joint density and making a maximum a posteriori estimate of clean audio from visual speech features. Noise statistics for the Wiener filter utilise an audio-visual voice activity detector which classifies input audio as speech or nonspeech, enabling a noisemodel to be updated. Analysis shows estimation of speech and noise statistics to be effective with human listening tests measuring the effectiveness of the resulting Wiener filter. Wed-Ses2-P1 : Emotion and Expression II Hewison Hall, 13:30, Wednesday 9 Sept 2009 Chair: L. ten Bosch, Radboud Universiteit Nijmegen, The Netherlands Wed-Ses2-O4-4, Time: 14:30 In this paper, single channel speech separation (SCSS) techniques based on hidden Markov models (HMM) and vector quantization (VQ) are described and compared in terms of (a) signal-to-noise ratio (SNR) between separated and original speech signals, (b) preference of listeners, and (c) computational complexity. The SNR results show that the HMM-based technique marginally outperforms the VQ-based technique by 0.85 dB in experiments conducted on mixtures of female-female, male-male, and male-female speakers. Subjective tests show that listeners prefer HMM over VQ for 86.70% of test speech files. This improvement, however, is at the expense of a drastic increase in computational complexity when compared with the VQ-based technique. Stereo-Input Speech Recognition Using Sparseness-Based Time-Frequency Masking in a Reverberant Environment Perceiving Surprise on Cue Words: Prosody and Semantics Interact on Right and Really Catherine Lai; University of Pennsylvania, USA Wed-Ses2-P1-1, Time: 13:30 Cue words in dialogue have different interpretations depending context and prosody. This paper presents a corpus study and perception experiment investigating when prosody causes right and really to be perceived as questioning or expressing surprise. Pitch range is found to be the best cue for surprise. This extends to the question rating for really but not for right. In fact, prosody appears to interact with semantics so ratings differ for these two types of cue word even when prosodic features are similar. So, different semantics appears to result in different surprise/question rating thresholds. Yosuke Izumi 1 , Kenta Nishiki 1 , Shinji Watanabe 2 , Takuya Nishimoto 1 , Nobutaka Ono 1 , Shigeki Sagayama 1 ; 1 University of Tokyo, Japan; 2 NTT Corporation, Japan Emotion Recognition Using Linear Transformations in Combination with Video Wed-Ses2-O4-5, Time: 14:50 Wed-Ses2-P1-2, Time: 13:30 We present noise robust automatic speech recognition (ASR) using sparseness-based underdetermined blind source separation (BSS) technique. As a representative underdetermined BSS method, we utilized time-frequency masking in this paper. Although timefrequency masking is able to separate target speech from interferences effectively, one should consider two problems. One is that masking does not work well in noisy or reverberant environment. Another is that masking itself might cause some distortion of the target speech. For the former, we apply our time-frequency masking method [7] which can separate the target signal robustly even in noisy and reverberant environment. Next, investigating the distortion caused by time-frequency masking, we reveal following facts The paper discuses the usage of linear transformations of Hidden Markov Models, normally employed for speaker and environment adaptation, as a way of extracting the emotional components from the speech. A constrained version of Maximum Likelihood Linear Regression (CMLLR) transformation is used as a feature for classification of normal or aroused emotional state. We present a procedure of incrementally building a set of speaker independent acoustic models, that are used to estimate the CMLLR transformations for emotion classification. An audio-video database of spontaneous emotions (AvID) is briefly presented since it forms the basis for the evaluation of the proposed method. Emotion classification using the video part of the database is also described and the added Rok Gajšek, Vitomir Štruc, Simon Dobrišek, France Mihelič; University of Ljubljana, Slovenia Notes 129 value of combining the visual information with the audio features is shown. Speaker Dependent Emotion Recognition Using Prosodic Supervectors Ignacio Lopez-Moreno, Carlos Ortego-Resa, Joaquin Gonzalez-Rodriguez, Daniel Ramos; Universidad Autónoma de Madrid, Spain manipulated were presented to listeners to test the effect of this manipulation on the affective colouring of the stimuli. The results showed that even when devoid of intrinsic loudness variation, nonmodal voice quality stimuli were capable of communicating affect. However, changing the loudness of a non-modal voice quality stimulus towards its intrinsic loudness resulted in the increase of affective ratings. Modeling Mutual Influence of Interlocutor Emotion States in Dyadic Spoken Interactions Wed-Ses2-P1-3, Time: 13:30 This work presents a novel approach for detection of emotions embedded in the speech signal. The proposed approach works at the prosodic level, and models the statistical distribution of the prosodic features with Gaussian Mixture Models (GMM) meanadapted from a Universal Background Model (UBM). This allows the use of GMM-mean supervectors, which are classified by a Support Vector Machine (SVM). Our proposal is compared to a popular baseline, which classifies with an SVM a set of selected prosodic features from the whole speech signal. In order to measure the speaker intervariability, which is a factor of degradation in this task, speaker dependent and speaker independent frameworks have been considered. Experiments have been carried out under the SUSAS subcorpus, including real and simulated emotions. Results shows that in a speaker dependent framework our proposed approach achieves a relative improvement greater than 14% in Equal Error Rate (EER) with respect to the baseline approach. The relative improvement is greater than 17% when both approaches are combined together by fusion with respect to the baseline. Physiologically-Inspired Feature Extraction for Emotion Recognition Chi-Chun Lee, Carlos Busso, Sungbok Lee, Shrikanth S. Narayanan; University of Southern California, USA Wed-Ses2-P1-6, Time: 13:30 In dyadic human interactions, mutual influence — a person’s influence on the interacting partner’s behaviors — is shown to be important and could be incorporated into the modeling framework in characterizing, and automatically recognizing the participants’ states. We propose a Dynamic Bayesian Network (DBN) to explicitly model the conditional dependency between two interacting partners’ emotion states in a dialog using data from the IEMOCAP corpus of expressive dyadic spoken interactions. Also, we focus on automatically computing the Valence-Activation emotion attributes to obtain a continuous characterization of the participants’ emotion flow. Our proposed DBN models the temporal dynamics of the emotion states as well as the mutual influence between speakers in a dialog. With speech based features, the proposed network improves classification accuracy by 3.67% absolute and 7.12% relative over the Gaussian Mixture Model (GMM) baseline on isolated turnby-turn emotion classification. A Detailed Study of Word-Position Effects on Emotion Expression in Speech Yu Zhou 1 , Yanqing Sun 1 , Junfeng Li 2 , Jianping Zhang 1 , Yonghong Yan 1 ; 1 Chinese Academy of Sciences, China; 2 JAIST, Japan Jangwon Kim, Sungbok Lee, Shrikanth S. Narayanan; University of Southern California, USA Wed-Ses2-P1-4, Time: 13:30 In this paper, we proposed a new feature extraction method for emotion recognition based on the knowledge of the emotion production mechanism in physiology. It was reported by physiacoustist that emotional speech is differently encoded from the normal speech in terms of articulation organs and that emotion information in speech is concentrated in different frequencies caused by the different movements of organs [4]. To apply these findings, in this paper, we first quantified the distribution of speech emotion information along with each frequency band by exploiting the Fisher’s F-Ratio and mutual information techniques, and then proposed a non-uniform sub-band processing method which is able to extract and emphasize the emotion features in speech. These extracted features are finally applied to emotional recognition. Experimental results in speech emotion recognition showed that the extracted features using our proposed non-uniform sub-band processing outperform the traditional (MFCC) features, and the average error reduction rate amounts to 16.8% for speech emotion recognition. Perceived Loudness and Voice Quality in Affect Cueing Wed-Ses2-P1-7, Time: 13:30 We investigate emotional effects on articulatory-acoustic speech characteristics with respect to word location within a sentence. We examined the hypothesis that emotional effect will vary based on word position by first examining articulatory features manually extracted from Electromagnetic articulography data. Initial articulatory data analyses indicated that the emotional effects on sentence medial words are significantly stronger than on initial words. To verify that observation further, we expanded our hypothesis testing to include both acoustic and articulatory data, and a consideration of an expanded set of words from different locations. Results suggest that emotional effects are generally more significant on sentence medial words than sentence initial and final words. This finding suggests that word location needs to be considered as a factor in emotional speech processing. CMAC for Speech Emotion Profiling Norhaslinda Kamaruddin, Abdul Wahab; Nanyang Technological University, Singapore Wed-Ses2-P1-8, Time: 13:30 Irena Yanushevskaya, Christer Gobl, Ailbhe Ní Chasaide; Trinity College Dublin, Ireland Wed-Ses2-P1-5, Time: 13:30 The paper describes an auditory experiment aimed at testing whether the intrinsic loudness of a stimulus with a given voice quality influences the way in which it signals affect. Synthesised voice quality stimuli in which intrinsic loudness was systematically Cultural differences have been one of the many factors that can cause failures in speech emotion analysis. If this cultural parameter could be regarded as noise artifacts in detecting emotion in speech, we could then extract pure emotion speech signal from the raw emotional speech. In this paper we use the amplitude spectral subtraction (ASS) method to profile the emotion from raw emotional speech based on the affection space model. In addition, the robustness of the cerebellar model arithmetic computer (CMAC) is used Notes 130 to ensure that all other noise artifacts can be suppressed. Result from the speech emotion profiling shows potential of such technique to visualize hidden features for detecting intra-cultural and inter-cultural variation that is missing from current approach of speech emotion recognition. On the Relevance of High-Level Features for Speaker Independent Emotion Recognition of Spontaneous Speech Marko Lugger, Bin Yang; Universität Stuttgart, Germany Spoken dialogue researchers often use supervised machine learning to classify turn-level user affect from a set of turn-level features. The utility of sub-turn features has been less explored, due to the complications introduced by associating a variable number of sub-turn units with a single turn-level classification. We present and evaluate several voting methods for using word-level pitch and energy features to classify turn-level user uncertainty in spoken dialogue data. Our results show that when linguistic knowledge regarding prosody and word position is introduced into a word-level voting model, classification accuracy is significantly improved compared to the use of both turn-level and uninformed word-level models. Wed-Ses2-P1-9, Time: 13:30 Detecting Subjectivity in Multiparty Speech In this paper we study the relevance of so called high-level speech features for the application of speaker independent emotion recognition. After we give a brief definition of high-level features, we discuss for which standard feature groups high-level features are conceivable. Two groups of high-level features are proposed within this paper: a feature set for the parametrization of phonation called voice quality parameters and a second feature set deduced from music theory called harmony features. Harmony features give information about the frequency interval and chord content of the pitch data of a spoken utterance. Finally, we study the gain in classification rate by combining the proposed high-level features with the standard low-level features. We show that both high-level feature sets improve the speaker independent classification performance for spontaneous emotional speech. Gabriel Murray, Giuseppe Carenini; University of British Columbia, Canada Pitch Contour Parameterisation Based on Linear Stylisation for Emotion Recognition Recognising Interest in Conversational Speech — Comparing Bag of Frames and Supra-Segmental Features Vidhyasaharan Sethu, Eliathamby Ambikairajah, Julien Epps; University of New South Wales, Australia Björn Schuller, Gerhard Rigoll; Technische Universität München, Germany Wed-Ses2-P2-3, Time: 13:30 Wed-Ses2-P1-10, Time: 13:30 It is common knowledge that affective and emotion-related states are acoustically well modelled on a supra-segmental level. Nonetheless successes are reported for frame-level processing either by means of dynamic classification or multi-instance learning techniques. In this work a quantitative feature-type-wise comparison between frame-level and supra-segmental analysis is carried out for the recognition of interest in human conversational speech. To shed light on the respective differences the same classifier, namely Support-Vector-Machines, is used in both cases: once by clustering a ‘bag of frames’ of unknown sequence length employing MultiInstance Learning techniques, and once by statistical functional application for the projection of the time series onto a static feature vector. As database serves the Audiovisual Interest Corpus of naturalistic interest. Wed-Ses2-P2 : Expression, Emotion and Personality Recognition Hewison Hall, 13:30, Wednesday 9 Sept 2009 Chair: John H.L. Hansen, University of Texas at Dallas, USA Classifying Turn-Level Uncertainty Using Word-Level Prosody Diane Litman 1 , Mihai Rotaru 2 , Greg Nicholas 3 ; 1 University of Pittsburgh, USA; 2 Textkernel B.V., The Netherlands; 3 Brown University, USA Wed-Ses2-P2-1, Time: 13:30 Wed-Ses2-P2-2, Time: 13:30 In this research we aim to detect subjective sentences in spontaneous speech and label them for polarity. We introduce a novel technique wherein subjective patterns are learned from both labeled and unlabeled data, using n-grams with varying levels of lexical instantiation. Applying this technique to meeting speech, we gain significant improvement over state-of-the-art approaches and demonstrate the method’s robustness to ASR errors. We also show that coupling the pattern-based approach with structural and lexical features of meetings yields additional improvement. The pitch contour contains information that characterises the emotion being expressed by speech, and consequently features extracted from pitch form an integral part of many automatic emotion recognition systems. While pitch contours may have many small variations and hence are difficult to represent compactly, it may be possible to parameterise them by approximating the contour for each voiced segment by a straight line. This paper looks at such a parameterisation method in the context of emotion recognition. Listening tests were performed to subjectively determine if the linearly stylised contours were able to sufficiently capture information pertaining to emotions expressed in speech. Furthermore these parameters were used as features for an automatic 5-class emotion classification system. The use of the proposed parameters rather than pitch statistics resulted in a relative increase in accuracy of about 20%. Feature-Based and Channel-Based Analyses of Intrinsic Variability in Speaker Verification Martin Graciarena 1 , Tobias Bocklet 2 , Elizabeth Shriberg 1 , Andreas Stolcke 1 , Sachin Kajarekar 1 ; 1 SRI International, USA; 2 FAU Erlangen-Nürnberg, Germany Wed-Ses2-P2-4, Time: 13:30 We explore how intrinsic variations (those associated with the speaker rather than the recording environment) affect textindependent speaker verification performance. In a previous paper we introduced the SRI-FRTIV corpus and provided speaker verification results using a Gaussian mixture model (GMM) system on telephone-channel speech. In this paper we explore the use of other speaker verification systems on the telephone channel data and Notes 131 compare against the GMM baseline. We found the GMM system to be one of the more robust across all conditions. Systems relying on recognition hypotheses had a significant degradation in low vocal effort conditions. We also explore the use of the GMM system on several other channels. We found improved performance on table-top microphones compared to the telephone channel in furtive conditions and gradual degradations as a function of the distance from the microphone to the speaker. Therefore distant microphones further degrade the speaker verification performance due to intrinsic variability. Robust Angry Speech Detection Employing a TEO-Based Discriminative Classifier Combination Wooil Kim, John H.L. Hansen; University of Texas at Dallas, USA Wed-Ses2-P2-5, Time: 13:30 This study proposes an effective angry speech detection approach employing the TEO-based feature extraction. Decorrelation processing is applied to the TEO-based feature to increase model training ability by decreasing the correlation between feature elements and vector size. Minimum classification error training is employed to increase the discrimination between the angry speech model and other stressed speech models. Combination with the conventional Mel frequency cepstral coefficients (MFCC) is also employed to leverage the effectiveness of MFCC to characterize the spectral envelope of speech signals. Experimental results over the SUSAS corpus demonstrate the proposed angry speech detection scheme is effective at increasing detection accuracy on an open-speaker and open-vocabulary task. An improvement of up to 7.78% in classification accuracy is obtained by combination of the proposed methods including decorrelation of TEO-based feature vector, discriminative training, and classifier combination. Improving Emotion Recognition Using Class-Level Spectral Features Dmitri Bitouk, Ani Nenkova, Ragini Verma; University of Pennsylvania, USA Wed-Ses2-P2-6, Time: 13:30 Traditional approaches to automatic emotion recognition from speech typically make use of utterance level prosodic features. Still, a great deal of useful information about expressivity and emotion can be gained from segmental spectral features, which provide a more detailed description of the speech signal, or from measurements from specific regions of the utterance, such as the stressed vowels. Here we introduce a novel set of spectral features for emotion recognition: statistics of Mel-Frequency Spectral Coefficients computed over three phoneme type classes of interest: stressed vowels, unstressed vowels and consonants in the utterance. We investigate performance of our features in the task of speaker-independent emotion recognition using two publicly available datasets. Our experimental results clearly indicate that indeed both the richer set of spectral features and the differentiation between phoneme type classes are beneficial for the task. Classification accuracies are consistently higher for our features compared to prosodic features or utterance-level spectral features. Combination of our phoneme class features with prosodic features leads to even further improvement. Arousal and Valence Prediction in Spontaneous Emotional Speech: Felt versus Perceived Emotion Khiet P. Truong 1 , David A. van Leeuwen 2 , Mark A. Neerincx 2 , Franciska M.G. de Jong 1 ; 1 University of Twente, The Netherlands; 2 TNO Defence, The Netherlands Wed-Ses2-P2-7, Time: 11:20 In this paper, we describe emotion recognition experiments carried out for spontaneous affective speech with the aim to compare the added value of annotation of felt emotion versus annotation of perceived emotion. Using speech material available in the tno-gaming corpus (a corpus containing audiovisual recordings of people playing videogames), speech-based affect recognizers were developed that can predict Arousal and Valence scalar values. Two types of recognizers were developed in parallel: one trained with felt emotion annotations (generated by the gamers themselves) and one trained with perceived/observed emotion annotations (generated by a group of observers). The experiments showed that, in speech, with the methods and features currently used, observed emotions are easier to predict than felt emotions. The results suggest that recognition performance strongly depends on how and by whom the emotion annotations are carried out. Dimension Reduction Approaches for SVM Based Speaker Age Estimation Gil Dobry 1 , Ron M. Hecht 2 , Mireille Avigal 1 , Yaniv Zigel 3 ; 1 Open University of Israel, Israel; 2 PuddingMedia, Israel; 3 Ben-Gurion University of the Negev, Israel Wed-Ses2-P2-8, Time: 13:30 This paper presents two novel dimension reduction approaches applied on the gaussian mixture model (GMM) supervectors to improve age estimation speed and accuracy. The GMM supervector embodies many speech characteristics irrelevant to age estimation and like noise, they are harmful to the system’s generalization ability. In addition, the support vectors machine (SVM) testing computation grows with the vector’s dimension, especially when using complex kernels. The first approach presented is the weightedpairwise principal components analysis (WPPCA) that reduces the vector dimension by minimizing the redundant variability. The second approach is based on anchor-models, using a novel anchors selection method. Experiments showed that dimension reduction makes the testing process 5 times faster and using the WPPCA approach, it is also 5% more accurate. ANN Based Decision Fusion for Speech Emotion Recognition Lu Xu 1 , Mingxing Xu 1 , Dali Yang 2 ; 1 Tsinghua University, China; 2 Beijing Information Science & Technology University, China Wed-Ses2-P2-9, Time: 13:30 As a hot research field, speech emotion recognition has attracted increasing attentions from both academic and business. In this paper, we proposed a method to recognize speech emotions adopting ANNs and to fuse two kinds of recognitions using different features at the decision level. Each emotional utterance is recognized by some individual recognizers firstly. Then the outputs of these recognizers were fused adopting the voting strategy. Furthermore, the dimensionality of supervectors constructed from spectral features is reduced through PCA. Experimental results demonstrated that the proposed decision fusion is effective and the dimensionality reduction is feasible. Notes 132 Processing Affected Speech Within Human Machine Interaction Bogdan Vlasenko, Andreas Wendemuth; Otto-von-Guericke-Universität Magdeburg, Germany Wed-Ses2-P2-10, Time: 13:30 Spoken dialog systems (SDS) integrated into human-machine interaction interfaces is becoming a standard technology. Current stateof-the-art SDS, usually, is not able to provide for the user a natural way of communication. Existing automated dialog systems do not dedicate enough attention to problems in the interaction related to affected user behavior. As a result, Automatic Speech Recognition (ASR) engines are not able to recognize affected speech and dialog strategy does not make use of the user’s emotional state. This paper addresses some aspects of processing affected speech within natural human-machine interaction. First of all, we propose an affected speech adapted ASR engine. Second, we describe our methods of emotion recognition within speech and present our results of emotion classification within Interspeech 2009 Emotion Challenge. Third, we test affected speech adapted speech recognition models and introduce an approach to achieve emotion adaptive dialog management in human-machine interaction. Emotion Recognition from Speech Using Extended Feature Selection and a Simple Classifier Ali Hassan, Robert I. Damper; University of Southampton, UK Wed-Ses2-P2-11, Time: 13:30 We describe extensive experiments on the recognition of emotion from speech using acoustic features only. Two databases of acted emotional speech (Berlin and DES) have been used in this work. The principal focus is on methods for selection of good features from a relatively large set of hand-crafted features, perhaps formed by fusing different feature sets used by different researchers. We show that the monotonic assumption underlying popular sequential selection algorithms does not hold, and use this finding to improve recognition accuracy. We show further that a very simple classifier (k-nearest neighbour) produces better results than any so far reported by other researchers on these databases, suggesting that previous work has failed to match the complexity of the classifier used to the complexity of the data. Finally, several potentially fruitful avenues for future work are outlined. Wed-Ses2-P3 : Speech Synthesis Methods velopmental psychology claims, infants acquire the holistic sound patterns of words from the utterances of their parents, called word Gestalt, and they reproduce them with their vocal tubes. This behavior is called vocal imitation. In our previous studies, the word Gestalt was defined physically and a method of extracting it from a word utterance was proposed. We already applied the word Gestalt to ASR, CALL, and also speech generation, which we call structure to speech conversion. Unlike reading machines, our framework simulates infants’ vocal imitation. In this paper, a method for improving our speech generation framework based on a structural cost function is proposed and evaluated. Deriving Vocal Tract Shapes from Electromagnetic Articulograph Data via Geometric Adaptation and Matching Ziad Al Bawab 1 , Lorenzo Turicchia 2 , Richard M. Stern 1 , Bhiksha Raj 1 ; 1 Carnegie Mellon University, USA; 2 MIT, USA Wed-Ses2-P3-2, Time: 13:30 In this paper, we present our efforts towards deriving vocal tract shapes from ElectroMagnetic Articulograph data (EMA) via geometric adaptation and matching. We describe a novel approach for adapting Maeda’s geometric model of the vocal tract to one speaker in the MOCHA database. We show how we can rely solely on the EMA data for adaptation. We present our search technique for the vocal tract shapes that best fit the given EMA data. We then describe our approach of synthesizing speech from these shapes. Results on Mel-cepstral distortion reflect improvement in synthesis over the approach we used before without adaptation. Towards Unsupervised Articulatory Resynthesis of German Utterances Using EMA Data Ingmar Steiner 1 , Korin Richmond 2 ; 1 Universität des Saarlandes, Germany; 2 University of Edinburgh, UK Wed-Ses2-P3-3, Time: 13:30 As part of ongoing research towards integrating an articulatory synthesizer into a text-to-speech (TTS) framework, a corpus of German utterances recorded with electromagnetic articulography (EMA) is resynthesized to provide training data for statistical models. The resynthesis is based on a measure of similarity between the original and resynthesized EMA trajectories, weighted by articulatory relevance. Preliminary results are discussed and future work outlined. The KlattGrid Speech Synthesizer Hewison Hall, 13:30, Wednesday 9 Sept 2009 Chair: Nobuaki Minematsu, University of Tokyo, Japan David Weenink; University of Amsterdam, The Netherlands Optimal Event Search Using a Structural Cost Function — Improvement of Structure to Speech Conversion Wed-Ses2-P3-4, Time: 13:30 We present a new speech synthesizer class, named KlattGrid, for the Praat program [3]. This synthesizer is based on the original description of Klatt [1, 2]. New aspects of a KlattGrid in comparison with other Klatt-type synthesizers are that a KlattGrid Daisuke Saito, Yu Qiao, Nobuaki Minematsu, Keikichi Hirose; University of Tokyo, Japan • is not frame-based but time-based. You specify parameters as a function of time with any precision you like. Wed-Ses2-P3-1, Time: 13:30 This paper describes a new and improved method for the framework of structure to speech conversion we previously proposed. Most of the speech synthesizers take a phoneme sequence as input and generate speech by converting each of the phonemes into its corresponding sound. In other words, they simulate a human process of reading text out. However, infants usually acquire speech communication ability without text or phoneme sequences. Since their phonemic awareness is very immature, they can hardly decompose an utterance into a sequence of phones or phonemes. As de- Notes 133 • has no limitations on the number of oral formants, nasal formants, nasal antiformants, tracheal formants or tracheal antiformants that can be defined. • has separate formants for the frication part. • allows varying the form of the glottal flow function as a function of time. • allows for any number of formants and bandwidths to be modified during the open phase of the glottis. • uses no beforehand quantization of amplitude parameters. • is fully integrated into the freely available speech analysis program Praat [3]. Development of a Kenyan English Text to Speech System: A Method of Developing a TTS for a Previously Undefined English Dialect Unit Selection Based Speech Synthesis for Poor Channel Condition Ling Cen, Minghui Dong, Paul Chan, Haizhou Li; Institute for Infocomm Research, Singapore Mucemi Gakuru; Teknobyte Ltd., Kenya Wed-Ses2-P3-5, Time: 13:30 This work provides a method that can be used to build an English TTS for a population who speak a dialect which is not defined and for which no resources exist, by showing how a Text to Speech System (TTS) was developed for the English dialect spoken in Kenya. To begin with, the existence of a unique English dialect which had not previously been defined was confirmed from the need by the English speaking Kenyan population to have a TTS in an accent different from the British accent. This dialect is referred to here and has also been branded as Kenyan English. Given that building a TTS requires language features to be adequately defined, it was necessary to develop the essential features of the dialect such as the phoneset and the lexicon and then verifying their correctness. The paper shows how it was possible to come up with a systematic approach for defining these features through tracing the evolution of the dialect. It also discusses how the TTS was built and tested. Feedback Loop for Prosody Prediction in Concatenative Speech Synthesis 1 2 Wed-Ses2-P3-8, Time: 13:30 Synthesized speech can be largely degraded in noise, resulting in compromised speech quality. In this paper, we propose a unit selection based speech synthesis system for better speech quality under poor channel conditions. First, the measurement of speech intelligibility is incorporated in the cost function as a searching criterion for unit selection. Next, the prosody of the selected units is modified according to the Lombard effect. Prosody modification includes increasing the amplitude of unvoiced phoneme and enlarging the speech duration. Finally, the FIR equalization via convex optimization is applied to reduce signal distortion due to the channel effect. Listening test in our experiments shows that the quality level of synthetic speech can be improved under poor channel conditions with the help of our proposed synthesis system. Vocalic Sandwich, a Unit Designed for Unit Selection TTS Didier Cadic 1 , Cédric Boidin 1 , Christophe d’Alessandro 2 ; 1 Orange Labs, France; 2 LIMSI, France 1 Javier Latorre , Sergio Gracia , Masami Akamine ; 1 Toshiba Corporate R&D Center, Japan; 2 Universitat Politècnica de Catalunya, Spain Wed-Ses2-P3-9, Time: 13:30 Wed-Ses2-P3-6, Time: 13:30 We propose a method for concatenative speech synthesis that permits to obtain a better matching between the logF0 and duration predicted by the prosody module and the waveform generation back-end. The proposed method is based upon our previous multilevel parametric F0 model and Toshiba’s plural unit selection and fusion synthesizer. The method adds a feedback loop from the back-end into the prosody module so that the prosodical information of the selected units is used to re-estimate new prosody values. The feedback loop defines a frame-level prosody model which consists of the average value and variance of the duration and logF0 of the selected units. The log-likelihood defined by this model is added to the log-likelihood of the prosody model. From the maximization of this total log-likelihood, we obtain the prosody values that produce the optimum compromise between the distortion introduced by F0 discontinuities and the one created by the prosody adjusting signal processing. Assessing a Speaker for Fast Speech in Unit Selection Speech Synthesis 1 of a fast speech unit selection inventory are drawn. The following section deals with a perception study where a selected speaker’s ability to speak fast is investigated. To conclude, a preliminary perceptual analysis of the recordings for the speech synthesis corpus is presented. Unit selection text-to-speech systems currently produce very natural synthetic sentences by concatenating speech segments from a large database. Recently, increasing demand for designing high quality voices with less data creates need for further optimization of the textual corpus recorded by the speaker. The optimization process of this corpus is traditionally guided by the coverage rate of well-known units: triphones, words…. Such units are however not dedicated to concatenative speech synthesis; they are of general use in speech technologies and linguistics. In this paper, we describe a new unit which takes account of concatenative TTS own features: the “vocalic sandwich.” Both an objective and a perceptual evaluation tend to show that vocalic sandwiches are appropriate units for corpus design. Speech Synthesis Based on the Plural Unit Selection and Fusion Method Using FWF Model Ryo Morinaka, Masatsune Tamura, Masahiro Morita, Takehiko Kagoshima; Toshiba Corporate R&D Center, Japan Wed-Ses2-P3-10, Time: 13:30 2 1 Donata Moers , Petra Wagner ; Rheinische Friedrich-Wilhelms-Universität Bonn, Germany; 2 Universität Bielefeld, Germany Wed-Ses2-P3-7, Time: 13:30 This paper describes work in progress concerning the adequate modeling of fast speech in unit selection speech synthesis systems, mostly having in mind blind and visually impaired users. Initially, a survey of the main characteristics of fast speech will be given. Subsequently, strategies for fast speech production will be discussed. Certain requirements concerning the ability of a speaker For speech synthesizers, enhanced diversity and improved quality of synthesized speech are required. Speaker interpolation and voice conversion are the techniques that enhance diversity. The PUSF (plural unit selection and fusion) method, which we have proposed, generates synthesized waveforms using pitch-cycle waveforms. However, it is difficult to modify its spectral features while keeping naturalness of synthesized speech. In the present work, we investigated how best to represent speech waveforms. Firstly, we introduce a method that decomposes a pitch waveform in a voiced portion into a periodic component, which is excited by vocal sound source, and an aperiodic component, which is Notes 134 excited by noise source. Moreover, we introduce the FWF (formant waveform) model to represent the periodic component. Because the FWF model represents the pitch waveform in accordance with formant parameters, it can control the formant parameters independently. We realized a method that can easily be applied to the diversity-enhancing techniques in the PUSF-based method because this model is based on vocal tract features. sistance. In this paper we present a real-time system for automatic subtitling of live broadcast news in Spanish based on the News Redaction Computer texts and an Automatic Speech Recognition engine to provide precise temporal alignment of speech to text scripts with negligible latency. The presented system is working satisfactory on the Aragonese Public Television from June 2008 without human assistance. Speech Synthesis Without a Phone Inventory Development of the 2008 SRI Mandarin Speech-to-Text System for Broadcast News and Conversation Matthew P. Aylett, Simon King, Junichi Yamagishi; University of Edinburgh, UK Wed-Ses2-P3-11, Time: 13:30 In speech synthesis the unit inventory is decided using phonological and phonetic expertise. This process is resource intensive and potentially sub-optimal. In this paper we investigate how acoustic clustering, together with lexicon constraints, can be used to build a self-organised inventory. Six English speech synthesis systems were built using two frameworks, unit selection and parametric HTS for three inventory conditions: 1) a traditional phone set, 2) a system using orthographic units, and 3) a self-organised inventory. A listening test showed a strong preference for the classic system, and for the orthographic system over the self-organised system. Results also varied by letter to sound complexity and database coverage. This suggests the self-organised approach failed to generalise pronunciation as well as introducing noise above and beyond that caused by orthographic sound mismatch. Context-Dependent Additive log F0 Model for HMM-Based Speech Synthesis Xin Lei 1 , Wei Wu 2 , Wen Wang 1 , Arindam Mandal 1 , Andreas Stolcke 1 ; 1 SRI International, USA; 2 University of Washington, USA Wed-Ses2-P4-2, Time: 13:30 We describe the recent progress in SRI’s Mandarin speech-to-text system developed for 2008 evaluation in the DARPA GALE program. A data-driven lexicon expansion technique and language model adaptation methods contribute to the improvement in recognition performance. Our system yields 8.3% character error rate on the GALE dev08 test set, and 7.5% after combining with RWTH systems. Compared to our 2007 evaluation system, a significant improvement of 13% relative has been achieved. Multifactor Adaptation for Mandarin Broadcast News and Conversation Speech Recognition Wen Wang, Arindam Mandal, Xin Lei, Andreas Stolcke, Jing Zheng; SRI International, USA Heiga Zen, Norbert Braunschweiler; Toshiba Research Europe Ltd., UK Wed-Ses2-P4-3, Time: 13:30 Wed-Ses2-P3-12, Time: 13:30 This paper proposes a context-dependent additive acoustic modelling technique and its application to logarithmic fundamental frequency (log F0 ) modelling for HMM-based speech synthesis. In the proposed technique, mean vectors of state-output distributions are composed as the weighted sum of decision tree-clustered context-dependent bias terms. Its model parameters and decision trees are estimated and built based on the maximum likelihood (ML) criterion. The proposed technique has the potential to capture the additive structure of log F0 contours. A preliminary experiment using a small database showed that the proposed technique yielded encouraging results. Wed-Ses2-P4 : LVCSR Systems and Spoken Term Detection We explore the integration of multiple factors such as genre and speaker gender for acoustic model adaptation tasks to improve Mandarin ASR system performance on broadcast news and broadcast conversation audio. We investigate the use of multifactor clustering of acoustic model training data and the application of MPE-MAP and fMPE-MAP acoustic model adaptations. We found that by effectively combining these adaptation approaches, we achieve 6% relative reduction in recognition error rate compared to a Mandarin recognition system that does not use genre-specific acoustic models, and 5% relative improvement if the genre-adaptive system is combined with another, genre-independent state-of-theart system. Development of the GALE 2008 Mandarin LVCSR System C. Plahl, Björn Hoffmeister, Georg Heigold, Jonas Lööf, Ralf Schlüter, Hermann Ney; RWTH Aachen University, Germany Hewison Hall, 13:30, Wednesday 9 Sept 2009 Chair: Simon King, University of Edinburgh, UK Wed-Ses2-P4-4, Time: 13:30 Real-Time Live Broadcast News Subtitling System for Spanish Alfonso Ortega, Jose Enrique Garcia, Antonio Miguel, Eduardo Lleida; Universidad de Zaragoza, Spain Wed-Ses2-P4-1, Time: 13:30 Subtitling of live broadcast news is a very important application to meet the needs of deaf and hard of hearing people. However, live subtitling is a high cost operation in terms of qualification human resources and therefore, money if high precision is desired. Automatic Speech Recognition researchers can help to perform this task saving both time and money developing systems that deliver subtitles fully synchronized with speech without human as- This paper describes the current improvements of the RWTH Mandarin LVCSR system. We introduce vocal tract length normalization for the Gammatone features and present comparable results for Gammatone based feature extraction and classical feature extraction. In order to benefit from the huge amount of data of 1600h available in the GALE project we have trained the acoustic models up to 8M Gaussians. We present detailed character error rates for the different number of Gaussians. Different kinds of systems are developed and a two stage decoding framework is applied, which uses cross-adaptation and a subsequent lattice-based system combination. In addition to various acoustic front-ends, these systems use different kinds of neural network toneme posterior features. We present detailed Notes 135 recognition results of the development cycle and the different acoustic front-ends of the systems. Finally, we compare the ultimate evaluation system to our last years system and can report a 10% relative improvement. The RWTH Aachen University Open Source Speech Recognition System Improvements to the LIUM French ASR System Based on CMU Sphinx: What Helps to Significantly Reduce the Word Error Rate? Paul Deléglise, Yannick Estève, Sylvain Meignier, Teva Merlin; LIUM, France Wed-Ses2-P4-8, Time: 13:30 David Rybach, Christian Gollan, Georg Heigold, Björn Hoffmeister, Jonas Lööf, Ralf Schlüter, Hermann Ney; RWTH Aachen University, Germany Wed-Ses2-P4-5, Time: 13:30 We announce the public availability of the RWTH Aachen University speech recognition toolkit. The toolkit includes state of the art speech recognition technology for acoustic model training and decoding. Speaker adaptation, speaker adaptive training, unsupervised training, a finite state automata library, and an efficient tree search decoder are notable components. Comprehensive documentation, example setups for training and recognition, and a tutorial are provided to support newcomers. This paper describes the new ASR system developed by the LIUM and analyzes the various origins of the significant drop of the word error rate observed in comparison to the previous LIUM ASR system. This study was made on the test data of the latest evaluation campaign of ASR systems on French broadcast news, called ESTER 2 and organized in December 2008. For the same computation time, the new system yields a word error rate about 38% lower than what the previous system (which reached the second position during the ESTER 1 evaluation campaign) did. This paper evaluates the gain provided by various changes to the system: implementation of new search and training algorithms, new training data, vocabulary size, etc. The LIUM ASR system was the best open-source ASR system of the ESTER 2 campaign. Merging Search Spaces for Subword Spoken Term Detection Online Detecting End Times of Spoken Utterances for Synchronization of Live Speech and its Transcripts Timo Mertens 1 , Daniel Schneider 2 , Joachim Köhler 2 ; 1 NTNU, Norway; 2 Fraunhofer IAIS, Germany Jie Gao, Qingwei Zhao, Yonghong Yan; Chinese Academy of Sciences, China Wed-Ses2-P4-9, Time: 13:30 Wed-Ses2-P4-6, Time: 13:30 In this paper, we present our initial efforts in the task of Automatically Synchronizing live spoken Utterances with their Transcripts (textual contents) (ASUT). We address the problem of online detecting of the end time of a spoken utterance given its textual content, which is one of the key problems of the ASUT task. A framesynchronous likelihood ratio test (FS-LRT) procedure is proposed and explored under the hidden Markov model (HMM) framework. The property of FS-LRT is studies empirically. Experiments indicate that our proposed approach shows satisfying performance. In addition, the proposed procedure has been successfully applied in a subtitling system for live broadcast news. Real-Time ASR from Meetings Philip N. Garner 1 , John Dines 1 , Thomas Hain 2 , Asmaa El Hannani 2 , Martin Karafiát 3 , Danil Korchagin 1 , Mike Lincoln 4 , Vincent Wan 2 , Le Zhang 4 ; 1 IDIAP Research Institute, Switzerland; 2 University of Sheffield, UK; 3 Brno University of Technology, Czech Republic; 4 University of Edinburgh, UK Wed-Ses2-P4-7, Time: 13:30 The AMI(DA) system is a meeting room speech recognition system that has been developed and evaluated in the context of the NIST Rich Text (RT) evaluations. Recently, the “Distant Access” requirements of the AMIDA project have necessitated that the system operate in real-time. Another more difficult requirement is that the system fit into a live meeting transcription scenario. We describe an infrastructure that has allowed the AMI(DA) system to evolve into one that fulfils these extra requirements. We emphasise the components that address the live and real-time aspects. We describe how complementary search spaces, addressed by two different methods used in Spoken Term Detection (STD), can be merged for German subword STD. We propose fuzzy-search techniques on lattices to narrow the gap between subword and word retrieval. The first technique is based on an edit-distance, where no a priori knowledge about confusions is employed. Additionally, we propose a weighting method which explicitly models pronunciation variation on a subword level and thus improves robustness against false positives. Recall is improved by 6% absolute when retrieving on the merged search space rather than using an exact lattice search. By modeling subword pronunciation variation, we increase recall in a high-precision setting by 2% absolute compared to the edit-distance method. A Posterior Probability-Based System Hybridisation and Combination for Spoken Term Detection Javier Tejedor 1 , Dong Wang 2 , Simon King 2 , Joe Frankel 2 , José Colás 1 ; 1 Universidad Autónoma de Madrid, Spain; 2 University of Edinburgh, UK Wed-Ses2-P4-10, Time: 13:30 Spoken term detection (STD) is a fundamental task for multimedia information retrieval. To improve the detection performance, we have presented a direct posterior-based confidence measure generated from a neural network. In this paper, we propose a detection-independent confidence estimation based on the direct posterior confidence measure, in which the decision making is totally separated from the term detection. Based on this idea, we first present a hybrid system which conducts the term detection and confidence estimation based on different sub-word units and then propose a combination method which merges detections from heterogeneous term detectors based on the direct posterior-based confidence. Experimental results demonstrated that the proposed methods improved system performance considerably for both English and Spanish. Notes 136 Stochastic Pronunciation Modelling for Spoken Term Detection Dong Wang, Simon King, Joe Frankel; University of Edinburgh, UK Wed-Ses2-P4-11, Time: 13:30 A major challenge faced by a spoken term detection (STD) system is the detection of out-of-vocabulary (OOV) terms. Although a subword-based STD system is able to detect OOV terms, performance reduction is always observed compared to in-vocabulary terms. Current approaches to STD do not acknowledge the particular properties of OOV terms, such as pronunciation uncertainty. In this paper, we use a stochastic pronunciation model to deal with the uncertain pronunciations of OOV terms. By considering all possible term pronunciations, predicted by a joint-multigram model, we observe a significant performance improvement. quickly from a very large speech database without using a large memory space. To accelerate searches and save memory, we used a suffix array as the data structure and applied phoneme-based DP-matching. To avoid an exponential increase in the process time with the length of the keyword, a long keyword is divided into short sub-keywords. Moreover, an iterative lengthening search algorithm is used to rapidly output accurate search results. The experimental results show that it takes less than 100ms to detect the first set of search results from a 10,000-h virtual speech database. Wed-Ses2-S1 : Special Session: Active Listening & Synchrony Ainsworth (East Wing 4), 13:30, Wednesday 9 Sept 2009 Chair: Nick Campbell, Trinity College Dublin, Ireland Term-Dependent Confidence for Out-of-Vocabulary Term Detection Understanding Speaker-Listener Interactions Dong Wang, Simon King, Joe Frankel, Peter Bell; University of Edinburgh, UK Wed-Ses2-S1-1, Time: 13:30 Dirk Heylen; University of Twente, The Netherlands Wed-Ses2-P4-12, Time: 13:30 Within a spoken term detection (STD) system, the decision maker plays an important role in retrieving reliable detections. Most of the state-of-the-art STD systems make decisions based on a confidence measure that is term-independent, which poses a serious problem for out-of-vocabulary (OOV) term detection. In this paper, we study a term-dependent confidence measure based on confidence normalisation and discriminative modelling, particularly focusing on its remarkable effectiveness for detecting OOV terms. Experimental results indicate that the term-dependent confidence provides much more significant improvement for OOV terms than terms in-vocabulary. A Comparison of Query-by-Example Methods for Spoken Term Detection Wade Shen, Christopher M. White, Timothy J. Hazen; MIT, USA Wed-Ses2-P4-13, Time: 13:30 In this paper we examine an alternative interface for phonetic search, namely query-by-example, that avoids OOV issues associated with both standard word-based and phonetic search methods. We develop three methods that compare query lattices derived from example audio against a standard ngram-based phonetic index and we analyze factors affecting the performance of these systems. We show that the best systems under this paradigm are able to achieve 77% precision when retrieving utterances from conversational telephone speech and returning 10 results from a single query (performance that is better than a similar dictionarybased approach) suggesting significant utility for applications requiring high precision. We also show that these systems can be further improved using relevance feedback: By incorporating four additional queries the precision of the best system can be improved by 13.7% relative. Our systems perform well despite high phone recognition error rates (> 40%) and make use of no pronunciation or letter-to-sound resources. We provide an eclectic generic framework to understand the back and forth interactions between participants in a conversation highlighting the complexity of the actions that listeners are engaged in. Communicative actions of one participant implicate the “other” in many ways. In this paper, we try to enumerate some essential relevant dimensions of this reciprocal dependence. Detecting Changes in Speech Expressiveness in Participants of a Radio Program Plínio A. Barbosa; State University of Campinas, Brazil Wed-Ses2-S1-2, Time: 13:50 A method for speech expressiveness change detection is presented which combines a dimensional analysis of speech expression, a Principal Component Analysis technique, as well as multiple regression analysis. From the three inferred rates of activation, valence, and involvement, two PCA-factors explain 97% of the variance of the judges’ evaluations of a corpus of radio show interaction. The multiple regression analysis predicted the values of the two listener-oriented, PCA-derived dimensions of promptness and empathy from the acoustic parameters automatically obtained from a set of 206 utterances produced by radio show’s participants. Analysed chronologically, the utterances reveal expression change from automatic acoustic analysis. An Audio-Visual Approach to Measuring Discourse Synchrony in Multimodal Conversation Data Nick Campbell; Trinity College Dublin, Ireland Wed-Ses2-S1-3, Time: 14:10 This paper describes recent work on the automatic extraction of visual and audio parameters relating to the detection of synchrony in discourse, and to the modelling of active listening for advanced speech technology. It reports findings based on image processing that reliably identify the strong entrainment between members of a group conversation, and describes techniques for the extraction and analysis of such information. Towards Flexible Representations for Analysis of Accommodation of Temporal Features in Spontaneous Dialogue Speech Fast Keyword Detection Using Suffix Array Kouichi Katsurada, Shigeki Teshima, Tsuneo Nitta; Toyohashi University of Technology, Japan Wed-Ses2-P4-14, Time: 13:30 In this paper, we propose a technique for detecting keywords Spyros Kousidis, David Dorran, Ciaran McDonnell, Eugene Coyle; Dublin Institute of Technology, Ireland Wed-Ses2-S1-4, Time: 14:30 Notes 137 Current advances in spoken interface design point towards a shift towards more “human-like” interaction, as opposed to the traditional “push-to-talk” approach. However, human dialogue is characterized by synchrony and multi-modality, and these properties are not captured by traditional representation approaches, such as turn succession. This paper proposes an alternative representation schema for recorded (human) dialogues, which employs per frame averages of speaker turn distribution, in order to inform further analyses of temporal features (pauses and overlaps) in terms of inter-speaker accommodation. Preliminary results of such analyses are provided. Are We ‘in Sync’: Turn-Taking in Collaborative Dialogues Štefan Beňuš; Constantine the Philosopher University in Nitra, Slovak Republic Wed-Ses2-S1-5, Time: 14:50 We used a corpus of collaborative task oriented dialogues in American English to compare two units of rhythmic structure — pitch accents and syllables — within the coupled oscillator model of rhythmical entrainment in turn-taking proposed in [1]. We found that pitch accents are a slightly better fit than syllables as the unit of rhythmical structure for the model, but we also observed weak support for the model in general. Some turn-taking types were rhythmically more salient than others. An Audio-Visual Attention System for Online Association Learning Martin Heckmann, Holger Brandl, Xavier Domont, Bram Bolder, Frank Joublin, Christian Goerick; Honda Research Institute GmbH, Germany Wed-Ses2-S1-6, Time: 15:10 We present an audio-visual attention system for speech based interaction with a humanoid robot where a tutor can teach visual properties/locations (e.g “left”) and corresponding, arbitrary speech labels. The acoustic signal is segmented via the attention system and speech labels are learned from a few repetitions of the label by the tutor. The attention system integrates bottom-up stimulus driven saliency calculation (delay-and-sum beamforming, adaptive noise level estimation) and top-down modulation (spectral properties, segment length, movement and interaction status of the robot). We evaluate the performance of different aspects of the system based on a small dataset. Large Margin Estimation of Gaussian Mixture Model Parameters with Extended Baum-Welch for Spoken Language Recognition Donglai Zhu, Bin Ma, Haizhou Li; Institute for Infocomm Research, Singapore Wed-Ses3-O1-2, Time: 16:20 Discriminative training (DT) methods of acoustic models, such as SVM and MMI-training GMM, have been proved effective in spoken language recognition. In this paper we propose a DT method for GMM using the large margin (LM) estimation. Unlike traditional MMI or MCE methods, the LM estimation attempts to enhance the generalization ability of GMM to deal with new data that exhibits mismatch with training data. We define the multi-class separation margin as a function of GMM likelihoods, and derive update formulae of GMM parameters with the extended Baum-Welch algorithm. Results on the NIST language recognition evaluation (LRE) 2007 task show that the LM estimation achieves better performance and faster convergent speed than the MMI estimation. Linguistically-Motivated Automatic Classification of Regional French Varieties Cécile Woehrling, Philippe Boula de Mareüil, Martine Adda-Decker; LIMSI, France Wed-Ses3-O1-3, Time: 16:40 The goal of this study is to automatically differentiate French varieties (standard French and French varieties spoken in the South of France, Alsace, Belgium and Switzerland) by applying a linguistically-motivated approach. We took advantage of automatic phoneme alignment to measure vowel formants, consonant (de)voicing, pronunciation variants as well as prosodic cues. These features were then used to identify French varieties by applying classification techniques. On large corpora of hundreds of speakers, over 80% correct identification scores were obtained. The confusions between varieties and the features used (by decision trees) are linguistically grounded. Discriminative Acoustic Language Recognition via Channel-Compensated GMM Statistics Niko Brümmer 1 , Albert Strasheim 1 , Valiantsina Hubeika 2 , Pavel Matějka 2 , Lukáš Burget 2 , Ondřej Glembek 2 ; 1 AGNITIO, South Africa; 2 Brno University of Technology, Czech Republic Wed-Ses3-O1-4, Time: 17:00 Wed-Ses3-O1 : Language Recognition Main Hall, 16:00, Wednesday 9 Sept 2009 Chair: Jan Černocký, Brno University of Technology, Czech Republic A Human Benchmark for Language Recognition Rosemary Orr, David A. van Leeuwen; ICSI, USA Wed-Ses3-O1-1, Time: 16:00 In this study, we explore a human benchmark in language recognition, for the purpose of comparing human performance to machine performance in the context of the NIST LRE 2007. Humans are categorised in terms of language proficiency, and performance is presented per proficiency. The main challenge in this work is the design of a test and application of a performance metric which allows a meaningful comparison of humans and machines. The main result of this work is that where subjects have lexical knowledge of a language, even at a low level, they perform as well as the state of the art in language recognition systems in 2007. We propose a novel design for acoustic feature-based automatic spoken language recognizers. Our design is inspired by recent advances in text-independent speaker recognition, where intraclass variability is modeled by factor analysis in Gaussian mixture model (GMM) space. We use approximations to GMM-likelihoods which allow variable-length data sequences to be represented as statistics of fixed size. Our experiments on NIST LRE’07 show that variability-compensation of these statistics can reduce error-rates by a factor of three. Finally, we show that further improvements are possible with discriminative logistic regression training. Language Score Calibration Using Adapted Gaussian Back-End Mohamed Faouzi BenZeghiba, Jean-Luc Gauvain, Lori Lamel; LIMSI, France Wed-Ses3-O1-5, Time: 17:20 Generative Gaussian back-end and discriminative logistic regres- Notes 138 sion are the most used approaches for language score fusion and calibration. Combination of these two approaches can significantly improve the performance. This paper proposes the use of an adapted Gaussian back-end, where the mean of the language-dependent Gaussian is adapted from the mean of a language-specific background Gaussian via maximum a posteriori estimation algorithm. Experiments are conducted using the LRE-07 evaluation data. Compared to the conventional Gaussian back-end approach for a closed set task, relative improvements in the Cavg of 50%, 17% and 4.2% are obtained on the 30s, 10s and 3s conditions, respectively. Besides this, the estimated scores are better calibrated. A combination with logistic regression results in a system with the best calibrated scores. A Framework for Discriminative SVM/GMM Systems for Language Recognition W.M. Campbell, Zahi N. Karam; MIT, USA Wed-Ses3-O1-6, Time: 17:40 Language recognition with support vector machines and shifteddelta cepstral features has been an excellent performer in NIST-sponsored language evaluation for many years. A novel improvement of this method has been the introduction of hybrid SVM/GMM systems. These systems use GMM supervectors as an SVM expansion for classification. In prior work, methods for scoring SVM/GMM systems have been introduced based upon either standard SVM scoring or GMM scoring with a pushed model. Although prior work showed experimentally that GMM scoring yielded better results, no framework was available to explain the connection between SVM scoring and GMM scoring. In this paper, we show that there are interesting connections between SVM scoring and GMM scoring. We provide a framework both theoretically and experimentally that connects the two scoring techniques. This connection should provide the basis for further research in SVM discriminative training for GMM models. Large-Scale Analysis of Formant Frequency Estimation Variability in Conversational Telephone Speech Nancy F. Chen 1 , Wade Shen 1 , Joseph Campbell 1 , Reva Schwartz 2 ; 1 MIT, USA; 2 United States Secret Service, USA Wed-Ses3-O2-2, Time: 16:20 We quantify how the telephone channel and regional dialect influence formant estimates extracted from Wavesurfer [1, 2] in spontaneous conversational speech from over 3,600 native American English speakers. To the best of our knowledge, this is the largest scale study on this topic. We found that F1 estimates are higher in cellular channels than those in landline, while F2 in general shows an opposite trend. We also characterized vowel shift trends in northern states in U.S.A. and compared them with the Northern city chain shift (NCCS) [3]. Our analysis is useful in forensic applications where it is important to distinguish between speaker, dialect, and channel characteristics. Developing an Automatic Functional Annotation System for British English Intonation Saandia Ali, Daniel Hirst; LPL, France Wed-Ses3-O2-3, Time: 16:40 One of the fundamental aims of prosodic analysis is to provide a reliable means of extracting functional information (what prosody contributes to meaning) directly from prosodic form (i.e. what prosody is — in this case intonation). This paper addresses the development of an automatic functional annotation system for British English. It is based on the study of a large corpus of British English and a procedure of analysis by synthesis, enabling to test and enrich different models of English intonation on the one hand and work towards an automatic version of the annotation process on the other. Intrinsic Vowel Duration and the Post-Vocalic Voicing Effect: Some Evidence from Dialects of North American English Wed-Ses3-O2 : Phonetics & Phonology Jones (East Wing 1), 16:00, Wednesday 9 Sept 2009 Chair: Unto Kalervo Laine, Helsinki University of Technology, Finland Joshua Tauberer, Keelan Evanini; University of Pennsylvania, USA Functional Data Analysis as a Tool for Analyzing Speech Dynamics — A Case Study on the French Word c’était Wed-Ses3-O2-4, Time: 17:00 Michele Gubian, Francisco Torreira, Helmer Strik, Lou Boves; Radboud Universiteit Nijmegen, The Netherlands Wed-Ses3-O2-1, Time: 16:00 In this paper we introduce Functional Data Analysis (FDA) as a tool for analyzing dynamic transitions in speech signals. FDA makes it possible to perform statistical analyses of sets of mathematical functions in the same way as classical multivariate analysis treats scalar measurement data. We illustrate the use of FDA with a reduction phenomenon affecting the French word c’était /setE/ ‘it was’, which can be reduced to [stE] in conversational speech. FDA reveals that the dynamics of the transition from [s] to [t] in fully reduced cases may still be different from the dynamics of [s]-[t] transitions in underlying /st/ clusters such as in the word stage. We report the results of a comprehensive dialectal survey of three vowel duration phenomena in North American English: gross duration differences between dialects, the effect of post-vocalic consonant voicing, and intrinsic vowel duration. Duration data, from HMM-based forced alignment of phones in the Atlas of North American English corpus [1], showed that 1) the post-vocalic voicing effect appears in every dialect region and all but one dialect, and 2) dialectal variation in first formant frequency appears to be independent of intrinsic vowel duration. This second result adds evidence that intrinsic vowel durations are targets stored in the grammar and do not result from physiological constraints. Investigating /l/ Variation in English Through Forced Alignment Jiahong Yuan, Mark Liberman; University of Pennsylvania, USA Wed-Ses3-O2-5, Time: 10:00 We present a new method for measuring the “darkness” of /l/, and use it to investigate the variation of English /l/ in a large speech corpus that is automatically aligned with phones predicted from an Notes 139 orthographic transcript. We found a correlation between the rime duration and /l/-darkness for syllable-final /l/, but no correlation between /l/ duration and darkness for syllable-initial /l/. The data showed a clear difference between clear and dark /l/ in English, and also showed that syllable-final /l/ was less dark preceding an unstressed vowel than preceding a consonant or a word boundary. Structural Analysis of Dialects, Sub-Dialects and Sub-Sub-Dialects of Chinese Xuebin Ma 1 , Akira Nemoto 2 , Nobuaki Minematsu 1 , Yu Qiao 1 , Keikichi Hirose 1 ; 1 University of Tokyo, Japan; 2 Nankai University, China Wed-Ses3-O2-6, Time: 17:40 In China, there are hundred kinds of dialects. By traditional dialectology, they are classified into seven big dialect regions and most of them also have many sub-dialects and sub-sub-dialects. As they are different in various linguistic aspects, people from different dialect regions often cannot communicate orally. But for the sub-dialects of one dialect region, although they are sometimes still mutually unintelligible, more common features are shared. In this paper, a dialect pronunciation structure, which has been used successfully in dialect-based speaker classification in our previous work [1], is examined for the task of speaker classification and distance measurement among cities based on sub-dialects of Mandarin. Using the finals of the dialectal utterances of a specific list of written characters, a dialect pronunciation structure is built for every speaker in a data set and these speakers are classified based on the distances among their structures. Then, the results of classifying 16 Mandarin speakers based on their sub-dialects show that they are linguistically classified with little influence of their age and gender. Finally, distances among sub-sub-dialects are similarly calculated and evaluated. All the results show high validity and accordance to linguistic studies. Wed-Ses3-O3 : Speech Activity Detection High-Accuracy, Low-Complexity Voice Activity Detection Based on a posteriori SNR Weighted Energy Zheng-Hua Tan, Børge Lindberg; Aalborg University, Denmark Wed-Ses3-O3-3, Time: 16:40 This paper presents a voice activity detection (VAD) method using the measurement of a posteriori signal-to-noise ratio (SNR) weighted energy. The motivations are manifold: 1) the difference in frame-to-frame energy provides a great discrimination for speech signals, 2) speech segments, besides their characteristics, are accounted also on their reliability e.g. measured by SNR, 3) the a posteriori SNR for noise-only segments will theoretically equal to 0 dB, being ideal for VAD, and 4) both energy and a posteriori SNR are easy to estimate, resulting in a low complexity. The method is experimentally shown to be superior to a number of referenced methods and standards. Fusing Fast Algorithms to Achieve Efficient Speech Detection in FM Broadcasts Stéphane Pigeon, Patrick Verlinde; Royal Military Academy, Belgium Wed-Ses3-O3-4, Time: 17:00 Fallside (East Wing 2), 16:00, Wednesday 9 Sept 2009 Chair: Isabel Trancoso, INESC-ID Lisboa/IST, Portugal This paper describes a system aimed at detecting speech segments in FM broadcasts. To achieve high processing speeds, simple but fast algorithms are used. To output robust decisions, a combination of many different algorithms has been considered. The system is fully operational in the context of Open Source Intelligence, since 2007. Voice Activity Detection Using Singular Value Decomposition-Based Filter Hwa Jeon Song, Sung Min Ban, Hyung Soon Kim; Pusan National University, Korea Wed-Ses3-O3-1, Time: 16:00 This paper proposes a novel voice activity detector (VAD) based on singular value decomposition (SVD). The spectro-temporal characteristics of background noise region can be easily analyzed by SVD. The proposed method naturally drops hangover algorithm from VAD. Moreover, it adaptively changes the decision threshold by employing the most dominant singular value of the observation matrix in the noise region. According to simulation results, the proposed VAD shows significantly better performance than the conventional statistical model-based method and is less sensitive to the environmental changes. In addition, the proposed algorithm requires very low computational cost compared with other algorithms. Voice Activity Detection Using Partially Observable Markov Decision Process Chiyoun Park, Namhoon Kim, Jeongmi Cho; Samsung Electronics Co. Ltd., Korea Partially observable Markov decision process (POMDP) has been generally used to model agent decision processes such as dialogue management. In this paper, possibility of applying POMDP to a voice activity detector (VAD) has been explored. The proposed system first formulates hypotheses about the current noise environment and speech activity. Then, it decides and observes the features that are expected to be the most salient in the estimated situation. VAD decision is made based on the accumulated information. A comparative evaluation is presented to show that the proposed method outperforms other model-based algorithms regardless of noise types or signal-to-noise ratio. Robust Speech Recognition Using VAD-Measure-Embedded Decoder Tasuku Oonishi 1 , Paul R. Dixon 1 , Koji Iwano 2 , Sadaoki Furui 1 ; 1 Tokyo Institute of Technology, Japan; 2 Tokyo City University, Japan Wed-Ses3-O3-5, Time: 17:20 In a speech recognition system a Voice Activity Detector (VAD) is a crucial component for not only maintaining accuracy but also for reducing computational consumption. Front-end approaches which drop non-speech frames typically attempt to detect speech frames by utilizing speech/non-speech classification information such as the zero crossing rate or statistical models. These approaches discard the speech/non-speech classification information after voice detection. This paper proposes an approach that uses the speech/non-speech information to adjust the score of the recognition hypotheses. Experimental results show that our approach can improve the accuracy significantly and reduce computational consumption by combining the front-end method. Wed-Ses3-O3-2, Time: 16:20 Notes 140 Investigating Privacy-Sensitive Features for Speech Detection in Multiparty Conversations Sree Hari Krishnan Parthasarathi, Mathew Magimai-Doss, Hervé Bourlard, Daniel Gatica-Perez; IDIAP Research Institute, Switzerland Wed-Ses3-O3-6, Time: 17:40 We investigate four different privacy-sensitive features, namely energy, zero crossing rate, spectral flatness, and kurtosis, for speech detection in multiparty conversations. We liken this scenario to a meeting room and define our datasets and annotations accordingly. The temporal context of these features is modeled. With no temporal context, energy is the best performing single feature. But by modeling temporal context, kurtosis emerges as the most effective feature. Also, we combine the features. Besides yielding a gain in performance, certain combinations of features also reveal that a shorter temporal context is sufficient. We then benchmark other privacy-sensitive features utilized in previous studies. Our experiments show that the performance of all the privacy-sensitive features modeled with context is close to that of state-of-the-art spectral-based features, without extracting and using any features that can be used to reconstruct the speech signal. Wed-Ses3-O4 : Multimodal Speech (e.g. Audiovisual Speech, Gesture) Holmes (East Wing 3), 16:00, Wednesday 9 Sept 2009 Chair: Ji Ming, Queen’s University Belfast, UK speech source separation, and biometric spoofing detection. In particular, we build on earlier work, extending our previously proposed time-evolution model of audio-visual features to include non-causal (future) feature information. This significantly improves robustness of the method to small time-alignment errors between the audio and visual streams, as demonstrated by our experiments. In addition, we compare the proposed model to two known literature approaches for audio-visual synchrony detection, namely mutual information and hypothesis testing, and we show that our method is superior to both. Acoustic-to-Articulatory Inversion Using Speech Recognition and Trajectory Formation Based on Phoneme Hidden Markov Models Atef Ben Youssef, Pierre Badin, Gérard Bailly, Panikos Heracleous; GIPSA, France Wed-Ses3-O4-3, Time: 16:40 In order to recover the movements of usually hidden articulators such as tongue or velum, we have developed a data-based speech inversion method. HMMs are trained, in a multistream framework, from two synchronous streams: articulatory movements measured by EMA, and MFCC + energy from the speech signal. A speech recognition procedure based on the acoustic part of the HMMs delivers the chain of phonemes and together with their durations, information that is subsequently used by a trajectory formation procedure based on the articulatory part of the HMMs to synthesise the articulatory movements. The RMS reconstruction error ranged between 1.1 and 2. mm. Speaker Discriminability for Visual Speech Modes Evaluation of External and Internal Articulator Dynamics for Pronunciation Learning Jeesun Kim 1 , Chris Davis 1 , Christian Kroos 1 , Harold Hill 2 ; 1 University of Western Sydney, Australia; 2 University of Wollongong, Australia Lan Wang, Hui Chen, JianJun Ouyang; Chinese Academy of Sciences, China Wed-Ses3-O4-4, Time: 17:00 Wed-Ses3-O4-1, Time: 16:00 In this paper we present a data-driven 3D talking head system using facial video and a X-ray film database for speech research. In order to construct a database recording the three dimensional positions of articulators at phoneme-level, the feature points of articulators were defined and labeled in facial and X-ray images for each English phoneme. Dynamic displacement based deformations were used in three modes to simulate the motions of both external and internal articulators. For continuous speech, the articulatory movements of each phoneme within an utterance were concatenated. A blending function was also employed to smooth the concatenation. In audio-visual test, a set of minimal pairs were used as the stimuli to access the realistic degree of articulatory motions of the 3D talking head. In the experiments where the subjects are native speakers and professional English teachers, a word identification accuracy of 91.1% among 156 tests was obtained. Robust Audio-Visual Speech Synchrony Detection by Generalized Bimodal Linear Prediction Kshitiz Kumar 1 , Jiri Navratil 2 , Etienne Marcheret 2 , Vit Libal 2 , Gerasimos Potamianos 3 ; 1 Carnegie Mellon University, USA; 2 IBM T.J. Watson Research Center, USA; 3 NCSR “Demokritos”, Greece Does speech mode affect recognizing people from their visual speech? We examined 3D motion data from 4 talkers saying 10 sentences (twice). Speech was in noise, in quiet or whispered. Principal Component Analyses (PCAs) were conducted and speaker classification was determined by Linear Discriminant Analysis (LDA). The first five PCs for the rigid motion and the first 10 PCs each for the non-rigid motion and the combined motion were input to a series of LDAs for all possible combinations of PCs that could be constructed using the retained PCs. The discriminant functions and classification coefficients were determined on the training data to predict the talker of the test data. Classification performance for both the in-noise and whispered speech modes were superior to the in-quiet one. Superiority of classification was found even if only the first PC (jaw motion) was used, i.e., measures of jaw motion when speaking in noise or whispering hold promise for bimodal person recognition or verification. Audio-Visual Prosody of Social Attitudes in Vietnamese: Building and Evaluating a Tones Balanced Corpus Dang-Khoa Mac 1 , Véronique Aubergé 1 , Albert Rilliard 2 , Eric Castelli 3 ; 1 LIG, France; 2 LIMSI, France; 3 MICA, Vietnam Wed-Ses3-O4-2, Time: 16:20 Wed-Ses3-O4-5, Time: 17:20 We study the problem of detecting audio-visual synchrony in video segments containing a speaker in frontal head pose. The problem holds a number of important applications, for example speech source localization, speech activity detection, speaker diarization, This paper presents the building and a first evaluation of a tones balanced Audio-Visual corpus of social affect in Vietnamese language. This under-resourced tonal language has specific glottalization and co-articulation phenomena, for which interactions with Notes 141 attitudes prosody are a very interesting issue. A well-controlled recording methodology was designed to build a large representative audio-visual corpus for 16 attitudes, and one speaker. A perception experiment was carried out to evaluate a speaker’s perceived performances and to study the role and integration of the audio, visual, and audio-visual information in the listener’s perception of the speaker’s attitudes. The results reveal characteristics of Vietnamese prosodic attitudes and allow us to investigate such social affect in Vietnamese language. Direct, Modular and Hybrid Audio to Visual Speech Conversion Methods — A Comparative Study Gyorgy Takacs; Peter Pazmany University, Hungary Wed-Ses3-O4-6, Time: 17:40 A systematic comparative study of audio to visual speech conversion methods is described in this paper. A direct conversion system is compared to conceptually different ASR based solutions. Hybrid versions of the different solutions will also be presented. The methods are tested using the same speech material, audio preprocessing and facial motion visualization units. Only the conversion blocks are changed. Subjective opinion score evaluation tests prove the naturalness of the direct conversion is the best. Wed-Ses3-P1 : Phonetics Hewison Hall, 16:00, Wednesday 9 Sept 2009 Chair: Helmer Strik, Radboud Universiteit Nijmegen, The Netherlands How Similar Are Clusters Resulting from schwa Deletion in French to Identical Underlying Clusters? Audrey Bürki 1 , Cécile Fougeron 2 , Christophe Veaux 3 , Ulrich H. Frauenfelder 1 ; 1 Université de Genève, Switzerland; 2 LPP, France; 3 IRCAM, France Rarefaction Gestures and Coarticulation in Mangetti Dune !Xung Clicks Amanda Miller 1 , Abigail Scott 1 , Bonny Sands 2 , Sheena Shah 3 ; 1 University of British Columbia, Canada; 2 Northern Arizona University, USA; 3 Georgetown University, USA Wed-Ses3-P1-3, Time: 16:00 We provide high-speed ultrasound data on the four Mangetti Dune !Xung clicks. The posterior constriction is uvular for all four clicks — front uvular for [g |] and [}] and back uvular for [g !] and [g {]. [g !] and [g {] both involve tongue center lowering and tongue root retraction as part of the rarefaction gestures. The rarefaction gestures in [g |] and [}] involve tongue center lowering. Lingual cavity volume is largest for [g !], followed by [g {], [}] and [g |]. A tongue tip recoil effect is found following [g !], but the effect is smaller than that seen in IsiXhosa in earlier studies. The Acoustics of Mangetti Dune !Xung Clicks Amanda Miller 1 , Sheena Shah 2 ; 1 University of British Columbia, Canada; 2 Georgetown University, USA Wed-Ses3-P1-4, Time: 16:00 We document the acoustics of the four Mangetti Dune !Xung coronal clicks. We report the temporal measures of burst duration, relative burst amplitude and rise time, as well as the spectral value of center of gravity in the click bursts. COG correlates with lingual cavity volume. We show that there is inter-speaker variation in the acoustics of the palatal click, which we expect to correlate with a difference in the anterior constriction release dynamics. We show that burst duration, amplitude and rise time are correlated, similar to the correlation found between rise time and frication duration in affricates. Acoustic Characteristics of Ejectives in Amharic Hussien Seid, S. Rajendran, B. Yegnanarayana; IIIT Hyderabad, India Wed-Ses3-P1-1, Time: 16:00 Clusters resulting from the deletion of schwa in French are compared with identical underlying clusters in words and pseudowords. Both manual and automatic acoustical comparisons suggest that clusters resulting from schwa deletion in French are highly similar to identical underlying clusters. Furthermore, cluster duration is not longer for clusters resulting from schwa deletion than for identical underlying clusters. Clusters in pseudowords show a different acoustical and durational pattern from the two other clusters in words. Word-Final [t]-Deletion: An Analysis on the Segmental and Sub-Segmental Level Wed-Ses3-P1-5, Time: 16:00 In this paper, a preliminary investigation of the acoustic characteristics of Amharic ejectives in comparison with their unvoiced conjugates is presented. The normalized error from linear prediction residual and a zero frequency resonator output are used to locate the instant of release of the oral closure and the instant of the start of voicing, respectively. Amharic ejectives are found to have longer closure duration and smaller VOT than their unvoiced conjugates. Cross-linguistic comparisons reveal that no ejectives of two languages behave acoustically in a similar manner despite similarity in their articulation. Sentence-Final Particles in Hong Kong Cantonese: Are They Tonal or Intonational? Barbara Schuppler 1 , Wim van Dommelen 2 , Jacques Koreman 2 , Mirjam Ernestus 1 ; 1 Radboud Universiteit Nijmegen, The Netherlands; 2 NTNU, Norway Wing Li Wu; University College London, UK Wed-Ses3-P1-2, Time: 16:00 Wed-Ses3-P1-6, Time: 16:00 This paper presents a study on the reduction of word-final [t]s in conversational standard Dutch. Based on a large amount of tokens annotated on the segmental level, we show that the bigram frequency and the segmental context are the main predictors for the absence of [t]s. In a second study, we present an analysis of the detailed acoustic properties of word-final [t]s and we show that bigram frequency and context also play a role on the sub-segmental level. This paper extends research on the realization of /t/ in spontaneous speech and shows the importance of incorporating sub-segmental properties in models of speech. Cantonese is rich in sentence-final particles (SFPs), morphemes serving to show various linguistic or attitudinal meanings. The acoustic manifestations of these SFPs are not yet clear. This paper presents detailed analyses of the fundamental frequency tracings, final F0 , final velocity and duration of ten SFPs in Hong Kong Cantonese. The results show that most of these SFPs are very similar to the lexical tones in terms of the F0 measurements, but the durations are significantly different in half the cases. The notable differences may give some insight into the nature of this special class of words. Notes 142 Same Tone, Different Category: Linguistic-Tonetic Variation in the Areal Tone Acoustics of Chuqu Wu nents, but evidence for the phonetics/phonology dichotomy on which it hinges has proved elusive. Advocating a multidisciplinary approach, this paper outlines a new research project which combines traditional behavioural experiments with neuro-linguistic data to advance our understanding of the linguistic representation and neural correlates of intonation. William Steed, Phil Rose; Australian National University, Australia Wed-Ses3-P1-7, Time: 16:00 Acoustic and auditory data are presented for the citation tones of single speakers from nine sites (eight hitherto undescribed in English) from the little-studied Chuqu subgroup of Wu in East Central China: Lìshuı̆, Lóngquán, Qìngyuán, Lóngyóu, Jìnyún, Qı̄ngtián, Yúnhé, Jı̆ngníng, and Táishùn. The data demonstrate a high degree of complexity, having no less than 22 linguistictonetically different tones. The nature of the complexity of these forms is discussed, especially with respect to whether the variation is continuous or categorical, and inferences are drawn on their historical development. Why Would Aspiration Lower the Pitch of the Following Vowel? Observations from Leng-Shui-Jiang Chinese Exploring Vocalization of /l/ in English: An EPG and EMA Study Mitsuhiro Nakamura; Nihon University, Japan Wed-Ses3-P1-11, Time: 16:00 This study explores the spatiotemporal characteristics of lingual gestures for the clear, dark, and vocalized allophones of /l/ in English by examining the EPG and EMA data from the multichannel articulatory (MOCHA) database. The results show the evidence that the spatiotemporal controls of the tip lowering and the dorsum backing gestures are organized systematically for the three variants. An exploratory description of the articulatory correlates for the /l/ gestures is made. Caicai Zhang; Hong Kong University of Science & Technology, China The Monophthongs and Diphthongs of North-Eastern Welsh: An Acoustic Study Wed-Ses3-P1-8, Time: 16:00 Robert Mayr, Hannah Davies; University of Wales Institute Cardiff, UK This paper is a preliminary report of the aspiration-conditioned tonal split in Leng-shui-jiang (LSJ hereafter) Chinese. So far no consensus has been reached concerning the intrinsic perturbation of aspiration on the F0 of the following vowel. Conflicting data come from both the same language and different languages. In order to shed light on this issue, F0 and Closing quotient (Qx hereafter) are calculated in syllables after aspirated and unaspirated obstruents from six speakers (three male, three female) in LSJ dialect. The results turn out that F0 is significantly lower after the aspirated obstruents in two out of the three tone groups. The relatively lower Qx found in the syllables with aspirated initials is a possible explanation for the lower pitch. Dialectal Characteristics of Osaka and Tokyo Japanese: Analyses of Phonologically Identical Words Wed-Ses3-P1-12, Time: 16:00 Descriptive accounts of Welsh vowels indicate systematic differences between Northern and Southern varieties. Few studies have, however, attempted to verify these claims instrumentally, and little is known about regional variation in Welsh vowel systems. The present study aims to provide a first preliminary analysis of the acoustic properties of Welsh monophthongs and diphthongs, as produced by a male speaker from North-eastern Wales. The results indicate distinctive production of all the monophthong categories of Northern Welsh. Interesting patterns of spectral change were found for the diphthongs. Implications for theories of contrastivity in vowel systems are discussed. Voicing Profile of Polish Sonorants: [r] in Obstruent Clusters Kanae Amino, Takayuki Arai; Sophia University, Japan Wed-Ses3-P1-9, Time: 16:00 This study investigates the characteristics of the two major dialects of Japanese: Osaka and Tokyo dialects. We recorded the utterances of the speakers of both dialects, and analysed the differences that appear in the accentuation of the words at the phonetic-acoustic level. The Japanese words that are phonologically identical in both dialects were used as the analysis target. The results showed that the pitch patterns contained the dialect-dependent features of Osaka Japanese. Furthermore, these patterns could not be fully mimicked by speakers of Tokyo Japanese. These results show that there is a phonetics-phonology gap in the dialectal differences, and that we may exploit this gap for forensic purposes. J. Sieczkowska, Bernd Möbius, Antje Schweitzer, Michael Walsh, Grzegorz Dogil; Universität Stuttgart, Germany Wed-Ses3-P1-13, Time: 16:00 This study aims at defining and analyzing voicing profile of Polish sonorant [r] showing the variability of its realizations depending on segmental and prosodic position. Voicing profile is defined as the frame-by-frame voicing status of a speech sound in continuous speech. Word-final devoicing of sonorants is shortly reviewed and analyzed in terms of the conducted corpus-based investigation. We used automatic tools to extract consonants’ features, F0 values and obtain voicing profile. The results show that liquid [r] devoice word and syllable finally, particularly with left voiceless stop context. Categories and Gradience in Intonation: Evidence from Linguistics and Neurobiology Brechtje Post, Francis Nolan, Emmanuel Stamatakis, Toby Hudson; University of Cambridge, UK Wed-Ses3-P1-10, Time: 16:00 Multiple cues interact to signal multiple functions in intonation simultaneously, which makes intonation notoriously complex to analyze. The Autosegmental-Metrical model for intonation analysis has proved to be an excellent vehicle for separating the compo- Notes 143 the UBM model to have significantly high overlap in the acoustic space. We hypothesize that the use of VTLN will help in compacting the UBM model and thus the speaker adapted models obtained from this compact model will have better speaker-separability in the acoustic space. We perform experiments on MIT, TIMIT and NIST 2004 SRE databases and show that using VTLN we can achieve lesser Identification Error Rates as compared to the conventional GMM-UBM based method. Wed-Ses3-P2 : Speaker Verification & Identification III Hewison Hall, 16:00, Wednesday 9 Sept 2009 Chair: A. Ariyaeeinia, University of Hertfordshire, UK Mel, Linear, and Antimel Frequency Cepstral Coefficients in Broad Phonetic Regions for Telephone Speaker Recognition BUT System for NIST 2008 Speaker Recognition Evaluation Howard Lei, Eduardo Lopez; ICSI, USA Wed-Ses3-P2-1, Time: 16:00 We’ve examined the speaker discriminative power of mel-, antimeland linear-frequency cepstral coefficients (MFCCs, a-MFCCs and LFCCs) in the nasal, vowel, and non-nasal consonant speech regions. Our inspiration came from the work of Lu and Dang in 2007, who showed that filterbank energies at some frequencies mainly outside the telephone bandwidth possess more speaker discriminative power due to physiological characteristics of speakers, and derived a set of cepstral coefficients that outperformed MFCCs in non-telephone speech. Using telephone speech, we’ve discovered that LFCCs gave 21.5% and 15.0% relative EER improvements over MFCCs in nasal and non-nasal consonant regions, agreeing with our filterbank energy f-ratio analysis. We’ve also found that using only the vowel region with MFCCs gives a 9.1% relative improvement over using all speech. Last, we’ve shown that a-MFCCs are valuable in combination, contributing to a system with 17.3% relative improvement over our baseline. Fast GMM Computation for Speaker Verification Using Scalar Quantization and Discrete Densities Lukáš Burget, Michal Fapšo, Valiantsina Hubeika, Ondřej Glembek, Martin Karafiát, Marcel Kockmann, Pavel Matějka, Petr Schwarz, Jan Černocký; Brno University of Technology, Czech Republic Wed-Ses3-P2-4, Time: 16:00 This paper presents BUT system submitted to NIST 2008 SRE. It includes two subsystems based on Joint Factor Analysis (JFA) GMM/UBM and one based on SVM-GMM. The systems were developed on NIST SRE 2006 data, and the results are presented on NIST SRE 2008 evaluation data. We concentrate on the influence of side information in the calibration. Selection of the Best Set of Shifted Delta Cepstral Features in Speaker Verification Using Mutual Information José R. Calvo, Rafael Fernández, Gabriel Hernández; CENATAV, Cuba Wed-Ses3-P2-5, Time: 16:00 Guoli Ye 1 , Brian Mak 1 , Man-Wai Mak 2 ; 1 Hong Kong University of Science & Technology, China; 2 Hong Kong Polytechnic University, China Wed-Ses3-P2-2, Time: 16:00 Most of current state-of-the-art speaker verification (SV) systems use Gaussian mixture model (GMM) to represent the universal background model (UBM) and the speaker models (SM). For an SV system that employs log-likelihood ratio between SM and UBM to make the decision, its computational efficiency is largely determined by the GMM computation. This paper attempts to speedup GMM computation by converting a continuous-density GMM to a single or a mixture of discrete densities using scalar quantization. We investigated a spectrum of such discrete models: from high-density discrete models to discrete mixture models, and their combination called high-density discrete-mixture models. For the NIST 2002 SV task, we obtained an overall speedup by a factor of 2–100 with little loss in EER performance. Text-Independent Speaker Identification Using Vocal Tract Length Normalization for Building Universal Background Model A.K. Sarkar, S. Umesh, S.P. Rath; IIT Kanpur, India Wed-Ses3-P2-3, Time: 16:00 In this paper, we propose to use Vocal Tract Length Normalization (VTLN) to build the Universal Background Model (UBM) for a closed set speaker identification system. Vocal Tract Length (VTL) differences among speakers is a major source of variability in the speech signal. Since the UBM model is trained using data from many speakers, it statistically captures this inherent variation in the speech signal, which results in a “coarse” model in the acoustic space. This may cause the adapted speaker models obtained from Shifted delta cepstral (SDC) features, obtained by concatenating delta cepstral features across multiples speech frames, were recently reported to produce superior performance to delta cepstral features in language and speaker recognition systems. In this paper, the use of SDC features in a speaker verification experiment is reported. Mutual information between SDC features and identity of a speaker is used to select the best set of SDC parameters. The experiment evaluates robustness of the best SDC features due to channel and handset mismatch in speaker verification. The result reflects an EER relative reduction until 19% in a speaker verification experiment. Forensic Speaker Recognition Using Traditional Features Comparing Automatic and Human-in-the-Loop Formant Tracking Alberto de Castro, Daniel Ramos, Joaquin Gonzalez-Rodriguez; Universidad Autónoma de Madrid, Spain Wed-Ses3-P2-6, Time: 16:00 In this paper we compare forensic speaker recognition with traditional features using two different formant tracking strategies: one performed automatically and one semi-automatic performed by human experts. The main contribution of the work is the use of an automatic method for formant tracking, which allows a much faster recognition process and the use of a much higher amount of data for modelling background population, calibration, etc. This is especially important in likelihood-ratio-based forensic speaker recognition, where the variation of features among a population of speakers must be modelled in a statistically robust way. Experiments show that, although recognition using the human-in-the-loop approach is better than using the automatic scheme, the performance of the latter is also acceptable. Moreover, Notes 144 we present a novel feature selection method which allows the analysis of which feature of each formant has a greater contribution to the discriminating power of the whole recognition process, which can be used by the expert in order to decide which features in the available speech material are important. The MIT Lincoln Laboratory 2008 Speaker Recognition System D.E. Sturim, W.M. Campbell, Zahi N. Karam, Douglas Reynolds, F.S. Richardson; MIT, USA Wed-Ses3-P2-10, Time: 16:00 Open-Set Speaker Identification Under Mismatch Conditions S.G. Pillay 1 , A. Ariyaeeinia 1 , P. Sivakumaran 1 , M. Pawlewski 2 ; 1 University of Hertfordshire, UK; 2 BT Labs, UK Wed-Ses3-P2-7, Time: 16:00 This paper presents investigations into the performance of open-set, text-independent speaker identification (OSTI-SI) under mismatched data conditions. The scope of the study includes attempts to reduce the adverse effects of such conditions through the introduction of a modified parallel model combination (PMC) method together with condition-adjusted T-Norm (CT-Norm) into the OSTI-SI framework. The experiments are conducted using examples of real world noise. Based on the outcomes, it is demonstrated that the above approach can lead to considerable improvements in the accuracy of open-set speaker identification operating under severely mismatched data conditions. The paper details the realisation of the modified PMC method and CT-Norm in the context of OSTI-SI, presents the experimental investigations and provides an analysis of the results. MiniVectors: An Improved GMM-SVM Approach for Speaker Verification Xavier Anguera; Telefonica Research, Spain Wed-Ses3-P2-8, Time: 16:00 The accuracy levels achieved by state-of-the-art Speaker Verification systems are high enough for the technology to be used in real-life applications. Unfortunately, the transfer from the lab to the field is not as straight-forward as could be: the best performing systems can be computationally expensive to run and need large speaker model footprints. In this paper, we compare two speaker verification algorithms (GMM-SVM Supervectors and Kharroubi’s GMM-SVM vectors) and propose an improvement of Kharroubi’s system that: (a) achieves up to 17% relative performance improvement when compared to the Supervectors algorithm; (b) is 24% faster in run time and (c) makes use of speaker models that are 94% smaller than those needed by the Supervectors algorithm. Robustness of Phase Based Features for Speaker Recognition R. Padmanabhan 1 , Sree Hari Krishnan Parthasarathi 2 , Hema A. Murthy 1 ; 1 IIT Madras, India; 2 IDIAP Research Institute, Switzerland Wed-Ses3-P2-9, Time: 16:00 This paper demonstrates the robustness of group-delay based features for speech processing. An analysis of group delay functions is presented which show that these features retain formant structure even in noise. Furthermore, a speaker verification task performed on the NIST 2003 database show lesser error rates, when compared with the traditional MFCC features. We also mention about using feature diversity to dynamically choose the feature for every claimed speaker. In recent years methods for modeling and mitigating variational nuisances have been introduced and refined. A primary emphasis in last years NIST 2008 Speaker Recognition Evaluation (SRE) was to greatly expand the use of auxiliary microphones. This offered the additional channel variations which has been a historical challenge to speaker verification systems. In this paper we present the MIT Lincoln Laboratory Speaker Recognition system applied to the task in the NIST 2008 SRE. Our approach during the evaluation was two-fold: 1) Utilize recent advances in variational nuisance modeling (latent factor analysis and nuisance attribute projection) to allow our spectral speaker verification systems to better compensate for the channel variation introduced, and 2) fuse systems targeting the different linguistic tiers of information, high and low. The performance of the system is presented when applied on a NIST 2008 SRE task. Post evaluation analysis is conducted on the sub-task when interview microphones are present. Speaker Recognition on Lossy Compressed Speech Using the Speex Codec A.R. Stauffer, A.D. Lawson; RADC Inc., USA Wed-Ses3-P2-11, Time: 16:00 This paper examines the impact of lossy speech coding with Speex on GMM-UBM speaker recognition (SR). Audio from 120 speakers was compressed with Speex into twelve data sets, each with a different level of compression quality from 0 (most compressed) to 10 (least), plus uncompressed. Experiments looked at performance under matched and mismatched compression conditions, using models conditioned for the coded environment, and Speex coding applied to improving SR performance on other coders. Results show that Speex is effective for compression of data used in SR and that Speex coding can improve performance on data compressed by the GSM codec. Text-Independent Speaker Verification Using Rank Threshold in Large Number of Speaker Models Haruka Okamoto 1 , Satoru Tsuge 2 , Amira Abdelwahab 1 , Masafumi Nishida 3 , Yasuo Horiuchi 1 , Shingo Kuroiwa 1 ; 1 Chiba University, Japan; 2 University of Tokushima, Japan; 3 Doshisha University, Japan Wed-Ses3-P2-12, Time: 16:00 In this paper, we propose a novel speaker verification method which determines whether a claimer is accepted or rejected by the rank of the claimer in a large number of speaker models instead of score normalization, such as T-norm and Z-norm. The method has advantages over the standard T-norm in speaker verification accuracy. However, it needs much computation time as well as T-norm that needs calculating likelihoods for many cohort models. Hence, we also discuss the speed-up using the method that selects cohort subset for each target speaker in the training stage. This data driven approach can significantly reduce computation resulting in faster speaker verification decision. We conducted text-independent speaker verification experiments using large-scale Japanese speaker recognition evaluation corpus constructed by National Research Institute of Police Science. As a result, the proposed method achieved an equal error rate of 2.2%, while T-norm obtained 2.7%. Notes 145 Adaptive Training with Noisy Constrained Maximum Likelihood Linear Regression for Noise Robust Speech Recognition The Role of Age in Factor Analysis for Speaker Identification Yun Lei, John H.L. Hansen; University of Texas at Dallas, USA D.K. Kim, M.J.F. Gales; University of Cambridge, UK Wed-Ses3-P2-13, Time: 16:00 Wed-Ses3-P3-2, Time: 16:00 The speaker acoustic space described by a factor analysis model is assumed to reflect a majority of the speaker variations using a reduced number of latent factors. In this study, the age factor, as an observable important factor of a speaker’s voice, is analyzed and employed in the description of the speaker acoustic space, using a factor analysis approach. An age dependent acoustic space is developed for speakers, and the effect of the age dependent space in eigenvoice is evaluated using the NIST SRE08 corpus. In addition, the data pool with different age distributions are evaluated based on joint factor analysis model to assess age influence from the data pool. Adaptive training is a widely used technique for building speech recognition systems on non-homogeneous training data. Recently there has been interest in applying these approaches for situations where there is significant levels of background noise. This work extends the most popular form of linear transform for adaptive training, constrained MLLR, to reflect additional uncertainty from noise corrupted observations. This new form of transform, Noisy CMLLR, uses a modified version of generative model between clean speech and noisy observation, similar to factor analysis. Adaptive training using NCMLLR with both maximum likelihood and discriminative criteria are described. Experiments are conducted on noise-corrupted Resource Management and in-car recorded data. In preliminary experiments this new form achieves improvements in recognition performance over the standard approach in low signal-to-noise ratio conditions. Do Humans and Speaker Verification System Use the Same Information to Differentiate Voices? Juliette Kahn 1 , Solange Rossato 2 ; 1 LIA, France; 2 LIG, France Wed-Ses3-P2-14, Time: 16:00 The aim of this paper is to analyze the pairwise comparisons of voices by a speaker verification system (ALIZE/Spk) and by human. A database of familial groups of 24 speakers was created. A single sentence was chosen for the perception test. The same sentence was used the test signal for the ALIZE/Spk trained on another part of the corpus. Results shows that the voice proximities within a familial group were well recovered in the speaker representation by ALIZE and much less returned in the representation from perception test. Wed-Ses3-P3 : Robust Automatic Speech Recognition II Hewison Hall, 16:00, Wednesday 9 Sept 2009 Chair: Peter Jancovic, University of Birmingham, UK Noisy Speech Recognition by Using Output Combination of Discrete-Mixture HMMs and Continuous-Mixture HMMs Performance Comparisons of the Integrated Parallel Model Combination Approaches with Front-End Noise Reduction Guanghu Shen 1 , Soo-Young Suk 2 , Hyun-Yeol Chung 1 ; 1 Yeungnam University, Korea; 2 AIST, Japan Wed-Ses3-P3-3, Time: 16:00 In this paper, to find the best noise robustness approach, we study on approaches implemented at both-end (i.e. front-end and back-end) of speech recognition system. To reduce the noise with lower speech distortion at front-end, we investigate the Two-stage Mel-warped Wiener Filtering (TMWF) in the integrated Parallel Model Combination (PMC) approach. Furthermore, the first-stage of TMWF (i.e. One-stage Mel-warped Wiener Filtering (OMWF)), as well as the well-known Wiener Filtering (WF), is effective to reduce the noise, so we integrate PMC with those front-end noise reduction approaches. From the recognition performance, TMWF-PMC shows improved performance comparing with the well-known WF-PMC, and OMWF-PMC also shows a comparable performance in all noises. Tuning Support Vector Machines for Robust Phoneme Classification with Acoustic Waveforms Tetsuo Kosaka, You Saito, Masaharu Kato; Yamagata University, Japan Jibran Yousafzai, Zoran Cvetković, Peter Sollich; King’s College London, UK Wed-Ses3-P3-1, Time: 16:00 Wed-Ses3-P3-4, Time: 16:00 This paper presents an output combination approach for noiserobust speech recognition. The aim of this work is to improve recognition performance for adverse conditions which contain both stationary and non-stationary noise. In the proposed method, both discrete-mixture HMMs (DMHMMs) and continuous-mixture HMMs (CMHMMs) are used as acoustic models. In the DMHMM, subvector quantization is used instead of vector quantization and each state has multiple mixture components. Our previous work showed that DMHMM system indicated better performance in low SNR and/or non-stationary noise conditions. In contrast, CMHMM system was better in the opposite conditions. Thus, we take a system combination approach of the two models to improve the performance in various kinds of noise conditions. The proposed method was evaluated on a LVCSR task with 5K word vocabulary. The results showed that the proposed method was effective in various kinds of noise conditions. This work focuses on the robustness of phoneme classification to additive noise in the acoustic waveform domain using support vector machines (SVMs). We address the issue of designing kernels for acoustic waveforms which imitate the state-of-the-art representations such as PLP and MFCC and are tuned to the physical properties of speech. For comparison, classification results in the PLP representation domain with cepstral mean-and-variance normalization (CMVN) using standard kernels are also reported. It is shown that our custom-designed kernels achieve better classification performance at high noise. Finally, we combine the PLP and acoustic waveform representations to attain better classification than either of the individual representations over the entire range of noise levels tested, from quiet condition up to -18dB SNR. Notes 146 An Analytic Derivation of a Phase-Sensitive Observation Model for Noise Robust Speech Recognition terms of both recognition accuracy and computational cost on a database recorded in a real car environment. Experimental results indicate the unscented transformation is one of the best options for estimating JUD transforms as it maintains a good balance between accuracy and efficiency. Volker Leutnant, Reinhold Haeb-Umbach; Universität Paderborn, Germany Wed-Ses3-P3-5, Time: 16:00 In this paper we present an analytic derivation of the moments of the phase factor between clean speech and noise cepstral or log-mel-spectral feature vectors. The development shows, among others, that the probability density of the phase factor is of sub-Gaussian nature and that it is independent of the noise type and the signal-to-noise ratio, however dependent on the mel filter bank index. Further we show how to compute the contribution of the phase factor to both the mean and the variance of the noisy speech observation likelihood, which relates the speech and noise feature vectors to those of noisy speech. The resulting phase-sensitive observation model is then used in model-based speech feature enhancement, leading to significant improvements in word accuracy on the AURORA2 database. Variational Model Composition for Robust Speech Recognition with Time-Varying Background Noise Wooil Kim, John H.L. Hansen; University of Texas at Dallas, USA Wed-Ses3-P3-6, Time: 16:00 This paper proposes a novel model composition method to improve speech recognition performance in time-varying background noise conditions. It is suggested that each order of the cepstral coefficients represents the frequency degree of changing components in the envelope of the log-spectrum. With this motivation, in the proposed method, variational noise models are generated by selectively applying perturbation factors to a basis model, resulting in a collection of various types of spectral patterns in the log-spectral domain. The basis noise model is obtained from the silent duration segments of the input speech. The proposed Variational Model Composition (VMC) method is employed to generate multiple environmental models for our previously proposed feature compensation method. Experimental results prove that the proposed method is considerably more effective at increasing speech recognition performance in time-varying background noise conditions with 30.34% and 9.02% average relative improvements in word error rate for speech babble and background music conditions respectively, compared to an existing single model-based method. Comparison of Estimation Techniques in Joint Uncertainty Decoding for Noise Robust Speech Recognition Haitian Xu, K.K. Chin; Toshiba Research Europe Ltd., UK Wed-Ses3-P3-7, Time: 16:00 Model-based joint uncertainty decoding (JUD) has recently achieved promising results by integrating the front-end uncertainty into the back-end decoding by estimating JUD transforms in a mathematically consistent framework. There are different ways of estimating the JUD transforms resulting in different JUD methods. This paper gives an overview of the estimation techniques existing in the literature including data-driven parallel model combination, Taylor series based approximation and the recently proposed second order approximation. Application of a new technique based on the unscented transformation is also proposed for the JUD framework. The different techniques have been compared in Replacing Uncertainty Decoding with Subband Re-Estimation for Large Vocabulary Speech Recognition in Noise Jianhua Lu, Ji Ming, Roger Woods; Queen’s University Belfast, UK Wed-Ses3-P3-8, Time: 16:00 In this paper, we propose a novel approach for parameterized model compensation for large-vocabulary speech recognition in noisy environments. The new compensation algorithm, termed CMLLR-SUBREST, combines the model-based uncertainty decoding (UD) with subspace distribution clustering hidden Markov modeling (SDCHMM), so that the UD-type compensation can be realized by re-estimating the models based on small amount of adaptation data. This avoids the estimation of the covariance biases, which is required in model-based UD and usually needs a numerical approach. The Aurora 4 corpus is used in the experiments. We have achieved 16.9% relative WER (word error rate) reduction over our previous missing-feature (MF) based decoding and 16.1% over the combination of Constrained MLLR compensation and MF decoding. The number of model parameters is reduced by two orders of magnitude. Wed-Ses3-P4 : Prosody: Production II Hewison Hall, 16:00, Wednesday 9 Sept 2009 Chair: Shinichi Tokuma, Chuo University, Japan Perception and Production of Boundary Tones in Whispered Dutch W. Heeren, V.J. Van Heuven; Universiteit Leiden, The Netherlands Wed-Ses4-P4-1, Time: 13:30 The main cue to interrogativity in Dutch declarative questions is found in the final boundary tone. When whispering, a speaker does not produce the most important acoustic information conveying this: the fundamental frequency. In this paper listeners are shown to perceive the difference between whispered declarative questions and statements, though less clearly than in phonated speech. Moreover, possible acoustic correlates conveying whispered question intonation were investigated. The results show that the second formant may convey pitch in whispered speech, and also that first formant and intensity differences exist between high and low boundary tones in both phonated and whispered speech. Pitch Accents and Information Status in a German Radio News Corpus Katrin Schweitzer, Arndt Riester, Michael Walsh, Grzegorz Dogil; Universität Stuttgart, Germany Wed-Ses4-P4-2, Time: 13:30 This paper presents a corpus analysis of prosodic realisations of information status categories in terms of pitch accent types. The annotations base on a recent annotation scheme for information status [1] that is based on semantic criteria applied to written text. For each information status category, typical pitch accent realisations are identified. Moreover, the relevance of the strict Notes 147 semantic information status labelling scheme on the prosodic realisation is examined. It can be shown that the semantic criteria are reflected in prosody, i.e. the prosodic findings corroborate the theoretical assumptions made in the framework. Analysis of Voice Fundamental Frequency Contours of Continuing and Terminating Prosodic Phrases in Four Swiss German Dialects Adrian Leemann, Keikichi Hirose, Hiroya Fujisaki; University of Tokyo, Japan Wed-Ses4-P4-3, Time: 13:30 In the present study, the F0 contours of continuing and terminating prosodic phrases of 4 Swiss German dialects are analyzed by means of the command-response model. In every model parameter, the two prosodic phrase types show significant differences: continuing prosodic phrases indicate higher phrase command magnitude and shorter durations. Locally, they demonstrate more distinct accent command amplitudes as well as durations. In addition, continuing prosodic phrases have later rises relative to segment onset than terminating prosodic phrases. In the same context, fine phonetic differences between the dialects are highlighted. Using Responsive Prosodic Variation to Acknowledge the User’s Current State Nigel G. Ward, Rafael Escalante-Ruiz; University of Texas at El Paso, USA Wed-Ses4-P4-6, Time: 13:30 Spoken dialog systems today do not vary the prosody of their utterances, although prosody is known to have many useful expressive functions. In a corpus of memory quizzes, we identify eleven dimensions of prosodic variation, each with its own expressive function. We identified the situations in which each was used, and developed rules for detecting these situations from the dialog context and the prosody of the interlocutor’s previous utterance. We implemented the resulting rules and had 21 users interact with two versions of the system. Overall they preferred the version in which the prosodic forms of the acknowledgments were chosen to be suitable for each specific context. This suggests that simple adjustments to system prosody based on local context can have value to users. Intonation Segments and Segmental Intonation Oliver Niebuhr; LPL, France Wed-Ses4-P4-7, Time: 13:30 Intonational Features for Identifying Regional Accents of Italian Michelina Savino; Università di Bari, Italy Wed-Ses4-P4-4, Time: 13:30 Aim of this paper is providing a preliminary account of some intonational features useful for identifying a large number of Italian accents, estimated as representative of Italian regional variation, by analysing a corpus of comparable speech materials consisting of Map Task dialogues. Analysis concentrates on the intonational characteristics of yes-no questions, which can be realised very differently across varieties, whereas statements are generally characterised by a (low) falling final movement. Results of this preliminary investigation indicate that intonational features useful for identifying Italian regional accents are the tune type (rising-falling vs falling-rising vs rising), and the nuclear peak alignment in rising-falling contours (mid vs late). Analysis and Recognition of Accentual Patterns Agnieszka Wagner; Adam Mickiewicz University, Poland Wed-Ses4-P4-5, Time: 13:30 This study proposes a framework of automatic analysis and recognition of accentual patterns. In the first place we present the results of analyses which aimed at identification of acoustic cues signaling prominent syllables and different pitch accent types distinguished at the surface-phonological level. The resulting representation provides a framework of analysis of accentual patterns at the acoustic-phonetic level. The representation is compact — it consists of 13 acoustic features, has low redundancy — the features can not be derived from one another and wide coverage — it encodes distinctions between perceptually different utterances. Next, we train statistical models to automatically determine accentual patterns of utterances using the acoustic-phonetic representation which involves two steps: detection of accentual prominence and assigning pitch accent types to prominent syllables. The efficiency of the best models consists in achieving high accuracy (above 80% on average) using small acoustic feature vectors. An acoustic analysis of a German dialogue corpus showed that the sound qualities and durations of fricatives, vocoids, and diphthongs at the ends of question and statement utterances varied systematically with the utterance-final intonation segments, which were high-rising in the questions and terminal- falling in the statements. The ways in which the variations relate to phenomena like sibilant/spectral pitch and intrinsic F0 suggest that they are meant to support the pitch course. Thus, they may be called segmental intonations. The Phrase-Final Accent in Kammu: Effects of Tone, Focus and Engagement David House 1 , Anastasia Karlsson 2 , Jan-Olof Svantesson 2 , Damrong Tayanin 2 ; 1 KTH, Sweden; 2 Lund University, Sweden Wed-Ses4-P4-8, Time: 13:30 The phrase-final accent can typically contain a multitude of simultaneous prosodic signals. In this study, aimed at separating the effects of lexical tone from phrase-final intonation, phrase-final accents of two dialects of Kammu were analyzed. Kammu, a MonKhmer language spoken primarily in northern Laos, has dialects with lexical tones and dialects with no lexical tones. Both dialects seem to engage the phrase-final accent to simultaneously convey focus, phrase finality, utterance finality, and speaker engagement. Both dialects also show clear evidence of truncation phenomena. These results have implications for our understanding of the interaction between tone, intonation and phrase-finality. Tonal Alignment in Three Varieties of Hiberno-English Raya Kalaldeh, Amelie Dorn, Ailbhe Ní Chasaide; Trinity College Dublin, Ireland Wed-Ses4-P4-9, Time: 13:30 This pilot study investigates the tonal alignment of pre-nuclear (PN) and nuclear (N) accents in three Hiberno-English (HE) regional varieties: Dublin, Drogheda, and Donegal English. The peak alignment is investigated as a function of the number of unstressed syllables before PN and after N. Dublin and Drogheda English appear to a have fixed peak alignment in both nuclear and Notes 148 pre-nuclear conditions. Donegal English, however, shows a drift in peak alignment in nuclear and pre-nuclear conditions. Findings also show that the peak is located earlier in nuclear and later in pre-nuclear conditions across the three dialects. Is Tonal Alignment Interpretation Independent of Methodology? Caterina Petrone 1 , Mariapaola D’Imperio 2 ; 1 ZAS, Germany; 2 LPL, France Wed-Ses4-P4-13, Time: 13:30 Determining Intonational Boundaries from the Acoustic Signal Lourdes Aguilar 1 , Antonio Bonafonte 2 , Francisco Campillo 3 , David Escudero 4 ; 1 Universitat Autònoma de Barcelona, Spain; 2 Universitat Politècnica de Catalunya, Spain; 3 Universidade de Vigo, Spain; 4 Universidad de Valladolid, Spain Wed-Ses4-P4-10, Time: 13:30 This article has two-fold aims: it reports firstly the improvement of a speech database in Catalan for speech synthesis (Festcat) with the information about prosodic boundaries using the break index labels proposed in the ToBI system; and secondly, it presents the experiments undergone to determine the acoustic markers that can differentiate among the break-indexes. Several experiments using different classification techniques were performed in order to compare the relative merit of different attributes to characterize breaks. Results show that the prosodic phrase breaks are correlated with: presence of a pause, lengthening of the pre-break syllable and the F0 contour of the span between the stressed syllable and the following post-stressed, if there are, immediately preceding the break. Compression and Truncation Revisited Claudia K. Ohl, Hartmut R. Pfitzinger; Christian-Albrechts-Universität zu Kiel, Germany Tonal target detection is a very difficult task, especially in presence of consonantal perturbations. Though different detection methods have been adopted in tonal alignment research, we still do not know which is the most reliable. In our paper, we found that such methodological choices have serious theoretical implications. Interpretation of the data strongly depends on whether tonal targets have been detected by a manual, a semi-automatic or an automatic procedure. Moreover, different segmental classes can affect target placement especially in automatic detection. This suggests the importance of keeping segmental classes separate for the purpose of statistical analysis. Modeling the Intonation of Topic Structure: Two Approaches Margaret Zellers 1 , Brechtje Post 1 , Mariapaola D’Imperio 2 ; 1 University of Cambridge, UK; 2 LPL, France Wed-Ses4-P4-14, Time: 13:30 Intonational variation is widely regarded as a source of information about the topic structure of spoken discourse. However, many factors other than topic can influence this variation. We compared two models of intonation in terms of their ability to account for these other sources of variation. In dealing with this variation, the models paint different pictures of the intonational correlates of topic. Wed-Ses4-P4-11, Time: 13:30 This paper investigates the influence of varying segmental structures on the realizations of utterance-final rising and falling intonation contours. Following Grabe’s study on adjustment strategies in German, i.e. truncation and compression, a similar experiment was carried out, using materials with decreasing stretches of voicing in questions, lists, and statements. However, the results presented in the present paper could not confirm the idea of such common adjustment strategies. Instead, considerable variation was found as to how the phrase-final intonation contours were adjusted to the respective amounts of voicing: the strategies varied strongly across different word groups. Comparison of Fujisaki-Model Extractors and F0 Stylizers Wed-Ses3-S1 : Special Session: Machine Learning for Adaptivity in Spoken Dialogue Systems Ainsworth (East Wing 4), 16:00, Wednesday 9 Sept 2009 Chair: Oliver Lemon, University of Edinburgh, UK and Olivier Pietquin, Supélec, France A User Modeling-Based Performance Analysis of a Wizarded Uncertainty-Adaptive Dialogue System Corpus Kate Forbes-Riley, Diane Litman; University of Pittsburgh, USA Hartmut R. Pfitzinger 1 , Hansjörg Mixdorff 2 , Jan Schwarz 1 ; 1 Christian-Albrechts-Universität zu Kiel, Germany; 2 BHT Berlin, Germany Wed-Ses3-S1-1, Time: 16:00 Wed-Ses4-P4-12, Time: 13:30 This study compares four automatic methods for estimating Fujisaki-model parameters. Since interpolation and smoothing are necessary prerequisites for all approaches their fitting accuracies are also compared with that of a novel stylisation method. A hand-corrected set of results from one of the methods which was created on linguistic grounds served as a second benchmark. Although the four methods yield comparable results with respect to their total errors, they show different error distributions. The manually corrected version provided a poorer approximation of the F0 contours than the automatic one. Motivated by prior spoken dialogue system research in user modeling, we analyze interactions between performance and user class in a dataset previously collected with two wizarded spoken dialogue tutoring systems that adapt to user uncertainty. We focus on user classes defined by expertise level and gender, and on both objective (learning) and subjective (user satisfaction) performance metrics. We find that lower expertise users learn best from one adaptive system but prefer the other, while higher expertise users learned more from one adaptive system but didn’t prefer either. Female users both learn best from and prefer the same adaptive system, while males preferred one adaptive system but didn’t learn more from either. Our results yield an empirical basis for future investigations into whether adaptive system performance can improve by adapting to user uncertainty differently based on user class. Notes 149 Using Dialogue-Based Dynamic Language Models for Improving Speech Recognition Juan Manuel Lucas-Cuesta, Fernando Fernández, Javier Ferreiros; Universidad Politécnica de Madrid, Spain Bayesian Learning of Confidence Measure Function for Generation of Utterances and Motions in Object Manipulation Dialogue Task Komei Sugiura, Naoto Iwahashi, Hideki Kashioka, Satoshi Nakamura; NICT, Japan Wed-Ses3-S1-2, Time: 16:20 Wed-Ses3-S1-5, Time: 17:20 We present a new approach to dynamically create and manage different language models to be used on a spoken dialogue system. We apply an interpolation based approach, using several measures obtained by the Dialogue Manager to decide what LM the system will interpolate and also to estimate the interpolation weights. We propose to use not only semantic information (the concepts extracted from each recognized utterance), but also information obtained by the dialogue manager module (DM), that is, the objectives or goals the user wants to fulfill, and the proper classification of those concepts according to the inferred goals. The experiments we have carried out show improvements over word error rate when using the parsed concepts and the inferred goals from a speech utterance for rescoring the same utterance. This paper proposes a method that generates motions and utterances in an object manipulation dialogue task. The proposed method integrates belief modules for speech, vision, and motions into a probabilistic framework so that a user’s utterances can be understood based on multimodal information. Responses to the utterances are optimized based on an integrated confidence measure function for the integrated belief modules. Bayesian logistic regression is used for the learning of the confidence measure function. The experimental results revealed that the proposed method reduced the failure rate from 12% down to 2.6% while the rejection rate was less than 24%. Reinforcement Learning for Dialog Management Using Least-Squares Policy Iteration and Fast Feature Selection Lihong Li 1 , Jason D. Williams 2 , Suhrid Balakrishnan 2 ; 1 Rutgers University, USA; 2 AT&T Labs Research, USA Wed-Ses3-S1-3, Time: 16:40 Predicting How it Sounds: Re-Ranking Dialogue Prompts Based on TTS Quality for Adaptive Spoken Dialogue Systems Cédric Boidin 1 , Verena Rieser 2 , Lonneke van der Plas 3 , Oliver Lemon 2 , Jonathan Chevelu 1 ; 1 Orange Labs, France; 2 University of Edinburgh, UK; 3 Université de Genève, Switzerland Wed-Ses3-S1-6, Time: 17:40 Reinforcement learning (RL) is a promising technique for creating a dialog manager. RL accepts features of the current dialog state and seeks to find the best action given those features. Although it is often easy to posit a large set of potentially useful features, in practice, it is difficult to find the subset which is large enough to contain useful information yet compact enough to reliably learn a good policy. In this paper, we propose a method for RL optimization which automatically performs feature selection. The algorithm is based on least-squares policy iteration, a state-of-theart RL algorithm which is highly sample-efficient and can learn from a static corpus or on-line. Experiments in dialog simulation show it is more stable than a baseline RL algorithm taken from a working dialog system. Hybridisation of Expertise and Reinforcement Learning in Dialogue Systems This paper presents a method for adaptively re-ranking paraphrases in a Spoken Dialogue System (SDS) according to their predicted Text To Speech (TTS) quality. We collect data under 4 different conditions and extract a rich feature set of 55 TTS runtime features. We build predictive models of user ratings using linear regression with latent variables. We then show that these models transfer to a more specific target domain on a separate test set. All our models significantly outperform a random baseline. Our best performing model reaches the same performance as reported by previous work, but it requires 75% less annotated training data. The TTS re-ranking model is part of an end-to-end statistical architecture for Spoken Dialogue Systems developed by the ECFP7 CLASSiC project. Thu-Ses1-O1 : Robust Automatic Speech Recognition III Romain Laroche 1 , Ghislain Putois 1 , Philippe Bretier 1 , Bernadette Bouchon-Meunier 2 ; 1 Orange Labs, France; 2 LIP6, France Main Hall, 10:00, Thursday 10 Sept 2009 Chair: P.D. Green, University of Sheffield, UK Wed-Ses3-S1-4, Time: 17:00 This paper addresses the problem of introducing learning capabilities in industrial handcrafted automata-based Spoken Dialogue Systems, in order to help the developer to cope with his dialogue strategies design tasks. While classical reinforcement learning algorithms position their learning at the dialogue move level, the fundamental idea behind our approach is to learn at a finer internal decision level (which question, which words, which prosody, . . .). These internal decisions are made on the basis of different (distinct or overlapping) knowledge. This paper proposes a novel reinforcement learning algorithm that can be used to make a datadriven optimisation of such handcrafted systems. An experiment shows that the convergence can be up to 20 times faster than with Q-Learning. Accounting for the Uncertainty of Speech Estimates in the Complex Domain for Minimum Mean Square Error Speech Enhancement Ramón Fernandez Astudillo, Dorothea Kolossa, Reinhold Orglmeister; Technische Universität Berlin, Germany Thu-Ses1-O1-1, Time: 10:00 Uncertainty decoding and uncertainty propagation, or error propagation, techniques have emerged as a powerful tool to increase the accuracy of automatic speech recognition systems by employing an uncertain, or probabilistic, description of the speech features rather than the usual point estimate. In this paper we analyze the uncertainty generated in the complex Fourier domain when performing speech enhancement with the Wiener or Ephraim-Malah Notes 150 filters. We derive closed form solutions for the computation of the error of estimation and show that it provides a better insight into the origin of estimation uncertainty. We also show how the combination of such an error estimate with uncertainty propagation and uncertainty decoding or modified imputation yields superior recognition robustness when compared to conventional MMSE estimators with little increase in the computational cost. Signal Separation for Robust Speech Recognition Based on Phase Difference Information Obtained in the Frequency Domain Chanwoo Kim, Kshitiz Kumar, Bhiksha Raj, Richard M. Stern; Carnegie Mellon University, USA Thu-Ses1-O1-2, Time: 10:00 In this paper, we present a new two-microphone approach that improves speech recognition accuracy when speech is masked by other speech. The algorithm improves on previous systems that have been successful in separating signals based on differences in arrival time of signal components from two microphones. The present algorithm differs from these efforts in that the signal selection takes place in the frequency domain. We observe that additional smoothing of the phase estimates over time and frequency is needed to support adequate speech recognition performance. We demonstrate that the algorithm described in this paper provides better recognition accuracy than time-domain-based signal separation algorithms, and at less than 10 percent of the computation cost. Transforming Features to Compensate Speech Recogniser Models for Noise R.C. van Dalen, F. Flego, M.J.F. Gales; University of Cambridge, UK Thu-Ses1-O1-3, Time: 10:00 To make speech recognisers robust to noise, either the features or the models can be compensated. Feature enhancement is often fast; model compensation is often more accurate, because it predicts the corrupted speech distribution. It is therefore able, for example, to take uncertainty about the clean speech into account. This paper re-analyses the recently-proposed predictive linear transformations for noise compensation as minimising the kl divergence between the predicted corrupted speech and the adapted models. New schemes are then introduced which apply observation-dependent transformations in the front-end to adapt the back-end distributions. One applies transforms in the exact same manner as the popular minimum mean square error (mmse) feature enhancement scheme, and is as fast. The new method performs better on aurora 2. Subband Temporal Modulation Spectrum Normalization for Automatic Speech Recognition in Reverberant Environments Xugang Lu 1 , Masashi Unoki 2 , Satoshi Nakamura 1 ; 1 NICT, Japan; 2 JAIST, Japan Thu-Ses1-O1-4, Time: 10:00 Speech recognition in reverberant environments is still a challenge problem. In this paper, we first investigated the reverberation effect on subband temporal envelopes by using the modulation transfer function (MTF). Based on the investigation, we proposed an algorithm which normalizes the subband temporal modulation spectrum (TMS) to reduce the diffusion effect of the reverberation. During the normalization, both the subband TMS of the clean and reverberated speech are normalized to a reference TMS calculated from a clean speech data set for each frequency subband. Based on the normalized subband TMS, the inverse Fourier transform was done to restore the subband temporal envelopes by keeping their original phase information. We tested our algorithm on reverberated speech recognition tasks (in a reverberant room). For comparison, the traditional Mel-frequency cepstral coefficient (MFCC) and relative spectral filtering (RASTA) were used. Experimental results showed that the recognition rate using the feature extracted based on the proposed normalization method has totally a 80.64% relative improvement. Robust In-Car Spelling Recognition — A Tandem BLSTM-HMM Approach Martin Wöllmer 1 , Florian Eyben 1 , Björn Schuller 1 , Yang Sun 1 , Tobias Moosmayr 2 , Nhu Nguyen-Thien 3 ; 1 Technische Universität München, Germany; 2 BMW Group, Germany; 3 Continental Automotive GmbH, Germany Thu-Ses1-O1-5, Time: 10:00 As an intuitive hands-free input modality automatic spelling recognition is especially useful for in-car human-machine interfaces. However, for today’s speech recognition engines it is extremely challenging to cope with similar sounding spelling speech sequences in the presence of noises such as the driving noise inside a car. Thus, we propose a novel Tandem spelling recogniser, combining a Hidden Markov Model (HMM) with a discriminatively trained bidirectional Long Short-Term Memory (BLSTM) recurrent neural net. The BLSTM network captures long-range temporal dependencies to learn the properties of in-car noise, which makes the Tandem BLSTM-HMM robust with respect to speech signal disturbances at extremely low signal-to-noise ratios and mismatches between training and test noise conditions. Experiments considering various driving conditions reveal that our Tandem recogniser outperforms a conventional HMM by up to 33%. Applying Non-Negative Matrix Factorization on Time-Frequency Reassignment Spectra for Missing Data Mask Estimation Maarten Van Segbroeck, Hugo Van hamme; Katholieke Universiteit Leuven, Belgium Thu-Ses1-O1-6, Time: 10:00 The application of Missing Data Theory (MDT) has shown to improve the robustness of automatic speech recognition (ASR) systems. A crucial part in a MDT-based recognizer is the computation of the reliability masks from noisy data. To estimate accurate masks in environments with unknown, non-stationary noise statistics, we need to rely on a strong model for the speech. In this paper, an unsupervised technique using non-negative matrix factorization (NMF) discovers phone-sized time-frequency patches into which speech can be decomposed. The input matrix for the NMF is constructed using a high resolution and reassigned time-frequency representation. This representation facilitates an accurate detection of the patches that are active in unseen noisy speech. After further denoising of the patch activations, speech and noise can be reconstructed from which missing feature masks are estimated. Recognition experiments on the Aurora2 database demonstrate the effectiveness of this technique. Notes 151 Prosodic Analysis of Foreign-Accented English Thu-Ses1-O2 : Prosody: Perception Hansjörg Mixdorff 1 , John Ingram 2 ; 1 BHT Berlin, Germany; 2 University of Queensland, Australia Jones (East Wing 1), 10:00, Thursday 10 Sept 2009 Chair: Yi Xu, University College London, UK Thu-Ses1-O2-4, Time: 11:00 Experiments on Automatic Prosodic Labeling Antje Schweitzer, Bernd Möbius; Universität Stuttgart, Germany Thu-Ses1-O2-1, Time: 10:00 This paper presents results from experiments on automatic prosodic labeling. Using the WEKA machine learning software [1], classifiers were trained to determine for each syllable in a speech database of a male speaker its pitch accent and its boundary tone. Pitch accents and boundaries are according to the GToBI(S) dialect, with slight modifications. Classification was based on 35 attributes involving PaIntE F0 parametrization [2] and normalized phone durations, but also some phonological information as well as higher-linguistic information. Several classification algorithms yield results of approx. 78% accuracy on the word level for pitch accents, and approx. 88% accuracy on the word level for phrase boundaries, which compare very well to results of other studies. The classifiers generalize to similar data of a female speaker in that they perform equally well as classifiers trained directly on the female data. German Boundary Tones Show Categorical Perception and a Perceptual Magnet Effect When Presented in Different Contexts Katrin Schneider, Grzegorz Dogil, Bernd Möbius; Universität Stuttgart, Germany Thu-Ses1-O2-2, Time: 10:20 The experiment presented in this paper examines categorical perception as well as the perceptual magnet effect in German boundary tones, taking also context information into account. The test phrase is preceded by different context sentences that are assumed to affect the location of the category boundary in the stimulus continuum between the low and the high boundary tone. Results provide evidence for the existence of a low and a high boundary tone in German, corresponding to statement versus question interpretation, respectively. Furthermore, in contrast to previous findings, a prototype was found not only in the category of the low but also in the category of the high boundary tone, supporting the hypothesis that context might have been taken into account to solve a possible ambiguity between H% and a previously hypothesized non-low and non-terminal boundary tone. Eye Tracking for the Online Evaluation of Prosody in Speech Synthesis: Not So Fast! Michael White, Rajakrishnan Rajkumar, Kiwako Ito, Shari R. Speer; Ohio State University, USA Thu-Ses1-O2-3, Time: 10:40 This paper presents an eye-tracking experiment comparing the processing of different accent patterns in unit selection synthesis and human speech. The synthetic speech results failed to replicate the facilitative effect of contextually appropriate accent patterns found with human speech, while producing a more robust intonational garden-path effect with contextually inappropriate patterns, both of which could be due to processing delays seen with the synthetic speech. As the synthetic speech was of high quality, the results indicate that eye tracking holds promise as a highly sensitive and objective method for the online evaluation of prosody in speech synthesis. This study compares utterances by Vietnamese learners of Australian English with those of native subjects. In a previous study the utterances had been rated for foreign accent and intelligibility. We aim to find measurable prosodic differences accounting for the perceptual results. Our outcomes indicate, inter alia, that unaccented syllables are relatively longer compared with accented ones in the Vietnamese corpus than those in the Australian English corpus. Furthermore, the correlations of syllabic durations in utterances of one and the same sentence are much higher for Australian English subjects than for Vietnamese learners of English. Vietnamese speakers use a larger range of f0 and produce more pitch-accents than Australian speakers. Perception of the Evolution of Prosody in the French Broadcast News Style Philippe Boula de Mareüil, Albert Rilliard, Alexandre Allauzen; LIMSI, France Thu-Ses1-O2-5, Time: 11:20 This study makes use of advances in automatic speech processing to analyse French audiovisual archives and the perception of the journalistic style evolution regarding prosody. Three perceptual experiments were run, using prosody transplantation, delexicalisation and imitation. Results show that the fundamental frequency and duration correlates of prosody enable old-fashioned recordings to be distinguished from more recent ones. The higher the pitch is and the more there are pitch movements on syllables which may be interpreted as word-initially stressed, the more speech samples are perceived as dating back to the 40s or the 50s. Prosodic Effects on Vowel Production: Evidence from Formant Structure Yoonsook Mo, Jennifer Cole, Mark Hasegawa-Johnson; University of Illinois at Urbana-Champaign, USA Thu-Ses1-O2-6, Time: 11:40 Speakers communicate pragmatic and discourse meaning through the prosodic form assigned to an utterance, and listeners must attend to the acoustic cues to prosodic form to fully recover the speaker’s intended meaning. While much of the research on prosody examines supra-segmental cues such as F0 and temporal patterns, prosody is also known to affect the phonetic properties of segments as well. This paper reports on the effect of prosodic prominence on the formant patterns of vowels using speech data from the Buckeye corpus of spontaneous American English. A prosody annotation was obtained for a subset of this corpus based on the auditory perception of 97 ordinary, untrained listeners. To understand the relationship between prominence perception and formant structure, as a measure of the ‘strength’ of the vowel articulation, we measure the steady-state first and second formants of stressed vowels at vowel mid-points for monophthongs and at both 10% (nucleus) and 90% (glide) positions for diphthongs. Two hypotheses about the articulatory mechanism that implements prominence (Hyperarticulation vs. Sonority Expansion Hypothesis) were evaluated using Pearson’s bivariate correlation analyses with formant values and prominence ‘scores’ — a novel perceptual measure of prominence. The findings demonstrate that higher F1 values correlate with higher prominence scores regardless of vowel height, confirming that vowels perceived as prominent tend to have enhanced sonority. In the frontness dimension, on the other hand, the results show that vowels perceived as prominent tend Notes 152 to be hyperarticulated. These results support the model of the supra-laryngeal implementation of prominence proposed in [5, 6] based on controlled “laboratory” speech, and demonstrate that the model can be extended to cover prosody in spontaneous speech using a continuous-valued measure of prosodic prominence. The evidence reported here from spontaneous speech shows that prominent vowels have expanded sonority regardless of vowel height, and are hyperarticulated only when hyperarticulation does not interfere with sonority expansion. Thu-Ses1-O3 : Segmentation and Classification Fallside (East Wing 2), 10:00, Thursday 10 Sept 2009 Chair: Stephen J. Cox, University of East Anglia, UK An Adaptive BIC Approach for Robust Audio Stream Segmentation Janez Žibert 1 , Andrej Brodnik 1 , France Mihelič 2 ; 1 University of Primorska, Slovenia; 2 University of Ljubljana, Slovenia Speaker Segmentation and Clustering for Simultaneously Presented Speech Lingyun Gu, Richard M. Stern; Carnegie Mellon University, USA Thu-Ses1-O3-4, Time: 11:00 Thu-Ses1-O3-1, Time: 10:00 In this paper we focus on an audio segmentation. We present a novel method for robust estimation of decision-thresholds for accurate detection of acoustic change points in continuous audio streams. In standard segmentation procedures the decisionthresholds are usually set in advance and need to be tuned from development data. In the presented approach we tried to remove a need for using pre-determined decision-thresholds and propose a method for estimation of thresholds directly from the currently processed audio data. It employs change-detection methods from two well-established audio segmentation approaches based on the Bayesian Information Criterion. Following from that, we develop two audio segmentation procedures, which enable us to adaptively tune boundary-detection thresholds and to combine different audio representations in the segmentation process. The proposed segmentation procedures are tested on broadcast news audio data. Improving the Robustness of Phonetic Segmentation to Accent and Style Variation with a Two-Staged Approach Vaishali Patil, Shrikant Joshi, Preeti Rao; IIT Bombay, India Thu-Ses1-O3-2, Time: 10:20 Correct and temporally accurate phonetic segmentation of speech utterances is important in applications ranging from transcription alignment to pronunciation error detection. Automatic speech recognizers used in these tasks provide insufficient temporal alignment accuracy apart from a recognition performance that is sensitive to accent and style variations from the training data. A two-staged approach combining HMM broad-class recognition with acoustic-phonetic knowledge based refinement is evaluated for phonetic segmentation accuracy in the context of accent and style mismatches with training data. Signature Cluster Model Selection for Incremental Gaussian Mixture Cluster Modeling in Agglomerative Hierarchical Speaker Clustering Agglomerative hierarchical speaker clustering (AHSC) has been widely used for classifying speech data by speaker characteristics. Its bottom-up, one-way structure of merging the closest cluster pair at every recursion step, however, makes it difficult to recover from incorrect merging. Hence, making AHSC robust to incorrect merging is an important issue. In this paper we address this problem in the framework of AHSC based on incremental Gaussian mixture models, which we previously introduced for better representing variable cluster size. Specifically, to minimize contamination in cluster models by heterogeneous data, we select and keep updating a representative (or signature) model for each cluster during AHSC. Experiments on meeting speech excerpts (4 hours total) verify that the proposed approach improves average speaker clustering performance by approximately 20% (relative). This paper proposes a new scheme used to segment and cluster speech segments on an unsupervised basis in cases where multiple speakers are presented simultaneously at different SNRs. The new elements in our work are in the development of new feature for segmenting and clustering simultaneously-presented speech, the procedure for identifying a candidate set of possible speakerchange points, and the use of pair-wise cross-segment distance distributions to cluster segments by speaker. The proposed system is evaluated in terms of the F measure that is obtained. The system is compared to a baseline system that uses MFCC for acoustic features, the Bayesian Information Criterion (BIC) for detecting speaker-change points, and the Kullback-Leibler distance for clustering the segments. Experimental indicate that the new system consistently provides better performance than the baseline system with very small computational cost. Trimmed KL Divergence Between Gaussian Mixtures for Robust Unsupervised Acoustic Anomaly Detection Nash Borges, Gerard G.L. Meyer; Johns Hopkins University, USA Thu-Ses1-O3-5, Time: 11:20 In previous work [1], we presented several implementations of acoustic anomaly detection by training a model on purely normal data and estimating the divergence between it and other input. Here, we reformulate the problem in an unsupervised framework and allow for anomalous contamination of the training data. We focus exclusively on methods employing Gaussian mixture models (GMMs) since they are often used in speech processing systems. After analyzing what caused the Kullback-Leibler (KL) divergence between GMMs to break down in the face of training contamination, we came up with a promising solution. By trimming one quarter of the most divergent Gaussians from the mixture model, we significantly outperformed the untrimmed approximation for contamination levels of 10% and above, reducing the equal error rate from 33.8% to 6.4% at 33% contamination. The performance of the trimmed KL divergence showed no significant dependence on the investigated contamination levels. Kyu J. Han, Shrikanth S. Narayanan; University of Southern California, USA Thu-Ses1-O3-3, Time: 10:40 Notes 153 How to Loose Confidence: Probabilistic Linear Machines for Multiclass Classification Results of the N-Best 2008 Dutch Speech Recognition Evaluation Hui Lin 1 , Jeff Bilmes 1 , Koby Crammer 2 ; 1 University of Washington, USA; 2 University of Pennsylvania, USA David A. van Leeuwen 1 , Judith Kessens 1 , Eric Sanders 2 , Henk van den Heuvel 2 ; 1 TNO Human Factors, The Netherlands; 2 SPEX, The Netherlands Thu-Ses1-O3-6, Time: 11:40 In this paper we propose a novel multiclass classifier called the probabilistic linear machine (PLM) which overcomes the lowentropy problem of exponential-based classifiers. Although PLMs are linear classifiers, we use a careful design of the parameters matched with weak requirements over the features to output a true probability distribution over labels given an input instance. We cast the discriminative learning problem as linear programming, which can scale up to large problems on the order of millions of training samples. Our experiments on phonetic classification show that PLM achieves high entropy while maintaining a comparable accuracy to other state-of-the-art classifiers. Thu-Ses1-O4-3, Time: 10:40 In this paper we report the results of a Dutch speech recognition system evaluation held in 2008. The evaluation contained material in two domains: Broadcast News (BN) and Conversational Telephone Speech (CTS) and in two main accent regions (Flemish and Dutch). In total 7 sites submitted recognition results to the evaluation, totalling 58 different submissions in the various conditions. Best performances ranged from 15.9% word error rate for BN, Flemish to 46.1% for CTS, Flemish. This evaluation is the first of its kind for the Dutch language. SHoUT, the University of Twente Submission to the N-Best 2008 Speech Recognition Evaluation for Dutch Thu-Ses1-O4 : Evaluation & Standardisation of SL Technology and Systems Holmes (East Wing 3), 10:00, Thursday 10 Sept 2009 Chair: Sebastian Möller, Deutsche Telekom Laboratories, Germany Quantifying Wideband Speech Codec Degradations via Impairment Factors: The New ITU-T P.834.1 Methodology and its Application to the G.711.1 Codec Marijn Huijbregts, Roeland Ordelman, Laurens van der Werff, Franciska M.G. de Jong; University of Twente, The Netherlands Thu-Ses1-O4-4, Time: 11:00 Sebastian Möller 1 , Nicolas Côté 1 , Atsuko Kurashima 2 , Noritsugu Egi 2 , Akira Takahashi 2 ; 1 Deutsche Telekom Laboratories, Germany; 2 NTT Corporation, Japan Thu-Ses1-O4-1, Time: 10:00 Wideband speech codecs usually provide better perceptual speech quality than their narrowband counterparts, but they still degrade quality compared to an uncoded transmission path. In order to quantify these degradations, a new methodology is presented which derives a one-dimensional quality index on the basis of instrumental measurements. This index can be used to rank different wideband speech codecs according to their degradations and to calculate overall quality in conjunction with other degradations, like packet loss. We apply this methodology to derive respective indices for the new G.711.1 codec. SUXES — User Experience Evaluation Method for Spoken and Multimodal Interaction Markku Turunen, Jaakko Hakulinen, Aleksi Melto, Tomi Heimonen, Tuuli Laivo, Juho Hella; University of Tampere, Finland Thu-Ses1-O4-2, Time: 10:20 Much work remains to be done with subjective evaluations of speech-based and multimodal systems. In particular, user experience is still hard to evaluate. SUXES is an evaluation method for collecting subjective metrics with user experiments. It captures both user expectations and user experiences, making it possible to analyze the state of the application and its interaction methods, and compare results. We present the SUXES method with examples of user experiments with different applications and modalities. In this paper we present our primary submission to the first Dutch and Flemish large vocabulary continuous speech recognition benchmark, N-Best. We describe our system workflow, the models we created for the four evaluation tasks and how we approached the problem of compounding that is typical for a language such as Dutch. We present the evaluation results and our post-evaluation analysis. NIST 2008 Speaker Recognition Evaluation: Performance Across Telephone and Room Microphone Channels Alvin F. Martin, Craig S. Greenberg; NIST, USA Thu-Ses1-O4-5, Time: 11:20 We describe the 2008 NIST Speaker Recognition Evaluation, including the speech data used, the test conditions included, the participants, and some of the performance results obtained. This evaluation was distinguished by including as part of the required test condition interview type speech as well as conversational telephone speech, and speech recorded over microphone channels as well as speech recorded over telephone lines. Notable was the relative consistency of best system performance obtained over the different speech types, including those involving different types in training and test. Some comparison with performance in prior evaluations is also discussed. The Ester 2 Evaluation Campaign for the Rich Transcription of French Radio Broadcasts Sylvain Galliano 1 , Guillaume Gravier 2 , Laura Chaubard 1 ; 1 DGA, France; 2 AFCP, France Thu-Ses1-O4-6, Time: 11:40 This paper reports on the final results of the Ester 2 evaluation campaign held from 2007 to April 2009. The aim of this campaign was to evaluate automatic radio broadcasts rich transcription systems for the French language. The evaluation tasks were divided into three main categories: audio event detection and tracking (e.g., speech vs. music, speaker tracking), orthographic transcription, and information extraction. The paper describes the data provided Notes 154 for the campaign, the task definitions and evaluation protocols as well as the results. Soft Decision-Based Acoustic Echo Suppression in a Frequency Domain Thu-Ses1-P1 : Speech Coding Yun-Sik Park, Ji-Hyun Song, Jae-Hun Choi, Joon-Hyuk Chang; Inha University, Korea Thu-Ses1-P1-4, Time: 10:00 Hewison Hall, 10:00, Thursday 10 Sept 2009 Chair: Børge Lindberg, Aalborg University, Denmark Differential Vector Quantization of Feature Vectors for Distributed Speech Recognition Jose Enrique Garcia, Alfonso Ortega, Antonio Miguel, Eduardo Lleida; Universidad de Zaragoza, Spain Thu-Ses1-P1-1, Time: 10:00 Distributed speech recognition arises for solving computational limitations of mobile devices like PDAs or mobile phones. Due to bandwidth restrictions, it is necessary to develop efficient transmission techniques of acoustic features in Automatic Speech Recognition applications. This paper presents a technique for compressing acoustic feature vectors based on Differential Vector Quantization. It is a combination of Vector Quantization and Differential encoding schemes. Recognition experiments have been carried out, showing that the proposed method outperforms the ETSI standard VQ system, and classical VQ schemes for different codebook lengths and situations. With the proposed scheme, bit rates as low as 2.1 kbps can be used without decreasing the performance of the ASR system in terms of WER compared with a system without quantization. Arithmetic Coding of Sub-Band Residuals in FDLP Speech/Audio Codec Petr Motlicek 1 , Sriram Ganapathy 2 , Hynek Hermansky 2 ; 1 IDIAP Research Institute, Switzerland; 2 Johns Hopkins University, USA Thu-Ses1-P1-2, Time: 10:00 A speech/audio codec based on Frequency Domain Linear Prediction (FDLP) exploits auto-regressive modeling to approximate instantaneous energy in critical frequency sub-bands of relatively long input segments. The current version of the FDLP codec operating at 66 kbps has been shown to provide comparable subjective listening quality results to state-of-the-art codecs on similar bit-rates even without employing standard blocks such as entropy coding or simultaneous masking. This paper describes an experimental work to increase compression efficiency of the FDLP codec by employing entropy coding. Unlike conventional Huffman coding employed in current speech/audio coding systems, we describe an efficient way to exploit arithmetic coding to entropy compress quantized spectral magnitudes of the sub-band FDLP residuals. Such an approach provides 11% (∼ 3 kbps) bit-rate reduction compared to the Huffman coding algorithm (∼ 1 kbps). Pitch Variation Estimation Tom Bäckström, Stefan Bayer, Sascha Disch; Fraunhofer IIS, Germany In this paper, we propose a novel acoustic echo suppression (AES) technique based on soft decision in a frequency domain. The proposed approach provides an efficient and unified framework for such procedures as AES gain computation, AES gain modification using soft decision, and estimation of relevant parameters based on the same statistical model assumption of the near-end and far-end signal instead of the conventional strategies requiring the additional residual echo suppression (RES) step. Performances of the proposed AES algorithm are evaluated by objective tests under various environments and better results compared with the conventional AES method are obtained. Fine-Granular Scalable MELP Coder Based on Embedded Vector Quantization Mouloud Djamah, Douglas O’Shaughnessy; INRS-EMT, Canada Thu-Ses1-P1-5, Time: 10:00 This paper presents an efficient codebook design for treestructured vector quantization (TSVQ), which is embedded in nature. The federal standard MELP (mixed excitation linear prediction) speech coder is modified by replacing the original single stage vector quantizer for Fourier magnitudes with a TSVQ and the original multistage vector quantizer (MSVQ) for line spectral frequencies (LSF’s) with a multistage TSVQ (MTVQ). The modified coder is fine-granular bit-rate scalable with gradual change in quality for the synthetic speech when the number of bits available for LSF and Fourier magnitudes decoding is decremented bit-by-bit. Joint Quantization Strategies for Low Bit-Rate Sinusoidal Coding Emre Unver, Stephane Villette, Ahmet Kondoz; University of Surrey, UK Thu-Ses1-P1-6, Time: 10:00 Transparent speech quality has not been achieved at low bit rates, especially at 2.4 kbps and below, which is an area of interest for military and security applications. In this paper, strategies for low bit rate sinusoidal coding are discussed. Previous work in the literature on using metaframes and performing variable bit allocation according to the metaframe type is extended. An optimum metaframe size compromise between delay and quantization gains is found. A new method for voicing determination from the LPC shape is also presented. The proposed techniques have been applied to the SB-LPC vocoder to produce speech at 1.2/0.8 kbps, and compared to the original SB-LPC vocoder at 2.4/1.2 kbps as well as an established standard (MELP) at 2.4/1.2/0.6 kbps in a listening test. It has been found that the proposed techniques have been effective in reducing the bit-rate while not compromising the speech quality. Thu-Ses1-P1-3, Time: 10:00 A method for estimating the normalised pitch variation is described. While pitch tracking is a classical problem, in applications where the pitch magnitude is not required but only the change in pitch, all the main problems of pitch tracking can be avoided, such as octave jumps and intricate peak-finding heuristics. The presented approach is efficient, accurate and unbiased. It was developed for use in speech and audio coding for pitch variation compensation, but can also be used as additional information for pitch tracking. Steganographic Band Width Extension for the AMR Codec of Low-Bit-Rate Modes Akira Nishimura; Tokyo University of Information Sciences, Japan Thu-Ses1-P1-7, Time: 10:00 This paper proposes a bandwidth extension (BWE) method for the AMR narrow-band speech codec using steganography, which Notes 155 is called steganographic BWE herein. The high-band information is embedded into the pitch delay data of the AMR codec using an extended quantization-based method that achieves increased embedding capacity and higher perceived sound quality than the previous steganographic method. The target bit-rate mode is below 7 kbps, the level below which the previous steganographic BWE method did not maintain adequate sound quality. The sound quality of the steganographic BWE speech signals decoded from the embedded bitstream is comparable to that of the wide-band speech signals of the AMR-WB codec at a bit rate of less than 6.7 kbps, with only a slight degradation in the quality relative to speech signals decoded from the same bitstream by the legacy AMR decoder. subvector information for the mel-frequency cepstral coefficients (MFCCs) is then added as an error protection code. At the same time, Huffman coding methods are applied to compressed MFCCs to prevent the bit-rate increase by using such protection codes,. Different Huffman trees for MFCCs are designed according to the voicing class, subvector-wise, and their combinations. It is shown from the recognition experiments on the Aurora 4 large vocabulary database under several noisy channel conditions that the proposed FEC method is able to achieve the relative average word error rate (WER) reduction by 9.03∼17.81% compared with the standard DSR system using no FEC methods. Thu-Ses1-P2 : Voice Transformation II Ultra Low Bit-Rate Speech Coding Based on Unit-Selection with Joint Spectral-Residual Quantization: No Transmission of Any Residual Information Hewison Hall, 10:00, Thursday 10 Sept 2009 Chair: Tomoki Toda, NAIST, Japan HMM Adaptation and Voice Conversion for the Synthesis of Child Speech: A Comparison V. Ramasubramanian, D. Harish; Siemens Corporate Technology India, India Thu-Ses1-P1-8, Time: 10:00 A recent trend in ultra low bit-rate speech coding is based on segment quantization by unit-selection principle using large continuous codebooks as a unit database. We show that use of such large unit databases allows speech to be reconstructed at the decoder by using the best unit’s residual itself (in the unit database), thereby obviating the need to transmit any side information about the residual of the input speech. For this, it becomes necessary to jointly quantize the spectral and residual information at the encoder during unit selection, and we propose various composite measures for such a joint spectral-residual quantization within a unit-selection algorithm proposed earlier. We realize ultra low bit-rate speaker-dependent speech coding at an overall rate of 250 bits/sec using unit database sizes of 19 bits/unit (524288 phonelike units or about 6 hours of speech) with spectral distortions less than 2.5 dB that retains intelligibility, naturalness, prosody and speaker-identity. Oliver Watts 1 , Junichi Yamagishi 1 , Simon King 1 , Kay Berkling 2 ; 1 University of Edinburgh, UK; 2 Inline Internet Online Dienste GmbH, Germany Thu-Ses1-P2-1, Time: 10:00 This study compares two different methodologies for producing data-driven synthesis of child speech from existing systems that have been trained on the speech of adults. On one hand, an existing statistical parametric synthesiser is transformed using model adaptation techniques, informed by linguistic and prosodic knowledge, to the speaker characteristics of a child speaker. This is compared with the application of voice conversion techniques to convert the output of an existing waveform concatenation synthesiser with no explicit linguistic or prosodic knowledge. In a subjective evaluation of the similarity of synthetic speech to natural speech from the target speaker, the HMM-based systems evaluated are generally preferred, although this is at least in part due to the higher dimensional acoustic features supported by these techniques. On the Cost of Backward Compatibility for Communication Codecs HMM-Based Speaker Characteristics Emphasis Using Average Voice Model Konstantin Schmidt, Markus Schnell, Nikolaus Rettelbach, Manfred Lutzky, Jochen Issing; Fraunhofer IIS, Germany Takashi Nose, Junichi Adada, Takao Kobayashi; Tokyo Institute of Technology, Japan Thu-Ses1-P1-9, Time: 10:00 Super wideband (SWB) communication calls more and more attention as can be seen by the standardization activities of SWB extensions for well-established wideband codecs, e.g. G.722 or G.711.1. This paper presents a technical solution for extending the G.722 codec and compares the new technology to other standardized SWB codecs. Hereby, a closer look is given on the concept of extending technologies to more capabilities in contrast to non-backwards compatible solutions. A Media-Specific FEC Based on Huffman Coding for Distributed Speech Recognition Young Han Lee, Hong Kook Kim; GIST, Korea Thu-Ses1-P1-10, Time: 10:00 Thu-Ses1-P2-2, Time: 10:00 This paper presents a technique for controlling and emphasizing speaker characteristics of synthetic speech. The key idea comes from the way of imitating voice by professional impersonators. In the voice imitation, impersonators effectively utilize exaggeration of a target speaker’s voice characteristics. To model and control the degree of speaker characteristics, we use a speech synthesis framework based on multiple-regression hidden semi-Markov model (MRHSMM). In MRHSMM, mean parameters are given by multiple regression of a low-dimensional control vector. The control vector represents how much the target speaker’s model parameters are different from those of the average voice model. By changing the control vector in speech synthesis, we can control the degree of voice characteristics of the target speaker. Results of subjective experiments show that the speaker reproducibility of synthetic speech is improved by emphasizing speaker characteristics. In this paper, we propose a media-specific forward error correction (FEC) method based on Huffman coding for distributed speech recognition (DSR). In order to mitigate the performance degradation of DSR in noisy channel environments, the importance of each subvector for the DSR system is first explored. As a result, the first Notes 156 Observation of Empirical Cumulative Distribution of Vowel Spectral Distances and Its Application to Vowel Based Voice Conversion An Evaluation Methodology for Prosody Transformation Systems Based on Chirp Signals Damien Lolive, Nelly Barbot, Olivier Boeffard; IRISA, France Thu-Ses1-P2-3, Time: 10:00 Evaluation of prosody transformation systems is an important issue. First, the existing evaluation methodologies focus on parallel evaluation of systems and are not applicable to compare parallel and non-parallel systems. Secondly, these methodologies do not guarantee the independence from other features such as the segmental component. In particular, its influence cannot be neglected during evaluation and introduces a bias in the listening test. To answer these problems, we propose an evaluation methodology that depends only on the melody of the voice and that is applicable in a non-parallel context. Given a melodic contour, we propose to build an audio whistle from a chirp signal model. Experimental results show the efficiency of the proposed method concerning the discrimination of voices using only their melody information. An example of transformation function is also given and the results confirm the applicability of this methodology. Voice Morphing Based on Interpolation of Vocal Tract Area Functions Using AR-HMM Analysis of Speech Yoshiki Nambu, Masahiko Mikawa, Kazuyo Tanaka; University of Tsukuba, Japan Thu-Ses1-P2-4, Time: 10:00 This paper presents a new voice morphing method which focuses on the continuity of phonological identity overall inter- and extrapolated regions. Main features of the method are 1) to separate the characteristic of vocal tract area resonances from that of vocal cord waves by using AR-HMM analysis of speech, 2) interpolation in a log vocal tract area function domain and 3) independent morphing for the vocal tract resonances and vocal cord wave characteristics. By the morphing system constructed on a statistical conversion method, the continuity of formants and perceptual difference between a conventional method and the proposed method are confirmed. A Novel Model-Based Pitch Conversion Method for Mandarin Speech Hsin-Te Hwang, Chen-Yu Chiang, Po-Yi Sung, Sin-Horng Chen; National Chiao Tung University, Taiwan Thu-Ses1-P2-5, Time: 10:00 In this paper, a novel model-based pitch conversion method for Mandarin is presented and compared with other two conventional conversion methods, i.e. the mean/variance transformation approach and the GMM-based mapping approach. Syllable pitch contour is first quantized by 3rd order orthogonal expansion coefficients; then, the source and target speakers’ prosodic models are constructed, respectively. Two mapping methods based on the prosodic model are presented. Objective tests confirmed that one of the proposed methods are superior the conventional methods. Some findings in informal listening tests and objective tests are worthwhile to further investigate. Hideki Kawahara 1 , Masanori Morise 2 , Toru Takahashi 3 , Hideki Banno 4 , Ryuichi Nisimura 1 , Toshio Irino 1 ; 1 Wakayama University, Japan; 2 Ritsumeikan University, Japan; 3 Kyoto University, Japan; 4 Meijo University, Japan Thu-Ses1-P2-6, Time: 10:00 A simple and fast voice conversion method based only on vowel information is proposed. The proposed method relies on empirical distribution of perceptual spectral distances between representative examples of each vowel segment extracted using TANDEM-STRAIGHT spectral envelope estimation procedure [1]. Mapping functions of vowel spectra are designed to preserve vowel space structure defined by the observed empirical distribution while transforming position and orientation of the structure in an abstract vowel spectral space. By introducing physiological constraints in vocal tract shapes and vocal tract length normalization, difficulties in careful frequency alignment between vowel template spectra of the source and the target speakers can be alleviated without significant degradations in converted speech. The proposed method is a frame-based instantaneous method and is relevant for real-time processing. Applications of the proposed method in-cross language voice conversion are also discussed. Japanese Pitch Conversion for Voice Morphing Based on Differential Modeling Ryuki Tachibana 1 , Zhiwei Shuang 2 , Masafumi Nishimura 1 ; 1 IBM Tokyo Research Lab, Japan; 2 IBM China Research Lab, China Thu-Ses1-P2-7, Time: 10:00 In this paper, we convert the pitch contours predicted by a TTS system that models a source speaker to resemble the pitch contours of a target speaker. When the speaking styles of the speakers are very different, complex conversions such as adding or deleting pitch peaks may be required. Our method does the conversions by modeling the direct pitch features and differential pitch features at the same time based on linguistic features. The differential pitch features are calculated from matched pairs of source and target pitch values. We show experimental results in which the target speaker’s characteristics are successfully modeled based on a very limited training corpus. The proposed pitch conversion method stretches the possibilities of TTS customization for various speaking styles. A Novel Technique for Voice Conversion Based on Style and Content Decomposition with Bilinear Models Victor Popa 1 , Jani Nurminen 2 , Moncef Gabbouj 1 ; 1 Tampere University of Technology, Finland; 2 Nokia Devices R&D, Finland Thu-Ses1-P2-8, Time: 10:00 This paper presents a novel technique for voice conversion by solving a two-factor task using bilinear models. The spectral content of the speech represented as line spectral frequencies is separated into so-called style and content parameterizations using a framework proposed in [1]. This formulation of the voice conversion problem in terms of style and content offers a flexible representation of factor interactions and facilitates the use of efficient training algorithms based on singular value decomposition Notes 157 and expectation maximization. Promising results in a comparison with the traditional Gaussian mixture model based method indicate increased robustness with small training sets. Rule-Based Voice Quality Variation with Formant Synthesis as other features, such as morphology and prosody. We evaluate the accuracy of our model at predicting syntactic information on the POS tagging task against state-of-the-art POS taggers and on perplexity against the ngram model. Improved Language Modelling Using Bag of Word Pairs Felix Burkhardt; Deutsche Telekom Laboratories, Germany Langzhou Chen, K.K. Chin, Kate Knill; Toshiba Research Europe Ltd., UK Thu-Ses1-P2-9, Time: 10:00 We describe an approach to simulate different phonation types, following John Laver’s terminology, by means of a hybrid (rulebased and unit concatenating) formant synthesizer. Different voice qualities were generated by following hints from the literature and applying the revised KLGLOTT88 model. Within a listener perception experiment, we show that the phonation types get distinguished by the listeners and lead to emotional impression as predicted by literature. The synthesis system and its source code, as well as audio samples can be downloaded at http://emoSyn.syntheticspeech.de/. Thu-Ses1-P3 : Automatic Speech Recognition: Language Models II Thu-Ses1-P3-3, Time: 10:00 The bag-of-words (BoW) method has been used widely in language modelling and information retrieval. A document is expressed as a group of words disregarding the grammar and the order of word information. A typical BoW method is latent semantic analysis (LSA), which maps the words and documents onto the vectors in LSA space. In this paper, the concept of BoW is extended to Bag-of-Word Pairs (BoWP), which expresses the document as a group of word pairs. Using word pairs as a unit, the system can capture more complex semantic information than BoW. Under the LSA framework, the BoWP system is shown to improve both perplexity and word error rate (WER) compared to a BoW system. Morphological Analysis and Decomposition for Arabic Speech-to-Text Systems Hewison Hall, 10:00, Thursday 10 Sept 2009 Chair: Mari Ostendorf, University of Washington, USA F. Diehl, M.J.F. Gales, M. Tomalin, P.C. Woodland; University of Cambridge, UK Multiple Text Segmentation for Statistical Language Modeling Thu-Ses1-P3-4, Time: 10:00 Sopheap Seng 1 , Laurent Besacier 1 , Brigitte Bigi 1 , Eric Castelli 2 ; 1 LIG, France; 2 MICA, Vietnam Thu-Ses1-P3-1, Time: 10:00 In this article we deal with the text segmentation problem in statistical language modeling for under-resourced languages with a writing system without word boundary delimiters. While the lack of text resources has a negative impact on the performance of language models, the errors introduced by the automatic word segmentation makes those data even less usable. To better exploit the text resources, we propose a method based on weighted finite state transducers to estimate the N-gram language model from the training corpus on which each sentence is segmented in multiple ways instead of a unique segmentation. The multiple segmentation generates more N-grams from the training corpus and allows obtaining the N-grams not found in unique segmentation. We use this approach to train the language models for automatic speech recognition systems of Khmer and Vietnamese languages and the multiple segmentations lead to a better performance than the unique segmentation approach. Measuring Tagging Performance of a Joint Language Model Denis Filimonov, Mary Harper; University of Maryland at College Park, USA Thu-Ses1-P3-2, Time: 10:00 Predicting syntactic information in a joint language model (LM) has been shown not only to improve the model at its main task of predicting words, but it also allows this information to be passed to other applications, such as spoken language processing. This raises the question of just how accurate the syntactic information predicted by the LM is. In this paper, we present a joint LM designed not only to scale to large quantities of training data, but also to be able to utilize fine-grain syntactic information, as well Language modelling for a morphologically complex language such as Arabic is a challenging task. Its agglutinative structure results in data sparsity problems and high out-of-vocabulary rates. In this work these problems are tackled by applying the MADA tools to the Arabic text. In addition to morphological decomposition, MADA performs context-dependent stem-normalisation. Thus, if word-level system combination, or scoring, is required this normalisation must be reversed. To address this, a novel context-sensitive method for morpheme-to-word conversion is introduced. The performance of the MADA decomposed system was evaluated on an Arabic broadcast transcription task. The MADA-based system out-performed the word-based system, with both the morphological decomposition and stem normalisation being found to be important. Investigating the Use of Morphological Decomposition and Diacritization for Improving Arabic LVCSR Amr El-Desoky, Christian Gollan, David Rybach, Ralf Schlüter, Hermann Ney; RWTH Aachen University, Germany Thu-Ses1-P3-5, Time: 10:00 One of the challenges related to large vocabulary Arabic speech recognition is the rich morphology nature of Arabic language which leads to both high out-of-vocabulary (OOV) rates and high language model (LM) perplexities. Another challenge is the absence of the short vowels (diacritics) from the Arabic written transcripts which causes a large difference between spoken and written language and thus a weaker connection between the acoustic and language models. In this work, we try to address these two important challenges by introducing both morphological decomposition and diacritization in Arabic language modeling. Finally, we are able to obtain about 3.7% relative reduction in word error rate (WER) with respect to a comparable non-diacritized full-words system running on our test set. Notes 158 Topic Dependent Language Model Based on Topic Voting on Noun History A Parallel Training Algorithm for Hierarchical Pitman-Yor Process Language Models Welly Naptali, Masatoshi Tsuchiya, Seiichi Nakagawa; Toyohashi University of Technology, Japan Songfang Huang, Steve Renals; University of Edinburgh, UK Thu-Ses1-P3-6, Time: 10:00 Thu-Ses1-P3-9, Time: 10:00 Language models (LMs) are important in automatic speech recognition systems. In this paper, we propose a new approach to a topic dependent LM, where the topic is decided in an unsupervised manner. Latent Semantic Analysis (LSA) is employed to reveal hidden (latent) relations among nouns in the context words. To decide the topic of an event, a fixed size word history sequence (window) is observed, and voting is then carried out based on noun class occurrences weighted by a confidence measure. Experiments on the Wall Street Journal corpus and Mainichi Shimbun (Japanese newspaper) corpus show that our proposed method gives better perplexity than the comparative baselines, including a word-based/class-based n-gram LM, their interpolated LM, a cache-based LM, and the Latent Dirichlet Allocation (LDA)-based topic dependent LM. The Hierarchical Pitman Yor Process Language Model (HPYLM) is a Bayesian language model based on a non-parametric prior, the Pitman-Yor Process. It has been demonstrated, both theoretically and practically, that the HPYLM can provide better smoothing for language modeling, compared with state-of-the-art approaches such as interpolated Kneser-Ney and modified Kneser-Ney smoothing. However, estimation of Bayesian language models is expensive in terms of both computation time and memory; the inference is approximate and requires a number of iterations to converge. In this paper, we present a parallel training algorithm for the HPYLM, which enables the approach to be applied in the context of automatic speech recognition, using large training corpora with large vocabularies. We demonstrate the effectiveness of the proposed algorithm by estimating language models from corpora for meeting transcription containing over 200 million words, and observe significant reductions in perplexity and word error rate. Investigation of Morph-Based Speech Recognition Improvements Across Speech Genres Probabilistic and Possibilistic Language Models Based on the World Wide Web Péter Mihajlik, Balázs Tarján, Zoltán Tüske, Tibor Fegyó; BME, Hungary Thu-Ses1-P3-7, Time: 10:00 The improvement achieved by changing the basis of speech recognition from words to morphs (various sub-word units) varies greatly across tasks and languages. We make an attempt to explore the source of this variability by the investigation of three LVCSR tasks corresponding to three speech genres of a highly agglutinative language. Novel, press conference and broadcast news transcription results are presented and compared to spontaneous speech recognition results in several experimental setups. A noticeable correlation is observed between an easily computable characteristic of various language speech recognition tasks and between the relative improvements due to (statistical) morph-based approaches. Effective Use of Pause Information in Language Modelling for Speech Recognition Kengo Ohta, Masatoshi Tsuchiya, Seiichi Nakagawa; Toyohashi University of Technology, Japan Thu-Ses1-P3-8, Time: 10:00 This paper addresses mismatch between speech processing units used by a speech recognizer and sentences of corpora. A standard speech recognizer divides an input speech into speech processing units based on its power information. On the other hand, training corpora of language models are divided into sentences based on punctuations. There is inevitable mismatch between speech processing units and sentences, and both of them are not optimal for a spontaneous speech recognition task. This paper presents two sub issues to address this problem. At first, the words of the preceding units are utilized to predict the words of the succeeding units, in order to address the mismatch between speech processing units and optimal units. Secondly, we propose a method to build a language model including short pause from a corpus with no short pause to address the mismatch between speech processing units and sentences. Their combination achieved a 4.5% relative improvement over the conventional method in the meeting speech recognition task. Stanislas Oger, Vladimir Popescu, Georges Linarès; LIA, France Thu-Ses1-P3-10, Time: 10:00 Usually, language models are built either from a closed corpus, or by using World Wide Web retrieved documents, which are considered as a closed corpus themselves. In this paper we propose several other ways, more adapted to the nature of the Web, of using this resource for language modeling. We first start by improving an approach consisting in estimating n-gram probabilities from Web search engine statistics. Then, we propose a new way of considering the information extracted from the Web in a probabilistic framework. Then, we also propose to rely on Possibility Theory for effectively using this kind of information. We compare these two approaches on two automatic speech recognition tasks: (i) transcribing broadcast news data, and (ii) transcribing domain-specific data, concerning surgical operation film comments. We show that the two approaches are effective in different situations. Thu-Ses1-P4 : Systems for Spoken Language Understanding Hewison Hall, 10:00, Thursday 10 Sept 2009 Chair: Renato de Mori, LIA, France Classification-Based Strategies for Combining Multiple 5-W Question Answering Systems Sibel Yaman 1 , Dilek Hakkani-Tür 1 , Gokhan Tur 2 , Ralph Grishman 3 , Mary Harper 4 , Kathleen R. McKeown 5 , Adam Meyers 3 , Kartavya Sharma 5 ; 1 ICSI, USA; 2 SRI International, USA; 3 New York University, USA; 4 University of Maryland at College Park, USA; 5 Columbia University, USA Thu-Ses1-P4-1, Time: 10:00 We describe and analyze inference strategies for combining outputs from multiple question answering systems each of which was developed independently. Specifically, we address the DARPA-funded GALE information distillation Year 3 task of finding answers to the Notes 159 5-Wh questions (who, what, when, where, and why) for each given sentence. The approach we take revolves around determining the best system using discriminative learning. In particular, we train support vector machines with a set of novel features that encode systems’ capabilities of returning as many correct answers as possible. We analyze two combination strategies: one combines multiple systems at the granularity of sentences, and the other at the granularity of individual fields. Our experimental results indicate that the proposed features and combination strategies were able to improve the overall performance by 22% to 36% relative to a random selection, 16% to 35% relative to a majority voting scheme, and 15% to 23% relative to the best individual system. sented a strategy that consists in the robust detection of subjective opinions about a particular topic in a spoken message. If the same automatic system is used for estimating opinion proportions in different spoken surveys, then the error rate of the entire automatic process should not vary too much in different surveys for each type of opinions. Based on this conjecture, a linear error model is derived and used for error correction. Experimental results obtained with data of a real-world deployed system show significant error reductions obtained in the automatic estimation of proportions in spoken surveys. Transformation-Based Learning for Semantic Parsing F. Jurčíček, M. Gašić, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young; University of Cambridge, UK Combining Semantic and Syntactic Information Sources for 5-W Question Answering Thu-Ses1-P4-5, Time: 10:00 Sibel Yaman 1 , Dilek Hakkani-Tür 1 , Gokhan Tur 2 ; 1 ICSI, USA; 2 SRI International, USA Thu-Ses1-P4-2, Time: 10:00 This paper focuses on combining answers generated by a semantic parser that produces semantic role labels (SRLs) and those generated by syntactic parser that produces function tags for answering 5-W questions, i.e., who, what, when, where, and why. We take a probabilistic approach in which a system’s ability to correctly answer 5-W questions is measured with the likelihood that its answers are produced for the given word sequence. This is achieved by training statistical language models (LMs) that are used to predict whether the answers returned by semantic parse or those returned by the syntactic parser are more likely. We evaluated our approach using the OntoNotes dataset. Our experimental results indicate that the proposed LM-based combination strategy was able to improve the performance of the best individual system in terms of both F1 measure and accuracy. Furthermore, the error rates for each question type were also significantly reduced with the help of the proposed approach. Phrase and Word Level Strategies for Detecting Appositions in Speech Benoit Favre, Dilek Hakkani-Tür; ICSI, USA Thu-Ses1-P4-3, Time: 10:00 Appositions are grammatical constructs in which two noun phrases are placed side-by-side, one modifying the other. Detecting them in speech can help extract semantic information useful, for instance, for co-reference resolution and question answering. We compare and combine three approaches: word-level and phrase-level classifiers, and a syntactic parser trained to generate appositions. On reference parses, the phrase-level classifier outperforms the other approaches while on automatic parses and ASR output, the combination of the apposition-generating parser and the word-level classifier works best. An analysis of the system errors reveals that parsing accuracy and world knowledge are very important for this task. Error Correction of Proportions in Spoken Opinion Surveys Nathalie Camelin 1 , Renato De Mori 1 , Frederic Bechet 1 , Géraldine Damnati 2 ; 1 LIA, France; 2 Orange Labs, France This paper presents a semantic parser that transforms an initial semantic hypothesis into the correct semantics by applying an ordered list of transformation rules. These rules are learnt automatically from a training corpus with no prior linguistic knowledge and no alignment between words and semantic concepts. The learning algorithm produces a compact set of rules which enables the parser to be very efficient while retaining high accuracy. We show that this parser is competitive with respect to the state-ofthe-art semantic parsers on the ATIS and TownInfo tasks. Large-Scale Polish SLU Patrick Lehnen 1 , Stefan Hahn 1 , Hermann Ney 1 , Agnieszka Mykowiecka 2 ; 1 RWTH Aachen University, Germany; 2 Polish Academy of Sciences, Poland Thu-Ses1-P4-6, Time: 10:00 In this paper, we present state-of-the art concept tagging results on a new corpus for Polish SLU. For this language, it is the first large-scale corpus (∼200 different concepts) which has been semantically annotated and will be made publicly available. Conditional Random Fields have proven to lead to best results for string-to-string translation problems. Using this approach, we achieve a concept error rate of 22.6% on an evaluation corpus. To additionally extract attribute values, a combination of a statistical and a rule-based approach is used leading to a CER of 30.2%. Optimizing CRFs for SLU Tasks in Various Languages Using Modified Training Criteria Stefan Hahn, Patrick Lehnen, Georg Heigold, Hermann Ney; RWTH Aachen University, Germany Thu-Ses1-P4-7, Time: 10:00 In this paper, we present improvements of our state-of-the-art concept tagger based on conditional random fields. Statistical models have been optimized for three tasks of varying complexity in three languages (French, Italian, and Polish). Modified training criteria have been investigated leading to small improvements. The respective corpora as well as parameter optimization results for all models are presented in detail. A comparison of the selected features between languages as well as a close look at the tuning of the regularization parameter is given. The experimental results show in what level the optimizations of the single systems are portable between languages. Thu-Ses1-P4-4, Time: 10:00 The paper analyzes the types of errors encountered in automatic spoken surveys. These errors are different from the ones that appear when surveys are taken by humans because they are caused by the imprecision of an automatic system. Previous studies pre- Notes 160 Learning Lexicons from Spoken Utterances Based on Statistical Model Selection Ryo Taguchi 1 , Naoto Iwahashi 2 , Takashi Nose 3 , Kotaro Funakoshi 4 , Mikio Nakano 4 ; 1 ATR, Japan; 2 NICT, Japan; 3 Tokyo Institute of Technology, Japan; 4 Honda Research Institute Japan Co. Ltd., Japan customers said, thus it may be sufficient to process only agents’ speech. Our experiments with 1,677 customer calls show that two partial transcripts comprising only the agents utterances and the first 40 speaker turns actually produce slightly higher classification accuracy than a transcript set comprising the entire conversations. In addition, using partial conversations can significantly reduce the cost for speech transcription. Thu-Ses1-P4-8, Time: 10:00 This paper proposes a method for the unsupervised learning of lexicons from pairs of a spoken utterance and an object as its meaning without any a priori linguistic knowledge other than a phoneme acoustic model. In order to obtain a lexicon, a statistical model of the joint probability of a spoken utterance and an object is learned based on the minimum description length principle. This model consists of a list of word phoneme sequences and three statistical models: the phoneme acoustic model, a word-bigram model, and a word meaning model. Experimental results show that the method can acquire acoustically, grammatically and semantically appropriate words with about 85% phoneme accuracy. Improving Speech Understanding Accuracy with Limited Training Data Using Multiple Language Models and Multiple Understanding Models Masaki Katsumaru 1 , Mikio Nakano 2 , Kazunori Komatani 1 , Kotaro Funakoshi 2 , Tetsuya Ogata 1 , Hiroshi G. Okuno 1 ; 1 Kyoto University, Japan; 2 Honda Research Institute Japan Co. Ltd., Japan Thu-Ses1-P4-9, Time: 10:00 We aim to improve a speech understanding module with a small amount of training data. A speech understanding module uses a language model (LM) and a language understanding model (LUM). A lot of training data are needed to improve the models. Such data collection is, however, difficult in an actual process of development. We therefore design and develop a new framework that uses multiple LMs and LUMs to improve speech understanding accuracy under various amounts of training data. Even if the amount of available training data is small, each LM and each LUM can deal well with different types of utterances and more utterances are understood by using multiple LM and LUM. As one implementation of the framework, we develop a method for selecting the most appropriate speech understanding result from several candidates. The selection is based on probabilities of correctness calculated by logistic regressions. We evaluate our framework with various amounts of training data. Low-Cost Call Type Classification for Contact Center Calls Using Partial Transcripts Youngja Park, Wilfried Teiken, Stephen C. Gates; IBM T.J. Watson Research Center, USA Thu-Ses1-P4-10, Time: 10:00 Call type classification and topic classification for contact center calls using automatically generated transcripts is not yet widely available mainly due to the high cost and low accuracy of call-center grade automatic speech transcription. To address these challenges, we examine if using only partial conversations yields accuracy comparable to using the entire customer-agent conversations. We exploit two interesting characteristics of call center calls. First, contact center calls are highly scripted following prescribed steps, and the customers problem or request (i.e., the determinant of the call type) is typically stated in the beginning of a call. Thus, using only the beginning of calls may be sufficient to determine the call type. Second, agents often more clearly repeat or rephrase what A New Quality Measure for Topic Segmentation of Text and Speech Mehryar Mohri, Pedro Moreno, Eugene Weinstein; Google Inc., USA Thu-Ses1-P4-11, Time: 10:00 The recent proliferation of large multimedia collections has gathered immense attention from the speech research community, because speech recognition enables the transcription and indexing of such collections. Topicality information can be used to improve transcription quality and enable content navigation. In this paper, we give a novel quality measure for topic segmentation algorithms that improves over previously used measures. Our measure takes into account not only the presence or absence of topic boundaries but also the content of the text or speech segments labeled as topic-coherent. Additionally, we demonstrate that topic segmentation quality of spoken language can be improved using speech recognition lattices. Using lattices, improvements over the baseline one-best topic model are observed when measured with the previously existing topic segmentation quality measure, as well as the new measure proposed in this paper (9.4% and 7.0% relative error reduction, respectively). Concept Segmentation and Labeling for Conversational Speech Marco Dinarelli, Alessandro Moschitti, Giuseppe Riccardi; Università di Trento, Italy Thu-Ses1-P4-12, Time: 10:00 Spoken Language Understanding performs automatic concept labeling and segmentation of speech utterances. For this task, many approaches have been proposed based on both generative and discriminative models. While all these methods have shown remarkable accuracy on manual transcription of spoken utterances, robustness to noisy automatic transcription is still an open issue. In this paper we study algorithms for Spoken Language Understanding combining complementary learning models: Stochastic Finite State Transducers produce a list of hypotheses, which are re-ranked using a discriminative algorithm based on kernel methods. Our experiments on two different spoken dialog corpora, MEDIA and LUNA, show that the combined generative-discriminative model reaches the state-of-the-art such as Conditional Random Fields (CRF) on manual transcriptions, and it is robust to noisy automatic transcriptions, outperforming, in some cases, the state-of-the-art. Notes 161 Noise Robustness of Tract Variables and their Application to Speech Recognition Thu-Ses1-S1 : Special Session: New Approaches to Modeling Variability for Automatic Speech Recognition Vikramjit Mitra 1 , Hosung Nam 2 , Carol Y. Espy-Wilson 1 , Elliot Saltzman 2 , Louis Goldstein 3 ; 1 University of Maryland at College Park, USA; 2 Haskins Laboratories, USA; 3 University of Southern California, USA Ainsworth (East Wing 4), 10:00, Thursday 10 Sept 2009 Chair: Carol Y. Espy-Wilson, University of Maryland at College Park, USA and Jennifer Cole, University of Illinois at Urbana-Champaign, USA Thu-Ses1-S1-3, Time: 10:40 Introductory Remarks Carol Y. Espy-Wilson 1 , Jennifer Cole 2 ; 1 University of Maryland at College Park, USA; 2 University of Illinois at Urbana-Champaign, USA Thu-Ses1-S1-0, Time: 10:00 A Noise-Type and Level-Dependent MPO-Based Speech Enhancement Architecture with Variable Frame Analysis for Noise-Robust Speech Recognition Vikramjit Mitra 1 , Bengt J. Borgstrom 2 , Carol Y. Espy-Wilson 1 , Abeer Alwan 2 ; 1 University of Maryland at College Park, USA; 2 University of California at Los Angeles, USA Thu-Ses1-S1-1, Time: 10:10 In previous work, a speech enhancement algorithm based on phase opponency and a periodicity measure (MPO-APP) was developed for speech recognition. Axiomatic thresholds were used in the MPO-APP regardless of the signal-to-noise ratio (SNR) of the corrupted speech or any characterization of the noise. The current work developed an algorithm for adjusting the threshold in the MPO-APP based on the SNR and whether the speech signal is clean, corrupted by aperiodic noise or corrupted with noise with periodic components. In addition, variable frame rate (VFR) analysis has been incorporated so that dynamic regions in the speech signal are more heavily sampled than steady-state regions. The result is a 2-stage algorithm that gives superior performance to the previous MPO-APP, and to several other state-of-the-art speech enhancement algorithms. Complementarity of MFCC, PLP and Gabor Features in the Presence of Speech-Intrinsic Variabilities This paper analyzes the noise robustness of vocal tract constriction variable estimation and investigates their role for noise robust speech recognition. We implemented a simple direct inverse model using a feed-forward artificial neural network to estimate vocal tract variables (TVs) from the speech signal. Initially, we trained the model on clean synthetic speech and then test the noise robustness of the model on noise-corrupted speech. The training corpus was obtained from the TAsk Dynamics Application model (TADA [1]), which generated the synthetic speech as well as their corresponding TVs. Eight different vocal tract constriction variables consisting of five constriction degree variables (lip aperture [LA], tongue body [TBCD], tongue tip [TTCD], velum [VEL], and glottis [GLO]); three constriction location variables (lip protrusion [LP], tongue tip [TTCL], tongue body [TBCL]) were considered in this study. We also explored using a modified phase opponency (MPO) [2] speech enhancement technique as the preprocessor for TV estimation to observe its effect upon noise robustness. Kalman smoothing was applied to the estimated TVs to reduce the estimation noise. Finally the TV estimation module was tested using a naturally-produced speech that is contaminated with noise at different signal-to-noise ratios. The estimated TVs from the natural speech corpus are then used in conjunction with the baseline features to perform automatic speech recognition (ASR) experiments. Results show an average 22% and 21% improvement, relative to the baseline, on ASR performance using the Aurora-2 dataset with car and subway noise, respectively. The TVs in these experiments are estimated from the MPO-enhanced speech. Articulatory Phonological Code for Word Classification Xiaodan Zhuang 1 , Hosung Nam 2 , Mark Hasegawa-Johnson 1 , Louis Goldstein 2 , Elliot Saltzman 2 ; 1 University of Illinois at Urbana-Champaign, USA; 2 Haskins Laboratories, USA Thu-Ses1-S1-4, Time: 10:55 Bernd T. Meyer, Birger Kollmeier; Carl von Ossietzky Universität Oldenburg, Germany Thu-Ses1-S1-2, Time: 10:25 In this study, the effect of speech-intrinsic variabilities such as speaking rate, effort and speaking style on automatic speech recognition (ASR) is investigated. We analyze the influence of such variabilities as well as extrinsic factors (i.e., additive noise) on the most common features in ASR (mel-frequency cepstral coefficients and perceptual linear prediction features) and spectro-temporal Gabor features. MFCCs performed best for clean speech, whereas Gabors were found to be the most robust feature in extrinsic variabilities. Intrinsic variations were found to have a strong impact on error rates. While performance with MFCCs and PLPs was degraded in much the same way, Gabor features exhibit a different sensitivity towards these variabilities and are, e.g., well-suited to recognize speech with varying pitch. The results suggest that spectro-temporal and classic features carry complementary information, which could be exploited in feature-stream experiments. We propose a framework that leverages articulatory phonology for speech recognition. “Gestural pattern vectors” (GPV) encode the instantaneous gestural activations that exist across all tract variables at each time. Given a speech observation, recognizing the sequence of GPV recovers the ensemble of gestural activations, i.e., the gestural score. For each word in the vocabulary, we use a task dynamic model of inter-articulator speech coordination to generate the “canonical” gestural score. Speech recognition is achieved by matching the ensemble of gestural activations. In particular, we estimate the likelihood of the recognized GPV sequence on word-dependent GPV sequence models trained using the “canonical” gestural scores. These likelihoods, weighted by confidence score of the recognized GPVs, are used in a Bayesian speech recognizer. Pilot gestural score recovery and word classification experiments are carried out using synthesized data from one speaker. The observation distribution of each GPV is modeled by an artificial neural network and Gaussian mixture tandem model. Bigram GPV sequence models are used to distinguish gestural scores of different words. Given the tract variable time functions, about Notes 162 80% of the instantaneous gestural activation is correctly recovered. Word recognition accuracy is over 85% for a vocabulary of 139 words with no training observations. These results suggest that the proposed framework might be a viable alternative to the classic sequence-of-phones model. Robust Keyword Spotting with Rapidly Adapting Point Process Models Aren Jansen, Partha Niyogi; University of Chicago, USA Thu-Ses1-S1-5, Time: 11:10 In this paper, we investigate the noise robustness properties of frame-based and sparse point process-based models for spotting keywords in continuous speech. We introduce a new strategy to improve point process model (PPM) robustness by adapting low-level feature detector thresholds to preserve background firing rates in the presence of noise. We find that this unsupervised approach can significantly outperform fully supervised maximum likelihood linear regression (MLLR) adaptation of an equivalent keyword-filler HMM system in the presence of additive white and pink noise. Moreover, we find that the sparsity of PPMs introduces an inherent resilience to non-stationary babble noise not exhibited by the frame-based HMM system. Finally, we demonstrate that our approach requires less adaptation data than MLLR, permitting rapid online adaptation. Automatically Rating Pronunciation Through Articulatory Phonology Thu-Ses2-O1 : User Interactions in Spoken Dialog Systems Main Hall, 13:30, Thursday 10 Sept 2009 Chair: Roberto Pieraccini, SpeechCycle Labs, USA Learning the Structure of Human-Computer and Human-Human Dialogs David Griol 1 , Giuseppe Riccardi 2 , Emilio Sanchis 3 ; 1 Universidad Carlos III de Madrid, Spain; 2 Università di Trento, Italy; 3 Universidad Politécnica de Valencia, Spain Thu-Ses2-O1-1, Time: 13:30 We are interested in the problem of understanding human conversation structure in the context of human-machine and human-human interaction. We present a statistical methodology for detecting the structure of spoken dialogs based on a generative model learned using decision trees. To evaluate our approach we have used the LUNA corpora, collected from real users engaged in problem solving tasks. The results of the evaluation show that automatic segmentation of spoken dialogs is very effective not only with models built using separately human-machine dialogs or human-human dialogs, but it is also possible to infer the taskrelated structure of human-human dialogs with a model learned using only human-machine dialogs. Pause and Gap Length in Face-to-Face Interaction Joseph Tepperman, Louis Goldstein, Sungbok Lee, Shrikanth S. Narayanan; University of Southern California, USA Jens Edlund 1 , Mattias Heldner 1 , Julia Hirschberg 2 ; 1 KTH, Sweden; 2 Columbia University, USA Thu-Ses1-S1-6, Time: 11:25 Thu-Ses2-O1-2, Time: 13:50 Articulatory Phonology’s link between cognitive speech planning and the physical realizations of vocal tract constrictions has implications for speech acoustic and duration modeling that should be useful in assigning subjective ratings of pronunciation quality to nonnative speech. In this work, we compare traditional phoneme models used in automatic speech recognition to similar models for articulatory gestural pattern vectors, each with associated duration models. What we find is that, on the CDT corpus, gestural models outperform the phoneme-level baseline in terms of correlation with listener ratings, and in combination phoneme and gestural models outperform either one alone. This also validates previous findings with a similar (but not gesture-based) pseudo-articulatory representation. It has long been noted that conversational partners tend to exhibit increasingly similar pitch, intensity, and timing behavior over the course of a conversation. However, the metrics developed to measure this similarity to date have generally failed to capture the dynamic temporal aspects of this process. In this paper, we propose new approaches to measuring interlocutor similarity in spoken dialogue. We define similarity in terms of convergence and synchrony and propose approaches to capture these, illustrating our techniques on gap and pause production in Swedish spontaneous dialogues. General Discussion Time: 11:40 Modeling Other Talkers for Improved Dialog Act Recognition in Meetings Kornel Laskowski 1 , Elizabeth Shriberg 2 ; 1 Carnegie Mellon University, USA; 2 SRI International, USA Thu-Ses2-O1-3, Time: 14:10 Automatic dialog act (DA) modeling has been shown to benefit meeting understanding, but current approaches to DA recognition tend to suffer from a common problem: they under-represent behaviors found at turn edges, during which the “floor” is negotiated among meeting participants. We propose a new approach that takes into account speech from other talkers, relying only on speech/non-speech information from all participants. We find (1) that modeling other participants improves DA detection, even in the absence of other information, (2) that only the single locally most talkative other participant matters, and (3) that 10 seconds provides a sufficiently large local context. Results further show significant performance improvements over a lexical-only system — particularly for the DAs of interest. We conclude that interaction-based modeling at turn edges can be achieved by relatively simple features and should be incorporated for improved meeting understanding. Notes 163 A Closer Look at Quality Judgments of Spoken Dialog Systems Thu-Ses2-O2 : Production: Articulation and Acoustics Klaus-Peter Engelbrecht, Felix Hartard, Florian Gödde, Sebastian Möller; Deutsche Telekom Laboratories, Germany Jones (East Wing 1), 13:30, Thursday 10 Sept 2009 Chair: Denis Burnham, University of Western Sydney, Australia Thu-Ses2-O1-4, Time: 14:30 User judgments of Spoken Dialog Systems provide evaluators of such systems with a valid measure of their overall quality. Models for the automatic prediction of user judgments have been built, following the introduction of PARADISE [1]. Main applications are the comparison of systems, the analysis of parameters affecting quality, and the adoption of dialog management strategies. However, a common model which applies to different systems and users has not been found so far. With the aim of getting a closer insight into the quality-relevant characteristics of spoken interactions, an experiment was conducted where 25 users judged the same 5 dialogs. User judgments were collected after each dialog turn. The paper presents an analysis of the obtained results and some conclusions for future work. New Methods for the Analysis of Repeated Utterances In Search of Non-Uniqueness in the Acoustic-to-Articulatory Mapping G. Ananthakrishnan, D. Neiberg, Olov Engwall; KTH, Sweden Thu-Ses2-O2-1, Time: 13:30 This paper explores the possibility and extent of non-uniqueness in the acoustic-to-articulatory inversion of speech, from a statistical point of view. It proposes a technique to estimate the non-uniqueness, based on finding peaks in the conditional probability function of the articulatory space. The paper corroborates the existence of non-uniqueness in a statistical sense, especially in stop consonants, nasals and fricatives. The relationship between the importance of the articulator position and non-uniqueness at each instance is also explored. Estimation of Articulatory Gesture Patterns from Speech Acoustics Geoffrey Zweig; Microsoft Research, USA Thu-Ses2-O1-5, Time: 14:50 This paper proposes three novel and effective procedures for jointly analyzing repeated utterances. First, we propose repetitiondriven system switching, where repetition triggers the use of an independent backup system for decoding. Second, we propose a cache language model for use with the second utterance. Finally, we propose a method with which the acoustics from multiple utterances — not necessarily exact repetitions of each other — can be combined to into a composite that increases accuracy. The combination of all methods produces a relative increase in sentence accuracy of 65.7% for repeated voice-search queries. The Effects of Different Voices for Speech-Based In-Vehicle Interfaces: Impact of Young and Old Voices on Driving Performance and Attitude Ing-Marie Jonsson, Nils Dahlbäck; Linköping University, Sweden Prasanta Kumar Ghosh 1 , Shrikanth S. Narayanan 1 , Pierre Divenyi 2 , Louis Goldstein 1 , Elliot Saltzman 3 ; 1 University of Southern California, USA; 2 EBIRE, USA; 3 Haskins Laboratories, USA Thu-Ses2-O2-2, Time: 13:50 We investigated dynamic programming (DP) and state-model (SM) approaches for estimating gestural scores from speech acoustics. We performed a word-identification task using the gestural pattern vector sequences estimated by each approach. For a set of 75 randomly chosen words, we obtained the best word-identification accuracy (66.67%) using the DP approach. This result implies that considerable support for lexical access during speech perception might be provided by such a method of recovering gestural information from acoustics. Formant Trajectories for Acoustic-to-Articulatory Inversion Thu-Ses2-O1-6, Time: 15:10 This paper investigates how matching age of driver with age of voice in a conversational in-vehicle information system affects attitudes and performance. 36 participants from age groups, 55–75 and 18–25, interacted with a conversational system with young or old voice in a driving simulator. Results show that all drivers rather communicated with a young than old voice in the car. This willingness to communicate had a detrimental effect on driving performance. It is hence important to carefully select voices, since voice properties can have enormous effects on driving safety. Clearly, one voice doesn’t fit all. İ. Yücel Özbek 1 , Mark Hasegawa-Johnson 2 , Mübeccel Demirekler 1 ; 1 Middle East Technical University, Turkey; 2 University of Illinois at Urbana-Champaign, USA Thu-Ses2-O2-3, Time: 14:10 This work examines the utility of formant frequencies and their energies in acoustic-to-articulatory inversion. For this purpose, formant frequencies and formant spectral amplitudes are automatically estimated from audio, and are treated as observations for the purpose of estimating electromagnetic articulography (EMA) coil positions. A mixture Gaussian regression model with mel-frequency cepstral (MFCC) observations is modified by using formants and energies to either replace or augment the MFCC observation vector. The augmented observation results in 3.4% lower RMS error, and 2% higher correlation coefficient, than the baseline MFCC observation. Improvement is especially good for stop consonants, possibly because formant tracking provides information about the acoustic resonances that would be otherwise unavailable during stop closure and release. Notes 164 A Robust Variational Method for the Acoustic-to-Articulatory Problem Thu-Ses2-O3 : Features for Speech and Speaker Recognition Blaise Potard, Yves Laprie; LORIA, France Thu-Ses2-O2-4, Time: 14:30 This paper presents a novel acoustic-to-articulatory inversion method based on an articulatory synthesizer and variational calculus, without the need for an initial trajectory. Validation in ideal conditions is performed to show the potential of the method, and the performances are compared to codebook based methods. We also investigate the precision of the articulatory trajectories found for various acoustic vectors dimensions. Possible extensions are discussed. Comparison of Vowel Structures of Japanese and English in Articulatory and Auditory Spaces Jianwu Dang 1 , Mark Tiede 2 , Jiahong Yuan 3 ; 1 JAIST, Japan; 2 Haskins Laboratories, USA; 3 University of Pennsylvania, USA Thu-Ses2-O2-5, Time: 14:50 In previous work [1] we investigated the vowel structures of Japanese in both articulatory space and auditory perceptual space using Laplacian eigenmaps, and examined relations between speech production and perception. The results showed that the inherent structures of Japanese vowels were consistent in the two spaces. To verify whether such a property generalizes to other languages, we use the same approach to investigate the more crowded English vowel space. Results show that the vowel structure reflects the articulatory features for both languages. The degree of tongue-palate approximation is the most important feature for vowels, followed by the open ratio of the mouth to oral cavity. The topological relations of the vowel structures are consistent with both the articulatory and auditory perceptual spaces; in particular the lip-protruded vowel /UW/ of English was distinct from the unrounded Japanese /W/. The rhotic vowel /ER/ was located apart from the surface constructed by the other vowels, where the same phenomena appeared in both spaces. The Articulatory and Acoustic Impact of Scottish English /r/ on the Preceding Vowel-Onset Janine Lilienthal; Queen Margaret University, UK Thu-Ses2-O2-6, Time: 15:10 This paper demonstrates the use of smoothing spline ANOVA and T tests to analyze whether the influence of syllable final consonants on the preceding vowel differs for articulation and acoustics. The onset of vowels either followed by phrase-final /r/ or by phrase-initial /r/ is compared for two Scottish English speakers. To measure articulatory differences of opposing vowel pairs, smoothing splines of midsagittal tongue shape recorded via ultrasound imaging are compared. For the acoustic data, differences of the first two formant frequencies at the onset are tested. The results confirm that there is no 1:1 mapping between articulation and acoustics. Fallside (East Wing 2), 13:30, Thursday 10 Sept 2009 Chair: Thomas Hain, University of Sheffield, UK Static and Dynamic Modulation Spectrum for Speech Recognition Sriram Ganapathy, Samuel Thomas, Hynek Hermansky; Johns Hopkins University, USA Thu-Ses2-O3-1, Time: 13:30 We present a feature extraction technique based on static and dynamic modulation spectrum derived from long-term envelopes in sub-bands. Estimation of the sub-band temporal envelopes is done using Frequency Domain Linear Prediction (FDLP). These sub-band envelopes are compressed with a static (logarithmic) and dynamic (adaptive loops) compression. The compressed sub-band envelopes are transformed into modulation spectral components which are used as features for speech recognition. Experiments are performed on a phoneme recognition task using a hybrid HMM-ANN phoneme recognition system and an ASR task using the TANDEM speech recognition system. The proposed features provide a relative improvements of 3.8% and 11.5% in phoneme recognition accuracies for TIMIT and conversation telephone speech (CTS) respectively. Further, these improvements are found to be consistent for ASR tasks on OGI-Digits database (relative improvement of 13.5%). 2-D Processing of Speech for Multi-Pitch Analysis Tianyu T. Wang, Thomas F. Quatieri; MIT, USA Thu-Ses2-O3-2, Time: 13:50 This paper introduces a two-dimensional (2-D) processing approach for the analysis of multi-pitch speech sounds. Our framework invokes the short-space 2-D Fourier transform magnitude of a narrowband spectrogram, mapping harmonically-related signal components to multiple concentrated entities in a new 2-D space. First, localized time-frequency regions of the spectrogram are analyzed to extract pitch candidates. These candidates are then combined across multiple regions for obtaining separate pitch estimates of each speech-signal component at a single point in time. We refer to this as multi-region analysis (MRA). By explicitly accounting for pitch dynamics within localized time segments, this separability is distinct from that which can be obtained using short-time autocorrelation methods typically employed in state-ofthe-art multi-pitch tracking algorithms. We illustrate the feasibility of MRA for multi-pitch estimation on mixtures of synthetic and real speech. A Correlation-Maximization Denoising Filter Used as an Enhancement Frontend for Noise Robust Bird Call Classification Wei Chu, Abeer Alwan; University of California at Los Angeles, USA Thu-Ses2-O3-3, Time: 14:30 In this paper, we propose a Correlation-Maximization denoising filter which utilizes periodicity information to remove additive noise in bird calls. We also developed a statistically-based noise robust bird-call classification system which uses the denoising filter as a frontend. Enhanced bird calls which are the output of the denoising filter are used for feature extraction. Gaussian Mixture Models (GMM) and Hidden Markov Models (HMM) are used for classification. Experiments on a large noisy corpus containing bird Notes 165 calls from 5 species have shown that the Correlation-Maximization filter is more effective than the Wiener filter in improving the classification error rate of bird calls which have a quasi-periodic structure. This improvement results in a 4.1% classification error rate which is better than the system without a denoising frontend and a system with a Wiener filter denoising frontend. Preliminary Inversion Mapping Results with a New EMA Corpus elementary subtasks; then we propose a solution that combines a one-pass strategy that exploits the local repetitiveness of motifs and a dynamic programming technique to detect repetitions in audio streams. Results of an experiment on a radio broadcast show are shown to illustrate the effectiveness of the technique in providing audio summaries of real data. Thu-Ses2-O4 : Speech and Multimodal Resources & Annotation Korin Richmond; University of Edinburgh, UK Thu-Ses2-O3-4, Time: 14:10 In this paper, we apply our inversion mapping method, the trajectory mixture density network (TMDN), to a new corpus of articulatory data, recorded with a Carstens AG500 electromagnetic articulograph. This new data set, mngu0, is relatively large and phonetically rich, among other beneficial characteristics. We obtain good results, with a root mean square (RMS) error of only 0.99mm. This compares very well with our previous lowest result of 1.54mm RMS error for equivalent coils of the MOCHA fsew0 EMA data. We interpret this as showing the mngu0 data set is potentially more consistent than the fsew0 data set, and is very useful for research which calls for articulatory trajectory data. It also supports our view that the TMDN is very much suited to the inversion mapping problem. Holmes (East Wing 3), 13:30, Thursday 10 Sept 2009 Chair: Kristiina Jokinen, University of Tampere, Finland ASR Corpus Design for Resource-Scarce Languages Etienne Barnard, Marelie Davel, Charl van Heerden; CSIR, South Africa Thu-Ses2-O4-1, Time: 13:30 Time-Varying Autoregressive Tests for Multiscale Speech Analysis We investigate the number of speakers and the amount of data that is required for the development of useable speaker-independent speech-recognition systems in resource-scarce languages. Our experiments employ the Lwazi corpus, which contains speech in the eleven official languages of South Africa. We find that a surprisingly small number of speakers (fewer than 50) and around 10 to 20 hours of speech per language are sufficient for the purposes of acceptable phone-based recognition. Daniel Rudoy 1 , Thomas F. Quatieri 2 , Patrick J. Wolfe 1 ; 1 Harvard University, USA; 2 MIT, USA Pronunciation Dictionary Development in Resource-Scarce Environments Thu-Ses2-O3-5, Time: 14:50 In this paper we develop hypothesis tests for speech waveform nonstationarity based on time-varying autoregressive models, and demonstrate their efficacy in speech analysis tasks at both segmental and sub-segmental scales. Key to the successful synthesis of these ideas is our employment of a generalized likelihood ratio testing framework tailored to autoregressive coefficient evolutions suitable for speech. After evaluating our framework on speech-like synthetic signals, we present preliminary results for two distinct analysis tasks using speech waveform data. At the segmental level, we develop an adaptive short-time segmentation scheme and evaluate it on whispered speech recordings, while at the sub-segmental level, we address the problem of detecting the glottal flow closed phase. Results show that our hypothesis testing framework can reliably detect changes in the vocal tract parameters across multiple scales, thereby underscoring its broad applicability to speech analysis. Audio Keyword Extraction by Unsupervised Word Discovery Marelie Davel, Olga Martirosian; CSIR, South Africa Thu-Ses2-O4-2, Time: 13:50 The deployment of speech technology systems in the developing world is often hampered by the lack of appropriate linguistic resources. A suitable pronunciation dictionary is one such resource that can be difficult to obtain for lesser-resourced languages. We design a process for the development of pronunciation dictionaries in resource-scarce environments, and apply this to the development of pronunciation dictionaries for ten of the official languages of South Africa. We define the semi-automated development and verification process in detail and discuss practicalities, outcomes and lessons learnt. We analyse the accuracy of the developed dictionaries and demonstrate how the distribution of rules generated from the dictionaries provides insight into the inherent predictability of the languages studied. XTrans: A Speech Annotation and Transcription Tool Meghan Lammie Glenn, Stephanie M. Strassel, Haejoong Lee; University of Pennsylvania, USA Armando Muscariello, Guillaume Gravier, Frédéric Bimbot; IRISA, France Thu-Ses2-O4-3, Time: 14:10 Thu-Ses2-O3-6, Time: 15:10 In real audio data, frequently occurring patterns often convey relevant information on the overall content of the data. The possibility to extract meaningful portions of the main content by identifying such key patterns, can be exploited for providing audio summaries and speeding up the access to relevant parts of the data. We refer to these patterns as audio motifs in analogy with the nomenclature in its counterpart task in biology. We describe a framework for the discovery of audio motifs in streams in an unsupervised fashion, as no acoustic or linguistic models are used. We define the fundamental problem by decomposing the overall task into We present XTrans, a multi-platform, multilingual, multi-channel transcription application designed and developed by Linguistic Data Consortium. XTrans provides new and efficient solutions to many common challenges encountered during the manual transcription process of a wide variety of audio genres, such as supporting multiple audio channels in a meeting recording or right-to-left text directionality for languages like Arabic. To facilitate accurate transcription, XTrans incorporates a number of quality control functions, and provides a user-friendly mechanism for transcribing overlapping speech. This paper will describe the motivation to develop a new transcription tool, and will give an overview of XTrans functionality. Notes 166 How to Select a Good Training-Data Subset for Transcription: Submodular Active Selection for Sequences Thu-Ses2-O5 : Speech Analysis and Processing III Ainsworth (East Wing 4), 13:30, Thursday 10 Sept 2009 Chair: Ben Milner, University of East Anglia, UK Hui Lin, Jeff Bilmes; University of Washington, USA Thu-Ses2-O4-4, Time: 14:30 Given a large un-transcribed corpus of speech utterances, we address the problem of how to select a good subset for word-level transcription under a given fixed transcription budget. We employ submodular active selection on a Fisher-kernel based graph over un-transcribed utterances. The selection is theoretically guaranteed to be near-optimal. Moreover, our approach is able to bootstrap without requiring any initial transcribed data, whereas traditional approaches rely heavily on the quality of an initial model trained on some labeled data. Our experiments on phone recognition show that our approach outperforms both average-case random selection and uncertainty sampling significantly. Improving Acceptability Assessment for the Labelling of Affective Speech Corpora Zoraida Callejas, Ramón López-Cózar; Universidad de Granada, Spain Thu-Ses2-O4-5, Time: 14:50 In this paper we study how to address the assessment of affective speech corpora. We propose the use of several coefficients and provide guidelines to obtain a more complete background about the quality of their annotation. This proposal has been evaluated employing a corpus of non-acted emotions gathered from spontaneous interactions of users with a spoken dialogue system. The results show that, due to the nature of non-acted emotional corpora, traditional interpretations would in most cases consider the annotation of these corpora unacceptable even with very high inter-annotator agreement. Our proposal provides a basis to argue their acceptability by supplying a more fine-grained vision of their quality. The Broadcast Narrow Band Speech Corpus: A New Resource Type for Large Scale Language Recognition 1 1 1 Christopher Cieri , Linda Brandschain , Abby Neely , David Graff 1 , Kevin Walker 1 , Chris Caruso 1 , Alvin F. Martin 2 , Craig S. Greenberg 2 ; 1 University of Pennsylvania, USA; 2 NIST, USA Thu-Ses2-O4-6, Time: 15:10 This paper describes a new resource type, broadcast narrow band speech for use in large scale language recognition research and technology development. After providing the rational for this new resource type, the paper describes the collection, segmentation, auditing procedures and data formats used. Along the way, it addresses issues of defining language and dialect in found data and how ground truth is established for this corpus. Model-Based Automatic Evaluation of L2 Learner’S English Timing Chatchawarn Hansakunbuntheung 1 , Hiroaki Kato 2 , Yoshinori Sagisaka 1 ; 1 Waseda University, Japan; 2 NICT, Japan Thu-Ses2-O5-1, Time: 15:30 This paper proposes a method to automatically measure the timing characteristics of a second-language learner’s speech as a means to evaluate language proficiency in speech production. We used the durational differences from native speakers’ speech as an objective measure to evaluate the learner’s timing characteristics. To provide flexible evaluation without the need to collect any additional English reference speech, we employed predicted segmental durations using a statistical duration model instead of measured raw durations of natives’ speech. The proposed evaluation method was tested using English speech data uttered by Thai-native learners with different English-study experiences. An evaluation experiment shows that the proposed measure based on duration differences closely correlates to the subjects’ Englishstudy experiences. Moreover, segmental duration differences revealed Thai learners’ speech-control characteristics in word-final stress assignment. These results support the effectiveness of the proposed model-based objective evaluation. A Bayesian Approach to Non-Intrusive Quality Assessment of Speech Petko N. Petkov 1 , Iman S. Mossavat 2 , W. Bastiaan Kleijn 1 ; 1 KTH, Sweden; 2 Technische Universiteit Eindhoven, The Netherlands Thu-Ses2-O5-2, Time: 15:50 A Bayesian approach to non-intrusive quality assessment of narrow-band speech is presented. The speech features used to assess quality are the sample mean and variance of band-powers evaluated from the temporal envelope in the channels of an auditory filter-bank. Bayesian multivariate adaptive regression splines (BMARS) is used to map features into quality ratings. The proposed combination of features and regression method leads to a high performance quality assessment algorithm that learns efficiently from a small amount of training data and avoids overfitting. Use of the Bayesian approach also allows the derivation of credible intervals on the model predictions, which provide a quantitative measure of model confidence and can be used to identify the need for complementing the training databases. Precision of Phoneme Boundaries Derived Using Hidden Markov Models Ladan Baghai-Ravary, Greg Kochanski, John Coleman; University of Oxford, UK Thu-Ses2-O5-3, Time: 16:10 Some phoneme boundaries correspond to abrupt changes in the acoustic signal. Others are less clear-cut because the transition from one phoneme to the next is gradual. This paper compares the phoneme boundaries identified by a large number of different alignment systems, using different Notes 167 signal representations and Hidden Markov Model structures. The variability of the different boundaries is analysed statistically, with the boundaries grouped in terms of the broad phonetic classes of the respective phonemes. The mutual consistency between the boundaries from the various systems is analysed to identify which classes of phoneme boundary can be identified reliably by an automatic labelling system, and which are ill-defined and ambiguous. The results presented here provide a starting point for future development of techniques for objective comparisons between systems without giving undue weight to variations in those phoneme boundaries which are inherently ambiguous. Such techniques should improve the efficiency with which new alignment and HMM training algorithms can be developed. A Novel Method for Epoch Extraction from Speech Signals a discontinuity in the Linear Prediction residual. The proposed method is compared to the DYPSA algorithm on the CMU ARCTIC database. A significant improvement as well as a better noise robustness are reported. Besides, results of GOI identification accuracy are promising for the glottal source characterization. Thu-Ses2-P1 : Speaker and Speech Variability, Paralinguistic and Nonlinguistic Cues Hewison Hall, 13:30, Thursday 10 Sept 2009 Chair: Christer Gobl, Trinity College Dublin, Ireland A Novel Codebook Search Technique for Estimating the Open Quotient Yen-Liang Shue, Jody Kreiman, Abeer Alwan; University of California at Los Angeles, USA Lakshmish Kaushik, Douglas O’Shaughnessy; INRS-EMT, Canada Thu-Ses2-P1-1, Time: 13:30 Thu-Ses2-O5-4, Time: 16:30 This paper introduces a novel method of speech epoch extraction using a modified Wigner-Ville distribution. The Wigner-Ville distribution is an efficient speech representation tool with which minute speech variations can be tracked precisely. In this paper, epoch detection/extraction using accurate energy tracking, noise robustness, and the efficient speech representation properties of a modified discrete Wigner-Ville distribution is explored. The developed technique is tested using the Arctic database and its epoch information from an electro-glottograph as reference epochs. The developed algorithm is compared with the available state of the art methods in various noise conditions (babble, white, and vehicle) and different levels of degradation. The proposed method outperforms the existing methods in the literature. The open quotient (OQ), loosely defined as the proportion of time the glottis is open during phonation, is an important parameter in many source models. Accurate estimation of OQ from acoustic signals is a non-trivial process as it involves the separation of the source signal from the vocal-tract transfer function. Often this process is hampered by the lack of direct physiological data with which to calibrate algorithms. In this paper, an analysis-by-synthesis method using a codebook of harmonically-based Liljencrants-Fant (LF) source models in conjunction with a constrained optimizer was used to obtain estimates of OQ from four subjects. The estimates were compared with physiological measurements from high-speed imaging. Results showed relatively high correlations between the estimated and measured values for only two of the speakers, suggesting that existing source models may be unable to accurately represent some source signals. LS Regularization of Group Delay Features for Speaker Recognition Long Term Examination of Intra-Session and Inter-Session Speaker Variability Jia Min Karen Kua 1 , Julien Epps 1 , Eliathamby Ambikairajah 1 , Eric Choi 2 ; 1 University of New South Wales, Australia; 2 National ICT Australia, Australia A.D. Lawson 1 , A.R. Stauffer 1 , B.Y. Smolenski 1 , B.B. Pokines 2 , M. Leonard 3 , E.J. Cupples 1 ; 1 RADC Inc., USA; 2 Oasis Systems, USA; 3 University of Texas at Dallas, USA Thu-Ses2-O5-5, Time: 16:50 Due to the increasing use of fusion in speaker recognition systems, features that are complementary to MFCCs offer opportunities to advance the state of the art. One promising feature is based on group delay, however this can suffer large variability due to its numerical formulation. In this paper, we investigate reducing this variability in group delay features with least squares regularization. Evaluations on the NIST 2001 and 2008 SRE databases show a relative improvement of at least 6% and 18% EER respectively when group delay-based system is fused with MFCC-based system. Glottal Closure and Opening Instant Detection from Speech Signals Thomas Drugman, Thierry Dutoit; Faculté Polytechnique de Mons, Belgium Thu-Ses2-P1-2, Time: 13:30 Session variability in speaker recognition is a well recognized phenomena, but poorly understood largely due to a dearth of robust longitudinal data. The current study uses a large, longterm speaker database to quantify both speaker variability changes within a conversation and the impact of speaker variability changes over the long term (3 years). Results demonstrate that 1) change in accuracy over the course of a conversation is statistically very robust and 2) that the aging effect over three years is statistically negligible. Finally we demonstrate that voice change during the course of a conversation is, in large part, comparable across sessions. Distorted Visual Information Influences Audiovisual Perception of Voicing Thu-Ses2-O5-6, Time: 17:10 This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms. The procedure is divided into two successive steps. First a meanbased signal is computed, and intervals where speech events are expected to occur are extracted from it. Secondly, at each interval a precise position of the speech event is assigned by locating Ragnhild Eg, Dawn Behne; NTNU, Norway Thu-Ses2-P1-3, Time: 13:30 Research has shown that visual information becomes less reliable when images are severely distorted. Furthermore, while voicing is generally identified from acoustical cues, it may also provide perception with visual cues. The current study investigated the Notes 168 impact of video distortion on the audiovisual perception of voicing. Audiovisual stimuli were presented to 30 participants with the original video quality, or with reduced video resolution (75×60 pixels, 45×36 pixels). Results revealed that in addition to increased auditory reliance with video distortion, particularly for voiceless stimuli, perception of voiceless stimuli was more influenced by the visual modality than voiced stimuli. vealed that both formant space and F0 can act as cues for speaker discrimination even via BCUHA. However, sensitivity to formant space in BCU hearing is less than in AC hearing. Perceived Naturalness of a Synthesizer of Disordered Voices Céline De Looze, Stéphane Rauzy; LPL, France Samia Fraj, Francis Grenez, Jean Schoentgen; Université Libre de Bruxelles, Belgium In this article a clustering algorithm, allowing the automatic detection of speakers’ register changes, is presented. Together with automatic detection of pause duration, it has shown to be efficient for the automatic detection and prediction of topic changes. The need to take into account other parameters such as tempo and intensity, in the framework of Linear Discriminant Analysis, is proposed in order to improve the identification of the topic structure of discourse. Automatic Detection and Prediction of Topic Changes Through Automatic Detection of Register Variations and Pause Duration Thu-Ses2-P1-7, Time: 13:30 Thu-Ses2-P1-4, Time: 13:30 The presentation describes a synthesizer of normal and disordered voice timbres and their perceptual evaluation with respect to naturalness. The simulator uses a shaping function model, which enables controlling the perturbations of the frequency and harmonic richness of the glottal area signal via the control of the instantaneous frequency and amplitude of two harmonic driving functions. Several types of perturbations are simulated. Perceptual experiments, which involve stimuli of synthetic and human vowels with normal values of perturbations, have been carried out. The first has been based on a binary synthetic/natural classification. The second has involved a discrimination task. Both experiments suggest that human judges are unable to distinguish between human and synthetic vowels prepared with the synthesizer described here. Audio-Visual Speech Asynchrony Modeling in a Talking Head Alexey Karpov 1 , Liliya Tsirulnik 2 , Zdeněk Krňoul 3 , Andrey Ronzhin 1 , Boris Lobanov 2 , Miloš Železný 3 ; 1 Russian Academy of Sciences, Russia; 2 National Academy of Sciences, Belarus; 3 University of West Bohemia in Pilsen, Czech Republic Analyzing Features for Automatic Age Estimation on Cross-Sectional Data Werner Spiegl 1 , Georg Stemmer 2 , Eva Lasarcyk 3 , Varada Kolhatkar 4 , Andrew Cassidy 5 , Blaise Potard 6 , Stephen Shum 7 , Young Chol Song 8 , Puyang Xu 5 , Peter Beyerlein 9 , James Harnsberger 10 , Elmar Nöth 1 ; 1 FAU Erlangen-Nürnberg, Germany; 2 SVOX Deutschland GmbH, Germany; 3 Universität des Saarlandes, Germany; 4 University of Minnesota Duluth, USA; 5 Johns Hopkins University, USA; 6 CRIN, France; 7 University of California at Berkeley, USA; 8 Stony Brook University, USA; 9 TFH Wildau, Germany; 10 University of Florida, USA Thu-Ses2-P1-8, Time: 13:30 Thu-Ses2-P1-5, Time: 13:30 An audio-visual speech synthesis system with modeling of asynchrony between auditory and visual speech modalities is proposed in the paper. Corpus-based study of real recordings gave us the required data for understanding the problem of modalities asynchrony that is partially caused by the co-articulation phenomena. A set of context-dependent timing rules and recommendations was elaborated in order to make a synchronization of auditory and visual speech cues of the animated talking head similar to a natural humanlike way. The cognitive evaluation of the model-based talking head for Russian with implementation of the original asynchrony model has shown high intelligibility and naturalness of audio-visual synthesized speech. The Effects of Fundamental Frequency and Formant Space on Speaker Discrimination Through Bone-Conducted Ultrasonic Hearing Takayuki Kagomiya, Seiji Nakagawa; AIST, Japan Thu-Ses2-P1-6, Time: 13:30 Human listeners can perceive speech signals from a voicemodulated ultrasonic carrier which is presented through a bone-conduction stimulator, even if they are sensorineural hearing loss patients. As an application of this phenomenon, we have been developing a bone-conducted ultrasonic hearing aid (BCUHA). This research examined whether formant space and F0 can be cues of speaker discrimination in BCU hearing as well as via air-conduction (AC) hearing. A series of speaker discrimination experiments re- We develop an acoustic feature set for the estimation of a person’s age from a recorded speech signal. The baseline features are Mel-frequency cepstral coefficients (MFCCs) which are extended by various prosodic features, pitch and formant frequencies. From experiments on the University of Florida Vocal Aging Database we can draw different conclusions. On the one hand, adding prosodic, pitch and formant features to the MFCC baseline leads to relative reductions of the mean absolute error between 4–20%. Improvements are even larger when perceptual age labels are taken as a reference. On the other hand, reasonable results with a mean absolute error in age estimation of about 12 years are already achieved using a simple gender-independent setup and MFCCs only. Future experiments will evaluate the robustness of the prosodic features against channel variability on other databases and investigate the differences between perceptual and chronological age labels. Intercultural Differences in Evaluation of Pathological Voice Quality: Perceptual and Acoustical Comparisons Between RASATI and GRBASI Scales Emi Juliana Yamauchi 1 , Satoshi Imaizumi 1 , Hagino Maruyama 2 , Tomoyuki Haji 2 ; 1 Prefectural University of Hiroshima, Japan; 2 Kurashiki Central Hospital, Japan Thu-Ses2-P1-9, Time: 13:30 This paper analyzes differences and commonality in pathological voice quality evaluation between two different scaling systems, GR- Notes 169 BASI and RASATI. The results identified significant interrelations between the scales. Harshness, included in RASATI, is described as noisiness and strain in the GRBASI scale. Roughness is found to be the most consistent factor and easiest to identify by listeners of different linguistic backgrounds. Intercultural agreement in pathological voice quality evaluation seems to be possible. F0 Cues for the Discourse Functions of “hã” in Hindi Kalika Bali; Microsoft Research India, India Thu-Ses2-P1-10, Time: 13:30 Affirmative particles are often employed in conversational speech to convey more than their literal semantic meaning. The discourse information conveyed by such particles can have consequences in both Speech Understanding and Speech Production for a Spoken Dialogue System. This paper analyses the different discourse functions of the affirmative particle hã (“yes”) in Hindi and in explores the role of fundamental frequency (f0) as a cue to disambiguating these functions. Audio Spatialisation Strategies for Multitasking During Teleconferences Stuart N. Wrigley, Simon Tucker, Guy J. Brown, Steve Whittaker; University of Sheffield, UK Thu-Ses2-P1-11, Time: 13:30 Multitasking during teleconferences is becoming increasingly common: participants continue their work whilst monitoring the audio for topics of interest. Our previous work has established the benefit of spatialised audio presentation on improving multitasking performance. In this study, we investigate the different spatialisation strategies employed by subjects in order to aid their multitasking performance and improve their user experience. Subjects were given the freedom to place each participant at a different location in the acoustic space both in terms of azimuth and distance. Their strategies were based upon cues regarding keywords and which participant will utter them. Our findings suggest that subjects employ consistent strategies with regard to the location of target and distracter talkers. Furthermore, manipulation of the acoustic space plays an important role in multitasking performance and the user experience. Speech Rate Effects on Linguistic Change Alexsandro R. Meireles 1 , Plínio A. Barbosa 2 ; 1 Federal University of Espírito Santo, Brazil; 2 State University of Campinas, Brazil Thu-Ses2-P1-12, Time: 13:30 This work is couched in the Articulatory Phonology theoretical framework, and it discusses the possible role of speech rate on diachronic change from antepenultimate stress words to penultimate stress words. In this kind of change, there is deletion of the medial (or final) post-stressed vowel of the antepenultimate stress words. Our results suggest that speech rate can explain this historical process of linguistic change, since the medial poststressed vowel reduces more, although without deletion, than the final post-stressed vowel from normal to fast rate. These results were confirmed by Friedman’s ANOVA. A one-way ANOVA also indicated that the duration of the medial post-stressed vowel is significantly smaller than the duration of the final post-stressed vowel. On the other hand, words such as “fôlego” (breath) and “sábado” (Saturday) reduce less their post-stressed segments in comparison with words such as “abóbora” (pumpkin). This finding, associated to Brazilian Portuguese phonotactic restrictions, can explain why forms such as “folgo” and “sabdo” are not frequently found in this language. Besides, linguistic changes influenced by speech rate act according to dialect and gender. In this paper, speakers from the Mineiro dialect (from Minas Gerais state) (rate: 7.5 syllables/sec.) reduced the medial post-stressed vowel more than speakers from the Paulista dialect (from São Paulo state) (rate: 6.4 syllables/second), and male speakers (rate: 5.8 syllables/sec.) reduced the medial post-stressed vowel more than female speakers (rate: 5.2 syllables/second). These results were also confirmed by one-way ANOVA. Mandarin Spontaneous Narrative Planning — Prosodic Evidence from National Taiwan University Lecture Corpus Chiu-yu Tseng 1 , Zhao-yu Su 1 , Lin-shan Lee 2 ; 1 Academia Sinica, Taiwan; 2 National Taiwan University, Taiwan Thu-Ses2-P1-13, Time: 13:30 This paper discusses discourse planning of pre-organized spontaneous narratives (SpnNS) in comparison with read speech (RS). F0 and tempo modulations are compared by speech paragraph size and discourse boundaries. The speaking rate of SpnNS from university classroom lecture is 2 to 3 times to that of RS by professionals; paragraph phrasing of SpnNS is 6 times that of RS. Patterns of paragraph association are distinct for SpnNS and RS. Sub-paragraph and paragraph units in RS are marked by distinct relative F0 resets and boundary pause duration, but by patterns of intensity contrasts in SpnNS instead. Consistent to both data sets is the finding that combined relative supra-segmental cues reflecting global prosodic properties are more discriminative to distinguish discourse boundaries than any fragments of singular cue, supporting higher-level discourse planning in the acoustic signals. We believe these findings can be directly applied to speech technology development. Thu-Ses2-P2 : ASR: Acoustic Model Features Hewison Hall, 13:30, Thursday 10 Sept 2009 Chair: Richard M. Stern, Carnegie Mellon University, UK Investigation into Bottle-Neck Features for Meeting Speech Recognition František Grézl, Martin Karafiát, Lukáš Burget; Brno University of Technology, Czech Republic Thu-Ses2-P2-1, Time: 13:30 This work investigates into recently proposed Bottle-Neck features for ASR. The bottle-neck ANN structure is imported into Split Context architecture gaining significant WER reduction. Further, Universal Context architecture was developed which simplifies the system by using only one universal ANN for all temporal splits. Significant WER reduction can be obtained by applying fMPE on top of our BN features as a technique for discriminative feature extraction and further gain is also obtained by retraining model parameters using MPE criterion. The results are reported on meeting data from RT07 evaluation. Multi-Stream to Many-Stream: Using Spectro-Temporal Features for ASR Sherry Y. Zhao, Suman Ravuri, Nelson Morgan; ICSI, USA Thu-Ses2-P2-2, Time: 13:30 We report progress in the use of multi-stream spectro-temporal features for both small and large vocabulary automatic speech Notes 170 recognition tasks. Features are divided into multiple streams for parallel processing and dynamic utilization in this approach. For small vocabulary speech recognition experiments, the incorporation of up to 28 dynamically-weighted spectro-temporal feature streams along with MFCCs yields roughly 21% improvement on the baseline in low noise conditions and 47% improvement in noise-added conditions, a greater improvement on the baseline than in our previous work. A four stream framework yields a 14% improvement over the baseline in the large vocabulary low noise recognition experiment. These results suggest that the division of spectro-temporal features into multiple streams may be an effective way to flexibly utilize an inherently large number of features for automatic speech recognition. This paper aims at investigating the use of TANDEM features based on hierarchical processing of the modulation spectrum. The study is done in the framework of the GALE project for recognition of Mandarin Broadcast data. We describe the improvements obtained using the hierarchical processing and the addition of features like pitch and short-term critical band energy. Results are consistent with previous findings on a different LVCSR task suggesting that the proposed technique is effective and robust across several conditions. Furthermore we describe integration into RWTH GALE LVCSR system trained on 1600 hours of Mandarin data and present progress across the GALE 2007 and GALE 2008 RWTH systems resulting in approximately 20% CER reduction on several data set. Hill-Climbing Feature Selection for Multi-Stream ASR Tandem Representations of Spectral Envelope and Modulation Frequency Features for ASR David Gelbart 1 , Nelson Morgan 1 , Alexey Tsymbal 2 ; 1 ICSI, USA; 2 Siemens AG, Germany Samuel Thomas, Sriram Ganapathy, Hynek Hermansky; Johns Hopkins University, USA Thu-Ses2-P2-6, Time: 13:30 Thu-Ses2-P2-3, Time: 13:30 We present a feature extraction technique for automatic speech recognition that uses Tandem representation of short-term spectral envelope and modulation frequency features. These features, derived from sub-band temporal envelopes of speech estimated using frequency domain linear prediction, are combined at the phoneme posterior level. Tandem representations derived from these phoneme posteriors are used along with HMM-based ASR systems for both small and large vocabulary continuous speech recognition (LVCSR) tasks. For a small vocabulary continuous digit task on the OGI Digits database, the proposed features reduce the word error rate (WER) by 13% relative to other feature extraction techniques. We obtain a relative reduction of about 14% in WER for an LVCSR task using the NIST RT05 evaluation data. For phoneme recognition tasks on the TIMIT database these features provide a relative improvement of 13% compared to other techniques. Entropy-Based Feature Analysis for Speech Recognition Panji Setiawan 1 , Harald Höge 2 , Tim Fingscheidt 3 ; 1 Siemens Enterprise Communications GmbH & Co. KG, Germany; 2 SVOX Deutschland GmbH, Germany; 3 Technische Universität Braunschweig, Germany Thu-Ses2-P2-4, Time: 13:30 Based on the concept of entropy, a new approach to analyse the quality of features as used in speech recognition is proposed. We regard the relation between the hidden Markov model (HMM) states and the corresponding frame based feature vectors as a coding problem, where the states are sent through a noisy recognition channel and received as feature vectors. Using the relation between Shannon’s conditional entropy and the error rate on state level, we estimate how much information is contained in the feature vectors to recognize the states. Thus, the conditional entropy is a measure for the quality of the features. Finally, we show how noise reduces the information contained in the features. Hierarchical Processing of the Modulation Spectrum for GALE Mandarin LVCSR System Fabio Valente 1 , Mathew Magimai-Doss 1 , C. Plahl 2 , Suman Ravuri 3 ; 1 IDIAP Research Institute, Switzerland; 2 RWTH Aachen University, Germany; 3 ICSI, USA Thu-Ses2-P2-5, Time: 13:30 We performed automated feature selection for multi-stream (i.e., ensemble) automatic speech recognition, using a hill-climbing (HC) algorithm that changes one feature at a time if the change improves a performance score. For both clean and noisy data sets (using the OGI Numbers corpus), HC usually improved performance on held out data compared to the initial system it started with, even for noise types that were not seen during the HC process. Overall, we found that using Opitz’s scoring formula, which blends single-classifier word recognition accuracy and ensemble diversity, worked better than ensemble accuracy as a performance score for guiding HC in cases of extreme mismatch between the SNR of training and test sets. Our noisy version of the Numbers corpus, our multi-layerperceptron-based Numbers ASR system, and our HC scripts are available online. Robust F0 Estimation Based on Log-Time Scale Autocorrelation and its Application to Mandarin Tone Recognition Yusuke Kida, Masaru Sakai, Takashi Masuko, Akinori Kawamura; Toshiba Corporate R&D Center, Japan Thu-Ses2-P2-7, Time: 13:30 This paper proposes a novel F0 estimation method in which deltalogF0 is directly estimated based on autocorrelation function (ACF) on a logarithmic time scale. Since peaks of ACFs of periodic signals have a specific pattern on the log-time scale and the period only affects the position of the pattern, delta-logF0 can be estimated directly from the shift of the peaks of the log-time scale ACF (LTACF) without F0 estimation. Then logF0 is estimated from the sum of LTACFs shifted based on delta-logF0. Experimental results show that the proposed method is more robust against noise than the baseline ACF-based method. It is also shown that the proposed method significantly improves the Mandarin tone recognition accuracy. Invariant-Integration Method for Robust Feature Extraction in Speaker-Independent Speech Recognition Florian Müller, Alfred Mertins; Universität zu Lübeck, Germany Thu-Ses2-P2-8, Time: 13:30 The vocal tract length (VTL) is one of the variabilities that speaker-independent automatic speech recognition (ASR) systems encounter. Standard methods to compensate for the effects of different VTLs within the processing stages of the ASR systems often Notes 171 have a high computational effort. By using an appropriate warping scheme for the frequency centers of the time-frequency analysis, a change in VTL can be approximately described by a translation in the subband-index space. We present a new type of features that is based on the principle of invariant integration, and an according feature selection method is described. ASR experiments show the increased robustness of the proposed features in comparison to standard MFCCs. performance is obtained for any environmental condition, clean as well as noisy. Thu-Ses2-P3 : ASR: Tonal Language, Cross-Lingual and Multilingual ASR Hewison Hall, 13:30, Thursday 10 Sept 2009 Chair: Lori Lamel, LIMSI, France Discriminative Feature Transformation Using Output Coding for Speech Recognition Pronunciation-Based ASR for Names Omid Dehzangi 1 , Bin Ma 2 , Eng Siong Chng 1 , Haizhou Li 2 ; 1 Nanyang Technological University, Singapore; 2 Institute for Infocomm Research, Singapore Henk van den Heuvel 1 , Bert Réveil 2 , Jean-Pierre Martens 2 ; 1 Radboud Universiteit Nijmegen, The Netherlands; 2 Ghent University, Belgium Thu-Ses2-P2-9, Time: 13:30 Thu-Ses2-P3-1, Time: 13:30 In this paper, we present a new mechanism to extract discriminative acoustic features for speech recognition using continuous output coding (COC) based feature transformation. Our proposed method first expands the short-time spectral features into a higher dimensional feature space to improve its discriminative capability. The expansion is performed by employing the polynomial expansion. The high dimension features are then projected into lower dimension space using continuous output coding technique implemented by a set of linear SVMs. The resulting feature vectors are designed to encode the difference between phones. The generated features are shown to be more discriminative than MFCCs and experimental results on both TIMIT and NTIMIT corpus showed better phone recognition accuracy with the proposed features. To improve the ASR of proper names a novel method based on the generation of pronunciation variants by means of phonemeto-phoneme converters (P2Ps) is proposed. The aim is convert baseline transcriptions into variants that maximally resemble actual name pronunciations that were found in a training corpus. The method has to operate in a cross lingual setting with native Dutch persons speaking Dutch and foreign names, and foreign persons speaking Dutch names. The P2Ps are trained to act either on conventional G2P-transcriptions or on canonical transcriptions that were provided by a human expert. Including the variants produced by the P2Ps in the lexicon of the recognizer substantially improves the recognition accuracy for natives pronouncing foreign names, but not for the other investigated combinations. Discriminant Spectrotemporal Features for Phoneme Recognition How Speaker Tongue and Name Source Language Affect the Automatic Recognition of Spoken Names Nima Mesgarani, G.S.V.S. Sivaram, Sridhar Krishna Nemala, Mounya Elhilali, Hynek Hermansky; Johns Hopkins University, USA Bert Réveil 1 , Jean-Pierre Martens 1 , Bart D’hoore 2 ; 1 Ghent University, Belgium; 2 Nuance, Belgium Thu-Ses2-P2-10, Time: 13:30 In this paper the automatic recognition of person names and geographical names uttered by native and non-native speakers is examined in an experimental set-up. The major aim was to raise our understanding of how well and under which circumstances previously proposed methods of multilingual pronunciation modeling and multilingual acoustic modeling contribute to a better name recognition in a cross-lingual context. To come to a meaningful interpretation of results we have categorized each language according to the amount of exposure a native speaker is expected to have had to this language. After having interpreted our results we have also tried to find an answer to the question of how much further improvement one might be able to attain with a more advanced pronunciation modeling technique which we plan to develop. Thu-Ses2-P3-2, Time: 13:30 We propose discriminant methods for deriving two-dimensional spectrotemporal features for phoneme recognition that are estimated to maximize the separation between the representations of phoneme classes. The linearity of the filters results in their intuitive interpretation enabling us to investigate the working principles of the system and to improve its performance by locating the sources of error. Two methods for the estimation of filters are proposed: Regularized Least Square (RLS) and Modified Linear Discriminant Analysis (MLDA). Both methods reach a comparable improvement over the baseline condition demonstrating the advantage of the discriminant spectrotemporal filters. Auditory Model Based Optimization of MFCCs Improves Automatic Speech Recognition Performance Online Generation of Acoustic Models for Multilingual Speech Recognition Saikat Chatterjee, Christos Koniaris, W. Bastiaan Kleijn; KTH, Sweden Martin Raab 1 , Guillermo Aradilla 1 , Rainer Gruhn 1 , Elmar Nöth 2 ; 1 Harman Becker Automotive Systems, Germany; 2 FAU Erlangen-Nürnberg, Germany Thu-Ses2-P2-11, Time: 13:30 Using a spectral auditory model along with perturbation based analysis, we develop a new framework to optimize a set of features such that it emulates the behavior of the human auditory system. The optimization is carried out in an off-line manner based on the conjecture that the local geometries of the feature domain and the perceptual auditory domain should be similar. Using this principle, we modify and optimize the static mel frequency cepstral coefficients (MFCCs) without considering any feedback from the speech recognition system. We show that improved recognition Thu-Ses2-P3-3, Time: 13:30 Our goal is to provide a multilingual speech based Human Machine Interface for in-car infotainment and navigation systems. The multilinguality is for example needed for music player control via speech as artist and song names in the globalized music market come from many languages. Another frequent use case is the input of foreign navigation destinations via speech. In this paper we propose approximated projections between mixtures of Gaus- Notes 172 sians that allow the generation of the multilingual system from monolingual systems. This makes the creation of the multilingual systems on an embedded system possible with the benefit that training and maintenance effort remain unchanged compared to the provision of monolingual systems. We also sketch how this algorithm can help together with our previous work to have an efficient architecture for multilingual speech recognition on embedded devices. Basic Speech Recognition for Spoken Dialogues Charl van Heerden, Etienne Barnard, Marelie Davel; CSIR, South Africa Thu-Ses2-P3-4, Time: 13:30 Spoken dialogue systems (SDSs) have great potential for information access in the developing world. However, the realisation of that potential requires the solution of several challenging problems, including the development of sufficiently accurate speech recognisers for a diverse multitude of languages. We investigate the feasibility of developing small-vocabulary speaker-independent ASR systems designed for use in a telephone-based information system, using ten resource-scarce languages spoken in South Africa as a case study. We contrast a cross-language transfer approach (using a welltrained system from a different language) with the development of new language-specific corpora and systems, and evaluate the effectiveness of both approaches. We find that limited speech corpora (3 to 8 hours of data from around 200 speakers) are sufficient for the development of reasonably accurate recognisers: Error rates are in the range 2% to 12% for a ten-word task, where vocabulary words are excluded from training to simulate vocabulary-independent performance. This approach is substantially more accurate than crosslanguage transfer, and sufficient for the development of basic spoken dialogue systems. Tonal Articulatory Feature for Mandarin and its Application to Conversational LVCSR of code-mixing utterances. By examining the recognition results of Canton-English code-mixing speech, where Canton is the matrix language and English is the embedded language, we noticed that recognition accuracy of the embedded language plays a significant role to the overall performance. In particular, significant performance degradation is found in the matrix language if the embedded words can not be recognized correctly. We also studied the error propagation effect of the embedded English. The results show that the error in embedded English words may propagate to two neighboring Cantonese syllables. Finally, analysis is carried out to determine the influencing factors for recognition performance in embedded English. A One-Step Tone Recognition Approach Using MSD-HMM for Continuous Speech Changliang Liu, Fengpei Ge, Fuping Pan, Bin Dong, Yonghong Yan; Chinese Academy of Sciences, China Thu-Ses2-P3-7, Time: 13:30 There are two types of methods for tone recognition of continuous speech: one-step and two-step approaches. Two-step approaches need to identify the syllable boundaries firstly, while one-step approaches do not. Previous studies mostly focus on two-step approaches. In this paper, a one-step approach using Multi-space distribution HMM (MSD-HMM) is investigated. The F0, which only exists in voiced speech, is modeled by MSD-HMM. Then, a tonal syllable network is built based on the reference and Viterbi search is carried out on it to find the best tone sequence. Two modifications to the conventional tri-phone HMM models are investigated: tone-based context expansion and syllable-based model units. The experimental results proved that tone-based context information is more important for tone recognition and syllable-based HMM models are much better than phone-based ones. The final tone correct rate result is 88.8%, which is much higher than the state-of-the-art two-step approaches. Stream-Based Context-Sensitive Phone Mapping for Cross-Lingual Speech Recognition Qingqing Zhang, Jielin Pan, Yonghong Yan; Chinese Academy of Sciences, China Khe Chai Sim, Haizhou Li; Institute for Infocomm Research, Singapore Thu-Ses2-P3-5, Time: 13:30 This paper presents our recent work on the development of a tonal Articulatory Feature (AF) for Mandarin and its application to conversational LVCSR. Motivated by the theory of Mandarin phonology, eight features for classifying the acoustic units and one feature for classifying the tone are investigated and constructed in the paper, and the AF-based tandem approach is used to improve speech recognition performances. With this Mandarin AF set, a significant relative reduction on Character Error Rate is obtained over the baseline system using the standard acoustic feature, and the comparison between the ASR systems based on AF classifiers with and without the tonal feature demonstrates that the system with the tonal feature achieves better performances further. Effects of Language Mixing for Automatic Recognition of Cantonese-English Code-Mixing Utterances Houwei Cao, P.C. Ching, Tan Lee; Chinese University of Hong Kong, China Thu-Ses2-P3-8, Time: 13:30 Recently, a Probabilistic Phone Mapping (PPM) model was proposed to facilitate cross-lingual automatic speech recognition using a foreign phonetic system. Under this framework, discrete hidden Markov models (HMMs) are used to map a foreign phone sequence to a target phone sequence. Context-sensitive mapping is made possible by expanding the discrete observation symbols to include the contexts of the foreign phones in which they appear in the sequence. Unfortunately, modelling the context dependencies jointly results in dramatic increase in model parameters as wider contexts are used. In this paper, the probability of observing a contextdependent symbol is decomposed into the product of probabilities of observing the symbol and its contexts. This allows wider contexts to be modelled without greatly compromising the model complexity. This can be modelled conveniently using a multiple-stream discrete HMM system where the contexts are treated as independent streams. Experimental results are reported on TIMIT English phone recognition task using the Czech, Hungarian and Russian foreign phone recognisers. Thu-Ses2-P3-6, Time: 13:30 While automatic speech recognition of either Cantonese or English alone has achieved a great degree of success, recognition of CantonEnglish code-mixing speech is not as trivial. This paper attempts to analyze the effect of language mixing on recognition performance Notes 173 Human Translations Guided Language Discovery for ASR Systems Sebastian Stüker 1 , Laurent Besacier 2 , Alex Waibel 1 ; 1 Universität Karlsruhe (TH), Germany; 2 LIG, France A Noise Robust Method for Pattern Discovery in Quantized Time Series: The Concept Matrix Approach Okko Johannes Räsänen, Unto Kalervo Laine, Toomas Altosaar; Helsinki University of Technology, Finland Thu-Ses2-P3-9, Time: 13:30 The traditional approach of collecting and annotating the necessary training data is due to economic constraints not feasible for most of the 7,000 languages in the world. At the same time it is of vital interest to have natural language processing systems address practically all of them. Therefore, new, efficient ways of gathering the needed training material have to be found. In this paper we continue our experiments on exploiting the knowledge gained from human simultaneous translations that happen frequently in the real world, in order to discover word units in a new language. We evaluate our approach by measuring the performance of statistical machine translation systems trained on the word units discovered from an oracle phoneme sequence. We improve it then by combining it with a word discovery technique that works without supervision, solely on the unsegmented phoneme sequences. Thu-Ses2-P4-3, Time: 13:30 An efficient method for pattern discovery from discrete time series is introduced in this paper. The method utilizes two parallel streams of data, a discrete unit time-series and a set of labeled events, From these inputs it builds associative models between systematically co-occurring structures existing in both streams. The models are based on transitional probabilities of events at several different time scales. Learning and recognition processes are incremental, making the approach suitable for online learning tasks. The capabilities of the algorithm are demonstrated in a continuous speech recognition task operating in varying noise levels. Using Parallel Architectures in Speech Recognition Patrick Cardinal, Pierre Dumouchel, Gilles Boulianne; CRIM, Canada Thu-Ses2-P4 : ASR: New Paradigms II Thu-Ses2-P4-4, Time: 13:30 Hewison Hall, 13:30, Thursday 10 Sept 2009 Chair: Michael Schuster, Google, USA The Case for Case-Based Automatic Speech Recognition Viktoria Maier, Roger K. Moore; University of Sheffield, UK Thu-Ses2-P4-1, Time: 13:30 In order to avoid global parameter settings which are locally suboptimal, this paper argues for the inclusion of more knowledge (in particular procedural knowledge) into automatic speech recognition (ASR) systems. Two related fields provide inspiration for this new perspective: (a) ‘cognitive architectures’ indicate how experience with related problems can give rise to more (expert) knowledge, and (b) ‘case-based reasoning’ provides an extended framework which is relevant to any similarity-based recognition systems. The outcome of this analysis is a proposal for a new approach termed ‘Case-Based ASR’. The speed of modern processors has remained constant over the last few years and thus, to be scalable, applications must be parallelized. In addition to the main CPU, almost every computer is equipped with a Graphics Processors Unit (GPU) which is in essence a specialized parallel processor. This paper explores how performances of speech recognition systems can be enhanced by using GPU for the acoustic computations and multi-core CPUs for the Viterbi search in a large vocabulary application. The multi-core implementation of our speech recognition system runs 1.3 times faster than the single-threaded CPU implementation. Addition of the GPU for dedicated acoustic computations increases the speed by a factor of 2.8, leading to a word accuracy improvement of 16.6% absolute at real-time, compared to the single-threaded CPU implementation. Example-Based Speech Recognition Using Formulaic Phrases Christopher J. Watkins, Stephen J. Cox; University of East Anglia, UK Thu-Ses2-P4-5, Time: 13:30 A Self-Labeling Speech Corpus: Collecting Spoken Words with an Online Educational Game Ian McGraw, Alexander Gruenstein, Andrew Sutherland; MIT, USA Thu-Ses2-P4-2, Time: 13:30 We explore a new approach to collecting and transcribing speech data by using online educational games. One such game, Voice Race, elicited over 55,000 utterances over a 22 day period, representing 18.7 hours of speech. Voice Race was designed such that the transcripts for a significant subset of utterances can be automatically inferred using the contextual constraints of the game. Game context can also be used to simplify transcription to a multiple choice task, which can be performed by non-experts. We found that one third of the speech collected with Voice Race could be automatically transcribed with over 98% accuracy; and that an additional 49% could be labeled cheaply by Amazon Mechanical Turk workers. We demonstrate the utility of the self-labeled speech in an acoustic model adaptation task, which resulted in a reduction in the Voice Race utterance error rate. The collected utterances cover a wide variety of vocabulary, and should be useful across a range of research. In this paper, we describe the design of an ASR system that is based on identifying and extracting formulaic phrases from a corpus and then, rather than building statistical models of them, performing example-based recognition of these phrases. We describe a method for combining formulaic phrases into a bigram language model that results in a 13% decrease in WER on a monophone HMM recogniser over the baseline. We show that using this model with phrase templates in the example-based recogniser gives a significant improvement in WER compared to word templates, but performance still falls short of the HMM recogniser. We also describe an LDA decision tree classifier that reduces the search space of the DTW decoder by 40% while at the same time decreasing WER. Parallel Fast Likelihood Computation for LVCSR Using Mixture Decomposition Naveen Parihar 1 , Ralf Schlüter 2 , David Rybach 2 , Eric A. Hansen 1 ; 1 Mississippi State University, USA; 2 RWTH Aachen University, Germany Thu-Ses2-P4-6, Time: 13:30 This paper describes a simple and robust method for improving Notes 174 the runtime of likelihood computation on multi-core processors without degrading system accuracy. The method improves runtime by parallelizing likelihood computations on a multi-core processor. Mixtures are decomposed among the cores and each core computes the likelihood of the mixture allocated to it. We study two approaches to mixture decomposition — Chunk based and Decisiontree based. When applied to RWTH TC-STAR EPPS English LVCSR system on an Intel Core2 Quad processor with varying pruningbeam width settings, the method resulted in a 54% to 70% improvement in the likelihood computation runtime, and a 18% to 59% improvement in the overall runtime. jor issue that should be addressed in detection-based ASR. To this end, we propose several methods to reduce the asynchrony or the effects of asynchrony. The results are quite promising; for example, currently, we can achieve 67.67% phone accuracy in the TIMIT free phone recognition task with only 11 binary-valued articulatory features. An Indexing Weight for Voice-to-Text Search Thu-Ses2-P4-10, Time: 13:30 Chen Liu; Motorola, USA To date, the use of Conditional Random Fields (CRFs) in automatic speech recognition has been limited to the tasks of phone classification and phone recognition. In this paper, we present a framework for using CRF models in a word recognition task that extends the well-known Tandem HMM framework to CRFs. We show results that compare favorably to a set of standard baselines, and discuss some of the benefits and potential pitfalls of this method. Thu-Ses2-P4-7, Time: 13:30 The TF-IDF (term frequency-inverse document frequency) weight is a well-known indexing weight in information retrieval and text mining. However, it is not suitable for the increasingly popular voiceto-text search, as it does not take into account the impact of voice in the search process. We propose a method for calculating a new indexing weight, which is used as guidance for selection of suitable queries for voice-to-text search. In designing the new weight, we combine prominence factors from both the text and acoustic domains. Experimental results show significant improvement in the average search success rate with the new indexing weight. CRANDEM: Conditional Random Fields for Word Recognition Jeremy Morris, Eric Fosler-Lussier; Ohio State University, USA HEAR: An Hybrid Episodic-Abstract Speech Recognizer Sébastien Demange, Dirk Van Compernolle; Katholieke Universiteit Leuven, Belgium Thu-Ses2-P4-11, Time: 13:30 On Invariant Structural Representation for Speech Recognition: Theoretical Validation and Experimental Improvement Yu Qiao, Nobuaki Minematsu, Keikichi Hirose; University of Tokyo, Japan Thu-Ses2-P4-8, Time: 13:30 One of the most challenging problems in speech recognition is to deal with inevitable acoustic variations caused by non-linguistic factors. Recently, an invariant structural representation of speech was proposed [1], where the non-linguistic variations are effectively removed though modeling the dynamic and contrastive aspects of speech signals. This paper describes our recent progresses on this problem. Theoretically, we prove that the maximum likelihood based decomposition can lead to the same structural representations for a sequence and its transformed version. Practically, we introduce a method of discriminant analysis of eigen-structure to deal with two limitations of structural representations, namely, high dimensionality and too strong invariance. In the 1st experiment, we evaluate the proposed method through recognizing connected Japanese vowels. The proposed method achieves a recognition rate 99.0%, which is higher than those of the previous structure based recognition methods [2, 3, 4] and word HMM. In the 2nd experiment, we examine the recognition performance of structural representations to vocal tract length (VTL) differences. The experimental results indicate that structural representations have much more robustness to VTL changes than HMM. Moreover, the proposed method is about 60 times faster than the previous ones. This paper presents a new architecture for automatic continuous speech recognition called HEAR — Hybrid Episodic-Abstract speech Recognizer. HEAR relies on both parametric speech models (HMMs) and episodic memory. We propose an evaluation on the Wall Street Journal corpus, a standard continuous speech recognition task, and compare the results with a state-of-the-art HMM baseline. HEAR is shown to be a viable and a competitive architecture. While the HMMs have been studied and optimized during decades, their performance seems to converge to a limit which is lower than human performance. On the contrary, episodic memory modeling for speech recognition as applied in HEAR offers flexibility to enrich the recognizer with information the HMMs lack. This opportunity as well as future work are exposed in a discussion. Articulatory Feature Asynchrony Analysis and Compensation in Detection-Based ASR I-Fan Chen, Hsin-Min Wang; Academia Sinica, Taiwan Thu-Ses2-P4-9, Time: 13:30 This paper investigates the effects of two types of imperfection, namely detection errors and articulatory feature asynchrony, of the front-end articulatory feature detector on the performance of a detection-based ASR system. Based on a set of variable-controlled experiments, we find that articulatory feature asynchrony is the ma- Notes 175 Notes 176 Author Index A Abad, Alberto . . . . . . . . Abboutabit, N. . . . . . . . . Abdelwahab, Amira . . Abutalebi, H.R. . . . . . . . Acero, Alex . . . . . . . . . . . Ackermann, P. . . . . . . . . Acosta, Jaime C. . . . . . . Adada, Junichi . . . . . . . . Adda-Decker, Martine Adell, Jordi . . . . . . . . . . . Agüero, Pablo Daniel . Aguilar, Lourdes . . . . . . Aho, Eija . . . . . . . . . . . . . . Aimetti, Guillaume . . . Aist, Gregory . . . . . . . . . Ajmera, Jitendra . . . . . . Akagi, Masato . . . . . . . . Akamine, Masami . . . . Akita, Yuya . . . . . . . . . . . Al Bawab, Ziad . . . . . . . Alcázar, José . . . . . . . . . Alfandary, Amir . . . . . . Ali, Saandia . . . . . . . . . . . Alías, Francesc . . . . . . . Alku, Paavo . . . . . . . . . . . Allauzen, Alexandre . Allauzen, Cyril . . . . . . . Almajai, Ibrahim . . . . . Al Moubayed, Samer . Aloni-Lavi, Ruth . . . . . . Alpan, A. . . . . . . . . . . . . . . Altosaar, Toomas . . . . Alwan, Abeer . . . . . . . . . Amano, Shigeaki . . . . . Amano-Kusumoto, A. Ambikairajah, E. . . . . . . Amino, Kanae . . . . . . . . Ananthakrishnan, G. . Andersen, O. . . . . . . . . . . Andersson, J.S. . . . . . . . André, Elisabeth . . . . . . Andreou, Andreas G. . Anguera, Xavier . . . . . . Aradilla, Guillermo . . . Arai, Takayuki . . . . . . . . Arias, Juan Pablo . . . . . Ariki, Yasuo . . . . . . . . . . Ariyaeeinia, A. . . . . . . . . Aronowitz, Hagai . . . . . Attabi, Yazid . . . . . . . . . Atterer, Michaela . . . . . Atwell, Eric . . . . . . . . . . . Aubergé, Véronique . . Avigal, Mireille . . . . . . . Avinash, B. . . . . . . . . . . . . Ayan, Necip Fazil . . . . . Aylett, Matthew P. . . . . Mon-Ses2-O3-5 Tue-Ses2-P2-8 Tue-Ses3-P2-2 Wed-Ses3-P2-12 Tue-Ses3-P1-12 Tue-Ses1-O1-5 Mon-Ses3-P4-4 Wed-Ses1-O2-4 Thu-Ses1-P2-2 Mon-Ses2-P1-6 Wed-Ses3-O1-3 Mon-Ses2-S1-2 Mon-Ses3-P2-13 Wed-Ses1-O4-5 Wed-Ses4-P4-10 Wed-Ses1-P2-3 Tue-Ses1-O2-6 Tue-Ses2-P2-13 Mon-Ses3-P4-1 Tue-Ses3-P2-7 Wed-Ses1-O4-3 Tue-Ses3-P2-7 Wed-Ses2-P3-6 Mon-Ses2-O3-3 Wed-Ses2-P3-2 Tue-Ses3-P3-9 Mon-Ses2-P2-6 Wed-Ses3-O2-3 Mon-Ses3-P2-8 Tue-Ses3-P1-2 Wed-Ses1-P2-2 Thu-Ses1-O2-5 Tue-Ses2-P3-8 Wed-Ses2-O4-2 Wed-Ses2-O4-6 Tue-Ses3-P3-5 Mon-Ses2-P2-6 Tue-Ses1-S2-8 Tue-Ses1-O2-6 Tue-Ses1-P3-5 Wed-Ses1-P4-13 Thu-Ses2-P4-3 Mon-Ses2-O1-3 Wed-Ses1-O3-6 Thu-Ses1-S1-1 Thu-Ses2-O3-3 Thu-Ses2-P1-1 Mon-Ses3-P1-6 Wed-Ses1-P2-10 Tue-Ses2-P1-11 Wed-Ses2-P2-3 Thu-Ses2-O5-5 Wed-Ses3-P1-9 Tue-Ses3-P2-3 Thu-Ses2-O2-1 Tue-Ses2-P1-8 Mon-Ses3-P2-9 Mon-Ses2-S1-5 Tue-Ses1-P3-4 Wed-Ses3-P2-8 Thu-Ses2-P3-3 Mon-Ses2-O2-6 Tue-Ses1-P1-1 Wed-Ses3-P1-9 Tue-Ses2-P2-4 Mon-Ses2-P4-2 Wed-Ses3-P2-7 Tue-Ses1-P4-11 Mon-Ses2-S1-9 Tue-Ses2-O3-3 Wed-Ses1-P4-14 Tue-Ses1-P3-9 Wed-Ses1-P2-12 Wed-Ses3-O4-5 Wed-Ses2-P2-8 Tue-Ses2-P1-6 Mon-Ses3-O4-5 Wed-Ses2-P3-11 50 94 104 145 103 75 72 113 156 52 138 60 69 115 149 118 76 95 71 105 115 105 134 50 133 108 54 139 68 102 118 152 97 128 129 107 54 87 76 82 124 174 48 114 162 165 168 66 119 93 131 168 143 105 164 92 68 60 82 145 172 49 79 143 94 57 145 85 61 90 124 83 119 141 132 92 65 135 B Bachan, Jolanta . . . . . . . Mon-Ses3-P2-4 67 Bäckström, Tom . . . . . . Thu-Ses1-P1-3 155 Badin, Pierre . . . . . . . . . . Wed-Ses3-O4-3 141 Badino, Leonardo . . . . . Mon-Ses3-P2-9 68 Baghai-Ravary, Ladan Thu-Ses2-O5-3 167 Bagou, Odile . . . . . . . . . . Mon-Ses3-P1-2 66 Bailly, Gérard . . . . . . . . . Mon-Ses3-S1-8 74 Wed-Ses3-O4-3 141 Baker, Brendan . . . . . . . Tue-Ses3-O3-2 100 Balakrishnan, Suhrid . Wed-Ses3-S1-3 150 Balchandran, Rajesh . Mon-Ses2-P4-10 59 Bali, Kalika . . . . . . . . . . . . Mon-Ses3-P2-12 69 Thu-Ses2-P1-10 170 Ban, Sung Min . . . . . . . . Wed-Ses3-O3-1 140 Ban, Vin Shen . . . . . . . . . Mon-Ses2-P1-7 52 Banglore, Srinivas . . . . Wed-Ses1-S1-1 124 Banno, Hideki . . . . . . . . Thu-Ses1-P2-6 157 Bapineedu, G. . . . . . . . . . Tue-Ses2-P1-6 92 Barbosa, Plínio A. . . . . . Tue-Ses2-O2-4 89 Tue-Ses3-S2-3 110 Wed-Ses2-S1-2 137 Thu-Ses2-P1-12 170 Barbot, Nelly . . . . . . . . . . Thu-Ses1-P2-3 157 Bárkányi, Zsuzsanna . Mon-Ses3-P1-10 67 Barker, Jon . . . . . . . . . . . . Mon-Ses2-P1-5 52 Barnard, Etienne . . . . . . Tue-Ses1-P3-12 83 Thu-Ses2-O4-1 166 Thu-Ses2-P3-4 173 Barney, Anna . . . . . . . . . Mon-Ses3-P1-7 66 Tue-Ses1-S2-11 87 Barra-Chicote, R. . . . . . Mon-Ses2-S1-7 61 Bartalis, Mátyás . . . . . . Mon-Ses3-P4-9 73 Bar-Yosef, Yossi . . . . . . Tue-Ses3-O3-3 100 Batliner, Anton . . . . . . . Mon-Ses2-S1-1 60 Mon-Ses3-P4-4 72 Baumann, Timo . . . . . . Tue-Ses2-O3-3 90 Wed-Ses1-P4-14 124 Bayer, Stefan . . . . . . . . . Thu-Ses1-P1-3 155 Beautemps, Denis . . . . Tue-Ses3-P2-2 104 Bechet, Frederic . . . . . . Tue-Ses2-O3-5 90 Thu-Ses1-P4-4 160 Beck, Jeppe . . . . . . . . . . . Tue-Ses3-O4-1 101 Beckman, Mary . . . . . . . Tue-Ses1-O2-2 76 Behne, Dawn . . . . . . . . . . Thu-Ses2-P1-3 168 Belfield, Bill . . . . . . . . . . . Wed-Ses2-O3-6 128 Bell, Peter . . . . . . . . . . . . . Wed-Ses2-P4-12 137 Bellegarda, Jerome R. Tue-Ses1-O4-4 78 Benaroya, Elie-Laurent Mon-Ses3-S1-4 74 Ben-David, Shai . . . . . . . Mon-Ses3-O4-4 65 Benders, Titia . . . . . . . . . Mon-Ses3-O2-6 63 Ben-Harush, Oshry . . . Tue-Ses1-P4-8 85 Beňuš, Štefan . . . . . . . . . Tue-Ses1-P1-11 80 Wed-Ses2-S1-5 138 Ben Youssef, Atef . . . . Wed-Ses3-O4-3 141 BenZeghiba, M.F. . . . . . Wed-Ses3-O1-5 138 Berkling, Kay . . . . . . . . . Thu-Ses1-P2-1 156 Besacier, Laurent . . . . . Thu-Ses1-P3-1 158 Thu-Ses2-P3-9 174 Beskow, Jonas . . . . . . . . Mon-Ses2-P4-12 59 Tue-Ses3-P3-5 107 Beyerlein, Peter . . . . . . . Thu-Ses2-P1-8 169 Biadsy, Fadi . . . . . . . . . . . Mon-Ses2-P2-12 55 Bigi, Brigitte . . . . . . . . . . Thu-Ses1-P3-1 158 Bilmes, Jeff . . . . . . . . . . . Tue-Ses1-O1-1 75 Wed-Ses2-O3-1 127 Thu-Ses1-O3-6 154 Thu-Ses2-O4-4 167 Bimbot, Frédéric . . . . . . Thu-Ses2-O3-6 166 Bistritz, Yuval . . . . . . . . Tue-Ses3-O3-3 100 Bitouk, Dmitri . . . . . . . . Wed-Ses2-P2-6 132 Black, Matthew . . . . . . . Wed-Ses2-O2-2 126 Blanco, José Luis . . . . . Tue-Ses3-P3-9 108 Blomberg, Mats . . . . . . . Mon-Ses3-P3-11 71 Tue-Ses3-P2-3 105 Bocklet, Tobias . . . . . . . Wed-Ses2-P2-4 131 Boeffard, Olivier . . . . . . Tue-Ses1-O4-3 78 Thu-Ses1-P2-3 157 Boersma, Paul . . . . . . . . Mon-Ses3-O2-6 63 Mon-Ses3-P1-5 66 Bőhm, Tamás . . . . . . . . . Mon-Ses3-P1-10 67 Boidin, Cédric . . . . . . . . Tue-Ses1-O4-3 78 Wed-Ses2-P3-9 134 Wed-Ses3-S1-6 150 Bolder, Bram . . . . . . . . . . Wed-Ses2-S1-6 138 Bonafonte, Antonio . . Mon-Ses2-O2-5 49 Mon-Ses3-P2-13 69 Wed-Ses1-O4-5 115 Wed-Ses4-P4-10 149 Bonastre, J-F. . . . . . . . . . Mon-Ses2-P2-1 53 Bonneau, Anne . . . . . . . Mon-Ses3-P1-8 66 177 Boonpiam, Vataya . . . . Mon-Ses3-P2-3 67 Borel, Stephanie . . . . . . Tue-Ses1-S2-9 87 Borges, Nash . . . . . . . . . . Thu-Ses1-O3-5 153 Borgstrom, Bengt J. . . Thu-Ses1-S1-1 162 Bořil, Hynek . . . . . . . . . . Tue-Ses2-P4-6 98 Bouchon-Meunier, B. . Wed-Ses3-S1-4 150 Boufaden, Narjès . . . . . Mon-Ses2-S1-9 61 Boula de Mareüil, P. . . Wed-Ses3-O1-3 138 Thu-Ses1-O2-5 152 Boulenger, Véronique Wed-Ses2-O1-5 126 Boulianne, Gilles . . . . . Tue-Ses3-P3-6 107 Thu-Ses2-P4-4 174 Bourlard, Hervé . . . . . . Tue-Ses2-O4-4 91 Wed-Ses3-O3-6 141 Boves, Lou . . . . . . . . . . . . Tue-Ses1-O2-6 76 Wed-Ses1-P4-1 122 Wed-Ses3-O2-1 139 Bozkurt, Baris . . . . . . . . Mon-Ses2-O4-5 51 Bozkurt, Elif . . . . . . . . . . Mon-Ses2-S1-4 60 Braga, Daniela . . . . . . . . Tue-Ses3-O4-1 101 Brandl, Holger . . . . . . . . Wed-Ses2-S1-6 138 Brandschain, Linda . . . Thu-Ses2-O4-6 167 Braunschweiler, N. . . . Wed-Ses2-P3-12 135 Bray, W.P. . . . . . . . . . . . . . Wed-Ses1-P4-3 122 Bresch, Erik . . . . . . . . . . . Tue-Ses1-O2-5 76 Breslin, Catherine . . . . Tue-Ses3-P2-8 105 Bretier, Philippe . . . . . . Wed-Ses3-S1-4 150 Brierley, Claire . . . . . . . . Tue-Ses1-P3-9 83 Brodnik, Andrej . . . . . . Thu-Ses1-O3-1 153 Brown, Guy J. . . . . . . . . . Thu-Ses2-P1-11 170 Brumberg, Jonathan S. Mon-Ses3-S1-3 73 Brümmer, Niko . . . . . . . Wed-Ses1-O1-3 112 Wed-Ses3-O1-4 138 Brungart, Douglas S. . Mon-Ses3-O2-5 63 Buera, L. . . . . . . . . . . . . . . Mon-Ses2-O1-6 48 Tue-Ses2-P4-7 99 Tue-Ses3-P2-9 106 Bugalho, M. . . . . . . . . . . . Tue-Ses2-P2-8 94 Bunnell, H. Timothy . . Wed-Ses1-P2-1 118 Buquet, Julie . . . . . . . . . . Mon-Ses3-P1-8 66 Burget, Lukáš . . . . . . . . . Mon-Ses2-O3-2 50 Mon-Ses2-S1-10 61 Tue-Ses3-O3-1 99 Wed-Ses3-O1-4 138 Wed-Ses3-P2-4 144 Thu-Ses2-P2-1 170 Burkett, David . . . . . . . . Mon-Ses3-O4-5 65 Burkhardt, Felix . . . . . . Thu-Ses1-P2-9 158 Bürki, Audrey . . . . . . . . . Wed-Ses3-P1-1 142 Buß, Okko . . . . . . . . . . . . Tue-Ses2-O3-3 90 Busset, Julie . . . . . . . . . . Mon-Ses2-O2-2 49 Busso, Carlos . . . . . . . . . Mon-Ses2-S1-3 60 Wed-Ses2-P1-6 130 Butko, T. . . . . . . . . . . . . . . Tue-Ses2-P2-7 94 Byrne, William . . . . . . . . Mon-Ses3-O3-1 63 C Caballero Morales, O. Cabrera, Joao . . . . . . . . . Cadic, Didier . . . . . . . . . . Caetano, Janine . . . . . . . Cahill, Peter . . . . . . . . . . Cai, Jun . . . . . . . . . . . . . . . Cai, Lianhong . . . . . . . . . Callejas, Zoraida . . . . . Calvo, José R. . . . . . . . . . Camelin, Nathalie . . . . Campbell, Joseph . . . . Campbell, Nick . . . . . . . Campbell, W.M. . . . . . . . Campillo, Francisco . . Canton-Ferrer, C. . . . . . Cao, Houwei . . . . . . . . . . Carayannis, George . . Cardinal, Patrick . . . . . Carenini, Giuseppe . . . Carlson, Rolf . . . . . . . . . Carreira-Perpiñán, MA Carson-Berndsen, J. . . Caruso, Chris . . . . . . . . . Wed-Ses1-O3-1 Mon-Ses3-S1-5 Wed-Ses2-P3-9 Tue-Ses1-S2-11 Tue-Ses3-O4-6 Mon-Ses2-O2-2 Mon-Ses3-O3-4 Wed-Ses1-P3-5 Thu-Ses2-O4-5 Wed-Ses3-P2-5 Thu-Ses1-P4-4 Wed-Ses3-O2-2 Wed-Ses2-S1-3 Mon-Ses2-P2-8 Wed-Ses1-O1-2 Wed-Ses3-O1-6 Wed-Ses3-P2-10 Wed-Ses4-P4-10 Tue-Ses2-P2-7 Thu-Ses2-P3-6 Tue-Ses2-O4-2 Tue-Ses3-P3-6 Thu-Ses2-P4-4 Wed-Ses2-P2-2 Wed-Ses1-P2-7 Wed-Ses1-P4-1 Tue-Ses1-P1-5 Tue-Ses1-P3-13 Tue-Ses3-O4-6 Thu-Ses2-O4-6 113 74 134 87 101 49 64 120 167 144 160 139 137 54 111 139 145 149 94 173 90 107 174 131 119 122 79 83 101 167 Casas, J.R. . . . . . . . . . . . . Tue-Ses2-P2-7 94 Caskey, Sasha . . . . . . . . Mon-Ses3-O4-4 65 Cassidy, Andrew . . . . . Thu-Ses2-P1-8 169 Castaldo, Fabio . . . . . . . Mon-Ses2-P2-4 54 Tue-Ses2-O4-1 90 Castelli, Eric . . . . . . . . . . Wed-Ses3-O4-5 141 Thu-Ses1-P3-1 158 Castillo-Guerra, E. . . . . Tue-Ses1-S2-5 86 Cazi, Nadir . . . . . . . . . . . . Tue-Ses3-P1-7 103 Cecere, Elvio . . . . . . . . . . Tue-Ses3-P4-5 109 Cen, Ling . . . . . . . . . . . . . . Wed-Ses2-P3-8 134 Cerisara, C. . . . . . . . . . . . Wed-Ses1-P4-6 123 Černocký, Jan . . . . . . . . Mon-Ses2-S1-10 61 Tue-Ses3-O3-1 99 Wed-Ses3-P2-4 144 Cerva, Petr . . . . . . . . . . . . Tue-Ses2-O1-6 88 Chan, Arthur . . . . . . . . . Wed-Ses2-O3-6 128 Chan, Paul . . . . . . . . . . . . Wed-Ses2-P3-8 134 Chan, W.-Y. . . . . . . . . . . . Wed-Ses2-O4-4 129 Chang, Hung-An . . . . . . Mon-Ses2-P3-6 56 Chang, Joon-Hyuk . . . . Tue-Ses3-P1-10 103 Thu-Ses1-P1-4 155 Charlier, Malorie . . . . . Wed-Ses1-O4-4 115 Chatterjee, Saikat . . . . Thu-Ses2-P2-11 172 Chaubard, Laura . . . . . . Thu-Ses1-O4-6 154 Chelba, Ciprian . . . . . . . Mon-Ses3-O1-1 61 Chen, Berlin . . . . . . . . . . Tue-Ses3-P4-10 110 Wed-Ses1-P4-12 124 Chen, Chia-Ping . . . . . . Tue-Ses2-O4-3 91 Tue-Ses2-P4-10 99 Chen, Hui . . . . . . . . . . . . . Wed-Ses3-O4-1 141 Chen, I-Fan . . . . . . . . . . . Thu-Ses2-P4-9 175 Chen, Jia-Yu . . . . . . . . . . Mon-Ses2-P3-5 56 Chen, Langzhou . . . . . . Thu-Ses1-P3-3 158 Chen, Nancy F. . . . . . . . Wed-Ses3-O2-2 139 Chen, Sin-Horng . . . . . . Mon-Ses3-P2-5 68 Thu-Ses1-P2-5 157 Chen, Szu-wei . . . . . . . . Tue-Ses2-O2-3 89 Chen, Zhengqing . . . . . Tue-Ses2-P1-1 91 Cheng, Chierh . . . . . . . . Mon-Ses3-P1-3 66 Cheng, Chih-Chieh . . . Tue-Ses1-O1-3 75 Cheng, Shih-Sian . . . . . Tue-Ses2-O4-3 91 Chetty, Girija . . . . . . . . . Tue-Ses2-P2-12 95 Chevelu, Jonathan . . . . Wed-Ses3-S1-6 150 Chiang, Chen-Yu . . . . . Mon-Ses3-P2-5 68 Thu-Ses1-P2-5 157 Chien, Jen-Tzung . . . . . Mon-Ses3-O1-6 62 Tue-Ses3-P2-10 106 Chin, K.K. . . . . . . . . . . . . . Wed-Ses3-P3-7 147 Thu-Ses1-P3-3 158 Ching, P.C. . . . . . . . . . . . . Tue-Ses1-P4-2 84 Thu-Ses2-P3-6 173 Chiou, Sheng-Chiuan . Tue-Ses2-P4-10 99 Chiu, Yu-Hsiang Bosco Mon-Ses2-O1-2 48 Chládková, Kateřina . . Mon-Ses3-P1-5 66 Chng, Eng Siong . . . . . . Mon-Ses2-P2-10 55 Thu-Ses2-P2-9 172 Cho, Jeongmi . . . . . . . . . Wed-Ses3-O3-2 140 Cho, Kook . . . . . . . . . . . . Tue-Ses3-P1-13 103 Choi, Eric . . . . . . . . . . . . . Thu-Ses2-O5-5 168 Choi, Jae-Hun . . . . . . . . . Tue-Ses3-P1-10 103 Thu-Ses1-P1-4 155 Chollet, Gérard . . . . . . . Mon-Ses3-S1-4 74 Chonavel, Thierry . . . . Wed-Ses1-O4-2 115 Chong, Jike . . . . . . . . . . . Tue-Ses2-P3-3 96 Chotimongkol, A. . . . . . Mon-Ses3-P2-6 68 Wed-Ses1-P3-12 122 Christensen, Heidi . . . Mon-Ses2-P1-5 52 Chu, Wei . . . . . . . . . . . . . . Thu-Ses2-O3-3 165 Chueh, Chuang-Hua . . Mon-Ses3-O1-6 62 Chung, Hoon . . . . . . . . . Tue-Ses2-O1-1 87 Chung, Hyun-Yeol . . . . Wed-Ses3-P3-3 146 Chung, Minhwa . . . . . . . Wed-Ses1-P1-5 116 Cieri, Christopher . . . . Thu-Ses2-O4-6 167 Clark, Robert A.J. . . . . . Mon-Ses3-P2-9 68 Tue-Ses3-O4-3 101 Claveau, Vincent . . . . . Tue-Ses3-O4-4 101 Clemens, Caroline . . . . Tue-Ses1-P3-10 83 Clements, Mark A. . . . . Wed-Ses2-O4-1 128 Coelho, Luis . . . . . . . . . . Tue-Ses3-O4-1 101 Colás, José . . . . . . . . . . . . Wed-Ses2-P4-10 136 Colby, Glen . . . . . . . . . . . Mon-Ses3-S1-5 74 Cole, Jeffrey . . . . . . . . . . Tue-Ses3-P4-3 108 Cole, Jennifer . . . . . . . . . Thu-Ses1-O2-6 152 Thu-Ses1-S1-0 162 Coleman, John . . . . . . . . Thu-Ses2-O5-3 167 Colibro, Daniele . . . . . . Cooke, Martin . . . . . . . . Cordoba, R. . . . . . . . . . . . Corns, A. . . . . . . . . . . . . . . Cosi, Piero . . . . . . . . . . . . Côté, Nicolas . . . . . . . . . Cowie, Roddy . . . . . . . . . Cox, Stephen J. . . . . . . . Coyle, Eugene . . . . . . . . Crammer, Koby . . . . . . . Cranen, B. . . . . . . . . . . . . . Creer, S.M. . . . . . . . . . . . . Crevier-Buchman, Lise Csapó, Tamás Gábor . Cuayáhuitl, Heriberto Cucchiarini, Catia . . . . Cui, Xiaodong . . . . . . . . Cumani, Sandro . . . . . . Cummins, Fred . . . . . . . Cunningham, S.P. . . . . . Cupples, E.J. . . . . . . . . . . Cutler, Anne . . . . . . . . . . Cutugno, Francesco . . Cvetković, Zoran . . . . . Mon-Ses2-P2-4 Tue-Ses2-P3-9 Tue-Ses1-P1-12 Wed-Ses2-O1-6 Mon-Ses2-P4-8 Tue-Ses1-O2-6 Mon-Ses3-P3-1 Thu-Ses1-O4-1 Wed-Ses1-O2-6 Wed-Ses1-O3-1 Thu-Ses2-P4-5 Wed-Ses2-S1-4 Thu-Ses1-O3-6 Tue-Ses2-P4-2 Tue-Ses3-P3-1 Tue-Ses1-S2-9 Mon-Ses3-P1-10 Mon-Ses2-P4-7 Mon-Ses3-P4-2 Mon-Ses2-P3-8 Mon-Ses2-P2-4 Mon-Ses2-O2-3 Tue-Ses1-P4-9 Tue-Ses3-P3-1 Wed-Ses1-P4-3 Thu-Ses2-P1-2 Mon-Ses3-O2-2 Tue-Ses3-P4-5 Wed-Ses3-P3-4 54 97 80 126 58 76 69 154 113 113 174 137 154 98 106 87 67 58 71 57 54 49 85 106 122 168 63 109 146 D Dahlbäck, Nils . . . . . . . . Thu-Ses2-O1-6 164 Dai, Beiqian . . . . . . . . . . . Tue-Ses3-O3-6 100 Dai, Li-Rong . . . . . . . . . . Mon-Ses3-O3-2 64 Daimo, Katsunori . . . . Tue-Ses1-P1-3 79 Dakka, Wisam . . . . . . . . Tue-Ses3-P4-12 110 d’Alessandro, C. . . . . . . Wed-Ses2-P3-9 134 Dalsgaard, P. . . . . . . . . . . Tue-Ses2-P1-8 92 Damnati, Géraldine . . Tue-Ses1-O4-3 78 Thu-Ses1-P4-4 160 Damper, Robert I. . . . . Wed-Ses2-P2-11 133 Dang, Jianwu . . . . . . . . . Mon-Ses2-O2-1 48 Thu-Ses2-O2-5 165 Dansereau, R.M. . . . . . . Wed-Ses2-O4-4 129 Darch, Jonathan . . . . . . Wed-Ses2-O4-2 128 D’Arcy, Shona . . . . . . . . Tue-Ses1-S1-4 86 Das, Amit . . . . . . . . . . . . . Tue-Ses3-P1-15 104 Dashtbozorg, Behdad Tue-Ses3-P1-12 103 Davel, Marelie . . . . . . . . Thu-Ses2-O4-1 166 Thu-Ses2-O4-2 166 Thu-Ses2-P3-4 173 Davies, Hannah . . . . . . . Wed-Ses3-P1-12 143 Davis, Chris . . . . . . . . . . . Mon-Ses3-O2-2 63 Wed-Ses3-O4-4 141 Davis, Matthew H. . . . . Mon-Ses3-O2-1 62 Dayanidhi, Krishna . . . Tue-Ses3-P4-2 108 Dean, Jeffrey . . . . . . . . . . Mon-Ses3-O1-1 61 de Castro, Alberto . . . . Wed-Ses3-P2-6 144 Dehak, Najim . . . . . . . . . Mon-Ses2-S1-9 61 Wed-Ses1-O1-3 112 Dehak, Réda . . . . . . . . . . Mon-Ses2-S1-9 61 Wed-Ses1-O1-3 112 Dehzangi, Omid . . . . . . Thu-Ses2-P2-9 172 de Jong, F.M.G. . . . . . . . Tue-Ses1-P4-10 85 Tue-Ses2-O4-5 91 Wed-Ses2-P2-7 132 Thu-Ses1-O4-4 154 Deléglise, Paul . . . . . . . . Tue-Ses1-O3-1 77 Wed-Ses2-P4-8 136 De Looze, Céline . . . . . Thu-Ses2-P1-7 169 De Luca, Carlo J. . . . . . . Mon-Ses3-S1-5 74 Demange, Sébastien . . Mon-Ses3-P3-6 70 Thu-Ses2-P4-11 175 Demenko, Grażyna . . . Wed-Ses1-P4-4 123 Demirekler, Mübeccel Thu-Ses2-O2-3 164 De Mori, Renato . . . . . . Mon-Ses2-P4-9 58 Thu-Ses1-P4-4 160 Demuynck, Kris . . . . . . Tue-Ses2-P3-2 96 Denby, Bruce . . . . . . . . . Mon-Ses3-S1-4 74 Deng, Li . . . . . . . . . . . . . . . Tue-Ses1-O1-5 75 Deng, Yunbin . . . . . . . . . Mon-Ses3-S1-5 74 den Ouden, Hanny . . . Tue-Ses1-P2-4 81 Despres, Julien . . . . . . . Mon-Ses2-O3-6 50 D’Haro, L.F. . . . . . . . . . . . Mon-Ses2-P4-8 58 D’hoore, Bart . . . . . . . . . Thu-Ses2-P3-2 172 Dickie, Catherine . . . . . Wed-Ses1-P4-2 122 178 Mon-Ses2-P3-7 Thu-Ses1-P3-4 Tue-Ses2-O2-1 Wed-Ses4-P4-13 Wed-Ses4-P4-14 Tue-Ses2-O3-1 Thu-Ses1-P4-12 Mon-Ses3-O3-6 Tue-Ses3-P2-4 Tue-Ses3-P2-5 Wed-Ses2-P4-7 Thu-Ses1-P1-3 Tue-Ses2-O2-5 Thu-Ses2-O2-2 Tue-Ses3-P1-1 Wed-Ses3-O3-5 Thu-Ses1-P1-5 Tue-Ses1-O3-4 Wed-Ses2-P1-2 Mon-Ses2-P2-6 Mon-Ses2-P2-7 Wed-Ses2-P2-8 Tue-Ses1-P3-2 Wed-Ses3-P1-13 Wed-Ses4-P4-2 Thu-Ses1-O2-2 Mon-Ses2-P3-1 Mon-Ses2-P1-4 Wed-Ses2-S1-6 Thu-Ses2-P3-7 Wed-Ses2-P3-8 Wed-Ses4-P4-9 Wed-Ses2-S1-4 Wed-Ses1-O2-6 Wed-Ses1-P4-2 Mon-Ses3-S1-4 Tue-Ses1-O2-6 Wed-Ses1-P2-9 Mon-Ses2-O4-5 Tue-Ses3-P3-10 Wed-Ses1-P3-8 Thu-Ses2-O5-6 Tue-Ses3-O4-6 Mon-Ses3-O3-4 Tue-Ses3-P3-10 Tue-Ses2-P3-2 Wed-Ses1-P3-4 Mon-Ses2-S1-9 Wed-Ses1-O1-3 Thu-Ses2-P4-4 Mon-Ses2-O4-5 Tue-Ses3-P3-10 Wed-Ses1-O4-4 Wed-Ses1-P3-8 Thu-Ses2-O5-6 Wed-Ses1-P3-13 56 158 88 149 149 89 161 64 105 105 136 155 89 164 102 140 155 77 129 54 54 132 82 143 147 152 55 52 138 173 134 148 137 113 122 74 76 119 51 108 121 168 101 64 108 96 120 61 112 174 51 108 115 121 168 122 Edlund, Jens . . . . . . . . . . Mon-Ses2-P4-12 59 77 163 168 154 158 71 136 172 107 51 79 123 164 81 164 93 131 168 60 60 119 65 125 125 142 92 91 60 Diehl, F. . . . . . . . . . . . . . . . Dimitrova, Diana V. . . D’Imperio, Mariapaola Dinarelli, Marco . . . . . . Dines, John . . . . . . . . . . . Disch, Sascha . . . . . . . . . Dittrich, Heleen . . . . . . Divenyi, Pierre . . . . . . . . DiVita, Joseph . . . . . . . . Dixon, Paul R. . . . . . . . . Djamah, Mouloud . . . . Dobrišek, Simon . . . . . . Dobry, Gil . . . . . . . . . . . . . Docio-Fernandez, L. . . Dogil, Grzegorz . . . . . . Dognin, Pierre L. . . . . . . Dole, Marjorie . . . . . . . . Domont, Xavier . . . . . . . Dong, Bin . . . . . . . . . . . . . Dong, Minghui . . . . . . . . Dorn, Amelie . . . . . . . . . Dorran, David . . . . . . . . Douglas-Cowie, Ellen . Draxler, Christoph . . . Dreyfus, Gérard . . . . . . Driesen, Joris . . . . . . . . . Drugman, Thomas . . . Du, Jinhua . . . . . . . . . . . . Duan, Quansheng . . . . Dubuisson, Thomas . . Duchateau, Jacques . . Duckhorn, Frank . . . . . Dumouchel, Pierre . . . Dutoit, Thierry . . . . . . . Dziemianko, Michal . . E Eg, Ragnhild . . . . . . . . . . Egi, Noritsugu . . . . . . . . El-Desoky, Amr . . . . . . . Elenius, Daniel . . . . . . . . El Hannani, Asmaa . . . Elhilali, Mounya . . . . . . el Kaliouby, Rana . . . . . Ellis, Dan P.W. . . . . . . . . Enflo, Laura . . . . . . . . . . . Engelbrecht, K-P. . . . . . Engwall, Olov . . . . . . . . . Epps, Julien . . . . . . . . . . . Erdem, A. Tanju . . . . . . Erdem, Çiǧdem Eroǧlu Erickson, Donna . . . . . . Ernestus, Mirjam . . . . . Errity, Andrew . . . . . . . . Erro, Daniel . . . . . . . . . . . Erzin, Engin . . . . . . . . . . . Tue-Ses1-O3-5 Thu-Ses2-O1-2 Thu-Ses2-P1-3 Thu-Ses1-O4-1 Thu-Ses1-P3-5 Mon-Ses3-P3-11 Wed-Ses2-P4-7 Thu-Ses2-P2-10 Tue-Ses3-P3-8 Mon-Ses2-O4-3 Tue-Ses1-P1-7 Wed-Ses1-P4-7 Thu-Ses2-O1-4 Tue-Ses1-P2-6 Thu-Ses2-O2-1 Tue-Ses2-P1-11 Wed-Ses2-P2-3 Thu-Ses2-O5-5 Mon-Ses2-S1-4 Mon-Ses2-S1-4 Wed-Ses1-P2-12 Mon-Ses3-P1-1 Wed-Ses2-O1-1 Wed-Ses2-O1-3 Wed-Ses3-P1-2 Tue-Ses2-P1-7 Tue-Ses2-P1-2 Mon-Ses2-S1-4 Escalante-Ruiz, Rafael Wed-Ses4-P4-6 148 Escudero, David . . . . . . Wed-Ses4-P4-10 149 Espy-Wilson, Carol Y. Thu-Ses1-S1-0 162 Thu-Ses1-S1-1 162 Thu-Ses1-S1-3 162 Estève, Yannick . . . . . . . Wed-Ses2-P4-8 136 Evanini, Keelan . . . . . . . Wed-Ses1-P1-3 116 Wed-Ses3-O2-4 139 Ewender, Thomas . . . . Mon-Ses2-O4-1 50 Mon-Ses3-O1-2 62 Eyben, Florian . . . . . . . . Wed-Ses1-O2-6 113 Thu-Ses1-O1-5 151 Gales, M.J.F. . . . . . . . . . . . Mon-Ses2-P3-7 Galliano, Sylvain . . . . . . Gamboa Rosales, A. . . Gamboa Rosales, H. . . Ganapathy, Sriram . . . F Fagel, Sascha . . . . . . . . . Tue-Ses1-O2-4 76 Faisman, Alexander . . Mon-Ses3-O4-4 65 Fakotakis, Nikos . . . . . . Wed-Ses2-O2-1 126 Fan, Xing . . . . . . . . . . . . . . Tue-Ses1-P4-3 84 Fang, Qiang . . . . . . . . . . . Mon-Ses2-O2-1 48 Fapšo, Michal . . . . . . . . . Wed-Ses3-P2-4 144 Fatema, K. . . . . . . . . . . . . Tue-Ses3-P3-1 106 Faure, Julien . . . . . . . . . . Tue-Ses2-P2-5 94 Favre, Benoit . . . . . . . . . . Tue-Ses2-P3-4 96 Tue-Ses3-P4-8 109 Tue-Ses3-P4-9 109 Thu-Ses1-P4-3 160 Fegyó, Tibor . . . . . . . . . . Thu-Ses1-P3-7 159 Feldes, Stefan . . . . . . . . . Tue-Ses1-P3-10 83 Feng, Junlan . . . . . . . . . . Wed-Ses1-S1-1 124 Fernández, Fernando . Mon-Ses2-S1-7 61 Wed-Ses3-S1-2 150 Fernández, Rafael . . . . Wed-Ses3-P2-5 144 Fernández, Rubén . . . . Tue-Ses3-P3-9 108 Fernandez Astudillo, R Thu-Ses1-O1-1 150 Ferreiros, Javier . . . . . . Wed-Ses3-S1-2 150 Filimonov, Denis . . . . . Thu-Ses1-P3-2 158 Fingscheidt, Tim . . . . . Thu-Ses2-P2-4 171 Fitt, Sue . . . . . . . . . . . . . . . Tue-Ses3-O4-3 101 Flego, F. . . . . . . . . . . . . . . . Tue-Ses2-P4-8 99 Thu-Ses1-O1-3 151 Fohr, D. . . . . . . . . . . . . . . . Wed-Ses1-P4-6 123 Fon, Janice . . . . . . . . . . . . Tue-Ses1-P2-2 81 Forbes-Riley, Kate . . . . Wed-Ses3-S1-1 149 Fosler-Lussier, Eric . . . Tue-Ses1-O2-2 76 Tue-Ses1-P3-6 82 Tue-Ses2-P3-10 97 Wed-Ses1-P1-4 116 Thu-Ses2-P4-10 175 Foster, Kylie . . . . . . . . . . Mon-Ses2-O2-4 49 Fougeron, Cécile . . . . . . Wed-Ses3-P1-1 142 Fousek, Petr . . . . . . . . . . Mon-Ses2-O3-6 50 Mon-Ses2-P3-5 56 Fraile, Rubén . . . . . . . . . Tue-Ses1-S2-7 87 Fraj, Samia . . . . . . . . . . . . Thu-Ses2-P1-4 169 Frankel, Joe . . . . . . . . . . . Wed-Ses2-P4-10 136 Wed-Ses2-P4-11 137 Wed-Ses2-P4-12 137 Frauenfelder, Ulrich H. Wed-Ses3-P1-1 142 Frissora, Michael . . . . . Mon-Ses3-O4-4 65 Frolova, Olga V. . . . . . . Wed-Ses1-P2-11 119 Fu, Qian-Jie . . . . . . . . . . . Wed-Ses1-O4-6 115 Fujie, Shinya . . . . . . . . . . Mon-Ses2-P4-4 58 Fujimoto, Masakiyo . . Tue-Ses2-P4-4 98 Fujinaga, Tsuyoshi . . . Tue-Ses3-P4-4 109 Fujisaki, Hiroya . . . . . . . Wed-Ses4-P4-3 148 Fukuda, Takashi . . . . . . Mon-Ses2-O1-5 48 Funakoshi, Kotaro . . . . Thu-Ses1-P4-8 161 Thu-Ses1-P4-9 161 Furui, Sadaoki . . . . . . . . Mon-Ses1-K-1 47 Mon-Ses3-P3-10 71 Tue-Ses1-O3-6 77 Tue-Ses2-P4-9 99 Wed-Ses3-O3-5 140 Furuya, Ken’ichi . . . . . . Tue-Ses3-P1-4 102 Futagi, Yoko . . . . . . . . . . Mon-Ses3-P4-5 72 G Gabbouj, Moncef . . . . . Gajšek, Rok . . . . . . . . . . . Gakuru, Mucemi . . . . . . Wed-Ses1-P3-7 Thu-Ses1-P2-8 Wed-Ses2-P1-2 Wed-Ses2-P3-5 121 157 129 134 Gangashetty, S.V. . . . . . Gao, Jie . . . . . . . . . . . . . . . Garcia, Jose Enrique . . Garcia, Luz . . . . . . . . . . . Garcia-Mateo, Carmen García-Moral, Ana I. . . Garg, Nikhil . . . . . . . . . . . Garner, Philip N. . . . . . . Gašić, M. . . . . . . . . . . . . . . Gates, Stephen C. . . . . . Gatica-Perez, Daniel . . Gaudrain, Etienne . . . . Gauvain, Jean-Luc . . . . Gay, Sandrine . . . . . . . . . Ge, Fengpei . . . . . . . . . . . Gelbart, David . . . . . . . . Gemello, Roberto . . . . . Gemmeke, J.F. . . . . . . . . Georgiou, P.G. . . . . . . . . Gerosa, Matteo . . . . . . . Ghai, Shweta . . . . . . . . . . Ghemawat, Sanjay . . . . Ghosh, P.K. . . . . . . . . . . . Gibbon, Dafydd . . . . . . Gibson, Matthew . . . . . Gilbert, Mazin . . . . . . . . Gilmore, L. Donald . . . Giró, X. . . . . . . . . . . . . . . . . Gish, Herbert . . . . . . . . . Giuliani, Diego . . . . . . . . Glass, James R. . . . . . . . Glembek, Ondřej . . . . . Glenn, Meghan L. . . . . . Gobl, Christer . . . . . . . . Gödde, Florian . . . . . . . . Godino-Llorente, J.I. . . Godoy, Elizabeth . . . . . Goel, Vaibhava . . . . . . . Goerick, Christian . . . . Goldman, Jean P. . . . . . Goldstein, Louis . . . . . . Gollan, Christian . . . . . Gomez, Randy . . . . . . . . Goncharoff, Vladimir . Gonina, Ekaterina . . . . Gonzalez-Rodriguez, J Gonzalvo, Xavi . . . . . . . Goodwin, Matthew . . . Goto, Masataka . . . . . . . Goudbeek, Martijn . . . Gracia, Sergio . . . . . . . . . Graciarena, Martin . . . Gráczi, Tekla Etelka . . Graff, David . . . . . . . . . . . Mon-Ses3-O1-3 Mon-Ses3-O1-4 Tue-Ses2-P4-8 Wed-Ses1-O1-6 Wed-Ses3-P3-2 Thu-Ses1-O1-3 Thu-Ses1-P3-4 Thu-Ses1-O4-6 Tue-Ses1-O4-5 Tue-Ses1-O4-5 Thu-Ses1-P1-2 Thu-Ses2-O3-1 Thu-Ses2-P2-3 Tue-Ses2-P1-6 Wed-Ses2-P4-6 Wed-Ses2-P4-1 Thu-Ses1-P1-1 Mon-Ses2-O1-4 Tue-Ses1-P3-2 Tue-Ses1-P2-7 Tue-Ses3-P4-8 Wed-Ses2-P4-7 Thu-Ses1-P4-5 Thu-Ses1-P4-10 Wed-Ses3-O3-6 Mon-Ses2-P1-7 Mon-Ses2-O3-6 Wed-Ses3-O1-5 Mon-Ses2-O3-6 Thu-Ses2-P3-7 Thu-Ses2-P2-6 Mon-Ses2-O1-4 Tue-Ses2-P3-9 Tue-Ses2-P4-2 Mon-Ses3-O4-6 Tue-Ses1-S2-10 Mon-Ses3-P3-8 Wed-Ses1-O3-3 Mon-Ses3-O1-1 Mon-Ses3-O4-6 Thu-Ses2-O2-2 Mon-Ses3-P2-4 Wed-Ses1-P3-11 Wed-Ses1-S1-1 Mon-Ses3-S1-5 Tue-Ses2-P2-7 Wed-Ses2-O3-6 Tue-Ses1-S2-10 Mon-Ses2-P3-6 Wed-Ses3-O1-4 Wed-Ses3-P2-4 Thu-Ses2-O4-3 Wed-Ses2-P1-5 Thu-Ses2-O1-4 Tue-Ses1-S2-7 Wed-Ses1-O4-2 Mon-Ses2-P3-1 Mon-Ses2-P3-5 Tue-Ses3-P2-12 Wed-Ses2-S1-6 Wed-Ses1-O2-1 Tue-Ses1-O2-1 Tue-Ses1-O2-5 Thu-Ses1-S1-3 Thu-Ses1-S1-4 Thu-Ses1-S1-6 Thu-Ses2-O2-2 Mon-Ses2-O3-4 Wed-Ses2-P4-5 Thu-Ses1-P3-5 Tue-Ses2-P4-1 Tue-Ses3-P1-1 Tue-Ses2-P3-3 Wed-Ses2-P1-3 Wed-Ses3-P2-6 Mon-Ses3-O3-5 Tue-Ses3-P3-8 Tue-Ses1-P2-8 Tue-Ses2-P2-6 Tue-Ses3-P4-6 Wed-Ses1-O2-1 Wed-Ses2-P3-6 Wed-Ses2-P2-4 Mon-Ses3-P1-10 Thu-Ses2-O4-6 179 56 62 62 99 112 146 151 158 154 78 78 155 165 171 92 136 135 155 48 82 81 109 136 160 161 141 52 50 138 50 173 171 48 97 98 65 87 70 114 61 65 164 67 121 124 74 94 128 87 56 138 144 166 130 164 87 115 55 56 106 138 112 76 76 162 162 163 164 50 136 158 98 102 96 130 144 64 107 81 94 109 112 134 131 67 167 Granström, Björn . . . . . Mon-Ses2-P4-12 Gravano, Agustín . . . . . Gravier, Guillaume . . . Green, P.D. . . . . . . . . . . . . Greenberg, Craig S. . . . Grenez, Francis . . . . . . . Grézl, František . . . . . . Grieco, J.J. . . . . . . . . . . . . Griffiths, Thomas L. . . Grigoriev, Aleks S. . . . . Grimaldi, Marco . . . . . . Grimaldi, Mirko . . . . . . Griol, David . . . . . . . . . . . Grishman, Ralph . . . . . Gruenstein, Alexander Gruhn, Rainer . . . . . . . . Gu, Lingyun . . . . . . . . . . . Guan, Yong . . . . . . . . . . . Gubian, Michele . . . . . . Gudnason, Jon . . . . . . . . Guenther, Frank H. . . . Gupta, Sanjeev . . . . . . . Gustafson, Joakim . . . Guterman, Hugo . . . . . . Gutiérrez, Juana M. . . Gutkin, Alexander . . . . Tue-Ses3-P3-5 Tue-Ses2-O2-6 Mon-Ses3-O1-5 Thu-Ses1-O4-6 Thu-Ses2-O3-6 Tue-Ses3-P3-1 Thu-Ses1-O4-5 Thu-Ses2-O4-6 Tue-Ses1-S1-3 Tue-Ses1-S2-8 Thu-Ses2-P1-4 Thu-Ses2-P2-1 Wed-Ses1-P4-3 Tue-Ses0-K-1 Wed-Ses1-P2-11 Tue-Ses1-P4-9 Wed-Ses1-P1-9 Mon-Ses2-P4-6 Thu-Ses2-O1-1 Thu-Ses1-P4-1 Thu-Ses2-P4-2 Thu-Ses2-P3-3 Thu-Ses1-O3-4 Mon-Ses3-O3-6 Wed-Ses3-O2-1 Mon-Ses2-O4-3 Mon-Ses3-S1-3 Tue-Ses2-P1-5 Mon-Ses2-P4-12 Tue-Ses1-P4-8 Tue-Ses1-S2-7 Mon-Ses3-O3-5 59 107 89 62 154 166 106 154 167 86 87 169 170 122 47 119 85 117 58 163 159 174 172 153 64 139 51 73 92 59 85 87 64 H Haderlein, Tino . . . . . . . Tue-Ses1-S2-6 Haeb-Umbach, R. . . . . . Tue-Ses2-P2-11 86 95 Tue-Ses2-P4-3 98 Wed-Ses3-P3-5 147 Hahn, Stefan . . . . . . . . . . Thu-Ses1-P4-6 160 Thu-Ses1-P4-7 160 Hain, Thomas . . . . . . . . . Wed-Ses2-P4-7 136 Haji, Tomoyuki . . . . . . . Thu-Ses2-P1-9 169 Haji Abolhassani, I. . . Tue-Ses3-P1-16 104 Hakkani-Tür, Dilek . . . Tue-Ses3-P4-8 109 Tue-Ses3-P4-9 109 Thu-Ses1-P4-1 159 Thu-Ses1-P4-2 160 Thu-Ses1-P4-3 160 Hakulinen, Jaakko . . . . Tue-Ses3-P3-4 107 Thu-Ses1-O4-2 154 Hallé, Pierre . . . . . . . . . . Mon-Ses2-P1-6 52 Han, Kyu J. . . . . . . . . . . . Tue-Ses2-O4-6 91 Thu-Ses1-O3-3 153 Haneda, Yoichi . . . . . . . Tue-Ses3-P1-4 102 Hans, Stéphane . . . . . . . Tue-Ses1-S2-9 87 Hansakunbuntheung, C Thu-Ses2-O5-1 167 Hansen, Eric A. . . . . . . . Thu-Ses2-P4-6 174 Hansen, John H.L. . . . . Mon-Ses2-P2-3 53 Tue-Ses1-P3-7 82 Tue-Ses1-P4-3 84 Tue-Ses2-P4-6 98 Tue-Ses3-P1-14 104 Tue-Ses3-P1-15 104 Wed-Ses2-P2-5 132 Wed-Ses3-P2-13 146 Wed-Ses3-P3-6 147 Hansen, Mervi . . . . . . . . Tue-Ses3-P3-4 107 Haokip, D. Mary Kim . Mon-Ses3-P2-4 67 Harb, Boulos . . . . . . . . . . Mon-Ses3-O1-1 61 Harish, A.N. . . . . . . . . . . . Tue-Ses2-P1-10 93 Harish, D. . . . . . . . . . . . . . Thu-Ses1-P1-8 156 Harlow, Ray . . . . . . . . . . . Tue-Ses3-S2-5 111 Harnsberger, James . . Thu-Ses2-P1-8 169 Harper, Mary . . . . . . . . . Thu-Ses1-P3-2 158 Thu-Ses1-P4-1 159 Harrison, Alissa . . . . . . Wed-Ses1-P2-5 118 Hartard, Felix . . . . . . . . . Thu-Ses2-O1-4 164 Harte, Naomi . . . . . . . . . Tue-Ses2-P1-13 93 Hartmann, William . . . Wed-Ses1-P1-4 116 Hasegawa-Johnson, M. Tue-Ses3-P3-7 107 Wed-Ses2-O2-4 127 Thu-Ses1-O2-6 152 Thu-Ses1-S1-4 162 Thu-Ses2-O2-3 164 Hashimoto, Kei . . . . . . . Hassan, Ali . . . . . . . . . . . Hawkins, Sarah . . . . . . . Hayashi, Kyohei . . . . . . Hazen, Timothy J. . . . . He, Guan-min . . . . . . . . . Heaton, James T. . . . . . Hecht, Ron M. . . . . . . . . Heckmann, Martin . . . . Heeren, W. . . . . . . . . . . . . Heigold, Georg . . . . . . . Heimonen, Tomi . . . . . Heinrich, Antje . . . . . . . Heinrich, Christian . . . Heintz, Ilana . . . . . . . . . . Helander, Elina . . . . . . . Heldner, Mattias . . . . . . Hella, Juho . . . . . . . . . . . . Hendriks, Richard C. . Hennebert, Jean . . . . . . Heracleous, Panikos . . Hermansky, Hynek . . . Hernáez, Inmaculada Hernández, Gabriel . . Hernández, Luis A. . . . Hernando, J. . . . . . . . . . . Hershey, John R. . . . . . Hestvik, Arild . . . . . . . . . Heusdens, Richard . . . Heylen, Dirk . . . . . . . . . . Hezroni, Omer . . . . . . . . Hieronymus, J.L. . . . . . . Higgins, Derrick . . . . . . Hill, Harold . . . . . . . . . . . Hines, Andrew . . . . . . . . Hinrichs, Erhard . . . . . . Hioka, Yusuke . . . . . . . . Hirose, Keikichi . . . . . . Hirsch, Fabrice . . . . . . . Hirsch, Hans-Günter . Hirschberg, Julia . . . . . Hirst, Daniel . . . . . . . . . . Hoeks, John C.J. . . . . . . Hoen, Michel . . . . . . . . . . Hofer, Gregor . . . . . . . . . Hoffmann, Ruediger . . Hoffmann, Sarah . . . . . Hoffmeister, Björn . . . Höge, Harald . . . . . . . . . Holzrichter, John F. . . Hong, Hyejin . . . . . . . . . Hong, Jung Ook . . . . . . Hönig, F. . . . . . . . . . . . . . . Tue-Ses1-O1-6 Wed-Ses1-P3-1 Wed-Ses2-P2-11 Tue-Ses1-P2-1 Tue-Ses1-P1-2 Wed-Ses2-P4-13 Tue-Ses2-P4-5 Mon-Ses3-S1-5 Mon-Ses2-P2-6 Mon-Ses2-P2-7 Wed-Ses1-O1-5 Wed-Ses2-P2-8 Wed-Ses2-S1-6 Wed-Ses4-P4-1 Mon-Ses2-P3-2 Wed-Ses2-P4-4 Wed-Ses2-P4-5 Thu-Ses1-P4-7 Tue-Ses3-P3-4 Thu-Ses1-O4-2 Tue-Ses1-P2-1 Tue-Ses2-O1-3 Tue-Ses1-O2-2 Wed-Ses1-P3-7 Tue-Ses1-O3-5 Thu-Ses2-O1-2 Tue-Ses3-P3-4 Thu-Ses1-O4-2 Tue-Ses3-P1-3 Wed-Ses2-O4-3 Mon-Ses2-P2-1 Tue-Ses3-P2-2 Wed-Ses3-O4-3 Mon-Ses2-O3-2 Thu-Ses1-P1-2 Thu-Ses2-O3-1 Thu-Ses2-P2-3 Thu-Ses2-P2-10 Mon-Ses2-S1-6 Tue-Ses2-P1-2 Wed-Ses3-P2-5 Tue-Ses3-P3-9 Tue-Ses2-P2-7 Mon-Ses2-P3-1 Tue-Ses3-P1-6 Wed-Ses1-P2-1 Tue-Ses3-P1-3 Wed-Ses2-O4-3 Wed-Ses2-S1-1 Mon-Ses2-P2-6 Mon-Ses2-P2-7 Mon-Ses3-O1-4 Mon-Ses3-P4-5 Wed-Ses3-O4-4 Tue-Ses2-P1-13 Wed-Ses1-P4-1 Tue-Ses3-P1-4 Mon-Ses2-P4-15 Mon-Ses3-P2-10 Mon-Ses3-P4-6 Wed-Ses2-P3-1 Wed-Ses3-O2-6 Wed-Ses4-P4-3 Thu-Ses2-P4-8 Mon-Ses2-O2-2 Mon-Ses3-P3-7 Mon-Ses2-P2-12 Tue-Ses2-O2-6 Wed-Ses1-P2-7 Thu-Ses2-O1-2 Tue-Ses3-S2-1 Wed-Ses3-O2-3 Tue-Ses2-O2-1 Mon-Ses2-P1-4 Wed-Ses2-O1-5 Wed-Ses1-P3-13 Tue-Ses1-O4-5 Mon-Ses2-O4-1 Mon-Ses2-P3-10 Tue-Ses2-P3-5 Wed-Ses2-P4-4 Wed-Ses2-P4-5 Thu-Ses2-P2-4 Mon-Ses3-S1-1 Wed-Ses1-P1-5 Mon-Ses2-O4-4 Mon-Ses3-P4-4 75 120 133 80 79 137 98 74 54 54 112 132 138 147 55 135 136 160 107 154 80 88 76 121 77 163 107 154 102 129 53 104 141 50 155 165 171 172 61 91 144 108 94 55 102 118 102 129 137 54 54 62 72 141 93 122 102 59 68 72 133 140 148 175 49 70 55 89 119 163 110 139 88 52 126 122 78 50 57 96 135 136 171 73 116 51 72 Honma, Daisuke . . . . . . Hoque, Mohammed E. Hori, Chiori . . . . . . . . . . . Horiuchi, Yasuo . . . . . . Hosom, John-Paul . . . . House, David . . . . . . . . . Houtepen, Véronique Howard, David M. . . . . Howard, Ian S. . . . . . . . . Hsiao, Roger . . . . . . . . . . Hu, Hongbing . . . . . . . . . Hu, Rile . . . . . . . . . . . . . . . Huang, Jing . . . . . . . . . . . Huang, Shen . . . . . . . . . . Huang, Songfang . . . . . Huang, Thomas S. . . . . Hubeika, Valiantsina . Huckvale, Mark . . . . . . . Hudson, Toby . . . . . . . . Hueber, Thomas . . . . . . Huerta, Juan M. . . . . . . . Huijbregts, Marijn . . . . Hung, Jeih-weih . . . . . . Hunter, Peter . . . . . . . . . Hwang, Hsin-Te . . . . . . Tue-Ses3-P2-6 Tue-Ses3-P3-8 Mon-Ses2-P4-5 Wed-Ses1-P4-11 Wed-Ses3-P2-12 Wed-Ses1-P2-10 Wed-Ses1-P4-1 Wed-Ses4-P4-8 Tue-Ses2-O2-5 Tue-Ses1-P1-4 Tue-Ses1-O2-4 Tue-Ses1-O1-2 Tue-Ses2-P1-1 Mon-Ses3-O3-6 Tue-Ses3-P2-1 Wed-Ses2-O2-6 Thu-Ses1-P3-9 Tue-Ses3-O3-6 Tue-Ses3-O3-1 Wed-Ses3-O1-4 Wed-Ses3-P2-4 Tue-Ses1-O2-4 Tue-Ses3-P1-11 Wed-Ses3-P1-10 Mon-Ses3-S1-4 Mon-Ses3-O4-4 Tue-Ses1-P4-10 Tue-Ses2-O4-5 Thu-Ses1-O4-4 Tue-Ses2-P4-5 Mon-Ses2-O2-4 Thu-Ses1-P2-5 105 107 58 124 145 119 122 148 89 79 76 75 91 64 104 127 159 100 99 138 144 76 103 143 74 65 85 91 154 98 49 157 I Ichikawa, Osamu . . . . . Mon-Ses2-O1-5 48 Iimura, Miki . . . . . . . . . . Mon-Ses3-P4-7 72 Ijima, Yusuke . . . . . . . . . Mon-Ses3-P3-4 70 Imaizumi, Satoshi . . . . Thu-Ses2-P1-9 169 Ingram, John . . . . . . . . . Thu-Ses1-O2-4 152 Irino, Toshio . . . . . . . . . . Mon-Ses2-P1-2 52 Thu-Ses1-P2-6 157 Iriondo, Ignasi . . . . . . . . Mon-Ses2-S1-2 60 Mon-Ses3-O3-5 64 Isard, Stephen . . . . . . . . Wed-Ses1-P1-3 116 Ishizuka, Kentaro . . . . Tue-Ses2-P4-4 98 Issing, Jochen . . . . . . . . Thu-Ses1-P1-9 156 Itahashi, Shuichi . . . . . . Mon-Ses3-P1-6 66 Ito, Akinori . . . . . . . . . . . Mon-Ses2-P1-1 51 Mon-Ses3-P4-3 72 Tue-Ses3-P2-6 105 Ito, Kiwako . . . . . . . . . . . Thu-Ses1-O2-3 152 Ito, Masashi . . . . . . . . . . . Mon-Ses2-P1-1 51 Mon-Ses3-P4-3 72 Itoh, Toshihiko . . . . . . . Wed-Ses1-P4-9 123 Ivanov, George . . . . . . . Mon-Ses3-P4-5 72 Iwahashi, Naoto . . . . . . Wed-Ses3-S1-5 150 Thu-Ses1-P4-8 161 Iwano, Koji . . . . . . . . . . . Wed-Ses3-O3-5 140 Iyer, Nandini . . . . . . . . . . Mon-Ses3-O2-5 63 Izdebski, Krzysztof . . Tue-Ses1-S1-1 86 Izumi, Yosuke . . . . . . . . Wed-Ses2-O4-5 129 Jin, Zhaozhang . . . . . . . Johnson, Ralph . . . . . . . Jonsson, Ing-Marie . . . Jorge, Juliana . . . . . . . . . Joshi, Shrikant . . . . . . . . Josse, Yvan . . . . . . . . . . . Jou, Szu-Chen Stan . . . Joublin, Frank . . . . . . . . Ju, Yun-Cheng . . . . . . . . Tue-Ses1-P3-6 Tue-Ses3-P1-1 Thu-Ses2-O1-6 Tue-Ses1-S2-11 Thu-Ses1-O3-2 Mon-Ses2-O3-6 Mon-Ses3-S1-6 Wed-Ses2-S1-6 Tue-Ses2-O1-2 Tue-Ses2-O1-4 Jurčíček, F. . . . . . . . . . . . . Thu-Ses1-P4-5 Jyothi, Preethi . . . . . . . . Tue-Ses2-P3-10 K K., Sri Rama Murty . . . K., Sudheer Kumar . . . Kaburagi, Tokihiko . . . Kagomiya, Takayuki . . Kagoshima, Takehiko Kahn, Jeremy G. . . . . . . Kahn, Juliette . . . . . . . . . Kain, Alexander . . . . . . Kaino, Tomomi . . . . . . . Kajarekar, Sachin . . . . . Kalaldeh, Raya . . . . . . . . Kalgaonkar, Kaustubh Kalinli, Ozlem . . . . . . . . Kallasjoki, Heikki . . . . . Kamaruddin, N. . . . . . . Kang, Shiyin . . . . . . . . . . Karafiát, Martin . . . . . . . Karakos, Damianos . . . Karam, Zahi N. . . . . . . . Karhila, Reima . . . . . . . . Karlsson, Anastasia . . Karpov, Alexey . . . . . . . Kashioka, Hideki . . . . . Kataoka, Akitoshi . . . . Kato, Hiroaki . . . . . . . . . Kato, Masaharu . . . . . . . Kato, Tomoyuki . . . . . . Katsouros, Vassilis . . . Katsumaru, Masaki . . . Katsurada, Kouichi . . . Kaufmann, Tobias . . . . Kaushik, Lakshmish . . Kawaguchi, Hiroshi . . Kawahara, Hideki . . . . . Kawahara, Tatsuya . . . J Jan, Ea-Ee . . . . . . . . . . . . . Mon-Ses2-P4-11 Jänsch, Klaus . . . . . . . . . Jansen, Aren . . . . . . . . . . Järvikivi, Juhani . . . . . . Jelinek, Frederick . . . . . Jemaa, Imen . . . . . . . . . . Jensen, Jesper . . . . . . . . Jensen, Søren Holdt . . Jeon, HyeonBae . . . . . . . Jeon, Je Hun . . . . . . . . . . Jeong, Yongwon . . . . . . Jesus, Luis M.T. . . . . . . . Jia, Huibin . . . . . . . . . . . . Jiampojamarn, S. . . . . . Jiang, Hui . . . . . . . . . . . . . Jin, Qin . . . . . . . . . . . . . . . 180 Mon-Ses3-O4-3 Mon-Ses3-O4-4 Wed-Ses1-P4-2 Thu-Ses1-S1-5 Wed-Ses1-P2-2 Tue-Ses2-P3-6 Wed-Ses2-O3-5 Wed-Ses1-P1-6 Tue-Ses3-P1-3 Wed-Ses2-O4-3 Tue-Ses3-P3-11 Tue-Ses2-O1-1 Mon-Ses2-P2-5 Mon-Ses3-P3-3 Mon-Ses3-P1-7 Tue-Ses1-S2-11 Tue-Ses2-P1-9 Tue-Ses3-O4-5 Tue-Ses1-O1-4 Wed-Ses1-O4-6 Tue-Ses1-P4-5 Tue-Ses1-P4-7 59 65 65 122 163 118 96 128 116 102 129 108 87 54 69 66 87 93 101 75 115 84 85 82 102 164 87 153 50 74 138 88 88 160 97 Kawai, Hisashi . . . . . . . . Kawamura, Akinori . . . Kawano, Hiroshi . . . . . . Keane, Elinor . . . . . . . . . Keegan, Peter . . . . . . . . . Keizer, S. . . . . . . . . . . . . . . Kelling, Martin . . . . . . . . Kempton, Timothy . . . Kenmochi, Hideki . . . . Kennedy, Philip R. . . . . Kenny, Patrick . . . . . . . . Kesheorey, M.R. . . . . . . Kessens, Judith . . . . . . . Ketabdar, Hamed . . . . . Keutzer, Kurt . . . . . . . . . Khudanpur, Sanjeev . . Kida, Yusuke . . . . . . . . . Kim, Byeongchang . . . . Kim, Chanwoo . . . . . . . . Kim, D.K. . . . . . . . . . . . . . Wed-Ses1-O2-5 Wed-Ses1-O2-5 Tue-Ses1-P1-3 Thu-Ses2-P1-6 Wed-Ses2-P3-10 Tue-Ses2-O3-4 Wed-Ses3-P2-14 Tue-Ses1-O4-1 Mon-Ses3-S1-2 Wed-Ses1-O1-1 Wed-Ses1-O1-4 Wed-Ses2-P2-4 Wed-Ses4-P4-9 Wed-Ses2-O4-1 Wed-Ses2-O3-4 Tue-Ses3-P1-2 Wed-Ses2-P1-8 Mon-Ses3-O3-4 Wed-Ses1-P3-5 Mon-Ses2-O3-2 Wed-Ses2-P4-7 Wed-Ses3-P2-4 Thu-Ses2-P2-1 Tue-Ses1-P3-4 Wed-Ses1-O1-2 Wed-Ses3-O1-6 Wed-Ses3-P2-10 Mon-Ses3-O3-6 Wed-Ses4-P4-8 Thu-Ses2-P1-5 Mon-Ses2-P4-5 Wed-Ses1-P4-11 Wed-Ses3-S1-5 Tue-Ses3-P1-4 Tue-Ses3-S2-6 Wed-Ses1-P2-13 Thu-Ses2-O5-1 Wed-Ses3-P3-1 Wed-Ses1-P4-10 Tue-Ses2-O4-2 Thu-Ses1-P4-9 Wed-Ses2-P4-14 Mon-Ses3-O1-2 Thu-Ses2-O5-4 Tue-Ses3-P4-4 Thu-Ses1-P2-6 Mon-Ses2-O3-3 Tue-Ses2-P2-6 Tue-Ses2-P4-1 Tue-Ses3-P4-7 Tue-Ses1-O4-6 Thu-Ses2-P2-7 Tue-Ses2-O3-2 Tue-Ses3-S2-4 Tue-Ses3-S2-5 Thu-Ses1-P4-5 Tue-Ses2-P2-11 Wed-Ses1-P1-2 Tue-Ses3-P1-8 Mon-Ses3-S1-3 Tue-Ses2-O4-1 Wed-Ses1-O1-3 Tue-Ses2-P1-5 Thu-Ses1-O4-3 Mon-Ses2-S1-8 Tue-Ses2-P3-3 Tue-Ses2-P3-6 Thu-Ses2-P2-7 Tue-Ses3-O4-2 Mon-Ses2-O1-1 Thu-Ses1-O1-2 Wed-Ses3-P3-2 113 113 79 169 134 90 146 78 73 111 112 131 148 128 128 102 130 64 120 50 136 144 170 82 111 139 145 64 148 169 58 124 150 102 111 119 167 146 124 90 161 137 62 168 109 157 50 94 98 109 78 171 90 110 111 160 95 116 103 73 90 112 92 154 61 96 96 171 101 47 151 146 Kim, Gibak . . . . . . . . . . . . Tue-Ses1-P3-3 82 Kim, Hong Kook . . . . . . Thu-Ses1-P1-10 156 Kim, Hyung Soon . . . . . Mon-Ses3-P3-3 69 Wed-Ses3-O3-1 140 Kim, Jangwon . . . . . . . . Wed-Ses2-P1-7 130 Kim, Jeesun . . . . . . . . . . . Mon-Ses3-O2-2 63 Wed-Ses3-O4-4 141 Kim, Namhoon . . . . . . . Wed-Ses3-O3-2 140 Kim, Wooil . . . . . . . . . . . . Wed-Ses2-P2-5 132 Wed-Ses3-P3-6 147 Kim, Yeon-Jun . . . . . . . . Tue-Ses3-P4-11 110 Kim, Yoon-Chul . . . . . . Tue-Ses1-O2-5 76 King, Jeanette . . . . . . . . Tue-Ses3-S2-5 111 King, Simon . . . . . . . . . . . Mon-Ses3-O3-6 64 Tue-Ses3-P2-4 105 Wed-Ses2-P3-11 135 Wed-Ses2-P4-10 136 Wed-Ses2-P4-11 137 Wed-Ses2-P4-12 137 Thu-Ses1-P2-1 156 Kiss, Géza . . . . . . . . . . . . Mon-Ses3-P4-9 73 Kitaoka, Norihide . . . . . Wed-Ses1-P4-9 123 Kitzig, Andreas . . . . . . . Mon-Ses3-P3-7 70 Kjems, Ulrik . . . . . . . . . . Wed-Ses2-O4-3 129 Kleber, Felicitas . . . . . . Wed-Ses1-P2-4 118 Kleijn, W. Bastiaan . . . Thu-Ses2-O5-2 167 Thu-Ses2-P2-11 172 Klessa, Katarzyna . . . . Wed-Ses1-P4-4 123 Knill, Kate . . . . . . . . . . . . Tue-Ses3-P2-8 105 Thu-Ses1-P3-3 158 Ko, Tom . . . . . . . . . . . . . . Tue-Ses2-P3-12 97 Kobashikawa, Satoshi Wed-Ses1-O3-5 114 Kobayashi, Takao . . . . Mon-Ses3-P3-4 70 Thu-Ses1-P2-2 156 Kobayashi, Tetsunori Mon-Ses2-P4-4 58 Kochanski, Greg . . . . . . Tue-Ses3-S2-4 110 Thu-Ses2-O5-3 167 Kockmann, Marcel . . . Mon-Ses2-S1-10 61 Wed-Ses3-P2-4 144 Köhler, Joachim . . . . . . Wed-Ses2-P4-9 136 Kojima, Hiroaki . . . . . . . Tue-Ses3-P2-11 106 Kokkinakis, George . . Wed-Ses2-O2-1 126 Kolhatkar, Varada . . . . Thu-Ses2-P1-8 169 Kollmeier, Birger . . . . . Thu-Ses1-S1-2 162 Kolossa, Dorothea . . . . Thu-Ses1-O1-1 150 Komatani, Kazunori . . Mon-Ses2-P4-1 57 Thu-Ses1-P4-9 161 Kombrink, Stefan . . . . . Mon-Ses2-O3-2 50 Kondo, Kazunobu . . . . Tue-Ses3-P1-8 103 Kondo, Mariko . . . . . . . . Wed-Ses1-P2-5 118 Kondoz, Ahmet . . . . . . Thu-Ses1-P1-6 155 Kondrak, Grzegorz . . . Tue-Ses3-O4-5 101 Koniaris, Christos . . . . Thu-Ses2-P2-11 172 Konno, Tomoaki . . . . . . Mon-Ses3-P4-3 72 Korchagin, Danil . . . . . Wed-Ses2-P4-7 136 Koreman, Jacques . . . . Wed-Ses3-P1-2 142 Körner, E. . . . . . . . . . . . . . Mon-Ses3-P4-4 72 Kosaka, Tetsuo . . . . . . . Wed-Ses3-P3-1 146 Kousidis, Spyros . . . . . Wed-Ses2-S1-4 137 Krauwer, Steven . . . . . . Wed-Ses1-P4-1 122 Kreiman, Jody . . . . . . . . Thu-Ses2-P1-1 168 Křen, Michal . . . . . . . . . . Wed-Ses1-P4-5 123 Kristensson, Per Ola . Wed-Ses1-S1-2 125 Krňoul, Zdeněk . . . . . . . Thu-Ses2-P1-5 169 Kroos, Christian . . . . . . Tue-Ses1-P1-6 79 Wed-Ses3-O4-4 141 Krueger, Alexander . . . Tue-Ses2-P4-3 98 Kua, Jia Min Karen . . . Thu-Ses2-O5-5 168 Kühnel, Christine . . . . . Mon-Ses2-P4-14 59 Kulp, Scott . . . . . . . . . . . . Tue-Ses3-P4-1 108 Kumar, Kshitiz . . . . . . . Wed-Ses3-O4-2 141 Thu-Ses1-O1-2 151 Kunduk, Melda . . . . . . . Tue-Ses1-S1-1 86 Kunikoshi, Aki . . . . . . . . Mon-Ses2-P4-15 59 Kuo, Hong-Kwang . . . . Mon-Ses2-P4-11 59 Kurashima, Atsuko . . . Thu-Ses1-O4-1 154 Kurimo, Mikko . . . . . . . Mon-Ses3-O3-6 64 Tue-Ses3-P1-2 102 Kuroiwa, Shingo . . . . . . Wed-Ses3-P2-12 145 Laine, Unto Kalervo . . Laivo, Tuuli . . . . . . . . . . . Lamel, Lori . . . . . . . . . . . . Lane, Ian . . . . . . . . . . . . . . Lane, Joseph K. . . . . . . . Langlois, David . . . . . . . Lapidot, Itshak . . . . . . . Laprie, Yves . . . . . . . . . . Laroche, Romain . . . . . Lasarcyk, Eva . . . . . . . . . Laskowski, Kornel . . . . Latorre, Javier . . . . . . . . Laurent, Antoine . . . . . Lawless, René . . . . . . . . . Lawson, A.D. . . . . . . . . . . Lazaridis, Alexandros Le, Quoc Anh . . . . . . . . . Leak, Jayne . . . . . . . . . . . Lecorvé, Gwénolé . . . . . Lecouteux, Benjamin . Lee, Akinobu . . . . . . . . . Lee, Antonio . . . . . . . . . . Lee, Chi-Chun . . . . . . . . Lee, Ching-Hsien . . . . . Lee, Chin-Hui . . . . . . . . . Lee, Lee, Lee, Lee, Lee, Lee, Gary Geunbae . . . Haejoong . . . . . . . . Jinsik . . . . . . . . . . . . Kong-Aik . . . . . . . . . Lin-shan . . . . . . . . . Sungbok . . . . . . . . . Lee, S.W. . . . . . . . . . . . . . . Lee, Tan . . . . . . . . . . . . . . . Lee, Yi-Hui . . . . . . . . . . . . Lee, Young Han . . . . . . . Lee, YunKeun . . . . . . . . . Leemann, Adrian . . . . . Lefèvre, Fabrice . . . . . . . Lehnen, Patrick . . . . . . . Lei, Howard . . . . . . . . . . . Lei, Xin . . . . . . . . . . . . . . . . Lei, Yun . . . . . . . . . . . . . . . Lei, Zhenchun . . . . . . . . Leijon, Arne . . . . . . . . . . Leman, Adrien . . . . . . . . Lemnitzer, Lothar . . . . Lemon, Oliver . . . . . . . . Lennes, Mietta . . . . . . . . Leonard, M. . . . . . . . . . . . Leutnant, Volker . . . . . Levow, Gina-Anne . . . . Li, Aijun . . . . . . . . . . . . . . Li, Baojie . . . . . . . . . . . . . . Li, Haizhou . . . . . . . . . . . L Lacheret-Dujour, A. . . Mon-Ses3-P2-7 Lacroix, Arild . . . . . . . . . Tue-Ses2-P1-12 Laface, Pietro . . . . . . . . . Mon-Ses2-P2-4 68 93 54 Tue-Ses2-P3-9 97 Laganaro, Marina . . . . . Mon-Ses3-P1-2 66 Lai, Catherine . . . . . . . . . Wed-Ses2-P1-1 129 Li, Hongyan . . . . . . . . . . . Li, Jinyu . . . . . . . . . . . . . . . Li, Junfeng . . . . . . . . . . . . Tue-Ses1-P3-5 Tue-Ses2-P2-13 Wed-Ses1-P4-13 Thu-Ses2-P4-3 Tue-Ses3-P3-4 Thu-Ses1-O4-2 Mon-Ses2-O3-6 Mon-Ses2-P1-6 Wed-Ses3-O1-5 Mon-Ses2-P2-11 Tue-Ses3-P3-8 Mon-Ses3-O4-1 Tue-Ses1-P4-8 Mon-Ses2-O2-2 Wed-Ses1-P1-6 Thu-Ses2-O2-4 Wed-Ses3-S1-4 Thu-Ses2-P1-8 Tue-Ses1-O3-5 Thu-Ses2-O1-3 Wed-Ses2-P3-6 Tue-Ses1-O3-1 Mon-Ses3-P4-5 Wed-Ses1-P4-3 Wed-Ses3-P2-11 Thu-Ses2-P1-2 Wed-Ses2-O2-1 Mon-Ses3-P4-10 Tue-Ses3-P1-11 Mon-Ses3-O1-5 Tue-Ses2-P3-4 Wed-Ses1-P3-3 Mon-Ses3-O4-4 Mon-Ses2-S1-3 Wed-Ses2-P1-6 Mon-Ses3-P4-8 Mon-Ses2-P2-2 Wed-Ses1-O3-2 Tue-Ses3-O4-2 Thu-Ses2-O4-3 Tue-Ses3-O4-2 Mon-Ses2-P2-10 Thu-Ses2-P1-13 Mon-Ses2-S1-3 Tue-Ses1-O2-5 Wed-Ses2-O2-2 Wed-Ses2-P1-6 Wed-Ses2-P1-7 Thu-Ses1-S1-6 Tue-Ses3-P1-9 Tue-Ses1-P4-2 Tue-Ses3-P1-9 Thu-Ses2-P3-6 Wed-Ses1-O3-6 Thu-Ses1-P1-10 Tue-Ses2-O1-1 Wed-Ses4-P4-3 Mon-Ses2-P4-9 Thu-Ses1-P4-6 Thu-Ses1-P4-7 Tue-Ses1-P4-1 Wed-Ses3-P2-1 Wed-Ses2-P4-2 Wed-Ses2-P4-3 Wed-Ses3-P2-13 Tue-Ses3-O3-5 Tue-Ses1-P2-3 Tue-Ses2-P2-5 Wed-Ses1-P4-1 Wed-Ses3-S1-6 Wed-Ses1-P2-3 Thu-Ses2-P1-2 Tue-Ses2-P2-11 Wed-Ses3-P3-5 Tue-Ses1-O3-3 Mon-Ses2-O2-1 Tue-Ses1-O1-4 Wed-Ses1-O4-6 Mon-Ses2-P2-10 Tue-Ses1-P4-4 Wed-Ses2-P3-8 Wed-Ses3-O1-2 Thu-Ses2-P2-9 Thu-Ses2-P3-8 Wed-Ses2-O2-6 Wed-Ses1-O3-2 Wed-Ses2-P1-4 181 82 95 124 174 107 154 50 52 138 55 107 64 85 49 116 165 150 169 77 163 134 77 72 122 145 168 126 73 103 62 96 120 65 60 130 72 53 114 101 166 101 55 170 60 76 126 130 130 163 103 84 103 173 114 156 87 148 58 160 160 84 144 135 135 146 100 81 94 122 150 118 168 95 147 77 48 75 115 55 84 134 138 172 173 127 114 130 Li, Lihong . . . . . . . . . . . . . Li, Runxin . . . . . . . . . . . . . Li, Su . . . . . . . . . . . . . . . . . . Liang, Hui . . . . . . . . . . . . . Liang, Jiaen . . . . . . . . . . . Liang, Ruoying . . . . . . . Libal, Vit . . . . . . . . . . . . . . Liberman, Mark . . . . . . . Lilienthal, Janine . . . . . Lim, Daniel C.Y. . . . . . . Lin, Hsin-Yi . . . . . . . . . . . Lin, Hui . . . . . . . . . . . . . . . Lin, Shih-Hsiang . . . . . . Linarès, Georges . . . . . . Lincoln, Mike . . . . . . . . . Lindberg, Børge . . . . . . Ling, Zhen-Hua . . . . . . . Liscombe, Jackson . . . Litman, Diane . . . . . . . . Liu, Liu, Liu, Liu, Liu, Changliang . . . . . . . Chao-Hong . . . . . . . Chen . . . . . . . . . . . . . Wen . . . . . . . . . . . . . . X. . . . . . . . . . . . . . . . . . Liu, Yang . . . . . . . . . . . . . . Lleida, Eduardo . . . . . . . Lo, Yueng-Tien . . . . . . . Lobanov, Boris . . . . . . . . Lœvenbruck, Hélène . Loizou, Philipos C. . . . Lolive, Damien . . . . . . . . Longworth, C. . . . . . . . . Lööf, Jonas . . . . . . . . . . . Loots, Linsen . . . . . . . . . Lopez, Eduardo . . . . . . . López-Cózar, Ramón . Lopez-Gonzalo, E. . . . . Lopez-Moreno, Ignacio Lopez-Otero, Paula . . . Loukina, Anastassia . . Lowit, Anja . . . . . . . . . . . Lu, Jianhua . . . . . . . . . . . Lu, Xiao Bo . . . . . . . . . . . . Lu, Xugang . . . . . . . . . . . . Lu, Youyi . . . . . . . . . . . . . Lubensky, David . . . . . . Lucas-Cuesta, J.M. . . . . Luengo, Iker . . . . . . . . . . Lugger, Marko . . . . . . . . Lulich, Steven M. . . . . . Luo, Dean . . . . . . . . . . . . . Lutfi, S. . . . . . . . . . . . . . . . . Lutzky, Manfred . . . . . . Lyakso, Elena E. . . . . . . Lyras, Dimitrios P. . . . . Wed-Ses3-S1-3 Tue-Ses1-P4-5 Mon-Ses2-P1-7 Tue-Ses3-P2-5 Wed-Ses2-O2-6 Mon-Ses2-P3-10 Wed-Ses3-O4-2 Wed-Ses1-P1-3 Wed-Ses3-O2-5 Thu-Ses2-O2-6 Mon-Ses2-P2-11 Tue-Ses1-P2-2 Thu-Ses1-O3-6 Thu-Ses2-O4-4 Tue-Ses3-P4-10 Wed-Ses1-P4-12 Tue-Ses2-P2-9 Tue-Ses2-P2-10 Tue-Ses2-P3-4 Thu-Ses1-P3-10 Wed-Ses2-P4-7 Tue-Ses3-P3-11 Wed-Ses3-O3-3 Mon-Ses3-O3-2 Tue-Ses3-P4-2 Wed-Ses2-P2-1 Wed-Ses3-S1-1 Thu-Ses2-P3-7 Tue-Ses2-O3-6 Thu-Ses2-P4-7 Mon-Ses3-O4-4 Mon-Ses3-O1-3 Mon-Ses3-O1-4 Mon-Ses2-P2-5 Tue-Ses3-P4-9 Mon-Ses2-O1-6 Mon-Ses3-P3-2 Tue-Ses2-P4-7 Tue-Ses3-P2-9 Wed-Ses2-P4-1 Thu-Ses1-P1-1 Tue-Ses3-P4-10 Thu-Ses2-P1-5 Mon-Ses3-S1-8 Tue-Ses1-P3-3 Thu-Ses1-P2-3 Wed-Ses1-O1-6 Mon-Ses2-O3-4 Wed-Ses2-P4-4 Wed-Ses2-P4-5 Mon-Ses2-P2-9 Wed-Ses3-P2-1 Thu-Ses2-O4-5 Tue-Ses1-P4-1 Wed-Ses2-P1-3 Tue-Ses1-P3-2 Tue-Ses3-S2-4 Tue-Ses1-P1-10 Wed-Ses3-P3-8 Mon-Ses2-O2-4 Thu-Ses1-O1-4 Tue-Ses1-P1-12 Mon-Ses2-P4-11 Mon-Ses3-O4-4 Mon-Ses2-S1-7 Wed-Ses3-S1-2 Mon-Ses2-S1-6 Wed-Ses2-P1-9 Mon-Ses3-P1-10 Mon-Ses3-P4-6 Mon-Ses2-S1-7 Thu-Ses1-P1-9 Wed-Ses1-P2-11 Wed-Ses2-O2-1 150 84 52 105 127 57 141 116 139 165 55 81 154 167 110 124 95 95 96 159 136 108 140 64 108 131 149 173 90 175 65 62 62 54 109 48 69 99 106 135 155 110 169 74 82 157 112 50 135 136 55 144 167 84 130 82 110 80 147 49 151 80 59 65 61 150 61 131 67 72 61 156 119 126 M M., Anand Joseph . . . . Tue-Ses2-P1-5 92 M., Sri Harish Reddy . Wed-Ses1-O2-5 113 Ma, Bin . . . . . . . . . . . . . . . . Mon-Ses2-P2-10 55 Tue-Ses1-P4-4 84 Wed-Ses3-O1-2 138 Thu-Ses2-P2-9 172 Ma, Joan Ka-Yin . . . . . . Wed-Ses1-P1-12 117 Ma, Xuebin . . . . . . . . . . . . Wed-Ses3-O2-6 140 Ma, Zhanyu . . . . . . . . . . . Tue-Ses1-P2-3 81 Maassen, Ben . . . . . . . . . Mac, Dang-Khoa . . . . . . Machač, Pavel . . . . . . . . . Macias-Guarasa, J. . . . . Maclagan, Margaret . . Madsack, Andreas . . . . Magimai-Doss, M. . . . . . Maia, Ranniery . . . . . . . . Maier, Andreas . . . . . . . Maier, Stefan . . . . . . . . . Maier, Viktoria . . . . . . . . Maina, Ciira wa . . . . . . . Mairesse, F. . . . . . . . . . . . Mak, Brian . . . . . . . . . . . . Mak, Man-Wai . . . . . . . . . Makhoul, John . . . . . . . . Mäkinen, Erno . . . . . . . . Makino, Shozo . . . . . . . . Malisz, Zofia . . . . . . . . . . Malkin, Jonathan . . . . . Mana, Franco . . . . . . . . . Mandal, Arindam . . . . . Manna, Amit . . . . . . . . . . Marcheret, Etienne . . . Markaki, Maria . . . . . . . . Martens, Jean-Pierre . . Martin, Alvin F. . . . . . . . Martin Mota, Sidney . . Martins, Paula . . . . . . . . Martirosian, Olga . . . . . Maruyama, Hagino . . . Maryn, Y. . . . . . . . . . . . . . Maskey, Sameer . . . . . . Masuko, Takashi . . . . . Mata, Ana Isabel . . . . . . Matarić, Maja J. . . . . . . . Matějka, Pavel . . . . . . . . Matoušek, Jindřich . . . Matrouf, Driss . . . . . . . . Matsubara, Takeshi . . Matsuda, Shigeki . . . . . Matsuyama, Kyoko . . . Matsuyama, Yoichi . . . Mayr, Robert . . . . . . . . . . McDermott, Erik . . . . . . McDonnell, Ciaran . . . . McGraw, Ian . . . . . . . . . . McKenna, John . . . . . . . McKeown, Kathleen R. McLaren, Mitchell . . . . Medina, Victoria . . . . . . Meignier, Sylvain . . . . . Meireles, A.R. . . . . . . . . . Meister, Einar . . . . . . . . . Melamed, I. Dan . . . . . . Mella, O. . . . . . . . . . . . . . . Melto, Aleksi . . . . . . . . . . Meltzner, Geoffrey S. . Ménard, Lucie . . . . . . . . Menard, Madeleine . . . Meng, Helen . . . . . . . . . . Merlin, Teva . . . . . . . . . . Mertens, C. . . . . . . . . . . . Mertens, Timo . . . . . . . . Mertins, Alfred . . . . . . . Tue-Ses1-P1-9 Tue-Ses1-P1-10 Wed-Ses3-O4-5 Tue-Ses1-P3-11 Mon-Ses2-P4-8 Mon-Ses2-S1-7 Tue-Ses3-S2-5 Wed-Ses1-P1-7 Wed-Ses3-O3-6 Thu-Ses2-P2-5 Tue-Ses1-O4-6 Wed-Ses1-P3-9 Mon-Ses3-P4-4 Tue-Ses1-S2-6 Tue-Ses1-S2-10 Mon-Ses2-P4-3 Thu-Ses2-P4-1 Tue-Ses3-P1-17 Thu-Ses1-P4-5 Tue-Ses2-P3-12 Wed-Ses3-P2-2 Tue-Ses3-O3-4 Wed-Ses3-P2-2 Tue-Ses1-O3-2 Tue-Ses3-P3-4 Mon-Ses3-P4-3 Tue-Ses3-P2-6 Tue-Ses3-S2-8 Tue-Ses1-O1-1 Mon-Ses2-O1-4 Wed-Ses2-P4-2 Wed-Ses2-P4-3 Mon-Ses2-P2-6 Mon-Ses2-P2-7 Mon-Ses2-P3-5 Wed-Ses3-O4-2 Tue-Ses1-S1-2 Thu-Ses2-P3-1 Thu-Ses2-P3-2 Thu-Ses1-O4-5 Thu-Ses2-O4-6 Wed-Ses1-P1-13 Mon-Ses3-P1-9 Thu-Ses2-O4-2 Thu-Ses2-P1-9 Tue-Ses1-S2-8 Mon-Ses3-O4-3 Tue-Ses3-P4-12 Thu-Ses2-P2-7 Wed-Ses1-P2-6 Wed-Ses1-O2-3 Mon-Ses2-O3-2 Tue-Ses3-O3-1 Wed-Ses3-O1-4 Wed-Ses3-P2-4 Tue-Ses1-P3-11 Mon-Ses2-P2-1 Tue-Ses2-P2-9 Tue-Ses2-P2-10 Mon-Ses3-P3-4 Wed-Ses1-O3-2 Mon-Ses2-P4-1 Mon-Ses2-P4-4 Wed-Ses3-P1-12 Mon-Ses2-P3-4 Wed-Ses2-S1-4 Thu-Ses2-P4-2 Tue-Ses2-P1-7 Thu-Ses1-P4-1 Tue-Ses3-O3-2 Mon-Ses2-P1-8 Tue-Ses1-O3-1 Wed-Ses2-P4-8 Thu-Ses2-P1-12 Mon-Ses3-O2-4 Tue-Ses3-P4-11 Wed-Ses1-P4-6 Tue-Ses3-P3-4 Thu-Ses1-O4-2 Mon-Ses3-S1-5 Tue-Ses1-O2-2 Tue-Ses1-S2-9 Wed-Ses1-P2-5 Wed-Ses2-P4-8 Tue-Ses1-S1-3 Wed-Ses2-P4-9 Thu-Ses2-P2-8 80 80 141 83 58 61 111 116 141 171 78 121 72 86 87 57 174 104 160 97 144 100 144 77 107 72 105 111 75 48 135 135 54 54 56 141 86 172 172 154 167 117 67 166 169 87 65 110 171 118 113 50 99 138 144 83 53 95 95 70 114 57 58 143 56 137 174 92 159 100 53 77 136 170 63 110 123 107 154 74 76 87 118 136 86 136 171 Mesgarani, Nima . . . . . Thu-Ses2-P2-10 172 Messaoudi, Abdel . . . . Mon-Ses2-O3-6 50 Metze, Florian . . . . . . . . Mon-Ses2-P4-13 59 Mon-Ses2-S1-8 61 Wed-Ses1-P4-7 123 Meunier, Fanny . . . . . . . Mon-Ses2-P1-4 52 Wed-Ses2-O1-5 126 Meyer, Bernd T. . . . . . . . Thu-Ses1-S1-2 162 Meyer, Gerard G.L. . . . Thu-Ses1-O3-5 153 Meyers, Adam . . . . . . . . Thu-Ses1-P4-1 159 Miao, Qi . . . . . . . . . . . . . . . Tue-Ses1-O4-1 78 Michel, Violaine . . . . . . Mon-Ses3-P1-2 66 Miettinen, Toni . . . . . . . Tue-Ses3-P3-4 107 Miguel, Antonio . . . . . . Mon-Ses2-O1-6 48 Mon-Ses3-P3-2 69 Tue-Ses2-P4-7 99 Tue-Ses3-P2-9 106 Wed-Ses2-P4-1 135 Thu-Ses1-P1-1 155 Mihajlik, Péter . . . . . . . . Thu-Ses1-P3-7 159 Mihelič, France . . . . . . . Tue-Ses1-O3-4 77 Wed-Ses2-P1-2 129 Thu-Ses1-O3-1 153 Mikami, Hiroki . . . . . . . Tue-Ses2-O3-2 90 Mikawa, Masahiko . . . . Thu-Ses1-P2-4 157 Miki, Nobuhiro . . . . . . . Tue-Ses1-P1-2 79 Miller, Amanda . . . . . . . Wed-Ses3-P1-3 142 Wed-Ses3-P1-4 142 Milner, Ben . . . . . . . . . . . Wed-Ses2-O4-2 128 Wed-Ses2-O4-6 129 Mimura, Masato . . . . . . Mon-Ses2-O3-3 50 Minematsu, Nobuaki . Mon-Ses2-P4-15 59 Mon-Ses3-P2-10 68 Mon-Ses3-P4-6 72 Wed-Ses2-P3-1 133 Wed-Ses3-O2-6 140 Thu-Ses2-P4-8 175 Ming, Ji . . . . . . . . . . . . . . . Wed-Ses3-P3-8 147 Misu, Teruhisa . . . . . . . . Mon-Ses2-P4-5 58 Wed-Ses1-P4-11 124 Mitra, Vikramjit . . . . . . Thu-Ses1-S1-1 162 Thu-Ses1-S1-3 162 Miura, Kazuo . . . . . . . . . Tue-Ses3-P4-4 109 Mixdorff, Hansjörg . . . Tue-Ses2-O2-2 89 Wed-Ses4-P4-12 149 Thu-Ses1-O2-4 152 Mo, Yoonsook . . . . . . . . Thu-Ses1-O2-6 152 Möbius, Bernd . . . . . . . . Wed-Ses3-P1-13 143 Thu-Ses1-O2-1 152 Thu-Ses1-O2-2 152 Moers, Donata . . . . . . . . Wed-Ses2-P3-7 134 Mohri, Mehryar . . . . . . . Thu-Ses1-P4-11 161 Moinet, Alexis . . . . . . . . Wed-Ses1-O4-4 115 Molina, Carlos . . . . . . . . Wed-Ses2-O2-5 127 Möller, Sebastian . . . . . Mon-Ses2-P4-14 59 Wed-Ses1-P4-7 123 Thu-Ses1-O4-1 154 Thu-Ses2-O1-4 164 Moniz, Helena . . . . . . . . Wed-Ses1-P2-6 118 Montero, J.M. . . . . . . . . . Mon-Ses2-S1-7 61 Montiel-Hernández, J. Mon-Ses2-P4-7 58 Monzo, Carlos . . . . . . . . Mon-Ses2-S1-2 60 Moore, Roger K. . . . . . . Tue-Ses2-P2-13 95 Wed-Ses1-P1-2 116 Wed-Ses1-P2-8 119 Thu-Ses2-P4-1 174 Moosmayr, Tobias . . . . Thu-Ses1-O1-5 151 Moreno, Pedro . . . . . . . . Thu-Ses1-P4-11 161 Morgan, Nelson . . . . . . . Thu-Ses2-P2-2 170 Thu-Ses2-P2-6 171 Mori, Shinsuke . . . . . . . Tue-Ses3-P4-7 109 Morinaka, Ryo . . . . . . . . Wed-Ses2-P3-10 134 Morise, Masanori . . . . . Thu-Ses1-P2-6 157 Morita, Masahiro . . . . . Wed-Ses2-P3-10 134 Morris, Jeremy . . . . . . . . Thu-Ses2-P4-10 175 Morris, Robert . . . . . . . . Tue-Ses3-P1-1 102 Moschitti, Alessandro Thu-Ses1-P4-12 161 Mossavat, Iman S. . . . . Thu-Ses2-O5-2 167 Mostow, Jack . . . . . . . . . Mon-Ses3-P4-1 71 Motlicek, Petr . . . . . . . . . Tue-Ses2-P3-11 97 Thu-Ses1-P1-2 155 Moudenc, Thierry . . . . Tue-Ses1-O4-3 78 Mower, Emily . . . . . . . . . Mon-Ses2-S1-3 60 Wed-Ses1-O2-3 113 Mukerjee, Kunal . . . . . . Tue-Ses3-P4-3 108 Müller, Daniela . . . . . . . Wed-Ses1-P1-13 117 Müller, Florian . . . . . . . . Thu-Ses2-P2-8 171 182 Murakami, Hiroko . . . . Mon-Ses3-P3-10 71 Murphy, Damian . . . . . Tue-Ses1-P1-4 79 Murphy, P. . . . . . . . . . . . . Tue-Ses1-S2-8 87 Murray, Gabriel . . . . . . . Wed-Ses2-P2-2 131 Murthy, Hema A. . . . . . Wed-Ses3-P2-9 145 Muscariello, Armando Thu-Ses2-O3-6 166 Mykowiecka, A. . . . . . . . Thu-Ses1-P4-6 160 N Nadeu, C. . . . . . . . . . . . . . Nagai, Takayuki . . . . . . Nagaraja, Sunil . . . . . . . Nakagawa, Seiichi . . . . Nakagawa, Seiji . . . . . . . Nakajima, Yoshitaka . Nakamura, Atsushi . . . Nakamura, Keigo . . . . . Nakamura, Mitsuhiro Nakamura, Satoshi . . . Nakamura, Shizuka . . Nakamura, Shogo . . . . Nakano, Alberto Y. . . . Nakano, Mikio . . . . . . . . Nakatani, Tomohiro . . Nam, Hosung . . . . . . . . . Nambu, Yoshiki . . . . . . Nanjo, Hiroaki . . . . . . . . Nankaku, Yoshihiko . . Naptali, Welly . . . . . . . . . Narayanan, S.S. . . . . . . . Nasr, Alexis . . . . . . . . . . . Naumann, Anja B. . . . . Nava, Emily . . . . . . . . . . . Navas, Eva . . . . . . . . . . . . Navratil, Jiri . . . . . . . . . . Naylor, Patrick A. . . . . . Neely, Abby . . . . . . . . . . . Neerincx, Mark A. . . . . Neiberg, D. . . . . . . . . . . . . Nemala, S.K. . . . . . . . . . . Németh, Géza . . . . . . . . Nemoto, Akira . . . . . . . . Nenkova, Ani . . . . . . . . . Neto, Nelson . . . . . . . . . . Neubig, Graham . . . . . . Ney, Hermann . . . . . . . . Ng, Tim . . . . . . . . . . . . . . . Nguyen, Binh Phu . . . . Nguyen, Kham . . . . . . . . Tue-Ses2-P2-7 Mon-Ses3-S1-2 Tue-Ses1-S2-5 Tue-Ses2-P2-2 Thu-Ses1-P3-6 Thu-Ses1-P3-8 Thu-Ses2-P1-6 Mon-Ses3-S1-2 Mon-Ses2-P3-4 Tue-Ses2-P3-7 Mon-Ses3-S1-2 Tue-Ses3-P3-2 Wed-Ses3-P1-11 Mon-Ses2-P4-5 Tue-Ses1-O4-6 Wed-Ses1-O3-2 Wed-Ses1-P3-9 Wed-Ses1-P4-11 Wed-Ses3-S1-5 Thu-Ses1-O1-4 Tue-Ses3-S2-6 Tue-Ses1-P1-3 Tue-Ses2-P2-2 Thu-Ses1-P4-8 Thu-Ses1-P4-9 Tue-Ses2-P4-4 Thu-Ses1-S1-3 Thu-Ses1-S1-4 Thu-Ses1-P2-4 Tue-Ses2-O3-2 Mon-Ses3-P2-11 Tue-Ses1-O1-6 Wed-Ses1-P3-1 Wed-Ses1-P3-3 Thu-Ses1-P3-6 Mon-Ses2-S1-3 Mon-Ses3-O4-6 Tue-Ses1-O2-1 Tue-Ses1-O2-5 Tue-Ses2-O4-6 Wed-Ses1-O2-3 Wed-Ses2-O2-2 Wed-Ses2-O3-4 Wed-Ses2-P1-6 Wed-Ses2-P1-7 Thu-Ses1-O3-3 Thu-Ses1-S1-6 Thu-Ses2-O2-2 Tue-Ses2-O3-5 Wed-Ses1-P4-7 Tue-Ses1-O2-1 Mon-Ses2-S1-6 Tue-Ses2-P1-2 Wed-Ses3-O4-2 Mon-Ses2-O4-3 Thu-Ses2-O4-6 Wed-Ses2-P2-7 Tue-Ses3-P2-3 Thu-Ses2-O2-1 Thu-Ses2-P2-10 Mon-Ses3-P4-9 Wed-Ses3-O2-6 Wed-Ses2-P2-6 Mon-Ses2-O3-5 Tue-Ses3-P4-7 Mon-Ses2-O3-4 Mon-Ses2-P3-2 Mon-Ses2-P3-10 Tue-Ses2-P3-5 Wed-Ses2-P4-4 Wed-Ses2-P4-5 Thu-Ses1-P3-5 Thu-Ses1-P4-6 Thu-Ses1-P4-7 Tue-Ses1-O3-2 Wed-Ses1-O4-3 Tue-Ses1-O3-2 94 73 86 94 159 159 169 73 56 97 73 106 143 58 78 114 121 124 150 151 111 79 94 161 161 98 162 162 157 90 69 75 120 120 159 60 65 76 76 91 113 126 128 130 130 153 163 164 90 123 76 61 91 141 51 167 132 105 164 172 73 140 132 50 109 50 55 57 96 135 136 158 160 160 77 115 77 Nguyen, Long . . . . . . . . . Tue-Ses1-O3-2 77 Nguyen, Patrick . . . . . . Wed-Ses2-O3-2 127 Nguyen-Thien, Nhu . . . Thu-Ses1-O1-5 151 Ní Chasaide, Ailbhe . . Wed-Ses2-P1-5 130 Wed-Ses4-P4-9 148 Nicholas, Greg . . . . . . . . Wed-Ses2-P2-1 131 Niebuhr, Oliver . . . . . . . Wed-Ses4-P4-7 148 Niesler, Thomas . . . . . . Mon-Ses2-P2-9 55 Nijland, Lian . . . . . . . . . . Tue-Ses1-P1-9 80 Nilsenová, Marie . . . . . . Tue-Ses2-O2-5 89 Nishida, Masafumi . . . Wed-Ses3-P2-12 145 Nishiki, Kenta . . . . . . . . Wed-Ses2-O4-5 129 Nishikido, Akikazu . . . Mon-Ses2-O2-1 48 Nishimoto, Takuya . . . Wed-Ses2-O4-5 129 Nishimura, Akira . . . . . Thu-Ses1-P1-7 155 Nishimura, Masafumi Mon-Ses2-O1-5 48 Thu-Ses1-P2-7 157 Nishimura, Ryota . . . . . Wed-Ses1-P4-9 123 Nishiura, Takanobu . . Tue-Ses2-O3-2 90 Tue-Ses3-P1-13 103 Nisimura, Ryuichi . . . . Thu-Ses1-P2-6 157 Nitta, Tsuneo . . . . . . . . . Wed-Ses2-P4-14 137 Niyogi, Partha . . . . . . . . Thu-Ses1-S1-5 163 Noguchi, Hiroki . . . . . . Tue-Ses3-P4-4 109 Nogueira, João . . . . . . . . Tue-Ses3-O4-1 101 Nolan, Francis . . . . . . . . Wed-Ses3-P1-10 143 Noor, Elad . . . . . . . . . . . . Wed-Ses1-O1-5 112 Nose, Takashi . . . . . . . . Mon-Ses3-P3-4 70 Thu-Ses1-P2-2 156 Thu-Ses1-P4-8 161 Nöth, Elmar . . . . . . . . . . . Mon-Ses3-P4-4 72 Tue-Ses1-S2-6 86 Tue-Ses1-S2-10 87 Thu-Ses2-P1-8 169 Thu-Ses2-P3-3 172 Nouza, Jan . . . . . . . . . . . . Tue-Ses2-O1-6 88 Novák, Miroslav . . . . . . Tue-Ses2-P3-1 96 Novotney, Scott . . . . . . Mon-Ses2-P3-9 57 Nurminen, Jani . . . . . . . Wed-Ses1-P2-2 118 Wed-Ses1-P3-7 121 Thu-Ses1-P2-8 157 Nwe, Tin Lay . . . . . . . . . . Tue-Ses1-P4-4 84 O Obin, Nicolas . . . . . . . . . Ogata, Jun . . . . . . . . . . . . Ogata, Tetsuya . . . . . . . Ogawa, Atsunori . . . . . Ogbureke, Kalu U. . . . . Oger, Stanislas . . . . . . . Ohara, Keiji . . . . . . . . . . . Ohl, Claudia K. . . . . . . . Ohls, Sarah . . . . . . . . . . . Ohta, Kengo . . . . . . . . . . Ohtake, Kiyonori . . . . . Ohtani, Yamato . . . . . . . Okamoto, Haruka . . . . Okamoto, Jun . . . . . . . . Okolowski, Stefanie . . Okuno, Hiroshi G. . . . . Olaszy, Gábor . . . . . . . . Oliveira, Catarina . . . . . Olsen, Peder A. . . . . . . . Ono, Nobutaka . . . . . . . Ooigawa, Tomohiko . . Oonishi, Tasuku . . . . . . Ordelman, Roeland . . . Orglmeister, Reinhold Ormel, Ellen . . . . . . . . . . Orr, Rosemary . . . . . . . . Ortega, Alfonso . . . . . . Ortego-Resa, Carlos . . Mon-Ses3-P2-7 Tue-Ses2-P2-6 Tue-Ses3-P4-6 Mon-Ses2-P4-1 Thu-Ses1-P4-9 Tue-Ses2-P3-7 Wed-Ses1-O3-5 Tue-Ses1-P3-13 Thu-Ses1-P3-10 Mon-Ses2-P1-1 Wed-Ses4-P4-11 Mon-Ses3-P4-5 Thu-Ses1-P3-8 Mon-Ses2-P4-5 Wed-Ses1-P4-11 Wed-Ses1-O4-1 Wed-Ses1-O4-4 Wed-Ses3-P2-12 Wed-Ses1-P4-10 Wed-Ses2-O1-4 Mon-Ses2-P4-1 Thu-Ses1-P4-9 Mon-Ses3-P4-9 Mon-Ses3-P1-9 Mon-Ses2-P3-1 Mon-Ses2-P3-5 Tue-Ses3-P1-6 Tue-Ses3-P2-12 Wed-Ses2-O4-5 Wed-Ses1-P1-10 Wed-Ses3-O3-5 Thu-Ses1-O4-4 Thu-Ses1-O1-1 Tue-Ses3-P3-5 Wed-Ses3-O1-1 Mon-Ses2-O1-6 Tue-Ses2-P4-7 Tue-Ses3-P2-9 Wed-Ses2-P4-1 Thu-Ses1-P1-1 Wed-Ses2-P1-3 68 94 109 57 161 97 114 83 159 51 149 72 159 58 124 115 115 145 124 126 57 161 73 67 55 56 102 106 129 117 140 154 150 107 138 48 99 106 135 155 130 O’Shaughnessy, D. . . . Tue-Ses3-P1-16 104 Thu-Ses1-P1-5 155 Thu-Ses2-O5-4 168 Osma-Ruiz, Víctor . . . . Tue-Ses1-S2-7 87 Ostendorf, Mari . . . . . . Thu-Ses0-K-1 47 Tue-Ses2-O3-4 90 Öster, Ann-Marie . . . . . Tue-Ses3-P3-5 107 Ouellet, Pierre . . . . . . . . Wed-Ses1-O1-3 112 Ouni, Kaïs . . . . . . . . . . . . Wed-Ses1-P1-6 116 Oura, Keiichiro . . . . . . . Mon-Ses3-O3-6 64 Wed-Ses1-P3-3 120 Ouyang, JianJun . . . . . . Wed-Ses3-O4-1 141 Özbek, İ. Yücel . . . . . . . Thu-Ses2-O2-3 164 P Pabst, Friedemann . . . Padmanabhan, R. . . . . . Paek, Tim . . . . . . . . . . . . . Paliwal, Kuldip . . . . . . . Palomäki, Kalle J. . . . . . Pan, Fuping . . . . . . . . . . . Pan, Jielin . . . . . . . . . . . . . Pan, Yi-cheng . . . . . . . . . Pandey, Pramod . . . . . . Pantazis, Yannis . . . . . . Pardo, David . . . . . . . . . . Pardo, J.M. . . . . . . . . . . . . Parihar, Naveen . . . . . . . Parizet, Etienne . . . . . . . Park, Chiyoun . . . . . . . . Park, J. . . . . . . . . . . . . . . . . Park, JeonGue . . . . . . . . Park, Youngja . . . . . . . . Park, Yun-Sik . . . . . . . . . Parthasarathi, S.H.K. . Patel, Rupal . . . . . . . . . . . Patil, Vaishali . . . . . . . . . Patterson, Roy D. . . . . . Pawlewski, M. . . . . . . . . . Peabody, Mitchell . . . . Pedersen, C.F. . . . . . . . . Peláez-Moreno, C. . . . . Pelecanos, Jason . . . . . . Pellegrini, T. . . . . . . . . . . Pellegrino, François . . Penard, Nils . . . . . . . . . . . Pérez, Javier . . . . . . . . . . Pernkopf, Franz . . . . . . Petkov, Petko N. . . . . . . Petrone, Caterina . . . . . Pfister, Beat . . . . . . . . . . . Pfitzinger, Hartmut R. Philburn, Elke . . . . . . . . Picard, Rosalind W. . . . Pieraccini, Roberto . . . Pigeon, Stéphane . . . . . Pillay, S.G. . . . . . . . . . . . . Pinault, Florian . . . . . . . Pinho, Cátia M.R. . . . . . Plahl, C. . . . . . . . . . . . . . . . Planet, Santiago . . . . . . Podlipský, Václav J. . . Pohjalainen, Jouni . . . . Pokines, B.B. . . . . . . . . . . Pollák, Petr . . . . . . . . . . . Polzehl, Tim . . . . . . . . . . Pon-Barry, Heather . . . Popa, Victor . . . . . . . . . . Popescu, Vladimir . . . . Popescu-Belis, Andrei Portêlo, J. . . . . . . . . . . . . . Post, Brechtje . . . . . . . . . Potamianos, G. . . . . . . . Potard, Blaise . . . . . . . . . Povey, Daniel . . . . . . . . . Tue-Ses1-P1-7 Wed-Ses3-P2-9 Tue-Ses2-O1-4 Tue-Ses2-P1-4 Tue-Ses3-P1-5 Tue-Ses3-P1-2 Thu-Ses2-P3-7 Thu-Ses2-P3-5 Tue-Ses1-O3-6 Mon-Ses3-P2-4 Mon-Ses2-O4-2 Tue-Ses3-P3-9 Mon-Ses2-P4-8 Mon-Ses2-S1-7 Thu-Ses2-P4-6 Tue-Ses2-P2-5 Wed-Ses3-O3-2 Mon-Ses2-P3-7 Tue-Ses2-O1-1 Thu-Ses1-P4-10 Tue-Ses3-P1-10 Thu-Ses1-P1-4 Wed-Ses3-O3-6 Wed-Ses3-P2-9 Mon-Ses3-S1-5 Thu-Ses1-O3-2 Mon-Ses2-P1-7 Wed-Ses3-P2-7 Mon-Ses3-P1-4 Tue-Ses2-P1-8 Tue-Ses1-P2-7 Wed-Ses1-O1-4 Tue-Ses2-P2-8 Wed-Ses2-O1-5 Tue-Ses1-S1-4 Mon-Ses2-O2-5 Tue-Ses1-P3-1 Tue-Ses2-P1-3 Thu-Ses2-O5-2 Wed-Ses4-P4-13 Mon-Ses2-O4-1 Mon-Ses3-O1-2 Tue-Ses2-O2-2 Wed-Ses4-P4-11 Wed-Ses4-P4-12 Tue-Ses1-P1-8 Tue-Ses3-P3-8 Tue-Ses3-P4-2 Wed-Ses3-O3-4 Wed-Ses3-P2-7 Mon-Ses2-P4-9 Mon-Ses3-P1-7 Wed-Ses2-P4-4 Thu-Ses2-P2-5 Mon-Ses2-S1-2 Mon-Ses2-P1-3 Mon-Ses3-P1-5 Tue-Ses3-P1-2 Thu-Ses2-P1-2 Tue-Ses3-S2-7 Mon-Ses2-S1-8 Wed-Ses1-O2-2 Thu-Ses1-P2-8 Thu-Ses1-P3-10 Mon-Ses3-P4-10 Tue-Ses2-P2-8 Wed-Ses3-P1-10 Wed-Ses4-P4-14 Wed-Ses3-O4-2 Thu-Ses2-O2-4 Thu-Ses2-P1-8 Mon-Ses2-O3-1 183 79 145 88 92 102 102 173 173 77 67 51 108 58 61 174 94 140 56 87 161 103 155 141 145 74 153 52 145 66 92 81 112 94 126 86 49 81 92 167 149 50 62 89 149 149 80 107 108 140 145 58 66 135 171 60 52 66 102 168 111 61 113 157 159 73 94 143 149 141 165 169 49 Prabhavalkar, Rohit . . Putois, Ghislain . . . . . . . Pylkkönen, Janne . . . . . Tue-Ses1-P3-6 82 Wed-Ses3-S1-4 150 Mon-Ses2-P3-3 56 Q Qian, Yao . . . . . . . . . . . . . Mon-Ses3-O3-3 Qiao, Yu . . . . . . . . . . . . . . Qin, Qin, Qin, Qin, Chao . . . . . . . . . . . . . Long . . . . . . . . . . . . . Shenghao . . . . . . . . Yong . . . . . . . . . . . . . Quarteroni, Silvia . . . . . Quatieri, Thomas F. . . Quené, Hugo . . . . . . . . . . Wed-Ses1-P3-2 Mon-Ses2-P4-15 Mon-Ses3-P4-6 Wed-Ses2-P3-1 Wed-Ses3-O2-6 Thu-Ses2-P4-8 Tue-Ses1-P1-5 Wed-Ses1-P3-10 Wed-Ses1-P4-8 Mon-Ses3-O3-4 Wed-Ses1-P3-5 Tue-Ses2-O3-1 Thu-Ses2-O3-2 Thu-Ses2-O3-5 Tue-Ses1-P2-4 64 120 59 72 133 140 175 79 121 123 64 120 89 165 166 81 R Raab, Martin . . . . . . . . . . Thu-Ses2-P3-3 172 Rachevsky, Leonid . . . Mon-Ses2-P4-10 59 Radfar, M.H. . . . . . . . . . . Wed-Ses2-O4-4 129 Raisamo, Roope . . . . . . Tue-Ses3-P3-4 107 Raitio, Tuomo . . . . . . . . Wed-Ses1-P2-2 118 Raj, Bhiksha . . . . . . . . . . Mon-Ses2-O1-2 48 Wed-Ses2-P3-2 133 Thu-Ses1-O1-2 151 Rajaniemi, Juha-Pekka Tue-Ses3-P3-4 107 Rajendran, S. . . . . . . . . . Wed-Ses3-P1-5 142 Rajkumar, R. . . . . . . . . . . Thu-Ses1-O2-3 152 Ramabhadran, B. . . . . . Mon-Ses3-O4-3 65 Wed-Ses2-O3-5 128 Ramakrishnan, S. . . . . . Tue-Ses2-P2-3 94 Ramasubramanian, V. Thu-Ses1-P1-8 156 Ramos, Daniel . . . . . . . . Wed-Ses2-P1-3 130 Wed-Ses3-P2-6 144 Rantala, Jussi . . . . . . . . . Tue-Ses3-P3-4 107 Rao, Preeti . . . . . . . . . . . . Tue-Ses2-P2-3 94 Thu-Ses1-O3-2 153 Rao, Vishweshwara . . . Tue-Ses2-P2-3 94 Rapcan, Viliam . . . . . . . Tue-Ses1-S1-4 86 Räsänen, Okko J. . . . . . Tue-Ses1-O2-6 76 Tue-Ses1-P3-5 82 Tue-Ses2-P2-13 95 Wed-Ses1-P4-13 124 Thu-Ses2-P4-3 174 Rasmussen, Morten H. Tue-Ses3-P3-11 108 Rastrow, Ariya . . . . . . . . Tue-Ses2-P3-6 96 Wed-Ses2-O3-5 128 Rath, S.P. . . . . . . . . . . . . . . Mon-Ses3-P3-5 70 Mon-Ses3-P3-9 70 Mon-Ses3-P3-12 71 Wed-Ses3-P2-3 144 Rauzy, Stéphane . . . . . . Thu-Ses2-P1-7 169 Ravuri, Suman . . . . . . . . Thu-Ses2-P2-2 170 Thu-Ses2-P2-5 171 Raybaud, Sylvain . . . . . Mon-Ses3-O4-1 64 Rebordao, A.R.F. . . . . . . Mon-Ses3-P2-10 68 Redeker, Gisela . . . . . . . Tue-Ses2-O2-1 88 Reed, Jeremy . . . . . . . . . Mon-Ses2-P2-2 53 Regunathan, Shankar Tue-Ses3-P4-3 108 Reichel, Uwe D. . . . . . . . Wed-Ses1-P2-4 118 Reidhammer, K. . . . . . . Tue-Ses3-P4-8 109 Reilly, Richard B. . . . . . Tue-Ses1-S1-4 86 Rekhis, Oussama . . . . . Wed-Ses1-P1-6 116 Renals, Steve . . . . . . . . . Tue-Ses3-P3-3 107 Thu-Ses1-P3-9 159 Rennie, Steven J. . . . . . Tue-Ses3-P1-6 102 Rettelbach, Nikolaus . Thu-Ses1-P1-9 156 Réveil, Bert . . . . . . . . . . . Thu-Ses2-P3-1 172 Thu-Ses2-P3-2 172 Reynolds, Douglas . . . Tue-Ses2-O4-1 90 Wed-Ses3-P2-10 145 Riccardi, Giuseppe . . . Mon-Ses2-P4-6 58 Tue-Ses2-O3-1 89 Thu-Ses1-P4-12 161 Thu-Ses2-O1-1 163 Richardson, F.S. . . . . . . Mon-Ses2-P2-8 54 Wed-Ses3-P2-10 145 Richmond, Korin . . . . . Tue-Ses3-O4-3 101 Wed-Ses2-P3-3 133 Thu-Ses2-O3-4 166 Rieser, Verena . . . . . . . . Riester, Arndt . . . . . . . . Rigoll, Gerhard . . . . . . . Riley, Michael . . . . . . . . . Rilliard, Albert . . . . . . . . Robertson, Ian H. . . . . . Rodet, Xavier . . . . . . . . . Rogers, Jack C. . . . . . . . Romano, Sara . . . . . . . . . Romportl, Jan . . . . . . . . Romsdorfer, Harald . . Ronzhin, Andrey . . . . . Rose, Phil . . . . . . . . . . . . . Rosec, Olivier . . . . . . . . . Rossato, Solange . . . . . Rotaru, Mihai . . . . . . . . . Roukos, S. . . . . . . . . . . . . Rouvier, Mickael . . . . . . Roy, Brandon C. . . . . . . Roy, Deb . . . . . . . . . . . . . . Roy, Serge H. . . . . . . . . . Rudoy, Daniel . . . . . . . . Rugchatjaroen, A. . . . . Rutenbar, Rob A. . . . . . Rybach, David . . . . . . . . Wed-Ses3-S1-6 Wed-Ses4-P4-2 Mon-Ses2-P4-3 Wed-Ses2-P1-10 Tue-Ses2-P3-8 Wed-Ses1-P2-12 Wed-Ses3-O4-5 Thu-Ses1-O2-5 Tue-Ses1-S1-4 Mon-Ses3-P2-7 Mon-Ses3-O2-1 Tue-Ses3-P4-5 Tue-Ses1-O4-2 Mon-Ses3-P2-1 Mon-Ses3-P2-2 Thu-Ses2-P1-5 Wed-Ses3-P1-7 Mon-Ses2-O4-2 Wed-Ses1-O4-2 Wed-Ses3-P2-14 Wed-Ses2-P2-1 Mon-Ses3-O4-3 Tue-Ses2-P2-9 Tue-Ses2-P2-10 Wed-Ses1-P1-1 Wed-Ses0-K-1 Wed-Ses1-P1-1 Mon-Ses3-S1-5 Thu-Ses2-O3-5 Mon-Ses3-P2-3 Mon-Ses3-P2-6 Wed-Ses1-P3-12 Wed-Ses2-O3-3 Mon-Ses2-P3-2 Wed-Ses2-P4-5 Thu-Ses1-P3-5 Thu-Ses2-P4-6 150 147 57 131 97 119 141 152 86 68 62 109 78 67 67 169 143 51 115 146 131 65 95 95 116 47 116 74 166 67 68 122 127 55 136 158 174 S Sá Couto, Pedro . . . . . . Tue-Ses1-S2-11 87 Sáenz-Lechón, Nicolás Tue-Ses1-S2-7 87 Sagayama, Shigeki . . . . Wed-Ses2-O4-5 129 Sagisaka, Yoshinori . . Tue-Ses3-S2-6 111 Wed-Ses1-P2-13 119 Thu-Ses2-O5-1 167 Saheer, Lakshmi . . . . . . Tue-Ses3-P2-5 105 Sainz, Iñaki . . . . . . . . . . . Tue-Ses2-P1-2 91 Saito, Daisuke . . . . . . . . Wed-Ses2-P3-1 133 Saito, You . . . . . . . . . . . . . Wed-Ses3-P3-1 146 Saitou, Takeshi . . . . . . . Tue-Ses1-P2-8 81 Sakai, Masaru . . . . . . . . . Thu-Ses2-P2-7 171 Sakai, Shinsuke . . . . . . . Tue-Ses1-O4-6 78 Wed-Ses1-P3-9 121 Sakrajda, Andrej . . . . . Mon-Ses3-O4-4 65 Saleem, Shirin . . . . . . . . Mon-Ses3-O4-2 65 Sales Dias, Miguel . . . . Tue-Ses3-O4-1 101 Saltzman, Elliot . . . . . . . Thu-Ses1-S1-3 162 Thu-Ses1-S1-4 162 Thu-Ses2-O2-2 164 Salvi, Giampiero . . . . . . Tue-Ses3-P3-5 107 Sanand, D.R. . . . . . . . . . . Mon-Ses3-P3-12 71 Tue-Ses2-P1-10 93 Sánchez, Carmelo . . . . Tue-Ses1-S2-7 87 Sanchis, Emilio . . . . . . . Mon-Ses2-P4-6 58 Thu-Ses2-O1-1 163 Sanders, Eric . . . . . . . . . . Thu-Ses1-O4-3 154 Sands, Bonny . . . . . . . . . Wed-Ses3-P1-3 142 Sangwan, Abhijeet . . . Mon-Ses2-P2-3 53 San-Segundo, R. . . . . . . Mon-Ses2-P4-8 58 Mon-Ses2-S1-7 61 Sansone, Larry . . . . . . . . Mon-Ses2-P4-10 59 Santos, Ricardo . . . . . . . Tue-Ses1-S2-11 87 Saratxaga, Ibon . . . . . . . Tue-Ses2-P1-2 91 Sarikaya, R. . . . . . . . . . . . Mon-Ses3-O4-3 65 Sarkar, A.K. . . . . . . . . . . . Mon-Ses3-P3-9 70 Wed-Ses3-P2-3 144 Saruwatari, Hiroshi . . . Tue-Ses3-P3-2 106 Wed-Ses1-O4-1 115 Sato, Taichi . . . . . . . . . . . Mon-Ses3-P4-7 72 Saul, Lawrence K. . . . . . Tue-Ses1-O1-3 75 Savino, Michelina . . . . . Wed-Ses4-P4-4 148 Saz, Oscar . . . . . . . . . . . . Mon-Ses3-P3-2 69 Scanzio, Stefano . . . . . . Tue-Ses2-P3-9 97 Schaeffler, Felix . . . . . . . Wed-Ses1-P4-2 122 Schaffer, Stefan . . . . . . . Mon-Ses2-P4-13 59 Wed-Ses1-P4-7 123 Schalkwyk, Johan . . . . Scharenborg, Odette . Scheffer, Nicolas . . . . . . Schenk, Joachim . . . . . . Scherer, Klaus R. . . . . . Schiel, Florian . . . . . . . . Schlangen, David . . . . . Schlüter, Ralf . . . . . . . . . Schmalenstroeer, J. . . . Schmidt, Konstantin . Schneider, Daniel . . . . . Schneider, Katrin . . . . . Schnell, Karl . . . . . . . . . . Schnell, Markus . . . . . . Schoentgen, Jean . . . . . Schuhmacher, K. . . . . . Schuller, Björn . . . . . . . . Schultz, Tanja . . . . . . . . Schuppler, Barbara . . . Schuster, Maria . . . . . . . Schwartz, Reva . . . . . . . Schwartz, Richard . . . . Schwarz, Jan . . . . . . . . . . Schwarz, Petr . . . . . . . . . Schwärzler, Stefan . . . Schweitzer, Antje . . . . Schweitzer, Katrin . . . . Schwerin, Belinda . . . . Scipioni, Marcello . . . . Scott, Abigail . . . . . . . . . Sébillot, Pascale . . . . . . Seebode, Julia . . . . . . . . Segura, C. . . . . . . . . . . . . . Segura, Jose Carlos . . . Seid, Hussien . . . . . . . . . Seide, F. . . . . . . . . . . . . . . . Sekiyama, Kaoru . . . . . Selouani, Sid-Ahmed . Seltzer, Michael . . . . . . Seneff, Stephanie . . . . . Seng, Sopheap . . . . . . . . Serniclaes, Willy . . . . . . Sethu, Vidhyasaharan Sethy, Abhinav . . . . . . . Setiawan, Panji . . . . . . . Sgarbas, Kyriakos . . . . Sha, Fei . . . . . . . . . . . . . . . Shafran, Izhak . . . . . . . . Shah, Sheena . . . . . . . . . Shaikh, Mostafa A.M. . Shannon, Matt . . . . . . . . Sharma, Harsh V. . . . . . Sharma, Kartavya . . . . Shen, Guanghu . . . . . . . Shen, Wade . . . . . . . . . . . Shi, Qin . . . . . . . . . . . . . . . Shieber, Stuart . . . . . . . . Shiga, Yoshinori . . . . . . Shih, Chilin . . . . . . . . . . . Shikano, Kiyohiro . . . . Shimodaira, Hiroshi . . Tue-Ses2-O1-5 Tue-Ses2-P3-8 Wed-Ses1-S1-4 Wed-Ses1-P1-8 Wed-Ses2-O1-4 Wed-Ses1-O1-1 Wed-Ses1-O1-4 Mon-Ses2-P4-3 Wed-Ses1-O2-1 Tue-Ses2-O1-3 Tue-Ses2-O3-3 Wed-Ses1-P4-14 Mon-Ses2-P3-2 Mon-Ses2-P3-10 Tue-Ses2-P3-5 Wed-Ses2-P4-4 Wed-Ses2-P4-5 Thu-Ses1-P3-5 Thu-Ses2-P4-6 Tue-Ses2-P2-11 Thu-Ses1-P1-9 Wed-Ses2-P4-9 Thu-Ses1-O2-2 Tue-Ses2-P1-12 Thu-Ses1-P1-9 Tue-Ses1-S1-3 Tue-Ses1-S2-8 Thu-Ses2-P1-4 Tue-Ses1-P3-10 Mon-Ses2-S1-1 Wed-Ses1-O2-6 Wed-Ses2-P1-10 Thu-Ses1-O1-5 Mon-Ses3-S1-6 Mon-Ses3-S1-7 Tue-Ses1-O1-2 Tue-Ses1-P4-5 Tue-Ses1-P4-7 Wed-Ses3-P1-2 Tue-Ses1-S2-6 Wed-Ses3-O2-2 Mon-Ses2-P3-9 Wed-Ses4-P4-12 Wed-Ses3-P2-4 Mon-Ses2-P4-3 Wed-Ses3-P1-13 Thu-Ses1-O2-1 Wed-Ses4-P4-2 Tue-Ses3-P1-5 Tue-Ses1-S2-10 Wed-Ses3-P1-3 Mon-Ses3-O1-5 Mon-Ses2-P4-13 Wed-Ses1-P4-7 Tue-Ses2-P2-7 Mon-Ses2-O1-4 Wed-Ses3-P1-5 Wed-Ses1-O3-4 Wed-Ses1-P2-12 Tue-Ses3-P1-16 Tue-Ses2-O1-2 Mon-Ses3-P1-4 Thu-Ses1-P3-1 Mon-Ses2-P1-8 Wed-Ses2-P2-3 Wed-Ses2-O3-5 Thu-Ses2-P2-4 Wed-Ses2-O2-1 Tue-Ses1-O1-3 Wed-Ses1-P2-10 Wed-Ses3-P1-3 Wed-Ses3-P1-4 Mon-Ses3-P2-10 Mon-Ses3-O3-1 Tue-Ses3-P3-7 Thu-Ses1-P4-1 Wed-Ses3-P3-3 Wed-Ses2-P4-13 Wed-Ses3-O2-2 Wed-Ses1-P3-5 Wed-Ses1-O2-2 Wed-Ses1-P3-6 Tue-Ses3-S2-4 Mon-Ses3-S1-2 Tue-Ses3-P3-2 Wed-Ses1-O4-1 Wed-Ses1-P3-13 184 88 97 125 117 126 111 112 57 112 88 90 124 55 57 96 135 136 158 174 95 156 136 152 93 156 86 87 169 83 60 113 131 151 74 74 75 84 85 142 86 139 57 149 144 57 143 152 147 102 87 142 62 59 123 94 48 142 114 119 104 88 66 158 53 131 128 171 126 75 119 142 142 68 63 107 159 146 137 139 120 113 120 110 73 106 115 122 Shinoda, Koichi . . . . . . . Shinohara, Shigeko . . . Shinozaki, Takahiro . . Shiota, Sayaka . . . . . . . . Shochi, Takaaki . . . . . . . Shozakai, Makoto . . . . Shriberg, Elizabeth . . . Shuang, Zhiwei . . . . . . . Shue, Yen-Liang . . . . . . Shum, Stephen . . . . . . . Sicconi, Roberto . . . . . . Sieczkowska, J. . . . . . . . Sigüenza, Álvaro . . . . . Silén, Hanna . . . . . . . . . . Sim, Khe Chai . . . . . . . . Simko, Juraj . . . . . . . . . . Simpson, Brian D. . . . . Sinha, Rohit . . . . . . . . . . Siniscalchi, Sabato M. Sisinni, Bianca . . . . . . . . Siu, Man-hung . . . . . . . . Sivakumaran, P. . . . . . . Sivaram, G.S.V.S. . . . . . . Skantze, Gabriel . . . . . . Skarnitzl, Radek . . . . . . Smaïli, Kamel . . . . . . . . . Smolenski, B.Y. . . . . . . . Socoró, Joan Claudi . . Solewicz, Yosef A. . . . . Sollich, Peter . . . . . . . . . . Song, Hwa Jeon . . . . . . . Song, Ji-Hyun . . . . . . . . . Song, Young Chol . . . . Sonu, Mee . . . . . . . . . . . . . Soong, Frank K. . . . . . . . Soronen, Hannu . . . . . . Speed, Matt . . . . . . . . . . . Speer, Shari R. . . . . . . . . Spiegl, Werner . . . . . . . . Spinu, Laura . . . . . . . . . . Sproat, Richard . . . . . . . Sreenivas, T.V. . . . . . . . . Sridharan, Sridha . . . . . Stafylakis, Themos . . . Stallard, David . . . . . . . . Stamatakis, E. . . . . . . . . . Stark, Anthony . . . . . . . Stauffer, A.R. . . . . . . . . . Steed, William . . . . . . . . Stegmann, Joachim . . . Steidl, Stefan . . . . . . . . . Steiner, Ingmar . . . . . . . Stemmer, Georg . . . . . . Stern, Richard M. . . . . . Stewart, Osamuyimen Stolcke, Andreas . . . . . Stone, Maureen . . . . . . . Strasheim, Albert . . . . . Strassel, Stephanie M. Strecha, Guntram . . . . Strik, Helmer . . . . . . . . . Strömbergsson, Sofia . Strope, Brian . . . . . . . . . . Mon-Ses3-P3-10 Wed-Ses1-P1-10 Tue-Ses2-P4-9 Tue-Ses1-O1-6 Wed-Ses1-P2-12 Wed-Ses1-P4-10 Wed-Ses1-O1-1 Wed-Ses2-P2-4 Thu-Ses2-O1-3 Mon-Ses3-O3-4 Wed-Ses1-P3-5 Thu-Ses1-P2-7 Thu-Ses2-P1-1 Thu-Ses2-P1-8 Wed-Ses1-S1-5 Wed-Ses3-P1-13 Tue-Ses3-P3-9 Wed-Ses1-P3-7 Thu-Ses2-P3-8 Mon-Ses2-O2-3 Mon-Ses3-O2-5 Mon-Ses3-P3-8 Wed-Ses1-O3-3 Mon-Ses2-P2-2 Wed-Ses1-P1-9 Wed-Ses2-O3-6 Wed-Ses3-P2-7 Thu-Ses2-P2-10 Mon-Ses2-P4-12 Mon-Ses2-P1-3 Tue-Ses1-P3-11 Mon-Ses3-O4-1 Thu-Ses2-P1-2 Mon-Ses2-S1-2 Mon-Ses3-O3-5 Tue-Ses1-P4-11 Wed-Ses3-P3-4 Mon-Ses3-P3-3 Wed-Ses3-O3-1 Tue-Ses3-P1-10 Thu-Ses1-P1-4 Thu-Ses2-P1-8 Wed-Ses1-P2-13 Mon-Ses3-O3-3 Tue-Ses3-P1-9 Wed-Ses1-P3-2 Wed-Ses1-P4-8 Tue-Ses3-P3-4 Tue-Ses1-P1-4 Thu-Ses1-O2-3 Thu-Ses2-P1-8 Wed-Ses1-P2-1 Wed-Ses2-O2-4 Tue-Ses3-P1-7 Tue-Ses3-O3-2 Wed-Ses1-O1-4 Tue-Ses2-O4-2 Mon-Ses3-O4-2 Wed-Ses3-P1-10 Tue-Ses2-P1-4 Wed-Ses1-P4-3 Wed-Ses3-P2-11 Thu-Ses2-P1-2 Wed-Ses3-P1-7 Tue-Ses1-P3-10 Mon-Ses2-S1-1 Wed-Ses2-P3-3 Thu-Ses2-P1-8 Mon-Ses2-O1-1 Mon-Ses2-O1-2 Tue-Ses2-P4-7 Wed-Ses2-P3-2 Thu-Ses1-O1-2 Thu-Ses1-O3-4 Mon-Ses2-P4-11 Mon-Ses3-O4-4 Wed-Ses2-P2-4 Wed-Ses2-P4-2 Wed-Ses2-P4-3 Mon-Ses3-S1-4 Wed-Ses3-O1-4 Thu-Ses2-O4-3 Wed-Ses1-P3-4 Mon-Ses3-P4-2 Wed-Ses3-O2-1 Tue-Ses1-P2-5 Tue-Ses2-O1-5 71 117 99 75 119 124 111 131 163 64 120 157 168 169 125 143 108 121 173 49 63 70 114 53 117 128 145 172 59 52 83 64 168 60 64 85 146 69 140 103 155 169 119 64 103 120 123 107 79 152 169 118 127 103 100 112 90 65 143 92 122 145 168 143 83 60 133 169 47 48 99 133 151 153 59 65 131 135 135 74 138 166 120 71 139 81 88 Štruc, Vitomir . . . . . . . . Wed-Ses2-P1-2 129 Stüker, Sebastian . . . . . Thu-Ses2-P3-9 174 Sturim, D.E. . . . . . . . . . . . Wed-Ses3-P2-10 145 Stuttle, Matt . . . . . . . . . . Tue-Ses3-P2-8 105 Stylianou, Yannis . . . . . Mon-Ses2-O4-2 51 Tue-Ses1-S1-2 86 Su, Zhao-yu . . . . . . . . . . . Thu-Ses2-P1-13 170 Subramanya, A. . . . . . . . Tue-Ses1-O1-1 75 Wed-Ses2-O3-1 127 Suendermann, David . Tue-Ses3-P4-2 108 Sugiura, Komei . . . . . . . Wed-Ses3-S1-5 150 Suk, Soo-Young . . . . . . . Tue-Ses3-P2-11 106 Wed-Ses3-P3-3 146 Sumi, Kouhei . . . . . . . . . Tue-Ses2-P2-6 94 Sumner, Meghan . . . . . Mon-Ses3-O2-3 63 Sun, Hanwu . . . . . . . . . . . Tue-Ses1-P4-4 84 Sun, Hongjun . . . . . . . . . Tue-Ses2-P1-9 93 Sun, Yang . . . . . . . . . . . . . Thu-Ses1-O1-5 151 Sun, Yanqing . . . . . . . . . Wed-Ses2-P1-4 130 Sundaram, Shiva . . . . . Mon-Ses2-S1-8 61 Sundberg, Johan . . . . . . Tue-Ses1-P1-7 79 Sung, Po-Yi . . . . . . . . . . . Thu-Ses1-P2-5 157 Suni, Antti . . . . . . . . . . . . Wed-Ses1-P2-2 118 Sutherland, Andrew . . Thu-Ses2-P4-2 174 Suzuki, Motoyuki . . . . . Tue-Ses3-P2-6 105 Svantesson, Jan-Olof . Wed-Ses4-P4-8 148 Svendsen, Torbjørn . . Mon-Ses2-P2-2 53 Swerts, Marc . . . . . . . . . . Tue-Ses2-O2-5 89 Szaszák, György . . . . . . Wed-Ses2-O2-3 126 Sztahó, Dávid . . . . . . . . . Wed-Ses2-O2-3 126 T Taal, Cees H. . . . . . . . . . . Wed-Ses2-O4-3 129 Tachibana, Ryuki . . . . . Mon-Ses2-O1-5 48 Thu-Ses1-P2-7 157 Taguchi, Ryo . . . . . . . . . . Thu-Ses1-P4-8 161 Tajima, Keiichi . . . . . . . Wed-Ses1-P2-13 119 Takacs, Gyorgy . . . . . . . Wed-Ses3-O4-6 142 Takahashi, Akira . . . . . Thu-Ses1-O4-1 154 Takahashi, Satoshi . . . Wed-Ses1-O3-5 114 Takahashi, Toru . . . . . . Thu-Ses1-P2-6 157 Takeshima, Chihiro . . Mon-Ses2-P1-2 52 Takiguchi, Tetsuya . . . Mon-Ses2-P4-2 57 Tamura, Masatsune . . Wed-Ses2-P3-10 134 Tan, Zheng-Hua . . . . . . Tue-Ses3-P3-11 108 Wed-Ses3-O3-3 140 Tanaka, Kazuyo . . . . . . Thu-Ses1-P2-4 157 Tanaka, Kihachiro . . . . Mon-Ses3-P4-7 72 Taniyama, Hikaru . . . . Mon-Ses2-P4-4 58 Tao, Jianhua . . . . . . . . . . Tue-Ses2-P1-9 93 Tarján, Balázs . . . . . . . . Thu-Ses1-P3-7 159 Tashev, Ivan . . . . . . . . . . Tue-Ses2-O1-2 88 Tauberer, Joshua . . . . . Wed-Ses3-O2-4 139 Tayanin, Damrong . . . Wed-Ses4-P4-8 148 Taylor, Paul . . . . . . . . . . . Mon-Ses3-O3-5 64 Teiken, Wilfried . . . . . . Thu-Ses1-P4-10 161 Teixeira, António . . . . . Mon-Ses3-P1-9 67 Tejedor, Javier . . . . . . . . Wed-Ses2-P4-10 136 ten Bosch, L. . . . . . . . . . . Tue-Ses1-O2-6 76 Tue-Ses2-P2-13 95 Wed-Ses1-P2-8 119 Wed-Ses1-P2-9 119 Tepperman, Joseph . . Tue-Ses1-O2-1 76 Tue-Ses1-O2-5 76 Wed-Ses2-O2-2 126 Thu-Ses1-S1-6 163 Terband, Hayo . . . . . . . . Tue-Ses1-P1-9 80 Tue-Ses1-P1-10 80 Teshima, Shigeki . . . . . Wed-Ses2-P4-14 137 Thambiratnam, K. . . . . Wed-Ses1-O3-4 114 Thangthai, Ausdang . . Mon-Ses3-P2-6 68 Wed-Ses1-P3-12 122 Thatphithakkul, N. . . . Mon-Ses3-P2-6 68 Wed-Ses1-P3-12 122 Thiruvaran, T. . . . . . . . . Tue-Ses2-P1-11 93 Thomas, Mark R.P. . . . Mon-Ses2-O4-3 51 Thomas, Samuel . . . . . . Thu-Ses2-O3-1 165 Thu-Ses2-P2-3 171 Thompson, Laura . . . . Tue-Ses3-S2-5 111 Thomson, B. . . . . . . . . . . Thu-Ses1-P4-5 160 Thorpe, William . . . . . . Mon-Ses2-O2-4 49 Tian, Jilei . . . . . . . . . . . . . Mon-Ses3-O3-6 64 Tiede, Mark . . . . . . . . . . . Thu-Ses2-O2-5 165 Tihelka, Daniel . . . . . . . Tue-Ses1-O4-2 78 Ting, Chuan-Wei . . . . . . Tue-Ses3-P2-10 106 Tishby, Naftali . . . . . . . . Tobiasson, Helena . . . . Toda, Tomoki . . . . . . . . Toivola, Minnaleena . . Tokuda, Keiichi . . . . . . . Tokuma, Shinichi . . . . . Tomalin, M. . . . . . . . . . . . Tompkins, Frank . . . . . Tong, Rong . . . . . . . . . . . Torreira, Francisco . . . Torres-Carrasquillo, P. Toth, Arthur R. . . . . . . . Tran, Viet-Anh . . . . . . . . Trancoso, Isabel . . . . . . Tremblay, Annie . . . . . . Trilla, Alexandre . . . . . Trmal, Jan . . . . . . . . . . . . Truong, Khiet P. . . . . . . Tsakalidis, Stavros . . . Tsao, Yu . . . . . . . . . . . . . . Tschöpe, Constanze . . Tseng, Chiu-yu . . . . . . . Tseng, Chun-Han . . . . . Tsiartas, Andreas . . . . Tsirulnik, Liliya . . . . . . . Tsuchiya, Masatoshi . Tsuge, Satoru . . . . . . . . . Tsurutani, Chiharu . . . Tsuzaki, Minoru . . . . . . Tsymbal, Alexey . . . . . . Tucker, Benjamin V. . . Tucker, Simon . . . . . . . . Tulli, Juan Carlos . . . . . Tur, Gokhan . . . . . . . . . . Turicchia, Lorenzo . . . Turunen, Markku . . . . . Tüske, Zoltán . . . . . . . . . Mon-Ses2-P2-7 Wed-Ses1-O1-5 Mon-Ses2-P4-12 Mon-Ses3-S1-2 Mon-Ses3-S1-8 Tue-Ses3-P3-2 Wed-Ses1-O4-1 Wed-Ses1-O4-4 Wed-Ses1-P3-9 Wed-Ses1-P2-3 Mon-Ses3-O3-6 Mon-Ses3-P2-11 Tue-Ses1-O1-6 Wed-Ses1-P3-1 Wed-Ses1-P3-3 Wed-Ses1-P3-9 Wed-Ses1-P3-10 Wed-Ses1-P1-11 Mon-Ses2-P3-7 Thu-Ses1-P3-4 Mon-Ses2-O4-6 Mon-Ses2-P2-10 Mon-Ses3-P1-1 Wed-Ses3-O2-1 Mon-Ses2-P2-8 Mon-Ses3-S1-6 Mon-Ses3-S1-7 Mon-Ses3-S1-8 Mon-Ses2-O3-5 Tue-Ses2-O2-4 Tue-Ses2-P2-8 Wed-Ses1-P2-6 Mon-Ses2-P1-9 Mon-Ses3-P2-8 Tue-Ses1-P3-11 Wed-Ses2-P2-7 Mon-Ses3-O4-2 Wed-Ses1-O3-2 Wed-Ses1-P3-4 Wed-Ses1-P2-5 Thu-Ses2-P1-13 Tue-Ses2-O4-3 Mon-Ses3-O4-6 Thu-Ses2-P1-5 Thu-Ses1-P3-6 Thu-Ses1-P3-8 Wed-Ses3-P2-12 Tue-Ses1-O2-3 Mon-Ses2-P1-2 Thu-Ses2-P2-6 Wed-Ses2-O1-1 Thu-Ses2-P1-11 Mon-Ses3-P2-13 Wed-Ses1-O4-5 Thu-Ses1-P4-1 Thu-Ses1-P4-2 Wed-Ses2-P3-2 Tue-Ses3-P3-4 Thu-Ses1-O4-2 Thu-Ses1-P3-7 54 112 59 73 74 106 115 115 121 118 64 69 75 120 120 121 121 117 56 158 51 55 65 139 54 74 74 74 50 89 94 118 53 68 83 132 65 114 120 118 170 91 65 169 159 159 145 76 52 171 125 170 69 115 159 160 133 107 154 159 U Umesh, S. . . . . . . . . . . . . . Unoki, Masashi . . . . . . . Unver, Emre . . . . . . . . . . Uriz, Alejandro José . . Usabaev, Bela . . . . . . . . . Mon-Ses3-P3-5 Mon-Ses3-P3-9 Mon-Ses3-P3-12 Tue-Ses2-P1-10 Wed-Ses3-P2-3 Thu-Ses1-O1-4 Thu-Ses1-P1-6 Wed-Ses1-O4-5 Mon-Ses3-O3-6 70 70 71 93 144 151 155 115 64 V Vainio, Martti . . . . . . . . . Vaissiere, Jacqueline . Valente, Fabio . . . . . . . . Valkama, Pellervo . . . . Válková, Lucie . . . . . . . . Valverde-Albacete, F.J. van Brenk, Frits . . . . . . Van Compernolle, D. . van Dalen, R.C. . . . . . . . Wed-Ses1-P2-2 Wed-Ses1-P4-1 Tue-Ses1-S2-9 Tue-Ses2-O4-4 Thu-Ses2-P2-5 Tue-Ses3-P3-4 Wed-Ses1-P4-5 Tue-Ses1-P2-7 Tue-Ses1-P1-9 Tue-Ses1-P1-10 Mon-Ses3-P3-6 Thu-Ses2-P4-11 Wed-Ses1-O1-6 Thu-Ses1-O1-3 185 118 122 87 91 171 107 123 81 80 80 70 175 112 151 van den Heuvel, Henk van der Plas, Lonneke van der Werff, Laurens van de Ven, Marco . . . . van Dommelen, Wim . van Doremalen, Joost Van hamme, Hugo . . . van Heerden, Charl . . . Van Heuven, V.J. . . . . . van Leeuwen, David A. van Lieshout, Pascal . . van Niekerk, D.R. . . . . . van Santen, Jan P.H. . . Van Segbroeck, M. . . . . van Son, Nic . . . . . . . . . . Vasilescu, Ioana . . . . . . Veaux, Christophe . . . . Verdet, Florian . . . . . . . Verlinde, Patrick . . . . . . Verma, Ragini . . . . . . . . Vertanen, Keith . . . . . . . Vesnicer, Boštjan . . . . . Viana, M. Céu . . . . . . . . . Vicsi, Klára . . . . . . . . . . . Vijayasenan, Deepu . . Villette, Stephane . . . . Vipperla, Ravichander Viscelgia, Tanya . . . . . . Visweswariah, Karthik Vivanco, Hiram . . . . . . . Vlasenko, Bogdan . . . . Vogel, Irene . . . . . . . . . . . Vogt, Robbie . . . . . . . . . . Vogt, Thurid . . . . . . . . . . Volín, Jan . . . . . . . . . . . . . Thu-Ses1-O4-3 Thu-Ses2-P3-1 Wed-Ses3-S1-6 Thu-Ses1-O4-4 Wed-Ses2-O1-1 Wed-Ses3-P1-2 Mon-Ses3-P4-2 Tue-Ses2-P3-2 Tue-Ses2-P4-2 Wed-Ses1-P2-9 Thu-Ses1-O1-6 Tue-Ses2-O1-5 Thu-Ses2-O4-1 Thu-Ses2-P3-4 Wed-Ses4-P4-1 Tue-Ses1-P4-6 Tue-Ses1-P4-10 Tue-Ses2-O4-5 Wed-Ses2-P2-7 Wed-Ses3-O1-1 Thu-Ses1-O4-3 Tue-Ses1-P1-9 Tue-Ses1-P1-10 Tue-Ses1-P3-12 Tue-Ses1-O4-1 Tue-Ses2-P4-2 Thu-Ses1-O1-6 Tue-Ses3-P3-5 Mon-Ses2-P1-6 Wed-Ses3-P1-1 Mon-Ses2-P2-1 Wed-Ses3-O3-4 Wed-Ses2-P2-6 Wed-Ses1-S1-2 Tue-Ses1-O3-4 Mon-Ses2-O3-5 Tue-Ses2-O2-4 Wed-Ses2-O2-3 Tue-Ses2-O4-4 Thu-Ses1-P1-6 Tue-Ses3-P3-3 Wed-Ses1-P2-5 Tue-Ses3-P2-1 Tue-Ses2-P2-4 Wed-Ses2-O2-5 Wed-Ses2-P2-10 Wed-Ses1-P2-1 Tue-Ses3-O3-2 Wed-Ses1-O1-4 Mon-Ses2-S1-5 Mon-Ses2-P1-3 Tue-Ses3-S2-7 154 172 150 154 125 142 71 96 98 119 151 88 166 173 147 84 85 91 132 138 154 80 80 83 78 98 151 107 52 142 53 140 132 125 77 50 89 126 91 155 107 118 104 94 127 133 118 100 112 60 52 111 Wed-Ses1-P4-5 Wed-Ses4-P4-5 Mon-Ses2-S1-8 Tue-Ses2-P2-12 Tue-Ses3-S2-2 Wed-Ses2-P3-7 Wed-Ses2-P1-8 Thu-Ses2-P3-9 Mon-Ses2-P1-10 Thu-Ses2-O4-6 Mon-Ses2-P4-3 Tue-Ses3-P1-17 Wed-Ses3-P1-13 Wed-Ses4-P4-2 Wed-Ses2-P4-7 Mon-Ses3-S1-6 Mon-Ses3-S1-7 Tue-Ses2-O2-3 Wed-Ses1-S1-3 Mon-Ses3-O3-2 Mon-Ses3-O4-3 Wed-Ses2-P4-10 Wed-Ses2-P4-11 Wed-Ses2-P4-12 Tue-Ses2-O4-3 Thu-Ses2-P4-9 Wed-Ses3-O4-1 Wed-Ses1-P4-8 Mon-Ses3-O3-3 Tue-Ses1-P4-2 Wed-Ses2-O2-6 Wed-Ses1-O3-6 123 148 61 95 110 134 130 174 53 167 57 104 143 147 136 74 74 89 125 64 65 136 137 137 91 175 141 123 64 84 127 114 W Waclawičová, Martina Wagner, Agnieszka . . . Wagner, Michael . . . . . . Wagner, Petra . . . . . . . . . Wahab, Abdul . . . . . . . . Waibel, Alex . . . . . . . . . . Walker, Benjamin H. . Walker, Kevin . . . . . . . . . Wallhoff, Frank . . . . . . . Walsh, John MacLaren Walsh, Michael . . . . . . . Wan, Vincent . . . . . . . . . Wand, Michael . . . . . . . . Wang, Wang, Wang, Wang, Wang, Bei . . . . . . . . . . . . . Chao . . . . . . . . . . . Cheng-Cheng . . D. . . . . . . . . . . . . . . Dong . . . . . . . . . . . Wang, Hsin-Min . . . . . . . Wang, Wang, Wang, Wang, Wang, Wang, Lan . . . . . . . . . . . . Lijuan . . . . . . . . . . Miaomiao . . . . . . Ning . . . . . . . . . . . Shijin . . . . . . . . . . Shizhen . . . . . . . . Wang, Tianyu T. . . . . . . Thu-Ses2-O3-2 165 Wang, Wen . . . . . . . . . . . . Mon-Ses3-O4-5 65 Wed-Ses2-P4-2 135 Wed-Ses2-P4-3 135 Wang, Y. . . . . . . . . . . . . . . Tue-Ses2-P4-2 98 Wang, Yih-Ru . . . . . . . . . Mon-Ses3-P2-5 68 Ward, Nigel G. . . . . . . . . Mon-Ses2-P1-10 53 Wed-Ses1-O2-4 113 Wed-Ses4-P4-6 148 Watanabe, Shinji . . . . . Mon-Ses2-P3-4 56 Wed-Ses2-O4-5 129 Watkins, C.J. . . . . . . . . . . Thu-Ses2-P4-5 174 Watson, Catherine I. . . Tue-Ses3-S2-5 111 Watson, Ian . . . . . . . . . . . Tue-Ses3-S2-4 110 Watts, Oliver . . . . . . . . . . Mon-Ses3-O3-6 64 Thu-Ses1-P2-1 156 Way, Andy . . . . . . . . . . . . Tue-Ses3-O4-6 101 Weber, Frederick . . . . . Mon-Ses3-P2-12 69 Wechsung, Ina . . . . . . . . Mon-Ses2-P4-13 59 Wed-Ses1-P4-7 123 Weenink, David . . . . . . . Wed-Ses2-P3-4 133 Weinstein, Eugene . . . . Thu-Ses1-P4-11 161 Weiss, Benjamin . . . . . . Mon-Ses2-P4-14 59 Wendemuth, Andreas Wed-Ses2-P2-10 133 Wenhardt, Stefan . . . . . Tue-Ses1-S2-6 86 Wenndt, S.J. . . . . . . . . . . . Wed-Ses1-P4-3 122 Werner, Stefan . . . . . . . . Mon-Ses3-O2-4 63 White, Christopher M. Tue-Ses2-P3-6 96 Wed-Ses2-P4-13 137 White, Michael . . . . . . . . Thu-Ses1-O2-3 152 Whittaker, Steve . . . . . . Thu-Ses2-P1-11 170 Wibowo, Suryoadhi . . . Tue-Ses2-P2-1 93 Widjaja, Henry . . . . . . . . Tue-Ses2-P2-1 93 Wiesenegger, Michael Tue-Ses1-P3-1 81 Wik, Preben . . . . . . . . . . . Tue-Ses1-P2-6 81 Wilfart, Geoffrey . . . . . . Wed-Ses1-P3-8 121 Williams, Jason D. . . . . Wed-Ses3-S1-3 150 Windmann, Andreas . Tue-Ses3-S2-2 110 Winkelmann, Raphael Wed-Ses1-P2-4 118 Wintrode, Jonathan . . Tue-Ses3-P4-1 108 Wittenberg, Sören . . . . Wed-Ses1-P3-4 120 Wittenburg, Peter . . . . . Wed-Ses1-P4-1 122 Woehrling, Cécile . . . . . Wed-Ses3-O1-3 138 Wohlmayr, Michael . . . Tue-Ses2-P1-3 92 Wójcicki, Kamil . . . . . . . Tue-Ses3-P1-5 102 Wokurek, Wolfgang . . Wed-Ses1-P1-7 116 Wolfe, Patrick J. . . . . . . Mon-Ses2-O4-4 51 Mon-Ses2-O4-6 51 Thu-Ses2-O3-5 166 Wölfel, Matthias . . . . . . Tue-Ses1-P4-7 85 Wolff, Matthias . . . . . . . Wed-Ses1-P3-4 120 Wöllmer, Martin . . . . . . Wed-Ses1-O2-6 113 Thu-Ses1-O1-5 151 Wolters, Maria . . . . . . . . Tue-Ses3-P3-3 107 Wong, W. . . . . . . . . . . . . . . Wed-Ses2-O4-4 129 Woodland, P.C. . . . . . . . Mon-Ses2-P3-7 56 Mon-Ses3-O1-3 62 Mon-Ses3-O1-4 62 Thu-Ses1-P3-4 158 Woods, Roger . . . . . . . . . Wed-Ses3-P3-8 147 Wrigley, Stuart N. . . . . . Thu-Ses2-P1-11 170 Wrobel-Dautcourt, B. . Mon-Ses3-P1-8 66 Wu, Cheng . . . . . . . . . . . . Mon-Ses3-O4-4 65 Wu, Chung-Hsien . . . . . Tue-Ses2-O3-6 90 Wu, Dalei . . . . . . . . . . . . . Tue-Ses1-O1-4 75 Wed-Ses1-O4-6 115 Wu, Guanyong . . . . . . . . Mon-Ses2-O3-1 49 Wu, Hsu-Chih . . . . . . . . . Mon-Ses3-P4-8 72 Wu, Jiang . . . . . . . . . . . . . Tue-Ses2-P1-1 91 Wu, Wei . . . . . . . . . . . . . . . Tue-Ses2-O3-4 90 Wed-Ses2-P4-2 135 Wu, Wing Li . . . . . . . . . . . Wed-Ses3-P1-6 142 Wu, Yi-Jian . . . . . . . . . . . . Mon-Ses3-P2-11 69 Wed-Ses1-P3-10 121 Wu, Zhizheng . . . . . . . . . Mon-Ses3-O3-3 64 Wuth, Jorge . . . . . . . . . . . Wed-Ses2-O2-5 127 Wutiwiwatchai, Chai . Mon-Ses3-P2-3 67 Mon-Ses3-P2-6 68 Wed-Ses1-P3-12 122 X Xiang, Bing . . . . . . . . . . . Mon-Ses2-P3-8 57 Xie, Shasha . . . . . . . . . . . Tue-Ses3-P4-9 109 Xu, Bo . . . . . . . . . . . . . . . . . Wed-Ses2-O2-6 127 Xu, Haihua . . . . . . . . . . . . Mon-Ses2-O3-1 49 Xu, Haitian . . . . . . . . . . . . Wed-Ses3-P3-7 147 132 132 100 169 66 Tue-Ses2-O2-3 89 Wed-Ses1-P1-11 117 Xue, Jian . . . . . . . . . . . . . . Mon-Ses2-P3-8 57 Xu, Xu, Xu, Xu, Xu, Lu . . . . . . . . . . . . . . . . . Wed-Ses2-P2-9 Mingxing . . . . . . . . . Wed-Ses2-P2-9 Minqiang . . . . . . . . . Tue-Ses3-O3-6 Puyang . . . . . . . . . . . . Thu-Ses2-P1-8 Yi . . . . . . . . . . . . . . . . . Mon-Ses3-P1-3 Y Yamada, Makoto . . . . . Tue-Ses3-P1-8 103 Yamagata, Tomoyuki . Mon-Ses2-P4-2 57 Yamagishi, Junichi . . . Mon-Ses3-O3-6 64 Mon-Ses3-P2-9 68 Tue-Ses3-P2-4 105 Wed-Ses2-P3-11 135 Thu-Ses1-P2-1 156 Yamaguchi, Y. . . . . . . . . Wed-Ses1-O3-5 114 Yamakawa, Kimiko . . . Mon-Ses3-P1-6 66 Yamamoto, Kazumasa Tue-Ses2-P2-2 94 Yaman, Sibel . . . . . . . . . . Thu-Ses1-P4-1 159 Thu-Ses1-P4-2 160 Yamanaka, N. . . . . . . . . . Mon-Ses3-P4-4 72 Yamashita, Yoichi . . . . Tue-Ses3-P1-13 103 Yamauchi, Emi Juliana Thu-Ses2-P1-9 169 Yamauchi, Yutaka . . . . Mon-Ses3-P4-6 72 Yan, Yonghong . . . . . . . Wed-Ses2-P1-4 130 Wed-Ses2-P4-6 136 Thu-Ses2-P3-5 173 Thu-Ses2-P3-7 173 Yan, Yuling . . . . . . . . . . . Tue-Ses1-S1-1 86 Yan, Zhi-Jie . . . . . . . . . . . Wed-Ses1-P3-2 120 Yang, Bin . . . . . . . . . . . . . Wed-Ses2-P1-9 131 Yang, Dali . . . . . . . . . . . . . Wed-Ses2-P2-9 132 Yang, Dong . . . . . . . . . . . Tue-Ses1-O3-6 77 Yang, Qian . . . . . . . . . . . . Tue-Ses1-P4-7 85 Yano, Masafumi . . . . . . Mon-Ses2-P1-1 51 Yanushevskaya, Irena Wed-Ses2-P1-5 130 Ye, Guoli . . . . . . . . . . . . . . Wed-Ses3-P2-2 144 Yegnanarayana, B. . . . . Tue-Ses2-P1-5 92 Tue-Ses2-P1-6 92 Wed-Ses1-O2-5 113 Wed-Ses3-P1-5 142 Yeh, Yao-Ming . . . . . . . . Tue-Ses3-P4-10 110 Yi, Youngmin . . . . . . . . . Tue-Ses2-P3-3 96 Yip, Michael C.W. . . . . . Wed-Ses2-O1-2 125 Yoma, Nestor Becerra Tue-Ses2-P2-4 94 Wed-Ses2-O2-5 127 Yoon, Su-Youn . . . . . . . . Wed-Ses2-O2-4 127 Yoshimoto, Masahiko Tue-Ses3-P4-4 109 You, Hong . . . . . . . . . . . . Mon-Ses2-O1-3 48 Young, S. . . . . . . . . . . . . . . Thu-Ses1-P4-5 160 Yousafzai, Jibran . . . . . Wed-Ses3-P3-4 146 Yu, Dong . . . . . . . . . . . . . . Tue-Ses1-O1-5 75 Yu, K. . . . . . . . . . . . . . . . . . Thu-Ses1-P4-5 160 Yu, Kai . . . . . . . . . . . . . . . . Wed-Ses2-O3-3 127 Yu, Tao . . . . . . . . . . . . . . . Tue-Ses3-P1-14 104 Yuan, Jiahong . . . . . . . . Wed-Ses3-O2-5 139 Thu-Ses2-O2-5 165 Z Zahorian, Stephen A. . Tue-Ses2-P1-1 91 Zainkó, Csaba . . . . . . . . Mon-Ses3-P4-9 73 Zbib, Rabih . . . . . . . . . . . Tue-Ses1-O3-2 77 Zdansky, Jindrich . . . . Tue-Ses2-O1-6 88 Zechner, Klaus . . . . . . . Mon-Ses3-P4-5 72 Zeissler, V. . . . . . . . . . . . . Mon-Ses3-P4-4 72 Železný, Miloš . . . . . . . . Thu-Ses2-P1-5 169 Zellers, Margaret . . . . . Wed-Ses4-P4-14 149 Zen, Heiga . . . . . . . . . . . . Wed-Ses1-P3-3 120 Wed-Ses2-P3-12 135 Zhang, Bin . . . . . . . . . . . . Tue-Ses2-O3-4 90 Zhang, Caicai . . . . . . . . . Wed-Ses3-P1-8 143 Zhang, Chi . . . . . . . . . . . . Tue-Ses1-P3-7 82 Zhang, Jianping . . . . . . Wed-Ses2-P1-4 130 Zhang, Le . . . . . . . . . . . . . Wed-Ses2-P4-7 136 Zhang, Qingqing . . . . . . Thu-Ses2-P3-5 173 Zhang, R. . . . . . . . . . . . . . Mon-Ses3-O4-3 65 Zhang, Shi-Xiong . . . . . Tue-Ses3-O3-4 100 Zhao, Qingwei . . . . . . . . Wed-Ses2-P4-6 136 Zhao, Sherry Y. . . . . . . . Thu-Ses2-P2-2 170 Zheng, Jing . . . . . . . . . . . Mon-Ses3-O4-5 65 Wed-Ses2-P4-3 135 Zhou, Bowen . . . . . . . . . . Mon-Ses2-P3-8 57 Zhou, Haolang . . . . . . . . Tue-Ses1-P3-4 82 Zhou, Xi . . . . . . . . . . . . . . . Tue-Ses3-O3-6 100 186 Zhou, Yu . . . . . . . . . . . . . . Wed-Ses2-P1-4 130 Zhu, Donglai . . . . . . . . . . Wed-Ses3-O1-2 138 Zhu, Jie . . . . . . . . . . . . . . . Mon-Ses2-O3-1 49 Zhuang, Xiaodan . . . . . Thu-Ses1-S1-4 162 Žibert, Janez . . . . . . . . . . Thu-Ses1-O3-1 153 Zigel, Yaniv . . . . . . . . . . . Mon-Ses2-P2-6 54 Mon-Ses2-P2-7 54 Wed-Ses2-P2-8 132 Zimmermann, M. . . . . . Tue-Ses1-P3-8 83 Zubizarreta, M.L. . . . . . Tue-Ses1-O2-1 76 Zweig, Geoffrey . . . . . . . Wed-Ses1-S1-6 125 Wed-Ses2-O3-2 127 Thu-Ses2-O1-5 164 Venue Floorplan Rainbow Room Ground Floor Jones (East Wing 1) & Fallside (East Wing 2) Foyer West Bar Main Hall First Floor Hewison Hall Sunrise Room East Bar Holmes (East Wing 3) & Ainsworth (East Wing 4) Brighton Centre Suites (BCS) Third Floor 187 Interspeech 2009 Programme-at-a-Glance Sunday, 06 September (Tutorials, Loebner Competition) Jones (East Wing 1) Fallside (East Wing 2) Holmes (East Wing 3) Ainsworth (East Wing 4) Rainbow Room TUTORIALS 08:30 Registration for tutorials opens (closes at 14:30) 09:00 ISCA Board Meeting 1 (finish at 17:00) - BCS Room 3 09:15 T-1: Analysis by Synthesis of Speech Prosody, from Data to Models 10:45 Coffee break 11:15 T-1: Analysis by Synthesis of Speech Prosody, from Data to Models 12:45 Lunch T-2: Dealing with High Dimensional Data with Dimensionality Reduction T-3: Language and Dialect Recognition T-4: Emerging Technologies for Silent Speech Interfaces T-2: Dealing with High Dimensional Data with Dimensionality Reduction T-3: Language and Dialect Recognition T-4: Emerging Technologies for Silent Speech Interfaces T-6: Emotion Recognition in the Next Generation: an Overview and Recent Development T-7: Fundamentals and Recent Advances in HMM-based Speech Synthesis T-8: Statistical Approaches to Dialogue Systems T-6: Emotion Recognition in the Next Generation: an Overview and Recent Development T-7: Fundamentals and Recent Advances in HMM-based Speech Synthesis T-8: Statistical Approaches to Dialogue Systems 14:00 General registration opens (closes at 18:00) 14:15 T-5: In-Vehicle Speech Processing & Analysis 15:45 Tea break 16:15 T-5: In-Vehicle Speech Processing & Analysis 18:00 Elsevier Thank You Reception for Former Computer Speech and Language Editors (finish at 19:30) - BCS Room 1 Loebner Competition The first Interspeech conversational systems challenge Monday, 07 September Jones (East Wing 1) Main Hall Fallside (East Wing 2) Holmes (East Wing 3) ORAL SESSIONS 09:00 Arrival and Registration 10:00 Opening Ceremony - Main Hall Hewison Hall Ainsworth (East Wing 4) POSTER SESSIONS SPECIAL SESSION 11:00 Keynote: Sadaoki Furui ISCA Medallist “ Selected Topics from 40 years of Research on Speech and Speaker Recognition” - Main Hall 12:00 Lunch; IAC (Advisory Council) Meeting - BCS Room 3 13:30 ASR: Features for Noise Robustness 15:30 Tea Break 16:00 ASR: Language Models I 19:30 Welcome Reception - Brighton Dome Production: Articulatory Modelling Systems for LVCSR and Rich Transcription Accent and Speech Analysis and Speech Perception I Language Processing I Recognition ASR: Acoustic Model Spoken Dialogue Training and Systems Combination Phoneme-level Perception Statistical Parametric Synthesis I Systems for Spoken Human Speech Language Production I Translation ASR: Adaptation I Prosody, Text Analysis, and Multilingual Models Applications in Learning and Other Areas Emotion Challenge Silent Speech Interfaces Tuesday, 08 September 08:30 Keynote: Tom Griffiths “ Connecting Human and Machine Learning via Probabilistic Models of Cognition” - Main Hall 09:30 Coffee Break 10:00 ASR: Discriminative Training 12:00 Lunch; Elsevier Editorial Board Meeting for Computer Speech and Language - BCS Room 1; Special Interest Group Meeting - BCS Room 3 13:30 Standardising assessments for voice and speech pathology (finish at 14:30) - BCS Room 3 13:30 Automotive and Mobile Applications 15:30 Tea Break 16:00 Panel: Speech & Intelligence Language Acquisition ASR: Lexical and Prosodic Models Unit-Selection Synthesis Human Speech Production II Speech and Audio Speech Perception II Segmentation and Classification Advanced Voice Speaker Recognition Function and Diarisation Assessment ASR: Spoken Prosody: Production Language I Understanding Speaker Diarisation Speech Processing Speech Analysis and with Audio or Processing II Audiovisual Input ASR: Decoding and Confidence Measures Robust ASR I ISCA Student Advisory Committee Speaker Verification & Identification I Text Processing for Spoken Language Generation Single- and MultiChannel Speech Enhancement ASR: Acoustic Modelling Assistive Speech Technology Topics in Spoken Language Processing Measuring the Rhythm of Speech Prosody Perception and Language Acquisition Statistical Parametric Synthesis II Resources, Annotation and Evaluation Lessons and Challenges Deploying Voice Search Speech Synthesis Methods LVCSR Systems and Active Listening & Spoken Term Synchrony Detection 18:15 ISCA General Assembly - Main Hall 19:30 Reviewers' Reception - Brighton Pavilion; Student Reception - Al Duomo Restaurant Wednesday, 09 September 08:30 Keynote: Deb Roy “ New Horizons in the Study of Language Development ” - Main Hall 09:30 Coffee Break 10:00 Speaker Verification Emotion and & Identification II Expression I 12:00 Lunch; Interspeech Steering Committee - BCS Room 1; Elsevier Editorial Board Meeting for Speech Communication - BCS Room 3 13:30 Word-level Perception 15:30 Tea Break 16:00 Language Recognition 19:30 Revelry at the Racecourse ASR: Adaptation II Voice Transformation I Phonetics, Phonology, Cross-Language Comparisons, Pathology Applications in Education and Learning ASR: New Paradigms I Single-Channel Speech Enhancement Emotion and Expression II Expression, Emotion and Personality Recognition Phonetics & Phonology Speech Activity Detection Multimodal Speech (e.g. Audiovisual Speech, Gesture) Phonetics Speaker Verification Robust ASR II & Identification III Machine Learning Prosody: Production for Adaptivity in II Dialogue Systems Voice Transformation II Systems for Spoken New Approaches to Language Modelling Variability Understanding for ASR Thursday, 10 September 08:30 Keynote: Mari Ostendorf “ Transcribing Speech for Spoken Language Processing” - Main Hall 09:30 Coffee Break 10:00 Robust ASR III 12:00 Lunch; Industrial Lunch - BCS Room 1 13:30 User Interactions in Spoken Dialog Systems 15:30 Tea Break 16:00 Closing Ceremony - Main Hall Prosody: Perception Production: Articulation and Acoustics Segmentation and Classification Evaluation & Standardisation of SL Technology & Syst. Features for Speech Speech and Multiand Speaker modal Resources & Recognition Annotation Speech Coding ASR: Language Models II Speaker & Speech ASR: Tonal LanASR: Acoustic Model Variability, Paraling. guage, Cross-Ling. Features & Nonling. Cues and Multiling. ASR ASR: New Paradigms II Speech Analysis and Processing III