NIST 2004 Speech Recognition Results
Transcription
NIST 2004 Speech Recognition Results
Audrey Le for the NIST Gang RT-04F Workshop November 7-10, 2004 Palisades, NY STT Task Test Sets Scoring Procedures STT Participants Evaluation Results Multiple Applications Metadata Extraction (MDE) Speech-to-Text (STT) Human-human speech Words + Metadata Rich transcript Readable transcript UL Broadcasts 20x Conversations Metadata Extraction (MDE) 10x 1x Speech-to-Text (STT) Human-human speech English Get the words right! Words, Rich times, & Transcript confidences English Arabic Chinese Broadcast News Broadcast News Broadcast News Test Set Test Set Test Set English Arabic Chinese Conversational Conversational Conversational Telephone Telephone Telephone Speech Test Set Speech Test Set Speech Test Set English BN Test Set Arabic BN Test Set Chinese BN Test Set English CTS Test Set Arabic CTS Test Set Chinese CTS Test Set English EARS 2003 Collection 12 shows: 30 minute excerpts Consist of network news and local news shows Broadcast from the US for a US audience Cover international news, domestic and local US news English BN Test Set Arabic BN Test Set Chinese BN Test Set English CTS Test Set Arabic CTS Test Set Chinese CTS Test Set Arabic EARS 2003 Collection 3 shows: 20 minute excerpts Dubai TV, Al-Jazeera (2 shows, different dates) Target Middle East and world-wide audience Cover international and Middle East related news Main dialect - Modern Standard Arabic Format similar to US network news Mandarin EARS 2004 Collection 3 shows: 20 minute excerpts English BN Test Set Arabic BN Test Set Chinese BN Test Set English CTS Test Set Arabic CTS Test Set Chinese CTS Test Set China Central Television (CCTV-4) Financial and Economic Reports Broadcasts from China for overseas Chinese Covers international and Chinese financial and economic news Radio Free Asia (RFA) Asia-Pacific Headline News and Asia-Pacific Reports Broadcasts from the US for Chinese who live in China Covers news about China and China-related international news New Tang Dynasty TV (NTDTV) Hourly News Broadcasts from the US for a Chinese audience Covers international news Main dialect - mainland China standard Mandarin Format similar to US network news English BN Test Set Arabic BN Test Set Chinese BN Test Set English CTS Test Set Arabic CTS Test Set Chinese CTS Test Set English Fisher Collection 36 conversations: 72 speakers – 5 minute excerpts Equal number of male and female speakers All speakers are native English speakers living in the US Speakers came from four US dialect regions North, South, Midland, West Numbers of standard, cordless, and cellular phones used are ~ equally distributed All conversations were recorded in the US Topics from current events and social issues English BN Test Set Arabic BN Test Set Chinese BN Test Set English CTS Test Set Arabic CTS Test Set Chinese CTS Test Set Levantine Dialect Fisher Collection 12 conversations: 24 speakers – 5 minute excerpts ~ Equal number of male and female speakers All speakers are native Arabic speakers with majority living in the Middle East and a few in the US Majority of the speakers was raised in Jordan Majority of the phones used was landlines All conversations were recorded in the US Topics from current events and social issues English BN Test Set Arabic BN Test Set Chinese BN Test Set English CTS Test Set Arabic CTS Test Set Chinese CTS Test Set Hong Kong Univ. of Science & Tech. Collection 12 conversations: 24 speakers – 5 minute excerpts ~ Equal number of male and female speakers All speakers are native Mandarin speakers and live in China All speakers are in their twenties All phones used were landlines All conversations were recorded in China Topics from current events and social issues Reference data SCLITE Filtering GLM Processing Alignment Error Calculation System output WER = (# Deletions + # Insertions + # Substitutions) # Reference Words Transcription of colloquial Arabic is a challenge Modern Standard Arabic (MSA) orthographic conventions do not easily apply to dialectal Arabic speech New transcription conventions and methods were devised by LDC with community input New Arabic text normalization steps Word-initial alif+hamza normalization Rule: All word-initial alif+hamza characters are translated to bare alif The word-initial “alif” can be written omitting a “hamza” or with any of three vocalic variants Neither form changes the meaning and transcribers rarely agree Unresolved Issue: normalization was not applied to inflected forms Tanween character removal Rule: All tanween characters removed The three vowel diacritics, fathah tanween, kasrah tanween, and dhammah tanween are inconsistently transcribed Omitting or including the vowel does not affect the word’s meaning Pause filler characters List of pause filler characters expanded to 5 as an allowance for CTS training data Sometimes they occur as non-pause filler characters in broadcast news No lexical normalizations Character Error Rate (CER) No “word” analog in the native orthography Ordinary words Scored as is without case Speaker-attributed noises (e.g., cough, laugh) Stripped out Non-vocal noises (e.g., door slam) Stripped out Word fragments Counted as optionally deletable Counted as correct if substring of system output matches Uncertain or foreign words Counted as optionally deletable Pause-fillers Counted as optionally deletable Contractions Expanded to the most likely correct form Ordinary words Scored as is without case Speaker-attributed noises (e.g., cough, laugh) Stripped out Non-vocal noises (e.g., door slam) Stripped out Word fragments Stripped out Uncertain or foreign words (as marked by the system) Stripped out Pause-fillers Stripped out if identified as pause-fillers Transformed to a generic pause-fillers if not identified as such Allowed equivalent system output to be scored as correct Contained equivalence rules to transform data to a canonical form Rules to expand contractions (only applied to the system output) she’s => she is, she has, she was Rules to split hyphenated words processing-speed task => processing speed task Rules to split compounds with ambiguous representations servicemen => service men halftime => would not be split Rules to map backchannels and hesitations uh-huh => %BCACK Rules to allow legitimate alternative spellings Mr. => Mister REF: they want to give you (e-) give them all the things you never got ** %h HYP: *** going to give you ERR: D S give them all *** day D S you never got to %bc I Summary: two deletions, two substitutions, one insertion BBN BBN+LIMSI Cambridge Univ. IBM IBM+SRI ISL LIMSI SRI SRI + Univ. of Washington Super EARS (Collaborative effort involving BBN+CU+LIMSI+SRI) Primary systems ordered by increasing WER 100 10x No 1x significant WER (%) difference 15.3 14.6 10.7 11.6 10 1 ears cu bbn+limsi bbn sri best rt03 RT-04 dev set is different from RT-03 eval set Comparable subset of RT-04 with RT-03 Better sampling of broadcast news shows LDC selections News From CNN (anchor: Wolf Blitzer) PBS News Hour (anchor: Gwen Ifill) CNN Headline News (anchors: Steven Fraser, Sophia Choi) CNBC The News (anchor: Tom Costello) ABC6 WPVI Action News (12/17) CSPAN Book TV: Texas Book Festival NIST selections CNN Daybreak (anchor: Carol Costello) CNBC The News (anchor: Brian Williams) ABC World News Tonight (anchor: Peter Jennings) PBS BBC News (anchor: Alistair Yates) ABC6 WPVI Action News (12/03) WB17 News WPHL RT-03 Like Subset of RT-04 Shows Characteristics RT-03 Shows ABC World News Tonight Same show, different dates ABC World News Tonight CNBC The News (anchor: Tom Costello) Typical network news with both having male speakers speaking most of the time NBC Nightly News (anchor: Tom Brokaw) PBS News Hour Similar types of news content and speaking style VOA News Now CNN Daybreak Similar anchor styles PRI The World CNBC The News (anchor: Brian Williams) Both anchored by Brian Williams MSNBC News (anchor: Brian Williams) CNN Headline News Similar types of news contents, different dates CNN Headline News Hypothesis: A robust system should have little variation in WER for different sources Observation: Dramatic sensitivity to test sets / subsets Implication: Current technology is not adequately robust? Shows ordered by RT-03 Like Subset WER Primary English BN 10X Systems 25 WER (%) 20 RT-04F (RT-03 Like Subset) bbn+limsi cu ears sri RT-04F (Others) More difficult 15 10 5 0 Easy for all systems ABC World News Tonight CNBC Tom Costello PBS News CNN - Carol Hour Costello Wider spectrum of BN shows is more difficult RT-03 like shows are easier CNBC Brian Williams CNN Headline News WPHL CNN - Wolf WB17 News Blitzer WPVI Action News CSPAN Texas Book Festival WPVI Action News PBS BBC News Comparison with RT-03 English BN Shows Shows ordered by RT-03 Like Subset WER Primary English BN 10X Systems 25 WER (%) 20 15 RT-04F (RT-03 Like Subset) RT-03 Test Set bbn+limsi bbn limsi cu ears sri Similar but not identical trend 10 Easier than other Brian Williams show 5 0 ABC World New s Tonight CNBC Tom Costello PBS New s Hour CNN Carol Costello CNBC Brian William s CNN Headline New s ABC World New s Tonight NBC Nightly New s VOA New s Now PRI The World MSNBC Brian William s CNN Headline New s Primary systems ordered by increasing WER WER (%) 100 10 10x full set 1x full set 11.6 9.1 10x rt-03 like subset 1x rt-03 like subset There is improvement! 15.3 12.1 14.6 10.7 1 ears cu bbn+limsi bbn sri best rt03 Primary systems ordered by increasing WER Chinese BN Arabic BN 100 100 10x ul 10x WER (%) 26.3 CER (%) 18.5 10 17 22.4 30.8 19.1 isl best rt03 10 1 1 limsi bbn best rt03 bbn All differences at a given speed are statistically significant Primary systems ordered by increasing WER 100 ul 20x 10x 1x No significant difference 29 20.4 22 WER (%) 19 14.9 15.2 10 1 ibm +sri ibm bbn+lim si cu bbn sri best rt03 fs h_ 1 fs 19 h_ 3 1 1 fs 10 9_a h_ 3 1 4 fs 19 7_a h_ 2 1 5 fs 14 9_a h_ 2 1 4 fs 18 6_a h_ 9 1 9 fs 17 9_a h_ 9 1 1 fs 12 8_a h_ 0 1 4 fs 18 5_b h_ 6 1 9 fs 14 8_a h_ 6 1 5 fs 19 1_a h_ 2 1 5 fs 10 9_b h_ 8 1 6 fs 13 3_a h_ 4 1 0 fs 11 5_a h_ 9 1 0 fs 10 5_b h_ 3 1 4 fs 12 7_b h_ 8 1 2 fs 14 5_b h_ 2 1 4 fs 14 6_b h_ 8 1 4 fs 19 9_a h_ 3 1 7 fs 15 9_a h_ 1 1 3 fs 09 8_a h_ 8 1 4 fs 15 1_a h_ 1 1 5 fs 10 3_b h_ 8 1 6 fs 11 3_b h_ 1 1 0 fs 14 6_a h_ 4 1 2 fs 15 9_a h_ 1 1 3 fs 17 8_b h_ 9 1 1 fs 11 8_b h_ 5 1 8 fs 18 5_a h_ 1 1 7 fs 19 7_a h_ 1 1 9 fs 19 9_b h_ 3 1 1 fs 15 9_b h_ 1 1 5 fs 17 3_a h_ 3 1 1 fs 11 6_a h_ 9 1 0 fs 18 5_a h_ 4 1 1 fs 15 9_a h_ 2 11 31 38 _b 06 _a WER (%) 50 Speakers ordered by ascending absolute WER difference Primary English CTS 20X Systems bbn+limsi First maximum, difficult for all systems 45 40 35 cu ibm ibm+sri sri 30 25 20 15 10 5 0 First minimum, easy for all systems Systems don’t do well on rapid and disfluent speech Different system outputs Primary systems ordered by increasing WER Arabic CTS 100 20x 100 37.5 10 ul 20x 10x 29.3 CER (%) WER (%) 44.1 ul Chinese CTS 42.7 29.7 10 1 1 bbn sri+uw Unfair comparison, different dialects best rt03 bbn cu sri+uw The 3 20X systems are not different significantly best rt03 BN and CTS at Different Speeds WER (%) 100 10 Eng BN 1x Eng BN 10x Eng BN >10x Eng CTS 1x 1 Eng CTS 10x RT-02 RT-03 RT-04 RT-04 RT-03 Like Subset Phil Woodland The LDC Our Arabic tutors Mohamed Maamouri John Makhoul The rest of the NIST Gang!!!