NIST 2004 Speech Recognition Results

Transcription

NIST 2004 Speech Recognition Results
Audrey Le for the NIST Gang
RT-04F Workshop
November 7-10, 2004
Palisades, NY
STT Task
Test Sets
Scoring Procedures
STT Participants
Evaluation Results
Multiple
Applications
Metadata Extraction
(MDE)
Speech-to-Text
(STT)
Human-human
speech
Words +
Metadata
Rich transcript
Readable
transcript
UL
Broadcasts
20x
Conversations
Metadata Extraction
(MDE)
10x
1x
Speech-to-Text
(STT)
Human-human
speech
English
Get the
words
right!
Words,
Rich
times, &
Transcript
confidences
English
Arabic
Chinese
Broadcast News Broadcast News Broadcast News
Test Set
Test Set
Test Set
English
Arabic
Chinese
Conversational Conversational Conversational
Telephone
Telephone
Telephone
Speech Test Set Speech Test Set Speech Test Set
English BN
Test Set
Arabic BN
Test Set
Chinese BN
Test Set
English CTS
Test Set
Arabic CTS
Test Set
Chinese CTS
Test Set
English EARS 2003 Collection
12 shows: 30 minute excerpts
Consist of network news and local news shows
Broadcast from the US for a US audience
Cover international news, domestic and local
US news
English BN
Test Set
Arabic BN
Test Set
Chinese BN
Test Set
English CTS
Test Set
Arabic CTS
Test Set
Chinese CTS
Test Set
Arabic EARS 2003 Collection
3 shows: 20 minute excerpts
Dubai TV, Al-Jazeera (2 shows, different dates)
Target Middle East and world-wide audience
Cover international and Middle East related news
Main dialect - Modern Standard Arabic
Format similar to US network news
Mandarin EARS 2004 Collection
3 shows: 20 minute excerpts
English BN
Test Set
Arabic BN
Test Set
Chinese BN
Test Set
English CTS
Test Set
Arabic CTS
Test Set
Chinese CTS
Test Set
China Central Television (CCTV-4) Financial and Economic Reports
Broadcasts from China for overseas Chinese
Covers international and Chinese financial and economic news
Radio Free Asia (RFA) Asia-Pacific Headline News and Asia-Pacific
Reports
Broadcasts from the US for Chinese who live in China
Covers news about China and China-related international news
New Tang Dynasty TV (NTDTV) Hourly News
Broadcasts from the US for a Chinese audience
Covers international news
Main dialect - mainland China standard Mandarin
Format similar to US network news
English BN
Test Set
Arabic BN
Test Set
Chinese BN
Test Set
English CTS
Test Set
Arabic CTS
Test Set
Chinese CTS
Test Set
English Fisher Collection
36 conversations: 72 speakers – 5 minute
excerpts
Equal number of male and female speakers
All speakers are native English speakers living in the
US
Speakers came from four US dialect regions
North, South, Midland, West
Numbers of standard, cordless, and cellular phones
used are ~ equally distributed
All conversations were recorded in the US
Topics from current events and social issues
English BN
Test Set
Arabic BN
Test Set
Chinese BN
Test Set
English CTS
Test Set
Arabic CTS
Test Set
Chinese CTS
Test Set
Levantine Dialect Fisher Collection
12 conversations: 24 speakers – 5 minute
excerpts
~ Equal number of male and female speakers
All speakers are native Arabic speakers with majority
living in the Middle East and a few in the US
Majority of the speakers was raised in Jordan
Majority of the phones used was landlines
All conversations were recorded in the US
Topics from current events and social issues
English BN
Test Set
Arabic BN
Test Set
Chinese BN
Test Set
English CTS
Test Set
Arabic CTS
Test Set
Chinese CTS
Test Set
Hong Kong Univ. of Science & Tech. Collection
12 conversations: 24 speakers – 5 minute
excerpts
~ Equal number of male and female speakers
All speakers are native Mandarin speakers and live in
China
All speakers are in their twenties
All phones used were landlines
All conversations were recorded in China
Topics from current events and social issues
Reference
data
SCLITE
Filtering
GLM
Processing
Alignment
Error
Calculation
System
output
WER = (# Deletions + # Insertions + # Substitutions)
# Reference Words
Transcription of colloquial Arabic is a challenge
Modern Standard Arabic (MSA) orthographic conventions do not
easily apply to dialectal Arabic speech
New transcription conventions and methods were devised by LDC
with community input
New Arabic text normalization steps
Word-initial alif+hamza normalization
Rule: All word-initial alif+hamza characters are translated to bare alif
The word-initial “alif” can be written omitting a “hamza” or with any of
three vocalic variants
Neither form changes the meaning and transcribers rarely agree
Unresolved Issue: normalization was not applied to inflected forms
Tanween character removal
Rule: All tanween characters removed
The three vowel diacritics, fathah tanween, kasrah tanween, and
dhammah tanween are inconsistently transcribed
Omitting or including the vowel does not affect the word’s meaning
Pause filler characters
List of pause filler characters expanded to 5 as
an allowance for CTS training data
Sometimes they occur as non-pause filler
characters in broadcast news
No lexical normalizations
Character Error Rate (CER)
No “word” analog in the native orthography
Ordinary words
Scored as is without case
Speaker-attributed noises (e.g., cough, laugh)
Stripped out
Non-vocal noises (e.g., door slam)
Stripped out
Word fragments
Counted as optionally deletable
Counted as correct if substring of system output matches
Uncertain or foreign words
Counted as optionally deletable
Pause-fillers
Counted as optionally deletable
Contractions
Expanded to the most likely correct form
Ordinary words
Scored as is without case
Speaker-attributed noises (e.g., cough, laugh)
Stripped out
Non-vocal noises (e.g., door slam)
Stripped out
Word fragments
Stripped out
Uncertain or foreign words (as marked by the system)
Stripped out
Pause-fillers
Stripped out if identified as pause-fillers
Transformed to a generic pause-fillers if not identified as such
Allowed equivalent system output to be scored as correct
Contained equivalence rules to transform data to a
canonical form
Rules to expand contractions (only applied to the system output)
she’s => she is, she has, she was
Rules to split hyphenated words
processing-speed task => processing speed task
Rules to split compounds with ambiguous representations
servicemen => service men
halftime => would not be split
Rules to map backchannels and hesitations
uh-huh => %BCACK
Rules to allow legitimate alternative spellings
Mr. => Mister
REF: they want to give you (e-) give them all the things you never got ** %h
HYP: *** going to give you
ERR:
D
S
give them all *** day
D
S
you never got to %bc
I
Summary:
two deletions, two substitutions, one insertion
BBN
BBN+LIMSI
Cambridge Univ.
IBM
IBM+SRI
ISL
LIMSI
SRI
SRI + Univ. of Washington
Super EARS
(Collaborative effort involving BBN+CU+LIMSI+SRI)
Primary systems ordered by increasing WER
100
10x
No
1x
significant
WER (%)
difference
15.3
14.6
10.7
11.6
10
1
ears
cu
bbn+limsi
bbn
sri
best rt03
RT-04 dev set is different from RT-03 eval set
Comparable subset of RT-04 with RT-03
Better sampling of broadcast news shows
LDC selections
News From CNN (anchor: Wolf Blitzer)
PBS News Hour (anchor: Gwen Ifill)
CNN Headline News (anchors: Steven Fraser, Sophia Choi)
CNBC The News (anchor: Tom Costello)
ABC6 WPVI Action News (12/17)
CSPAN Book TV: Texas Book Festival
NIST selections
CNN Daybreak (anchor: Carol Costello)
CNBC The News (anchor: Brian Williams)
ABC World News Tonight (anchor: Peter Jennings)
PBS BBC News (anchor: Alistair Yates)
ABC6 WPVI Action News (12/03)
WB17 News WPHL
RT-03 Like Subset of
RT-04 Shows
Characteristics
RT-03 Shows
ABC World News Tonight
Same show, different dates
ABC World News Tonight
CNBC The News
(anchor: Tom Costello)
Typical network news with
both having male speakers
speaking most of the time
NBC Nightly News
(anchor: Tom Brokaw)
PBS News Hour
Similar types of news content
and speaking style
VOA News Now
CNN Daybreak
Similar anchor styles
PRI The World
CNBC The News
(anchor: Brian Williams)
Both anchored by Brian
Williams
MSNBC News
(anchor: Brian Williams)
CNN Headline News
Similar types of news
contents, different dates
CNN Headline News
Hypothesis: A robust system should have
little variation in WER for different sources
Observation: Dramatic sensitivity to test
sets / subsets
Implication: Current technology is not
adequately robust?
Shows ordered by RT-03 Like Subset WER
Primary English BN 10X Systems
25
WER (%)
20
RT-04F (RT-03 Like Subset)
bbn+limsi
cu
ears
sri
RT-04F (Others)
More difficult
15
10
5
0
Easy for all
systems
ABC World
News
Tonight
CNBC Tom
Costello
PBS News CNN - Carol
Hour
Costello
Wider spectrum
of BN shows is
more difficult
RT-03 like
shows are
easier
CNBC Brian
Williams
CNN
Headline
News
WPHL
CNN - Wolf
WB17 News
Blitzer
WPVI
Action
News
CSPAN
Texas Book
Festival
WPVI
Action
News
PBS BBC
News
Comparison with RT-03 English BN Shows
Shows ordered by RT-03 Like Subset WER
Primary English BN 10X Systems
25
WER (%)
20
15
RT-04F (RT-03 Like Subset)
RT-03 Test Set
bbn+limsi
bbn
limsi
cu
ears
sri
Similar but not
identical trend
10
Easier than
other Brian
Williams show
5
0
ABC World
New s
Tonight
CNBC Tom
Costello
PBS New s
Hour
CNN Carol
Costello
CNBC Brian
William s
CNN
Headline
New s
ABC World
New s
Tonight
NBC
Nightly
New s
VOA New s
Now
PRI The
World
MSNBC Brian
William s
CNN
Headline
New s
Primary systems ordered by increasing WER
WER (%)
100
10
10x full set
1x full set
11.6
9.1
10x rt-03 like subset
1x rt-03 like subset
There is
improvement!
15.3
12.1
14.6
10.7
1
ears
cu
bbn+limsi
bbn
sri
best rt03
Primary systems ordered by increasing WER
Chinese BN
Arabic BN
100
100
10x
ul
10x
WER (%)
26.3
CER (%)
18.5
10
17
22.4
30.8
19.1
isl
best rt03
10
1
1
limsi
bbn
best rt03
bbn
All differences at a given speed are statistically significant
Primary systems ordered by increasing WER
100
ul
20x
10x
1x
No significant
difference
29
20.4 22
WER (%)
19
14.9
15.2
10
1
ibm +sri
ibm
bbn+lim si
cu
bbn
sri
best rt03
fs
h_
1
fs 19
h_ 3
1 1
fs 10 9_a
h_ 3
1 4
fs 19 7_a
h_ 2
1 5
fs 14 9_a
h_ 2
1 4
fs 18 6_a
h_ 9
1 9
fs 17 9_a
h_ 9
1 1
fs 12 8_a
h_ 0
1 4
fs 18 5_b
h_ 6
1 9
fs 14 8_a
h_ 6
1 5
fs 19 1_a
h_ 2
1 5
fs 10 9_b
h_ 8
1 6
fs 13 3_a
h_ 4
1 0
fs 11 5_a
h_ 9
1 0
fs 10 5_b
h_ 3
1 4
fs 12 7_b
h_ 8
1 2
fs 14 5_b
h_ 2
1 4
fs 14 6_b
h_ 8
1 4
fs 19 9_a
h_ 3
1 7
fs 15 9_a
h_ 1
1 3
fs 09 8_a
h_ 8
1 4
fs 15 1_a
h_ 1
1 5
fs 10 3_b
h_ 8
1 6
fs 11 3_b
h_ 1
1 0
fs 14 6_a
h_ 4
1 2
fs 15 9_a
h_ 1
1 3
fs 17 8_b
h_ 9
1 1
fs 11 8_b
h_ 5
1 8
fs 18 5_a
h_ 1
1 7
fs 19 7_a
h_ 1
1 9
fs 19 9_b
h_ 3
1 1
fs 15 9_b
h_ 1
1 5
fs 17 3_a
h_ 3
1 1
fs 11 6_a
h_ 9
1 0
fs 18 5_a
h_ 4
1 1
fs 15 9_a
h_ 2
11 31
38 _b
06
_a
WER (%)
50
Speakers ordered by ascending absolute WER difference
Primary English CTS 20X Systems
bbn+limsi
First maximum, difficult for all systems
45
40
35
cu
ibm
ibm+sri
sri
30
25
20
15
10
5
0
First minimum, easy for all systems
Systems don’t do well on rapid and disfluent speech
Different
system
outputs
Primary systems ordered by increasing WER
Arabic CTS
100
20x
100
37.5
10
ul
20x
10x
29.3
CER (%)
WER (%)
44.1
ul
Chinese CTS
42.7
29.7
10
1
1
bbn
sri+uw
Unfair comparison,
different dialects
best rt03
bbn
cu
sri+uw
The 3 20X systems are
not different significantly
best rt03
BN and CTS at Different Speeds
WER (%)
100
10
Eng BN 1x
Eng BN 10x
Eng BN >10x
Eng CTS 1x
1
Eng CTS 10x
RT-02
RT-03
RT-04
RT-04 RT-03
Like Subset
Phil Woodland
The LDC
Our Arabic tutors
Mohamed Maamouri
John Makhoul
The rest of the NIST Gang!!!