A novel Wake-Up-Word speech recognition system, Wake

Transcription

A novel Wake-Up-Word speech recognition system, Wake
Nonlinear Analysis 71 (2009) e2772–e2789
Contents lists available at ScienceDirect
Nonlinear Analysis
journal homepage: www.elsevier.com/locate/na
A novel Wake-Up-Word speech recognition system, Wake-Up-Word
recognition task, technology and evaluation
V.Z. Këpuska ∗ , T.B. Klein
Electrical & Computer Engineering Department, Florida Institute of Technology, Melbourne, FL 32901, USA
article
info
MSC:
68T10
Keywords:
Wake-Up-Word
Speech recognition
Hidden Markov Models
Support Vector Machines
Mel-scale cepstral coefficients
Linear prediction spectrum
Enhanced spectrum
HTK
Microsoft SAPI
abstract
Wake-Up-Word (WUW) is a new paradigm in speech recognition (SR) that is not yet
widely recognized. This paper defines and investigates WUW speech recognition, describes
details of this novel solution and the technology that implements it. WUW SR is defined
as detection of a single word or phrase when spoken in the alerting context of requesting
attention, while rejecting all other words, phrases, sounds, noises and other acoustic events
and the same word or phrase spoken in non-alerting context with virtually 100% accuracy.
In order to achieve this accuracy, the following innovations were accomplished: (1) Hidden
Markov Model triple scoring with Support Vector Machine classification, (2) Combining
multiple speech feature streams: Mel-scale Filtered Cepstral Coefficients (MFCCs), Linear
Prediction Coefficients (LPC)-smoothed MFCCs, and Enhanced MFCC, and (3) Improved
Voice Activity Detector with Support Vector Machines.
WUW detection and recognition performance is 2514%, or 26 times better than HTK
for the same training & testing data, and 2271%, or 24 times better than Microsoft SAPI
5.1 recognizer. The out-of-vocabulary rejection performance is over 65,233%, or 653 times
better than HTK, and 5900% to 42,900%, or 60 to 430 times better than the Microsoft SAPI
5.1 recognizer.
This solution that utilizes a new recognition paradigm applies not only to WUW task
but also to any general Speech Recognition tasks.
© 2009 Elsevier Ltd. All rights reserved.
1. Introduction
Automatic Speech recognition (ASR) is the ability of a computer to convert a speech audio signal into its textual transcription [1]. Some motivations for building ASR systems are, presented in order of difficulty, to improve human–computer
interaction through spoken language interfaces, to solve difficult problems such as speech to speech translation, and to build
intelligent systems that can process spoken language as proficiently as humans [2].
Speech as a computer interface has numerous benefits over traditional interfaces such as a GUI with mouse and keyboard:
speech is natural for humans, requires no special training, improves multitasking by leaving the hands and eyes free, and is
often faster and more efficient to transmit than using conventional input methods. Even though many tasks are solved with
visual, pointing interfaces, and/or keyboards, speech has the potential to be a better interface for a number of tasks where
full natural language communication is useful [2] and the recognition performance of the Speech Recognition (SR) system
is sufficient to perform the tasks accurately [3,4]. This includes hands-busy or eyes-busy applications such as where the
user has objects to manipulate or equipment/devices to control. Those are circumstances where the user wants to multitask
while asking for assistance from the computer.
∗ Corresponding address: Electrical & Computer Engineering Department, Florida Institute of Technology, 150 W. University Blvd. Olin Engineering Bldg.,
353, Melbourne, FL 32901, USA. Tel.: +1 (321) 674 7183.
E-mail addresses: [email protected] (V.Z. Këpuska), [email protected] (T.B. Klein).
0362-546X/$ – see front matter © 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.na.2009.06.089
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
e2773
Novel SR technology named Wake-Up-Word (WUW) [5,6] bridges the gap between natural-language and other voice
recognition tasks [7]. WUW SR is a highly efficient and accurate recognizer specializing in the detection of a single word or phrase
when spoken in the alerting or WUW context [8] of requesting attention, while rejecting all other words, phrases, sounds, noises
and other acoustic events with virtually 100% accuracy including the same word or phrase uttered in non-alerting (referential)
context.
The WUW speech recognition task is similar to Key-Word spotting. However, WUW-SR is different in one important
aspect of being able to discriminate the specific word/phrase used only in alerting context (and not in the other; e.g.
referential contexts). Specifically, the sentence ‘‘Computer, begin PowerPoint presentation’’ exemplifies the use of the word
‘‘Computer’’ in alerting context. On the other hand in the sentence ‘‘My computer has dual Intel 64 bit processors each with
quad cores’’ the word computer is used in referential (non-alerting) context. Traditional key-word spotters will not be able
to discriminate between the two cases. The discrimination will be only possible by deploying higher level natural language
processing subsystem in order to discriminate between the two. However, for applications deploying such solutions is very
difficult (practically impossible) to determine in real time if the user is speaking to the computer or about the computer.
Traditional approaches to key-word spotting are usually based on large vocabulary word recognizers [9], phone
recognizers [9], or whole-word recognizers that either use HMMs or word templates [10]. Word recognizers require tens
of hours of word-level transcriptions as well as a pronunciation dictionary. Phone recognizers require phone marked
transcriptions and whole-word recognizers require word markings for each of the keywords [11]. The word based keyword recognizers are not able to find words that are out-of-vocabulary (OOV). To solve this problem of OOV keywords,
spotting approaches based on phone-level recognition have been applied to supplement or replace systems base upon large
vocabulary recognition [12]. In contrast our approach to WUW-SR is independent to what is the basic unit of recognition
(word, phone or any other sub-word unit). Furthermore, it is capable to discriminate if the user is speaking to the recognizer
or not without deploying language modeling and natural language understanding methods. This feature makes WUW task
distinct from key-word spotting.
To develop natural and spontaneous speech recognition applications an ASR system must:
(1) be able to determine if the user is speaking to it or not, and
(2) have orders of magnitude higher speech recognition accuracy and robustness.
Current WUW-SR system demonstrates that the problem under (1) is solved with sufficient accuracy. The generality of the
solution has been also successfully demonstrated on the tasks of event detection and recognition applied to seismic sensor
signal (i.e., footsteps, hammer hits to the ground, hammer hits to metal plate, and bowling ball drop). Further investigations
are underway to demonstrate the power of this approach to Large Vocabulary Speech Recognition (LVSR) solving problem (2)
above. This effort is done as part of the NIST’s Rich Transcription Evaluation program (http://www.nist.gov/speech/tests/rt/).
The presented work defines, in Section 2, the concept of Wake-Up-Word (WUW), a novel paradigm in speech recognition.
In Section 3, our solution to WUW task is presented. The implementation details of the WUW are depicted in Section 4.
The WUW speech recognition evaluation experiments are presented in Section 5. Comparative study between our WUW
recognizer, HTK, and Microsoft’s SAPI 5.1 is presented in Section 6. Concluding remarks and future work is provided in
Sections 7 and 8.
2. Wake-Up-Word paradigm
A WUW speech recognizer is a highly efficient and accurate recognizer specializing in the detection of a single word or
phrase when spoken in the context of requesting attention (e.g., alerting context), while rejecting the same word or phrase
spoken under referential (non-alerting) context, all other words, phrases, sounds, noises and other acoustic events with
virtually 100% accuracy. This high accuracy enables development of speech recognition driven interfaces that utilize dialogs
using speech only.
2.1. Creating 100% speech-enabled interfaces
One of the goals of speech recognition is to allow natural communication between humans and computers via speech [2],
where natural implies similarity to the ways humans interact with each other on a daily basis. A major obstacle to this is
the fact that most systems today still rely to large extent on non-speech input, such as pushing buttons or mouse clicking.
However, much like a human assistant, a natural speech interface must be continuously listening and must be robust enough
to recover from any communication errors without non-speech intervention.
Speech recognizers deployed in continuously listening mode are continuously monitoring acoustic input and do not
necessarily require non-speech activation. This is in contrast to the push to talk model, in which speech recognition is only
activated when the user pushes a button. Unfortunately, today’s continuously listening speech recognizers are not reliable
enough due to their insufficient accuracy, especially in the area of correct rejection. For example, such systems often respond
erratically, even when no speech is present. They sometimes interpret background noise as speech, and they sometimes
incorrectly assume that certain speech is addressed at the speech recognizer when in fact it is targeted elsewhere (context
misunderstanding). These problems have traditionally been solved by the push to talk model: requesting the user to push a
button immediately before or during talking or similar prompting paradigm.
e2774
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
Another problem with traditional speech recognizers is that they cannot recover from errors gracefully, and often require
non-speech intervention. Any speech-enabled human–machine interface based on natural language relies on carefully
crafted dialogues. When the dialogue fails, currently there is no good mechanism to resynchronize the communication,
and typically the transaction between human and machine fails by termination. A typical example is an SR system which
is in a dictation state, when in fact the human is attempting to use command-and-control to correct a previous dictation
error. Often the user is forced to intervene by pushing a button or keyboard key to resynchronize the system. Current SR
systems that do not deploy the push to talk paradigm use implicit context switching. For example, a system that has the ability
to switch from ‘‘dictation mode’’ to ‘‘command mode’’ does so by trying to infer whether the user is uttering a command
rather than dictating text. This task is rather difficult to perform with high accuracy, even for humans. The push to talk
model uses explicit context switching, meaning that the action of pushing the button explicitly sets the context of the speech
recognizer to a specific state.
To achieve the goal of developing a natural speech interface, it is first useful to consider human to human communication.
Upon hearing an utterance of speech a human listener must quickly make a decision whether or not the speech is directed
towards him or her. This decision determines whether the listener will make an effort to ‘‘process’’ and understand the
utterance. Humans can make this decision quickly and robustly by utilizing visual, auditory, semantic, and/or contextual
clues.
Visual clues might be gestures such as waving of hands or other facial expressions. Auditory clues are attention grabbing
words or phrases such as the listener’s name (e.g. John), interjections such as ‘‘hey’’, ‘‘excuse me’’, and so forth. Additionally,
the listener may make use of prosodic information such as pitch and intonation, as well as identification of the speaker’s
voice.
Semantic and/or contextual clues are inferred from interpretation of the words in the sentence being spoken, visual input,
prior experience, and customs dictated by the culture. For example, if one knows that a nearby speaker is in the process of
talking on the phone with someone else, he/she will ignore the speaker altogether knowing that speech is not targeting
him/her. Humans are very robust in determining when speech is targeted towards them, and should a computer SR system
be able to make the same decisions its robustness would increase significantly.
Wake-Up-Word is proposed as a method to explicitly request the attention of a computer using a spoken word or phrase.
The WUW must be spoken in the context of requesting attention and should not be recognized in any other context. After
successful detection of WUW, the speech recognizer may safely interpret the following utterance as a command. The WUW
is analogous to the button in push to talk, but the interaction is completely based on speech. Therefore it is proposed to use
explicit context switching via WUW. This is similar to how context switching occurs in human to human communication.
2.2. WUW-SR differences from other SR tasks
Wake-Up-Word is often mistakenly compared to other speech recognition tasks such as key-word spotting or command
& control, but WUW is different from the previously mentioned tasks in several significant ways. The most important
characteristic of a WUW-SR system is that it should guarantee virtually 100% correct rejection of non-WUW and same words
uttered in non-alerting context while maintaining correct acceptance rate over 99%. This requirement sets apart WUW-SR
from other speech recognition systems because no existing system can guarantee 100% reliability by any measure without
significantly lowering correct recognition rate. It is this guarantee that allows WUW-SR to be used in novel applications that
previously have not been possible. Second, a WUW-SR system should be context dependent; that is, it should detect only
words uttered in alerting context. Unlike key-word spotting, which tries to find a specific keyword in any context, the WUW
should only be recognized when spoken in the specific context of grabbing the attention of the computer in real time. Thus,
WUW recognition can be viewed as a refined key-word spotting task, albeit significantly more difficult. Finally, WUWSR should maintain high recognition rates in speaker-independent or speaker-dependent mode, and in various acoustic
environments.
2.3. Significance of WUW-SR
The reliability of WUW-SR opens up the world of speech recognition to applications that were previously impossible.
Today’s speech-enabled human–machine interfaces are still regarded with skepticism, and people are hesitant to entrust
any significant or accuracy-critical tasks to a speech recognizer.
Despite the fact that SR is becoming almost ubiquitous in the modern world, widely deployed in mobile phones,
automobiles, desktop, laptop, and palm computers, many handheld devices, telephone systems, etc., the majority of the
public pays little attention to speech recognition. Moreover, the majority of speech recognizers use the push to talk paradigm
rather than continuously listening, simply because they are not robust enough against false positives. One can imagine that
the driver of a speech-enabled automobile would be quite unhappy if his or her headlights suddenly turned off because the
continuously listening speech recognizer misunderstood a phrase in the conversation between driver and passenger.
The accuracy of speech recognizers is often measured by word error rate (WER), which uses three measures [1]:
Insertion (INS) – an extra word was inserted in the recognized sentence.
Deletion (DEL) – a correct word was omitted in the recognized sentence.
Substitution (SUB) – an incorrect word was substituted for a correct word.
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
e2775
WER is defined as:
WER =
INS + DEL + SUB
TRUE NUM . OF WORDS IN SENTENCE
100%.
(1)
Substitution errors or equivalently correct recognition in WUW context represent the accuracy of the recognition, while
insertion + deletion errors are caused by other factors; typically erroneous segmentation.
However, WER as an accuracy measurement is of limited usefulness to a continuously listening command and control
system. To understand why, consider a struggling movie director who cannot afford a camera crew and decides to install a
robotic camera system instead. The system is controlled by a speech recognizer that understands four commands: ‘‘lights’’,
‘‘camera’’, ‘‘action’’, and ‘‘cut’’. The programmers of the speech recognizer claim that it has 99% accuracy, and the director
is eager to try it out. When the director sets up the scene and utters ‘‘lights’’, ‘‘camera’’, ‘‘action’’, the recognizer makes no
mistake, and the robotic lights and cameras and spring into action. However, as the actors being performing the dialogue
in their scene, the computer misrecognizes ‘‘cut’’ and stops the film, ruining the scene and costing everyone their time
and money. The actors could only speak 100 words before the recognizer, which truly had 99% accuracy, triggered a false
acceptance.
This anecdote illustrates two important ideas. First, judging the accuracy of a continuously listening system requires using
a measure of ‘‘rejection’’. That is, the ability of the system to correctly reject out-of-vocabulary utterances. The WER formula
incorrectly assumes that all speech utterances are targeted at the recognizer and that all speech arrives in simple, consecutive
sentences. Consequently, performance of WUW-SR is measured in terms of correct acceptances (CA) and correct rejections
(CR). Because it is difficult to quantify and compare the number of CRs, rejection performance is also given in terms of ‘‘false
acceptances per hour’’.
The second note of interest is that 99% rejection accuracy is actually a very poor performance level for a continuously
listening command and control system. In fact, the 99% accuracy claim is a misleading figure. While 99% acceptance accuracy
is impressive in terms of recognition performance, 99% rejection implies one false acceptance per 100 words of speech. It
is not uncommon for humans to speak hundreds of words per minute, and such a system would trigger multiple false
acceptances per minute. That is the reason why today’s speech recognizers a) primarily use a ‘‘push to talk’’ design, and
b) are limited to performing simple convenience functions and are never relied on for critical tasks. On the other hand,
experiments have shown the WUW-SR implementation presented in this work to reach up to 99.99% rejection accuracy,
that translates to one false acceptance every 2.2 h.
3. Wake-Up-Word problem identification and solution
Wake-Up-Word recognition requires solving three problems: (1) detection of the WUW (e.g., alerting) context, (2) correct
identification of WUW, and (3) correct rejection of non-WUW.
3.1. Detecting WUW context
The Wake-Up-Word is proposed as a means to grab the computer’s attention with extremely high accuracy. Unlike keyword spotting, the recognizer should not trigger on every instance of the word, but rather only in certain alerting contexts.
For example, if the Wake-Up-Word is ‘‘computer’’, the WUW should not trigger if spoken in a sentence such as ‘‘I am now
going to use my computer’’.
Some previous attempts at WUW-SR avoided context detection by choosing a word that is unique and unlikely to be
spoken. An example of this is WildFire Communication’s ‘‘wildfire’’ word introduced in the early 1990s. However, even
though the solution was customized for the specific word ‘‘wildfire’’, it was neither accurate enough nor robust.
The implementation of WUW presented in this work selects context based on leading and trailing silence. Speech is
segmented from background noise by the Voice Activity Detector (VAD), which makes use of three features for its decisions:
energy difference from long term average, spectral difference from long term average, and Mel-scale Filtered Cepstral
Coefficients (MFCC) [13] difference from long term average. In the future, prosody information (pitch, energy, intonation
of speech) will be investigated as an additional means to determine the context as well as to set the VAD state.
3.2. Identifying WUW
The VAD is responsible for finding utterances spoken in the correct context and segmenting them from the rest of the
audio stream. After the VAD makes a decision, the next task of the system is to identify whether or not the segmented
utterance is a WUW.
The front end extracts the following features from the audio stream: MFCC [13], LPC smoothed MFCC [14], and enhanced
MFCC, a proprietary technology. The MFCC and LPC features are standard features deployed by vast majority of recognizers.
The enhanced MFCCs are introduced to validate their expected superior performance over the standard features specifically
exhibited under noisy conditions. Finally, although all three features individually may provide comparable performance, it
is hypothesized that the combination of all three features will provide significant performance gains as demonstrated in
e2776
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
Fig. 4.1. WUW-SR overall architecture. Signal processing module accepts the raw samples of audio signal and produces spectral representation of a shorttime signal. Feature extraction module generates features from spectral representation of the signal. Those features are decoded with corresponding HMMs.
The individual feature scores are classified using SVM.
this paper. This in turn indicates that each feature captures slightly different short time characteristics of the speech signal
segment from which they are produced.
Acoustic modeling is performed with Hidden Markov Models (HMM) [15,16] with triple scoring [4], an approached
further developed throughout this work. Finally, classification of those scores is performed with Support Vector Machines
(SVM) [17–21].
3.3. Correct rejection of Non-WUW
The final and critical task of a WUW system is correctly rejecting Out-Of-Vocabulary (OOV) speech. The WUW-SR system
should have a correct rejection rate of virtually 100% while maintaining high correct recognition rates of over 99%. The task
of rejecting OOV segments is difficult, and there are many papers in the literature discussing this topic. Typical approaches
use garbage models, filler models, and/or noise models to capture extraneous words or sounds that are out of vocabulary.
For example, in [22], a large number of garbage models is used to capture as much of the OOV space as possible. In [23],
OOV words are modeled by creating a generic word model which allows for arbitrary phone sequences during recognition,
such as the set of all phonetic units in the language. Their approach yields a correct acceptance rate of 99.2% and a false
acceptance rate of 71% on data collected from the Jupiter weather information system.
The WUW-SR system presented in this work uses triple scoring, a novel and scoring technique by author [4] and further
developed throughout this work, to assign three scores to every incoming utterance. The three scores produced by the
triple scoring approach are highly correlated, and furthermore the correlation between scores of INV utterances is distinctly
different than that of OOV utterances. The difference in correlations makes INV and OOV utterances highly separable in the
score space, and an SVM classifier is used to produce the optimal separating surface.
4. Implementation of WUW-SR
The WUW Speech Recognizer is a complex software system comprising of three major parts: Front End, VAD, and Back
End. The Front End is responsible for extracting features from an input audio signal. The VAD (Voice Activity Detector)
segments the signal into speech and non-speech regions. Finally, the back end performs recognition and scoring. The diagram
in Fig. 4.1 below shows the overall procedure.
The WUW recognizer has been implemented entirely in C++, and is capable of running live and performing recognitions in
real time. Functionally is broken into 3 major units. (1) Front End: responsible for features extraction and VAD classification
of each frame. (2) Back End: performing word segmentation and classification of those segments for each feature stream,
and (3) INV/OOV Classification using individual HMM segmental scores with SVM.
The Front End Processor and the HMM software can also be run and scripted independently, and they are accompanied
by numerous MATLAB scripts to assist with batch processing necessary for training the models.
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
e2777
LPC-MFCC
E-Front End
Spectrogram Computation
Input
Pre-emphasis
Windowing
Window Advance
Spectrogram
Feature
Energy
Feature
Spectrogram
FFT
MFCC
MFCC Feature Computation
MFCC
Voice Activity Detection
Autocorrelation
LPC
Mel scale Filtering
MF
VAD – Voice
Activity Detection
Logic
Discrete Cosine
Transform
DCT
Cepstral Mean
Normalization
CMN
Estimation of Stationary
Background Spectrum
VAD-STATE
DC-Filtering
Samples
Buffering
...
Enhanced MFCC Feature Computation
Enhanced
Spectrum
ENH
Mel scale Filtering
MF
ENH-MFCC
MFCC
Feature
Discrete Cosine
Transform
DCT
Fig. 4.2. Details of signal processing of the Front End. It contains a common Spectrogram computation module, feature based VAD, and module that
compute 3 features (MFCC, LPC-MFCC, and enhanced MFCC) for each analysis frame. The VAD classifier determines the state (speech or NO speech) of each
frame indicating if a segment contains speech. This information is used by Back End of the recognizer.
4.1. Front end
The front end is responsible for extracting features out of the input signal. Three sets of features are extracted: Mel
Filtered Cepstral Coefficients (MFCC) [13], LPC (Linear Predictive Coding) [14] smoothed MFCCs, and Enhanced MFCCs. The
following diagram, Fig. 4.2, illustrates the architecture of the Front End and VAD.
The following image, Fig. 4.3, shows an example of the waveform superimposed with its VAD segmentation, its
spectrogram, and its enhanced spectrogram.
4.2. Back end
The Back End is responsible for classifying observation sequences as In-Vocabulary (INV), i.e. the sequence of frames
hypothesized as a segment is a Wake-Up-Word, or Out Of Vocabulary (OOV), i.e. the sequence is not a Wake-Up-Word. The
WUW-SR system uses a combination of Hidden Markov Models and Support Vector Machines for acoustic modeling, and as
a result the back end consists of an HMM recognizer and a SVM classifier. Prior to recognition, HMM and SVM models must
be created and trained for the word or phrase which is to be the Wake-Up-Word.
When the VAD state changes from VAD_OFF to VAD_ON, the HMM recognizer resets and prepares for a new observation
sequence. As long as the VAD state remains VAD_ON, feature vectors are continuously passed to the HMM recognizer, where
they are scored using the novel triple scoring method. If using multiple feature streams (a process explained in a later
section), recognition is performed for each stream in parallel. When VAD state changes from VAD_ON to VAD_OFF, multiple
scores are obtained from the HMM recognizer and are sent to the SVM classifier. SVM produces a classification score which
is compared against a threshold to make the final classification decision of INV or OOV.
5. Wake-Up-Word speech recognition experiments
For HMM training and recognition we used a number of corpora as described next.
e2778
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
Speech and VAD segmentation
2
1
0
-1
-2
4
4.5
5
5.5
6
6.5
7
7.5
8
6.5
7
7.5
8
6.5
7
7.5
8
Time (s)
Spectrogram
4000
Frequency [Hz]
3000
2000
1000
0
4
4.5
5
5.5
6
Time [s]
Enhanced Spectrogram
4000
Frequency [Hz]
3000
2000
1000
0
4
4.5
5
5.5
6
Time [s]
Fig. 4.3. Example of the speech signal (first plot in blue) with VAD segmentation (first plot in red), spectrogram (second plot), and enhanced spectrogram
(third plot) generated by the Front End module. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version
of this article.)
5.1. Speech corpora
Speech recognition experiments made use of the following corpora: CCW17, WUW-I, WUW-II, Phonebook, and Callhome.
The properties of each corpus are summarized in the following table, Table 5.1.
5.2. Selecting INV and OOV utterances
The HMMs used throughout this work are predominantly whole word models, and HMM training requires many
examples of the same word spoken in isolation or segmented from the continuous speech. As such, training utterances
were only chosen from CCW17, WUW-I, and WUW-II, but OOV utterances were chosen from all corpora. The majority of
the preliminary tests were performed using the CCW17 corpus, which is well organized and very easy to work with. Of the
ten words available in CCW17, the word ‘‘operator’’ has been chosen for most experiments as the INV word. This word is
not especially unique in the English vocabulary and is moderately confusable (e.g, ‘‘operate’’, ‘‘moderator’’, ‘‘narrator’’, etc.),
making a good candidate for demonstrating the power of WUW speech recognition.
5.3. INV/OOV decisions & cumulative error rates
The WUW speech recognizer Back End is essentially a binary classifier that, given an arbitrary length input sequence
of observations, classifies it as In-Vocabulary (INV) or Out-Of-Vocabulary (OOV). Although the internal workings are very
complex and can involve a hierarchy of Hidden Markov Models, Support Vector Machines, and other classifiers, the end
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
e2779
Table 5.1
Corpora used for Training and Testing. CCW17, WUW-I and WUW-II were privately collected corpora by ThinkEngine Networks. Phonebook and Callhome
are publically available corpora from Linguistic Data Consortium (LDC) of Penn State University.
CCW17
Speech type:
Vocabulary size:
Unique speakers:
Total duration:
Total words:
Channel type:
Channel quality:
Isolated words
10
More than 200
7.9 h
4234
Telephone
Variety of noise conditions
The CCW17 corpus contains 10 isolated words recorded over the telephone from multiple
speakers. The calls were made from land lines, cellular phones, and speakerphones, under
a variety of noise conditions. The data is grouped by telephone call, and each call contains
one speaker uttering up to 10 words. The speakers have varying backgrounds, but are
predominantly American from the New England region and Indian (many of whom carry a
heavy Indian accent).
WUW-I & WUW-II
Speech type:
Isolated words & sentences
WUW-I and WUW-II are similar to CCW17. They consist of isolated words and additionally
contain sentences that use those words in various ways.
Phonebook
Speech type:
Vocabulary size:
Total words:
Unique speakers:
Length:
Channel type:
Channel quality:
Isolated words
8002
93,267
1358
37.6 h
Telephone
Quiet to noisy
The phonebook corpus consists of 93,267 files, each containing a single utterance
surrounded by two small periods of silence. The utterances were recorded by 1358 unique
individuals. The corpus has been split into a training set of 48,598 words (7945 unique)
spanning 19.6 h of speech, and testing set of 44,669 words (7944 unique) spanning 18 h of
speech.
Free & spontaneous
8577
165,546
More than 160
40 h
Telephone
Quiet to noisy
The Callhome corpus contains 80 natural telephone conversations in English of length
30 min each. For each conversation both channels were recorded and included separately.
Callhome
Speech type:
Vocabulary size:
Total words:
Unique speakers:
Total duration:
Channel type:
Channel quality:
result is that any given input sequence produces a final score. The score space is divided into two regions, R1 and R−1
based on a chosen decision boundary, and an input sequence can be marked as INV or OOV depending on which region it
falls in. The score space can be partitioned at any threshold depending on the application requirements, and the error rates
can be computed for that threshold. An INV utterance that falls in the OOV region is a False Rejection error, while an OOV
utterance that falls in the INV region is considered a False Acceptance error.
OOV error rate (False Acceptances of False Positives) and INV error rate (False Rejections or False Negatives) can be
plotted as functions of the threshold over the range of scores. The point where they intersect is known as the Equal Error
Rate (EER), and is often used as a simple measure of classification performance. The EER defines a score threshold for which
False Rejection is equal to False Acceptance rate. However, the EER is often not the minimum error, especially when the
scores do not have well-formed Gaussian or uniform distributions, nor is it the optimal threshold for most applications. For
instance, in Wake-Up-Word the primary concern is minimizing OOV error, so the threshold would be biased as to pick a
threshold that minimizes False Acceptance rate at the expense of increasing False Rejection rate.
For the purpose of comparison and evaluation, the following experiments include plots of the INV and OOV recognition
score distributions, the OOV and INV error rate curves, as well as the point of equal error rate. Where applicable, the WUW
optimal threshold and error rates at that threshold are also shown.
5.4. Plain HMM scores
For the first tests on speech data, a HMM was trained on the word ‘‘operator’’. The training sequences were taken from the
CCW17 and WUW-II corpora for a total of 573 sequences from over 200 different speakers. After features were extracted,
some of the erroneous VAD segments were manually removed. The INV testing sequences were the same as the training
sequences, while the OOV testing sequences included the rest of the CCW17 corpus (3833 utterances, 9 different words,
over 200 different speakers). The HMM was a left to right model with no skips, 30 states, and 6 mixtures per state, and was
trained with two iterations of Baum–Welch algorithm [16,15].
The score is the result of the Viterbi algorithm over the input sequence. Recall that the Viterbi algorithm finds the state
sequence that has the highest probability of being taken while generating the observation sequence. The final score is that
probability normalized by the number of input observations, T. The below Fig. 5.1 shows the results:
The distributions look Gaussian, but there is a significant overlap between them. The equal error rate of 15.5% essentially
means that at that threshold, 15.5% of the OOV words would be classified as INV, and 15.5% of the INV words would be
classified as OOV. Obviously, no practical applications can be developed based on the performance of this recognizer.
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
Cumulative error percent (%)
e2780
Recognition Score
Fig. 5.1. Score distributions for plain HMM on CCW17 corpus data for the WUW ‘‘operator’’.
Score 2
-60
-60
OOV sequences
INV sequences
-70
-70
-80
-80
-90
-90
-100
-100
-110
-110
-120
-120
-130
-120
-110
-100
-90
Score 1
-80
-70
-60
-120
-110
-100
-80
-70
-60
0 0.05
0.05
0
-130
-90
Fig. 5.2. Scatter plots of score 1 vs. score 2 and their corresponding independently constructed histograms.
-60
Score 3
-65
-60
OOV sequences
INV sequences
-65
-70
-70
-75
-75
-80
-80
-85
-85
-90
-90
-95
-95
-100
-100
-105
-105
-110
-110
-130
-120
-110
-100
-90
Score 1
-80
-70
-60
-120
-110
-100
-90
-80
-70
-60
0
0.05
0.05
0
-130
Fig. 5.3. Scatter plots of score 1 vs. score 3 and their corresponding independently constructed histograms.
5.5. Triple scoring method
In addition to the Viterbi score, the algorithm developed in this work produces two additional scores for any given
observation sequence. When considering the three scores as features in a three-dimensional space, the separation between
INV and OOV distributions increases significantly. The next experiment runs recognition on the same data as above,
but this time the recognizer uses the triple scoring algorithm to output three scores. The Figs. 5.2 and 5.3 show twodimensional scatter plots of Score 1 vs. Score 2, and Score 1 vs. Score 3 for each observation sequence. In addition, a
histogram on the horizontal axis shows the distributions of Score 1 independently, and a similar histogram on the vertical
axis shows the distributions of Score 2 and Score 3 independently. The histograms are hollowed out so that the overlap
between distributions can be seen clearly. The distribution for Score 1 is exactly the same as in the previous section, as
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
e2781
Fig. 5.4. Linear SVM, of the scatter plot of scores 1–2.
Fig. 5.5. Linear SVM, of the scatter plot of scores 1–3.
the data and model have not changed. Any individual score does not produce a good separation between classes, and in
fact the Score 2 distributions have almost complete overlap. However, the two-dimensional separation in either case is
remarkable.
5.6. Classification with Support Vector Machines
In order to automatically classify an input sequence as INV or OOV, the triple score feature space, R3 , can be partitioned
by a binary classifier into two regions, R31 and R3−1 . Support Vector Machines have been selected for this task because of
the following reasons: they can produce various kinds of decision surfaces including radial basis function, polynomial, and
linear; they employ Structural Risk Minimization (SRM) [18] to maximize the margin which has shown empirically to have
good generalization performance.
Two types of SVMs have been considered for this task: linear and RBF. The linear SVM uses a dot product kernel function,
K (x, y) = x.y, and separates the feature space with a hyperplane. It is very computationally efficient because no matter
how many support vectors are found, evaluation requires only a single dot product. Figs. 5.4 and 5.5 below shows that the
separation between distributions based on Score 1 and Score 3 is almost linear, so a linear SVM would likely give good results.
However, in the Score 1 / Score 2 space, the distributions have a curvature, so the linear SVM is unlikely to generalize well
for unseen data. The Figs. 5.4 and 5.5 below show the decision boundary found by a linear SVM trained on Score 1 + 2, and
Score 1 + 3, respectively.
The line in the center represents the contour of the SVM function at u = 0, and outer two lines are drawn at u = ±1.
Using 0 as the threshold, the accuracy of Scores 1–2 is 99.7% Correct Rejection (CR) and 98.6% Correct Acceptance (CA), while
for Scores 1–3 it is 99.5% CR and 95.5% CA. If considering only two features, Scores 1 and 2 seem to have better classification
ability. However, combining the three scores produces the plane shown below, in Figs. 5.6 and 5.7 from two different
angles.
e2782
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
Linear SVM hyperplane
-40
-50
-60
Score 3
-70
-80
-90
-100
-110
-120
-130
-40
-60
-80
-100
Score 2
-120
-140
-140 -130
-120 -110
-100
-90
-80
-70
-60
-50
-40
Score 1
Fig. 5.6. Linear SVM, three-dimensional Discriminating Surface view 1 as a function of scores 1–3.
Linear SVM Separating Plane
-40
Score 3
-60
-80
-100
-120
-140
-140
-140
-120
-120
-100
-100
-80
-80
Score 2
-60
-60
Score 1
-40 -40
Cumulative error percent (%)
Fig. 5.7. Linear SVM, three-dimensional Discriminating Surface view 2 as a function of scores 1–3.
Recognition Score
Fig. 5.8. Histogram of score distributions for triple scoring using linear SVM discriminating surface.
The plane split the feature space with an accuracy of 99.9% CR and 99.5% CA (just 6 of 4499 total sequences were
misclassified). The accuracy was better than any of the two-dimensional cases, indicating that Score 3 contains additional
information not found in Score 2 and vice versa. The classification error rate of the linear SVM is shown in Fig. 5.8.
The conclusion is that using the triple scoring method combined with a linear SVM decreased the equal error rate on this
particular data set from 15.5% to 0.2%, or in other words increased accuracy by over 77.5 times (i.e., error rate reduction of
7650%)!
2
In the next experiment, a Radial Basis Function (RBF) kernel was used for the SVM. The RBF function,K (x, y) = e−γ |x−y| ,
maps feature vectors into a potentially infinitely dimensional Hilbert space and is able to achieve complete separation
between classes in most cases. However, the γ parameter must be chosen carefully in order to avoid overtraining. As there
is no way to determine it automatically, a grid search may be used to find a good value. For most experiments γ = 0.008
gave good results. Shown below in Figs. 5.9 and 5.10 are the RBF SVM contours for both two-dimensional cases.
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
e2783
Fig. 5.9. RBF SVM kernel, of the scatter plot of scores 1–2.
Fig. 5.10. RBF SVM kernel, of the scatter plot of scores 1–3.
At the u = 0 threshold, the classification accuracy was 99.8% CR, 98.6% CA for Score 1–2, and 99.4% CR, 95.6% CA for
Score 1–3. In both cases the RBF kernel formed a closed decision region around the INV points (Recall that the SVM decision
function at u = 0 is shown by the green line).
Some interesting observations can be made from these plots. First, the INV outlier in the bottom left corner of the first
plot caused a region to form around it. SVM function output values inside the region were somewhere between −1 and
0, not high enough to cross into the INV class. However, it is apparent that the RBF kernel is sensitive to outliers, so the γ
parameter must be chosen carefully to prevent overtraining. Had the γ parameter been a little bit higher, the SVM function
output inside the circular region would have increased beyond 0, and that region would have been considered INV.
Second, the RBF kernel’s classification accuracy showed almost no improvement over the linear SVM. However, a number
of factors should be considered when making the final decision about which kernel to use. If both models have similar
performance, it is usually the simplest one that is eventually chosen (according to Occam’s Razor). RBF kernels feature an
extra parameter (γ ) and are therefore considered more complex than linear kernels. However, the criterion that matters
the most in real-time application such as this is not necessarily the feature of curved decision boundaries but it is the
number of support vectors utilized by each model; the one with the least number, would be the simplest one and the
one finally chosen, whether RBF or linear kernel based. From the plots, the two distributions look approximately Gaussian
with a similar covariance structure, whose principal directions roughly coincide for both populations. This may be another
reason why a roughly linear decision boundary between the two classes provides good classification results. On the other
hand it is expected that due to the RBF kernel’s ability to create arbitrary curved decision surfaces, it will have better
generalization performance than the linear SVM’s hyperplane (as demonstrated by the results described below) if the specific
score distribution does not follow Gaussian distribution.
The Figs. 5.11 and 5.12 below shows an RBF kernel SVM trained on all three scores.
The RBF kernel created a closed three-dimensional surface around the INV points and had a classification accuracy of
99.95% CR, 99.30% CA. If considering u = 0 as the threshold, the triple score SVM with RBF kernel function shows only little
e2784
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
RBF SVM decision region
-40
-50
Score 3
-60
-70
-80
-90
-100
-110
-120
-40
-60
-40
-60
-80
-80
-100
-120
Score 2
-120
-100
Score 1
-140 -140
Fig. 5.11. RBF SVM kernel, three-dimensional Discriminating Surface view 1 as a function of scores 1–3.
Score 3
RBF SVM decision region
-50
-60
-70
-80
-90
-100
-110
-120
-130
-120
-110
-100
-140
-120
-90
-100
-80
Score 2
-80
-70
-60
-60
-50 -40
Score 1
Fig. 5.12. RBF SVM kernel, three-dimensional Discriminating Surface view 2 as a function of scores 1–3.
Fig. 5.13. Histogram of OOV-blue and INV-red score distributions for triple scoring using RBF SVM kernel.
improvement over the linear SVM for this data set. However, as shown in Fig. 5.13 below, the SVM score distributions are
significantly more separated, and the equal error rate is lower than the linear SVM; from 0.2% to 0.1%.
Table 5.2 summarizes all of the results up to this point.
5.7. Front end features
The Front End Processor produces three sets of features: MFCC (Mel Filtered Cepstral Coefficients), LPC (Linear Predictive
Coding) smoothed MFCC, and Enhanced MFCC (E-MFCC). The E-MFCC features have a higher dynamic range than regular
MFCC, so they have been chosen for most of the experiments. However, the next tests attempt to compare the performance
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
e2785
Table 5.2
Score comparisons of conventional single scoring and presented triple scoring using liner and RBF SVM kernels.
Score 1–2 (%)
Score 1–3 (%)
Triple score (%)
EER (%)
N/A
CR: 99.7
CA: 98.6
CR: 99.8
CA: 98.6
N/A
CR: 99.5
CA: 95.4
CR: 99.3
CA: 95.6
N/A
CR: 99.9
CA: 99.5
CR: 99.9
CA: 99.3%
15.5
0.2
0.1
Cumulative error percent (%)
Single scoring
Triple scoring
Linear kernel
Triple scoring
RBF kernel (γ = 0.008)
Recognition Score
Cumulative error percent (%)
Fig. 5.14. Histogram of OOV-blue and INV-red score distributions for E-MFCC features.
Recognition Score
Fig. 5.15. Histogram of OOV-blue and INV-red score distributions for MFCC features.
of E-MFCC, MFCC, and LPC features. In addition, a method of combining all three features has been developed and has shown
excellent results.
The first experiment verifies the classification performance of each feature stream independently. Three HMMs have
been generated with the following parameters: 30 states, 2 entry and exit silence states, 6 mixtures, 1 skip transition. The
HMMs were trained on utterances of the word ‘‘operator’’ taken from the CCW17 and WUW-II corpora. One HMM was
trained with E-MFCC features, one with MFCC features, and one with LPC features.
For OOV rejection testing, the Phonebook corpus, which contains over 8000 unique words, was divided into two sets,
Phonebook 1 and Phonebook 2. Each set was assigned roughly half of the utterances of every unique word, so both sets
covered the same range of vocabulary.
Recognition was then performed on the training data set, Phonebook 1 data and Phonebook 2 data. SVM models were
trained on the same training set and the Phonebook 1 set, with parameters γ = 0.001, C = 15. Classification tests were
performed on the training set and the Phonebook 2 set. Figs. 5.14–5.16 show the score distributions and error rates for each
set of front end features.
In this experiment LPC had an equal error rate slightly better than E-MFCC. However, the difference is negligible, and the
distributions and their separation look almost the same. MFCC had the worst equal error rate.
The next experiment attempts to combine information from all three sets of features. It is hypothesized that each feature
set contains some unique information, and combining all three of them would extract the maximum amount of information
from the speech signal. To combine them, the features are considered as three independent streams, and recognition and
triple scoring is performed on each stream independently. However, the scores from the three streams are aggregated and
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
Cumulative error percent (%)
e2786
Recognition Score
Cumulative error percent (%)
Fig. 5.16. Histogram of OOV-blue and INV-red score distributions for LPC features.
Recognition Score
Fig. 5.17. Histogram of OOV-blue and INV-red score distributions for combined features.
classified via SVM in a nine-dimensional space. The next Fig. 5.17 shows the results when combining the same scores from
the previous section using this method.
Combining the three sets of features produced a remarkable separation between the two distributions. The Phonebook
2 set used for OOV testing contains 49,029 utterances previously unseen by the recognizer, of which over 8,000 are unique
words. At an operating threshold of −0.5, just 16 of 49,029 utterances were misclassified, yielding an FA error of 0.03%. For
INV, just 1 of 573 utterances was misclassified.
6. Baseline and final results
This section establishes a baseline by examining the performance of HTK and Microsoft SAPI 5.1, a commercial speech
recognition engine. It then presents the final performance results of the WUW-SR system.
6.1. Corpora setup
Because Wake-Up-Word Speech Recognition is a new concept, it is difficult to compare its performance with existing
systems. In order to perform a fair, nonbiased analysis of performance, all attempts were made to configure the test systems
to achieve highest possible performance for WUW recognition. For the final experiments, once again the word ‘‘operator’’
was chosen for testing. The first step entails choosing a good set of training and testing data.
The ideal INV data for building a speaker-independent word model would be a large number of isolated utterances from
a wide variety of speakers. As such, the INV training data was once again constructed from isolated utterances taken from
the CCW17 and WUW-II corpora, for a total of 573 examples.
Because the WUW-SR system is expected to listen continuously to all forms of speech and non-speech and never activate
falsely. It must be able to reject OOV utterances in both continuous free speech and isolated cases, so the following two
scenarios have been defined for testing OOV rejection capability: continuous natural speech (Callhome corpus), and isolated
words (Phonebook corpus).
e2787
Cumulative error percent (%)
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
Recognition Score
Fig. 6.1. Histogram of the overall recognition scores for Callhome 2 test.
As described in Section 5, the Callhome corpus contains natural speech from phone call conversations in 30 min long
segments. The corpus spans 40 h of speech from over 160 different speakers, and contains over 8500 unique words. The
corpus has been split into two sets of 20 h each, Callhome 1 and Callhome 2.
The Phonebook corpus contains isolated utterances from over 1000 speakers and over 8000 unique words. The fact that
every utterance is isolated actually presents a worst case scenario to the WUW recognizer. In free speech, the signal is often
segmented by the VAD into long sentences, indicative of non-WUW context and easily classified as OOV. However, short
isolated words are much more likely to be confusable with the Wake-Up-Word. The Phonebook corpus was also divided into
two sets, Phonebook 1 and Phonebook 2. Each set was assigned roughly half of the utterances of every word. Phonebook 1
contained 48,598 words (7945 unique) and Phonebook 2 contained 44,669 words (7944 unique).
6.2. HTK
The first baseline comparison was against the Hidden Markov Model Toolkit (HTK) version 3.0 from Cambridge
University [24]. First, an HMM word model with similar parameters (number of states, mixtures, topology) as the WUW
model was generated. Next, the model was trained on the INV speech utterances. Note that while the utterances were the
same in both cases (i.e., HTK and e-WUW), the speech features (MFCCs) were extracted with the HTK feature extractor, not
the WUW front end. Finally, recognition was performed by computing the Viterbi score of test sequences against the model.
After examining the score of all sequences, a threshold was manually chosen to optimize FA and FR error rates.
6.3. Commercial SR engine—microsoft SAPI 5.1
The second baseline comparisons evaluate the performance of a commercial speech recognition system. The commercial
system was designed for dictation and command & control modes, and has no concept of ‘‘Wake-Up-Word’’. In order
to simulate a wake-up-word scenario, a command & control grammar was created with the wake-up-word as the only
command (e.g. ‘‘operator’’). There is little control of the engine parameters other than a generic ‘‘sensitivity’’ setting.
Preliminary tests were executed and the sensitivity of the engine was adjusted to the point where it gave the best results.
Speaker adaptation was also disabled for the system.
6.4. WUW-SR
For the WUW-SR system, three HMMs were trained; one with E-MFCC features, one with MFCC features, and one with
LPC smoothed MFCC features. All three used the following parameters: 30 states, 2 entry and exit silence states, 6 mixtures,
1 skip transition. SVMs were trained on the nine-dimensional score space obtained by HMM triple scoring on three feature
streams, with parameters γ = 0.001, C = 15.
Note that unlike the baseline recognizers, the WUW Speech Recognizer uses OOV data for training the SVM classifier. The
OOV training set consisted of Phonebook 1 and Callhome 1, while the testing set consisted of Phonebook 2 and Callhome 2.
Although it is possible to train a classifier using only INV examples, such as a one-class SVM, it has not been explored yet in
this work.
The plots in Figs. 6.1 and 6.2 show the combined feature results on the Callhome 2 and Phonebook 2 testing sets.
Table 6.1 summarizes the results of HTK, the commercial speech recognizer, and WUW-SR:
In Table 6.1, Relative Error Rate Reduction is computed as:
RERR =
B−N
N
∗ 100%
where B is the baseline error rate and N is the new error rate. Reduction factor is computed as B/N.
(2)
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
Cumulative error percent (%)
e2788
Recognition Score
Fig. 6.2. Histogram of the overall recognition scores for Phonebook 2 test.
Table 6.1
Summary of final results presenting and comparing False Rejection (FR) and Acceptance (FA) rates of WUW-SR vs. HTK and Microsoft SAPI 5.1 reported as
Relative Error Rate Reduction and as reduction error factor.
CCW17 + WUW-II (INV)
Phonebook 2 (OOV)
FR
FR (%)
FA
FA (%)
FA/h
FA
FA (%)
FA/h
105
102
4
18.3
16.6
0.70
8793
5801
15
19.6
12.9
0.03
488.5
322.3
0.40
2421
467
9
2.9
0.6
0.01
121.1
23.35
0.45
Callhome 2 (OOV)
Recognizer
HTK
Microsoft SAPI 5.1 SR
WUW –SR
Reduction of Rel. Error Rate /Factor
WUW vs. HTK
WUW vs. Microsoft SAPI 5.1 SR
2514%26
2271%24
65,233%653
42,900%430
28,900%290
5900%60
Table 6.1 demonstrates WUW-SR’s improvement over current state of the art recognizers. WUW-SR with three feature
streams was several orders of magnitude superior to the baseline recognizers in INV recognition and particularly in OOV
rejection. Of the two OOV corpora, Phonebook 2 was significantly more difficult to recognize for all three systems due to
the fact that every utterance was a single isolated word. Isolated words have similar durations to the WUW and differ just
in pronunciation, putting to the test the raw recognition power of a SR system. HTK had 8793 false acceptance errors on
Phonebook 2 and the commercial SR system had 5801 errors. WUW-SR demonstrated its OOV rejection capabilities by
committing just 15 false acceptance errors on the same corpus—an error rate reduction of 58,520% relative to HTK, and
38,573% relative to the commercial system.
On the Callhome 2 corpus, which contains lengthy telephone conversations, recognizers are helped by the VAD in the
sense that most utterances are segmented as long phrases instead of short words. Long phrases differ from the WUW in
both content and duration, and are more easily classified as OOV. Nevertheless, HTK had 2421 false acceptance errors and
the commercial system had 467 errors, compared to WUW’s 9 false acceptance errors in 20 h of unconstrained speech.
7. Conclusion
In this paper we have defined and investigated WUW Speech Recognition, and presented a novel solution. The most
important characteristic of a WUW-SR system is that it should guarantee virtually 100% correct rejection of non-WUW while
maintaining correct acceptance rate of 99% or higher. This requirement sets apart WUW-SR from other speech recognition
systems because no existing system can guarantee 100% reliability by any measure, and allows WUW-SR to be used in novel
applications that previously have not been possible.
The WUW-SR system developed in this work provides for efficient and highly accurate speaker-independent recognitions
at performance levels not achievable by current state of the art recognizers.
Extensive testing demonstrates accuracy improvements superior by several orders of magnitude over the best known
academic speech recognition system, HTK, as well as a leading commercial speech recognition system. Specifically, the
WUW-SR system correctly detects the WUW with 99.3% accuracy. It correctly rejects non-WUW with 99.97% accuracy for
cases where they are uttered in isolation. In continuous and spontaneous free speech the WUW system makes 0.45 false
acceptance errors per hour, or one false acceptance in 2.22 h.
WUW detection performance is 2514%, or 26 times better than HTK for the same training & testing data, and 2271%, or 24
times better than Microsoft SAPI 5.1., leading commercial recognizer. The non-WUW rejection performance is over 62,233%,
or 653 times better than HTK, and 5900% to 42,900%, or 60 6o 430 better than Microsoft SAPI 5.1.
V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789
e2789
8. Future work
In spite of the significant advancement of SR technologies, true natural language interaction with machines is not
yet possible with state-of-the-art systems. WUW-SR technology has demonstrated, through this work and in two
actual prototype software applications; voice-only activated PowerPoint Commander and Elevator simulator, to enable
spontaneous and true natural interaction with a computer using only voice. For further information on the software
applications please contact the author ([email protected]). Currently there are no research or commercial technologies that
are capable of discriminating a same word or phrase with such accuracy while maintaining practically 100% correct rejection
rate. Furthermore, authors are not aware of any work or technology that discriminates the same word based on its context
using only acoustic features as is done in WUW-SR.
To further enhance disriminability of a WUW word recognition used in alerting context from other referential contexts
using additional prosodic features is presently being conducted. Intuitive and empirical evidence suggests that alerting
context tends to be characterized with additional emphasis needed to get a desired attention. This investigation is leading
in development of prosodic based features that will capture this heightened emphasis. Initial results using energy based
features show promise. Using appropriate data is paramount to accuracy of this study; an innovative procedure of collecting
appropriate data from video DVDs is developed.
Although current system performs admirably, there is room for improvement in several areas. Making SVN classification
independent of OOV data is one of the primary tasks. Development of such a solution will enable various adaptation
techniques to be deployed in real or near real time. The most important aspect of the presented solution is its generic
nature and its applicability to any basic speech unit being modeled, such as phonemes (e.g., context independent or context
dependent such as tri-phones), which is necessary for large vocabulary continuous speech recognition (LVCSR) applications.
If the fraction of the accuracy reported here is carried to those models as expected this methods has the potential to
revolutionize speech recognition technology. In order to demonstrate its expected superiority, the WUW-SR technology
will be added to an existing and publically available recognition system such as HTK (http://htk.eng.cam.ac.uk/) or SPHINX
(http://cmusphinx.sourceforge.net/html/cmusphinx.php).
Acknowledgements
The authors acknowledge the partial support from the AMALTHEA REU NSF Grant No. 0647120 and Grant No. 0647018,
and the members of the AMALTHEA 2007 team, Alex Stills, Brandon Schmitt, and Tad Gertz for their dedication to the project
and their immense assistance with the experiments.
References
[1] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice Hall
PTR, 2001.
[2] Ron Cole, Joseph Mariani, Hans Uszkoreit, Giovanni Batista Varile, Annie Zaenen, Antonio Zampolli, Victor Zue (Eds.), Survey of the State of the Art in
Human Language Technology, Cambridge University Press and Giardini, 1997.
[3] V. Këpuska, Wake-Up-Word Application for First Responder Communication Enhancement, SPIE, Orlando, 2006.
[4] T. Klein, Triple scoring of hidden markov models in wake-up-word speech recognition, Thesis, Florida Institute of Technology.
[5] V. Këpuska, Dynamic time warping (DTW) using frequency distributed distance measures, US Patent: 6983246, January 3, 2006.
[6] V. Këpuska, Scoring and rescoring dynamic time warping of speech, US Patent: 7085717, April 1, 2006.
[7] V. Këpuska, T. Klein, On Wake-Up-Word speech recognition task, technology, and evaluation results against HTK and Microsoft SDK 5.1, Invited Paper:
World Congress on Nonlinear Analysts, Orlando 2008.
[8] V. Këpuska, D.S. Carstens, R. Wallace, Leading and trailing silence in Wake-Up-Word speech recognition, in: Proceedings of the International
Conference: Industry, Engineering & Management Systems 2006, Cocoa Beach, FL., 259–266.
[9] J.R. Rohlicek, W. Russell, S. Roukos, H. Gish, Continuous hidden Markov modeling for speaker-independent word spotting, vol. 1, 23–26 May 1989,
pp. 627–630.
[10] C. Myers, L. Rabiner, A. Rosenberg, An investigation of the use of dynamic time warping for word spotting and connected speech recognition, in:
ICASSP ’80. vol. 5, Apr 1980, pp. 173–177.
[11] A. Garcia, H. Gish, Keyword spotting of arbitrary words using minimal speech resources, in: ICASSP 2006, vol. 1, 14–19 May 2006, pp. I–I.
[12] D.A. James, S.J. Young, A fast lattice-based approach to vocabulary independent wordspotting, in: Proc. ICASSP ’94, vol. 1, 1994, pp. 377–380.
[13] S.P. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans.
ASSP 28 (1980) 357–366.
[14] John Makhoul, Linear prediction: A tutorial review, Proc. IEEE. 63 (1975) 4.
[15] Frederick Jelinek, Statistical Methods for Speech Recognition, The MIT Press, 1998.
[16] L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proc. IEEE. 77 (1989) 2.
[17] John C. Platt, Sequential minimal optimization: A fast algorithm for training support vector machines, Microsoft Research Technical Report, 1998.
[18] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 1998.
[19] Chih-Chung Chang, Chih-Jen Lin, LIBSVM : A Library for Support Vector Machines. Online 2001. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[20] Rong-En Fan, Pai-Hsuen Chen, Chih-Jen Lin, Working set selection using second order information for training support vector machines, J. Mach. Learn.
Res. 6 (2005) 1889–1918.
[21] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, A practical guide to support vector classification, Department of Computer Science and Information
Engineering, National Taiwan University, Taipei, Taiwan : s.n., 2007.
[22] C.C. Broun, W.M. Campbell, Robust out-of-vocabulary rejection for low-complexity speakerindependent speech recognition, Proc. IEEE Acoustics
Speech Signal Process. 3 (2000) 1811–1814.
[23] T. Hazen, I. Bazzi, A comparison and combination of methods for oov word detection and word confidence scoring, in: IEEE International Conference
on Acoustics, Speech, and Signal Processing, ICASSP, Salt Lake City, Utah, May 2001.
[24] Cambridge University Engineering Department, HTK 3.0. [Online] http://htk.eng.cam.ac.uk.