A novel Wake-Up-Word speech recognition system, Wake
Transcription
A novel Wake-Up-Word speech recognition system, Wake
Nonlinear Analysis 71 (2009) e2772–e2789 Contents lists available at ScienceDirect Nonlinear Analysis journal homepage: www.elsevier.com/locate/na A novel Wake-Up-Word speech recognition system, Wake-Up-Word recognition task, technology and evaluation V.Z. Këpuska ∗ , T.B. Klein Electrical & Computer Engineering Department, Florida Institute of Technology, Melbourne, FL 32901, USA article info MSC: 68T10 Keywords: Wake-Up-Word Speech recognition Hidden Markov Models Support Vector Machines Mel-scale cepstral coefficients Linear prediction spectrum Enhanced spectrum HTK Microsoft SAPI abstract Wake-Up-Word (WUW) is a new paradigm in speech recognition (SR) that is not yet widely recognized. This paper defines and investigates WUW speech recognition, describes details of this novel solution and the technology that implements it. WUW SR is defined as detection of a single word or phrase when spoken in the alerting context of requesting attention, while rejecting all other words, phrases, sounds, noises and other acoustic events and the same word or phrase spoken in non-alerting context with virtually 100% accuracy. In order to achieve this accuracy, the following innovations were accomplished: (1) Hidden Markov Model triple scoring with Support Vector Machine classification, (2) Combining multiple speech feature streams: Mel-scale Filtered Cepstral Coefficients (MFCCs), Linear Prediction Coefficients (LPC)-smoothed MFCCs, and Enhanced MFCC, and (3) Improved Voice Activity Detector with Support Vector Machines. WUW detection and recognition performance is 2514%, or 26 times better than HTK for the same training & testing data, and 2271%, or 24 times better than Microsoft SAPI 5.1 recognizer. The out-of-vocabulary rejection performance is over 65,233%, or 653 times better than HTK, and 5900% to 42,900%, or 60 to 430 times better than the Microsoft SAPI 5.1 recognizer. This solution that utilizes a new recognition paradigm applies not only to WUW task but also to any general Speech Recognition tasks. © 2009 Elsevier Ltd. All rights reserved. 1. Introduction Automatic Speech recognition (ASR) is the ability of a computer to convert a speech audio signal into its textual transcription [1]. Some motivations for building ASR systems are, presented in order of difficulty, to improve human–computer interaction through spoken language interfaces, to solve difficult problems such as speech to speech translation, and to build intelligent systems that can process spoken language as proficiently as humans [2]. Speech as a computer interface has numerous benefits over traditional interfaces such as a GUI with mouse and keyboard: speech is natural for humans, requires no special training, improves multitasking by leaving the hands and eyes free, and is often faster and more efficient to transmit than using conventional input methods. Even though many tasks are solved with visual, pointing interfaces, and/or keyboards, speech has the potential to be a better interface for a number of tasks where full natural language communication is useful [2] and the recognition performance of the Speech Recognition (SR) system is sufficient to perform the tasks accurately [3,4]. This includes hands-busy or eyes-busy applications such as where the user has objects to manipulate or equipment/devices to control. Those are circumstances where the user wants to multitask while asking for assistance from the computer. ∗ Corresponding address: Electrical & Computer Engineering Department, Florida Institute of Technology, 150 W. University Blvd. Olin Engineering Bldg., 353, Melbourne, FL 32901, USA. Tel.: +1 (321) 674 7183. E-mail addresses: [email protected] (V.Z. Këpuska), [email protected] (T.B. Klein). 0362-546X/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.na.2009.06.089 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2773 Novel SR technology named Wake-Up-Word (WUW) [5,6] bridges the gap between natural-language and other voice recognition tasks [7]. WUW SR is a highly efficient and accurate recognizer specializing in the detection of a single word or phrase when spoken in the alerting or WUW context [8] of requesting attention, while rejecting all other words, phrases, sounds, noises and other acoustic events with virtually 100% accuracy including the same word or phrase uttered in non-alerting (referential) context. The WUW speech recognition task is similar to Key-Word spotting. However, WUW-SR is different in one important aspect of being able to discriminate the specific word/phrase used only in alerting context (and not in the other; e.g. referential contexts). Specifically, the sentence ‘‘Computer, begin PowerPoint presentation’’ exemplifies the use of the word ‘‘Computer’’ in alerting context. On the other hand in the sentence ‘‘My computer has dual Intel 64 bit processors each with quad cores’’ the word computer is used in referential (non-alerting) context. Traditional key-word spotters will not be able to discriminate between the two cases. The discrimination will be only possible by deploying higher level natural language processing subsystem in order to discriminate between the two. However, for applications deploying such solutions is very difficult (practically impossible) to determine in real time if the user is speaking to the computer or about the computer. Traditional approaches to key-word spotting are usually based on large vocabulary word recognizers [9], phone recognizers [9], or whole-word recognizers that either use HMMs or word templates [10]. Word recognizers require tens of hours of word-level transcriptions as well as a pronunciation dictionary. Phone recognizers require phone marked transcriptions and whole-word recognizers require word markings for each of the keywords [11]. The word based keyword recognizers are not able to find words that are out-of-vocabulary (OOV). To solve this problem of OOV keywords, spotting approaches based on phone-level recognition have been applied to supplement or replace systems base upon large vocabulary recognition [12]. In contrast our approach to WUW-SR is independent to what is the basic unit of recognition (word, phone or any other sub-word unit). Furthermore, it is capable to discriminate if the user is speaking to the recognizer or not without deploying language modeling and natural language understanding methods. This feature makes WUW task distinct from key-word spotting. To develop natural and spontaneous speech recognition applications an ASR system must: (1) be able to determine if the user is speaking to it or not, and (2) have orders of magnitude higher speech recognition accuracy and robustness. Current WUW-SR system demonstrates that the problem under (1) is solved with sufficient accuracy. The generality of the solution has been also successfully demonstrated on the tasks of event detection and recognition applied to seismic sensor signal (i.e., footsteps, hammer hits to the ground, hammer hits to metal plate, and bowling ball drop). Further investigations are underway to demonstrate the power of this approach to Large Vocabulary Speech Recognition (LVSR) solving problem (2) above. This effort is done as part of the NIST’s Rich Transcription Evaluation program (http://www.nist.gov/speech/tests/rt/). The presented work defines, in Section 2, the concept of Wake-Up-Word (WUW), a novel paradigm in speech recognition. In Section 3, our solution to WUW task is presented. The implementation details of the WUW are depicted in Section 4. The WUW speech recognition evaluation experiments are presented in Section 5. Comparative study between our WUW recognizer, HTK, and Microsoft’s SAPI 5.1 is presented in Section 6. Concluding remarks and future work is provided in Sections 7 and 8. 2. Wake-Up-Word paradigm A WUW speech recognizer is a highly efficient and accurate recognizer specializing in the detection of a single word or phrase when spoken in the context of requesting attention (e.g., alerting context), while rejecting the same word or phrase spoken under referential (non-alerting) context, all other words, phrases, sounds, noises and other acoustic events with virtually 100% accuracy. This high accuracy enables development of speech recognition driven interfaces that utilize dialogs using speech only. 2.1. Creating 100% speech-enabled interfaces One of the goals of speech recognition is to allow natural communication between humans and computers via speech [2], where natural implies similarity to the ways humans interact with each other on a daily basis. A major obstacle to this is the fact that most systems today still rely to large extent on non-speech input, such as pushing buttons or mouse clicking. However, much like a human assistant, a natural speech interface must be continuously listening and must be robust enough to recover from any communication errors without non-speech intervention. Speech recognizers deployed in continuously listening mode are continuously monitoring acoustic input and do not necessarily require non-speech activation. This is in contrast to the push to talk model, in which speech recognition is only activated when the user pushes a button. Unfortunately, today’s continuously listening speech recognizers are not reliable enough due to their insufficient accuracy, especially in the area of correct rejection. For example, such systems often respond erratically, even when no speech is present. They sometimes interpret background noise as speech, and they sometimes incorrectly assume that certain speech is addressed at the speech recognizer when in fact it is targeted elsewhere (context misunderstanding). These problems have traditionally been solved by the push to talk model: requesting the user to push a button immediately before or during talking or similar prompting paradigm. e2774 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 Another problem with traditional speech recognizers is that they cannot recover from errors gracefully, and often require non-speech intervention. Any speech-enabled human–machine interface based on natural language relies on carefully crafted dialogues. When the dialogue fails, currently there is no good mechanism to resynchronize the communication, and typically the transaction between human and machine fails by termination. A typical example is an SR system which is in a dictation state, when in fact the human is attempting to use command-and-control to correct a previous dictation error. Often the user is forced to intervene by pushing a button or keyboard key to resynchronize the system. Current SR systems that do not deploy the push to talk paradigm use implicit context switching. For example, a system that has the ability to switch from ‘‘dictation mode’’ to ‘‘command mode’’ does so by trying to infer whether the user is uttering a command rather than dictating text. This task is rather difficult to perform with high accuracy, even for humans. The push to talk model uses explicit context switching, meaning that the action of pushing the button explicitly sets the context of the speech recognizer to a specific state. To achieve the goal of developing a natural speech interface, it is first useful to consider human to human communication. Upon hearing an utterance of speech a human listener must quickly make a decision whether or not the speech is directed towards him or her. This decision determines whether the listener will make an effort to ‘‘process’’ and understand the utterance. Humans can make this decision quickly and robustly by utilizing visual, auditory, semantic, and/or contextual clues. Visual clues might be gestures such as waving of hands or other facial expressions. Auditory clues are attention grabbing words or phrases such as the listener’s name (e.g. John), interjections such as ‘‘hey’’, ‘‘excuse me’’, and so forth. Additionally, the listener may make use of prosodic information such as pitch and intonation, as well as identification of the speaker’s voice. Semantic and/or contextual clues are inferred from interpretation of the words in the sentence being spoken, visual input, prior experience, and customs dictated by the culture. For example, if one knows that a nearby speaker is in the process of talking on the phone with someone else, he/she will ignore the speaker altogether knowing that speech is not targeting him/her. Humans are very robust in determining when speech is targeted towards them, and should a computer SR system be able to make the same decisions its robustness would increase significantly. Wake-Up-Word is proposed as a method to explicitly request the attention of a computer using a spoken word or phrase. The WUW must be spoken in the context of requesting attention and should not be recognized in any other context. After successful detection of WUW, the speech recognizer may safely interpret the following utterance as a command. The WUW is analogous to the button in push to talk, but the interaction is completely based on speech. Therefore it is proposed to use explicit context switching via WUW. This is similar to how context switching occurs in human to human communication. 2.2. WUW-SR differences from other SR tasks Wake-Up-Word is often mistakenly compared to other speech recognition tasks such as key-word spotting or command & control, but WUW is different from the previously mentioned tasks in several significant ways. The most important characteristic of a WUW-SR system is that it should guarantee virtually 100% correct rejection of non-WUW and same words uttered in non-alerting context while maintaining correct acceptance rate over 99%. This requirement sets apart WUW-SR from other speech recognition systems because no existing system can guarantee 100% reliability by any measure without significantly lowering correct recognition rate. It is this guarantee that allows WUW-SR to be used in novel applications that previously have not been possible. Second, a WUW-SR system should be context dependent; that is, it should detect only words uttered in alerting context. Unlike key-word spotting, which tries to find a specific keyword in any context, the WUW should only be recognized when spoken in the specific context of grabbing the attention of the computer in real time. Thus, WUW recognition can be viewed as a refined key-word spotting task, albeit significantly more difficult. Finally, WUWSR should maintain high recognition rates in speaker-independent or speaker-dependent mode, and in various acoustic environments. 2.3. Significance of WUW-SR The reliability of WUW-SR opens up the world of speech recognition to applications that were previously impossible. Today’s speech-enabled human–machine interfaces are still regarded with skepticism, and people are hesitant to entrust any significant or accuracy-critical tasks to a speech recognizer. Despite the fact that SR is becoming almost ubiquitous in the modern world, widely deployed in mobile phones, automobiles, desktop, laptop, and palm computers, many handheld devices, telephone systems, etc., the majority of the public pays little attention to speech recognition. Moreover, the majority of speech recognizers use the push to talk paradigm rather than continuously listening, simply because they are not robust enough against false positives. One can imagine that the driver of a speech-enabled automobile would be quite unhappy if his or her headlights suddenly turned off because the continuously listening speech recognizer misunderstood a phrase in the conversation between driver and passenger. The accuracy of speech recognizers is often measured by word error rate (WER), which uses three measures [1]: Insertion (INS) – an extra word was inserted in the recognized sentence. Deletion (DEL) – a correct word was omitted in the recognized sentence. Substitution (SUB) – an incorrect word was substituted for a correct word. V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2775 WER is defined as: WER = INS + DEL + SUB TRUE NUM . OF WORDS IN SENTENCE 100%. (1) Substitution errors or equivalently correct recognition in WUW context represent the accuracy of the recognition, while insertion + deletion errors are caused by other factors; typically erroneous segmentation. However, WER as an accuracy measurement is of limited usefulness to a continuously listening command and control system. To understand why, consider a struggling movie director who cannot afford a camera crew and decides to install a robotic camera system instead. The system is controlled by a speech recognizer that understands four commands: ‘‘lights’’, ‘‘camera’’, ‘‘action’’, and ‘‘cut’’. The programmers of the speech recognizer claim that it has 99% accuracy, and the director is eager to try it out. When the director sets up the scene and utters ‘‘lights’’, ‘‘camera’’, ‘‘action’’, the recognizer makes no mistake, and the robotic lights and cameras and spring into action. However, as the actors being performing the dialogue in their scene, the computer misrecognizes ‘‘cut’’ and stops the film, ruining the scene and costing everyone their time and money. The actors could only speak 100 words before the recognizer, which truly had 99% accuracy, triggered a false acceptance. This anecdote illustrates two important ideas. First, judging the accuracy of a continuously listening system requires using a measure of ‘‘rejection’’. That is, the ability of the system to correctly reject out-of-vocabulary utterances. The WER formula incorrectly assumes that all speech utterances are targeted at the recognizer and that all speech arrives in simple, consecutive sentences. Consequently, performance of WUW-SR is measured in terms of correct acceptances (CA) and correct rejections (CR). Because it is difficult to quantify and compare the number of CRs, rejection performance is also given in terms of ‘‘false acceptances per hour’’. The second note of interest is that 99% rejection accuracy is actually a very poor performance level for a continuously listening command and control system. In fact, the 99% accuracy claim is a misleading figure. While 99% acceptance accuracy is impressive in terms of recognition performance, 99% rejection implies one false acceptance per 100 words of speech. It is not uncommon for humans to speak hundreds of words per minute, and such a system would trigger multiple false acceptances per minute. That is the reason why today’s speech recognizers a) primarily use a ‘‘push to talk’’ design, and b) are limited to performing simple convenience functions and are never relied on for critical tasks. On the other hand, experiments have shown the WUW-SR implementation presented in this work to reach up to 99.99% rejection accuracy, that translates to one false acceptance every 2.2 h. 3. Wake-Up-Word problem identification and solution Wake-Up-Word recognition requires solving three problems: (1) detection of the WUW (e.g., alerting) context, (2) correct identification of WUW, and (3) correct rejection of non-WUW. 3.1. Detecting WUW context The Wake-Up-Word is proposed as a means to grab the computer’s attention with extremely high accuracy. Unlike keyword spotting, the recognizer should not trigger on every instance of the word, but rather only in certain alerting contexts. For example, if the Wake-Up-Word is ‘‘computer’’, the WUW should not trigger if spoken in a sentence such as ‘‘I am now going to use my computer’’. Some previous attempts at WUW-SR avoided context detection by choosing a word that is unique and unlikely to be spoken. An example of this is WildFire Communication’s ‘‘wildfire’’ word introduced in the early 1990s. However, even though the solution was customized for the specific word ‘‘wildfire’’, it was neither accurate enough nor robust. The implementation of WUW presented in this work selects context based on leading and trailing silence. Speech is segmented from background noise by the Voice Activity Detector (VAD), which makes use of three features for its decisions: energy difference from long term average, spectral difference from long term average, and Mel-scale Filtered Cepstral Coefficients (MFCC) [13] difference from long term average. In the future, prosody information (pitch, energy, intonation of speech) will be investigated as an additional means to determine the context as well as to set the VAD state. 3.2. Identifying WUW The VAD is responsible for finding utterances spoken in the correct context and segmenting them from the rest of the audio stream. After the VAD makes a decision, the next task of the system is to identify whether or not the segmented utterance is a WUW. The front end extracts the following features from the audio stream: MFCC [13], LPC smoothed MFCC [14], and enhanced MFCC, a proprietary technology. The MFCC and LPC features are standard features deployed by vast majority of recognizers. The enhanced MFCCs are introduced to validate their expected superior performance over the standard features specifically exhibited under noisy conditions. Finally, although all three features individually may provide comparable performance, it is hypothesized that the combination of all three features will provide significant performance gains as demonstrated in e2776 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 Fig. 4.1. WUW-SR overall architecture. Signal processing module accepts the raw samples of audio signal and produces spectral representation of a shorttime signal. Feature extraction module generates features from spectral representation of the signal. Those features are decoded with corresponding HMMs. The individual feature scores are classified using SVM. this paper. This in turn indicates that each feature captures slightly different short time characteristics of the speech signal segment from which they are produced. Acoustic modeling is performed with Hidden Markov Models (HMM) [15,16] with triple scoring [4], an approached further developed throughout this work. Finally, classification of those scores is performed with Support Vector Machines (SVM) [17–21]. 3.3. Correct rejection of Non-WUW The final and critical task of a WUW system is correctly rejecting Out-Of-Vocabulary (OOV) speech. The WUW-SR system should have a correct rejection rate of virtually 100% while maintaining high correct recognition rates of over 99%. The task of rejecting OOV segments is difficult, and there are many papers in the literature discussing this topic. Typical approaches use garbage models, filler models, and/or noise models to capture extraneous words or sounds that are out of vocabulary. For example, in [22], a large number of garbage models is used to capture as much of the OOV space as possible. In [23], OOV words are modeled by creating a generic word model which allows for arbitrary phone sequences during recognition, such as the set of all phonetic units in the language. Their approach yields a correct acceptance rate of 99.2% and a false acceptance rate of 71% on data collected from the Jupiter weather information system. The WUW-SR system presented in this work uses triple scoring, a novel and scoring technique by author [4] and further developed throughout this work, to assign three scores to every incoming utterance. The three scores produced by the triple scoring approach are highly correlated, and furthermore the correlation between scores of INV utterances is distinctly different than that of OOV utterances. The difference in correlations makes INV and OOV utterances highly separable in the score space, and an SVM classifier is used to produce the optimal separating surface. 4. Implementation of WUW-SR The WUW Speech Recognizer is a complex software system comprising of three major parts: Front End, VAD, and Back End. The Front End is responsible for extracting features from an input audio signal. The VAD (Voice Activity Detector) segments the signal into speech and non-speech regions. Finally, the back end performs recognition and scoring. The diagram in Fig. 4.1 below shows the overall procedure. The WUW recognizer has been implemented entirely in C++, and is capable of running live and performing recognitions in real time. Functionally is broken into 3 major units. (1) Front End: responsible for features extraction and VAD classification of each frame. (2) Back End: performing word segmentation and classification of those segments for each feature stream, and (3) INV/OOV Classification using individual HMM segmental scores with SVM. The Front End Processor and the HMM software can also be run and scripted independently, and they are accompanied by numerous MATLAB scripts to assist with batch processing necessary for training the models. V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2777 LPC-MFCC E-Front End Spectrogram Computation Input Pre-emphasis Windowing Window Advance Spectrogram Feature Energy Feature Spectrogram FFT MFCC MFCC Feature Computation MFCC Voice Activity Detection Autocorrelation LPC Mel scale Filtering MF VAD – Voice Activity Detection Logic Discrete Cosine Transform DCT Cepstral Mean Normalization CMN Estimation of Stationary Background Spectrum VAD-STATE DC-Filtering Samples Buffering ... Enhanced MFCC Feature Computation Enhanced Spectrum ENH Mel scale Filtering MF ENH-MFCC MFCC Feature Discrete Cosine Transform DCT Fig. 4.2. Details of signal processing of the Front End. It contains a common Spectrogram computation module, feature based VAD, and module that compute 3 features (MFCC, LPC-MFCC, and enhanced MFCC) for each analysis frame. The VAD classifier determines the state (speech or NO speech) of each frame indicating if a segment contains speech. This information is used by Back End of the recognizer. 4.1. Front end The front end is responsible for extracting features out of the input signal. Three sets of features are extracted: Mel Filtered Cepstral Coefficients (MFCC) [13], LPC (Linear Predictive Coding) [14] smoothed MFCCs, and Enhanced MFCCs. The following diagram, Fig. 4.2, illustrates the architecture of the Front End and VAD. The following image, Fig. 4.3, shows an example of the waveform superimposed with its VAD segmentation, its spectrogram, and its enhanced spectrogram. 4.2. Back end The Back End is responsible for classifying observation sequences as In-Vocabulary (INV), i.e. the sequence of frames hypothesized as a segment is a Wake-Up-Word, or Out Of Vocabulary (OOV), i.e. the sequence is not a Wake-Up-Word. The WUW-SR system uses a combination of Hidden Markov Models and Support Vector Machines for acoustic modeling, and as a result the back end consists of an HMM recognizer and a SVM classifier. Prior to recognition, HMM and SVM models must be created and trained for the word or phrase which is to be the Wake-Up-Word. When the VAD state changes from VAD_OFF to VAD_ON, the HMM recognizer resets and prepares for a new observation sequence. As long as the VAD state remains VAD_ON, feature vectors are continuously passed to the HMM recognizer, where they are scored using the novel triple scoring method. If using multiple feature streams (a process explained in a later section), recognition is performed for each stream in parallel. When VAD state changes from VAD_ON to VAD_OFF, multiple scores are obtained from the HMM recognizer and are sent to the SVM classifier. SVM produces a classification score which is compared against a threshold to make the final classification decision of INV or OOV. 5. Wake-Up-Word speech recognition experiments For HMM training and recognition we used a number of corpora as described next. e2778 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 Speech and VAD segmentation 2 1 0 -1 -2 4 4.5 5 5.5 6 6.5 7 7.5 8 6.5 7 7.5 8 6.5 7 7.5 8 Time (s) Spectrogram 4000 Frequency [Hz] 3000 2000 1000 0 4 4.5 5 5.5 6 Time [s] Enhanced Spectrogram 4000 Frequency [Hz] 3000 2000 1000 0 4 4.5 5 5.5 6 Time [s] Fig. 4.3. Example of the speech signal (first plot in blue) with VAD segmentation (first plot in red), spectrogram (second plot), and enhanced spectrogram (third plot) generated by the Front End module. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) 5.1. Speech corpora Speech recognition experiments made use of the following corpora: CCW17, WUW-I, WUW-II, Phonebook, and Callhome. The properties of each corpus are summarized in the following table, Table 5.1. 5.2. Selecting INV and OOV utterances The HMMs used throughout this work are predominantly whole word models, and HMM training requires many examples of the same word spoken in isolation or segmented from the continuous speech. As such, training utterances were only chosen from CCW17, WUW-I, and WUW-II, but OOV utterances were chosen from all corpora. The majority of the preliminary tests were performed using the CCW17 corpus, which is well organized and very easy to work with. Of the ten words available in CCW17, the word ‘‘operator’’ has been chosen for most experiments as the INV word. This word is not especially unique in the English vocabulary and is moderately confusable (e.g, ‘‘operate’’, ‘‘moderator’’, ‘‘narrator’’, etc.), making a good candidate for demonstrating the power of WUW speech recognition. 5.3. INV/OOV decisions & cumulative error rates The WUW speech recognizer Back End is essentially a binary classifier that, given an arbitrary length input sequence of observations, classifies it as In-Vocabulary (INV) or Out-Of-Vocabulary (OOV). Although the internal workings are very complex and can involve a hierarchy of Hidden Markov Models, Support Vector Machines, and other classifiers, the end V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2779 Table 5.1 Corpora used for Training and Testing. CCW17, WUW-I and WUW-II were privately collected corpora by ThinkEngine Networks. Phonebook and Callhome are publically available corpora from Linguistic Data Consortium (LDC) of Penn State University. CCW17 Speech type: Vocabulary size: Unique speakers: Total duration: Total words: Channel type: Channel quality: Isolated words 10 More than 200 7.9 h 4234 Telephone Variety of noise conditions The CCW17 corpus contains 10 isolated words recorded over the telephone from multiple speakers. The calls were made from land lines, cellular phones, and speakerphones, under a variety of noise conditions. The data is grouped by telephone call, and each call contains one speaker uttering up to 10 words. The speakers have varying backgrounds, but are predominantly American from the New England region and Indian (many of whom carry a heavy Indian accent). WUW-I & WUW-II Speech type: Isolated words & sentences WUW-I and WUW-II are similar to CCW17. They consist of isolated words and additionally contain sentences that use those words in various ways. Phonebook Speech type: Vocabulary size: Total words: Unique speakers: Length: Channel type: Channel quality: Isolated words 8002 93,267 1358 37.6 h Telephone Quiet to noisy The phonebook corpus consists of 93,267 files, each containing a single utterance surrounded by two small periods of silence. The utterances were recorded by 1358 unique individuals. The corpus has been split into a training set of 48,598 words (7945 unique) spanning 19.6 h of speech, and testing set of 44,669 words (7944 unique) spanning 18 h of speech. Free & spontaneous 8577 165,546 More than 160 40 h Telephone Quiet to noisy The Callhome corpus contains 80 natural telephone conversations in English of length 30 min each. For each conversation both channels were recorded and included separately. Callhome Speech type: Vocabulary size: Total words: Unique speakers: Total duration: Channel type: Channel quality: result is that any given input sequence produces a final score. The score space is divided into two regions, R1 and R−1 based on a chosen decision boundary, and an input sequence can be marked as INV or OOV depending on which region it falls in. The score space can be partitioned at any threshold depending on the application requirements, and the error rates can be computed for that threshold. An INV utterance that falls in the OOV region is a False Rejection error, while an OOV utterance that falls in the INV region is considered a False Acceptance error. OOV error rate (False Acceptances of False Positives) and INV error rate (False Rejections or False Negatives) can be plotted as functions of the threshold over the range of scores. The point where they intersect is known as the Equal Error Rate (EER), and is often used as a simple measure of classification performance. The EER defines a score threshold for which False Rejection is equal to False Acceptance rate. However, the EER is often not the minimum error, especially when the scores do not have well-formed Gaussian or uniform distributions, nor is it the optimal threshold for most applications. For instance, in Wake-Up-Word the primary concern is minimizing OOV error, so the threshold would be biased as to pick a threshold that minimizes False Acceptance rate at the expense of increasing False Rejection rate. For the purpose of comparison and evaluation, the following experiments include plots of the INV and OOV recognition score distributions, the OOV and INV error rate curves, as well as the point of equal error rate. Where applicable, the WUW optimal threshold and error rates at that threshold are also shown. 5.4. Plain HMM scores For the first tests on speech data, a HMM was trained on the word ‘‘operator’’. The training sequences were taken from the CCW17 and WUW-II corpora for a total of 573 sequences from over 200 different speakers. After features were extracted, some of the erroneous VAD segments were manually removed. The INV testing sequences were the same as the training sequences, while the OOV testing sequences included the rest of the CCW17 corpus (3833 utterances, 9 different words, over 200 different speakers). The HMM was a left to right model with no skips, 30 states, and 6 mixtures per state, and was trained with two iterations of Baum–Welch algorithm [16,15]. The score is the result of the Viterbi algorithm over the input sequence. Recall that the Viterbi algorithm finds the state sequence that has the highest probability of being taken while generating the observation sequence. The final score is that probability normalized by the number of input observations, T. The below Fig. 5.1 shows the results: The distributions look Gaussian, but there is a significant overlap between them. The equal error rate of 15.5% essentially means that at that threshold, 15.5% of the OOV words would be classified as INV, and 15.5% of the INV words would be classified as OOV. Obviously, no practical applications can be developed based on the performance of this recognizer. V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 Cumulative error percent (%) e2780 Recognition Score Fig. 5.1. Score distributions for plain HMM on CCW17 corpus data for the WUW ‘‘operator’’. Score 2 -60 -60 OOV sequences INV sequences -70 -70 -80 -80 -90 -90 -100 -100 -110 -110 -120 -120 -130 -120 -110 -100 -90 Score 1 -80 -70 -60 -120 -110 -100 -80 -70 -60 0 0.05 0.05 0 -130 -90 Fig. 5.2. Scatter plots of score 1 vs. score 2 and their corresponding independently constructed histograms. -60 Score 3 -65 -60 OOV sequences INV sequences -65 -70 -70 -75 -75 -80 -80 -85 -85 -90 -90 -95 -95 -100 -100 -105 -105 -110 -110 -130 -120 -110 -100 -90 Score 1 -80 -70 -60 -120 -110 -100 -90 -80 -70 -60 0 0.05 0.05 0 -130 Fig. 5.3. Scatter plots of score 1 vs. score 3 and their corresponding independently constructed histograms. 5.5. Triple scoring method In addition to the Viterbi score, the algorithm developed in this work produces two additional scores for any given observation sequence. When considering the three scores as features in a three-dimensional space, the separation between INV and OOV distributions increases significantly. The next experiment runs recognition on the same data as above, but this time the recognizer uses the triple scoring algorithm to output three scores. The Figs. 5.2 and 5.3 show twodimensional scatter plots of Score 1 vs. Score 2, and Score 1 vs. Score 3 for each observation sequence. In addition, a histogram on the horizontal axis shows the distributions of Score 1 independently, and a similar histogram on the vertical axis shows the distributions of Score 2 and Score 3 independently. The histograms are hollowed out so that the overlap between distributions can be seen clearly. The distribution for Score 1 is exactly the same as in the previous section, as V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2781 Fig. 5.4. Linear SVM, of the scatter plot of scores 1–2. Fig. 5.5. Linear SVM, of the scatter plot of scores 1–3. the data and model have not changed. Any individual score does not produce a good separation between classes, and in fact the Score 2 distributions have almost complete overlap. However, the two-dimensional separation in either case is remarkable. 5.6. Classification with Support Vector Machines In order to automatically classify an input sequence as INV or OOV, the triple score feature space, R3 , can be partitioned by a binary classifier into two regions, R31 and R3−1 . Support Vector Machines have been selected for this task because of the following reasons: they can produce various kinds of decision surfaces including radial basis function, polynomial, and linear; they employ Structural Risk Minimization (SRM) [18] to maximize the margin which has shown empirically to have good generalization performance. Two types of SVMs have been considered for this task: linear and RBF. The linear SVM uses a dot product kernel function, K (x, y) = x.y, and separates the feature space with a hyperplane. It is very computationally efficient because no matter how many support vectors are found, evaluation requires only a single dot product. Figs. 5.4 and 5.5 below shows that the separation between distributions based on Score 1 and Score 3 is almost linear, so a linear SVM would likely give good results. However, in the Score 1 / Score 2 space, the distributions have a curvature, so the linear SVM is unlikely to generalize well for unseen data. The Figs. 5.4 and 5.5 below show the decision boundary found by a linear SVM trained on Score 1 + 2, and Score 1 + 3, respectively. The line in the center represents the contour of the SVM function at u = 0, and outer two lines are drawn at u = ±1. Using 0 as the threshold, the accuracy of Scores 1–2 is 99.7% Correct Rejection (CR) and 98.6% Correct Acceptance (CA), while for Scores 1–3 it is 99.5% CR and 95.5% CA. If considering only two features, Scores 1 and 2 seem to have better classification ability. However, combining the three scores produces the plane shown below, in Figs. 5.6 and 5.7 from two different angles. e2782 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 Linear SVM hyperplane -40 -50 -60 Score 3 -70 -80 -90 -100 -110 -120 -130 -40 -60 -80 -100 Score 2 -120 -140 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 Score 1 Fig. 5.6. Linear SVM, three-dimensional Discriminating Surface view 1 as a function of scores 1–3. Linear SVM Separating Plane -40 Score 3 -60 -80 -100 -120 -140 -140 -140 -120 -120 -100 -100 -80 -80 Score 2 -60 -60 Score 1 -40 -40 Cumulative error percent (%) Fig. 5.7. Linear SVM, three-dimensional Discriminating Surface view 2 as a function of scores 1–3. Recognition Score Fig. 5.8. Histogram of score distributions for triple scoring using linear SVM discriminating surface. The plane split the feature space with an accuracy of 99.9% CR and 99.5% CA (just 6 of 4499 total sequences were misclassified). The accuracy was better than any of the two-dimensional cases, indicating that Score 3 contains additional information not found in Score 2 and vice versa. The classification error rate of the linear SVM is shown in Fig. 5.8. The conclusion is that using the triple scoring method combined with a linear SVM decreased the equal error rate on this particular data set from 15.5% to 0.2%, or in other words increased accuracy by over 77.5 times (i.e., error rate reduction of 7650%)! 2 In the next experiment, a Radial Basis Function (RBF) kernel was used for the SVM. The RBF function,K (x, y) = e−γ |x−y| , maps feature vectors into a potentially infinitely dimensional Hilbert space and is able to achieve complete separation between classes in most cases. However, the γ parameter must be chosen carefully in order to avoid overtraining. As there is no way to determine it automatically, a grid search may be used to find a good value. For most experiments γ = 0.008 gave good results. Shown below in Figs. 5.9 and 5.10 are the RBF SVM contours for both two-dimensional cases. V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2783 Fig. 5.9. RBF SVM kernel, of the scatter plot of scores 1–2. Fig. 5.10. RBF SVM kernel, of the scatter plot of scores 1–3. At the u = 0 threshold, the classification accuracy was 99.8% CR, 98.6% CA for Score 1–2, and 99.4% CR, 95.6% CA for Score 1–3. In both cases the RBF kernel formed a closed decision region around the INV points (Recall that the SVM decision function at u = 0 is shown by the green line). Some interesting observations can be made from these plots. First, the INV outlier in the bottom left corner of the first plot caused a region to form around it. SVM function output values inside the region were somewhere between −1 and 0, not high enough to cross into the INV class. However, it is apparent that the RBF kernel is sensitive to outliers, so the γ parameter must be chosen carefully to prevent overtraining. Had the γ parameter been a little bit higher, the SVM function output inside the circular region would have increased beyond 0, and that region would have been considered INV. Second, the RBF kernel’s classification accuracy showed almost no improvement over the linear SVM. However, a number of factors should be considered when making the final decision about which kernel to use. If both models have similar performance, it is usually the simplest one that is eventually chosen (according to Occam’s Razor). RBF kernels feature an extra parameter (γ ) and are therefore considered more complex than linear kernels. However, the criterion that matters the most in real-time application such as this is not necessarily the feature of curved decision boundaries but it is the number of support vectors utilized by each model; the one with the least number, would be the simplest one and the one finally chosen, whether RBF or linear kernel based. From the plots, the two distributions look approximately Gaussian with a similar covariance structure, whose principal directions roughly coincide for both populations. This may be another reason why a roughly linear decision boundary between the two classes provides good classification results. On the other hand it is expected that due to the RBF kernel’s ability to create arbitrary curved decision surfaces, it will have better generalization performance than the linear SVM’s hyperplane (as demonstrated by the results described below) if the specific score distribution does not follow Gaussian distribution. The Figs. 5.11 and 5.12 below shows an RBF kernel SVM trained on all three scores. The RBF kernel created a closed three-dimensional surface around the INV points and had a classification accuracy of 99.95% CR, 99.30% CA. If considering u = 0 as the threshold, the triple score SVM with RBF kernel function shows only little e2784 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 RBF SVM decision region -40 -50 Score 3 -60 -70 -80 -90 -100 -110 -120 -40 -60 -40 -60 -80 -80 -100 -120 Score 2 -120 -100 Score 1 -140 -140 Fig. 5.11. RBF SVM kernel, three-dimensional Discriminating Surface view 1 as a function of scores 1–3. Score 3 RBF SVM decision region -50 -60 -70 -80 -90 -100 -110 -120 -130 -120 -110 -100 -140 -120 -90 -100 -80 Score 2 -80 -70 -60 -60 -50 -40 Score 1 Fig. 5.12. RBF SVM kernel, three-dimensional Discriminating Surface view 2 as a function of scores 1–3. Fig. 5.13. Histogram of OOV-blue and INV-red score distributions for triple scoring using RBF SVM kernel. improvement over the linear SVM for this data set. However, as shown in Fig. 5.13 below, the SVM score distributions are significantly more separated, and the equal error rate is lower than the linear SVM; from 0.2% to 0.1%. Table 5.2 summarizes all of the results up to this point. 5.7. Front end features The Front End Processor produces three sets of features: MFCC (Mel Filtered Cepstral Coefficients), LPC (Linear Predictive Coding) smoothed MFCC, and Enhanced MFCC (E-MFCC). The E-MFCC features have a higher dynamic range than regular MFCC, so they have been chosen for most of the experiments. However, the next tests attempt to compare the performance V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2785 Table 5.2 Score comparisons of conventional single scoring and presented triple scoring using liner and RBF SVM kernels. Score 1–2 (%) Score 1–3 (%) Triple score (%) EER (%) N/A CR: 99.7 CA: 98.6 CR: 99.8 CA: 98.6 N/A CR: 99.5 CA: 95.4 CR: 99.3 CA: 95.6 N/A CR: 99.9 CA: 99.5 CR: 99.9 CA: 99.3% 15.5 0.2 0.1 Cumulative error percent (%) Single scoring Triple scoring Linear kernel Triple scoring RBF kernel (γ = 0.008) Recognition Score Cumulative error percent (%) Fig. 5.14. Histogram of OOV-blue and INV-red score distributions for E-MFCC features. Recognition Score Fig. 5.15. Histogram of OOV-blue and INV-red score distributions for MFCC features. of E-MFCC, MFCC, and LPC features. In addition, a method of combining all three features has been developed and has shown excellent results. The first experiment verifies the classification performance of each feature stream independently. Three HMMs have been generated with the following parameters: 30 states, 2 entry and exit silence states, 6 mixtures, 1 skip transition. The HMMs were trained on utterances of the word ‘‘operator’’ taken from the CCW17 and WUW-II corpora. One HMM was trained with E-MFCC features, one with MFCC features, and one with LPC features. For OOV rejection testing, the Phonebook corpus, which contains over 8000 unique words, was divided into two sets, Phonebook 1 and Phonebook 2. Each set was assigned roughly half of the utterances of every unique word, so both sets covered the same range of vocabulary. Recognition was then performed on the training data set, Phonebook 1 data and Phonebook 2 data. SVM models were trained on the same training set and the Phonebook 1 set, with parameters γ = 0.001, C = 15. Classification tests were performed on the training set and the Phonebook 2 set. Figs. 5.14–5.16 show the score distributions and error rates for each set of front end features. In this experiment LPC had an equal error rate slightly better than E-MFCC. However, the difference is negligible, and the distributions and their separation look almost the same. MFCC had the worst equal error rate. The next experiment attempts to combine information from all three sets of features. It is hypothesized that each feature set contains some unique information, and combining all three of them would extract the maximum amount of information from the speech signal. To combine them, the features are considered as three independent streams, and recognition and triple scoring is performed on each stream independently. However, the scores from the three streams are aggregated and V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 Cumulative error percent (%) e2786 Recognition Score Cumulative error percent (%) Fig. 5.16. Histogram of OOV-blue and INV-red score distributions for LPC features. Recognition Score Fig. 5.17. Histogram of OOV-blue and INV-red score distributions for combined features. classified via SVM in a nine-dimensional space. The next Fig. 5.17 shows the results when combining the same scores from the previous section using this method. Combining the three sets of features produced a remarkable separation between the two distributions. The Phonebook 2 set used for OOV testing contains 49,029 utterances previously unseen by the recognizer, of which over 8,000 are unique words. At an operating threshold of −0.5, just 16 of 49,029 utterances were misclassified, yielding an FA error of 0.03%. For INV, just 1 of 573 utterances was misclassified. 6. Baseline and final results This section establishes a baseline by examining the performance of HTK and Microsoft SAPI 5.1, a commercial speech recognition engine. It then presents the final performance results of the WUW-SR system. 6.1. Corpora setup Because Wake-Up-Word Speech Recognition is a new concept, it is difficult to compare its performance with existing systems. In order to perform a fair, nonbiased analysis of performance, all attempts were made to configure the test systems to achieve highest possible performance for WUW recognition. For the final experiments, once again the word ‘‘operator’’ was chosen for testing. The first step entails choosing a good set of training and testing data. The ideal INV data for building a speaker-independent word model would be a large number of isolated utterances from a wide variety of speakers. As such, the INV training data was once again constructed from isolated utterances taken from the CCW17 and WUW-II corpora, for a total of 573 examples. Because the WUW-SR system is expected to listen continuously to all forms of speech and non-speech and never activate falsely. It must be able to reject OOV utterances in both continuous free speech and isolated cases, so the following two scenarios have been defined for testing OOV rejection capability: continuous natural speech (Callhome corpus), and isolated words (Phonebook corpus). e2787 Cumulative error percent (%) V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 Recognition Score Fig. 6.1. Histogram of the overall recognition scores for Callhome 2 test. As described in Section 5, the Callhome corpus contains natural speech from phone call conversations in 30 min long segments. The corpus spans 40 h of speech from over 160 different speakers, and contains over 8500 unique words. The corpus has been split into two sets of 20 h each, Callhome 1 and Callhome 2. The Phonebook corpus contains isolated utterances from over 1000 speakers and over 8000 unique words. The fact that every utterance is isolated actually presents a worst case scenario to the WUW recognizer. In free speech, the signal is often segmented by the VAD into long sentences, indicative of non-WUW context and easily classified as OOV. However, short isolated words are much more likely to be confusable with the Wake-Up-Word. The Phonebook corpus was also divided into two sets, Phonebook 1 and Phonebook 2. Each set was assigned roughly half of the utterances of every word. Phonebook 1 contained 48,598 words (7945 unique) and Phonebook 2 contained 44,669 words (7944 unique). 6.2. HTK The first baseline comparison was against the Hidden Markov Model Toolkit (HTK) version 3.0 from Cambridge University [24]. First, an HMM word model with similar parameters (number of states, mixtures, topology) as the WUW model was generated. Next, the model was trained on the INV speech utterances. Note that while the utterances were the same in both cases (i.e., HTK and e-WUW), the speech features (MFCCs) were extracted with the HTK feature extractor, not the WUW front end. Finally, recognition was performed by computing the Viterbi score of test sequences against the model. After examining the score of all sequences, a threshold was manually chosen to optimize FA and FR error rates. 6.3. Commercial SR engine—microsoft SAPI 5.1 The second baseline comparisons evaluate the performance of a commercial speech recognition system. The commercial system was designed for dictation and command & control modes, and has no concept of ‘‘Wake-Up-Word’’. In order to simulate a wake-up-word scenario, a command & control grammar was created with the wake-up-word as the only command (e.g. ‘‘operator’’). There is little control of the engine parameters other than a generic ‘‘sensitivity’’ setting. Preliminary tests were executed and the sensitivity of the engine was adjusted to the point where it gave the best results. Speaker adaptation was also disabled for the system. 6.4. WUW-SR For the WUW-SR system, three HMMs were trained; one with E-MFCC features, one with MFCC features, and one with LPC smoothed MFCC features. All three used the following parameters: 30 states, 2 entry and exit silence states, 6 mixtures, 1 skip transition. SVMs were trained on the nine-dimensional score space obtained by HMM triple scoring on three feature streams, with parameters γ = 0.001, C = 15. Note that unlike the baseline recognizers, the WUW Speech Recognizer uses OOV data for training the SVM classifier. The OOV training set consisted of Phonebook 1 and Callhome 1, while the testing set consisted of Phonebook 2 and Callhome 2. Although it is possible to train a classifier using only INV examples, such as a one-class SVM, it has not been explored yet in this work. The plots in Figs. 6.1 and 6.2 show the combined feature results on the Callhome 2 and Phonebook 2 testing sets. Table 6.1 summarizes the results of HTK, the commercial speech recognizer, and WUW-SR: In Table 6.1, Relative Error Rate Reduction is computed as: RERR = B−N N ∗ 100% where B is the baseline error rate and N is the new error rate. Reduction factor is computed as B/N. (2) V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 Cumulative error percent (%) e2788 Recognition Score Fig. 6.2. Histogram of the overall recognition scores for Phonebook 2 test. Table 6.1 Summary of final results presenting and comparing False Rejection (FR) and Acceptance (FA) rates of WUW-SR vs. HTK and Microsoft SAPI 5.1 reported as Relative Error Rate Reduction and as reduction error factor. CCW17 + WUW-II (INV) Phonebook 2 (OOV) FR FR (%) FA FA (%) FA/h FA FA (%) FA/h 105 102 4 18.3 16.6 0.70 8793 5801 15 19.6 12.9 0.03 488.5 322.3 0.40 2421 467 9 2.9 0.6 0.01 121.1 23.35 0.45 Callhome 2 (OOV) Recognizer HTK Microsoft SAPI 5.1 SR WUW –SR Reduction of Rel. Error Rate /Factor WUW vs. HTK WUW vs. Microsoft SAPI 5.1 SR 2514%26 2271%24 65,233%653 42,900%430 28,900%290 5900%60 Table 6.1 demonstrates WUW-SR’s improvement over current state of the art recognizers. WUW-SR with three feature streams was several orders of magnitude superior to the baseline recognizers in INV recognition and particularly in OOV rejection. Of the two OOV corpora, Phonebook 2 was significantly more difficult to recognize for all three systems due to the fact that every utterance was a single isolated word. Isolated words have similar durations to the WUW and differ just in pronunciation, putting to the test the raw recognition power of a SR system. HTK had 8793 false acceptance errors on Phonebook 2 and the commercial SR system had 5801 errors. WUW-SR demonstrated its OOV rejection capabilities by committing just 15 false acceptance errors on the same corpus—an error rate reduction of 58,520% relative to HTK, and 38,573% relative to the commercial system. On the Callhome 2 corpus, which contains lengthy telephone conversations, recognizers are helped by the VAD in the sense that most utterances are segmented as long phrases instead of short words. Long phrases differ from the WUW in both content and duration, and are more easily classified as OOV. Nevertheless, HTK had 2421 false acceptance errors and the commercial system had 467 errors, compared to WUW’s 9 false acceptance errors in 20 h of unconstrained speech. 7. Conclusion In this paper we have defined and investigated WUW Speech Recognition, and presented a novel solution. The most important characteristic of a WUW-SR system is that it should guarantee virtually 100% correct rejection of non-WUW while maintaining correct acceptance rate of 99% or higher. This requirement sets apart WUW-SR from other speech recognition systems because no existing system can guarantee 100% reliability by any measure, and allows WUW-SR to be used in novel applications that previously have not been possible. The WUW-SR system developed in this work provides for efficient and highly accurate speaker-independent recognitions at performance levels not achievable by current state of the art recognizers. Extensive testing demonstrates accuracy improvements superior by several orders of magnitude over the best known academic speech recognition system, HTK, as well as a leading commercial speech recognition system. Specifically, the WUW-SR system correctly detects the WUW with 99.3% accuracy. It correctly rejects non-WUW with 99.97% accuracy for cases where they are uttered in isolation. In continuous and spontaneous free speech the WUW system makes 0.45 false acceptance errors per hour, or one false acceptance in 2.22 h. WUW detection performance is 2514%, or 26 times better than HTK for the same training & testing data, and 2271%, or 24 times better than Microsoft SAPI 5.1., leading commercial recognizer. The non-WUW rejection performance is over 62,233%, or 653 times better than HTK, and 5900% to 42,900%, or 60 6o 430 better than Microsoft SAPI 5.1. V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2789 8. Future work In spite of the significant advancement of SR technologies, true natural language interaction with machines is not yet possible with state-of-the-art systems. WUW-SR technology has demonstrated, through this work and in two actual prototype software applications; voice-only activated PowerPoint Commander and Elevator simulator, to enable spontaneous and true natural interaction with a computer using only voice. For further information on the software applications please contact the author ([email protected]). Currently there are no research or commercial technologies that are capable of discriminating a same word or phrase with such accuracy while maintaining practically 100% correct rejection rate. Furthermore, authors are not aware of any work or technology that discriminates the same word based on its context using only acoustic features as is done in WUW-SR. To further enhance disriminability of a WUW word recognition used in alerting context from other referential contexts using additional prosodic features is presently being conducted. Intuitive and empirical evidence suggests that alerting context tends to be characterized with additional emphasis needed to get a desired attention. This investigation is leading in development of prosodic based features that will capture this heightened emphasis. Initial results using energy based features show promise. Using appropriate data is paramount to accuracy of this study; an innovative procedure of collecting appropriate data from video DVDs is developed. Although current system performs admirably, there is room for improvement in several areas. Making SVN classification independent of OOV data is one of the primary tasks. Development of such a solution will enable various adaptation techniques to be deployed in real or near real time. The most important aspect of the presented solution is its generic nature and its applicability to any basic speech unit being modeled, such as phonemes (e.g., context independent or context dependent such as tri-phones), which is necessary for large vocabulary continuous speech recognition (LVCSR) applications. If the fraction of the accuracy reported here is carried to those models as expected this methods has the potential to revolutionize speech recognition technology. In order to demonstrate its expected superiority, the WUW-SR technology will be added to an existing and publically available recognition system such as HTK (http://htk.eng.cam.ac.uk/) or SPHINX (http://cmusphinx.sourceforge.net/html/cmusphinx.php). Acknowledgements The authors acknowledge the partial support from the AMALTHEA REU NSF Grant No. 0647120 and Grant No. 0647018, and the members of the AMALTHEA 2007 team, Alex Stills, Brandon Schmitt, and Tad Gertz for their dedication to the project and their immense assistance with the experiments. References [1] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice Hall PTR, 2001. [2] Ron Cole, Joseph Mariani, Hans Uszkoreit, Giovanni Batista Varile, Annie Zaenen, Antonio Zampolli, Victor Zue (Eds.), Survey of the State of the Art in Human Language Technology, Cambridge University Press and Giardini, 1997. [3] V. Këpuska, Wake-Up-Word Application for First Responder Communication Enhancement, SPIE, Orlando, 2006. [4] T. Klein, Triple scoring of hidden markov models in wake-up-word speech recognition, Thesis, Florida Institute of Technology. [5] V. Këpuska, Dynamic time warping (DTW) using frequency distributed distance measures, US Patent: 6983246, January 3, 2006. [6] V. Këpuska, Scoring and rescoring dynamic time warping of speech, US Patent: 7085717, April 1, 2006. [7] V. Këpuska, T. Klein, On Wake-Up-Word speech recognition task, technology, and evaluation results against HTK and Microsoft SDK 5.1, Invited Paper: World Congress on Nonlinear Analysts, Orlando 2008. [8] V. Këpuska, D.S. Carstens, R. Wallace, Leading and trailing silence in Wake-Up-Word speech recognition, in: Proceedings of the International Conference: Industry, Engineering & Management Systems 2006, Cocoa Beach, FL., 259–266. [9] J.R. Rohlicek, W. Russell, S. Roukos, H. Gish, Continuous hidden Markov modeling for speaker-independent word spotting, vol. 1, 23–26 May 1989, pp. 627–630. [10] C. Myers, L. Rabiner, A. Rosenberg, An investigation of the use of dynamic time warping for word spotting and connected speech recognition, in: ICASSP ’80. vol. 5, Apr 1980, pp. 173–177. [11] A. Garcia, H. Gish, Keyword spotting of arbitrary words using minimal speech resources, in: ICASSP 2006, vol. 1, 14–19 May 2006, pp. I–I. [12] D.A. James, S.J. Young, A fast lattice-based approach to vocabulary independent wordspotting, in: Proc. ICASSP ’94, vol. 1, 1994, pp. 377–380. [13] S.P. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. ASSP 28 (1980) 357–366. [14] John Makhoul, Linear prediction: A tutorial review, Proc. IEEE. 63 (1975) 4. [15] Frederick Jelinek, Statistical Methods for Speech Recognition, The MIT Press, 1998. [16] L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proc. IEEE. 77 (1989) 2. [17] John C. Platt, Sequential minimal optimization: A fast algorithm for training support vector machines, Microsoft Research Technical Report, 1998. [18] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 1998. [19] Chih-Chung Chang, Chih-Jen Lin, LIBSVM : A Library for Support Vector Machines. Online 2001. http://www.csie.ntu.edu.tw/~cjlin/libsvm. [20] Rong-En Fan, Pai-Hsuen Chen, Chih-Jen Lin, Working set selection using second order information for training support vector machines, J. Mach. Learn. Res. 6 (2005) 1889–1918. [21] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, A practical guide to support vector classification, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan : s.n., 2007. [22] C.C. Broun, W.M. Campbell, Robust out-of-vocabulary rejection for low-complexity speakerindependent speech recognition, Proc. IEEE Acoustics Speech Signal Process. 3 (2000) 1811–1814. [23] T. Hazen, I. Bazzi, A comparison and combination of methods for oov word detection and word confidence scoring, in: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, Salt Lake City, Utah, May 2001. [24] Cambridge University Engineering Department, HTK 3.0. [Online] http://htk.eng.cam.ac.uk.