Speech SDK for SK-FM4-176L-S6E2CC
Transcription
Speech SDK for SK-FM4-176L-S6E2CC
Speech SDK for SK-FM4-176L-S6E2CC-SE Speech Software Development Kit Users Guide Publication Number S6E2CC_AN709-00009 Revision 1.0 Issue Date February 1, 2015 Us ers 2 Guide Speech Software Development Kit S6E2CC_AN709-00009_ February 1, 2015 User s Gu id e Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1. Overview of the Speech-Enabled MCU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1 Hardware Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Software Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 System Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Host Controller + Dedicated Speech-Enabled MCU (configuration 1) . . . . . . . . . . . . . .8 1.3.2 Running the User Application on the Speech-Enabled MCU (configuration 2) . . . . . .11 2. Speech Evaluation Kit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3. Building a Speech Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Writing the Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Running the Grammar Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Speech Processing APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. Sample Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1 Speech Recognition without VAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Speech Recognition with VAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5. How to Compile and Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 6. System Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 7. Audio Codec Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 8. Phoneme Pronunciations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 9. FAQs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 10. Major Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 February 1, 2015 S6E2CC_AN709-00009-1v0-E 14 14 15 18 3 Us ers 4 Guide S6E2CC_AN709-00009-1v0-E February 1, 2015 User s Gu id e Introduction Speech Recognition is an increasingly popular method for human-machine interaction. As a result of improvements in processing speed, advances in algorithms, and the availability of large speech databases, usable real-time speech recognition systems are now a reality. It is now possible to build speakerindependent automatic speech recognition (ASR) systems capable of recognizing more than a hundred commands using an ARM M4 processor (200 MHz) with embedded flash memory and an external pSRAM. This SDK provides a detailed description of the hardware and software used in Spansion’s Speech-Enabled ARM M4 MCU, based on the S6E2CCAJHAGV20010. The goal of this document is to provide the information necessary to: Operate the Speech-Enabled M4 starter kit SK-FM4-176L-S6E2CC-SE Create new speech applications that run on Spansion’s Speech-Enabled MCU Design MCU-based systems that have voice control capability This document is arranged in nine sections: Section 1. describes the hardware and software features of the Speech-Enabled MCU based on the FM4 S6E2CCAJHAGV20010. It also provides a description of the supported system architectures. Section 2. describes the Speech Evaluation Kit, which is a development platform based on the SK-FM4176L-S6E2CC-SE board. Section 3. outlines the steps for building a speech application that runs on the Speech-Enabled MCU. Section 4. provides sample source code for applications without voice activity detector (VAD) and with VAD, respectively. Section 5. gives detailed directions that refer to the IAR Embedded Workbench, however similar flows are applicable to other development environments. Section 6. provides a description of the resources required for running the speech recognition engine. Section 7. contains audio codec driver routines can be used to initialize and control the codec. Section 8. provides phoneme pronunciations used for this SDK. Section 9. is a list of frequently asked question to help you with your application development. February 1, 2015 S6E2CC_AN709-00009-1v0-E 5 Us ers 1. Guide Overview of the Speech-Enabled MCU This section describes the hardware and software features of the Speech-Enabled MCU based on the FM4 S6E2CCAJHAGV20010. It also provides a description of the supported system architectures. 1.1 Hardware Features The data sheet shown below describes the features of Spansion's Speech-Enabled MCU S6E2CCAJHAGV20010. A block diagram is shown in Figure 1.1. Further details of S6E2CCAJHAGV20010 are available on Spansion's website: http://www.spansion.com/products/microcontrollers/32-bit-arm-core/fm4/Pages/ S6E2CCAJHAGV20000.aspx FM4 S6E2CCAJHAGV20010 ARM Cortex-M4F CPU Core 12-bit A/D Converter: Max. 32 channels (3 units) – Processor version: r0p1 External Bus Interface – FPU built-in, Support DSP instruction Real Time Clock: 1 unit Clock DMA Controller: 8 channels – Maximum clock frequency: 200 MHz Memory Ethernet-MAC a – Main Flash: 2048 KB – RAM: 256 KB DSTC: 256 channels a USB2.0 (Device/Host): 2 channels CAN: 2 channels Base Timer: 16 channels (max.) CAN-FD: 1 channel Watch Counter External Interrupt Controller Unit CRC Accelerator – External Interrupt input pin: Max. 32 pins Multi-function Timer — 3 units (max.) – Include one non-maskable interrupt (NMI) – 16-bit free run timer x 3 channels/unit 12-bit D/A Converter: 2 channels (Max.) – Input capture x 4 channels/unit Low Power Consumption Mode – Sleep mode/Timer mode/RTC mode/Stop mode/Deep standby RTC mode/Deep standby stop mode supported – Output compare x 6 channels/unit – A/D activation compare x 6 channels/unit – Waveform generator x 3 channels/unit General Purpose I/O port: 100 (Max.) – 16-bit PPG timer x 3 channels/unit Built-in CR SD Card Interface: 1 unit QPRC: 4 channels Unique ID Debug Dual Timer: 1 unit – Serial Wire JTAG Debug Port (SWJ-DP) Watch Dog Timer: 1 channel (SW) + 1 channel (HW) Multi-function Serial Interface: 16 channels (Max.) – Selectable from UART/CSIO/LIN/I2C I2S: 1 unit – Embedded Trace Macrocells (ETM) Low Voltage Detector Clock Supervisor Power Supply: 2.7 to 5.5 V a. See Section 6. for details on resource requirements 6 S6E2CC_AN709-00009-1v0-E February 1, 2015 User s Gu id e Figure 1.1 Block Diagram of the Speech-Enabled MCU Based on Spansion S6E2CCAJHAGV20010 In systems using voice control, not all of the resources of the of S6E2CCAJHAGV20010 are available to the user. This needs to be accounted for when developing application code and architecting a system. Specifically, the I2S port is reserved for connecting to the codec, and the external bus interface is used to connect to the pSRAM. In addition, the speech recognition software requires approximately 600 KB of internal flash memory, leaving approximately 1,400 KB for the user application. Detailed descriptions of the software and system architecture are given in Section 1.2 and Section 1.3. 1.2 Software Features The software that is bundled with Spansion’s Speech-Enabled M4 MCU is a state-of-the-art speakerindependent automatic speech recognition program. This speech recognition software has been optimized to fit in a limited memory footprint and to run in real time with high accuracy on the S6E2CCAJHAGV20010. It is ideally suited for command and control applications. The main features are: Provides Speaker Independent Voice Control for Spansion's Speech-Enabled M4 MCU – Over 100 commands can be decoded in real time and with high accuracy – User defined commands are entered in an intuitive manner using a text file (no audio input or training required) – Windows-based software is provided to convert the user commands (text file) into M4 MCU library objects – APIs are provided that are used to configure the speech recognition engine and call the runtime libraries – Audio drivers are provided to interface with the codec Noise Robust Front End – Built-in noise reduction – Beam forming (Optional) Multiple Language Support – English, German, Chinese, and Japanese are currently available – Spanish and French are under development February 1, 2015 S6E2CC_AN709-00009-1v0-E 7 Us ers Guide Voice Activity Detection – Speech recognition can be initiated using the voice activity detection API or using the push-to-talk option Dynamic grammar – The speech application can store multiple grammar files that can be used to implement a hierarchical command structure 1.3 System Architecture There are two supported architectural configurations for the Speech-Enabled MCU. In the first configuration, the user application runs on a host processor and the Speech-Enabled MCU is used for voice recognition only. In the second configuration, both the host application and the speech recognition software run on the Speech-Enabled MCU. 1.3.1 Host Controller + Dedicated Speech-Enabled MCU (configuration 1) The first configuration, where the user application runs on a host processor and the Speech-Enabled MCU is used for voice recognition only, is shown in Figure 1.2. Communication between the host processor and the Speech-Enabled MCU happens over an SPI interface. The advantage of this architecture is its modularity. With minimal effort, a system can be configured to run with or without a speech recognition interface. Additionally, resource contention between the host application and the speech recognition software is eliminated, as the speech recognition software runs on a dedicated MCU. This flexibility allows systems to easily add a voice control option. Figure 1.2 System Architecture with a Speech-Enabled MCU (configuration 1) 8 S6E2CC_AN709-00009-1v0-E February 1, 2015 User s Gu id e The software architecture for the host and Speech-Enabled MCU is shown in Figure 1.3. The host uses the speech client module to send commands (via the ASR API) to the MCU, which performs the speech recognition and returns the results. The only software component required to be installed on the host processor is the speech client module. Figure 1.3 Software Architecture for a Speech-Enabled MCU (configuration 1) Host Speech Enabled MCU SW Components: ASR SW Components: ‐ ‐ User Applications Speech Client Module ‐ ‐ ‐ ‐ ASR Engine Drivers Speech Objects Speech Server Module Figure 1.4 shows the interface diagram for the Speech-Enabled FM4 MCU with the Host processor. Spansion recommends using CLK Out from the Host processor as the CLK input for the FM4 MCU. The Audio Codec using an I2S protocol interfaces with the MCU through the I2S port. Communications and data transfer between the Host and the FM4 MCU are done using an SPI port. An external 2 MB pSRAM is used as the main memory of the ASR engine and is connected to the External Memory Bus of the MCU. As shown in Figure 1.4 (configuration 1), an interrupt signal is provided by the Speech-Enabled MCU to signal when a decoding result is ready. February 1, 2015 S6E2CC_AN709-00009-1v0-E 9 Us ers Guide Figure 1.4 Speech-Enabled FM4 MCU interface (configuration 1) 10 S6E2CC_AN709-00009-1v0-E February 1, 2015 User s 1.3.2 Gu id e Running the User Application on the Speech-Enabled MCU (configuration 2) In the second configuration, both the host application and the speech recognition software run on the Speech-Enabled MCU (Figure 1.5). The advantage of this architecture is that it reduces the total number of system components and hence the BOM. The drawback is that the user application and the speech software now use the same MCU resources, which must be considered when designing a voice controlled system. Figure 1.5 System Architecture with a Speech-Enabled MCU (configuration 2) February 1, 2015 S6E2CC_AN709-00009-1v0-E 11 Us ers Guide The software architecture is shown in Figure 1.6. The left hand side of the figure shows the Speech Objects (Acoustic Model, Dictionary, and Grammar), which are generated from the set of user defined commands. The process of compiling the user defined commands into the Speech Objects is done off-line by using the software tools provided with the Speech-Enabled MCU. This process is described in more detail in the Section 3., Building a Speech Application on page 14. The right hand side of Figure 1.6 shows the software hierarchy that runs on the Speech-Enabled MCU. The audio driver receives data from the codec and stores it in memory. The Speech Recognition software (ASR engine) then accesses this data and finds the user defined command (from the Speech Objects) that best matches the audio data. This hypothesis is then passed to the application as the recognition result. The user application interacts with the ASR Engine and the Audio Driver through a set of APIs. A description of the APIs is described in the Section 3.3, Speech Processing APIs on page 18. Figure 1.6 Software Architecture with a Speech-Enabled MCU (configuration 2) 2. Speech Evaluation Kit The Speech Evaluation Kit is a development platform based on the SK-FM4-176L-S6E2CC-SE board. It contains a Speech-Enabled MCU (S6E2CCAJHAGV20010), a codec, a 2MB pSRAM, a microphone, and a USB cable for connection to a PC. The Speech Evaluation Kit comes preloaded with a speech recognition application and the software tools required to create new speech applications. A block diagram of the Speech Evaluation Kit is shown in Figure 2.1, and a description of the SK-FM4-176L-S6E2CC-SE features is shown below. A full description of the Speech Evaluation Kit can be found in the SK-FM4-176L-S6E2CC-SE Board Support Package document. 12 S6E2CC_AN709-00009-1v0-E February 1, 2015 User s Gu id e Figure 2.1 Block Diagram of the Speech Evaluation Kit SK-FM4-176L-S6E2CC-SE Features SK-FM4-176L-S6E2CC-SE Spansion FM4 Family S6E2CCAJHAGV20010 MCU ARM Cortex™ - M4F (DSP/FPU) Core 2MB Flasha, 256 KB RAM on-chip LQFP176 Package IEEE 802.3 Ethernet RJ45 USB Micro Type-B connector x1 On board ICE (CMSIS-DAP) Flash Memory up to 32 Mbit, S25FL032P (via Quad SPI) PSRAM Memory up to 32 Mbit, SV6P1615UFC (via Ext Bus) RGB LED (via GPIO or PWM dimming) Acceleration Sensor (via I2C and INT) Phototransistor (via A/D Converter) Push Button (via NMIX) Arduino Compatible I/F Reset Button Power Supply LED User Setting Jumpers 6pin - JTAG I/F (support SWD only) CMSIS-DAP USB bus power, USB Device bus power (selected by jumper) Stereo Codec WM8731 (via I2S) a. See Section 6. for details on resource requirements February 1, 2015 S6E2CC_AN709-00009-1v0-E 13 Us ers Guide 3. Building a Speech Application The steps for building a speech application that runs on the Speech-Enabled MCU are outlined below. Spansion can provide support with each of these steps and can also build the entire speech application if needed. 1. Write the list of commands (grammar) specific to the application. Examples of grammars are shown below. 2. Run the Grammar Compiler, provided by Spansion, which takes the list of commands and generates the Speech Objects [.c]. 3. Write the application program using the ASR APIs. A sample application is provided with the SDK software. 4. For configuration 1: build the host executable by compiling the application and linking with the speech client module. Build the executable for the Speech-Enabled MCU by linking the Speech Objects, the ASR engine, the speech server module, and the drivers. 5. For configuration 2: build the executable for the Speech-Enabled MCU by compiling the application and linking with the ASR engine, the Speech Objects, and the drivers. A sample IAR project folder is provided with the SDK software. 3.1 Writing the Grammar The list of user defined commands for an application is called the grammar. The grammar compiler tool is provided with the SDK software and supports standard Java Speech Grammar Format except for tags, quoted tokens / terminals and grammar import. The JSGF format is described in detail in the link below. http://www.w3.org/TR/2000/NOTE-jsgf-20000605/ Below are two examples of grammar files, the first consisting of a simple set of commands for calling a list of 10 contacts, and the second consisting of a slightly more complicated set of commands for recognizing a telephone number in the '408' area code. As can be seen from these examples, the JSGF grammar provides a simple and compact formalism for writing command and control applications. Note: it is recommended to use the windows text editor Notepad when writing the grammar file. Grammar Example 1 – Call Contacts /**Begin Grammar*/ #JSGF V1.0; grammar SPSNCALL; public <command> = CALL CONTACT <name>; <name>= ( FREDERIC LAWHON | TRANG MARCOUX | MIREILLE CATALAN | CAMERON MANGO | STACIE PELLEGRINO | GAIL CURCIO | MERNA LAMBERSON | LIA STOTT | KATHERIN BRECK | KIMBERLI MUCK ); /**End Grammar*/ 14 S6E2CC_AN709-00009-1v0-E February 1, 2015 User s Gu id e Grammar Example 2 – Telephone Numbers /**Begin Grammar*/ #JSGF V1.0; grammar SPSN_DIAL; public <command> = DIAL <phonenumber>; <phonenumber> = <areacode><number>; <areacode>= FOUR ZERO EIGHT; <number>= <digit><digit><digit> <digit><digit><digit><digit>; <digit> = NINE; ZERO | ONE | TWO | THREE | FOUR | FIVE | SIX | SEVEN | EIGHT | /**End Grammar*/ 3.2 Running the Grammar Compiler The Grammar Compiler is a PC application provided with the Speech-Enabled MCU. It takes the user’s JSGF grammar file as input and generates the Speech Object file that contains: Finite State Grammar Dictionary Acoustic Model The Finite State Grammar contained in the Speech Object file is a compact and highly optimized version of the input JSGF grammar file, using the method of Finite State Transducers. It is used during the search phase of the speech recognition process to generate all possible sequences of words allowed by the grammar. A description of this technique is provided in: http://www.cs.nyu.edu/~mohri/postscript/hbka.pdf The Dictionary contained in the Speech Object file is a text file containing the phonetic representation of all the words used in the command set. It is generated using a variety of techniques including look up tables and grapheme-to-phoneme conversion algorithms. The Dictionary is also used during the search phase of the speech recognition process to generate all possible sequences of phonemes (sound units) allowed by the grammar. An example of the Dictionary corresponding to the Call Contacts grammar is shown in the Dictionary Example 1. Dictionary Example 1 – Call Contacts CALL K AO L CONTACT K AA N T AE K T BRECK B R EH K CAMERON K AE M ER AH N CATALAN K AE T AH L AH N CURCIO K UH R CH IY OW FREDERIC F R EH D R IH K GAIL G EY L KATHERIN K AE TH AH R AH N KATHERIN(2)K AE TH R IH N KIMBERLI K IH M B AH R L IY LAMBERSON L AE M B ER S AH N LAWHON L AO HH AH N February 1, 2015 S6E2CC_AN709-00009-1v0-E 15 Us ers LIA MANGO MARCOUX MERNA MIREILLE MUCK PELLEGRINO STACIE STOTT TRANG L M M M M M P S S T Guide IY AH AE NG G OW AA R K UW EH R N AH AH R IY L AH K EH L EH G R IY N OW T AE S IY T AA T R AE NG The Dictionary can support multiple pronunciations for each word, as is the case for KATHERIN and KATHERIN(2) in the example above. This file can also be customized by the user to add pronunciations that are not automatically generated. Manually editing the pronunciation dictionary can significantly improve recognition accuracy for unusual pronunciations and for different accents and dialects. A list of the phoneme tables that are used for English, Chinese, and German is shown in Section 8.. The Acoustic Model contained in the Speech Object file is a set of parameters providing the mathematical descriptions of the sound units. These models are trained using hundreds of hours of audio recordings from many speakers in different settings. Each language has a unique Acoustic Model. Spansion currently provides Acoustic Models for English, Chinese, Japanese and German. Additional languages such as Spanish, French, Italian, and Korean are under development. After we define the grammar, we are ready to produce a speech object file that covers only a specific voice recognition task. Spansion provides sample scripts for speech object production. Figure 3.1 shows the structure of a folder that contains the DOS batch scripts as well as necessary binary files. For example, the speech object file for the English voice dialing application can be generated by executing the DOS batch script, gen_enUS_voice_dialing.bat in the package. The sample script runs three executable programs: buzz_optfsg_gen.exe — converts the JSGF into the compact state finite grammar (FSG) file and generates a vocabulary list, buzz_dict_compile.exe — generates the pronunciations, buzz_speechobj_gen.exe — integrates the grammar, pronunciation dictionary and the set of the acoustic model parameters into the speech object. Figure 3.2 shows a flow chart of the speech object generation process. Figure 3.1 Folder Structure of Speech Object Generation Package gen_speech_object acoustic_model Sound unit files for each language windows Executable programs and libraries sample_grammar enUS_speech_dial Sample JSGF file for voice dialing application enUS_tvcmd Sample JSGF and pronuncation files for the TV control task out_enUS_tvdemo out_speech_dial 16 Speech object samples generated by the tool Speech object samples generated by the tool gen_enUS_voice_dialing.bat Sample DOS batch script for the voice dialing application gen_enUS_tvcmd.bat Sample DOS batch script for the TV control task S6E2CC_AN709-00009-1v0-E February 1, 2015 User s Gu id e Figure 3.2 Flow Chart of Speech Object Production Here, we explain each program invoked in gen_enUS_voice_dialing.bat. The first program converts the JSGF file (sample_grammar/enUS_speech_dial/speech_dial.jsgf) into the compact FSG representation by executing the commands in Table 3.1. Table 3.1 Generating the optimized grammar file Input File Description sample_grammar/enUS_speech_dial/speech_dial.jsgf Defines the grammar. Created by the user Output File Description sample_grammar/enUS_speech_dial/speech_dial.list List of vocabulary sample_grammar/enUS_speech_dial/speech_dial.fsg Optimized grammar Command windows/bin/buzz_optfsg_gen.exe \ -list sample_grammar/enUS_speech_dial/speech_dial.list \ sample_grammar/enUS_speech_dial/speech_dial.jsgf \ sample_grammar/enUS_speech_dial/speech_dial.fsg Note that all files must be in plain text format. Spansion's JSGF compiler produces an optimized FSG file (speech_dial.fsg) in order to achieve the smallest search space for voice recognition. It also creates a list of vocabulary used in the JSGF file (speech_dial.list). Table 3.2 Generating the Pronunciations Input File Description sample_grammar/enUS_speech_dial/speech_dial.list List of vocabulary. Generated by buzz_optfsg_gen.exe Windows/bin/English/reference.dict English reference dictionary provided as part of the SDK Output File Description sample_grammar/enUS_speech_dial/speech_dial.dict List of vocabulary Command windows/bin/buzz_dict_compile.exe \ -d windows/bin/english/reference.dict \ -i sample_grammar/enUS_speech_dial/speech_dial.list \ -o sample_grammar/enUS_speech_dial/speech_dial.dict February 1, 2015 S6E2CC_AN709-00009-1v0-E 17 Us ers Guide The next program generates the pronunciations for the vocabulary by executing the commands in Table 3.2. The dictionary file, speech_dial.dict, contains pairs of words and pronunciations. Please note that any automatic pronunciation tools will not always provide expected pronunciations. Thus, developers need to review and make any necessary corrections to the pronunciations in the speech_dial.dict file or add alternate pronunciations. The last program takes the pronunciation dictionary (speech_dial.dict) and the optimum FSG file (speech_dial.fsg) and generates the speech object as follows: Table 3.3 Generating the Speech Objects Input File Description Hmm_acoustic_model/enUS English acoustic model provided as part of the SDK speech_dial.dict List of vocabulary. Produced by buzz_dict_comple.exe speech_dial.fsg Optimized grammar. Produced by buzz_optfsg_gen.exe Output File Description Speech objects file Out_enUS_speech_dial/speech_objects.c Command windows/bin/buzz_speechobj_gen.exe \ -hmm acoustic_model/enUS \ -dict sample_grammar/enUS_speech_dial/speech_dial.dict \ -fsg sample_grammar/enUS_speech_dial/speech_dial.fsg \ -outdir out_enUS_speech_dial The final result of the grammar compilation process is the speech_objects.c file. Replace the speech_objects.c file in IAR with your new speech_objects.c file and compile to produce a new .srec file that is to be programmed in the flash. In a small vocabulary application, the ASR may produce incorrect matches when a word that is outside the grammar is spoken. To reduce this kind of false match, there is a keyword provided (QQQ) that loosely matches to every phone, but is not strongly associated with any phone. An example would be a thermostat that wakes up to the phrase "Hello Thermostat", but also incorrectly wakes up to "Hello Thomas". "Hello QQQ" can be added to the .jsgf file. "Hello QQQ" will now produce a stronger match to "Hello Thomas" than "Hello Thermostat" does, and hence "Hello Thermostat" is no longer detected. The main use for this is wakeup words that need very low false positive results. 3.3 Speech Processing APIs At runtime, the user application interacts with the ASR engine through the set of ASR APIs. The basic set of APIs for configuring the ASR engine and for processing speech are shown in Table 3.4 and Table 3.5, respectively. The speech processing APIs for the two architectural configurations are identical. An additional set of APIs for configuration 1 only, is shown in Table 3.6. These additional APIs provide a way for the host processor to configure the Speech-Enabled MCU and external chips, to update the Speech Objects, and to transfer audio data between the two processors. The first step in configuring the ASR engine is to create and initialize a parameter object by calling buzz_init_params(). This command configures the ASR engine with the default values of the parameter object. To achieve optimal performance for particular applications, however, the values of the parameter object need to be tuned. This can be done with the APIs listed in Table 3.4. Many of the configuration parameters in Table 3.4 control the search algorithm used for matching the spoken phrase to the Speech Object. This is done by adjusting the pruning thresholds at various points of the search. Relaxing the pruning thresholds allows more hypotheses to be evaluated, increasing the recognition accuracy at the expense of more computation. Tightening the pruning thresholds, on the other hand, restricts the number of potential matches to be evaluated but speeds up the computation. Another way to control the search algorithm is to directly specify the number of hypotheses that are evaluated at each point. This puts a limit on the maximum number of hypotheses that are evaluated in each time frame and is a convenient way to control the maximum heap memory size. Other configuration parameters in Table 3.4 include: the 18 S6E2CC_AN709-00009-1v0-E February 1, 2015 User s Gu id e bos_threshold and bos_timeout, which adjusts when the beginning of speech is detected; the eos_threshold, eos_valid, and eos_timeout, which adjusts when the end of speech is detected; and parameters for selecting which Speech Object is used during recognition. For processing speech input, the set of APIs in Table 3.5 are used. The first step is to create a Buzz voice recognizer (VR) object, buzz_decoder_t *buzz. The VR object can be initialized by passing the VR and parameter objects to buzz_decode_init( &buzz, param ). After the initialization, buzz_decode_start( buzz ) has to be called to start speech decoding. An array containing audio samples is fed to the Buzz VR object by calling buzz_process_audio( buzz, buf, CB_AD_BUFSIZE ). This process has to be repeated for all the samples of audio or until end of speech is detected by the buzz_eos_detected() function. The buzz_decode_finish(buzz) function needs to be called to finish the decoding process and generate the hypothesis. The recognition result can be retrieved with buzz_get_hyp( buzz ). The confidence score for the hypothesis can be obtained by calling buzz_get_conf_score(). This confidence score is a combination of acoustic scores and language scores for the recognized hypothesis and is always a negative value. Scores closer to zero are better. The application can use this confidence score to accept or reject an hypothesis by comparing it against an empirically derived threshold for a specific task. For close talking microphones, it was found that a threshold of -6000 works reasonably well. Note that the recognized hypothesis might be NULL when the confidence score is too low. Sample code that shows how the various APIs are integrated together to create a speech application is given in Section 4. To achieve good recognition performance, it is crucial to detect when voice activity starts and ends. The start of speech can be easily determined when a push-to-talk button is used. In applications where a push-to-talk button is not available, however, the ASR engine also provides an automatic voice activity detector (VAD). Section 4.2 shows another sample code using the voice activity detector (VAD). In this example buzz_bos_detected() is used to determine when speech begins. When buzz_bos_detected() returns VA_DEFINITELY_SPEECH or VA_PERHAPS_SPEECH, then processing can start. After buzz_eos_detected() returns 1 (one of more times), processing can be finished and the recognition result obtained. Table 3.4 Buzz API for Parameter Configuration API for Parameter Configuration Purpose buzz_param_list_t *buzz_init_params() Create a parameter object to configure parameters of the Buzz voice recognizer. void buzz_free_params(buzz_param_list_t *param) Free the parameter object. void buzz_set_param_hmm_beam_init( buzz_param_list_t *param, double value) Set the beam pruning threshold (likelihood basis) to adjust search space at an HMM level. (Default Value: 1e-48) void buzz_set_param_word_beam_init( buzz_param_list_t *param, double value) Set the beam pruning threshold (likelihood basis) to control the word search space. (Default Value: 1e-48) void buzz_set_param_max_hmms_per_frm( buzz_param_list_t *param, int num) Set the beam width at the HMM level. The default value is 500, which should cover 4000 word recognition tasks. void buzz_set_param_max_wrds_per_frm( buzz_param_list_t *param, int num) Set the beam width at a word level. The default value is 100. Void buzz_set_eos_threshold( buzz_param_list_t *param, int zero2hundred_value) Set the threshold used to detect end of speech. This parameter can be set from 0 to 100, where the lower values detect end of speech easier (Default value is 80) void buzz_set_eos_valid( buzz_param_list_t *param, int millisecond ) Set how long eos_detected needs to be valid before the end of utterance is declared (Default Value is 200ms) void buzz_set_eos_timeout( buzz_param_list_t *param, int millisecond) Set the maximum length of an utterance starting from when the beginning of speech is detected. February 1, 2015 S6E2CC_AN709-00009-1v0-E 19 Us ers Guide Table 3.4 Buzz API for Parameter Configuration (Continued) API for Parameter Configuration Purpose void buzz_set_bos_timeout( buzz_param_list_t *param, int millisecond) Set the maximum time to wait for speech after push-to-talk is pressed, before terminating (Default value is 60 seconds). void buzz_set_bos_threshold( buzz_param_list_t *param, float num) Set the threshold value to detect the beginning of speech/ voice activity (Default value is 3.5). Lower values are more sensitive to audio input and should be used in quiet conditions. Higher values should be used in noisy environments. void buzz_set_vad_ramp_up( buzz_param_list_t *param, int samples ) Set the ramp up samples (default: 8192 0.5 sec). This parameter sets the number of samples to wait after the codec is reset before processing data. void buzz_set_vad_previous_samples( buzz_param_list_t *param, int samples ); Set the previous samples (default: 8192 0.5 sec). This parameter sets the number of samples to process before the beginning of speech is detected. void buzz_set_language( buzz_param_list_t *param, char const * name) Set the language for speech recognition (Default value is English). Current options include: English, Mandarin, German, Japanese). void buzz_set_grammar( buzz_param_list_t *param, char const * name) Select grammar object by name. Table 3.5 Buzz API for Speech Processing API for Speech Processing 20 Purpose buzz_decoder_t *buzz_decode_init( buzz_decoder_t **pbuzz, const buzz_param_list_t *parameters ) Initialize a Buzz voice recognizer (VR) object as specified in Table 3.1. Note that you have to feed speech samples to buzz_process_audio() when you process audio samples. int buzz_decode_start( This API has to be called before each utterance. buzz_decoder_t *buzz ) int buzz_process_audio( buzz_decoder_t *buzz, short *buffer, const int len, int *is_eos ) Process a block of incoming speech signals stored in “buffer”. “len” denotes the number of samples in the buffer. The function also update the is_eos value (1 if the end of speech is detected, 2 if the eos is timeout, and 0 if the eos is not detected) int This API has to be called at the end of the utterance. buzz_decode_finish( buzz_decoder_t *buzz ) int buzz_bos_detected( buzz_decoder_t *buzz, short *buffer, const int len) The function returns 1 if the beginning of speech (bos) is detected or returns 0 if the bos is timeout. Otherwise, it returns 0. The “len” denotes the number of samples in the buffer. void buzz_vad_previous_samples( buzz_decoder_t *buzz, int *samples ) The function update the current samples with the previous samples int buzz_eos_detected( buzz_decoder_t *buzz ) The API returns 1 if the end of speech is detected. Otherwise, it returns 0. char const *buzz_get_hyp( buzz_decoder_t *buzz ); Return the pointer to a recognition result. S6E2CC_AN709-00009-1v0-E February 1, 2015 User s Gu id e Table 3.5 Buzz API for Speech Processing (Continued) API for Speech Processing Purpose int buzz_get_conf_score( buzz_decoder_t *buzz ); Return a confidence score associated with the recognition result obtained with buzz_get_hyp(). void buzz_decode_free( Free the Buzz VR object initialized with buzz_decode_init(). buzz_decoder_t *buzz ); Table 3.6 Buzz API extensions for Configuration 1 (Note: APIs in this table are not implemented yet) API Extensions for Configuration 1 Purpose void buzz_codec_command( buzz_param_list_t *param, int num) Tells the Speech-Enabled MCU to send a command to the codec. void buzz_sram_command( buzz_param_list_t *param, int num) Tells the Speech-Enabled MCU to send a command to the SRAM. void buzz_data_transfer( buzz_param_list_t *param, short *buffer, const int len ) Transfers a block of audio data from the host processor to the Speech-Enabled MCU void buzz_update_speech_object( buzz_decoder_t *buzz, const int len ) Transfers and updates the Speech Object in the Speech-Enabled MCU 4. Sample Code Section 4.1 and Section 4.2 provide sample source code for applications without VAD and with VAD, respectively. For details please refer to the source code. 4.1 Speech Recognition without VAD do { if (Audio_GetBufAddr(rxsample, &rxbufaddr) < 0) break; buzz_process_audio(buzz, (short *)rxbufaddr, SAMPLE_SIZE, &is_eos); if (is_eos > 0) break; rxsample += SAMPLE_SIZE; // sample data Audio_CircularBuf(&rxsample); } while (rxsample < AUDIO_CIRBUFSIZE); // circular buffer size 4.2 Speech Recognition with VAD do { if (Audio_GetBufAddr(rxsample, &rxbufaddr) < 0) break; if( is_bos == 0 ) { is_bos = buzz_bos_detected(buzz, (short *)rxbufaddr, SAMPLE_SIZE); if (is_bos == 1) { buzz_vad_previous_samples(buzz, &rxsample); } else if (is_bos == 2) break; } else { buzz_process_audio(buzz, (short *)rxbufaddr, SAMPLE_SIZE, &is_eos); if (is_eos > 0) break; } // end if is_bos rxsample += SAMPLE_SIZE; // sample data Audio_CircularBuf(&rxsample); } while (rxsample < AUDIO_CIRBUFSIZE); // circular buffer size February 1, 2015 S6E2CC_AN709-00009-1v0-E 21 Us ers Guide 5. How to Compile and Link After the user commands have been compiled into the speech_objects.c file, using the Grammar Compiler described in Section 3.2, this file must be compiled and linked with the user application. The following detailed directions refer to the IAR Embedded Workbench, however similar flows are applicable to other development environments. The first step is to open the S6E2CC_pdl project file using the IAR Embedded Workbench (see Figure 5.1). The downloaded project file contains the English TV Demo. The Speech Objects for the English TV Demo and the Chinese TV Demo have been placed in the gen_speech_object folder. Speech Objects for additional tasks can be added to this folder by first running the Grammar Compiler script (Section 3.2). Figure 5.1 Directory Structure of Board Example S6E2CC_pdl Common source files for FM4 Speech_app Audio buzz_live_mcu template MCU template structure Open the template folder (see Figure 5.2) and select the template\IAR\s6e2cc.eww file by double clicking. This template file corresponds to the English TV Demo. Additional template files can be added to this folder by copying and renaming existing template files to this folder. After replacing the speech_objects.c file in the IAR template with a newly generated file, the Rebuild All function needs to be used for compiling and linking with the ASR engine, the drivers, and the Speech Object. Figure 5.2 Directory Structure of the Template Folder template IAR s6e2cc_pdl.eww Example projects for IAR Embedded project including startup file, linker and compiler settings IAR Embedded Workbench Workspace File s6e2cc_release\exe\*.srec Source Compiled and linked firmware Example Source Files Readme.txt To burn the image into flash, these steps must be followed: 22 S6E2CC_AN709-00009-1v0-E February 1, 2015 User s Gu id e 1. Download the FLASH MCU Programmer for FM0+/FM3/FM4 (serial version) from: http:// www.spansion.com/support/microcontrollers/developmentenvironment/Pages/mcu-download.aspx 2. Connect the USB cable from the laptop to the s6e2cc demo board. 3. Connect the jumper J2 on the board to enter the programming mode. Press the reset button on the s6e2cc demo board. 4. Open the serial Flash MCU Programmer for FM0+/FM3/FM4, and start programming. 5. When the programming is complete remove jumper J2 and press the reset button on the s6e2cc board. 6. The demo is now ready to run. 6. System Resources The Speech-Enabled version of the FM4 MCU described in this SDK is the S6E2CCAJHAGV20010. Some of the resources of this MCU are used for running the speech recognition engine, such as the embedded flash that stores the firmware, as well as the CPU and RAM during runtime. Resources not used for the speech recognition engine are available for the user application. This section provides a description of the resources required for running the speech recognition engine. Flash requirements: all firmware is stored in the embedded flash. The firmware consists of the Speech Recognition Engine, which uses 320 KB of flash, and the Speech Object whose size depends on the number of recognition phrases. A plot of the Speech Object size vs the number of commands is shown in Figure 6.1. Note: the exact size of the Speech Object will vary depending on the phrases used. Figure 6.1 Speech Object size vs command list shown below. An additional 320 KB of flash is required for the Speech Recognition Engine. CPU Requirements: the CPU load during decoding is shown in Figure 6.2. The CPU load is defined as: (total processing time) / (duration of utterance). For a smaller number of recognition phrases (commands) the CPU requirements are reduced. This characterization data was taken with a CPU clock of 200 MHz. The pruning thresholds were optimized for the case of 100 commands and these values were used for all measurements. Further reductions in the CPU load at smaller list sizes can be achieved by optimizing the pruning thresholds for these cases. Running the Voice Activity Detector only, requires a CPU load of 0.2. February 1, 2015 S6E2CC_AN709-00009-1v0-E 23 Us ers Guide Figure 6.2 CPU Running at 200 MHz. Pruning threshold is constant over all list sizes Stack Requirements: The Speech Recognition Engine uses 1 KB for CSTACK. RAM Requirements: 80 KB of internal SRAM and 1 MB external pSRAM is currently required for all command list sizes External Ports: running the Speech Recognition Engine requires a 1 MB external pSRAM (2 MB are recommended) and an audio codec. The pSRAM is connected to the FM4 MCU using the external bus interface. The codec is connected using the I2S port. Schematics are provided in the board support package. 7. Audio Codec Driver The Speech Evaluation Kit uses the Wolfson WM8731 audio codec to input a mono microphone signal or a line-in input. The following routines can be used to initialize and control this codec: 1. Audio_Init(); // This function is used to initialize the FM4 I2C port connection for the audio codec parameter configuration, and also to initialize the FM4 I2S port connection for getting the audio data from the codec. It is also used to allocate memory for the audio data stream. 2. Audio_Activate(); // This function is used to activate the FM4 I2S controller and codec to start receiving audio data after the push-to-talk button is pressed. 3. Audio_GetSample(); // This is function gets the audio sample data from the codec and waits until the audio sample data is ready. 4. Audio_DeInit(); // This is function is used to disable the FM4 I2S controller, deactivate the audio codec, and free the memory allocation used for the audio codec. The configuration settings used for the WM8731 can be found in the application code provided with the Speech Evaluation Kit. They are also shown below for reference: Audio_Write(WM8731_REG_RESET, _WM8731_Reset); Audio_Write(WM8731_REG_LLINE_IN, 0x19f); Audio_Write(WM8731_REG_RLINE_IN, 0x19f); Audio_Write(WM8731_REG_LHPHONE_OUT, 0x1ff); Audio_Write(WM8731_REG_RHPHONE_OUT, 0x1ff); Audio_Write(WM8731_REG_ANALOG_PATH, 0x25); Audio_Write(WM8731_REG_DIGITAL_PATH, 0x02); Audio_Write(WM8731_REG_PDOWN_CTRL, 0x00); 24 // Reset module // Left line in settings // Right line in settings // Left headphone out settings // Right headphone out settings // Analog paths // Digital paths // Power down control S6E2CC_AN709-00009-1v0-E February 1, 2015 User s Gu id e Audio_Write(WM8731_REG_DIGITAL_IF, 0xc1); Audio_Write(WM8731_REG_SAMPLING_CTRL, 0x58); // Digital interface // Sampling control The digital interface settings correspond to: MSB-first, left justified 16 bits Master mode Bit clock inverted The sampling control settings correspond to: Normal mode (256 fs) Sampling rate 32 kHz February 1, 2015 S6E2CC_AN709-00009-1v0-E 25 Us ers 8. Phoneme Pronunciations Phoneme AA 26 Guide Example g(o)t AE c(a)t AH (a)llow, c(u)t AO f(a)ll AW f(ou)l AY f(i)le B (b)it CH cat(ch) D (d)ig DH (th)en EH f(e)ll ER c(u)rt EY f(ai)l F (f)at G (g)ot HH (h)at IH f(i)ll IY f(ee)l, (ea)t JH (j)ourney K ( c )at L (l)ip, batt(le) M (m)an N (n)ut NG ri(ng) OW g(oa)l OY f(oi)l P (p)it R ( r )ip S (s)eal SH (sh)ip SIL - T (t)op TH (th)in UH f(u)ll UW f(oo)l V (v)at W (wh)y Y (y)es Z (z)eal ZH lei(s)ure S6E2CC_AN709-00009-1v0-E February 1, 2015 User s 9. Gu id e FAQs What type of microphone should I use? Three critical specs for choosing a microphone are: 1) frequency response, 2) sensitivity, and 3) SNR. For good ASR performance, the frequency response should be flat within the range of 50 Hz to 10 kHz. For analog microphones, the sensitivity needs to be chosen to generate the maximum output voltage for a given application, without clipping. The required microphone sensitivity also depends on the gain of the codec or preamp. A good description of how everything fits together is given in: http://www.analog.com/library/analogdialogue/archives/46-05/understanding_microphone_sensitivity.html Digital microphones have similar specs and can also be used as long as the codec supports the output format. What type of microphones are the best to be used for noisy environments? In high noise environments, headset mics are recommended. These mics do a good job of picking up the speech signal while maintaining a reasonable SNR since they are placed close to the mouth. In applications where headset mics can not be used, a unidirectional mic or a two microphone array with beamforming software can effectively reduce the noise. What type of microphones are best for far field (>1 meters) ? For far field applications the combined microphone + codec system needs to have enough sensitivity to pick up distant speech. This will depend on the sensitivity of the mic and the gain of the codec. What about beam forming? How many microphones are used? It is possible to run beam forming with two microphones on the FM4. This has been shown to improve the accuracy of distant speech (>5 m) applications. Beam forming software will be released in Q1-2015. What about AGC? Automatic Gain Control is usually not recommended for ASR systems, since changes in the gain while an utterance is being spoken can cause problems with the recognition results. If AGC is required, however, it should be implemented so that the impact on the utterance is minimized. How to control the microphone gain or line-in gain? The microphone and in-line gain settings are set in the audio driver code. The data sheet for the WM8731 audio codec can be found at: http://www.wolfsonmicro.com/documents/uploads/data_sheets/en/WM8731.pdf How to use line-in instead of the microphone? The Speech Evaluation Kit is usually wired to accept a single microphone input. However, it can be easily modified to accept a single line-in. This change also requires a modification to the audio driver. Please contact Spansion Technical Support directly for this option. Which codec should I use? The codec must be able to support 16 bit output at a sampling rate of 16 kHz (for each channel). Additional considerations, which depend on the application, are: the number of microphone or line inputs, the codec gain, master or slave operation, power requirements, analog or digital microphone, SNR. Consult the codec data sheet to verify that the codec fits your application requirements. February 1, 2015 S6E2CC_AN709-00009-1v0-E 27 Us ers Guide Is it possible to use the MCU ADC to sample the microphone? (Is there an amplifier circuit required for this?) Best results are achieved by using an audio codec with 16 bit output resolution and good noise rejection (i.e. 24 bit delta-sigma modulation). The required gain of the audio codec depends on the microphone and the strength of the audio input. It should be chosen so as to provide the maximum signal without clipping Are there noise filters available to help eliminate the negative effects of background noise? Basic noise filtering is built into the ASR engine. More advanced noise reduction techniques such as beamforming will be available in Q1'2015. Is there a wake-up phrase capability? Yes. The Voice Activity Detector (VAD) API can be used as the first stage of the wake-up process: buzz_bos_detected( buzz_decoder_t *buzz, short *buffer, const int len ). For the second stage of the wake-up process, a dedicated grammar can be run with just the wake-up phrase. When the confidence score is high enough, then the full speech detector is started. Note: it is recommended to use additional phrases in the wake-up grammar containing the QQQ model in order to reduce the number of false detections. Spansion can provide technical support in designing robust wake-up phrases and grammars. What is the expected accuracy? The expected accuracy is <5% sentence error rate under ideal conditions, meaning a close talking microphone with a native (unaccented speaker). How can I improve the recognition results? There are a number of steps that can be taken to improve recognition accuracy: 1) optimize the audio path microphone and codec, 2) add alternate pronunciations for problematic words, 3) reduce the size of the command set or use a hierarchical menu structure, 4) choose commands that sound different from one another. Does the ASR system understand different accents and dialects? Yes. The ASR system is very robust over different accents and dialogs as a result of the large database used to train the recognizer. In addition, further improvements can be realized by customizing the pronunciation dictionary. How to change the pronunciation dictionary? The Grammar Compiler software that runs on the PC generates a pronunciation dictionary file based on the input phrases. In many cases, improved recognition results can be obtained by adding alternate pronunciations to the dictionary for certain words. Alternate pronunciations can also be used to cover a range of accents or dialects. The pronunciation dictionary is a text file, so it can be modified using any text editor. A table of phoneme pronunciations is provided in Section 8. of the SDK to help in defining the alternate pronunciations. After the pronunciation dictionary has been modified, a new set of Speech Objects are generated by re-running the Grammar Compiler without the dictionary generation option. Are there certain words or phrase that should be avoided due to similarity to other phrase that might cause either false positives or high error rates? There is no general rule for this, so the command set should be tested to see if some words are easily confused with others. Words or commands that are longer typically contain more phonetic information and are easier to distinguish from one another. 28 S6E2CC_AN709-00009-1v0-E February 1, 2015 User s Gu id e What value of the “confidence score” parameter should be used? Is there a tuning procedure recommended to calibrate a system for optimal performance? The confidence score is used to show how well an utterance matches the recognized command. The confidence score is always negative with a best score of 0. To get a sense of how this parameter responds to various utterances, input multiple utterances into the ASR engine and record the confidence score of each utterance. Use different types of utterances, similar to those that would be found in an application: some in the command list, some outside of the command list, some spoken clearly, some spoken unclearly. Based on these results, set the confidence score to reject the utterances outside of the command list. Is there self-learning built into the solution to allow frequently stated commands to be recognized for either faster confirmation or better false positive rejection? Not in the current release. What value of the “confidence score” parameter should be used? Is there a tuning procedure recommended to calibrate a system for optimal performance? The confidence score is used to show how well an utterance matches the recognized command. The confidence score is always negative with a best score of 0. To get a sense of how this parameter responds to various utterances, input multiple utterances into the ASR engine and record the confidence score of each utterance. Use different types of utterances, similar to those that would be found in an application: some in the command list, some outside of the command list, some spoken clearly, some spoken unclearly. Based on these results, set the confidence score to reject the utterances outside of the command list. What amount of testing has Spansion performed to ensure any word or phrase can be entered and used with a high degree of success? In low noise environments, the sentence error rate of native American speakers is less than 5% on a typical set of 100 commands. The actual error rate will depend on the command list, the environment (i.e. noisy, reverberant, distant speech, close talking, etc.), the amount of front end processing, and the accents of the speakers. The error rate should be characterized for each application under the target conditions. Spansion provides technical support to optimize the recognition accuracy under challenging conditions. Why do I sometimes get a NULL recognition result? A NULL recognition result means that all the potential hypotheses were pruned out. This can occur for a number of reasons, such as: 1) the microphone input is of poor quality (i.e. signal too low or signal clipped), 2) there is a lot of noise in the signal, 3) the spoken phrase is not in the grammar, 4) the VAD circuit picks up noise and starts the decoding, but nothing is said. In cases (1,2) the audio path needs to be debugged. A good first step is to record an utterance and play it back. In the last case (4), the bos threshold should be empirically adjusted so that it's above the noise floor. Why do some commands take longer to decode? Some commands may be slow to decode because the recognizer has trouble detecting the end of the utterance. When this happens, the recognizer waits until the timeout is reached before returning a result. Reducing the eos_thresold will make it easier for the recognizer to detect the end of speech. If this value is reduced too far, however, it may detect the end of speech before it actually occurs. Reducing the eos_timeout value will also result in a faster response, even when the end of speech is not detected, however, it needs to be set long enough to capture the longest phrase in the grammar. If neither of these fixes solves the problem, the phrase should be changed to make it easier to detect the end. February 1, 2015 S6E2CC_AN709-00009-1v0-E 29 Us ers Guide What does Spansion recommend a customer do prior to going into mass production to ensure reliable operation in large volume? Test, test, test. The speech application can initially be evaluated using prerecorded test data with the line-in option. The prerecorded data should be taken in conditions similar to the target environment. Tweaking the ASR parameters for optimal performance should be done at this stage. The final system, with the user microphone and codec, should then be evaluated in the target environment. What are the power requirements? The power requirement depends strongly on how the system is operated. In the case where the sleep mode is not used and decoding is performed continuously on a task of 50 commands, then the active power is approximately: FM4 MCU (37 mA) + pSRAM (20 mA) + WM8731L codec (7 mA) + electret microphone (0.5 mA) = 65 mA. In the case that the FM4 MCU is running the Voice Activity Detector software only and the pSRAM is in standby then the required power is: FM4 MCU (20mA) + pSRAM (0.1mA) + WM8731L codec (7 mA) + microphone (0.5 mA) = 28 mA. In the case where the FM4 MCU is put in the deep sleep mode (VBAT operation, RTC stop), the pSRAM is put into self-refresh mode, and the other components (codec, microphone) are powered down, then the standby power is less than 100 mA. What if I don't use IAR? Not a problem. Although the SDK uses IAR as an example, any development environment that supports the FM4 MB9BF568R will work. Please refer to the Board Support Package for further details. What are the resource requirements? See Section 6. of the SDK. What if the system doesn't respond? Check that there's power, check that the microphone is connected to the correct input, press the push-to-talk button before speaking. Does the Spansion Voice solution support authentication? Speaker Identification is planned for release in Q1-2015. Instead of using a text input method, is there a way to input commands via speaking into the system? No. The ASR Engine is speaker independent, meaning that all the audio training has been performed by Spansion, so the application developer just needs to type the commands into a file. Other types of speech recognition systems are speaker dependent, meaning that the application developer needs to train the system themselves by collecting many recordings of the command list by people with different and in different acoustic environments. Is there a way for one firmware version to support two or more languages? Currently the ASR Engine is designed to support one language at a time. Support of multiple languages would require a custom release. How to get technical support? For technical support: http://www.spansion.com/Support/Pages/SolutionsSupport.aspx 30 S6E2CC_AN709-00009-1v0-E February 1, 2015 User s Gu id e 10. Major Changes Page No. Section — N/A Description Initial release February 1, 2015 S6E2CC_AN709-00009-1v0-E 31 Us ers Guide Colophon The products described in this document are designed, developed and manufactured as contemplated for general use, including without limitation, ordinary industrial use, general office use, personal use, and household use, but are not designed, developed and manufactured as contemplated (1) for any use that includes fatal risks or dangers that, unless extremely high safety is secured, could have a serious effect to the public, and could lead directly to death, personal injury, severe physical damage or other loss (i.e., nuclear reaction control in nuclear facility, aircraft flight control, air traffic control, mass transport control, medical life support system, missile launch control in weapon system), or (2) for any use where chance of failure is intolerable (i.e., submersible repeater and artificial satellite). Please note that Spansion will not be liable to you and/or any third party for any claims or damages arising in connection with above-mentioned uses of the products. Any semiconductor devices have an inherent chance of failure. You must protect against injury, damage or loss from such failures by incorporating safety design measures into your facility and equipment such as redundancy, fire protection, and prevention of over-current levels and other abnormal operating conditions. If any products described in this document represent goods or technologies subject to certain restrictions on export under the Foreign Exchange and Foreign Trade Law of Japan, the US Export Administration Regulations or the applicable laws of any other country, the prior authorization by the respective government entity will be required for export of those products. Trademarks and Notice The contents of this document are subject to change without notice. This document may contain information on a Spansion product under development by Spansion. Spansion reserves the right to change or discontinue work on any product without notice. The information in this document is provided as is without warranty or guarantee of any kind as to its accuracy, completeness, operability, fitness for particular purpose, merchantability, non-infringement of third-party rights, or any other warranty, express, implied, or statutory. Spansion assumes no liability for any damages of any kind arising out of the use of the information in this document. Copyright © 2015 Spansion. All rights reserved. Spansion®, the Spansion Logo, MirrorBit®, MirrorBit® Eclipse™, ORNAND™, HD-SIM™ and combinations thereof, are trademarks of Spansion LLC in the US and other countries. Other names used are for informational purposes only and may be trademarks of their respective owners. 32 S6E2CC_AN709-00009-1v0-E February 1, 2015