Speech SDK for SK-FM4-176L-S6E2CC

Transcription

Speech SDK for SK-FM4-176L-S6E2CC
Speech SDK for SK-FM4-176L-S6E2CC-SE
Speech Software Development Kit
Users Guide
Publication Number S6E2CC_AN709-00009
Revision 1.0
Issue Date February 1, 2015
Us ers
2
Guide
Speech Software Development Kit
S6E2CC_AN709-00009_ February 1, 2015
User s
Gu id e
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.
Overview of the Speech-Enabled MCU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1
Hardware Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2
Software Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3
System Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Host Controller + Dedicated Speech-Enabled MCU (configuration 1) . . . . . . . . . . . . . .8
1.3.2 Running the User Application on the Speech-Enabled MCU (configuration 2) . . . . . .11
2.
Speech Evaluation Kit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.
Building a Speech Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Writing the Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Running the Grammar Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Speech Processing APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.
Sample Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1
Speech Recognition without VAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2
Speech Recognition with VAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.
How to Compile and Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.
System Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.
Audio Codec Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.
Phoneme Pronunciations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
9.
FAQs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
10.
Major Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
February 1, 2015 S6E2CC_AN709-00009-1v0-E
14
14
15
18
3
Us ers
4
Guide
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
Gu id e
Introduction
Speech Recognition is an increasingly popular method for human-machine interaction. As a result of
improvements in processing speed, advances in algorithms, and the availability of large speech databases,
usable real-time speech recognition systems are now a reality. It is now possible to build speakerindependent automatic speech recognition (ASR) systems capable of recognizing more than a hundred
commands using an ARM M4 processor (200 MHz) with embedded flash memory and an external pSRAM.
This SDK provides a detailed description of the hardware and software used in Spansion’s Speech-Enabled
ARM M4 MCU, based on the S6E2CCAJHAGV20010.
The goal of this document is to provide the information necessary to:
 Operate the Speech-Enabled M4 starter kit SK-FM4-176L-S6E2CC-SE
 Create new speech applications that run on Spansion’s Speech-Enabled MCU
 Design MCU-based systems that have voice control capability
This document is arranged in nine sections:
 Section 1. describes the hardware and software features of the Speech-Enabled MCU based on the FM4
S6E2CCAJHAGV20010. It also provides a description of the supported system architectures.
 Section 2. describes the Speech Evaluation Kit, which is a development platform based on the SK-FM4176L-S6E2CC-SE board.
 Section 3. outlines the steps for building a speech application that runs on the Speech-Enabled MCU.
 Section 4. provides sample source code for applications without voice activity detector (VAD) and with
VAD, respectively.
 Section 5. gives detailed directions that refer to the IAR Embedded Workbench, however similar flows are
applicable to other development environments.
 Section 6. provides a description of the resources required for running the speech recognition engine.
 Section 7. contains audio codec driver routines can be used to initialize and control the codec.
 Section 8. provides phoneme pronunciations used for this SDK.
 Section 9. is a list of frequently asked question to help you with your application development.
February 1, 2015 S6E2CC_AN709-00009-1v0-E
5
Us ers
1.
Guide
Overview of the Speech-Enabled MCU
This section describes the hardware and software features of the Speech-Enabled MCU based on the FM4
S6E2CCAJHAGV20010. It also provides a description of the supported system architectures.
1.1
Hardware Features
The data sheet shown below describes the features of Spansion's Speech-Enabled MCU
S6E2CCAJHAGV20010. A block diagram is shown in Figure 1.1. Further details of S6E2CCAJHAGV20010
are available on Spansion's website:
http://www.spansion.com/products/microcontrollers/32-bit-arm-core/fm4/Pages/
S6E2CCAJHAGV20000.aspx
FM4 S6E2CCAJHAGV20010
 ARM Cortex-M4F CPU Core
 12-bit A/D Converter: Max. 32 channels (3 units)
– Processor version: r0p1
 External Bus Interface
– FPU built-in, Support DSP instruction
 Real Time Clock: 1 unit
 Clock
 DMA Controller: 8 channels
– Maximum clock frequency: 200 MHz
 Memory
 Ethernet-MAC
a
– Main Flash: 2048 KB
– RAM: 256 KB
 DSTC: 256 channels
a
 USB2.0 (Device/Host): 2 channels
 CAN: 2 channels
 Base Timer: 16 channels (max.)
 CAN-FD: 1 channel
 Watch Counter
 External Interrupt Controller Unit
 CRC Accelerator
– External Interrupt input pin: Max. 32 pins
 Multi-function Timer — 3 units (max.)
– Include one non-maskable interrupt (NMI)
– 16-bit free run timer x 3 channels/unit
 12-bit D/A Converter: 2 channels (Max.)
– Input capture x 4 channels/unit
 Low Power Consumption Mode
– Sleep mode/Timer mode/RTC mode/Stop
mode/Deep standby RTC mode/Deep standby
stop mode supported
– Output compare x 6 channels/unit
– A/D activation compare x 6 channels/unit
– Waveform generator x 3 channels/unit
 General Purpose I/O port: 100 (Max.)
– 16-bit PPG timer x 3 channels/unit
 Built-in CR
 SD Card Interface: 1 unit
 QPRC: 4 channels
 Unique ID
 Debug
 Dual Timer: 1 unit
– Serial Wire JTAG Debug Port (SWJ-DP)
 Watch Dog Timer: 1 channel (SW) + 1 channel
(HW)
 Multi-function Serial Interface: 16 channels (Max.)
– Selectable from UART/CSIO/LIN/I2C
 I2S: 1 unit
– Embedded Trace Macrocells (ETM)
 Low Voltage Detector
 Clock Supervisor
 Power Supply: 2.7 to 5.5 V
a. See Section 6. for details on resource requirements
6
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
Gu id e
Figure 1.1 Block Diagram of the Speech-Enabled MCU Based on Spansion S6E2CCAJHAGV20010
In systems using voice control, not all of the resources of the of S6E2CCAJHAGV20010 are available to the
user. This needs to be accounted for when developing application code and architecting a system.
Specifically, the I2S port is reserved for connecting to the codec, and the external bus interface is used to
connect to the pSRAM. In addition, the speech recognition software requires approximately 600 KB of internal
flash memory, leaving approximately 1,400 KB for the user application. Detailed descriptions of the software
and system architecture are given in Section 1.2 and Section 1.3.
1.2
Software Features
The software that is bundled with Spansion’s Speech-Enabled M4 MCU is a state-of-the-art speakerindependent automatic speech recognition program. This speech recognition software has been optimized to
fit in a limited memory footprint and to run in real time with high accuracy on the S6E2CCAJHAGV20010. It is
ideally suited for command and control applications. The main features are:
 Provides Speaker Independent Voice Control for Spansion's Speech-Enabled M4 MCU
– Over 100 commands can be decoded in real time and with high accuracy
– User defined commands are entered in an intuitive manner using a text file (no audio input or training
required)
– Windows-based software is provided to convert the user commands (text file) into M4 MCU library
objects
– APIs are provided that are used to configure the speech recognition engine and call the runtime
libraries
– Audio drivers are provided to interface with the codec
 Noise Robust Front End
– Built-in noise reduction
– Beam forming (Optional)
 Multiple Language Support
– English, German, Chinese, and Japanese are currently available
– Spanish and French are under development
February 1, 2015 S6E2CC_AN709-00009-1v0-E
7
Us ers
Guide
 Voice Activity Detection
– Speech recognition can be initiated using the voice activity detection API or using the push-to-talk
option
 Dynamic grammar
– The speech application can store multiple grammar files that can be used to implement a hierarchical
command structure
1.3
System Architecture
There are two supported architectural configurations for the Speech-Enabled MCU. In the first configuration,
the user application runs on a host processor and the Speech-Enabled MCU is used for voice recognition
only. In the second configuration, both the host application and the speech recognition software run on the
Speech-Enabled MCU.
1.3.1
Host Controller + Dedicated Speech-Enabled MCU (configuration 1)
The first configuration, where the user application runs on a host processor and the Speech-Enabled MCU is
used for voice recognition only, is shown in Figure 1.2. Communication between the host processor and the
Speech-Enabled MCU happens over an SPI interface. The advantage of this architecture is its modularity.
With minimal effort, a system can be configured to run with or without a speech recognition interface.
Additionally, resource contention between the host application and the speech recognition software is
eliminated, as the speech recognition software runs on a dedicated MCU. This flexibility allows systems to
easily add a voice control option.
Figure 1.2 System Architecture with a Speech-Enabled MCU (configuration 1)
8
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
Gu id e
The software architecture for the host and Speech-Enabled MCU is shown in Figure 1.3. The host uses the
speech client module to send commands (via the ASR API) to the MCU, which performs the speech
recognition and returns the results. The only software component required to be installed on the host
processor is the speech client module.
Figure 1.3 Software Architecture for a Speech-Enabled MCU (configuration 1)
Host Speech Enabled MCU
SW Components: ASR SW Components: ‐
‐
User Applications
Speech Client Module
‐
‐
‐
‐
ASR Engine
Drivers
Speech Objects
Speech Server Module
Figure 1.4 shows the interface diagram for the Speech-Enabled FM4 MCU with the Host processor. Spansion
recommends using CLK Out from the Host processor as the CLK input for the FM4 MCU.
The Audio Codec using an I2S protocol interfaces with the MCU through the I2S port. Communications and
data transfer between the Host and the FM4 MCU are done using an SPI port. An external 2 MB pSRAM is
used as the main memory of the ASR engine and is connected to the External Memory Bus of the MCU. As
shown in Figure 1.4 (configuration 1), an interrupt signal is provided by the Speech-Enabled MCU to signal
when a decoding result is ready.
February 1, 2015 S6E2CC_AN709-00009-1v0-E
9
Us ers
Guide
Figure 1.4 Speech-Enabled FM4 MCU interface (configuration 1)
10
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
1.3.2
Gu id e
Running the User Application on the Speech-Enabled MCU (configuration 2)
In the second configuration, both the host application and the speech recognition software run on the
Speech-Enabled MCU (Figure 1.5). The advantage of this architecture is that it reduces the total number of
system components and hence the BOM. The drawback is that the user application and the speech software
now use the same MCU resources, which must be considered when designing a voice controlled system.
Figure 1.5 System Architecture with a Speech-Enabled MCU (configuration 2)
February 1, 2015 S6E2CC_AN709-00009-1v0-E
11
Us ers
Guide
The software architecture is shown in Figure 1.6. The left hand side of the figure shows the Speech Objects
(Acoustic Model, Dictionary, and Grammar), which are generated from the set of user defined commands.
The process of compiling the user defined commands into the Speech Objects is done off-line by using the
software tools provided with the Speech-Enabled MCU. This process is described in more detail in the
Section 3., Building a Speech Application on page 14. The right hand side of Figure 1.6 shows the software
hierarchy that runs on the Speech-Enabled MCU. The audio driver receives data from the codec and stores it
in memory. The Speech Recognition software (ASR engine) then accesses this data and finds the user
defined command (from the Speech Objects) that best matches the audio data. This hypothesis is then
passed to the application as the recognition result. The user application interacts with the ASR Engine and
the Audio Driver through a set of APIs. A description of the APIs is described in the Section 3.3, Speech
Processing APIs on page 18.
Figure 1.6 Software Architecture with a Speech-Enabled MCU (configuration 2)
2. Speech Evaluation Kit
The Speech Evaluation Kit is a development platform based on the SK-FM4-176L-S6E2CC-SE board. It
contains a Speech-Enabled MCU (S6E2CCAJHAGV20010), a codec, a 2MB pSRAM, a microphone, and a
USB cable for connection to a PC. The Speech Evaluation Kit comes preloaded with a speech recognition
application and the software tools required to create new speech applications. A block diagram of the Speech
Evaluation Kit is shown in Figure 2.1, and a description of the SK-FM4-176L-S6E2CC-SE features is shown
below. A full description of the Speech Evaluation Kit can be found in the SK-FM4-176L-S6E2CC-SE Board
Support Package document.
12
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
Gu id e
Figure 2.1 Block Diagram of the Speech Evaluation Kit SK-FM4-176L-S6E2CC-SE
Features SK-FM4-176L-S6E2CC-SE
 Spansion FM4 Family
S6E2CCAJHAGV20010 MCU
ARM Cortex™ - M4F (DSP/FPU) Core
 2MB Flasha, 256 KB RAM on-chip
 LQFP176 Package
 IEEE 802.3 Ethernet RJ45
 USB Micro Type-B connector x1
 On board ICE (CMSIS-DAP)
 Flash Memory up to 32 Mbit, S25FL032P (via
Quad SPI)
 PSRAM Memory up to 32 Mbit, SV6P1615UFC
(via Ext Bus)
 RGB LED (via GPIO or PWM dimming)
 Acceleration Sensor (via I2C and INT)
 Phototransistor (via A/D Converter)
 Push Button (via NMIX)
 Arduino Compatible I/F
 Reset Button
 Power Supply LED
 User Setting Jumpers
 6pin - JTAG I/F (support SWD only)
 CMSIS-DAP USB bus power, USB Device bus
power (selected by jumper)
 Stereo Codec WM8731 (via I2S)
a. See Section 6. for details on resource requirements
February 1, 2015 S6E2CC_AN709-00009-1v0-E
13
Us ers
Guide
3. Building a Speech Application
The steps for building a speech application that runs on the Speech-Enabled MCU are outlined below.
Spansion can provide support with each of these steps and can also build the entire speech application if
needed.
1. Write the list of commands (grammar) specific to the application. Examples of grammars are
shown below.
2. Run the Grammar Compiler, provided by Spansion, which takes the list of commands and
generates the Speech Objects [.c].
3. Write the application program using the ASR APIs. A sample application is provided with the SDK
software.
4. For configuration 1: build the host executable by compiling the application and linking with the
speech client module. Build the executable for the Speech-Enabled MCU by linking the Speech
Objects, the ASR engine, the speech server module, and the drivers.
5. For configuration 2: build the executable for the Speech-Enabled MCU by compiling the application
and linking with the ASR engine, the Speech Objects, and the drivers.
A sample IAR project folder is provided with the SDK software.
3.1
Writing the Grammar
The list of user defined commands for an application is called the grammar. The grammar compiler tool is
provided with the SDK software and supports standard Java Speech Grammar Format except for tags,
quoted tokens / terminals and grammar import. The JSGF format is described in detail in the link below.
http://www.w3.org/TR/2000/NOTE-jsgf-20000605/
Below are two examples of grammar files, the first consisting of a simple set of commands for calling a list of
10 contacts, and the second consisting of a slightly more complicated set of commands for recognizing a
telephone number in the '408' area code. As can be seen from these examples, the JSGF grammar provides
a simple and compact formalism for writing command and control applications. Note: it is recommended to
use the windows text editor Notepad when writing the grammar file.
Grammar Example 1 – Call Contacts
/**Begin Grammar*/
#JSGF V1.0;
grammar SPSNCALL;
public <command> = CALL CONTACT <name>;
<name>=
(
FREDERIC LAWHON |
TRANG MARCOUX |
MIREILLE CATALAN |
CAMERON MANGO |
STACIE PELLEGRINO |
GAIL CURCIO |
MERNA LAMBERSON |
LIA STOTT |
KATHERIN BRECK |
KIMBERLI MUCK
);
/**End Grammar*/
14
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
Gu id e
Grammar Example 2 – Telephone Numbers
/**Begin Grammar*/
#JSGF V1.0;
grammar SPSN_DIAL;
public <command> = DIAL <phonenumber>;
<phonenumber> = <areacode><number>;
<areacode>= FOUR ZERO EIGHT;
<number>= <digit><digit><digit> <digit><digit><digit><digit>;
<digit> =
NINE;
ZERO | ONE | TWO | THREE | FOUR | FIVE | SIX | SEVEN | EIGHT |
/**End Grammar*/
3.2
Running the Grammar Compiler
The Grammar Compiler is a PC application provided with the Speech-Enabled MCU. It takes the user’s JSGF
grammar file as input and generates the Speech Object file that contains:
 Finite State Grammar
 Dictionary
 Acoustic Model
The Finite State Grammar contained in the Speech Object file is a compact and highly optimized version of
the input JSGF grammar file, using the method of Finite State Transducers. It is used during the search
phase of the speech recognition process to generate all possible sequences of words allowed by the
grammar. A description of this technique is provided in:
http://www.cs.nyu.edu/~mohri/postscript/hbka.pdf
The Dictionary contained in the Speech Object file is a text file containing the phonetic representation of all
the words used in the command set. It is generated using a variety of techniques including look up tables and
grapheme-to-phoneme conversion algorithms. The Dictionary is also used during the search phase of the
speech recognition process to generate all possible sequences of phonemes (sound units) allowed by the
grammar. An example of the Dictionary corresponding to the Call Contacts grammar is shown in the
Dictionary Example 1.
Dictionary Example 1 – Call Contacts
CALL
K AO L
CONTACT
K AA N T AE K T
BRECK
B R EH K
CAMERON
K AE M ER AH N
CATALAN
K AE T AH L AH N
CURCIO
K UH R CH IY OW
FREDERIC
F R EH D R IH K
GAIL
G EY L
KATHERIN
K AE TH AH R AH N
KATHERIN(2)K AE TH R IH N
KIMBERLI
K IH M B AH R L IY
LAMBERSON L AE M B ER S AH N
LAWHON
L AO HH AH N
February 1, 2015 S6E2CC_AN709-00009-1v0-E
15
Us ers
LIA
MANGO
MARCOUX
MERNA
MIREILLE
MUCK
PELLEGRINO
STACIE
STOTT
TRANG
L
M
M
M
M
M
P
S
S
T
Guide
IY AH
AE NG G OW
AA R K UW
EH R N AH
AH R IY L
AH K
EH L EH G R IY N OW
T AE S IY
T AA T
R AE NG
The Dictionary can support multiple pronunciations for each word, as is the case for KATHERIN and
KATHERIN(2) in the example above. This file can also be customized by the user to add pronunciations that
are not automatically generated. Manually editing the pronunciation dictionary can significantly improve
recognition accuracy for unusual pronunciations and for different accents and dialects. A list of the phoneme
tables that are used for English, Chinese, and German is shown in Section 8..
The Acoustic Model contained in the Speech Object file is a set of parameters providing the mathematical
descriptions of the sound units. These models are trained using hundreds of hours of audio recordings from
many speakers in different settings. Each language has a unique Acoustic Model. Spansion currently
provides Acoustic Models for English, Chinese, Japanese and German. Additional languages such as
Spanish, French, Italian, and Korean are under development.
After we define the grammar, we are ready to produce a speech object file that covers only a specific voice
recognition task. Spansion provides sample scripts for speech object production. Figure 3.1 shows the
structure of a folder that contains the DOS batch scripts as well as necessary binary files. For example, the
speech object file for the English voice dialing application can be generated by executing the DOS batch
script, gen_enUS_voice_dialing.bat in the package. The sample script runs three executable programs:
 buzz_optfsg_gen.exe — converts the JSGF into the compact state finite grammar (FSG) file and generates
a vocabulary list,
 buzz_dict_compile.exe — generates the pronunciations,
 buzz_speechobj_gen.exe — integrates the grammar, pronunciation dictionary and the set of the acoustic
model parameters into the speech object.
Figure 3.2 shows a flow chart of the speech object generation process.
Figure 3.1 Folder Structure of Speech Object Generation Package
gen_speech_object
acoustic_model
Sound unit files for each language
windows
Executable programs and libraries
sample_grammar
enUS_speech_dial
Sample JSGF file for voice dialing application
enUS_tvcmd
Sample JSGF and pronuncation files for the TV control task
out_enUS_tvdemo
out_speech_dial
16
Speech object samples generated by the tool
Speech object samples generated by the tool
gen_enUS_voice_dialing.bat
Sample DOS batch script for the voice dialing application
gen_enUS_tvcmd.bat
Sample DOS batch script for the TV control task
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
Gu id e
Figure 3.2 Flow Chart of Speech Object Production
Here, we explain each program invoked in gen_enUS_voice_dialing.bat. The first program converts the JSGF
file (sample_grammar/enUS_speech_dial/speech_dial.jsgf) into the compact FSG representation by
executing the commands in Table 3.1.
Table 3.1 Generating the optimized grammar file
Input File
Description
sample_grammar/enUS_speech_dial/speech_dial.jsgf
Defines the grammar. Created by the user
Output File
Description
sample_grammar/enUS_speech_dial/speech_dial.list
List of vocabulary
sample_grammar/enUS_speech_dial/speech_dial.fsg
Optimized grammar
Command
windows/bin/buzz_optfsg_gen.exe \
-list sample_grammar/enUS_speech_dial/speech_dial.list \
sample_grammar/enUS_speech_dial/speech_dial.jsgf \
sample_grammar/enUS_speech_dial/speech_dial.fsg
Note that all files must be in plain text format.
Spansion's JSGF compiler produces an optimized FSG file (speech_dial.fsg) in order to achieve the smallest
search space for voice recognition. It also creates a list of vocabulary used in the JSGF file (speech_dial.list).
Table 3.2 Generating the Pronunciations
Input File
Description
sample_grammar/enUS_speech_dial/speech_dial.list
List of vocabulary.
Generated by buzz_optfsg_gen.exe
Windows/bin/English/reference.dict
English reference dictionary provided as
part of the SDK
Output File
Description
sample_grammar/enUS_speech_dial/speech_dial.dict
List of vocabulary
Command
windows/bin/buzz_dict_compile.exe \
-d windows/bin/english/reference.dict \
-i sample_grammar/enUS_speech_dial/speech_dial.list \
-o sample_grammar/enUS_speech_dial/speech_dial.dict
February 1, 2015 S6E2CC_AN709-00009-1v0-E
17
Us ers
Guide
The next program generates the pronunciations for the vocabulary by executing the commands in Table 3.2.
The dictionary file, speech_dial.dict, contains pairs of words and pronunciations. Please note that any
automatic pronunciation tools will not always provide expected pronunciations. Thus, developers need to
review and make any necessary corrections to the pronunciations in the speech_dial.dict file or add alternate
pronunciations.
The last program takes the pronunciation dictionary (speech_dial.dict) and the optimum FSG file
(speech_dial.fsg) and generates the speech object as follows:
Table 3.3 Generating the Speech Objects
Input File
Description
Hmm_acoustic_model/enUS
English acoustic model provided as part of
the SDK
speech_dial.dict
List of vocabulary. Produced by
buzz_dict_comple.exe
speech_dial.fsg
Optimized grammar. Produced by
buzz_optfsg_gen.exe
Output File
Description
Speech objects file
Out_enUS_speech_dial/speech_objects.c
Command
windows/bin/buzz_speechobj_gen.exe \
-hmm acoustic_model/enUS \
-dict sample_grammar/enUS_speech_dial/speech_dial.dict \
-fsg sample_grammar/enUS_speech_dial/speech_dial.fsg \
-outdir out_enUS_speech_dial
The final result of the grammar compilation process is the speech_objects.c file. Replace the
speech_objects.c file in IAR with your new speech_objects.c file and compile to produce a new .srec file that
is to be programmed in the flash.
In a small vocabulary application, the ASR may produce incorrect matches when a word that is outside the
grammar is spoken. To reduce this kind of false match, there is a keyword provided (QQQ) that loosely
matches to every phone, but is not strongly associated with any phone. An example would be a thermostat
that wakes up to the phrase "Hello Thermostat", but also incorrectly wakes up to "Hello Thomas". "Hello
QQQ" can be added to the .jsgf file. "Hello QQQ" will now produce a stronger match to "Hello Thomas" than
"Hello Thermostat" does, and hence "Hello Thermostat" is no longer detected. The main use for this is
wakeup words that need very low false positive results.
3.3
Speech Processing APIs
At runtime, the user application interacts with the ASR engine through the set of ASR APIs. The basic set of
APIs for configuring the ASR engine and for processing speech are shown in Table 3.4 and Table 3.5,
respectively. The speech processing APIs for the two architectural configurations are identical. An additional
set of APIs for configuration 1 only, is shown in Table 3.6. These additional APIs provide a way for the host
processor to configure the Speech-Enabled MCU and external chips, to update the Speech Objects, and to
transfer audio data between the two processors.
The first step in configuring the ASR engine is to create and initialize a parameter object by calling
buzz_init_params(). This command configures the ASR engine with the default values of the parameter
object. To achieve optimal performance for particular applications, however, the values of the parameter
object need to be tuned. This can be done with the APIs listed in Table 3.4.
Many of the configuration parameters in Table 3.4 control the search algorithm used for matching the spoken
phrase to the Speech Object. This is done by adjusting the pruning thresholds at various points of the search.
Relaxing the pruning thresholds allows more hypotheses to be evaluated, increasing the recognition accuracy
at the expense of more computation. Tightening the pruning thresholds, on the other hand, restricts the
number of potential matches to be evaluated but speeds up the computation. Another way to control the
search algorithm is to directly specify the number of hypotheses that are evaluated at each point. This puts a
limit on the maximum number of hypotheses that are evaluated in each time frame and is a convenient way to
control the maximum heap memory size. Other configuration parameters in Table 3.4 include: the
18
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
Gu id e
bos_threshold and bos_timeout, which adjusts when the beginning of speech is detected; the eos_threshold,
eos_valid, and eos_timeout, which adjusts when the end of speech is detected; and parameters for selecting
which Speech Object is used during recognition.
For processing speech input, the set of APIs in Table 3.5 are used. The first step is to create a Buzz voice
recognizer (VR) object, buzz_decoder_t *buzz. The VR object can be initialized by passing the VR and
parameter objects to buzz_decode_init( &buzz, param ). After the initialization, buzz_decode_start( buzz )
has to be called to start speech decoding. An array containing audio samples is fed to the Buzz VR object by
calling buzz_process_audio( buzz, buf, CB_AD_BUFSIZE ). This process has to be repeated for all the
samples of audio or until end of speech is detected by the buzz_eos_detected() function.
The buzz_decode_finish(buzz) function needs to be called to finish the decoding process and generate the
hypothesis. The recognition result can be retrieved with buzz_get_hyp( buzz ). The confidence score for the
hypothesis can be obtained by calling buzz_get_conf_score(). This confidence score is a combination of
acoustic scores and language scores for the recognized hypothesis and is always a negative value. Scores
closer to zero are better. The application can use this confidence score to accept or reject an hypothesis by
comparing it against an empirically derived threshold for a specific task. For close talking microphones, it was
found that a threshold of -6000 works reasonably well. Note that the recognized hypothesis might be NULL
when the confidence score is too low. Sample code that shows how the various APIs are integrated together
to create a speech application is given in Section 4.
To achieve good recognition performance, it is crucial to detect when voice activity starts and ends. The start
of speech can be easily determined when a push-to-talk button is used. In applications where a push-to-talk
button is not available, however, the ASR engine also provides an automatic voice activity detector (VAD).
Section 4.2 shows another sample code using the voice activity detector (VAD). In this example
buzz_bos_detected() is used to determine when speech begins. When buzz_bos_detected() returns
VA_DEFINITELY_SPEECH or VA_PERHAPS_SPEECH, then processing can start. After
buzz_eos_detected() returns 1 (one of more times), processing can be finished and the recognition result
obtained.
Table 3.4 Buzz API for Parameter Configuration
API for Parameter Configuration
Purpose
buzz_param_list_t *buzz_init_params()
Create a parameter object to configure
parameters of the Buzz voice recognizer.
void buzz_free_params(buzz_param_list_t *param)
Free the parameter object.
void buzz_set_param_hmm_beam_init(
buzz_param_list_t *param,
double value)
Set the beam pruning threshold (likelihood
basis) to adjust search space at an HMM
level. (Default Value: 1e-48)
void buzz_set_param_word_beam_init(
buzz_param_list_t *param,
double value)
Set the beam pruning threshold (likelihood
basis) to control the word search space.
(Default Value: 1e-48)
void buzz_set_param_max_hmms_per_frm(
buzz_param_list_t *param,
int num)
Set the beam width at the HMM level. The
default value is 500, which should cover
4000 word recognition tasks.
void buzz_set_param_max_wrds_per_frm(
buzz_param_list_t *param,
int num)
Set the beam width at a word level. The
default value is 100.
Void buzz_set_eos_threshold(
buzz_param_list_t *param,
int zero2hundred_value)
Set the threshold used to detect end of
speech. This parameter can be set from 0 to
100, where the lower values detect end of
speech easier (Default value is 80)
void buzz_set_eos_valid(
buzz_param_list_t *param,
int millisecond )
Set how long eos_detected needs to be
valid before the end of utterance is declared
(Default Value is 200ms)
void buzz_set_eos_timeout(
buzz_param_list_t *param,
int millisecond)
Set the maximum length of an utterance
starting from when the beginning of speech
is detected.
February 1, 2015 S6E2CC_AN709-00009-1v0-E
19
Us ers
Guide
Table 3.4 Buzz API for Parameter Configuration (Continued)
API for Parameter Configuration
Purpose
void buzz_set_bos_timeout(
buzz_param_list_t *param,
int millisecond)
Set the maximum time to wait for speech
after push-to-talk is pressed, before
terminating (Default value is 60 seconds).
void buzz_set_bos_threshold(
buzz_param_list_t *param,
float num)
Set the threshold value to detect the
beginning of speech/ voice activity (Default
value is 3.5). Lower values are more
sensitive to audio input and should be used
in quiet conditions. Higher values should be
used in noisy environments.
void buzz_set_vad_ramp_up(
buzz_param_list_t *param, int samples )
Set the ramp up samples (default: 8192 
0.5 sec). This parameter sets the number of
samples to wait after the codec is reset
before processing data.
void buzz_set_vad_previous_samples(
buzz_param_list_t *param, int samples );
Set the previous samples (default: 8192 
0.5 sec). This parameter sets the number of
samples to process before the beginning of
speech is detected.
void buzz_set_language(
buzz_param_list_t *param,
char const * name)
Set the language for speech recognition
(Default value is English). Current options
include: English, Mandarin, German,
Japanese).
void buzz_set_grammar(
buzz_param_list_t *param,
char const * name)
Select grammar object by name.
Table 3.5 Buzz API for Speech Processing
API for Speech Processing
20
Purpose
buzz_decoder_t *buzz_decode_init(
buzz_decoder_t **pbuzz,
const buzz_param_list_t *parameters )
Initialize a Buzz voice recognizer (VR)
object as specified in Table 3.1. Note that
you have to feed speech samples to
buzz_process_audio() when you process
audio samples.
int buzz_decode_start(
This API has to be called before each
utterance.
buzz_decoder_t *buzz )
int buzz_process_audio(
buzz_decoder_t *buzz,
short *buffer,
const int len, int *is_eos )
Process a block of incoming speech signals
stored in “buffer”. “len” denotes the number
of samples in the buffer. The function also
update the is_eos value (1 if the end of
speech is detected, 2 if the eos is timeout,
and 0 if the eos is not detected)
int
This API has to be called at the end of the
utterance.
buzz_decode_finish( buzz_decoder_t *buzz )
int buzz_bos_detected(
buzz_decoder_t *buzz, short *buffer,
const int len)
The function returns 1 if the beginning of
speech (bos) is detected or returns 0 if the
bos is timeout. Otherwise, it returns 0. The
“len” denotes the number of samples in the
buffer.
void buzz_vad_previous_samples(
buzz_decoder_t *buzz, int *samples )
The function update the current samples
with the previous samples
int buzz_eos_detected(
buzz_decoder_t *buzz )
The API returns 1 if the end of speech is
detected. Otherwise, it returns 0.
char const *buzz_get_hyp( buzz_decoder_t *buzz );
Return the pointer to a recognition result.
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
Gu id e
Table 3.5 Buzz API for Speech Processing (Continued)
API for Speech Processing
Purpose
int
buzz_get_conf_score(
buzz_decoder_t *buzz );
Return a confidence score associated with
the recognition result obtained with
buzz_get_hyp().
void buzz_decode_free(
Free the Buzz VR object initialized with
buzz_decode_init().
buzz_decoder_t *buzz );
Table 3.6 Buzz API extensions for Configuration 1 (Note: APIs in this table are not implemented yet)
API Extensions for Configuration 1
Purpose
void buzz_codec_command(
buzz_param_list_t *param,
int num)
Tells the Speech-Enabled MCU to send a
command to the codec.
void buzz_sram_command(
buzz_param_list_t *param,
int num)
Tells the Speech-Enabled MCU to send a
command to the SRAM.
void buzz_data_transfer(
buzz_param_list_t *param,
short *buffer,
const int len )
Transfers a block of audio data from the
host processor to the Speech-Enabled MCU
void buzz_update_speech_object(
buzz_decoder_t *buzz,
const int len )
Transfers and updates the Speech Object in
the Speech-Enabled MCU
4. Sample Code
Section 4.1 and Section 4.2 provide sample source code for applications without VAD and with VAD,
respectively. For details please refer to the source code.
4.1
Speech Recognition without VAD
do {
if (Audio_GetBufAddr(rxsample, &rxbufaddr) < 0) break;
buzz_process_audio(buzz, (short *)rxbufaddr, SAMPLE_SIZE, &is_eos);
if (is_eos > 0) break;
rxsample += SAMPLE_SIZE; // sample data
Audio_CircularBuf(&rxsample);
} while (rxsample < AUDIO_CIRBUFSIZE); // circular buffer size
4.2
Speech Recognition with VAD
do {
if (Audio_GetBufAddr(rxsample, &rxbufaddr) < 0) break;
if( is_bos == 0 ) {
is_bos = buzz_bos_detected(buzz, (short *)rxbufaddr, SAMPLE_SIZE);
if (is_bos == 1) {
buzz_vad_previous_samples(buzz, &rxsample);
} else if (is_bos == 2) break;
} else {
buzz_process_audio(buzz, (short *)rxbufaddr, SAMPLE_SIZE, &is_eos);
if (is_eos > 0) break;
} // end if is_bos
rxsample += SAMPLE_SIZE; // sample data
Audio_CircularBuf(&rxsample);
} while (rxsample < AUDIO_CIRBUFSIZE); // circular buffer size
February 1, 2015 S6E2CC_AN709-00009-1v0-E
21
Us ers
Guide
5. How to Compile and Link
After the user commands have been compiled into the speech_objects.c file, using the Grammar Compiler
described in Section 3.2, this file must be compiled and linked with the user application. The following
detailed directions refer to the IAR Embedded Workbench, however similar flows are applicable to other
development environments.
The first step is to open the S6E2CC_pdl project file using the IAR Embedded Workbench (see Figure 5.1).
The downloaded project file contains the English TV Demo. The Speech Objects for the English TV Demo
and the Chinese TV Demo have been placed in the gen_speech_object folder. Speech Objects for additional
tasks can be added to this folder by first running the Grammar Compiler script (Section 3.2).
Figure 5.1 Directory Structure of Board Example
S6E2CC_pdl
Common
source files for FM4
Speech_app
Audio
buzz_live_mcu
template
MCU template structure
Open the template folder (see Figure 5.2) and select the template\IAR\s6e2cc.eww file by double clicking.
This template file corresponds to the English TV Demo. Additional template files can be added to this folder
by copying and renaming existing template files to this folder.
After replacing the speech_objects.c file in the IAR template with a newly generated file, the Rebuild All
function needs to be used for compiling and linking with the ASR engine, the drivers, and the Speech Object.
Figure 5.2 Directory Structure of the Template Folder
template
IAR
s6e2cc_pdl.eww
Example projects for IAR Embedded project
including startup file, linker and compiler settings
IAR Embedded Workbench Workspace File
s6e2cc_release\exe\*.srec
Source
Compiled and linked firmware
Example Source Files
Readme.txt
To burn the image into flash, these steps must be followed:
22
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
Gu id e
1. Download the FLASH MCU Programmer for FM0+/FM3/FM4 (serial version) from: http://
www.spansion.com/support/microcontrollers/developmentenvironment/Pages/mcu-download.aspx
2. Connect the USB cable from the laptop to the s6e2cc demo board.
3. Connect the jumper J2 on the board to enter the programming mode. Press the reset button on the
s6e2cc demo board.
4. Open the serial Flash MCU Programmer for FM0+/FM3/FM4, and start programming.
5. When the programming is complete remove jumper J2 and press the reset button on the s6e2cc
board.
6. The demo is now ready to run.
6. System Resources
The Speech-Enabled version of the FM4 MCU described in this SDK is the S6E2CCAJHAGV20010. Some of
the resources of this MCU are used for running the speech recognition engine, such as the embedded flash
that stores the firmware, as well as the CPU and RAM during runtime. Resources not used for the speech
recognition engine are available for the user application. This section provides a description of the resources
required for running the speech recognition engine.
Flash requirements: all firmware is stored in the embedded flash. The firmware consists of the Speech
Recognition Engine, which uses 320 KB of flash, and the Speech Object whose size depends on the number
of recognition phrases. A plot of the Speech Object size vs the number of commands is shown in Figure 6.1.
Note: the exact size of the Speech Object will vary depending on the phrases used.
Figure 6.1 Speech Object size vs command list shown below. An additional 320 KB of flash is required for
the Speech Recognition Engine.
CPU Requirements: the CPU load during decoding is shown in Figure 6.2. The CPU load is defined as: (total
processing time) / (duration of utterance). For a smaller number of recognition phrases (commands) the CPU
requirements are reduced. This characterization data was taken with a CPU clock of 200 MHz. The pruning
thresholds were optimized for the case of 100 commands and these values were used for all measurements.
Further reductions in the CPU load at smaller list sizes can be achieved by optimizing the pruning thresholds
for these cases. Running the Voice Activity Detector only, requires a CPU load of 0.2.
February 1, 2015 S6E2CC_AN709-00009-1v0-E
23
Us ers
Guide
Figure 6.2 CPU Running at 200 MHz. Pruning threshold is constant over all list sizes
Stack Requirements: The Speech Recognition Engine uses 1 KB for CSTACK.
RAM Requirements: 80 KB of internal SRAM and 1 MB external pSRAM is currently required for all command
list sizes
External Ports: running the Speech Recognition Engine requires a 1 MB external pSRAM (2 MB are
recommended) and an audio codec. The pSRAM is connected to the FM4 MCU using the external bus
interface. The codec is connected using the I2S port. Schematics are provided in the board support package.
7. Audio Codec Driver
The Speech Evaluation Kit uses the Wolfson WM8731 audio codec to input a mono microphone signal or a
line-in input. The following routines can be used to initialize and control this codec:
1. Audio_Init(); // This function is used to initialize the FM4 I2C port connection for the audio codec
parameter configuration, and also to initialize the FM4 I2S port connection for getting the audio
data from the codec. It is also used to allocate memory for the audio data stream.
2. Audio_Activate(); // This function is used to activate the FM4 I2S controller and codec to start
receiving audio data after the push-to-talk button is pressed.
3. Audio_GetSample(); // This is function gets the audio sample data from the codec and waits until
the audio sample data is ready.
4. Audio_DeInit(); // This is function is used to disable the FM4 I2S controller, deactivate the audio
codec, and free the memory allocation used for the audio codec.
The configuration settings used for the WM8731 can be found in the application code provided with the
Speech Evaluation Kit. They are also shown below for reference:
Audio_Write(WM8731_REG_RESET, _WM8731_Reset);
Audio_Write(WM8731_REG_LLINE_IN, 0x19f);
Audio_Write(WM8731_REG_RLINE_IN, 0x19f);
Audio_Write(WM8731_REG_LHPHONE_OUT, 0x1ff);
Audio_Write(WM8731_REG_RHPHONE_OUT, 0x1ff);
Audio_Write(WM8731_REG_ANALOG_PATH, 0x25);
Audio_Write(WM8731_REG_DIGITAL_PATH, 0x02);
Audio_Write(WM8731_REG_PDOWN_CTRL, 0x00);
24
// Reset module
// Left line in settings
// Right line in settings
// Left headphone out settings
// Right headphone out settings
// Analog paths
// Digital paths
// Power down control
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
Gu id e
Audio_Write(WM8731_REG_DIGITAL_IF, 0xc1);
Audio_Write(WM8731_REG_SAMPLING_CTRL, 0x58);
// Digital interface
// Sampling control
The digital interface settings correspond to:
 MSB-first, left justified
 16 bits
 Master mode
 Bit clock inverted
The sampling control settings correspond to:
 Normal mode (256 fs)
 Sampling rate 32 kHz
February 1, 2015 S6E2CC_AN709-00009-1v0-E
25
Us ers
8.
Phoneme Pronunciations
Phoneme
AA
26
Guide
Example
g(o)t
AE
c(a)t
AH
(a)llow, c(u)t
AO
f(a)ll
AW
f(ou)l
AY
f(i)le
B
(b)it
CH
cat(ch)
D
(d)ig
DH
(th)en
EH
f(e)ll
ER
c(u)rt
EY
f(ai)l
F
(f)at
G
(g)ot
HH
(h)at
IH
f(i)ll
IY
f(ee)l, (ea)t
JH
(j)ourney
K
( c )at
L
(l)ip, batt(le)
M
(m)an
N
(n)ut
NG
ri(ng)
OW
g(oa)l
OY
f(oi)l
P
(p)it
R
( r )ip
S
(s)eal
SH
(sh)ip
SIL
-
T
(t)op
TH
(th)in
UH
f(u)ll
UW
f(oo)l
V
(v)at
W
(wh)y
Y
(y)es
Z
(z)eal
ZH
lei(s)ure
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
9.
Gu id e
FAQs
What type of microphone should I use?
Three critical specs for choosing a microphone are: 1) frequency response, 2) sensitivity, and 3) SNR. For
good ASR performance, the frequency response should be flat within the range of 50 Hz to 10 kHz. For
analog microphones, the sensitivity needs to be chosen to generate the maximum output voltage for a given
application, without clipping. The required microphone sensitivity also depends on the gain of the codec or
preamp. A good description of how everything fits together is given in:
http://www.analog.com/library/analogdialogue/archives/46-05/understanding_microphone_sensitivity.html
Digital microphones have similar specs and can also be used as long as the codec supports the output
format.
What type of microphones are the best to be used for noisy environments?
In high noise environments, headset mics are recommended. These mics do a good job of picking up the
speech signal while maintaining a reasonable SNR since they are placed close to the mouth. In applications
where headset mics can not be used, a unidirectional mic or a two microphone array with beamforming
software can effectively reduce the noise.
What type of microphones are best for far field (>1 meters) ?
For far field applications the combined microphone + codec system needs to have enough sensitivity to pick
up distant speech. This will depend on the sensitivity of the mic and the gain of the codec.
What about beam forming? How many microphones are used?
It is possible to run beam forming with two microphones on the FM4. This has been shown to improve the
accuracy of distant speech (>5 m) applications. Beam forming software will be released in Q1-2015.
What about AGC?
Automatic Gain Control is usually not recommended for ASR systems, since changes in the gain while an
utterance is being spoken can cause problems with the recognition results. If AGC is required, however, it
should be implemented so that the impact on the utterance is minimized.
How to control the microphone gain or line-in gain?
The microphone and in-line gain settings are set in the audio driver code. The data sheet for the WM8731
audio codec can be found at: http://www.wolfsonmicro.com/documents/uploads/data_sheets/en/WM8731.pdf
How to use line-in instead of the microphone?
The Speech Evaluation Kit is usually wired to accept a single microphone input. However, it can be easily
modified to accept a single line-in. This change also requires a modification to the audio driver. Please
contact Spansion Technical Support directly for this option.
Which codec should I use?
The codec must be able to support 16 bit output at a sampling rate of 16 kHz (for each channel). Additional
considerations, which depend on the application, are: the number of microphone or line inputs, the codec
gain, master or slave operation, power requirements, analog or digital microphone, SNR. Consult the codec
data sheet to verify that the codec fits your application requirements.
February 1, 2015 S6E2CC_AN709-00009-1v0-E
27
Us ers
Guide
Is it possible to use the MCU ADC to sample the microphone? (Is there an amplifier circuit required for this?)
Best results are achieved by using an audio codec with 16 bit output resolution and good noise rejection (i.e.
24 bit delta-sigma modulation). The required gain of the audio codec depends on the microphone and the
strength of the audio input. It should be chosen so as to provide the maximum signal without clipping
Are there noise filters available to help eliminate the negative effects of background noise?
Basic noise filtering is built into the ASR engine. More advanced noise reduction techniques such as
beamforming will be available in Q1'2015.
Is there a wake-up phrase capability?
Yes. The Voice Activity Detector (VAD) API can be used as the first stage of the wake-up process:
buzz_bos_detected( buzz_decoder_t *buzz, short *buffer, const int len ). For the second stage of the wake-up
process, a dedicated grammar can be run with just the wake-up phrase. When the confidence score is high
enough, then the full speech detector is started. Note: it is recommended to use additional phrases in the
wake-up grammar containing the QQQ model in order to reduce the number of false detections. Spansion
can provide technical support in designing robust wake-up phrases and grammars.
What is the expected accuracy?
The expected accuracy is <5% sentence error rate under ideal conditions, meaning a close talking
microphone with a native (unaccented speaker).
How can I improve the recognition results?
There are a number of steps that can be taken to improve recognition accuracy: 1) optimize the audio path microphone and codec, 2) add alternate pronunciations for problematic words, 3) reduce the size of the
command set or use a hierarchical menu structure, 4) choose commands that sound different from one
another.
Does the ASR system understand different accents and dialects?
Yes. The ASR system is very robust over different accents and dialogs as a result of the large database used
to train the recognizer. In addition, further improvements can be realized by customizing the pronunciation
dictionary.
How to change the pronunciation dictionary?
The Grammar Compiler software that runs on the PC generates a pronunciation dictionary file based on the
input phrases. In many cases, improved recognition results can be obtained by adding alternate
pronunciations to the dictionary for certain words. Alternate pronunciations can also be used to cover a range
of accents or dialects. The pronunciation dictionary is a text file, so it can be modified using any text editor. A
table of phoneme pronunciations is provided in Section 8. of the SDK to help in defining the alternate
pronunciations. After the pronunciation dictionary has been modified, a new set of Speech Objects are
generated by re-running the Grammar Compiler without the dictionary generation option.
Are there certain words or phrase that should be avoided due to similarity to other phrase that might cause
either false positives or high error rates?
There is no general rule for this, so the command set should be tested to see if some words are easily
confused with others. Words or commands that are longer typically contain more phonetic information and
are easier to distinguish from one another.
28
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
Gu id e
What value of the “confidence score” parameter should be used? Is there a tuning procedure recommended
to calibrate a system for optimal performance?
The confidence score is used to show how well an utterance matches the recognized command. The
confidence score is always negative with a best score of 0. To get a sense of how this parameter responds to
various utterances, input multiple utterances into the ASR engine and record the confidence score of each
utterance. Use different types of utterances, similar to those that would be found in an application: some in
the command list, some outside of the command list, some spoken clearly, some spoken unclearly. Based on
these results, set the confidence score to reject the utterances outside of the command list.
Is there self-learning built into the solution to allow frequently stated commands to be recognized for either
faster confirmation or better false positive rejection?
Not in the current release.
What value of the “confidence score” parameter should be used? Is there a tuning procedure recommended
to calibrate a system for optimal performance?
The confidence score is used to show how well an utterance matches the recognized command. The
confidence score is always negative with a best score of 0. To get a sense of how this parameter responds to
various utterances, input multiple utterances into the ASR engine and record the confidence score of each
utterance. Use different types of utterances, similar to those that would be found in an application: some in
the command list, some outside of the command list, some spoken clearly, some spoken unclearly. Based on
these results, set the confidence score to reject the utterances outside of the command list.
What amount of testing has Spansion performed to ensure any word or phrase can be entered and used with
a high degree of success?
In low noise environments, the sentence error rate of native American speakers is less than 5% on a typical
set of 100 commands. The actual error rate will depend on the command list, the environment (i.e. noisy,
reverberant, distant speech, close talking, etc.), the amount of front end processing, and the accents of the
speakers. The error rate should be characterized for each application under the target conditions. Spansion
provides technical support to optimize the recognition accuracy under challenging conditions.
Why do I sometimes get a NULL recognition result?
A NULL recognition result means that all the potential hypotheses were pruned out. This can occur for a
number of reasons, such as: 1) the microphone input is of poor quality (i.e. signal too low or signal clipped), 2)
there is a lot of noise in the signal, 3) the spoken phrase is not in the grammar, 4) the VAD circuit picks up
noise and starts the decoding, but nothing is said. In cases (1,2) the audio path needs to be debugged. A
good first step is to record an utterance and play it back. In the last case (4), the bos threshold should be
empirically adjusted so that it's above the noise floor.
Why do some commands take longer to decode?
Some commands may be slow to decode because the recognizer has trouble detecting the end of the
utterance. When this happens, the recognizer waits until the timeout is reached before returning a result.
Reducing the eos_thresold will make it easier for the recognizer to detect the end of speech. If this value is
reduced too far, however, it may detect the end of speech before it actually occurs. Reducing the eos_timeout
value will also result in a faster response, even when the end of speech is not detected, however, it needs to
be set long enough to capture the longest phrase in the grammar. If neither of these fixes solves the problem,
the phrase should be changed to make it easier to detect the end.
February 1, 2015 S6E2CC_AN709-00009-1v0-E
29
Us ers
Guide
What does Spansion recommend a customer do prior to going into mass production to ensure reliable
operation in large volume?
Test, test, test. The speech application can initially be evaluated using prerecorded test data with the line-in
option. The prerecorded data should be taken in conditions similar to the target environment. Tweaking the
ASR parameters for optimal performance should be done at this stage. The final system, with the user
microphone and codec, should then be evaluated in the target environment.
What are the power requirements?
The power requirement depends strongly on how the system is operated. In the case where the sleep mode
is not used and decoding is performed continuously on a task of 50 commands, then the active power is
approximately: FM4 MCU (37 mA) + pSRAM (20 mA) + WM8731L codec (7 mA) + electret microphone
(0.5 mA) = 65 mA. In the case that the FM4 MCU is running the Voice Activity Detector software only and the
pSRAM is in standby then the required power is: FM4 MCU (20mA) + pSRAM (0.1mA) + WM8731L codec
(7 mA) + microphone (0.5 mA) = 28 mA. In the case where the FM4 MCU is put in the deep sleep mode
(VBAT operation, RTC stop), the pSRAM is put into self-refresh mode, and the other components (codec,
microphone) are powered down, then the standby power is less than 100 mA.
What if I don't use IAR?
Not a problem. Although the SDK uses IAR as an example, any development environment that supports the
FM4 MB9BF568R will work. Please refer to the Board Support Package for further details.
What are the resource requirements?
See Section 6. of the SDK.
What if the system doesn't respond?
Check that there's power, check that the microphone is connected to the correct input, press the push-to-talk
button before speaking.
Does the Spansion Voice solution support authentication?
Speaker Identification is planned for release in Q1-2015.
Instead of using a text input method, is there a way to input commands via speaking into the system?
No. The ASR Engine is speaker independent, meaning that all the audio training has been performed by
Spansion, so the application developer just needs to type the commands into a file. Other types of speech
recognition systems are speaker dependent, meaning that the application developer needs to train the
system themselves by collecting many recordings of the command list by people with different and in different
acoustic environments.
Is there a way for one firmware version to support two or more languages?
Currently the ASR Engine is designed to support one language at a time. Support of multiple languages
would require a custom release.
How to get technical support?
For technical support: http://www.spansion.com/Support/Pages/SolutionsSupport.aspx
30
S6E2CC_AN709-00009-1v0-E February 1, 2015
User s
Gu id e
10. Major Changes
Page No.
Section
—
N/A
Description
Initial release
February 1, 2015 S6E2CC_AN709-00009-1v0-E
31
Us ers
Guide
Colophon
The products described in this document are designed, developed and manufactured as contemplated for general use, including without
limitation, ordinary industrial use, general office use, personal use, and household use, but are not designed, developed and manufactured as
contemplated (1) for any use that includes fatal risks or dangers that, unless extremely high safety is secured, could have a serious effect to the
public, and could lead directly to death, personal injury, severe physical damage or other loss (i.e., nuclear reaction control in nuclear facility,
aircraft flight control, air traffic control, mass transport control, medical life support system, missile launch control in weapon system), or (2) for
any use where chance of failure is intolerable (i.e., submersible repeater and artificial satellite). Please note that Spansion will not be liable to
you and/or any third party for any claims or damages arising in connection with above-mentioned uses of the products. Any semiconductor
devices have an inherent chance of failure. You must protect against injury, damage or loss from such failures by incorporating safety design
measures into your facility and equipment such as redundancy, fire protection, and prevention of over-current levels and other abnormal
operating conditions. If any products described in this document represent goods or technologies subject to certain restrictions on export under
the Foreign Exchange and Foreign Trade Law of Japan, the US Export Administration Regulations or the applicable laws of any other country,
the prior authorization by the respective government entity will be required for export of those products.
Trademarks and Notice
The contents of this document are subject to change without notice. This document may contain information on a Spansion product under
development by Spansion. Spansion reserves the right to change or discontinue work on any product without notice. The information in this
document is provided as is without warranty or guarantee of any kind as to its accuracy, completeness, operability, fitness for particular purpose,
merchantability, non-infringement of third-party rights, or any other warranty, express, implied, or statutory. Spansion assumes no liability for any
damages of any kind arising out of the use of the information in this document.
Copyright © 2015 Spansion. All rights reserved. Spansion®, the Spansion Logo, MirrorBit®, MirrorBit® Eclipse™, ORNAND™, HD-SIM™ and
combinations thereof, are trademarks of Spansion LLC in the US and other countries. Other names used are for informational purposes only
and may be trademarks of their respective owners.
32
S6E2CC_AN709-00009-1v0-E February 1, 2015