Niusha, the first Persian speech

Transcription

Niusha, the first Persian speech
2010 5th International Symposium on Telecommunications (IST'2010)
Niusha, the first Persian speech-enabled IVR
platform
M.H. Bokaei†, H. Sameti†, H. Eghbal-zadeh††, B. BabaAli†, KH. Hosseinzadeh††, M. Bahrani†, H. Veisi†, A. Sanian††
†
Speech Processing Lab, Sharif University of Technology, Tehran, Iran
{Bokaei, Babaali, Bahrani, Veisi}@ce.sharif.edu, [email protected]
††
ASR-Gooyesh Pardaz Company, Tehran, Iran
{h.eghbalzadeh, kh.hosseinzadeh, a.sanian}@asr-gooyesh.com
where the user can say his/her choice and the system
recognizes the speech and acts accordingly. With the
development of this kind of systems, IVR systems are getting
closer to the ultimate dialogue system.
Abstract—This paper introduces Niusha, the first Persian
speech-enabled IVR platform. This platform uses Persian
recognizer and Persian text-to-speech synthesizer engines in
order to interact with users. The platform is designed in a way
that it can simply be customized in various domains and its
components are adjustable with new words.
Keywords-component;
speech-enabled
VoiceXML; dialogue system;
I.
IVR
In this paper we aim to introduce Niusha, the first Persian
speech enabled IVR platform. The main module of an IVR
system is its “interaction process manager”. With the use of
VoiceXML (VXML) standard for implementing this unit, the
whole system can be adapted in different domains easily. The
rest of this paper is organized as follows. In Section II speechenabled IVR systems and the VoiceXML standard are
introduced. In Section III Niusha is introduced and the distinct
parts of this system are investigated. In section IV the main
features of Niusha are introduced and finally in Section V the
discussion is concluded and the future works are introduced.
systems;
Introduction
Since the invention of computer, human-computer
interaction has been one of the most interesting areas from
both academic and industrial viewpoints. The ease of this
communication is a basic need for a user of any computer
systems. According to phenomenon of data explosion, one of
the most commonly used computer systems are information
systems such as information kiosks where a user refers to it to
gain information in a specific domain. The simplest way to
communicate with an information system is to use natural
language. For this purpose, spoken dialogue systems are
developed that communicate with a user in an interactive
environment in order to provide suitable information for the
user.
II.
In this section we briefly introduce two most important
concepts: Interactive voice response systems and VoiceXML
standard.
A. Interactive Voice Response systems
Interactive voice response (IVR) is an automated
telephony system that interacts with callers, gathers
information and provides the requested information to the
caller. An IVR system accepts a combination of voice input
and touch-tone keypad selection and provides appropriate
responses in the form of voice, fax, callback, e-mail and
perhaps other media. An IVR system interacts with its user
according to a pre-defined scenario designed in tree structure.
User is moved to different states according to his/her answer
to the questions being asked by the system.
A typical dialogue system consists of five distinct
modules: automatic speech recognizer, spoken language
understanding module, dialogue manager, text generator and
text to speech synthesizer. These modules are not perfect and
have some errors in generating their outputs. Because of these
errors, a commercial dialogue system is not developed yet and
academic studies are conducted to improve the accuracy of
each module separately. To palliate the need of dialogue
systems, Interactive Voice Response (IVR) systems are
emerged instead which consists of the same five modules, but
each module is implemented in a more limited level and thus
the accuracy is improved.
The first generation of IVR systems is the touch-tone IVR
systems that read a menu and the caller selects an appropriate
choice by pressing a number on the phone keypad.
Apparently, this kind of IVR system is incapable to deal with
some scenarios. An important limitation of touch-tone IVR
systems is that the number of choices must be less than 9.
Listening to a menu with several choices exhausts the caller.
Usually, a menu with 3 or 4 choices is acceptable. According
to this limitation and along with performance improvements in
Traditionally, touch tone IVR systems are used where a
menu is read for the user and he/she uses the buttons on the
phone keypad to interact with the system according to the read
menu. With improvement of speech recognition module
specifically in limited domains, a distinct kind of IVR systems
are emerged. These systems are speech enabled IVR systems
978-1-4244-8185-9/10/$26.00 ©2010 IEEE
Concepts
591
Since an IVR system is used in a limited domain, such as the
banking domain, this limitation affects the implementation
level of each component. For example a speech recognizer
which is used in a limited domain, like the banking domain, is
expected to detect only a few words in each state. This
limitation simplifies the training process of speech recognition
module and improves the accuracy simultaneously.
speech recognition modules, specifically in limited domains,
the second generations of IVR systems are emerged that use
an automatic speech recognition module to recognize user
utterances. With incorporating the speech recognizer engine in
the system the caller can say his/her purpose as well as
pressing the appropriate key in order to select a choice. These
IVR systems are called the speech-enabled IVR systems. A
speech-enabled IVR system breaks the touch-tone version
limitation and the user can freely choose a choice with the use
of natural language and speech.
B.
VoiceXML standard
VoiceXML (VXML) is the W3C's standard XML format
for specifying interactive voice dialogues between a human
and a computer. It allows voice applications to be developed
and deployed in an analogous way to HTML for visual
applications. Just as HTML documents are interpreted by a
visual web browser, VoiceXML documents are interpreted by
a voice browser. This standard is designed for creating audio
dialogues that feature synthesized speech, digitized audio,
recognition of spoken and DTMF key input, recording of
spoken input, telephony, and mixed initiative conversations.
In fact, VoiceXML is a description language which describes
the procedure of a voice application such as speech enabled
IVR system. Below is a short example of a VoiceXML
application which simply uses text to speech synthesizer
(TTS) module to produce the specified utterance “Hello
World” for the user:
Figure 1. overall schema of an IVR system
Generally speaking, Niusha is a platform which also
creates tools for a designer to design a scenario of an IVR
system. Then this scenario is used later within Niusha
platform and constructs a complete IVR system where a user
can connect and interact with the system. Two main features
of our proposed system are listed below:
1.
<?xml version="1.0" ?>
<vxml version="2.1"
xmlns="http://www.w3.org/2001/vxml">
<form>
<block>Hello World!</block>
</form>
</vxml>
2.
So far, the VoiceXML 2.1 is finalized and adopted as a
W3C recommendation. VoiceXML 3.0 will be the next major
release of VoiceXML, with new major features. It includes a
new XML state chart description language called State Chart
XML (SCXML).
III.
Niusha
In this section we introduce Niusha, The first Persian
speech-enabled IVR platform. Also system architecture is
described and each different module is studied. As it was
explained earlier, an IVR system is a dialogue system the
main purpose of which is to interact with users in order to gain
information about the user’s goal and to act accordingly. For
example in a “Bank” system, different operations are often
defined such as cash deposit or withdrawal, payments etc.
It supports Persian language: Niusha is the first speechenabled IVR platform which supports Persian language. It
uses Persian speech recognizer engine (Nevisa engine)
and also Persian text-to-speech synthesizer engine
(Ariana engine) to interact with the user.
Niusha is a platform for creating and running a complete
IVR system in different domains: As it was mentioned
earlier, Niusha is a platform rather than a system in a way
that it can be used in various domains. A scenario must be
presented to this platform in order to construct a complete
IVR system. The system designer designs a scenario
according to the VoiceXML standard and adds it to the
core of Niusha to construct an IVR system. Implementing
the system in a new distinct domain is as simple as
designing a new scenario for that domain. In addition
both engines (Nevisa and Ariana) are statistical and can
be adapted with new words for use in a new domain.
In this paper a “System designer” is someone who designs
a scenario and a “System user” is someone who makes a
connection to the designed IVR system and interacts with the
system. In the next subsection various components of Niusha
are introduced according to overall schema illustrated in
Figure 1.
A. Niusha components
As illustrated in Figure 1, typical IVR system architecture
is composed of five distinct modules. So far, we have a simple
spoken language understanding module. This simple module
is to detect certain pre-defined words. It has a priority list of
The overall schema of an IVR system is like a dialogue
system and is illustrated in Figure 1. Generally, an IVR
system is a small-scale model of a dialogue system where
each component is implemented in a lower complexity level.
592
template based manner. Some pre-defined templates are
provided. Each template needs some parameters to become a
complete sentence. For example assume a greeting template
like “Welcome <person name> to the system…” This
template becomes a sentence when the person name is added
to it. In Niusha this component is implemented according to
VoiceXML standard. For example we have:
words and selects one simple prior word among them. A more
complicated statistical understanding module has also been
designed by the authors [1] and is supposed to be added to the
system in the future. The other four modules and the way we
implement them in Niusha are explained in the following
section.
1.
Speech recognition module
Niusha uses Nevisa engine as its speech recognizer
module. Nevisa is the first and only large vocabulary speech
recognizer system for the Persian language. This continuous
speech recognition system uses the state-of-the-art speech and
language modeling techniques and performs adequately as the
first product for automatic dictation and telephony
environment recognition applications in Persian. MFCC
representation with some modifications is used as the set of
speech signal features besides a VAD based on signal energy
and zero-crossing rate. Nevisa is equipped with out-ofvocabulary capability for applications with medium or small
vocabulary sizes. Powerful robustness techniques are also
utilized in the system. Model-based approaches like PMC,
MLLR, and MAP, feature robustness methods such as CMS,
PCA, RCC, and VTLN, and speech enhancement methods
like spectral subtraction and Wiener filtering, along with their
modified versions, were diligently implemented and evaluated
in the system. More about this engine can be found in [2].
<prompt> Welcome <value expr=”personName”/> to the
system</prompt>
The person name is stored in runtime variable named
personName and used for completing the sentence. The
completed sentence is sent for text-to-speech synthesizer
module in order to be played for the user.
2.
Dialogue manager
Dialogue manager is the heart of a dialogue or an IVR
system. Its main goal is to direct interaction in a way that
important information is gathered from the user and to decide
on the best response to the user based on this information.
This unit acts according to a pre-defined scenario. The
scenario can be assumed as a finite state machine whose states
correspond to the IVR process states and whose transitions
correspond to the conditions for changing the states. A part of
the whole “bank” application is illustrated in Figure 2.
Welcome prompt is played for the user in the “welcome”
state. An unconditional transition is made to the next state
where an input is given from the user and the next state is
chosen according to this input.
Figure 2. Part of "bank" application scenario
4.
Text-to-speech synthesizer
For the TTS component, the recently developed Ariana
TTS engine is used. It is a Persian TTS engine which contains
two parts. The first part named TTP, obtains phonetic
information from the pure text. Combination of a 90k lexicon
and a stemmer is applied for collecting various phonetic
candidates of each word and an HMM-based algorithm
estimates the best candidate.
The second part converts the extracted phonetic
information to the synthesized speech waveform. This part
uses cluster unit selection method for synthesizing speech
waveform. This method is implemented in Festival speech
synthesizer system [3].
As mentioned earlier, VoiceXML standard is so far the
best computer knowledgeable language for describing an IVR
scenario [3]. In order to use this standard, the system has to
have an interpreter capable to read VoiceXML standard tags
and act accordingly. For this purpose we use BladewareVXML, an open source VoiceXML standard interpreter. This
interpreter supports VoiceXML 2.1, the latest stable version of
VoiceXML standard, but has its own defects. We fix these
problems by adding several new tags, customize it in order to
support the Persian language and prepare it in order to be used
as the core of the dialogue manager unit.
B. Niusha architecture
Through architectural perspective Niusha is composed of
four distinct components, Niusha gateway, Niusha designer,
Niusha simulator, Niusha TAPI wrapper. These components
and their responsibilities are described later.
The connection between these components is done with
socket programming. Therefore each component can be
placed in a separate computer and keep on functioning
without any interruption using the definite protocol. Each
component is described next.
3.
Natural language generator
The goal of NLG unit is to generate the proper responses
for the user. Usually this component is implemented in a
593
2.
Niusha designer
This part is a tool to enable the user to design a scenario
with a graphical interface. Complicated scenarios can be
designed with just drag and drops of pre-defined modules.
This graph is then converted to the tags of VoiceXML
standard and is prepared to be used within the Niusha
platform. Each component is mapped to a series of
VoiceXML tags. For example a menu can be designed with
menu tag where a prompt is played for the user and a choice is
selected according to user decision. For putting such a menu
in a scenario, the designer has to only drag a menu component
and drop it in the grid. The corresponding tags are then
produced as illustrated in Figure 4.
1.
Niusha gateway
Niusha gateway is the main engine of our whole system.
All Niusha units and their connectivity are shown in Figure 3
and described below.
•
•
Niusha browser/simulator manager: This unit manages all
the system settings such as component’s individual IP and
ports, files path setting and etc…
Niusha call handler: This unit is responsible for the
following tasks:
o To check all line statuses and their connectivity.
o To provide call waiting service in case the lines
are busy.
o To organize the calls in their design order.
o To switch calls over the lines.
o To assign different ports to different calls and
different conversation in multi-lines.
Figure 4. a menu component and its corresponding VoiceXML tags
3.
Niusha simulator
Before putting a scenario on line, it must be checked if it
works properly. This can be done with Niusha simulator. This
tool is a computer application which gets a VoiceXML
application and interacts with the user through microphone
and speaker.
Figure 3. Niusha gateway architecture
•
•
•
Niusha resource manager: It manages all the resources
(such as fax, SMS, TTS, etc…) that should be authorized.
Each resource may have certain limitations which have to
be controlled in this unit.
Niusha VoiceXML interpreter: This unit interprets all the
tags in the designed scenario and acts accordingly.
Resources: To perform specified action, Niusha is
presented with some modules. These modules are
considered as resources which have to be controlled by
the recourse manager. Some of these resources are: fax,
SMS, email, database, Speech recognition engine, text-tospeech engine, etc…
4.
Niusha TAPI wrapper
This wrapper is designed for telephony boards.
Specifically Niusha is customized to use Dialogic Dive media
board for telephone interface, but it can easily be adapted to
other media boards.
IV.
Niusha capabilities
Niusha is designed in a way to supports the main
capabilities of available IVR systems. The main capabilities
are listed in this section. These specifications are accessible in
594
VoiceXML level. It means that their properties can be set at
the scenario design step.
limitations of traditional touch-tone IVR systems the need to
this kind of speech enabled systems is growing.
1.
The system platform was explained from two distinct
viewpoints. First we introduce an overall schema of a typical
IVR or dialogue system and the implementation of these
modules in our Niusha platform are explained. From
architectural viewpoint, we illustrated that our platform
consists of 4 components. These components are introduced
and their responsibilities are described. At last, we introduced
the main features that our platform can support. With these
features, a designer can create complicated scenarios and use
the dialogue engine in any specific domain.
2.
3.
4.
5.
6.
7.
Connecting to any available database and execute any
valid query: we define a new tag which handle execution
of database queries. This tag contains two properties. One
for database name and the other for the query. An engine
like SQL server is used to perform the specified query.
Speech recognition module configuration: To have a
more accurate speech recognition engine, we have to limit
the number of considered words. This can be done
separately for each state since the words used in each
state are different. Three properties are set for speech
recognition engine to use:
a. Language model
b. Lexicon
c. Phonetic of each word in lexicon
With specifying these values in each node separately,
the speech recognition engine is customized for that node
specifically.
Utilizing barge in (with sound and DTMF):
Assume a long prompt which has to be played for
users. Barge in gives this capability for the user to select a
choice before the playing prompt is finished. This can be
done with both DTMF and voice. In the other words,
during playing of prompt, user can say his/her choice or
press the corresponding phone key. Then the prompt is
skipped and the choice is selected. This property can be
set for each state separately.
Final silence detection: an energy-based algorithm is
implemented which differentiates between silence and
speech signals. This module is used to detect silence in
utterances. We assume that the user utterance is finished
when specified epoch is detected as silence and then the
record operation is terminated. This feature can be set for
each state separately and its value indicates the silence
time interval that the system has to wait in order to
terminate the recording process.
Taking multi-digit numbers from the user:
In some of the states the system has to get a multi
digit number from the user. The desired number
specification, such as maximum and minimum length, can
be set in VoiceXML for the corresponding state.
Defining skip character: a character can be set to be
considered as the skip character. The skip character
function is to skip playing the prompt, but no choice is
made. The prompt is skipped and system waits for an
input from the user.
Time and date indication: to indicate current time & date
or any other ideal time or date, a tag is defined which gets
the runtime date and time and play it with Ariana TTS
engine.
V.
For future we want to embed our designed statistical
understanding module in Niusha and go one step further to an
ultimate dialog system.
REFERENCES
[1] Bokaei M. H., Sameti H., Bahrani M., Babaali B.,
"Segmental
HMM-based
part-of-speech
tagger,"
International conference on audio language and image
processing ICALIP2010, Shanghai, China, 2010.
[2] Sameti H., Veisi H., Bahrani M., Babaali B. and
Hosseinzadeh K., “Nevisa, a Persian Continuous Speech
Recognition System.” In Communications in Computer
and Information Science, Advances in Computer Science
and Engineering, 13th International CSI Computer
Conference, CSICC 2008 Kish Island, Iran, Vol. 6, pp.
485-492, Springer Berlin Heidelberg, 2008.
[3] Bruce L., “VoiceXML for Web-based distributed
conversational applications, Communications of the
ACM, v.43 n.9, p.53-57, Sept. 2000
[4] Black A., Taylor P., and Caley R. “The Festival Speech
Synthesis System.” University of Edinburgh, Centre for
Speech Technology Research, edition 1.4, for festival
version 1.4.0, 1999.
Conclusion
In this paper we introduced the first Persian speechenabled IVR system, called Niusha. According to the
595