Niusha, the first Persian speech
Transcription
Niusha, the first Persian speech
2010 5th International Symposium on Telecommunications (IST'2010) Niusha, the first Persian speech-enabled IVR platform M.H. Bokaei†, H. Sameti†, H. Eghbal-zadeh††, B. BabaAli†, KH. Hosseinzadeh††, M. Bahrani†, H. Veisi†, A. Sanian†† † Speech Processing Lab, Sharif University of Technology, Tehran, Iran {Bokaei, Babaali, Bahrani, Veisi}@ce.sharif.edu, [email protected] †† ASR-Gooyesh Pardaz Company, Tehran, Iran {h.eghbalzadeh, kh.hosseinzadeh, a.sanian}@asr-gooyesh.com where the user can say his/her choice and the system recognizes the speech and acts accordingly. With the development of this kind of systems, IVR systems are getting closer to the ultimate dialogue system. Abstract—This paper introduces Niusha, the first Persian speech-enabled IVR platform. This platform uses Persian recognizer and Persian text-to-speech synthesizer engines in order to interact with users. The platform is designed in a way that it can simply be customized in various domains and its components are adjustable with new words. Keywords-component; speech-enabled VoiceXML; dialogue system; I. IVR In this paper we aim to introduce Niusha, the first Persian speech enabled IVR platform. The main module of an IVR system is its “interaction process manager”. With the use of VoiceXML (VXML) standard for implementing this unit, the whole system can be adapted in different domains easily. The rest of this paper is organized as follows. In Section II speechenabled IVR systems and the VoiceXML standard are introduced. In Section III Niusha is introduced and the distinct parts of this system are investigated. In section IV the main features of Niusha are introduced and finally in Section V the discussion is concluded and the future works are introduced. systems; Introduction Since the invention of computer, human-computer interaction has been one of the most interesting areas from both academic and industrial viewpoints. The ease of this communication is a basic need for a user of any computer systems. According to phenomenon of data explosion, one of the most commonly used computer systems are information systems such as information kiosks where a user refers to it to gain information in a specific domain. The simplest way to communicate with an information system is to use natural language. For this purpose, spoken dialogue systems are developed that communicate with a user in an interactive environment in order to provide suitable information for the user. II. In this section we briefly introduce two most important concepts: Interactive voice response systems and VoiceXML standard. A. Interactive Voice Response systems Interactive voice response (IVR) is an automated telephony system that interacts with callers, gathers information and provides the requested information to the caller. An IVR system accepts a combination of voice input and touch-tone keypad selection and provides appropriate responses in the form of voice, fax, callback, e-mail and perhaps other media. An IVR system interacts with its user according to a pre-defined scenario designed in tree structure. User is moved to different states according to his/her answer to the questions being asked by the system. A typical dialogue system consists of five distinct modules: automatic speech recognizer, spoken language understanding module, dialogue manager, text generator and text to speech synthesizer. These modules are not perfect and have some errors in generating their outputs. Because of these errors, a commercial dialogue system is not developed yet and academic studies are conducted to improve the accuracy of each module separately. To palliate the need of dialogue systems, Interactive Voice Response (IVR) systems are emerged instead which consists of the same five modules, but each module is implemented in a more limited level and thus the accuracy is improved. The first generation of IVR systems is the touch-tone IVR systems that read a menu and the caller selects an appropriate choice by pressing a number on the phone keypad. Apparently, this kind of IVR system is incapable to deal with some scenarios. An important limitation of touch-tone IVR systems is that the number of choices must be less than 9. Listening to a menu with several choices exhausts the caller. Usually, a menu with 3 or 4 choices is acceptable. According to this limitation and along with performance improvements in Traditionally, touch tone IVR systems are used where a menu is read for the user and he/she uses the buttons on the phone keypad to interact with the system according to the read menu. With improvement of speech recognition module specifically in limited domains, a distinct kind of IVR systems are emerged. These systems are speech enabled IVR systems 978-1-4244-8185-9/10/$26.00 ©2010 IEEE Concepts 591 Since an IVR system is used in a limited domain, such as the banking domain, this limitation affects the implementation level of each component. For example a speech recognizer which is used in a limited domain, like the banking domain, is expected to detect only a few words in each state. This limitation simplifies the training process of speech recognition module and improves the accuracy simultaneously. speech recognition modules, specifically in limited domains, the second generations of IVR systems are emerged that use an automatic speech recognition module to recognize user utterances. With incorporating the speech recognizer engine in the system the caller can say his/her purpose as well as pressing the appropriate key in order to select a choice. These IVR systems are called the speech-enabled IVR systems. A speech-enabled IVR system breaks the touch-tone version limitation and the user can freely choose a choice with the use of natural language and speech. B. VoiceXML standard VoiceXML (VXML) is the W3C's standard XML format for specifying interactive voice dialogues between a human and a computer. It allows voice applications to be developed and deployed in an analogous way to HTML for visual applications. Just as HTML documents are interpreted by a visual web browser, VoiceXML documents are interpreted by a voice browser. This standard is designed for creating audio dialogues that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. In fact, VoiceXML is a description language which describes the procedure of a voice application such as speech enabled IVR system. Below is a short example of a VoiceXML application which simply uses text to speech synthesizer (TTS) module to produce the specified utterance “Hello World” for the user: Figure 1. overall schema of an IVR system Generally speaking, Niusha is a platform which also creates tools for a designer to design a scenario of an IVR system. Then this scenario is used later within Niusha platform and constructs a complete IVR system where a user can connect and interact with the system. Two main features of our proposed system are listed below: 1. <?xml version="1.0" ?> <vxml version="2.1" xmlns="http://www.w3.org/2001/vxml"> <form> <block>Hello World!</block> </form> </vxml> 2. So far, the VoiceXML 2.1 is finalized and adopted as a W3C recommendation. VoiceXML 3.0 will be the next major release of VoiceXML, with new major features. It includes a new XML state chart description language called State Chart XML (SCXML). III. Niusha In this section we introduce Niusha, The first Persian speech-enabled IVR platform. Also system architecture is described and each different module is studied. As it was explained earlier, an IVR system is a dialogue system the main purpose of which is to interact with users in order to gain information about the user’s goal and to act accordingly. For example in a “Bank” system, different operations are often defined such as cash deposit or withdrawal, payments etc. It supports Persian language: Niusha is the first speechenabled IVR platform which supports Persian language. It uses Persian speech recognizer engine (Nevisa engine) and also Persian text-to-speech synthesizer engine (Ariana engine) to interact with the user. Niusha is a platform for creating and running a complete IVR system in different domains: As it was mentioned earlier, Niusha is a platform rather than a system in a way that it can be used in various domains. A scenario must be presented to this platform in order to construct a complete IVR system. The system designer designs a scenario according to the VoiceXML standard and adds it to the core of Niusha to construct an IVR system. Implementing the system in a new distinct domain is as simple as designing a new scenario for that domain. In addition both engines (Nevisa and Ariana) are statistical and can be adapted with new words for use in a new domain. In this paper a “System designer” is someone who designs a scenario and a “System user” is someone who makes a connection to the designed IVR system and interacts with the system. In the next subsection various components of Niusha are introduced according to overall schema illustrated in Figure 1. A. Niusha components As illustrated in Figure 1, typical IVR system architecture is composed of five distinct modules. So far, we have a simple spoken language understanding module. This simple module is to detect certain pre-defined words. It has a priority list of The overall schema of an IVR system is like a dialogue system and is illustrated in Figure 1. Generally, an IVR system is a small-scale model of a dialogue system where each component is implemented in a lower complexity level. 592 template based manner. Some pre-defined templates are provided. Each template needs some parameters to become a complete sentence. For example assume a greeting template like “Welcome <person name> to the system…” This template becomes a sentence when the person name is added to it. In Niusha this component is implemented according to VoiceXML standard. For example we have: words and selects one simple prior word among them. A more complicated statistical understanding module has also been designed by the authors [1] and is supposed to be added to the system in the future. The other four modules and the way we implement them in Niusha are explained in the following section. 1. Speech recognition module Niusha uses Nevisa engine as its speech recognizer module. Nevisa is the first and only large vocabulary speech recognizer system for the Persian language. This continuous speech recognition system uses the state-of-the-art speech and language modeling techniques and performs adequately as the first product for automatic dictation and telephony environment recognition applications in Persian. MFCC representation with some modifications is used as the set of speech signal features besides a VAD based on signal energy and zero-crossing rate. Nevisa is equipped with out-ofvocabulary capability for applications with medium or small vocabulary sizes. Powerful robustness techniques are also utilized in the system. Model-based approaches like PMC, MLLR, and MAP, feature robustness methods such as CMS, PCA, RCC, and VTLN, and speech enhancement methods like spectral subtraction and Wiener filtering, along with their modified versions, were diligently implemented and evaluated in the system. More about this engine can be found in [2]. <prompt> Welcome <value expr=”personName”/> to the system</prompt> The person name is stored in runtime variable named personName and used for completing the sentence. The completed sentence is sent for text-to-speech synthesizer module in order to be played for the user. 2. Dialogue manager Dialogue manager is the heart of a dialogue or an IVR system. Its main goal is to direct interaction in a way that important information is gathered from the user and to decide on the best response to the user based on this information. This unit acts according to a pre-defined scenario. The scenario can be assumed as a finite state machine whose states correspond to the IVR process states and whose transitions correspond to the conditions for changing the states. A part of the whole “bank” application is illustrated in Figure 2. Welcome prompt is played for the user in the “welcome” state. An unconditional transition is made to the next state where an input is given from the user and the next state is chosen according to this input. Figure 2. Part of "bank" application scenario 4. Text-to-speech synthesizer For the TTS component, the recently developed Ariana TTS engine is used. It is a Persian TTS engine which contains two parts. The first part named TTP, obtains phonetic information from the pure text. Combination of a 90k lexicon and a stemmer is applied for collecting various phonetic candidates of each word and an HMM-based algorithm estimates the best candidate. The second part converts the extracted phonetic information to the synthesized speech waveform. This part uses cluster unit selection method for synthesizing speech waveform. This method is implemented in Festival speech synthesizer system [3]. As mentioned earlier, VoiceXML standard is so far the best computer knowledgeable language for describing an IVR scenario [3]. In order to use this standard, the system has to have an interpreter capable to read VoiceXML standard tags and act accordingly. For this purpose we use BladewareVXML, an open source VoiceXML standard interpreter. This interpreter supports VoiceXML 2.1, the latest stable version of VoiceXML standard, but has its own defects. We fix these problems by adding several new tags, customize it in order to support the Persian language and prepare it in order to be used as the core of the dialogue manager unit. B. Niusha architecture Through architectural perspective Niusha is composed of four distinct components, Niusha gateway, Niusha designer, Niusha simulator, Niusha TAPI wrapper. These components and their responsibilities are described later. The connection between these components is done with socket programming. Therefore each component can be placed in a separate computer and keep on functioning without any interruption using the definite protocol. Each component is described next. 3. Natural language generator The goal of NLG unit is to generate the proper responses for the user. Usually this component is implemented in a 593 2. Niusha designer This part is a tool to enable the user to design a scenario with a graphical interface. Complicated scenarios can be designed with just drag and drops of pre-defined modules. This graph is then converted to the tags of VoiceXML standard and is prepared to be used within the Niusha platform. Each component is mapped to a series of VoiceXML tags. For example a menu can be designed with menu tag where a prompt is played for the user and a choice is selected according to user decision. For putting such a menu in a scenario, the designer has to only drag a menu component and drop it in the grid. The corresponding tags are then produced as illustrated in Figure 4. 1. Niusha gateway Niusha gateway is the main engine of our whole system. All Niusha units and their connectivity are shown in Figure 3 and described below. • • Niusha browser/simulator manager: This unit manages all the system settings such as component’s individual IP and ports, files path setting and etc… Niusha call handler: This unit is responsible for the following tasks: o To check all line statuses and their connectivity. o To provide call waiting service in case the lines are busy. o To organize the calls in their design order. o To switch calls over the lines. o To assign different ports to different calls and different conversation in multi-lines. Figure 4. a menu component and its corresponding VoiceXML tags 3. Niusha simulator Before putting a scenario on line, it must be checked if it works properly. This can be done with Niusha simulator. This tool is a computer application which gets a VoiceXML application and interacts with the user through microphone and speaker. Figure 3. Niusha gateway architecture • • • Niusha resource manager: It manages all the resources (such as fax, SMS, TTS, etc…) that should be authorized. Each resource may have certain limitations which have to be controlled in this unit. Niusha VoiceXML interpreter: This unit interprets all the tags in the designed scenario and acts accordingly. Resources: To perform specified action, Niusha is presented with some modules. These modules are considered as resources which have to be controlled by the recourse manager. Some of these resources are: fax, SMS, email, database, Speech recognition engine, text-tospeech engine, etc… 4. Niusha TAPI wrapper This wrapper is designed for telephony boards. Specifically Niusha is customized to use Dialogic Dive media board for telephone interface, but it can easily be adapted to other media boards. IV. Niusha capabilities Niusha is designed in a way to supports the main capabilities of available IVR systems. The main capabilities are listed in this section. These specifications are accessible in 594 VoiceXML level. It means that their properties can be set at the scenario design step. limitations of traditional touch-tone IVR systems the need to this kind of speech enabled systems is growing. 1. The system platform was explained from two distinct viewpoints. First we introduce an overall schema of a typical IVR or dialogue system and the implementation of these modules in our Niusha platform are explained. From architectural viewpoint, we illustrated that our platform consists of 4 components. These components are introduced and their responsibilities are described. At last, we introduced the main features that our platform can support. With these features, a designer can create complicated scenarios and use the dialogue engine in any specific domain. 2. 3. 4. 5. 6. 7. Connecting to any available database and execute any valid query: we define a new tag which handle execution of database queries. This tag contains two properties. One for database name and the other for the query. An engine like SQL server is used to perform the specified query. Speech recognition module configuration: To have a more accurate speech recognition engine, we have to limit the number of considered words. This can be done separately for each state since the words used in each state are different. Three properties are set for speech recognition engine to use: a. Language model b. Lexicon c. Phonetic of each word in lexicon With specifying these values in each node separately, the speech recognition engine is customized for that node specifically. Utilizing barge in (with sound and DTMF): Assume a long prompt which has to be played for users. Barge in gives this capability for the user to select a choice before the playing prompt is finished. This can be done with both DTMF and voice. In the other words, during playing of prompt, user can say his/her choice or press the corresponding phone key. Then the prompt is skipped and the choice is selected. This property can be set for each state separately. Final silence detection: an energy-based algorithm is implemented which differentiates between silence and speech signals. This module is used to detect silence in utterances. We assume that the user utterance is finished when specified epoch is detected as silence and then the record operation is terminated. This feature can be set for each state separately and its value indicates the silence time interval that the system has to wait in order to terminate the recording process. Taking multi-digit numbers from the user: In some of the states the system has to get a multi digit number from the user. The desired number specification, such as maximum and minimum length, can be set in VoiceXML for the corresponding state. Defining skip character: a character can be set to be considered as the skip character. The skip character function is to skip playing the prompt, but no choice is made. The prompt is skipped and system waits for an input from the user. Time and date indication: to indicate current time & date or any other ideal time or date, a tag is defined which gets the runtime date and time and play it with Ariana TTS engine. V. For future we want to embed our designed statistical understanding module in Niusha and go one step further to an ultimate dialog system. REFERENCES [1] Bokaei M. H., Sameti H., Bahrani M., Babaali B., "Segmental HMM-based part-of-speech tagger," International conference on audio language and image processing ICALIP2010, Shanghai, China, 2010. [2] Sameti H., Veisi H., Bahrani M., Babaali B. and Hosseinzadeh K., “Nevisa, a Persian Continuous Speech Recognition System.” In Communications in Computer and Information Science, Advances in Computer Science and Engineering, 13th International CSI Computer Conference, CSICC 2008 Kish Island, Iran, Vol. 6, pp. 485-492, Springer Berlin Heidelberg, 2008. [3] Bruce L., “VoiceXML for Web-based distributed conversational applications, Communications of the ACM, v.43 n.9, p.53-57, Sept. 2000 [4] Black A., Taylor P., and Caley R. “The Festival Speech Synthesis System.” University of Edinburgh, Centre for Speech Technology Research, edition 1.4, for festival version 1.4.0, 1999. Conclusion In this paper we introduced the first Persian speechenabled IVR system, called Niusha. According to the 595