Speech Recognition of Highly Inflective Languages
Transcription
Speech Recognition of Highly Inflective Languages
Speech Recognition of Highly Inflective Languages BARTOSZ Z I ÓŁKO Ph.D. Thesis This thesis is submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy. Artificial Intelligence Group Pattern Recognition and Computer Vision Group Department of Computer Science United Kingdom 2009 2 Abstract This PhD thesis combines various topics in speech recognition. There are two central hypotheses. First one is that it would be useful to incorporate phoneme segmentation information in speech recognition and that this task can be achieved by applying discrete wavelet transform. The second main point is that adding semantics into language models for speech recognition improves recognition accuracy. The research starts with analysing differences between English and Polish from speech recognition point of view. English is a very typical positional language and Polish is highly inflective. Part of research is focused on aspects which should be changed due to the linguistic differences comparing to well known solutions for English to improve recognition of Polish. These are mainly phoneme segmentation and semantic analysis. Phoneme statistics for Polish were gathered by the author and a toolkit designed for English was applied on Polish. The phoneme segmentation is more likely to be successful in Polish than English because phonemes are easier to be distinguished. A method based on the discrete wavelet transform was design and tested by the PhD candidate. Another part of research is focused on finding new ways of modelling a natural language. Semantic analysis is crucial for Polish because syntax models are not very effective and difficult to be trained due to non-positionality of Polish. This part of the thesis describes an unsuccessful approach of using part-of-speech taggers for language modelling in speech recognition and a much better bag-of-words model. The latter is inspired by well known latent semantic analysis. It is, however, easier to be trained and does not need calculations on big matrices. The difference is in the completely new approach to smoothing information in a word-topic matrix. Because of the morphological nature of the Polish language, this method gathers not only semantic content, but also some grammatical structure. Contents 1 2 3 Introduction 16 1.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.2.1 Introduction and Literature Review . . . . . . . . . . . . . . . . . . . . 18 1.2.2 Linguistic Aspects of Highly Inflective Languages Using Polish as an Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.2.3 Phoneme Segmentation and Acoustic Models . . . . . . . . . . . . . . . 18 1.2.4 Language Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Literature Review 20 2.1 History of Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Linguistic Rudiments of Speech Analysis . . . . . . . . . . . . . . . . . . . . . 22 2.3 Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Speech Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Phoneme Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 Speech Parametrisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6.1 Parametrisation Methods Based on Linear Prediction Coefficients . . . . 30 2.6.2 Parametrisation Methods Based on Filter Banks . . . . . . . . . . . . . . 33 2.6.3 Test Corpora and Baselines . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.4 Comparison of the Methods . . . . . . . . . . . . . . . . . . . . . . . . 38 2.7 Speech Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.8 Natural Language Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.9 Semantic Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.10 Academic Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Linguistic Aspects of Polish 46 3.1 Analysis of Polish from the Speech Recognition Point of View . . . . . . . . . . 46 3.2 Triphone Statistics of Polish Language . . . . . . . . . . . . . . . . . . . . . . . 47 3.3 Description of a problem solution . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Methods, software and hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.1 50 Grapheme to Phoneme Transcription . . . . . . . . . . . . . . . . . . . 3 CONTENTS 4 5 6 4 3.4.2 Corpora Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5 Analysis of Phonetic Similarities in Wrong Recognitions of the Polish Language 56 3.6 Experimental Results on Applying HTK to Polish . . . . . . . . . . . . . . . . . 57 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Phoneme Segmentation 63 4.1 Analysis Using the Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . 63 4.2 General Description of the Segmentation Method . . . . . . . . . . . . . . . . . 65 4.3 Phoneme Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4 Fuzzy Sets for Recall and Precision . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5 Algorithm of Speech Segmentation Evaluation . . . . . . . . . . . . . . . . . . . 75 4.6 Comparison to Other Evaluation Methods . . . . . . . . . . . . . . . . . . . . . 78 4.7 Experimental Results of DWT Segmentation Method . . . . . . . . . . . . . . . 78 4.8 Evaluation for Different Types of Phoneme Transitions . . . . . . . . . . . . . . 80 4.9 LogitBoost WEKA Classifier Speech Segmentation . . . . . . . . . . . . . . . . 83 4.10 Experimental Results for LogitBoost . . . . . . . . . . . . . . . . . . . . . . . . 83 4.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Language Models 87 5.1 POS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Applying POS Taggers for Language Modelling in Speech Recognition . . . . . 88 5.3 Experimental Results of Applying POS Tags in ASR . . . . . . . . . . . . . . . 89 5.4 Bag-of-words Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.6 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.7 Process of Finding The Most Similar Topics . . . . . . . . . . . . . . . . . . . . 97 5.8 Example in English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.9 Recognition Using Bag-of-words Model . . . . . . . . . . . . . . . . . . . . . . 99 5.10 Preliminary Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.11 K-means On-line Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.12 Experiment on Parliament Transcripts . . . . . . . . . . . . . . . . . . . . . . . 103 5.13 Preprocessing of Training Corpora . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.14 Experiment with Literature Training Corpus . . . . . . . . . . . . . . . . . . . . 107 5.15 Word Prediction Model and Evaluation with Perplexity . . . . . . . . . . . . . . 110 5.16 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Conclusions and Future Research 112 Appendices 114 List of References 115 List of Tables 2.1 Phoneme transcription in English - BEEP dictionary . . . . . . . . . . . . . . . . 23 2.2 Phoneme transcription in Polish - SAMPA . . . . . . . . . . . . . . . . . . . . . 23 2.3 Comparison of the efficiency of the described methods. Asterisks mark methods appended to baselines (they could be used with most of the other methods). The methods without asterisks are new sets of features, different to the baselines . . . 38 2.4 Speech recognition applications available on the Internet . . . . . . . . . . . . . 44 3.1 Phonemes in Polish (SAMPA Demenko et al. (2003)) . . . . . . . . . . . . . . . 49 3.2 Most common Polish diphones . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Most common Polish triphones . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Word recognition correctness for different speakers (the model was trained on adult male speakers only) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5 Errors in different types of utterances (for all speakers) . . . . . . . . . . . . . . 58 3.6 Errors in sentences (speakers AK1C1 and AK2C1 respectively) . . . . . . . . . . 58 3.7 Errors in digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.8 Errors in the most often wrongly recognised names and commands . . . . . . . . 60 3.9 Errors in the most often wrongly recognised names and commands (2nd part) . . 61 3.10 Names which appeared the most commonly as wrong recognitions in above statistics 61 3.11 Errors in pronounced alphabet . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1 Characteristics of the discrete wavelet transform levels and their envelopes . . . . 67 4.2 Types of events associated with a phoneme boundary. Mathematical conditions are based on power envelope pen m (n), rate-of-change information rm (n), a threshold p en of the distance between rm (n) and pen m (n) and a threshold pmin of minimal pm (n) and β = 1. Values in the last four columns are for different DWT levels (the first one for d1 level, the second one for d2 level, the third for levels from d3 to d5 and the last one for d6 level) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 70 Comparison of fuzzy recall and precision with commonly used methods based on insertions and deletions for an exemplar word . . . . . . . . . . . . . . . . . . . 79 4.4 Comparison of proposed method using different wavelets . . . . . . . . . . . . . 79 4.5 Comparison of some other segmentation strategies and proposed method . . . . . 79 4.6 Recall for different types of phoneme transitions. . . . . . . . . . . . . . . . . . 81 5 LIST OF TABLES 6 4.7 Precision for different types of phoneme transitions. . . . . . . . . . . . . . . . 82 4.8 F-score for different types of phoneme transitions. The scores above 0.5 were bolded. 82 4.9 Experimental results for LogitBoost classifier. The rows with the label boundary is for classifying segments representing boundaries. The rows named phoneme present grades for classifying segments inside phonemes which are not boundaries. From practical point of view boundary labels are important. The grades for phoneme labels are just for a reference . . . . . . . . . . . . . . . . . . . . . . . 5.1 84 Results of applying the POS tagger to language modelling. First, a sentence in Polish is given, then a position of a correct recognition in 10 best list. The description of tagger grade for the correct recognition follows . . . . . . . . . . . . . . . . . 5.2 90 Results of applying the POS tagger to language modelling. First, a sentence in Polish is given, then a position of a correct recognition in 10 best list. The description of tagger grade for the correct recognition follows (2nd part) . . . . . . . . . . . 5.3 91 Results of applying the POS tagger on its training corpus. First version of a sentence is a correct one, second is a recognition using just HTK and third one using HTK and POS tagging. Then the number of differences comparing to a correct sentence were counted and summarised . . . . . . . . . . . . . . . . . . . . . . 92 5.4 Matrix S for the example with 4 topics and a row of S’ for the topic 3 . . . . . . 98 5.5 Matrix D for the presented example . . . . . . . . . . . . . . . . . . . . . . . . 98 5.6 Experimental results for pure HTK audio model, audio model with LSA and audio model with our bag-of-words model . . . . . . . . . . . . . . . . . . . . . . . . 5.7 44 sentences in the exact transcription used for testing by HTK and bag-of-words model with English translations . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 104 44 sentences in the exact transcription used for testing by HTK and bag-of-words model with English translations (2nd part) . . . . . . . . . . . . . . . . . . . . . 5.9 101 105 44 sentences in the exact transcription used for testing by HTK and bag-of-words model with English translations (3rd part) . . . . . . . . . . . . . . . . . . . . . 106 5.10 SED script for text preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.11 Experimental results for pure HTK audio model, audio model with LSA and audio model with our bags-of-words model trained on literature . . . . . . . . . . . . . 109 5.12 Experimental results for pure HTK audio model, audio model with LSA and audio model with our bags-of-words model trained on enlarged literature corpus . . . . 109 5.13 Text corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 List of Figures 2.1 Toy dog Rex - first working speech recognition system (USA 1920) . . . . . . . 20 2.2 Scheme of speech recognition system . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Typical current services offered by call centres with ASR (above) and its future (below) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Speech audibility and average human hearing band (Tadeusiewicz, 1988) . . . . 25 2.5 The example of Fourier spectrum amplitude . . . . . . . . . . . . . . . . . . . . 25 2.6 Frequency spectrum of speech in a linear and a non-linear scale . . . . . . . . . . 26 2.7 The cepstrum is the Fourier transform of the log of the power spectrum . . . . . 27 2.8 The types of speech segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.9 Comparison of the frames produced by constant segmentation and phoneme segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.10 The list of speech features extracting method types, grouped in two avenues: based on linear prediction coefficients (with PLP as the main one) and filter bank analysis (with MFCC as the main one). . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.11 fMPE transformation matrix from original low-dimensional feature vector into high-dimensional one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.12 Mel frequency cepstrum coefficients . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1 Phonemes in Polish in SAMPA alphabet . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Frequency of diphones in Polish (each phoneme separately) . . . . . . . . . . . . 52 3.3 Space of triphones in Polish . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4 Phoneme occurrences distribution . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1 Wavelet transform outperforms STFT because it has higher resolution for higher frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 The discrete Meyer wavelet - dmey . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3 Subband amplitude DWT spectra of the Polish word ’osiem’ (eng. eight). The number of samples depends on a resolution level . . . . . . . . . . . . . . . . . 4.4 66 Segmentation of the Polish word ’osiem’ (eng. eight) based on DWT sub-bands. Dotted lines are hand segmentation boundaries; dashed lines are automatic segmentation boundaries, bold lines are envelopes and thin lines are smoothed rateof-change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 68 LIST OF FIGURES 4.5 The event function versus time in ms of the word presented in Fig. 4.4. High event scores mean that a phoneme boundary is more likely . . . . . . . . . . . . . . . 4.6 8 71 Simple examples of four events described in Table 4.2. They are characteristic for phoneme boundaries. Images present power envelope pen m (n) and rate-of-change information (derivative) rm (n) . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 72 The general scheme of sets G with correct boundaries and A with detected ones. Elements of set A have a grade f(x) standing for probability of being a correct boundary. In set G there can be elements which were not detected (in the left part of the set) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 74 The example of phoneme segmentation of a single word. In the lower part hand segmentation is drawn. Boundaries are represented by two indexes close to each other (sometimes overlapping). Upper columns present the example of segmentation for the word done by a segmentation algorithm. All of calculated boundaries 4.9 are quite accurate but never perfect . . . . . . . . . . . . . . . . . . . . . . . . . 75 Fuzzy membership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.10 F-score of phoneme boundaries detection for transitions between several types of phonemes. Phoneme types 1-10 are explained in section 4.8 (1 - stops, 2 - nasal consonants, etc.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.1 Histogram of POS tagger probabilities for hypotheses which are correct recognitions 93 5.2 Histogram of POS tagger probabilities for hypotheses which are wrong recognitions 94 5.3 Ratio of correct recognitions to all for different probabilities from POS tagger . . 94 5.4 Undirected, complete graph illustrating similarities between sentences . . . . . . 96 5.5 Histogram of probabilities received from the bag-of-words model for hypotheses which are correct recognitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Histogram of probabilities received from the bag-of-words model for hypotheses which are wrong recognitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 102 102 Ratio of correct recognitions to all of them for different probabilities received from the bag-of-words model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 9 10 Acknowledgments I would like to begin by thanking my parents. Not only for their unstinting support throughout my educational career, but also for encouraging me to pursue my PhD. I feel very lucky that I had two supervisors to guide me through research. I would like to thank Dr Suresh Manandhar and Dr Richard C. Wilson for their continued support, advice and constructive feedbacks. I am glad we published together many papers and that I participated thanks to them in several conferences. They were not only teachers but also good friends helping me in my life in a new country which was often less surprising and easier to understand because of them. Appreciation goes to my assessor Dr Adrian Bors for his regular feedback on progress of my research. I had the privilege to meet many interesting people in the department. This had provided an excellent environment for inventing my methods and algorithms. I would like to thank for all seminars and minor discussions on corridors of our department. In particular thanks go to Thimal Jasooriya for sitting in front of me for long three years and answering patiently all questions like ‘Hey, how do you do this in LaTeX?’ or ‘Where is room 103?’. I appreciate as well Ioannis Klapaftis help regarding grammar parsers and graphs of collocation. I would like to thank Pierre Andrews for both improving my knowledge not only in NLP but also in photography. Many thanks to Marcelo Romero Huertas. And finally I am very glad that I met Marek Grześ with who I had so many exciting conversations about travels all over the world and who was a strong support for me in days I had private problems. Many thanks too all other members of the department I have met during my studies. My PhD would not be completed without the help of many people out of the department. I would like to thank Professor Zdzisław Brzeźniak for our mathematical discussions with coffee. Appreciations go to Professor Grażyna Demenko for providing PolPhone software and Dr Stefan Grocholewski for CORPORA. I would like to thank Dr Adam Przepiórkowski and Dr Maciej Piasecki for their help in part of research about POS taggers. I am also very glad for my close cooperation with Jakub Gałka in our research. Finally many thanks to my father Professor Mariusz Ziółko for many useful feedbacks about my research papers and this thesis. 11 List of candidate publications. Parts of some of them were used in the thesis. Conferences: • M.P. Sellars, G.E. Athanasiadou, B. Ziółko, S.D. Greaves, A. Hopper, Simulation of Broadband FWA Networks in High-rise Cities with Linear Antenna Polarisation, The 14th IEEE 2003 International Symposium on Personal, Indoor and Mobile Radio Communications Proceedings - PIMRC, pp. 371-5. Beijing, China 2003. • M. Ziółko, P. Sypka, B. Ziółko, Compression of Transmultiplexed Acoustic Signals, Proceedings of The 2004 International TICSP Workshop on Spectral and Multirate Signal Processing, pp.81-6. Vienna 2004. • B. Ziółko, M. Ziółko, M. Nowak, P. Sypka, A suggestion of multiple-access method for 4G system, Proceedings of 47th International Symposium ELMAR-2005, pp.327-30 Zadar, Croatia 2005. • M. Ziółko, B. Ziółko, A. Dziech, Transcription as a Speech Compression Method in Transmultiplexer System, 5th WSEAS International Conference on Multimedia, Internet and Video Technologies, Corfu, Greece 2005. • B. Ziółko, M. Ziółko, M. Nowak, Design of Integer Filters for Transmultiplexer Perfect Reconstruction, Proceedings of 13th European Signal Processing Conference EUSIPCO, Antalya, Turkey 2005. • M. Ziółko, M. Nowak, B. Ziółko, Transmultiplexer Integer-to-Integer Filter Banks, Proceedings of The First IFIP International Conference in Central Asia on Internet, The Next Generation of Mobile, Wireless and Optical Communications Networks, Bishkek, Kyrgyzstan 2005. • P. Sypka, B. Ziółko, M. Ziółko, Integer-to-Integer Filters in Image Transmultiplexers, Proceedings of 2006 Second International Symposium on Communications, Control and Signal Processing, ISCCSP, Marrakech, Morocco 2006. • P. Sypka, M. Ziółko and B. Ziółko, Lossy Compression Approach to Transmultiplexed Images, 48th International Symposium ELMAR-2006, Zadar, Croatia. • B. Ziółko, S. Manandhar, R.C. Wilson, Phoneme segmentation of speech, Proceedings of ICPR 2006 , Hong Kong, 2006. • B. Ziółko, S. Manandhar, R.C. Wilson, M. Ziółko, Wavelet method of speech segmentation, Proceedings of EUSIPCO 2006, Florence, Italy. • P. Sypka, M. Ziółko, B. Ziółko, Robustness of Transmultiplexed Images, International Conference Mixed Design of Integrated Circuits and Systems Mixdes , Gdynia, 2006. • B. Ziółko, J. Gałka, S. Manandhar, R. C. Wilson, M. Ziółko, The use of statistics of Polish phonemes in speech recognition, Speech Signal Annotation, Processing and Synthesis, Poznań, 2006. 12 • P. Sypka, M. Ziółko and B. Ziółko, Lossless JPEG-Base Compression of Transmultiplexed Images, Proceedings of the 12th Digital Signal Processing Workshop, pp. 531-534. Wyoming 2006. • M. Ziółko, P. Sypka, B. Ziółko, Application of 1-D Transmultiplexer to Images Transmission, Proceedings of the 32nd Annual Conference of the IEEE Industrial Electronics Society IECON, pp. 3564-3567, Paris, France, 2006. • M. Kotti, C. Kotropoulos, B. Ziółko, I. Pitas, V. Moschou, A Framework for Dialogue Detection in Movies, Proceedings of Multimedia Content Representation, Classification and Security International Workshop, MRCS, Lecture Notes in Computer Science, vol. 4105, pp. 371-378, Istanbul, Turkey, 2006. • P. Sypka, M. Ziółko, B. Ziółko, Approach of JPEG2000 Compression Standard to Transmultiplexed Images, Proceedings of the Visualization, Imaging, and Image Processing, VIIP, Palma De Mallorca, Spain, 2006. • B. Ziółko, J. Gałka, S. Manandhar, R. C. Wilson, M. Ziółko, Triphone Statistics for Polish Language, Proceedings of 3rd Language and Technology Conference, Poznań, Poland, 2007. • B. Ziółko, S. Manandhar, R. C. Wilson, Fuzzy Recall and Precision for Speech Segmentation Evaluation, Proceedings of 3rd Language and Technology Conference, Poznań, Poland, 2007. • B. Ziółko, S. Manandhar, R. C. Wilson, M. Ziółko, LogitBoost Weka Classifier Speech Segmentation, Proceedings of 2008 IEEE International Conference on Multimedia and Expo, Hannover, Germany, 2008. • B. Ziółko, S. Manandhar, R. C. Wilson, M. Ziółko, Language Model Based on POS Tagger, Proceedings of SIGMAP 2008 the International Conference on Signal Processing and Multimedia Applications, Porto, Portugal, 2008. • B. Ziółko, S. Manandhar, R. C. Wilson, M. Ziółko, J. Gałka Application of HTK to the Polish Language, Proceedings of IEEE International Conference on Audio, Language and Image Processing, Shanghai, 2008. • B. Ziółko, S. Manandhar, R. C. Wilson, M. Ziółko, Semantic Modelling for Speech Recognition, Proceedings of Speech Analysis, Synthesis and Recognition. Applications in Systems for Homeland Security, Piechowice, Poland, 2008. • B. Ziółko, S. Manandhar, R. C. Wilson, Bag-of-words Modelling for Speech Recognition, Proceedings of International Conference on Future Computer and Communication, Kuala Lumpur, Malaysia, 2009. 13 • B. Ziółko, M. Ziółko, Linguistic Calculations on Cyfronet High Performance Computers, Proceedings of Conference of the High Performance Computers’ Users, Zakopane, Poland, 2009. • B. Ziółko, J. Gałka, M. Ziółko, Phone, diphone and triphone statistics for Polish language, Proceedings of SPECOM 2009, St. Petersburg, Russia, 2009. • B. Ziółko, J. Gałka, M. Ziółko, Phoneme ngrams based on a Polish newspaper corpus, Proceedings of WORLDCOMP’09, Las Vegas, USA, 2009. • B. Ziółko, J. Gałka, M. Ziółko, Phonetic statistics from an Internet articles corpus of Polish language, Proceedings of Intelligent Information Systems, Kraków, Poland, 2009. Journals: • M.P. Sellars, G.E. Athanasiadou, B. Ziółko, S.D. Greaves, Opposite-sector uplink interference in broadband FWA networks in high-rise cities, The IEE Electronics Letters , vol. 40, no. 17, pp. 1070-1, 2004. • M. Ziółko, A. Dziech, R. Baran, P. Sypka, B. Ziółko, Transmultiplexing System for Compression of Selected Signals, WSEAS Transactions on Communications, issue 12, vol. 4, pp. 1427-1434, December 2005. • M. Dyrek, J. Gałka and B. Ziółko, Measures On Wavelet Segmentation of Speech, International Journal Of Circuits, Systems And Signal Processing, NAUN 2008. • J. Gałka and B. Ziółko, Study of Performance Evaluation Methods for Non-Uniform Speech Segmentation, International Journal Of Circuits, Systems And Signal Processing, NAUN 2008. 14 List of Abbreviations AMI - Augmented Multi-party Interaction ANN - Artificial Neural Network ASR - Automatic Speech Recognition BEEP - British English Phonemic Transcription Dictionary CML - Conditional Maximum Likelihood CMU - Carnegie Mellon University CUED - Cambridge University Engineering Department DARPA - Defence Advanced Research Projects Agency DBNs - Dynamic Bayesian Networks DCT - Discrete Cosine Transform DWT - Discrete Wavelet Transform FBE - Filter Bank Energy FFT - Fast Fourier Transform fMPE - feature-space Minimum Phone Error GSM - Global System for Mobile HLDA - Heteroscedastic Linear Discriminant Analysis HMM - Hidden Markov Model HTK - Hidden Markov Model Toolkit IIS - Improved Iterative Scaling LFCCs - Linear Frequency Cepstrum Coefficients LM - Language Models LPCC - Linear Prediction Coefficients LSA - Latent Semantic Analysis MaxEnt - Maximum Entropy MFCC - Mel Frequency Cepstrum Coefficients MFMGDCCs - Mel Frequency Modified Group Delay Cepstral Coefficients MFPSCCs - Mel Frequency Product Spectrum Cepstral Coefficients MGDCCs - Modified Group Delay Cepstral Coefficients MLLR - Maximum Likelihood Linear Regression MMSE - Minimum Mean Square Error MPE - Minimum Phone Error PLP - Perceptual Linear Predictive PMF - Probability Mass Function POS - Part Of Speech RASTA - Relative Spectral RCs - Reflection Coefficients SAT - Speaker Adaptative Training SED - Stream Editor SHLDA - Smoothed Heteroscedastic Linear Discriminant Analysis SNR - Signal to Noise Ratio 15 SPLICE - Stereo-based Piecewise Linear Compensation for Environments STFT - Short Time Fast Fourier Transform SVD - Singular Value Decomposition TIMIT - Texas Instrument/Massachusetts Institute of Technology VTLN - Vocal Tract Length Normalisation WER - Word Error Rate Declaration This thesis has not previously been accepted in substance for any degree and is not being concurrently submitted in candidature for any degree other than Doctor of Philosophy of the University of York. This thesis is the result of my own investigations, except where otherwise stated. Other sources are acknowledged by explicit references. I hereby give consent for my thesis, if accepted, to be made available for photocopying and for inter-library loan, and for the title and summary to be made available to outside organisations. Chapter 1 Introduction As information technology has an impact on more and more aspects of our lives with every year, the problem of communication between human beings and information processing devices becomes increasingly important. Up to now, such communication has almost entirely been through the use of keyboards and screens, but speech is the most widely used, natural and the fastest means of communication for people. Moreover, mobile computing devices are becoming increasingly small. The bottom limit lays not in integrated circuit design size but, simply, in the size a human can operate with their fingers. There is also more and more hands-free, like in-car, computer systems. We must redefine traditional methods of human-computer and human-machine interactions. Unfortunately, machine capabilities for interpreting speech are still poor in comparison to what a human can achieve, even though we can predict that automatic speech recognition (ASR) will become a very pervasive technology (Alewine et al., 2004). 1.1 Contribution An aim of our research was to improve the accuracy of speech recognition and to find the elements which might be especially efficient in the ASR of highly inflective and non-positional languages like Polish. English is a much different language in some aspects and some of these differences have impacts on speech recognition systems. The part-of-speech (POS) structure is much more regular in English than in Polish, which means it is much more predictable. A word can change its POS meaning depending on its position. For example we understand all nouns located on the left from another noun as adjectives. In Polish such change is stressed by morphology rather then position. English has many short forms including pronouncing many vowels weakly as // and skipping several letters in longer words. There are also some Polish phonemes which do not exist in English and the other way around. As it is a wide field, research was conducted on chosen elements. As a part of our research we did practical, linguistic studies on differences between Polish and English. Phonetic statistics for Polish were collected and analysed. These statistics helped in further works. Among them a hidden Markov model toolkit (HTK) for Polish was trained and tested. The model we created was trained from real data for all biphones in Polish and by HTK 16 CHAPTER 1. INTRODUCTION 17 scripts for all triphones in a synthesised way using statistics that we collected. The system can be adapted to any vocabulary, however, it does not work efficiently for large vocabulary tasks. One of the possible improvements in ASR is in detecting phoneme boundaries. This information is typically skipped in existing solutions. Speech is usually analysed in frames of constant length. Analysing separate phonemes would be much more accurate. One can quite easily set phoneme boundaries by observing spectrograms or discrete wavelet transform (DWT) spectra of speech, however, it is very difficult to give an exact algorithm to find them. Constant segmentation benefits from simplicity of implementation and the simple comparison of blocks of the same length. However, it is perceptually unnatural. Human phonetic categorisation is very poor for such short segments (Morgan et al., 2005). Constant segmentation is not natural as phonemes have different length. Moreover, boundary effects provide additional distortions, and framing creates more boundaries than phoneme segmentation. We have to consider these boundary effects, which can cause errors. Obviously, a smaller number of boundaries means smaller errors due to the mentioned effects. Constant segmentation therefore risks losing information about the phonemes due to merging different sounds into single blocks, losing phoneme length information and losing complexity of individual phonemes. Phoneme duration can be also used as an additional parameter in speech recognition, improving the accuracy of the whole process (Stöber and Hess, 1998). There is very little interest in using POS tags in ASR. We investigated its application in ASR. POS tag trigrams, a matrix grading possible neighbourhoods or a probabilistic tagger can be created and used to predict a word being recognised based on the left context analysed by a POS tagger. Another innovation of speech recognition is based on semantic analysis as the very last step of the process. It can be applied as an additional measure to use a non-first choice from a n-best list of audio model recognition hypotheses, if the first one does not fit semantic content. It is not possible to recognise speech using acoustic information only. The human perception system is based upon catching context, structure and understanding combined with recognition. It is much easier to recognise and repeat without any errors a heard sentence, if it is in a language we understand, compared to a sentence in a language we are not familiar with. Language modelling can improve recognition highly. We decided to focus on using information which was not used, or not commonly used until now in speech recognition. POS tags were not applied as English can be modelled efficiently using context-free grammars. In case of Polish, it is very difficult to provide tree structures, which represent all possible sentences, as the order of words can vary significantly. We thought that Polish can be modelled using POS tags because some tags are much more probable in the context of some others. Unfortunately, experiments shown that POS information is too ambiguous to be used in the way we proposed. Semantic analysis is generally very difficult, due to information sparsity problems. We believe that this is why it was not used very commonly in existing ASR systems, as language models based on grammar structure were quite efficient for English, and there was no necessity of using semantic analysis. In the case of Polish, semantic information has to be included in a language model due to syntactic irregularities. A bag-of-words model was invented. It applies word-topic statistics to re-rank a list of hypotheses from models of lower levels. CHAPTER 1. INTRODUCTION 1.2 18 Thesis Overview We investigated several new elements of ASR systems with special interest of highly inflective and non-positional languages like Polish. It includes non-constant segmentation for acoustic modelling. We have analysed some aspects of Polish to choose the language’s best approach for ASR as a representative of highly inflective languages. Apart from this we investigate introducing POS tagging and semantic information analysis in ASR systems. 1.2.1 Introduction and Literature Review In the first chapter we will introduce the general aspects of the research areas that are involved in ASR. Specifically, we pay attention to previous work concerning signal processing methods like DWT, speech segmentation and parametrisation, pattern recognition, language modelling (for example hidden Markov models (HMM) and n-grams) and natural language processing (NLP), mainly lexical semantics, POS tagging and latent semantic analysis (LSA). Some literature in linguistics, mathematical analysis, probabilistic and information theory is also considered. 1.2.2 Linguistic Aspects of Highly Inflective Languages Using Polish as an Example This chapter will focus on a linguistic background (Ostaszewska and Tambor, 2000), which is useful for ASR. Linguists have provided many basic assumptions in methodology of recognising English. As we aim in creating ASR system for Polish, a similar analysis should be done, because these two languages vary in some aspects. This chapter will summarise phonological knowledge about sounds in Polish, pronouncing rules and grammatical phenomena related to rich morphology. A Polish text corpus was analysed to find information about phoneme statistics. We were especially interested in triphones, as they are commonly used in many speech processing applications like the HTK speech recogniser. An attempt to create the full list of triphones for Polish language is presented. A vast amount of phonetically transcribed text was analysed to obtain the frequency of triphone occurrences. A distribution of the frequency of triphone occurrence and other phenomena are presented. The standard phonetic alphabet for Polish and methods of providing phonetic transcriptions are described as well. The ASR system for Polish based on HTK is described with detailed analysis of the errors it committed. 1.2.3 Phoneme Segmentation and Acoustic Models Speech has to be split into some units to be analysed. The very common way is to use time constant framing with overlapping. Phoneme segmentation is another approach, which may highly improve acoustic models, if phoneme boundaries are detected correctly. We will present our own segmentation method, evaluation method and the way to apply it in ASR. The localisation of phoneme boundaries is useful in several speech analysis tasks and in particular for speech recognition. Here it enables the use of more accurate acoustic models, since the lengths of phonemes are known and more accurate information is provided for parametrisation. Our method compares the values of power envelopes and their first derivatives for six frequency CHAPTER 1. INTRODUCTION 19 subbands. Specific scenarios which are typical of phoneme boundaries are searched for. Discrete times with such events are noted and graded using a distribution-like event function. The final decision on localisation of boundaries is taken by analysis of the event function. Boundaries are therefore extracted using information from all the subbands. The method was developed on small set of Polish hand segmented words and tested on another, large corpus containing 16425 utterances. A recall and precision measure specifically designed to measure the quality of speech segmentation was adapted by using fuzzy sets; from this, results with f-score equal to 72.49% were obtained. A statistical classification method was also used to check which features are useful and also used as a baseline for the comparison of the new method. 1.2.4 Language Modelling Language models are necessary for any large vocabulary speech recogniser. There are two main types of information which can be used to support the modelling of a language: syntactic and semantic. One of the ways to apply syntactic modelling is to use POS taggers. Morphological information can be statistically analysed to provide the probability of a sequence of words using their POS tags. This chapter covers methods of POS tagging and available POS tagged data in Polish. We presented our own method of applying taggers and POS tag statistics to ASR as a part of language modelling. Unfortunately, experiments showed that this type of modelling is not effective. Semantic analysis can be done in many different ways and has already been applied in ASR. However, this kind of modelling is difficult due to the data sparsity problem. Literature always mentions semantic analysis as a necessary step in ASR, but it is very difficult to find any research papers, which provide results concerning the exact impact on recognition of applying semantic methods. We investigate LSA and present our own method, which was shown to be more effective in experiments. The invented model differs from LSA in the way the word-topic matrix is smoothed. Our method trains a model faster than the widely known LSA and is more efficient. Chapter 2 Literature Review This chapter presents the history of research on speech recognition and some of the details of more up-to-date publications. ASR is a very wide area so only some choice of topics from this field is presented, which were studied during PhD of the author. 2.1 History of Speech Recognition In the beginning we should define what an ASR system is. Because of the variety of applied methods and approaches, it is difficult to define it by describing how it works. It is better to say that an ASR system is software which changes acoustic signal into sequence of symbols. Speech is an input while the sequence of written words is an output. Obviously this definition covers a vast area of applications. We can distinguish systems trained for a given user only or which are speaker independent. A system can be dedicated for continuous speech or discrete word recognition. Some applications assume that speech is clear (or rather clear enough) while some are dedicated for working in a factory or at an airport where noise is a crucial issue. Finally the size of a vocabulary is a feature of a system. There are quite different approaches for speech recognition with a small, limited vocabulary and with a large vocabulary (especially with unlimited dictionary). To give a proper background, we would like to set speech recognition research in time. The invention of a phonograph in 1870 by Alexander Graham Bell can be considered as the very Figure 2.1: Toy dog Rex - first working speech recognition system (USA 1920) 20 CHAPTER 2. LITERATURE REVIEW 21 Figure 2.2: Scheme of speech recognition system first step of creating ASR system. More precisely, the phonograph is the first audio recording tool, which transferred acoustic waves into electrical waves, allowing further processing. Another important mile stone was set by the Swiss linguist Ferdinand de Saussure, who described general rules of linguistics, which were collected and printed by his students and colleagues, after his death in 1916 (de Saussure, 1916). His ideas became the rudiments of modern linguistics and NLP. Then, quite surprisingly, we can speak about the first working ASR system in 1920. It was a celluloid toy dog developed by Walker Balke and National Company Inc., presented in Fig. 2.1. The dog was attached to the turntable of a phonograph and could jump out of its kennel, when detecting its own name ’Rex’. The mechanism was controlled by resonant reed and in fact it was detecting a phoneme e // by a metal bar arranged to form a bridge and sensitive to acoustic energy of 500 Hz, which vibrated it, interrupting the current and releasing the dog. In 1952 Bell Labs created a digit recogniser (Davis et al., 1952). It was based on analysis of the spectrum divided into 2 frequency bands (above and below 900 Hz). It recognised digits with error less than 2%, if the user did not change the position of the head regarding to the microphone between training and testing. In the sixties there were two important inventions: the fast Fourier transform (FFT) (Tukey et al., 1963) and the HMM (Rabiner, 1989) which have a crucial impact on current ASR systems. There was a growing interest in speech recognition which resulted in running the ARPA Speech Understanding Project in 1971. This ambitious and well-funded project ($15M) started connected word recognition with a vocabulary size of around 1000 words. It resulted in CMU Harpy system (Lowerre, 1976) with 5% sentence error. Thanks to the project, the seventies were a time of rapid improvements in ASR. Viterbi algorithm for model training was developed between 1967 and 1973 (Viterbi, 1967; Forney, 1973). In 1975, linear predictive coding, the first successive speech parameterisation method, was invented (Makhoul, 1975). Further research in speech recognition has a larger impact on this dissertation so it will be described in more detail in following sections. The general scheme of ASR was created in the eighties. It survived till now with just small differences. All the most important steps are presented in Fig. 2.2 which is based on (Rabiner CHAPTER 2. LITERATURE REVIEW Village. 22 Would you like to go village or town? Hmm… I’d love to go to some lovely village at the seaside. What kind of holidays would you like? Figure 2.3: Typical current services offered by call centres with ASR (above) and its future (below) and Juang, 1993). Our research is focused on segmentation and semantic analysis, so it will be described in detail. Some other topics are connected very closely, so they have been also described. Some of them, which are not crucial for our research, have been skipped because of the limit of the thesis size. The whole large field of pre-processing is first of them, including noise reduction, feature compensation, missing feature approaches. There are too many papers about these topic to describe that step of speech recognition even succinctly. Many of them are very well summarised in (Raj and Stern, 2005). ASR can save around 60% of time spend on work with computer through automatic transcription and dictation rather than typing as we are able to speak 3 times faster than we can type. Sophisticated ASR systems are becoming more important, as customer services need to be more friendly, while costs of running call centres need to be kept at a minimum level (Fig. 2.3). The ASR system may introduce also an incredibly efficient lossy compression for communications if recognition is seen as coding and speech synthesis as decoding. 2.2 Linguistic Rudiments of Speech Analysis It is essential to understand the rudiments of speech generation process in order to do research on digital speech analysis. Speech signals consist of sound sequences, which we interpret as information representation. Phonetics is a science which classifies these sounds. Most languages, including English and Polish, can be described in terms of a set of distinctive sounds - phonemes. Both languages consist of around 40 phonemes, however, some of them exist in English and do not exist in Polish and the other way round. They are grouped in vowels and consonants (nasals, stops and fricatives). British English phoneme transcription presented in Table 2.1 is based on BEEP dictionary (Beep dictionary, 2000), which is commonly used by speech recognisers, like HTK CHAPTER 2. LITERATURE REVIEW 23 Table 2.1: Phoneme transcription in English - BEEP dictionary transcription example transcription example aa odd ae at ah hut ao ought aw cow ax abaft (first vowel, schwa) ay hide ea wear eh Ed er hurt ey ate ia fortieth ih it iy teen oh mob ow lobe oy toy ua intellectual uh nook uw two p pick b be t tip d dee f fee v vise th thick dh thee (eth) s sick z zip sh ship zh seizure ch cheese jh jeep k key ng rang (engma) g green m me n new l lee r ream w win y you hh he Table 2.2: Phoneme transcription in Polish - SAMPA i I e a o u e˜ o˜ j l w r m n n’ N v f x z s z’ s’ Z S dz ts dz’ ts’ dZ tS b p d t g k (Young et al., 2005). It contains of 20 vowels and 24 consonants. Polish phoneme transcription is typically presented in SAMPA notation (Ostaszewska and Tambor, 2000), like in Table 2.2, with 37 or 39 phonemes. Irregularities of pronunciation and linguistic rules are a real challenge for speech recognition. Many words sound similar, especially in English. They are called homophones (e.g. night and knight). What is more there are even sentences which sound very similarly (e.g. ’I helped Apple wreck a nice beach’ and ’I helped Apple recognise speech’. Another problem is caused by context dependency of phonemes. As we said, there are around 40 different phonemes, but actually all of them vary at the beginning and in the end, depending on neighbouring phonemes. Such triples CHAPTER 2. LITERATURE REVIEW 24 are so-called triphones. Around 40 % of possible phoneme combinations exist, which gives 25600 possible patterns to recognise. There are no trivial methods for such a number. Unfortunately, it is not the only problem. Phoneme boundaries are overlapping each other. There is a co-articulation of phonemes and words. Intonation and sentence stress plays an important role in the interpretation. Utterances ‘go!’, ‘go?’ and ‘go.’ can clearly be recognised by a human but are difficult for a computer. In naturally spoken language there are no pauses between words. It is difficult for a computer to decide where boundaries lie. This is why a general speech recognition system requires human knowledge and experience, as well as advanced pattern recognition and artificial intelligence. 2.3 Speech Processing Speech carries information. This is quite obvious, but very often we do not remember that our brain has to decode speech on many different levels to produce real information. We have to do the same using computers. We understand speech processing as waveform signal representing and transforming. For practical reasons we do it usually in frequency domain, where coded information is easier to find. 2.3.1 Spectrum Originally, a spectrum was what is now called a spectre, for example, a phantom or an apparition. In the 17th century the word spectrum was introduced into optics, referring to the range of colours observed when white light was dispersed through a prism. A sound spectrum is a representation of a sound in terms of the amount of vibration at each individual frequency. It is usually presented as a graph of either power or pressure as a function of frequency. The power or pressure is measured in decibels and the frequency is measured in vibrations per second - Hertz [Hz]. It is important for any research on speech, that speech is quite a specific audio signal, which can be distinguished by its pressure and frequency as presented in Fig. 2.4, copied from (Tadeusiewicz, 1988). There is no point in analysing other frequencies. Similarly, a given acoustic pressure can be expected. We can limit analysing to the subband of around 80-8000 [kHz]. This observation was already very successfully used, for example in GSM mobile phones. In 1807, Jean Baptiste Joseph Fourier described his method of analysing heat propagation. It was very controversial and was negatively graded by a committee in Paris Institute which consisted of many famous mathematicians. The first objection, made by Lagrange and Laplace in 1808, was to Fourier’s expansions of functions as trigonometrical series, what we now call the Fourier series. Others objections were connected to equations of heat transfer. Fourier spectrum (Fig. 2.5) is currently a basic and very common tool for analysing many types of stationary signals. A stationary signal is a signal that repeats into infinity with the same periodicity. The spectral representation of signal is calculated as Z ∞ ŝ(f ) = s(t) exp(−2πjf t)dt. −∞ (2.1) CHAPTER 2. LITERATURE REVIEW Figure 2.4: Speech audibility and average human hearing band (Tadeusiewicz, 1988) Figure 2.5: The example of Fourier spectrum amplitude 25 CHAPTER 2. LITERATURE REVIEW 26 Figure 2.6: Frequency spectrum of speech in a linear and a non-linear scale Function ŝ(f ) defines the notion of global frequency f in a signal. It is computed as inner products of the the signal and trigonometric functions cos(2πf t) − j sin(2πf t) (from Euler equation), as basis functions of infinite duration (2.1). Any non-stationarity is spread out over the whole frequency in ŝ(f ). Therefore, non-stationary signals require changes in the analysis method. A non-stationary signal has to be windowed to be analysed by Fourier transform. The original method was improved in 1965 by Cooley and Tukey (Cooley and Tukey, 1965), who found an algorithm to calculate the spectrum in fewer steps. It is known as fast Fourier transform (FFT). Then the transform is calculated locally for a given window over which the signal is approximately stationary by repeating the part and creating a periodic function. This approach is called usually the short time Fourier transform (STFT). Another way is to modify the basis functions used in Fourier transform (trigonometric functions) to another, more concentrated in time and less in frequency. This way of thinking leads to wavelet transforms. Human perception systems work in non-linear scale, for example, it is much easier to perceive a candle in a dark room than in a lit one. Perception depends on background and reference. This is why we can say the natural scale for humans is the logarithm one. The most common conclusion of this fact is using decibels [dB]. For the same reason, we use sometimes mel scale frequency in speech analysis, rather than the standard linear one in Hz. Frequency in mels is defined as fHz . fmel = 1000 log2 1 + 1000 (2.2) The comparison of two frequency scales is presented in Fig. 2.6. The need of nonlinearity in ASR caused creation of an expression ’cepstrum’. It is etymology of spectrum, formed by reversing the first four letters. This term was introduced by Tukey et al. in 1963 (Tukey et al., 1963). It has come to be accepted terminology for the inverse Fourier transform R∞ of the logarithm of the power spectrum of a signal −∞ |ŝ(t)| exp(2πjf t)dt. It was simplified, by changing inverse transform into a forward one, which does not change the basic idea (Rabiner and Schafer, 1978). CHAPTER 2. LITERATURE REVIEW 27 Figure 2.7: The cepstrum is the Fourier transform of the log of the power spectrum Phonetic features Silence and speech To fit transcription Speech segmentation Words Phonemes Speakers Syllables Figure 2.8: The types of speech segmentation 2.4 Speech Segmentation In the vast majority of approaches to speech recognition, the speech signals need to be divided into segments before recognition can take place. The properties of the signal contained in each segment are then assumed to be constant, or in other words to be characteristic of a single part of speech. Speech segmentation is easier than image segmentation (Nasios and Bors, 2005), as has to be done in one dimension only. There are different meanings of segmentation though (Fig. 2.4). Very often it is used for word segmentation. It can be done by Viterbi and Forward-Backward Segmentation (Demuynck and Laureys, 2002). The other applied method (Subramanya et al., 2005) is based on mean and variance of spectral entropy. Another issue covered by the same name, segmentation, is separating silence and speech from an audio recording (Zheng and Yan, 2004). The method uses so called TRAPS-based segmentation and Gaussian (Nasios and Bors, 2006) mixture based segmentation. Segmentation here means mainly removing non-speech events and additionally clustering according to speaker identities, environmental and channel conditions. Another possible segmentation is by phonetic features (not necessarily phonemes) (Tan et al., 1994), by applying wavelet analysis which will be described in more detail in this dissertation. There also exists research on syllable segmentation (Villing et al., 2004). Another meaning is segmenting due to partially correct transcriptions (Cardinal et al., 2005). In this case segmentation is combined with recognition. Finally, we can understand segmentation as a process of breaking audio into phonemes (Grayden and Scordilis, 1994). Segmentation was conducted by filter bank energy contours analysis. In our research (Ziółko et al., 2006a,b), we find that phoneme segmentation is the most important and this is why, we will use the word ’segmentation’ in the meaning of phoneme segmentation, if nothing else is mentioned. Phoneme segmentation and its usefulness in speech recognition will be described in more detail in the next chapter. CHAPTER 2. LITERATURE REVIEW 28 Naturally, if the frame contains the end of one phoneme and the beginning of another it will cause recognition difficulties. Segmentation methods currently used in ASR are not particularly sophisticated. For example they do not consider where phonemes begin and end; this causes conflicting information to appear at the boundaries of phonemes. Non-uniform phoneme segmentation can be useful in ASR for more accurate modelling (Glass, 2003). 2.5 Phoneme Segmentation Constant-time segmentation or framing, for example into 23.2 ms blocks (Young, 1996), is commonly used to divide the speech signal for processing. This method benefits from simplicity of implementation and easy comparison of blocks, which are of the same length. However, it is perceptually unnatural, because of the variation in the duration of real phonemes. In fact, human phonetic categorisation is also very poor for such short segments (Morgan et al., 2005). Moreover, boundary effects provide additional distortions (which are partially reduced by applying Hamming window), and framing with such short segments create many more boundaries than there are phonemes in the speech. These boundary effects can cause errors in speech recognition because of the mixing of two phonemes in a single frame. A smaller number of boundaries means a smaller number of errors due to the aforementioned effects. Constant segmentation therefore, while straightforward and efficient, risks losing valuable information about the phonemes due to the merging of different sounds into a single block and because the complexity of individual phonemes cannot be represented in short frames. The length of a phoneme can be also used as an additional parameter in speech recognition improving the accuracy of the whole process. A comparison of applying constant framing and phoneme segmentation is presented in Fig. 2.9. Models based on processing information over long time ranges have already been introduced. The RASTA (RelAtive SpecTrAl) methodology (Hermansky and Morgan, 1994) is based on relative spectral analysis and the TRAPs (TempoRAl Patterns) approach (Morgan et al., 2005) is based on multilayer perceptrons with the temporal trajectory of logarithmic spectral energy as the input vector. It allows the generation of class posterior probability estimates. A number of approaches have been suggested (Stöber and Hess, 1998; Grayden and Scordilis, 1994; Weinstein et al., 1975; Zue, 1985; Toledano et al., 2003) to find phoneme boundaries from the time-varying speech signal properties. These approaches utilise features derived from acoustic knowledge of the phonemes. For example, solution presented in (Grayden and Scordilis, 1994) analyses a number of different subbands in the signal using its spectra. Phoneme boundaries are extracted by comparing the percentage of signal power in different subbands. The Toledano et al. (Toledano et al., 2003) approach is based on spectral variation functions. Such methods need to be optimised for particular phoneme data and cannot be performed in isolation from phoneme recognition itself. Neural networks (NN) (Suh and Lee, 1996) have also been tested, but they require time consuming training. Segmentation can be applied by the segment models (SM) (Ostendorf et al., 1996; Russell and Jackson, 2005) instead of the HMM. The SM solution differs from the HMM by searching paths through sequences of frames of different lengths rather than frames. It means that segmentation and recognition are conducted at the same time and there is a set of pos- CHAPTER 2. LITERATURE REVIEW 29 Figure 2.9: Comparison of the frames produced by constant segmentation and phoneme segmentation sible observation lengths. In a general SM, the segmentation is associated with a likelihood and in fact describes the likelihood of a particular segmentation of an utterance. The SM for a given label is also characterised by a family of output densities which gives information about observation sequences of different lengths. These features of SM solution allow the location of boundaries only at several fixed positions which are dependent on framing (on an integer multiple value of the frame length). The typical approach to phoneme segmentation for creating speech corpora is to apply dynamic programming (Rabiner and Juang, 1993; Holmes, 2001). Dynamic programming is a tool which guarantees to find the cumulative distance along the optimum path without having to calculate the distance along all possible paths. In speech segmentation it is used for time alignment of boundaries. The common practice is to provide a transcription done by professional phoneticians for one of the speakers in the given corpus. Then it is possible to automatically create phoneme segmentation of the same utterances for other speakers. This method is very accurate but demands transcription and hand segmentation to start with. For this reason it is not very useful for any application other than creating a corpus. There are several speech segmentation methods and several approaches to the most of them. It is quite obvious that it is interesting to compare them. Surprisingly evaluation methods for speech segmentation are quite simple and do not consider all scenarios. There are several suggestions of evaluation methods but they are usually developed for given solutions, which are not very universal and they lose some accuracy in their simplifications. Typically evaluation is based on counting the number of insertions, deletions and substitutions of the automatic segmentations with respect to a hand-checked reference transcription. The automatic word segmentation (Demuynck and Laureys, 2002) was evaluated by counting the number of boundaries for which the deviation between automatic and manual segmentation exceeded thresholds of 35, 70 and 100 ms. The syllable segmentation (Villing et al., 2004) was evaluated by counting the number of insertion and delation errors within a tolerance of 50 ms before and after a reference boundary. Some authors do not publish any details about such a tolerance or do not give a tolerance at all but use generally the same method (Grayden and Scordilis, 1994). This insertion and delation approach has a few flaws. First of all, a value of tolerance is questionable and cannot be set with any exact explanation. It is rather chosen using experience, quite often experience in results of a given speech segmentation method and experiments. What is more, such methods treat different inaccuracies as CHAPTER 2. LITERATURE REVIEW 30 Figure 2.10: The list of speech features extracting method types, grouped in two avenues: based on linear prediction coefficients (with PLP as the main one) and filter bank analysis (with MFCC as the main one). simply correct or wrong detections (or giving a larger scale of grades) without considering ’how wrong’ the detection really is. Unfortunately, it is not the last of the problems. A tolerance is set, like 50 ms (Villing et al., 2004) for syllables, according to a statistically average length of a segment. The disadvantage of this approach is that speech segments, whatever they are, words, syllables or phonemes, vary much in their length. This is why a shift of 50 ms in boundary location is not the same for a 100 ms long syllable as for a 300 ms long one. Different speech segmentation methods were compared by us in (Gałka and Ziółko, 2008). 2.6 Speech Parametrisation Speech parametrisation is a representation of a spectral envelope of an audio signal which can be used in further processing. There are two most common parametrisation methods, mel-frequency cepstral coefficients (MFCC) (Davis and Mermelstein, 1980) and perceptual linear predictive (PLP) (Hermansky, 1990). 2.6.1 Parametrisation Methods Based on Linear Prediction Coefficients PLP (Rabiner, 1989) has become one of the standard speech parametrisation methods (Fig. 2.10), and is used as a baseline for a part of the new research. Because of its importance, there have been further improvements to the method, some of which are described below. CHAPTER 2. LITERATURE REVIEW 31 Figure 2.11: fMPE transformation matrix from original low-dimensional feature vector into highdimensional one Misra et al. (Misra et al., 2004) suggest normalising the spectrum into a probability mass function (PMF) or more strictly speaking PMF-like function. Such a representation allows the calculation of entropy. Voice and non-voice segments are easily detected, even with a low signalto-noise ratio (SNR). A hidden Markov models / artificial neural networks (HMM/ANN) hybrid system was used in the experiments. Because the PLP features are the only baseline provided and a novel hybrid system is used, it is difficult to compare the results with many other papers. The results suggest that the entropy features are less efficient than PLP, but it is possible to improve a system based on the PLP by using entropy for creating extra parameters. Entropy is a good choice to measure the gross peakiness of a data spectrum. Deng et al. (Deng et al., 2005) present and compare two feature extraction and compensation algorithms which improve the PLP, and possibly other methods. The first one is the featurespace minimum phone error (fMPE) (Fig. 4) and the second is the stereo-based piecewise linear compensation for environments (SPLICE). The fPME is an improvement to the PLP. It is based on adding an additional high-dimensional feature vector containing conditional probabilities of each feature given the whole original lowdimensional feature vector. The high dimensional feature vector is projected by a transformation matrix into the subspace of the same dimension as the original vector (Fig. 2.11). The transformation matrix is created by reestimation via minimising the discriminative objective function known as the minimum phone error by gradient descent. The training is conducted by an iterative scheme of retraining the HMM parameters using the fMPE feature sets via maximum likelihood. There are different possible decomposition schemes of the fMPE. One of them may be interpreted as a compensation for the original features by adding a large number of bias vectors, each of which is computed as a full-rank rotation of a small set of posterior probabilities. Approximations can be easily made to remove the numerical problems in maximum-likelihood estimation. Another decomposition scheme is interpreted as compensating for the original PLP cepstral features by a frame-dependent bias vector. The fMPE can be understood as the compensation vector, which consists of the linear weighted sum of a set of frame-independent correction vectors. The weight is then the conditional probability associated with the corresponding correction vector. The fPME algorithm is empirical in its nature. CHAPTER 2. LITERATURE REVIEW 32 Figure 2.12: Mel frequency cepstrum coefficients The SPLICE is also a method of compensation. It assumes that an ideally clean speech feature vector is ‘piecewise linearly’ related to the corresponding analysed noisy one. Which ‘piece’ of the local approximation is used for the piecewise linear approximation to the non-linear relationship between the noisy and clean speech feature vectors is determined by index. With such an assumption the SPLICE compensation is calculated using the minimum mean square error (MMSE). This gives corresponding conditional probabilities to ones in the fPME algorithm. In contrast to the fMPE, the compensation by addition is a natural consequence of the MMSE optimisation rule. The PLP has found several applications. The transcription of conference room meetings is described in (Hain et al., 2005). It is based on the augmented multi-party interaction (AMI) system using the HTK as the HMM for modelling and N-gram based language models. A phonetic decision tree state clustered triphone models with standard left-to-right three states topology is used for acoustic modelling. States are represented by mixtures of 16 Gaussians. Coefficients obtained by applying the PLP can be transformed in other types of parameters (cepstral coefficients) for further analysis. However, there is some ambiguity in the paper regarding the features. First, it is stated that 12 mel-frequency PLP coefficients, with first and second order derivatives were used by front-ends as parameters to form a 39 dimensional feature vector. Then, it is said that the smoothed heteroscedastic linear discriminant analysis (SHLDA) reduces a 52 dimensional (standard vector plus third derivatives) vector to 39 dimensions. Cepstral means and variance normalisation are performed on complete channels. The vocal tract length normalisation (VTLN) gives speaker adaptation. The maximum likelihood criterion estimates warp factors. The UNISYN pronunciation lexicon was used. The method for feature extraction is not very novel but the complexity of the system and results of experiments on large amount of data are impressive. The AMI is a global approach to a large vocabulary ASR system. CHAPTER 2. LITERATURE REVIEW 2.6.2 33 Parametrisation Methods Based on Filter Banks Davis and Mermelstein (Davis and Mermelstein, 1980) suggested a new approach to speech parametrisation in 1980. They described and compared two groups of parametric representations: one based on Fourier spectrum (the MFCCs, the linear frequency cepstrum coefficients LFCCs) and another based on the linear prediction spectrum (linear prediction coefficients LPCs, the reflection coefficients RCs and the cepstrum coefficients derived from the linear prediction coefficients LPCCs). The MFCCs, proved to be the best of them, and is computed using triangular bandpass filters organised in a bank to filter different frequencies. The filters’ characteristics overlap each other in a way that a next filter begins for the middle, best-passing frequency of the previous one (Fig. 2.12). The MFCCs are computed as the sums over filters M F CCi = 12 X k=1 1 π Xk cos i k − , 2 20 i = 1, 2, ..., M. (2.3) The method was improved by setting 12 basic coefficients, energy, first and second derivatives of these, which gives a set of 39 features (Young, 1996). This seems to be now the most common parametrisation and a baseline for new research in ASR. Some improvements of MFCCs and new approaches based on filter banks are described below. Most researchers believe that the phase spectrum information is not useful in speech recognition. Zhu and Paliwal (Zhu and Paliwal, 2004) argue that it is a wrong assumption. The phase spectrum information is less important than the magnitude spectrum, but it can still be useful. They use the product of the power spectrum and the group delay function (GDF). They compared a standard set of 39 parameters based on the MFCCs (12 MFCCs + energy, first and second derivatives of these) with three new approaches, modified-group-delay cepstral coefficients (MGDCCs), melfrequency modified-group-delay cepstral coefficients (MFMGDCCs) and mel-frequency product spectrum cepstral coefficients (MFPSCCs). MFCCs are the best for an absolutely clean signal and MFPSCCs are the best for noisy signals. MFPSCCs are calculated in four steps (Zhu and Paliwal, 2004): 1. Compute the FFT spectrum of the speech signal x(n) and speech signal values multiplied by indexes nx(n). 2. Compute its product spectrum. 3. Apply a mel-frequency filter-bank to produce spectrum in order to get filter-bank energies (FBEs). 4. Compute the discrete cosine transform (DCT) (Ahmed et al., 1974) of log FBEs to get the MFPSCCs. MGDCCs and MFMGDCCs are calculated by applying so-called the modified GDF (MGDF) on smoothed spectrum calculated using the FFT. Computing the DCT provides the features. In case of MFMGDCCs before computing the DCT, mel-frequency filter banks are additionally applied. Both methods were evaluated as less efficient than MFCCs by the authors. CHAPTER 2. LITERATURE REVIEW 34 Zhu and Paliwal used an HMM as a model of a language. In the calculation of all the features, the speech signal was framed using Hamming window every 10 ms with a 30 ms frame. The pre-emphasis filter was applied. The mel filter bank was designed with 23 frequency bands in the range from 64 Hz to 4 kHz. Another interesting approach is given by Ishizuka and Miyazaki (Ishizuka and Miyazaki, 2004). Their method focuses on feature extraction that represents aperiodicity of speech. The method is based on the gammatone filter banks, framing, autocorrelation and comb filters. First the signal is filtered by the gammatone filter banks, which are designed by using equivalent rectangular bandwidth scale to choose the centre frequencies and bandwidths of filters. Each bank consists of 24 filters. Various comb filters are designed for outputs of the gammatone filters. They support separation of the output into its periodic and aperiodic features in subbands. Aperiodicity and periodicity power vectors are calculated. The DCT is used to extract parametrisation features from vectors. The method has the accuracy of the MFCCs without noise and is better in noisy conditions. The HTK (Young, 1996) is used as the HMM pattern classifier. The Centre for Speech Technology Research at the University of Edinburgh has introduced an innovative method of parametrisation. King and Taylor (King and Taylor, 2000) describe a linguistically motivated structural approach to continuous speech recognition based on symbolic representation of distinctive phonological features. As the part of further research, syllable classification using articulatory-acoustic features was conducted (M. Wester, 2003). The speech is firstly analysed using MFCCs, but then it is parametrised using features which are based on socalled multivalued features, namely: front-back (front, back, nil, silence), place of articulation (labial, labiodental, dental, alveolar, velar, glottal, high, mid, low, silence), manner of articulation (approximant, fricative, nasal, stop, vowel, silence), roundness (rounded, unrounded, nil, silence), static (static, dynamic, silence) and voicing (voiced, voiceless, silence). This is parametrisation based strictly on classical phonology. The speech is represented by a sequence of symbolic matrices, each identifying a phone in terms of its distinctive phonological features. The NN was used for language modelling. The phonological approach is described in many other papers of the group. Methods of language modelling are also described, for example, comparing using the NN and dynamic Bayesian networks (DBNs) for phonological feature recognition. Yapanel and Dharanipragada (Yapanel and Dharanipragada, 2003) present a method based on the minimum variance distortionless response (MVDR), spectrum estimation and a trajectory smoothing technique. It was applied to reduce the variance in the feature vectors. The method is based on using specially designed FIR filters and it aims to gain the statistical stability of spectrum estimation rather than spectral resolution limit. Reduction of bias and variance is of interest especially. The method was first described in 2001 and it differs from the classical MFCCs solution by applying the shortly described technique as an additional block following window filtering. In (Yapanel and Dharanipragada, 2003) additional perceptually modified autocorrelation estimates are obtained based on the PLP technique (Hermansky, 1990). The MVDR coefficients are calculated from these autocorrelation estimates. Thanks to incorporating perceptual information, autocorrelation estimates are more reliable, because of perceptual smoothing of the spectrum. Then MVDR estimation is more robust. But this is not the only advantage of using such smoothing; additionally, CHAPTER 2. LITERATURE REVIEW 35 the dimensionality of the MVDR estimation is reduced. As a result, the MVDR method is faster with such a modification. The method was named by authors as perceptual MVDR-based cepstral coefficients (PMCCs). Farooq and Datta (Farooq and Datta, 2004) describe the opportunity of using the DWT instead of the STFT to parametrise speech. The paper compares 2, 6 and 20 order Daubechies wavelets and two sets of subbands with 6 and 8 bands. The method analyses 32 ms frames using 28 or 36 features (depends on a number of subbands). The linear discriminant analysis (LDA) using the Mahalanobis distance measure classifier was used for phoneme classification. Evaluation of the method is done with 52 MFCC features as a baseline. The method was evaluated under noiseless conditions and with noise. Vowel recognition was found more difficult than fricatives and stops for recognition. In most cases the DWT method is superior compared to the MFCCs even though it uses less features. The Speech Research Group at University of Cambridge describes a 2003 CU-HTK large vocabulary speech recognition system for conversational telephone speech (CTS) (Evermann et al., 2004) which uses MFCCs as feature vectors. The system has a multi-pass, multi-branch structure. The multi-branch architecture works as combining results from a few separate similar systems with different parameters by separate lattice rescoring. Basing on Levenshtein distance metric, different word sequences are generated in branches instead of one best hypothesis. The output of all branches is combined using a system combination based on a confusion network. The CUHTK CTS system consists of two main stages: lattice generation with adapted models and lattice multi-pass rescoring in multiple branches. Lattices restrict the search space in the subsequent rescoring stage. Additionally, the generation of lattices provides control for adaptation in each of the branches of the rescoring stage. In the lattice generation, the gain from performing the VTLN by warping the filter bank is very substantial. The multi-passing scheme is used for lattice generation. The first pass generates a transcription using the heteroscedastic linear discriminant analysis (HLDA), the minimum phone error (MPE) trained triphones and the word 4-gram language model (LM). Speakers gain the VTLN warp factor in this step. The second pass uses MPE VTLN HLDA triphones to create small lattices. In the third and last pass they are used in the lattice maximum likelihood linear regression (MLLR). Word lattices are generated with the word 4-gram LM interpolated with the class trigram. The speaker adaptative training (SAT) and the single pronunciation dictionaries were used. A word-based 4-gram language model was trained on the acoustic transcriptions. That system seems to be the most, if not the only ready, complex academic solution for large vocabulary speech recognition. Hifny et al. (Hifny et al., 2005) extend the classical HMM and MFCCs solution using the maximum entropy (MaxEnt) principle to estimate posterior probabilities more efficiently. Entropy measure information of acoustic constraints is used in an unbiased distribution to replace Gaussian mixture models. They use discriminative MaxEnt models for modelling acoustic variability trained using the conditional maximum likelihood (CML) criterion, which maximises the likelihood of the empirical model estimated from the training data with respect to the hypothesised MaxEnt model. Exact parameters are numerically estimated using a modified version of the improved iterative scaling (IIS) algorithm. The difference lies in supporting constraints that may take negative values. CHAPTER 2. LITERATURE REVIEW 36 The idea of the IIS is to use an auxiliary function bounding the change in divergence after each iteration. Parametric constraints model the high variability of the observed acoustic signal and do not have the assumption of the Gaussian distribution of data which are not strictly true in practical applications. They exist if acoustic features are used directly. Currently, in many fields, researchers are trying to overcome a model dependence on Gaussian assumptions. In the opinion of authors the hybrid MaxEnt/HMM method may replace hybrid ANN/HMM solutions, which are currently very popular, using the MaxEnt modelling to estimate the posterior probabilities over the states. The experiments were conducted using MFCC features. The conclusion might be that in a standard speech recognition solutions (MFCCs and ANN/HMM model) there is a lack of use of entropy information. This conclusion corresponds very well to the paper (Misra et al., 2004), described earlier, which also points the lack of use of entropy in existing solutions as a flaw. Both papers prove that taking entropy additionally to existing solutions improves them, one (Misra et al., 2004) for PLP and the other (Hifny et al., 2005) for MFCCs. 2.6.3 Test Corpora and Baselines The lack of a standard baseline method and a test corpus for speech recognition is an important issue. Information about evaluation experiments published in described research papers is presented. It is easy to observe that databases and baselines are often different and the provided information about them often covers different issues. It is very difficult to compare different methods of parametrisation if they are evaluated using different baselines and modelling. The Aurora2 database was used to evaluate the performance in (Zhu and Paliwal, 2004). The source speech is TIDigits, consisting of connected digits task spoken by American English speakers sampled at 8 kHz. It contains clean and multi-condition training sets and three test sets. 39 parameters based on the MFCCs are used as a baseline and a not described in detail HMM as a language model. Aurora2 was also used to test SPLICE (Deng et al., 2005). PMCCs (Yapanel and Dharanipragada, 2003) was evaluated using Aurora2 as well, and in addition, an automotive speech recognition application was used. It was compared to MFCCs, PLP and standard MVDR. HMM was used as a model. Tests in (Ishizuka and Miyazaki, 2004) were carried out on vowels from Japanese sentences from a newspaper spoken by a male speaker, and Japanese noisy digit recognition database Aurora2J. The HTK was used for features classification and the standard 39 MFCCs as baseline. Misra and al. method (Misra et al., 2004) was tested on the Numbers95 database of US English connected digits telephone speech. There are 30 words in the database represented by 27 phonemes. Training was conducted on clear data. Noise from the Noisex92 database has been added into testing data. The PLP features are used as the baseline. There are 3330 utterances for training and 1143 utterances for testing. The HMM/NN hybrid system was used in the experiments. A very impressive amount of training and test data was used by the Cambridge Speech Research Group (Evermann et al., 2004). Training data consists of 296 hours of speech by LDC (Switchboard I, Call Home English and Switchboard Cellular) plus 67 hours of Switchboard (Cellular and Switchboard II phase 2). Transcriptions were provided by the MSState University for CHAPTER 2. LITERATURE REVIEW 37 LDC (carefully) and by BBN commercial transcription service (quickly) for additional 67 hours. Additionally Broadcast News data (427M words of text) and 62M words of ‘conversational texts’ were collected from the Internet (www.ldc.upenn.edu/Fisher/). Paper (Hain et al., 2005) presenting the development of the AMI meeting transcription system describes and uses many speech corpora for evaluation: SWBD/CHE, Fisher, BBC -THISL, HUB4-LM96, SDR99-Newswire, Enron email, ICSI meeting, NIST, ISL and AMI. The last four are typical meeting corpora. Results for different corpora and their sizes are compared. It uses elements of the HTK for training and decoding. The Centre for Speech Technology Research at the University of Edinburgh (King and Taylor, 2000; M. Wester, 2003) experiments were carried out on the Texas Instruments/Massachusetts Institute of Technology (TIMIT) database (read continuous speech from North American speakers). 3696 training utterances from 462 different speakers and 1344 test utterances from 168 speakers were used. 39 phone classes are used, instead of original 61. The same database was used to evaluate MaxEnt/HMM model (Hifny et al., 2005). The same reduction of phone classes took place. 420 speakers were used for the training set. Farooq and Datta (Farooq and Datta, 2004) also evaluated their methods using the TIMIT database, using vowels (/aa/, /ax/, /iy/), unvoiced fricatives (/f/, /sh/ and /s/) and unvoiced stops (/p/, /t/ and /k/) from the dialect region of New England and the northern part of USA. 114 speakers’ (including 37 females) data was used for training and 37 speakers’ (including 12 females) for testing. The fMPE (Deng et al., 2005) is evaluated using DARPA-ears rich-transcription-2004 conversational telephone speech-recognition task. The baseline in this case is just the set of coefficients to which the fMPE is appended to, with HMM used as a model. As it has been said there are two typical baselines for feature evaluation: the MFCC and the PLP. The first one is more popular. It has to be mentioned that in several papers other baselines are used, especially incomplete MFCC. It makes comparing currently researched parametrisation methods a difficult task. Unfortunately, it is not the only problem. The HTK is the most typical method for speech modelling. However, not the only one and it should be stressed that the HTK is a running project with new versions available quite frequently. It can be easily imagined, that different researchers use different versions, which are better or worse according to its date of release but authors do not give any details about the version they are using. What is more, quite many experiments are based on other HMMs and HMM/ANN hybrid solutions than the HTK (or authors just do not give all details) or just an ANNs. Differences in the results of experiments can be caused by worse or better parametrisation as well as changes in a model. One of the reasons why there is no standard test corpus might be that all of them are commercial and it seems there is no satisfactory, free evaluation data test for speech recognition. This is an issue which prevents standardisation of tests. Another point is that the ASR research is conducted for different languages, so variety is inevitable because of the language preferences of researchers. Still, different sizes, complexity and variety of words in test corpora cause difficulties in comparing different approaches. To avoid such problems, there should be two freely available corpora. One would be of small vocabulary, like digits, mainly for fast tests during research and the other of large vocabulary for final results. CHAPTER 2. LITERATURE REVIEW 38 Table 2.3: Comparison of the efficiency of the described methods. Asterisks mark methods appended to baselines (they could be used with most of the other methods). The methods without asterisks are new sets of features, different to the baselines Method MFPSCCs (Zhu and Paliwal, 2004) Ishizuka* (Ishizuka and Miyazaki, 2004) Phonological (King and Taylor, 2000; M. Wester, 2003) Spectral Entropy* (Misra et al., 2004) DWT (Farooq and Datta, 2004) fPME* (Deng et al., 2005) SPLICE* (Deng et al., 2005) PMCCs* (Yapanel and Dharanipragada, 2003) 2.6.4 Comparison to MFCC Comparison to PLP 2% ? 17% ? (no straightforward comparison) ? 15% 2% (52 MFCC) ? ? 13% ? 29% 20% 11% Comparison of the Methods It is very difficult to compare different methods because of the reasons presented in the previous section. However, we tried to do at least an approximation of it. We compare methods according to baselines, which authors gave, by presenting the average improvement in comparison to the baseline (Table 2.3). We do not see any way to compare methods with different baselines. The methods can be grouped in two categories. One of them covers basic features which replace the baseline (Zhu and Paliwal, 2004; Farooq and Datta, 2004; Hain et al., 2005). The other consists of elements appended to classical ones (Ishizuka and Miyazaki, 2004; Misra et al., 2004; Deng et al., 2005; Yapanel and Dharanipragada, 2003) and these are marked by asterisks in Table 2.3. The first group gives less improvement. It has to be stressed that methods in the second group are additional elements and as such they may be used in connection with methods of the first group to give even better results. Phonological approach (M. Wester, 2003; King and Taylor, 2000) has not been compared with any baseline. Works on the phonological features are conducted, results improved, but no clear comparison with the MFCCs or the PLP was found. As one of the authors explained in the email conversation, the system is not ready for word recognition and because a main reason for using articulatory features to mediate between the acoustic signal and words is to get around the problem of ‘beads on a string’ (describing words as a simple concatenation of phones) using phone error rate would be pointless. New sets of features are not much better than baselines. The largest improvement is based on adding extra elements and improving existing parametrisation. The methods marked with asterisk could give outstanding results if combined. However, some of them might be dependent on each other and use the same information in fact. The highest improvement of reviewed methods is given by the SPLICE (Deng et al., 2005). Basing on Yapanel results (Yapanel and Dharanipragada, 2003) (the only one compared with the both baselines) we can calculate that the PLP method gives around 8% of improvement compared to the MFCCs. This evaluation depends on the database used in the experiment and an exact value is questionable. Still it allows us to give an assumption that the PLP is a bit better method than the MFCCs. CHAPTER 2. LITERATURE REVIEW 2.7 39 Speech Modelling Speech and language modelling is based on stochastic processes. To define them let us assume the existence of a probabilistic space and infinite number of random variables in the space. E is the space of process states, and T stands for the domain of a stochastic process. The set of random variables S(t) ∈ E such that S(t) = {A(t), t ∈ T } is a stochastic process. A stochastic process S = {A(t), t ∈ T } is called a Markov process if it fulfils P {S(tn+1 ) = sn+1 |S(tn ) = sn , ..., S(t1 ) = s1 } = P {S(tn+1 ) = sn+1 |S(tn ) = sn }. (2.4) It means that a Markov process keeps a memory of the last event. The whole future run of the process depends only on the current event. A Markov chain is a Markov process with a discrete space of states. A domain may be continuous or discrete. The concept of Markov chains can be extended to include the case where the observation is a probabilistic function of a state. The HMM is a doubly embedded stochastic process with an underlying stochastic process that is hidden and can only be observed through another set of stochastic processes that produce the sequence of observations. The HMM (Rabiner, 1989; Li et al., 2005) is a statistical model where the system being modelled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters, from the observable parameters, based on this assumption. Speech recognition systems are generally based on HMM(Young et al., 2005) or hybrid solutions with ANN (Young, 1996; Holmes, 2001). Statistical model gives the probability of an observed sequence of acoustic data by the application of Bayes’ rule P (word|acoustic) = P (acoustic|word)P (word) , P (acoustic) (2.5) where P (acoustic|word) comes from an acoustic model, P (word) is given by a language model (or combination of several language models) and P (acoustic) is used for normalisation purposes only so it can be skipped as long as we deliver normalisation in another way or we accept the fact that final result is not a probability function, as it may not take values from 0 to 1 and the sum of all of them is not equal to 1. We can easily accept it, if we are interested only in an argument of a maximum of the result and we do not need proper probability values. The Bayes rule can be similarly applied to phonemes, words, syntactic and semantic information. Introducing an additional hidden dynamic state gives a model of spatial correlations and leads to better results (Frankel and King, ress). The HMM is very popular but there are some other approaches to language modelling. One of them is a support vector machine (SVM), a classifier that estimates decision surfaces directly rather than models a probability distribution across the training data. As the SVM cannot model temporal speech structure efficiently it is best in a hybrid solution with the HMM (Ganapathiraju et al., 2004). Another model which started to be popular in speech recognition is based on dynamic Bayesian networks (DBNs) (Wester et al., 2004; Frankel and King., 2005). Typical Bayes nets are CHAPTER 2. LITERATURE REVIEW 40 directed acyclic graphs where each node represents a random variable. Implying conditional independence uses missing edges to factor joint distribution of all random variables into a set of simpler probability distributions. DBNs consist of instances of Bayesian networks repeated over time, with dependencies across time. DBNs were proposed as a model for articulatory feature recognition. In a classical HMM framework, parameters are obtained by the maximum likelihood approach. The variational Bayesian estimation and clustering (Watanabe et al., 2004) is another approach. It does not use maximum likelihood parameters but posterior distribution. There are other models (Venkataraman, 2001; Ma and Deng, 2004; Wester, 2003) for modelling acoustic parameters or elements of language. In all models we have to make many assumptions, like statistical dependence and independence (King, 2003). One has to be very careful to not commit a simplification which might result in a wrong model. Another issue is a training process of a model. Most popular algorithms are based on a forward-backward procedure (Rabiner, 1989; X. Huang, 2001) for evaluation of HMM, Viterbi algorithm (Rabiner, 1989; Viterbi, 1967; Forney, 1973) for decoding HMM and Baum-Welch for estimating HMM parameters (Rabiner, 1989; X. Huang, 2001). All of them need human supervision and might be quite costly in time. There are also methods based on active learning (Riccardi and Hakkani-Tür, 2005) in which applying adaptive learning may cut down the need of supervision. 2.8 Natural Language Modelling Analysing semantic and syntax content is one of the topics of NLP (Manning, 1999). Words can be connected in a large number of ways, including: by relations to other words, in terms of decomposition of semantic primitives, and in terms of non-linguistic cognitive constructs (perception, action and emotion). There are hierarchical and non-hierarchical relations. Some hierarchical relations are: is-a (a tree is a plant), has-a (a computer has a screen), and for scales of degree. Non-hierarchical relations include synonyms and antonyms. There are some word affinities and disaffinities in the semantic relations regarding the expressed concept. They are difficult to be described in a mathematical way but may be exploited by speech recognition systems. A crucial problem is the context-dependent meaning of words. For example, ’bank’ is a bank of a river and a bank to keep money in it. Authors of dictionaries try to identify distinct senses of entries, but it is very difficult to put an exact boundary between senses of a word and to disambiguate senses in practical contexts. Another problem is that natural languages are not static. Some additional meanings of words can change quite frequently (X. Huang, 2001). The language regularities are very often modelled by n-grams (X. Huang, 2001). Let us assume the word string W consisting of n words w1 , w2 , w3 , ..., wn . P (W ) is a probability distribution over word strings W that reflects how often W occurs. It can be decomposed as P (W ) = P (w1 )P (w2 |w1 )P (w3 |w1 , w2 )...P (wn |w1 , ..wn−1 ). (2.6) For calculation time reasons, the dependence is limited to n words backwards. Probably the most CHAPTER 2. LITERATURE REVIEW 41 popular are trigram models where P (wi |wi−2 , wi−1 ), as a dependence on the previous two words is very strong, while model complication is not very high. Such models need statistics collected over a vast amount of text. It means that many dependencies can be averaged. Adaptive language models (Bellegarda, 1997; Jelinek et al., 1991; Mahajan et al., 1999) deal with this flaw by a semantic approach to n-grams. Several different models can be created for different topics and different types of texts organised in a domain or topic-clustered language model. Then a system detects a topic of recognised text and use a cluster of n-gram model associated with this topic. It is possible to combine several clusters at once and to change a topic during recognition of different parts of the same text. Latent semantic indexing (Bellegarda, 2000) improves the traditional ngram model by searching for co-occurrences across much larger spans regarding semantic roles rather than the simple word distance. We are mainly interested in lexical semantics which is a study of systematic, meaning related structures of individual words. This field proves how ambiguous the natural language might be. We will start with defining typical semantic notion (Jurafsky and Martin, 2000). A lexeme is an individual entry in the lexicon. It corresponds to a word but it has a more strict meaning - a pairing of a particular orthographic and phonological form with some form of symbolic meaning representation - a sense. In most of traditional dictionaries lexeme senses are surprisingly circular - blood may be defined as red liquid flowing in veins, and red as a colour of blood. The usage of such structures is possible only if a user has some basic knowledge about the world and meanings. Computers and artificial intelligence do not have it. This is why avoiding this circularity was one of the main issues in creating a lexical database WordNet (Fellbaum, 1999). It contains three separate databases for nouns, verbs and the third for adjectives and adverbs. WordNet is based on relations among lexemes. Homonymy is a relation that holds between words that have the same form with unrelated meanings. The items with such relation are homonyms. Words with the same pronunciation and different spelling are homophones. In contrary, homographs have same orthographic form but different sounds. Polysemy is an occurrence of multiple meanings within a single lexeme. So we can say that a bank of a river and a bank to keep money are rather homonyms, while a blood bank and a bank to keep money are rather polysems. Obviously, it is not fully distinct what is a homonym and what is polysemy. The Polish example of homonyms are two meanings of a word ‘zamek’. The first one is a castle and the second one is a lock. We can separate them typically by investigating lexeme history and etymology (origin). A bank to keep something has an Italian origin, while a bank of a river has a Scandinavian one. Synonymy is defined as coexistence of different lexemes with the same meaning, which also leaves many open questions. The example of synonyms are Polish words ‘kolor’ and ‘barwa’. The first one means colour and the second one might be translated as hue, but in Polish it can easily replace the first one. Hyponymy is a pair of lexemes with similar but not identical senses. There are several problems with applying semantic analysis. First of them is using metaphors. They are especially common in literature, but also in spoken language and sometimes even in documents. Words and phrases used to present completely different kinds of concepts than their lexical senses are a serious challenge. Metonymy is a related issue. These are using lexemes to denote concepts by naming some other related concept. We can use word ’kill’ to describe CHAPTER 2. LITERATURE REVIEW 42 stopping some process in a more dramatical way like ’killing’ processes in Linux or ’killing a sale of a rival company’. Finally the problem is that existing semantic algorithms are dedicated to written text which is expected to be correct. Spoken language is characterised by a higher level of mistakes and abbreviations, while a user expects a transcription produced by a speech recogniser to be of a written text quality. There is very little research on semantic analysis for ASR but there are some other fields which might be useful in our research like word disambiguation (Banerjee and Pedersen, 2003) and automatic hypertext construction (Green, 1999). One of the interesting issues is topic signatures. The experiments show that it is possible to approximate accurately the link distance between synsets (a semantic distance based on the internal structure of WordNet) with topic signatures (Agirre et al., 2001, 2004). Clean signatures can be constructed from the WWW using filtering techniques like ExRetriever and Infomap (Cuadros et al., 2005). There are several methods of measuring the relatedness of concepts in WordNet. Similarity package provides six measures of similarity (Pedersen et al., 2004). The lch measure searches for the shortest path between two concepts, and scales it. The wup finds the path length to the root (shared ancestor) node from the least common subsumer of the measured concepts. The measure path equals to the inverse of the shortest path length between two concepts. The res, lin and jcn are based on information content - a corpus based measure of the specificity of a concept. The package contains also three measures of relatedness (Pedersen et al., 2004). The hso classifies relations as having direction, so it is path based. The lesk and vector measures use the text of gloss (definition) of the concept as a representation for it (Banerjee and Pedersen, 2003). It can be realised by counting shared words in gloss. Strings containing several words bring much more information due to entropy theory, so a score is the number of neighbouring words in overlapping description risen to second power. If several strings are shared, their scores are summed. Glosses of related senses can be also used to improve accuracy. There are other semantic similarity measures as well, like (Seco et al., 2004) which is based on hierarchical structure only. Semantic similarity can be also measured using Roget’s Thesaurus instead of WordNet (Jarmasz and Szpakowicz, 2003). The method is based on calculating all paths between two words using Roget’s taxonomy. Semantic analysis can improve quality of results of ASR. This is the highest information level in the linguistic model. Semantics deals with the study of meaning, including the ways meaning is structured in language and changes in meaning and form over time. Majority of the latest papers describing general speech recognition scheme include semantics analysis. But there is no working system (known to the author) using lexical semantics and there is little research on applying any semantic analysis into speech recognition. Semantic analysis is much more often used in written text analysis to retrieve information. There are two main approaches (X. Huang, 2001). The first is based on semantic roles: • agent - cause or initiator of the action • patient - undergoer of the action • instrument - how the action is accomplished CHAPTER 2. LITERATURE REVIEW 43 • goal - to whom the action is directed • result - result of the action • location - location of the action We can predict a localisation and order of different semantic roles in sentences. Some of them have to be present, others are optional. We can also associate exact words with a few roles. It allows us to detect wrong structure of recognised text. Such semantic analysis can be used in speech recognition (Bellegarda, 2000) instead of n-gram models. The other approach is by lexical semantics. Some words go very often together in texts. Some of them appear close to each other very rarely (Agirre et al., 2001, 2004). There are already such collected statistics, for example as the semantic dictionary WordNet (Fellbaum, 1999). Words create a set of trees and a number of branches between two nodes may stand for their semantic closeness. There are other possible measures as well. It is possible to detect words which do not fit to a general semantic content of a recognised hypothesis. 2.9 Semantic Modelling It is not efficient to recognise speech using acoustic information only. The human perception system is based on catching context, structure and understanding combined with recognition procedure. It is much easier for a human being to recognise and repeat without any errors a heard sentence, if it is in a language we understand, comparing to a sentence in a language we are not familiar with, which is just a sequence of sounds. Similarly, it is much easier to recognise sentences in a familiar domain or topic, then sentences from an unfamiliar context. Language modelling can improve recognition highly. Semantic analysis can be done in many different ways and has been applied to ASR already. However, this kind of modelling is difficult due to data sparsity problem. The ASR literature always mentions semantic analysis, as a necessary step, but it is very difficult to find any research papers, which provides any exact results on recognition, when applying semantic methods. Latent semantic analysis (LSA) (Bellegarda, 1997, 1998; T.Hofmann, 1999) is a NLP technique patented in 1988. It assumes, that the meaning of a small part of text, like a paragraph or a sentence, can be approximated by the sum of the meanings of its words. LSA uses a wordparagraph matrix which describes the occurrences of words in topics. It is a sparse matrix whose rows correspond to topics and columns correspond typically to words that appear in the topics. The elements of the matrix are proportional to the number of times the words appear in each document, where rare words are upweighted to reflect their relative importance. LSA is performed by using singular value decomposition (SVD). LSA has found already a few applications. One of them is automatic essay and answers grading (Kakkonen et al., 2006; Kanejiya et al., 2003). LSA can be also used in modelling global word relationships for junk e-mail filtering or pronunciation modelling (Bellegarda, 80). Another possible application is for word completion (Miller and Wolf, 2006). LSA can be combined with the n-gram model (Coccaro and Jurafsky, 1998; CHAPTER 2. LITERATURE REVIEW 44 Table 2.4: Speech recognition applications available on the Internet HTK (Young, 1996; Evermann et al., 2004) - htk.eng.cam.ac.uk Edinburgh Speech Tools - www.cstr.ed.ac.uk/projects/speech tools SPRACH (Hermansky, 1990) - www.icsi.berkeley.edu/dpwe/projects/sprach/sprachcore.html AMI (Hain et al., 2005) - www.amiproject.org/business/index.htm CMU Sphinx (Lamere et al., 2004)- cmusphinx.sourceforge.net/html/cmusphinx.php CMU Let’s go (Eskenazi et al., 2008) - http://www.speech.cs.cmu.edu/letsgo/ Snorri - www.loria.fr/ laprie Snack Speech Toolkit - http://www.speech.kth.se/snack/ Praat (Boersma, 1996) - www.fon.hum.uva.nl/praat/ CSLU OGI Toolkit - http://cslu.cse.ogi.edu/toolkit/ Sonic ASR - cslr.colorado.edu/beginweb/speech recognition/sonic.html Grönqvist, 2005) or maximum entropy model (Deng and Khudanpur, 2003). LSA can be also applied for bigrams of words in topics rather than single words (Y.-C. Tam, 2008). It is more difficult to train such a model but can improve results if combined with a regular LSA model. There are other methods of analysing semantic information, like topic signatures (Agirre et al., 2001, 2004) and maximum entropy language models (Khudanpur and Wu, 1999; Wu and Khudanpur, 2000). The idea of topic signatures is to store concepts in context vectors. There are simple methods to automatically acquire for any concept hierarchy. They were used to approximate link distances in WordNet. Maximum entropy language models combine dependency information from sources like syntactic relationships, topic cohesiveness and a collocation frequency. They evolved from n-grams. The difference is that they store not only n-words but also other information like n preceding exposed head-words of the syntactic partial parse, n non-terminal labels of the partial parse and a topic. 2.10 Academic Applications There are a few academic applications of speech recognition. We listed some of them in Table 2.4. Edinburgh Speech Tools is not a complex ASR but rather a toolbox for speech analysis with many elements useful in speech recognition for example n-gram language model. The SPRACH is the full package based on the PLP including for example ANN training and recognition, feature calculation, sound file manipulation, plus all the GUI components and tools. The AMI targets computer enhanced multi-modal interaction in the context of meetings including ASR. The CMU Sphinx Group (Lamere et al., 2004) offers packages for speech using applications, very useful for speech modelling in ASR. CMU provides also a spoken language system Let’s go (Eskenazi et al., 2008) which includes ASR. Snorri is dedicated to assist researchers in the fields of ASR, phonetics, perception and signal processing. Similar opportunities are provided by the Snack Sound Toolkit which uses script languages like Python. Praat (Boersma, 1996) covers speech analysing, labelling, segmentation and learning algorithms. The CSLU OGI Toolkit is the help in building interactive CHAPTER 2. LITERATURE REVIEW 45 language systems for human-computer interaction. SONIC is the speech recogniser developed by University of Colorado. It is available only for registered and accepted persons. The HTK (Young et al., 2005) is a toolkit using HMM, for ASR research mainly. Research into speech synthesis, character recognition and DNA sequencing are its other applications. We used version 3.3 in our research. HTK consists of many modules and tools. All of them are available in C source form. The HTK provides facilities for speech analysis, HMM training, testing and results analysis. The system fits hypothesis of every recognition with one of the elements from the dictionary, provided by a user, comparing with phonetic transcriptions of words. The toolkit supports HMMs using both continuous density mixture Gaussians and discrete distributions. HTK was originally developed at the Machine Intelligence Laboratory of the Cambridge University Engineering Department (CUED). It was sold to Entopic Research Laboratory Inc. and later to Microsoft. Currently it is licensed back to CUED and under permanent development. Chapter 3 Linguistic Aspects of Polish English is the most common language of ASR research with Chinese and Japanese as two other common languages. This thesis is focused on ASR of Polish which is the most commonly spoken Slavic language in EU and one of the most common inflective languages. There is quite little research and no working continuous Polish ASR system. To create such a system successes in other languages have to be used. As Polish and English are languages of the same Indo-European group, we focused on existing solutions for English ASR. There are some differences between these languages which have a larger or smaller impact on ASR. These differences should result in some variations in algorithms. 3.1 Analysis of Polish from the Speech Recognition Point of View We searched for differences between English and Polish, which seem to be important in ASR. It is important to consider linguistic aspects while designing ASR system. • English has a large number of homophones. What is more, many combinations of different words have similar pronunciation. Polish has fewer homophones. • Pronunciation of vowels in English is very similar. If a vowel is not stressed it is usually pronounced as // or /*/. What is more, both of these phonemes have quite similar sounds and spectra. It means that unstressed vowels are almost indistinguishable in English. It contrasts with Polish. • Modern English has emerged as a mixture of around thirty languages. It resulted in quite simple general rules (which was necessary for a language to be widely accepted by different people) but many irregularities (as a kind of residues), especially in pronunciation. Modern Polish is strongly based on Latin. Contrary to English, it resulted in very complicated grammar rules and morphology but quite few irregularities in pronunciation. • English is a positional language, while Polish is an inflectional one. A meaning of a word in English depends strongly on the position of a word in a sentence. In Polish a position is of secondary importance, the exact meaning of a word depends mainly on morphology. 46 CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 47 For example in English the sentences ’Mike hit Andrew’ and ’Andrew hit Mike’ means something quite different. In Polish (using Polish similar names) ’Michał uderzył Andrzeja’, ’Michał Andrzeja uderzył’, ’Andrzeja Michał uderzył’ and ’Andrzeja uderzył Michał’ are all acceptable and mean almost the same. However, all not the first stress some part of information and sound quite strange without a special context. To identify the person who hit and who is hit, we have to use another ending ’Andrzej uderzył Michała’. It means the usage of syntax modelling is very difficult for Polish and possibly not as necessary as for English. On the other hand, analysing morphology seems to be crucial in the case of ASR for Polish. • In English, conjugation and declension are relatively simple and adjectives do not need any type of agreement. In Polish there are groups of different ways of conjugation and declension. Each verb has typically different forms for each combination of gender (there are three basic genders in Polish, however, linguists distinguish 8 categories), person and singular or plural number. Each noun has 7 forms (cases) depending on the position and relation with other words in the sentence. Adjectives and numbers are agreed with the nouns they describe. There is no general rule of word agreement, like adding ’s’ or ’es’ in English. Different groups of words have their own types of endings. Verbs have 47 inflection forms (excluding participles), adjectives 44, numerals up to 49, adverbs 3, nouns and pronouns 14. A single word in Polish may have even several hundreds of derived forms topically correlated (for example some verbs have almost 200 forms including conjugation of participle, perfect and imperfect forms). This fact causes making a full dictionary of Polish language for the ASR system very difficult. Even as it is possible, its size may cause very serious delays in the work of the ASR system. • English is well known to have a vast vocabulary. It is due to a large number of dialects and versions of English situated all around the world. Another reason is that English is a mixture of several languages, so there are words which mean almost the same but came from different sources. Polish dictionary seems to be smaller in this aspect. • Polish has a few phonemes, which are rare in other languages and do not exist in English. They sound very different than other phonemes. Being more particular they have much higher frequency and sound to non-Polish speakers almost like rustles or hums. These phonemes are very easily detectable, and as such, can be additionally used as a kind of boundaries between blocks of other phonemes. 3.2 Triphone Statistics of Polish Language Statistical linguistics at the word and sentence level were under considerations for several languages Agirre et al. (2001); Bellegarda (2000). However, similar research on phonemes is rare Denes (1962); Yannakoudakis and Hutton (1992); Basztura (1992). The frequency of phonetic units appearance is an important topic itself for every language. It can also be used in several speech processing applications, for example modelling in LVCSR or coding and compression. CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 48 Models of triphones which are not present in a training corpus of a speech recogniser can be prepared using phonetic decision trees Young et al. (2005). The list of possible triphones has to be provided for a particular language along with phonemes’ categorisation. The triphone statistics can also be used to generate hypotheses used in recognition of out-of-dictionary words including names and addresses. 3.3 Description of a problem solution The problem is to find triphone statistics for Polish language. Our first attempt to this task was already published Ziółko et al. (2007). The task was conducted on a corpus containing Parliament transcriptions mainly, which makes up amounts to around 50 megabytes of text. It was repeated on Mars, a Cyfronet computer cluster, for data of around 2 gigabytes. Context-dependent modelling can significantly improve speech recognition quality. Each phoneme varies slightly depending on its context, namely neighbouring phonemes due to a natural phenomena of coarticulation. It means that there are no clear boundaries between phonemes and they overlap each other. It results in interference of acoustical properties. Speech recognisers based on triphone models rather than phoneme ones are much more complex but give better results Young (1996). Let us present examples of different ways of transcribing word above. Phoneme model is ax b ah v while the triphone one is *-ax+b ax-b+ah b-ah+v ah-v+*. In case a specific triphone is not present, it can be replaced by a phonetically similar triphone (phonemes of the same phonetic group interfere in similar way with their neighbours) using phonetic decision trees Young et al. (2005) or diphones (applying only left or right context) Rabiner and Juang (1993). 3.4 Methods, software and hardware Sophisticated rules and methods are necessary to obtain the phonetic information from an orthographic text-data. Simplifications could cause errors Ostaszewska and Tambor (2000). Transcription of text into phonetic data was applied first by PolPhone Demenko et al. (2003). The extended SAMPA phonetic alphabet was applied with 39 symbols (plus space) and pronunciation rules for cities Poznań and Kraków. We used our own digit symbols corresponding to SAMPA symbols, instead of typical ones, to distinguish phonemes easier while analysing received phonetic transcriptions. Stream editor (SED) was applied to change original phoneme transcriptions into digits with the following script: s/##/#/g s/w∼/2/g s/dˆz/6/g s/tˆs’/8/g s/s’/5/g s/tˆS/0/g s/dˆz’/X/g s/z’/4/g s/dˆZ/9/g s/j∼/1/g s/tˆs/7/g s/n’/3/g . Statistics can now be simply collected by counting the number of occurrences of each phoneme, phoneme pair, and phoneme triple in an analysed text, where each phoneme is just a symbol (single letter or a digit). Matlab was used to analyse the phonetic transcription of the text corpora. The CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 49 Table 3.1: Phonemes in Polish (SAMPA Demenko et al. (2003)) SAMPA example # a e o t r n i j I v s u p m k d l n’ z w f g tˆs b x S s’ Z tˆS tˆs’ w∼ c dˆz’ N dˆz J z’ j∼ dˆZ pat test pot test ryk nasz PIT jak typ wilk syk puk pik mysz kit dym luk koń zbir łyk fan gen cyk bit hymn szyk świt żyto czyn ćma cia̧ża kiedy dźwig pȩk dzwoń giełda źle wiȩź dżem transcr. # pat test pot test rIk naS pit jak tIp vilk sIk puk pik mIS kit dIm luk kon’ zbir wIk fan gen tˆsIk bit xImn SIk s’vit ZIto tˆSIn tˆs’ma ts’ow∼Za cjedy dˆz’vik peNk dˆzvon’ Jjewda z’le vjej∼s’ dˆZem occurr. 283 296 436 151 160 947 146 364 208 141 975 325 68 851 605 68 797 073 68 056 439 67 212 728 61 265 911 58 930 672 58 247 951 54 359 454 51 503 621 51 228 649 48 760 010 44 892 420 44 406 412 40 189 121 34 092 610 30 924 282 30 194 178 25 308 167 24 910 462 24 789 080 24 212 663 21 407 209 20 756 164 17 220 321 16 409 930 15 429 711 11 945 381 10 814 216 10 581 296 9 995 596 4 880 260 4 212 857 3 680 888 3 390 372 1 527 778 693 838 % 15.256 8.141 7.882 7.646 3.708 3.705 3.665 3.620 3.299 3.174 3.137 2.927 2.774 2.759 2.626 2.418 2.391 2.164 1.84 1.665 1.626 1.363 1.341 1.335 1.304 1.153 1.118 0.927 0.884 0.831 0.643 0.582 0.570 0.538 0.262 0.227 0.198 0.183 0.082 0.037 % Basztura (1992) 4.7 9.7 10.6 8.0 4.8 3.2 4.0 3.4 4.4 3.8 2.9 2.8 2.8 3.0 3.2 2.5 2.1 1.9 2.4 1.5 1.8 1.3 1.3 1.2 1.5 1.0 1.9 1.6 1.3 1.2 1.2 0.6 0.7 0.7 0.1 0.2 0.1 0.2 0.1 0.1 CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 50 Phoneme classes Phonemes d^Z j~ z’ J d^z N d^z’ c w~ t^s’ t^S Z s’ S x b t^s g f w z n’ l d k m p u s v y j i n r t o e a # 0 2 4 6 8 Occurrences [%] 10 12 14 16 Figure 3.1: Phonemes in Polish in SAMPA alphabet calculations were conducted on Mars in Cyfronet, Krakow. We analysed more than 2 gigabytes of data. Text data for Polish are still being collected and will be included in the statistics in the future. Mars is a cluster for calculations with following specification: IBM Blade Center HS21 - 112 Intel Dual-core processors, 8GB RAM/core, 5 TB disk storage and 1192 Gflops. It operates using Red Hat Linux. Mars uses Portable Batch System (PBS) to queue tasks and split calculation power to optimise times for all users. A user have to declare expected time of every task. In example, a short time is up to 24 hours of calculations and a long one is up to 300 hours. Tasks can be submitted by simple commands with scripts and the cluster starts particular tasks when calculation resources are available. One process needs around 100 hours to analyse 45 megabytes text file. 3.4.1 Grapheme to Phoneme Transcription Two main approaches are used for the automatic transcription of texts into phonemic forms. The classical approach is based on phonetic grammatical rules specified by human Steffen-Batóg and Nowakowski (1993) or machine learning process Daelemans and van den Bosch (1997). The second solution utilises graphemic-phonetic dictionaries. Both methods were used in PolPhone to cover typical and exceptional transcriptions. Polish phonetic transcription rules are relatively easy CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 51 to formalise because of their regularity. The necessity of investigating large text corpus pointed to the use of the Polish phonetic transcription system PolPhone Jassem (1996); Demenko et al. (2003). In this system, strings of Polish characters are converted into their phonetic SAMPA representations. Extended SAMPA (Table 3.1) is used, to deal with nuances of Polish phonetic system. The transcription process is performed by a table-based system, which implements the rules of transcription. Matrix T ∈ S m×n is a transcription table, where S is a set of strings and the cells meet the requirements listed precisely in Demenko et al. (2003). The first element t1,1 of each table contains currently processed character of the input string. For every character (or character substring) one table is defined. The first column of each table {ti,1 }m i=1 contains all possible character strings that could precede currently transcribed character. The first row {t1,j }nj=1 contains all possible character strings that can proceed a currently transcribed character. All possible phonetic transcription results are stored in the remaining cells {ti,j }m,n i=2,j=2 . A particular element ti,j is chosen as a transcription result, if ti,1 matches the substring preceding t1,1 and t1,j matches the substring proceeding t1,1 . This basic scheme is extended to cover overlapping phonetic contexts. If more then one result is possible, then longer context is chosen for transcription, which increases its accuracy. Exceptions are handled by additional tables in the similar manner. Specific transcription rules were designed by a human expert in an iterative process of testing and updating rules. Text corpora used in design process consisted of various sample texts (newspaper articles) and a few thousand words and phrases including special cases and exceptions. 3.4.2 Corpora Used Several newspaper articles in Polish were used as input data in our experiment. They are from Rzeczpospolita newspaper from years 1993-2002. They cover mainly political and economic issues, so they contain quite many names and places including foreign ones, what may influence the results slightly. In example, q appeared once, even though it does not exist in Polish. In total, 879 megabytes (103 655 666 words) were included in the process. Several hundreds of thousands of Internet articles in Polish made another corpus. They are all from a high quality website, where all content is reviewed and controlled by moderators. They are of encyclopedia type, so they also contain many names including foreign ones. In total, 754 megabytes (96 679 304 words) were included in the process. The third corpus consists of several literature books in Polish. Some of them are translations from other languages, so they also contain foreign words. The corpus includes 490 megabytes (68 144 446 words) of text. 3.4.3 Results The total number of around 1 856 900 000 phonemes were analysed. They are grouped into 40 categories (including space). Actually, one more, namely q, was detected, which appeared in a foreign name. Since q is not a part of the Polish alphabet, it was not included in the phoneme distribution presented in Table 3.1. Space (noted as #) frequency was 15.26 %. An average number CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 52 The probability of transition [%] # a e o t r n i j y v s u p m k d l n’ z w f g t^s b x S s’ Z t^S ^s’ w~ c d^z’ N d^z J z’ j~ d^Z # a e o t r n i j y v s u p m k d l n’ z w f g t^s b x S s’ Z t^S t^s’w~ cd^z’N d^z J z’ j~d^Z Second phoneme classes Figure 3.2: Frequency of diphones in Polish (each phoneme separately) of phonemes in words is 6.6 including one space. Exactly 1 271 different diphones (Fig. 3.2 and Table 3.2) for 1 560 possible combinations were found, which constitutes 81%. 21 961 different triphones (see Table 3.3) were detected. Combinations like *#*, where * is any phoneme and # is a space were removed. These triples should not be considered as triphones because the first and the second * are in two different words. The list of the most common triphones is presented in Table 3.3. Assuming 40 different phonemes (including space) and subtracting mentioned *#* combinations, there are 62 479 possible triples. We found 21 961 different triphones. It leads to a conclusion that around 35% of possible triples were detected as triphones, the very most of them at least 10 times. Young Young (1996), estimates that in English, 60-70% of possible triples exist as triphones. However, in his estimation there is no space between words, what changes the distribution a lot. Some triphones may not occur inside words but may occur at combinations of an end of one word and the beginning of another. We started to calculate such statistics without an empty space as the next step of our research. It is also expected that there are different numbers of triphones for different languages. Some values are similar to statistics given by Jassem a few decades ago and reprinted in Basztura (1992). We applied computer clusters so our statistics were calculated for much more data and they are more represantative. Fig. 3.2 shows some symmetry but the probability of diphone αβ is usually different than CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 53 Figure 3.3: Space of triphones in Polish probability of βα. The mentioned quasi symmetry results from the fact that high values of α probability and (or) β probability often gives high probability of products αβ and βα as well. Similar effects can be observed for triphones. Data presented in this paper illustrate the wellknown fact that probabilities of triphones (see Table 3.3) cannot be calculated from the diphone probabilities (see Table 3.2). The conditional probabilities between diphones have to be known. Besides the frequency of triphone occurrence, we are also interested in distributions of their frequencies. This is presented in logarithmic scale in Fig. 3.4. We received another distribution than in the previous experiment Ziółko et al. (2007) because larger number of words were analysed. We have found around 500 triphones which occurred once and around 300 which occurred two or three times. Then every occurrence up to 10 happened for 100 to 150 triphones. It supports a hypothesis that one can reach a situation, when new triphones do not appear and a distribution of occurrences is changing as a result of more data being analysed. Some threshold can be set and the rarest triphones can be removed as errors caused by unusual Polish word combinations, acronyms, slang and other variations of dictionary words, onomatopoeic words, foreign words, errors in phonisation and typographical errors in the text corpus. Entropy H=− 40 X p(i) log2 p(i), (3.1) i=1 where p(i) is a probability of a particular phoneme, is used as a measure of the disorder of a lin- CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 54 Table 3.2: Most common Polish diphones log(occurrences of a triphone) diphone e# a# #p je i# o# #v y# na #s po #z ov st n’e #o #t ra #m ro #d m# no. of occurr. 43 557 832 38 690 469 31 014 275 28 499 593 24 271 474 23 552 591 20 678 007 19 018 563 18 384 584 17 321 614 16 870 118 16 619 556 16 206 857 15 895 694 14 851 771 14 104 742 13 910 147 13 713 928 13 657 073 13 597 891 13 103 398 12 968 346 % 2.346 2.084 1.671 1.535 1.307 1.269 1.114 1.024 0.990 0.933 0.909 0.895 0.873 0.856 0.800 0.760 0.749 0.739 0.736 0.732 0.706 0.698 diphone on #k ta #n va ko #i aw u# #f #b #r ja ar x# do er te #j v# #a to no. of occurr. 12 854 255 12 529 124 12 449 178 12 316 393 11 413 878 11 168 294 10 515 253 10 514 514 10 379 234 10 265 162 10 167 482 10 137 129 10 097 444 9 818 127 9 811 211 9 779 666 9 724 692 9 618 998 9 398 210 9 251 288 9 143 021 9 043 529 % 0.692 0.675 0.671 0.663 0.615 0.602 0.566 0.566 0.559 0.553 0.548 0.546 0.544 0.529 0.528 0.527 0.524 0.518 0.506 0.498 0.492 0.487 8 7 6 5 4 3 2 1 0 0 0.5 1 1.5 Triphones Figure 3.4: Phoneme occurrences distribution 2 2.5 4 x 10 CHAPTER 3. LINGUISTIC ASPECTS OF POLISH triphone #po #na n’e# na# ow∼# #do #za ej# je# #pS go# #i# ego ova vje #v# #je #n’e sta #s’e yx# #vy s’e# pSe e#p #f# em# #pr #ko a#p ci# ne# cje n’a# #ro mje #st aw# ny# #te e#v Ze# ym# Table 3.3: Most common Polish triphones no. of occurr. % triphone no. of occurr. 12 531 515 0.675 wa# 3 262 204 9 587 483 0.516 do# 3 210 532 9 178 080 0.494 #ma 3 209 675 8 588 806 0.463 jon 3 082 879 6 778 259 0.365 e#z 3 054 967 6 751 495 0.364 a#v 3 028 787 6 429 379 0.346 #z# 2 928 164 6 390 911 0.344 ka# 2 871 230 6 388 032 0.344 #sp 2 818 515 6 173 458 0.333 ontˆs 2 754 934 5 990 895 0.323 e#s 2 737 210 5 945 409 0.320 i#p 2 725 414 5 742 711 0.309 o#p 2 719 121 5 560 749 0.300 #Ze 2 701 194 5 433 154 0.293 #ja 2 670 034 5 317 078 0.286 ta# 2 618 595 5 311 716 0.286 ent 2 612 166 5 292 103 0.285 #to 2 567 269 4 983 295 0.268 to# 2 557 630 4 861 117 0.262 pro 2 548 979 4 858 960 0.262 pra 2 539 424 4 763 697 0.257 #pa 2 503 153 4 746 280 0.256 #re 2 502 443 4 728 565 0.255 ost 2 490 304 4 727 840 0.255 #ty 2 452 830 4 660 745 0.251 tˆse# 2 436 864 4 514 478 0.243 #mj 2 397 741 4 428 341 0.239 ku# 2 383 231 4 216 459 0.227 e#m 2 379 510 4 155 732 0.224 ja# 2 353 638 3 965 693 0.214 e#o 2 343 622 3 958 262 0.213 a#s 2 336 272 3 916 595 0.211 #vj 2 329 962 3 888 279 0.209 #mo 2 320 091 3 785 754 0.204 nyx 2 299 719 3 760 340 0.203 os’tˆs’ 2 295 365 3 745 320 0.202 ovy 2 284 782 3 596 680 0.194 sci 2 282 887 3 580 425 0.193 ove 2 262 277 3 449 304 0.186 li# 2 255 403 3 313 798 0.178 ovj 2 251 294 3 309 352 0.178 mi# 2 243 432 3 300 273 0.178 uv# 2 236 507 55 % 0.176 0.173 0.173 0.166 0.165 0.163 0.158 0.155 0.152 0.148 0.147 0.147 0.146 0.145 0.144 0.141 0.141 0.138 0.138 0.137 0.137 0.135 0.135 0.134 0.132 0.131 0.129 0.128 0.128 0.127 0.126 0.126 0.125 0.125 0.124 0.124 0.123 0.123 0.122 0.121 0.121 0.121 0.120 CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 56 guistic system. It describes how many bits in average are needed to describe phonemes. According to Jassem in Basztura (1992) entropy for Polish is 4.7506 bits/phoneme. From our calculations entropy for phonemes is 4.6335, for diphones 8.3782 and 11.5801 for triphones. 3.5 Analysis of Phonetic Similarities in Wrong Recognitions of the Polish Language A speech recognition system based on HTK for Polish is presented. It was trained on 365 utterances, all spoken by 26 males. Errors in recognition were analysed in detail in an attempt to find reasons and scenarios of wrong recognitions. We aim to provide a large vocabulary ASR system for Polish. There is very little research in this topic and there is no system which would work on sentence level for a relatively rich dictionary. Polish differs from the languages most commonly used in ASR like English, Japanese and Chinese in the same way as all Slavic languages. It is highly inflective and non-positional. These disadvantages are compensated by an important feature of Polish language. The relation between phonemes and the transcription is more distinct. We used the HTK (Rabiner, 1989; Young, 1996) as the basis of the recognition engine. While this solution seems to work well, it is necessary to add extra tools on grammar and semantic levels if a large dictionary is going to be used, while retaining very good recognition. The mel-frequency cepstral coefficients (MFCCs) (Davis and Mermelstein, 1980; Young, 1996) were calculated for parametrisation. 12 MFCCs plus an energy with first and second derivatives were used, giving a standard set of 39 elements. We used 25 ms windows for audio framing and preemphasis filtering 0.97. Segments were windowed using Hamming method. All 37 different phonemes were distinguished using a phonetic transcription provided with the corpus. As it was shown in the previous chapter HTK is a standard for ASR. All technical details of HTK are also considered state-of-art of ASR. HTK is widely used as a model (Hain et al., 2005; Zhu and Paliwal, 2004; Ishizuka and Miyazaki, 2004; Evermann et al., 2004). We used HTK settings suggested in a tutorial in (Young et al., 2005) apart from a sentences model. We did not use it at all because of linguistic differences between English and Polish. Namely, the order of words in Polish is too irregular to use this kind of models. In this experiment we simply treated sentences like they were words, which means we put them in a dictionary. Obviously we used different dictionary and list of phonemes that in the English example in the tutorial. All other settings were like those suggested in (Young et al., 2005). Errors in speech recognition can have many different reasons (Greenberg et al., 2000). Some of them can appear because of phonetic similarities of different types, although there are errors which cannot be explained by acoustic similarities. We want to find other possible reasons for these errors. Results are presented with very deep analysis of what utterances where wrongly recognised and what utterances they were recognised as. This knowledge may help in future ASR system design and in preparing data for corpora and model training. There are three general types of errors: random, systematic and gross. Random (or indeterminate) errors are caused by uncontrollable fluctuations of voice that affect parametrisation and CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 57 experimental results. Systematic (or determinate) errors are instrumental, methodological, or personal mistakes causing lopsided data, which is consistently deviated in one direction from the true value. The detection of such errors is most important, because the model has to be altered then. Gross errors are caused by experimenter carelessness or equipment failure which are quite unlikely here as we used a professionally recorded data which were already used by other researchers. Our system has been trained on part of a set called CORPORA (Grocholewski, 1995) created under supervision of Stefan Grocholewski in Institute of Computer Science, Poznań University of Technology in 1997 (Grocholewski, 1995). Speech files in CORPORA were recorded with the sampling frequency f0 = 16 kHz, equivalent to sampling period t0 = 62.5 µs. Speech was recorded in an office, with the working computer in the background, which makes the corpus not perfectly clean. Signal to noise ratio (SNR) is not stated in the description of the corpus. It can be assumed that SNR is very high for actual speech but minor noise is detectable for periods of silence. The database contains 365 utterances (33 single letters, 10 digits, 200 names, 8 short computer commands and 114 simple sentences), each spoken by 11 females, 28 males and 6 children (45 people), giving 16425 utterances in total. One set spoken by male and one by female were hand segmented. The rest were segmented by a dynamic programming algorithm using a model trained on hand segmented ones. The optimisation was used to fit borders using existing hand segmentation of the same utterance spoken by two different people. All available utterances for 26 male speakers were used for training, considering all of them as single words in HTK model. We created the decision tree to find contexts making the largest difference to the acoustics and which should distinguish clusters using rules of phonology and phonetics in Polish (Kȩpiński, 2005) to create tied-state triphones. In all our experiments involving HTK, some preprocessing of data is necessary, because of special letters in Polish. The first step of this process is to change all upper case letters into lower case letters. Than all Polish special letters are replaced by standard corresponding capital letters. In example, ó is changed into O. 3.6 Experimental Results on Applying HTK to Polish As we mentioned already, the system was trained on 9490 utterances, 365 for each of 26 male speakers. The orthographic dictionary contains 365 elements, but due to differences in pronunciation between different speakers, the final version of the dictionary, working on phonetic transcriptions, contains 1030 positions. We started recognition evaluation using data of the only male speaker who was not used in the training (Table 3.4). Only 6 out of 365 utterances were substituted giving correctness 98.36 %. Audio files of females, boys and girls were also recognised to check correlation between parameterisation of different age and gender. These speakers were also used instead of adding noise to the male speaker. We received correctnesses 79.73%, 95.34% and 92.05% for adult female speakers. Child male speakers were recognised with correctnesses 60.55%, 95.07% and 75.62%. We noted correctnesses 88.22% and 84.11% for girls. All non-adult male speakers gave clearly worse results, however, there is no obvious difference between degradation in results related to age CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 58 Table 3.4: Word recognition correctness for different speakers (the model was trained on adult male speakers only) speaker age gender substitutions correctness AO1M1 AF1K1 BC1K1 BW1K1 AK1C1 AK2C1 CK1C1 LK1D1 ZK1D1 adult adult adult adult child child child child child male female female female male male male female female 6 74 17 29 144 89 18 43 58 98.36 79.73 95.34 92.05 60.55 75.62 95.07 88.22 84.11 Table 3.5: Errors in different types of utterances (for all speakers) type errors being recog. % of errors sentences digits alphabet names and commands 2 21 130 312 1026 90 297 1872 0 23 44 17 or gender. Even girl speakers, for which both age and gender differed from the training speakers, were recognised with the similar number of errors as speakers with just different gender or age. Types of errors were carefully analysed. First, we checked percentage of correctly and wrongly recognised utterances, depending on the type of utterances (Table 3.5). It can be clearly seen that smaller units are much more difficult to recognise: 44 % for one syllable units (spoken letters of alphabet), 23% and 17% for single words and almost no errors for sentences, even though we evaluated the system also on speakers of gender and age which were not used during the training. It suggests that recognition based on MFCC parameterisation only is not enough. The context has to be used for allowing HMM models work correctly (or much better parameterisation, if possible). All sentences were treated as single words during the training and the testing. The recognition of sentences is on an exceptional level, especially considering, that we used many speakers of gender and age not used during the training. The only two wrong recognitions are quite bizarre. In the first case the sentence which means ‘He cleans sparrows in zoo’ was recognised as a female name Helena. In the second case the sentence ‘Ups, it was more grey than yours’ was recognised as ‘A horse went on poor road’. In both cases the correct transcription and wrong recognition are Table 3.6: Errors in sentences (speakers AK1C1 and AK2C1 respectively) correct transcription wrong recognition On myje wróble w zoo Helena Oj bardziej niż wasz był szary Koń droga̧ marna̧ szedł CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 59 Table 3.7: Errors in digits digit no. wrong recognitions 0 zero 3 trzy 1 jeden 4 cztery 2 dwa 8 osiem 5 piȩć 6 sześć 7 siedem 9 dziewiȩć 4 4 3 3 2 2 1 1 1 1 Zofia,Iwona,ce,Bożena ce(2),zero,Joanna Urban(2),Izabela, o, ge(2) Diana,Anna Franciszek, Alicja Rudolf zero Zenon Diana phonetically very different and very easily distinguishable for a human listener. There are several interesting detailed observations in patterns of wrong recognitions. Only one name was recognised as a sentence and quite few were recognised as spoken letters (Table 3.8 and 3.9). The majority of wrong hypotheses were simple words. It means that the efficiency of the model depends on a length of utterances. It works better for longer ones. The very interesting fact is that even if names are recognised wrongly, their gender is still correct most of the time. 79 female names were recognised as other female names (out of those presented in Table 3.8), with only 17 female names recognised as male names. Some clue might be that the very majority of female names in Polish end with ’a’. However, such phonological similarity is probably not strong enough for this effect. It is difficult to explain fully this phenomena. The similar pattern was found in case of male names. 50 male names were wrongly recognised as other male names and only 14 male names were recognised as female names. There are some pairs of phonologically similar names like Lucjan and Łucjan, or Mariola and Marian, which where quite commonly mistaken with each other. However, most of wrong recognitions seem to have no explanation like this. What is more, some wrong detections with large phonological differences appear quite frequently. For example, Barbara was recognised wrongly three times, and all of them as Marzena. It has to be stressed that many pairs of very similar words were recognised quite correctly, like name Maria was only twice recognised as Marian and Marian as Maria just once. We can conclude that phonological similarities can cause wrong detections but seem to be not a major source of them. Table 3.10 shows names which were used as wrong hypotheses for errors listed in other tables. There is an interesting tendency that these words were correctly recognised most of the time when the audio with their content was analysed. It suggests that some utterances are generally more probable than others for the recognition of the whole set, correct or not. We can say that they are represented more strongly in the language models. In a similar way, names which were wrongly recognised, rarely appear in Table 3.10, because they are weakly represented. It has to be stressed that all utterances were used 26 times (Table 3.10) during the training. The best example of this behaviour is a name Łucjan, which was recognised for virtually all test speakers as Lucjan. The CHAPTER 3. LINGUISTIC ASPECTS OF POLISH Table 3.8: Errors in the most often wrongly recognised names and commands word no. wrong recognitions Łucjan Nina Dorota Jan nie cofnij Dominik Ewa Maria Regina Wacław Ziuta Emilia Emil Gerard Julia Lech Łucja Sabina Teodor Alina Barbara Benon Bernard Cecylia Celina Damian Daria Eliza Felicja Hanna Henryk Irena Iwona Izydor Jerzy Janusz Karolina Monika 9 7 6 6 6 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Lucjan(9) Lidia,Emilia,Anna(2),Łucja,Urszula,Julian Beata(4),Renata,Danuta Jerzy,Łucjan(2),Daniel,Diana,Leon źle,Lech(2),u(2),o Teofil(3),Rafał(2) Jan,Daniel(3),Jakub, Anna,Helena,Ole nka,Eliza,Helena Mariola,Marian(2),Klaudia,Marzenka Joanna,Romuald,el,Emilia,Aniela Lucyna(2),Jarosław Julita(2),Joanna,Jolanta,Olga Aniela(2),el,ku ku,el(3) Eugenia,Bożena,Leonard,de Urszula,Julian(2),Joanna zero,Joanna,u,te Lucjan(2),Urszula(2) Celina(2),Halina(2), Adam(3),Joanna, Emilia,Alicja,Urszula Marzena(3) Damian(2),Marian Gerard,Beata,Leonard Apolonia(2),Wacław Karol,źle,Mariola Daniel(2),Benon Marta,Daniel,Bożena Alina(2),Lucjan Łucja,Urszula,Alicja Helena,Marian,Halina Alfred,Romuald,Hubert Ireneusz,Urszula,Karolina Izabela,Maria,Zuzanna jeden,Romuald,Bogdan źle,u,Leszek Ireneusz,Lech,Rudolf Mariola,Pelagia,Alina Oleńka,Łukasz 60 CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 61 Table 3.9: Errors in the most often wrongly recognised names and commands (2nd part) word no. wrong recognitions Marek Mariola Pelagia Paulina Sławomir Seweryn Wojciech Wanda Weronika źle Zenon 3 3 3 3 3 3 3 3 3 3 3 Romuald(2),Marta Marian(2),Maria Karolina(2),ten chór dusiłem licznie, Mariola(2),Karolina Hanna,Mariola,Karol Karolina,Cezary,Zenon Walenty,Monika,Alicja Halina,Marzena,Mariola Dorota,Renata,Danuta Julian,Joanna,Zofia Marian(2),Benon Table 3.10: Names which appeared the most commonly as wrong recognitions in above statistics name no. name no. name no. Lucjan Marian Urszula Daniel Joanna Mariola Beata Karolina Marzena Romuald Alicja Anna Halina Julian 14 8 8 7 7 7 5 5 5 5 4 4 4 4 Alina Bożena Diana Emilia Helena Ireneusz Rudolf Julita Karol Lech Leonard Leszek Łucja Łucjan 3 3 3 3 3 3 3 2 2 2 2 2 2 2 Aniela Apolonia Benon Celina Damian Danuta Izabela Maria Marta Oleńka Renata Urban Zenon Zofia 2 2 2 2 2 2 2 2 2 2 2 2 2 2 CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 62 Table 3.11: Errors in pronounced alphabet letter errors letter errors letter errors en em er pe će a̧ eń te y esz źet 9 8 8 8 7 6 6 6 6 6 6 ce e ka be de ge i o zet eś 5 5 5 4 4 4 4 4 3 3 a es żet eł ku u wu el ȩ ef 2 2 2 1 1 1 1 1 1 1 name Lucjan was always correctly recognised. What is more Lucjan was provided as a hypothesis for several other names, including Jan which was recognised as Lucjan in case of two different speakers. In this example the name Lucjan was provided as a recognised word 23 times (including correct ones) and Łucjan twice, in both cases incorrectly. Table 3.11 presents wrongly recognised letters of alphabet. We already mentioned that this group is most likely to contain errors because its elements are very short and the HMM model cannot use all its advantages. We can also observe that sonorants (n, m, r) tend to be the most difficult for recognition. Letters ha and jot were recognised correctly for all speakers. 3.7 Conclusion Polish and English were compared considering approaches to ASR of these two languages. 250 000 000 words from different corpora: newspaper articles, Internet and literature were analysed. Statistics of Polish phonemes, diphones and triphones were created. They are not fully complete, but the corpora were large enough, that they can be successfully applied in NLP applications and speech processing. The collected statistics are the biggest for Polish of this type of linguistic computational knowledge. Polish is one of most common Slavic languages. It has several different phonemes than English and the statistics of phonemes are also different. The most popular and standard ASR - HTK - was trained for the Polish language and tested with a deep analysis of the errors that occurred. Chapter 4 Phoneme Segmentation Speech signals typically need to be divided into small frames before recognition can begin. Analysis of these frames can then determine the likelihood of a particular phoneme being present within the frame. Speech is non-stationary in the sense that frequency components change continuously over time, but it is generally assumed to be a stationary process within a single frame. Segmentation methods currently used in speech recognition usually do not consider where phonemes begin and end, which causes complications to appear at the boundaries of phonemes. However, nonuniform phoneme segmentation was already found useful in ASR for more accurate modelling (Glass, 2003). A phoneme segmentation method is presented in this chapter, which is a more sophisticated method than one described in (Ziółko et al., 2006b). More scenarios are covered and results are evaluated in a better way. Experiments were taken on much larger COPORA, which was described in the previous chapter. The method is based on analysing envelopes and the rate-of-change of the DWT subband power. 4.1 Analysis Using the Discrete Wavelet Transform The human hearing system uses frequency processing in the first step of sound analysis. While the details are still not fully understood, it is clear that a frequency based analysis of speech reveals important information. This encourages us to use DWT as a method of speech analysis, since the DWT may be more similar to the human hearing system than other methods (Wang and Narayanan, 2005; Daubechies, 1992). Details of the wavelet transformation are beyond the scope of this thesis, but here we present a brief overview of the method. The wavelet transformation provides a timefrequency spectrum. The original speech signal s(n) and its wavelet spectrum are of 16 bits accuracy. In order to obtain DWT (Daubechies, 1992), the coefficients of series sm+1 (n) = X cm+1,i φm+1,i (n) i 63 (4.1) CHAPTER 4. PHONEME SEGMENTATION 64 are computed, where φm+1,i is the ith wavelet function at the (m + 1)th resolution level. Due to the orthogonality of wavelet functions X cm+1,i = s(n)φm+1,i (n), (4.2) nDm+1,i where Dm+1,i = {n : ϕm+1,i (n) 6= 0} (4.3) are supports of φm+1,i . The coefficients of the lower level are calculated by applying the wellknown (Daubechies, 1992; Rioul and Vetterli, 1991) formulae: cm,k = X hi−2k cm+1,i , (4.4) gi−2k cm+1,i , (4.5) i dm,k = X i where hi and gi are the constant coefficients which depend on the scale function φ and wavelet ψ (e.g. functions presented in Fig. 4.2, which characterises dmey (discrete Meyer wavelet). The speech spectrum is decomposed using digital filtering and downsampling procedures defined by (4.4) and (4.5). It means that given the wavelet coefficients cm+1,i of the (m+1)th resolution level, (4.4) and (4.5) are applied to compute the coefficients of the mth resolution level. The elements of the DWT for a particular level may be collected into a vector, for example dm = (dm,1 , dm,2 , ...)T . The coefficients of other resolution levels are calculated recursively by applying formulae (4.4) and (4.5). The multiresolution analysis gives a hierarchical and fast scheme for the computation of the wavelet coefficients for a given speech signal s. In this way the values DWT(s) = {dM , dM −1 , ..., d1 , c1 } (4.6) of the DWT for M + 1 levels are obtained. Each signal sm+1 (n) = sm (n) + sdm (n) for all n ∈ Z (4.7) on the resolution level m+1 is split into approximation (coarse signal) sm (n) = X cm,k φm,k (n) (4.8) k on the lower, mth resolution level and the high frequency details sdm (n) = X dm,k ψm,k (n). (4.9) k The wavelet transformation can be viewed as a tree. The root of the tree consists of the coefficients of wavelet series (4.1) of the original speech signal. The first level of the tree is the result of one step of the (4.5). Subsequent levels in the tree are constructed by recursively applying (4.4) CHAPTER 4. PHONEME SEGMENTATION 65 I 67)7 W I :7 W Figure 4.1: Wavelet transform outperforms STFT because it has higher resolution for higher frequencies. and (4.5) to split the spectrum into the low (approximation cm,n ) and high (detail dm,n ) parts. Experiments undertaken by us, show that the speech signal decomposition into six levels is sufficient (see Fig. 4.3) to cover the frequency band of a human voice (see Table 4.1). The energy of the speech signal above 8 kHz and below 125 Hz is very low and can be neglected. The same experiment was conducted using 7 subbands and the worse results were received. There is a wide variety of possible basis functions from which a DWT can be derived. To determine the optimal choice of wavelet, we analysed six different wavelet functions: Meyer (Fig. 4.2), Haar, Daubechies wavelets of 3 different orders and symlets. Our results show that the discrete Meyer wavelet gives the best results. 4.2 General Description of the Segmentation Method Phonemes are characterised by differing frequency content, and so we would expect changes of the power in different wavelet resolution levels between phonemes. Clearly, it would be easier to analyse the absolute value of the rate-of-change of power and expect it to be large at the beginning and at the end of phonemes. However, this does not uniquely define start and end points, for two reasons. Firstly, the power can rise over a considerable length of time at the start of a phoneme, leading to an ambiguous start time. Secondly, there may also be rapid changes in power in the middle of a segment. A better method of detecting the boundary of phonemes relies on power transitions between the DWT subbands. Our approach (Ziółko et al., 2006b) is based on a six level DWT analysis (for example M = 6) of a speech signal (Fig. 4.3). CHAPTER 4. PHONEME SEGMENTATION 66 Meyer wavelet 2 1 0 −1 −8 −6 −4 −2 0 2 4 6 8 4 6 8 Meyer scaling function 1.5 1 0.5 0 −0.5 −8 −6 −4 −2 0 2 Figure 4.2: The discrete Meyer wavelet - dmey DWT level d6 DWT level d5 0.4 0.4 0.2 0.2 0 0 2000 4000 6000 DWT level d4 8000 0 2 2 1 1 0 0 500 1000 1500 DWT level d2 2000 0 2 2 1 1 0 0 100 200 300 400 0 0 1000 0 200 2000 3000 DWT level d3 400 4000 600 800 150 200 DWT level d1 0 50 100 Figure 4.3: Subband amplitude DWT spectra of the Polish word ’osiem’ (eng. eight). The number of samples depends on a resolution level CHAPTER 4. PHONEME SEGMENTATION 67 Table 4.1: Characteristics of the discrete wavelet transform levels and their envelopes Level Band (kHz) No. of samples Window d6 8−4 32 5 d5 4−2 16 5 d4 2−1 8 5 d3 1 − 0.5 4 3 d2 0.5 − 0.25 2 3 d1 0.25 − 0.125 1 3 The amount 2−M +m−1 N of wavelet spectrum samples in the mth level (where m = 1, . . . , M ) depends on the length N of the speech signal in time domain, assuming N is a power of 2. Table 4.1 presents their number at each level relative to the lowest resolution level. The power waveform pm (n) = m−1 2X d2m,j−1+n2m−1 where n = 0, . . . , 2−M N − 1, (4.10) j=1 is computed in a way to obtain the equal number of power samples for all subbands. The DWT subband power shows rapid variations (see Fig. 4.3) and despite smoothing (4.10) the power waveforms change rapidly. The first order differences in the power are inevitably noisy, and so we calculate the envelopes pen m (n) for power fluctuations in each subband by choosing the highest values of pm (n) in a window of given size ω (see Table 4.1) to obtain a power envelope (Fig.4.4). A smoothed differencing operator was used and the subband power pm is convolved with the mask [1, 2, −2, −1] to obtain smoothed rate-of-change information rm (n). In order to improve accuracy, a minimum threshold pmin was introduced for a subband DWT power. This threshold was chosen experimentally as 0.0002 for the test corpus. This prevents us from analysing noise where the power of the speech signal is very small (for example in areas of ‘silence’), even though noise is very low in the test corpus. The parameter pmin can be easily chosen for other corpora by analysing part of it with audio containing noise only. The threshold pmin can be set as 110% of power of noise. The start and end of a phoneme should be marked by an initially small, but rapidly rising power level in one or more of the DWT levels. In other words, the derivative can be expected to be approximately as large as the power. This is why phoneme boundaries can be detected searching for n-points for which the inequality p ≥ |β|rm (n)| − pen m (n)| (4.11) holds for the phoneme boundaries. Constant p is a value of threshold which accounts for the time scale and sensitivity of the crossing points. We found that setting the threshold p as 0.1 gave the best results. The rate-of-change function rm is multiplied by scaling factor β approximately equal to 1 which allows us to subtract the power from product β|rm (n)|. CHAPTER 4. PHONEME SEGMENTATION 68 d6 d5 1.5 3 1 2 0.5 1 0 0 50 100 150 200 0 0 50 d4 100 150 200 150 200 150 200 d3 15 25 20 10 15 10 5 5 0 0 50 100 150 200 0 0 50 d2 100 d1 4 1 0.8 3 0.6 2 0.4 1 0 0.2 0 50 100 150 200 0 0 50 100 Figure 4.4: Segmentation of the Polish word ’osiem’ (eng. eight) based on DWT sub-bands. Dotted lines are hand segmentation boundaries; dashed lines are automatic segmentation boundaries, bold lines are envelopes and thin lines are smoothed rate-of-change 4.3 Phoneme Detection Algorithm Without any additional refinement, the above method may not be able to detect the phoneme boundaries precisely. There are several reasons for this. First, the exact locations of the boundaries may vary slightly between subbands, and for some phonemes, only one frequency band may show significant variations in power, while for others several subbands may show variations in power. Sometimes analysis will detect slightly separate boundaries for different subbands. Secondly, despite smoothing the derivative, there may be a number of transitions which represent the same boundary. This problem was approached by noting the transitions and other situations which are likely to happen for phoneme boundaries using e(n), which will be referred to as an event function. Such an approach let us consider several scenarios and aspects of potential phoneme boundaries. It also allows us to improve the method easily by adding additional events to the existing list. The suggested events are presented in Table 4.2 and explained in details later. Surprisingly pre- CHAPTER 4. PHONEME SEGMENTATION 69 emphasis filtering was found as a step degradating quality so it was not used in the final version of the algorithm: 1. Normalise a speech signal by dividing by its maximum value in an analysed fragment of speech. 2. Decompose a signal into six levels of the DWT. 3. Calculate (4.10) in all frequency subbands to obtain the power representations pm (n) of the mth subband. 4. Calculate the envelopes pen m (Fig. 4.4) for power fluctuations in each subband by choosing the highest values of pm in a window of a given size ω, according to Table 4.1. 5. Calculate the rate-of-change function (Fig. 4.4) rm (n) by filtering pm (n) with [1, 2, -2, -1] mask. 6. Create an event function e(n) = 0 for all n. In the next step the function value will be increased to record events for which rm (n) and pen m (n) look like a phoneme boundary for a given n. 7. Analyse rm (n) and pen m (n) for each DWT subband to find the discrete time n for which the event conditions described in Table 4.2 hold. Add the value of the event importance (as par Table 4.2) to the event function e(n) (Fig. 4.5) for a given discrete time n according to Table 4.2. If several events occur for a single discrete time, then sum the event importances of all of them. Repeat the step for all discrete times n. In this way, we have a boundary distribution-like function. ( e(n) = 0 no condition fulfilled for n P i wi otherwise (4.12) where wj are importance weights (see Table 4.2) for events that occurred for n in all subbands. 8. Search for a discrete time n starting from 1, for which the event function is higher than a decision threshold. A threshold value of 4 was chosen experimentally. 9. Find all the discrete times ti for which e(ti ) > τ − 1 ti > n (4.13) ti − ti+1 < α where n is the last index analysed in the previous step and α is associated with minimal phoneme length (α = 4 gives approximately 20 ms). Organise all the discrete times ti in separate groups of those fulfilling the above conditions. CHAPTER 4. PHONEME SEGMENTATION 70 Table 4.2: Types of events associated with a phoneme boundary. Mathematical conditions are based on power envelope pen m (n), rate-of-change information rm (n), a threshold p of the distance between rm (n) and pen (n) and a threshold pmin of minimal pen m m (n) and β = 1. Values in the last four columns are for different DWT levels (the first one for d1 level, the second one for d2 level, the third for levels from d3 to d5 and the last one for d6 level) Description Mathematical condition Importance en Quasi-crossing point |β|rm (n)| − pm (n)| < p and 1 3 4 1 en (|β|rm (n + 1)| − pm (n + 1)| > p or |β|rm (n − 1)| − pen m (n − 1)| > p) and en pm (n) > pmin Crossing point β|rm (n)| > pen 1 3 4 1 m (n) + p and en first case β|rm (n + 1)| < pm (n + 1) − p and pen m (n) > 5 pmin Crossing point β|rm (n)| < pen 1 3 4 1 m (n) − p and en second case β|rm (n + 1)| > pm (n + 1) + p and pen m (n) > 5 pmin Rate-of-change higher than β|rm (n)| > pen 1 2 2 1 m (n) and en power envelope pm (n) > 2 pmin 10. Calculate the weighted mean discrete time b from the discrete times grouped in the previous step. Index b is the detected phoneme boundary in the discrete timing of DWT level d1 , which was used in the algorithm for all other subbands by summing samples. P t i wi b = Pi . i wi (4.14) 11. Repeat previous three steps for next discrete time values n, until the largest n with non-zero value of event function e(n) will be processed. Table 4.2 describes the events which can be expected to occur in the power of DWT subbands. Some of them are more crucial than others. In our previously published work (Ziółko et al., 2006b) only the first of them was used. Additionally, different weights were given to events with respect to a subband in which they occur. It is a perceptually motivated idea which was very successfully used in the PLP (Hermansky, 1990). As per this study, information in relatively high and low frequency subbands is not so important for the human ear as information in the bands from 345 Hz to 2756 Hz. Briefly, the Hermansky solution (Hermansky, 1990; Hermansky and Morgan, 1994) used a window to modify speech, decreasing frequencies not crucial for the human ear and amplifying the most important ones. The same aim was followed in our solution by giving low weights for events occurring in detectable, but not the most important frequencies, and higher ones for the middle of human hearing bands. Six DWT subbands were used. The third, fourth and fifth were grouped together as the middle and most crucial ones. As a result in Table 4.2 four columns with importance values (weights) are presented (the first one for the d1 level, the second one for CHAPTER 4. PHONEME SEGMENTATION 71 event function e(i) 14 12 10 8 6 4 2 0 0 20 40 60 80 100 120 140 160 180 Figure 4.5: The event function versus time in ms of the word presented in Fig. 4.4. High event scores mean that a phoneme boundary is more likely the d2 level, the third for the levels from d3 to d5 and the last one for the d6 level). There are four possible events presented in Fig. 4.6 and described in Table 4.2. Some of them are quite similar. It has to be stressed that for some discrete times and subbands more than one event can occur (typically two and very rarely more). In this case weights of both events are taken into account to the event function e(n). In all cases, the values of rate-of-change information |rm (n)| are multiplied by scaling factor β equals to 1. The first event is called quasi-crossing point. It is the most general and common one. The mathematical condition for this event detects discrete times for which power envelope pen m (n) and absolute value of rate-of-change information |rm (n)| cross or approach each other very closely (on a distance of threshold p). Additionally power envelope pen m (n) has to be higher than threshold pmin . The second and third events are twin events and represent rarer cases, namely the crossing of en power envelope pen m (n) and absolute value of rate-of-change |rm (n)| when pm (n) is five times higher than minimum threshold pmin . It means that the second and third cases are used to detect and note more specific situations than the first one, because typically fulfilling one of those conditions means fulfilling the first one as well. As we sum all event importances for a given n, this will cause a higher value of event function e(n) than just the first event. In these cases, one of the CHAPTER 4. PHONEME SEGMENTATION 72 Figure 4.6: Simple examples of four events described in Table 4.2. They are characteristic for phoneme boundaries. Images present power envelope pen m (n) and rate-of-change information (derivative) rm (n) CHAPTER 4. PHONEME SEGMENTATION 73 functions of pen m (n) and |rm (n)| starts with higher level than the other and goes below the level of the second one, suggesting a phoneme boundary very clearly. Fulfilling one of those conditions means fulfilling the first one as well. As we sum all event importances for a given n, this will cause a higher value of event function e(n) than just the first event. In these cases, one of the functions of pen m (n) and |rm (n)| starts with higher level than the other and goes below the level of the second one, suggesting a phoneme boundary very clearly. The fourth event is also quite rare and covers situations were the DWT spectrum changes very rapidly, which happens for changes in speech content like phoneme boundaries. In this situation a level of pen m (n) can be relatively low. The absolute value of rate-of-change information |rm (n)| en being higher than power envelope pen m (n) and pm (n) being higher than double of the minimum threshold are searched for. The fourth event is different, because it does not describe anything similar to crossing used in general description of the method in the previous section. However, if |rm (n)| is so high, it also indicates that a phoneme boundary may occur. It is less strict and more general, so a lower weight was given. The values of thresholds in the first three events were chosen to make the second and third events more difficult to fulfil than the first one. The threshold in the fourth type event was chosen experimentally. The method is designed so that it would be easy to improve it by introducing additional conditions. It is easy to introduce a new condition which will add or subtract (negative events, which imply boundaries did not occur, are not included in this solution but generally possible) additional values to e(n) for discrete times where the new condition is fulfilled. Another aspect of the ‘intelligence’ of the method is that, even though it consists of several conditions, the sensitivity can be easily changed by setting another decision threshold. The decision threshold is lowered by 1 for finding the following discrete times (comparing to the first one in the group) due to a hysteresis rule. The application of hysteresis for the threshold produces better results. The algorithm is implemented in Matlab environment and not optimised for time efficiency. In its current version it needs 14 minutes to segment the whole corpus using Haar wavelet (the lowest order of filters) and 20 minutes for discrete Meyer wavelet (the highest order of filters, namely 50). The corpus has 16425 utterances (some of them are sentences), which give 0.05 s per utterance for the version with Haar wavelet and 0.07 s for the Meyer one. The properly optimised code in C++ would be much more time efficient. The experiment was conducted on a computer with AMD Athlon 64 processor 3500+ 990 MHz, 1.00 GB of RAM. The method was developed on a set of 50 hand segmented Polish words with the sampling frequency f0 = 11025 Hz, equivalent to a sampling period t0 = 90.7 µs. In order to assess the quality of our results, the method was tested on CORPORA. None of the CORPORA utterances were in the original set used during development. Hand segmentation was done by different people in the small development set and for CORPORA. CHAPTER 4. PHONEME SEGMENTATION 74 Figure 4.7: The general scheme of sets G with correct boundaries and A with detected ones. Elements of set A have a grade f(x) standing for probability of being a correct boundary. In set G there can be elements which were not detected (in the left part of the set) 4.4 Fuzzy Sets for Recall and Precision Fuzzy logic is a tool for embedding structured human knowledge into workable algorithms. In a narrow sense, fuzzy logic is considered a logical system aimed at providing a model for modes of human reasoning that are approximate rather than exact. In a wider sense, it is treated as a fuzzy set theory of classes with unsharp boundaries (Kecman, 2001). Fuzzy logic found many applications in artificial intelligence, due to the introduction of the opportunity of numerical and symbolic processing of a human-like knowledge. This kind of processing is needed in proper evaluating of many types of segmentation. In our case we are interested in speech boundary (for example phonemes) location (Fig. 4.8). Detected boundaries may be shifted more or less with respect to a manual segmentation. This ’more or less’ makes a crucial difference and cannot be mathematically described in a Boolean logic. Fuzzy logic introduces an opportunity of grading detected boundary locations in more sensitive and human-like way. Our segmentation evaluation method (?) is based on the well-known recall and precision evaluation method. However, in our approach, calculated boundary locations are elements of a fuzzy set and a binary operation T-norm describes their memberships. T-norm is defined as a function T : [0, 1] × [0, 1] → [0, 1] which satisfies commutativity, monotonicity, associativity and for which 1 acts as an identity element. As usual in recall and precision, one set contains relevant elements. The other is the set of retrieved boundaries. We calculate an evaluation grade using the number of elements in each of them and in their intersection. The comparison of the number of relevant boundaries and a number of elements in intersection gives precision. In a boolean version of the evaluation method it is information about how many correct boundaries were found. By using fuzzy logic we evaluate not only how many boundaries were detected, but how accurately they were detected. The comparison of the number of retrieved elements and intersection gives recall, which is a grade of wrong detections. In this case fuzzy logic allows to evaluate not only a CHAPTER 4. PHONEME SEGMENTATION 75 1 0.5 0 −0.5 −1 0 50 100 150 200 Figure 4.8: The example of phoneme segmentation of a single word. In the lower part hand segmentation is drawn. Boundaries are represented by two indexes close to each other (sometimes overlapping). Upper columns present the example of segmentation for the word done by a segmentation algorithm. All of calculated boundaries are quite accurate but never perfect number of wrong detections but also their incorrectness. Each retrieved boundary has a probability factor which represents being correct information. 4.5 Algorithm of Speech Segmentation Evaluation In this section we present an example of applying the approach described in the previous section for phoneme speech segmentation (Fig. 4.8). Due to the described features, such segmentation and its evaluation is particularly useful in ASR. In this case we have to make three assumptions: • Hand segmentation (ground truth) is given as a set of narrow ranges. Neighbouring phonemes overlap each other in these ranges. • Detected boundaries are represented as a set of single indexes. • We assume the perfect detection of silence. Silence segments may be of almost any length. Due to this fact including them in evaluation would cause serious inaccuracy. This is why we skip silence segments in evaluation. The method proceeds as follows: 1. Assign first and last detected boundaries with the same value as hand segmented boundaries (typically the first and the last index). It has to be done because of the third assumption. 2. Start with matching the closest detected and hand segmented boundaries. They need to be matched in pairs. Each boundary may have only one matched boundary from the other set. Do following steps for each ith detected boundary ia starting from the first. CHAPTER 4. PHONEME SEGMENTATION 76 3. Calculate grades of being relevant and retrieved. All matched pairs are elements of two sets of which one is fuzzy. All non-matched detected and hand segmented boundaries are elements of one set. Let G denote the set of relevant (correct) elements. Let A denote the ordered set containing retrieved (predicted) boundaries. For each segmentation boundary x in A be define a fuzzy membership function f(x) that describes the degree to which x has been accurately segmented. There are three different scenarios for calculating membership function f(x): • A hand segmented boundary not matched with any detected boundary is an element of set G. • A detected boundary x not matched with any hand segmented boundary is an element of set A and has f(x) = 0. The last detected boundary on the Fig. 4.8 is such a case. • If a detected boundary x is inside the hand segmented boundary range, the boundary is the element of both sets A and G. The other probabilistic factor is boolean and represents membership of a set with hand segmentation boundaries. We use algebraic product of these two probabilistic grades as a T-norm, to find a membership grade of CHAPTER 4. PHONEME SEGMENTATION 77 1 0 midpoint start/end point Figure 4.9: Fuzzy membership the intersection. In the situation where x is inside the hand segmented boundary, range f(x) = 1. • Otherwise it is a fuzzy case and f(x) = a−b/a where a stands for the half of the length of the phoneme which the boundary was detected (take the phoneme in which the detected boundary is situated) and b stands for the distance between hand segmented boundary and the detected one (Fig. 4.9). All boundaries on the Fig. 4.8 apart from the last one are examples of this case, which proves how useful fuzzy logic can be in the segmentation evaluation. 4. Fuzzy precision can be calculated as P x∈A f (x) P = |G| . (4.15) . (4.16) 5. Fuzzy recall equals P R= x∈A f (x) |A| Recall and precision can be used to give a single evaluation grade in many different ways depending on which of them is more important. The widely used way is calculating f-score (van Rijsbergen, 1979) F = (β 2 + 1) ∗ P ∗ R , (β 2 ∗ P ) + R (4.17) where β is a parameter to the f-score. Often β = 1, that is, precision and recall are given equal weights. Higher β values would favour recall over precision. CHAPTER 4. PHONEME SEGMENTATION 4.6 78 Comparison to Other Evaluation Methods Evaluation methods are always subjective and there is no way to grade them statistically. This is why it is difficult to compare evaluation methods and judge which one is better. Because it cannot be proved our method outperforms the others, we present an example which might explain why we believe so. There is no standard method, but all evaluations are based on insertions and deletions with some tolerances. Let us compare a use of such methods with the fuzzy recall and precision for the example presented in Fig. 4.8. The indexes are due to the segmentation method (Ziółko et al., 2006b). One index unit corresponds to 5.8 ms. The very first and last boundary is not included due to assumption that they are supposed to be perfectly detected. Table 4.3 lists membership function f(x) for all boundaries. In lower rows, insertions and deletions with all possible tolerances are marked. The symbol X stands for a boundary with a delation or insertion for a given tolerance, while stands for a boundary accepted as a correct one with a given tolerance. The number of insertions and deletions is given in brackets in the first column. As we use only a single word, results are the same for many tolerance levels. For a larger corpora it does not happen. It is clearly visible that counting insertions and deletions is less accurate, unless one uses tolerance levels with resolution equals to the resolution of index order. Especially using single tolerance level smooths information about boundary detections. Perfectly accurate detections are graded in the same way as imperfect, but fulfilling a tolerance level. Using several tolerance levels improves quality of evaluation but is still just a step towards a high resolution evaluation method, as suggested fuzzy recall and precision. Another issue is the length of phonemes. A method based on tolerances gives grade without comparing a tolerance and length of a given phoneme. In other words, our methods is better, because the membership function f(x) is calculated on percentage of the phoneme length of a boundary which was missed and not on a constant tolerance value. In the presented example, phoneme lengths vary from 11 (64 ms) to 47 (273 ms). For example, the tolerance of 3 (17 ms) is effectively much higher for the shortest unit than for the longest one. There is no such flaw in our method. The algorithm was implemented in C++. Final grades for a given word are: precision: 0.813901, recall: 0.697629, f-score: 0.751293. 4.7 Experimental Results of DWT Segmentation Method Our first set of results looks at the usefulness of the six wavelet functions for analysing phoneme boundaries. The obtained results for different wavelets (see Table 4.4) show the differences in their efficiency. They suggest that discrete Meyer wavelet (Fig. 4.2)(Abry, 1997) performs the best in this case, probably because of its symmetry in the time domain, which helps in synchronisation of the subbands. Asynchronisation in time domain can be caused by ripples in frequency domain. An experiment using two wavelets (Meyer and sym6), one after another, was also conducted. As it might be expected, it improved results only a little, while it almost doubled the time of calculations. Analysing seven subbands was also checked, where the seventh one was from 125 Hz to 62.5 Hz. The accuracy of our phoneme detection technique was then compared with some standard framing techniques (see Table 4.5) like constant segmentation methods where the speech is broken CHAPTER 4. PHONEME SEGMENTATION 79 Table 4.3: Comparison of fuzzy recall and precision with commonly used methods based on insertions and deletions for an exemplar word beg 9 56 89 113 156 196 end 10 58 90 114 158 198 auto 15 59 97 112 159 195 206 fuzzy recall and precision f(x) 0.78 0.93 0.36 0.91 0.95 0.95 0 insertions and deletions without tolerance Ins(7) X X X X X X X Del(6) X X X X X X with tolerance from 1 (5.8 ms) to 4 (23.2 ms) - same results Ins(3) X X X Del(2) X X with tolerance 5 (29 ms) or 6 (34.8 ms) Ins(2) X X Del(1) X with tolerance 7 (40.6 ms) or higher Ins(1) X Del(0) - Table 4.4: Comparison of proposed method using different wavelets Method av. recall av. precision f-score Meyer 0.7096 0.7408 0.7249 db2 0.6770 0.7562 0.7144 db6 0.7029 0.7414 0.7217 db20 0.7034 0.7408 0.7216 sym6 0.7015 0.7426 0.7215 haar 0.6377 0.8042 0.7113 Meyer+sym6 0.6825 0.7936 0.7339 Meyer 7 subbands 0.6449 0.6714 0.6579 Table 4.5: Comparison of some other segmentation strategies and proposed method Method av. recall av. precision f-score Const 23.2 ms 0.9651 0.1431 0.2493 Const 92.8 ms 0.7635 0.4659 0.5787 SVM 0.50 0.33 0.40 Wavelet 0.7096 0.7408 0.7249 CHAPTER 4. PHONEME SEGMENTATION 80 into fixed length segments, and with the speech signal being segmented randomly. Accuracy of constant segmentation for many multiplications of 5.8 ms (the time length between neighbouring discrete times) was evaluated but we only present results for 23 ms as it is corresponding to typical length of frames in speech recognition and for 92.8 ms for which the result is the best of all constant segmentations. We also trained the SVM using powers and derivatives from DWT subbands. Features for SVM included analysed part of speech as well as left and right context. No other phoneme segmentation method available for comparison was found. While constant segmentation is able to find most of the boundaries with a 23 ms frame, this is only at the expense of very short segments and many irrelevant boundaries. The overall score of our method is much superior to the constant segmentation approach. Several researchers claim that syllables are better basic units for ASR than phonemes (Frankel et al., 2007). It is probably true in terms of their content, but it seems not to be the same for detecting unit boundaries. Our method is not perfect but the observed DWT spectra of speech clearly show that boundaries between phonemes can be extracted. Boundaries between syllables seem not to differ from phoneme boundaries in observed DWT spectra, while obviously there are fewer syllable boundaries than phoneme ones. It is therefore difficult to detect syllable boundaries without also finding phoneme boundaries when analysing DWT spectra. 4.8 Evaluation for Different Types of Phoneme Transitions Errors in phoneme segmentation depend on what type of transitions are being detected. The evaluations differ regarding to groups of phonemes because some phonemes have similar spectra, while others differ a lot. These differences depend on acoustic properties of phonemes (Kȩpiński, 2005). The transitions which are more likely to cause errors should be analysed with more care, in example by applying more segmentation methods and considering all results. There are following types of phonemes in Polish (Kȩpiński, 2005): 1. Stops (/p/, /b/, /t/, /d/, /k/, /g/) 2. Nasal consonants (/m/, /n/, /ni/, /N/) 3. Mouth vowels (/i/, /y/, /e/, /a/, /o/, /u/) 4. Nasal vowels (/e /, /a /) 5. Palatal consonants (Polish ’Glajdy’)(/j/, /l /) 6. Unstables (Polish ’Płynne’)(/l/, /r/) 7. Fricatives (/w/, /f/, /h/, /z/, /s/, /zi/, /si/, /rz/, /sz/) 8. Closed fricatives (/dz/, /c/, /dzi/, /ci/, /drz/, /cz/) 9. Silence in the beginnings and ends of recordings 10. Silence inside words CHAPTER 4. PHONEME SEGMENTATION 81 1 1 0.9 2 0.8 3 0.7 4 0.6 5 0.5 6 0.4 7 0.3 8 0.2 9 0.1 10 1 2 3 4 5 6 7 8 9 0 10 Figure 4.10: F-score of phoneme boundaries detection for transitions between several types of phonemes. Phoneme types 1-10 are explained in section 4.8 (1 - stops, 2 - nasal consonants, etc.). This division is made on acoustic properties of phonemes. We do not have enough statistical data to calculate results for transitions between all 39 types of phonemes. It can be assumed that transitions between phonemes of two particular groups face similar problems due to co-articulation and other natural phonetic phenomena. Tables 4.6, 4.7, 4.8 and Fig. 4.10 present evaluation of phoneme segmentation regarding to the transitions of types listed above. Value 0 means that there was no transition of this type. Table 4.6: Recall for different types of phoneme transitions. Type 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0.7204 0.6015 0.4886 0.5089 0.6403 0.5624 0.6148 0.6216 1.0000 0.0399 0.6101 0.5555 0.4493 0.4816 0.5790 0.5445 0.5320 0.5593 1.0000 0.1399 0.5114 0.4686 0.5069 0.5384 0.4534 0.4690 0.4389 0.4771 1.0000 0.4180 0.5776 0.5812 0.0821 0 0.5362 0.5553 0.5299 0.5424 1.0000 0 0.5818 0.5474 0.4605 0.4215 0.5942 0.5428 0.4641 0.4281 1.0000 0.0335 0.5007 0.5087 0.3776 0.4388 0.5520 0.4768 0.4708 0.5288 1.0000 0.0643 0.5877 0.5817 0.4218 0.4380 0.5829 0.5781 0.5203 0.5372 1.0000 0.0835 0.6456 0.6062 0.5872 0.5015 0.6072 0.5558 0.5911 0.6387 1.0000 0.0289 0.5210 0.5658 0.4741 0.3692 0.5563 0.5885 0.5784 0.5169 0 0 0.4194 0.2129 0.3712 0.2155 0.0702 0.2630 0.4661 0.1388 0.0227 0 All silences before speech are marked as perfectly detected due to the evaluation algorithm. Apart from that silences were not detected very well. The reason is that the segmentation method is tuned to phoneme boundaries and not speech-silence transitions. There are other very efficient methods for this task already established (Zheng and Yan, 2004). CHAPTER 4. PHONEME SEGMENTATION 82 Table 4.7: Precision for different types of phoneme transitions. Type 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0.6927 0.5523 0.4171 0.4199 0.5943 0.4838 0.5762 0.5573 1.0000 0.0365 0.5788 0.4858 0.3692 0.4124 0.5465 0.4976 0.4875 0.4938 1.0000 0.1399 0.4783 0.4021 0.4433 0.4735 0.3731 0.3987 0.3732 0.4154 1.0000 0.4035 0.5299 0.4996 0.0771 0 0.4688 0.4811 0.4835 0.4926 1.0000 0 0.5465 0.4952 0.3963 0.3789 0.5554 0.4811 0.4150 0.3511 1.0000 0.0310 0.4741 0.4783 0.3033 0.4073 0.5252 0.4271 0.4208 0.4869 1.0000 0.0620 0.5599 0.5375 0.3470 0.3405 0.5488 0.5303 0.4798 0.4809 1.0000 0.0835 0.6094 0.5569 0.5207 0.4222 0.5443 0.5174 0.5324 0.5692 1.0000 0.0289 0.3115 0.3928 0.2899 0.1987 0.3811 0.4203 0.4158 0.3209 0 0 0.4108 0.2129 0.3423 0.1826 0.0645 0.2630 0.4452 0.1333 0.0227 0 Table 4.8: F-score for different types of phoneme transitions. The scores above 0.5 were bolded. Type 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0.7063 0.5759 0.4500 0.4601 0.6164 0.5202 0.5949 0.5877 1.0000 0.0382 0.5940 0.5183 0.4053 0.4443 0.5623 0.5200 0.5088 0.5245 1.0000 0.1399 0.4943 0.4328 0.4730 0.5038 0.4093 0.4310 0.4034 0.4441 1.0000 0.4106 0.5528 0.5373 0.0795 0 0.5002 0.5155 0.5056 0.5163 1.0000 0 0.5636 0.5200 0.4260 0.3991 0.5742 0.5101 0.4382 0.3858 1.0000 0.0322 0.4870 0.4931 0.3364 0.4225 0.5383 0.4506 0.4444 0.5070 1.0000 0.0632 0.5734 0.5587 0.3807 0.3831 0.5654 0.5532 0.4992 0.5075 1.0000 0.0835 0.6270 0.5805 0.5519 0.4584 0.5740 0.5359 0.5602 0.6019 1.0000 0.0289 0.3899 0.4637 0.3598 0.2583 0.4523 0.4904 0.4838 0.3960 0 0 0.4150 0.2129 0.3562 0.1977 0.0672 0.2630 0.4555 0.1360 0.0227 0 DWT was also tested for parametrisation of speech (Farooq and Datta, 2004). The unvoiced stops (/p/, /t/, /k/) were found more difficult to be recognised than vowels (/aa/, /ax/, /iy/) and unvoiced fricatives (/p/, /t/, /k/). In our case stops did not cause difficult problems for locating them correctly. Actually, the highest f-score (0.7063) was obtained for the boundaries between two stops and the second grade (0.6270) between stops and closed fricatives. Also transitions from palatal consonants to stops were evaluated highly (0.6164). Transitions between two closed fricatives were another group of easy ones to be detected (0.6019). The most difficult for detection were transitions from mouth vowels to any type apart from closed fricatives, especially to nasal vowels (0.0795), unstables (0.3364) and fricatives (0.3807). Also transitions to mouth vowels were difficult to locate correctly. The only exception was from nasal vowels to mouth vowels (0.5038), which is surprisingly large, comparing to 0.0795 for a transition in the other way. Another group of boundaries with low F-scores were transitions from nasal vowels apart from the mentioned transition to mouth vowels. Especially difficult were transitions to fricatives (0.3831) and palatal consonants (0.3991). There are no transitions from one nasal vowel into another one. The transitions from closed fricatives to palatal consonants, from unstables to unstables and fricatives to palatal consonants, unstables and another fricatives were also difficult to be detected properly. According to our results it is relatively easy to find a boundary between phonemes of the CHAPTER 4. PHONEME SEGMENTATION 83 same group if such transition is possible. F-score for such boundaries is usually above 0.5. This is slightly surprising and counterintuitive because phonemes of the same group have typically similar spectra and it could be expected to be difficult to differentiate them. Tables 4.6, 4.7 and 4.8 are not symmetric. It is not very surprising because phoneme spectra are not symmetric. Their ends and starts can vary significantly. This is why, it might be easier to locate a beginning of a particular phoneme than its end. The gained statistical knowledge can improve the quality of segmentation. In case of large vocabulary continuous speech recognition, the recognition follows the segmentation. If a phoneme which is know to cause errors for segmentation is detected, its boundaries can be re-evaluated by another more sophisticated or simply other method. Then another segmentation decision can be taken, leading to a better final recognition. 4.9 LogitBoost WEKA Classifier Speech Segmentation WEKA is a graphical data mining and machine learning software providing many classifiers. The procedure called ‘boosting’ is the important classification methodology. The WEKA LogitBoost classifier is based on well known AdaBoost procedure (Friedman et al., 1999). The AdaBoost procedure trains the classifiers on weighted versions of the training samples. It gives higher weights for those which are misclassified. That part of procedure is conducted for a sequence of weighted samples. Afterwards the final classifier is defined to be a linear combination of the classifiers from each stage. Logistic boost (Friedman et al., 1999) uses the adaptative Newton algorithm to fit an additive multiple logistic regression model. So it calls a classifier repeatedly in series. A distribution of weights is updated each time. In this way it indicates the importance of examples in the data set for the classification. The main point of being adaptative is that, on each round, the weights of each incorrectly classified example are increased. The new classifier focuses more on those examples. Logistic regression fits data to a logistic curve to specify prediction of the probability of occurrence of an event. There were many more non-boundary points in feature space than those which really represent boundaries. This is why we cloned all sets of features representing phoneme boundaries for 30 times to keep a similar ratio of boundaries and non-boundaries. We used 70 % of all feature points as training data and 30 % for a test in every experiment. 4.10 Experimental Results for LogitBoost Seven different sets of features for the same classifier and same test data were tested to check which features are useful. The differences between following sets are described. The classification was evaluated using popular precision and recall measure (van Rijsbergen, 1979) which is presented in tables and by percentage of properly classified instances which are given in text for all cases. Two evaluations are provided for every set of features to help in grading the method because we did not manage to find any other similar system to use to present as a baseline. We started with one left and one right context subset of features to describe the surrounding CHAPTER 4. PHONEME SEGMENTATION 84 part of signal. We included first and second derivatives and both of them were smoothed. Different subbands were smoothed using different windows (see Tab. 4.1). We found that this method is the most efficient in our previous experiments (Ziółko et al., 2006a). That gives 54 features in total. 64 % of test instances were correctly classified. The more exact results using recall and precision evaluation are presented in Tab. 4.9. The final measure is f-score presented separately for sets of features describing frames with boundaries and without. The second group is named in Tab. 4.9 as phonemes, as they are segments from inside of phonemes, far from boundaries. From practical point of view we are interested in detecting boundaries so the evaluation of classification of these frames is crucial. So for the first set of features the most important grade is f-score 0.45 (Tab. 4.9). Table 4.9: Experimental results for LogitBoost classifier. The rows with the label boundary is for classifying segments representing boundaries. The rows named phoneme present grades for classifying segments inside phonemes which are not boundaries. From practical point of view boundary labels are important. The grades for phoneme labels are just for a reference set of features label precision recall f-score boundary 0.583 0.366 0.45 Basic phoneme 0.659 0.824 0.732 boundary 0.588 0.386 0.466 Without smoothing the second derivative phoneme 0.665 0.818 0.733 boundary 0.551 0.077 0.135 Normalisation by whole energy value phoneme 0.607 0.958 0.743 boundary 0.59 0.317 0.413 By max in a subband for a given utterance phoneme 0.649 0.851 0.737 boundary 0.618 0.447 0.519 With wider context phoneme 0.682 0.811 0.741 boundary 0.699 0.162 0.263 Even wider context but without 2nd derivative phoneme 0.703 0.966 0.814 boundary 0.609 0.2 0.302 Asymmetric context phoneme 0.712 0.939 0.81 We managed to improve results slightly by leaving the second derivative unsmoothed. There were no other changes in the set of features. 64% of test instances were correctly classified like for the previous set of features but the more exact evaluation presented in Tab. 4.9, indicates some improvement through higher f-score, namely 0.466. In the next approach, we kept the same number and type of features but subband features were normalised, by dividing by the energy. In that way, 60.384 % of test instances were correctly classified with f-score only 0.135 (Tab. 4.9). We tried also an normalising approach, by dividing features by a maximum in a given subband for an analysed utterance. 63.6347% of test instances were correctly classified, but f-score is also quite low, namely 0.413 (Tab. 4.9). Surprisingly, none of normalisation methods improved results. Finally, we experimented with wider left and right context. We added more subsets of features CHAPTER 4. PHONEME SEGMENTATION 85 for signal around the analysed frame. We have got 66% of test instances correctly classified by including two contexts to the left and two to the right. In that case we had a set of 90 features with a relatively high f-score 0.519 (Tab. 4.9). To use wider context, namely three to the left and three to the right, we had to skip the second derivative, because the number of features was too large to be operated by WEKA. In that way we had a set of 84 features. 70% of test instances were correctly classified, but recall for boundary frames was very low, just 0.162 which caused f-score to be only 0.263 (Tab. 4.9). It means, that generally, this set of features is not effective. The three to left and one to right context was also checked. In that experiment we used the second derivatives, so we had 90 features. We received correctness of 70% but f-score for boundaries was again quite low, only 0.302 (Tab. 4.9). 4.11 Conclusion ASR systems could be improved if an efficient phoneme segmentation method was found. Innovative segmentation software was designed and implemented in Matlab. F-score 0.72 was achieved for phoneme segmentation task analysing envelopes of discrete Meyer wavelet subband powers and their derivatives. It is a very good result comparing to 0.4 for SVM, 0.58 for constant segmentation and 0.46 for LogitBoost WEKA classifier. DWT is a good tool to analyse speech and extract segments for further analysis. It achieves better results than all baselines, including WEKA machine learning LogitBoost classifier for which several sets of features were tested and compared. The segmentation evaluation was also analysed and some flaws of typical approaches were identified. It was suggested that the segmentation evaluation by the application of fuzzy logic could be improved. Segmentation is a subfield of speech analysis which was not investigated enough in ASR. Our solution showed a new direction of possible improvements in ASR for any language. Segmentation allows to be more precise during modelling. Systems based on framing and HMMs miss some of the information on the speech, which could be used in recognition if the efficient phoneme segmentation was done first. This information, while once lost, cannot be recovered in the further steps what results in worse efficiency of the whole system. There are types of phoneme transitions which are more difficult to detect than others. The average F-score for our segmentation method based on DWT vary from 0.0795 to 0.7063 for transitions between different acoustic types of phonemes. The experiment support a hypothesis that in general, it is more difficult to locate boundaries of vowels than other phonemes. One of the reasons can be that vowel spectra are often less distinctive than others. Another reason might be that vowels are relatively short comparing to other types of phonemes. DWT is one of the most perceptual analysis processing tools. It enables to extract subbands important to a human ear. It outperforms SFT because the size of DWT window is changeable depending on a frequency subband as presented in Fig. 4.1. In SFT low and high frequencies are analysed with the same resolution. It is not efficient, because a relatively short frame is needed for high frequencies for an analysis. It has to be proportionally longer for low frequencies. DWT mo- CHAPTER 4. PHONEME SEGMENTATION 86 difies these lengths automatically, while in case of SFT, it is necessary to calculate mel-frequency based cepstrum rather than regular spectrum from FFT. Chapter 5 Language Models Language modelling is a weak point of ASR. Most of the time n-grams are still the most efficient models. Even though they are so simple a solution, it is difficult to train any better model because of data sparsity. Several experiments were conducted on n-best list of hypotheses received from HTK audio model to re-rank the list and improve recognition. The POS tagger model was presented in (Ziółko et al., 2008a) and the first results using a semantic model in (Ziółko et al., 2008b). So far the most popular and often most effective language model is the n-gram model (2.6) described in the literature review chapter. N -gram is very simple in its nature, because it counts possible sequences of words and uses them to provide probabilities. It is quite unusual than there is no more sophisticated method which would perform in a better way than n-gram by applying more complicated methods and calculations. We did not find any published papers on applying POS tagging in ASR. This is why we decided to check if it can be successfully used in language modelling instead of n-grams. It was quite a promising idea as the grammar structure of sentences can be described using POS tags while they provide much smaller set of elements in a model because several words can be modelled by the same POS tag. One of the problems which is very often experienced while using n-grams, is a lack of data for training because of too many possible words. The situation is even worse in inflective languages what was described on an example of Russian (Whittaker and Woodland, 2003) where the authors claim that 430,00 words for Russian is needed to provide the same vocabulary coverage as 65,000 for English. A similar situation can be expected for all inflective languages. The language models can be based on order of words in sentences like n-grams where words are processed as a sequence. Another approach is to process words as a set, where the order is lost. This approach is often called bag-of-words because we can imagine taking an ordered sequence of words, putting them in a bag and shaking. This is a visualisation of modelling methods like LSA. In most cases it is used to capture semantic knowledge. In case of inflective languages the order is not crucial, so loosing the information about the order is not very destructive to the method while allow one to reduce amount of data necessary for the training. This chapter describes the language modelling part of the research. The methods presented here are designed for inflective languages and tested on Polish but some of them could be applied 87 CHAPTER 5. LANGUAGE MODELS 88 to any other language as well. The first model is based on a probabilistic POS tagger. This approach was unsuccessful, but we present it to document the experiment and discuss why we believe it reduced recognition. Then the most of the chapter focuses on a bag-of-words model designed by the candidate. The model has some similarities to LSA in its general concept but differs a lot in realisation allowing calculations on much more data than LSA. 5.1 POS Tagging POS tagging (Brill, 1995) is the process of marking up the words as corresponding to a particular part of speech, based on both its definition, as well as its context, using their relationship with other words in a phrase, sentence, or paragraph (Brill, 1994; Cozens, 1998). POS tagging is more than providing a list of words with their parts of speech, because many words represent more than one part of speech at different times. The first major corpus of English for computer analysis was the Brown Corpus (Kucera and Francis, 1967). It consists of about 1,000,000 words, made up of 500 samples from randomly chosen publications. In the mid 1980s, researchers in Europe began to use HMMs to disambiguate parts of speech, when working to tag the LancasterOslo-Bergen Corpus (Johansson et al., 1978). HMMs involve counting cases and making a table of the probabilities of certain sequences. For example, once an article has been recognised, the next word is a noun with probability of 40%, an adjective with 40%, and a number with 20%. Markov Models are a common method for asaigning POS tags. The methods already discussed involve operations on a pre-existing corpus to find tag probabilities. Unsupervised tagging is also possible by bootstrapping. Those techniques use an untagged corpus for their training data and produce the tagset by induction. That is, they observe patterns in word structures, and provide POS types. These two categories can be further subdivided into rule-based, stochastic, and neural approaches. Some current major algorithms for POS tagging include the Viterbi algorithm (Viterbi, 1967; Forney, 1973), the Brill tagger (Brill, 1995), and the Baum-Welch algorithm (L. E. Baum and Weiss, 1970) (also known as the forward-backward algorithm). The HMM and visible Markov model taggers can both be implemented using the Viterbi algorithm. POS tagging of Polish was started by governmental research institute IPI PAN. They created a relatively large corpus which is partly hand tagged and partly automatically tagged (Przepiórkowski, 2004; A.Przepiórkowski, 2006; Dȩbowski, 2003; Przepiórkowski and Woliński, 2003). The tagging was later improved by focusing on hand-written and automatically acquired rules, rather than trigrams by Piasecki (Piasecki, 2006). The best and latest version of the tagger has accuracy 93.44%, which is not much comparing to other languages. It might be one of the reasons for the outcome of our experiment. 5.2 Applying POS Taggers for Language Modelling in Speech Recognition There is very little interest in using POS tags in ASR. Their usefulness was investigated. POS tags trigrams, a matrix grading possible neighbourhoods or probabilistic tagger can be created CHAPTER 5. LANGUAGE MODELS 89 and used to predict a word being recognised based on left context analysed by a tagger. It is very difficult to provide tree structures, necessary for context-free grammars, which represent all possible sentences in case of Polish, as the order of words can vary significantly. Some POS tags are much more probable in context of some others, which can be used in language modelling. Experiments on applying morphological information to ASR of Polish language were undertaken using the best available POS tagger for Polish (Piasecki, 2006; Przepiórkowski, 2004). The results were unsatisfactory, probably because of high ambiguity. An average word in Polish has two POS tags. It gives too many possible combinations for a sentence. Briefly speaking applying POS tagging for modelling of Polish is a process of guessing based on uncertain information. HTK (Young, 1996; Young et al., 2005) was used to provide 10 best list of acoustic hypotheses for sentences from CORPORA. The hypotheses were constructed as any combinations of any words from the corpus. The hypotheses are provided as an ordered lists of words. This model was trained in a way which allowed all possible combinations of all words in a dictionary to have more variations and to give opportunity for a language model to improve recognition. Then probabilities of those hypotheses using the POS tagger (Piasecki, 2006) were calculated. The acoustic model can be easily combined with language models using Bayes’ rule by multiplying both probabilities (2.5). 5.3 Experimental Results of Applying POS Tags in ASR Trigrams of tags were calculated using transcriptions of spoken language and existing tagging tools. Results were saved in XML. We received significant help from Dr Maciej Piasecki and his group from the Technical University of Wrocław in this step of research. The results were compared giving different weights for probabilities from the HTK acoustic model and the POS tagger language model. In all situations, the outcome probability gave worse results then pure HTK acoustic model. Histograms of probabilities for correct and wrong recognition were also calculated and they showed unuseful correlation. Some examples of sentences were also analysed and described by human supervisor. They are presented in Table 5.1. In total 331 occurrences were analysed. Only 282 of them had correct recognition in the whole 10 best list. An average HTK probability of correct sentences was 0.1105. Exactly 244 of all occurrences had a correct hypothesis on the first position of the 10 best list. 73.72 % of occurrences were correctly recognised while using only HTK acoustic model. Only 53 occurrences were recognised applying probabilities from the POS tagger, even when HTK probabilities were 4 times more important than those from POS tagger. The weight was applied by raising HTK probability to power of 4. It gives 16.01 % of correct recognitions for a model with POS tag probabilities, which is a very disappointing result. The POS tagger was trained on a different corpus than the one used in an experiment described above. This is why we decided to conduct an additional experiment. We recorded 11 sentences from the POS tagger training corpus. They were recognised by HTK, providing 10 best list and used in a similar experiment, as the one described above. The amount of data is not enough to provide statistical results, but observations on exact sentences (Table 5.3) provide the same CHAPTER 5. LANGUAGE MODELS 90 Table 5.1: Results of applying the POS tagger to language modelling. First, a sentence in Polish is given, then a position of a correct recognition in 10 best list. The description of tagger grade for the correct recognition follows Lubić czardaszowy pla̧s 1, Tagger grade is very low. Cudzy brzuch i buzia w drzewie 4, Tagger grade is higher than for wrong recognitions. W ża̧dzy zejdȩ z gwoździa There is no correct sentence in the 10 best list. Krociowych sum nie żal mi 1, Tagger grade is higher or similar then other recognitions in top 6 but lower then 7th Móc czuć każdy odczynnik 6, Tagger grade is lower than for most of the wrong recognitions including first two hypotheses. However, the wrong recognition with highest probability is grammatically correct. On łom kładzie lampy i kołpak 7, Tagger grade is low. Rybactwo smutnieje on siȩ śmieje There is no correct sentence in the 10 best list. On liczne taśmy w cuglach da 2, Tagger grade is low, but still highest in the first 5 hypotheses. Ten chór dusiłem licznie There is no correct sentence in the 10 best list. Chciałbym wpaść nas sesjȩ There is no correct sentence in the 10 best list. Żółtko wlazło i co zrobić There is no correct sentence in the 10 best list. Wór rur żelaznych ważył 3, Tagger grade is lower than for the sentence on the first position. U nas ludzie zwa̧ to fuchy There is no correct sentence in the 10 best list. On myje wróble w zoo There is no correct sentence in the 10 best list. Boś cały w wiśniowym soku 3, Tagger grade is higher then for 7 top hypotheses. Na czczo chleby i pyry z dżemem There is no correct sentence in the 10 best list. Lech być podlejszym chce 1, Tagger grade is the lowest in top 5 hypotheses but most of them are grammatically correct. CHAPTER 5. LANGUAGE MODELS 91 Table 5.2: Results of applying the POS tagger to language modelling. First, a sentence in Polish is given, then a position of a correct recognition in 10 best list. The description of tagger grade for the correct recognition follows (2nd part) Żre jeż zioła jak dżem John 1, Tagger grade is higher than for top 4 hypotheses. Masz dzisiaj różyczkȩ zielona̧ 1, Tagger grade is lower than for the second hypothesis which has no sense but morphologically is correct. Weź daj im soli drogi dyzmo 2, Tagger grade is very close to the most probable hypothesis, which is also grammatically correct. Weź masz ramki opolskie 1, tagger grade is higher than for the second hypothesis but lower than for the third one. Dźgna̧ł nas cicho pod zamkiem 1, Tagger grade is highest of all. Tam śpi wojsko z bronia̧ 6, Tagger grade is second of all, the highest one is acoustically 5th. Nie odchodź bo żona idzie 3, tagger grade is highest but equal to three others, which has acoustical probability lower. Tym można atakować 5, Tagger grade is higher than for the acoustically most probable sentence but lower than for all other between 1 and 5, however all of them are grammatically correct. Zmyślny kot psotny ujdzie 1, Tagger grade is higher then second and third hypothesis. Niech pan sunie na wschód 4, Tagger grade is higher than for 7 most probable acoustically. conclusion as in the main experiment. The recognitions, which were found using HTK only, had fewer errors for 6 sentences. then 5 times the number of errors was the same. One sentence was correctly recognised for both models. One more was correctly recognised using just HTK acoustic model. 5.4 Bag-of-words Modelling A new method of language modelling for ASR is presented (Ziółko et al., 2008b). The method has some similarities to LSA, but it does not need so much memory and gave better experimental results, which are provided as percentage of correctly recognised sentences from a corpus. The main difference is a choice of similar topics influencing a matrix describing probability of words appearing in topics. Recently, graph based methods (Harary, 1969; Véronis, 2004; Agirre et al., 2006) have become more and more popular. In case of our algorithm, graphs are used instead of applying SVD in order to smooth information between different topics. Graphs help us to locate and grade similar topics. An important advantage of our method is that it does not need much memory at once to process CHAPTER 5. LANGUAGE MODELS 92 Table 5.3: Results of applying the POS tagger on its training corpus. First version of a sentence is a correct one, second is a recognition using just HTK and third one using HTK and POS tagging. Then the number of differences comparing to a correct sentence were counted and summarised i do licha coście mi wczoraj dali takiego że teraz ledwo wiem jak siȩ nazywam i do i w coście mi wczoraj dali takiego że teraz ledwo wiem nie siȩ nazywam i do i w coście w wczoraj dali takiego że teraz ledwo wiem nie siȩ nazywam htk is better nie mówia̧c o tym kim ja jestem skinȩła głowa̧ zawstydzona nie w wiem nocy nocy nie jestem skinȩła głowa̧ zawstydzona nie w wiem nocy nocy nie jestem skinȩła bo w w w zawstydzona same number of errors htk is better to okropne obudzić siȩ po nocy spȩdzonej z kimś czyjego imienia siȩ nie pamiȩta to okropne obudzić siȩ minut spȩdzonej z kimś czyjego imienia ciȩ nie pamiȩta to okropne obudzić w nocy spȩdzonej w kimś czyjego imienia ciȩ nie pamiȩta htk is better parȩ minut temu nie pamiȩtałam nawet że jestem w innym świecie parȩ minut temu nie pamiȩta nawet jestem innym świecie parȩ minut temu nie pamiȩta nawet w jestem innym świecie same number of errors poleż teraz spokojnie zasłoniȩ okno bo widzȩ że światło ciȩ razi poleż teraz spokojnie zasłoniȩ o okno bo widzȩ ciȩ światło ciȩ razi poleż z teraz spokojnie zasłoniȩ o okno bo widzȩ ciȩ światło ciȩ razi htk is better same number of errors zobaczysz wszystko bȩdzie dobrze pamiȩtasz że opuściła sanktuarium zobaczysz wszystko bȩdzie dobrze pamiȩta że opuściła sanktuarium zobaczysz wszystko bȩdzie dobrze pamiȩta że opuściła sanktuarium w same number of errors htk is better o tak pamiȩtała wszystko powróciło z pełna̧ wyrazistościa̧ o tak pamiȩta wszystko powróciło pełna̧ wyrazistościa̧ o tak w pamiȩta wszystko powróciło pełna̧ wyrazistościa̧ htk is better w końcu tyle razy o tym myślała i wcia̧ż nie mogła poja̧ć jak do tego doszło końcu ciȩ teraz nocy myślała w wcia̧ż nie tego tym ciȩ bȩdzie do doszło końcu ciȩ teraz nocy myślała wcia̧ż nie tego świecie bȩdzie do doszło same number of errors CHAPTER 5. LANGUAGE MODELS 93 Histogram of probabilities for correct hypotehsis 180 160 140 120 counts 100 80 60 40 20 0 0−0.1 0.1−0.2 0.2−0.3 0.3−0.4 04−0.5 0.5−0.6 0.6−0.7 0.7−0.8 0.8−0.9 0.9−1 probability Figure 5.1: Histogram of POS tagger probabilities for hypotheses which are correct recognitions any amount of data. It is in contrary to LSA, which is quite limited in real applications for this reason. SVD is conducted on the entire matrix in LSA which means that a model with a few thousands words and a few hundred topics might be a challenge for memory of a regular PC. Our method does not need to do operations on the entire matrix. There are other approaches to face this issue like by applying generalised Hebbian algorithm (Gorrell and Webb, 2005). The main aspect of modelling in our method is based on semantic analysis, which is an important innovation of ASR, as the very last step of the process. It can be applied as an additional measure to use the non-first choice word recognition hypothesis, if they do not fit semantic context. However, the method extracts some syntax information as well. It was designed for Polish, which is highly inflective and not a positional language. For this reason only particular endings can occur in a context of endings of other part of speech elements of a sentence. In example, we can expect female adjectives with female nouns. In the same way, in English we can expect I in a same sentence as am, and you in a same sentence as are, etc. In Polish all verbs have this kind of inflection, however, usually, differences between forms are only in endings, not like to be in English. CHAPTER 5. LANGUAGE MODELS 94 Histogram of probabilities of hypothesis with wrong recognitions 2000 1800 1600 1400 counts 1200 1000 800 600 400 200 0 0−0.1 0.1−0.2 0.2−0.3 0.3−0.4 0.4−0.5 0.5−0.6 0.6−0.7 0.7−0.8 0.8−0.9 0.9−1 probability Figure 5.2: Histogram of POS tagger probabilities for hypotheses which are wrong recognitions ratio 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 probability 0.8 0.9 1 Figure 5.3: Ratio of correct recognitions to all for different probabilities from POS tagger CHAPTER 5. LANGUAGE MODELS 5.5 95 Experimental Setup Semantic analysis might be much more crucial in non-positional languages than in English, due to irregularities in position structures of words. Language models, based on context free grammars, are quite unsuccessful for non-positional languages. Research about applying LSA in ASR has been done (Bellegarda, 1997) for English only. HTK (Young, 1996; Young et al., 2005) was used to provide 100-best lists of acoustic hypotheses for sentences from the test corpora. The MFCCs (Davis and Mermelstein, 1980; Young, 1996) were calculated for parametrisation with a standard set of 39 features. 37 different phonemes were distinguished using a phonetic transcription provided with CORPORA. Several experiments were conducted to evaluate the method. The first one was very simple to have a general view only. The audio model was trained on male speakers of CORPORA (Grocholewski, 1995). The corpus was organised as follows: all single letters are combined in one topic, all digits in another, names and commands separately in two more. Every sentence is also treated as a topic. In this way 118 topics are provided. They all consist of 659 different words in total. In the preliminary experiment we used 114 simple sentences spoken by a male not included in the training set as a testing set. All other utterances are obviously too short to use them in language modelling. In following experiments HTK was also used to provide 100-best list. The main difference was a division between training and testing corpora. Training data was collected from internet and ebooks from several sources described later in details. testing sentences were created by the author and recorded on a desktop PC with a regular microphone and some, but very little noise in background. 5.6 Training Algorithm The entire algorithm is illustrated on a simple English example in one of the following sections. Several versions of the algorithm were applied and tested. Some of the differences are presented in the following sections with experimental results. Here, we describe the final version which performs in the best way. The training algorithm starts with creating matrix S = [sik ], (5.1) representing semantic relations, where rows i = 1, ..., I represent topics and columns k = 1, ..., K represent words. Each matrix value sik is the number of times word k occurs in topic i. Some words are so common that they appear in almost all topics. Appearance of these words have little semantic information due to entropy rule. The words which appear only in certain topics can say more about semantic content. This is why all values of (5.1) are divided by a sum for the given word over all topics to normalise. In this way importance of commonly appearing words is reduced for each topic. A measure of similarity between two topics is dij = K X k=1 sik sjk . (5.2) CHAPTER 5. LANGUAGE MODELS 96 Figure 5.4: Undirected, complete graph illustrating similarities between sentences It has to be normalised according to formula d0ij = dij / max {dij }. i,j (5.3) As a result values 0 ≤ d0ij ≤ 1 are obtained. These topic similarities are analysed as follows: 1. Create an undirected, complete graph (Fig. 5.4) with topics as nodes and d0ij as weights of edges. Let us define path weight Y pij = d0ab , (5.4) (a,b)∈P (i,j) where P (i, j) is the sequence of edges in the path from i to j. In the simplest case of a single edge i to j path weight is d0ij . In case of a multiple edges path, it is a product of similarities of all edges on a path (5.4). In case there are several paths we always take a path with the largest similarity for the path weight (5.4). 2. For each node, we need to find n nodes with highest path weights between the nodes and the given, analysed topic node. It will allow us to define a list N of semantically related topics which consists of the n nodes with their measures. The exact implementation of this part is presented in the next section. 3. The matrix S has to be recalculated to include the impact of similar topics. Smoothed wordtopic relations are expressed by matrix S0 = [s0ik ]. (5.5) For all topics in matrix (5.1), we add all values of topics from the list of related topics, multiplied by a measure for a given pair of topics. The elements of S 0 are s0ik = sik + α−1 X pij sjk . (5.6) j∈N Coefficient α is a smoothing factor which provides additional weight for influence of other CHAPTER 5. LANGUAGE MODELS 97 topics on matrix S 0 . N is the list of similar topics found in step 2. Matrix element (5.6) is a measure of likelihood that kth word appears in ith topic. Matrix (5.5) stores counts of words present in particular topics. They can be represented as C(wordk , si ) = c. (5.7) We should not assume that there can be 0 probability of any word appearing in any topic. This is why we replace all zeros in (5.5) with small values s0min = 0.01. If (5.7) was normalised to have values between 0 and 1, it would be a probabilistic information of type P (wordk |si ) = p. (5.8) The sum of values in (5.5) is not equal to 1 which is why (5.7) are not probabilities regarding to the definition. However, (5.7) stands for all other conditions of probabilities and often can be treated like it was (5.8). For this reason, a sum of all values in (5.5) is calculated and then every value in (5.5) is divided by it. In this way s0ik become probabilities, as their sum is equal to 1. In the further sections we will assume that (5.8), rather than (5.7), is stored in (5.5) and s0ik are probabilities. 5.7 Process of Finding The Most Similar Topics A group of the longest paths, where a distance is calculated using a product between edges rather than sum, has to be found in the 2nd point of the algorithm described in the previous section. It can be achieved by implementing the following algorithm: 1. Find n single edge paths with the highest measures d0ij . 2. Check if the two edges path P (i, m) starting from the node i with the highest measure d0ij , which was found in the step above and going through j to any other edge m, has a better measure pim than the lowest of the n solutions found in the step above. If it does than replace the lowest one with m in the list of n similar topics. 3. Conduct the step above for all other single node paths from the list apart from the lowest, nth element. 4. If there are any non single edge paths P (i, j) on the list on position different then nth, repeat a process similar to step 2. Check if after adding any other edge, a measure of path pij is higher than a measure of the nth position. Than replace the previous path with a new, longer path with higher pij . It can be proved that the process is exhaustive in one way (from the analysed topic). Let us name the analysed topic as i and the set of the n most similar topics to i, found in the first step of the process (using a measure d0ij ) as N1 . Let l be the element with the lowest measure of similarity d0ij of N1 . As a result of the algorithm presented above, we obtain d0in1 > d0ij ∀n1 ∈ N1 , ∀j ∈ / N1 . (5.9) CHAPTER 5. LANGUAGE MODELS 98 Table 5.4: Matrix S for the example with 4 topics and a row of S’ for the topic 3 big John has house black aggr. cat small mouse is mammal 1 2 3 4 1 1 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 3’ 7/8 7/8 15/8 1/2 11/8 11/8 11/8 1 1 0 0 Table 5.5: Matrix D for the presented example 1 2 3 4 1 2 3 4 Let us define a set N2 = T /({ia } S 4 3 1 0 3 6 4 0 1 4 6 2 0 0 2 4 N1 ) of topics not included in the list of similar topics, where T is a set of all topics and {ia } is a one element set with the analysed topic ia . From definition (5.3) 0 ≤ d0ij ≤ 1 ∀i, j ∈ {1, . . . , I}, (5.10) therefore d0ij d0jk ≤ d0ij ∀j ∈ N2 , (5.11) ∀j ∈ N2 . (5.12) where k is any topic. From (5.9) and (5.11) d0in1 > d0ij d0jk As the same reasoning can be applied for further iterations (three-edge paths and so on) (5.11) and (5.12) prove that the process is exhaustive in one way. It can skip some solutions from other topics to the analysed one. But it is even better from linguistic point of view, because we do not want topics assigned as being similar to many other topics, just because they have a very strong link to one other topic. 5.8 Example in English Let us consider an example of a corpus consisting of 4 sentences, all of them are treated as separate topics. Big John has a house. Big John has a black, aggressive cat. The black aggressive cat has a small mouse. The small mouse is a mammal. All articles a and the were skipped as they have no semantic content and they do not exist in Polish which was our experimental language. We count all other words, which creates matrices S CHAPTER 5. LANGUAGE MODELS 99 (Tab. 5.4) and D (Tab. 5.4). Following topic similarities (d012 = 3/4, d013 = 1/4, d014 = 0, d023 = 1, d024 = 0, d034 = 1/2) are received. It constructs the graph on Fig. 5.4. Then a list of topics similar to topic three N1 = {2, 4} can be found by applying first step of the process on the graph. Topic 4 is l in this example - the topic with the lowest measure in N1 , namely 1/2. In the next step, pij are calculated for two-edge paths starting at node 3 and going through 2. There are two of them. First one is for the path 3-2-4, where p34 = 1 · 0 = 0. The second one is for the path 3-2-1, where p31 = 1 · 3/4 = 3/4 > d034 . This is why the topic 4 is replaced by topic 1 and the final list of topics similar to 3 is {2, 1}. Then assuming α = 2 we can calculate the row for topic 3 from S 0 (Tab. 5.4, last row). 5.9 Recognition Using Bag-of-words Model The recognition task can be described as si = argmax P (s|wordk1 , ..., wordkm ), (5.13) where s is any topic and wordk1 , ..., wordkm is a set of recognised words, which were in a sentence. It classifies the bag-of-words as one of the realisations of one of the topics in matrix (5.5). Recognition can be conducted by finding the most coherent topic for a set of words W in a provided hypothesis. It is carried out by finding a maximum of a sum of elements of (5.5) from columns representing the words from a hypothesis over rows Q Psem = max i k∈W |W | s0ik . (5.14) where |W | is cardinality of the set of words W in the sentence. The row i, for which the maximum is found, is assumed to represent the topic of sentence being recognised. The calculated sum Psem can be used as additional weight in providing speech recognition due to Bayes’ theorem. The values of phtk probability gained from HTK model tend to be very similar for all hypotheses in the 100-best list of a particular utterance. This is why an extra weighting w was introduced to favour probabilities from audio model over psem received from semantic model. The final measure can be obtained applying Bayes’ theorem p = pw htk psem . 5.10 (5.15) Preliminary Experiment The first experiment (Ziółko et al., 2008b) was conducted on CORPORA using the same data for training and testing to evaluate the implementation and approximate chances of the algorithm to be successful without spending several days training a proper model. Because the model was small, it was easy to compare different values of parameters n, α and w. Results for recognition based on the audio model only are also included. LSA was used as a baseline to evaluate results of our method. Experiments with several different w for the semantic model based on LSA were CHAPTER 5. LANGUAGE MODELS 100 conducted. Values in a range between 23 and 26 gave the best results presented in Tab. 5.6. 45 utterances did not have hypotheses with correct sentences in entire 100 best lists. This is why the maximal number of utterances which could be recognised was 69. The experiment shows that our semantic model is useful, even though, the results might be so outstanding due to a small number of words in the corpus and using the same corpus for training and testing. The same corpus was used for both tasks because phoneme segmentation in the corpus is needed to use HTK. CORPORA is the only Polish corpus which provides it. However, the comparison of 53% correct recognitions for best configurations of our model with 36% for LSA and 29% for audio model only is impressive. The analysed results for different configurations shown that the choice of n, the length of list of topics related to an analysed topic is not as important as ratio between n and α which is a smoothing factor for weighting impact of related topics. The ratio n/α should be kept around 2/3, for this case, in order to provide the best results. The audio model importance weight w is also very crucial as the information from HTK model is very important and can be ignored if w has too small value. It has to be stressed, that it was a preliminary experiment. Our aim was to check, if it is worth to invest more time in research on this model. This is why we used little data and the same set for training and testing. Some elements of the algorithm were not used for this experiment. In example, values in (5.1) were not normalised to be probabilities. We do not claim that the calculated model can be used for any practical task. One more reason for that is that it was trained on CORPORA which has no semantic connotations. On the other hand it has to be stressed that for Polish this model keeps some grammar information as well, even though it was designed as a semantic one. In example, we can expect words with morphology related to one gender in a given sentence, which will be noted in matrix S. The results were promising, so more sophisticated experiments using transcriptions from the Polish Parliament, literature, a journal and wikipedia as training corpora were conducted and are described in following sections. Another way of proving usefulness of our bag-of-words model is through calculating histograms psemc of probabilities received from semantic model for hypotheses, which are correct recognitions (Fig. 5.5) and histogram psemw of probabilities received from semantic model for hypotheses, which are wrong recognitions (Fig. 5.6). The ratio psemc /(psemc + psemw ) is presented in Fig. 5.7. It clearly shows a correlation between high probability from the bag-of-words model and correctness of a recognition. 5.11 K-means On-line Clustering The number of topics is limited to around 1000. If a large choice of words in the model is expected then the number of topics has to be kept low to save memory. This is why it is necessary to overcome the limitation in the number of topics for any real applications. It was done by clustering them into representatives of several topics. K-means clustering algorithm was used for this aim. However, it was not possible to apply it directly on all topics at once because of huge amount of data (millions of sentences). This is why we invented an algorithm which we call on-line clustering. CHAPTER 5. LANGUAGE MODELS 101 Table 5.6: Experimental results for pure HTK audio model, audio model with LSA and audio model with our bag-of-words model n α w recognised sentences % LSA 25 41 0.36 HTK 33 0.29 3 1 50 48 0.42 3 2 50 46 0.40 3 3 50 46 0.40 7 1 50 35 0.31 7 3 50 45 0.39 7 5 50 46 0.40 5 1 20 44 0.39 5 2 20 55 0.48 5 3 20 60 0.53 5 4 20 59 0.52 5 5 20 59 0.52 3 2 20 61 0.53 3 1 20 50 0.44 7 6 20 59 0.52 7 5 20 61 0.53 7 4 20 59 0.52 8 4 20 57 0.5 8 5 20 61 0.53 8 6 20 60 0.53 9 1 20 28 0.25 9 3 20 49 0.43 9 5 20 57 0.5 9 6 20 61 0.53 9 7 20 59 0.52 11 5 20 54 0.47 11 7 20 60 0.53 11 8 20 60 0.53 11 9 20 58 0.51 9 6 10 58 0.51 9 6 15 60 0.53 9 6 17 60 0.53 9 6 18 61 0.53 9 6 19 61 0.53 9 6 20 61 0.53 9 6 22 59 0.52 9 6 25 58 0.51 CHAPTER 5. LANGUAGE MODELS 102 number of correct recognitions 40 30 20 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 probability 0.8 0.9 1 number of wrong hypotheses Figure 5.5: Histogram of probabilities received from the bag-of-words model for hypotheses which are correct recognitions 400 300 200 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 probability 0.8 0.9 1 Figure 5.6: Histogram of probabilities received from the bag-of-words model for hypotheses which are wrong recognitions ratio 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 probability 0.8 0.9 1 Figure 5.7: Ratio of correct recognitions to all of them for different probabilities received from the bag-of-words model CHAPTER 5. LANGUAGE MODELS 103 The general scheme is to collect n topics from training data. The algorithm is initialised heuristically. Then, they are clustered into n/2 topics using k-means clustering algorithm, which is iterating following two steps until convergence. The first one is to compute membership of each data point x in clusters by choosing the nearest centroid. The second is to recompute a location of each centroid, according to members. When the k-means converge and new topics are chosen, new n/2 topics can be added from new training data and clustering is repeated to reduce it again. This loop is applied as long as there is new training data to be included. In the very end additional clusterisation is conducted to limit the number of topics to n/4. Every time, the information on how many sentences are represented by a particular topic is stored and used as weights when means are being calculated and topics are combined as a result of clustering. Thanks to that, the order of how sentences are fed into the training system is not important for the image of the final clusters. Unfortunately, it is not possible to cluster all sentences at once because of data sparsity. Formula (5.8) holds for topics which represent several sentences in the same way, as for these which represent just one sentence. However, it is not possible to calculate probability of a word given a combined topic by using probabilities related to topics represented by the combined topic. The new version (for clustered topics) of matrix (5.5) has to be calculated and used instead. It means that the process of collecting statistical data by creating (5.1) has to be finished first to run the described algorithm. When (5.5) is already created, new statistical data cannot be added to the matrix (5.5). In case it has to be done, the new data should be added to (5.1). Then (5.5) has to be recalculated from the beginning. 5.12 Experiment on Parliament Transcripts A set of 44 sentences was created using words and language similar to expected to be used in a parliament. They were also designed in a way that the most common words from the training corpus are used and that some of the words from testing set appear in a few sentences. They were recorded and the HTK recognition experiment was conducted on them using triphone model trained on CORPORA but with vocabulary limited to words in these 45 sentences. In this way, HTK provided 100-best list of hypotheses to every of the sentences. They were used in a same way as in a previously described experiment. Matrix (5.1) was created by analysing transcriptions of the Polish Parliament meetings in years 2005-2007. They are the biggest corpus of transcribed Polish speech. There are differences in sentence construction between spoken and written language. This is one of the reasons, why we decided to use this corpus for training. Another one is that our model is likely to be a part of ASR system used by Police and Courts, so we are interested in research on very formal language. None of testing sentences was intentionally taken from these transcriptions, however, it was not checked that they did not appear there. The testing set consists of 198 words and those words were included in matrix (5.1). Because of data sparsity the k-means on-line clustering algorithm described above was used to combine several topics. Every topic is a set of words between two dots in the training corpus. In ideal case topics are sentences. In real case dots are used in Polish CHAPTER 5. LANGUAGE MODELS 104 Table 5.7: 44 sentences in the exact transcription used for testing by HTK and bag-of-words model with English translations platforma obywatelska wymaga funkcjonowania klubu w czasie obrad sejmu Civic Platform expects the club to operate during parliament proceedings. dlaczego poseL wojciech polega na opinii zarzAdu Why does MP Wojciech trust the board opinion? Latwo skierowaC czynnoSci do sAdu It is easy to move actions to court. wniosek rolniczego zwiAzku znajduje siE w ministerstwie The petition of the agricultural union is in the ministry. projekt samorzAdu ma wysokie oczekiwania finansowe The municipality project has high financial expectations. fundusz spoLeczny podjAL dziaLania w ramach obecnego prawa cywilnego The communal foundation took steps according to existing civil law. koalicja chce komisji sejmowej do oceny dziaLalnoSci posLa jana The coalition wants a parliament commission for evaluation of MP Jan activity. dzisiaj piEC paN poprze ministra w waZnym gLosowaniu w sejmie Five women will support the Minister in an important vote today. poseL ludwik dorn byl na waZnym gLosowaniu po duZym posiLku MP Ludwik Dorn participated in an important vote after a large meal. bOg ocenia polskE za powaZne przestEpstwa sektora finansowego w kraju i za granicA God judges Poland for crucial crimes of the financial sector in the country and abroad. poseL tadeusz cymaNski faktycznie wyraziL sprzeciw wobec rozwoju paNstwa polskiego MP Tadeusz Cymanski expressed a protest against development of the Polish country indeed. tak mi dopomOZ bOg God, help me. (traditional formula added after an oath) poseL andrzej lepper zajmuje siE rzAdem jak nikt inny MP Andrzej Lepper takes care of the government like no one else. uchwaLa rzAdowa dotyczAca handlu i inwestycji przedsiEbiorstw paNstwowych w rynek nieruchomoSci The government act on trade and investments of public enterprises in the estate market. panie marszaLku wysoka izbo Mr speaker, House. (common way to start a speech in the Polish Parliament) poseL ludwik dorn chce podziEkowaC komisji MP Ludwik Dorn wants to thank the commission. bezpieczeNstwo jest bardzo waZne The safety is very important. minister Srodowiska powiedziaL waZne rzeczy The Minister of Environment said important things. CHAPTER 5. LANGUAGE MODELS 105 Table 5.8: 44 sentences in the exact transcription used for testing by HTK and bag-of-words model with English translations (2nd part) narOd rzeczpospolitej polskiej chce pieniEdzy The nation of Republic of Poland wants money. rodziny powinny byC najwaZniejsze Families should be the most important. resort bezpieczeNstwa ma wysokie uprawnienia The department of security has high authority. odpowiednie uprawnienia sA bardzo waZne Proper authorities are very important. kilkanaScie przedsiEbiorstw potrzebuje nowych dochodOw Over a dozen of enterprises need new incomes. poseL andrzej lepper zwrOciL dokumenty do sejmu MP Andrzej Lepper returned documents to the Parliament. krajowa komisja popiera nowA ustawE The national commission supports the new act. narOd rzeczpospolitej polskiej ma waZne oczekiwania od sejmu The nation of the Republic of Poland has important expectations from the Parliament. praktyka wskazuje co innego Real life shows something else. czterech posLow nie mogLo zostaC Four MPs were not able to stay. na sLuZbie siE pracuje You work on a duty. sprzeciwiam siE I speak against. wnoszE o przerwE w obradach I ask for a break in the proceedings. proszE o ciszE I ask for silence. wznowienie obrad nastApi po godzinnej przerwie The proceedings will be reopened after an hour break. to jest skandal It is a scandal. CHAPTER 5. LANGUAGE MODELS 106 Table 5.9: 44 sentences in the exact transcription used for testing by HTK and bag-of-words model with English translations (3rd part) nie pozwolimy na to We will not allow it. obrady przy zamkniEtych drzwiach Closed proceedings. matki potrzebujA becikowe Mothers need a support. przechodzimy do konkretOw na temat ustawy o ubezpieczeniach spoLecznych We move to details on the act on public insurances. duZA frekwencja w trakcie gLosowania High attendance during a vote. zgromadzenie narodowe zadecyduje o przyszLoSci tej ustawy The National Assembly will decide about the future of this act. komisja zbierze siE po przerwie The commission will gather after a break. proszE mOwiC wolniej Speak slower please. zacznijmy od budowania podstaw Let’s start from building the foundations. zgLoszono wiele poprawek do tej ustawy Many corrections to this act were declared. to mark abbreviations and ordering numbers what influenced the content of topics. The training corpus consisted of around 800,000 topics. In the end of the training process all topics were clustered into 500 final topics. Then values of matrix (5.1) were normalised for words by all topics to increase importance of words which appeared in few topics and decrease importance of words which appeared in many topics. Then matrix (5.5) was created. The HTK hypotheses were rearranged using information from (5.5) in the same away as in the previous experiment. The results of this experiment were negative. The model did not improve recognition. Quality of the training data is blamed for these results. The transcriptions contained many comments and other elements which are not sentences. Then transcriptions were copied from pdf files into a text file, what degradated quality slightly. All syllables fi in the corpus were changed into dots and some parts were rearranged in an inappropriate way. What is more, a dot is quite frequently used in Polish to mark ends of abbreviations and put after numbers if they mean order like 1st or 2nd in English. All these dots were treated by our algorithm in the same way as dots marking ends of sentences. This is why the topics were quite often not proper sentences as expected in our algorithm. We decided to conduct another experiment using literature for training. Quality of ebooks is better than transcriptions. They are available in txt and doc files. Abbreviations and numbers are much rarer in literature than in the Parliament transcripts. CHAPTER 5. LANGUAGE MODELS 5.13 107 Preprocessing of Training Corpora The experiment on the Parliament transcripts taught us that text data has to be preprocessed more before it can be used for model training. There are three main issues which has to be faced. First, Matlab, which is used for model training, do not recognise special Polish letters. This is why they have to be replaced by some single signs. Secondly, several special signs should be erased to keep a corpus cleaner. Thirdly, some dots have to be removed from a corpus as they do not represent an end of sentences. We started with replacing all capital letters in a corpus with lower cases as they are redundant for this experiment. Then, we can use capital letters to represent special Polish signs. The second issue was faced by removing (some of them can be replaced by an empty space) all signs from the list: , ” “ : ( ) ; + - \/ ’ # & =. Then question and exclamation marks ?! were changed into dots. Several dots were removed if they followed some abbreviations. A dot is put after an abbreviation in Polish, if it finishes with a consonant. All short forms from the list were replaced by a full form or an abbreviation without a dot if several morphological forms are represented by one abbreviation. An empty space was put at the beginning of a string to be searched for, to avoid detecting ends of some words. It is more and more common in Polish to put dots following digits if they mean order, like th in English. This is why all dots following digits were also removed from corpora. Two dots following each other were replaced with just one. The same happened with three dots. Finally all doubled and tripled spaces were replaced by just one as final cleaning of the corpora. In the beginning we did these operations using Matlab and Word for Windows. Later, the process was automatised by using SED. Another preprocessing which we had to do was removing html and xml tags from some of the texts. This task was also accomplished in SED, which is a simple stream editor under Linux. It takes and filters a row after a row from a default input which was a text file in our case. Then it applies changes in text according to commands in a specific order and send it to an output. The script presented in Table (5.10) was used for changes apart from removing html tags. 5.14 Experiment with Literature Training Corpus Another experiment on larger scale was conducted using literature to train the model. This attempt was more successful then the previous one, however, the results are still unsatisfactory. The improvement comparing to the transcript might be caused by the fact that the language in literature is much more proper than in the transcripts where spoken language was written down. It would be an interesting observation that the written language should be used for training, even though the spoken one is being recognised. With some configurations, 3% of improvement was noted (Tab. 5.11). The low efficiency was probably caused by using too little data for training. Very bad results for applying LSA support this hypothesis. The perplexity of the corpus is sufficiently large and equals 9 031. As the next step to improve our model, we started to normalise all values in the matrix (5.5) to have probabilities as its values and to have final grades as probabilities, what we have not done in CHAPTER 5. LANGUAGE MODELS 108 Table 5.10: SED script for text preprocessing s/A/a/g s/B/b/g s/C/c/g s/D/d/g s/E/e/g s/F/f/g s/G/g/g s/H/h/g s/I/i/g s/J/j/g s/K/k/g s/L/l/g s/M/m/g s/N/n/g s/O/o/g s/P/p/g s/R/r/g s/S/s/g s/T/t/g s/U/u/g s/W/w/g s/Y/y/g s/V/v/g s/X/x/g s/Z/z/g s/ł/L/g s/ś/S/g s/ń/N/g s/ć/C/g s/ó/O/g s/ȩ/E/g s/ż/Z/g s/ź/X/g s/a̧/A/g s/Ł/L/g s/Ś/S/g s/Ń/N/g s/Ć/C/g s/Ó/O/g s/Ȩ/E/g s/Ż/Z/g s/Ź/X/g s/A̧/A/g s/,//g s/[-]//g s/[+]//g s/[/]//g s/[=]//g s/[\]//g s/[”]//g s/[:]//g s/ [%]/ procent/g s/ [$]/ dolar/g s/nbsp/ /g s/[.] [.]/./g s/ ust[.]/ ustawa/g s/ ub[.]/ ub/g s/[(]//g s/[)]//g s/[;]//g s/[¡]//g s/[#]//g s/[&]//g s/[|]//g s/[*]//g s/[ ]//g s/[’]//g s/[!]/./g s/[?]/./g s/[@]/ /g s/0[.]/0/g s/1[.]/1/g s/2[.]/2/g s/3[.]/3/g s/3[.]/3/g s/4[.]/4/g s/5[.]/5/g s/6[.]/6/g s/7[.]/7/g s/8[.]/8/g s/9[.]/9/g s/ godz[.]/ godz/g s/ art[.]/ art/g s/ tys[.]/ tys/g s/ ok[.]/ ok/g s/ m[.]in[.]/ miEdzy innymi/g s/ m[.] in[.]/ miEdzy innymi/g s/ n[.]p[.]m[.]/ nad poziomem morza/g s/ p[.]p[.]m[.]/ pod poziomem morza/g s/ p[.]n[.]e[.]/ przed naszA erA/g s/ n[.]e[.]/ naszej ery/g s/ przyp. tLum./ przypis tLumacza/g s/ z o[.] o[.]/ z ograniczonA odpowiedzialnoSciA/g s/ z o[.]o[.]/ z ograniczonA odpowiedzialnoSciA/g s/ orygin[.]/ oryginalnie/g s/ proc[.]/ procent/g s/ tj[.]/ to jest/g s/ szt[.]/ sztuk/g s/ np[.]/ na przykLad/g s/ ww[.]/ wyZej wym/g s/ ds[.]/ do spraw/g s/ wLaSc[.]/ wLaSc/g s/ tzw[.]/ tzw/g s/ im[.]/ imienia/g s/ lit[.]/ litera/g s/ ang[.]/ ang/g s/ Lac[.]/ Lac/g s/ gr[.]/ gr/g s/ poL[.]/ poLowa/g s/ zm[.]/ zmarLy/g s/ ur[.]/ urodzony/g s/ wyd[.]/ wyd/g s/ r[.]/ r/g s/ r [.]/ roku/g s/ sp[.]/ spOLka/g s/ ul[.]/ ulica/g s/ pkt[.]/ pkt/g s/[.]jpg/ jpg/g s/[.]png/ png/g s/[.]exe/ exe/g s/[.]bmp/ bmp/g s/[.]pdf/ pdf/g s/[.]html/ htm/g s/[.]pl/ pl/g s/[.]com/ com/g s/ w[.]/ w/g s/ a[.]/ a/g s/ b[.]/ b/g s/ c[.]/ c/g s/ d[.]/ d/g s/ e[.]/ e/g s/ f[.]/ f/g s/ g[.]/ g/g s/ h[.]/ h/g s/ i[.]/ i/g s/ j[.]/ j/g s/ k[.]/ k/g s/ l[.]/ l/g s/ L[.]/ L/g s/ m[.]/ m/g s/ n[.]/ n/g s/ o[.]/ o/g s/ p[.]/ p/g s/ s[.]/ s/g s/ t[.]/ t/g s/ u[.]/ u/g s/ z[.]/ z/g s/www[.]/www /g s/ / /g s/ / /g s/[.][.][.]/./g s/[.][.]/./g CHAPTER 5. LANGUAGE MODELS 109 Table 5.11: Experimental results for pure HTK audio model, audio model with LSA and audio model with our bags-of-words model trained on literature n α w recognised sentences % LSA 26 8 18 HTK 16 35 30 20 20 17 38 Table 5.12: Experimental results for pure HTK audio model, audio model with LSA and audio model with our bags-of-words model trained on enlarged literature corpus n α w ranking of the correct hypothesis % improvement LSA 30 12.36 -19 HTK 10.39 0 3 3 25 8.95 14 the previous experiment. We also added new text to the training data. Additionally, we decided that counting the number of properly recognised sentences is not the best way to evaluate the method. We started to look at the average position of the correct hypothesis in the n-best list before and after applying our model. It gives us evaluation from all sentences and not just from those, for which a correct hypothesis was moved to a first position from not a first one, like in the earlier evaluation method. We compared our model with LSA as the baseline. It performed better again (Tab. 5.12). It supports the conclusion that this model is at least better than LSA, because it needs less data to be trained. Different parameters of our model with which it performs best are probably caused by a fact that the matrix (5.1) is calculated using more data. Thanks to that there are fewer zeros in (5.1) and there is no need to smooth it so much by including an impact of many similar topics. Only the most similar were used in that case. We collected also more data for training using Rzeczpospolita journal and Polish wikipedia. The first corpus can be downloaded from Dawid Weiss website as a set of html files. The researcher claims that the journal agreed for using these resources for any academic research. The second was collected form Internet using C++ software and very high perplexity, namely 16 436. However, adding this data did not improve the performance of the method. Table 5.13 shows size and complexity of all corpora we used in this research. Table 5.13: Text corpora Content MBytes Mwords Parliament transcripts 58 8 Literature 490 68 Rzeczpospolita journal 879 104 Wikipedia 754 97 Perplexity 4013 9 031 8 918 16 436 CHAPTER 5. LANGUAGE MODELS 5.15 110 Word Prediction Model and Evaluation with Perplexity There are two main ways to evaluate language models. First of them is to find recognition error. The second is by perplexity, which, for a probability model, is defined with cross entropy (Brown et al., 1992) 2− P N p(x)log2 q(x) , (5.16) where p(x) is a probability of a correct recognition from a ground truth distribution. Here it is assumed to be uniform, what leads to p(x) = 1/N . N is the number of test samples and q(x) is a probability of a correct recognition from a probability distribution of a tested language model. The first one, usually given by word error rate (WER) is an accuracy. Briefly, it describes how correct is the highest-probability hypothesis. The perplexity is a measure of how probable is the observed data, according to the model. Our model is designed to be implemented in a working ASR system and this is why, the accuracy is more important evaluation for us than perplexity. Even though, perplexity is a very popular measure and it is recommended by many NLP researchers to report both evaluations. It has to be stressed that the previously described bag-of-words model cannot provide perplexity as such. The reason for it, is that our model does not provide a probability of an event, like a word following a given history of words. The model provides us with a grade of how coherent a set of words is. Perplexity of our model cannot be given, as the model uses a probability of a topic given all words in a sentence. There is no ground truth for this probability distribution to be used in (5.16). The topics in our model are not listed and named. They are not real topics but representations of sentences grouped in an unsupervised process. 5.16 Conclusion The POS tagger from dr Piasecki (Piasecki, 2006) was applied as an extra language model to the problem of improving ASR. Although this is the most effective tagger for Polish, with an accuracy of 93.44%, the results were not good. It reduced the recognition rate by 57% when applied as a LM to ASR system based on HTK. We believe this is because POS tag information for Polish is too ambiguous. Another language model, inspired by LSA, was designed, implemented in Matlab and applied to improve ASR recognition for sentences. It was mainly a semantic model, but because of the inflective nature of the Polish language, it covers some syntax modelling as well. The semantic model uses statistics of how words appear together in sentences using a word-topic matrix, where the topics can be seen as sentence examples and patterns. The order of words is not kept though. This is why we call it the bag-of-words model. Almost 300,000,000 word corpus was available to the task of training the model. However, some texts were decreasing efficiency of the model. After several experiments, an improvement in recognition of 14 % was achieved, compared to a system without a language model, and 33 % comparing to LSA. An average ranking position of the correct recognition in the entire n-best list of hypotheses was used as the evaluation grade. We believe that the bag-of-words is effective because of non-positional nature of the Polish language. The overall conclusion from this part of our research is that POS taggers are not useful in ASR of CHAPTER 5. LANGUAGE MODELS 111 Polish, but the bag-of-words model based on word-topic matrix helps in ASR task for Polish. Unusefulness of applying POS tagging in language modelling of Polish was experimentally supported. The main contribution presented in this chapter is the successful model based on wordtopic matrix was invented, implemented and tested. It can be trained with fewer data than a baseline and has better predictive power. The method could be improved by stemming the training corpora first. The stemming for Polish can be applied using Morfeusz (Woliński, 2004) - morphological analyser implemented by Marcin Woliński applying Zygmunt Saloni rules. It would reduce data sparsity and improve results. The method can be used to any other language for which LSA is useful, however, it is tuned to Polish and other Slavic languages because they are non-positional. The bag-of-words philosophy fits the logic of these languages very well. We plan to train the bag-of-words model on larger training corpus. The more data one can use, the better performance of a language model can be achieved. We believe that it is true especially in this case, because LSA is known to be effective, while it reduced recognition in this case, when trained on the available data. LSA is a challenging baseline and this is why we believe our method is very good when trained on large enough data what we plan to do. The work on the bag-of-words will be extended. Several possible combinations will be tested on larger corpora then described here. We are in a process of gaining more literature books, newspaper articles and high quality websites. We will optimise the bag-of-words algorithm, especially how to save memory while working on matrix (5.1) and implement it in C. The method will be tested not only on sentences as topics but also on paragraphs and articles or chapters. In all cases we will compare versions trained on original corpora and on the stemmed ones. We will also combine the bag-of-words model with n-grams to catch some extra information and gain as high recognition as possible. Chapter 6 Conclusions and Future Research It is difficult to predict success in research. In the case of ASR it is even more difficult as the revolutionary and effective solutions have been anticipated for approximately 25 years but have not, as yet, materialised. However, there is still important progress in all aspects of ASR. Our study on different parametrisation methods has highlighted a few aspects which might be especially successful in the near future. One of the obviously good avenues of research is the perceptual approach. The idea was conceived by Hermansky by improving the already popular LPC to PLP. Many other methods also give better results because they are perceptually motivated. Human hearing and speaking systems are tuned to each other by millennia of evolution. It means that we have to simulate processes in human ear and brain to recognise and understand signals created by a human speech system. In fact, all ASR methods are perceptually motivated to some extent, but some specifically model perceptual features. Wavelets, for example, give good opportunities due to their non-uniform bandwidths. Phonological approaches also try to simulate processes in human ear in more detail. Another issue, which will definitely become more important, is the differences in parametrisation of speech for different languages. The beginnings of research in ASR were based on English. Currently it has become quite popular to try to recognise other languages like Japanese, Chinese, Arabic, German, French, Turkish, Finnish, Slavic languages and many more. There are obvious differences between them, but the methods very often repeat the scheme applied for English. This might be important encumbrance, because English is in fact quite an unusual language. It has a few issues important for ASR, which mark it out even from other western European languages, not to mention others. The huge majority of unstressed vowels are pronounced in a very similar way. It causes a large number of homophones. Conjugation is relatively simple and declension of nouns and adjectives almost does not exist. Languages have different widths of possible frequency bands. For example, there are phonemes in Polish with frequencies much higher than any other in English. It is quite common that most people find some phonemes especially difficult to use while learning a new foreign language. This observation should be taken into consideration by researchers working on non-English languages. Table 2.3 shows clearly that it is very difficult to find a new parametrisation method which would outperform the baseline. It is usually much more successful to append new elements, or 112 CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH 113 to further process a commonly known parametrisation. This suggests that it might be impossible to find any new crucial parametrisation method and success can be obtained rather by additional processing of features or better modelling. The statistics of phonemes, diphones and triphones were collected for Polish using a large corpus of mainly spoken formal language. Summary of the data was presented and interesting phenomena in the statistics were described. Triphone statistics play an important role in ASR. They are used to improve the proper transcription of the analysed speech segments. 28% of possible triples were detected as triphones, but many of them appeared very rarely. A majority of rare triphones came from foreign or twisted words. Most of the ASR systems do not use information about boundaries of phonetic units like phonemes. A method based on the DWT to find such boundaries was presented. The method is language agnostic, as it does not rely on any phonetic models but purely on the analysis of the power spectrum and hence has applicability to any language. For the same reason it can be easily introduced to most of existing systems as it does not depend on any exact configuration or training of the speech model. It can also be used to provide additional information or primal hypothesis for segmentation methods based on models like in (Ostendorf et al., 1996). Our method is intelligent as it can be easily improved or adapted for specific applications, noisy data, etc. by introducing additional conditions or changing weights. The algorithm can find most of the boundaries with high accuracy. The use of several wavelet functions were compared and our results show that Meyer wavelets are better than the others. Fuzzy recall and precision measures were introduced for segmentation in order to evaluate the method with more sensitivity, grading errors more smoothly than in commonly used evaluation methods. Our results give approximately 0.72 f-score for Meyer and most of the other wavelets. The precise evaluation method was described. It adapts a standard and very useful recall and precision scheme for applications where evaluation has to consider more details. Speech segmentation is such a field, however, many other types of segmentation are as well. The reason is that the correctness of audio or image segmentation is typically not binary. This is why we found usefulness of fuzzy sets in the task of segmentation evaluation. General rules of applying fuzzy logic into recall and precision were presented as well as exact algorithm of using it for phoneme segmentation evaluation, as an example. It seems that POS tags are too ambiguous to be used effectively in modelling Polish for ASR. Actually, according to our experiments it reduces the number of correct recognitions. Even though, POS information is important in Polish language, the ambiguity of forms causes that other language models have to be used. The new method inspired by LSA was presented. The advantage of the method is that smoothing of information in a matrix representing word-topic relations is based on a limited number of most closely related topics for every topic rather than on all of them like in LSA. Our model was still better than LSA which actually reduced recognition with the available training data. The bag-of-word model can be trained with less data than LSA. The performance was improved in comparison to audio model. In the experiment with the best algorithm and most of the training data, we graded the method by an average position of the correct hypothesis in the n-best list. The CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH 114 improvement was by 14% comparing to using HTK audio model only. LSA for the same training model was reducing recognition. The author’s research on ASR will be continued. He now works as a research assistant in an ASR project for AGH University of Science and Technology, and Polish Platform of Homeland Security. He is responsible for designing language models in the project and will apply his PhD experience there and will experiment with the bag-of-words method on larger scale. It will probably be combined with n-grams and applied to subword units that are provided by a POS tagger to reduce the size of a dictionary. The author’s segmentation method was already improved by other people in the project, and is now being implemented in C++ for the ASR system which is going to be used in courts and during police interrogations. The paper on triphone statistics was found very good by the 3rd LTC conference committee and they requested a revised version for a journal. Statistics will be collected again using a larger corpus and will be published in the submission of the new paper. List of References Abry, P. (1997). Ondelettes et turbulence (eng. Wavelets and turbulence). Diderot ed., Paris. Agirre, E., Alfonseca, E., and de Lacalle, O. L. (2004). Approximating hierachy-based similarity for wordnet nominal synsets using topic signatures. Proceedings of the 2nd Global WordNet Conference. Brno, Czech Republic. Agirre, E., Ansa, O., Martı́nez, D., and Hovy, E. (2001). Enriching wordnet concepts with topic signatures. Procceedings of the SIGLEX Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations. Agirre, E., Martı́nez, D., de Lacalle, O. L., and Soroa, A. (2006). Two graph-based algorithms for state-of-the-art wsd. Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, pages 585–593. Ahmed, N., Natarajan, T., and Rao, K. R. (1974). Discrete cosine transform. IEEE Transcations Computers, Jan:90–93. Alewine, N., Ruback, H., and Deligne, S. (October-December 2004). Pervasive speech recognition. Pervasive computing, pages 78–81. A.Przepiórkowski (2006). The potential of the IPI PAN corpus. Poznań Studies in Contemporary Linguistics, 41:31–48. Banerjee, S. and Pedersen, T. (2003). Extended gloss overlaps as a measure of semantic relatedness. Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pages 805–810. Basztura, C. (1992). Rozmawiać z komputerem (Eng. To speak with computers). Format. Beep dictionary (2000). www.speech.cs.cmu.edu/comp.speech/Section1/Lexical/beep.html. Bellegarda, J. (1998). A multispan language modeling framework for large vocabulary speech recognition. IEEE Transactions on Speech and Audio Processing, 6(5):456–467. Bellegarda, J. R. (1997). A latent semantic analysis framework for large-span language modeling. Proceedings of Eurospeech, 3:1451–1454. Bellegarda, J. R. (2000). Large vocabulary speech recognition with multispan statistical language models. IEEE Transactions on Speech and Audio Processing, 8(1):76–84. Bellegarda, J. R. (70–80). Latent semantic mapping. IEEE Signal Processing Magazine, September:70–80. Boersma, P. (1996). Praat, a system for doing phonetics by computer. Glot International, 5(9/10):341–345. Brill, E. (1994). Some advances in transformation-based part of speech tagging. Proceedings of 115 LIST OF REFERENCES 116 the Twelfth National Conference on artificial Intelligence AAAI. Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, December:543–565. Brown, P. F., Pietra, V. J. D., ans S. A. Della Pietra, R. L. M., and Lai, J. C. (1992). An estimate of an upper bound for the entropy of english. Computational Linguistics, 18(1):31–40. Cardinal, P., Boulianne, G., and Comeau, M. (2005). Segmentation of recordings based on partial transcriptions. Proceedings of Interspeech, pages 3345–3348. Coccaro, N. and Jurafsky, D. (1998). Towards better integration of semantic predictors in statistical language modeling. Proceedings of ICSLP-98, Sydney. Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the machine calculation of complex fourier series. Math. Comput., 19:297–301. Cozens, S. (1998). Primitive part-of-speech tagging using word length and sentential structure. Computaion and Language. Cuadros, M., Padró, L., and Rigau, G. (2005). Comparing methods for automatic acquisition of topic signatures. Proceedings of the International Conference on Recent Advances on Natural Language Processing (RANLP). Daelemans, W. and van den Bosch, A. (1997). Language-independent data-oriented grapheme-tophoneme conversion. Progress in Speech Synthesis, New York: Springer-Verlag. Daubechies, I. (1992). Ten lectures on Wavelets. Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania. Davis, H., Biddulph, R., and Balashek, S. (1952). Automatic recognition of spoken digits. Journal of the Acoustical Society of America, (24(6)):637–642. Davis, S. B. and Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-28(4):357–366. Dȩbowski, Ł. (2003). A reconfigurable stochastic tagger for languages with complex tag structure. The Proceedings of the Workshop on Morphological Processing of Slavic Languages, EACL. de Saussure, F. (1916). Course de lingustique genérale. Lausanne and Paris: Payot. Demenko, G., Wypych, M., and Baranowska, E. (2003). Implementation of grapheme-to-phoneme rules and extended SAMPA alphabet in Polish text-to-speech synthesis. Speech and Language Technology, PTFon, Poznań, 7(17). Demuynck, K. and Laureys, T. (2002). A comparison of different approaches to automatic speech segmentation. Proceedings of the 5th International Conference on Text, Speech and Dialogue, pages 277–284. Denes, P. B. (1962). Statistics of spoken English. The Journal of the Acoustical Society of America, 34:1978–1979. Deng, L., Wu, J., Droppo, J., and Acero, A. (2005). Analysis and comparison of two speech feature extraction/compensation algorithms. IEEE Signal Processing Letters, 12(6):477–480. Deng, Y. and Khudanpur, S. (2003). Latent semantic information in maximum entropy language models for conversational speech recognition. Proceedings of the HLT-NAACL 03, pages 56–63. Eskenazi, M., Black, A., Raux, A., and Langner, B. (2008). Let’s go lab: a platform for evaluation LIST OF REFERENCES 117 of spoken dialog systems with real world users. Proceedings of Interspeech, Brisbane. Evermann, G., Chan, H. Y., Gales, M. J. F., Hain, T., Liu, X., Mrva, D., Wang, L., and Woodland, P. C. (2004). Develpment of the 2003 CU-HTK conversational telephone speech transcription system. Proceedings of ICASSP Interspeech, pages I–249–252. Farooq, O. and Datta, S. (2004). Wavelet based robust subband features for phoneme recognition. IEE Proceedings: Vision, Image and Signal Processing, 151(3):187–193. Fellbaum, C. (1999). Wordnet. An Electronic Lexical Database. Massachusetts Institute of Technology, US. Forney, G. D. (1973). The Viterbi algorithm. Proceedings IEEE, 61:268–273. Frankel, J. and King., S. (2005). A hybrid ANN/DBN approach to articulatory feature recognition. Proceedings of Eurospeech. Frankel, J. and King, S. (2007 (in press)). Speech recognition using linear dynamic models. IEEE Transactions on Speech and Audio Processing. Frankel, J., Wester, M., and King, S. (2007). Articulatory feature recognition using dynamic Bayesian networks. Computer Speech and Language, 21(4):620–640. Friedman, J., Hastie, T., and Tibshirani, R. (1999). Additive logistic regression: A statistical view of boosting. Technical report, Department of Statistics, Stanford University. Gałka, J. and Ziółko, B. (2008). Study of performance evaluation methods for non-uniform speech segmentation. International Journal Of Circuits, Systems And Signal Processing, NAUN. Ganapathiraju, A., Hamaker, J. E., and Picone, J. (2004). Applications of support vector machines to speech recognition. IEEE Transactions on Signal Processing, 52(8):2348–2355. Glass, J. (2003). A probabilistic framework for segment-based speech recognition. Computer Speech and Language, 17:137–152. Gorrell, G. and Webb, B. (2005). Generalized Hebbian algorithm for incremental latent semantic analysis. proceedings of Intespeech. Grayden, D. B. and Scordilis, M. S. (1994). Phonemic segmentation of fluent speech. Proceedings of ICASSP, Adelaide, pages 73–76. Green, S. J. (1999). Lexical semantics and automatic hypertext construction. ACM Computing Surveys (CSUR), 31. Greenberg, S., Chang, S., and Hollenback, J. (2000). An introduction to the diagnostic evaluation of switchboard- corpus automatic speech recognition systems. Proceedings of NIST Speech Transcription Workshop. Grocholewski, S. (1995). Założenia akustycznej bazy danych dla jȩzyka polskiego na nośniku cd rom (Eng. Assumptions of acoustic database for Polish language). Mat. I KK: Głosowa komunikacja człowiek-komputer, Wrocław, pages 177–180. Grönqvist, L. (2005). An evaluation of bi- and trigram enriched latent semantic vector models. ACM Proceedings of ELECTRA Workshop - Methodologies and Evaluation of Lexical Cohesion Techniques in Real-world Applications, Salvador, Brazil, pages 57–62. Hain, T., Dines, J., Garau, G., Karafiat, M., Moore, D., Wan, V., Ordelman, R., and S.Renals (2005). Transcription of conference room meetings: an investigation. Proceedings of ICSLP Interspeech. LIST OF REFERENCES 118 Harary, F. (1969). Graph Theory. Addison-Wesley. Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87(4):1738–1752. Hermansky, H. and Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4):578–589. Hifny, Y., Renals, S., and Lawrence, N. D. (2005). A hybrid MaxEnt/HMM based ASR system. Proceedings of ICSLP Interspeech. Holmes, J. N. (2001). Speech Synthesis and Recognition. London: Taylor and Francis. Ishizuka, K. and Miyazaki, N. (2004). Speech feature extraction method representing periodicity and aperiodicity in sub bands for robust speech recognition. Proceedings of ICASSP, pages I–141–144. Jarmasz, M. and Szpakowicz, S. (2003). Roget’s thesaurus and semantic similarity. Proceedings of Conference on Recent Advances in Natural Language Processing (RANLP), pages 212–219. Jassem, K. (1996). A phonemic transcription and syllable division rule engine. OnomasticaCopernicus Research Colloquium, Edinburgh. Jelinek, F., Merialdo, B., Roukos, S., and Strauss, M. (1991). A dynamic language model for speech recognition. Fourth DARPA Speech and Natural Language Workshop, pages 293–295. Johansson, S., Leech, G., and Goodluck, H. (1978). Manual of Information to Accompany the Lancaster-Olso/Bergen Corpus of British English, for Use with Digital Computers. Department of English, University of Oslo. Jurafsky, D. and Martin, J. H. (2000). Speech and Language Processing. Prentice-Hall, Inc., New Jersey. Kakkonen, T., Myller, N., and Sutinen, E. (2006). Applying part-of-speech enhanced LSA to automatic essay grading. Proceedings of the 4th IEEE International Conference on Information Technology:Research and Education (ITRE 2006). Tel Aviv, Israel, pages 500–504. Kanejiya, D., Kumar, A., and Prasad, S. (2003). Automatic evaluation of students’ answers using syntactically enchanced LSA. Proceedings of the HLT-NAACL 03 workshop on Building educational applications using natural language processing, 2:53–60. Kecman, V. (2001). Learning and Soft Computing. Massachusetts Institute of Technology, US. Kȩpiński, M. (2005). Kontekstowe zwia̧zki cech w sygnale mowy polskiej (Eng. Contextual feature relations in Polish speech signal), PhD Thesis. AGH University of Science and Technology, Kraków. Khudanpur, S. and Wu, J. (1999). A maximum entropy language model integrating n-grams and topic dependencies for conversational speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Phoenix, AZ. King, S. (2003). Dependence and independence in automatic speech recognition and synthesis. Journal of Phonetics, 31(3-4):407–411. King, S. and Taylor, P. (2000). Detection of phonological features in continuous speech using neural networks. Computer Speech and Language, 14(4):333–353. Kucera, H. and Francis, W. (1967). Computational Analysis of Present Day American English. Brown University Press Providence. LIST OF REFERENCES 119 L. E. Baum, T. Petrie, G. S. and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Statist., 41(1):164– 171. Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., and Wolf, P. (2004). The cmu sphinx-4 speech recognition system. Sun Microsystems. Li, H.-Z., Liu, Z.-Q., and Zhu, X.-H. (2005). Hidden markov models with factored gaussian mixtures densities. Elsevier Pattern Recognition, 38:2022–2031. Lowerre, B. T. (1976). The HARPY Speech Recognition System, PhD thesis. Carnegie-Mellon Univesity, Pittsburgh. Ma, J. Z. and Deng, L. (2004). Target - directed mixture dynamic models for spontaneous speech recognition. IEEE Transactions on Speech and Audio Processing, 12(1). Mahajan, M., Beeferman, D., , and Huang, D. (1999). Improved topic-dependent language modeling using information retrieval techniques. Proceedings of ICASSP, pages 541–544. Makhoul, J. (1975). Spectral linear prediction: properties and applications. IEEE Transcations, ASSP-23:283–296. Manning, C. D. (1999). Foundations of Statistical Natural Language Processing. MIT Press. Cambridge, MA. Miller, T. and Wolf, E. (2006). Word completion with latent semantic analysis. 18th International Conference on Pattern Recognition, ICPR, Hong Kong, 1:1252–1255. Misra, H., Ikbal, S., Bourlard, H., and Hermansky, H. (2004). Spectral entropy based feature for robust ASR. Proceedings of ICASSP, pages I–193–196. Morgan, N., Zhu, Q., Stolcke, A., Sonmez, K., Sivadas, S., Shinozaki, T., Ostendorf, M., Jain, P., Hermansky, H., Ellis, D., Doddington, G., Chen, B., Cretin, O., Bourlard, H., and Athineos, M. (2005). Pushing the envelope - aside. IEEE Signal Processing Magazine, 22:81–88. M. Wester (2003). Syllable classification using articulatory-acoustic features. Proceedings of Eurospeech. Nasios, N. and Bors, A. (2005). Finding the number of clusters for nonparametric segmentation. Lecture Notes in Computer Science, 3691:213–221. Nasios, N. and Bors, A. (2006). Variational learning for gaussian mixture models. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics, 36(4):849–862. Ostaszewska, D. and Tambor, J. (2000). Fonetyka i fonologia współczesnego jȩzyka Polskiego (eng. Phonetics and phonology of modern Polish language). PWN. Ostendorf, M., Digalakis, V. V., and Kimball, O. A. (1996). From HMM’s to segment models: A unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing, 4:360–378. Pedersen, T., Patwardhan, S., and Michelizzi, J. (2004). Wordnet::similarity - measuring the relatedness of concepts. Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-2004), pages 1024–1025. Piasecki, M. (2006). Hand-written and automatically extracted rules for Polish tagger. Lecture Notes in Artificial Intelligence, Springer, W P. Sojka, I. Kopecek, K. Pala, eds. Proceedings of Text, Speech, Dialogue:205–212. LIST OF REFERENCES 120 Przepiórkowski, A. (2004). The IPI PAN Corpus: Preliminary version. IPI PAN. Przepiórkowski, A. and Woliński, M. (2003). The unbearable lightness of tagging: A case study in morphosyntactic tagging of Polish. Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03), EACL 2003. Rabiner, L. and Juang, B. H. (1993). Fundamentals of speech recognition. PTR Prentice-Hall, Inc., New Jersey. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286. Rabiner, L. R. and Schafer, R. W. (1978). Signal Processing of Speech Signals. Prentice Hall, Englewood-cliffs. Raj, B. and Stern, R. M. (September 2005). Missing-feature approaches in speech recognition. IEEE Signal Processing Magazine, pages 101–116. Riccardi, G. and Hakkani-Tür, D. (2005). Active learning: Theory and applications to automatic speech recognition. IEEE Transactions on Speech and Audio Processing, 13(4):504–511. Rioul, O. and Vetterli, M. (1991). Wavelets and signal processing. IEEE Signal Processing Magazine, 8:11–38. Russell, M. and Jackson, P. J. B. (2005). A multiple-level linear/linear segmental HMM with a formant-based intermediate layer. Computer Speech and Language, 19:205–225. Seco, N., Veale, T., and Hayes, J. (2004). An intrinsic information content metric for semantic similarity in wordnet. Proceedings of ECAI’2004, the 16th European Conference on Artificial Intelligence. Steffen-Batóg, M. and Nowakowski, P. (1993). An algorithm for phonetic transcription of ortographic texts in Polish. Studia Phonetica Posnaniensia, 3. Stöber, K. and Hess, W. (1998). Additional use of phoneme duration hypotheses in automatic speech segmentation. Proceedings of ICSLP, Sydney, pages 1595–1598. Subramanya, A., Bilmes, J., and Chen, C. P. (2005). Focused word segmentation for ASR. Proceedings of Interspeech 2005, pages 393–396. Suh, Y. and Lee, Y. (1996). Phoneme segmentation of continuous speech using multi-layer perceptron. In Proceedings of ICSLP, Philadelphia, pages 1297–1300. Tadeusiewicz, R. (1988). Sygnał mowy (eng. Speech Signal). Wydawnictwo Komunikacji i Ła̧czności. Tan, B. T., Lang, R., Schroder, H., Spray, A., and Dermody, P. (1994). Applying wavelet analysis to speech segmentation and classification. H. H. Szu, editor, Wavelet Applications, volume Proc. SPIE 2242, pages 750–761. T.Hofmann (1999). Probabilistic latent semantic analysis. Proceedings of Uncertainty in Artificial Intelligence, UAI’99, Stockholm. Toledano, D., Gómez, L., and Grande, L. (2003). Automatic phonetic segmentation. IEEE Transactions on Speech and Audio Processing, 11(6):617–625. Tukey, J. W., Bogert, B. P., and Healy, M. J. R. (1963). The quefrency analysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe-cracking. Proceedings of the Symposium on Time Series Analysis (M. Rosenblatt, Ed), pages 209–243. LIST OF REFERENCES 121 van Rijsbergen, C. J. (1979). Information Retrieval. London: Butterworths. Venkataraman, A. (2001). A statistical model for word discovery in transcribed speech. Computational Linguistics, 27. Véronis, J. (2004). Hyperlex: lexical cartography for information retrieval. Computer Speech and Language, 18(3):223–252. Villing, R., Timoney, J., Ward, T., and Costello, J. (2004). Automatic blind syllable segmentation for continuous speech. Proceedings of ISSC 2004, Belfast. Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269. Wang, D. and Narayanan, S. (2005). Piecewise linear stylization of pitch via wavelet analysis. Proceedings of Interspeech, Lisboa, pages 3277–3280. Watanabe, S., Minami, Y., Nakamura, A., and Ueda, N. (2004). Variational Bayesian estimation and clustering for speech recognition. IEEE Transcations on Speech and Audio Processing, 12(4). Weinstein, C. J., McCandless, S. S., Mondshein, L. F., and Zue, V. W. (1975). A system for acoustic-phonetic analysis of continuous speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 23:54–67. Wester, M. (2003). Pronunciation modeling for ASR - knowledge-based and data-derived methods. Computer Speech and Language, 17:69–85. Wester, M., Frankel, J., and King, S. (2004). Asynchronous articulatory feature recognition using dynamic Bayesian networks. Proceedings of IEICI Beyond HMM Workshop. Whittaker, E. and Woodland, P. (2003). Language modelling for russian and english using words and classes. Computer Speech and Language, 17:87–104. Woliński, M. (2004). System znaczników morfosyntaktycznych w korpusie ipi pan (Eng. The system of morphological tags used in IPI PAN corpus). POLONICA, XII:39–54. Wu, J. and Khudanpur, S. (2000). Efficient training methods for maximum entropy language modelling. Proceedings of 6th International Conference on Spoken Language Technologies (ICSLP-00). X. Huang, A. Acero, H.-W. H. (2001). Spoken Language Processing. Prentice Hall PTR, New Jersey. Y.-C. Tam, T. S. (2008). Correlated bigram LSA for unsupervised language model adaptation. Proc. of Neural Information Processing Systems (NIPS), Vancouver. Yannakoudakis, E. J. and Hutton, P. J. (1992). An assessment of n-phoneme statistics in phoneme guessing algorithms which aim to incorporate phonotactic constraints. Speech Communication, 11:581 – 602. Yapanel, U. and Dharanipragada, S. (2003). Perceptual MVDR-based cepstral coefficients (PMCCs) for robust speech recognition. Proceedings of ICASSP. Young, S. (1996). Large vocabulary continuous speech recognition: a review. IEEE Signal Processing Magazine, 13(5):45–57. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. (2005). HTK Book. Cambridge University Engineering LIST OF REFERENCES 122 Department, UK. Zheng, C. and Yan, Y. (2004). Fusion based speech segmentation in DARPA SPINE2 task. Proceedings of ICASSP, Montreal, pages I–885–888. Zhu, D. and Paliwal, K. K. (2004). Product of power spectrum and group delay function for speech recognition. Proceedings of ICASSP. Ziółko, B., Gałka, J., Manandhar, S., Wilson, R., and Ziółko, M. (2007). Triphone statistics for Polish language. Proceedings of 3rd Language and Technology Conference, Poznań. Ziółko, B., Manandhar, S., and Wilson, R. C. (2006a). Phoneme segmentation of speech. Proceedings of 18th International Conference on Pattern Recognition. Ziółko, B., Manandhar, S., Wilson, R. C., and Ziółko, M. (2006b). Wavelet method of speech segmentation. Proceedings of 14th European Signal Processing Conference EUSIPCO, Florence. Ziółko, B., Manandhar, S., Wilson, R. C., and Ziółko, M. (2008a). Language model based on pos tagger. Proceedings of SIGMAP 2008 the International Conference on Signal Processing and Multimedia Applications, Porto. Ziółko, B., Manandhar, S., Wilson, R. C., and Ziółko, M. (2008b). Semantic modelling for speech recognition. Proceedings of Speech Analysis, Synthesis and Recognition. Applications in Systems for Homeland Security, Piechowice, Poland. Zue, V. W. (1985). The use of speech knowledge in automatic speech recognition. Proceedings of the IEEE, 73:1602–1615.