Speech Recognition of Highly Inflective Languages

Transcription

Speech Recognition of Highly
Inflective Languages
BARTOSZ Z I ÓŁKO
Ph.D. Thesis
This thesis is submitted in partial fulfilment of the requirements for the degree of Doctor of
Philosophy.
Artificial Intelligence Group
Pattern Recognition and Computer Vision Group
Department of Computer Science
United Kingdom
2009
2
Abstract
This PhD thesis combines various topics in speech recognition. There are two central hypotheses. First one is that it would be useful to incorporate phoneme segmentation information in
speech recognition and that this task can be achieved by applying discrete wavelet transform. The
second main point is that adding semantics into language models for speech recognition improves
recognition accuracy.
The research starts with analysing differences between English and Polish from speech recognition point of view. English is a very typical positional language and Polish is highly inflective.
Part of research is focused on aspects which should be changed due to the linguistic differences
comparing to well known solutions for English to improve recognition of Polish. These are mainly
phoneme segmentation and semantic analysis. Phoneme statistics for Polish were gathered by the
author and a toolkit designed for English was applied on Polish.
The phoneme segmentation is more likely to be successful in Polish than English because
phonemes are easier to be distinguished. A method based on the discrete wavelet transform was
design and tested by the PhD candidate.
Another part of research is focused on finding new ways of modelling a natural language.
Semantic analysis is crucial for Polish because syntax models are not very effective and difficult
to be trained due to non-positionality of Polish. This part of the thesis describes an unsuccessful
approach of using part-of-speech taggers for language modelling in speech recognition and a much
better bag-of-words model. The latter is inspired by well known latent semantic analysis. It is,
however, easier to be trained and does not need calculations on big matrices. The difference is in
the completely new approach to smoothing information in a word-topic matrix. Because of the
morphological nature of the Polish language, this method gathers not only semantic content, but
also some grammatical structure.
Contents
1
2
3
Introduction
16
1.1
Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.2
Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
1.2.1
Introduction and Literature Review . . . . . . . . . . . . . . . . . . . .
18
1.2.2
Linguistic Aspects of Highly Inflective Languages Using Polish as an
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
1.2.3
Phoneme Segmentation and Acoustic Models . . . . . . . . . . . . . . .
18
1.2.4
Language Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
Literature Review
20
2.1
History of Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.2
Linguistic Rudiments of Speech Analysis . . . . . . . . . . . . . . . . . . . . .
22
2.3
Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.3.1
Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.4
Speech Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.5
Phoneme Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.6
Speech Parametrisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.6.1
Parametrisation Methods Based on Linear Prediction Coefficients . . . .
30
2.6.2
Parametrisation Methods Based on Filter Banks . . . . . . . . . . . . . .
33
2.6.3
Test Corpora and Baselines . . . . . . . . . . . . . . . . . . . . . . . . .
36
2.6.4
Comparison of the Methods . . . . . . . . . . . . . . . . . . . . . . . .
38
2.7
Speech Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
2.8
Natural Language Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
2.9
Semantic Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
2.10 Academic Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
Linguistic Aspects of Polish
46
3.1
Analysis of Polish from the Speech Recognition Point of View . . . . . . . . . .
46
3.2
Triphone Statistics of Polish Language . . . . . . . . . . . . . . . . . . . . . . .
47
3.3
Description of a problem solution . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.4
Methods, software and hardware . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.4.1
50
Grapheme to Phoneme Transcription . . . . . . . . . . . . . . . . . . .
3
CONTENTS
4
5
6
4
3.4.2
Corpora Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.4.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.5
Analysis of Phonetic Similarities in Wrong Recognitions of the Polish Language
56
3.6
Experimental Results on Applying HTK to Polish . . . . . . . . . . . . . . . . .
57
3.7
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
Phoneme Segmentation
63
4.1
Analysis Using the Discrete Wavelet Transform . . . . . . . . . . . . . . . . . .
63
4.2
General Description of the Segmentation Method . . . . . . . . . . . . . . . . .
65
4.3
Phoneme Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.4
Fuzzy Sets for Recall and Precision . . . . . . . . . . . . . . . . . . . . . . . .
74
4.5
Algorithm of Speech Segmentation Evaluation . . . . . . . . . . . . . . . . . . .
75
4.6
Comparison to Other Evaluation Methods . . . . . . . . . . . . . . . . . . . . .
78
4.7
Experimental Results of DWT Segmentation Method . . . . . . . . . . . . . . .
78
4.8
Evaluation for Different Types of Phoneme Transitions . . . . . . . . . . . . . .
80
4.9
LogitBoost WEKA Classifier Speech Segmentation . . . . . . . . . . . . . . . .
83
4.10 Experimental Results for LogitBoost . . . . . . . . . . . . . . . . . . . . . . . .
83
4.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
Language Models
87
5.1
POS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
5.2
Applying POS Taggers for Language Modelling in Speech Recognition . . . . .
88
5.3
Experimental Results of Applying POS Tags in ASR . . . . . . . . . . . . . . .
89
5.4
Bag-of-words Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
5.5
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
5.6
Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
5.7
Process of Finding The Most Similar Topics . . . . . . . . . . . . . . . . . . . .
97
5.8
Example in English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
5.9
Recognition Using Bag-of-words Model . . . . . . . . . . . . . . . . . . . . . .
99
5.10 Preliminary Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
5.11 K-means On-line Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .
100
5.12 Experiment on Parliament Transcripts . . . . . . . . . . . . . . . . . . . . . . .
103
5.13 Preprocessing of Training Corpora . . . . . . . . . . . . . . . . . . . . . . . . .
107
5.14 Experiment with Literature Training Corpus . . . . . . . . . . . . . . . . . . . .
107
5.15 Word Prediction Model and Evaluation with Perplexity . . . . . . . . . . . . . .
110
5.16 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
110
Conclusions and Future Research
112
Appendices
114
List of References
115
List of Tables
2.1
Phoneme transcription in English - BEEP dictionary . . . . . . . . . . . . . . . .
23
2.2
Phoneme transcription in Polish - SAMPA . . . . . . . . . . . . . . . . . . . . .
23
2.3
Comparison of the efficiency of the described methods. Asterisks mark methods
appended to baselines (they could be used with most of the other methods). The
methods without asterisks are new sets of features, different to the baselines . . .
38
2.4
Speech recognition applications available on the Internet . . . . . . . . . . . . .
44
3.1
Phonemes in Polish (SAMPA Demenko et al. (2003)) . . . . . . . . . . . . . . .
49
3.2
Most common Polish diphones . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
3.3
Most common Polish triphones . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
3.4
Word recognition correctness for different speakers (the model was trained on
adult male speakers only) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
3.5
Errors in different types of utterances (for all speakers) . . . . . . . . . . . . . .
58
3.6
Errors in sentences (speakers AK1C1 and AK2C1 respectively) . . . . . . . . . .
58
3.7
Errors in digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
3.8
Errors in the most often wrongly recognised names and commands . . . . . . . .
60
3.9
Errors in the most often wrongly recognised names and commands (2nd part) . .
61
3.10 Names which appeared the most commonly as wrong recognitions in above statistics 61
3.11 Errors in pronounced alphabet . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.1
Characteristics of the discrete wavelet transform levels and their envelopes . . . .
67
4.2
Types of events associated with a phoneme boundary. Mathematical conditions are
based on power envelope pen
m (n), rate-of-change information rm (n), a threshold p
en
of the distance between rm (n) and pen
m (n) and a threshold pmin of minimal pm (n)
and β = 1. Values in the last four columns are for different DWT levels (the first
one for d1 level, the second one for d2 level, the third for levels from d3 to d5 and
the last one for d6 level) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3
70
Comparison of fuzzy recall and precision with commonly used methods based on
insertions and deletions for an exemplar word . . . . . . . . . . . . . . . . . . .
79
4.4
Comparison of proposed method using different wavelets . . . . . . . . . . . . .
79
4.5
Comparison of some other segmentation strategies and proposed method . . . . .
79
4.6
Recall for different types of phoneme transitions. . . . . . . . . . . . . . . . . .
81
5
LIST OF TABLES
6
4.7
Precision for different types of phoneme transitions.
. . . . . . . . . . . . . . .
82
4.8
F-score for different types of phoneme transitions. The scores above 0.5 were bolded. 82
4.9
Experimental results for LogitBoost classifier. The rows with the label boundary
is for classifying segments representing boundaries. The rows named phoneme
present grades for classifying segments inside phonemes which are not boundaries. From practical point of view boundary labels are important. The grades for
phoneme labels are just for a reference . . . . . . . . . . . . . . . . . . . . . . .
5.1
84
Results of applying the POS tagger to language modelling. First, a sentence in Polish is given, then a position of a correct recognition in 10 best list. The description
of tagger grade for the correct recognition follows . . . . . . . . . . . . . . . . .
5.2
90
Results of applying the POS tagger to language modelling. First, a sentence in Polish is given, then a position of a correct recognition in 10 best list. The description
of tagger grade for the correct recognition follows (2nd part) . . . . . . . . . . .
5.3
91
Results of applying the POS tagger on its training corpus. First version of a sentence is a correct one, second is a recognition using just HTK and third one using
HTK and POS tagging. Then the number of differences comparing to a correct
sentence were counted and summarised . . . . . . . . . . . . . . . . . . . . . .
92
5.4
Matrix S for the example with 4 topics and a row of S’ for the topic 3 . . . . . .
98
5.5
Matrix D for the presented example . . . . . . . . . . . . . . . . . . . . . . . .
98
5.6
Experimental results for pure HTK audio model, audio model with LSA and audio
model with our bag-of-words model . . . . . . . . . . . . . . . . . . . . . . . .
5.7
44 sentences in the exact transcription used for testing by HTK and bag-of-words
model with English translations . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8
104
model with English translations (2nd part) . . . . . . . . . . . . . . . . . . . . .
5.9
101
105
model with English translations (3rd part) . . . . . . . . . . . . . . . . . . . . .
106
5.10 SED script for text preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . .
108
5.11 Experimental results for pure HTK audio model, audio model with LSA and audio
model with our bags-of-words model trained on literature . . . . . . . . . . . . .
109
5.12 Experimental results for pure HTK audio model, audio model with LSA and audio
model with our bags-of-words model trained on enlarged literature corpus . . . .
109
5.13 Text corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
109
List of Figures
2.1
Toy dog Rex - first working speech recognition system (USA 1920) . . . . . . .
20
2.2
Scheme of speech recognition system . . . . . . . . . . . . . . . . . . . . . . .
21
2.3
Typical current services offered by call centres with ASR (above) and its future
(below) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.4
Speech audibility and average human hearing band (Tadeusiewicz, 1988) . . . .
25
2.5
The example of Fourier spectrum amplitude . . . . . . . . . . . . . . . . . . . .
25
2.6
Frequency spectrum of speech in a linear and a non-linear scale . . . . . . . . . .
26
2.7
The cepstrum is the Fourier transform of the log of the power spectrum . . . . .
27
2.8
The types of speech segmentation . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.9
Comparison of the frames produced by constant segmentation and phoneme segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.10 The list of speech features extracting method types, grouped in two avenues: based
on linear prediction coefficients (with PLP as the main one) and filter bank analysis
(with MFCC as the main one). . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.11 fMPE transformation matrix from original low-dimensional feature vector into
high-dimensional one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.12 Mel frequency cepstrum coefficients . . . . . . . . . . . . . . . . . . . . . . . .
32
3.1
Phonemes in Polish in SAMPA alphabet . . . . . . . . . . . . . . . . . . . . . .
50
3.2
Frequency of diphones in Polish (each phoneme separately) . . . . . . . . . . . .
52
3.3
Space of triphones in Polish . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.4
Phoneme occurrences distribution . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.1
Wavelet transform outperforms STFT because it has higher resolution for higher
frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.2
The discrete Meyer wavelet - dmey . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.3
Subband amplitude DWT spectra of the Polish word ’osiem’ (eng. eight). The
number of samples depends on a resolution level . . . . . . . . . . . . . . . . .
4.4
66
Segmentation of the Polish word ’osiem’ (eng. eight) based on DWT sub-bands.
Dotted lines are hand segmentation boundaries; dashed lines are automatic segmentation boundaries, bold lines are envelopes and thin lines are smoothed rateof-change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
68
LIST OF FIGURES
4.5
The event function versus time in ms of the word presented in Fig. 4.4. High event
scores mean that a phoneme boundary is more likely . . . . . . . . . . . . . . .
4.6
8
71
Simple examples of four events described in Table 4.2. They are characteristic for
phoneme boundaries. Images present power envelope pen
m (n) and rate-of-change
information (derivative) rm (n) . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7
72
The general scheme of sets G with correct boundaries and A with detected ones.
Elements of set A have a grade f(x) standing for probability of being a correct
boundary. In set G there can be elements which were not detected (in the left part
of the set) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8
74
The example of phoneme segmentation of a single word. In the lower part hand
segmentation is drawn. Boundaries are represented by two indexes close to each
other (sometimes overlapping). Upper columns present the example of segmentation for the word done by a segmentation algorithm. All of calculated boundaries
4.9
are quite accurate but never perfect . . . . . . . . . . . . . . . . . . . . . . . . .
75
Fuzzy membership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
4.10 F-score of phoneme boundaries detection for transitions between several types of
phonemes. Phoneme types 1-10 are explained in section 4.8 (1 - stops, 2 - nasal
consonants, etc.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
5.1
Histogram of POS tagger probabilities for hypotheses which are correct recognitions 93
5.2
Histogram of POS tagger probabilities for hypotheses which are wrong recognitions 94
5.3
Ratio of correct recognitions to all for different probabilities from POS tagger . .
94
5.4
Undirected, complete graph illustrating similarities between sentences . . . . . .
96
5.5
Histogram of probabilities received from the bag-of-words model for hypotheses
which are correct recognitions . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6
Histogram of probabilities received from the bag-of-words model for hypotheses
which are wrong recognitions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7
102
102
Ratio of correct recognitions to all of them for different probabilities received from
the bag-of-words model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
102
9
10
Acknowledgments
I would like to begin by thanking my parents. Not only for their unstinting support throughout my
educational career, but also for encouraging me to pursue my PhD.
I feel very lucky that I had two supervisors to guide me through research. I would like to
thank Dr Suresh Manandhar and Dr Richard C. Wilson for their continued support, advice and
constructive feedbacks. I am glad we published together many papers and that I participated
thanks to them in several conferences. They were not only teachers but also good friends helping
me in my life in a new country which was often less surprising and easier to understand because
of them.
Appreciation goes to my assessor Dr Adrian Bors for his regular feedback on progress of my
research.
I had the privilege to meet many interesting people in the department. This had provided an
excellent environment for inventing my methods and algorithms. I would like to thank for all
seminars and minor discussions on corridors of our department. In particular thanks go to Thimal
Jasooriya for sitting in front of me for long three years and answering patiently all questions like
‘Hey, how do you do this in LaTeX?’ or ‘Where is room 103?’. I appreciate as well Ioannis
Klapaftis help regarding grammar parsers and graphs of collocation. I would like to thank Pierre
Andrews for both improving my knowledge not only in NLP but also in photography. Many thanks
to Marcelo Romero Huertas. And finally I am very glad that I met Marek Grześ with who I had so
many exciting conversations about travels all over the world and who was a strong support for me
in days I had private problems. Many thanks too all other members of the department I have met
during my studies.
My PhD would not be completed without the help of many people out of the department. I
would like to thank Professor Zdzisław Brzeźniak for our mathematical discussions with coffee.
Appreciations go to Professor Grażyna Demenko for providing PolPhone software and Dr Stefan Grocholewski for CORPORA. I would like to thank Dr Adam Przepiórkowski and Dr Maciej
Piasecki for their help in part of research about POS taggers. I am also very glad for my close cooperation with Jakub Gałka in our research. Finally many thanks to my father Professor Mariusz
Ziółko for many useful feedbacks about my research papers and this thesis.
11
List of candidate publications. Parts of some of them were used in the thesis.
Conferences:
• M.P. Sellars, G.E. Athanasiadou, B. Ziółko, S.D. Greaves, A. Hopper, Simulation of Broadband FWA Networks in High-rise Cities with Linear Antenna Polarisation, The 14th IEEE
2003 International Symposium on Personal, Indoor and Mobile Radio Communications Proceedings - PIMRC, pp. 371-5. Beijing, China 2003.
• M. Ziółko, P. Sypka, B. Ziółko, Compression of Transmultiplexed Acoustic Signals, Proceedings of The 2004 International TICSP Workshop on Spectral and Multirate Signal Processing, pp.81-6. Vienna 2004.
• B. Ziółko, M. Ziółko, M. Nowak, P. Sypka, A suggestion of multiple-access method for
4G system, Proceedings of 47th International Symposium ELMAR-2005, pp.327-30 Zadar,
Croatia 2005.
• M. Ziółko, B. Ziółko, A. Dziech, Transcription as a Speech Compression Method in Transmultiplexer System, 5th WSEAS International Conference on Multimedia, Internet and Video Technologies, Corfu, Greece 2005.
• B. Ziółko, M. Ziółko, M. Nowak, Design of Integer Filters for Transmultiplexer Perfect
Reconstruction, Proceedings of 13th European Signal Processing Conference EUSIPCO,
Antalya, Turkey 2005.
• M. Ziółko, M. Nowak, B. Ziółko, Transmultiplexer Integer-to-Integer Filter Banks, Proceedings of The First IFIP International Conference in Central Asia on Internet, The Next
Generation of Mobile, Wireless and Optical Communications Networks, Bishkek, Kyrgyzstan 2005.
• P. Sypka, B. Ziółko, M. Ziółko, Integer-to-Integer Filters in Image Transmultiplexers, Proceedings of 2006 Second International Symposium on Communications, Control and Signal
Processing, ISCCSP, Marrakech, Morocco 2006.
• P. Sypka, M. Ziółko and B. Ziółko, Lossy Compression Approach to Transmultiplexed
Images, 48th International Symposium ELMAR-2006, Zadar, Croatia.
• B. Ziółko, S. Manandhar, R.C. Wilson, Phoneme segmentation of speech, Proceedings of
ICPR 2006 , Hong Kong, 2006.
• B. Ziółko, S. Manandhar, R.C. Wilson, M. Ziółko, Wavelet method of speech segmentation,
Proceedings of EUSIPCO 2006, Florence, Italy.
• P. Sypka, M. Ziółko, B. Ziółko, Robustness of Transmultiplexed Images, International
Conference Mixed Design of Integrated Circuits and Systems Mixdes , Gdynia, 2006.
• B. Ziółko, J. Gałka, S. Manandhar, R. C. Wilson, M. Ziółko, The use of statistics of Polish phonemes in speech recognition, Speech Signal Annotation, Processing and Synthesis,
Poznań, 2006.
12
• P. Sypka, M. Ziółko and B. Ziółko, Lossless JPEG-Base Compression of Transmultiplexed
Images, Proceedings of the 12th Digital Signal Processing Workshop, pp. 531-534. Wyoming 2006.
• M. Ziółko, P. Sypka, B. Ziółko, Application of 1-D Transmultiplexer to Images Transmission, Proceedings of the 32nd Annual Conference of the IEEE Industrial Electronics Society
IECON, pp. 3564-3567, Paris, France, 2006.
• M. Kotti, C. Kotropoulos, B. Ziółko, I. Pitas, V. Moschou, A Framework for Dialogue Detection in Movies, Proceedings of Multimedia Content Representation, Classification and
Security International Workshop, MRCS, Lecture Notes in Computer Science, vol. 4105, pp.
371-378, Istanbul, Turkey, 2006.
• P. Sypka, M. Ziółko, B. Ziółko, Approach of JPEG2000 Compression Standard to Transmultiplexed Images, Proceedings of the Visualization, Imaging, and Image Processing, VIIP,
Palma De Mallorca, Spain, 2006.
• B. Ziółko, J. Gałka, S. Manandhar, R. C. Wilson, M. Ziółko, Triphone Statistics for Polish Language, Proceedings of 3rd Language and Technology Conference, Poznań, Poland,
2007.
• B. Ziółko, S. Manandhar, R. C. Wilson, Fuzzy Recall and Precision for Speech Segmentation Evaluation, Proceedings of 3rd Language and Technology Conference, Poznań, Poland,
2007.
• B. Ziółko, S. Manandhar, R. C. Wilson, M. Ziółko, LogitBoost Weka Classifier Speech Segmentation, Proceedings of 2008 IEEE International Conference on Multimedia and Expo,
Hannover, Germany, 2008.
• B. Ziółko, S. Manandhar, R. C. Wilson, M. Ziółko, Language Model Based on POS Tagger, Proceedings of SIGMAP 2008 the International Conference on Signal Processing and
Multimedia Applications, Porto, Portugal, 2008.
• B. Ziółko, S. Manandhar, R. C. Wilson, M. Ziółko, J. Gałka Application of HTK to the
Polish Language, Proceedings of IEEE International Conference on Audio, Language and
Image Processing, Shanghai, 2008.
• B. Ziółko, S. Manandhar, R. C. Wilson, M. Ziółko, Semantic Modelling for Speech Recognition, Proceedings of Speech Analysis, Synthesis and Recognition. Applications in Systems
for Homeland Security, Piechowice, Poland, 2008.
• B. Ziółko, S. Manandhar, R. C. Wilson, Bag-of-words Modelling for Speech Recognition,
Proceedings of International Conference on Future Computer and Communication, Kuala
Lumpur, Malaysia, 2009.
13
• B. Ziółko, M. Ziółko, Linguistic Calculations on Cyfronet High Performance Computers,
Proceedings of Conference of the High Performance Computers’ Users, Zakopane, Poland,
2009.
• B. Ziółko, J. Gałka, M. Ziółko, Phone, diphone and triphone statistics for Polish language,
Proceedings of SPECOM 2009, St. Petersburg, Russia, 2009.
• B. Ziółko, J. Gałka, M. Ziółko, Phoneme ngrams based on a Polish newspaper corpus, Proceedings of WORLDCOMP’09, Las Vegas, USA, 2009.
• B. Ziółko, J. Gałka, M. Ziółko, Phonetic statistics from an Internet articles corpus of Polish
language, Proceedings of Intelligent Information Systems, Kraków, Poland, 2009.
Journals:
• M.P. Sellars, G.E. Athanasiadou, B. Ziółko, S.D. Greaves, Opposite-sector uplink interference in broadband FWA networks in high-rise cities, The IEE Electronics Letters , vol. 40,
no. 17, pp. 1070-1, 2004.
• M. Ziółko, A. Dziech, R. Baran, P. Sypka, B. Ziółko, Transmultiplexing System for Compression of Selected Signals, WSEAS Transactions on Communications, issue 12, vol. 4, pp.
1427-1434, December 2005.
• M. Dyrek, J. Gałka and B. Ziółko, Measures On Wavelet Segmentation of Speech, International Journal Of Circuits, Systems And Signal Processing, NAUN 2008.
• J. Gałka and B. Ziółko, Study of Performance Evaluation Methods for Non-Uniform Speech
Segmentation, International Journal Of Circuits, Systems And Signal Processing, NAUN
2008.
14
List of Abbreviations
AMI - Augmented Multi-party Interaction
ANN - Artificial Neural Network
ASR - Automatic Speech Recognition
BEEP - British English Phonemic Transcription Dictionary
CML - Conditional Maximum Likelihood
CMU - Carnegie Mellon University
CUED - Cambridge University Engineering Department
DARPA - Defence Advanced Research Projects Agency
DBNs - Dynamic Bayesian Networks
DCT - Discrete Cosine Transform
DWT - Discrete Wavelet Transform
FBE - Filter Bank Energy
FFT - Fast Fourier Transform
fMPE - feature-space Minimum Phone Error
GSM - Global System for Mobile
HLDA - Heteroscedastic Linear Discriminant Analysis
HMM - Hidden Markov Model
HTK - Hidden Markov Model Toolkit
IIS - Improved Iterative Scaling
LFCCs - Linear Frequency Cepstrum Coefficients
LM - Language Models
LPCC - Linear Prediction Coefficients
LSA - Latent Semantic Analysis
MaxEnt - Maximum Entropy
MFCC - Mel Frequency Cepstrum Coefficients
MFMGDCCs - Mel Frequency Modified Group Delay Cepstral Coefficients
MFPSCCs - Mel Frequency Product Spectrum Cepstral Coefficients
MGDCCs - Modified Group Delay Cepstral Coefficients
MLLR - Maximum Likelihood Linear Regression
MMSE - Minimum Mean Square Error
MPE - Minimum Phone Error
PLP - Perceptual Linear Predictive
PMF - Probability Mass Function
POS - Part Of Speech
RASTA - Relative Spectral
RCs - Reflection Coefficients
SAT - Speaker Adaptative Training
SED - Stream Editor
SHLDA - Smoothed Heteroscedastic Linear Discriminant Analysis
SNR - Signal to Noise Ratio
15
SPLICE - Stereo-based Piecewise Linear Compensation for Environments
STFT - Short Time Fast Fourier Transform
SVD - Singular Value Decomposition
TIMIT - Texas Instrument/Massachusetts Institute of Technology
VTLN - Vocal Tract Length Normalisation
WER - Word Error Rate
Declaration
This thesis has not previously been accepted in substance for any degree and is not being concurrently submitted in candidature for any degree other than Doctor of Philosophy of the University
of York. This thesis is the result of my own investigations, except where otherwise stated. Other
sources are acknowledged by explicit references.
I hereby give consent for my thesis, if accepted, to be made available for photocopying and for
inter-library loan, and for the title and summary to be made available to outside organisations.
Chapter 1
Introduction
As information technology has an impact on more and more aspects of our lives with every year,
the problem of communication between human beings and information processing devices becomes increasingly important. Up to now, such communication has almost entirely been through
the use of keyboards and screens, but speech is the most widely used, natural and the fastest means
of communication for people. Moreover, mobile computing devices are becoming increasingly
small. The bottom limit lays not in integrated circuit design size but, simply, in the size a human
can operate with their fingers. There is also more and more hands-free, like in-car, computer systems. We must redefine traditional methods of human-computer and human-machine interactions.
Unfortunately, machine capabilities for interpreting speech are still poor in comparison to what
a human can achieve, even though we can predict that automatic speech recognition (ASR) will
become a very pervasive technology (Alewine et al., 2004).
1.1
Contribution
An aim of our research was to improve the accuracy of speech recognition and to find the elements
which might be especially efficient in the ASR of highly inflective and non-positional languages
like Polish. English is a much different language in some aspects and some of these differences
have impacts on speech recognition systems. The part-of-speech (POS) structure is much more
regular in English than in Polish, which means it is much more predictable. A word can change
its POS meaning depending on its position. For example we understand all nouns located on the
left from another noun as adjectives. In Polish such change is stressed by morphology rather then
position. English has many short forms including pronouncing many vowels weakly as // and
skipping several letters in longer words. There are also some Polish phonemes which do not exist
in English and the other way around. As it is a wide field, research was conducted on chosen
elements.
As a part of our research we did practical, linguistic studies on differences between Polish
and English. Phonetic statistics for Polish were collected and analysed. These statistics helped
in further works. Among them a hidden Markov model toolkit (HTK) for Polish was trained and
tested. The model we created was trained from real data for all biphones in Polish and by HTK
16
CHAPTER 1. INTRODUCTION
17
scripts for all triphones in a synthesised way using statistics that we collected. The system can be
adapted to any vocabulary, however, it does not work efficiently for large vocabulary tasks.
One of the possible improvements in ASR is in detecting phoneme boundaries. This information is typically skipped in existing solutions. Speech is usually analysed in frames of constant
length. Analysing separate phonemes would be much more accurate. One can quite easily set
phoneme boundaries by observing spectrograms or discrete wavelet transform (DWT) spectra of
speech, however, it is very difficult to give an exact algorithm to find them. Constant segmentation benefits from simplicity of implementation and the simple comparison of blocks of the same
length. However, it is perceptually unnatural. Human phonetic categorisation is very poor for such
short segments (Morgan et al., 2005). Constant segmentation is not natural as phonemes have different length. Moreover, boundary effects provide additional distortions, and framing creates more
boundaries than phoneme segmentation. We have to consider these boundary effects, which can
cause errors. Obviously, a smaller number of boundaries means smaller errors due to the mentioned effects. Constant segmentation therefore risks losing information about the phonemes due to
merging different sounds into single blocks, losing phoneme length information and losing complexity of individual phonemes. Phoneme duration can be also used as an additional parameter in
speech recognition, improving the accuracy of the whole process (Stöber and Hess, 1998).
There is very little interest in using POS tags in ASR. We investigated its application in ASR.
POS tag trigrams, a matrix grading possible neighbourhoods or a probabilistic tagger can be created and used to predict a word being recognised based on the left context analysed by a POS tagger.
Another innovation of speech recognition is based on semantic analysis as the very last step of the
process. It can be applied as an additional measure to use a non-first choice from a n-best list of
audio model recognition hypotheses, if the first one does not fit semantic content. It is not possible to recognise speech using acoustic information only. The human perception system is based
upon catching context, structure and understanding combined with recognition. It is much easier
to recognise and repeat without any errors a heard sentence, if it is in a language we understand,
compared to a sentence in a language we are not familiar with. Language modelling can improve
recognition highly.
We decided to focus on using information which was not used, or not commonly used until
now in speech recognition. POS tags were not applied as English can be modelled efficiently
using context-free grammars. In case of Polish, it is very difficult to provide tree structures, which
represent all possible sentences, as the order of words can vary significantly. We thought that
Polish can be modelled using POS tags because some tags are much more probable in the context
of some others. Unfortunately, experiments shown that POS information is too ambiguous to be
used in the way we proposed.
Semantic analysis is generally very difficult, due to information sparsity problems. We believe
that this is why it was not used very commonly in existing ASR systems, as language models
based on grammar structure were quite efficient for English, and there was no necessity of using
semantic analysis. In the case of Polish, semantic information has to be included in a language
model due to syntactic irregularities. A bag-of-words model was invented. It applies word-topic
statistics to re-rank a list of hypotheses from models of lower levels.
1.2
18
Thesis Overview
We investigated several new elements of ASR systems with special interest of highly inflective
and non-positional languages like Polish. It includes non-constant segmentation for acoustic modelling. We have analysed some aspects of Polish to choose the language’s best approach for ASR
as a representative of highly inflective languages. Apart from this we investigate introducing POS
tagging and semantic information analysis in ASR systems.
1.2.1
Introduction and Literature Review
In the first chapter we will introduce the general aspects of the research areas that are involved
in ASR. Specifically, we pay attention to previous work concerning signal processing methods
like DWT, speech segmentation and parametrisation, pattern recognition, language modelling (for
example hidden Markov models (HMM) and n-grams) and natural language processing (NLP),
mainly lexical semantics, POS tagging and latent semantic analysis (LSA). Some literature in
linguistics, mathematical analysis, probabilistic and information theory is also considered.
1.2.2
Linguistic Aspects of Highly Inflective Languages Using Polish as an Example
This chapter will focus on a linguistic background (Ostaszewska and Tambor, 2000), which is
useful for ASR. Linguists have provided many basic assumptions in methodology of recognising
English. As we aim in creating ASR system for Polish, a similar analysis should be done, because
these two languages vary in some aspects. This chapter will summarise phonological knowledge
about sounds in Polish, pronouncing rules and grammatical phenomena related to rich morphology.
A Polish text corpus was analysed to find information about phoneme statistics. We were especially interested in triphones, as they are commonly used in many speech processing applications
like the HTK speech recogniser. An attempt to create the full list of triphones for Polish language
is presented. A vast amount of phonetically transcribed text was analysed to obtain the frequency
of triphone occurrences. A distribution of the frequency of triphone occurrence and other phenomena are presented. The standard phonetic alphabet for Polish and methods of providing phonetic
transcriptions are described as well. The ASR system for Polish based on HTK is described with
detailed analysis of the errors it committed.
1.2.3
Phoneme Segmentation and Acoustic Models
Speech has to be split into some units to be analysed. The very common way is to use time
constant framing with overlapping. Phoneme segmentation is another approach, which may highly
improve acoustic models, if phoneme boundaries are detected correctly. We will present our own
segmentation method, evaluation method and the way to apply it in ASR.
The localisation of phoneme boundaries is useful in several speech analysis tasks and in particular for speech recognition. Here it enables the use of more accurate acoustic models, since the
lengths of phonemes are known and more accurate information is provided for parametrisation.
Our method compares the values of power envelopes and their first derivatives for six frequency
19
subbands. Specific scenarios which are typical of phoneme boundaries are searched for. Discrete times with such events are noted and graded using a distribution-like event function. The
final decision on localisation of boundaries is taken by analysis of the event function. Boundaries
are therefore extracted using information from all the subbands. The method was developed on
small set of Polish hand segmented words and tested on another, large corpus containing 16425
utterances. A recall and precision measure specifically designed to measure the quality of speech
segmentation was adapted by using fuzzy sets; from this, results with f-score equal to 72.49%
were obtained. A statistical classification method was also used to check which features are useful
and also used as a baseline for the comparison of the new method.
1.2.4
Language Modelling
Language models are necessary for any large vocabulary speech recogniser. There are two main
types of information which can be used to support the modelling of a language: syntactic and
semantic. One of the ways to apply syntactic modelling is to use POS taggers. Morphological
information can be statistically analysed to provide the probability of a sequence of words using
their POS tags.
This chapter covers methods of POS tagging and available POS tagged data in Polish. We
presented our own method of applying taggers and POS tag statistics to ASR as a part of language
modelling. Unfortunately, experiments showed that this type of modelling is not effective.
Semantic analysis can be done in many different ways and has already been applied in ASR.
However, this kind of modelling is difficult due to the data sparsity problem. Literature always
mentions semantic analysis as a necessary step in ASR, but it is very difficult to find any research
papers, which provide results concerning the exact impact on recognition of applying semantic
methods. We investigate LSA and present our own method, which was shown to be more effective in experiments. The invented model differs from LSA in the way the word-topic matrix is
smoothed. Our method trains a model faster than the widely known LSA and is more efficient.
Chapter 2
Literature Review
This chapter presents the history of research on speech recognition and some of the details of more
up-to-date publications. ASR is a very wide area so only some choice of topics from this field is
presented, which were studied during PhD of the author.
2.1
History of Speech Recognition
In the beginning we should define what an ASR system is. Because of the variety of applied
methods and approaches, it is difficult to define it by describing how it works. It is better to say
that an ASR system is software which changes acoustic signal into sequence of symbols. Speech
is an input while the sequence of written words is an output. Obviously this definition covers a vast
area of applications. We can distinguish systems trained for a given user only or which are speaker
independent. A system can be dedicated for continuous speech or discrete word recognition. Some
applications assume that speech is clear (or rather clear enough) while some are dedicated for
working in a factory or at an airport where noise is a crucial issue. Finally the size of a vocabulary
is a feature of a system. There are quite different approaches for speech recognition with a small,
limited vocabulary and with a large vocabulary (especially with unlimited dictionary).
To give a proper background, we would like to set speech recognition research in time. The
invention of a phonograph in 1870 by Alexander Graham Bell can be considered as the very
Figure 2.1: Toy dog Rex - first working speech recognition system (USA 1920)
20
CHAPTER 2. LITERATURE REVIEW
21
Figure 2.2: Scheme of speech recognition system
first step of creating ASR system. More precisely, the phonograph is the first audio recording
tool, which transferred acoustic waves into electrical waves, allowing further processing. Another
important mile stone was set by the Swiss linguist Ferdinand de Saussure, who described general
rules of linguistics, which were collected and printed by his students and colleagues, after his
death in 1916 (de Saussure, 1916). His ideas became the rudiments of modern linguistics and
NLP. Then, quite surprisingly, we can speak about the first working ASR system in 1920. It was
a celluloid toy dog developed by Walker Balke and National Company Inc., presented in Fig. 2.1.
The dog was attached to the turntable of a phonograph and could jump out of its kennel, when
detecting its own name ’Rex’. The mechanism was controlled by resonant reed and in fact it was
detecting a phoneme e // by a metal bar arranged to form a bridge and sensitive to acoustic energy
of 500 Hz, which vibrated it, interrupting the current and releasing the dog.
In 1952 Bell Labs created a digit recogniser (Davis et al., 1952). It was based on analysis of
the spectrum divided into 2 frequency bands (above and below 900 Hz). It recognised digits with
error less than 2%, if the user did not change the position of the head regarding to the microphone
between training and testing. In the sixties there were two important inventions: the fast Fourier
transform (FFT) (Tukey et al., 1963) and the HMM (Rabiner, 1989) which have a crucial impact
on current ASR systems. There was a growing interest in speech recognition which resulted in
running the ARPA Speech Understanding Project in 1971. This ambitious and well-funded project ($15M) started connected word recognition with a vocabulary size of around 1000 words. It
resulted in CMU Harpy system (Lowerre, 1976) with 5% sentence error. Thanks to the project,
the seventies were a time of rapid improvements in ASR. Viterbi algorithm for model training was
developed between 1967 and 1973 (Viterbi, 1967; Forney, 1973). In 1975, linear predictive coding, the first successive speech parameterisation method, was invented (Makhoul, 1975). Further
research in speech recognition has a larger impact on this dissertation so it will be described in
more detail in following sections.
The general scheme of ASR was created in the eighties. It survived till now with just small
differences. All the most important steps are presented in Fig. 2.2 which is based on (Rabiner
Village.
22
Would you like to
go village or town?
Hmm… I’d love to go
to some lovely village
at the seaside.
What kind of
holidays would
you like?
Figure 2.3: Typical current services offered by call centres with ASR (above) and its future (below)
and Juang, 1993). Our research is focused on segmentation and semantic analysis, so it will be
described in detail. Some other topics are connected very closely, so they have been also described.
Some of them, which are not crucial for our research, have been skipped because of the limit of
the thesis size. The whole large field of pre-processing is first of them, including noise reduction,
feature compensation, missing feature approaches. There are too many papers about these topic to
describe that step of speech recognition even succinctly. Many of them are very well summarised
in (Raj and Stern, 2005).
ASR can save around 60% of time spend on work with computer through automatic transcription and dictation rather than typing as we are able to speak 3 times faster than we can type.
Sophisticated ASR systems are becoming more important, as customer services need to be more
friendly, while costs of running call centres need to be kept at a minimum level (Fig. 2.3). The
ASR system may introduce also an incredibly efficient lossy compression for communications if
recognition is seen as coding and speech synthesis as decoding.
2.2
Linguistic Rudiments of Speech Analysis
It is essential to understand the rudiments of speech generation process in order to do research
on digital speech analysis. Speech signals consist of sound sequences, which we interpret as
information representation. Phonetics is a science which classifies these sounds. Most languages,
including English and Polish, can be described in terms of a set of distinctive sounds - phonemes.
Both languages consist of around 40 phonemes, however, some of them exist in English and do not
exist in Polish and the other way round. They are grouped in vowels and consonants (nasals, stops
and fricatives). British English phoneme transcription presented in Table 2.1 is based on BEEP
dictionary (Beep dictionary, 2000), which is commonly used by speech recognisers, like HTK
23
Table 2.1: Phoneme transcription in English - BEEP dictionary
transcription example transcription
example
aa
odd
ae
at
ah
hut
ao
ought
aw
cow
ax
abaft (first vowel, schwa)
ay
hide
ea
wear
eh
Ed
er
hurt
ey
ate
ia
fortieth
ih
it
iy
teen
oh
mob
ow
lobe
oy
toy
ua
intellectual
uh
nook
uw
two
p
pick
b
be
t
tip
d
dee
f
fee
v
vise
th
thick
dh
thee (eth)
s
sick
z
zip
sh
ship
zh
seizure
ch
cheese
jh
jeep
k
key
ng
rang (engma)
g
green
m
me
n
new
l
lee
r
ream
w
win
y
you
hh
he
Table 2.2: Phoneme transcription in Polish - SAMPA
i
I
e
a
o
u
e˜
o˜
j
l
w
r
m
n
n’
N
v
f
x
z
s
z’
s’
Z
S dz
ts
dz’ ts’ dZ tS
b
p
d
t
g
k
(Young et al., 2005). It contains of 20 vowels and 24 consonants. Polish phoneme transcription is
typically presented in SAMPA notation (Ostaszewska and Tambor, 2000), like in Table 2.2, with
37 or 39 phonemes.
Irregularities of pronunciation and linguistic rules are a real challenge for speech recognition.
Many words sound similar, especially in English. They are called homophones (e.g. night and
knight). What is more there are even sentences which sound very similarly (e.g. ’I helped Apple
wreck a nice beach’ and ’I helped Apple recognise speech’. Another problem is caused by context
dependency of phonemes. As we said, there are around 40 different phonemes, but actually all
of them vary at the beginning and in the end, depending on neighbouring phonemes. Such triples
24
are so-called triphones. Around 40 % of possible phoneme combinations exist, which gives 25600
possible patterns to recognise. There are no trivial methods for such a number. Unfortunately, it is
not the only problem. Phoneme boundaries are overlapping each other. There is a co-articulation of
phonemes and words. Intonation and sentence stress plays an important role in the interpretation.
Utterances ‘go!’, ‘go?’ and ‘go.’ can clearly be recognised by a human but are difficult for a
computer. In naturally spoken language there are no pauses between words. It is difficult for
a computer to decide where boundaries lie. This is why a general speech recognition system
requires human knowledge and experience, as well as advanced pattern recognition and artificial
intelligence.
2.3
Speech Processing
Speech carries information. This is quite obvious, but very often we do not remember that our
brain has to decode speech on many different levels to produce real information. We have to do
the same using computers. We understand speech processing as waveform signal representing and
transforming. For practical reasons we do it usually in frequency domain, where coded information
is easier to find.
2.3.1
Spectrum
Originally, a spectrum was what is now called a spectre, for example, a phantom or an apparition.
In the 17th century the word spectrum was introduced into optics, referring to the range of colours
observed when white light was dispersed through a prism. A sound spectrum is a representation of
a sound in terms of the amount of vibration at each individual frequency. It is usually presented as
a graph of either power or pressure as a function of frequency. The power or pressure is measured
in decibels and the frequency is measured in vibrations per second - Hertz [Hz].
It is important for any research on speech, that speech is quite a specific audio signal, which can
be distinguished by its pressure and frequency as presented in Fig. 2.4, copied from (Tadeusiewicz,
1988). There is no point in analysing other frequencies. Similarly, a given acoustic pressure can
be expected. We can limit analysing to the subband of around 80-8000 [kHz]. This observation
was already very successfully used, for example in GSM mobile phones.
In 1807, Jean Baptiste Joseph Fourier described his method of analysing heat propagation. It
was very controversial and was negatively graded by a committee in Paris Institute which consisted
of many famous mathematicians. The first objection, made by Lagrange and Laplace in 1808,
was to Fourier’s expansions of functions as trigonometrical series, what we now call the Fourier
series. Others objections were connected to equations of heat transfer. Fourier spectrum (Fig.
2.5) is currently a basic and very common tool for analysing many types of stationary signals.
A stationary signal is a signal that repeats into infinity with the same periodicity. The spectral
representation of signal is calculated as
Z
∞
ŝ(f ) =
s(t) exp(−2πjf t)dt.
−∞
(2.1)
Figure 2.4: Speech audibility and average human hearing band (Tadeusiewicz, 1988)
Figure 2.5: The example of Fourier spectrum amplitude
25
26
Figure 2.6: Frequency spectrum of speech in a linear and a non-linear scale
Function ŝ(f ) defines the notion of global frequency f in a signal. It is computed as inner
products of the the signal and trigonometric functions cos(2πf t) − j sin(2πf t) (from Euler equation), as basis functions of infinite duration (2.1). Any non-stationarity is spread out over the whole
frequency in ŝ(f ). Therefore, non-stationary signals require changes in the analysis method.
A non-stationary signal has to be windowed to be analysed by Fourier transform. The original
method was improved in 1965 by Cooley and Tukey (Cooley and Tukey, 1965), who found an
algorithm to calculate the spectrum in fewer steps. It is known as fast Fourier transform (FFT).
Then the transform is calculated locally for a given window over which the signal is approximately
stationary by repeating the part and creating a periodic function. This approach is called usually the
short time Fourier transform (STFT). Another way is to modify the basis functions used in Fourier
transform (trigonometric functions) to another, more concentrated in time and less in frequency.
This way of thinking leads to wavelet transforms.
Human perception systems work in non-linear scale, for example, it is much easier to perceive
a candle in a dark room than in a lit one. Perception depends on background and reference. This is
why we can say the natural scale for humans is the logarithm one. The most common conclusion
of this fact is using decibels [dB]. For the same reason, we use sometimes mel scale frequency in
speech analysis, rather than the standard linear one in Hz. Frequency in mels is defined as
fHz
.
fmel = 1000 log2 1 +
1000
(2.2)
The comparison of two frequency scales is presented in Fig. 2.6.
The need of nonlinearity in ASR caused creation of an expression ’cepstrum’. It is etymology
of spectrum, formed by reversing the first four letters. This term was introduced by Tukey et al. in
1963 (Tukey et al., 1963). It has come to be accepted terminology for the inverse Fourier transform
R∞
of the logarithm of the power spectrum of a signal −∞ |ŝ(t)| exp(2πjf t)dt. It was simplified, by
changing inverse transform into a forward one, which does not change the basic idea (Rabiner and
Schafer, 1978).
27
Figure 2.7: The cepstrum is the Fourier transform of the log of the power spectrum
Phonetic features
Silence and speech
To fit transcription
Speech segmentation
Words
Phonemes
Speakers
Syllables
Figure 2.8: The types of speech segmentation
2.4
Speech Segmentation
In the vast majority of approaches to speech recognition, the speech signals need to be divided
into segments before recognition can take place. The properties of the signal contained in each
segment are then assumed to be constant, or in other words to be characteristic of a single part of
speech. Speech segmentation is easier than image segmentation (Nasios and Bors, 2005), as has
to be done in one dimension only.
There are different meanings of segmentation though (Fig. 2.4). Very often it is used for word
segmentation. It can be done by Viterbi and Forward-Backward Segmentation (Demuynck and
Laureys, 2002). The other applied method (Subramanya et al., 2005) is based on mean and variance of spectral entropy. Another issue covered by the same name, segmentation, is separating
silence and speech from an audio recording (Zheng and Yan, 2004). The method uses so called
TRAPS-based segmentation and Gaussian (Nasios and Bors, 2006) mixture based segmentation.
Segmentation here means mainly removing non-speech events and additionally clustering according to speaker identities, environmental and channel conditions. Another possible segmentation
is by phonetic features (not necessarily phonemes) (Tan et al., 1994), by applying wavelet analysis
which will be described in more detail in this dissertation. There also exists research on syllable
segmentation (Villing et al., 2004). Another meaning is segmenting due to partially correct transcriptions (Cardinal et al., 2005). In this case segmentation is combined with recognition. Finally,
we can understand segmentation as a process of breaking audio into phonemes (Grayden and Scordilis, 1994). Segmentation was conducted by filter bank energy contours analysis. In our research
(Ziółko et al., 2006a,b), we find that phoneme segmentation is the most important and this is why,
we will use the word ’segmentation’ in the meaning of phoneme segmentation, if nothing else is
mentioned. Phoneme segmentation and its usefulness in speech recognition will be described in
more detail in the next chapter.
28
Naturally, if the frame contains the end of one phoneme and the beginning of another it will
cause recognition difficulties. Segmentation methods currently used in ASR are not particularly
sophisticated. For example they do not consider where phonemes begin and end; this causes
conflicting information to appear at the boundaries of phonemes. Non-uniform phoneme segmentation can be useful in ASR for more accurate modelling (Glass, 2003).
2.5
Constant-time segmentation or framing, for example into 23.2 ms blocks (Young, 1996), is commonly used to divide the speech signal for processing. This method benefits from simplicity of
implementation and easy comparison of blocks, which are of the same length. However, it is
perceptually unnatural, because of the variation in the duration of real phonemes. In fact, human
phonetic categorisation is also very poor for such short segments (Morgan et al., 2005). Moreover,
boundary effects provide additional distortions (which are partially reduced by applying Hamming window), and framing with such short segments create many more boundaries than there
are phonemes in the speech. These boundary effects can cause errors in speech recognition because of the mixing of two phonemes in a single frame. A smaller number of boundaries means
a smaller number of errors due to the aforementioned effects. Constant segmentation therefore,
while straightforward and efficient, risks losing valuable information about the phonemes due to
the merging of different sounds into a single block and because the complexity of individual phonemes cannot be represented in short frames. The length of a phoneme can be also used as an
additional parameter in speech recognition improving the accuracy of the whole process. A comparison of applying constant framing and phoneme segmentation is presented in Fig. 2.9. Models
based on processing information over long time ranges have already been introduced. The RASTA
(RelAtive SpecTrAl) methodology (Hermansky and Morgan, 1994) is based on relative spectral
analysis and the TRAPs (TempoRAl Patterns) approach (Morgan et al., 2005) is based on multilayer perceptrons with the temporal trajectory of logarithmic spectral energy as the input vector. It
allows the generation of class posterior probability estimates.
A number of approaches have been suggested (Stöber and Hess, 1998; Grayden and Scordilis,
1994; Weinstein et al., 1975; Zue, 1985; Toledano et al., 2003) to find phoneme boundaries from
the time-varying speech signal properties. These approaches utilise features derived from acoustic
knowledge of the phonemes. For example, solution presented in (Grayden and Scordilis, 1994)
analyses a number of different subbands in the signal using its spectra. Phoneme boundaries are
extracted by comparing the percentage of signal power in different subbands. The Toledano et al.
(Toledano et al., 2003) approach is based on spectral variation functions. Such methods need to be
optimised for particular phoneme data and cannot be performed in isolation from phoneme recognition itself. Neural networks (NN) (Suh and Lee, 1996) have also been tested, but they require
time consuming training. Segmentation can be applied by the segment models (SM) (Ostendorf
et al., 1996; Russell and Jackson, 2005) instead of the HMM. The SM solution differs from the
HMM by searching paths through sequences of frames of different lengths rather than frames. It
means that segmentation and recognition are conducted at the same time and there is a set of pos-
29
Figure 2.9: Comparison of the frames produced by constant segmentation and phoneme segmentation
sible observation lengths. In a general SM, the segmentation is associated with a likelihood and in
fact describes the likelihood of a particular segmentation of an utterance. The SM for a given label
is also characterised by a family of output densities which gives information about observation
sequences of different lengths. These features of SM solution allow the location of boundaries
only at several fixed positions which are dependent on framing (on an integer multiple value of the
frame length).
The typical approach to phoneme segmentation for creating speech corpora is to apply dynamic
programming (Rabiner and Juang, 1993; Holmes, 2001). Dynamic programming is a tool which
guarantees to find the cumulative distance along the optimum path without having to calculate
the distance along all possible paths. In speech segmentation it is used for time alignment of
boundaries. The common practice is to provide a transcription done by professional phoneticians
for one of the speakers in the given corpus. Then it is possible to automatically create phoneme
segmentation of the same utterances for other speakers. This method is very accurate but demands
transcription and hand segmentation to start with. For this reason it is not very useful for any
application other than creating a corpus.
There are several speech segmentation methods and several approaches to the most of them. It
is quite obvious that it is interesting to compare them. Surprisingly evaluation methods for speech
segmentation are quite simple and do not consider all scenarios. There are several suggestions
of evaluation methods but they are usually developed for given solutions, which are not very
universal and they lose some accuracy in their simplifications. Typically evaluation is based on
counting the number of insertions, deletions and substitutions of the automatic segmentations with
respect to a hand-checked reference transcription. The automatic word segmentation (Demuynck
and Laureys, 2002) was evaluated by counting the number of boundaries for which the deviation
between automatic and manual segmentation exceeded thresholds of 35, 70 and 100 ms. The
syllable segmentation (Villing et al., 2004) was evaluated by counting the number of insertion
and delation errors within a tolerance of 50 ms before and after a reference boundary. Some
authors do not publish any details about such a tolerance or do not give a tolerance at all but use
generally the same method (Grayden and Scordilis, 1994). This insertion and delation approach
has a few flaws. First of all, a value of tolerance is questionable and cannot be set with any exact
explanation. It is rather chosen using experience, quite often experience in results of a given speech
segmentation method and experiments. What is more, such methods treat different inaccuracies as
30
Figure 2.10: The list of speech features extracting method types, grouped in two avenues: based
on linear prediction coefficients (with PLP as the main one) and filter bank analysis (with MFCC
as the main one).
simply correct or wrong detections (or giving a larger scale of grades) without considering ’how
wrong’ the detection really is. Unfortunately, it is not the last of the problems. A tolerance is
set, like 50 ms (Villing et al., 2004) for syllables, according to a statistically average length of a
segment. The disadvantage of this approach is that speech segments, whatever they are, words,
syllables or phonemes, vary much in their length. This is why a shift of 50 ms in boundary location
is not the same for a 100 ms long syllable as for a 300 ms long one. Different speech segmentation
methods were compared by us in (Gałka and Ziółko, 2008).
2.6
Speech Parametrisation
Speech parametrisation is a representation of a spectral envelope of an audio signal which can be
used in further processing. There are two most common parametrisation methods, mel-frequency
cepstral coefficients (MFCC) (Davis and Mermelstein, 1980) and perceptual linear predictive
(PLP) (Hermansky, 1990).
2.6.1
Parametrisation Methods Based on Linear Prediction Coefficients
PLP (Rabiner, 1989) has become one of the standard speech parametrisation methods (Fig. 2.10),
and is used as a baseline for a part of the new research. Because of its importance, there have been
further improvements to the method, some of which are described below.
31
Figure 2.11: fMPE transformation matrix from original low-dimensional feature vector into highdimensional one
Misra et al. (Misra et al., 2004) suggest normalising the spectrum into a probability mass
function (PMF) or more strictly speaking PMF-like function. Such a representation allows the
calculation of entropy. Voice and non-voice segments are easily detected, even with a low signalto-noise ratio (SNR). A hidden Markov models / artificial neural networks (HMM/ANN) hybrid
system was used in the experiments. Because the PLP features are the only baseline provided and
a novel hybrid system is used, it is difficult to compare the results with many other papers. The
results suggest that the entropy features are less efficient than PLP, but it is possible to improve a
system based on the PLP by using entropy for creating extra parameters. Entropy is a good choice
to measure the gross peakiness of a data spectrum.
Deng et al. (Deng et al., 2005) present and compare two feature extraction and compensation
algorithms which improve the PLP, and possibly other methods. The first one is the featurespace minimum phone error (fMPE) (Fig. 4) and the second is the stereo-based piecewise linear
compensation for environments (SPLICE).
The fPME is an improvement to the PLP. It is based on adding an additional high-dimensional
feature vector containing conditional probabilities of each feature given the whole original lowdimensional feature vector. The high dimensional feature vector is projected by a transformation
matrix into the subspace of the same dimension as the original vector (Fig. 2.11). The transformation matrix is created by reestimation via minimising the discriminative objective function known
as the minimum phone error by gradient descent. The training is conducted by an iterative scheme
of retraining the HMM parameters using the fMPE feature sets via maximum likelihood. There
are different possible decomposition schemes of the fMPE. One of them may be interpreted as a
compensation for the original features by adding a large number of bias vectors, each of which
is computed as a full-rank rotation of a small set of posterior probabilities. Approximations can
be easily made to remove the numerical problems in maximum-likelihood estimation. Another
decomposition scheme is interpreted as compensating for the original PLP cepstral features by a
frame-dependent bias vector. The fMPE can be understood as the compensation vector, which
consists of the linear weighted sum of a set of frame-independent correction vectors. The weight
is then the conditional probability associated with the corresponding correction vector. The fPME
algorithm is empirical in its nature.
32
Figure 2.12: Mel frequency cepstrum coefficients
The SPLICE is also a method of compensation. It assumes that an ideally clean speech feature
vector is ‘piecewise linearly’ related to the corresponding analysed noisy one. Which ‘piece’ of the
local approximation is used for the piecewise linear approximation to the non-linear relationship
between the noisy and clean speech feature vectors is determined by index. With such an assumption the SPLICE compensation is calculated using the minimum mean square error (MMSE). This
gives corresponding conditional probabilities to ones in the fPME algorithm. In contrast to the
fMPE, the compensation by addition is a natural consequence of the MMSE optimisation rule.
The PLP has found several applications. The transcription of conference room meetings is described in (Hain et al., 2005). It is based on the augmented multi-party interaction (AMI) system
using the HTK as the HMM for modelling and N-gram based language models. A phonetic decision tree state clustered triphone models with standard left-to-right three states topology is used
for acoustic modelling. States are represented by mixtures of 16 Gaussians. Coefficients obtained
by applying the PLP can be transformed in other types of parameters (cepstral coefficients) for
further analysis. However, there is some ambiguity in the paper regarding the features. First, it is
stated that 12 mel-frequency PLP coefficients, with first and second order derivatives were used by
front-ends as parameters to form a 39 dimensional feature vector. Then, it is said that the smoothed heteroscedastic linear discriminant analysis (SHLDA) reduces a 52 dimensional (standard
vector plus third derivatives) vector to 39 dimensions. Cepstral means and variance normalisation
are performed on complete channels. The vocal tract length normalisation (VTLN) gives speaker
adaptation. The maximum likelihood criterion estimates warp factors. The UNISYN pronunciation lexicon was used. The method for feature extraction is not very novel but the complexity of
the system and results of experiments on large amount of data are impressive. The AMI is a global
approach to a large vocabulary ASR system.
2.6.2
33
Parametrisation Methods Based on Filter Banks
Davis and Mermelstein (Davis and Mermelstein, 1980) suggested a new approach to speech parametrisation in 1980. They described and compared two groups of parametric representations: one
based on Fourier spectrum (the MFCCs, the linear frequency cepstrum coefficients LFCCs) and
another based on the linear prediction spectrum (linear prediction coefficients LPCs, the reflection coefficients RCs and the cepstrum coefficients derived from the linear prediction coefficients
LPCCs). The MFCCs, proved to be the best of them, and is computed using triangular bandpass
filters organised in a bank to filter different frequencies. The filters’ characteristics overlap each
other in a way that a next filter begins for the middle, best-passing frequency of the previous one
(Fig. 2.12). The MFCCs are computed as the sums over filters
M F CCi =
12
X
k=1
1 π
Xk cos i k −
,
2 20
i = 1, 2, ..., M.
(2.3)
The method was improved by setting 12 basic coefficients, energy, first and second derivatives of
these, which gives a set of 39 features (Young, 1996). This seems to be now the most common
parametrisation and a baseline for new research in ASR. Some improvements of MFCCs and new
approaches based on filter banks are described below.
Most researchers believe that the phase spectrum information is not useful in speech recognition. Zhu and Paliwal (Zhu and Paliwal, 2004) argue that it is a wrong assumption. The phase
spectrum information is less important than the magnitude spectrum, but it can still be useful. They
use the product of the power spectrum and the group delay function (GDF). They compared a standard set of 39 parameters based on the MFCCs (12 MFCCs + energy, first and second derivatives
of these) with three new approaches, modified-group-delay cepstral coefficients (MGDCCs), melfrequency modified-group-delay cepstral coefficients (MFMGDCCs) and mel-frequency product
spectrum cepstral coefficients (MFPSCCs). MFCCs are the best for an absolutely clean signal and
MFPSCCs are the best for noisy signals. MFPSCCs are calculated in four steps (Zhu and Paliwal,
2004):
1. Compute the FFT spectrum of the speech signal x(n) and speech signal values multiplied
by indexes nx(n).
2. Compute its product spectrum.
3. Apply a mel-frequency filter-bank to produce spectrum in order to get filter-bank energies
(FBEs).
4. Compute the discrete cosine transform (DCT) (Ahmed et al., 1974) of log FBEs to get the
MFPSCCs.
MGDCCs and MFMGDCCs are calculated by applying so-called the modified GDF (MGDF)
on smoothed spectrum calculated using the FFT. Computing the DCT provides the features. In case
of MFMGDCCs before computing the DCT, mel-frequency filter banks are additionally applied.
Both methods were evaluated as less efficient than MFCCs by the authors.
34
Zhu and Paliwal used an HMM as a model of a language. In the calculation of all the features,
the speech signal was framed using Hamming window every 10 ms with a 30 ms frame. The
pre-emphasis filter was applied. The mel filter bank was designed with 23 frequency bands in the
range from 64 Hz to 4 kHz.
Another interesting approach is given by Ishizuka and Miyazaki (Ishizuka and Miyazaki,
2004). Their method focuses on feature extraction that represents aperiodicity of speech. The
method is based on the gammatone filter banks, framing, autocorrelation and comb filters. First
the signal is filtered by the gammatone filter banks, which are designed by using equivalent rectangular bandwidth scale to choose the centre frequencies and bandwidths of filters. Each bank
consists of 24 filters. Various comb filters are designed for outputs of the gammatone filters. They
support separation of the output into its periodic and aperiodic features in subbands. Aperiodicity
and periodicity power vectors are calculated. The DCT is used to extract parametrisation features
from vectors. The method has the accuracy of the MFCCs without noise and is better in noisy
conditions. The HTK (Young, 1996) is used as the HMM pattern classifier.
The Centre for Speech Technology Research at the University of Edinburgh has introduced
an innovative method of parametrisation. King and Taylor (King and Taylor, 2000) describe a
linguistically motivated structural approach to continuous speech recognition based on symbolic
representation of distinctive phonological features. As the part of further research, syllable classification using articulatory-acoustic features was conducted (M. Wester, 2003). The speech is
firstly analysed using MFCCs, but then it is parametrised using features which are based on socalled multivalued features, namely: front-back (front, back, nil, silence), place of articulation
(labial, labiodental, dental, alveolar, velar, glottal, high, mid, low, silence), manner of articulation
(approximant, fricative, nasal, stop, vowel, silence), roundness (rounded, unrounded, nil, silence),
static (static, dynamic, silence) and voicing (voiced, voiceless, silence). This is parametrisation
based strictly on classical phonology. The speech is represented by a sequence of symbolic matrices, each identifying a phone in terms of its distinctive phonological features. The NN was
used for language modelling. The phonological approach is described in many other papers of the
group. Methods of language modelling are also described, for example, comparing using the NN
and dynamic Bayesian networks (DBNs) for phonological feature recognition.
Yapanel and Dharanipragada (Yapanel and Dharanipragada, 2003) present a method based
on the minimum variance distortionless response (MVDR), spectrum estimation and a trajectory
smoothing technique. It was applied to reduce the variance in the feature vectors. The method is
based on using specially designed FIR filters and it aims to gain the statistical stability of spectrum
estimation rather than spectral resolution limit. Reduction of bias and variance is of interest especially. The method was first described in 2001 and it differs from the classical MFCCs solution by
applying the shortly described technique as an additional block following window filtering. In (Yapanel and Dharanipragada, 2003) additional perceptually modified autocorrelation estimates are
obtained based on the PLP technique (Hermansky, 1990). The MVDR coefficients are calculated
from these autocorrelation estimates. Thanks to incorporating perceptual information, autocorrelation estimates are more reliable, because of perceptual smoothing of the spectrum. Then MVDR
estimation is more robust. But this is not the only advantage of using such smoothing; additionally,
35
the dimensionality of the MVDR estimation is reduced. As a result, the MVDR method is faster
with such a modification. The method was named by authors as perceptual MVDR-based cepstral
coefficients (PMCCs).
Farooq and Datta (Farooq and Datta, 2004) describe the opportunity of using the DWT instead
of the STFT to parametrise speech. The paper compares 2, 6 and 20 order Daubechies wavelets
and two sets of subbands with 6 and 8 bands. The method analyses 32 ms frames using 28 or 36
features (depends on a number of subbands). The linear discriminant analysis (LDA) using the
Mahalanobis distance measure classifier was used for phoneme classification. Evaluation of the
method is done with 52 MFCC features as a baseline. The method was evaluated under noiseless
conditions and with noise. Vowel recognition was found more difficult than fricatives and stops
for recognition. In most cases the DWT method is superior compared to the MFCCs even though
it uses less features.
The Speech Research Group at University of Cambridge describes a 2003 CU-HTK large vocabulary speech recognition system for conversational telephone speech (CTS) (Evermann et al.,
2004) which uses MFCCs as feature vectors. The system has a multi-pass, multi-branch structure.
The multi-branch architecture works as combining results from a few separate similar systems
with different parameters by separate lattice rescoring. Basing on Levenshtein distance metric,
different word sequences are generated in branches instead of one best hypothesis. The output of
all branches is combined using a system combination based on a confusion network. The CUHTK CTS system consists of two main stages: lattice generation with adapted models and lattice
multi-pass rescoring in multiple branches. Lattices restrict the search space in the subsequent rescoring stage. Additionally, the generation of lattices provides control for adaptation in each of
the branches of the rescoring stage. In the lattice generation, the gain from performing the VTLN
by warping the filter bank is very substantial. The multi-passing scheme is used for lattice generation. The first pass generates a transcription using the heteroscedastic linear discriminant analysis
(HLDA), the minimum phone error (MPE) trained triphones and the word 4-gram language model
(LM). Speakers gain the VTLN warp factor in this step. The second pass uses MPE VTLN HLDA
triphones to create small lattices. In the third and last pass they are used in the lattice maximum
likelihood linear regression (MLLR). Word lattices are generated with the word 4-gram LM interpolated with the class trigram. The speaker adaptative training (SAT) and the single pronunciation
dictionaries were used. A word-based 4-gram language model was trained on the acoustic transcriptions. That system seems to be the most, if not the only ready, complex academic solution for
large vocabulary speech recognition.
Hifny et al. (Hifny et al., 2005) extend the classical HMM and MFCCs solution using the
maximum entropy (MaxEnt) principle to estimate posterior probabilities more efficiently. Entropy
measure information of acoustic constraints is used in an unbiased distribution to replace Gaussian
mixture models. They use discriminative MaxEnt models for modelling acoustic variability trained
using the conditional maximum likelihood (CML) criterion, which maximises the likelihood of the
empirical model estimated from the training data with respect to the hypothesised MaxEnt model.
Exact parameters are numerically estimated using a modified version of the improved iterative
scaling (IIS) algorithm. The difference lies in supporting constraints that may take negative values.
36
The idea of the IIS is to use an auxiliary function bounding the change in divergence after each
iteration. Parametric constraints model the high variability of the observed acoustic signal and do
not have the assumption of the Gaussian distribution of data which are not strictly true in practical
applications. They exist if acoustic features are used directly. Currently, in many fields, researchers
are trying to overcome a model dependence on Gaussian assumptions. In the opinion of authors the
hybrid MaxEnt/HMM method may replace hybrid ANN/HMM solutions, which are currently very
popular, using the MaxEnt modelling to estimate the posterior probabilities over the states. The
experiments were conducted using MFCC features. The conclusion might be that in a standard
speech recognition solutions (MFCCs and ANN/HMM model) there is a lack of use of entropy
information. This conclusion corresponds very well to the paper (Misra et al., 2004), described
earlier, which also points the lack of use of entropy in existing solutions as a flaw. Both papers
prove that taking entropy additionally to existing solutions improves them, one (Misra et al., 2004)
for PLP and the other (Hifny et al., 2005) for MFCCs.
2.6.3
Test Corpora and Baselines
The lack of a standard baseline method and a test corpus for speech recognition is an important
issue. Information about evaluation experiments published in described research papers is presented. It is easy to observe that databases and baselines are often different and the provided information about them often covers different issues. It is very difficult to compare different methods of
parametrisation if they are evaluated using different baselines and modelling.
The Aurora2 database was used to evaluate the performance in (Zhu and Paliwal, 2004). The
source speech is TIDigits, consisting of connected digits task spoken by American English speakers sampled at 8 kHz. It contains clean and multi-condition training sets and three test sets. 39
parameters based on the MFCCs are used as a baseline and a not described in detail HMM as a
language model. Aurora2 was also used to test SPLICE (Deng et al., 2005). PMCCs (Yapanel
and Dharanipragada, 2003) was evaluated using Aurora2 as well, and in addition, an automotive
speech recognition application was used. It was compared to MFCCs, PLP and standard MVDR.
HMM was used as a model.
Tests in (Ishizuka and Miyazaki, 2004) were carried out on vowels from Japanese sentences
from a newspaper spoken by a male speaker, and Japanese noisy digit recognition database Aurora2J. The HTK was used for features classification and the standard 39 MFCCs as baseline.
Misra and al. method (Misra et al., 2004) was tested on the Numbers95 database of US English connected digits telephone speech. There are 30 words in the database represented by 27
phonemes. Training was conducted on clear data. Noise from the Noisex92 database has been
added into testing data. The PLP features are used as the baseline. There are 3330 utterances for
training and 1143 utterances for testing. The HMM/NN hybrid system was used in the experiments.
A very impressive amount of training and test data was used by the Cambridge Speech Research Group (Evermann et al., 2004). Training data consists of 296 hours of speech by LDC
(Switchboard I, Call Home English and Switchboard Cellular) plus 67 hours of Switchboard (Cellular and Switchboard II phase 2). Transcriptions were provided by the MSState University for
37
LDC (carefully) and by BBN commercial transcription service (quickly) for additional 67 hours.
Additionally Broadcast News data (427M words of text) and 62M words of ‘conversational texts’
were collected from the Internet (www.ldc.upenn.edu/Fisher/).
Paper (Hain et al., 2005) presenting the development of the AMI meeting transcription system describes and uses many speech corpora for evaluation: SWBD/CHE, Fisher, BBC -THISL,
HUB4-LM96, SDR99-Newswire, Enron email, ICSI meeting, NIST, ISL and AMI. The last four
are typical meeting corpora. Results for different corpora and their sizes are compared. It uses
elements of the HTK for training and decoding.
The Centre for Speech Technology Research at the University of Edinburgh (King and Taylor,
2000; M. Wester, 2003) experiments were carried out on the Texas Instruments/Massachusetts Institute of Technology (TIMIT) database (read continuous speech from North American speakers).
3696 training utterances from 462 different speakers and 1344 test utterances from 168 speakers
were used. 39 phone classes are used, instead of original 61. The same database was used to evaluate MaxEnt/HMM model (Hifny et al., 2005). The same reduction of phone classes took place.
420 speakers were used for the training set. Farooq and Datta (Farooq and Datta, 2004) also evaluated their methods using the TIMIT database, using vowels (/aa/, /ax/, /iy/), unvoiced fricatives
(/f/, /sh/ and /s/) and unvoiced stops (/p/, /t/ and /k/) from the dialect region of New England and
the northern part of USA. 114 speakers’ (including 37 females) data was used for training and 37
speakers’ (including 12 females) for testing.
The fMPE (Deng et al., 2005) is evaluated using DARPA-ears rich-transcription-2004 conversational telephone speech-recognition task. The baseline in this case is just the set of coefficients
to which the fMPE is appended to, with HMM used as a model.
As it has been said there are two typical baselines for feature evaluation: the MFCC and the
PLP. The first one is more popular. It has to be mentioned that in several papers other baselines
are used, especially incomplete MFCC. It makes comparing currently researched parametrisation
methods a difficult task. Unfortunately, it is not the only problem. The HTK is the most typical
method for speech modelling. However, not the only one and it should be stressed that the HTK
is a running project with new versions available quite frequently. It can be easily imagined, that
different researchers use different versions, which are better or worse according to its date of
release but authors do not give any details about the version they are using. What is more, quite
many experiments are based on other HMMs and HMM/ANN hybrid solutions than the HTK (or
authors just do not give all details) or just an ANNs. Differences in the results of experiments can
be caused by worse or better parametrisation as well as changes in a model.
One of the reasons why there is no standard test corpus might be that all of them are commercial and it seems there is no satisfactory, free evaluation data test for speech recognition. This is an
issue which prevents standardisation of tests. Another point is that the ASR research is conducted
for different languages, so variety is inevitable because of the language preferences of researchers.
Still, different sizes, complexity and variety of words in test corpora cause difficulties in comparing different approaches. To avoid such problems, there should be two freely available corpora.
One would be of small vocabulary, like digits, mainly for fast tests during research and the other
of large vocabulary for final results.
38
Table 2.3: Comparison of the efficiency of the described methods. Asterisks mark methods appended to baselines (they could be used with most of the other methods). The methods without
asterisks are new sets of features, different to the baselines
Method
MFPSCCs (Zhu and Paliwal, 2004)
Ishizuka* (Ishizuka and Miyazaki, 2004)
Phonological (King and Taylor, 2000; M. Wester, 2003)
Spectral Entropy* (Misra et al., 2004)
DWT (Farooq and Datta, 2004)
fPME* (Deng et al., 2005)
SPLICE* (Deng et al., 2005)
PMCCs* (Yapanel and Dharanipragada, 2003)
2.6.4
Comparison to MFCC
Comparison to PLP
2%
?
17%
?
(no straightforward comparison)
?
15%
2% (52 MFCC)
?
?
13%
?
29%
20%
11%
Comparison of the Methods
It is very difficult to compare different methods because of the reasons presented in the previous
section. However, we tried to do at least an approximation of it. We compare methods according
to baselines, which authors gave, by presenting the average improvement in comparison to the
baseline (Table 2.3). We do not see any way to compare methods with different baselines. The
methods can be grouped in two categories. One of them covers basic features which replace the
baseline (Zhu and Paliwal, 2004; Farooq and Datta, 2004; Hain et al., 2005). The other consists
of elements appended to classical ones (Ishizuka and Miyazaki, 2004; Misra et al., 2004; Deng
et al., 2005; Yapanel and Dharanipragada, 2003) and these are marked by asterisks in Table 2.3.
The first group gives less improvement. It has to be stressed that methods in the second group are
additional elements and as such they may be used in connection with methods of the first group
to give even better results. Phonological approach (M. Wester, 2003; King and Taylor, 2000) has
not been compared with any baseline. Works on the phonological features are conducted, results
improved, but no clear comparison with the MFCCs or the PLP was found. As one of the authors
explained in the email conversation, the system is not ready for word recognition and because a
main reason for using articulatory features to mediate between the acoustic signal and words is
to get around the problem of ‘beads on a string’ (describing words as a simple concatenation of
phones) using phone error rate would be pointless.
New sets of features are not much better than baselines. The largest improvement is based on
adding extra elements and improving existing parametrisation. The methods marked with asterisk
could give outstanding results if combined. However, some of them might be dependent on each
other and use the same information in fact. The highest improvement of reviewed methods is given
by the SPLICE (Deng et al., 2005).
Basing on Yapanel results (Yapanel and Dharanipragada, 2003) (the only one compared with
the both baselines) we can calculate that the PLP method gives around 8% of improvement compared to the MFCCs. This evaluation depends on the database used in the experiment and an exact
value is questionable. Still it allows us to give an assumption that the PLP is a bit better method
than the MFCCs.
2.7
39
Speech Modelling
Speech and language modelling is based on stochastic processes. To define them let us assume the
existence of a probabilistic space and infinite number of random variables in the space. E is the
space of process states, and T stands for the domain of a stochastic process. The set of random
variables S(t) ∈ E such that S(t) = {A(t), t ∈ T } is a stochastic process.
A stochastic process S = {A(t), t ∈ T } is called a Markov process if it fulfils
P {S(tn+1 ) = sn+1 |S(tn ) = sn , ..., S(t1 ) = s1 } = P {S(tn+1 ) = sn+1 |S(tn ) = sn }.
(2.4)
It means that a Markov process keeps a memory of the last event. The whole future run of the
process depends only on the current event. A Markov chain is a Markov process with a discrete
space of states. A domain may be continuous or discrete. The concept of Markov chains can be
extended to include the case where the observation is a probabilistic function of a state. The HMM
is a doubly embedded stochastic process with an underlying stochastic process that is hidden and
can only be observed through another set of stochastic processes that produce the sequence of
observations.
The HMM (Rabiner, 1989; Li et al., 2005) is a statistical model where the system being modelled is assumed to be a Markov process with unknown parameters, and the challenge is to determine
the hidden parameters, from the observable parameters, based on this assumption. Speech recognition systems are generally based on HMM(Young et al., 2005) or hybrid solutions with ANN
(Young, 1996; Holmes, 2001). Statistical model gives the probability of an observed sequence of
acoustic data by the application of Bayes’ rule
P (word|acoustic) =
P (acoustic|word)P (word)
,
P (acoustic)
(2.5)
where P (acoustic|word) comes from an acoustic model, P (word) is given by a language model
(or combination of several language models) and P (acoustic) is used for normalisation purposes
only so it can be skipped as long as we deliver normalisation in another way or we accept the
fact that final result is not a probability function, as it may not take values from 0 to 1 and the
sum of all of them is not equal to 1. We can easily accept it, if we are interested only in an
argument of a maximum of the result and we do not need proper probability values. The Bayes
rule can be similarly applied to phonemes, words, syntactic and semantic information. Introducing
an additional hidden dynamic state gives a model of spatial correlations and leads to better results
(Frankel and King, ress).
The HMM is very popular but there are some other approaches to language modelling. One
of them is a support vector machine (SVM), a classifier that estimates decision surfaces directly
rather than models a probability distribution across the training data. As the SVM cannot model
temporal speech structure efficiently it is best in a hybrid solution with the HMM (Ganapathiraju
et al., 2004).
Another model which started to be popular in speech recognition is based on dynamic Bayesian networks (DBNs) (Wester et al., 2004; Frankel and King., 2005). Typical Bayes nets are
40
directed acyclic graphs where each node represents a random variable. Implying conditional independence uses missing edges to factor joint distribution of all random variables into a set of
simpler probability distributions. DBNs consist of instances of Bayesian networks repeated over
time, with dependencies across time. DBNs were proposed as a model for articulatory feature
recognition. In a classical HMM framework, parameters are obtained by the maximum likelihood
approach. The variational Bayesian estimation and clustering (Watanabe et al., 2004) is another
approach. It does not use maximum likelihood parameters but posterior distribution.
There are other models (Venkataraman, 2001; Ma and Deng, 2004; Wester, 2003) for modelling acoustic parameters or elements of language. In all models we have to make many assumptions, like statistical dependence and independence (King, 2003). One has to be very careful to not
commit a simplification which might result in a wrong model.
Another issue is a training process of a model. Most popular algorithms are based on a
forward-backward procedure (Rabiner, 1989; X. Huang, 2001) for evaluation of HMM, Viterbi
algorithm (Rabiner, 1989; Viterbi, 1967; Forney, 1973) for decoding HMM and Baum-Welch for
estimating HMM parameters (Rabiner, 1989; X. Huang, 2001). All of them need human supervision and might be quite costly in time. There are also methods based on active learning (Riccardi
and Hakkani-Tür, 2005) in which applying adaptive learning may cut down the need of supervision.
2.8
Natural Language Modelling
Analysing semantic and syntax content is one of the topics of NLP (Manning, 1999). Words can be
connected in a large number of ways, including: by relations to other words, in terms of decomposition of semantic primitives, and in terms of non-linguistic cognitive constructs (perception,
action and emotion). There are hierarchical and non-hierarchical relations. Some hierarchical
relations are: is-a (a tree is a plant), has-a (a computer has a screen), and for scales of degree.
Non-hierarchical relations include synonyms and antonyms. There are some word affinities and
disaffinities in the semantic relations regarding the expressed concept. They are difficult to be
described in a mathematical way but may be exploited by speech recognition systems. A crucial
problem is the context-dependent meaning of words. For example, ’bank’ is a bank of a river and
a bank to keep money in it. Authors of dictionaries try to identify distinct senses of entries, but
it is very difficult to put an exact boundary between senses of a word and to disambiguate senses
in practical contexts. Another problem is that natural languages are not static. Some additional
meanings of words can change quite frequently (X. Huang, 2001).
The language regularities are very often modelled by n-grams (X. Huang, 2001). Let us assume
the word string W consisting of n words w1 , w2 , w3 , ..., wn . P (W ) is a probability distribution
over word strings W that reflects how often W occurs. It can be decomposed as
P (W ) = P (w1 )P (w2 |w1 )P (w3 |w1 , w2 )...P (wn |w1 , ..wn−1 ).
(2.6)
For calculation time reasons, the dependence is limited to n words backwards. Probably the most
41
popular are trigram models where P (wi |wi−2 , wi−1 ), as a dependence on the previous two words
is very strong, while model complication is not very high. Such models need statistics collected
over a vast amount of text. It means that many dependencies can be averaged. Adaptive language
models (Bellegarda, 1997; Jelinek et al., 1991; Mahajan et al., 1999) deal with this flaw by a
semantic approach to n-grams. Several different models can be created for different topics and
different types of texts organised in a domain or topic-clustered language model. Then a system
detects a topic of recognised text and use a cluster of n-gram model associated with this topic. It is
possible to combine several clusters at once and to change a topic during recognition of different
parts of the same text. Latent semantic indexing (Bellegarda, 2000) improves the traditional ngram model by searching for co-occurrences across much larger spans regarding semantic roles
rather than the simple word distance.
We are mainly interested in lexical semantics which is a study of systematic, meaning related
structures of individual words. This field proves how ambiguous the natural language might be.
We will start with defining typical semantic notion (Jurafsky and Martin, 2000). A lexeme is
an individual entry in the lexicon. It corresponds to a word but it has a more strict meaning - a
pairing of a particular orthographic and phonological form with some form of symbolic meaning
representation - a sense. In most of traditional dictionaries lexeme senses are surprisingly circular
- blood may be defined as red liquid flowing in veins, and red as a colour of blood. The usage of
such structures is possible only if a user has some basic knowledge about the world and meanings.
Computers and artificial intelligence do not have it. This is why avoiding this circularity was one
of the main issues in creating a lexical database WordNet (Fellbaum, 1999). It contains three
separate databases for nouns, verbs and the third for adjectives and adverbs. WordNet is based
on relations among lexemes. Homonymy is a relation that holds between words that have the
same form with unrelated meanings. The items with such relation are homonyms. Words with
the same pronunciation and different spelling are homophones. In contrary, homographs have
same orthographic form but different sounds. Polysemy is an occurrence of multiple meanings
within a single lexeme. So we can say that a bank of a river and a bank to keep money are rather
homonyms, while a blood bank and a bank to keep money are rather polysems. Obviously, it is
not fully distinct what is a homonym and what is polysemy. The Polish example of homonyms
are two meanings of a word ‘zamek’. The first one is a castle and the second one is a lock. We
can separate them typically by investigating lexeme history and etymology (origin). A bank to
keep something has an Italian origin, while a bank of a river has a Scandinavian one. Synonymy is
defined as coexistence of different lexemes with the same meaning, which also leaves many open
questions. The example of synonyms are Polish words ‘kolor’ and ‘barwa’. The first one means
colour and the second one might be translated as hue, but in Polish it can easily replace the first
one. Hyponymy is a pair of lexemes with similar but not identical senses.
There are several problems with applying semantic analysis. First of them is using metaphors.
They are especially common in literature, but also in spoken language and sometimes even in
documents. Words and phrases used to present completely different kinds of concepts than their
lexical senses are a serious challenge. Metonymy is a related issue. These are using lexemes
to denote concepts by naming some other related concept. We can use word ’kill’ to describe
42
stopping some process in a more dramatical way like ’killing’ processes in Linux or ’killing a
sale of a rival company’. Finally the problem is that existing semantic algorithms are dedicated to
written text which is expected to be correct. Spoken language is characterised by a higher level of
mistakes and abbreviations, while a user expects a transcription produced by a speech recogniser
to be of a written text quality.
There is very little research on semantic analysis for ASR but there are some other fields which
might be useful in our research like word disambiguation (Banerjee and Pedersen, 2003) and automatic hypertext construction (Green, 1999). One of the interesting issues is topic signatures. The
experiments show that it is possible to approximate accurately the link distance between synsets (a
semantic distance based on the internal structure of WordNet) with topic signatures (Agirre et al.,
2001, 2004). Clean signatures can be constructed from the WWW using filtering techniques like
ExRetriever and Infomap (Cuadros et al., 2005).
There are several methods of measuring the relatedness of concepts in WordNet. Similarity
package provides six measures of similarity (Pedersen et al., 2004). The lch measure searches for
the shortest path between two concepts, and scales it. The wup finds the path length to the root
(shared ancestor) node from the least common subsumer of the measured concepts. The measure
path equals to the inverse of the shortest path length between two concepts. The res, lin and jcn are
based on information content - a corpus based measure of the specificity of a concept. The package
contains also three measures of relatedness (Pedersen et al., 2004). The hso classifies relations as
having direction, so it is path based. The lesk and vector measures use the text of gloss (definition)
of the concept as a representation for it (Banerjee and Pedersen, 2003). It can be realised by
counting shared words in gloss. Strings containing several words bring much more information
due to entropy theory, so a score is the number of neighbouring words in overlapping description
risen to second power. If several strings are shared, their scores are summed. Glosses of related
senses can be also used to improve accuracy. There are other semantic similarity measures as well,
like (Seco et al., 2004) which is based on hierarchical structure only. Semantic similarity can be
also measured using Roget’s Thesaurus instead of WordNet (Jarmasz and Szpakowicz, 2003). The
method is based on calculating all paths between two words using Roget’s taxonomy.
Semantic analysis can improve quality of results of ASR. This is the highest information level
in the linguistic model. Semantics deals with the study of meaning, including the ways meaning is
structured in language and changes in meaning and form over time. Majority of the latest papers
describing general speech recognition scheme include semantics analysis. But there is no working
system (known to the author) using lexical semantics and there is little research on applying any
semantic analysis into speech recognition. Semantic analysis is much more often used in written
text analysis to retrieve information. There are two main approaches (X. Huang, 2001). The first
is based on semantic roles:
• agent - cause or initiator of the action
• patient - undergoer of the action
• instrument - how the action is accomplished
43
• goal - to whom the action is directed
• result - result of the action
• location - location of the action
We can predict a localisation and order of different semantic roles in sentences. Some of them
have to be present, others are optional. We can also associate exact words with a few roles. It
allows us to detect wrong structure of recognised text. Such semantic analysis can be used in
speech recognition (Bellegarda, 2000) instead of n-gram models.
The other approach is by lexical semantics. Some words go very often together in texts. Some
of them appear close to each other very rarely (Agirre et al., 2001, 2004). There are already such
collected statistics, for example as the semantic dictionary WordNet (Fellbaum, 1999). Words
create a set of trees and a number of branches between two nodes may stand for their semantic
closeness. There are other possible measures as well. It is possible to detect words which do not
fit to a general semantic content of a recognised hypothesis.
2.9
Semantic Modelling
It is not efficient to recognise speech using acoustic information only. The human perception
system is based on catching context, structure and understanding combined with recognition procedure. It is much easier for a human being to recognise and repeat without any errors a heard
sentence, if it is in a language we understand, comparing to a sentence in a language we are not familiar with, which is just a sequence of sounds. Similarly, it is much easier to recognise sentences
in a familiar domain or topic, then sentences from an unfamiliar context. Language modelling can
improve recognition highly. Semantic analysis can be done in many different ways and has been
applied to ASR already. However, this kind of modelling is difficult due to data sparsity problem.
The ASR literature always mentions semantic analysis, as a necessary step, but it is very difficult to
find any research papers, which provides any exact results on recognition, when applying semantic
methods.
Latent semantic analysis (LSA) (Bellegarda, 1997, 1998; T.Hofmann, 1999) is a NLP technique patented in 1988. It assumes, that the meaning of a small part of text, like a paragraph or
a sentence, can be approximated by the sum of the meanings of its words. LSA uses a wordparagraph matrix which describes the occurrences of words in topics. It is a sparse matrix whose
rows correspond to topics and columns correspond typically to words that appear in the topics.
The elements of the matrix are proportional to the number of times the words appear in each document, where rare words are upweighted to reflect their relative importance. LSA is performed
by using singular value decomposition (SVD). LSA has found already a few applications. One
of them is automatic essay and answers grading (Kakkonen et al., 2006; Kanejiya et al., 2003).
LSA can be also used in modelling global word relationships for junk e-mail filtering or pronunciation modelling (Bellegarda, 80). Another possible application is for word completion (Miller
and Wolf, 2006). LSA can be combined with the n-gram model (Coccaro and Jurafsky, 1998;
44
Table 2.4: Speech recognition applications available on the Internet
HTK (Young, 1996; Evermann et al., 2004) - htk.eng.cam.ac.uk
Edinburgh Speech Tools - www.cstr.ed.ac.uk/projects/speech tools
SPRACH (Hermansky, 1990) - www.icsi.berkeley.edu/dpwe/projects/sprach/sprachcore.html
AMI (Hain et al., 2005) - www.amiproject.org/business/index.htm
CMU Sphinx (Lamere et al., 2004)- cmusphinx.sourceforge.net/html/cmusphinx.php
CMU Let’s go (Eskenazi et al., 2008) - http://www.speech.cs.cmu.edu/letsgo/
Snorri - www.loria.fr/ laprie
Snack Speech Toolkit - http://www.speech.kth.se/snack/
Praat (Boersma, 1996) - www.fon.hum.uva.nl/praat/
CSLU OGI Toolkit - http://cslu.cse.ogi.edu/toolkit/
Sonic ASR - cslr.colorado.edu/beginweb/speech recognition/sonic.html
Grönqvist, 2005) or maximum entropy model (Deng and Khudanpur, 2003). LSA can be also applied for bigrams of words in topics rather than single words (Y.-C. Tam, 2008). It is more difficult
to train such a model but can improve results if combined with a regular LSA model. There are
other methods of analysing semantic information, like topic signatures (Agirre et al., 2001, 2004)
and maximum entropy language models (Khudanpur and Wu, 1999; Wu and Khudanpur, 2000).
The idea of topic signatures is to store concepts in context vectors. There are simple methods to
automatically acquire for any concept hierarchy. They were used to approximate link distances
in WordNet. Maximum entropy language models combine dependency information from sources
like syntactic relationships, topic cohesiveness and a collocation frequency. They evolved from
n-grams. The difference is that they store not only n-words but also other information like n preceding exposed head-words of the syntactic partial parse, n non-terminal labels of the partial parse
and a topic.
2.10
Academic Applications
There are a few academic applications of speech recognition. We listed some of them in Table
2.4. Edinburgh Speech Tools is not a complex ASR but rather a toolbox for speech analysis with
many elements useful in speech recognition for example n-gram language model. The SPRACH
is the full package based on the PLP including for example ANN training and recognition, feature
calculation, sound file manipulation, plus all the GUI components and tools. The AMI targets
computer enhanced multi-modal interaction in the context of meetings including ASR. The CMU
Sphinx Group (Lamere et al., 2004) offers packages for speech using applications, very useful for
speech modelling in ASR. CMU provides also a spoken language system Let’s go (Eskenazi et al.,
2008) which includes ASR. Snorri is dedicated to assist researchers in the fields of ASR, phonetics,
perception and signal processing. Similar opportunities are provided by the Snack Sound Toolkit
which uses script languages like Python. Praat (Boersma, 1996) covers speech analysing, labelling,
segmentation and learning algorithms. The CSLU OGI Toolkit is the help in building interactive
45
language systems for human-computer interaction. SONIC is the speech recogniser developed by
University of Colorado. It is available only for registered and accepted persons.
The HTK (Young et al., 2005) is a toolkit using HMM, for ASR research mainly. Research into
speech synthesis, character recognition and DNA sequencing are its other applications. We used
version 3.3 in our research. HTK consists of many modules and tools. All of them are available
in C source form. The HTK provides facilities for speech analysis, HMM training, testing and
results analysis. The system fits hypothesis of every recognition with one of the elements from
the dictionary, provided by a user, comparing with phonetic transcriptions of words. The toolkit
supports HMMs using both continuous density mixture Gaussians and discrete distributions. HTK
was originally developed at the Machine Intelligence Laboratory of the Cambridge University
Engineering Department (CUED). It was sold to Entopic Research Laboratory Inc. and later to
Microsoft. Currently it is licensed back to CUED and under permanent development.
Chapter 3
Linguistic Aspects of Polish
English is the most common language of ASR research with Chinese and Japanese as two other
common languages. This thesis is focused on ASR of Polish which is the most commonly spoken
Slavic language in EU and one of the most common inflective languages. There is quite little
research and no working continuous Polish ASR system. To create such a system successes in
other languages have to be used. As Polish and English are languages of the same Indo-European
group, we focused on existing solutions for English ASR. There are some differences between
these languages which have a larger or smaller impact on ASR. These differences should result in
some variations in algorithms.
3.1
Analysis of Polish from the Speech Recognition Point of View
We searched for differences between English and Polish, which seem to be important in ASR. It
is important to consider linguistic aspects while designing ASR system.
• English has a large number of homophones. What is more, many combinations of different
words have similar pronunciation. Polish has fewer homophones.
• Pronunciation of vowels in English is very similar. If a vowel is not stressed it is usually
pronounced as // or /*/. What is more, both of these phonemes have quite similar sounds
and spectra. It means that unstressed vowels are almost indistinguishable in English. It
contrasts with Polish.
• Modern English has emerged as a mixture of around thirty languages. It resulted in quite
simple general rules (which was necessary for a language to be widely accepted by different
people) but many irregularities (as a kind of residues), especially in pronunciation. Modern Polish is strongly based on Latin. Contrary to English, it resulted in very complicated
grammar rules and morphology but quite few irregularities in pronunciation.
• English is a positional language, while Polish is an inflectional one. A meaning of a word
in English depends strongly on the position of a word in a sentence. In Polish a position
is of secondary importance, the exact meaning of a word depends mainly on morphology.
46
CHAPTER 3. LINGUISTIC ASPECTS OF POLISH
47
For example in English the sentences ’Mike hit Andrew’ and ’Andrew hit Mike’ means
something quite different. In Polish (using Polish similar names) ’Michał uderzył Andrzeja’,
’Michał Andrzeja uderzył’, ’Andrzeja Michał uderzył’ and ’Andrzeja uderzył Michał’ are
all acceptable and mean almost the same. However, all not the first stress some part of
information and sound quite strange without a special context. To identify the person who
hit and who is hit, we have to use another ending ’Andrzej uderzył Michała’. It means the
usage of syntax modelling is very difficult for Polish and possibly not as necessary as for
English. On the other hand, analysing morphology seems to be crucial in the case of ASR
for Polish.
• In English, conjugation and declension are relatively simple and adjectives do not need
any type of agreement. In Polish there are groups of different ways of conjugation and
declension. Each verb has typically different forms for each combination of gender (there
are three basic genders in Polish, however, linguists distinguish 8 categories), person and
singular or plural number. Each noun has 7 forms (cases) depending on the position and
relation with other words in the sentence. Adjectives and numbers are agreed with the
nouns they describe. There is no general rule of word agreement, like adding ’s’ or ’es’
in English. Different groups of words have their own types of endings. Verbs have 47
inflection forms (excluding participles), adjectives 44, numerals up to 49, adverbs 3, nouns
and pronouns 14. A single word in Polish may have even several hundreds of derived forms
topically correlated (for example some verbs have almost 200 forms including conjugation
of participle, perfect and imperfect forms). This fact causes making a full dictionary of
Polish language for the ASR system very difficult. Even as it is possible, its size may cause
very serious delays in the work of the ASR system.
• English is well known to have a vast vocabulary. It is due to a large number of dialects
and versions of English situated all around the world. Another reason is that English is a
mixture of several languages, so there are words which mean almost the same but came from
different sources. Polish dictionary seems to be smaller in this aspect.
• Polish has a few phonemes, which are rare in other languages and do not exist in English.
They sound very different than other phonemes. Being more particular they have much
higher frequency and sound to non-Polish speakers almost like rustles or hums. These
phonemes are very easily detectable, and as such, can be additionally used as a kind of
boundaries between blocks of other phonemes.
3.2
Triphone Statistics of Polish Language
Statistical linguistics at the word and sentence level were under considerations for several languages Agirre et al. (2001); Bellegarda (2000). However, similar research on phonemes is rare
Denes (1962); Yannakoudakis and Hutton (1992); Basztura (1992). The frequency of phonetic
units appearance is an important topic itself for every language. It can also be used in several
speech processing applications, for example modelling in LVCSR or coding and compression.
48
Models of triphones which are not present in a training corpus of a speech recogniser can be prepared using phonetic decision trees Young et al. (2005). The list of possible triphones has to be
provided for a particular language along with phonemes’ categorisation. The triphone statistics
can also be used to generate hypotheses used in recognition of out-of-dictionary words including
names and addresses.
3.3
Description of a problem solution
The problem is to find triphone statistics for Polish language. Our first attempt to this task was
already published Ziółko et al. (2007). The task was conducted on a corpus containing Parliament
transcriptions mainly, which makes up amounts to around 50 megabytes of text. It was repeated
on Mars, a Cyfronet computer cluster, for data of around 2 gigabytes.
Context-dependent modelling can significantly improve speech recognition quality. Each phoneme varies slightly depending on its context, namely neighbouring phonemes due to a natural
phenomena of coarticulation. It means that there are no clear boundaries between phonemes and
they overlap each other. It results in interference of acoustical properties. Speech recognisers based on triphone models rather than phoneme ones are much more complex but give better results
Young (1996). Let us present examples of different ways of transcribing word above. Phoneme
model is ax b ah v while the triphone one is *-ax+b ax-b+ah b-ah+v ah-v+*. In case a specific
triphone is not present, it can be replaced by a phonetically similar triphone (phonemes of the
same phonetic group interfere in similar way with their neighbours) using phonetic decision trees
Young et al. (2005) or diphones (applying only left or right context) Rabiner and Juang (1993).
3.4
Methods, software and hardware
Sophisticated rules and methods are necessary to obtain the phonetic information from an orthographic text-data. Simplifications could cause errors Ostaszewska and Tambor (2000). Transcription
of text into phonetic data was applied first by PolPhone Demenko et al. (2003). The extended
SAMPA phonetic alphabet was applied with 39 symbols (plus space) and pronunciation rules for
cities Poznań and Kraków. We used our own digit symbols corresponding to SAMPA symbols,
instead of typical ones, to distinguish phonemes easier while analysing received phonetic transcriptions. Stream editor (SED) was applied to change original phoneme transcriptions into digits
with the following script:
s/##/#/g
s/w∼/2/g
s/dˆz/6/g
s/tˆs’/8/g
s/s’/5/g
s/tˆS/0/g
s/dˆz’/X/g
s/z’/4/g
s/dˆZ/9/g
s/j∼/1/g
s/tˆs/7/g
s/n’/3/g
.
Statistics can now be simply collected by counting the number of occurrences of each phoneme,
phoneme pair, and phoneme triple in an analysed text, where each phoneme is just a symbol (single
letter or a digit). Matlab was used to analyse the phonetic transcription of the text corpora. The
49
Table 3.1: Phonemes in Polish (SAMPA Demenko et al. (2003))
SAMPA
example
#
a
e
o
t
r
n
i
j
I
v
s
u
p
m
k
d
l
n’
z
w
f
g
tˆs
b
x
S
s’
Z
tˆS
tˆs’
w∼
c
dˆz’
N
dˆz
J
z’
j∼
dˆZ
pat
test
pot
test
ryk
nasz
PIT
jak
typ
wilk
syk
puk
pik
mysz
kit
dym
luk
koń
zbir
łyk
fan
gen
cyk
bit
hymn
szyk
świt
żyto
czyn
ćma
cia̧ża
kiedy
dźwig
pȩk
dzwoń
giełda
źle
wiȩź
dżem
transcr.
#
pat
test
pot
test
rIk
naS
pit
jak
tIp
vilk
sIk
puk
pik
mIS
kit
dIm
luk
kon’
zbir
wIk
fan
gen
tˆsIk
bit
xImn
SIk
s’vit
ZIto
tˆSIn
tˆs’ma
ts’ow∼Za
cjedy
dˆz’vik
peNk
dˆzvon’
Jjewda
z’le
vjej∼s’
dˆZem
occurr.
283 296 436
151 160 947
146 364 208
141 975 325
68 851 605
68 797 073
68 056 439
67 212 728
61 265 911
58 930 672
58 247 951
54 359 454
51 503 621
51 228 649
48 760 010
44 892 420
44 406 412
40 189 121
34 092 610
30 924 282
30 194 178
25 308 167
24 910 462
24 789 080
24 212 663
21 407 209
20 756 164
17 220 321
16 409 930
15 429 711
11 945 381
10 814 216
10 581 296
9 995 596
4 880 260
4 212 857
3 680 888
3 390 372
1 527 778
693 838
%
15.256
8.141
7.882
7.646
3.708
3.705
3.665
3.620
3.299
3.174
3.137
2.927
2.774
2.759
2.626
2.418
2.391
2.164
1.84
1.665
1.626
1.363
1.341
1.335
1.304
1.153
1.118
0.927
0.884
0.831
0.643
0.582
0.570
0.538
0.262
0.227
0.198
0.183
0.082
0.037
% Basztura (1992)
4.7
9.7
10.6
8.0
4.8
3.2
4.0
3.4
4.4
3.8
2.9
2.8
2.8
3.0
3.2
2.5
2.1
1.9
2.4
1.5
1.8
1.3
1.3
1.2
1.5
1.0
1.9
1.6
1.3
1.2
1.2
0.6
0.7
0.7
0.1
0.2
0.1
0.2
0.1
0.1
50
Phoneme classes
Phonemes
d^Z
j~
z’
J
d^z
N
d^z’
c
w~
t^s’
t^S
Z
s’
S
x
b
t^s
g
f
w
z
n’
l
d
k
m
p
u
s
v
y
j
i
n
r
t
o
e
a
#
0
2
4
6
8
Occurrences [%]
10
12
14
16
Figure 3.1: Phonemes in Polish in SAMPA alphabet
calculations were conducted on Mars in Cyfronet, Krakow. We analysed more than 2 gigabytes of
data. Text data for Polish are still being collected and will be included in the statistics in the future.
Mars is a cluster for calculations with following specification: IBM Blade Center HS21 - 112
Intel Dual-core processors, 8GB RAM/core, 5 TB disk storage and 1192 Gflops. It operates using
Red Hat Linux. Mars uses Portable Batch System (PBS) to queue tasks and split calculation power
to optimise times for all users. A user have to declare expected time of every task. In example,
a short time is up to 24 hours of calculations and a long one is up to 300 hours. Tasks can be
submitted by simple commands with scripts and the cluster starts particular tasks when calculation
resources are available. One process needs around 100 hours to analyse 45 megabytes text file.
3.4.1
Grapheme to Phoneme Transcription
Two main approaches are used for the automatic transcription of texts into phonemic forms. The
classical approach is based on phonetic grammatical rules specified by human Steffen-Batóg and
Nowakowski (1993) or machine learning process Daelemans and van den Bosch (1997). The
second solution utilises graphemic-phonetic dictionaries. Both methods were used in PolPhone to
cover typical and exceptional transcriptions. Polish phonetic transcription rules are relatively easy
51
to formalise because of their regularity.
The necessity of investigating large text corpus pointed to the use of the Polish phonetic transcription system PolPhone Jassem (1996); Demenko et al. (2003). In this system, strings of Polish
characters are converted into their phonetic SAMPA representations. Extended SAMPA (Table
3.1) is used, to deal with nuances of Polish phonetic system. The transcription process is performed by a table-based system, which implements the rules of transcription. Matrix T ∈ S m×n is
a transcription table, where S is a set of strings and the cells meet the requirements listed precisely in Demenko et al. (2003). The first element t1,1 of each table contains currently processed
character of the input string. For every character (or character substring) one table is defined.
The first column of each table {ti,1 }m
i=1 contains all possible character strings that could precede
currently transcribed character. The first row {t1,j }nj=1 contains all possible character strings that
can proceed a currently transcribed character. All possible phonetic transcription results are stored
in the remaining cells {ti,j }m,n
i=2,j=2 . A particular element ti,j is chosen as a transcription result,
if ti,1 matches the substring preceding t1,1 and t1,j matches the substring proceeding t1,1 . This
basic scheme is extended to cover overlapping phonetic contexts. If more then one result is possible, then longer context is chosen for transcription, which increases its accuracy. Exceptions are
handled by additional tables in the similar manner.
Specific transcription rules were designed by a human expert in an iterative process of testing
and updating rules. Text corpora used in design process consisted of various sample texts (newspaper articles) and a few thousand words and phrases including special cases and exceptions.
3.4.2
Corpora Used
Several newspaper articles in Polish were used as input data in our experiment. They are from
Rzeczpospolita newspaper from years 1993-2002. They cover mainly political and economic issues, so they contain quite many names and places including foreign ones, what may influence the
results slightly. In example, q appeared once, even though it does not exist in Polish. In total, 879
megabytes (103 655 666 words) were included in the process.
Several hundreds of thousands of Internet articles in Polish made another corpus. They are all
from a high quality website, where all content is reviewed and controlled by moderators. They
are of encyclopedia type, so they also contain many names including foreign ones. In total, 754
megabytes (96 679 304 words) were included in the process.
The third corpus consists of several literature books in Polish. Some of them are translations
from other languages, so they also contain foreign words. The corpus includes 490 megabytes (68
144 446 words) of text.
3.4.3
Results
The total number of around 1 856 900 000 phonemes were analysed. They are grouped into 40
categories (including space). Actually, one more, namely q, was detected, which appeared in
a foreign name. Since q is not a part of the Polish alphabet, it was not included in the phoneme
distribution presented in Table 3.1. Space (noted as #) frequency was 15.26 %. An average number
52
The probability of transition [%]
#
a
e
o
t
r
n
i
j
y
v
s
u
p
m
k
d
l
n’
z
w
f
g
t^s
b
x
S
s’
Z
t^S
^s’
w~
c
d^z’
N
d^z
J
z’
j~
d^Z
# a e o t r n i j y v s u p m k d l n’ z w f g t^s b x S s’ Z t^S
t^s’w~ cd^z’N d^z J z’ j~d^Z
Second phoneme classes
Figure 3.2: Frequency of diphones in Polish (each phoneme separately)
of phonemes in words is 6.6 including one space. Exactly 1 271 different diphones (Fig. 3.2 and
Table 3.2) for 1 560 possible combinations were found, which constitutes 81%.
21 961 different triphones (see Table 3.3) were detected. Combinations like *#*, where *
is any phoneme and # is a space were removed. These triples should not be considered as triphones because the first and the second * are in two different words. The list of the most common
triphones is presented in Table 3.3. Assuming 40 different phonemes (including space) and subtracting mentioned *#* combinations, there are 62 479 possible triples. We found 21 961 different
triphones. It leads to a conclusion that around 35% of possible triples were detected as triphones,
the very most of them at least 10 times.
Young Young (1996), estimates that in English, 60-70% of possible triples exist as triphones.
However, in his estimation there is no space between words, what changes the distribution a lot.
Some triphones may not occur inside words but may occur at combinations of an end of one word
and the beginning of another. We started to calculate such statistics without an empty space as
the next step of our research. It is also expected that there are different numbers of triphones for
different languages. Some values are similar to statistics given by Jassem a few decades ago and
reprinted in Basztura (1992). We applied computer clusters so our statistics were calculated for
much more data and they are more represantative.
Fig. 3.2 shows some symmetry but the probability of diphone αβ is usually different than
53
Figure 3.3: Space of triphones in Polish
probability of βα. The mentioned quasi symmetry results from the fact that high values of α
probability and (or) β probability often gives high probability of products αβ and βα as well.
Similar effects can be observed for triphones. Data presented in this paper illustrate the wellknown fact that probabilities of triphones (see Table 3.3) cannot be calculated from the diphone
probabilities (see Table 3.2). The conditional probabilities between diphones have to be known.
Besides the frequency of triphone occurrence, we are also interested in distributions of their
frequencies. This is presented in logarithmic scale in Fig. 3.4. We received another distribution
than in the previous experiment Ziółko et al. (2007) because larger number of words were analysed.
We have found around 500 triphones which occurred once and around 300 which occurred two or
three times. Then every occurrence up to 10 happened for 100 to 150 triphones. It supports a
hypothesis that one can reach a situation, when new triphones do not appear and a distribution
of occurrences is changing as a result of more data being analysed. Some threshold can be set
and the rarest triphones can be removed as errors caused by unusual Polish word combinations,
acronyms, slang and other variations of dictionary words, onomatopoeic words, foreign words,
errors in phonisation and typographical errors in the text corpus.
Entropy
H=−
40
X
p(i) log2 p(i),
(3.1)
i=1
where p(i) is a probability of a particular phoneme, is used as a measure of the disorder of a lin-
54
Table 3.2: Most common Polish diphones
log(occurrences of a triphone)
diphone
e#
a#
#p
je
i#
o#
#v
y#
na
#s
po
#z
ov
st
n’e
#o
#t
ra
#m
ro
#d
m#
no. of occurr.
43 557 832
38 690 469
31 014 275
28 499 593
24 271 474
23 552 591
20 678 007
19 018 563
18 384 584
17 321 614
16 870 118
16 619 556
16 206 857
15 895 694
14 851 771
14 104 742
13 910 147
13 713 928
13 657 073
13 597 891
13 103 398
12 968 346
%
2.346
2.084
1.671
1.535
1.307
1.269
1.114
1.024
0.990
0.933
0.909
0.895
0.873
0.856
0.800
0.760
0.749
0.739
0.736
0.732
0.706
0.698
diphone
on
#k
ta
#n
va
ko
#i
aw
u#
#f
#b
#r
ja
ar
x#
do
er
te
#j
v#
#a
to
no. of occurr.
12 854 255
12 529 124
12 449 178
12 316 393
11 413 878
11 168 294
10 515 253
10 514 514
10 379 234
10 265 162
10 167 482
10 137 129
10 097 444
9 818 127
9 811 211
9 779 666
9 724 692
9 618 998
9 398 210
9 251 288
9 143 021
9 043 529
%
0.692
0.675
0.671
0.663
0.615
0.602
0.566
0.566
0.559
0.553
0.548
0.546
0.544
0.529
0.528
0.527
0.524
0.518
0.506
0.498
0.492
0.487
8
7
6
5
4
3
2
1
0
0
0.5
1
1.5
Triphones
Figure 3.4: Phoneme occurrences distribution
2
2.5
4
x 10
triphone
#po
#na
n’e#
na#
ow∼#
#do
#za
ej#
je#
#pS
go#
#i#
ego
ova
vje
#v#
#je
#n’e
sta
#s’e
yx#
#vy
s’e#
pSe
e#p
#f#
em#
#pr
#ko
a#p
ci#
ne#
cje
n’a#
#ro
mje
#st
aw#
ny#
#te
e#v
Ze#
ym#
Table 3.3: Most common Polish triphones
no. of occurr.
%
triphone no. of occurr.
12 531 515
0.675
wa#
3 262 204
9 587 483
0.516
do#
3 210 532
9 178 080
0.494
#ma
3 209 675
8 588 806
0.463
jon
3 082 879
6 778 259
0.365
e#z
3 054 967
6 751 495
0.364
a#v
3 028 787
6 429 379
0.346
#z#
2 928 164
6 390 911
0.344
ka#
2 871 230
6 388 032
0.344
#sp
2 818 515
6 173 458
0.333
ontˆs
2 754 934
5 990 895
0.323
e#s
2 737 210
5 945 409
0.320
i#p
2 725 414
5 742 711
0.309
o#p
2 719 121
5 560 749
0.300
#Ze
2 701 194
5 433 154
0.293
#ja
2 670 034
5 317 078
0.286
ta#
2 618 595
5 311 716
0.286
ent
2 612 166
5 292 103
0.285
#to
2 567 269
4 983 295
0.268
to#
2 557 630
4 861 117
0.262
pro
2 548 979
4 858 960
0.262
pra
2 539 424
4 763 697
0.257
#pa
2 503 153
4 746 280
0.256
#re
2 502 443
4 728 565
0.255
ost
2 490 304
4 727 840
0.255
#ty
2 452 830
4 660 745
0.251
tˆse#
2 436 864
4 514 478
0.243
#mj
2 397 741
4 428 341
0.239
ku#
2 383 231
4 216 459
0.227
e#m
2 379 510
4 155 732
0.224
ja#
2 353 638
3 965 693
0.214
e#o
2 343 622
3 958 262
0.213
a#s
2 336 272
3 916 595
0.211
#vj
2 329 962
3 888 279
0.209
#mo
2 320 091
3 785 754
0.204
nyx
2 299 719
3 760 340
0.203
os’tˆs’
2 295 365
3 745 320
0.202
ovy
2 284 782
3 596 680
0.194
sci
2 282 887
3 580 425
0.193
ove
2 262 277
3 449 304
0.186
li#
2 255 403
3 313 798
0.178
ovj
2 251 294
3 309 352
0.178
mi#
2 243 432
3 300 273
0.178
uv#
2 236 507
55
%
0.176
0.173
0.173
0.166
0.165
0.163
0.158
0.155
0.152
0.148
0.147
0.147
0.146
0.145
0.144
0.141
0.141
0.138
0.138
0.137
0.137
0.135
0.135
0.134
0.132
0.131
0.129
0.128
0.128
0.127
0.126
0.126
0.125
0.125
0.124
0.124
0.123
0.123
0.122
0.121
0.121
0.121
0.120
56
guistic system. It describes how many bits in average are needed to describe phonemes. According
to Jassem in Basztura (1992) entropy for Polish is 4.7506 bits/phoneme. From our calculations
entropy for phonemes is 4.6335, for diphones 8.3782 and 11.5801 for triphones.
3.5
Analysis of Phonetic Similarities in Wrong Recognitions of the
Polish Language
A speech recognition system based on HTK for Polish is presented. It was trained on 365 utterances, all spoken by 26 males. Errors in recognition were analysed in detail in an attempt to find
reasons and scenarios of wrong recognitions.
We aim to provide a large vocabulary ASR system for Polish. There is very little research
in this topic and there is no system which would work on sentence level for a relatively rich
dictionary. Polish differs from the languages most commonly used in ASR like English, Japanese
and Chinese in the same way as all Slavic languages. It is highly inflective and non-positional.
These disadvantages are compensated by an important feature of Polish language. The relation
between phonemes and the transcription is more distinct.
We used the HTK (Rabiner, 1989; Young, 1996) as the basis of the recognition engine. While
this solution seems to work well, it is necessary to add extra tools on grammar and semantic levels
if a large dictionary is going to be used, while retaining very good recognition.
The mel-frequency cepstral coefficients (MFCCs) (Davis and Mermelstein, 1980; Young, 1996)
were calculated for parametrisation. 12 MFCCs plus an energy with first and second derivatives
were used, giving a standard set of 39 elements. We used 25 ms windows for audio framing and
preemphasis filtering 0.97. Segments were windowed using Hamming method. All 37 different
phonemes were distinguished using a phonetic transcription provided with the corpus. As it was
shown in the previous chapter HTK is a standard for ASR. All technical details of HTK are also
considered state-of-art of ASR. HTK is widely used as a model (Hain et al., 2005; Zhu and Paliwal, 2004; Ishizuka and Miyazaki, 2004; Evermann et al., 2004). We used HTK settings suggested
in a tutorial in (Young et al., 2005) apart from a sentences model. We did not use it at all because
of linguistic differences between English and Polish. Namely, the order of words in Polish is too
irregular to use this kind of models. In this experiment we simply treated sentences like they were
words, which means we put them in a dictionary. Obviously we used different dictionary and list of
phonemes that in the English example in the tutorial. All other settings were like those suggested
in (Young et al., 2005).
Errors in speech recognition can have many different reasons (Greenberg et al., 2000). Some
of them can appear because of phonetic similarities of different types, although there are errors
which cannot be explained by acoustic similarities. We want to find other possible reasons for
these errors. Results are presented with very deep analysis of what utterances where wrongly
recognised and what utterances they were recognised as. This knowledge may help in future ASR
system design and in preparing data for corpora and model training.
There are three general types of errors: random, systematic and gross. Random (or indeterminate) errors are caused by uncontrollable fluctuations of voice that affect parametrisation and
57
experimental results. Systematic (or determinate) errors are instrumental, methodological, or personal mistakes causing lopsided data, which is consistently deviated in one direction from the true
value. The detection of such errors is most important, because the model has to be altered then.
Gross errors are caused by experimenter carelessness or equipment failure which are quite unlikely
here as we used a professionally recorded data which were already used by other researchers.
Our system has been trained on part of a set called CORPORA (Grocholewski, 1995) created
under supervision of Stefan Grocholewski in Institute of Computer Science, Poznań University
of Technology in 1997 (Grocholewski, 1995). Speech files in CORPORA were recorded with the
sampling frequency f0 = 16 kHz, equivalent to sampling period t0 = 62.5 µs. Speech was
recorded in an office, with the working computer in the background, which makes the corpus not
perfectly clean. Signal to noise ratio (SNR) is not stated in the description of the corpus. It can
be assumed that SNR is very high for actual speech but minor noise is detectable for periods of
silence. The database contains 365 utterances (33 single letters, 10 digits, 200 names, 8 short
computer commands and 114 simple sentences), each spoken by 11 females, 28 males and 6
children (45 people), giving 16425 utterances in total. One set spoken by male and one by female
were hand segmented. The rest were segmented by a dynamic programming algorithm using a
model trained on hand segmented ones. The optimisation was used to fit borders using existing
hand segmentation of the same utterance spoken by two different people. All available utterances
for 26 male speakers were used for training, considering all of them as single words in HTK
model. We created the decision tree to find contexts making the largest difference to the acoustics
and which should distinguish clusters using rules of phonology and phonetics in Polish (Kȩpiński,
2005) to create tied-state triphones.
In all our experiments involving HTK, some preprocessing of data is necessary, because of
special letters in Polish. The first step of this process is to change all upper case letters into lower
case letters. Than all Polish special letters are replaced by standard corresponding capital letters.
In example, ó is changed into O.
3.6
Experimental Results on Applying HTK to Polish
As we mentioned already, the system was trained on 9490 utterances, 365 for each of 26 male speakers. The orthographic dictionary contains 365 elements, but due to differences in pronunciation
between different speakers, the final version of the dictionary, working on phonetic transcriptions,
contains 1030 positions.
We started recognition evaluation using data of the only male speaker who was not used in
the training (Table 3.4). Only 6 out of 365 utterances were substituted giving correctness 98.36
%. Audio files of females, boys and girls were also recognised to check correlation between
parameterisation of different age and gender. These speakers were also used instead of adding
noise to the male speaker. We received correctnesses 79.73%, 95.34% and 92.05% for adult female
speakers. Child male speakers were recognised with correctnesses 60.55%, 95.07% and 75.62%.
We noted correctnesses 88.22% and 84.11% for girls. All non-adult male speakers gave clearly
worse results, however, there is no obvious difference between degradation in results related to age
58
Table 3.4: Word recognition correctness for different speakers (the model was trained on adult
male speakers only)
speaker
age
gender substitutions correctness
AO1M1
AF1K1
BC1K1
BW1K1
AK1C1
AK2C1
CK1C1
LK1D1
ZK1D1
adult
adult
adult
adult
child
child
child
child
child
male
female
female
female
male
male
male
female
female
6
74
17
29
144
89
18
43
58
98.36
79.73
95.34
92.05
60.55
75.62
95.07
88.22
84.11
Table 3.5: Errors in different types of utterances (for all speakers)
type
errors being recog. % of errors
sentences
digits
alphabet
names and commands
2
21
130
312
1026
90
297
1872
0
23
44
17
or gender. Even girl speakers, for which both age and gender differed from the training speakers,
were recognised with the similar number of errors as speakers with just different gender or age.
Types of errors were carefully analysed. First, we checked percentage of correctly and wrongly
recognised utterances, depending on the type of utterances (Table 3.5). It can be clearly seen that
smaller units are much more difficult to recognise: 44 % for one syllable units (spoken letters
of alphabet), 23% and 17% for single words and almost no errors for sentences, even though we
evaluated the system also on speakers of gender and age which were not used during the training. It
suggests that recognition based on MFCC parameterisation only is not enough. The context has to
be used for allowing HMM models work correctly (or much better parameterisation, if possible).
All sentences were treated as single words during the training and the testing. The recognition
of sentences is on an exceptional level, especially considering, that we used many speakers of
gender and age not used during the training. The only two wrong recognitions are quite bizarre.
In the first case the sentence which means ‘He cleans sparrows in zoo’ was recognised as a female
name Helena. In the second case the sentence ‘Ups, it was more grey than yours’ was recognised
as ‘A horse went on poor road’. In both cases the correct transcription and wrong recognition are
Table 3.6: Errors in sentences (speakers AK1C1 and AK2C1 respectively)
correct transcription
wrong recognition
On myje wróble w zoo
Helena
Oj bardziej niż wasz był szary
Koń droga̧ marna̧ szedł
59
Table 3.7: Errors in digits
digit
no.
wrong recognitions
0 zero
3 trzy
1 jeden
4 cztery
2 dwa
8 osiem
5 piȩć
6 sześć
7 siedem
9 dziewiȩć
4
4
3
3
2
2
1
1
1
1
Zofia,Iwona,ce,Bożena
ce(2),zero,Joanna
Urban(2),Izabela,
o, ge(2)
Diana,Anna
Franciszek, Alicja
Rudolf
zero
Zenon
Diana
phonetically very different and very easily distinguishable for a human listener.
There are several interesting detailed observations in patterns of wrong recognitions. Only one
name was recognised as a sentence and quite few were recognised as spoken letters (Table 3.8 and
3.9). The majority of wrong hypotheses were simple words. It means that the efficiency of the
model depends on a length of utterances. It works better for longer ones.
The very interesting fact is that even if names are recognised wrongly, their gender is still
correct most of the time. 79 female names were recognised as other female names (out of those
presented in Table 3.8), with only 17 female names recognised as male names. Some clue might
be that the very majority of female names in Polish end with ’a’. However, such phonological similarity is probably not strong enough for this effect. It is difficult to explain fully this phenomena.
The similar pattern was found in case of male names. 50 male names were wrongly recognised as
other male names and only 14 male names were recognised as female names.
There are some pairs of phonologically similar names like Lucjan and Łucjan, or Mariola and
Marian, which where quite commonly mistaken with each other. However, most of wrong recognitions seem to have no explanation like this. What is more, some wrong detections with large
phonological differences appear quite frequently. For example, Barbara was recognised wrongly
three times, and all of them as Marzena. It has to be stressed that many pairs of very similar words
were recognised quite correctly, like name Maria was only twice recognised as Marian and Marian
as Maria just once. We can conclude that phonological similarities can cause wrong detections but
seem to be not a major source of them.
Table 3.10 shows names which were used as wrong hypotheses for errors listed in other tables.
There is an interesting tendency that these words were correctly recognised most of the time when
the audio with their content was analysed. It suggests that some utterances are generally more
probable than others for the recognition of the whole set, correct or not. We can say that they are
represented more strongly in the language models. In a similar way, names which were wrongly
recognised, rarely appear in Table 3.10, because they are weakly represented. It has to be stressed
that all utterances were used 26 times (Table 3.10) during the training. The best example of this
behaviour is a name Łucjan, which was recognised for virtually all test speakers as Lucjan. The
Table 3.8: Errors in the most often wrongly recognised names and commands
word
no.
wrong recognitions
Łucjan
Nina
Dorota
Jan
nie
cofnij
Dominik
Ewa
Maria
Regina
Wacław
Ziuta
Emilia
Emil
Gerard
Julia
Lech
Łucja
Sabina
Teodor
Alina
Barbara
Benon
Bernard
Cecylia
Celina
Damian
Daria
Eliza
Felicja
Hanna
Henryk
Irena
Iwona
Izydor
Jerzy
Janusz
Karolina
Monika
9
7
6
6
6
5
5
5
5
5
5
5
4
4
4
4
4
4
4
4
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
Lucjan(9)
Lidia,Emilia,Anna(2),Łucja,Urszula,Julian
Beata(4),Renata,Danuta
Jerzy,Łucjan(2),Daniel,Diana,Leon
źle,Lech(2),u(2),o
Teofil(3),Rafał(2)
Jan,Daniel(3),Jakub,
Anna,Helena,Ole nka,Eliza,Helena
Mariola,Marian(2),Klaudia,Marzenka
Joanna,Romuald,el,Emilia,Aniela
Lucyna(2),Jarosław
Julita(2),Joanna,Jolanta,Olga
Aniela(2),el,ku
ku,el(3)
Eugenia,Bożena,Leonard,de
Urszula,Julian(2),Joanna
zero,Joanna,u,te
Lucjan(2),Urszula(2)
Celina(2),Halina(2),
Adam(3),Joanna,
Emilia,Alicja,Urszula
Marzena(3)
Damian(2),Marian
Gerard,Beata,Leonard
Apolonia(2),Wacław
Karol,źle,Mariola
Daniel(2),Benon
Marta,Daniel,Bożena
Alina(2),Lucjan
Łucja,Urszula,Alicja
Helena,Marian,Halina
Alfred,Romuald,Hubert
Ireneusz,Urszula,Karolina
Izabela,Maria,Zuzanna
jeden,Romuald,Bogdan
źle,u,Leszek
Ireneusz,Lech,Rudolf
Mariola,Pelagia,Alina
Oleńka,Łukasz
60
61
Table 3.9: Errors in the most often wrongly recognised names and commands (2nd part)
word
no.
wrong recognitions
Marek
Mariola
Pelagia
Paulina
Sławomir
Seweryn
Wojciech
Wanda
Weronika
źle
Zenon
3
3
3
3
3
3
3
3
3
3
3
Romuald(2),Marta
Marian(2),Maria
Karolina(2),ten chór dusiłem licznie,
Mariola(2),Karolina
Hanna,Mariola,Karol
Karolina,Cezary,Zenon
Walenty,Monika,Alicja
Halina,Marzena,Mariola
Dorota,Renata,Danuta
Julian,Joanna,Zofia
Marian(2),Benon
Table 3.10: Names which appeared the most commonly as wrong recognitions in above statistics
name
no.
name
no.
name
no.
Lucjan
Marian
Urszula
Daniel
Joanna
Mariola
Beata
Karolina
Marzena
Romuald
Alicja
Anna
Halina
Julian
14
8
8
7
7
7
5
5
5
5
4
4
4
4
Alina
Bożena
Diana
Emilia
Helena
Ireneusz
Rudolf
Julita
Karol
Lech
Leonard
Leszek
Łucja
Łucjan
3
3
3
3
3
3
3
2
2
2
2
2
2
2
Aniela
Apolonia
Benon
Celina
Damian
Danuta
Izabela
Maria
Marta
Oleńka
Renata
Urban
Zenon
Zofia
2
2
2
2
2
2
2
2
2
2
2
2
2
2
62
Table 3.11: Errors in pronounced alphabet
letter errors letter errors letter errors
en
em
er
pe
će
a̧
eń
te
y
esz
źet
9
8
8
8
7
6
6
6
6
6
6
ce
e
ka
be
de
ge
i
o
zet
eś
5
5
5
4
4
4
4
4
3
3
a
es
żet
eł
ku
u
wu
el
ȩ
ef
2
2
2
1
1
1
1
1
1
1
name Lucjan was always correctly recognised. What is more Lucjan was provided as a hypothesis
for several other names, including Jan which was recognised as Lucjan in case of two different
speakers. In this example the name Lucjan was provided as a recognised word 23 times (including
correct ones) and Łucjan twice, in both cases incorrectly.
Table 3.11 presents wrongly recognised letters of alphabet. We already mentioned that this
group is most likely to contain errors because its elements are very short and the HMM model
cannot use all its advantages. We can also observe that sonorants (n, m, r) tend to be the most
difficult for recognition. Letters ha and jot were recognised correctly for all speakers.
3.7
Conclusion
Polish and English were compared considering approaches to ASR of these two languages. 250
000 000 words from different corpora: newspaper articles, Internet and literature were analysed.
Statistics of Polish phonemes, diphones and triphones were created. They are not fully complete,
but the corpora were large enough, that they can be successfully applied in NLP applications
and speech processing. The collected statistics are the biggest for Polish of this type of linguistic
computational knowledge. Polish is one of most common Slavic languages. It has several different
phonemes than English and the statistics of phonemes are also different. The most popular and
standard ASR - HTK - was trained for the Polish language and tested with a deep analysis of the
errors that occurred.
Chapter 4
Speech signals typically need to be divided into small frames before recognition can begin. Analysis of these frames can then determine the likelihood of a particular phoneme being present within
the frame. Speech is non-stationary in the sense that frequency components change continuously
over time, but it is generally assumed to be a stationary process within a single frame. Segmentation methods currently used in speech recognition usually do not consider where phonemes begin
and end, which causes complications to appear at the boundaries of phonemes. However, nonuniform phoneme segmentation was already found useful in ASR for more accurate modelling
(Glass, 2003).
A phoneme segmentation method is presented in this chapter, which is a more sophisticated
method than one described in (Ziółko et al., 2006b). More scenarios are covered and results are
evaluated in a better way. Experiments were taken on much larger COPORA, which was described
in the previous chapter. The method is based on analysing envelopes and the rate-of-change of the
DWT subband power.
4.1
Analysis Using the Discrete Wavelet Transform
The human hearing system uses frequency processing in the first step of sound analysis. While the
details are still not fully understood, it is clear that a frequency based analysis of speech reveals
important information. This encourages us to use DWT as a method of speech analysis, since the
DWT may be more similar to the human hearing system than other methods (Wang and Narayanan,
2005; Daubechies, 1992). Details of the wavelet transformation are beyond the scope of this thesis,
but here we present a brief overview of the method. The wavelet transformation provides a timefrequency spectrum. The original speech signal s(n) and its wavelet spectrum are of 16 bits
accuracy. In order to obtain DWT (Daubechies, 1992), the coefficients of series
sm+1 (n) =
X
cm+1,i φm+1,i (n)
i
63
(4.1)
CHAPTER 4. PHONEME SEGMENTATION
64
are computed, where φm+1,i is the ith wavelet function at the (m + 1)th resolution level. Due to
the orthogonality of wavelet functions
X
cm+1,i =
s(n)φm+1,i (n),
(4.2)
nDm+1,i
where
Dm+1,i = {n : ϕm+1,i (n) 6= 0}
(4.3)
are supports of φm+1,i . The coefficients of the lower level are calculated by applying the wellknown (Daubechies, 1992; Rioul and Vetterli, 1991) formulae:
cm,k =
X
hi−2k cm+1,i ,
(4.4)
gi−2k cm+1,i ,
(4.5)
i
dm,k =
X
i
where hi and gi are the constant coefficients which depend on the scale function φ and wavelet
ψ (e.g. functions presented in Fig. 4.2, which characterises dmey (discrete Meyer wavelet). The
speech spectrum is decomposed using digital filtering and downsampling procedures defined by
(4.4) and (4.5). It means that given the wavelet coefficients cm+1,i of the (m+1)th resolution level,
(4.4) and (4.5) are applied to compute the coefficients of the mth resolution level. The elements of
the DWT for a particular level may be collected into a vector, for example dm = (dm,1 , dm,2 , ...)T .
The coefficients of other resolution levels are calculated recursively by applying formulae (4.4) and
(4.5). The multiresolution analysis gives a hierarchical and fast scheme for the computation of the
wavelet coefficients for a given speech signal s. In this way the values
DWT(s) = {dM , dM −1 , ..., d1 , c1 }
(4.6)
of the DWT for M + 1 levels are obtained. Each signal
sm+1 (n) = sm (n) + sdm (n) for all n ∈ Z
(4.7)
on the resolution level m+1 is split into approximation (coarse signal)
sm (n) =
X
cm,k φm,k (n)
(4.8)
k
on the lower, mth resolution level and the high frequency details
sdm (n) =
X
dm,k ψm,k (n).
(4.9)
k
The wavelet transformation can be viewed as a tree. The root of the tree consists of the coefficients of wavelet series (4.1) of the original speech signal. The first level of the tree is the result
of one step of the (4.5). Subsequent levels in the tree are constructed by recursively applying (4.4)
65
I
67)7
W
I
:7
W
Figure 4.1: Wavelet transform outperforms STFT because it has higher resolution for higher frequencies.
and (4.5) to split the spectrum into the low (approximation cm,n ) and high (detail dm,n ) parts.
Experiments undertaken by us, show that the speech signal decomposition into six levels is sufficient (see Fig. 4.3) to cover the frequency band of a human voice (see Table 4.1). The energy
of the speech signal above 8 kHz and below 125 Hz is very low and can be neglected. The same
experiment was conducted using 7 subbands and the worse results were received.
There is a wide variety of possible basis functions from which a DWT can be derived. To
determine the optimal choice of wavelet, we analysed six different wavelet functions: Meyer (Fig.
4.2), Haar, Daubechies wavelets of 3 different orders and symlets. Our results show that the
discrete Meyer wavelet gives the best results.
4.2
General Description of the Segmentation Method
Phonemes are characterised by differing frequency content, and so we would expect changes of
the power in different wavelet resolution levels between phonemes. Clearly, it would be easier to
analyse the absolute value of the rate-of-change of power and expect it to be large at the beginning
and at the end of phonemes. However, this does not uniquely define start and end points, for two
reasons. Firstly, the power can rise over a considerable length of time at the start of a phoneme,
leading to an ambiguous start time. Secondly, there may also be rapid changes in power in the
middle of a segment. A better method of detecting the boundary of phonemes relies on power
transitions between the DWT subbands. Our approach (Ziółko et al., 2006b) is based on a six level
DWT analysis (for example M = 6) of a speech signal (Fig. 4.3).
66
Meyer wavelet
2
1
0
−1
−8
−6
−4
−2
0
2
4
6
8
4
6
8
Meyer scaling function
1.5
1
0.5
0
−0.5
−8
−6
−4
−2
0
2
Figure 4.2: The discrete Meyer wavelet - dmey
DWT level d6
DWT level d5
0.4
0.4
0.2
0.2
0
0
2000
4000
6000
DWT level d4
8000
0
2
2
1
1
0
0
500
1000
1500
DWT level d2
2000
0
2
2
1
1
0
0
100
200
300
400
0
0
1000
0
200
2000
3000
DWT level d3
400
4000
600
800
150
200
DWT level d1
0
50
100
Figure 4.3: Subband amplitude DWT spectra of the Polish word ’osiem’ (eng. eight). The number
of samples depends on a resolution level
67
Table 4.1: Characteristics of the discrete wavelet transform levels and their envelopes
Level
Band (kHz)
No. of samples Window
d6
8−4
32
5
d5
4−2
16
5
d4
2−1
8
5
d3
1 − 0.5
4
3
d2
0.5 − 0.25
2
3
d1
0.25 − 0.125
1
3
The amount 2−M +m−1 N of wavelet spectrum samples in the mth level (where m = 1, . . . , M )
depends on the length N of the speech signal in time domain, assuming N is a power of 2. Table
4.1 presents their number at each level relative to the lowest resolution level. The power waveform
pm (n) =
m−1
2X
d2m,j−1+n2m−1 where n = 0, . . . , 2−M N − 1,
(4.10)
j=1
is computed in a way to obtain the equal number of power samples for all subbands.
The DWT subband power shows rapid variations (see Fig. 4.3) and despite smoothing (4.10)
the power waveforms change rapidly. The first order differences in the power are inevitably noisy,
and so we calculate the envelopes pen
m (n) for power fluctuations in each subband by choosing the
highest values of pm (n) in a window of given size ω (see Table 4.1) to obtain a power envelope
(Fig.4.4). A smoothed differencing operator was used and the subband power pm is convolved
with the mask [1, 2, −2, −1] to obtain smoothed rate-of-change information rm (n).
In order to improve accuracy, a minimum threshold pmin was introduced for a subband DWT
power. This threshold was chosen experimentally as 0.0002 for the test corpus. This prevents us
from analysing noise where the power of the speech signal is very small (for example in areas
of ‘silence’), even though noise is very low in the test corpus. The parameter pmin can be easily
chosen for other corpora by analysing part of it with audio containing noise only. The threshold
pmin can be set as 110% of power of noise. The start and end of a phoneme should be marked by
an initially small, but rapidly rising power level in one or more of the DWT levels. In other words,
the derivative can be expected to be approximately as large as the power. This is why phoneme
boundaries can be detected searching for n-points for which the inequality
p ≥ |β|rm (n)| − pen
m (n)|
(4.11)
holds for the phoneme boundaries. Constant p is a value of threshold which accounts for the time
scale and sensitivity of the crossing points. We found that setting the threshold p as 0.1 gave the
best results. The rate-of-change function rm is multiplied by scaling factor β approximately equal
to 1 which allows us to subtract the power from product β|rm (n)|.
68
d6
d5
1.5
3
1
2
0.5
1
0
0
50
100
150
200
0
0
50
d4
100
150
200
150
200
150
200
d3
15
25
20
10
15
10
5
5
0
0
50
100
150
200
0
0
50
d2
100
d1
4
1
0.8
3
0.6
2
0.4
1
0
0.2
0
50
100
150
200
0
0
50
100
Figure 4.4: Segmentation of the Polish word ’osiem’ (eng. eight) based on DWT sub-bands. Dotted lines are hand segmentation boundaries; dashed lines are automatic segmentation boundaries,
bold lines are envelopes and thin lines are smoothed rate-of-change
4.3
Phoneme Detection Algorithm
Without any additional refinement, the above method may not be able to detect the phoneme
boundaries precisely. There are several reasons for this. First, the exact locations of the boundaries
may vary slightly between subbands, and for some phonemes, only one frequency band may show
significant variations in power, while for others several subbands may show variations in power.
Sometimes analysis will detect slightly separate boundaries for different subbands. Secondly,
despite smoothing the derivative, there may be a number of transitions which represent the same
boundary. This problem was approached by noting the transitions and other situations which are
likely to happen for phoneme boundaries using e(n), which will be referred to as an event function.
Such an approach let us consider several scenarios and aspects of potential phoneme boundaries.
It also allows us to improve the method easily by adding additional events to the existing list.
The suggested events are presented in Table 4.2 and explained in details later. Surprisingly pre-
69
emphasis filtering was found as a step degradating quality so it was not used in the final version of
the algorithm:
1. Normalise a speech signal by dividing by its maximum value in an analysed fragment of
speech.
2. Decompose a signal into six levels of the DWT.
3. Calculate (4.10) in all frequency subbands to obtain the power representations pm (n) of the
mth subband.
4. Calculate the envelopes pen
m (Fig. 4.4) for power fluctuations in each subband by choosing
the highest values of pm in a window of a given size ω, according to Table 4.1.
5. Calculate the rate-of-change function (Fig. 4.4) rm (n) by filtering pm (n) with [1, 2, -2, -1]
mask.
6. Create an event function e(n) = 0 for all n. In the next step the function value will be
increased to record events for which rm (n) and pen
m (n) look like a phoneme boundary for a
given n.
7. Analyse rm (n) and pen
m (n) for each DWT subband to find the discrete time n for which
the event conditions described in Table 4.2 hold. Add the value of the event importance (as
par Table 4.2) to the event function e(n) (Fig. 4.5) for a given discrete time n according to
Table 4.2. If several events occur for a single discrete time, then sum the event importances
of all of them. Repeat the step for all discrete times n. In this way, we have a boundary
distribution-like function.
(
e(n) =
0 no condition fulfilled for n
P
i wi otherwise
(4.12)
where wj are importance weights (see Table 4.2) for events that occurred for n in all subbands.
8. Search for a discrete time n starting from 1, for which the event function is higher than a
decision threshold. A threshold value of 4 was chosen experimentally.
9. Find all the discrete times ti for which
e(ti ) > τ − 1
ti > n
(4.13)
ti − ti+1 < α
where n is the last index analysed in the previous step and α is associated with minimal
phoneme length (α = 4 gives approximately 20 ms). Organise all the discrete times ti in
separate groups of those fulfilling the above conditions.
70
Table 4.2: Types of events associated with a phoneme boundary. Mathematical conditions are
based on power envelope pen
m (n), rate-of-change information rm (n), a threshold p of the distance
between rm (n) and pen
(n)
and a threshold pmin of minimal pen
m
m (n) and β = 1. Values in the last
four columns are for different DWT levels (the first one for d1 level, the second one for d2 level,
the third for levels from d3 to d5 and the last one for d6 level)
Description
Mathematical condition
Importance
en
Quasi-crossing point
|β|rm (n)| − pm (n)| < p and
1 3 4 1
en
(|β|rm (n + 1)| − pm (n + 1)| > p or
|β|rm (n − 1)| − pen
m (n − 1)| > p) and
en
pm (n) > pmin
Crossing point
β|rm (n)| > pen
1 3 4 1
m (n) + p and
en
first case
β|rm (n + 1)| < pm (n + 1) − p and
pen
m (n) > 5 pmin
Crossing point
β|rm (n)| < pen
1 3 4 1
m (n) − p and
en
second case
β|rm (n + 1)| > pm (n + 1) + p and
pen
m (n) > 5 pmin
Rate-of-change higher than
β|rm (n)| > pen
1 2 2 1
m (n) and
en
power envelope
pm (n) > 2 pmin
10. Calculate the weighted mean discrete time b from the discrete times grouped in the previous
step. Index b is the detected phoneme boundary in the discrete timing of DWT level d1 ,
which was used in the algorithm for all other subbands by summing samples.
P
t i wi
b = Pi
.
i wi
(4.14)
11. Repeat previous three steps for next discrete time values n, until the largest n with non-zero
value of event function e(n) will be processed.
Table 4.2 describes the events which can be expected to occur in the power of DWT subbands.
Some of them are more crucial than others. In our previously published work (Ziółko et al., 2006b)
only the first of them was used. Additionally, different weights were given to events with respect
to a subband in which they occur. It is a perceptually motivated idea which was very successfully
used in the PLP (Hermansky, 1990). As per this study, information in relatively high and low
frequency subbands is not so important for the human ear as information in the bands from 345
Hz to 2756 Hz. Briefly, the Hermansky solution (Hermansky, 1990; Hermansky and Morgan,
1994) used a window to modify speech, decreasing frequencies not crucial for the human ear and
amplifying the most important ones. The same aim was followed in our solution by giving low
weights for events occurring in detectable, but not the most important frequencies, and higher ones
for the middle of human hearing bands. Six DWT subbands were used. The third, fourth and fifth
were grouped together as the middle and most crucial ones. As a result in Table 4.2 four columns
with importance values (weights) are presented (the first one for the d1 level, the second one for
71
event function e(i)
14
12
10
8
6
4
2
0
0
20
40
60
80
100
120
140
160
180
Figure 4.5: The event function versus time in ms of the word presented in Fig. 4.4. High event
scores mean that a phoneme boundary is more likely
the d2 level, the third for the levels from d3 to d5 and the last one for the d6 level).
There are four possible events presented in Fig. 4.6 and described in Table 4.2. Some of them
are quite similar. It has to be stressed that for some discrete times and subbands more than one
event can occur (typically two and very rarely more). In this case weights of both events are taken
into account to the event function e(n). In all cases, the values of rate-of-change information
|rm (n)| are multiplied by scaling factor β equals to 1. The first event is called quasi-crossing
point. It is the most general and common one. The mathematical condition for this event detects
discrete times for which power envelope pen
m (n) and absolute value of rate-of-change information
|rm (n)| cross or approach each other very closely (on a distance of threshold p). Additionally
power envelope pen
m (n) has to be higher than threshold pmin .
The second and third events are twin events and represent rarer cases, namely the crossing of
en
power envelope pen
m (n) and absolute value of rate-of-change |rm (n)| when pm (n) is five times
higher than minimum threshold pmin . It means that the second and third cases are used to detect
and note more specific situations than the first one, because typically fulfilling one of those conditions means fulfilling the first one as well. As we sum all event importances for a given n, this
will cause a higher value of event function e(n) than just the first event. In these cases, one of the
72
Figure 4.6: Simple examples of four events described in Table 4.2. They are characteristic for
phoneme boundaries. Images present power envelope pen
m (n) and rate-of-change information (derivative) rm (n)
73
functions of pen
m (n) and |rm (n)| starts with higher level than the other and goes below the level of
the second one, suggesting a phoneme boundary very clearly. Fulfilling one of those conditions
means fulfilling the first one as well. As we sum all event importances for a given n, this will cause
a higher value of event function e(n) than just the first event. In these cases, one of the functions
of pen
m (n) and |rm (n)| starts with higher level than the other and goes below the level of the second
one, suggesting a phoneme boundary very clearly.
The fourth event is also quite rare and covers situations were the DWT spectrum changes very
rapidly, which happens for changes in speech content like phoneme boundaries. In this situation
a level of pen
m (n) can be relatively low. The absolute value of rate-of-change information |rm (n)|
en
being higher than power envelope pen
m (n) and pm (n) being higher than double of the minimum
threshold are searched for. The fourth event is different, because it does not describe anything
similar to crossing used in general description of the method in the previous section. However, if
|rm (n)| is so high, it also indicates that a phoneme boundary may occur. It is less strict and more
general, so a lower weight was given.
The values of thresholds in the first three events were chosen to make the second and third
events more difficult to fulfil than the first one. The threshold in the fourth type event was chosen
experimentally.
The method is designed so that it would be easy to improve it by introducing additional conditions. It is easy to introduce a new condition which will add or subtract (negative events, which
imply boundaries did not occur, are not included in this solution but generally possible) additional
values to e(n) for discrete times where the new condition is fulfilled. Another aspect of the ‘intelligence’ of the method is that, even though it consists of several conditions, the sensitivity can be
easily changed by setting another decision threshold. The decision threshold is lowered by 1 for
finding the following discrete times (comparing to the first one in the group) due to a hysteresis
rule. The application of hysteresis for the threshold produces better results.
The algorithm is implemented in Matlab environment and not optimised for time efficiency. In
its current version it needs 14 minutes to segment the whole corpus using Haar wavelet (the lowest
order of filters) and 20 minutes for discrete Meyer wavelet (the highest order of filters, namely 50).
The corpus has 16425 utterances (some of them are sentences), which give 0.05 s per utterance for
the version with Haar wavelet and 0.07 s for the Meyer one. The properly optimised code in C++
would be much more time efficient. The experiment was conducted on a computer with AMD
Athlon 64 processor 3500+ 990 MHz, 1.00 GB of RAM.
The method was developed on a set of 50 hand segmented Polish words with the sampling
frequency f0 = 11025 Hz, equivalent to a sampling period t0 = 90.7 µs. In order to assess the
quality of our results, the method was tested on CORPORA. None of the CORPORA utterances were
in the original set used during development. Hand segmentation was done by different people in
the small development set and for CORPORA.
74
Figure 4.7: The general scheme of sets G with correct boundaries and A with detected ones.
Elements of set A have a grade f(x) standing for probability of being a correct boundary. In set G
there can be elements which were not detected (in the left part of the set)
4.4
Fuzzy Sets for Recall and Precision
Fuzzy logic is a tool for embedding structured human knowledge into workable algorithms. In a
narrow sense, fuzzy logic is considered a logical system aimed at providing a model for modes
of human reasoning that are approximate rather than exact. In a wider sense, it is treated as a
fuzzy set theory of classes with unsharp boundaries (Kecman, 2001). Fuzzy logic found many
applications in artificial intelligence, due to the introduction of the opportunity of numerical and
symbolic processing of a human-like knowledge. This kind of processing is needed in proper
evaluating of many types of segmentation. In our case we are interested in speech boundary (for
example phonemes) location (Fig. 4.8). Detected boundaries may be shifted more or less with
respect to a manual segmentation. This ’more or less’ makes a crucial difference and cannot be
mathematically described in a Boolean logic. Fuzzy logic introduces an opportunity of grading
detected boundary locations in more sensitive and human-like way.
Our segmentation evaluation method (?) is based on the well-known recall and precision
evaluation method. However, in our approach, calculated boundary locations are elements of a
fuzzy set and a binary operation T-norm describes their memberships. T-norm is defined as a
function T : [0, 1] × [0, 1] → [0, 1] which satisfies commutativity, monotonicity, associativity and
for which 1 acts as an identity element. As usual in recall and precision, one set contains relevant
elements. The other is the set of retrieved boundaries. We calculate an evaluation grade using the
number of elements in each of them and in their intersection. The comparison of the number of
relevant boundaries and a number of elements in intersection gives precision. In a boolean version
of the evaluation method it is information about how many correct boundaries were found. By
using fuzzy logic we evaluate not only how many boundaries were detected, but how accurately
they were detected. The comparison of the number of retrieved elements and intersection gives
recall, which is a grade of wrong detections. In this case fuzzy logic allows to evaluate not only a
75
1
0.5
0
−0.5
−1
0
50
100
150
200
Figure 4.8: The example of phoneme segmentation of a single word. In the lower part hand segmentation is drawn. Boundaries are represented by two indexes close to each other (sometimes
overlapping). Upper columns present the example of segmentation for the word done by a segmentation algorithm. All of calculated boundaries are quite accurate but never perfect
number of wrong detections but also their incorrectness. Each retrieved boundary has a probability
factor which represents being correct information.
4.5
Algorithm of Speech Segmentation Evaluation
In this section we present an example of applying the approach described in the previous section
for phoneme speech segmentation (Fig. 4.8). Due to the described features, such segmentation
and its evaluation is particularly useful in ASR. In this case we have to make three assumptions:
• Hand segmentation (ground truth) is given as a set of narrow ranges. Neighbouring phonemes overlap each other in these ranges.
• Detected boundaries are represented as a set of single indexes.
• We assume the perfect detection of silence. Silence segments may be of almost any length.
Due to this fact including them in evaluation would cause serious inaccuracy. This is why
we skip silence segments in evaluation.
The method proceeds as follows:
1. Assign first and last detected boundaries with the same value as hand segmented boundaries
(typically the first and the last index). It has to be done because of the third assumption.
2. Start with matching the closest detected and hand segmented boundaries. They need to be
matched in pairs. Each boundary may have only one matched boundary from the other set.
Do following steps for each ith detected boundary ia starting from the first.
76
3. Calculate grades of being relevant and retrieved. All matched pairs are elements of two
sets of which one is fuzzy. All non-matched detected and hand segmented boundaries are
elements of one set. Let G denote the set of relevant (correct) elements. Let A denote the
ordered set containing retrieved (predicted) boundaries. For each segmentation boundary x
in A be define a fuzzy membership function f(x) that describes the degree to which x has
been accurately segmented. There are three different scenarios for calculating membership
function f(x):
• A hand segmented boundary not matched with any detected boundary is an element of
set G.
• A detected boundary x not matched with any hand segmented boundary is an element
of set A and has f(x) = 0. The last detected boundary on the Fig. 4.8 is such a case.
• If a detected boundary x is inside the hand segmented boundary range, the boundary
is the element of both sets A and G. The other probabilistic factor is boolean and
represents membership of a set with hand segmentation boundaries. We use algebraic
product of these two probabilistic grades as a T-norm, to find a membership grade of
77
1
0
midpoint
start/end point
Figure 4.9: Fuzzy membership
the intersection. In the situation where x is inside the hand segmented boundary, range
f(x) = 1.
• Otherwise it is a fuzzy case and f(x) = a−b/a where a stands for the half of the length
of the phoneme which the boundary was detected (take the phoneme in which the
detected boundary is situated) and b stands for the distance between hand segmented
boundary and the detected one (Fig. 4.9). All boundaries on the Fig. 4.8 apart from
the last one are examples of this case, which proves how useful fuzzy logic can be in
the segmentation evaluation.
4. Fuzzy precision can be calculated as
P
x∈A f (x)
P =
|G|
.
(4.15)
.
(4.16)
5. Fuzzy recall equals
P
R=
x∈A f (x)
|A|
Recall and precision can be used to give a single evaluation grade in many different ways depending on which of them is more important. The widely used way is calculating f-score (van
Rijsbergen, 1979)
F =
(β 2 + 1) ∗ P ∗ R
,
(β 2 ∗ P ) + R
(4.17)
where β is a parameter to the f-score. Often β = 1, that is, precision and recall are given equal
weights. Higher β values would favour recall over precision.
4.6
78
Comparison to Other Evaluation Methods
Evaluation methods are always subjective and there is no way to grade them statistically. This is
why it is difficult to compare evaluation methods and judge which one is better. Because it cannot
be proved our method outperforms the others, we present an example which might explain why we
believe so. There is no standard method, but all evaluations are based on insertions and deletions
with some tolerances. Let us compare a use of such methods with the fuzzy recall and precision for
the example presented in Fig. 4.8. The indexes are due to the segmentation method (Ziółko et al.,
2006b). One index unit corresponds to 5.8 ms. The very first and last boundary is not included due
to assumption that they are supposed to be perfectly detected. Table 4.3 lists membership function
f(x) for all boundaries. In lower rows, insertions and deletions with all possible tolerances are
marked. The symbol X stands for a boundary with a delation or insertion for a given tolerance,
while stands for a boundary accepted as a correct one with a given tolerance. The number of
insertions and deletions is given in brackets in the first column.
As we use only a single word, results are the same for many tolerance levels. For a larger
corpora it does not happen. It is clearly visible that counting insertions and deletions is less accurate, unless one uses tolerance levels with resolution equals to the resolution of index order.
Especially using single tolerance level smooths information about boundary detections. Perfectly
accurate detections are graded in the same way as imperfect, but fulfilling a tolerance level. Using
several tolerance levels improves quality of evaluation but is still just a step towards a high resolution evaluation method, as suggested fuzzy recall and precision. Another issue is the length of
phonemes. A method based on tolerances gives grade without comparing a tolerance and length
of a given phoneme. In other words, our methods is better, because the membership function f(x)
is calculated on percentage of the phoneme length of a boundary which was missed and not on a
constant tolerance value. In the presented example, phoneme lengths vary from 11 (64 ms) to 47
(273 ms). For example, the tolerance of 3 (17 ms) is effectively much higher for the shortest unit
than for the longest one. There is no such flaw in our method. The algorithm was implemented in
C++. Final grades for a given word are: precision: 0.813901, recall: 0.697629, f-score: 0.751293.
4.7
Experimental Results of DWT Segmentation Method
Our first set of results looks at the usefulness of the six wavelet functions for analysing phoneme
boundaries. The obtained results for different wavelets (see Table 4.4) show the differences in their
efficiency. They suggest that discrete Meyer wavelet (Fig. 4.2)(Abry, 1997) performs the best in
this case, probably because of its symmetry in the time domain, which helps in synchronisation
of the subbands. Asynchronisation in time domain can be caused by ripples in frequency domain.
An experiment using two wavelets (Meyer and sym6), one after another, was also conducted. As it
might be expected, it improved results only a little, while it almost doubled the time of calculations.
Analysing seven subbands was also checked, where the seventh one was from 125 Hz to 62.5 Hz.
The accuracy of our phoneme detection technique was then compared with some standard
framing techniques (see Table 4.5) like constant segmentation methods where the speech is broken
79
Table 4.3: Comparison of fuzzy recall and precision with commonly used methods based on insertions and deletions for an exemplar word
beg
9
56
89
113
156
196
end
10
58
90
114
158
198
auto
15
59
97
112
159
195
206
fuzzy recall and precision
f(x)
0.78 0.93 0.36 0.91 0.95 0.95
0
insertions and deletions without tolerance
Ins(7)
X
X
X
X
X
X
X
Del(6)
X
X
X
X
X
X
with tolerance from 1 (5.8 ms) to 4 (23.2 ms) - same results
Ins(3)
X
X
X
Del(2)
X
X
with tolerance 5 (29 ms) or 6 (34.8 ms)
Ins(2)
X
X
Del(1)
X
with tolerance 7 (40.6 ms) or higher
Ins(1)
X
Del(0)
-
Table 4.4: Comparison of proposed method using different wavelets
Method
av. recall av. precision f-score
Meyer
0.7096
0.7408
0.7249
db2
0.6770
0.7562
0.7144
db6
0.7029
0.7414
0.7217
db20
0.7034
0.7408
0.7216
sym6
0.7015
0.7426
0.7215
haar
0.6377
0.8042
0.7113
Meyer+sym6
0.6825
0.7936
0.7339
Meyer 7 subbands
0.6449
0.6714
0.6579
Table 4.5: Comparison of some other segmentation strategies and proposed method
Method
av. recall av. precision f-score
Const 23.2 ms
0.9651
0.1431
0.2493
Const 92.8 ms
0.7635
0.4659
0.5787
SVM
0.50
0.33
0.40
Wavelet
0.7096
0.7408
0.7249
80
into fixed length segments, and with the speech signal being segmented randomly. Accuracy of
constant segmentation for many multiplications of 5.8 ms (the time length between neighbouring
discrete times) was evaluated but we only present results for 23 ms as it is corresponding to typical
length of frames in speech recognition and for 92.8 ms for which the result is the best of all constant
segmentations. We also trained the SVM using powers and derivatives from DWT subbands.
Features for SVM included analysed part of speech as well as left and right context. No other
phoneme segmentation method available for comparison was found. While constant segmentation
is able to find most of the boundaries with a 23 ms frame, this is only at the expense of very short
segments and many irrelevant boundaries. The overall score of our method is much superior to the
constant segmentation approach.
Several researchers claim that syllables are better basic units for ASR than phonemes (Frankel
et al., 2007). It is probably true in terms of their content, but it seems not to be the same for
detecting unit boundaries. Our method is not perfect but the observed DWT spectra of speech
clearly show that boundaries between phonemes can be extracted. Boundaries between syllables
seem not to differ from phoneme boundaries in observed DWT spectra, while obviously there are
fewer syllable boundaries than phoneme ones. It is therefore difficult to detect syllable boundaries
without also finding phoneme boundaries when analysing DWT spectra.
4.8
Evaluation for Different Types of Phoneme Transitions
Errors in phoneme segmentation depend on what type of transitions are being detected. The evaluations differ regarding to groups of phonemes because some phonemes have similar spectra,
while others differ a lot. These differences depend on acoustic properties of phonemes (Kȩpiński,
2005). The transitions which are more likely to cause errors should be analysed with more care,
in example by applying more segmentation methods and considering all results.
There are following types of phonemes in Polish (Kȩpiński, 2005):
1. Stops (/p/, /b/, /t/, /d/, /k/, /g/)
2. Nasal consonants (/m/, /n/, /ni/, /N/)
3. Mouth vowels (/i/, /y/, /e/, /a/, /o/, /u/)
4. Nasal vowels (/e /, /a /)
5. Palatal consonants (Polish ’Glajdy’)(/j/, /l /)
6. Unstables (Polish ’Płynne’)(/l/, /r/)
7. Fricatives (/w/, /f/, /h/, /z/, /s/, /zi/, /si/, /rz/, /sz/)
8. Closed fricatives (/dz/, /c/, /dzi/, /ci/, /drz/, /cz/)
9. Silence in the beginnings and ends of recordings
10. Silence inside words
81
1
1
0.9
2
0.8
3
0.7
4
0.6
5
0.5
6
0.4
7
0.3
8
0.2
9
0.1
10
1
2
3
4
5
6
7
8
9
0
10
Figure 4.10: F-score of phoneme boundaries detection for transitions between several types of
phonemes. Phoneme types 1-10 are explained in section 4.8 (1 - stops, 2 - nasal consonants, etc.).
This division is made on acoustic properties of phonemes. We do not have enough statistical
data to calculate results for transitions between all 39 types of phonemes. It can be assumed that
transitions between phonemes of two particular groups face similar problems due to co-articulation
and other natural phonetic phenomena. Tables 4.6, 4.7, 4.8 and Fig. 4.10 present evaluation of
phoneme segmentation regarding to the transitions of types listed above. Value 0 means that there
was no transition of this type.
Table 4.6: Recall for different types of phoneme transitions.
Type
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
0.7204
0.6015
0.4886
0.5089
0.6403
0.5624
0.6148
0.6216
1.0000
0.0399
0.6101
0.5555
0.4493
0.4816
0.5790
0.5445
0.5320
0.5593
1.0000
0.1399
0.5114
0.4686
0.5069
0.5384
0.4534
0.4690
0.4389
0.4771
1.0000
0.4180
0.5776
0.5812
0.0821
0
0.5362
0.5553
0.5299
0.5424
1.0000
0
0.5818
0.5474
0.4605
0.4215
0.5942
0.5428
0.4641
0.4281
1.0000
0.0335
0.5007
0.5087
0.3776
0.4388
0.5520
0.4768
0.4708
0.5288
1.0000
0.0643
0.5877
0.5817
0.4218
0.4380
0.5829
0.5781
0.5203
0.5372
1.0000
0.0835
0.6456
0.6062
0.5872
0.5015
0.6072
0.5558
0.5911
0.6387
1.0000
0.0289
0.5210
0.5658
0.4741
0.3692
0.5563
0.5885
0.5784
0.5169
0
0
0.4194
0.2129
0.3712
0.2155
0.0702
0.2630
0.4661
0.1388
0.0227
0
All silences before speech are marked as perfectly detected due to the evaluation algorithm.
Apart from that silences were not detected very well. The reason is that the segmentation method
is tuned to phoneme boundaries and not speech-silence transitions. There are other very efficient
methods for this task already established (Zheng and Yan, 2004).
82
Table 4.7: Precision for different types of phoneme transitions.
Type
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
0.6927
0.5523
0.4171
0.4199
0.5943
0.4838
0.5762
0.5573
1.0000
0.0365
0.5788
0.4858
0.3692
0.4124
0.5465
0.4976
0.4875
0.4938
1.0000
0.1399
0.4783
0.4021
0.4433
0.4735
0.3731
0.3987
0.3732
0.4154
1.0000
0.4035
0.5299
0.4996
0.0771
0
0.4688
0.4811
0.4835
0.4926
1.0000
0
0.5465
0.4952
0.3963
0.3789
0.5554
0.4811
0.4150
0.3511
1.0000
0.0310
0.4741
0.4783
0.3033
0.4073
0.5252
0.4271
0.4208
0.4869
1.0000
0.0620
0.5599
0.5375
0.3470
0.3405
0.5488
0.5303
0.4798
0.4809
1.0000
0.0835
0.6094
0.5569
0.5207
0.4222
0.5443
0.5174
0.5324
0.5692
1.0000
0.0289
0.3115
0.3928
0.2899
0.1987
0.3811
0.4203
0.4158
0.3209
0
0
0.4108
0.2129
0.3423
0.1826
0.0645
0.2630
0.4452
0.1333
0.0227
0
Table 4.8: F-score for different types of phoneme transitions. The scores above 0.5 were bolded.
Type
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
0.7063
0.5759
0.4500
0.4601
0.6164
0.5202
0.5949
0.5877
1.0000
0.0382
0.5940
0.5183
0.4053
0.4443
0.5623
0.5200
0.5088
0.5245
1.0000
0.1399
0.4943
0.4328
0.4730
0.5038
0.4093
0.4310
0.4034
0.4441
1.0000
0.4106
0.5528
0.5373
0.0795
0
0.5002
0.5155
0.5056
0.5163
1.0000
0
0.5636
0.5200
0.4260
0.3991
0.5742
0.5101
0.4382
0.3858
1.0000
0.0322
0.4870
0.4931
0.3364
0.4225
0.5383
0.4506
0.4444
0.5070
1.0000
0.0632
0.5734
0.5587
0.3807
0.3831
0.5654
0.5532
0.4992
0.5075
1.0000
0.0835
0.6270
0.5805
0.5519
0.4584
0.5740
0.5359
0.5602
0.6019
1.0000
0.0289
0.3899
0.4637
0.3598
0.2583
0.4523
0.4904
0.4838
0.3960
0
0
0.4150
0.2129
0.3562
0.1977
0.0672
0.2630
0.4555
0.1360
0.0227
0
DWT was also tested for parametrisation of speech (Farooq and Datta, 2004). The unvoiced
stops (/p/, /t/, /k/) were found more difficult to be recognised than vowels (/aa/, /ax/, /iy/) and
unvoiced fricatives (/p/, /t/, /k/). In our case stops did not cause difficult problems for locating
them correctly. Actually, the highest f-score (0.7063) was obtained for the boundaries between
two stops and the second grade (0.6270) between stops and closed fricatives. Also transitions
from palatal consonants to stops were evaluated highly (0.6164). Transitions between two closed
fricatives were another group of easy ones to be detected (0.6019).
The most difficult for detection were transitions from mouth vowels to any type apart from
closed fricatives, especially to nasal vowels (0.0795), unstables (0.3364) and fricatives (0.3807).
Also transitions to mouth vowels were difficult to locate correctly. The only exception was from
nasal vowels to mouth vowels (0.5038), which is surprisingly large, comparing to 0.0795 for a
transition in the other way. Another group of boundaries with low F-scores were transitions from
nasal vowels apart from the mentioned transition to mouth vowels. Especially difficult were transitions to fricatives (0.3831) and palatal consonants (0.3991). There are no transitions from one
nasal vowel into another one. The transitions from closed fricatives to palatal consonants, from
unstables to unstables and fricatives to palatal consonants, unstables and another fricatives were
also difficult to be detected properly.
According to our results it is relatively easy to find a boundary between phonemes of the
83
same group if such transition is possible. F-score for such boundaries is usually above 0.5. This is
slightly surprising and counterintuitive because phonemes of the same group have typically similar
spectra and it could be expected to be difficult to differentiate them.
Tables 4.6, 4.7 and 4.8 are not symmetric. It is not very surprising because phoneme spectra
are not symmetric. Their ends and starts can vary significantly. This is why, it might be easier to
locate a beginning of a particular phoneme than its end.
The gained statistical knowledge can improve the quality of segmentation. In case of large
vocabulary continuous speech recognition, the recognition follows the segmentation. If a phoneme
which is know to cause errors for segmentation is detected, its boundaries can be re-evaluated by
another more sophisticated or simply other method. Then another segmentation decision can be
taken, leading to a better final recognition.
4.9
LogitBoost WEKA Classifier Speech Segmentation
WEKA is a graphical data mining and machine learning software providing many classifiers. The
procedure called ‘boosting’ is the important classification methodology. The WEKA LogitBoost
classifier is based on well known AdaBoost procedure (Friedman et al., 1999). The AdaBoost procedure trains the classifiers on weighted versions of the training samples. It gives higher weights
for those which are misclassified. That part of procedure is conducted for a sequence of weighted samples. Afterwards the final classifier is defined to be a linear combination of the classifiers
from each stage. Logistic boost (Friedman et al., 1999) uses the adaptative Newton algorithm to
fit an additive multiple logistic regression model. So it calls a classifier repeatedly in series. A
distribution of weights is updated each time. In this way it indicates the importance of examples
in the data set for the classification. The main point of being adaptative is that, on each round,
the weights of each incorrectly classified example are increased. The new classifier focuses more
on those examples. Logistic regression fits data to a logistic curve to specify prediction of the
probability of occurrence of an event.
There were many more non-boundary points in feature space than those which really represent
boundaries. This is why we cloned all sets of features representing phoneme boundaries for 30
times to keep a similar ratio of boundaries and non-boundaries. We used 70 % of all feature points
as training data and 30 % for a test in every experiment.
4.10
Experimental Results for LogitBoost
Seven different sets of features for the same classifier and same test data were tested to check which
features are useful. The differences between following sets are described. The classification was
evaluated using popular precision and recall measure (van Rijsbergen, 1979) which is presented in
tables and by percentage of properly classified instances which are given in text for all cases. Two
evaluations are provided for every set of features to help in grading the method because we did not
manage to find any other similar system to use to present as a baseline.
We started with one left and one right context subset of features to describe the surrounding
84
part of signal. We included first and second derivatives and both of them were smoothed. Different
subbands were smoothed using different windows (see Tab. 4.1). We found that this method is the
most efficient in our previous experiments (Ziółko et al., 2006a). That gives 54 features in total.
64 % of test instances were correctly classified. The more exact results using recall and precision
evaluation are presented in Tab. 4.9. The final measure is f-score presented separately for sets of
features describing frames with boundaries and without. The second group is named in Tab. 4.9
as phonemes, as they are segments from inside of phonemes, far from boundaries. From practical
point of view we are interested in detecting boundaries so the evaluation of classification of these
frames is crucial. So for the first set of features the most important grade is f-score 0.45 (Tab. 4.9).
Table 4.9: Experimental results for LogitBoost classifier. The rows with the label boundary is
for classifying segments representing boundaries. The rows named phoneme present grades for
classifying segments inside phonemes which are not boundaries. From practical point of view
boundary labels are important. The grades for phoneme labels are just for a reference
set of features
label
precision recall
f-score
boundary 0.583
0.366 0.45
Basic
phoneme
0.659
0.824 0.732
boundary 0.588
0.386 0.466
Without smoothing the second derivative
phoneme
0.665
0.818 0.733
boundary 0.551
0.077 0.135
Normalisation by whole energy value
phoneme
0.607
0.958 0.743
boundary 0.59
0.317 0.413
By max in a subband for a given utterance
phoneme
0.649
0.851 0.737
boundary 0.618
0.447 0.519
With wider context
phoneme
0.682
0.811 0.741
boundary 0.699
0.162 0.263
Even wider context but without 2nd derivative
phoneme
0.703
0.966 0.814
boundary 0.609
0.2
0.302
Asymmetric context
phoneme
0.712
0.939 0.81
We managed to improve results slightly by leaving the second derivative unsmoothed. There
were no other changes in the set of features. 64% of test instances were correctly classified like
for the previous set of features but the more exact evaluation presented in Tab. 4.9, indicates some
improvement through higher f-score, namely 0.466.
In the next approach, we kept the same number and type of features but subband features were
normalised, by dividing by the energy. In that way, 60.384 % of test instances were correctly
classified with f-score only 0.135 (Tab. 4.9).
We tried also an normalising approach, by dividing features by a maximum in a given subband
for an analysed utterance. 63.6347% of test instances were correctly classified, but f-score is also
quite low, namely 0.413 (Tab. 4.9). Surprisingly, none of normalisation methods improved results.
Finally, we experimented with wider left and right context. We added more subsets of features
85
for signal around the analysed frame. We have got 66% of test instances correctly classified by
including two contexts to the left and two to the right. In that case we had a set of 90 features with
a relatively high f-score 0.519 (Tab. 4.9).
To use wider context, namely three to the left and three to the right, we had to skip the second
derivative, because the number of features was too large to be operated by WEKA. In that way we
had a set of 84 features. 70% of test instances were correctly classified, but recall for boundary
frames was very low, just 0.162 which caused f-score to be only 0.263 (Tab. 4.9). It means, that
generally, this set of features is not effective.
The three to left and one to right context was also checked. In that experiment we used the second derivatives, so we had 90 features. We received correctness of 70% but f-score for boundaries
was again quite low, only 0.302 (Tab. 4.9).
4.11
Conclusion
ASR systems could be improved if an efficient phoneme segmentation method was found. Innovative segmentation software was designed and implemented in Matlab. F-score 0.72 was achieved
for phoneme segmentation task analysing envelopes of discrete Meyer wavelet subband powers
and their derivatives. It is a very good result comparing to 0.4 for SVM, 0.58 for constant segmentation and 0.46 for LogitBoost WEKA classifier. DWT is a good tool to analyse speech and extract
segments for further analysis. It achieves better results than all baselines, including WEKA machine learning LogitBoost classifier for which several sets of features were tested and compared.
The segmentation evaluation was also analysed and some flaws of typical approaches were identified. It was suggested that the segmentation evaluation by the application of fuzzy logic could be
improved.
Segmentation is a subfield of speech analysis which was not investigated enough in ASR. Our
solution showed a new direction of possible improvements in ASR for any language. Segmentation
allows to be more precise during modelling. Systems based on framing and HMMs miss some
of the information on the speech, which could be used in recognition if the efficient phoneme
segmentation was done first. This information, while once lost, cannot be recovered in the further
steps what results in worse efficiency of the whole system.
There are types of phoneme transitions which are more difficult to detect than others. The
average F-score for our segmentation method based on DWT vary from 0.0795 to 0.7063 for
transitions between different acoustic types of phonemes. The experiment support a hypothesis
that in general, it is more difficult to locate boundaries of vowels than other phonemes. One of the
reasons can be that vowel spectra are often less distinctive than others. Another reason might be
that vowels are relatively short comparing to other types of phonemes.
DWT is one of the most perceptual analysis processing tools. It enables to extract subbands
important to a human ear. It outperforms SFT because the size of DWT window is changeable
depending on a frequency subband as presented in Fig. 4.1. In SFT low and high frequencies are
analysed with the same resolution. It is not efficient, because a relatively short frame is needed for
high frequencies for an analysis. It has to be proportionally longer for low frequencies. DWT mo-
86
difies these lengths automatically, while in case of SFT, it is necessary to calculate mel-frequency
based cepstrum rather than regular spectrum from FFT.
Chapter 5
Language Models
Language modelling is a weak point of ASR. Most of the time n-grams are still the most efficient
models. Even though they are so simple a solution, it is difficult to train any better model because of data sparsity. Several experiments were conducted on n-best list of hypotheses received
from HTK audio model to re-rank the list and improve recognition. The POS tagger model was
presented in (Ziółko et al., 2008a) and the first results using a semantic model in (Ziółko et al.,
2008b).
So far the most popular and often most effective language model is the n-gram model (2.6)
described in the literature review chapter. N -gram is very simple in its nature, because it counts
possible sequences of words and uses them to provide probabilities. It is quite unusual than there
is no more sophisticated method which would perform in a better way than n-gram by applying
more complicated methods and calculations.
We did not find any published papers on applying POS tagging in ASR. This is why we decided
to check if it can be successfully used in language modelling instead of n-grams. It was quite a
promising idea as the grammar structure of sentences can be described using POS tags while they
provide much smaller set of elements in a model because several words can be modelled by the
same POS tag. One of the problems which is very often experienced while using n-grams, is a lack
of data for training because of too many possible words. The situation is even worse in inflective
languages what was described on an example of Russian (Whittaker and Woodland, 2003) where
the authors claim that 430,00 words for Russian is needed to provide the same vocabulary coverage
as 65,000 for English. A similar situation can be expected for all inflective languages.
The language models can be based on order of words in sentences like n-grams where words
are processed as a sequence. Another approach is to process words as a set, where the order is lost.
This approach is often called bag-of-words because we can imagine taking an ordered sequence of
words, putting them in a bag and shaking. This is a visualisation of modelling methods like LSA.
In most cases it is used to capture semantic knowledge. In case of inflective languages the order is
not crucial, so loosing the information about the order is not very destructive to the method while
allow one to reduce amount of data necessary for the training.
This chapter describes the language modelling part of the research. The methods presented
here are designed for inflective languages and tested on Polish but some of them could be applied
87
CHAPTER 5. LANGUAGE MODELS
88
to any other language as well. The first model is based on a probabilistic POS tagger. This
approach was unsuccessful, but we present it to document the experiment and discuss why we
believe it reduced recognition. Then the most of the chapter focuses on a bag-of-words model
designed by the candidate. The model has some similarities to LSA in its general concept but
differs a lot in realisation allowing calculations on much more data than LSA.
5.1
POS Tagging
POS tagging (Brill, 1995) is the process of marking up the words as corresponding to a particular
part of speech, based on both its definition, as well as its context, using their relationship with
other words in a phrase, sentence, or paragraph (Brill, 1994; Cozens, 1998). POS tagging is
more than providing a list of words with their parts of speech, because many words represent
more than one part of speech at different times. The first major corpus of English for computer
analysis was the Brown Corpus (Kucera and Francis, 1967). It consists of about 1,000,000 words,
made up of 500 samples from randomly chosen publications. In the mid 1980s, researchers in
Europe began to use HMMs to disambiguate parts of speech, when working to tag the LancasterOslo-Bergen Corpus (Johansson et al., 1978). HMMs involve counting cases and making a table
of the probabilities of certain sequences. For example, once an article has been recognised, the
next word is a noun with probability of 40%, an adjective with 40%, and a number with 20%.
Markov Models are a common method for asaigning POS tags. The methods already discussed
involve operations on a pre-existing corpus to find tag probabilities. Unsupervised tagging is also
possible by bootstrapping. Those techniques use an untagged corpus for their training data and
produce the tagset by induction. That is, they observe patterns in word structures, and provide
POS types. These two categories can be further subdivided into rule-based, stochastic, and neural
approaches. Some current major algorithms for POS tagging include the Viterbi algorithm (Viterbi,
1967; Forney, 1973), the Brill tagger (Brill, 1995), and the Baum-Welch algorithm (L. E. Baum
and Weiss, 1970) (also known as the forward-backward algorithm). The HMM and visible Markov
model taggers can both be implemented using the Viterbi algorithm.
POS tagging of Polish was started by governmental research institute IPI PAN. They created a relatively large corpus which is partly hand tagged and partly automatically tagged (Przepiórkowski, 2004; A.Przepiórkowski, 2006; Dȩbowski, 2003; Przepiórkowski and Woliński,
2003). The tagging was later improved by focusing on hand-written and automatically acquired rules, rather than trigrams by Piasecki (Piasecki, 2006). The best and latest version of the
tagger has accuracy 93.44%, which is not much comparing to other languages. It might be one of
the reasons for the outcome of our experiment.
5.2
Applying POS Taggers for Language Modelling in Speech Recognition
There is very little interest in using POS tags in ASR. Their usefulness was investigated. POS
tags trigrams, a matrix grading possible neighbourhoods or probabilistic tagger can be created
89
and used to predict a word being recognised based on left context analysed by a tagger. It is
very difficult to provide tree structures, necessary for context-free grammars, which represent all
possible sentences in case of Polish, as the order of words can vary significantly. Some POS tags
are much more probable in context of some others, which can be used in language modelling.
Experiments on applying morphological information to ASR of Polish language were undertaken using the best available POS tagger for Polish (Piasecki, 2006; Przepiórkowski, 2004). The
results were unsatisfactory, probably because of high ambiguity. An average word in Polish has
two POS tags. It gives too many possible combinations for a sentence. Briefly speaking applying
POS tagging for modelling of Polish is a process of guessing based on uncertain information.
HTK (Young, 1996; Young et al., 2005) was used to provide 10 best list of acoustic hypotheses
for sentences from CORPORA. The hypotheses were constructed as any combinations of any words
from the corpus. The hypotheses are provided as an ordered lists of words. This model was
trained in a way which allowed all possible combinations of all words in a dictionary to have more
variations and to give opportunity for a language model to improve recognition. Then probabilities
of those hypotheses using the POS tagger (Piasecki, 2006) were calculated. The acoustic model
can be easily combined with language models using Bayes’ rule by multiplying both probabilities
(2.5).
5.3
Experimental Results of Applying POS Tags in ASR
Trigrams of tags were calculated using transcriptions of spoken language and existing tagging
tools. Results were saved in XML. We received significant help from Dr Maciej Piasecki and his
group from the Technical University of Wrocław in this step of research.
The results were compared giving different weights for probabilities from the HTK acoustic
model and the POS tagger language model. In all situations, the outcome probability gave worse
results then pure HTK acoustic model. Histograms of probabilities for correct and wrong recognition were also calculated and they showed unuseful correlation. Some examples of sentences were
also analysed and described by human supervisor. They are presented in Table 5.1.
In total 331 occurrences were analysed. Only 282 of them had correct recognition in the
whole 10 best list. An average HTK probability of correct sentences was 0.1105. Exactly 244
of all occurrences had a correct hypothesis on the first position of the 10 best list. 73.72 % of
occurrences were correctly recognised while using only HTK acoustic model. Only 53 occurrences
were recognised applying probabilities from the POS tagger, even when HTK probabilities were
4 times more important than those from POS tagger. The weight was applied by raising HTK
probability to power of 4. It gives 16.01 % of correct recognitions for a model with POS tag
probabilities, which is a very disappointing result.
The POS tagger was trained on a different corpus than the one used in an experiment described
above. This is why we decided to conduct an additional experiment. We recorded 11 sentences
from the POS tagger training corpus. They were recognised by HTK, providing 10 best list and
used in a similar experiment, as the one described above. The amount of data is not enough
to provide statistical results, but observations on exact sentences (Table 5.3) provide the same
90
Table 5.1: Results of applying the POS tagger to language modelling. First, a sentence in Polish
is given, then a position of a correct recognition in 10 best list. The description of tagger grade for
the correct recognition follows
Lubić czardaszowy pla̧s
1, Tagger grade is very low.
Cudzy brzuch i buzia w drzewie
4, Tagger grade is higher than for wrong recognitions.
W ża̧dzy zejdȩ z gwoździa
There is no correct sentence in the 10 best list.
Krociowych sum nie żal mi
1, Tagger grade is higher or similar then other recognitions in top 6 but lower then 7th
Móc czuć każdy odczynnik
6, Tagger grade is lower than for most of the wrong recognitions including first two hypotheses.
However, the wrong recognition with highest probability is grammatically correct.
On łom kładzie lampy i kołpak
7, Tagger grade is low.
Rybactwo smutnieje on siȩ śmieje
On liczne taśmy w cuglach da
2, Tagger grade is low, but still highest in the first 5 hypotheses.
Ten chór dusiłem licznie
Chciałbym wpaść nas sesjȩ
Żółtko wlazło i co zrobić
Wór rur żelaznych ważył
3, Tagger grade is lower than for the sentence on the first position.
U nas ludzie zwa̧ to fuchy
On myje wróble w zoo
Boś cały w wiśniowym soku
3, Tagger grade is higher then for 7 top hypotheses.
Na czczo chleby i pyry z dżemem
Lech być podlejszym chce
1, Tagger grade is the lowest in top 5 hypotheses but most of them are grammatically correct.
91
Table 5.2: Results of applying the POS tagger to language modelling. First, a sentence in Polish
is given, then a position of a correct recognition in 10 best list. The description of tagger grade for
the correct recognition follows (2nd part)
Żre jeż zioła jak dżem John
1, Tagger grade is higher than for top 4 hypotheses.
Masz dzisiaj różyczkȩ zielona̧
1, Tagger grade is lower than for the second hypothesis which has no sense but morphologically is correct.
Weź daj im soli drogi dyzmo
2, Tagger grade is very close to the most probable hypothesis, which is also grammatically correct.
Weź masz ramki opolskie
1, tagger grade is higher than for the second hypothesis but lower than for the third one.
Dźgna̧ł nas cicho pod zamkiem
1, Tagger grade is highest of all.
Tam śpi wojsko z bronia̧
6, Tagger grade is second of all, the highest one is acoustically 5th.
Nie odchodź bo żona idzie
3, tagger grade is highest but equal to three others, which has acoustical probability lower.
Tym można atakować
5, Tagger grade is higher than for the acoustically most probable sentence but lower than
for all other between 1 and 5, however all of them are grammatically correct.
Zmyślny kot psotny ujdzie
1, Tagger grade is higher then second and third hypothesis.
Niech pan sunie na wschód
4, Tagger grade is higher than for 7 most probable acoustically.
conclusion as in the main experiment. The recognitions, which were found using HTK only, had
fewer errors for 6 sentences. then 5 times the number of errors was the same. One sentence was
correctly recognised for both models. One more was correctly recognised using just HTK acoustic
model.
5.4
Bag-of-words Modelling
A new method of language modelling for ASR is presented (Ziółko et al., 2008b). The method
has some similarities to LSA, but it does not need so much memory and gave better experimental
results, which are provided as percentage of correctly recognised sentences from a corpus. The
main difference is a choice of similar topics influencing a matrix describing probability of words
appearing in topics.
Recently, graph based methods (Harary, 1969; Véronis, 2004; Agirre et al., 2006) have become
more and more popular. In case of our algorithm, graphs are used instead of applying SVD in order
to smooth information between different topics. Graphs help us to locate and grade similar topics.
An important advantage of our method is that it does not need much memory at once to process
92
Table 5.3: Results of applying the POS tagger on its training corpus. First version of a sentence is
a correct one, second is a recognition using just HTK and third one using HTK and POS tagging.
Then the number of differences comparing to a correct sentence were counted and summarised
i do licha coście mi wczoraj dali takiego że teraz ledwo wiem jak siȩ nazywam
i do i w coście mi wczoraj dali takiego że teraz ledwo wiem nie siȩ nazywam
i do i w coście w wczoraj dali takiego że teraz ledwo wiem nie siȩ nazywam
htk is better
nie mówia̧c o tym kim ja jestem
skinȩła głowa̧ zawstydzona
nie w wiem nocy nocy nie jestem
skinȩła głowa̧ zawstydzona
nie w wiem nocy nocy nie jestem
skinȩła bo w w w zawstydzona
same number of errors
htk is better
to okropne obudzić siȩ po nocy spȩdzonej z kimś czyjego imienia siȩ nie pamiȩta
to okropne obudzić siȩ minut spȩdzonej z kimś czyjego imienia ciȩ nie pamiȩta
to okropne obudzić w nocy spȩdzonej w kimś czyjego imienia ciȩ nie pamiȩta
htk is better
parȩ minut temu nie pamiȩtałam nawet że jestem w innym świecie
parȩ minut temu nie pamiȩta nawet jestem innym świecie
parȩ minut temu nie pamiȩta nawet w jestem innym świecie
poleż teraz spokojnie
zasłoniȩ okno bo widzȩ że światło ciȩ razi
poleż teraz spokojnie
zasłoniȩ o okno bo widzȩ ciȩ światło ciȩ razi
poleż z teraz spokojnie
zasłoniȩ o okno bo widzȩ ciȩ światło ciȩ razi
htk is better
zobaczysz wszystko bȩdzie dobrze pamiȩtasz że opuściła sanktuarium
zobaczysz wszystko bȩdzie dobrze pamiȩta że opuściła sanktuarium
zobaczysz wszystko bȩdzie dobrze pamiȩta że opuściła sanktuarium w
htk is better
o tak pamiȩtała wszystko powróciło z pełna̧ wyrazistościa̧
o tak pamiȩta wszystko powróciło pełna̧ wyrazistościa̧
o tak w pamiȩta wszystko powróciło pełna̧ wyrazistościa̧
htk is better
w końcu tyle razy o tym myślała i wcia̧ż nie mogła poja̧ć jak do tego doszło
końcu ciȩ teraz nocy myślała w wcia̧ż nie tego tym ciȩ bȩdzie do doszło
końcu ciȩ teraz nocy myślała wcia̧ż nie tego świecie bȩdzie do doszło
93
Histogram of probabilities for correct hypotehsis
180
160
140
120
counts
100
80
60
40
20
0
0−0.1 0.1−0.2 0.2−0.3 0.3−0.4 04−0.5 0.5−0.6 0.6−0.7 0.7−0.8 0.8−0.9 0.9−1
probability
Figure 5.1: Histogram of POS tagger probabilities for hypotheses which are correct recognitions
any amount of data. It is in contrary to LSA, which is quite limited in real applications for this
reason. SVD is conducted on the entire matrix in LSA which means that a model with a few
thousands words and a few hundred topics might be a challenge for memory of a regular PC. Our
method does not need to do operations on the entire matrix. There are other approaches to face
this issue like by applying generalised Hebbian algorithm (Gorrell and Webb, 2005).
The main aspect of modelling in our method is based on semantic analysis, which is an important innovation of ASR, as the very last step of the process. It can be applied as an additional
measure to use the non-first choice word recognition hypothesis, if they do not fit semantic context.
However, the method extracts some syntax information as well. It was designed for Polish, which
is highly inflective and not a positional language. For this reason only particular endings can occur
in a context of endings of other part of speech elements of a sentence. In example, we can expect
female adjectives with female nouns. In the same way, in English we can expect I in a same sentence as am, and you in a same sentence as are, etc. In Polish all verbs have this kind of inflection,
however, usually, differences between forms are only in endings, not like to be in English.
94
Histogram of probabilities of hypothesis with wrong recognitions
2000
1800
1600
1400
counts
1200
1000
800
600
400
200
0
0−0.1 0.1−0.2 0.2−0.3 0.3−0.4 0.4−0.5 0.5−0.6 0.6−0.7 0.7−0.8 0.8−0.9 0.9−1
probability
Figure 5.2: Histogram of POS tagger probabilities for hypotheses which are wrong recognitions
ratio
0.5
0
0.1
0.2
0.3
0.4
0.5 0.6 0.7
probability
0.8
0.9
1
Figure 5.3: Ratio of correct recognitions to all for different probabilities from POS tagger
5.5
95
Experimental Setup
Semantic analysis might be much more crucial in non-positional languages than in English, due to
irregularities in position structures of words. Language models, based on context free grammars,
are quite unsuccessful for non-positional languages. Research about applying LSA in ASR has
been done (Bellegarda, 1997) for English only.
HTK (Young, 1996; Young et al., 2005) was used to provide 100-best lists of acoustic hypotheses for sentences from the test corpora. The MFCCs (Davis and Mermelstein, 1980; Young,
1996) were calculated for parametrisation with a standard set of 39 features. 37 different phonemes
were distinguished using a phonetic transcription provided with CORPORA. Several experiments
were conducted to evaluate the method. The first one was very simple to have a general view only.
The audio model was trained on male speakers of CORPORA (Grocholewski, 1995). The corpus
was organised as follows: all single letters are combined in one topic, all digits in another, names
and commands separately in two more. Every sentence is also treated as a topic. In this way 118
topics are provided. They all consist of 659 different words in total. In the preliminary experiment
we used 114 simple sentences spoken by a male not included in the training set as a testing set.
All other utterances are obviously too short to use them in language modelling.
In following experiments HTK was also used to provide 100-best list. The main difference
was a division between training and testing corpora. Training data was collected from internet
and ebooks from several sources described later in details. testing sentences were created by the
author and recorded on a desktop PC with a regular microphone and some, but very little noise in
background.
5.6
Training Algorithm
The entire algorithm is illustrated on a simple English example in one of the following sections.
Several versions of the algorithm were applied and tested. Some of the differences are presented
in the following sections with experimental results. Here, we describe the final version which
performs in the best way. The training algorithm starts with creating matrix
S = [sik ],
(5.1)
representing semantic relations, where rows i = 1, ..., I represent topics and columns k = 1, ..., K
represent words. Each matrix value sik is the number of times word k occurs in topic i. Some
words are so common that they appear in almost all topics. Appearance of these words have little
semantic information due to entropy rule. The words which appear only in certain topics can say
more about semantic content. This is why all values of (5.1) are divided by a sum for the given
word over all topics to normalise. In this way importance of commonly appearing words is reduced
for each topic. A measure of similarity between two topics is
dij =
K
X
k=1
sik sjk .
(5.2)
96
Figure 5.4: Undirected, complete graph illustrating similarities between sentences
It has to be normalised according to formula
d0ij = dij / max {dij }.
i,j
(5.3)
As a result values 0 ≤ d0ij ≤ 1 are obtained.
These topic similarities are analysed as follows:
1. Create an undirected, complete graph (Fig. 5.4) with topics as nodes and d0ij as weights of
edges. Let us define path weight
Y
pij =
d0ab ,
(5.4)
(a,b)∈P (i,j)
where P (i, j) is the sequence of edges in the path from i to j. In the simplest case of a single
edge i to j path weight is d0ij . In case of a multiple edges path, it is a product of similarities
of all edges on a path (5.4). In case there are several paths we always take a path with the
largest similarity for the path weight (5.4).
2. For each node, we need to find n nodes with highest path weights between the nodes and the
given, analysed topic node. It will allow us to define a list N of semantically related topics
which consists of the n nodes with their measures. The exact implementation of this part is
presented in the next section.
3. The matrix S has to be recalculated to include the impact of similar topics. Smoothed wordtopic relations are expressed by matrix
S0 = [s0ik ].
(5.5)
For all topics in matrix (5.1), we add all values of topics from the list of related topics,
multiplied by a measure for a given pair of topics. The elements of S 0 are
s0ik = sik + α−1
X
pij sjk .
(5.6)
j∈N
Coefficient α is a smoothing factor which provides additional weight for influence of other
97
topics on matrix S 0 . N is the list of similar topics found in step 2. Matrix element (5.6) is a
measure of likelihood that kth word appears in ith topic.
Matrix (5.5) stores counts of words present in particular topics. They can be represented as
C(wordk , si ) = c.
(5.7)
We should not assume that there can be 0 probability of any word appearing in any topic. This is
why we replace all zeros in (5.5) with small values s0min = 0.01. If (5.7) was normalised to have
values between 0 and 1, it would be a probabilistic information of type
P (wordk |si ) = p.
(5.8)
The sum of values in (5.5) is not equal to 1 which is why (5.7) are not probabilities regarding to the
definition. However, (5.7) stands for all other conditions of probabilities and often can be treated
like it was (5.8). For this reason, a sum of all values in (5.5) is calculated and then every value in
(5.5) is divided by it. In this way s0ik become probabilities, as their sum is equal to 1. In the further
sections we will assume that (5.8), rather than (5.7), is stored in (5.5) and s0ik are probabilities.
5.7
Process of Finding The Most Similar Topics
A group of the longest paths, where a distance is calculated using a product between edges rather
than sum, has to be found in the 2nd point of the algorithm described in the previous section. It
can be achieved by implementing the following algorithm:
1. Find n single edge paths with the highest measures d0ij .
2. Check if the two edges path P (i, m) starting from the node i with the highest measure d0ij ,
which was found in the step above and going through j to any other edge m, has a better
measure pim than the lowest of the n solutions found in the step above. If it does than
replace the lowest one with m in the list of n similar topics.
3. Conduct the step above for all other single node paths from the list apart from the lowest,
nth element.
4. If there are any non single edge paths P (i, j) on the list on position different then nth, repeat
a process similar to step 2. Check if after adding any other edge, a measure of path pij is
higher than a measure of the nth position. Than replace the previous path with a new, longer
path with higher pij .
It can be proved that the process is exhaustive in one way (from the analysed topic). Let us
name the analysed topic as i and the set of the n most similar topics to i, found in the first step of
the process (using a measure d0ij ) as N1 . Let l be the element with the lowest measure of similarity
d0ij of N1 . As a result of the algorithm presented above, we obtain
d0in1 > d0ij
∀n1 ∈ N1 , ∀j ∈
/ N1 .
(5.9)
98
Table 5.4: Matrix S for the example with 4 topics and a row of S’ for the topic 3
big
John
has
house
black
aggr.
cat
small
mouse
is
mammal
1
2
3
4
1
1
0
0
1
1
0
0
1
1
1
0
1
0
0
0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
1
1
0
0
1
1
0
0
0
1
0
0
0
1
3’
7/8
7/8
15/8
1/2
11/8
11/8
11/8
1
1
0
0
Table 5.5: Matrix D for the presented example
1 2 3 4
1
2
3
4
Let us define a set N2 = T /({ia }
S
4
3
1
0
3
6
4
0
1
4
6
2
0
0
2
4
N1 ) of topics not included in the list of similar topics, where
T is a set of all topics and {ia } is a one element set with the analysed topic ia . From definition
(5.3)
0 ≤ d0ij ≤ 1
∀i, j ∈ {1, . . . , I},
(5.10)
therefore
d0ij d0jk ≤ d0ij
∀j ∈ N2 ,
(5.11)
∀j ∈ N2 .
(5.12)
where k is any topic. From (5.9) and (5.11)
d0in1 > d0ij d0jk
As the same reasoning can be applied for further iterations (three-edge paths and so on) (5.11)
and (5.12) prove that the process is exhaustive in one way. It can skip some solutions from other
topics to the analysed one. But it is even better from linguistic point of view, because we do not
want topics assigned as being similar to many other topics, just because they have a very strong
link to one other topic.
5.8
Example in English
Let us consider an example of a corpus consisting of 4 sentences, all of them are treated as separate
topics. Big John has a house. Big John has a black, aggressive cat. The black aggressive cat has
a small mouse. The small mouse is a mammal.
All articles a and the were skipped as they have no semantic content and they do not exist in
Polish which was our experimental language. We count all other words, which creates matrices S
99
(Tab. 5.4) and D (Tab. 5.4). Following topic similarities (d012 = 3/4, d013 = 1/4, d014 = 0, d023 =
1, d024 = 0, d034 = 1/2) are received. It constructs the graph on Fig. 5.4. Then a list of topics
similar to topic three N1 = {2, 4} can be found by applying first step of the process on the graph.
Topic 4 is l in this example - the topic with the lowest measure in N1 , namely 1/2. In the next
step, pij are calculated for two-edge paths starting at node 3 and going through 2. There are two
of them. First one is for the path 3-2-4, where p34 = 1 · 0 = 0. The second one is for the path
3-2-1, where p31 = 1 · 3/4 = 3/4 > d034 . This is why the topic 4 is replaced by topic 1 and the
final list of topics similar to 3 is {2, 1}. Then assuming α = 2 we can calculate the row for topic
3 from S 0 (Tab. 5.4, last row).
5.9
Recognition Using Bag-of-words Model
The recognition task can be described as
si = argmax P (s|wordk1 , ..., wordkm ),
(5.13)
where s is any topic and wordk1 , ..., wordkm is a set of recognised words, which were in a sentence. It classifies the bag-of-words as one of the realisations of one of the topics in matrix (5.5).
Recognition can be conducted by finding the most coherent topic for a set of words W in a
provided hypothesis. It is carried out by finding a maximum of a sum of elements of (5.5) from
columns representing the words from a hypothesis over rows
Q
Psem = max
i
k∈W
|W |
s0ik
.
(5.14)
where |W | is cardinality of the set of words W in the sentence. The row i, for which the maximum
is found, is assumed to represent the topic of sentence being recognised. The calculated sum Psem
can be used as additional weight in providing speech recognition due to Bayes’ theorem. The
values of phtk probability gained from HTK model tend to be very similar for all hypotheses in the
100-best list of a particular utterance. This is why an extra weighting w was introduced to favour
probabilities from audio model over psem received from semantic model. The final measure can
be obtained applying Bayes’ theorem
p = pw
htk psem .
5.10
(5.15)
Preliminary Experiment
The first experiment (Ziółko et al., 2008b) was conducted on CORPORA using the same data for
training and testing to evaluate the implementation and approximate chances of the algorithm to
be successful without spending several days training a proper model. Because the model was
small, it was easy to compare different values of parameters n, α and w. Results for recognition
based on the audio model only are also included. LSA was used as a baseline to evaluate results
of our method. Experiments with several different w for the semantic model based on LSA were
100
conducted. Values in a range between 23 and 26 gave the best results presented in Tab. 5.6. 45
utterances did not have hypotheses with correct sentences in entire 100 best lists. This is why the
maximal number of utterances which could be recognised was 69.
The experiment shows that our semantic model is useful, even though, the results might be so
outstanding due to a small number of words in the corpus and using the same corpus for training
and testing. The same corpus was used for both tasks because phoneme segmentation in the
corpus is needed to use HTK. CORPORA is the only Polish corpus which provides it. However,
the comparison of 53% correct recognitions for best configurations of our model with 36% for
LSA and 29% for audio model only is impressive. The analysed results for different configurations
shown that the choice of n, the length of list of topics related to an analysed topic is not as important
as ratio between n and α which is a smoothing factor for weighting impact of related topics. The
ratio n/α should be kept around 2/3, for this case, in order to provide the best results. The
audio model importance weight w is also very crucial as the information from HTK model is very
important and can be ignored if w has too small value.
It has to be stressed, that it was a preliminary experiment. Our aim was to check, if it is
worth to invest more time in research on this model. This is why we used little data and the same
set for training and testing. Some elements of the algorithm were not used for this experiment.
In example, values in (5.1) were not normalised to be probabilities. We do not claim that the
calculated model can be used for any practical task. One more reason for that is that it was trained
on CORPORA which has no semantic connotations. On the other hand it has to be stressed that
for Polish this model keeps some grammar information as well, even though it was designed as a
semantic one. In example, we can expect words with morphology related to one gender in a given
sentence, which will be noted in matrix S. The results were promising, so more sophisticated
experiments using transcriptions from the Polish Parliament, literature, a journal and wikipedia as
training corpora were conducted and are described in following sections.
Another way of proving usefulness of our bag-of-words model is through calculating histograms psemc of probabilities received from semantic model for hypotheses, which are correct
recognitions (Fig. 5.5) and histogram psemw of probabilities received from semantic model for hypotheses, which are wrong recognitions (Fig. 5.6). The ratio psemc /(psemc + psemw ) is presented
in Fig. 5.7. It clearly shows a correlation between high probability from the bag-of-words model
and correctness of a recognition.
5.11
K-means On-line Clustering
The number of topics is limited to around 1000. If a large choice of words in the model is expected
then the number of topics has to be kept low to save memory. This is why it is necessary to
overcome the limitation in the number of topics for any real applications. It was done by clustering
them into representatives of several topics. K-means clustering algorithm was used for this aim.
However, it was not possible to apply it directly on all topics at once because of huge amount
of data (millions of sentences). This is why we invented an algorithm which we call on-line
clustering.
101
Table 5.6: Experimental results for pure HTK audio model, audio model with LSA and audio
model with our bag-of-words model
n
α
w
recognised sentences
%
LSA
25
41
0.36
HTK
33
0.29
3
1
50
48
0.42
3
2
50
46
0.40
3
3
50
46
0.40
7
1
50
35
0.31
7
3
50
45
0.39
7
5
50
46
0.40
5
1
20
44
0.39
5
2
20
55
0.48
5
3
20
60
0.53
5
4
20
59
0.52
5
5
20
59
0.52
3
2
20
61
0.53
3
1
20
50
0.44
7
6
20
59
0.52
7
5
20
61
0.53
7
4
20
59
0.52
8
4
20
57
0.5
8
5
20
61
0.53
8
6
20
60
0.53
9
1
20
28
0.25
9
3
20
49
0.43
9
5
20
57
0.5
9
6
20
61
0.53
9
7
20
59
0.52
11
5
20
54
0.47
11
7
20
60
0.53
11
8
20
60
0.53
11
9
20
58
0.51
9
6
10
58
0.51
9
6
15
60
0.53
9
6
17
60
0.53
9
6
18
61
0.53
9
6
19
61
0.53
9
6
20
61
0.53
9
6
22
59
0.52
9
6
25
58
0.51
102
number of correct recognitions
40
30
20
10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
probability
0.8
0.9
1
number of wrong hypotheses
Figure 5.5: Histogram of probabilities received from the bag-of-words model for hypotheses
which are correct recognitions
400
300
200
100
0
0.1
0.2
0.3
0.4
0.5 0.6 0.7
probability
0.8
0.9
1
Figure 5.6: Histogram of probabilities received from the bag-of-words model for hypotheses
which are wrong recognitions
ratio
0.5
0
0.1
0.2
0.3
0.4
0.5 0.6 0.7
probability
0.8
0.9
1
Figure 5.7: Ratio of correct recognitions to all of them for different probabilities received from
the bag-of-words model
103
The general scheme is to collect n topics from training data. The algorithm is initialised
heuristically. Then, they are clustered into n/2 topics using k-means clustering algorithm, which
is iterating following two steps until convergence. The first one is to compute membership of each
data point x in clusters by choosing the nearest centroid. The second is to recompute a location
of each centroid, according to members. When the k-means converge and new topics are chosen,
new n/2 topics can be added from new training data and clustering is repeated to reduce it again.
This loop is applied as long as there is new training data to be included. In the very end additional
clusterisation is conducted to limit the number of topics to n/4.
Every time, the information on how many sentences are represented by a particular topic is
stored and used as weights when means are being calculated and topics are combined as a result
of clustering. Thanks to that, the order of how sentences are fed into the training system is not
important for the image of the final clusters. Unfortunately, it is not possible to cluster all sentences
at once because of data sparsity.
Formula (5.8) holds for topics which represent several sentences in the same way, as for these
which represent just one sentence. However, it is not possible to calculate probability of a word
given a combined topic by using probabilities related to topics represented by the combined topic.
The new version (for clustered topics) of matrix (5.5) has to be calculated and used instead. It
means that the process of collecting statistical data by creating (5.1) has to be finished first to run
the described algorithm. When (5.5) is already created, new statistical data cannot be added to the
matrix (5.5). In case it has to be done, the new data should be added to (5.1). Then (5.5) has to be
recalculated from the beginning.
5.12
Experiment on Parliament Transcripts
A set of 44 sentences was created using words and language similar to expected to be used in
a parliament. They were also designed in a way that the most common words from the training
corpus are used and that some of the words from testing set appear in a few sentences. They
were recorded and the HTK recognition experiment was conducted on them using triphone model
trained on CORPORA but with vocabulary limited to words in these 45 sentences. In this way,
HTK provided 100-best list of hypotheses to every of the sentences. They were used in a same
way as in a previously described experiment.
Matrix (5.1) was created by analysing transcriptions of the Polish Parliament meetings in years
2005-2007. They are the biggest corpus of transcribed Polish speech. There are differences in
sentence construction between spoken and written language. This is one of the reasons, why we
decided to use this corpus for training. Another one is that our model is likely to be a part of
ASR system used by Police and Courts, so we are interested in research on very formal language.
None of testing sentences was intentionally taken from these transcriptions, however, it was not
checked that they did not appear there. The testing set consists of 198 words and those words
were included in matrix (5.1). Because of data sparsity the k-means on-line clustering algorithm
described above was used to combine several topics. Every topic is a set of words between two
dots in the training corpus. In ideal case topics are sentences. In real case dots are used in Polish
104
Table 5.7: 44 sentences in the exact transcription used for testing by HTK and bag-of-words model
with English translations
platforma obywatelska wymaga funkcjonowania klubu w czasie obrad sejmu
Civic Platform expects the club to operate during parliament proceedings.
dlaczego poseL wojciech polega na opinii zarzAdu
Why does MP Wojciech trust the board opinion?
Latwo skierowaC czynnoSci do sAdu
It is easy to move actions to court.
wniosek rolniczego zwiAzku znajduje siE w ministerstwie
The petition of the agricultural union is in the ministry.
projekt samorzAdu ma wysokie oczekiwania finansowe
The municipality project has high financial expectations.
fundusz spoLeczny podjAL dziaLania w ramach obecnego prawa cywilnego
The communal foundation took steps according to existing civil law.
koalicja chce komisji sejmowej do oceny dziaLalnoSci posLa jana
The coalition wants a parliament commission for evaluation of MP Jan activity.
dzisiaj piEC paN poprze ministra w waZnym gLosowaniu w sejmie
Five women will support the Minister in an important vote today.
poseL ludwik dorn byl na waZnym gLosowaniu po duZym posiLku
MP Ludwik Dorn participated in an important vote after a large meal.
bOg ocenia polskE za powaZne przestEpstwa sektora finansowego w kraju i za granicA
God judges Poland for crucial crimes of the financial sector in the country and abroad.
poseL tadeusz cymaNski faktycznie wyraziL sprzeciw wobec rozwoju paNstwa polskiego
MP Tadeusz Cymanski expressed a protest against development of the Polish country indeed.
tak mi dopomOZ bOg
God, help me. (traditional formula added after an oath)
poseL andrzej lepper zajmuje siE rzAdem jak nikt inny
MP Andrzej Lepper takes care of the government like no one else.
uchwaLa rzAdowa dotyczAca handlu i inwestycji przedsiEbiorstw paNstwowych
w rynek nieruchomoSci
The government act on trade and investments of public enterprises in the estate market.
panie marszaLku wysoka izbo
Mr speaker, House. (common way to start a speech in the Polish Parliament)
poseL ludwik dorn chce podziEkowaC komisji
MP Ludwik Dorn wants to thank the commission.
bezpieczeNstwo jest bardzo waZne
The safety is very important.
minister Srodowiska powiedziaL waZne rzeczy
The Minister of Environment said important things.
105
with English translations (2nd part)
narOd rzeczpospolitej polskiej chce pieniEdzy
The nation of Republic of Poland wants money.
rodziny powinny byC najwaZniejsze
Families should be the most important.
resort bezpieczeNstwa ma wysokie uprawnienia
The department of security has high authority.
odpowiednie uprawnienia sA bardzo waZne
Proper authorities are very important.
kilkanaScie przedsiEbiorstw potrzebuje nowych dochodOw
Over a dozen of enterprises need new incomes.
poseL andrzej lepper zwrOciL dokumenty do sejmu
MP Andrzej Lepper returned documents to the Parliament.
krajowa komisja popiera nowA ustawE
The national commission supports the new act.
narOd rzeczpospolitej polskiej ma waZne oczekiwania od sejmu
The nation of the Republic of Poland has important expectations from the Parliament.
praktyka wskazuje co innego
Real life shows something else.
czterech posLow nie mogLo zostaC
Four MPs were not able to stay.
na sLuZbie siE pracuje
You work on a duty.
sprzeciwiam siE
I speak against.
wnoszE o przerwE w obradach
I ask for a break in the proceedings.
proszE o ciszE
I ask for silence.
wznowienie obrad nastApi po godzinnej przerwie
The proceedings will be reopened after an hour break.
to jest skandal
It is a scandal.
106
with English translations (3rd part)
nie pozwolimy na to
We will not allow it.
obrady przy zamkniEtych drzwiach
Closed proceedings.
matki potrzebujA becikowe
Mothers need a support.
przechodzimy do konkretOw na temat ustawy o ubezpieczeniach spoLecznych
We move to details on the act on public insurances.
duZA frekwencja w trakcie gLosowania
High attendance during a vote.
zgromadzenie narodowe zadecyduje o przyszLoSci tej ustawy
The National Assembly will decide about the future of this act.
komisja zbierze siE po przerwie
The commission will gather after a break.
proszE mOwiC wolniej
Speak slower please.
zacznijmy od budowania podstaw
Let’s start from building the foundations.
zgLoszono wiele poprawek do tej ustawy
Many corrections to this act were declared.
to mark abbreviations and ordering numbers what influenced the content of topics. The training
corpus consisted of around 800,000 topics.
In the end of the training process all topics were clustered into 500 final topics. Then values
of matrix (5.1) were normalised for words by all topics to increase importance of words which
appeared in few topics and decrease importance of words which appeared in many topics. Then
matrix (5.5) was created. The HTK hypotheses were rearranged using information from (5.5) in
the same away as in the previous experiment.
The results of this experiment were negative. The model did not improve recognition. Quality
of the training data is blamed for these results. The transcriptions contained many comments and
other elements which are not sentences. Then transcriptions were copied from pdf files into a
text file, what degradated quality slightly. All syllables fi in the corpus were changed into dots
and some parts were rearranged in an inappropriate way. What is more, a dot is quite frequently
used in Polish to mark ends of abbreviations and put after numbers if they mean order like 1st or
2nd in English. All these dots were treated by our algorithm in the same way as dots marking
ends of sentences. This is why the topics were quite often not proper sentences as expected in
our algorithm. We decided to conduct another experiment using literature for training. Quality
of ebooks is better than transcriptions. They are available in txt and doc files. Abbreviations and
numbers are much rarer in literature than in the Parliament transcripts.
5.13
107
Preprocessing of Training Corpora
The experiment on the Parliament transcripts taught us that text data has to be preprocessed more
before it can be used for model training. There are three main issues which has to be faced. First,
Matlab, which is used for model training, do not recognise special Polish letters. This is why they
have to be replaced by some single signs. Secondly, several special signs should be erased to keep
a corpus cleaner. Thirdly, some dots have to be removed from a corpus as they do not represent an
end of sentences.
We started with replacing all capital letters in a corpus with lower cases as they are redundant
for this experiment. Then, we can use capital letters to represent special Polish signs. The second
issue was faced by removing (some of them can be replaced by an empty space) all signs from
the list: , ” “ : ( ) ; + - \/ ’ # &
=.
Then question and exclamation marks ?! were changed
into dots. Several dots were removed if they followed some abbreviations. A dot is put after an
abbreviation in Polish, if it finishes with a consonant. All short forms from the list were replaced
by a full form or an abbreviation without a dot if several morphological forms are represented by
one abbreviation. An empty space was put at the beginning of a string to be searched for, to avoid
detecting ends of some words.
It is more and more common in Polish to put dots following digits if they mean order, like th in
English. This is why all dots following digits were also removed from corpora. Two dots following
each other were replaced with just one. The same happened with three dots. Finally all doubled
and tripled spaces were replaced by just one as final cleaning of the corpora. In the beginning
we did these operations using Matlab and Word for Windows. Later, the process was automatised
by using SED. Another preprocessing which we had to do was removing html and xml tags from
some of the texts. This task was also accomplished in SED, which is a simple stream editor under
Linux. It takes and filters a row after a row from a default input which was a text file in our case.
Then it applies changes in text according to commands in a specific order and send it to an output.
The script presented in Table (5.10) was used for changes apart from removing html tags.
5.14
Experiment with Literature Training Corpus
Another experiment on larger scale was conducted using literature to train the model. This attempt
was more successful then the previous one, however, the results are still unsatisfactory. The improvement comparing to the transcript might be caused by the fact that the language in literature
is much more proper than in the transcripts where spoken language was written down. It would
be an interesting observation that the written language should be used for training, even though
the spoken one is being recognised. With some configurations, 3% of improvement was noted
(Tab. 5.11). The low efficiency was probably caused by using too little data for training. Very bad
results for applying LSA support this hypothesis. The perplexity of the corpus is sufficiently large
and equals 9 031.
As the next step to improve our model, we started to normalise all values in the matrix (5.5) to
have probabilities as its values and to have final grades as probabilities, what we have not done in
108
Table 5.10: SED script for text preprocessing
s/A/a/g
s/B/b/g
s/C/c/g
s/D/d/g
s/E/e/g
s/F/f/g
s/G/g/g
s/H/h/g
s/I/i/g
s/J/j/g
s/K/k/g
s/L/l/g
s/M/m/g
s/N/n/g
s/O/o/g
s/P/p/g
s/R/r/g
s/S/s/g
s/T/t/g
s/U/u/g
s/W/w/g
s/Y/y/g
s/V/v/g
s/X/x/g
s/Z/z/g
s/ł/L/g
s/ś/S/g
s/ń/N/g
s/ć/C/g
s/ó/O/g
s/ȩ/E/g
s/ż/Z/g
s/ź/X/g
s/a̧/A/g
s/Ł/L/g
s/Ś/S/g
s/Ń/N/g
s/Ć/C/g
s/Ó/O/g
s/Ȩ/E/g
s/Ż/Z/g
s/Ź/X/g
s/A̧/A/g
s/,//g
s/[-]//g
s/[+]//g
s/[/]//g
s/[=]//g
s/[\]//g
s/[”]//g
s/[:]//g
s/ [%]/ procent/g
s/ [$]/ dolar/g
s/nbsp/ /g
s/[.] [.]/./g
s/ ust[.]/ ustawa/g
s/ ub[.]/ ub/g
s/[(]//g
s/[)]//g
s/[;]//g
s/[¡]//g
s/[#]//g
s/[&]//g
s/[|]//g
s/[*]//g
s/[ ]//g
s/[’]//g
s/[!]/./g
s/[?]/./g
s/[@]/ /g
s/0[.]/0/g
s/1[.]/1/g
s/2[.]/2/g
s/3[.]/3/g
s/3[.]/3/g
s/4[.]/4/g
s/5[.]/5/g
s/6[.]/6/g
s/7[.]/7/g
s/8[.]/8/g
s/9[.]/9/g
s/ godz[.]/ godz/g
s/ art[.]/ art/g
s/ tys[.]/ tys/g
s/ ok[.]/ ok/g
s/ m[.]in[.]/ miEdzy innymi/g
s/ m[.] in[.]/ miEdzy innymi/g
s/ n[.]p[.]m[.]/ nad poziomem morza/g
s/ p[.]p[.]m[.]/ pod poziomem morza/g
s/ p[.]n[.]e[.]/ przed naszA erA/g
s/ n[.]e[.]/ naszej ery/g
s/ przyp. tLum./ przypis tLumacza/g
s/ z o[.] o[.]/ z ograniczonA odpowiedzialnoSciA/g
s/ z o[.]o[.]/ z ograniczonA odpowiedzialnoSciA/g
s/ orygin[.]/ oryginalnie/g
s/ proc[.]/ procent/g
s/ tj[.]/ to jest/g
s/ szt[.]/ sztuk/g
s/ np[.]/ na przykLad/g
s/ ww[.]/ wyZej wym/g
s/ ds[.]/ do spraw/g
s/ wLaSc[.]/ wLaSc/g
s/ tzw[.]/ tzw/g
s/ im[.]/ imienia/g
s/ lit[.]/ litera/g
s/ ang[.]/ ang/g
s/ Lac[.]/ Lac/g
s/ gr[.]/ gr/g
s/ poL[.]/ poLowa/g
s/ zm[.]/ zmarLy/g
s/ ur[.]/ urodzony/g
s/ wyd[.]/ wyd/g
s/ r[.]/ r/g
s/ r [.]/ roku/g
s/ sp[.]/ spOLka/g
s/ ul[.]/ ulica/g
s/ pkt[.]/ pkt/g
s/[.]jpg/ jpg/g
s/[.]png/ png/g
s/[.]exe/ exe/g
s/[.]bmp/ bmp/g
s/[.]pdf/ pdf/g
s/[.]html/ htm/g
s/[.]pl/ pl/g
s/[.]com/ com/g
s/ w[.]/ w/g
s/ a[.]/ a/g
s/ b[.]/ b/g
s/ c[.]/ c/g
s/ d[.]/ d/g
s/ e[.]/ e/g
s/ f[.]/ f/g
s/ g[.]/ g/g
s/ h[.]/ h/g
s/ i[.]/ i/g
s/ j[.]/ j/g
s/ k[.]/ k/g
s/ l[.]/ l/g
s/ L[.]/ L/g
s/ m[.]/ m/g
s/ n[.]/ n/g
s/ o[.]/ o/g
s/ p[.]/ p/g
s/ s[.]/ s/g
s/ t[.]/ t/g
s/ u[.]/ u/g
s/ z[.]/ z/g
s/www[.]/www /g
s/ / /g
s/ / /g
s/[.][.][.]/./g
s/[.][.]/./g
109
model with our bags-of-words model trained on literature
n
α
w
recognised sentences
%
LSA
26
8
18
HTK
16
35
30 20 20
17
38
model with our bags-of-words model trained on enlarged literature corpus
n α
w
ranking of the correct hypothesis % improvement
LSA
30
12.36
-19
HTK
10.39
0
3
3
25
8.95
14
the previous experiment. We also added new text to the training data. Additionally, we decided that
counting the number of properly recognised sentences is not the best way to evaluate the method.
We started to look at the average position of the correct hypothesis in the n-best list before and
after applying our model. It gives us evaluation from all sentences and not just from those, for
which a correct hypothesis was moved to a first position from not a first one, like in the earlier
evaluation method. We compared our model with LSA as the baseline. It performed better again
(Tab. 5.12). It supports the conclusion that this model is at least better than LSA, because it needs
less data to be trained. Different parameters of our model with which it performs best are probably
caused by a fact that the matrix (5.1) is calculated using more data. Thanks to that there are fewer
zeros in (5.1) and there is no need to smooth it so much by including an impact of many similar
topics. Only the most similar were used in that case.
We collected also more data for training using Rzeczpospolita journal and Polish wikipedia.
The first corpus can be downloaded from Dawid Weiss website as a set of html files. The researcher
claims that the journal agreed for using these resources for any academic research. The second was
collected form Internet using C++ software and very high perplexity, namely 16 436. However,
adding this data did not improve the performance of the method. Table 5.13 shows size and
complexity of all corpora we used in this research.
Table 5.13: Text corpora
Content
MBytes Mwords
Parliament transcripts
58
8
Literature
490
68
Rzeczpospolita journal
879
104
Wikipedia
754
97
Perplexity
4013
9 031
8 918
16 436
5.15
110
Word Prediction Model and Evaluation with Perplexity
There are two main ways to evaluate language models. First of them is to find recognition error.
The second is by perplexity, which, for a probability model, is defined with cross entropy (Brown
et al., 1992)
2−
P
N
p(x)log2 q(x)
,
(5.16)
where p(x) is a probability of a correct recognition from a ground truth distribution. Here it is
assumed to be uniform, what leads to p(x) = 1/N . N is the number of test samples and q(x) is a
probability of a correct recognition from a probability distribution of a tested language model.
The first one, usually given by word error rate (WER) is an accuracy. Briefly, it describes how
correct is the highest-probability hypothesis. The perplexity is a measure of how probable is the
observed data, according to the model. Our model is designed to be implemented in a working
ASR system and this is why, the accuracy is more important evaluation for us than perplexity.
Even though, perplexity is a very popular measure and it is recommended by many NLP researchers to report both evaluations. It has to be stressed that the previously described bag-of-words
model cannot provide perplexity as such. The reason for it, is that our model does not provide a
probability of an event, like a word following a given history of words. The model provides us
with a grade of how coherent a set of words is. Perplexity of our model cannot be given, as the
model uses a probability of a topic given all words in a sentence. There is no ground truth for this
probability distribution to be used in (5.16). The topics in our model are not listed and named.
They are not real topics but representations of sentences grouped in an unsupervised process.
5.16
Conclusion
The POS tagger from dr Piasecki (Piasecki, 2006) was applied as an extra language model to the
problem of improving ASR. Although this is the most effective tagger for Polish, with an accuracy
of 93.44%, the results were not good. It reduced the recognition rate by 57% when applied as a
LM to ASR system based on HTK. We believe this is because POS tag information for Polish is
too ambiguous.
Another language model, inspired by LSA, was designed, implemented in Matlab and applied
to improve ASR recognition for sentences. It was mainly a semantic model, but because of the
inflective nature of the Polish language, it covers some syntax modelling as well. The semantic
model uses statistics of how words appear together in sentences using a word-topic matrix, where
the topics can be seen as sentence examples and patterns. The order of words is not kept though.
This is why we call it the bag-of-words model. Almost 300,000,000 word corpus was available
to the task of training the model. However, some texts were decreasing efficiency of the model.
After several experiments, an improvement in recognition of 14 % was achieved, compared to a
system without a language model, and 33 % comparing to LSA. An average ranking position of
the correct recognition in the entire n-best list of hypotheses was used as the evaluation grade. We
believe that the bag-of-words is effective because of non-positional nature of the Polish language.
The overall conclusion from this part of our research is that POS taggers are not useful in ASR of
111
Polish, but the bag-of-words model based on word-topic matrix helps in ASR task for Polish.
Unusefulness of applying POS tagging in language modelling of Polish was experimentally
supported. The main contribution presented in this chapter is the successful model based on wordtopic matrix was invented, implemented and tested. It can be trained with fewer data than a baseline and has better predictive power.
The method could be improved by stemming the training corpora first. The stemming for
Polish can be applied using Morfeusz (Woliński, 2004) - morphological analyser implemented
by Marcin Woliński applying Zygmunt Saloni rules. It would reduce data sparsity and improve
results. The method can be used to any other language for which LSA is useful, however, it is
tuned to Polish and other Slavic languages because they are non-positional. The bag-of-words
philosophy fits the logic of these languages very well. We plan to train the bag-of-words model
on larger training corpus. The more data one can use, the better performance of a language model
can be achieved. We believe that it is true especially in this case, because LSA is known to be
effective, while it reduced recognition in this case, when trained on the available data. LSA is a
challenging baseline and this is why we believe our method is very good when trained on large
enough data what we plan to do.
The work on the bag-of-words will be extended. Several possible combinations will be tested
on larger corpora then described here. We are in a process of gaining more literature books, newspaper articles and high quality websites. We will optimise the bag-of-words algorithm, especially
how to save memory while working on matrix (5.1) and implement it in C. The method will be
tested not only on sentences as topics but also on paragraphs and articles or chapters. In all cases
we will compare versions trained on original corpora and on the stemmed ones. We will also
combine the bag-of-words model with n-grams to catch some extra information and gain as high
recognition as possible.
Chapter 6
Conclusions and Future Research
It is difficult to predict success in research. In the case of ASR it is even more difficult as the
revolutionary and effective solutions have been anticipated for approximately 25 years but have
not, as yet, materialised. However, there is still important progress in all aspects of ASR. Our study
on different parametrisation methods has highlighted a few aspects which might be especially
successful in the near future. One of the obviously good avenues of research is the perceptual
approach. The idea was conceived by Hermansky by improving the already popular LPC to PLP.
Many other methods also give better results because they are perceptually motivated. Human
hearing and speaking systems are tuned to each other by millennia of evolution. It means that we
have to simulate processes in human ear and brain to recognise and understand signals created
by a human speech system. In fact, all ASR methods are perceptually motivated to some extent,
but some specifically model perceptual features. Wavelets, for example, give good opportunities
due to their non-uniform bandwidths. Phonological approaches also try to simulate processes in
human ear in more detail.
Another issue, which will definitely become more important, is the differences in parametrisation of speech for different languages. The beginnings of research in ASR were based on English.
Currently it has become quite popular to try to recognise other languages like Japanese, Chinese,
Arabic, German, French, Turkish, Finnish, Slavic languages and many more. There are obvious
differences between them, but the methods very often repeat the scheme applied for English. This
might be important encumbrance, because English is in fact quite an unusual language. It has a
few issues important for ASR, which mark it out even from other western European languages,
not to mention others. The huge majority of unstressed vowels are pronounced in a very similar
way. It causes a large number of homophones. Conjugation is relatively simple and declension of
nouns and adjectives almost does not exist. Languages have different widths of possible frequency
bands. For example, there are phonemes in Polish with frequencies much higher than any other
in English. It is quite common that most people find some phonemes especially difficult to use
while learning a new foreign language. This observation should be taken into consideration by
researchers working on non-English languages.
Table 2.3 shows clearly that it is very difficult to find a new parametrisation method which
would outperform the baseline. It is usually much more successful to append new elements, or
112
CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH
113
to further process a commonly known parametrisation. This suggests that it might be impossible
to find any new crucial parametrisation method and success can be obtained rather by additional
processing of features or better modelling.
The statistics of phonemes, diphones and triphones were collected for Polish using a large
corpus of mainly spoken formal language. Summary of the data was presented and interesting
phenomena in the statistics were described. Triphone statistics play an important role in ASR.
They are used to improve the proper transcription of the analysed speech segments. 28% of possible triples were detected as triphones, but many of them appeared very rarely. A majority of rare
triphones came from foreign or twisted words.
Most of the ASR systems do not use information about boundaries of phonetic units like
phonemes. A method based on the DWT to find such boundaries was presented. The method
is language agnostic, as it does not rely on any phonetic models but purely on the analysis of the
power spectrum and hence has applicability to any language. For the same reason it can be easily
introduced to most of existing systems as it does not depend on any exact configuration or training
of the speech model. It can also be used to provide additional information or primal hypothesis for
segmentation methods based on models like in (Ostendorf et al., 1996). Our method is intelligent
as it can be easily improved or adapted for specific applications, noisy data, etc. by introducing
additional conditions or changing weights. The algorithm can find most of the boundaries with
high accuracy. The use of several wavelet functions were compared and our results show that
Meyer wavelets are better than the others. Fuzzy recall and precision measures were introduced for
segmentation in order to evaluate the method with more sensitivity, grading errors more smoothly
than in commonly used evaluation methods. Our results give approximately 0.72 f-score for Meyer
and most of the other wavelets.
The precise evaluation method was described. It adapts a standard and very useful recall
and precision scheme for applications where evaluation has to consider more details. Speech
segmentation is such a field, however, many other types of segmentation are as well. The reason is
that the correctness of audio or image segmentation is typically not binary. This is why we found
usefulness of fuzzy sets in the task of segmentation evaluation. General rules of applying fuzzy
logic into recall and precision were presented as well as exact algorithm of using it for phoneme
segmentation evaluation, as an example.
It seems that POS tags are too ambiguous to be used effectively in modelling Polish for ASR.
Actually, according to our experiments it reduces the number of correct recognitions. Even though,
POS information is important in Polish language, the ambiguity of forms causes that other language models have to be used.
The new method inspired by LSA was presented. The advantage of the method is that smoothing of information in a matrix representing word-topic relations is based on a limited number
of most closely related topics for every topic rather than on all of them like in LSA. Our model
was still better than LSA which actually reduced recognition with the available training data. The
bag-of-word model can be trained with less data than LSA. The performance was improved in
comparison to audio model. In the experiment with the best algorithm and most of the training
data, we graded the method by an average position of the correct hypothesis in the n-best list. The
CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH
114
improvement was by 14% comparing to using HTK audio model only. LSA for the same training
model was reducing recognition.
The author’s research on ASR will be continued. He now works as a research assistant in an
ASR project for AGH University of Science and Technology, and Polish Platform of Homeland
Security. He is responsible for designing language models in the project and will apply his PhD
experience there and will experiment with the bag-of-words method on larger scale. It will probably be combined with n-grams and applied to subword units that are provided by a POS tagger to
reduce the size of a dictionary. The author’s segmentation method was already improved by other
people in the project, and is now being implemented in C++ for the ASR system which is going
to be used in courts and during police interrogations. The paper on triphone statistics was found
very good by the 3rd LTC conference committee and they requested a revised version for a journal.
Statistics will be collected again using a larger corpus and will be published in the submission of
the new paper.
List of References
Abry, P. (1997). Ondelettes et turbulence (eng. Wavelets and turbulence). Diderot ed., Paris.
Agirre, E., Alfonseca, E., and de Lacalle, O. L. (2004). Approximating hierachy-based similarity
for wordnet nominal synsets using topic signatures. Proceedings of the 2nd Global WordNet
Conference. Brno, Czech Republic.
Agirre, E., Ansa, O., Martı́nez, D., and Hovy, E. (2001). Enriching wordnet concepts with topic
signatures. Procceedings of the SIGLEX Workshop on WordNet and Other Lexical Resources:
Applications, Extensions and Customizations.
Agirre, E., Martı́nez, D., de Lacalle, O. L., and Soroa, A. (2006). Two graph-based algorithms
for state-of-the-art wsd. Proceedings of the 2006 Conference on Empirical Methods in Natural
Language Processing, Sydney, pages 585–593.
Ahmed, N., Natarajan, T., and Rao, K. R. (1974). Discrete cosine transform. IEEE Transcations
Computers, Jan:90–93.
Alewine, N., Ruback, H., and Deligne, S. (October-December 2004). Pervasive speech recognition. Pervasive computing, pages 78–81.
A.Przepiórkowski (2006). The potential of the IPI PAN corpus. Poznań Studies in Contemporary
Linguistics, 41:31–48.
Banerjee, S. and Pedersen, T. (2003). Extended gloss overlaps as a measure of semantic relatedness. Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence,
pages 805–810.
Basztura, C. (1992). Rozmawiać z komputerem (Eng. To speak with computers). Format.
Beep dictionary (2000). www.speech.cs.cmu.edu/comp.speech/Section1/Lexical/beep.html.
Bellegarda, J. (1998). A multispan language modeling framework for large vocabulary speech
recognition. IEEE Transactions on Speech and Audio Processing, 6(5):456–467.
Bellegarda, J. R. (1997). A latent semantic analysis framework for large-span language modeling.
Proceedings of Eurospeech, 3:1451–1454.
Bellegarda, J. R. (2000). Large vocabulary speech recognition with multispan statistical language
models. IEEE Transactions on Speech and Audio Processing, 8(1):76–84.
Bellegarda, J. R. (70–80).
Latent semantic mapping.
IEEE Signal Processing Magazine,
September:70–80.
Boersma, P. (1996). Praat, a system for doing phonetics by computer. Glot International,
5(9/10):341–345.
Brill, E. (1994). Some advances in transformation-based part of speech tagging. Proceedings of
115
LIST OF REFERENCES
116
the Twelfth National Conference on artificial Intelligence AAAI.
Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A
case study in part of speech tagging. Computational Linguistics, December:543–565.
Brown, P. F., Pietra, V. J. D., ans S. A. Della Pietra, R. L. M., and Lai, J. C. (1992). An estimate
of an upper bound for the entropy of english. Computational Linguistics, 18(1):31–40.
Cardinal, P., Boulianne, G., and Comeau, M. (2005). Segmentation of recordings based on partial
transcriptions. Proceedings of Interspeech, pages 3345–3348.
Coccaro, N. and Jurafsky, D. (1998). Towards better integration of semantic predictors in statistical
language modeling. Proceedings of ICSLP-98, Sydney.
Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the machine calculation of complex
fourier series. Math. Comput., 19:297–301.
Cozens, S. (1998). Primitive part-of-speech tagging using word length and sentential structure.
Computaion and Language.
Cuadros, M., Padró, L., and Rigau, G. (2005). Comparing methods for automatic acquisition of
topic signatures. Proceedings of the International Conference on Recent Advances on Natural
Language Processing (RANLP).
Daelemans, W. and van den Bosch, A. (1997). Language-independent data-oriented grapheme-tophoneme conversion. Progress in Speech Synthesis, New York: Springer-Verlag.
Daubechies, I. (1992). Ten lectures on Wavelets. Society for Industrial and Applied Mathematics,
Philadelphia, Pennsylvania.
Davis, H., Biddulph, R., and Balashek, S. (1952). Automatic recognition of spoken digits. Journal
of the Acoustical Society of America, (24(6)):637–642.
Davis, S. B. and Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics,
Speech and Signal Processing, ASSP-28(4):357–366.
Dȩbowski, Ł. (2003). A reconfigurable stochastic tagger for languages with complex tag structure.
The Proceedings of the Workshop on Morphological Processing of Slavic Languages, EACL.
de Saussure, F. (1916). Course de lingustique genérale. Lausanne and Paris: Payot.
Demenko, G., Wypych, M., and Baranowska, E. (2003). Implementation of grapheme-to-phoneme
rules and extended SAMPA alphabet in Polish text-to-speech synthesis. Speech and Language
Technology, PTFon, Poznań, 7(17).
Demuynck, K. and Laureys, T. (2002). A comparison of different approaches to automatic speech
segmentation. Proceedings of the 5th International Conference on Text, Speech and Dialogue,
pages 277–284.
Denes, P. B. (1962). Statistics of spoken English. The Journal of the Acoustical Society of America,
34:1978–1979.
Deng, L., Wu, J., Droppo, J., and Acero, A. (2005). Analysis and comparison of two speech
feature extraction/compensation algorithms. IEEE Signal Processing Letters, 12(6):477–480.
Deng, Y. and Khudanpur, S. (2003). Latent semantic information in maximum entropy language
models for conversational speech recognition. Proceedings of the HLT-NAACL 03, pages 56–63.
Eskenazi, M., Black, A., Raux, A., and Langner, B. (2008). Let’s go lab: a platform for evaluation
LIST OF REFERENCES
117
of spoken dialog systems with real world users. Proceedings of Interspeech, Brisbane.
Evermann, G., Chan, H. Y., Gales, M. J. F., Hain, T., Liu, X., Mrva, D., Wang, L., and Woodland,
P. C. (2004). Develpment of the 2003 CU-HTK conversational telephone speech transcription
system. Proceedings of ICASSP Interspeech, pages I–249–252.
Farooq, O. and Datta, S. (2004). Wavelet based robust subband features for phoneme recognition.
IEE Proceedings: Vision, Image and Signal Processing, 151(3):187–193.
Fellbaum, C. (1999). Wordnet. An Electronic Lexical Database. Massachusetts Institute of Technology, US.
Forney, G. D. (1973). The Viterbi algorithm. Proceedings IEEE, 61:268–273.
Frankel, J. and King., S. (2005). A hybrid ANN/DBN approach to articulatory feature recognition.
Proceedings of Eurospeech.
Frankel, J. and King, S. (2007 (in press)). Speech recognition using linear dynamic models. IEEE
Transactions on Speech and Audio Processing.
Frankel, J., Wester, M., and King, S. (2007). Articulatory feature recognition using dynamic
Bayesian networks. Computer Speech and Language, 21(4):620–640.
Friedman, J., Hastie, T., and Tibshirani, R. (1999). Additive logistic regression: A statistical view
of boosting. Technical report, Department of Statistics, Stanford University.
Gałka, J. and Ziółko, B. (2008). Study of performance evaluation methods for non-uniform speech
segmentation. International Journal Of Circuits, Systems And Signal Processing, NAUN.
Ganapathiraju, A., Hamaker, J. E., and Picone, J. (2004). Applications of support vector machines
to speech recognition. IEEE Transactions on Signal Processing, 52(8):2348–2355.
Glass, J. (2003). A probabilistic framework for segment-based speech recognition. Computer
Speech and Language, 17:137–152.
Gorrell, G. and Webb, B. (2005). Generalized Hebbian algorithm for incremental latent semantic
analysis. proceedings of Intespeech.
Grayden, D. B. and Scordilis, M. S. (1994). Phonemic segmentation of fluent speech. Proceedings
of ICASSP, Adelaide, pages 73–76.
Green, S. J. (1999). Lexical semantics and automatic hypertext construction. ACM Computing
Surveys (CSUR), 31.
Greenberg, S., Chang, S., and Hollenback, J. (2000). An introduction to the diagnostic evaluation
of switchboard- corpus automatic speech recognition systems. Proceedings of NIST Speech
Transcription Workshop.
Grocholewski, S. (1995). Założenia akustycznej bazy danych dla jȩzyka polskiego na nośniku
cd rom (Eng. Assumptions of acoustic database for Polish language). Mat. I KK: Głosowa
komunikacja człowiek-komputer, Wrocław, pages 177–180.
Grönqvist, L. (2005). An evaluation of bi- and trigram enriched latent semantic vector models.
ACM Proceedings of ELECTRA Workshop - Methodologies and Evaluation of Lexical Cohesion
Techniques in Real-world Applications, Salvador, Brazil, pages 57–62.
Hain, T., Dines, J., Garau, G., Karafiat, M., Moore, D., Wan, V., Ordelman, R., and S.Renals
(2005). Transcription of conference room meetings: an investigation. Proceedings of ICSLP
Interspeech.
LIST OF REFERENCES
118
Harary, F. (1969). Graph Theory. Addison-Wesley.
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the
Acoustical Society of America, 87(4):1738–1752.
Hermansky, H. and Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on
Speech and Audio Processing, 2(4):578–589.
Hifny, Y., Renals, S., and Lawrence, N. D. (2005). A hybrid MaxEnt/HMM based ASR system.
Proceedings of ICSLP Interspeech.
Holmes, J. N. (2001). Speech Synthesis and Recognition. London: Taylor and Francis.
Ishizuka, K. and Miyazaki, N. (2004). Speech feature extraction method representing periodicity
and aperiodicity in sub bands for robust speech recognition. Proceedings of ICASSP, pages
I–141–144.
Jarmasz, M. and Szpakowicz, S. (2003). Roget’s thesaurus and semantic similarity. Proceedings
of Conference on Recent Advances in Natural Language Processing (RANLP), pages 212–219.
Jassem, K. (1996). A phonemic transcription and syllable division rule engine. OnomasticaCopernicus Research Colloquium, Edinburgh.
Jelinek, F., Merialdo, B., Roukos, S., and Strauss, M. (1991). A dynamic language model for
speech recognition. Fourth DARPA Speech and Natural Language Workshop, pages 293–295.
Johansson, S., Leech, G., and Goodluck, H. (1978). Manual of Information to Accompany the
Lancaster-Olso/Bergen Corpus of British English, for Use with Digital Computers. Department
of English, University of Oslo.
Jurafsky, D. and Martin, J. H. (2000). Speech and Language Processing. Prentice-Hall, Inc., New
Jersey.
Kakkonen, T., Myller, N., and Sutinen, E. (2006). Applying part-of-speech enhanced LSA to
automatic essay grading. Proceedings of the 4th IEEE International Conference on Information
Technology:Research and Education (ITRE 2006). Tel Aviv, Israel, pages 500–504.
Kanejiya, D., Kumar, A., and Prasad, S. (2003). Automatic evaluation of students’ answers using
syntactically enchanced LSA. Proceedings of the HLT-NAACL 03 workshop on Building educational applications using natural language processing, 2:53–60.
Kecman, V. (2001). Learning and Soft Computing. Massachusetts Institute of Technology, US.
Kȩpiński, M. (2005). Kontekstowe zwia̧zki cech w sygnale mowy polskiej (Eng. Contextual feature
relations in Polish speech signal), PhD Thesis. AGH University of Science and Technology,
Kraków.
Khudanpur, S. and Wu, J. (1999). A maximum entropy language model integrating n-grams and
topic dependencies for conversational speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Phoenix, AZ.
King, S. (2003). Dependence and independence in automatic speech recognition and synthesis.
Journal of Phonetics, 31(3-4):407–411.
King, S. and Taylor, P. (2000). Detection of phonological features in continuous speech using
neural networks. Computer Speech and Language, 14(4):333–353.
Kucera, H. and Francis, W. (1967). Computational Analysis of Present Day American English.
Brown University Press Providence.
LIST OF REFERENCES
119
L. E. Baum, T. Petrie, G. S. and Weiss, N. (1970). A maximization technique occurring in the
statistical analysis of probabilistic functions of markov chains. Ann. Math. Statist., 41(1):164–
171.
Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., and Wolf, P. (2004). The cmu
sphinx-4 speech recognition system. Sun Microsystems.
Li, H.-Z., Liu, Z.-Q., and Zhu, X.-H. (2005). Hidden markov models with factored gaussian
mixtures densities. Elsevier Pattern Recognition, 38:2022–2031.
Lowerre, B. T. (1976). The HARPY Speech Recognition System, PhD thesis. Carnegie-Mellon
Univesity, Pittsburgh.
Ma, J. Z. and Deng, L. (2004). Target - directed mixture dynamic models for spontaneous speech
recognition. IEEE Transactions on Speech and Audio Processing, 12(1).
Mahajan, M., Beeferman, D., , and Huang, D. (1999). Improved topic-dependent language modeling using information retrieval techniques. Proceedings of ICASSP, pages 541–544.
Makhoul, J. (1975). Spectral linear prediction: properties and applications. IEEE Transcations,
ASSP-23:283–296.
Manning, C. D. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
Cambridge, MA.
Miller, T. and Wolf, E. (2006). Word completion with latent semantic analysis. 18th International
Conference on Pattern Recognition, ICPR, Hong Kong, 1:1252–1255.
Misra, H., Ikbal, S., Bourlard, H., and Hermansky, H. (2004). Spectral entropy based feature for
robust ASR. Proceedings of ICASSP, pages I–193–196.
Morgan, N., Zhu, Q., Stolcke, A., Sonmez, K., Sivadas, S., Shinozaki, T., Ostendorf, M., Jain, P.,
Hermansky, H., Ellis, D., Doddington, G., Chen, B., Cretin, O., Bourlard, H., and Athineos, M.
(2005). Pushing the envelope - aside. IEEE Signal Processing Magazine, 22:81–88.
M. Wester (2003). Syllable classification using articulatory-acoustic features. Proceedings of
Eurospeech.
Nasios, N. and Bors, A. (2005). Finding the number of clusters for nonparametric segmentation.
Lecture Notes in Computer Science, 3691:213–221.
Nasios, N. and Bors, A. (2006). Variational learning for gaussian mixture models. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cybernetics, 36(4):849–862.
Ostaszewska, D. and Tambor, J. (2000). Fonetyka i fonologia współczesnego jȩzyka Polskiego
(eng. Phonetics and phonology of modern Polish language). PWN.
Ostendorf, M., Digalakis, V. V., and Kimball, O. A. (1996). From HMM’s to segment models: A
unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and
Audio Processing, 4:360–378.
Pedersen, T., Patwardhan, S., and Michelizzi, J. (2004). Wordnet::similarity - measuring the relatedness of concepts. Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-2004), pages 1024–1025.
Piasecki, M. (2006). Hand-written and automatically extracted rules for Polish tagger. Lecture
Notes in Artificial Intelligence, Springer, W P. Sojka, I. Kopecek, K. Pala, eds. Proceedings of
Text, Speech, Dialogue:205–212.
LIST OF REFERENCES
120
Przepiórkowski, A. (2004). The IPI PAN Corpus: Preliminary version. IPI PAN.
Przepiórkowski, A. and Woliński, M. (2003). The unbearable lightness of tagging: A case study
in morphosyntactic tagging of Polish. Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03), EACL 2003.
Rabiner, L. and Juang, B. H. (1993). Fundamentals of speech recognition. PTR Prentice-Hall,
Inc., New Jersey.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech
recognition. Proceedings of the IEEE, 77(2):257–286.
Rabiner, L. R. and Schafer, R. W. (1978). Signal Processing of Speech Signals. Prentice Hall,
Englewood-cliffs.
Raj, B. and Stern, R. M. (September 2005). Missing-feature approaches in speech recognition.
IEEE Signal Processing Magazine, pages 101–116.
Riccardi, G. and Hakkani-Tür, D. (2005). Active learning: Theory and applications to automatic
speech recognition. IEEE Transactions on Speech and Audio Processing, 13(4):504–511.
Rioul, O. and Vetterli, M. (1991). Wavelets and signal processing. IEEE Signal Processing Magazine, 8:11–38.
Russell, M. and Jackson, P. J. B. (2005). A multiple-level linear/linear segmental HMM with a
formant-based intermediate layer. Computer Speech and Language, 19:205–225.
Seco, N., Veale, T., and Hayes, J. (2004). An intrinsic information content metric for semantic
similarity in wordnet. Proceedings of ECAI’2004, the 16th European Conference on Artificial
Intelligence.
Steffen-Batóg, M. and Nowakowski, P. (1993). An algorithm for phonetic transcription of ortographic texts in Polish. Studia Phonetica Posnaniensia, 3.
Stöber, K. and Hess, W. (1998). Additional use of phoneme duration hypotheses in automatic
speech segmentation. Proceedings of ICSLP, Sydney, pages 1595–1598.
Subramanya, A., Bilmes, J., and Chen, C. P. (2005). Focused word segmentation for ASR. Proceedings of Interspeech 2005, pages 393–396.
Suh, Y. and Lee, Y. (1996). Phoneme segmentation of continuous speech using multi-layer perceptron. In Proceedings of ICSLP, Philadelphia, pages 1297–1300.
Tadeusiewicz, R. (1988). Sygnał mowy (eng. Speech Signal). Wydawnictwo Komunikacji i
Ła̧czności.
Tan, B. T., Lang, R., Schroder, H., Spray, A., and Dermody, P. (1994). Applying wavelet analysis
to speech segmentation and classification. H. H. Szu, editor, Wavelet Applications, volume Proc.
SPIE 2242, pages 750–761.
T.Hofmann (1999). Probabilistic latent semantic analysis. Proceedings of Uncertainty in Artificial
Intelligence, UAI’99, Stockholm.
Toledano, D., Gómez, L., and Grande, L. (2003). Automatic phonetic segmentation. IEEE Transactions on Speech and Audio Processing, 11(6):617–625.
Tukey, J. W., Bogert, B. P., and Healy, M. J. R. (1963). The quefrency analysis of time series for
echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe-cracking. Proceedings of
the Symposium on Time Series Analysis (M. Rosenblatt, Ed), pages 209–243.
LIST OF REFERENCES
121
van Rijsbergen, C. J. (1979). Information Retrieval. London: Butterworths.
Venkataraman, A. (2001). A statistical model for word discovery in transcribed speech. Computational Linguistics, 27.
Véronis, J. (2004). Hyperlex: lexical cartography for information retrieval. Computer Speech and
Language, 18(3):223–252.
Villing, R., Timoney, J., Ward, T., and Costello, J. (2004). Automatic blind syllable segmentation
for continuous speech. Proceedings of ISSC 2004, Belfast.
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269.
Wang, D. and Narayanan, S. (2005). Piecewise linear stylization of pitch via wavelet analysis.
Proceedings of Interspeech, Lisboa, pages 3277–3280.
Watanabe, S., Minami, Y., Nakamura, A., and Ueda, N. (2004). Variational Bayesian estimation
and clustering for speech recognition. IEEE Transcations on Speech and Audio Processing,
12(4).
Weinstein, C. J., McCandless, S. S., Mondshein, L. F., and Zue, V. W. (1975). A system for
acoustic-phonetic analysis of continuous speech. IEEE Transactions on Acoustics, Speech and
Signal Processing, 23:54–67.
Wester, M. (2003). Pronunciation modeling for ASR - knowledge-based and data-derived methods.
Computer Speech and Language, 17:69–85.
Wester, M., Frankel, J., and King, S. (2004). Asynchronous articulatory feature recognition using
dynamic Bayesian networks. Proceedings of IEICI Beyond HMM Workshop.
Whittaker, E. and Woodland, P. (2003). Language modelling for russian and english using words
and classes. Computer Speech and Language, 17:87–104.
Woliński, M. (2004). System znaczników morfosyntaktycznych w korpusie ipi pan (Eng. The
system of morphological tags used in IPI PAN corpus). POLONICA, XII:39–54.
Wu, J. and Khudanpur, S. (2000). Efficient training methods for maximum entropy language
modelling. Proceedings of 6th International Conference on Spoken Language Technologies
(ICSLP-00).
X. Huang, A. Acero, H.-W. H. (2001). Spoken Language Processing. Prentice Hall PTR, New
Jersey.
Y.-C. Tam, T. S. (2008). Correlated bigram LSA for unsupervised language model adaptation.
Proc. of Neural Information Processing Systems (NIPS), Vancouver.
Yannakoudakis, E. J. and Hutton, P. J. (1992). An assessment of n-phoneme statistics in phoneme
guessing algorithms which aim to incorporate phonotactic constraints. Speech Communication,
11:581 – 602.
Yapanel, U. and Dharanipragada, S. (2003).
Perceptual MVDR-based cepstral coefficients
(PMCCs) for robust speech recognition. Proceedings of ICASSP.
Young, S. (1996). Large vocabulary continuous speech recognition: a review. IEEE Signal Processing Magazine, 13(5):45–57.
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. (2005). HTK Book. Cambridge University Engineering
LIST OF REFERENCES
122
Department, UK.
Zheng, C. and Yan, Y. (2004). Fusion based speech segmentation in DARPA SPINE2 task. Proceedings of ICASSP, Montreal, pages I–885–888.
Zhu, D. and Paliwal, K. K. (2004). Product of power spectrum and group delay function for speech
recognition. Proceedings of ICASSP.
Ziółko, B., Gałka, J., Manandhar, S., Wilson, R., and Ziółko, M. (2007). Triphone statistics for
Polish language. Proceedings of 3rd Language and Technology Conference, Poznań.
Ziółko, B., Manandhar, S., and Wilson, R. C. (2006a). Phoneme segmentation of speech. Proceedings of 18th International Conference on Pattern Recognition.
Ziółko, B., Manandhar, S., Wilson, R. C., and Ziółko, M. (2006b). Wavelet method of speech segmentation. Proceedings of 14th European Signal Processing Conference EUSIPCO, Florence.
Ziółko, B., Manandhar, S., Wilson, R. C., and Ziółko, M. (2008a). Language model based on pos
tagger. Proceedings of SIGMAP 2008 the International Conference on Signal Processing and
Multimedia Applications, Porto.
Ziółko, B., Manandhar, S., Wilson, R. C., and Ziółko, M. (2008b). Semantic modelling for speech
recognition. Proceedings of Speech Analysis, Synthesis and Recognition. Applications in Systems for Homeland Security, Piechowice, Poland.
Zue, V. W. (1985). The use of speech knowledge in automatic speech recognition. Proceedings of
the IEEE, 73:1602–1615.

Speech Recognition of Highly Inflective Languages

Transcription

Similar documents

Historical Background Briefing Conditions of

Donzi 38` Cabin ZSF

DynaCAD Breast Brochure

continuous speech recognition using synthetic word and

Devastator

Broszura

Segmentation,Targeting and Positioning

Polish Film Festival Flyer 8_5x5_5

Polish music at the turn of years

by Mazurkas Travel - Mazurkas Travel Poland