P - DSP AGH
Transcription
P - DSP AGH
Katedra Elektroniki Akademia Górniczo-Hutnicza, Kraków Technologia mowy 1 Katedra Elektroniki Akademia Górniczo-Hutnicza, Kraków Technologia mowy 2 Katedra Elektroniki Akademia Górniczo-Hutnicza, Kraków Technologia mowy 3 Sterowanie głosem – zalety przekazywanie informacji za pomocą mowy umożliwia zwolnienie rąk operatora, które mogą równocześnie być wykorzystywane do manipulowania przedmiotami lub do wprowadzania danych 4 Voice tract Voiced sounds are produced by modulation of the air flow from the lungs by vibration of vocal cords. The time dependent amplitude and the frequency characteristics of a speech signal change in the time domain by continuous reconfiguration of human’s voice-tract resonant chambers. Impedance of nostril radiation Speech signal Control Nose tract Pitch generation Throat tract Tract spitting Mouth tract Summation Impedance of mouth radiation Speech is produced by a specific mechanism that has many constraints (the human vocal tract), so we can exploit such constrains in speech compression and recognition. 5 Real voice tract 4 2 0 8 6 4 Mouth cavity 6 Mouth cavity 8 Throat cavity 10 Cross section [mm2] 10 Splitting of cavities 12 Splitting of cavities Sound /u/ 12 Throat cavity Cross section [mm2] Sound /i/ 2 5 10 Length [cm] 15 0 5 10 15 Length[cm] Changes of the cross section of human voice tract from throat cavity to the mouth slot for chosen Polish sounds 6 Model of voice tract l Input of acoustic wave Output of acoustic wave A 7 Sygnał mowy, a krzywe izofoniczne Ciśnienie akustyczne [dB] Próg bólu (120 fonów) 140 120 Próg pobudzeni a (0 fonów) 100 80 Obszar mowy 60 40 20 0 -20 0 0.1 1 Częstotliwość [kHz] fragment siatki obiektywnej skali decybelowej, subiektywna skala fonowa - krzywe izofoniczne próg pobudzenia i próg bólu 10 8 Analiza częstotliwościowa mowy 1 Amplituda 0.75 0.50 0.25 100 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Częstotliwość [kHz] „osiem” 9 Czasowo-częstotliwościowa analiza mowy 10 Komputerowy system rozpoznawania mowy Sygnał mowy Sekwencja liter System automatycznego rozpoznawania mowy 11 Automatyczne rozpoznawanie mowy ciągłej • Zamiana akustycznego sygnału mowy na tekst pisany • Wykorzystanie w: – – – – Biurze Nauce Przemyśle Domu • Brak skutecznych rozwiązań dla języka polskiego 12 Rynek ScanSoft makes Dragon NaturallySpeaking and dominates the speech recognition market. IBM had ViaVoice. IBM claims to have put one hundred speech researchers on the problem of taking automatic speech recognition (ASR) beyond the level of human speech recognition by 2010 year. 13 Bill Gates is also making very large investments in speech recognition research at Microsoft and predicted that by 2011 the quality of ASR will catch up to human speech recognition. http://vista.dobreprogramy.pl/ 14 Applications Computer users can create and edit documents and interact with computer more quickly because people are able to speak faster than anyone can type. People who are poor typists can drastically increase their productivity. Speaking to computer is much faster and easier than typing! 15 Approaches Constrained recognition constrains the possible recognized phrases to a small-sized possible responses. Dictation transcribes speech word by word, does not require semantic understanding, the goal is to identify the exact words. Natural language recognition allows the speaker to provide natural, sentence-length patterns. 16 Difficulties • Co-articulation of phonemes and words makes the task of speech recognition difficult, • Intonation and sentence stress plays an important role in the interpretation. Utterances "go!", "go?" and "go." can clearly be recognized by a human but are difficult for a computer, • In naturally spoken language there are no pauses between words. It is difficult for a computer to decide where word boundaries lie. 17 Pronunciation Afganistan [g] agencja [g] wzmagać [g] English language Afghanistan [g] agency [dż] heighten [-] German language Afganistan [g] Agentur [g] steigen [g] [agentur] [sztajgen] Polish language Many words in English language sound alike (e.g. sun and son, night and knight). I helped Apple wreck a nice beach sounds like I helped Apple recognize speech. Context dependency for the phones, phones with different left and right context have different realizations. A general solution requires human knowledge and experience, and require advanced pattern recognition and artificial intelligence. 18 Syntezator mowy „IVONA” 19 Implant ucha środkowego Działanie implantu ucha środkowego (VS): Implant przekazuje wzmocniony dźwięk bezpośrednio do układu kosteczek słuchowych, a nie przez przewód słuchowy zewnętrzny i błonę bębenkową 20 Implant pniowy jest urządzeniem, które wyzwala wrażenie słuchowe dzięki elektrycznej stymulacji jądra ślimakowego brzusznego w pniu mózgu. Matryca elektrod implantu pniowego umieszczana jest w zachyłku bocznym czwartej komory mózgu, w okolicy jądra ślimakowego brzusznego. 21 Schemat systemu rozpoznawania mowy Usuwanie zakłóceń Segmentacja mowy Korekta rozpoznanych słów Gramatyka Słownik Parametryzacja segmentu Segmenty wzorcowe Modele statystyczne języka Wybór najlepszego wzorca Korekta syntaktyczna Korekta semantyczna Korekta połączeń literowych 22 Usuwanie zakłóceń i ulepszanie sygnału mowy (speech enhancement) v’’(n) Filtr adaptacyjny Mówca Sygnał zakłócony x(n)=s(n)+v(n) e(n) Sygnał odszumiony Zakłócenia x(n)=v’(n) Sygnał zakłócony Mówca x(n)=s(n)+v(n) Transformacja widmowa X(jw) S(jw) Modyfikacja widmowa Odwrotna transformacja widmowa s(n) Sygnał odszumiony Zakłócenia Estymacja szumu 23 Metody ulepszania sygnału mowy (speech enhancement) • Adaptacyjne usuwanie szumu (ANC) • Formowanie wiązki (Beamforming) • Separacja źródeł mowy (BSS) • Odszumianie w oparciu o metody widmowe i estymację szumu • Usuwanie echa i innych zakłóceń 24 Word partitioning 1 0.8 silne 450 [ms] działanie 450 [ms] uboczne 600 [ms] 0.6 0.4 amplitude 0.2 /s’il/ 0 /ne/ /dz'a/ /wa/ /n’e/ /u/ /bo/ / tSne/ -0.2 -0.4 -0.6 -0.8 -1 pause 110 [ms ] 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 time [s] 1 1.1 1.2 1.3 1.4 1.5 1.6 25 Nonlinear scale 5000 2500 4000 2000 Frequency [Mel] Frequency [Hz] f _ Hz f _ mel 1000 log 2 1 1000 3000 2000 0 1000 500 1000 0 1500 0.1 0.2 0.3 0.4 Time [s] 0.5 0.6 0 0 0.2 0.4 Time [s] 0.6 26 Cepstrum Verbally: the cepstrum is the FT of the log of the FT. Frequenc y spectrum Signal FFT Squaring Power spectrum Averaging Cepstru m Logarithm FFT Many texts incorrectly state that the process is FT → log → IFT, i.e. that the cepstrum is the "inverse Fourier transform of the log of the spectrum". 27 Cepstrum The term cepstrum was introduced by Bogert et al. and has come to be accepted terminology for the inverse Fourier transform of the logarithm of the power spectrum of a signal. (L.R.Rabiner and R.W.Schafer, Digital Signal Processing of Speech Signals, Prentice Hall, Englewood-cliffs, NJ, 1978) Etymology: "cepstrum" is an anagram of "spectrum", formed by reversing the first four letters. A cepstrum (pronounced "kepstrum") is the result of taking the Fourier transform of the decibel spectrum as if it were a signal. There is a complex cepstrum and a real cepstrum. The cepstrum was defined in a 1963 paper: Tukey, J. W., B. P. Bogert and M. J. R. Healy : "The quefrency alanysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe-cracking". Proceedings of the Symposium on Time Series Analysis (M. Rosenblatt, Ed) Chapter 15, 209-243. New York: Wiley. 28 Wavelet spectra 5000 Frequency [Hz] 4000 3000 2000 Daubechies psi of order 12 Daubechies phi of order 12 1000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Scale a Time [s] 6 14 22 30 38 46 54 62 70 78 86 94 102 110 118 126 134 142 150 0 -0.2 -0.4 2 4 6 8 10 12 14 -6 -4 F(d12_phi(w)) 1 1 0.8 0.6 0.6 0.4 0.4 1 0.2 0.2 2 0 1000 2000 3000 4000 5000 6000 7000 0 2 -5 0 5 10 15 -15 -10 -5 0 5 4 5 6 7 8 8 1000 2000 3000 4000 5000 6000 4 6 10 15 0 -15 -10 3 -2 F(d12_psi(w)) 0.8 Time b Resolution m 0.75 0.5 0.25 0 -0.25 -0.5 -0.75 0.8 0.6 0.4 0.2 7000 Time 2 -mn 29 STFT versus continuous and discrete wavelet spectrum for the word „osiem” Pasma częstotliwościowe mowy Częstotliwość [Hz] Gęstość dyskretyzacji D1 2756÷5512 2t D2 1378÷2756 4t D3 689 ÷1378 8t D4 345 ÷ 689 16t D5 172 ÷ 345 32t D6 86 ÷ 172 64t=5.805 ms Poziom dekompozycji Częstotliwość próbkowania f 0 11025 Hz oznacza gęstość dyskretyzacji t 90.7 μs 30 Phoneme segmentation 31 Hidden Markov Model A Hidden Markov Model (HMM) is a statistical model where the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters, from the observable parameters, based on this assumption. The extracted model parameters can then be used to perform further analysis, for example for speech recognition applications. Speech recognition systems are generally based on HMM. Statistical model gives the probability of an observed sequence of acoustic data by the application of Bayes’ rule: Pword | acoustic pacoustic | word Pword pacoustic P(mushroom soup) > P(much rooms hope) 32 Prawdopodobieństwa fonemów Dane uzyskano w oparciu o przemówienia sejmowe. 33 Najczęstsze bifony 34 Bifony 35 Najczęstsze trifony 36 Trifony 37 „Text Speech and Dialogue’2001” Autor: Jordan Cohen Tittle: A Historical Perspective on Modern Speech Abstract: Science is the study of nature through models, measurement, and prediction. When models become overly complicated, and predictions do not improve with further complication, it is often necessary to reconsider the basic assumptions of the models to make progress. This reconsideration happened in astronomy at the time of Copernicus and Kepler. I will draw a parallel to the current situation in speech recognition, and will argue that it is time for reconsideration of the basic models and methods. 38 T W I S T Trying Wacky (silly) Ideas for Speech Technology Marc Blasband - ELSNET, April 1998, s.9 It is becoming clear that the HMM model is reaching the limit of its possibilities The current state and quality of HMM-based work is the result of enormous amounts of time and effort spent by many researchers worldwide, and is therefore hard to abandon. The chances of one group being able to create something that can compete with th results of the years of work that went into HMM are very small. A breakthrough is necessary ... 39 Dziękuję za uwagę 40