P - DSP AGH

Transcription

P - DSP AGH
Katedra Elektroniki
Akademia Górniczo-Hutnicza, Kraków
Technologia mowy
1
Katedra Elektroniki
Akademia Górniczo-Hutnicza, Kraków
Technologia mowy
2
Katedra Elektroniki
Akademia Górniczo-Hutnicza, Kraków
Technologia mowy
3
Sterowanie głosem – zalety
 przekazywanie informacji
za pomocą mowy umożliwia
zwolnienie rąk operatora,
które mogą równocześnie
być wykorzystywane do
manipulowania
przedmiotami lub do
wprowadzania danych
4
Voice tract
Voiced sounds are produced by modulation of the air flow from the lungs
by vibration of vocal cords. The time dependent amplitude and the
frequency characteristics of a speech signal change in the time domain
by continuous reconfiguration of human’s voice-tract resonant chambers.
Impedance of
nostril radiation
Speech
signal
Control
Nose
tract
Pitch
generation
Throat
tract
Tract spitting
Mouth
tract
Summation
Impedance of
mouth radiation
Speech is produced by a specific mechanism that has many constraints
(the human vocal tract), so we can exploit such constrains in speech
compression and recognition.
5
Real voice tract
4
2
0
8
6
4
Mouth cavity
6
Mouth cavity
8
Throat cavity
10
Cross section [mm2]
10
Splitting of cavities
12
Splitting of cavities
Sound /u/
12
Throat cavity
Cross section [mm2]
Sound /i/
2
5
10
Length [cm]
15
0
5
10
15
Length[cm]
Changes of the cross section of human voice tract from throat cavity
to the mouth slot for chosen Polish sounds
6
Model of voice tract
l
Input of acoustic
wave
Output of acoustic
wave
A
7
Sygnał mowy, a krzywe izofoniczne
Ciśnienie akustyczne [dB]
Próg bólu
(120 fonów)
140
120
Próg
pobudzeni
a
(0 fonów)
100
80
Obszar mowy
60
40
20
0
-20
0
0.1
1
Częstotliwość [kHz]
fragment siatki obiektywnej skali decybelowej,
subiektywna skala fonowa - krzywe izofoniczne
próg pobudzenia i próg bólu
10
8
Analiza częstotliwościowa mowy
1
Amplituda
0.75
0.50
0.25
100
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
Częstotliwość [kHz]
„osiem”
9
Czasowo-częstotliwościowa analiza mowy
10
Komputerowy system
rozpoznawania mowy
Sygnał
mowy
Sekwencja
liter
System automatycznego
rozpoznawania mowy
11
Automatyczne rozpoznawanie mowy ciągłej
• Zamiana akustycznego sygnału mowy na tekst
pisany
• Wykorzystanie w:
–
–
–
–
Biurze
Nauce
Przemyśle
Domu
• Brak skutecznych rozwiązań dla języka polskiego
12
Rynek
ScanSoft makes Dragon NaturallySpeaking and dominates
the speech recognition market.
IBM had ViaVoice.
IBM claims to have put one hundred speech researchers on the
problem of taking automatic speech recognition (ASR) beyond the
level of human speech recognition by 2010 year.
13
Bill Gates is also making very large investments in speech
recognition research at Microsoft and predicted that by 2011 the
quality of ASR will catch up to human speech recognition.
http://vista.dobreprogramy.pl/
14
Applications
Computer users can create and edit documents and interact with computer more
quickly because people are able to speak faster than anyone can type.
People who are poor typists can drastically increase their productivity.
Speaking to computer is much faster and easier than typing!
15
Approaches
Constrained recognition
constrains the possible recognized phrases to a small-sized possible responses.
Dictation
transcribes speech word by word, does not require semantic understanding,
the goal is to identify the exact words.
Natural language recognition
allows the speaker to provide natural, sentence-length patterns.
16
Difficulties
• Co-articulation of phonemes and words makes the task of speech
recognition difficult,
• Intonation and sentence stress plays an important role in the
interpretation. Utterances "go!", "go?" and "go." can clearly be
recognized by a human but are difficult for a computer,
• In naturally spoken language there are no pauses between words.
It is difficult for a computer to decide where word boundaries lie.
17
Pronunciation
Afganistan [g]
agencja [g]
wzmagać [g]
English language
Afghanistan [g]
agency [dż]
heighten [-]
German language
Afganistan [g]
Agentur [g]
steigen [g]
[agentur]
[sztajgen]
Polish language
Many words in English language sound alike (e.g. sun and son, night and knight).
I helped Apple wreck a nice beach sounds like I helped Apple recognize speech.
Context dependency for the phones, phones with different left and right context have
different realizations.
A general solution requires human knowledge and experience, and require advanced
pattern recognition and artificial intelligence.
18
Syntezator mowy „IVONA”
19
Implant ucha środkowego
Działanie implantu ucha środkowego (VS):
Implant przekazuje wzmocniony dźwięk
bezpośrednio do układu kosteczek słuchowych,
a nie przez przewód
słuchowy zewnętrzny
i błonę bębenkową
20
Implant pniowy jest urządzeniem, które wyzwala wrażenie
słuchowe dzięki elektrycznej stymulacji jądra ślimakowego
brzusznego w pniu mózgu. Matryca elektrod implantu pniowego
umieszczana jest w zachyłku bocznym czwartej komory mózgu,
w okolicy jądra ślimakowego brzusznego.
21
Schemat systemu rozpoznawania mowy
Usuwanie zakłóceń
Segmentacja mowy
Korekta
rozpoznanych słów
Gramatyka
Słownik
Parametryzacja
segmentu
Segmenty
wzorcowe
Modele
statystyczne
języka
Wybór najlepszego
wzorca
Korekta
syntaktyczna
Korekta
semantyczna
Korekta połączeń
literowych
22
Usuwanie zakłóceń i ulepszanie
sygnału mowy (speech enhancement)
v’’(n)
Filtr
adaptacyjny
Mówca
Sygnał
zakłócony
x(n)=s(n)+v(n)
e(n)
Sygnał
odszumiony
Zakłócenia
x(n)=v’(n)
Sygnał
zakłócony
Mówca
x(n)=s(n)+v(n)
Transformacja
widmowa
X(jw)
S(jw)
Modyfikacja
widmowa
Odwrotna
transformacja
widmowa
s(n)
Sygnał
odszumiony
Zakłócenia
Estymacja
szumu
23
Metody ulepszania
sygnału mowy (speech enhancement)
• Adaptacyjne usuwanie szumu (ANC)
• Formowanie wiązki (Beamforming)
• Separacja źródeł mowy (BSS)
• Odszumianie w oparciu o metody widmowe i estymację szumu
• Usuwanie echa i innych zakłóceń
24
Word partitioning
1
0.8
silne 450 [ms]
działanie 450 [ms]
uboczne 600 [ms]
0.6
0.4
amplitude
0.2
/s’il/
0
/ne/
/dz'a/
/wa/
/n’e/
/u/
/bo/
/ tSne/
-0.2
-0.4
-0.6
-0.8
-1
pause 110 [ms ]
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8 0.9
time [s]
1
1.1
1.2
1.3
1.4
1.5
1.6
25
Nonlinear scale
5000
2500
4000
2000
Frequency [Mel]
Frequency [Hz]
f _ Hz 

f _ mel  1000 log 2 1 

1000 

3000
2000
0
1000
500
1000
0
1500
0.1
0.2
0.3 0.4
Time [s]
0.5
0.6
0
0
0.2
0.4
Time [s]
0.6
26
Cepstrum
Verbally: the cepstrum is the FT of the log of the FT.
Frequenc
y
spectrum
Signal
FFT
Squaring
Power
spectrum
Averaging
Cepstru
m
Logarithm
FFT
Many texts incorrectly state that the process is FT → log → IFT, i.e. that the cepstrum is
the "inverse Fourier transform of the log of the spectrum".
27
Cepstrum
The term cepstrum was introduced by Bogert et al. and has come to be accepted
terminology for the inverse Fourier transform of the logarithm of the power
spectrum of a signal. (L.R.Rabiner and R.W.Schafer, Digital Signal Processing
of Speech Signals, Prentice Hall, Englewood-cliffs, NJ, 1978)
Etymology: "cepstrum" is an anagram of "spectrum", formed by reversing the first
four letters.
A cepstrum (pronounced "kepstrum") is the result of taking the Fourier transform of the
decibel spectrum as if it were a signal. There is a complex cepstrum and a real cepstrum.
The cepstrum was defined in a 1963 paper:
Tukey, J. W., B. P. Bogert and M. J. R. Healy : "The quefrency alanysis of time series for
echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe-cracking". Proceedings of the
Symposium on Time Series Analysis (M. Rosenblatt, Ed) Chapter 15, 209-243. New York: Wiley.
28
Wavelet spectra
5000
Frequency [Hz]
4000
3000
2000
Daubechies psi of order 12
Daubechies phi of order 12
1000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Scale a
Time [s]
6
14
22
30
38
46
54
62
70
78
86
94
102
110
118
126
134
142
150
0
-0.2
-0.4
2
4
6
8
10
12
14
-6
-4
F(d12_phi(w))
1
1
0.8
0.6
0.6
0.4
0.4
1
0.2
0.2
2
0
1000
2000
3000
4000
5000
6000
7000
0
2
-5
0
5
10
15
-15 -10
-5
0
5
4
5
6
7
8
8
1000
2000
3000
4000
5000
6000
4
6
10
15
0
-15 -10
3
-2
F(d12_psi(w))
0.8
Time b
Resolution m
0.75
0.5
0.25
0
-0.25
-0.5
-0.75
0.8
0.6
0.4
0.2
7000
Time 2 -mn
29
STFT versus continuous and discrete wavelet spectrum for the word „osiem”
Pasma częstotliwościowe mowy
Częstotliwość
[Hz]
Gęstość
dyskretyzacji
D1
2756÷5512
2t
D2
1378÷2756
4t
D3
689 ÷1378
8t
D4
345 ÷ 689
16t
D5
172 ÷ 345
32t
D6
86 ÷ 172
64t=5.805 ms
Poziom
dekompozycji
Częstotliwość
próbkowania
f 0  11025 Hz
oznacza gęstość dyskretyzacji
t  90.7 μs
30
Phoneme segmentation
31
Hidden Markov Model
A Hidden Markov Model (HMM) is a statistical model where the system being
modeled is assumed to be a Markov process with unknown parameters, and the
challenge is to determine the hidden parameters, from the observable parameters,
based on this assumption. The extracted model parameters can then be used to
perform further analysis, for example for speech recognition applications.
Speech recognition systems are generally based on HMM. Statistical model gives the
probability of an observed sequence of acoustic data by the application of Bayes’ rule:
Pword | acoustic 
pacoustic | word Pword
pacoustic
P(mushroom soup) > P(much rooms hope)
32
Prawdopodobieństwa fonemów
Dane uzyskano w oparciu o przemówienia sejmowe.
33
Najczęstsze bifony
34
Bifony
35
Najczęstsze trifony
36
Trifony
37
„Text Speech and Dialogue’2001”
Autor: Jordan Cohen
Tittle:
A Historical Perspective on Modern Speech
Abstract:
Science is the study of nature through models, measurement, and prediction.
When models become overly complicated, and predictions do not improve with
further complication, it is often necessary to reconsider the basic assumptions of
the models to make progress. This reconsideration happened in astronomy at
the time of Copernicus and Kepler. I will draw a parallel to the current
situation in speech recognition, and will argue that it is time for
reconsideration of the basic models and methods.
38
T
W
I
S
T
Trying Wacky (silly) Ideas for Speech Technology
Marc Blasband - ELSNET, April 1998, s.9
It is becoming clear that the HMM model is reaching the
limit of its possibilities
The current state and quality of HMM-based work is the
result of enormous amounts of time and effort spent by
many researchers worldwide, and is therefore hard to
abandon.
The chances of one group being able to create something
that can compete with th results of the years of work that
went into HMM are very small.
A breakthrough is necessary ...
39
Dziękuję za uwagę
40