VQ Source Models

Transcription

VQ Source Models
VQ Source Models:
Perceptual & Phase Issues
Dan Ellis & Ron Weiss
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
{dpwe,ronw}@ee.columbia.edu
1.
2.
3.
4.
http://labrosa.ee.columbia.edu/
Source Models for Separation
VQ with Perceptual Weighting
Phase and Resynthesis
Conclusions
VQ Source Models - Ellis & Weiss
2006-05-16 - 1 /12
Single-Channel Scene Analysis
freq / kHz
• How to separate overlapping sounds?
8
20
6
0
4
-20
2
-40
0
0.5
1
1.5
freq / kHz
0
2
level / dB 0
time / s
0.5
1
1.5
2
time / s
8
6
4
2
0
0
0.5
1
1.5
2
time / s
underconstrained: infinitely many decompositions
time-frequency overlaps cause obliteration
.. no obvious segmentation of sources (?)
VQ Source Models - Ellis & Weiss
2006-05-16 - 2 /12
Scene Analysis as Inference
• Ideal separation is rarely possible
i.e. no projection can guarantee to remove overlaps
• Overlaps Ambiguity
scene analysis = find “most reasonable” explanation
• Ambiguity can be expressed probabilistically
i.e. posteriors of sources {Si} given observations X:
P({Si}| X)
P(X |{Si}) P({Si})
combination physics source models
• Better source models → better inference
.. learn from examples?
VQ Source Models - Ellis & Weiss
2006-05-16 - 3 /12
Vector-Quantized (VQ)
Source Models
• “Constraint” of source can be captured
explicitly in a codebook (dictionary):
x(t) ≈ ci(t) where i(t) ∈ 1 . . . N
defines the ‘subspace’ occupied by source
• Codebook minimizes distortion (MSE)
freq / mel band
by k-means clustering
VQ800 Codebook - Linear distortion measure (greedy ordering)
80
-10
-20
-30
-40
-50
60
40
20
level / dB
0
100
200
300
VQ Source Models - Ellis & Weiss
400
500
600
700
800
codebook index
2006-05-16 - 4 /12
Simple Source Separation
• Given models for sources,
find “best” (most likely) states for spectra:
combination
p(x|i1, i2) = N (x; ci1 + ci2, Σ) model
inference of
{i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) source
state
can include sequential constraints...
different domains for combining c and defining 
• E.g. stationary noise:
In speech-shaped noise
(mel magsnr = 2.41 dB)
freq / mel bin
Original speech
80
80
80
60
60
60
40
40
40
20
20
20
0
1
2 time / s
0
VQ Source Models - Ellis & Weiss
1
2
VQ inferred states
(mel magsnr = 3.6 dB)
0
1
2
2006-05-16 - 5 /12
Codebook Size
• Two (main) variables:
o number of codewords o amount of training data
• Measure average accuracy (distortion):
10
VQ Magnitude fidelity (dB SNR)
o main effect of
16000 training frames
8000 training frames
4000 training frames
2000 training frames
1000 training frames
600 training frames
9.5
9
8.5
codebook size
o larger codebooks
need/allow more data
o (large absolute
distortion values)
8
7.5
7
6.5
6
5.5
100
200
300
500
1000
Codebook size
VQ Source Models - Ellis & Weiss
2000
3000
2006-05-16 - 6 /12
Distortion Metric
• Standard MSE gives equal weight by channel
excessive emphasis on high frequencies
• Try e.g. Mel spectrum
Linear Frequency
8
Mel Frequency
freq / Mel band
freq / kHz
approx. log spacing of frequency bins
6
4
2
• Little effect (?):
freq / mel band
0
80
0
0.5
1
1.5
2 time / s
VQ800 Codebook - Linear distortion measure
80
-10
-20
-30
-40
-50
60
40
20
0
0
0.5
1
1.5
2
level / dB
VQ800 Codebook - Mel/cube root distance
60
40
20
0
codebook index
VQ Source Models - Ellis & Weiss
codebook index
2006-05-16 - 7 /12
Resynthesis Phase
• Codewords quantize spectrum magnitude
phase has arbitrary offset due to STFT grid
• Resynthesis (ISTFT) requires phase info
use mixture phase? no good for filling-in
• Spectral peaks indicate common
instantaneous frequency ( ∂φ/∂t )
can quantize and cumulate in resynthesis
.. like the “phase vocoder”
Code 924: Instantaneous Frequency
Inst Freq / Hz
level / dB
Code 924: Magnitudes (21 trn frms)
20
0
1000
800
600
-20
400
-40
-60
200
0
200
400
600
800
1000
freq / Hz
VQ Source Models - Ellis & Weiss
0
0
200
400
600
800
1000
freq / Hz
2006-05-16 - 8 /12
Resynthesis Phase (2)
• Can also improve phase iteratively
repeat: X (1)(t, f )=|X̂(t, f )| · exp{ jφ(1)(t, f )}
(1)
x(1)(t)=istft{X
(t,
f
)}
!
"
(1)
(2)
φ (t, f )=∠ stft{x (t)}
• Visible
|X (n)(t, f )| = |X̂(t, f )|
benefit:
SNR of reconstruction (dB)
goal:
Iterative phase estimation
8
7.6 dB
7
6
5
4.7 dB
Magnitude quantization only
Starting from quantized phase
Starting from random phase
4
3.4 dB
3
0
VQ Source Models - Ellis & Weiss
1
2
3
4
5
6
7
8
Iteration number
2006-05-16 - 9 /12
Evaluating Model Quality
• Low distortion is not really the goal;
models are to constrain source separation
fit source spectra but reject non-source signals
• Include sequential constraints
• Best way to
evaluate is
via a task
e.g. separating
speech from
noise
freq / bins
e.g. transition matrix for codewords
.. or smaller HMM with distributions over codebook
80
0dB SNR to Speech-Shaped Noise (magsnr =1.8 dB)
60
40
20
0
Direct VQ (magsnr = -0.6 dB)
80
-10
-20
-30
-40
-50
60
40
20
0
HMM-smoothed VQ (magsnr = -2.1 dB)
80
level / dB
60
40
20
0
0
VQ Source Models - Ellis & Weiss
0.5
1
1.5
2
2.5
time / s
2006-05-16 - 10/12
Future Directions
• Factorized codebooks
level / dB
codebooks too large due to combinatorics
separate codebooks for type, formants, excitation?
Spectrum decomposed into low, mid, high components
40
20
0
-20
-40
• Model adaptation
0
1000
2000
3000
4000
5000
6000
7000
freq / Hz
8000
many speaker-dependents model, or...
single speaker-adapted model, fit to each speaker
• Using uncertainty
enhancing noisy speech for listeners:
use special tokens to preserve uncertainty
VQ Source Models - Ellis & Weiss
2006-05-16 - 11/12
Summary
• Source models permit separation of
underconstrained mixtures
or at least inference of source state
• Explicit codebooks need to be large
.. and chosen to optimize perceptual quality
• Resynthesis phase can be quantized
.. using “phase vocoder” derivative
.. iterative re-estimation helps more
VQ Source Models - Ellis & Weiss
2006-05-16 - 12/12
Extra Slides
VQ Source Models - Ellis & Weiss
2006-05-16 - 13/12
Other Uses for Source Models
•
Projecting into the model’s space:
Restoration / Extension
inferring missing parts
• Generation / Copying:
adaptation + fitting
VQ Source Models - Ellis & Weiss
2006-05-16 - 14/12
that one signal contains white and the other does not, and vice
versa. We then used the association that yielded the highest combined likelihood.
Log-power spectrum features were computed at a 15 ms rate.
Each frame was of length 40 ms and a 640 point FFT was used
producing a 3195 dimensional log-power-spectrum feature vector.
system surpasses human performance in all condi
dB and −9 dB.
The third plot in Figure 1 shows WER for the Di
condition. In this condition, our system surpasses
mance in the 0 dB and 3 dB conditions. Interesti
constraints do not improve performance relative to
dynamics as dramatically as in the same talker cas
cates that the characteristics of the two speakers in a
a) Same Talker
are effective for separation.
<command:4><color:4><preposition:4><letter:25><number:10><adverb:4>
The performance of our best system, which us
mar and acoustic-level dynamics, is
summarized in
t5_bwam5s_m5_bbilzp_6p1.wav
e.g. "bin white at M 5 soon"
system surpassed human lister performance at SNR
−3 dB on average acrossKristjansson
all speaker conditions.
Av
et
al.
all SNRs, the system surpassed human performanc
6 dB
3 dB
0 dB
!3 dB
!6 dB
!9 dB
Gender condition. Based on
these initial results, w
Interspeech’06
b) Same Gender
super-human performance over all conditions is wit
Example 2: Mixed Speech Recog.
• Cooke & Lee’s Speech Separation Challenge
short, grammatically-constrained utterances:
100
50
0
wer / %
100
• IBM’s “superhuman” recognizer:
o Model individual speakers
7. References
(512 mix GMM)
o Infer speakers and gain
o Reconstruct speech
o Recognize as normal...
50
0
6 dB
3 dB
0 dB
!3 dB
!6 dB
!9 dB
c) Different Gender
100
50
0
6 dB
3 dB
SDL Recognizer
0 dB
No dynamics
!3 dB
Acoustic dyn.
!6 dB
!9 dB
Grammar dyn.
VQ Source Models - Ellis & Weiss
Human
Figure 1: Word error rates for the a) Same Talker, b) Same Gender
and c) Different Gender cases.
[1] Martin Cooke and Tee-Won Lee,
“Inte
separation
challenge,”
http : //www.d
∼ martin/SpeechSeparationChallenge.htm, 20
[2] T. Kristjansson, J. Hershey, and H. Attias, “Single mi
separation using high resolution signal reconstruction
[3] S. Roweis, “Factorial models and refiltering for speec
denoising,” Eurospeech, pp. 1009–1012, 2003.
[4] P. Varga and R.K. Moore, “Hidden markov model d
speech and noise,” ICASSP, pp. 845–848, 1990.
[5] J.-L. Gauvain and C.-H. Lee, “Maximum a posterio
multivariate Gaussian mixture observations of Marko
Transactions on Speech and Audio Processing, vol. 2
2006-05-16 - 15/12
298, 1994.
[6] Peder Olsen and Satya Dharanipragada, “An efficien
der detection scheme and time mediated averaging o
• Grammar constraints
a big help
• Central idea:
Employ strong learned constraints
time
time
to disambiguate possible sources
Varga & Moore’90
Roweis’03...
frequency
{Si} = argmaxSi P(X | {Si})
frequency
speech-trained Vector-Quantizer
• e.g. fitMAXVQ
Results: Denoising
totimemixed spectrum:
time
from Roweis’03
time
frequency
frequency
frequency
frequency
frequency
frequency
Model-Based Separation
time
time
time
VQ Source Models - Ellis & Weiss
frequency
frequency
Training: 300sec of
isolated speech
to fit 512 codewords, and 100sec of
separate
via T-F(TIMIT)
mask (again)
solated noise (NOISEX) to fit 32 codewords; testing on new signals at 0dB SNR
2006-05-16 - 16/12
Separation or Description?
• Are isolated waveforms required?
clearly sufficient, but may not be necessary
not part of perceptual source separation!
• Integrate separation with application?
e.g. speech recognition
separation
mix
t-f masking
+ resynthesis
identify
target energy
ASR
speech
models
words
mix
vs.
identify
find best
words model
speech
models
source
knowledge
words output = abstract description of signal
VQ Source Models - Ellis & Weiss
2006-05-16 - 17/12
words
Evaluation
• How to measure separation performance?
depends what you are trying to do
• SNR?
energy (and distortions) are not created equal
different nonlinear components [Vincent et al. ’06]
rare for nonlinear processing
to improve intelligibility
listening tests expensive
• ASR performance?
Net
effect
Transmission
errors
• Intelligibility?
Increased
artefacts
Reduced
interference
optimum
Agressiveness
of processing
separate-then-recognize too simplistic;
ASR needs to accommodate separation
VQ Source Models - Ellis & Weiss
2006-05-16 - 18/12