The Manifold Nature of Vowel Sounds

Transcription

The Manifold Nature of Vowel Sounds
The Manifold Nature of Vowel Sounds
by Aren Jansen
Master’s Paper
Advisor: Partha Niyogi
Department of Computer Science
The University of Chicago
September 14, 2007
Abstract
Recently there has been great interest in geometrically motivated approaches to
data analysis and pattern recognition. Low-dimensional structure in higher-dimensional data can be exploited by manifold-based data reduction and learning algorithms to improve performance. The existence of such a structure
in speech has not been formally documented. Toward this end, I present a
derivation of the approximate discrete spectra of sustained vowel phonemes
using standard tube models of the vocal tract. Adopting a geometrical approach, each N -point discrete frequency spectrum produced using these models represents a point in RN . Given a continuous range of vocal tract model
parameters, I either formally or graphically demonstrate that the subsets of
Euclidean space traced out by the resulting spectra positions form low-dimensional, extrinsically curved manifolds that span the ambient space. Tube model
parameters that approximate the vocal tract configurations for various vowel
phonemes determine the approximate manifold structure for several sustained
vowel sounds. Using the manifolds for the phonemes /a/ and /æ/ as input,
the manifold-based Laplacian eigenmap dimensionality reduction algorithm
of Belkin and Niyogi [1] outperforms traditional principal component analysis, both with and without the introduction of noise.
Contents
Introduction
1
4
.
.
.
.
.
.
.
.
7
8
8
9
11
14
15
15
16
2
Speech Sound Generator Software
2.1 Simulation Mode . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Data Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . .
18
18
19
19
3
The Manifold of Acoustic Tube Solutions
3.1 The Single Tube Solution Manifold . . . . . . . . . . . . . . . . .
3.2 The 2-Tube Solution Manifold . . . . . . . . . . . . . . . . . . . .
21
21
25
4
Tube Models of Vowel Production
4.1 Introducing the Glottal Source . . . .
4.2 2-Tube Models of Vowel Production
4.3 N-Tube Models of Vowel Production
4.4 Fundamental Frequency Variation .
4.5 Frequency Sampling Normalization
.
.
.
.
.
29
29
30
36
39
39
Vowel Manifolds in the Graph Laplacian Eigenbasis
5.1 The Laplacian Eigenbasis . . . . . . . . . . . . . . . . . . . . . .
5.2 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . .
5.3 Noise on the Manifold . . . . . . . . . . . . . . . . . . . . . . . .
44
44
45
46
5
The Physics of Acoustic Tubes
1.1 Single Tube Analysis . . . . . . . . . . . . . . . . . . .
1.1.1 Continuity and Conservation of Mass . . . . .
1.1.2 The Wave Equation and Boundary Conditions
1.1.3 The General Solution . . . . . . . . . . . . . . .
1.1.4 The Sinusoidal Source Solution . . . . . . . . .
1.2 Multiple Tube Analysis . . . . . . . . . . . . . . . . . .
1.2.1 The N -Tube General Solution . . . . . . . . . .
1.2.2 The N -Tube Solution for a Sinusoidal Source .
Conclusion
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
52
1
List of Figures
1.1
Schematic plot of Ks (f ). . . . . . . . . . . . . . . . . . . . . . . .
10
2.1
2.2
Speech sound generator software in simulation mode. . . . . . .
Speech sound generator software in data mode. . . . . . . . . .
19
20
3.1
3.2
3.3
3.4
Sample plot of M1 (L1 , L2 ). . . . . . . . . . . .
Sample plot of M1 (L1 , L2 ) (zoomed on origin).
Sample plot of M2 (L1 , L2 ). . . . . . . . . . . .
Sample plot of M2 (IRL , IRA ). . . . . . . . . . .
24
25
27
28
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4.1
(a) Glottal source spectrum and (b) corresponding waveform
where f0 = 100 Hz. . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Schematic plot of the vowel structure within the manifold of
acoustic two-tube model solutions, M2 . . . . . . . . . . . . . . .
4.3 Amplitude spectrum for the two-tube /a/ configuration given
in Table 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Principal component plots of M2 (P ; f0 , L1 , L2 ) for each phoneme individually. . . . . . . . . . . . . . . . . . . . . . . . . . .
(f0 , L1 , L2 ). . . . . . . . . . .
4.5 Principal component plot of Mvowel
2
4.6 Principal component plot of M2 (f0 , L, A), where colors differentiate length ratios. . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Principal component plot of M2 (f0 , L, A), where colors differentiate area ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 Principal component plots of MN (P ; f0 , L1 , L2 ) for each phoneme individually. . . . . . . . . . . . . . . . . . . . . . . . . . .
(f0 , L1 , L2 ). . . . . . . . . . .
4.9 Principal component plot of Mvowel
N
4.10 Principal component plots of M2 (P ; F1 , F2 , L1 , L2 ) for each phoneme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.11 Principal component plots of M2 (P ; T, fref , K, F1 , F2 , L1 , L2 ) for
each phoneme individually. . . . . . . . . . . . . . . . . . . . . .
S
. . . . . . . .
5.1 Principal component plot of Ma2 Mæ
2 . . . . . . . S
5.2 Graph Laplacian eigenfunction projection of Ma2 Mæ
2 . (c.f.
Figure 5.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
30
31
32
33
34
35
35
37
38
40
42
46
47
5.3
5.4
5.5
5.6
5.7
5.8
Amplitude spectrum of Figure 4.3 after the addition of noise
with SNR = 18 dB. . . . . . . . . . . . . S. . . . . . . . . . . . . . .
Principal component projection of Ma2 Mæ
2 with the introduction of noise at S/N = 18 dB. . . . . . . . . . . . . . . . . . . . . .
Nearest
graph Laplacian eigenfunction projection of
S neighbors
æ with the introduction of noise at S/N = 18 dB. . . . .
M
Ma
2
2
S
Principal component projection of Ma2 Mæ
2 with the introduction of noise at S/N = 8 dB. . . . . . . . . . . . . . . . . . . . . .
Nearest
graph Laplacian eigenfunction projection of
S neighbors
æ with the introduction of noise at S/N = 8 dB. . . . .
Ma
M
2
2
The linear separability of four data representations, as a function
of signal-to-noise ratio. . . . . . . . . . . . . . . . . . . . . . . .
3
48
48
49
49
50
51
Introduction
In the past 45 years, the fields of speech acoustics and speech recognition have
on the most part developed independently. Focused presentation of speech
acoustics dates back to Gunnar Fant in his 1960 book Acoustic Theory of Speech
Production, where he develops several physical models of speech production.
Fant adopts the source-filter approach to speech production, which assumes
that a speech signal can be uniquely specified by independent source and filter
characteristics. The traditional sources include glottal excitations and turbulent flow at constrictions. The vocal tract acts as the acoustic filter that, when
applied to the source, determines the output signal.
The simplest filter model presented is the twin-tube resonator, which approximates the vocal tract as a pair of concatenated cylindrical tubes. The relative cross section and length proportions of the tubes determine the filter characteristics. Fant also presents models for several other production phenomena,
such as nasal coupling and tongue position. Since 1960, Fant and others have
developed more linguistically motivated acoustic models, covering all classes
of phonemes [2].
The application of acoustical physics to speech production models leads to
a wide range of insights into the acoustic nature of various classes of phonemes
and their relation to the vocal tract configurations that produce them. In particular, for the class of sustained vowel sounds (e.g. /a/ as in “pot” or /u/
as in “boot”), Fant’s crude twin-tube approximations to the vocal tract profile
are sufficient to capture the key acoustic features. The utility of these models
is to provide a map between a small set of articulatory parameters and a phoneme’s approximate acoustic signal. However, developments in this field have
traditionally found audience in the linguistic and applied physics/engineering
communities.
The development of speech recognition has traditionally been directed towards generic statistical methods for classification. These methods take as input a set of fully or partially labelled data vectors, whose components can be
anything from digital image pixel intensities to daily weather measurements.
While such generic approaches are useful in having a large domain of application, they do not factor in the possibly useful constraints imposed by the nature
of a particular input data set. In the case of speech, such constraints are determined by the physical mechanisms that produce it; humans cannot produce
every arbitrary acoustic signal.
4
Recently, several algorithms have been proposed that both exploit the underlying structure of natural data and remain application-independent. To accomplish this task, these algorithms share the assumption that high-dimensional input data sets are generated in analytically from a much smaller number of degrees of freedom. These algorithms remain generic since the exact
form of the underlying structure is inconsequential; it is simply sufficient that
a lower-dimensional representation exists. It should also be noted that these
methods share a geometric interpretation of the data, where an n-dimensional
input vector is regarded as a point in an n-dimensional Euclidean space with a
standard Euclidean distance metric.
A particular formalization of this postulate is to assume that natural data resides on or near a low-dimensional manifold embedded in the ambient space.
A number of algorithms based on this formalization have been presented, including Laplacian eigenmaps [3, 1], locally linear embedding (LLE) [4, 5], Hessian locally linear embedding (hLLE) [6], and ISOMAP [7]. These algorithms
have shown some success on benchmark problems.
It is therefore relevant to consider the extent to which the manifold assumption may be true in natural data. Individual phenomena must be investigated
to provide an application-specific justification for the manifold assumption.
For example, Donoho and Grimes [8] characterize families of images in terms
of their manifold properties. The articulatory parameterizations of Fant and
others strongly indicate the existence of a low-dimensional structure of speech.
The interest of this paper is to formally consider speech sounds from this geometric point of view and determine the precise nature of the manifold structure
that exists at its foundations. Towards this end, I analyze sounds generated by
series of concatenated tubes excited by a periodic source. This system serves
as a simple model for the vocal tract and glottal excitation and has been shown
to be useful in the tradition of acoustic modeling in speech production and
acoustic phonetics.
This paper presents a derivation of the frequency spectra generated by
these acoustic tube models, which are used to define solution manifolds. If
it is postulated that real vowel sounds are adequately approximated by the
acoustic tube model solutions, then the existence of an underlying manifold
structure in real speech follows. This has several implications, of which the
most relevant to this document is the successful application of manifold-based
algorithms, such as the Laplacian dimensionality reduction algorithm of Belkin
and Niyogi [1] and the semi-supervised Laplacian learning algorithm of Belkin,
Niyogi, and Sindhwani [3]. In addition, the consequences of this manifold interpretation has applications to the study of non-linear speech phenomena and
alternative spectrogram methods, two possible areas of future research.
Chapter 1 presents the relevant acoustic physics and a derivation of the
tube model solution spectra. Chapter 2 provides a brief overview to the Speech
Sound Generator software created to study and demonstrate the sounds produced using the tube models. Chapter 3 develops a general definition and
formal justification for the acoustic tube solution manifolds. In Chapter 4, this
discussion is extended to the solution manifold structure of the tube models
5
of vowel production. Chapter 5 investigates the performance of a Laplacianbased dimensionality reduction algorithm using the acoustic tube model data,
both with and without the introduction of noise. Finally, I conclude with a
summary of findings and a discussion of future research goals resulting from
the speech perspective presented in this paper.
6
Chapter 1
The Physics of Acoustic Tubes
Sound is transmitted through a medium in the form of a longitudinal pressure
wave. In free space, a given source signal will propagate through the air with
its acoustic nature fundamentally unchanged. However, if the space through
which a wave travels is in some way constricted, the frequency content of the
acoustic signal can shift dramatically.
One canonical system of constraint is that of acoustic tubes. When an input
signal is passed through a tube, some of the frequency content tends to reflect
back, resulting in resonance. The frequencies at which such resonance occurs
are determined solely by the shape and dimensions of the acoustic tube. The
ultimate output signal is therefore dominated by these resonant frequencies,
resulting in a modulated signal much different from the input. This property
of acoustic tubes is exploited by man-made instruments such as pipe organs
and flutes, as well as by natural systems such as the human voice.
Historically, there have been two primary data representations used in speech
analysis and recognition. The first uses the entire discrete Fourier spectrum of
a speech signal, typically resulting in high-dimensional data sets. The second
approach goes one step further, decomposing the frequency spectrum into a
relatively small number of resonant frequencies, known as formants. Since formant frequencies are maxima positions, they are also the most audible, as the
ear performs a sort of Fourier decomposition before perception begins.
The formant approach has two advantages. First, the formant representation is typically low-dimensional (3 or 4 formants), which can lead to short processing times. Second, formants tend to characterize certain classes of speech
sounds very well, leading to improved classification. However, when analyzing discretely-sampled speech data, the frequency spectrum tends to be more
chaotic, due to an implicit convolution with a sinc function. Therefore, the determination of these formant positions can introduce enough speed bottlenecks
and uncertainty to counteract the two benefits.
In this paper, the entire spectral representation is of primary interest. However, as a result of the formant tradition, presentation of the acoustics of vocal tract resonators has typically been limited to finding the poles of the filter
7
transfer functions that arise from particular articulatory configurations. This
chapter presents a complete derivation of the output waveform and spectrum
arising from an arbitrary configuration of concatenated tubes and an arbitrary
source function, starting with basic physical principles. Several approximations are made to contain the analysis of this system within the realm of linear
dynamics; these are noted when applied. We begin with the derivation for a
single uniform tube.
1.1 Single Tube Analysis
The acoustic analysis of the vocal tract resonator becomes a tractable problem
with the introduction of several approximations. First, the vocal tract walls
are modeled as rigid, assuming their impedance to be much greater than that
of air. This approximation is reasonable for the frequency range observed in
human speech [2]. Using this approximation, we can neglect energy loss in
the vocal tract as well as time-dependence of the cavity profile. Next, if we assume that the transverse vocal tract dimension is much smaller than the signal
wavelength, we can assume solutions are uniform in the transverse dimensions. Thus, the problem reduces to an analysis in one spatial dimension. Furthermore, this approximation allows us to assume the solutions are dependent
solely on the cross-sectional area of the vocal tract and not on the bounding
contour. This set of approximations is valid for sound frequencies under about
5 kHz [2].
1.1.1
Continuity and Conservation of Mass
To derive a working form of the acoustic equations, we must make the assumption that the acoustic signal passes through the air with pressure, density, and
velocity perturbations that are small compared with their equilibrium values
(i.e., p0 + p ≈ p0 , where p and p0 are the pressure perturbation and equilibrium
pressure, respectively). Under this assumption, the one-dimensional equation
of continuity for compressible fluid flow, ∂ρ/∂t + ∂(ρu)/∂x = 0, is approximately
ρ0 ∂U
∂ρ
=
,
∂t
A(x) ∂x
(1.1)
where A(x) is the cross-sectional area of the tube as a function of position,
U = Au is the volume velocity, ρ is the density perturbation, and ρ0 is the equilibrium density. If we also assume that perturbations are sufficiently transient
to neglect heat transfer, we can make use of the adiabatic ideal gas relation
between pressure and density: pρ−γ = const, where γ = cp /cv = 5/3 for an
ideal gas. Discarding second order terms according to the assumption of small
perturbations, we have
8
dρ
dp
=γ .
p0
ρ0
(1.2)
Combining Equations 1.1 and 1.2 we arrive at
γp0 ∂U
∂p
=
,
∂t
A(x) ∂x
(1.3)
the first of the acoustic equations.
Neglecting the effect of gravity and assuming the propagating fluid is frictionless, conservation of momentum gives the relation (again, neglecting second order terms)
∂p
ρ0 ∂U
=
,
∂x
A(x) ∂t
(1.4)
the second of the acoustic equations. Thus, mass continuity and conservation
of momentum provide expressions for both the spatial and time dependence
of the pressure and the volume velocity.
1.1.2
The Wave Equation and Boundary Conditions
Starting from Equations 1.3 and 1.4, we can arrive at partial differential equation for the volume velocity U as functions of space and time. First, we differentiate Equation 1.3 with respect to t and Equation 1.4 with respect to x.
Equating mixed partials and assuming A is time-independent (i.e., the walls
are rigid) gives the wave equation in a surface of revolution with area profile
A(x):
1 dA ∂U
1 ∂2U
∂2U
−
=
∂x2
A dx ∂x
c2 ∂t2
(1.5)
p
where c = γp0 /ρ0 , the speed of sound in air.
For our purposes, we consider uniform cylindrical tubes. Thus, A(x) =
const and dA/dx = 0, so Equation 1.5 reduces to the standard one-dimensional
wave equation in free space,
∂2U
1 ∂2U
= 2 2.
(1.6)
2
∂x
c ∂t
Since we could have opted to eliminate the volume velocity in favor of pressure
in the derivation, we have the complementary equation,
∂2p
1 ∂2p
= 2 2.
(1.7)
2
∂x
c ∂t
Given a second order differential equation, we need two boundary conditions to arrive at a unique solution. The first boundary condition incorporates
the source function corresponding to the input glottis-modulated volume velocity created by the lungs as a function of time, s(t). We model the glottis as
9
Ks
1.5
1.0
300
2000
f (Hz)
Figure 1.1: Schematic plot of Ks (f ).
a vibrating piston at the x = 0 end of the tube. We then require as the first
boundary condition that
U (0, t) = s(t).
(1.8)
The second boundary equation makes use of the fact that a sound wave faces
the acoustic impedance of the surrounding environment at the open end of
the tube (x = L). The acoustic impedance (a frequency-dependent, complex
quantity) is defined by
Z(x, ω) ≡
p̂(x, ω)
Û (x, ω)
,
(1.9)
where p̂ and Û are the forward Fourier transforms of the pressure and volume
velocity. At the open end of the tube, the acoustic impedance encountered is
termed the radiation impedance. This quantity can be calculated from the geometry and baffling of the tube opening and is independent of the wave equation
solution. For our purposes, we use the approximate form given by Stevens [2],
based on a model of a circular piston on the surface of a sphere with radius 9
cm. This form of the radiation impedance, valid up to 6000 Hz, is given by
Z(L, ω) =
4ckρa
ρck 2 Ks (f )
+i
≡ Zr (ω),
4π
5A
(1.10)
where ρ is the density of air, c is the speed of sound, A = πa2 is the area of
the piston, k = ω/c is the wave number, and Ks is a real-valued frequencydependent factor approximated by the form of Figure 1.1. The second boundary condition, which is given in the frequency domain, is then
Û (L, ω) =
10
p̂(L, ω)
.
Zr (ω)
(1.11)
1.1.3
The General Solution
We are now equipped with all the components necessary to arrive the general
solution to Equation 1.6. To begin, we write the solution U (x, t) in terms of its
time-Fourier transform Û (x, ω) as
Z ∞
1
Û (x, ω)eiωt dω.
(1.12)
U (x, t) =
2π −∞
Substituting this form into Equation 1.6 yields the ordinary differential equation
2
d
2
(1.13)
+ k Û (x, ω) = 0,
dx2
where k = ω/c is the wavenumber for the sound wave. This is simply the
equation of simple harmonic motion in x, and the general solution is given by
Û (x, ω) = U1 (ω)e−ikx + U2 (ω)eikx ,
(1.14)
where U1 and U2 are the complex amplitudes of the forward and backward
traveling waves, respectively. In general, the amplitudes are frequency dependent. Starting instead from Equation 1.7, we can write a similar expression for
the pressure wave:
p̂(x, ω) = p1 (ω)e−ikx + p2 (ω)eikx ,
(1.15)
where p1 and p2 are the complex amplitudes of the forward and backward
traveling pressure waves, respectively. Thus the general forms for the volume
velocity and pressure are given by
Z ∞
1
U (x, t) =
(U1 (ω)e−ikx + U2 (ω)eikx )eiωt dω
(1.16)
2π −∞
Z ∞
1
(p1 (ω)e−ikx + p2 (ω)eikx )eiωt dω.
(1.17)
p(x, t) =
2π −∞
There are two relevant methods for continuing from this point. The first
method that provides the entire general solution, U (x, t) and p(x, t), is treated
for completeness and serves as a reference for any future model extension. The
chain matrix method is useful for efficient computation of the output signal
of the tube, p(L, t), and generalizes seamlessly to the case of multiple tube
systems. Both methods are presented below.
The Complete Solution
To arrive at a complete solution, we need to determine the specific form for the
complex Fourier amplitudes, U1 and U2 . To accomplish this, we must make
use of the boundary conditions discussed above in Section 1.1.2. The boundary
11
condition at x = 0 (Equation 1.8) can be incorporated by first considering the
source function s(t) written in terms of its Fourier transform ŝ(ω) as follows:
Z ∞
1
s(t) =
ŝ(ω)eiωt dω.
(1.18)
2π −∞
Imposing the boundary condition of Equation 1.8 and equating integrands
gives
U1 (ω) + U2 (ω) = ŝ(ω).
(1.19)
Thus, we have the first constraint on the functional form of U1 and U2 .
The second boundary condition of Equation 1.11 will provide the second
constraint required to fully determine the wave equation solution. We begin
by writing this expression in the time domain as the convolution
p(L, t) = Žr (t) ∗ U (L, t),
(1.20)
where Žr (t) is the inverse Fourier transform of Zr (ω). Next, we differentiate
this expression with respect to t, giving
∂U
∂p
(L, t) = Žr (t) ∗
(L, t).
(1.21)
∂t
∂t
We can now eliminate p from this expression by invoking Equation 1.4, giving
∂U
∂U
(L, t) = AŽr (t) ∗
(L, t).
∂x
∂t
Transforming back to the frequency domain, we have
γp0
γp0
∂ Û
(L, ω) = AZr (ω)F[ ∂U
∂t ] = iωAZr (ω)Û (L, ω).
∂x
(1.22)
(1.23)
Substituting the expression for Û of Equation 1.14, we arrive at the following
condition on the solution:
U1 (ω)e−ikL + U2 (ω)eikL
γp0
.
=−
AZr (ω)c
U1 (ω)e−ikL − U2 (ω)eikL
(1.24)
Solving Equations 1.19 and 1.24 results in the following expression for the timeFourier transform of the general solution Û (x, ω):

−ikx
ikx
e
e
,
+
Û (x, ω) = ŝ(ω) 
B+1
i2kL
1 + B−1
e−i2kL
1 + B−1
e
B+1

(1.25)
where B is defined to be γp0 /AZr (ω)c = ρ0 c/AZr (ω). The corresponding pressure spectrum is then given by
12

ikx
−ikx
e
e
.
+
p̂(x, ω) = ŝ(ω)Zr (ω) 
B+1
i2kL
1 + B−1
e−i2kL
1 + B−1
B+1 e

(1.26)
The Chain Matrix Solution
We are primarily interested in the output of the acoustic tube filter. Furthermore, the complete solution method presented above changes greatly for the
case of multiple tubes. Therefore, we present the alternative chain matrix solution method, which generalizes seamlessly to the case of multiple tubes. We
start by inserting the general solutions of Equations 1.16 and 1.17 into the
acoustic Equations 1.3 and 1.4 and equate integrands to arrive at the Fourier
coefficient relations
p1 e−ikx + p2 eikx = −
ρ0 c
(U1 e−ikx − U2 eikx )
A
ρ0 c
(U1 e−ikx + U2 eikx ).
A
Combining these relations immediately gives
p1 e−ikx − p2 eikx = −
(1.27)
(1.28)
ρ0 c
ρ0 c
U1 and p2 =
U2 .
(1.29)
A
A
We can now combine Equations 1.14 and 1.15 in matrix form in terms of
only U1 and U2 as follows:
e−ikx
eikx
U1 (ω)
Û (x, ω)
=
.
(1.30)
U2 (ω)
− ρA0 c e−ikx ρA0 c eikx
p̂(x, ω)
p1 = −
Since we are only concerned with the input (x = 0) and output (x = L) of the
tube, we eliminate U1 and U2 from this vector expression in favor of Û (L, ω),
p̂(L, ω), Û (0, ω), and p̂(0, ω) giving
cos kL
i ρA0 c sin kL
Û (0, ω)
Û (L, ω)
.
(1.31)
=
p̂(0, ω)
p̂(L, ω)
cos kL
i ρA0 c sin kL
We now have a system of two equations with four unknowns, so we must
again apply the two boundary conditions to uniquely determine the solution.
The first condition of Equation 1.8 is given in the frequency domain by
Û (0, ω) = ŝ(ω),
(1.32)
where ŝ(ω) is the Fourier transform of the source function s(t). The second
boundary condition of Equation 1.11 is
Û (L, ω) =
13
p̂(L, ω)
,
Zr (ω)
(1.33)
where Zr (ω) is the radiation impedance of Equation 1.10.
Using Equations 1.31, 1.32, and 1.33, we can then solve for the output volume velocity spectrum, Û (L, ω), which is given by
−1
AZr (ω)
Û (L, ω) = ŝ(ω) cos kL − i
sin kL
.
ρ0 c
(1.34)
The corresponding pressure spectrum is then given by
−1
AZr (ω)
,
sin kL
p̂(L, ω) = ŝ(ω)Zr (ω) cos kL − i
ρ0 c
(1.35)
Note that the complete solutions of Equations 1.25 and 1.26 evaluated at x = L
reduce to these forms, but the chain matrix approach more easily generalizes
to multiple tube systems, as we will see later.
1.1.4
The Sinusoidal Source Solution
We now turn our attention to the useful solution to the case of a sinusoidal
source. Since we can express any odd periodic source function as a Fourier
series of sinusoidal sources, understanding the solution to a single driving frequency will prove integral to further analysis. Consider the single-frequency
sinusoidal source function
s(t) = U0 sin ω0 t.
(1.36)
This source function can be cast into exponential form using Euler’s equation
as follows:
U0 iω0 t
(e
− e−iω0 t ).
(1.37)
2i
Making use of the fact that the Fourier transform of a complex exponential is
simply the Dirac delta function, we have
s(t) =
U0 π
(δ(ω − ω0 ) − δ(ω + ω0 )).
(1.38)
i
Using our chain matrix solution of Equation 1.35, we can immediately write
down
ŝ(ω) =
U0 π
(δ(ω − ω0 ) − δ(ω + ω0 ))f (ω, L, A),
i
where f (ω, L, A) is defined to be the frequency-dependent factor
p̂(L, ω) =
−1
AZr (ω)
f (ω, L, A) = Zr (ω) cos kL − i
.
sin kL
ρ0 c
(1.39)
(1.40)
Now that we have the Fourier transform of the solution, we can use the
sifting property of the Dirac delta function,
14
Z
∞
f (ω)δ(ω − ω0 )dω = f (ω0 ),
(1.41)
−∞
when applying the inverse Fourier transform to arrive at the final solution in
terms of f ,
U0 f (ω0 , L, A)eiω0 t − f (−ω0 , L, A)e−iω0 t
2i
= Im(U0 f (ω0 , L, A)eiω0 t ).
p(L, t) =
(1.42)
Here, the second form is a consequence of Zr (ω) = Z∗r (−ω) and thus f ∗ (ω, L, A) =
f (−ω, L, A).
1.2 Multiple Tube Analysis
With the chain matrix single tube solution of Section 1.1.3 in hand, it is a simple
matter to extend the analysis to a series of N tubes with lengths {Li |i = 1, . . . , N }
and cross-sectional areas {Ai |i = 1, . . . , N }. Relying on continuity of pressure
and volume velocity at inter-tube boundaries, the solution for N concatenated
tubes is equivalent to determining N single tube solutions. Thus, if we discretely approximate an arbitrary vocal tract profile A(x) with the {Li } and
{Ai }, we can determine the output speech signal to desired accuracy within
the bounds admissible by this framework. Moreover, we can accomplish this
without having to numerically compute the solution to a much more complicated differential equation for each variation in articulatory configuration.
1.2.1
The N -Tube General Solution
Let Ci be the chain matrix for tube i given by
cos kLi
i ρA0ic sin kLi
,
Ci =
i ρA0ic sin kLi
cos kLi
which satisfies the vector equation
Ûi (0, ω)
Ûi (Li , ω)
.
= Ci
p̂i (0, ω)
p̂i (Li , ω)
(1.43)
(1.44)
Here, Ui and pi are the solutions and Li the length for the i-th tube. Note that
det Ci = 1. Continuity at the inter-tube boundaries imposes the conditions
Ûi (0, ω)
Ûi−1 (Li−1 , ω)
=
for i = 2, . . . , N.
(1.45)
p̂i (0, ω)
p̂i−1 (Li−1 , ω)
Therefore, it follows by induction that the output of the N th tube is given by
15
ÛN (L, ω)
p̂N (L, ω)
=
N
Y
i=1
Ci
!
Û1 (0, ω)
p̂1 (0, ω)
,
(1.46)
PN
where L =
i=1 Li indicates the position of the open end of the multi-tube
system. Note that this matrix equation is of the same form as the single tube
case of Equation 1.31. Thus we can proceed exactly as before with the slightly
modified boundary conditions,
Û1 (0, ω) = ŝ(ω),
p̂N (L, ω)
.
Zr (ω)
ÛN (L, ω) =
(1.47)
(1.48)
QN
Let us denote the composite 2 × 2 chain matrix as M = i=1 Ci . Since the
determinant of the {Ci } are all 1, det M = 1 as well. Therefore, we can write
the general multiple tube volume velocity spectrum solution as
ÛN (L, ω) =
ŝ(ω)
.
M22 − Zr (ω)M12
(1.49)
The corresponding pressure spectrum is then given by
p̂N (L, ω) =
ŝ(ω)Zr (ω)
.
M22 − Zr (ω)M12
(1.50)
The task of computing the output spectrum of an N -tube system, then, reduces
to the multiplication of N 2 × 2 matrices.
1.2.2
The N -Tube Solution for a Sinusoidal Source
We now turn to the sinusoidal source solution for the N -tube case. Generalizing the results of Section 1.1.4, for a single frequency source function
s(t) = U0 sin ω0 t,
(1.51)
the output spectrum at the open end is given by
U0 π
(δ(ω − ω0 ) − δ(ω + ω0 ))g(ω, {Li }, {Ai }),
(1.52)
i
where g(ω, {Li }, {Ai }) is the frequency-dependent factor of Equation 1.49,
p̂(L, ω) =
g(ω, {Li }, {Ai }) =
Zr (ω)
.
M22 − Zr (ω)M12
(1.53)
Here M12 = M12 ({Li }, {Ai }) and M22 = M22 ({Li }, {Ai }). It follows in the
same manner of Section 1.1.4 that the output waveform is given by
16
pN (L, t) =
U0 g(ω0 , {Li }, {Ai })eiω0 t − g(−ω0 , {Li }, {Ai })e−iω0 t .
2i
(1.54)
The second form of the single-tube solution given by Equation 1.42 relied
on the fact that f ∗ (ω) = f (−ω). For an N -tube system, it remains the case
that Z∗r (ω) = Zr (−ω). Also, it is easy to show that for all composite chain
matrices M , Im(M22 ) = 0 and Re(M12 ) = 0. Therefore, it again follows that
g ∗ (ω) = g(−ω), so we can cast the solution of Equation 1.54 into the simpler
form,
pN (L, t) = Im(U0 g(ω0 , {Li }, {Ai })eiω0 t ),
(1.55)
recovering an explicitly real output waveform.
In this chapter we have developed in great detail a set of solutions to the
physical system of concatenated acoustic tubes. These solutions will form the
input for the geometrical representation adopted in Chapter 3, which will be
extended to speech in Chapter 4. First, however, a brief overview of the solution demonstration software will be presented in Chapter 2.
17
Chapter 2
Speech Sound Generator
Software
The acoustic solutions of Sections 1.1 and 1.2 are implemented in a simulation and analysis software package Speech Sound Generator. The source code
is available on the web at http://www.cs.uchicago.edu/˜aren/. This
software allows for audio demonstration of solutions derived above for any
N -tube configuration and source function. This software also allows for configuration parameter adjustments that correspond to the various manifold parameters developed below in Chapters 3 and 4. The two modes, simulation
and data, are introduced separately below.
2.1 Simulation Mode
In simulation mode, the user inputs the vocal tract profile, glottal source frequency spectrum and fundamental frequency. The user can then stream the
filtered output signal to the sound device for playback. In addition, the output waveform and Fourier spectrum are displayable, as well as the vocal tract
profile. The graphical user interface for this mode is shown in Figure 2.1. The
simulation mode has the following features:
• Input parameters can be adjusted during playback to modify the output
sound in real-time by using the sliders and dials.
• Vocal tract configurations may be modified by adjusting several parameters corresponding to the various manifold configuration space variables
introduced in later chapters.
• Vocal tract filtering may also be deactivated, allowing examination of the
source signal.
• Complete configurations may be saved and loaded.
18
Figure 2.1: Speech sound generator software in simulation mode.
• Vocal tract profiles and source spectra can be loaded independently to
mix and match externally prepared settings.
• A two-second clip of sound produced by a given configuration may also
be exported to a standard WAV file.
2.2 Data Mode
The data mode allows the user to record speech sounds, and display either
the waveform or frequency spectra in the plot window. The graphical user
interface for this mode is shown in Figure 2.2. For periodic sources, the fundamental frequency will be computed using autocorrelation minimization, and
displayed on the frequency spectrum plot.
The fit option, if enabled before recording, will attempt to fit a two-tube
model frequency spectrum to the recording, finding the best length ratio, radius ratio, and total length match from theory. That is, by recording their voice
a user can get an estimate of their vocal tract profile. The results of this fit are
displayed in the Fourier transform plot window.
2.3 Implementation Details
The software package is written for the Linux platform in C++ using Gnome
Toolkit 2.0 (GTK2.0) graphical user interface development package and relies
19
Figure 2.2: Speech sound generator software in data mode.
heavily on the use of multiple threads to manage the interface, solution generation, and sound production concurrently. WAV sound file generation is implemented using the libsndfile library. The plot window is managed using the
developmental GTKExtra 2.0 widget extension library.
20
Chapter 3
The Manifold of Acoustic
Tube Solutions
Chapter 1 presented a derivation of the continuous output spectrum resulting
from an arbitrary source and N -tube configuration. If the source spectrum has
bounded support and the radiation impedance has a non-zero resistive term
(i.e., Re(Zr ) 6= 0), then it follows that the output pressure spectrum is in L2 ,
the infinite-dimensional space of square-integrable functions. However, if we
instead consider sources composed of an H-term linear combination of sinusoidal terms, we can alternatively view the output solution as contained in an
H-dimensional subset of the infinite-dimensional space l2 , the set of squaresummable series. Thus, each solution of this type, while still an element of
an infinite-dimensional space L2 , will also have a discrete, but exact, finite-dimensional representation. This readily allows for the adoption of a geometric
representation of the solutions, where the H-dimensional solution spectra coefficients represent points in H-dimensional Euclidean space.
In this chapter, we will adopt this geometrical representation and determine
the subsets of Euclidean space to which acoustic tube solutions are constrained
for various ranges of configuration parameters. We will see that these subsets
are indeed low-dimensional manifolds embedded in the ambient space.
3.1 The Single Tube Solution Manifold
We begin our investigation of the manifold structure with the simple yet expository case of a single uniform tube with length L and cross-sectional area
A, driven by the source function
s(t) =
H
X
αn sin nω0 t,
(3.1)
n=1
where ω0 is the fundamental angular frequency and {αn } are real-valued Fourier
21
coefficients. Here, H is the number of harmonics, which must be less than or
equal to b5000/f0 c due to the approximations used in the physics model. Since
we are guaranteed real-valued output from the tube, we can assume a solution
of the form
p(L, t) =
H
X
βn sin(nω0 t + φn ),
(3.2)
n=1
where {βn } is the set of real-valued output Fourier coefficients and {φn } is a
set of real-valued phases. We know from Equation 1.42 that for each n,
βn sin(nω0 t + φn ) = Im(αn f (nω0 , L, A)einω0 t ).
(3.3)
Therefore, it follows that the output Fourier coefficients {βn } are given by
βn (L, A) = αn |f (nω0 , L, A)|
=h
cos2 kn L +
αn |Zn |
A2
ρ20 c2
|Zn |2 sin2 kn L +
i1/2 ,
(3.4)
A
ρ0 c Im(Zn ) sin 2kn L
where kn = nω0 /c and Zn = Zr (nω0 ).
Now, consider the subset of RH defined for a given set {αi |i = 1, . . . , H} by
M1 (L1 , L2 ) = {(β1 , β2 , . . . , βH )|L ∈ (L1 , L2 )}.
(3.5)
This set traces out a one-dimensional curve in the ambient Fourier space, RH .
Here, the subscript “1” indicates the use of the single tube solution. Our immediate goal is to show that M1 (L1 , L2 ) is in fact a one-dimensional manifold.
Formally, this is a consequence of the following three properties:
1. There exists a diffeomorphism, φ : (L1 , L2 ) ⊂ R1 → M1 (L1 , L2 ) ⊂ RH , for
L1 and L2 in the range of human vocal tract lengths.
We can define a continuous map, φ, from points in the open interval
(L1 , L2 ) to M1 (L1 , L2 ) using the {βn } functions defined in Equation 3.4.
This mapping is surjective by definition of M1 (L1 , L2 ). Its injectivity follows from the fact that the set is not self-intersecting. For self-intersection
to occur, there must exist two lengths, L, L0 ∈ (L1 , L2 ) such that k1 L =
k1 L0 + m2π for some m ∈ Z+ . This means that the minimum length
difference that admits self-intersection is given by
∆Lmin =
c
f0max
≈ 1 m,
(3.6)
where the maximum fundamental frequency occurring in human speech
is estimated to be f0max ≈ 300 Hz. Since vocal tracts typically range in
length from approximately 10 and 30 cm, self-intersection would be impossible for any natural scale interval.
22
Furthermore, if Re(Zn ) 6= 0 for all n, φ is a differentiable map since the
functions {βn } that determine it are infinitely differentiable (C ∞ ). Thus,
it follows by definition that φ is a diffeomorphism.
2. The diffeomorphism, φ−1 : M1 (L1 , L2 ) → R1 , is a coordinate chart on the set
M1 (L1 , L2 ).
Since φ is a diffeomorphism, its inverse φ−1 must exist and must also
be a diffeomorphism. It then follows that φ−1 is a coordinate chart on
M1 (L1 , L2 ), by definition.
3. The set M1 (L1 , L2 ) is open.
This fact follows from its definition, where L is chosen from the open interval (L1 , L2 ).
We can conclude that the set M1 (L1 , L2 ) is a smooth and open one-dimensional manifold. The manifold has two interesting properties:
• The manifold M1 (L1 , L2 ) is extrinsically curved in the ambient space.
Inspection of the functional form reveals that this manifold is extrinsically curved in the ambient Fourier space. That is, there exists distinct
sets of lengths,
{l1 , l2 , l3 |li ∈ (L1 , L2 )},
(3.7)
~ = hβ1 (li , A), β2 (li , A), . . . , βH (li , A)i|i = 1, 2, 3}
{β
(3.8)
such that the points
do not lie on a straight line in the ambient space. This follows from the
~
fact that the tangent vector, ∂ β/∂L,
does not maintain a fixed direction.
This can be easily seen when we write down the tangent vector components,
∂βn
αn kn |Zn |
=
×
∂L 
2

A2 |Zn |2
2AIm(Zn )
1
−
cos
2k
L
sin
2k
L
−
n
n
ρ2o c2
ρ0 c



3/2  .
2 |Z |2
A
A
Im
(Z
)
2
n
n
cos2 kn L + ρ2 c2 sin kn L +
sin 2kn L
ρ0 c
(3.9)
0
Clearly, for values of the {li } that do not result in trigonometric arguments that are multiples of π, the tangent vector components will not
maintain constant proportions. Extrinsic curvature is demonstrated graphically simply by a non-linear form of the set.
23
M1(L1, L2)
M1(L1, L2) (x−y plane)
80
100
60
80
40
β2
β3
120
60
20
40
0
200
20
1.5
100
1
β2
0
0
0.5
β1
0
0
M1(L1, L2) (x−z plane)
0.5
β1
1
1.5
M1(L1, L2) (y−z plane)
60
50
50
40
40
β3
70
60
β3
70
30
30
20
20
10
0
10
0
0.5
β1
1
0
1.5
0
50
β2
100
150
Figure 3.1: Sample plot of M1 (L1 , L2 ).
• The manifold M1 (L1 , L2 ) spans the ambient space.
This fact follows from the fact that there exists a distinct set of lengths,
{l1 , l2 , . . . , lH |li ∈ (L1 , L2 )}, such that the H-dimensional square matrix
with row vectors β~i = (β1 (li , A), β2 (li , A), . . . , βH (li , A)) has rank H. We
have verified the existence of such a set of lengths numerically.
Therefore, we can conclude that the single-tube solution manifold is a curved
one-dimensional manifold that spans the H-dimensional ambient Euclidean
space. This indicates that we are dealing with a complex non-linear subset of
a typically high-dimensional space. Even though this subset has a one-dimensional representation, it fills all H dimensions of the ambient space. Therefore,
simple linear machine learning and dimensionality reduction techniques may
fail when applied to the ambient representation. It is precisely these qualifications that promise improved classification when using manifold-based algorithms, which incorporate the exploitation of the underlying low-dimensional
structure into simpler techniques that function in the ambient space.
Figures 3.1 and 3.2 (zoomed) show the manifold of Equation 3.5 for f0 = 150
Hz, L1 = 10 cm, L2 = 50 cm, A = 15 cm2 , and H = 3. The extrinsic curvature in
the ambient space is clearly demonstrated. Moreover, the three-prong structure
that roughly coincides with the axes indicates that the manifold does in fact
span the ambient space R3 .
24
M1(L1, L2)
M1(L1, L2) (x−y plane)
16
14
10
12
10
6
β2
β3
8
4
2
8
6
4
15
10
2
0.3
β2
5
0.2
0.25
β1
0.2
10
10
8
8
6
6
4
0.3
β1
4
2
0.2
0.25
M1(L1, L2) (y−z plane)
β3
β3
M1(L1, L2) (x−z plane)
2
0.25
β1
0.3
5
β2
10
15
Figure 3.2: Sample plot of M1 (L1 , L2 ) (zoomed on origin).
3.2 The 2-Tube Solution Manifold
As we increase the number of tubes in our model, the terms in the general
solution gain additional trigonometric factors from the extra chain matrix multiplications. As a result, analytic treatment of the solution geometry grows
increasingly complex, becoming prohibitively unmanageable even for the case
of just two tubes. Thus, from this point on analytical parameterizations of the
solution manifolds are no longer presented, replaced instead by numerical and
graphical treatments.
The 2-tube resonator configuration is described fully by four parameters:
the lengths and cross-sectional areas of each of the two tubes. Alternatively,
we can capture the same information in four related parameters, which we
will use to define a two-tube configuration four-tuple,
c = (RA , A, RL , L) ∈ C2 (IRA , IA , IRL , IL ).
(3.10)
Here, L is the total resonator length, A is the larger of the two tube areas, and
RA and RL are the cross-section area and length ratios between the tubes, respectively. We define C2 (IRA , IA , IRL , IL ) to be the space of all possible configurations with
25
RA ∈ IRA ≡ (RA,min , RA,max )
RL ∈ IRL ≡ (RL,min , RL,max )
A ∈ IA ≡ (Amin , Amax )
L ∈ IL ≡ (Lmin , Lmax ).
(3.11)
We again conduct the analysis with the harmonic source function used in
the single tube case,
s(t) =
H
X
αn sin nω0 t,
(3.12)
n=1
where ω0 is the fundamental angular frequency and H is the number of included harmonics. The sinusoidal source output given by Equation 1.55 determine the magnitudes of the output Fourier series coefficients, which are given
by
βn = αn |g(nω0 , c)|.
(3.13)
Here it should be noted that we have replaced the explicit {Li } and {Ai } dependence of the function g with the above-defined configuration four-tuple
c ∈ C2 , which we have established to encapsulate the same information.
Consider the subset of RH ,
M2 (S) = {(β1 (c), β2 (c), . . . , βH (c))|c ∈ S},
(3.14)
where S = C2 (IRA , IA , IRL , IL ) is the configuration space for some choice of
parameter ranges. The subset “2” of M2 indicates we are dealing with twotube solutions. This set possesses the same properties as M1 (L1 , L2 ) listed
in the previous section, so it follows that M2 (S) is a smooth open manifold
embedded in RH . The dimension of this manifold is the dimension of the configuration space S.
As a first example, consider the configuration space where we only vary
the overall length of the two tube system within the range L = (L1 , L2 ), keeping the other three parameters fixed. The resulting one-dimensional manifold,
which can be denoted M2 (L1 , L2 ) (c.f. Equation 3.5), is shown in Figure 3.3,
where I have chosen
26
M2(L1,L2)
M2(L1,L2) (x−y plane)
100
60
80
40
β2
β3
60
20
40
0
100
β2
20
100
50
50
0
0
β1
0
0
20
M2(L1,L2) (x−z plane)
40
β1
60
80
100
80
100
M2(L1,L2) (y−z plane)
40
30
30
β3
50
40
β3
50
20
20
10
10
0
0
20
40
β1
60
80
0
100
0
20
40
β2
60
Figure 3.3: Sample plot of M2 (L1 , L2 ).
(L1 , L2 ) = (10, 50) cm
RL = 1.2
RA = 8
(3.15)
A = 15 cm2
f0 = 150 Hz
αn = 0 for n > 3
H = 3.
The extrinsic curvature of this manifold is obvious upon inspection. Also, the
100×3 matrix formed using a discretized set of 100 data points along this manifold is determined numerically to have rank three. Therefore, this manifold
spans the ambient space.
Finally, consider the 2-tube solution manifold where we hold the overall
area and length constant, but vary the area and length ratios. An example of
such a manifold, which we denote by M2 (IRL , IRa ), is shown in Figure 3.4,
where configuration parameters are chosen as
27
M2(IR ,IR )
M2(IR ,IR ) (x−y proj.)
A
L
L
A
100
80
40
60
β3
β2
60
20
40
0
0
20
100
50
10
20
β1
0
0
β2
0
5
M2(IR ,IR ) (x−z proj.)
L
10
β1
15
20
M2(IR ,IR ) (y−z proj.)
A
L
40
40
30
30
A
β3
50
β3
50
20
20
10
10
0
0
5
10
β1
15
0
20
0
20
40
β2
60
80
100
Figure 3.4: Sample plot of M2 (IRL , IRA ).
IRA = (1/8, 8)
IRL = (1/3, 3)
L = 17.6 cm
A = 15 cm2
f0 = 150 Hz
αn = 0 for n > 3
H = 3.
This is a two-dimensional manifold for which we can clearly observe extrinsic curvature. Moreover, using the matrix rank method described above, the
manifold was verified to span the ambient space.
The general manifold picture developed in this chapter demonstrates that
the single and twin-tube parameterizations determine a simple, low-dimensional underlying structure to the complex, high-dimensional acoustic signals
they produce. However, at this point, the acoustic tube model system is rather
abstract. We have seen that the solution spaces are indeed low-dimensional
manifolds, but a connection to speech remains to be demonstrated. To this
end, we extend this approach into vowel-approximating configuration spaces
in the following chapter.
28
Chapter 4
Tube Models of Vowel
Production
Now that we have introduced the concept of acoustic tube solution manifolds, we can continue our investigation into the structure of vowel manifolds.
Speech production begins with a fairly stable air flow produced by the lungs.
Located at the base of the vocal tract are a pair of tissue folds, known as the
glottis or vocal cords. These folds can constrict to modulate air flow into the
vocal tract, which determines the volume velocity source for the physical system.
The vocal tract, mouth, and lips comprise the filter of the system. Various
muscles in the neck, mouth, and face control the shape, known as an articulatory configuration. These configurations can be approximated by a sequence
of concatenated acoustic tubes. In the case of sustained vowel sounds, a particular configuration is maintained throughout production. Therefore, the timeindependent physics model presented is sufficient for this analysis.
The general formalism presented in Chapter 3 is extended in this chapter,
where an approximate glottal source volume velocity spectrum and vowelapproximating vocal tract filter configurations are implemented.
4.1 Introducing the Glottal Source
The volume velocity source that results from glottal vibration is an odd periodic function, which we denote by s(t), and thus can be approximated to
desired accuracy by a Fourier series of the form
s(t) ≈ sapp (t) =
H
X
sn sin nω0 t,
(4.1)
n=1
where H is the included number of harmonics (limH→∞ sapp (t) = s(t)) and
ω0 is the fundamental angular frequency, which is speaker-dependent. The
29
(a)
(b)
100
1.5
90
1
80
0.5
60
Amplitude
Amplitude (dB)
70
50
40
0
−0.5
30
20
−1
10
0
0
1000
2000
3000
Frequency (Hz)
4000
−1.5
5000
0
0.005
0.01
Time (s)
0.015
0.02
Figure 4.1: (a) Glottal source spectrum and (b) corresponding waveform where
f0 = 100 Hz.
Fourier amplitudes {sn } are traditionally approximated by a fixed 12 dB drop
per octave [9], which corresponds on a linear scale to
sn = n−12/(20 log10 2) ,
(4.2)
where s1 = 1 is chosen as the reference amplitude (see Figure 4.1). Therefore, for a given N -tube vocal tract configuration and fundamental angular
frequency ω0 , we can generalize the multi-tube sinusoidal source solution of
Section 1.2.2 to write down the glottal-sourced output,
pN (L, t) =
H
X
n=1
Im n−12/(20 log10 2) g(nω0 )einω0 t .
(4.3)
Here, g(ω) is the function defined in Equation 1.53. Note that the listing of the
explicit resonator configuration dependence has been omitted for the sake of
brevity. Since the analysis presented in Chapter 1 assumes frequencies under
5000 Hz, we must limit the number of harmonics to H < 5000 · 2π/ω0 .
The frequency spectrum of the output waveform of Equation 4.3 is given
by (discarding phase information)
βn = n−12/(20 log10 2) |g(nω0 )|.
(4.4)
This form of the amplitude spectrum is used in the remainder of this chapter
to study the vowel manifolds.
4.2 2-Tube Models of Vowel Production
In Section 3.2 we examined the manifold structure formed by H-point output Fourier transforms and included figures resulting from a three-harmonic
source function. If we instead supply the two-tube resonators with the glottal
30
e
M2
u
a
ae
i
Figure 4.2: Schematic plot of the vowel structure within the manifold of acoustic two-tube model solutions, M2 .
Phoneme
/@/
/u/
/a/
/y/
/i/
/æ/
RL
1
8
1.2
1
1.5
1/3
RA
1
8
1/8
8
8
1/8
A (cm2 )
15
15
15
15
15
15
L (cm)
17.6
17.6
17.6
17.6
14.5
17.6
Table 4.1: Two-tube configurations for five vowel phonemes (Fant [9]).
source function described above, and choose resonator configurations that approximate vocal tract profiles during vowel production, we can use the output
solutions to estimate the manifold structure of vowel sounds.
If we fix a pitch (i.e., fundamental frequency), the manifold of vowel sounds
will comprise a subset of the space of all two-tube solutions, M2 , as shown
schematically in Figure 4.2. A sound’s position in this space depends both on
the phoneme being produced and the vocal tract size of an individual speaker.
Each phoneme will occupy a distinct subset of the solution manifold, but all
phonemes together do not form a partition, as there exists solutions that do
not fall into a phonetic category.
Two-tube articulatory vocal tract approximations have been studied in great
detail [9]. Table 4.1 shows approximate two-tube configurations for five vowels
phonemes, in terms of the quantities defined in Section 3.2. The spectrum for
the /a/ configuration given is shown in Figure 4.3. Now the question arises:
using these vowel configurations, what manifolds result when we vary one or
several of the parameters?
Before we can answer this, we must first address the issue of visualization. This is a matter complicated by the fact that we are now dealing with a
b5000/f0 c-dimensional ambient space where we have one dimension for each
harmonic. For the time being, we present projections of the manifolds onto the
three principal component axes defined to be those that contribute most to the
variance across the data points.
Using this visualization approach, we will consider the manifolds arising
31
Amplitude Spectrum for Phoneme /a/
390
385
380
Amplitude (dB)
375
370
365
360
355
350
345
0
500
1000
1500
2000
2500
3000
Frequency (Hz)
3500
4000
4500
5000
Figure 4.3: Amplitude spectrum for the two-tube /a/ configuration given in
Table 4.1.
from vocal tract length (L) variation and area/length ratio (RL and RA ) variation. Vocal tract length is correlated roughly to the height of the speaker. Variation in area and length ratio between the two tubes roughly corresponds to
traversing the space of possible sounds producible by a speaker of given size
and pitch. Therefore, these manifolds have direct connections to the variance
present in natural datasets, namely speaker variation with fixed content and
content variation with fixed speaker.
We begin by examining the effect of varying overall vocal tract length for a
given vowel phoneme configuration. Let M2 (P ; f0 , L1 , L2 ) be the manifold in
RH specified by the two-tube configuration of phoneme P , where we vary the
total resonator length L ∈ (L1 , L2 ). We stimulate this resonator with a glottal
source of fundamental frequency f0 = 100 Hz, which constrains the number
of harmonics and dimension of the ambient space to H = 50. Figure 4.4 shows
M2 (P ; f0 , L1 , L2 ) for each of the phonemes listed in Table 4.1, where L1 =11
cm and L2 = 19 cm. The principal components for each phoneme’s plot are determined independently. High extrinsic curvature is evident for all phonemes,
typically taking a spiral form. Moreover, each manifold was numerically verified to span its 50-dimensional ambient space.
Next we combine the individual vowel phoneme manifolds into a single
manifold given by
[
(f0 , L1 , L2 ) =
Mvowel
M2 (P ; f0 , L1 , L2 ),
(4.5)
2
P ∈P
32
PC Projection of M2(P) for P=schwa
PC Projection of M2(P) for P=a
5
PC Axis 3
PC Axis 3
10
0
−10
10
0
−5
10
10
0
PC Axis 2
−10 −10
PC Axis 2
PC Axis 1
PC Projection of M2(P) for P=ae
−10 −10
PC Axis 1
5
PC Axis 3
PC Axis 3
0
PC Projection of M2(P) for P=i
5
0
−5
10
0
−5
10
10
0
PC Axis 2
−10 −10
10
0
0
PC Axis 2
PC Axis 1
PC Projection of M2(P) for P=u
0
−10 −10
PC Axis 1
PC Projection of M2(P) for P=y
5
PC Axis 3
5
PC Axis 3
10
0
0
0
−5
−10
10
0
−5
10
10
0
PC Axis 2
0
−10 −10
PC Axis 2
PC Axis 1
10
0
0
−10 −10
PC Axis 1
Figure 4.4: Principal component plots of M2 (P ; f0 , L1 , L2 ) for each phoneme
individually.
33
PC Projection of Mvowel
2
PC Projection of Mvowel
2
5
0
6
4
PC Axis 2
10
PC Axis 3
8
schwa
a
ae
i
u
y
−5
2
0
−2
−10
10
0
PC Axis 2 −10 −10
−4
10
0
PC Axis 1
−6
−10
6
6
4
4
2
2
0
−2
−4
0
PC Axis 1
5
10
0
−2
−4
−6
−8
−10
−5
PC Projection of Mvowel
2
PC Axis 3
PC Axis 3
PC Projection of Mvowel
2
−6
−5
0
PC Axis 1
5
−8
−10
10
−5
0
PC Axis 2
5
10
Figure 4.5: Principal component plot of Mvowel
(f0 , L1 , L2 ).
2
where P is the set of phonemes listed in Table 4.1. The principal component
projection for this manifold is shown in Figure 4.5, color-coded by phoneme.
Note that we observe overlaps between some of the phoneme submanifolds,
particularly between the /a/-/æ/ pair and the /i/-/y/ pair. This may come
as no surprise, given the perceptual similarity of the phonemes. Thus it is clear
that these pairs are not linearly separable in principal component projections.
They are, however, linearly separable in the 50-dimensional ambient space.
Finally, we examine the manifold created by using the configuration for the
phoneme /a/ and varying both the area and length ratios independently from
1/10 to 10. This manifold, which we denote by M2 (f0 , L, A), is the space of
all sounds that can be created by a glottal source with fundamental frequency
f0 filtered by a two-tube resonator with total length L and maximum area A.
This manifold should approximately contain the space of all vowel sounds producible by the equivalent real vocal tract with those overall dimensions at the
given pitch.
The principal component projection of M2 (f0 , L, A) is shown in Figures 4.6
and 4.7. In Figure 4.6, color is implemented such that the continuous light spectrum is linearly indexed by the length ratio values (i.e., regions of the manifold
with the same color result from configurations with the same length ratio). In
Figure 4.7, the color spectrum is instead indexed by area ratio. Clearly, the
area ratio contributes much more to the overall variance of the manifold in the
ambient space, as the principal component analysis favors its independent differentiation. This indicates that variation of area ratio has the greater affect on
the acoustic nature of the resulting signal.
34
PC Projection of M2(f0,L,A)
8
6
PC Axis 3
4
2
0
−2
−4
5
0
10
−5
5
0
−5
PC Axis 2
−10
−10
−15
PC Axis 1
Figure 4.6: Principal component plot of M2 (f0 , L, A), where colors differentiate length ratios.
PC Projection of M2(f0,L,A)
8
6
PC Axis 3
4
2
0
−2
−4
5
0
10
−5
5
0
−5
PC Axis 2
−10
−10
−15
PC Axis 1
Figure 4.7: Principal component plot of M2 (f0 , L, A), where colors differentiate area ratios.
35
Phoneme
/a/
/e/
/i/
/1/
/o/
/u/
N
34
33
33
38
37
39
A (cm2 )
8.0
10.5
10.5
13.0
14.5
13.0
L (cm)
17.0
16.5
16.5
19.0
18.5
19.5
Table 4.2: N -tube configurations for six vowel phonemes (Fant [9])
Note that the manifold M2 (f0 , L, A) (e.g. Figures 4.6 and 4.7) also contains
paths that roughly model the space of Fourier amplitudes created as a speaker
continuously changes vocal tract configuration from one vowel phoneme to
another. The precise nature of these paths is not clear. Two immediate possibilities include transition paths that minimize geodesic distance or paths that
minimize parameter space distance. Answering these questions definitively,
however, would require functional imaging of actual human vocal tracts during non-stationary speech production.
The manifold determined by a geometric representation of two-tube solutions possesses extrinsic curvature and spans the ambient space. Therefore,
the above-described submanifolds arising from variation of subsets of configuration parameters inherit these properties. As we can see from the principal
component plots, this translates into complex interrelationships between the
subspaces corresponding to each vowel phoneme when we linearly reduce the
dimension of the data.
4.3 N-Tube Models of Vowel Production
Increasing the number of tube segments in the model can better approximate
the acoustic characteristics of a given vowel sound. Much work has been done
to measure the actual vocal tract profile during sustained production of vowel
sounds, both with X-ray and magnetic resonance imaging. The goal of this section is to use N -tube profiles gathered from these medical imaging techniques
to further examine the manifold structure of vowels.
As in the 2-tube vowel approximations, the N -tube resonators are stimulated with the glottal source of Equation 4.2. We use the tube profiles of
Fant [9], collected using X-ray imaging. Since these profiles are are defined
in 0.5 cm segments, the number of tubes are determined by the overall tract
length of each profile measured. Table 4.2 shows the phonemes used in our
study along with their N -tube overall dimensions.
Consider the manifold MN (P ; f0 , L1 , L2 ) determined by the acoustic solution using the glottal source with fundamental frequency f0 and N -tube
resonator configuration for phoneme P , where we vary the total length L ∈
(L1 , L2 ). The principal component projections of these manifolds are shown
36
PC Projection of MN(P) for P=a
PC Projection of MN(P) for P=e
5
PC Axis 3
PC Axis 3
5
0
−5
10
0
−5
10
10
0
PC Axis 2
−10 −10
PC Axis 2
PC Axis 1
PC Projection of MN(P) for P=i
−10 −10
PC Axis 1
5
PC Axis 3
PC Axis 3
0
PC Projection of MN(P) for P=ibar
5
0
−5
5
0
−5
10
10
0
PC Axis 2
10
0
0
−5 −10
PC Axis 2
PC Axis 1
PC Projection of MN(P) for P=o
0
−10 −10
PC Axis 1
PC Projection of MN(P) for P=u
5
PC Axis 3
5
PC Axis 3
10
0
0
0
−5
−10
10
0
−5
10
10
0
PC Axis 2
−10 −10
10
0
0
PC Axis 2
PC Axis 1
0
−10 −10
PC Axis 1
Figure 4.8: Principal component plots of MN (P ; f0 , L1 , L2 ) for each phoneme
individually.
in Figure 4.8 for each of the phonemes listed in Table 4.2, where f0 = 100 Hz,
L1 = 11 cm, and L2 = 19 cm (c.f. Figure 4.4). Again, extrinsic curvature is evident, and each was numerically determined to span the 50-dimensional space.
Next, as in the two-tube case, we can combine the individual phoneme
manifolds into a single manifold,
[
MN (P ; f0 , L1 , L2 ).
(4.6)
Mvowel
(f0 , L1 , L2 ) =
N
P
The principal component projection for this manifold is shown in Figure 4.9,
color-coded by phoneme (c.f. Figure 4.5).
Unlike the two-tube case, each of the individual N -tube phoneme manifolds are linearly separable in the principal component projection. However,
in the two-tube case we had two pairs of similar phonemes, /a/-/æ/ and /i//y/. In the N -tube case, we are without /æ/ and /y/ specifications; the prin37
PC Projection of Mvowel
N
PC Projection of Mvowel
N
5
0
4
PC Axis 2
10
PC Axis 3
6
a
e
i
ibar
o
u
−5
−10
10
PC Axis 2
0
−10 −10
0
−2
−4
10
0
2
−6
−10
PC Axis 1
−5
0
5
PC Axis 1
PC Projection of Mvowel
N
8
8
6
6
4
4
PC Axis 3
PC Axis 3
PC Projection of Mvowel
N
2
0
−2
0
−2
−4
−6
−10
2
−4
−5
0
−6
−10
5
PC Axis 1
−5
0
PC Axis 2
5
10
Figure 4.9: Principal component plot of Mvowel
(f0 , L1 , L2 ).
N
cipal component projections of these pairs would presumably not be linearly
separable in the N -tube case either. However, without access to N -tube vocal
tract profiles for these phonemes, we could not verify this claim. Also note
that linear separability of the phoneme manifolds in the principal component
projection guarantees linear separability in ambient 50-dimensional space.
While the more accurate N -tube profiles can result in better vowel approximations, they are also higher-dimensional representations. The main utility of
the two-tube model is to capture the majority of the acoustic signature using
as simple a model as possible. For the purpose of manifold algorithm performance studies, we can capture any two-tube model inaccuracies by adding a
sufficient amount of noise to the solution spectra. We will return to this approach in Chapter 5. However, it is relevant to notice that, after varying overall scale values, the choice of a specific N -tube vowel phoneme configuration
isolates a more localized region of the acoustic space than a two-tube configuration does. This fact is reflected in the general increased separability of the
N -tube phoneme submanifolds. Therefore, the two-tube model phonemes act
as a sort of average over speakers’ N -tube profiles. In this sense, classification
and data reduction performance using two-tube model manifolds is a worstcase scenario measure, and is thus a desirable model to work with.
38
4.4 Fundamental Frequency Variation
So far we have considered various tube model solution manifolds where we
have kept the fundamental frequency fixed. This approximates speakers of
different sizes and genders, but who have the same pitch, or fundamental frequency. However both vocal tract geometry and the pitch of natural speech
vary greatly across speakers. Thus, to successfully approximate the manifold
of sustained vowel sounds, we must also allow for fundamental frequency
variation, which is not necessarily correlated with vocal tract dimensions.
We extend the vowel configuration space of Section 4.2 into a frequency
dimension by taking the manifold M2 (P ; f0 , L1 , L2 ) and varying f0 as well.
We can define such a two-tube solution manifold for phoneme P by
[
M2 (P ; F1 , F2 , L1 , L2 ) ≡
M2 (P ; f0 , L1 , L2 ).
(4.7)
f0 ∈(F1 ,F2 )
Figure 4.10 shows the principal component projection of this manifold for each
phoneme listed in Table 4.1, where F1 = 75 Hz, F2 = 275 Hz, L1 =11 cm, and
L2 = 19 cm. The result is a structure similar to the corresponding one-dimensional manifolds of Figure 4.4. However, the curve is spread into a two-dimensional ribbon, induced by the fundamental frequency variation.
4.5 Frequency Sampling Normalization
In the previous section, we defined a two-tube vowel manifold that incorporated fundamental frequency variation. We have not addressed the complication that the n-th coordinate of two points that result from unequal fundamentals are not Fourier coefficients of the same frequency. That is, the coefficient βn
for a solution with fundamental frequency f is the amplitude at frequency nf ,
while the coefficient βn0 for a solution with fundamental f 0 6= f is the amplitude
at frequency nf 0 6= nf . When dealing with Fourier transforms of real speech
data across speakers with varying pitch, the positions of the frequency samples
are determined by the time length of the waveform. Thus, for accurate simulation, the 50-point transforms as defined by Equation 4.4 for a given fundamental f0 must be mapped to the corresponding Fourier amplitudes sampled
at multiples of some reference frequency f0ref .
This frequency sampling normalization can be accomplished while working in the frequency domain by considering a finite time sample window, which
converts our discrete Fourier coefficients in a continuous Fourier spectrum that
can be sampled at any desired frequency interval. Consider the steady-state solution p(t) sampled for t ∈ [−T, T ]. The resulting waveform can then be written
m(t) = p(t)RT (t), where
1 for t ∈ [−T, T ]
RT (t) =
.
(4.8)
0 otherwise
39
PC Projection of M2(P) for P=a
5
10
0
5
PC Axis 3
PC Axis 3
PC Projection of M2(P) for P=schwa
−5
−10
10
−10
0
0
10
−5
PC Axis 2
PC Axis 1
PC Projection of M2(P) for P=ae
5
−5
10
−5
0
0
5
−10
PC Axis 2
PC Axis 1
PC Projection of M2(P) for P=i
5
PC Axis 3
PC Axis 3
5
0
0
−10
−5
10
−10
10
PC Axis 1
PC Projection of M2(P) for P=u
−10
−5
10
0
0
0
PC Axis 2
6
0
0
−10
10
PC Axis 1
PC Projection of M2(P) for P=y
PC Axis 2
10
PC Axis 3
PC Axis 3
4
2
0
5
0
−2
−4
10
−10
5
0
0
PC Axis 1
−5
10
PC Axis 2
−5
10
−10
5
0
0
PC Axis 1
−5
10
PC Axis 2
Figure 4.10: Principal component plots of M2 (P ; F1 , F2 , L1 , L2 ) for each phoneme.
40
Taking the Fourier transform, we have m̂(ω) = p̂(ω) ∗ R̂T (ω), where
2 sin ωT
.
(4.9)
ω
Now, we know from Section 1.2.2 that an N -tube Fourier series solution is determined by
R̂T (ω) =
p̂(ω) =
H
X
sn π
[δ(ω − nω0 ) − δ(ω − nω0 )] g(ω),
i
n=1
(4.10)
where ω0 is the fundamental angular frequency of the original steady state
solution, g(ω) is determined by the resonator geometry according to Equation 1.53, and the {sn } are determined according to Equation 4.2. We can then
calculate m̂(ω) using the δ-function property of Equation 1.41, giving
H
X
sin (ω + nω0 )T
sin (ω − nω0 )T
sn π
∗
− g (nω0 )
,
g(nω0 )
m̂(ω) =
i
ω − nω0
ω + nω0
n=1
(4.11)
Here, we have used g(−nω) = g ∗ (nω).
The resulting normalized frequency spectra are given by (c.f. Equation 4.4)
γn = |m̂(nωref )|,
n = 1, ..., K,
(4.12)
where K is the number of frequency spectrum samples and ωref = 2πfref is
the angular frequency sampling interval. Note that since the sinc function is
C ∞ , it follows that the {γn } comprise a diffeomorphism, and thus subsets of
Euclidean defined by these amplitudes remain smooth manifolds.
Now that we have a normalized geometrical representation of solutions
with varying fundamental frequencies, we can continue by defining the corresponding solution manifolds. We define the two-tube solution manifold of this
form for phoneme P as
M2 (P ; T, fref , K,F1 , F2 , L1 , L2 ) =
{(γ1 , γ2 , . . . , γK )|L ∈ (L1 , L2 ), f0 ∈ (F1 , F2 )},
(4.13)
where 2T is the time window, fref is the reference frequency interval, and K
is the number of samples in the spectrum. Figure 4.11 shows the 3D principal
component projection of this manifold for each phoneme in Table 4.1, where
T = 0.050 s, fref = 10 Hz, K = 500, F1 = 75 Hz, F2 = 275 Hz, L1 =11 cm, and
L2 = 19 cm.
We can see that the normalized data has lost the ribbon-like manifold structure evident in Figure 4.10. We are left with a cluster of points which has no
clear discernible structure, at least in the discrete approximation shown here
41
PC Projection of M2(P) for P=schwa
PC Projection of M2(P) for P=a
10
20
10
PC Axis 3
PC Axis 3
5
0
−5
0
−10
−10
−50
−15
50
−50
50
20
10
10
−10
−20
50
0
0
20
−50
PC Axis 2
PC Axis 1
PC Projection of M2(P) for P=u
20
PC Axis 2
0
−20
50
−50
0
−50
50
−100
PC Axis 2
PC Axis 1
PC Projection of M2(P) for P=y
0
30
20
PC Axis 3
5
PC Axis 3
−50
−10
−20
10
0
−5
−10
50
0
0
PC Axis 1
PC Projection of M2(P) for P=i
20
PC Axis 3
PC Axis 3
PC Axis 2
PC Axis 1
PC Projection of M2(P) for P=ae
0
−20
−20
50
0
0
10
0
−10
−50
0
0
−50
PC Axis 1
−100
50
PC Axis 2
−20
50
−50
0
0
−50
PC Axis 1
−100
50
PC Axis 2
Figure 4.11: Principal component plots of M2 (P ; T, fref , K, F1 , F2 , L1 , L2 ) for
each phoneme individually.
42
in the principal component projection. Thus, we can conclude frequency sampling normalization greatly increases the structural complexity of the vowel
manifold.
In all the manifolds presented in this chapter, we find curved and spanning
subsets of the acoustic space. In each case, we are dealing with a relatively
low-dimensional generating parameter space. Ideally, dimensionality reduction would be achieved by transforming the acoustic signal into this easily
managed low-dimensional parameter space. However, noise and model inaccuracies could conceivably complicate the effectiveness of explicitly using the
above-derived maps. Therefore, it is desirable to use methods that exploit this
low-dimensional structure while not relying on knowledge of the exact form
of the map. We will examine such a method in the following chapter.
43
Chapter 5
Vowel Manifolds in the
Graph Laplacian Eigenbasis
In Chapter 4, we touched on the topic of linear separability of the manifolds for
each vowel phoneme. We found that for two-tube models, certain phoneme
manifolds, while separable in the ambient 50-dimensional space, are not separable in the 3-dimensional principal component projection. We also found that
for the phonemes that had N -tube vocal tract profiles, a separating linear hyperplane existed both in the ambient space and in the 3-dimensional principal
component projection space. From these facts arise the issues of dimensionality
reduction and the possible ramifications of noise in the dataset.
Principal component projection is a linear mapping and therefore cannot
improve separability of classes. Furthermore, as noise is increased, classes
in the 50-dimensional ambient space will be rendered inseparable, further increasing the cost of dimensionality reduction. It is therefore desirable to use
more complex mappings that preserve or even improve separability as well as
reduce data dimension, both with and without the presence of noise.
5.1 The Laplacian Eigenbasis
One approach to handling data reduction with linearly inseparable data (on
account of noise, etc.) is to use a projection onto a non-linear basis. An example of such a basis is the nearest neighbors graph Laplacian eigenbasis, in the
manner presented by Niyogi and Belkin [1]. Below is a brief summary of their
method, the results of which will be presented in the following sections.
Consider k data points x1 , . . . xk ∈ RH . We can construct an adjacency
graph with one vertex Vi per data point xi . Let Xn (Vi ) be the set of n nearest
vertices to vertex Vi using a Euclidean distance metric. We connect vertices Vi
and Vj with an edge of weight one if and only if Vi ∈ Xn (Vj ) or Vj ∈ Xn (Vi ).1
1 Niyogi
and Belkin provide many variations on this condition. However, this simplest form is
44
This graph can be represented by the adjacency matrix W , which is symmetric
and binary-valued in this case. From this, we can determine the so-called graph
Laplacian,
L = W − D, where D is the diagonal matrix with elements Dii =
P
W
.
ji
j
Solving the eigenvalue problem Le = λe results in a set of eigenfunctions
e1 , . . . ek ∈ Rk , which are sorted by their corresponding eigenvalues 0 = λ1 ≤
· · · ≤ λk . The projection of data point xi onto an m-dimensional subset of the
graph Laplacian eigenbasis is then determined by
Pm (xi ) = (e2 (i), . . . , em (i)),
(5.1)
where en (i) indicates the i-th component of vector en . Note that the trivial zero
eigenvector e1 is excluded.
The sorted eigenbasis {ei } provides a projection that reflects the length of
path between points in the adjacency graph. That is, the simpler the connection
between two points the closer their projection values. Therefore, if the data of a
manifold is sampled sufficiently dense, the projection of any two points on the
manifold will be close in projection space. Points on two non-intersecting manifolds will likely have a long or even infinite connecting path and will thus have
very disparate projections. This method can therefore serve to both improve
linear separability of data classes and reduce the data dimension by choosing
m < H.
5.2 Dimensionality Reduction
Dimensionality reduction, while useful for increasing the efficiency of algorithms, may obscure separations between data classes, in this case the various
vowel phonemes. In the case of the two-tube vowel solutions, the variation
in the ambient space over a given phoneme manifold may be greater than the
average variation between two phonemes. Therefore, a linear principal component projection that acts on the data of multiple phoneme classes will function to spread the solution points of each phoneme class relative to themselves,
leaving some classes entangled. This pitfall is exhibited by the /a/-/æ/ pair
(see Figure 4.5).
In Figure 5.1, we show the union of the manifolds M2 (/a/; f0 , L1 , L2 ) and
M2 (/æ/; f0 , L1 , L2 ) (see Section 4.2) projected onto the first three principal
component axes determined by using data from both
S manifolds. (From here
on, the union of these manifolds is denoted by Ma2 Mæ
2 .) As expected from
Figure 4.5, there does not exist a linear hyperplane that separates the two manifolds in the 3D principal component space. They are, however, separable in
the ambient 50-dimensional space, indicating the PCA method introduces precisely the damaging effects one wishes to avoid.
In Figure 5.2, we show the projection of the same manifolds onto the nearest neighbors graph Laplacian eigenfunction with the smallest corresponding
sufficient in this context.
45
PC Projection of Ma2 U Mae
2
PC Projection of Ma2 U Mae
2
6
a
ae
10
4
2
PC Axis 2
PC Axis 3
5
0
−5
0
−2
−4
−10
10
PC Axis 2
−6
10
0
0
−10 −10
−8
−10
PC Axis 1
6
6
4
4
2
2
0
−2
−4
−6
−10
−5
0
PC Axis 1
5
10
PC Projection of Ma2 U Mae
2
PC Axis 3
PC Axis 3
PC Projection of Ma2 U Mae
2
0
−2
−4
−5
0
PC Axis 1
5
−6
−10
10
−5
0
5
PC Axis 2
S
Figure 5.1: Principal component plot of Ma2 Mæ
2 .
eigenvalue (in this case zero). Here, all solutions using /a/ configurations
project onto zero and those using /æ/ configurations project onto a collection of four relatively remote points. The classes are completely separable by
a zero-dimensional hyperplane in this one-dimensional basis. In this simple
yet expository case, the graph Laplacian eigenbasis perfectly isolates the two
phoneme classes. This occurs because the nearest neighbors adjacency graph
contains two connected components, one for each class. That is, there does
not exist an edge from any /a/ point to any /æ/ point. Therefore, the minimum eigenvalue has a corresponding eigenvector that functions primarily to
separate the two classes. Mathematically, this is a direct consequence of the
manifold structure of the class data.
5.3 Noise on the Manifold
So far in our study we have ignored the unavoidable issue of noise present in
actual recorded speech data. Deviations from our theoretical data presented
above can be a systematic result of the approximations used in our tube model
analysis as well as a random result of ambient room and electronics noise. Either contribution will function to spread our individual phoneme manifolds,
possibly in unpredictable ways. Therefore, our approximated theoretical data
that is linearly separable in the ambient and, in the case of the N -tube vowels,
principal component space may not be perfectly separable when these noise
sources are taken into account.
46
Graph Lapl. Projection of Ma2 U Mae
2
1
a
ae
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−0.04
−0.035
−0.03
−0.025
−0.02
−0.015
−0.01
−0.005
0
0.005
e2
S
Figure 5.2: Graph Laplacian eigenfunction projection of Ma2 Mæ
2 . (c.f. Figure 5.1)
We encapsulate all of the possible noise sources involved into the addition
of a Gaussian distributed
S æ random variable to each frequency component of
the manifold Ma
M2 . We begin by introducing a small amount of noise,
2
resulting in a 18 dB signal-to-noise ratio averaged over the components. As
an example, Figure 5.3 shows the standard /a/ spectrum after the addition of
this level of noise (shown originally in Figure 4.3). The results for the principal
component projection using this noisy data is shown in Figure 5.4 and results
for the graph Laplacian eigenbasis are shown in Figure 5.5. In the principal
components plot, there is a significant spreading of data points, resulting in
near overlap of the two classes. However, in the graph Laplacian projection,
the classes are still distinct and remain linearly separable in a single dimension.
We continue by introducing a larger amount of noise, this time with a
signal-to-noise ratio of 8 dB. The resulting principal component and graph
Laplacian plots are shown in Figures 5.6 and 5.7. Here we use a 3-dimensional Laplacian eigenfunction projection (i.e., m = 3 in Equation 5.1). The class
overlap increases in the principal component plot, while, even with this relatively high level of noise, the data in the graph Laplacian basis remains linearly
separable to a large extent.
To further expose this performance trend, Figure 5.8 shows
the linear sepS
arability of four data representations of the manifold Ma2 Mæ
2 as a function
of signal-to-noise ratio (SNR). These representations include the original 50dimensional transform data, the 3-dimensional principal component projection, the 1-dimensional graph Laplacian projection and the 3-dimension graph
47
Amplitude Spectrum for Phoneme /a/ with SNR = 18 dB
350
345
340
Amplitude (dB)
335
330
325
320
315
310
305
300
0
500
1000
1500
2000
2500
3000
Frequency (Hz)
3500
4000
4500
5000
Figure 5.3: Amplitude spectrum of Figure 4.3 after the addition of noise with
SNR = 18 dB.
PC Projection of Ma2 U Mae
2
PC Projection of Ma2 U Mae
2
6
a
ae
10
4
PC Axis 2
PC Axis 3
5
0
−5
−10
10
PC Axis 2
0
−10 −10
0
−2
−4
10
0
2
−6
−10
PC Axis 1
6
6
4
4
2
2
0
−2
−4
−6
−10
−5
0
PC Axis 1
5
10
PC Projection of Ma2 U Mae
2
PC Axis 3
PC Axis 3
PC Projection of Ma2 U Mae
2
0
−2
−4
−5
0
PC Axis 1
5
−6
−10
10
−5
0
PC Axis 2
5
10
S
Figure 5.4: Principal component projection of Ma2 Mæ
2 with the introduction
of noise at S/N = 18 dB.
48
Graph Lapl. Projection of Ma2 U Mae
2
1
a
ae
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−0.05
−0.04
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
0.04
e2
Figure
Nearest neighbors graph Laplacian eigenfunction projection of
S 5.5:
æ with the introduction of noise at S/N = 18 dB.
Ma
M
2
2
PC Projection of Ma2 U Mae
2
PC Projection of Ma2 U Mae
2
10
4
5
PC Axis 2
PC Axis 3
6
a
ae
0
−5
10
PC Axis 2
0
−10 −10
0
−2
−4
10
0
2
−6
−10
PC Axis 1
−5
0
5
PC Axis 1
U
Mae
2
PC Projection of
6
6
4
4
2
2
PC Axis 3
PC Axis 3
PC Projection of
Ma2
0
−2
−4
−6
−10
Ma2
U
Mae
2
0
−2
−4
−5
0
−6
−10
5
PC Axis 1
−5
0
5
PC Axis 2
S
Figure 5.6: Principal component projection of Ma2 Mæ
2 with the introduction
of noise at S/N = 8 dB.
49
Graph Lapl. Projection of Ma2 U Mae
2
Graph Lapl. Projection of Ma2 U Mae
2
0.1
a
ae
0.1
0.05
0
e3
e4
0.05
0
−0.05
−0.1
0.1
−0.05
0.1
0
e3
0.05
−0.1 0
−0.1
e2
0
0.1
0.1
0.05
0.05
0
−0.05
−0.1
0.02
0.04
e2
0.06
0.08
Graph Lapl. Projection of Ma2 U Mae
2
e4
e4
Graph Lapl. Projection of Ma2 U Mae
2
0
−0.05
0
0.02
0.04
e2
0.06
−0.1
−0.1
0.08
−0.05
0
e3
0.05
0.1
Figure
Nearest neighbors graph Laplacian eigenfunction projection of
S 5.7:
æ with the introduction of noise at S/N = 8 dB.
Ma
M
2
2
Laplacian projection. We can see that for low noise levels (i.e., high SNR), all
representations except the principal component are 100% linearly separable.
As the level of noise is increased, the one-dimensional graph Laplacian projection representation deviates from perfect separability before its 3D counterpart, as expected. Still, the 1D graph Laplacian projection outperforms the 3D
PCA projection down to SNR 12 dB. For SNR greater than 10 dB, the 3-dimensional graph Laplacian representation maintains class distinction as well
as the original 50-dimensional data. The 3D graph Laplacian projection outperforms the principal component projection all the way down to SNR 4 dB.
For signal-to-noise ratios less than 4 dB, the two 3-dimensional representations
are roughly equivalent. Below 4 dB, the further increase of noise begins to
largely degrade class separability with all representations, though the original 50-dimensional transform data maintains an accuracy above 90% down to
nearly 0 dB SNR (i.e., equal levels of signal and noise).
For low to moderate noise levels, the nearest neighbor graph Laplacian representation is more useful than PCA at preserving linear separability through
dimensionality reduction. The reason is that the adjacency graph can remain
largely unchanged for these noise levels, as each point’s set of nearest neighbors remains the same. For very high levels of noise, the original 50-dimensional transforms are the most useful, while principal component and graph
Laplacian eigenbasis projections are equivalent. The peformance degradation
of the graph Laplacian method at these high noise levels indicates that the
manifold structure has become too obscured, resulting in interclass edges in
50
Linear Separability as a Function of SNR
100
95
90
Separability (%)
85
Original 50−dim data
3D PCA Projection
1D Laplacian Projection
3D Laplacian Projection
80
75
70
65
60
55
50
−20
−10
0
10
20
30
Signal−to−Noise Ratio (dB)
40
50
60
Figure 5.8: The linear separability of four data representations, as a function of
signal-to-noise ratio.
the adjacency graph.
While the graph Laplacian method does eventually break down for very
high levels of noise, it otherwise consistently outperforms PCA at preserving
class separation through dimensionality reduction. As expected, the underlying manifold structure of the acoustic tube vowel approximations results in
successful application of the manifold-based dimensionality reduction algorithm.
51
Conclusion
This paper has presented a derivation of a class of manifolds defined by solutions of the traditional acoustic tube articulatory model. These manifolds are
extrinsically curved in, and span, the ambient acoustical space. Assuming real
vowel data is sufficiently approximated by these manifolds, the existence of
an underlying low-dimensional structure for spoken vowels follows. Furthermore, this structure is adequately complex to complicate successful application
of linear classification and dimensionality reduction techniques. Since humans
can differentiate between these classes with ease, we might assume that there
is something inherently non-linear at work.
Towards a possible reconciliation, the geometric point of view presented in
this paper justifies the application of the class of manifold-based algorithms,
useful in machine learning, speech recognition, and data representation. The
positive results using the Laplacian eigenmap dimensionality reduction method
that were presented are only an example of the possibilities this perspective
admits. Furthermore, a geometric representation provides an alternative entry point into understanding non-linear perceptual phenomena. The possible
avenues of future research are outlined below.
• Approximate Manifold Structure of Other Phonemes
The analysis presented in this paper was limited to tube models to approximate sustained vowel sounds. Modeling other classes of phonemes,
such as fricatives and nasals, involves introducing turbulent noise sources
at vocal tract constrictions and incorporating branching resonators. This
will involve the introduction of stochastic processes in determining the
output transforms, which will prevent the existence of a diffeomorphism
between the configuration and acoustic spaces. However, as we saw in
Chapter 5 with the introduction of noise, the Laplacian-based manifold
algorithm was stable to deviations from a precise manifold structure.
Thus, it is proposed that manifold based algorithms will continue to experience a performance gain even on these more complicated data sets.
However, a systematic study would be required to justify this claim.
• Semi-supervised Learning
Recently, several manifold-learning algorithms have been presented [3,
1, 4, 5, 6, 7]. In particular, Niyogi, Belkin, and Sindhwani [3] proposed a
52
semi-supervised manifold learning algorithm that incorporates the method
described in Section 5.1 into a regularization term of the objective function for various classifiers. The positive performance results for the Laplacian data reduction method indicates strongly that the corresponding
semi-supervised learning method will be equally successful. A full performance study is necessary, using both the synthetic vowel manifolds
derived here and actual recorded phonemes from the TIMIT database.
• Non-linear Acoustic and Perceptual Phenomena
Much has been written in recent years about the perceptual magnet effect,
which refers to a warping of the perceptual space towards categorical
centers. This means that equally spaced points in acoustic space near a
categorical center will map closer in perceptual space than equally spaced
points further from this categorical center. Thus, non-linear perceptual
shifts can result from fixed displacements in acoustical space.
A widely discussed acoustics phonetic framework is the quantal theory of speech [10]. The articulatory-acoustic half of the theory states
that there are regions of articulatory (configuration) space where large
changes lead to relatively small changes in acoustic character, and vice
versa. This implies a largely non-linear relationship between configuration space and acoustic space.
The extrinsic curvature of the vowel manifolds derived in this paper indicates an implicit non-linearity of acoustic space. The relationship between the above-mentioned acoustic-perceptual effects and this manifold
structure will need further study. However, the geometric approach presented here provides an alternate entry point into these subjects.
53
Acknowledgements
I would like to thank Partha Niyogi for suggesting this project topic and providing guidance in its completion. I would also like to thank Mikhail Belkin
and Vikas Sindhwani for providing their code and assistance in implementing
the Laplacian eigenmap algorithm.
54
Bibliography
[1] Mikhail Belkin and Partha Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Technical Report TR-2002-01,
University of Chicago, Computer Science Department, December 2001.
[2] Kenneth N. Stevens. Acoustic Phoenetics. MIT Press, Cambridge, MA, 1998.
[3] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold Regularization: A Geometric Framework for Learning from Examples. Technical Report TR-2004-06, University of Chicago, Computer Science Department, August 2004.
[4] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction
by locally linear embedding. Science, 290:2323–2326, 2000.
[5] Sam T. Roweis and Lawrence K. Saul. Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning
Research, 4:119–155, 2003.
[6] D. L. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National
Academy of Arts and Sciences, 100:5591–5596, 2003.
[7] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global
geometric framework for nonlinear dimensionality reduction. Science,
290:2319–2323, 2000.
[8] D. L. Donoho and C. Grimes. When does ISOMAP recover natural parameterization of families of articulated images? Technical Report TR-2002-27,
Department of Statistics, Stanford University, August 2002.
[9] Gunnar Fant. Acoustic Theory of Speech Production. Mouton and Co., Paris,
1970.
[10] Kenneth N. Stevens. On the quantal nature of speech. Journal of Phonetics,
17:91–97, 1989.
55