The Manifold Nature of Vowel Sounds
Transcription
The Manifold Nature of Vowel Sounds
The Manifold Nature of Vowel Sounds by Aren Jansen Master’s Paper Advisor: Partha Niyogi Department of Computer Science The University of Chicago September 14, 2007 Abstract Recently there has been great interest in geometrically motivated approaches to data analysis and pattern recognition. Low-dimensional structure in higher-dimensional data can be exploited by manifold-based data reduction and learning algorithms to improve performance. The existence of such a structure in speech has not been formally documented. Toward this end, I present a derivation of the approximate discrete spectra of sustained vowel phonemes using standard tube models of the vocal tract. Adopting a geometrical approach, each N -point discrete frequency spectrum produced using these models represents a point in RN . Given a continuous range of vocal tract model parameters, I either formally or graphically demonstrate that the subsets of Euclidean space traced out by the resulting spectra positions form low-dimensional, extrinsically curved manifolds that span the ambient space. Tube model parameters that approximate the vocal tract configurations for various vowel phonemes determine the approximate manifold structure for several sustained vowel sounds. Using the manifolds for the phonemes /a/ and /æ/ as input, the manifold-based Laplacian eigenmap dimensionality reduction algorithm of Belkin and Niyogi [1] outperforms traditional principal component analysis, both with and without the introduction of noise. Contents Introduction 1 4 . . . . . . . . 7 8 8 9 11 14 15 15 16 2 Speech Sound Generator Software 2.1 Simulation Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Data Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 18 18 19 19 3 The Manifold of Acoustic Tube Solutions 3.1 The Single Tube Solution Manifold . . . . . . . . . . . . . . . . . 3.2 The 2-Tube Solution Manifold . . . . . . . . . . . . . . . . . . . . 21 21 25 4 Tube Models of Vowel Production 4.1 Introducing the Glottal Source . . . . 4.2 2-Tube Models of Vowel Production 4.3 N-Tube Models of Vowel Production 4.4 Fundamental Frequency Variation . 4.5 Frequency Sampling Normalization . . . . . 29 29 30 36 39 39 Vowel Manifolds in the Graph Laplacian Eigenbasis 5.1 The Laplacian Eigenbasis . . . . . . . . . . . . . . . . . . . . . . 5.2 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . 5.3 Noise on the Manifold . . . . . . . . . . . . . . . . . . . . . . . . 44 44 45 46 5 The Physics of Acoustic Tubes 1.1 Single Tube Analysis . . . . . . . . . . . . . . . . . . . 1.1.1 Continuity and Conservation of Mass . . . . . 1.1.2 The Wave Equation and Boundary Conditions 1.1.3 The General Solution . . . . . . . . . . . . . . . 1.1.4 The Sinusoidal Source Solution . . . . . . . . . 1.2 Multiple Tube Analysis . . . . . . . . . . . . . . . . . . 1.2.1 The N -Tube General Solution . . . . . . . . . . 1.2.2 The N -Tube Solution for a Sinusoidal Source . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 1 List of Figures 1.1 Schematic plot of Ks (f ). . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 2.2 Speech sound generator software in simulation mode. . . . . . . Speech sound generator software in data mode. . . . . . . . . . 19 20 3.1 3.2 3.3 3.4 Sample plot of M1 (L1 , L2 ). . . . . . . . . . . . Sample plot of M1 (L1 , L2 ) (zoomed on origin). Sample plot of M2 (L1 , L2 ). . . . . . . . . . . . Sample plot of M2 (IRL , IRA ). . . . . . . . . . . 24 25 27 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 (a) Glottal source spectrum and (b) corresponding waveform where f0 = 100 Hz. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Schematic plot of the vowel structure within the manifold of acoustic two-tube model solutions, M2 . . . . . . . . . . . . . . . 4.3 Amplitude spectrum for the two-tube /a/ configuration given in Table 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Principal component plots of M2 (P ; f0 , L1 , L2 ) for each phoneme individually. . . . . . . . . . . . . . . . . . . . . . . . . . . (f0 , L1 , L2 ). . . . . . . . . . . 4.5 Principal component plot of Mvowel 2 4.6 Principal component plot of M2 (f0 , L, A), where colors differentiate length ratios. . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Principal component plot of M2 (f0 , L, A), where colors differentiate area ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Principal component plots of MN (P ; f0 , L1 , L2 ) for each phoneme individually. . . . . . . . . . . . . . . . . . . . . . . . . . . (f0 , L1 , L2 ). . . . . . . . . . . 4.9 Principal component plot of Mvowel N 4.10 Principal component plots of M2 (P ; F1 , F2 , L1 , L2 ) for each phoneme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 Principal component plots of M2 (P ; T, fref , K, F1 , F2 , L1 , L2 ) for each phoneme individually. . . . . . . . . . . . . . . . . . . . . . S . . . . . . . . 5.1 Principal component plot of Ma2 Mæ 2 . . . . . . . S 5.2 Graph Laplacian eigenfunction projection of Ma2 Mæ 2 . (c.f. Figure 5.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 30 31 32 33 34 35 35 37 38 40 42 46 47 5.3 5.4 5.5 5.6 5.7 5.8 Amplitude spectrum of Figure 4.3 after the addition of noise with SNR = 18 dB. . . . . . . . . . . . . S. . . . . . . . . . . . . . . Principal component projection of Ma2 Mæ 2 with the introduction of noise at S/N = 18 dB. . . . . . . . . . . . . . . . . . . . . . Nearest graph Laplacian eigenfunction projection of S neighbors æ with the introduction of noise at S/N = 18 dB. . . . . M Ma 2 2 S Principal component projection of Ma2 Mæ 2 with the introduction of noise at S/N = 8 dB. . . . . . . . . . . . . . . . . . . . . . Nearest graph Laplacian eigenfunction projection of S neighbors æ with the introduction of noise at S/N = 8 dB. . . . . Ma M 2 2 The linear separability of four data representations, as a function of signal-to-noise ratio. . . . . . . . . . . . . . . . . . . . . . . . 3 48 48 49 49 50 51 Introduction In the past 45 years, the fields of speech acoustics and speech recognition have on the most part developed independently. Focused presentation of speech acoustics dates back to Gunnar Fant in his 1960 book Acoustic Theory of Speech Production, where he develops several physical models of speech production. Fant adopts the source-filter approach to speech production, which assumes that a speech signal can be uniquely specified by independent source and filter characteristics. The traditional sources include glottal excitations and turbulent flow at constrictions. The vocal tract acts as the acoustic filter that, when applied to the source, determines the output signal. The simplest filter model presented is the twin-tube resonator, which approximates the vocal tract as a pair of concatenated cylindrical tubes. The relative cross section and length proportions of the tubes determine the filter characteristics. Fant also presents models for several other production phenomena, such as nasal coupling and tongue position. Since 1960, Fant and others have developed more linguistically motivated acoustic models, covering all classes of phonemes [2]. The application of acoustical physics to speech production models leads to a wide range of insights into the acoustic nature of various classes of phonemes and their relation to the vocal tract configurations that produce them. In particular, for the class of sustained vowel sounds (e.g. /a/ as in “pot” or /u/ as in “boot”), Fant’s crude twin-tube approximations to the vocal tract profile are sufficient to capture the key acoustic features. The utility of these models is to provide a map between a small set of articulatory parameters and a phoneme’s approximate acoustic signal. However, developments in this field have traditionally found audience in the linguistic and applied physics/engineering communities. The development of speech recognition has traditionally been directed towards generic statistical methods for classification. These methods take as input a set of fully or partially labelled data vectors, whose components can be anything from digital image pixel intensities to daily weather measurements. While such generic approaches are useful in having a large domain of application, they do not factor in the possibly useful constraints imposed by the nature of a particular input data set. In the case of speech, such constraints are determined by the physical mechanisms that produce it; humans cannot produce every arbitrary acoustic signal. 4 Recently, several algorithms have been proposed that both exploit the underlying structure of natural data and remain application-independent. To accomplish this task, these algorithms share the assumption that high-dimensional input data sets are generated in analytically from a much smaller number of degrees of freedom. These algorithms remain generic since the exact form of the underlying structure is inconsequential; it is simply sufficient that a lower-dimensional representation exists. It should also be noted that these methods share a geometric interpretation of the data, where an n-dimensional input vector is regarded as a point in an n-dimensional Euclidean space with a standard Euclidean distance metric. A particular formalization of this postulate is to assume that natural data resides on or near a low-dimensional manifold embedded in the ambient space. A number of algorithms based on this formalization have been presented, including Laplacian eigenmaps [3, 1], locally linear embedding (LLE) [4, 5], Hessian locally linear embedding (hLLE) [6], and ISOMAP [7]. These algorithms have shown some success on benchmark problems. It is therefore relevant to consider the extent to which the manifold assumption may be true in natural data. Individual phenomena must be investigated to provide an application-specific justification for the manifold assumption. For example, Donoho and Grimes [8] characterize families of images in terms of their manifold properties. The articulatory parameterizations of Fant and others strongly indicate the existence of a low-dimensional structure of speech. The interest of this paper is to formally consider speech sounds from this geometric point of view and determine the precise nature of the manifold structure that exists at its foundations. Towards this end, I analyze sounds generated by series of concatenated tubes excited by a periodic source. This system serves as a simple model for the vocal tract and glottal excitation and has been shown to be useful in the tradition of acoustic modeling in speech production and acoustic phonetics. This paper presents a derivation of the frequency spectra generated by these acoustic tube models, which are used to define solution manifolds. If it is postulated that real vowel sounds are adequately approximated by the acoustic tube model solutions, then the existence of an underlying manifold structure in real speech follows. This has several implications, of which the most relevant to this document is the successful application of manifold-based algorithms, such as the Laplacian dimensionality reduction algorithm of Belkin and Niyogi [1] and the semi-supervised Laplacian learning algorithm of Belkin, Niyogi, and Sindhwani [3]. In addition, the consequences of this manifold interpretation has applications to the study of non-linear speech phenomena and alternative spectrogram methods, two possible areas of future research. Chapter 1 presents the relevant acoustic physics and a derivation of the tube model solution spectra. Chapter 2 provides a brief overview to the Speech Sound Generator software created to study and demonstrate the sounds produced using the tube models. Chapter 3 develops a general definition and formal justification for the acoustic tube solution manifolds. In Chapter 4, this discussion is extended to the solution manifold structure of the tube models 5 of vowel production. Chapter 5 investigates the performance of a Laplacianbased dimensionality reduction algorithm using the acoustic tube model data, both with and without the introduction of noise. Finally, I conclude with a summary of findings and a discussion of future research goals resulting from the speech perspective presented in this paper. 6 Chapter 1 The Physics of Acoustic Tubes Sound is transmitted through a medium in the form of a longitudinal pressure wave. In free space, a given source signal will propagate through the air with its acoustic nature fundamentally unchanged. However, if the space through which a wave travels is in some way constricted, the frequency content of the acoustic signal can shift dramatically. One canonical system of constraint is that of acoustic tubes. When an input signal is passed through a tube, some of the frequency content tends to reflect back, resulting in resonance. The frequencies at which such resonance occurs are determined solely by the shape and dimensions of the acoustic tube. The ultimate output signal is therefore dominated by these resonant frequencies, resulting in a modulated signal much different from the input. This property of acoustic tubes is exploited by man-made instruments such as pipe organs and flutes, as well as by natural systems such as the human voice. Historically, there have been two primary data representations used in speech analysis and recognition. The first uses the entire discrete Fourier spectrum of a speech signal, typically resulting in high-dimensional data sets. The second approach goes one step further, decomposing the frequency spectrum into a relatively small number of resonant frequencies, known as formants. Since formant frequencies are maxima positions, they are also the most audible, as the ear performs a sort of Fourier decomposition before perception begins. The formant approach has two advantages. First, the formant representation is typically low-dimensional (3 or 4 formants), which can lead to short processing times. Second, formants tend to characterize certain classes of speech sounds very well, leading to improved classification. However, when analyzing discretely-sampled speech data, the frequency spectrum tends to be more chaotic, due to an implicit convolution with a sinc function. Therefore, the determination of these formant positions can introduce enough speed bottlenecks and uncertainty to counteract the two benefits. In this paper, the entire spectral representation is of primary interest. However, as a result of the formant tradition, presentation of the acoustics of vocal tract resonators has typically been limited to finding the poles of the filter 7 transfer functions that arise from particular articulatory configurations. This chapter presents a complete derivation of the output waveform and spectrum arising from an arbitrary configuration of concatenated tubes and an arbitrary source function, starting with basic physical principles. Several approximations are made to contain the analysis of this system within the realm of linear dynamics; these are noted when applied. We begin with the derivation for a single uniform tube. 1.1 Single Tube Analysis The acoustic analysis of the vocal tract resonator becomes a tractable problem with the introduction of several approximations. First, the vocal tract walls are modeled as rigid, assuming their impedance to be much greater than that of air. This approximation is reasonable for the frequency range observed in human speech [2]. Using this approximation, we can neglect energy loss in the vocal tract as well as time-dependence of the cavity profile. Next, if we assume that the transverse vocal tract dimension is much smaller than the signal wavelength, we can assume solutions are uniform in the transverse dimensions. Thus, the problem reduces to an analysis in one spatial dimension. Furthermore, this approximation allows us to assume the solutions are dependent solely on the cross-sectional area of the vocal tract and not on the bounding contour. This set of approximations is valid for sound frequencies under about 5 kHz [2]. 1.1.1 Continuity and Conservation of Mass To derive a working form of the acoustic equations, we must make the assumption that the acoustic signal passes through the air with pressure, density, and velocity perturbations that are small compared with their equilibrium values (i.e., p0 + p ≈ p0 , where p and p0 are the pressure perturbation and equilibrium pressure, respectively). Under this assumption, the one-dimensional equation of continuity for compressible fluid flow, ∂ρ/∂t + ∂(ρu)/∂x = 0, is approximately ρ0 ∂U ∂ρ = , ∂t A(x) ∂x (1.1) where A(x) is the cross-sectional area of the tube as a function of position, U = Au is the volume velocity, ρ is the density perturbation, and ρ0 is the equilibrium density. If we also assume that perturbations are sufficiently transient to neglect heat transfer, we can make use of the adiabatic ideal gas relation between pressure and density: pρ−γ = const, where γ = cp /cv = 5/3 for an ideal gas. Discarding second order terms according to the assumption of small perturbations, we have 8 dρ dp =γ . p0 ρ0 (1.2) Combining Equations 1.1 and 1.2 we arrive at γp0 ∂U ∂p = , ∂t A(x) ∂x (1.3) the first of the acoustic equations. Neglecting the effect of gravity and assuming the propagating fluid is frictionless, conservation of momentum gives the relation (again, neglecting second order terms) ∂p ρ0 ∂U = , ∂x A(x) ∂t (1.4) the second of the acoustic equations. Thus, mass continuity and conservation of momentum provide expressions for both the spatial and time dependence of the pressure and the volume velocity. 1.1.2 The Wave Equation and Boundary Conditions Starting from Equations 1.3 and 1.4, we can arrive at partial differential equation for the volume velocity U as functions of space and time. First, we differentiate Equation 1.3 with respect to t and Equation 1.4 with respect to x. Equating mixed partials and assuming A is time-independent (i.e., the walls are rigid) gives the wave equation in a surface of revolution with area profile A(x): 1 dA ∂U 1 ∂2U ∂2U − = ∂x2 A dx ∂x c2 ∂t2 (1.5) p where c = γp0 /ρ0 , the speed of sound in air. For our purposes, we consider uniform cylindrical tubes. Thus, A(x) = const and dA/dx = 0, so Equation 1.5 reduces to the standard one-dimensional wave equation in free space, ∂2U 1 ∂2U = 2 2. (1.6) 2 ∂x c ∂t Since we could have opted to eliminate the volume velocity in favor of pressure in the derivation, we have the complementary equation, ∂2p 1 ∂2p = 2 2. (1.7) 2 ∂x c ∂t Given a second order differential equation, we need two boundary conditions to arrive at a unique solution. The first boundary condition incorporates the source function corresponding to the input glottis-modulated volume velocity created by the lungs as a function of time, s(t). We model the glottis as 9 Ks 1.5 1.0 300 2000 f (Hz) Figure 1.1: Schematic plot of Ks (f ). a vibrating piston at the x = 0 end of the tube. We then require as the first boundary condition that U (0, t) = s(t). (1.8) The second boundary equation makes use of the fact that a sound wave faces the acoustic impedance of the surrounding environment at the open end of the tube (x = L). The acoustic impedance (a frequency-dependent, complex quantity) is defined by Z(x, ω) ≡ p̂(x, ω) Û (x, ω) , (1.9) where p̂ and Û are the forward Fourier transforms of the pressure and volume velocity. At the open end of the tube, the acoustic impedance encountered is termed the radiation impedance. This quantity can be calculated from the geometry and baffling of the tube opening and is independent of the wave equation solution. For our purposes, we use the approximate form given by Stevens [2], based on a model of a circular piston on the surface of a sphere with radius 9 cm. This form of the radiation impedance, valid up to 6000 Hz, is given by Z(L, ω) = 4ckρa ρck 2 Ks (f ) +i ≡ Zr (ω), 4π 5A (1.10) where ρ is the density of air, c is the speed of sound, A = πa2 is the area of the piston, k = ω/c is the wave number, and Ks is a real-valued frequencydependent factor approximated by the form of Figure 1.1. The second boundary condition, which is given in the frequency domain, is then Û (L, ω) = 10 p̂(L, ω) . Zr (ω) (1.11) 1.1.3 The General Solution We are now equipped with all the components necessary to arrive the general solution to Equation 1.6. To begin, we write the solution U (x, t) in terms of its time-Fourier transform Û (x, ω) as Z ∞ 1 Û (x, ω)eiωt dω. (1.12) U (x, t) = 2π −∞ Substituting this form into Equation 1.6 yields the ordinary differential equation 2 d 2 (1.13) + k Û (x, ω) = 0, dx2 where k = ω/c is the wavenumber for the sound wave. This is simply the equation of simple harmonic motion in x, and the general solution is given by Û (x, ω) = U1 (ω)e−ikx + U2 (ω)eikx , (1.14) where U1 and U2 are the complex amplitudes of the forward and backward traveling waves, respectively. In general, the amplitudes are frequency dependent. Starting instead from Equation 1.7, we can write a similar expression for the pressure wave: p̂(x, ω) = p1 (ω)e−ikx + p2 (ω)eikx , (1.15) where p1 and p2 are the complex amplitudes of the forward and backward traveling pressure waves, respectively. Thus the general forms for the volume velocity and pressure are given by Z ∞ 1 U (x, t) = (U1 (ω)e−ikx + U2 (ω)eikx )eiωt dω (1.16) 2π −∞ Z ∞ 1 (p1 (ω)e−ikx + p2 (ω)eikx )eiωt dω. (1.17) p(x, t) = 2π −∞ There are two relevant methods for continuing from this point. The first method that provides the entire general solution, U (x, t) and p(x, t), is treated for completeness and serves as a reference for any future model extension. The chain matrix method is useful for efficient computation of the output signal of the tube, p(L, t), and generalizes seamlessly to the case of multiple tube systems. Both methods are presented below. The Complete Solution To arrive at a complete solution, we need to determine the specific form for the complex Fourier amplitudes, U1 and U2 . To accomplish this, we must make use of the boundary conditions discussed above in Section 1.1.2. The boundary 11 condition at x = 0 (Equation 1.8) can be incorporated by first considering the source function s(t) written in terms of its Fourier transform ŝ(ω) as follows: Z ∞ 1 s(t) = ŝ(ω)eiωt dω. (1.18) 2π −∞ Imposing the boundary condition of Equation 1.8 and equating integrands gives U1 (ω) + U2 (ω) = ŝ(ω). (1.19) Thus, we have the first constraint on the functional form of U1 and U2 . The second boundary condition of Equation 1.11 will provide the second constraint required to fully determine the wave equation solution. We begin by writing this expression in the time domain as the convolution p(L, t) = Žr (t) ∗ U (L, t), (1.20) where Žr (t) is the inverse Fourier transform of Zr (ω). Next, we differentiate this expression with respect to t, giving ∂U ∂p (L, t) = Žr (t) ∗ (L, t). (1.21) ∂t ∂t We can now eliminate p from this expression by invoking Equation 1.4, giving ∂U ∂U (L, t) = AŽr (t) ∗ (L, t). ∂x ∂t Transforming back to the frequency domain, we have γp0 γp0 ∂ Û (L, ω) = AZr (ω)F[ ∂U ∂t ] = iωAZr (ω)Û (L, ω). ∂x (1.22) (1.23) Substituting the expression for Û of Equation 1.14, we arrive at the following condition on the solution: U1 (ω)e−ikL + U2 (ω)eikL γp0 . =− AZr (ω)c U1 (ω)e−ikL − U2 (ω)eikL (1.24) Solving Equations 1.19 and 1.24 results in the following expression for the timeFourier transform of the general solution Û (x, ω): −ikx ikx e e , + Û (x, ω) = ŝ(ω) B+1 i2kL 1 + B−1 e−i2kL 1 + B−1 e B+1 (1.25) where B is defined to be γp0 /AZr (ω)c = ρ0 c/AZr (ω). The corresponding pressure spectrum is then given by 12 ikx −ikx e e . + p̂(x, ω) = ŝ(ω)Zr (ω) B+1 i2kL 1 + B−1 e−i2kL 1 + B−1 B+1 e (1.26) The Chain Matrix Solution We are primarily interested in the output of the acoustic tube filter. Furthermore, the complete solution method presented above changes greatly for the case of multiple tubes. Therefore, we present the alternative chain matrix solution method, which generalizes seamlessly to the case of multiple tubes. We start by inserting the general solutions of Equations 1.16 and 1.17 into the acoustic Equations 1.3 and 1.4 and equate integrands to arrive at the Fourier coefficient relations p1 e−ikx + p2 eikx = − ρ0 c (U1 e−ikx − U2 eikx ) A ρ0 c (U1 e−ikx + U2 eikx ). A Combining these relations immediately gives p1 e−ikx − p2 eikx = − (1.27) (1.28) ρ0 c ρ0 c U1 and p2 = U2 . (1.29) A A We can now combine Equations 1.14 and 1.15 in matrix form in terms of only U1 and U2 as follows: e−ikx eikx U1 (ω) Û (x, ω) = . (1.30) U2 (ω) − ρA0 c e−ikx ρA0 c eikx p̂(x, ω) p1 = − Since we are only concerned with the input (x = 0) and output (x = L) of the tube, we eliminate U1 and U2 from this vector expression in favor of Û (L, ω), p̂(L, ω), Û (0, ω), and p̂(0, ω) giving cos kL i ρA0 c sin kL Û (0, ω) Û (L, ω) . (1.31) = p̂(0, ω) p̂(L, ω) cos kL i ρA0 c sin kL We now have a system of two equations with four unknowns, so we must again apply the two boundary conditions to uniquely determine the solution. The first condition of Equation 1.8 is given in the frequency domain by Û (0, ω) = ŝ(ω), (1.32) where ŝ(ω) is the Fourier transform of the source function s(t). The second boundary condition of Equation 1.11 is Û (L, ω) = 13 p̂(L, ω) , Zr (ω) (1.33) where Zr (ω) is the radiation impedance of Equation 1.10. Using Equations 1.31, 1.32, and 1.33, we can then solve for the output volume velocity spectrum, Û (L, ω), which is given by −1 AZr (ω) Û (L, ω) = ŝ(ω) cos kL − i sin kL . ρ0 c (1.34) The corresponding pressure spectrum is then given by −1 AZr (ω) , sin kL p̂(L, ω) = ŝ(ω)Zr (ω) cos kL − i ρ0 c (1.35) Note that the complete solutions of Equations 1.25 and 1.26 evaluated at x = L reduce to these forms, but the chain matrix approach more easily generalizes to multiple tube systems, as we will see later. 1.1.4 The Sinusoidal Source Solution We now turn our attention to the useful solution to the case of a sinusoidal source. Since we can express any odd periodic source function as a Fourier series of sinusoidal sources, understanding the solution to a single driving frequency will prove integral to further analysis. Consider the single-frequency sinusoidal source function s(t) = U0 sin ω0 t. (1.36) This source function can be cast into exponential form using Euler’s equation as follows: U0 iω0 t (e − e−iω0 t ). (1.37) 2i Making use of the fact that the Fourier transform of a complex exponential is simply the Dirac delta function, we have s(t) = U0 π (δ(ω − ω0 ) − δ(ω + ω0 )). (1.38) i Using our chain matrix solution of Equation 1.35, we can immediately write down ŝ(ω) = U0 π (δ(ω − ω0 ) − δ(ω + ω0 ))f (ω, L, A), i where f (ω, L, A) is defined to be the frequency-dependent factor p̂(L, ω) = −1 AZr (ω) f (ω, L, A) = Zr (ω) cos kL − i . sin kL ρ0 c (1.39) (1.40) Now that we have the Fourier transform of the solution, we can use the sifting property of the Dirac delta function, 14 Z ∞ f (ω)δ(ω − ω0 )dω = f (ω0 ), (1.41) −∞ when applying the inverse Fourier transform to arrive at the final solution in terms of f , U0 f (ω0 , L, A)eiω0 t − f (−ω0 , L, A)e−iω0 t 2i = Im(U0 f (ω0 , L, A)eiω0 t ). p(L, t) = (1.42) Here, the second form is a consequence of Zr (ω) = Z∗r (−ω) and thus f ∗ (ω, L, A) = f (−ω, L, A). 1.2 Multiple Tube Analysis With the chain matrix single tube solution of Section 1.1.3 in hand, it is a simple matter to extend the analysis to a series of N tubes with lengths {Li |i = 1, . . . , N } and cross-sectional areas {Ai |i = 1, . . . , N }. Relying on continuity of pressure and volume velocity at inter-tube boundaries, the solution for N concatenated tubes is equivalent to determining N single tube solutions. Thus, if we discretely approximate an arbitrary vocal tract profile A(x) with the {Li } and {Ai }, we can determine the output speech signal to desired accuracy within the bounds admissible by this framework. Moreover, we can accomplish this without having to numerically compute the solution to a much more complicated differential equation for each variation in articulatory configuration. 1.2.1 The N -Tube General Solution Let Ci be the chain matrix for tube i given by cos kLi i ρA0ic sin kLi , Ci = i ρA0ic sin kLi cos kLi which satisfies the vector equation Ûi (0, ω) Ûi (Li , ω) . = Ci p̂i (0, ω) p̂i (Li , ω) (1.43) (1.44) Here, Ui and pi are the solutions and Li the length for the i-th tube. Note that det Ci = 1. Continuity at the inter-tube boundaries imposes the conditions Ûi (0, ω) Ûi−1 (Li−1 , ω) = for i = 2, . . . , N. (1.45) p̂i (0, ω) p̂i−1 (Li−1 , ω) Therefore, it follows by induction that the output of the N th tube is given by 15 ÛN (L, ω) p̂N (L, ω) = N Y i=1 Ci ! Û1 (0, ω) p̂1 (0, ω) , (1.46) PN where L = i=1 Li indicates the position of the open end of the multi-tube system. Note that this matrix equation is of the same form as the single tube case of Equation 1.31. Thus we can proceed exactly as before with the slightly modified boundary conditions, Û1 (0, ω) = ŝ(ω), p̂N (L, ω) . Zr (ω) ÛN (L, ω) = (1.47) (1.48) QN Let us denote the composite 2 × 2 chain matrix as M = i=1 Ci . Since the determinant of the {Ci } are all 1, det M = 1 as well. Therefore, we can write the general multiple tube volume velocity spectrum solution as ÛN (L, ω) = ŝ(ω) . M22 − Zr (ω)M12 (1.49) The corresponding pressure spectrum is then given by p̂N (L, ω) = ŝ(ω)Zr (ω) . M22 − Zr (ω)M12 (1.50) The task of computing the output spectrum of an N -tube system, then, reduces to the multiplication of N 2 × 2 matrices. 1.2.2 The N -Tube Solution for a Sinusoidal Source We now turn to the sinusoidal source solution for the N -tube case. Generalizing the results of Section 1.1.4, for a single frequency source function s(t) = U0 sin ω0 t, (1.51) the output spectrum at the open end is given by U0 π (δ(ω − ω0 ) − δ(ω + ω0 ))g(ω, {Li }, {Ai }), (1.52) i where g(ω, {Li }, {Ai }) is the frequency-dependent factor of Equation 1.49, p̂(L, ω) = g(ω, {Li }, {Ai }) = Zr (ω) . M22 − Zr (ω)M12 (1.53) Here M12 = M12 ({Li }, {Ai }) and M22 = M22 ({Li }, {Ai }). It follows in the same manner of Section 1.1.4 that the output waveform is given by 16 pN (L, t) = U0 g(ω0 , {Li }, {Ai })eiω0 t − g(−ω0 , {Li }, {Ai })e−iω0 t . 2i (1.54) The second form of the single-tube solution given by Equation 1.42 relied on the fact that f ∗ (ω) = f (−ω). For an N -tube system, it remains the case that Z∗r (ω) = Zr (−ω). Also, it is easy to show that for all composite chain matrices M , Im(M22 ) = 0 and Re(M12 ) = 0. Therefore, it again follows that g ∗ (ω) = g(−ω), so we can cast the solution of Equation 1.54 into the simpler form, pN (L, t) = Im(U0 g(ω0 , {Li }, {Ai })eiω0 t ), (1.55) recovering an explicitly real output waveform. In this chapter we have developed in great detail a set of solutions to the physical system of concatenated acoustic tubes. These solutions will form the input for the geometrical representation adopted in Chapter 3, which will be extended to speech in Chapter 4. First, however, a brief overview of the solution demonstration software will be presented in Chapter 2. 17 Chapter 2 Speech Sound Generator Software The acoustic solutions of Sections 1.1 and 1.2 are implemented in a simulation and analysis software package Speech Sound Generator. The source code is available on the web at http://www.cs.uchicago.edu/˜aren/. This software allows for audio demonstration of solutions derived above for any N -tube configuration and source function. This software also allows for configuration parameter adjustments that correspond to the various manifold parameters developed below in Chapters 3 and 4. The two modes, simulation and data, are introduced separately below. 2.1 Simulation Mode In simulation mode, the user inputs the vocal tract profile, glottal source frequency spectrum and fundamental frequency. The user can then stream the filtered output signal to the sound device for playback. In addition, the output waveform and Fourier spectrum are displayable, as well as the vocal tract profile. The graphical user interface for this mode is shown in Figure 2.1. The simulation mode has the following features: • Input parameters can be adjusted during playback to modify the output sound in real-time by using the sliders and dials. • Vocal tract configurations may be modified by adjusting several parameters corresponding to the various manifold configuration space variables introduced in later chapters. • Vocal tract filtering may also be deactivated, allowing examination of the source signal. • Complete configurations may be saved and loaded. 18 Figure 2.1: Speech sound generator software in simulation mode. • Vocal tract profiles and source spectra can be loaded independently to mix and match externally prepared settings. • A two-second clip of sound produced by a given configuration may also be exported to a standard WAV file. 2.2 Data Mode The data mode allows the user to record speech sounds, and display either the waveform or frequency spectra in the plot window. The graphical user interface for this mode is shown in Figure 2.2. For periodic sources, the fundamental frequency will be computed using autocorrelation minimization, and displayed on the frequency spectrum plot. The fit option, if enabled before recording, will attempt to fit a two-tube model frequency spectrum to the recording, finding the best length ratio, radius ratio, and total length match from theory. That is, by recording their voice a user can get an estimate of their vocal tract profile. The results of this fit are displayed in the Fourier transform plot window. 2.3 Implementation Details The software package is written for the Linux platform in C++ using Gnome Toolkit 2.0 (GTK2.0) graphical user interface development package and relies 19 Figure 2.2: Speech sound generator software in data mode. heavily on the use of multiple threads to manage the interface, solution generation, and sound production concurrently. WAV sound file generation is implemented using the libsndfile library. The plot window is managed using the developmental GTKExtra 2.0 widget extension library. 20 Chapter 3 The Manifold of Acoustic Tube Solutions Chapter 1 presented a derivation of the continuous output spectrum resulting from an arbitrary source and N -tube configuration. If the source spectrum has bounded support and the radiation impedance has a non-zero resistive term (i.e., Re(Zr ) 6= 0), then it follows that the output pressure spectrum is in L2 , the infinite-dimensional space of square-integrable functions. However, if we instead consider sources composed of an H-term linear combination of sinusoidal terms, we can alternatively view the output solution as contained in an H-dimensional subset of the infinite-dimensional space l2 , the set of squaresummable series. Thus, each solution of this type, while still an element of an infinite-dimensional space L2 , will also have a discrete, but exact, finite-dimensional representation. This readily allows for the adoption of a geometric representation of the solutions, where the H-dimensional solution spectra coefficients represent points in H-dimensional Euclidean space. In this chapter, we will adopt this geometrical representation and determine the subsets of Euclidean space to which acoustic tube solutions are constrained for various ranges of configuration parameters. We will see that these subsets are indeed low-dimensional manifolds embedded in the ambient space. 3.1 The Single Tube Solution Manifold We begin our investigation of the manifold structure with the simple yet expository case of a single uniform tube with length L and cross-sectional area A, driven by the source function s(t) = H X αn sin nω0 t, (3.1) n=1 where ω0 is the fundamental angular frequency and {αn } are real-valued Fourier 21 coefficients. Here, H is the number of harmonics, which must be less than or equal to b5000/f0 c due to the approximations used in the physics model. Since we are guaranteed real-valued output from the tube, we can assume a solution of the form p(L, t) = H X βn sin(nω0 t + φn ), (3.2) n=1 where {βn } is the set of real-valued output Fourier coefficients and {φn } is a set of real-valued phases. We know from Equation 1.42 that for each n, βn sin(nω0 t + φn ) = Im(αn f (nω0 , L, A)einω0 t ). (3.3) Therefore, it follows that the output Fourier coefficients {βn } are given by βn (L, A) = αn |f (nω0 , L, A)| =h cos2 kn L + αn |Zn | A2 ρ20 c2 |Zn |2 sin2 kn L + i1/2 , (3.4) A ρ0 c Im(Zn ) sin 2kn L where kn = nω0 /c and Zn = Zr (nω0 ). Now, consider the subset of RH defined for a given set {αi |i = 1, . . . , H} by M1 (L1 , L2 ) = {(β1 , β2 , . . . , βH )|L ∈ (L1 , L2 )}. (3.5) This set traces out a one-dimensional curve in the ambient Fourier space, RH . Here, the subscript “1” indicates the use of the single tube solution. Our immediate goal is to show that M1 (L1 , L2 ) is in fact a one-dimensional manifold. Formally, this is a consequence of the following three properties: 1. There exists a diffeomorphism, φ : (L1 , L2 ) ⊂ R1 → M1 (L1 , L2 ) ⊂ RH , for L1 and L2 in the range of human vocal tract lengths. We can define a continuous map, φ, from points in the open interval (L1 , L2 ) to M1 (L1 , L2 ) using the {βn } functions defined in Equation 3.4. This mapping is surjective by definition of M1 (L1 , L2 ). Its injectivity follows from the fact that the set is not self-intersecting. For self-intersection to occur, there must exist two lengths, L, L0 ∈ (L1 , L2 ) such that k1 L = k1 L0 + m2π for some m ∈ Z+ . This means that the minimum length difference that admits self-intersection is given by ∆Lmin = c f0max ≈ 1 m, (3.6) where the maximum fundamental frequency occurring in human speech is estimated to be f0max ≈ 300 Hz. Since vocal tracts typically range in length from approximately 10 and 30 cm, self-intersection would be impossible for any natural scale interval. 22 Furthermore, if Re(Zn ) 6= 0 for all n, φ is a differentiable map since the functions {βn } that determine it are infinitely differentiable (C ∞ ). Thus, it follows by definition that φ is a diffeomorphism. 2. The diffeomorphism, φ−1 : M1 (L1 , L2 ) → R1 , is a coordinate chart on the set M1 (L1 , L2 ). Since φ is a diffeomorphism, its inverse φ−1 must exist and must also be a diffeomorphism. It then follows that φ−1 is a coordinate chart on M1 (L1 , L2 ), by definition. 3. The set M1 (L1 , L2 ) is open. This fact follows from its definition, where L is chosen from the open interval (L1 , L2 ). We can conclude that the set M1 (L1 , L2 ) is a smooth and open one-dimensional manifold. The manifold has two interesting properties: • The manifold M1 (L1 , L2 ) is extrinsically curved in the ambient space. Inspection of the functional form reveals that this manifold is extrinsically curved in the ambient Fourier space. That is, there exists distinct sets of lengths, {l1 , l2 , l3 |li ∈ (L1 , L2 )}, (3.7) ~ = hβ1 (li , A), β2 (li , A), . . . , βH (li , A)i|i = 1, 2, 3} {β (3.8) such that the points do not lie on a straight line in the ambient space. This follows from the ~ fact that the tangent vector, ∂ β/∂L, does not maintain a fixed direction. This can be easily seen when we write down the tangent vector components, ∂βn αn kn |Zn | = × ∂L 2 A2 |Zn |2 2AIm(Zn ) 1 − cos 2k L sin 2k L − n n ρ2o c2 ρ0 c 3/2 . 2 |Z |2 A A Im (Z ) 2 n n cos2 kn L + ρ2 c2 sin kn L + sin 2kn L ρ0 c (3.9) 0 Clearly, for values of the {li } that do not result in trigonometric arguments that are multiples of π, the tangent vector components will not maintain constant proportions. Extrinsic curvature is demonstrated graphically simply by a non-linear form of the set. 23 M1(L1, L2) M1(L1, L2) (x−y plane) 80 100 60 80 40 β2 β3 120 60 20 40 0 200 20 1.5 100 1 β2 0 0 0.5 β1 0 0 M1(L1, L2) (x−z plane) 0.5 β1 1 1.5 M1(L1, L2) (y−z plane) 60 50 50 40 40 β3 70 60 β3 70 30 30 20 20 10 0 10 0 0.5 β1 1 0 1.5 0 50 β2 100 150 Figure 3.1: Sample plot of M1 (L1 , L2 ). • The manifold M1 (L1 , L2 ) spans the ambient space. This fact follows from the fact that there exists a distinct set of lengths, {l1 , l2 , . . . , lH |li ∈ (L1 , L2 )}, such that the H-dimensional square matrix with row vectors β~i = (β1 (li , A), β2 (li , A), . . . , βH (li , A)) has rank H. We have verified the existence of such a set of lengths numerically. Therefore, we can conclude that the single-tube solution manifold is a curved one-dimensional manifold that spans the H-dimensional ambient Euclidean space. This indicates that we are dealing with a complex non-linear subset of a typically high-dimensional space. Even though this subset has a one-dimensional representation, it fills all H dimensions of the ambient space. Therefore, simple linear machine learning and dimensionality reduction techniques may fail when applied to the ambient representation. It is precisely these qualifications that promise improved classification when using manifold-based algorithms, which incorporate the exploitation of the underlying low-dimensional structure into simpler techniques that function in the ambient space. Figures 3.1 and 3.2 (zoomed) show the manifold of Equation 3.5 for f0 = 150 Hz, L1 = 10 cm, L2 = 50 cm, A = 15 cm2 , and H = 3. The extrinsic curvature in the ambient space is clearly demonstrated. Moreover, the three-prong structure that roughly coincides with the axes indicates that the manifold does in fact span the ambient space R3 . 24 M1(L1, L2) M1(L1, L2) (x−y plane) 16 14 10 12 10 6 β2 β3 8 4 2 8 6 4 15 10 2 0.3 β2 5 0.2 0.25 β1 0.2 10 10 8 8 6 6 4 0.3 β1 4 2 0.2 0.25 M1(L1, L2) (y−z plane) β3 β3 M1(L1, L2) (x−z plane) 2 0.25 β1 0.3 5 β2 10 15 Figure 3.2: Sample plot of M1 (L1 , L2 ) (zoomed on origin). 3.2 The 2-Tube Solution Manifold As we increase the number of tubes in our model, the terms in the general solution gain additional trigonometric factors from the extra chain matrix multiplications. As a result, analytic treatment of the solution geometry grows increasingly complex, becoming prohibitively unmanageable even for the case of just two tubes. Thus, from this point on analytical parameterizations of the solution manifolds are no longer presented, replaced instead by numerical and graphical treatments. The 2-tube resonator configuration is described fully by four parameters: the lengths and cross-sectional areas of each of the two tubes. Alternatively, we can capture the same information in four related parameters, which we will use to define a two-tube configuration four-tuple, c = (RA , A, RL , L) ∈ C2 (IRA , IA , IRL , IL ). (3.10) Here, L is the total resonator length, A is the larger of the two tube areas, and RA and RL are the cross-section area and length ratios between the tubes, respectively. We define C2 (IRA , IA , IRL , IL ) to be the space of all possible configurations with 25 RA ∈ IRA ≡ (RA,min , RA,max ) RL ∈ IRL ≡ (RL,min , RL,max ) A ∈ IA ≡ (Amin , Amax ) L ∈ IL ≡ (Lmin , Lmax ). (3.11) We again conduct the analysis with the harmonic source function used in the single tube case, s(t) = H X αn sin nω0 t, (3.12) n=1 where ω0 is the fundamental angular frequency and H is the number of included harmonics. The sinusoidal source output given by Equation 1.55 determine the magnitudes of the output Fourier series coefficients, which are given by βn = αn |g(nω0 , c)|. (3.13) Here it should be noted that we have replaced the explicit {Li } and {Ai } dependence of the function g with the above-defined configuration four-tuple c ∈ C2 , which we have established to encapsulate the same information. Consider the subset of RH , M2 (S) = {(β1 (c), β2 (c), . . . , βH (c))|c ∈ S}, (3.14) where S = C2 (IRA , IA , IRL , IL ) is the configuration space for some choice of parameter ranges. The subset “2” of M2 indicates we are dealing with twotube solutions. This set possesses the same properties as M1 (L1 , L2 ) listed in the previous section, so it follows that M2 (S) is a smooth open manifold embedded in RH . The dimension of this manifold is the dimension of the configuration space S. As a first example, consider the configuration space where we only vary the overall length of the two tube system within the range L = (L1 , L2 ), keeping the other three parameters fixed. The resulting one-dimensional manifold, which can be denoted M2 (L1 , L2 ) (c.f. Equation 3.5), is shown in Figure 3.3, where I have chosen 26 M2(L1,L2) M2(L1,L2) (x−y plane) 100 60 80 40 β2 β3 60 20 40 0 100 β2 20 100 50 50 0 0 β1 0 0 20 M2(L1,L2) (x−z plane) 40 β1 60 80 100 80 100 M2(L1,L2) (y−z plane) 40 30 30 β3 50 40 β3 50 20 20 10 10 0 0 20 40 β1 60 80 0 100 0 20 40 β2 60 Figure 3.3: Sample plot of M2 (L1 , L2 ). (L1 , L2 ) = (10, 50) cm RL = 1.2 RA = 8 (3.15) A = 15 cm2 f0 = 150 Hz αn = 0 for n > 3 H = 3. The extrinsic curvature of this manifold is obvious upon inspection. Also, the 100×3 matrix formed using a discretized set of 100 data points along this manifold is determined numerically to have rank three. Therefore, this manifold spans the ambient space. Finally, consider the 2-tube solution manifold where we hold the overall area and length constant, but vary the area and length ratios. An example of such a manifold, which we denote by M2 (IRL , IRa ), is shown in Figure 3.4, where configuration parameters are chosen as 27 M2(IR ,IR ) M2(IR ,IR ) (x−y proj.) A L L A 100 80 40 60 β3 β2 60 20 40 0 0 20 100 50 10 20 β1 0 0 β2 0 5 M2(IR ,IR ) (x−z proj.) L 10 β1 15 20 M2(IR ,IR ) (y−z proj.) A L 40 40 30 30 A β3 50 β3 50 20 20 10 10 0 0 5 10 β1 15 0 20 0 20 40 β2 60 80 100 Figure 3.4: Sample plot of M2 (IRL , IRA ). IRA = (1/8, 8) IRL = (1/3, 3) L = 17.6 cm A = 15 cm2 f0 = 150 Hz αn = 0 for n > 3 H = 3. This is a two-dimensional manifold for which we can clearly observe extrinsic curvature. Moreover, using the matrix rank method described above, the manifold was verified to span the ambient space. The general manifold picture developed in this chapter demonstrates that the single and twin-tube parameterizations determine a simple, low-dimensional underlying structure to the complex, high-dimensional acoustic signals they produce. However, at this point, the acoustic tube model system is rather abstract. We have seen that the solution spaces are indeed low-dimensional manifolds, but a connection to speech remains to be demonstrated. To this end, we extend this approach into vowel-approximating configuration spaces in the following chapter. 28 Chapter 4 Tube Models of Vowel Production Now that we have introduced the concept of acoustic tube solution manifolds, we can continue our investigation into the structure of vowel manifolds. Speech production begins with a fairly stable air flow produced by the lungs. Located at the base of the vocal tract are a pair of tissue folds, known as the glottis or vocal cords. These folds can constrict to modulate air flow into the vocal tract, which determines the volume velocity source for the physical system. The vocal tract, mouth, and lips comprise the filter of the system. Various muscles in the neck, mouth, and face control the shape, known as an articulatory configuration. These configurations can be approximated by a sequence of concatenated acoustic tubes. In the case of sustained vowel sounds, a particular configuration is maintained throughout production. Therefore, the timeindependent physics model presented is sufficient for this analysis. The general formalism presented in Chapter 3 is extended in this chapter, where an approximate glottal source volume velocity spectrum and vowelapproximating vocal tract filter configurations are implemented. 4.1 Introducing the Glottal Source The volume velocity source that results from glottal vibration is an odd periodic function, which we denote by s(t), and thus can be approximated to desired accuracy by a Fourier series of the form s(t) ≈ sapp (t) = H X sn sin nω0 t, (4.1) n=1 where H is the included number of harmonics (limH→∞ sapp (t) = s(t)) and ω0 is the fundamental angular frequency, which is speaker-dependent. The 29 (a) (b) 100 1.5 90 1 80 0.5 60 Amplitude Amplitude (dB) 70 50 40 0 −0.5 30 20 −1 10 0 0 1000 2000 3000 Frequency (Hz) 4000 −1.5 5000 0 0.005 0.01 Time (s) 0.015 0.02 Figure 4.1: (a) Glottal source spectrum and (b) corresponding waveform where f0 = 100 Hz. Fourier amplitudes {sn } are traditionally approximated by a fixed 12 dB drop per octave [9], which corresponds on a linear scale to sn = n−12/(20 log10 2) , (4.2) where s1 = 1 is chosen as the reference amplitude (see Figure 4.1). Therefore, for a given N -tube vocal tract configuration and fundamental angular frequency ω0 , we can generalize the multi-tube sinusoidal source solution of Section 1.2.2 to write down the glottal-sourced output, pN (L, t) = H X n=1 Im n−12/(20 log10 2) g(nω0 )einω0 t . (4.3) Here, g(ω) is the function defined in Equation 1.53. Note that the listing of the explicit resonator configuration dependence has been omitted for the sake of brevity. Since the analysis presented in Chapter 1 assumes frequencies under 5000 Hz, we must limit the number of harmonics to H < 5000 · 2π/ω0 . The frequency spectrum of the output waveform of Equation 4.3 is given by (discarding phase information) βn = n−12/(20 log10 2) |g(nω0 )|. (4.4) This form of the amplitude spectrum is used in the remainder of this chapter to study the vowel manifolds. 4.2 2-Tube Models of Vowel Production In Section 3.2 we examined the manifold structure formed by H-point output Fourier transforms and included figures resulting from a three-harmonic source function. If we instead supply the two-tube resonators with the glottal 30 e M2 u a ae i Figure 4.2: Schematic plot of the vowel structure within the manifold of acoustic two-tube model solutions, M2 . Phoneme /@/ /u/ /a/ /y/ /i/ /æ/ RL 1 8 1.2 1 1.5 1/3 RA 1 8 1/8 8 8 1/8 A (cm2 ) 15 15 15 15 15 15 L (cm) 17.6 17.6 17.6 17.6 14.5 17.6 Table 4.1: Two-tube configurations for five vowel phonemes (Fant [9]). source function described above, and choose resonator configurations that approximate vocal tract profiles during vowel production, we can use the output solutions to estimate the manifold structure of vowel sounds. If we fix a pitch (i.e., fundamental frequency), the manifold of vowel sounds will comprise a subset of the space of all two-tube solutions, M2 , as shown schematically in Figure 4.2. A sound’s position in this space depends both on the phoneme being produced and the vocal tract size of an individual speaker. Each phoneme will occupy a distinct subset of the solution manifold, but all phonemes together do not form a partition, as there exists solutions that do not fall into a phonetic category. Two-tube articulatory vocal tract approximations have been studied in great detail [9]. Table 4.1 shows approximate two-tube configurations for five vowels phonemes, in terms of the quantities defined in Section 3.2. The spectrum for the /a/ configuration given is shown in Figure 4.3. Now the question arises: using these vowel configurations, what manifolds result when we vary one or several of the parameters? Before we can answer this, we must first address the issue of visualization. This is a matter complicated by the fact that we are now dealing with a b5000/f0 c-dimensional ambient space where we have one dimension for each harmonic. For the time being, we present projections of the manifolds onto the three principal component axes defined to be those that contribute most to the variance across the data points. Using this visualization approach, we will consider the manifolds arising 31 Amplitude Spectrum for Phoneme /a/ 390 385 380 Amplitude (dB) 375 370 365 360 355 350 345 0 500 1000 1500 2000 2500 3000 Frequency (Hz) 3500 4000 4500 5000 Figure 4.3: Amplitude spectrum for the two-tube /a/ configuration given in Table 4.1. from vocal tract length (L) variation and area/length ratio (RL and RA ) variation. Vocal tract length is correlated roughly to the height of the speaker. Variation in area and length ratio between the two tubes roughly corresponds to traversing the space of possible sounds producible by a speaker of given size and pitch. Therefore, these manifolds have direct connections to the variance present in natural datasets, namely speaker variation with fixed content and content variation with fixed speaker. We begin by examining the effect of varying overall vocal tract length for a given vowel phoneme configuration. Let M2 (P ; f0 , L1 , L2 ) be the manifold in RH specified by the two-tube configuration of phoneme P , where we vary the total resonator length L ∈ (L1 , L2 ). We stimulate this resonator with a glottal source of fundamental frequency f0 = 100 Hz, which constrains the number of harmonics and dimension of the ambient space to H = 50. Figure 4.4 shows M2 (P ; f0 , L1 , L2 ) for each of the phonemes listed in Table 4.1, where L1 =11 cm and L2 = 19 cm. The principal components for each phoneme’s plot are determined independently. High extrinsic curvature is evident for all phonemes, typically taking a spiral form. Moreover, each manifold was numerically verified to span its 50-dimensional ambient space. Next we combine the individual vowel phoneme manifolds into a single manifold given by [ (f0 , L1 , L2 ) = Mvowel M2 (P ; f0 , L1 , L2 ), (4.5) 2 P ∈P 32 PC Projection of M2(P) for P=schwa PC Projection of M2(P) for P=a 5 PC Axis 3 PC Axis 3 10 0 −10 10 0 −5 10 10 0 PC Axis 2 −10 −10 PC Axis 2 PC Axis 1 PC Projection of M2(P) for P=ae −10 −10 PC Axis 1 5 PC Axis 3 PC Axis 3 0 PC Projection of M2(P) for P=i 5 0 −5 10 0 −5 10 10 0 PC Axis 2 −10 −10 10 0 0 PC Axis 2 PC Axis 1 PC Projection of M2(P) for P=u 0 −10 −10 PC Axis 1 PC Projection of M2(P) for P=y 5 PC Axis 3 5 PC Axis 3 10 0 0 0 −5 −10 10 0 −5 10 10 0 PC Axis 2 0 −10 −10 PC Axis 2 PC Axis 1 10 0 0 −10 −10 PC Axis 1 Figure 4.4: Principal component plots of M2 (P ; f0 , L1 , L2 ) for each phoneme individually. 33 PC Projection of Mvowel 2 PC Projection of Mvowel 2 5 0 6 4 PC Axis 2 10 PC Axis 3 8 schwa a ae i u y −5 2 0 −2 −10 10 0 PC Axis 2 −10 −10 −4 10 0 PC Axis 1 −6 −10 6 6 4 4 2 2 0 −2 −4 0 PC Axis 1 5 10 0 −2 −4 −6 −8 −10 −5 PC Projection of Mvowel 2 PC Axis 3 PC Axis 3 PC Projection of Mvowel 2 −6 −5 0 PC Axis 1 5 −8 −10 10 −5 0 PC Axis 2 5 10 Figure 4.5: Principal component plot of Mvowel (f0 , L1 , L2 ). 2 where P is the set of phonemes listed in Table 4.1. The principal component projection for this manifold is shown in Figure 4.5, color-coded by phoneme. Note that we observe overlaps between some of the phoneme submanifolds, particularly between the /a/-/æ/ pair and the /i/-/y/ pair. This may come as no surprise, given the perceptual similarity of the phonemes. Thus it is clear that these pairs are not linearly separable in principal component projections. They are, however, linearly separable in the 50-dimensional ambient space. Finally, we examine the manifold created by using the configuration for the phoneme /a/ and varying both the area and length ratios independently from 1/10 to 10. This manifold, which we denote by M2 (f0 , L, A), is the space of all sounds that can be created by a glottal source with fundamental frequency f0 filtered by a two-tube resonator with total length L and maximum area A. This manifold should approximately contain the space of all vowel sounds producible by the equivalent real vocal tract with those overall dimensions at the given pitch. The principal component projection of M2 (f0 , L, A) is shown in Figures 4.6 and 4.7. In Figure 4.6, color is implemented such that the continuous light spectrum is linearly indexed by the length ratio values (i.e., regions of the manifold with the same color result from configurations with the same length ratio). In Figure 4.7, the color spectrum is instead indexed by area ratio. Clearly, the area ratio contributes much more to the overall variance of the manifold in the ambient space, as the principal component analysis favors its independent differentiation. This indicates that variation of area ratio has the greater affect on the acoustic nature of the resulting signal. 34 PC Projection of M2(f0,L,A) 8 6 PC Axis 3 4 2 0 −2 −4 5 0 10 −5 5 0 −5 PC Axis 2 −10 −10 −15 PC Axis 1 Figure 4.6: Principal component plot of M2 (f0 , L, A), where colors differentiate length ratios. PC Projection of M2(f0,L,A) 8 6 PC Axis 3 4 2 0 −2 −4 5 0 10 −5 5 0 −5 PC Axis 2 −10 −10 −15 PC Axis 1 Figure 4.7: Principal component plot of M2 (f0 , L, A), where colors differentiate area ratios. 35 Phoneme /a/ /e/ /i/ /1/ /o/ /u/ N 34 33 33 38 37 39 A (cm2 ) 8.0 10.5 10.5 13.0 14.5 13.0 L (cm) 17.0 16.5 16.5 19.0 18.5 19.5 Table 4.2: N -tube configurations for six vowel phonemes (Fant [9]) Note that the manifold M2 (f0 , L, A) (e.g. Figures 4.6 and 4.7) also contains paths that roughly model the space of Fourier amplitudes created as a speaker continuously changes vocal tract configuration from one vowel phoneme to another. The precise nature of these paths is not clear. Two immediate possibilities include transition paths that minimize geodesic distance or paths that minimize parameter space distance. Answering these questions definitively, however, would require functional imaging of actual human vocal tracts during non-stationary speech production. The manifold determined by a geometric representation of two-tube solutions possesses extrinsic curvature and spans the ambient space. Therefore, the above-described submanifolds arising from variation of subsets of configuration parameters inherit these properties. As we can see from the principal component plots, this translates into complex interrelationships between the subspaces corresponding to each vowel phoneme when we linearly reduce the dimension of the data. 4.3 N-Tube Models of Vowel Production Increasing the number of tube segments in the model can better approximate the acoustic characteristics of a given vowel sound. Much work has been done to measure the actual vocal tract profile during sustained production of vowel sounds, both with X-ray and magnetic resonance imaging. The goal of this section is to use N -tube profiles gathered from these medical imaging techniques to further examine the manifold structure of vowels. As in the 2-tube vowel approximations, the N -tube resonators are stimulated with the glottal source of Equation 4.2. We use the tube profiles of Fant [9], collected using X-ray imaging. Since these profiles are are defined in 0.5 cm segments, the number of tubes are determined by the overall tract length of each profile measured. Table 4.2 shows the phonemes used in our study along with their N -tube overall dimensions. Consider the manifold MN (P ; f0 , L1 , L2 ) determined by the acoustic solution using the glottal source with fundamental frequency f0 and N -tube resonator configuration for phoneme P , where we vary the total length L ∈ (L1 , L2 ). The principal component projections of these manifolds are shown 36 PC Projection of MN(P) for P=a PC Projection of MN(P) for P=e 5 PC Axis 3 PC Axis 3 5 0 −5 10 0 −5 10 10 0 PC Axis 2 −10 −10 PC Axis 2 PC Axis 1 PC Projection of MN(P) for P=i −10 −10 PC Axis 1 5 PC Axis 3 PC Axis 3 0 PC Projection of MN(P) for P=ibar 5 0 −5 5 0 −5 10 10 0 PC Axis 2 10 0 0 −5 −10 PC Axis 2 PC Axis 1 PC Projection of MN(P) for P=o 0 −10 −10 PC Axis 1 PC Projection of MN(P) for P=u 5 PC Axis 3 5 PC Axis 3 10 0 0 0 −5 −10 10 0 −5 10 10 0 PC Axis 2 −10 −10 10 0 0 PC Axis 2 PC Axis 1 0 −10 −10 PC Axis 1 Figure 4.8: Principal component plots of MN (P ; f0 , L1 , L2 ) for each phoneme individually. in Figure 4.8 for each of the phonemes listed in Table 4.2, where f0 = 100 Hz, L1 = 11 cm, and L2 = 19 cm (c.f. Figure 4.4). Again, extrinsic curvature is evident, and each was numerically determined to span the 50-dimensional space. Next, as in the two-tube case, we can combine the individual phoneme manifolds into a single manifold, [ MN (P ; f0 , L1 , L2 ). (4.6) Mvowel (f0 , L1 , L2 ) = N P The principal component projection for this manifold is shown in Figure 4.9, color-coded by phoneme (c.f. Figure 4.5). Unlike the two-tube case, each of the individual N -tube phoneme manifolds are linearly separable in the principal component projection. However, in the two-tube case we had two pairs of similar phonemes, /a/-/æ/ and /i//y/. In the N -tube case, we are without /æ/ and /y/ specifications; the prin37 PC Projection of Mvowel N PC Projection of Mvowel N 5 0 4 PC Axis 2 10 PC Axis 3 6 a e i ibar o u −5 −10 10 PC Axis 2 0 −10 −10 0 −2 −4 10 0 2 −6 −10 PC Axis 1 −5 0 5 PC Axis 1 PC Projection of Mvowel N 8 8 6 6 4 4 PC Axis 3 PC Axis 3 PC Projection of Mvowel N 2 0 −2 0 −2 −4 −6 −10 2 −4 −5 0 −6 −10 5 PC Axis 1 −5 0 PC Axis 2 5 10 Figure 4.9: Principal component plot of Mvowel (f0 , L1 , L2 ). N cipal component projections of these pairs would presumably not be linearly separable in the N -tube case either. However, without access to N -tube vocal tract profiles for these phonemes, we could not verify this claim. Also note that linear separability of the phoneme manifolds in the principal component projection guarantees linear separability in ambient 50-dimensional space. While the more accurate N -tube profiles can result in better vowel approximations, they are also higher-dimensional representations. The main utility of the two-tube model is to capture the majority of the acoustic signature using as simple a model as possible. For the purpose of manifold algorithm performance studies, we can capture any two-tube model inaccuracies by adding a sufficient amount of noise to the solution spectra. We will return to this approach in Chapter 5. However, it is relevant to notice that, after varying overall scale values, the choice of a specific N -tube vowel phoneme configuration isolates a more localized region of the acoustic space than a two-tube configuration does. This fact is reflected in the general increased separability of the N -tube phoneme submanifolds. Therefore, the two-tube model phonemes act as a sort of average over speakers’ N -tube profiles. In this sense, classification and data reduction performance using two-tube model manifolds is a worstcase scenario measure, and is thus a desirable model to work with. 38 4.4 Fundamental Frequency Variation So far we have considered various tube model solution manifolds where we have kept the fundamental frequency fixed. This approximates speakers of different sizes and genders, but who have the same pitch, or fundamental frequency. However both vocal tract geometry and the pitch of natural speech vary greatly across speakers. Thus, to successfully approximate the manifold of sustained vowel sounds, we must also allow for fundamental frequency variation, which is not necessarily correlated with vocal tract dimensions. We extend the vowel configuration space of Section 4.2 into a frequency dimension by taking the manifold M2 (P ; f0 , L1 , L2 ) and varying f0 as well. We can define such a two-tube solution manifold for phoneme P by [ M2 (P ; F1 , F2 , L1 , L2 ) ≡ M2 (P ; f0 , L1 , L2 ). (4.7) f0 ∈(F1 ,F2 ) Figure 4.10 shows the principal component projection of this manifold for each phoneme listed in Table 4.1, where F1 = 75 Hz, F2 = 275 Hz, L1 =11 cm, and L2 = 19 cm. The result is a structure similar to the corresponding one-dimensional manifolds of Figure 4.4. However, the curve is spread into a two-dimensional ribbon, induced by the fundamental frequency variation. 4.5 Frequency Sampling Normalization In the previous section, we defined a two-tube vowel manifold that incorporated fundamental frequency variation. We have not addressed the complication that the n-th coordinate of two points that result from unequal fundamentals are not Fourier coefficients of the same frequency. That is, the coefficient βn for a solution with fundamental frequency f is the amplitude at frequency nf , while the coefficient βn0 for a solution with fundamental f 0 6= f is the amplitude at frequency nf 0 6= nf . When dealing with Fourier transforms of real speech data across speakers with varying pitch, the positions of the frequency samples are determined by the time length of the waveform. Thus, for accurate simulation, the 50-point transforms as defined by Equation 4.4 for a given fundamental f0 must be mapped to the corresponding Fourier amplitudes sampled at multiples of some reference frequency f0ref . This frequency sampling normalization can be accomplished while working in the frequency domain by considering a finite time sample window, which converts our discrete Fourier coefficients in a continuous Fourier spectrum that can be sampled at any desired frequency interval. Consider the steady-state solution p(t) sampled for t ∈ [−T, T ]. The resulting waveform can then be written m(t) = p(t)RT (t), where 1 for t ∈ [−T, T ] RT (t) = . (4.8) 0 otherwise 39 PC Projection of M2(P) for P=a 5 10 0 5 PC Axis 3 PC Axis 3 PC Projection of M2(P) for P=schwa −5 −10 10 −10 0 0 10 −5 PC Axis 2 PC Axis 1 PC Projection of M2(P) for P=ae 5 −5 10 −5 0 0 5 −10 PC Axis 2 PC Axis 1 PC Projection of M2(P) for P=i 5 PC Axis 3 PC Axis 3 5 0 0 −10 −5 10 −10 10 PC Axis 1 PC Projection of M2(P) for P=u −10 −5 10 0 0 0 PC Axis 2 6 0 0 −10 10 PC Axis 1 PC Projection of M2(P) for P=y PC Axis 2 10 PC Axis 3 PC Axis 3 4 2 0 5 0 −2 −4 10 −10 5 0 0 PC Axis 1 −5 10 PC Axis 2 −5 10 −10 5 0 0 PC Axis 1 −5 10 PC Axis 2 Figure 4.10: Principal component plots of M2 (P ; F1 , F2 , L1 , L2 ) for each phoneme. 40 Taking the Fourier transform, we have m̂(ω) = p̂(ω) ∗ R̂T (ω), where 2 sin ωT . (4.9) ω Now, we know from Section 1.2.2 that an N -tube Fourier series solution is determined by R̂T (ω) = p̂(ω) = H X sn π [δ(ω − nω0 ) − δ(ω − nω0 )] g(ω), i n=1 (4.10) where ω0 is the fundamental angular frequency of the original steady state solution, g(ω) is determined by the resonator geometry according to Equation 1.53, and the {sn } are determined according to Equation 4.2. We can then calculate m̂(ω) using the δ-function property of Equation 1.41, giving H X sin (ω + nω0 )T sin (ω − nω0 )T sn π ∗ − g (nω0 ) , g(nω0 ) m̂(ω) = i ω − nω0 ω + nω0 n=1 (4.11) Here, we have used g(−nω) = g ∗ (nω). The resulting normalized frequency spectra are given by (c.f. Equation 4.4) γn = |m̂(nωref )|, n = 1, ..., K, (4.12) where K is the number of frequency spectrum samples and ωref = 2πfref is the angular frequency sampling interval. Note that since the sinc function is C ∞ , it follows that the {γn } comprise a diffeomorphism, and thus subsets of Euclidean defined by these amplitudes remain smooth manifolds. Now that we have a normalized geometrical representation of solutions with varying fundamental frequencies, we can continue by defining the corresponding solution manifolds. We define the two-tube solution manifold of this form for phoneme P as M2 (P ; T, fref , K,F1 , F2 , L1 , L2 ) = {(γ1 , γ2 , . . . , γK )|L ∈ (L1 , L2 ), f0 ∈ (F1 , F2 )}, (4.13) where 2T is the time window, fref is the reference frequency interval, and K is the number of samples in the spectrum. Figure 4.11 shows the 3D principal component projection of this manifold for each phoneme in Table 4.1, where T = 0.050 s, fref = 10 Hz, K = 500, F1 = 75 Hz, F2 = 275 Hz, L1 =11 cm, and L2 = 19 cm. We can see that the normalized data has lost the ribbon-like manifold structure evident in Figure 4.10. We are left with a cluster of points which has no clear discernible structure, at least in the discrete approximation shown here 41 PC Projection of M2(P) for P=schwa PC Projection of M2(P) for P=a 10 20 10 PC Axis 3 PC Axis 3 5 0 −5 0 −10 −10 −50 −15 50 −50 50 20 10 10 −10 −20 50 0 0 20 −50 PC Axis 2 PC Axis 1 PC Projection of M2(P) for P=u 20 PC Axis 2 0 −20 50 −50 0 −50 50 −100 PC Axis 2 PC Axis 1 PC Projection of M2(P) for P=y 0 30 20 PC Axis 3 5 PC Axis 3 −50 −10 −20 10 0 −5 −10 50 0 0 PC Axis 1 PC Projection of M2(P) for P=i 20 PC Axis 3 PC Axis 3 PC Axis 2 PC Axis 1 PC Projection of M2(P) for P=ae 0 −20 −20 50 0 0 10 0 −10 −50 0 0 −50 PC Axis 1 −100 50 PC Axis 2 −20 50 −50 0 0 −50 PC Axis 1 −100 50 PC Axis 2 Figure 4.11: Principal component plots of M2 (P ; T, fref , K, F1 , F2 , L1 , L2 ) for each phoneme individually. 42 in the principal component projection. Thus, we can conclude frequency sampling normalization greatly increases the structural complexity of the vowel manifold. In all the manifolds presented in this chapter, we find curved and spanning subsets of the acoustic space. In each case, we are dealing with a relatively low-dimensional generating parameter space. Ideally, dimensionality reduction would be achieved by transforming the acoustic signal into this easily managed low-dimensional parameter space. However, noise and model inaccuracies could conceivably complicate the effectiveness of explicitly using the above-derived maps. Therefore, it is desirable to use methods that exploit this low-dimensional structure while not relying on knowledge of the exact form of the map. We will examine such a method in the following chapter. 43 Chapter 5 Vowel Manifolds in the Graph Laplacian Eigenbasis In Chapter 4, we touched on the topic of linear separability of the manifolds for each vowel phoneme. We found that for two-tube models, certain phoneme manifolds, while separable in the ambient 50-dimensional space, are not separable in the 3-dimensional principal component projection. We also found that for the phonemes that had N -tube vocal tract profiles, a separating linear hyperplane existed both in the ambient space and in the 3-dimensional principal component projection space. From these facts arise the issues of dimensionality reduction and the possible ramifications of noise in the dataset. Principal component projection is a linear mapping and therefore cannot improve separability of classes. Furthermore, as noise is increased, classes in the 50-dimensional ambient space will be rendered inseparable, further increasing the cost of dimensionality reduction. It is therefore desirable to use more complex mappings that preserve or even improve separability as well as reduce data dimension, both with and without the presence of noise. 5.1 The Laplacian Eigenbasis One approach to handling data reduction with linearly inseparable data (on account of noise, etc.) is to use a projection onto a non-linear basis. An example of such a basis is the nearest neighbors graph Laplacian eigenbasis, in the manner presented by Niyogi and Belkin [1]. Below is a brief summary of their method, the results of which will be presented in the following sections. Consider k data points x1 , . . . xk ∈ RH . We can construct an adjacency graph with one vertex Vi per data point xi . Let Xn (Vi ) be the set of n nearest vertices to vertex Vi using a Euclidean distance metric. We connect vertices Vi and Vj with an edge of weight one if and only if Vi ∈ Xn (Vj ) or Vj ∈ Xn (Vi ).1 1 Niyogi and Belkin provide many variations on this condition. However, this simplest form is 44 This graph can be represented by the adjacency matrix W , which is symmetric and binary-valued in this case. From this, we can determine the so-called graph Laplacian, L = W − D, where D is the diagonal matrix with elements Dii = P W . ji j Solving the eigenvalue problem Le = λe results in a set of eigenfunctions e1 , . . . ek ∈ Rk , which are sorted by their corresponding eigenvalues 0 = λ1 ≤ · · · ≤ λk . The projection of data point xi onto an m-dimensional subset of the graph Laplacian eigenbasis is then determined by Pm (xi ) = (e2 (i), . . . , em (i)), (5.1) where en (i) indicates the i-th component of vector en . Note that the trivial zero eigenvector e1 is excluded. The sorted eigenbasis {ei } provides a projection that reflects the length of path between points in the adjacency graph. That is, the simpler the connection between two points the closer their projection values. Therefore, if the data of a manifold is sampled sufficiently dense, the projection of any two points on the manifold will be close in projection space. Points on two non-intersecting manifolds will likely have a long or even infinite connecting path and will thus have very disparate projections. This method can therefore serve to both improve linear separability of data classes and reduce the data dimension by choosing m < H. 5.2 Dimensionality Reduction Dimensionality reduction, while useful for increasing the efficiency of algorithms, may obscure separations between data classes, in this case the various vowel phonemes. In the case of the two-tube vowel solutions, the variation in the ambient space over a given phoneme manifold may be greater than the average variation between two phonemes. Therefore, a linear principal component projection that acts on the data of multiple phoneme classes will function to spread the solution points of each phoneme class relative to themselves, leaving some classes entangled. This pitfall is exhibited by the /a/-/æ/ pair (see Figure 4.5). In Figure 5.1, we show the union of the manifolds M2 (/a/; f0 , L1 , L2 ) and M2 (/æ/; f0 , L1 , L2 ) (see Section 4.2) projected onto the first three principal component axes determined by using data from both S manifolds. (From here on, the union of these manifolds is denoted by Ma2 Mæ 2 .) As expected from Figure 4.5, there does not exist a linear hyperplane that separates the two manifolds in the 3D principal component space. They are, however, separable in the ambient 50-dimensional space, indicating the PCA method introduces precisely the damaging effects one wishes to avoid. In Figure 5.2, we show the projection of the same manifolds onto the nearest neighbors graph Laplacian eigenfunction with the smallest corresponding sufficient in this context. 45 PC Projection of Ma2 U Mae 2 PC Projection of Ma2 U Mae 2 6 a ae 10 4 2 PC Axis 2 PC Axis 3 5 0 −5 0 −2 −4 −10 10 PC Axis 2 −6 10 0 0 −10 −10 −8 −10 PC Axis 1 6 6 4 4 2 2 0 −2 −4 −6 −10 −5 0 PC Axis 1 5 10 PC Projection of Ma2 U Mae 2 PC Axis 3 PC Axis 3 PC Projection of Ma2 U Mae 2 0 −2 −4 −5 0 PC Axis 1 5 −6 −10 10 −5 0 5 PC Axis 2 S Figure 5.1: Principal component plot of Ma2 Mæ 2 . eigenvalue (in this case zero). Here, all solutions using /a/ configurations project onto zero and those using /æ/ configurations project onto a collection of four relatively remote points. The classes are completely separable by a zero-dimensional hyperplane in this one-dimensional basis. In this simple yet expository case, the graph Laplacian eigenbasis perfectly isolates the two phoneme classes. This occurs because the nearest neighbors adjacency graph contains two connected components, one for each class. That is, there does not exist an edge from any /a/ point to any /æ/ point. Therefore, the minimum eigenvalue has a corresponding eigenvector that functions primarily to separate the two classes. Mathematically, this is a direct consequence of the manifold structure of the class data. 5.3 Noise on the Manifold So far in our study we have ignored the unavoidable issue of noise present in actual recorded speech data. Deviations from our theoretical data presented above can be a systematic result of the approximations used in our tube model analysis as well as a random result of ambient room and electronics noise. Either contribution will function to spread our individual phoneme manifolds, possibly in unpredictable ways. Therefore, our approximated theoretical data that is linearly separable in the ambient and, in the case of the N -tube vowels, principal component space may not be perfectly separable when these noise sources are taken into account. 46 Graph Lapl. Projection of Ma2 U Mae 2 1 a ae 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −0.04 −0.035 −0.03 −0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 e2 S Figure 5.2: Graph Laplacian eigenfunction projection of Ma2 Mæ 2 . (c.f. Figure 5.1) We encapsulate all of the possible noise sources involved into the addition of a Gaussian distributed S æ random variable to each frequency component of the manifold Ma M2 . We begin by introducing a small amount of noise, 2 resulting in a 18 dB signal-to-noise ratio averaged over the components. As an example, Figure 5.3 shows the standard /a/ spectrum after the addition of this level of noise (shown originally in Figure 4.3). The results for the principal component projection using this noisy data is shown in Figure 5.4 and results for the graph Laplacian eigenbasis are shown in Figure 5.5. In the principal components plot, there is a significant spreading of data points, resulting in near overlap of the two classes. However, in the graph Laplacian projection, the classes are still distinct and remain linearly separable in a single dimension. We continue by introducing a larger amount of noise, this time with a signal-to-noise ratio of 8 dB. The resulting principal component and graph Laplacian plots are shown in Figures 5.6 and 5.7. Here we use a 3-dimensional Laplacian eigenfunction projection (i.e., m = 3 in Equation 5.1). The class overlap increases in the principal component plot, while, even with this relatively high level of noise, the data in the graph Laplacian basis remains linearly separable to a large extent. To further expose this performance trend, Figure 5.8 shows the linear sepS arability of four data representations of the manifold Ma2 Mæ 2 as a function of signal-to-noise ratio (SNR). These representations include the original 50dimensional transform data, the 3-dimensional principal component projection, the 1-dimensional graph Laplacian projection and the 3-dimension graph 47 Amplitude Spectrum for Phoneme /a/ with SNR = 18 dB 350 345 340 Amplitude (dB) 335 330 325 320 315 310 305 300 0 500 1000 1500 2000 2500 3000 Frequency (Hz) 3500 4000 4500 5000 Figure 5.3: Amplitude spectrum of Figure 4.3 after the addition of noise with SNR = 18 dB. PC Projection of Ma2 U Mae 2 PC Projection of Ma2 U Mae 2 6 a ae 10 4 PC Axis 2 PC Axis 3 5 0 −5 −10 10 PC Axis 2 0 −10 −10 0 −2 −4 10 0 2 −6 −10 PC Axis 1 6 6 4 4 2 2 0 −2 −4 −6 −10 −5 0 PC Axis 1 5 10 PC Projection of Ma2 U Mae 2 PC Axis 3 PC Axis 3 PC Projection of Ma2 U Mae 2 0 −2 −4 −5 0 PC Axis 1 5 −6 −10 10 −5 0 PC Axis 2 5 10 S Figure 5.4: Principal component projection of Ma2 Mæ 2 with the introduction of noise at S/N = 18 dB. 48 Graph Lapl. Projection of Ma2 U Mae 2 1 a ae 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 e2 Figure Nearest neighbors graph Laplacian eigenfunction projection of S 5.5: æ with the introduction of noise at S/N = 18 dB. Ma M 2 2 PC Projection of Ma2 U Mae 2 PC Projection of Ma2 U Mae 2 10 4 5 PC Axis 2 PC Axis 3 6 a ae 0 −5 10 PC Axis 2 0 −10 −10 0 −2 −4 10 0 2 −6 −10 PC Axis 1 −5 0 5 PC Axis 1 U Mae 2 PC Projection of 6 6 4 4 2 2 PC Axis 3 PC Axis 3 PC Projection of Ma2 0 −2 −4 −6 −10 Ma2 U Mae 2 0 −2 −4 −5 0 −6 −10 5 PC Axis 1 −5 0 5 PC Axis 2 S Figure 5.6: Principal component projection of Ma2 Mæ 2 with the introduction of noise at S/N = 8 dB. 49 Graph Lapl. Projection of Ma2 U Mae 2 Graph Lapl. Projection of Ma2 U Mae 2 0.1 a ae 0.1 0.05 0 e3 e4 0.05 0 −0.05 −0.1 0.1 −0.05 0.1 0 e3 0.05 −0.1 0 −0.1 e2 0 0.1 0.1 0.05 0.05 0 −0.05 −0.1 0.02 0.04 e2 0.06 0.08 Graph Lapl. Projection of Ma2 U Mae 2 e4 e4 Graph Lapl. Projection of Ma2 U Mae 2 0 −0.05 0 0.02 0.04 e2 0.06 −0.1 −0.1 0.08 −0.05 0 e3 0.05 0.1 Figure Nearest neighbors graph Laplacian eigenfunction projection of S 5.7: æ with the introduction of noise at S/N = 8 dB. Ma M 2 2 Laplacian projection. We can see that for low noise levels (i.e., high SNR), all representations except the principal component are 100% linearly separable. As the level of noise is increased, the one-dimensional graph Laplacian projection representation deviates from perfect separability before its 3D counterpart, as expected. Still, the 1D graph Laplacian projection outperforms the 3D PCA projection down to SNR 12 dB. For SNR greater than 10 dB, the 3-dimensional graph Laplacian representation maintains class distinction as well as the original 50-dimensional data. The 3D graph Laplacian projection outperforms the principal component projection all the way down to SNR 4 dB. For signal-to-noise ratios less than 4 dB, the two 3-dimensional representations are roughly equivalent. Below 4 dB, the further increase of noise begins to largely degrade class separability with all representations, though the original 50-dimensional transform data maintains an accuracy above 90% down to nearly 0 dB SNR (i.e., equal levels of signal and noise). For low to moderate noise levels, the nearest neighbor graph Laplacian representation is more useful than PCA at preserving linear separability through dimensionality reduction. The reason is that the adjacency graph can remain largely unchanged for these noise levels, as each point’s set of nearest neighbors remains the same. For very high levels of noise, the original 50-dimensional transforms are the most useful, while principal component and graph Laplacian eigenbasis projections are equivalent. The peformance degradation of the graph Laplacian method at these high noise levels indicates that the manifold structure has become too obscured, resulting in interclass edges in 50 Linear Separability as a Function of SNR 100 95 90 Separability (%) 85 Original 50−dim data 3D PCA Projection 1D Laplacian Projection 3D Laplacian Projection 80 75 70 65 60 55 50 −20 −10 0 10 20 30 Signal−to−Noise Ratio (dB) 40 50 60 Figure 5.8: The linear separability of four data representations, as a function of signal-to-noise ratio. the adjacency graph. While the graph Laplacian method does eventually break down for very high levels of noise, it otherwise consistently outperforms PCA at preserving class separation through dimensionality reduction. As expected, the underlying manifold structure of the acoustic tube vowel approximations results in successful application of the manifold-based dimensionality reduction algorithm. 51 Conclusion This paper has presented a derivation of a class of manifolds defined by solutions of the traditional acoustic tube articulatory model. These manifolds are extrinsically curved in, and span, the ambient acoustical space. Assuming real vowel data is sufficiently approximated by these manifolds, the existence of an underlying low-dimensional structure for spoken vowels follows. Furthermore, this structure is adequately complex to complicate successful application of linear classification and dimensionality reduction techniques. Since humans can differentiate between these classes with ease, we might assume that there is something inherently non-linear at work. Towards a possible reconciliation, the geometric point of view presented in this paper justifies the application of the class of manifold-based algorithms, useful in machine learning, speech recognition, and data representation. The positive results using the Laplacian eigenmap dimensionality reduction method that were presented are only an example of the possibilities this perspective admits. Furthermore, a geometric representation provides an alternative entry point into understanding non-linear perceptual phenomena. The possible avenues of future research are outlined below. • Approximate Manifold Structure of Other Phonemes The analysis presented in this paper was limited to tube models to approximate sustained vowel sounds. Modeling other classes of phonemes, such as fricatives and nasals, involves introducing turbulent noise sources at vocal tract constrictions and incorporating branching resonators. This will involve the introduction of stochastic processes in determining the output transforms, which will prevent the existence of a diffeomorphism between the configuration and acoustic spaces. However, as we saw in Chapter 5 with the introduction of noise, the Laplacian-based manifold algorithm was stable to deviations from a precise manifold structure. Thus, it is proposed that manifold based algorithms will continue to experience a performance gain even on these more complicated data sets. However, a systematic study would be required to justify this claim. • Semi-supervised Learning Recently, several manifold-learning algorithms have been presented [3, 1, 4, 5, 6, 7]. In particular, Niyogi, Belkin, and Sindhwani [3] proposed a 52 semi-supervised manifold learning algorithm that incorporates the method described in Section 5.1 into a regularization term of the objective function for various classifiers. The positive performance results for the Laplacian data reduction method indicates strongly that the corresponding semi-supervised learning method will be equally successful. A full performance study is necessary, using both the synthetic vowel manifolds derived here and actual recorded phonemes from the TIMIT database. • Non-linear Acoustic and Perceptual Phenomena Much has been written in recent years about the perceptual magnet effect, which refers to a warping of the perceptual space towards categorical centers. This means that equally spaced points in acoustic space near a categorical center will map closer in perceptual space than equally spaced points further from this categorical center. Thus, non-linear perceptual shifts can result from fixed displacements in acoustical space. A widely discussed acoustics phonetic framework is the quantal theory of speech [10]. The articulatory-acoustic half of the theory states that there are regions of articulatory (configuration) space where large changes lead to relatively small changes in acoustic character, and vice versa. This implies a largely non-linear relationship between configuration space and acoustic space. The extrinsic curvature of the vowel manifolds derived in this paper indicates an implicit non-linearity of acoustic space. The relationship between the above-mentioned acoustic-perceptual effects and this manifold structure will need further study. However, the geometric approach presented here provides an alternate entry point into these subjects. 53 Acknowledgements I would like to thank Partha Niyogi for suggesting this project topic and providing guidance in its completion. I would also like to thank Mikhail Belkin and Vikas Sindhwani for providing their code and assistance in implementing the Laplacian eigenmap algorithm. 54 Bibliography [1] Mikhail Belkin and Partha Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Technical Report TR-2002-01, University of Chicago, Computer Science Department, December 2001. [2] Kenneth N. Stevens. Acoustic Phoenetics. MIT Press, Cambridge, MA, 1998. [3] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold Regularization: A Geometric Framework for Learning from Examples. Technical Report TR-2004-06, University of Chicago, Computer Science Department, August 2004. [4] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, 2000. [5] Sam T. Roweis and Lawrence K. Saul. Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4:119–155, 2003. [6] D. L. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Arts and Sciences, 100:5591–5596, 2003. [7] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000. [8] D. L. Donoho and C. Grimes. When does ISOMAP recover natural parameterization of families of articulated images? Technical Report TR-2002-27, Department of Statistics, Stanford University, August 2002. [9] Gunnar Fant. Acoustic Theory of Speech Production. Mouton and Co., Paris, 1970. [10] Kenneth N. Stevens. On the quantal nature of speech. Journal of Phonetics, 17:91–97, 1989. 55