STATISTICAL APPROACH TO ENHANCING ESOPHAGEAL SPEECH BASED ON GAUSSIANお1IXTURE MODELS

Transcription

STATISTICAL APPROACH TO ENHANCING ESOPHAGEAL SPEECH BASED ON GAUSSIANお1IXTURE MODELS
STATISTICAL APPROACH TO ENHANCING ESOPHAGEAL SPEECH BASED ON
GAUSSIANお1IXTURE MODELS
Hironori Doi, Keigo N,αkωnura, TOll1oki Todα, Hiroslzi Sαrutl'atαri, and Kiヲohiro Shikano
Graduate School of Inforrnation Science, Nara lnstitute of Science al1l1 Technology, Japan
E-mail:
{ hironori-d.kei-naka. tomoki‘sawatari, shikano } @is.naist.jp
ABSTRACT
lhe trained GMMs so lhal il sounds like n0l111al speech.
This papcr prcscnts a novel method of cnhancing esophagcal specch
using statistical voicc convcrsion.
Esophageal spcech is one of
Because
the converted speech is generated from slatistics extracted from
nonnal speech, this conversion process is expeclcd 10 remove Ihe
Fo pallern uf Ihe
the alternative speaking methods for laryngeclom巴es. Allhough il
spec)自c noise sounds effcctively訓ld improve the
doesn't require any eXlernal devices, generaled voices sound unnat­
針。phageaJ speech. Howewr, the converted specch basically sounds
ural. To irnprove the intelligibility and naturaJness of esophageal
Iike voices ullered by a differenl speaker frol11 Ihe laryngecton】ee.
speech, we propose a voice conv巴rsion melhod from esophageal
To make it possible to日cxibly control thc convcrtcd voice quality,
A spectral pararneter and excitation
we also apply one-to-many cigcnvoice conversion lEVC) [5) to ES­
pararnelers of target norrnal speech are只eparately eslimaled白om
to-Spccch. Thc onc-IO-many EVC is a convcrsion mcthod from a
a speclral parameter of lhe esophageal speech based on Gaussian
single source sp巴akcr into any arbitrary target spcakcrs. This mcthod
speech inlo normal sp巴ech.
miÀture models. Th巴 experimenlal results demonstrate thal the pro・
allows us to control lhe convel1ed voice quality by manipulating a
posed method yields significant improvemenls in intelligibility and
small number of parametcrs or 10 flcxibJy adapt lhe conversion
naturalness. We also apply one-lo-many eigenvoice conversion 10
model for Ihe given spccch �amples. This mcthod secms very cffcc­
esoph暗eal speech enh‘n
l cement for nexibly conlTolling enhanced
tiv巴 for helping Jaryngectomees to speak in their favorite voices 01'
special voices sounding like their own voices thal have already been
voice qualily.
11ldex Termsーl訂yngeclomees. esophageal speech, speech en­
lost.
2. ESOPHAGEAL SPEECH
h,lncemenl. voice conversion, eigenvoice conversion.
J shows an example of speech wavefonns, spectrograms.
Figure
1. INTRODUCTION
People who have undergone a total laryngectomy due to an accid巴nt
or laryngeaJ cancer cannot produce speech sounds becau只e their vo­
cal folds have been rernoved. EsophageaJ speech iミone of the alter­
nalive speaking melhods for laryng巴clomees. Excilation sounds are
produced by releasing gases from or Ihrollgh lhe esophagus, and then
they are articulaled for generating esophageal speech. Esophageal
speech sounds more nalural than speech generaled by other alter­
native speak.ing methods such as eleclrolaryngeal speech.
How­
ever, degradalion of lhe int巴lligibilily and naluralness in esophageal
speech is caused by several faclors such as sp巴citic noises and rela悶
tively low fundamental frequency. Conseqllently. esophageal speech
sOllnds unnatural compared with nomlal sp巴ech.
and
ro contours of normal speech uttered by a non-laryngectomee
and esophageal speech lItlered by a laryngectomee in the same sen­
lence. We C.ln see that acouslic features of esophageaJ sp巴ech are
consid巴rably different frol11 those of normal speeじh.
Esophageal
speech often incllldes some specific noisy sounds. which can be eas­
ily obscrvcd in lhe silcnce pa口s in the自gure. Thesc noises are pro­
duced through a process of gcncrating excitation �ounds, i.e., pump­
ing air inlo thc csophagus and the stomach and rcJeasing air from
them. Wavcform envclope and spcctraJ componcnts of csophagcaJ
specch don 't vary ovcr an lItlcrance as smoothly as those of normal
spcech. These unstable and unnatural varialions caus巴 the unnatu­
ral sounds of esophagcaJ spccch. Moreovcr, the pitch of esophagcaJ
speech is generally lower and less stabl巴 than thal of normal只peech.
Some approaches have been proposed to enhance esophageal
speech based on th巴 modification of its acollslic features, e.g., lIS­
Fo analysis process for normal speech oflen
Fo extraction and the unvoicedJvoic巴d decision in esophageaJ
Consequently, a usuaJ
fails at
ing comb自1t巴ring [1) 01' smoolhing [2]. However, since the acoustic
speech.
features of esophageal speech exhibit quite differenl properties from
degradation of analysis-synthesized sp巴ech quality.
These charactelistics of esophageal speech cause severe
those of nonnal speech. it is basically diftìcult to compensale 1'01' the
The intelligibiJity and naturalness of esophageal speech slrongly
acoustic differences belween lhem lIsing those simple modilication
depend on the skill of indi\'idual laryngectomees in producing
processes. More sophisticated and complicated processes are ess叩ー
esophageal speech.
lial to dramatically enhance esophageal speech
diflìcult 10 remove because lhey are caused by Ihe production mech­
In this paper, we propose a statistical approach to enhanc­
ing esophageal speech using voice conversion (VC)
[3, 4].
anism of esophage.ù speech.
OUI
3. VOICE CONVERSION ALGORTTHM
proposed rnethod converts esophageal speech into normaJ speech
ES-to­
Wc dc,cribe a conversion method b出ed on maximum Iikelihood cs­
We train Gaussian mixture models (GMMs) of the joint
timation of speech paramcter tr勾cctorics considering a gJobal varト
in a probabi1istic manner (Esophagea1-Sp田ch-to-Speech:
Speech).
However, som巴 specifìc noises are essentially
probability densities between the acoustic features of esophageaJ
ance (GVl [4) as one of the �tate-of-the-art statistical VC 11lcthods.
speech and those of normal speech in advance using parallel data
consisting 01' utterance-pairs of the esophageal speech and lh巴 larget
This method consists of a training and convcrsion process. Morc­
normal speech. Any esophageal speech凶mple is converted using
ibly controlling the con、'Cl'1巴川j speech quality.
1'his work was suppo口ed in p‘Irt by MIC SCOI'E. The authors are grate­
fu1 to Prof. Hideki Kawahara of Wakayall1a L:nive"ity, Japan. for permission
to use the STRAIGHT analysis只ynthesis method
978-1-4244-4296-6/10/$25.00 @2010 IEEE
over. we also describe Ollc-to-many EVC [5] as a technique for flcx­
3.1. Training Process
Lel us assume an input static feature vector X,
Xt(D,.)],
4250
and an outpul slalic feature vector
Yt
[:1'1 (1)、
= [YI ( l ) ,
ICASSP2010
Qd
司i
内4
4品Z
' A
AU
ri151」
-卜1
-liriO
K
K
K8 0
O
K
0
0
0
0
nυ
AU
A
V
AO
内υ
nυ
nU
FO
η4
・A
-A
。&
6・z
η&
'A
出
z
s
g
u
h
Z
一N
尚
E
2
5
E
一
4
2
守
〈
日
三
喜
喧
ω
Fig. 1. Example
合= ['ÎI� "・197司‘ー ρJjT is determined by maximizing the fol­
担nnal生盟主
lowing objective function,
。=ö叩na.xP(YIX.入)ωP(叫ν)1入(ul)
y
suLjCCL LO Y = Wy
よ4i
where W is a window matrix to extend the static feature vector se­
quence into the joint feature vector sequence of static and dynamic
features [7]. A balance between P(YIX,入) and P(v(y)1入(VJ) is
controlled by the weight ω
ll川h|
of waveforms. spectrograms and
Fo contours
3.3. One-to・ManyEVC
An eigenvoice GMM (EV-GMM) mod巴ls the joint probability den­
sity in th巴 same manner as shown inEqs, (1) and (2). except for a
definition of the target mean v巴ctor wntten as
textual features of source speech, e,g., the joint static and dynamic
featur巴 vector 01' the concatenated feature vector from multiple
frames, As an output speech feature vector, we use Yt = [凶ν,
ム ν;rjT丁 c∞o∞ns凶I時of sta以山tic and dyr山nic features,
Using a pan廿lel training data set consisting of time-align己d in­
put and ouゆut parameter vectors [X!.Yi]T. [XI.YJjT.
[xi,Y引T, where l' denotes the 101al number of frames, lhe joinr
probabilitydensity of the input and output parameter vectorsis mod­
eled by a GMM [6] as follows
'
(
1I
whcrc bm and Am
[am(l),・ \αm(J)j are a bias vectOl
and e明nvectors am (j) for the m'h mixture component, respec­
tively, The number of eigenvectors is J. The target speaker
individuality is controlled by the J-dimensional weight vector
、'W(J)jT Consequently, th巴EV-GMM has target­
山= [即(1),
m耐r-independent p釘ameter見1入(EV)consisting ofμjf),Am
bm.and :E��\:.\ï and target-speaker-dependent parameter w
TheEV-GMM is 汀ained using multiple parallel data sets con­
sisting of a single input speech dala set and many outpul speech dala
S巴ts including various speakerピvoices, The trainedEV-GMM al­
lows us to control the converted voice quality by manipulating the
weight vectOl世J. Moreover, the GMM for the input speech and
new target speech are flexibly built by automatically dete口nining the
weight vector叩using only a few arbitrary utterances of the target
speech in a text-independent manner.
I
I
4. VOICE CONVERSION FROM ESOPHAGEAL SPEECH
(2)
where N(-;μ,E) denotes the Gaussian distl均utJon w油a mean
vector μand a covariancc matrix :E, The rnixture compon巴nt index
IS?凡The total number of mixture components is JH. A parameter
set of the GMM is入,which consists of weights Wm‘mean vectors
μ.�;.Y) and full coval;ance matr悶s :E�: .Yl for individual mixture
components.
Th巴 probabi口lηl町 d白臼E白I附1ザy ofthe GV v(ν) of the outpゆ uωr s引ra以tJωc f,己た a仕ν '[.
lu川 r陀eve氏叫cは1ωtor凶s ν = [ν 7
,yJjT over an u 旧rance is also
modeled by aGaussian dis廿ibution.
'@•@ • .
P(切(ν)1入(吋) = N(v(y);μい'J, :E(り)
(3)
wl町E山e GVv(ν) = [v(l),・・,V(Dy)jT is calculatcd by
拾い)一品川))
(4)
=
TO SPEECH (ES-TO-SPEECH)
1n order 10significantly enh,mce esophageal speech,it is essenti,ùto
remove the speci自c noise sounds and to gcnerate smoothly varying
speech parameters over an uttcrancc, Moreover, in order to syn­
thesize cnhanced speech with modi日cd spcech p紅白netcrs,it is also
cssential to deal with difficulties of fcature cx汀action of esophageal
spcech目To address thcs巴 issucs,wc propose statistical voice conver­
sion from esophagcal speech into normal specch, B巴cause conv巴此ed
speech paramctcrs are gcncratcd from thc statistics of thc nonnal
speech, the specifìc noise sounds and unstable variations are alle­
viated effectively by the conversion process, FurthemlOre. even if
some specch paramctcrs such as Fo and unvoicedlvoiced informa­
lion are diffìcult to extract合om esophageaJ spccch, thosc parame­
lers exhibiting properties similar to those of normal specch would be
estimated by the conversion from other speech parameters robustly
extracted from esophageal speech (巴。g" spectral cnvelope), Thises­
llmatlOn proce回may be regarded as a statistical feature ex廿actlon
process
4.1. Feature Extraction in ES-to-Speech [8]
A parameter set入(叫consists of a mean vcctor μ(,.) and a diagonal
covananc巴 matnx :E(V)
[Y!. .. .Y;r.
LetX = [X[,,,, ,xi.. ,XijT and Y
Yi jT be a tin問1
tωur陀e v巴ctωors, r陀esp巴ctively, Th巴 conv巴引rtほed s幻ta剖tJにc f,たeatωur陀e s鉛巴qu己nc巴
,
(6)
)
、、1te,J
I
2
2
r
a
μ
T
Tt
Y
Tt
x
〆't
、
t、
ν
M
Zぃ11
リ川
m
n
一
一 し
μμ
λ 「|lL
れ =
m
…
ぼp x
μ
i ゃ ( λ二Y l :E�m;�l')
7ft =- I中川
I E�,�X) :E�γY
ml
:E�;�' \')
3.2. Convcrsion Process
μゴ) = Amω+bm,
of
め(Dν)jT at frame 1:, wher巴T denotes transposition of the vector
As an input sp巴ech parameter vector. we use X t to capture con­
ド(d)
(5)
The spcc甘al components of esophagcal speech vary unstably as
mcntioned in Section 2, Moreover. spectral fcaturcs of some
phonemes are often collapsed due to dif自culties of producing them
in esophageal speech. To alleviate these issues. we use a spectral
segment feature extractcd frorn rnultiple frames. At each frame, a
spectral parameter vector at the current frame and those at several
preceding and succeeding frames are concat巴nated, Then.dimension
4251
nHV
口6
n〆UH
reduction with principaJ component analysis (PCA) is performed to
extract the spec佐al segm巴nt feature
Although it is difficult to ex汀act Fo from esophageal speech
(see Figl1re 1), we usuaJly perceive pitch infonnation of esophageal
speech. Assuming that r巴levant infonnation is incll1ded in spectral
parameter5. we use the spectral segment feature as an input featllre
for estimating Fo in the conversion process. Moreover, in order to
m法e the estimated Fo corrc�pond to the perccived pitch infonna­
tion of esophageal speech, as an Olltpllt featllrc we use Fo vaJlles
extractcd from nonnal speech uttered by a non-Iaryngcctomce 50 as
to makc its prosody similar to that of esophagcal spcech
TabIc 1. Estimafion accumcy vf me/-cepsfru川川TI削tT pOlVer wuj
aperiodic componenTs. MeL-ceps Tr,日/ disTυrTion wiTh jJower (i.e.. in.
sholl'l1 in pω'enthes出
Mel-cepstral
Aperiodic
d1Stβrlion [dB] I distortion [dB)
ι/uding the Oth co�fJicient) is
846 (11.95)
4.96 (6.16)
I
I
6.99
3.71
F o cor re lafion coefficiel1f (Ccm:)
beT\Veen TIIl' ex­
Fo and the target Fo exrracted.from normal speech
al1d ul1voicedlvoiced decisiol1引開仁For cxalllple, “VU" shvws rhe
rafe of es timatil l g a voiced .fmllleαs al1凶n'oiced.
UN dec凶ion error [別
I
I Corr. I
内rable 2.
rracted/converted
4.2. Training and ConversioJl
In order to convert esophageal speech into normal speech, we use
three different GMMs to estimat巴 three speech parameters 01' the
target normal speech, i.e., sp巴ctrurn. Fo, and aperiodic components
that captllre noise strength of an excitation signal on 巴ach frequency
band [9]. For the sp巴ctral estimation, we use a GMM to cOl1vert the
mput sp巴:ctral segment features into the coπesponding output spec­
tTal parameters. For the Fo estimation‘we use a GMM to convert
the input spectral segment features into the output Fù vaJues and urト
voiced/voiced infonnation. For the aperiodic estimation. we use a
Gl\1M to convert the input spectral segment features into the output
aperiodic components.
I
,I
Extracted
円
,
0.07
I
,、68 I
43.82 (V-U: 42.60. U\i:卜22)
8.36 (\/U: 4.06. UV 4.30)
8 frames in both the spectral estimation and the apeliodic estimation
and the current土16 frames in the Fo estimation. respectively.
5.2. Objcctive Evaluations
Tables J and 2 show estimation accuracy of spectrum, aperiodic
cOll1ponents, and Fo. It is observed that the acoustic features of
esophageaJ speech are very differ巴nt froll1 those of norll1al speech.
Tbese larg巴 differenc巴s of the acoustic features are significantly re­
duced by ES-to-Speech. We can see that the proposed conversion
ll1ethod is very effective for estimating any of the three acoustic fea­
tures.
ln synthesizing the converted speech, we design a mixed excita­
tion based on町estimated日and aperiodic components [9]. Then.
we synthesize the converted speech by filtering the mixed excitation
with the estimated spec町al parameters
5.3. Perceptual Evaluatiol1s
4.3. Applyil1g Ol1e-to・恥'1al1y EVC to ES-to-Speech
We conducted two op叩ion tests of inteJligibility and naturalness.
The foJlowing six types of speech sampJes were e六faluated by 10
Iisteners
We fUl1hcr apply onc-to-many EVC to ES-to-Speech for flexibly
controlling convel1巴d voicc quality. The one-to-many EV-GMM for
spectraJ conversion is traincd using muJtiple parall巴I data sets of
esophageaJ speech data uttcr巴d by the laryngcctomee and normaJ
spcech data uttered by many non-Iaryngectomees. The trained EV­
GMM allows laryngectomees to speak in th巴ir favorite voices, which
are created by manipulating the weight vector for eigenvoices or by
estill1ating proper weight vaJues using onJy a smaJl amount of those
voices' data as adaptation if they are given.
ES recorded esophageal speech
ES・AS analysis-synthesized csophagcal spccch
EstSpg synthetic sp巴巴ch using converted ll1el-ccpstrum, convcrlcd
apcriodic componcnts, and Fo extractcd f[om esophagcal
speech
EstFo synthetic speech using extracted ll1cl-ccps汀Ull1‘cxtractcd
apcriodic components, and convcrtcd Fo
CV synthetic speech using converted meJ-cepstrum唱∞nverted ape­
吋odic components, and converted Fo
NS・AS analysis-synthesized normaJ speech
5. EXPERlMENTAL EVALUATIONS
5.1. Expl.'rimental Conditiol1s
We recorded 50 phoneme-balanced sentences of esophageal speech
uttered by one Japanese male laryngectoll1ee. We aJso recordcd the
sall1e sentcnces of nOl1l1al spccch llttered by a Japanese maJe non­
laryngectomee. He tried to imitate thc prosody of山e laryngectoll1ce
utterance-by-utterance as closcly as possible. Sampling frequcncy
was set to 16 kHz. We conducted a 5-fold cross vaJidation test in
which 40 utterance-pairs were llsed for training, and the rCll1ω111ng
10 uttcr白lcc-pairs were uscd for evaJuation.
Th巴Oth through 24th meJ-cepstral co巴f自cients extracted with
STRAJGHT anaJysis [1OJ wer巴 used as the spectraJ parameter. As
the source excitation features of normal speech, we used 10含scaled
Fo巴xtracted with STRATGHT Fo analysis ll l] and aperiodic com­
ponents 19J on five frequency bands, i.e., 0- J, 1-2, 2-4, 4-6, and 68 kHz, which were used for designing mixed excitation. The shift
length was 5 ms.
We pl官liminarily optimiæd several parameters such as the num­
ber of mixture components of each GMM and the number of frames
used for extracting the spectral segment feature [8]. As a result.
we set th巴 number 01' mixture components to 32 for each of three
GMMs. For the segment feature extraction, we used the cu汀ent土
Each listener evaluatcd 120 sall1pJes' in each of thc two tcsts.
Figur官s 2 and 3 show thc result of the intelligibility test and that
of the naturaJness test, respectivcly. ES-AS causcs significant intcl­
ligibility degradation compared to ES duc 10 the difficulties of the
acoustic fcature extraction in esophageal speech. The specific noiscs
and unstable variations on the sp巴C住ogram 111巴sophageaJ speech
are signi自cantly aJJeviated by using the estimated spectral features
Moreover、the conv官rted spe巴:ch exhibiting pitch information sim­
ilar to pitch p巴rceiv官d in esophageaJ speech is generated by using
the estimated Fo contour. AJthough significant improvemenls in in­
t巴lligibility and naturalness are not observed when using only one
of th巴se e坑imated features, we can see that the ES-to-Speech (CV)
estimating all acoustic features yields signi日canlly lllore intelligibJe
and natural speech than esophageal speech
These results suggesl that the proposed ES-to-Speech is very ef­
fective for improving both the naturaIness and lhe intelligibility of
esopha.geal speech
IS引eraJ samples副e availabJe frol11
http://spalab.naist.jp/hironOll-cVJCASSP/ES2SPlindeλhtml
4252
TIE­
n6
っ“
95% CollÍïdence
interv,ùs
む宮吾 品E〈
- I IH
I
2f.. . . .1ぞ
ES-AS
EstSpg
EstFo
CY
5 2F邑h出
]h υ
N目白
{
ノ�I r/é�.ヰrょう
ES
NS-AS
Fig. 2. Mean Opillioll SCOI官Oll illtel1igibility
。
nu nu nu
aAZ nUAυ nU ハU
a仏τ 04 唱i
�
zωロr日片山
hu
N-12}
{
串 仁
i31
'且'ι
K -K
nunu nUAU ハU n6
21
4 3
5
50
0
U
0.8
Ti血e [secl
0.4
1.6
Fig. 4. Example 01' wavefonn, spectrogram. and }ò of the cOl1verted
speech by one�to-many EVC.
spef'ch illtO spectmm.
Fo, and aperiodic components of nonnal
speech independently using three different GMMs.
The experト
mental results have demonstrated that ES-to-Speech yields Sigllifi
cm1t improvements ill illtelligibility and naturalness of esophageal
speech.
Moreover, we have also applied olle�to�many eigenvoice
conversioll to ES-to-Speech for flexibly COll町ol1ing voice quality of
1
the converted spe巴ch.
NS-AS
7. REFERENCES
[1)
5.4. ElTectiveness of One-to・manvVC
lIsing a comb fìlter."lmernational Conference on Disability, Y inual Re­
ality and Associated Technologies, pp. 39-46. 2002
To makc voicc quality of the convc口cd spcech simiJar to the Jaryn­
[2)
gectomec‘5 OWll voice quality. we applied onc�to�many EVC to ES­
sets consisting of the esophageal speech and
1 834, Phoenix, Arizona. May 1999
30 speakers' nonnal
|β3J
speech. The EV-G恥1M was adapted to an esophageal sp巴巴ch sam­
ple shown in Figllre 1, and then the esophageaJ speech sample was
Spワeech
alld AI“Idio Pro­
Yo叫l.
maximum likelihood estimation of spcctral parametcr tr句ectory円IEEE
Fo
1) acoustic features of
7ト即日ASLP, Yol. 1 5. No. 8. pp. 2222-2235. Nov. 2007
[51
the converted speech vary more stably than those of the origillal
esophageal speech sample出ld
tran目sfor口rmη、 for voi町ce convers剖】on," IEEE 7罰ワrans.
6, No. 2, 1 'p. 131-1 42, 1998
[4) T. Toda, A、11. Black, and K. Tokuda, ‘Yoice conve円ion based on
Figure 4 shows an example of waveform. spectrogram, and
T.
Toda, Y. Ohtani, and K. Shikano守
‘One-to-many and many-to-one
voice conversion based on引genvoices," Proc. ICASSP, pp. 1249-1252,
2) the specific noise sounds afe sig­
nific,mtly ,ùleviated by the conversion process.
Y. SIりげylia叩n叩l(旧o川1I, O. C“叩ppe, and E.. Mo山ulin即1暗es.
c、es口sr附ng,
converted inlo nonnal speech using the adapted EV�G恥1M
We can see that
K. Matlli. N. Hara, N. Kobayashi. and H. Hirose, "Enhancement of
esophageal specch using formam synthesis," Pr.οc. ICASSP, pp. 1831-
to�Speech. We trained the one�to�many EV�GMM using parallel data
of the converled speech.
A.H】抽出and H. Sawada. "Real-time c1 arificalion of esophageal speech
Hawaii, USA, Apr. 2007
[6)
Namely, evell if
A. Kain and M.W. Macon、"Spectral voice conversion for texl-to-speech
synthesis." Proc. ICASSP, pp. 285-288, Seattle, l1SA, May 1 998
esophageal speech is used as the adaptation data. the adapted EV­
[7)
G孔仏1 provides the converted speech of which properti目立re Slm1-
K. Tokllcla.
T.
Yoshimura. T. Masuko, T. Kobayashi. and T. Kilamura.
'.Speech paramet町generation algorithms for HMM-based叩eech !:.yn
.
lar 10 those of llonnal speech. In addition. we have observed that
the日5 ." Prac. ICASSP, pp. 1315-1 31 8, I st anbuJ TlIrkey, June 2000
the adapted EV-G孔仏1 makes the conve11ed voice quality clos巴r to
18) H. Doi, K. N玖amu問、T. Toda. H. Saruwatari, and K. Shikano.“En
hancement of Esophage立1 S1'eech l1sing Statistical Yoice Conversion,"
the laryngcctomec's voice quality compared with the GMM lIscd in
APSIPA 2009.1'1.' 805-808, Sapporo. Japan, OCl. 2009.
[9j H. Kawahara, J. Estill, and O. FlIjimllra,“Aperiodicity extraction and
the previous cvaluations, which wcre trained using the single 110n­
laryngcctomec・s spcech. FlIrthermore, w巴 have also observed that
control using mixed mocle excitation and !,'TOUp delay m,mipulation for
the converted voice qllality is flexibly challg巴d by manipulating the
a high quality speech analysis, modifìcation and system STRAIGHT,"
weight valucs of the EV�GMM. The proposed ES-to-Sp巴cch with
MAVEBA 2001, Florcncc, Jtaly, Sept. 2001
one-to-mally EVC is cxpected to make possible a new speaking aid
[10) H. Kawahara.
system that allows laryngcctomees to control the converted voice
l.
Masucla-Katsuse, and A. Cheveigne.“Restructu口ng
sp�ech representations using a pitch-adaptive time-frequency smoothing
and an instantaneolls-frequency-based fò extraction: Possible role of a
quality as they want
repetitive structure in sounds,"
27. No. 3-4.
[1 1) H. Kawahara. H, Kalayose. A. Che\'cigne. and R. D. Patterson.“Fixed
This paper has presented a novel method for enhancing esophag巴al
spe巴ch using statistical voice conversion.
Spnch Commwtication, Vol.
p1'. 1 87-207.1999
6. CONCLUSION
point analysis of freqllency to instantaneous仕equency mapping for ac­
The propos巴d method
curate 出timation of
(ES-to�Sp巴ech) converts a sp巴ctral segment fealllre of esophageal
Fb
and periodicity," Proc. EVROSPEECH. pp.
2781-2784, Budapest. Hungary. SepL 1999
4253
円〆“
nku
つ'“