thèse telecom bretagne - Pages personnelles à TELECOM Bretagne

Transcription

thèse telecom bretagne - Pages personnelles à TELECOM Bretagne
N° d’ordre : 2008telb0096
THÈSE
Présentée à
TELECOM BRETAGNE
Sous le Sceau de l’Université Européenne de Bretagne
En habilitation conjointe avec l’Université de Rennes 1
pour obtenir le grade de
DOCTEUR de TELECOM BRETAGNE
Mention «Traitement du Signal et Télécommunications»
par
Emmanuel Rossignol THEPIE FAPI
Réduction du Bruit et Annulation de L’Écho Acoustique dans le Domaine des
Paramètres des Codeurs de Type CELP, Intégrés dans Les Réseaux Mobiles
Soutenue le 09 Janvier 2009 devant la Commission d’examen :
Composition du Jury
-
Rapporteurs
:
Geneviève BAUDOIN, Professeur, ESIEE Paris
Stéphane AZOU, Maître de Conférences HDR, Université de Brest
-
Examinateurs
:
Régine LE BOUQUIN-JEANNES, Professeur, Université de Rennes 1
Ramesh PYNDIAH, Directeur d’études, TELECOM Bretagne
Dominique PASTOR, Professeur, TELECOM Bretagne
Christophe BEAUGEANT, Ingénieur de Recherche, Infineon Technologies
Invités
:
Hervé TADDEI, Ingénieur de Recherche, Huawei Technologies
Gang FENG, Professeur, Université Stendhal - Grenoble 3
Acknowledgements
First of all, I would like to thank Dr. Chritophe BEAUGEANT, Dr. Hervé TADDEI and
Pr. Dominique PASTOR for offering me the chance to work on this thesis, for always ensuring
good working conditions, and for proofreading so many times my dissertation.
I am thankful to sponsors from Siemens Mobile, BenQ Siemens, Siemens Networks and
Nokia Siemens Networks for financing my study. A special thank goes to Dr. Jarmo HILLO,
Dr. Marcel WAGNER and Ms Stephanie DAHLHOFF.
I will like to thank my brother Simon Prosper HAPPI for his affection, as well as my
parents Noé and Marie-Claire FAPI.
I want also to express my gratitude to my supervisor Dr. Ramesh PYNDIAH and Pr.
Dominique PASTOR from TELECOM Bretagne. I also like to thank Dr. André GOALIC.
I am grateful to Pr. Geneviève BAUDOIN and Dr. Stephane AZOU for accepting to be
rapporteurs of my work. Moreover, many thanks go to Pr. Régine LE BOUQUIN-JEANNES
for being president of the examination.
I am also indebted to Dr. Mickael De MEULENEIRE and Nicolas DUESTCH, for their
contributions prior to my work. Many thanks go to the collegues, students that I have met
over these four years at Siemens Mobile, BenQ Siemens, Siemens Network and Nokia Siemens
Networks, especially Ketra KANG, with whom I had a really good time, for their help and
participation to listening tests.
I would like to deeply thank my familly, YEMNGA familly, as well as my fiancée Odette
YEMNGA and my son Ludovic Fabrice, for their support and their love throughout all these
years.
I finally dedicate this thesis to the memory of my big sister Brigitte-Chantal TOUKAM
FAPI (Dec. 1966 − f ev. 2005).
Abstract
Voice Quality Enhancement (VQE) solutions are now moving from Mobile Device to the
network. This is due to constraints of low-complexity, low-delay and the need of centralized
control of the network. The deployment of incompatible standardized speech codecs implies
interoperability issue between telecommunication networks. To insure interconnection between
networks, the transcoding from one codec format to another is necessary. The common point
to the classical network VQE and standard transcoding is that they use the speech signal in
PCM format during the process.
An alternative way to perform network VQE is developed in this thesis. This new approach
leads to modification of the CELP parameters to perform network VQE. A Noise Reduction
algorithm is implemented in this thesis by modifying the fixed codebook gain and the LPC
coefficients of the noisy speech signal. An Acoustic Echo Canceller is developed by filtering the
fixed gain of the microphone signal. These algorithms are based on extrapolation of existing
algorithms in the time or the frequency domain into the CELP parameter domain.
During this thesis, the algorithms developed in coded domain have been integrated into
smart transcoding algorithms. The smart transcoding strategy is applied to the fixed codebook gain, the LPC coefficients and the adaptive codebook gain. With this approach, the
non-linearity introduced by the coders does not affect the performance of the network AEC.
Many functions at the target encoder are skipped, leading to a significant computational load
reduction of about 27%, compared to the classical approach. The network VQE embedded
into smart transcoding has been implemented. Objective metrics (the Signal-to-Noise Ratio
Improvement (SNRI) and the Total Noise Reduction Level (TNRL)) indicate that noise reduction integrated in smart transcoding performance is better than the classical Wiener method
when transcoding from the AMR-NB in 7.4 kbps mode to 12.2 kbps mode. The performance
is equivalent during transcoding from 12.2 kbps mode to 7.4 kbps mode. The Echo Return
Loss Enhancement (ERLE) values of our proposed algorithms are highly improved compared
to the standard NLMS (up to 40 dB). The required 45 dB ERLE in GSM is achieved.
Key words: CELP, AMR-NB, VQE based on CELP parameters, Wiener filter, GSM network,
smart transcoding.
Résumé
L’amélioration de la qualité de la parole s’effectue progressivement dans les réseaux, plutôt
que dans les terminaux mobiles. Les contraintes liées à la réduction du délai, la réduction de la
complexité et le souhait d’un contrôle centralisé des réseaux motivent cette nouvelle approche.
Le déploiement des codeurs de parole standardisés pose des problèmes d’interopérabilité entre
les réseaux. Pour assurer l’interconnexion entre ces réseaux, le transcodage du train binaire
d’un codeur vers le codeur cible est indispensable. Les solutions classiques d’amélioration
de la qualité et le transcodage classique nécessitent la présence du signal sous format PCM,
c’est-à-dire des échantillons du signal.
Un concept alternatif pour améliorer la qualité de la parole dans les réseaux est proposé
dans cette thèse. Cette approche repose sur le traitement des paramètres des codeurs de type
CELP. Un système de réduction du bruit est implémenté dans cette thèse en modifiant le
gain fixe et les coefficients LPC. Deux algorithmes destinés à l’annulation de l’écho acoustique
développés modifient le gain fixe. Ces différents algorithmes utilisent des extrapolations et des
transpositions des techniques existantes, du domaine temporel ou fréquentiel dans le domaine
des paramètres des codeurs de type CELP.
Au cours de cette thèse, nous avons aussi intégré les algorithmes ci-dessus mentionnés dans
des schémas de transcodage intelligent impliquant les gains fixes et adaptatifs, ainsi que les
coefficients LPC. Avec cette approche, la complexité du système est réduite d’environ 27%. Les
problèmes liés à la non-linéarité introduits par les codeurs sont significativement réduits. Les
tests objectifs indiquent en ce qui concerne la réduction du bruit, que les performances sont
meilleures que celles du filtre classique de Wiener pendant le transcodage de l’AMR-NB 7.4
kbps vers 12.2 kbps. Elles sont sensiblement équivalentes dans le transcodage de l’AMR-NB
12.2 kbps mode vers 7.4 kbps mode. Les mesures objectives concernant l’annulation de l’écho
acoustique (ERLE) montrent un gain de plus de 40 dB des algorithmes proposés par rapport
au NLMS. Le seuil minimal de 45 dB fixé pour le GSM est atteint.
Mots clés: codeur CELP, AMR-NB, réduction de bruit et annulation de l’écho acoustique dans
le domaine des paramètres CELP, filtre de Wiener, réseau GSM, transcodage intelligent.
Résumé des chapitres
Chapitre 1 : Introduction et Contexte de la Thèse
Ce premier chapitre introduit le contexte de la thèse. Dans ce chapitre, nous mettons en
évidence les problèmes liés à la présence du bruit environnemental et de l’écho acoustique
pendant les communications à travers les réseaux mobiles. Les techniques classiques pour réduire les effets du bruit ou de l’écho acoustique utilisent essentiellement le signal sous format
PCM. Ces techniques classiques peuvent être implémentées dans les terminaux mobiles, ou
directement dans les réseaux. Les désavantages majeurs avec ces méthodes classiques sont
entre autre l’augmentation du coût de calcul, et le retard qu’elles peuvent introduire dans la
communication.
La multiplication des réseaux, mobiles ou non, crée aujourd’hui des problèmes dûs à l’interoperabilité. L’approche classique une fois de plus nécessite la présence du signal sous format
PCM. L’opération de transcodage, aussi dite de ’tandeming’ : (décodage-encodage) est requise
pour les approches classiques. Les conséquences sont ici la dégradation de la qualité du signal
de parole, le retard et le coût de calcul élevé.
Les architectures actuelles se tournent progressivement vers des implémentations de réduction de bruit et d’annulation de l’écho acoustique directement dans les réseaux. On parle aussi
dans ce cas de Centralized Voice Quality Enhancement. Ce chapitre d’introduction énonce
l’idée selon laquelle il est avantageux de directement modifier les paramètres transmis par les
codeurs de parole pour améliorer la qualité de la parole. Pour faire face aux problèmes liés à
l’interoperabilité entre les réseaux, le transcodage intelligent est de plus testé. Le transcodage
intelligent se basera sur la quasi similarité entre les codeurs de parole déployés dans les réseaux. Le point central de cette thèse se définit comme étant l’imbrication dans des schémas
de transcodage intelligent de nos ’nouveaux algorithmes’. Le résultat attendu est la résolution
en un seul module des problèmes d’interoperabilité, les problèmes de qualité et d’intelligibilité
dus à la présence du bruit et de l’écho acoustique.
Chapitre 2 : Codage de la Parole et les Techniques du Codage
CELP
Le deuxième chapitre est consacré au codage de la parole par prédiction linéaire, notamment
à la prédiction linéaire avec excitation par séquences codées, plus connu sous l’abréviation
anglaise CELP. Les codeurs CELP travaillent sur des morceaux consécutifs d’égale longueur
du signal d’entrée, appelés trames. Suivant le codeur, la longueur d’une trame varie entre 10
ms et 30 ms pour rester dans l’hypothèse de stationnarité du signal de parole.
Les paramètres transmis par les codeurs de type CELP ont une signification physique
étroitement liée au système de production de la parole de l’homme. Tout d’abord, la corrélation
à court terme sur une trame du signal est réduite par prédiction linéaire, c’est-à-dire qu’un
échantillon de la trame est estimé par combinaison linéaire des échantillons précédents, ceci
pour un nombre fini d’échantillons. L’ensemble des coefficients de prédiction forme le filtre
de synthèse. Ce filtre modélise en fait le conduit vocal. Un résidu de prédiction est obtenu
par différence entre le signal d’entrée et son estimée par prédiction linéaire. Ce résidu est
quantifié par une combinaison linéaire de deux mots de code, provenant de deux dictionnaires,
le dictionnaire fixe et le dictionnaire adaptatif.
Le premier dictionnaire, dit adaptatif, modélise la corrélation à long terme présente dans le
résidu, résultant de la vibration des cordes vocales. Ce dictionnaire contient un ensemble d’excitations quantifiées des dernières trames codées. Les mots de code sont indicés par une valeur
appelée pitch, qui caractérise la périodicité du signal à la trame courante. Une fois le mot de
code optimal trouvé, son gain associé est également calculé. Le deuxième dictionnaire, qualifié
de fixe, contient un ensemble de séquences prédéfinies et code l’information non prédictible,
appelé innovation. Le codeur détermine le mot de code optimal ainsi que son gain associé. Dans
les deux cas, le mot de code ainsi que son gain sont obtenus en minimisant l’erreur quadratique
moyenne entre le signal original et le signal reconstruit. Cette méthode est appelée analyse par
synthèse.
L’excitation consiste en la somme des deux mots de code, pondérés par leur gain quantifié respectif. Le dictionnaire adaptatif est mis à jour en concatenant cette excitation aux
excitations des trames précédentes. Les propriétés de masquage du système auditif peuvent
être prises en compte en pondérant l’erreur par une fonction dépendant des coefficients de
prédiction à court terme.
Le codeur AMR-NB, standardisé par le 3GPP, est brièvement décrit en fin de chapitre.
Ce codeur sera à la base pour toutes les simulations effectuées dans cette thèse. L’AMR-NB
encode des signaux échantillonnés à 8 kHz et est capable de produire un débit variable aussi
appelé mode, en fonction des ressources du cannal. Les débits sont les suivants : 4.75 kbps, 5.15
kbps, 5.90 kbps, 6.70 kbps (identique à PDC-Japan), 7.4 kbps (identique à DAMPS-IS136),
7.95 kbps, 10.2 kbps and 12.2 kbps (identique au GSM-EFR). Le nombre total de bits attribué
dépend du mode sous lequel fonctionne le codeur, mais c’est la qantification du mot de code
fixe qui requiert le plus de bits. Ce codeur comprend d’autres fonctionnalités, entre autre deux
options de détection d’activité vocale.
Chapitre 3 : Reduction du Bruit dans le Domaine des Paramètres
des Codeurs de type CELP
Les algorithmes dédiés à la réduction du bruit additif sont généralement implémentés dans
le domaine fréquentiel. Le signal bruité est dans un premier temps transformé dans le domaine
fréquentiel. Un filtrage est appliqué à chaque composante fréquentielle du signal bruité. Le
résultat de ce filtrage est une estimation du signal utile. Cette estimation du signal utile est
ensuite convertie via une FFT inverse dans le domaine temporel. Les paramètres du filtre ou
du gain utilisé dépendent essentiellement du bruit et du signal bruité.
Le filtrage appliqué dans le domaine des fréquences se décline sous plusieurs techniques.
On peut citer les atténuations spectrales, le filtre de Wiener, la règle d’Ephraim et Malah qui
sont résumés en première partie de ce chapitre. Le filtrage nécessite dans la plupart des cas
une estimation du rapport Signal-sur-Bruit (SNR). Ce chapitre propose deux algorithmes qui
réduisent le bruit en modifiant d’une part le gain fixe et d’autre part les coefficients LPC du
signal bruité.
L’ algorithme dédié à la réduction du bruit via la modification du gain fixe exploite le
filtre de Wiener. L’algorithme implémenté ici fait suite à des travaux initiés en amont de cette
thèse. L’innovation majeure dans l’approche est la transposition et l’extrapolation du minimum
statistique, généralement utilisé dans le domaine fréquentiel. Nous avons aussi introduit un
parallèle avec le rapport Signal-sur-Bruit a priori dans le domaine des paramètres. Etant donné
que l’opération de codage n’est pas linéaire, nous avons proposé une relation simple qui lie le
gain fixe du signal bruité à celui du signal utile et du bruit. Le gain fixe du signal utile
ainsi estimé est ensuite introduit dans le train binaire. Les tests d’écoute ont montré que cet
algorithme réduit significativement les bruits au niveau du décodeur si le SNR n’est pas trop
faible.
Le second algorithme proposé dans ce chapitre traite la réduction du bruit via la modification des coefficients LPC du signa bruité. Cet algorithme exploite la relation qui lie les
coefficients LPC du signal bruité à ceux du bruit et du signal utile. Cette approche nécessite
une détection d’activité vocale (VAD), étant donné que la relation qui lie les coefficients LPC
n’est réalisable que si le signal utile est non nul et/ou le bruit est non nul. Lorsque la parole
est présente, nous utilisons cette relation pour estimer les coefficients LPC du signal utile.
S’il n’y a pas d’activité vocale, nous avons privilégié une atténuation spectrale. Les résultats
expérimentaux ont prouvé que la modification des coefficients LPC améliore les caractéristiques spectrales du signal de parole en présence du bruit, surtout dans les sections voisées.
Pae comparaison au filtrage classique de Wiener, des tests objectifs montrent une amélioration
significative.
Chapitre 4 : Annulation de l’Écho Acoustique dans le Domaine
des Paramètres des Codeurs de CELP
Pour réduire les effets indésirables de l’écho acoustique, traditionnellement, on estime la
réponse impulsionnelle du filtre qui modélise la cavité acoustique. Cette estimation peut se
faire dans le domaine temporel ou dans le domaine fréquentiel. Cette estimation permet alors
de reconstruire l’écho qui sera par la suite soustrait du signal du microphone. On peut citer
dans cette catégorie le LMS, le NLMS et le filtre de Wiener. D’autres techniques se limitent
à un calcul de gain en fonction de l’énergie du signal du microphone et du haut-parleur. Le
gain une fois appliqué au microphone réduit l’impact de l’écho acoustique. Cette approche
est plus connue sous le nom de Gain Loss Control. Ces différentes approches sont discutées en
début de chapitre sous forme d’état de l’art. Ce chapitre propose deux algorithms qui modifient
directement le gain fixe du microphone pour réduire les effets de l’écho acoustique.
Le premier algorithme s’inspire de l’approche classique du Gain-Loss-Control (GLC) dans
le domaine temporel. Un parallèle a été fait entre l’amplitude du signal et le gain fixe. Les
coefficients d’atténuation sont calculés dans cette méthode en estimant l’énergie du signal à
l’aide des paramètres du codeur. Des vérifications expérimentales ont permis de constater que
cet algorithme se comporte remarquablement bien pendant les périodes dites de ’Single Talk’
(un seul locuteur parle).
Un second algorithme correspond au filtrage du gain fixe. Cette approche est basée sur
une analogie entre filtrage classique de Wiener dans le domaine fréquentiel et le filtrage sur
les paramètres du codeur. Il requiert une estimation du gain fixe de l’écho acoustique. Pour
cette estimation, nous avons utilisé une méthode basée sur la corrélation. Le filtrage prend en
compte les périodes de ’Double Talk’ et des périodes de ’Single Talk’. Pour ce pseudo-filtre de
Wiener, la notion de rapport Signal-sur-Écho (SER) dans le domaine des paramètres du codeur,
semblable au rapport Signal-sur-Bruit (SNR), a été introduite. L’estimation du SER proposée
est inspirée de l’approche récursive d’Ephraim et Malah. L’avantage de cette méthode est
qu’elle se comporte relativement bien pendant les périodes dites de ’Double Talk’ par rapport
au GLC. Les performances dépendent du type de filtre qui modélise l’environnement qui génère
l’écho, ainsi que du SER.
Chapitre 5 : Réduction du Bruit et Annulation de l’Écho Acoustique dans le Domaine des Paramètres du Codeur CELP et Transcodage Intelligent
Le Chapitre 6 constitue le point central de cette thèse. Ce chapitre traite du concept qui
consiste à intégrer les algorithmes implémentés dans les chapitres 3 et 4 dans des schémas de
transcodage intelligents. Il s’agit en effet du traitement centralisé, c’est-à-dire de l’annulation
de l’écho acoustique et de la réduction du bruit dans le réseaux.
Le transcodage intelligent dans ce chapitre s’applique sur les coefficients LPC, le gain fixe et
le gain adaptatif. Les coefficients LPC et le gain fixe extraits du décodeur source sont transmis à
nos modules d’amélioration de la qualité de la parole. En fin de traitement, les coefficients LPC
et gain fixe améliorés, ainsi que le gain adaptatif sont directement reportés dans l’encodeur
cible. Avec ce schéma de transcodage l’analyse par prédiction lineaire au niveau de l’encodeur
cible n’est plus effectuée. Il n’est non plus nécessaire de calculer les gains fixes et adaptatifs.
Le coût de calcul lié au transcodage du codeur sous le mode 12.2 kbps vers 7.4 kbps et vice
versa est réduit de 27 % environ. Le délai est réduit de 5 ms pendant le transcodage du mode
12.2 kbps vers 7.4 kbps.
La réduction du bruit se comporte de manière similaire au filtre classique de Wiener lors
du transcodage du mode 12.2 kbps vers 7.4 kbps. Cependant, pendant le transcodage du 7.4
kbps vers 12.2 kbps, la méthode proposée dans cette thèse donne des performances objectives
supérieures à celles du filtre classique de Wiener.
Le résultat le plus intéressant concerne l’annulation de l’écho acoustique. Le traitement des
paramètres du codeur de parole permet de contourner les problèmes de non linearité introduits
par les codeurs qui dégradent les performances des algorithmes classiques de type stochastiques
(NLMS). Les résultats (analyse de l’ERLE) montrent que le traitement des paramètres du
codeur de parole permet aisément d’atteindre des mesures de l’ERLE de plus de 45 dB comme
recommandé pour le GSM.
Chapitre 6 : Conclusion Générale
Le chapitre 6 est consacré à la conclusion générale des travaux effectués au cours de cette
thèse. Nous dressons dans un premier temps les acquis de cette thèse ainsi que les résultats
concernant aussi bien la réduction du bruit, l’annulation de l’écho acoustique et le transcodage
intelligent obtenus. Ce chapitre se termine par la présentation de plusieurs pistes et axes de
recherches qui contribueront d’une part à l’améloration des performances et d’autre part à la
généralisation du concept introduit dans cette thèse.
CONTENTS
i
Contents
Contents
i
List of figures
v
List of tables
ix
1 Introduction and Context of the Thesis
Introduction . . . . . . . . . . . . . . . . . . . . . . .
1.1 Motivations . . . . . . . . . . . . . . . . . . . .
1.1.1 Voice Quality Enhancement in Terminal
1.1.2 Network Voice Quality Enhancement . .
1.1.3 Transcoding . . . . . . . . . . . . . . . .
1.1.4 Alternative Approach . . . . . . . . . .
1.2 Objective of the PhD . . . . . . . . . . . . . . .
1.3 Organization of the Document . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
6
6
7
7
7
8
9
2 Speech Coding and CELP Techniques
Introduction . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Speech Coding: General Overview . . . . . . . .
2.1.1 Speech Coder Classification . . . . . . . .
2.1.2 Speech Coding Techniques . . . . . . . . .
2.2 Analysis-by-Synthesis Principle . . . . . . . . . .
2.3 CELP Coding Overview . . . . . . . . . . . . . .
2.3.1 CELP Decoder . . . . . . . . . . . . . . .
2.3.2 Physical Description of CELP Parameters
2.3.3 Speech Perception . . . . . . . . . . . . .
2.4 Standard . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 The Standard ETSI-GSM AMR . . . . . .
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
12
14
15
16
17
19
29
30
30
33
3 Noise Reduction
35
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Noise Reduction in the Frequency Domain . . . . . . . . . . . . . . . . . . . . . 37
ii
CONTENTS
3.2.1
3.2.2
3.2.3
3.3
3.4
3.5
3.6
Overview: General Synoptic . . . . . . . . . . . . . . . . . . . . . .
Spectral Attenuation Filters . . . . . . . . . . . . . . . . . . . . . .
Techniques to Estimate the Noise Power Spectral Density . . . . .
3.2.3.1 Estimation of the Noise PSD based on the Voice Activity
tection . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3.2 The Minimum Statistic Technique . . . . . . . . . . . . .
Introduction to Noise Reduction in the Coded Domain . . . . . . . . . . .
3.3.1 Some Previous Works in the Codec Domain . . . . . . . . . . . . .
Noise Reduction Based by Weighting the Fixed Gain . . . . . . . . . . . .
3.4.1 Estimation of the Noise Fixed codebook Gain . . . . . . . . . . . .
3.4.2 Relation between Fixed Codebook Gains . . . . . . . . . . . . . . .
3.4.3 Attenuation Function . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.4 Noise Reduction Control: Post Filtering . . . . . . . . . . . . . . .
Noise Reduction through Modification of the LPC coefficients . . . . . . .
3.5.1 Estimation during Voice Activity Periods . . . . . . . . . . . . . .
3.5.1.1 Estimation of the noise LPC vector: ÂD . . . . . . . . .
3.5.1.2 Estimation of the Noise Autocorrelation Matrix: Γ̂D . . .
3.5.1.3 Estimation of the Speech Autocorrelation Matrix: Γ̂S . .
3.5.2 Estimation during Noise Only Periods . . . . . . . . . . . . . . . .
3.5.3 Some experimental Results . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . .
. . .
. . .
De. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
37
39
41
41
43
44
45
47
48
50
51
53
55
57
58
58
60
63
65
67
4 Acoustic Echo Cancellation
69
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Acoustic Echo Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Acoustic Echo Cancellation: State of the Art . . . . . . . . . . . . . . . . . . . 71
4.3.1 The Least Mean Square Algorithm . . . . . . . . . . . . . . . . . . . . . 71
4.3.2 The Gain Loss Control in the Time Domain . . . . . . . . . . . . . . . . 73
4.3.3 The Wiener Filter Applied to Acoustic Echo Cancellation . . . . . . . . 75
4.3.4 The Double Talk Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Overview on Acoustic Echo Cancellation Approaches in the Coded Parameter
Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5 The Gain Loss Control in the Coded Parameter Domain . . . . . . . . . . . . . 78
4.5.1 Estimation of the Energy . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5.2 Computation of the Attenuation Gain Factors . . . . . . . . . . . . . . . 80
4.5.3 Experimental Results Analysis . . . . . . . . . . . . . . . . . . . . . . . 82
4.6 Acoustic Echo Cancellation by Filtering the Fixed Gain . . . . . . . . . . . . . 84
4.6.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.6.2 Approximation of the joint function f (., .) and the Filter G(m) . . . . . 86
4.6.3 Estimation of the Echo Signal Fixed Codebook Gain: ĝf,Z . . . . . . . . 88
4.6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
CONTENTS
iii
5 Voice Quality Enhancement and Smart Transcoding
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Network Interoperability Problems and Voice Quality Enhancement . . . . . . .
5.2.1 Classical Speech Transcoding Scenarios . . . . . . . . . . . . . . . . . . .
5.2.2 Classical Speech Transcoding and Voice Quality Enhancement . . . . . .
5.3 Alternative Approach: the Speech Smart Transcoding . . . . . . . . . . . . . .
5.3.1 The Speech Smart Transcoding Principle and Strategies . . . . . . . . .
5.3.2 Mapping Strategy of the LPC Coefficients . . . . . . . . . . . . . . . . .
5.3.3 Mapping Strategy of the Fixed and Adaptive Codebook Gains . . . . . .
5.4 Network Voice Quality Enhancement and Smart Transcoding . . . . . . . . . .
5.5 The proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Noise Reduction Integrated in the Smart Transcoding Algorithm . . . .
5.5.2 Acoustic Echo Cancellation Integrated in Smart Transcoding Algorithm
5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Overall Computational Load and Algorithmic Delay . . . . . . . . . . .
5.6.2 Overall Voice Quality Improvement . . . . . . . . . . . . . . . . . . . . .
5.6.3 Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.3.1 The ITU-T Objective Measurement Standard for GSM Noise
Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.3.2 Noise Reduction: Simulation Results . . . . . . . . . . . . . . .
5.6.4 Acoustic Echo Cancellation . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
97
99
99
101
102
102
104
105
112
114
115
116
117
118
119
119
119
120
124
125
129
6 General Conclusion
6.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
135
135
136
139
A GSM Network and Interconnection
143
A.1 GSM Networks Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
B CELP Speech Coding Tools
B.1 The Recursive Levinson Durbin Algorithm . . . . . . . . . .
B.1.1 Steps of the Recursive Levinson-Durbin Algorithm .
B.2 The Inverse Recursive Levinson-Durbin Algorithm . . . . .
B.3 The ITU-T P. 160 . . . . . . . . . . . . . . . . . . . . . . .
B.3.1 Assessment of SNR Improvement (SNRI) . . . . . .
B.3.2 Assessment of Total Noise Level Reduction (TNLR)
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
147
147
148
148
149
151
152
155
iv
CONTENTS
LIST OF FIGURES
v
List of Figures
1.1
1.2
1.3
Acoustic Echo Scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VQE in a Digital Wireless Network. . . . . . . . . . . . . . . . . . . . . . . . .
Codec Domain Speech Enhancement. . . . . . . . . . . . . . . . . . . . . . . . .
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
Generic Design of a Speech Coder. . . .
Illustration of Coding Delay. . . . . . . .
Encoder based on Analysis-by-Synthesis.
Typical CELP Decoder. . . . . . . . . .
Human Speech Production Mechanism. .
Voiced Sound. . . . . . . . . . . . . . . .
Unvoiced Sound. . . . . . . . . . . . . .
LPC Coefficients as Spectral Estimation.
Structure of the Analysis Window. . . .
Rectangular Window. . . . . . . . . . .
Hamming Window. . . . . . . . . . . . .
Typical CELP Encoder. . . . . . . . . .
Decoding Block of the AMR-NB. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
13
14
16
17
19
20
21
22
23
24
24
29
31
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
Existing Noise Reduction Unit Location. . . . . . . . . . . . . . . . . . . . . . .
Spectral Attenuation Principle. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simplified Block Diagram of the AMR VAD Algorithm, Option 1. . . . . . . . .
Example of VAD Decision, Option 1. . . . . . . . . . . . . . . . . . . . . . . . .
Experimental Setup for the Exchange of Parameters. . . . . . . . . . . . . . . .
Coded Domain Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fixed Codebook Gain Modification in Parameter Domain. . . . . . . . . . . . .
Example of Noise Fixed Codebook Gain Estimation. . . . . . . . . . . . . . . .
(a)-clean Speech signal, (b)-noisy speech fixed gain (red), clean fixed gain (blue),
(c)-noisy speech fixed gain (red), noise fixed gain (blue). . . . . . . . . . . . . .
(a)-clean speech, (b)- noisy speech, (c)-noisy fixed gain(red), estimated clean
fixed gain(blue). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Estimated Fixed codebook Gain. . . . . . . . . . . . . . . . . . . . . . . . . . .
Principle of NR based on LPC Coefficients. . . . . . . . . . . . . . . . . . . . .
Estimation Flowchart of the Clean Speech LPC Coefficients. . . . . . . . . . . .
36
38
42
43
45
46
47
50
3.10
3.11
3.12
3.13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
5
8
51
53
55
56
58
vi
LIST OF FIGURES
3.14
3.15
3.16
3.17
Lad Windowing Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Damping Factor Characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . .
Typical Example of Spectrum Damping. . . . . . . . . . . . . . . . . . . . . . .
Typical Estimation Spectrun (SN Rseg = 12 dB): (a)-our proposed method is
displayed with the noisy spectrum, (b)-our proposed method is compared to the
noisy, the clean and the wiener method spectrum. . . . . . . . . . . . . . . . . .
62
64
64
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
Acoustic Echo Scenario. . . . . . . . . . . . . . . . . . . . . . . .
System Identification in AEC. . . . . . . . . . . . . . . . . . . . .
Control Characteristics of the Microphone in Gain Loss Control.
Combined AEC/CELP Predictor. . . . . . . . . . . . . . . . . . .
Gain Loss Control in the Codec Parameter Domain. . . . . . . .
Example of Energy Estimation in Codec Parameter Domain. . . .
Characteristics of the Attenuation Gains. . . . . . . . . . . . . .
Example of AEC based on Gain Loss Control. . . . . . . . . . . .
Typical Example of the Evolution of the Attenuation Factor. . .
Filtering of the Microphone Fixed Codebook Gain Principle. . . .
Example of AEC by Filtering the Fixed Gain. . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
70
72
74
77
78
81
82
83
84
85
93
5.1
5.2
5.3
5.4
5.5
Generic GSM Interconnection Architecture. . . . . . . . . . . . . . . . . . . . .
Transcoding, Classical Approach. . . . . . . . . . . . . . . . . . . . . . . . . . .
Network VQE, Classical Solution. . . . . . . . . . . . . . . . . . . . . . . . . . .
Smart Transcoding Principle. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Transcoding Example from 7.4 kbps mode to 12.2 kbps mode: Spectrum of the
associated synthesis filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Transcoding Example from 12.2 kbps mode to 7.4 kbps mode: Spectrum of the
associated synthesis filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Adaptive Gains in Transcoding, typical example during transcoding from 7.4
kbps to 12.2 kbps mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Typical Example of Decoded Fixed Codebook Gains during transcoding from
7.4 kbps mode to 12.2 kbps mode. . . . . . . . . . . . . . . . . . . . . . . . . .
Fixed Codebook Gains mapping transcoding from 7.4 kbps mode to 12.2 kbps
mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fixed Codebook Gains mapping transcoding from 12.2 kbps mode to 7.4 kbps
mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Structure of the Codec Domain VQE Embedded in Smart Transcoding. . . . .
Proposed Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Flowchart Noise Reduction in Smart Transcoding. . . . . . . . . . . . . . . . . .
Overview of the Gain Loss Control Integrated in Smart Transcoding. . . . . . .
Filtering of the Fixed Codebook Gain Integrated in Smart Transcoding. . . . .
Objective Metrics versus Segmented SNR. Transcoding from 12.2 kbps mode to
7.4 kbps mode. Proposed NR method (blue dashed circle), Wiener NR method
(red dashed diamond). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
100
100
101
102
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
66
105
106
107
109
110
111
113
114
116
117
118
122
LIST OF FIGURES
5.17 Objective Metrics versus Segmented SNR. Transcoding from 7.4 kbps mode to
12.2 kbps mode. Proposed NR method (blue dashed circle). Wiener NR method
(red dashed diamond). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.18 Spectrogram of the Noisy Speech Signal: 6 dB Segmented SNR. . . . . . . . . .
5.19 Spectrogram of the Noisy Speech Enhanced With the Standard Wiener Filter. .
5.20 Spectrogram of Coded Domain Enhancement: Transcoding from 12.2 kbps mode
to 7.4 kbps mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.21 Spectrogram of Coded Domain Enhancement: Transcoding from 7.4 kbps mode
to 12.2 kbps mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.22 Time Evolution of the ERLE: from 12.2 kbps mode to 7.4 kbps mode, case filter
h1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.23 Time Evolution of the ERLE: from 7.4 kbps mode to 12.2 kbps mode, case filter
h1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
123
124
125
126
127
130
131
A.1 Generic GSM Interconnection Architecture. . . . . . . . . . . . . . . . . . . . . 144
viii
LIST OF FIGURES
List of Tables
2.1
2.2
2.3
Bit-rate Classification of Speech Coder. . . . . . . . . . . . . . . . . . . . . . . .
Vocoder Relationship. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2 kbps Mode Algebraic Codebook Position. . . . . . . . . . . . . . . . . . . .
13
16
32
4.1
4.2
Mean Linear Coefficients in Double Talk Mode. . . . . . . . . . . . . . . . . . .
Mean and Standard Deviation Opinion Score. . . . . . . . . . . . . . . . . . . .
88
93
5.1
5.2
Total Average of the Objective Metrics. . . . . . . . . . . . . . . . . . . . . . . 120
Echo Return Loss Enhancement Values. . . . . . . . . . . . . . . . . . . . . . . 133
B.1 Threshold Level for Speech Classification. . . . . . . . . . . . . . . . . . . . . . 151
B.2 Objective Metrics Requirement. . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
x
LIST OF TABLES
LIST OF ABBREVIATIONS
List of Abbreviations
AbS: Analysis-by-Synthesis
ACR: Absolute Category Rating
ACELP: Algebraic Code-Excited Linear Prediction
ADC: Analog-to-Digital Converter
ADPCM: Adaptive Differential Pulse Code Modulation
AE :Acooustic Echo
AEC: Acooustic Echo Cancellation
AMR-NB: Adaptive Multi-Rate Narrow-Band
AMR-WB: Adaptive Multi-Rate Wide-Band
ATH: Absolute Threshold of Hearing
BTS: Base Transceiver Station
BSC: Base Station Controller
BSS: Base Station Subsystem
CELP: Code-Excited Linear Prediction
CODEC: COoder/DECoder
CNG: Comfor Noise Generator
CPU: Central Processing Unit
DAC: Digital-to-Analog Converter
dB: Decibel
dBov: Decibel-Overload
DCT: Discrete Cosine Transform
DFT: Discrete Fourier Transform
DPCM: Differential Pulse Code Modulation
DSN: Difference SNRI TNLR
DSP: Digital Signal Processor
DTD: Double Talk Detection
DTX: Discontinuous Transmission
FFT: Fast Fourier Transform
FFG: Filtering of the Fixed Gain
FIR: Finite Impulse Response
GLC: Gain Loss Control
GMSC: Gateway Mobile Switching Center
GSM: Global System for Mobile communications
HLR: Home Location Register
xi
ISDN: Integrated Services Digital Network
ITU-T: International Telecommunication Union, Telecommunication standardization sector
IIR: Infinite Impulse Response
IP: Internet Protocol
KBPS: Kilo Bits Per Second
LMS: Least Mean Square
LP: Linear Prediction
LPC: Linear Prediction Coefficients
LSF: Line Spectral Frequencies
LSP: Line Spectral Pair
LTP: Long Term Prediction
MA: Moving Average
MOS: Mean Opinion Score
MSC: Mobile Switching Center
MD: Mobile Deice Equipment
NLMS: Normalized Least Mean Square
NPLR: Noise Power Level Reduction
NSS: Network Switching Subsystem
NR: Noise Reduction
PCM: Pulse Code Modulation
PESQ: Perceptual Evaluation of Speech Quality
PSTN: Public Switch Telephone Network
PLMN: Public Lan Mobile Network
QoS: Quality of Service
SER: Signal-to-Echo Ratio
SNR: Signal-to-Noise Ratio
SNRI: Signal-to-Noise Ratio Improvement
SSNR: Segmental Signal-to-Noise Ratio
STP: Short Term Prediction
TNLR: Total Noise Level Reduction
TRAU: Transcoding and Adaptative Unit
UMTS: Universal Mobile Telecommunications System
VAD: Voice activity Detection
VoIP: Voice over Internet Protocol
VQE: Voice Quality Enhancement
X-LMS: X-Filter Band Least Mean Square
2
LIST OF ABBREVIATIONS
3
Chapter 1
Introduction and Context of the Thesis
Introduction
In mobile communication, voice quality and intelligibility are two of the most important
customer’s satisfaction factors. Therefore, during wireless telecommunication scenario (Mobile
to Mobile or Mobile to other networks), as described in Sec. 1.1, the communication system
should be designed so as to produce a perceived sound impression on the listener sides as close
as possible to a face to face conversation. In a telecommunication scenario, speech signal for
the end user is generally affected by external impairments due to three main problems.
The first problem is that Mobile Device (MD) systems or terminals are usually used on-themove, especially in noisy surrounding (busy offices, high street traffic, airport halls, restaurants,
moving vehicles, etc.). As a consequence, both receiver and transmitter ends are surrounded
by background noise. Voice quality and intelligibility can be significantly affected by noise
effects, resulting in tiredness for the listeners and difficulty to understand each other. Speech
quality and intelligibility can be improved by a noise reduction algorithm. Hence, the noise
reduction algorithm should efficiently reduce the background noise by significantly increasing
the speech signal to noise ratio. The algorithm should also have minimal effect on the speech
signal, such as distortion, clicks and buzzing. In addition, the algorithm should be able to
smartly manage various conditions: reduce the noise level to a comfortable level but retain
the basic characteristic of the noise, maximum noise attenuation, measure of noise reduction
aggressiveness: [Loizou 2007].
A second problem is the Acoustic Echo (AE) where the low quality system of amplification
of mobile devices, the hand-free telephony and the environment room where the speaker is
located play significant roles in its presence. AE is materialized as the acoustic coupling
between the loudspeaker and the microphone. As depicted in Fig. 1.1, the microphone at the
4
INTRODUCTION AND CONTEXT OF THE THESIS
sending side does not only capture the near-end speech signal s(t). It also captures the delayed
version of the far-end speech signal z(t), leading to the superposition of sound waves captured
by the microphone: y(t) = s(t)+z(t). This phenomenon affects greatly the conversation quality
and the remote speaker (Receiver Speaker in Fig. 1.1) experiences the annoying effect of hearing
his own voice with a delay. The delay is usually of about 200 − 500 ms, due to transmission
time over mobile network which can be particularly high. Acoustic echo in addition increases
the voice activity factor. For instance, this is particularly true with the Adaptive Multi Rate
(AMR) coder where a Discontinuous Transmission module (DTX) is integrated. In the uplink,
the radio efficiency is reduced [3GPP 1999b]. Acoustic Echo Cancellation (AEC) is strongly
recommended [ITU-T 2004] to reduce echo effects and it should not introduce excessive delay.
x(k)
x(n)
Decoder
D/A
Echo
Path
x(t)
Loudspeaker path
Receiver
Speaker Side
Sending
Speaker Side
Microphone path
y(k)
Encoder
y(n)
echo z(t)
A/D
y(t) = z(t)+s(t)
speech s(t)
Figure 1.1: Acoustic Echo Scenario.
The Third problem is related to the fact that the world of telecommunication is becoming more and more heterogeneous. The proliferation of mobile devices and development of
several speech coders for different networks have leaded to the deployment of coders that are
not interoperable with each other. To interoperate, bit-streams need to be converted in the
gateways separating different networks: decoding one codec bit-stream and re-encoding it into
the target codec bit-stream format. This process can also be performed inside the Base Station
Subsystem (BSS). Such a solution also called Transcoding requires computational load, but
decreases speech quality and increases algorithmic delay. Solutions to overcome transcoding
problems are in development, but are still limited to configurations where coders (sending and
receiving) at every switching stage are similar. Due to these external impairments, voice quality enhancement algorithms are more critical than ever. Noise Reduction and Acoustic Echo
Canceller algorithms can be used to overcome the problems due to speech degradation. Current solutions are achieved based on highly elaborated and complex algorithms. The common
principle to reduce noise effects and/or to reduce or cancel acoustic echo, is to deal directly
with the speech samples.
To describe where NR and AEC are actually implemented, a short overview on GSM
architecture is necessary Annex A, see the next section and [Halonen et al. 2002] .
INTRODUCTION AND CONTEXT OF THE THESIS
5
Speech
Signal Noise
Echo
Microphone
Loudspeaker
A/D
D/A
Terminal:
Mobile Device
Terminal Based
Speech Enhancement
NR / AEC
SPD
SPE
CHD
CHE
Speech Coder
Radio Channel / Network Transmission
CHE
CHD
SPE
SPD
Radio Access Network:
BST BSC TRAU
NR / AEC
SPD
SPE
CHD
CHE
Network Based Voice
Quality Enhancement
Network Transmission
CHE
CHD
SPE
SPD
Core Network:
MSC
NR / AEC
PSTN
SPD
SPE
CHD
CHE
Network Transmission
PLMN
Figure 1.2: VQE in a Digital Wireless Network.
6
1.1
INTRODUCTION AND CONTEXT OF THE THESIS
Motivations
This thesis took place in audio coding and speech processing departments of Siemens Mobile, BenQ Mobile, and Nokia Siemens Networks, successively. Those departments have been
involved in the International Telecommunication Union-Telephony Sector (ITU-T) standardization activities related to embedded speech, audio coding and speech enhancement.
Its purpose is to answer the three problems mentioned (background noise and/or acoustic
echo presence, transcoding problems). For a better understanding of how these three problems
will be addressed, a brief overview of a wireless network is useful. In Fig. 1.2, three modules
of a typical wireless network (Universal Mobile Telecommunication Systems (UMTS) or GSM)
are depicted: the mobile device (receiver side or transmitter side), the radio access network
and the core network. Due to the symmetry between the transmitter end and the receiver
end, only the transmitter end is shown in details to simplify the communication problem.
The transmitter end device captures noise d(n) and/or echo signals z(n) simultaneously with
the clean speech signal s(n). The second module represents the radio access network that
controls the radio link with the mobile device. The third module shows the core network
where the Mobile Switching Center (MSC) and the Gateway are located. Fig. 1.2 shows where
Voice Quality Enhancement (VQE) algorithms, like NR and AEC are generally implemented
or deployed. The first location is directly inside the terminal. The second possible location
of the VQE is within the network. The enhancement can be performed either at the MSC or
near the radio access network ([Eriksson 2006] - [Cotanis 2003]).
1.1.1
Voice Quality Enhancement in Terminal
According to Fig. 1.2, enhancement is achieved as a pre-processing before encoding, or
after decoding near the loudspeaker. Existing techniques use the corrupted speech signal
in PCM format to perform noise reduction and/or acoustic echo cancellation. Algorithms
deployed are generally based on some transform (FFT, DCT, DFT, Block processing, Sub-band
implementation, etc.) techniques. The complexity of such approaches is high and is constrained
by the CPU of the DSP design. Real time processing constraints also make difficult and
expensive to implement more advanced algorithms. Additionally, terminal solutions depend
today on terminal constructors. As a consequence, the network providers cannot control the
delivered quality for all terminals. Some solutions were initiated by incorporating speech
enhancement inside speech coders. Such solutions also lead to problems related to terminal
solutions.
1.1. MOTIVATIONS
1.1.2
7
Network Voice Quality Enhancement
Positioning speech enhancement in the network leads to a technique called network voice
quality enhancement. Algorithms (NR, AEC) implemented are similar to those used in terminal
devices. As PCM samples are not available, processes are achieved by decoding the bit-stream,
performing noise reduction and/or acoustic echo cancellation (to the uplink) in the time or
frequency domain and re-encoding. Such solutions are implemented near the MSC or between
Transcoder and the MSC.
Another interesting configuration is where user A from (wireless or wireline) networkA (coder A) is in conversation with user B from another (wireless or wireline) network-B
(coder B). Transcoding needs to be performed inside the network for bit-stream conversion
and interoperability issue. In practice the transcoding is performed in the gateway. The bitstream from coder A is decoded by decoder A and the decoded speech signal is re-encoded
with encoder B. Such a solution always degrades speech quality and introduces distortion due
to the superposition of multiple quantization noises (encoding-decoding-encoding-decoding).
In presence of noise or/and echo, ’classical’ speech enhancement (NR and AEC) must be
performed between the decoding stage and before the re-encoding process: see Fig. 1.2.
1.1.3
Transcoding
If coder A and coder B are using similar technology, Code-Excited Linear Prediction
(CELP) for example, the parameters that are transmitted are of the same kind. An interesting technique involves directly mapping some parameters of encoder A inside encoder B
leading to partial encoding. The encoding process is thus reduced. This approach is suitable
when speech signal is not corrupted by noise and/or acoustic echo. This technique is known
as Smart Transcoding. It has already been experimented with good results as in [Kang et al.
2003], [Beaugeant, Taddei 2007] and [Ghenania 2005].
If the speech signal is corrupted by noise or/and echo, the bit-stream from encoder A
is decoded with decoder A and speech enhancement performed on the PCM samples. The
enhanced speech signal is then re- encoded using encoder B. Current voice quality enhancement
methods in the network are computational expensive and they introduce algorithmic delay.
1.1.4
Alternative Approach
Developments of acoustic echo cancellation and noise reduction units are now moving from
terminals (MD) into networks. There are several reasons and advantages of such a new placement of VQE algorithms. First of all, a central control of the network quality is desirable.
8
INTRODUCTION AND CONTEXT OF THE THESIS
Indeed, network providers have a high diversity of devices in their network, with various levels of speech quality. At the same time, quality of mobile devices has not been particularly
enhanced over the last decade. New challenges have appeared, like miniaturization of devices,
without the speech quality being a high focus. Therefore concrete industrial development leads
to placing VQE in network as an efficient or even better solution than those built in terminals.
Enhancement of speech quality and intelligibility is related to the perturbation source characteristic, and even the design of the platform of the dedicated algorithm,[Loizou 2007]. The
drawbacks of ’classical’ solutions, as mentioned in Sec. 1.1.1 and 1.1.2, lead to the idea that
VQE could be made directly by modifying the available bit-stream. Modifying the parameters
composing the bit-stream avoids the entire decoding/encoding process necessary in ’classical’
solution: it reduces computation load and avoid tandeming effects. Such solutions were initiated in [Chandran, Marchok 2000] and were recently extended to automatic level control
[Pasanen 2006] and frame energy estimation [Doh-Suk et al. 2008]. The principles of such
concepts are depicted in Fig. 1.3.
Corrupted
Parameters
Extraction
Input corrupted
speech
Encoder
Processing of the
Corrupted
Parameters
Corrupted
Bit-stream
Mapping inside
the Bit-stream
Other
Parameters
Near-End
Speaker Side
Enhanced
speech
Network Area or Parameters Domain
Decoder
Enhanced
Bit-stream
Far-End
Speaker Side
Figure 1.3: Codec Domain Speech Enhancement.
Additionally, this new speech enhancement solution (Modification of coded parameters) can
be easily integrated into Smart Transcoding schemes. In combination with Smart Transcoding
solutions, our idea involves modifying some CELP parameters before mapping them inside the
target bit-stream. Dealing with such solutions, delay is minimized, computational demand is
reduced and problems due to classical transcoding solution are controlled.
1.2
Objective of the PhD
This PhD investigation addresses both the interoperability problem in speech coding and
the voice quality enhancement problem. The purpose of this PhD is the conception and
implementation of flexible single microphone speech enhancement algorithms providing good
speech quality and intelligibility. The proposed algorithms perform on CELP parameters.
Above all, these algorithms can be located anywhere inside the network, since they are applied
on the bit-stream. The main feature of these algorithms is that they do not require the decoded
1.3. ORGANIZATION OF THE DOCUMENT
9
speech signal.
Based on the discussion of Sec. 1 and 1.1, this PhD will first propose algorithms to enhance
CELP parameters (AMR-NB) degraded by background noise or/and acoustic echo without
requiring the PCM speech samples. The developed algorithms will be compared to two existing
’classical’ location levels of voice quality enhancement as described in Sec. 1.1.1 and 1.1.2. The
proposed speech enhancement algorithms should also satisfy constraints of complexity load,
implementation flexibility and real time processing (algorithmic delay). Additionally, this
study relies on the knowledge of CELP coder techniques and on the parameters transmitted
in the bit-stream.
This PhD will also explore Smart Transcoding principle and interoperability problems.
This part includes investigations on different architectures and configurations of transrating
between AMR-NB modes. Results regarding mapping of parameters and their impacts on
speech quality and intelligibility will be presented.
The final step of the PhD is the investigation on embedded solutions, meaning integration of our new algorithms (noise reduction and/or acoustic echo cancellation) inside Smart
Transcoding schemes. The chosen embedded architecture should be implemented within transrating scheme between different AMR-NB modes.
Our main contributions in this domain of research are:
– The proposition and implementation of Noise Reduction algorithms by modifying the
coded parameters of the AMR-NB. Such algorithms can be located inside the network
or in any area where the coded parameters can be recovered.
– The proposition and implementation of Acoustic Echo Canceller algorithms located inside
the network through the modification of coded parameters of the AMR-NB. As above,
these algorithms can also be implemented in any area where the coded parameters can
be recovered.
– NR and/or AEC solutions embedded inside smart transcoding schemes in different AMRNB modes. Smart transcoding is applied during the mapping of parameters.
1.3
Organization of the Document
The present document is organized in six chapters or sections, including the present one.
– Chap. 2 introduces elementary principles of CELP coding techniques. It also highlights
the physical description and signification of the parameters transmitted inside the bitstream by the CELP coders. An overview of the AMR-NB architecture is also discussed.
– Chap. 3 and 4 are dedicated to noise-reduction and acoustic echo cancellation, respec-
10
INTRODUCTION AND CONTEXT OF THE THESIS
tively. Before developing new algorithms in the parameter domain we start with a state
of the art of existing techniques. We also present recent works based on the modification of coded parameters. New algorithms dealing with CELP parameters are then
presented for noise reduction in chapter 3 and for acoustic echo cancellation in chapter
4. Both chapters are concluded by experimental results of objective and subjective tests
for comparison with classical speech enhancement approach.
– Chap. 5 can be viewed as an application of the new algorithms introduced in chapters
3 and 4. The problem of centralized or network VQE is introduced. We start this
chapter with the description of the Smart Transcoding and interoperability problems.
The purpose of this section is to integrate the proposed algorithms for noise reduction
and acoustic echo cancellation inside a smart transcoding scheme. The performance of
such an architecture is studied via some objective tests.
– Chap. 6 concludes this work. The critical points of the PhD are highlighted. It also
indicates some perspectives to improve this new voice quality enhancement approach.
11
Chapter 2
Speech Coding and CELP Techniques
Introduction
Digital technology is based on sampling theory, which states the possibility to reconstruct
a continuous signal without distortion with only a finite number of samples (called discrete
signal). The rate at which we pick up samples from the continuous signal has an impact
on the bandwidth of the discrete signal, i.e. the span of the frequencies that contains the
discrete signal. This theorem, also known as Nyquist-Shannon sampling theorem stipulates
that the density of time sample must be twice greater than the characteristic frequency length,
[Goldberg, Riek 2000], [Shannon 1949].
Traditional telephony provides a bandwidth limited to approximately 3.4 kHz, much less
than what our ear can perceive. To improve the user experience we can eventually increase
the bandwidth or the sampling frequency. But telecom operators rather focus on reducing the
quantity of information to be transmitted than on improving the frequency bandwidth. It is
at this point that coding theory is exploited for speech compression.
The purpose of speech coding is to compress a digitalized speech signal using as few bits
as possible, while keeping a reasonable listening quality level. Properties such as low bit-rate,
high speech quality, robustness across different speakers and languages, robustness against
transmission errors, good performance during noisy periods, low memory size, low computational complexity, and low algorithmic delay are the most desirable features for a speech coder
[Goldberg, Riek 2000] - [Oppenheim, Schafer 1999]. The most spread out technique in speech
coding nowadays is Code-Excited Linear Prediction (CELP). This technique tries to mimic the
human speech production apparatus. The vocal tract is modeled by a set of LPC coefficients.
The excitation signal is a linear combination of an adaptive excitation (vocal cords) and of a
fixed excitation (noise like signal). This approach produces a good approximation of human
12
INTRODUCTION TO SPEECH CODING AND CELP
speech production model.
In this chapter, computational details of the CELP coder are not presented. The objective
here is to have good knowledge of the parameters transmitted by the CELP encoder. We will
especially focus on the physical description and signification of the transmitted parameters
received at the CELP decoder.
This chapter starts with a general overview of speech coding in Sec. 2.1. In Sec. 2.2, the
principle of analysis-by-synthesis is presented. The CELP coder technique is widely discussed
in Sec. 2.3. In this section, physical interpretation of the CELP parameters is also presented.
As application of CELP technique, Sec. 2.4 is dedicated to the description of the AMR-NB
standard.
2.1
Speech Coding: General Overview
A basic structure of a speech encoder and decoder is depicted in Fig. 2.1 (see [CHU 2000]).
The processing is frame-wise at the encoder by taking into account the quasi stationarity of
the speech signal. The input speech signal in PCM format (16 bits PCM at sampling rate of
8 kHz that would require a bit-rate of 128 Kbps without compression) is analyzed to extract
a number of pertinent parameters. These parameters characterize the speech frame under
analysis. The computed parameters are quantized and sent together as a compressed bitstream. As speech coding aims at reducing the bit-rate, the compressed bit-stream should
have a bit-rate lower than 128 Kbps. The channel coding processes the encoded digital speech
data for error protection. Finally, the channel protected bit-stream is transmitted.
This PhD does not address specific problems related to source coding, and channel coding.
Information on channel theory and channel coding can be found in [Bossert 1999]. At the decoder side Fig. 2.1 (b), the bit-stream is unpacked and the quantized parameters are obtained.
The synthetic speech is generated by synthesizing and processing the decoded parameters.
2.1.1
Speech Coder Classification
In speech coding, many encoding techniques have been developed. Current coders, candidate for standardization should satisfy some particular attributes which are used during the
classification [Kleijn, Paliwal 1995]. These attributes are the following ones.
2.1. SPEECH CODING: GENERAL OVERVIEW
Input
PCM
Speech
13
Analysis and Processing
Extraction/
Encoding:
Parameter 1
Index 1
Extraction/
Encoding:
Parameter 2
…..
Index 2
Extraction/
Encoding:
Parameter N
Index N
Bit-stream
Pack (Multiplexing)
(a) Encoder
Unpack
Index 1
Bit-stream
Index 2
Decoding:
Parameter 1
Index N
Decoding:
Parameter 2
…..
Decoding:
Parameter N
Synthesis and Processing
Synthetic
Speech
(a) Decoder
Figure 2.1: Generic Design of a Speech Coder.
Bit-Rate
The bit-rate specifies the number of bits required to encode a speech signal. The minimum
bit-rate achievable by a speech coder is limited by the amount of information and redundancy
contained in the speech signal. A sampling frequency of 8 kHz is commonly used for speech
encoding (e.g. in telephony) and the input samples are usually 2 bytes (16 bits) samples.
Therefore the input bit-rate that the coder attempts to reduce is: 8 kHz × 16 = 128 kbps.
Tab. 2.1 indicates a speech codec classification speech according to the bit-rate.
Category
High bit-rate
Meduium bit-rate
Low bit-rate
Very low bit-rate
Bit-rate
> 15 kbps
5 to 15 kbps
2 to 5 kbps
< 2 kbps
Table 2.1: Bit-rate Classification of Speech Coder.
Subjective Quality
It means perceived quality of the reconstructed speech signal at the receiver end, meaning
intelligibility and naturalness of the spoken words and ability to be understood.
14
INTRODUCTION TO SPEECH CODING AND CELP
Complexity
Computational demand is one of the key issues, usually low bit-rate implies high complexity. There is an important requirement of memory storage directly related to the algorithmic
complexity. Sophisticated coder needs large amount of fast memory to store intermediate
coefficients and codebooks.
Delay
The algorithmic complexity generally implies algorithmic delay. As depicted in Fig. 2.2, the
overall coder delay is the sum of several delay components: encoder buffering delay, encoder
processing delay, transmission delay, decoder buffering delay and decoder processing delay.
Buffering for real time implementation entails some delay that should also be minimized.
Buffer
Input Frame
Encoding
Bits Transmission
Decoding
Encoder
Buffering
Delay
Encoder
Processing
Delay
Transmission Delay
/ Decoder Buffering
Delay
Decoder
Processing
Delay
Output
Frame
Figure 2.2: Illustration of Coding Delay.
Bandwidth
This characteristic refers to the frequency range which the coder is able to reproduce.
2.1.2
Speech Coding Techniques
Another attribute that differentiates coders is their coding technique. Several speech coders
have been developed and can be classified in three groups:
– Waveform approximating coders:
Here, the speech signal is digitized and each sample is coded by a constant number of
bits (G.711 or PCM, [ITU-T 1988]). They provide high quality at bit rate greater than
16 kbps. Below this limit, the quality degrades rapidly. The number of bits for the
quantization can be reduced when the difference between one sample and its predicted
version is coded. The G.726 or ADPCM, Adaptive Differential Pulse Code Modulation
is an example of waveform coder [Daumer et al. 1984].
2.2. ANALYSIS-BY-SYNTHESIS PRINCIPLE
15
– Parametric coders:
Based on the frame based process of the input digital speech signal, this kind of coder uses
a model to generate and estimate a set of parameters that are quantized and transmitted.
The frame size is about 10 − 30 ms and the decoded speech signal is intelligible but
the perceptual quality of such coders relates to the model used. The most successful
coder in this group is the LP vocoder, where a filter which models the vocal tract is
derived from linear prediction. Parameters sent to the decoder are the filter coefficients,
unvoiced/voiced state of the frame, the variance of the excitation signal and the pitch
period of the voiced decision. The bit-rate of such coders is within the range of 2 to
4 kbps and these coders are especially used in military applications, permitting high
data protection and encryption [Federal Standard 1015, Telecommunications: Analog to
Digital Conversion of Radio Voice By 2400 Bit/Second Linear Predictive Coding 1984].
– Hybrid coders:
These coders can be regarded as combination between parametric and waveform coders.
Additional parameters of the model are fitted such that the decoded speech approximates
the original waveform, and the original signal in time domain as close as possible. In this
class, the commonly used technique is the analysis-by-synthesis. This technique makes
use of the same linear prediction as vocoders. The excitations are computed in different
ways, independently of the type of speech segment (voiced or unvoiced). Finally, the
excitation signal is computed as a linear combination of both periodic part (adaptive
excitation) and the noise like part (fixed excitation). The bit-rate lies between 4 and 16
kbps [Schroeder, Atal 1985a].
2.2
Analysis-by-Synthesis Principle
Analysis-by-Synthesis principle is used in CELP coding. This section gives a brief description of the technique (see [Kondoz 1994], [Atal, Remde 1982] and [Schroeder, Atal 1985b]).
In CELP coders, the speech is represented by a set of parameters. One way to select the
parameters, also called open-loop, is to analyze the speech and extract a group of parameters.
With the analysis-by-synthesis scheme, also called closed-loop, the speech signal s(n) is reconstructed by the encoder, giving a synthetic speech signal s̃(n). The reconstruction is performed
by a model of speech production that depends on certain parameters. During the close loop
procedure, the reconstructed signal is compared to the original input speech signal according
to a defined criterion (typically, the error metric is a perceptually weighted mean squared error
between the original and the synthesized speech). Based on this criterion, the best configuration of the quantized parameters is selected and its index or indices is transmitted to the
receiver. At the receiver side, the decoder uses techniques similar to those implemented at the
encoder side to re-synthesize the original speech signal.
As shown in Fig. 2.3, parameters are chosen conditionally to an error criterion. This
principle is generalized over many coders, and CELP coders use this technique to achieve
16
INTRODUCTION TO SPEECH CODING AND CELP
Input
Speech
Parameter
selection /
encoding
Local
Decoder
Synthetic
speech
Bit-stream
Error
Minimization
Error
Figure 2.3: Encoder based on Analysis-by-Synthesis.
optimum excitations sequences. Additionally, quantization exploits spectral masking. The
quantization noise is shaped such that its energy is located in spectral regions where the
original signal has most of its energy. This effect will be discussed in Sec. 2.3, concerning
CELP coders.
2.3
CELP Coding Overview
CELP coders can be seen as improved versions of LP vocoders, where an interesting mathematic model permits to have an equivalent of the human speech production system, Tab. 2.2.
The vocoder model assumes that the digital speech is the output of a digital filter whose input
excitation is either white noise or a train of impulses.
Human Speech Production
Vocal Tract
Air From Lungs
Vocal Tract Vibration
Vocal Cords Vibration Period
Fricatives and Plosives
Air Volume from the Lungs
Mathematic Model
LPC Synthesis Filter: H(z)
Input Excitation: u(n)
Voiced Speech: v(n)
Pitch Periods: T
Unvoiced Speech: uv(n)
The gain Applied to the Excitation: G
Table 2.2: Vocoder Relationship.
To model the filter and its excitation, the LP vocoder analyzes the speech signal to extract
its parameters. The speech synthesis is only performed at the decoder. With CELP coders,
a different way to encode the speech signal was developed by introducing the principle of
analysis-by-synthesis at the encoder. This principle was highlighted by Schröder and Atal in,
[Atal, Remde 1982] and [Schroeder, Atal 1985a]. The main innovation at this point was that
the excitation was not based on the strict voiced/unvoiced classification of the speech frame.
A synthesis phase was added to the encoding process such that the excitation was modeled
utilizing Short-Term and Long-Term linear prediction of speech, combined with excitation
codebook. An analysis step was based on linear prediction analysis and the synthesis filter
2.3. CELP CODING OVERVIEW
17
was estimated similarly as in LP vocoder. Therefore, the synthesis filter is excited by the
excitation vectors selected inside codebooks, and this explains the terminology code-excited
in CELP. Excitation signal is selected by matching the reconstructed speech waveform to the
original signal as closely as possible. An easy way to understand CELP coder is to start with
the decoder where quantized parameters are decoded from the bit-stream to synthesize the
speech signal. In this work, we will not go too deep in CELP encoding process. We will limit
our study to the physical description of CELP parameters and to a description of the decoder.
A large overview of the CELP coders and related standards can be found in ([Vary, Martin
2005], [Kleijn, Paliwal 1995] and [Kondoz 1994]).
2.3.1
CELP Decoder
The speech signal is processed frame by frame by the CELP encoder. The parameters are
then extracted and transmitted frame by frame. At the CELP decoder side, parameters from
each sub-frame are used to synthesize the speech signal. A sub-frame is entirely characterized
by two groups of parameters, namely the excitation parameters and the vocal tract parameters.
The CELP decoder as presented in Fig. 2.4 involves five different steps, represented with the
yellow boxes.
(ii)
Lag T
v(n)
(i)
a
Gain Table of
Quantization
(iv)
Quantized
Indices
Quantized LPC
Coefficients
u(n)
1/Â(z)
0
j
Gain Table of
Quantization
…
…
.
.
f
Index of
Codebook
(iii)
(v)
Post
filtering
cj(n)
Synthesized
Speech š(n)
U
Fixed
codebook
Figure 2.4: Typical CELP Decoder.
The vocal tract parameters are represented by the LPC coefficients. During steps (i), the
b = (â1 , .., âM ) is used to build the synthesized filter Ĥ(z),
quantized LPC coefficient vector A
18
INTRODUCTION TO SPEECH CODING AND CELP
given by:
b =
H
1
=
b
A(z)
1
1+
M
X
(2.1)
âi z
−i
i=1
where M is the order of the linear prediction analysis.
The quantized LPC coefficients âi are also used to construct the post filter. Generally, the
LPC coefficients are computed on a frame basis and are then interpolated to obtain the LPC
coefficients for each sub-frame. Excitation parameters are divided into fixed and adaptive
excitations. In step (ii), the received pitch delay T is used to select a section of the past
excitation. The selected section v(n) = u(n − T ) is called the adaptive codebook vector.
In step (iii), the received index j of the fixed codebook is used to select the optimum fixed
codebook vector cj (n).
In step (iv), the quantized indices of the fixed and adaptive codebook gains are decoded and
the quantized fixed codebook gain ĝf , and the adaptive codebook gain ĝa are obtained. The
quantized codebook gains are used to scale cj (n) and v(n) respectively. The final excitation
u(n) is constructed during step (v) as follows:
u(n) = ĝa · u(n − T ) + ĝf · cj (n)
(2.2)
The excitation parameters are generally computed in each sub-frame at the decoder. For
each sub-frame, the final excitation is filtered through the synthesis filter. The output of this
operation is enhanced by a post processing filtering where the decoded or synthesized speech
signal s̃(n) is obtained.
The post processing is achieved in most CELP coders based on combination of a long
term post filter, a short term post filter and a tilt compensation. Basically, the post filtering
enhances the perceptual quality by lowering perceived noise in the synthesized speech signal.
This process is performed by attenuating the signal in the valleys of the spectrum. In addition,
the output signal is filtered through a high pass filter to prevent low frequency components.
The speech samples are also up-scaled to recover an appreciable speech level. An adaptive
gain control is used to insure that the signal energy of the post filtered speech signal is the
same as that of the input speech signal. Step (v) is completed by storing the final excitation
inside the adaptive codebook. This final excitation will be used during computation in the
next sub-frame.
Because this work aims at implementing a VQE system based on the CELP codec parameters, this chapter highlights the CELP parameters and their physical description and
representation. We will not detail computational aspects that are useless with respect to the
general purpose of this thesis. The following sections will provide a general overview on CELP
encoder.
2.3. CELP CODING OVERVIEW
2.3.2
19
Physical Description of CELP Parameters
As described in Sec. 2.3.1, the coders based on the CELP technique transmit approximately
the same kind of parameters. The difference generally relates to the number of parameters,
the way there are computed with their quantized version and the number of allocated bits for
the transmission. Current speech coders as CELP ones exploit the human speech production
apparatus as described in Fig. 2.5 below. The lungs generate air pressure that flows through
the trachea, the vocal cords, the pharynx, and the oral and nasal cavities.
Nose
Output
Nasal
Cavity
Velum
Pharyngeal
Cavity
Vocal
cords
Trachea
Tongue
Hump
Oral
Cavity
Mouth
Output
Lungs
Muscle
Force
Figure 2.5: Human Speech Production Mechanism.
The vocal tract represents all the cavities above vocal folds. The shape of the vocal tract
determines the sound somebody makes. Changes of the vocal tract are relatively slow (10 ms
to 100 ms). The amount of air coming from the lungs characterizes the loudness of the sound
and acts as energy.
When somebody speaks, the speech sounds can be created according to the following scenarios:
– First the flow of air sets the vocal folds in an oscillating motion. Typical sounds can
be vowels (/a/, /o/ and /i/) and nasals (/m/, /n/). The vocal cords vibrate and the
rate at which they vibrate determines the pitch. These types of sound are called voiced
sounds. Women and young children tend to have high pitch (fast vibration), whereas
adult males tend to have low pitch (slow vibration).
– Second, the flow is constricted, fricatives (/f /, /s/ and /h/), or completely stopped for
a short interval. These sounds are named unvoiced sounds with noise like characteristics.
The vocal cords do not vibrate, but remain constantly open.
20
INTRODUCTION TO SPEECH CODING AND CELP
– Third, sounds as /z/ are produced by both exciting the vocal tract with a periodic
excitation and by forcing air through circonstriction of vocal tract. These sounds are
called mixed sounds.
A typical waveform generated in the vocal folds (voice phoneme) is represented in Figure
6. The time representation in Fig. 2.6 (a) is characterized by the periodicity of the signal.
In the frequency representation of Fig. 2.6 (b), we can observe the harmonic structures of the
spectrum. The spectrum also indicates dominant low frequency content. In [0 1500] Hz one
can observe four significant peaks. These peaks correspond to resonances in the vocal tract
and are also called formants. The unvoiced phoneme time representation in Fig. 2.7 (a) is
(a)
Amplitude
5000
0
−5000
0
0.005
0.01
Time/s
(b)
0.015
0.02
PSD−in−dB
100
80
60
40
0
500
1000
1500
2000
2500
Frequency
3000
3500
4000
Figure 2.6: Voiced Sound.
noise like and there is no significant periodic component. We can also see on the spectrum
in Fig. 2.7 (b) that there is a significant amount of high frequency components. This effect
basically corresponds to rapid signal changes or the unvoiced sound random nature.
The structure observed in speech signal representations reflects the human speech production system. The source-filter model is the basis of the technique used in CELP coders
to characterize the transmitted parameters. Power spectrum of voiced and unvoiced speech
segments are characterized by two attributes: the envelope of the power spectrum (LPC coefficients) and the fine structure of the power spectrum (Pitch delay) (cf. [Kleijn, Paliwal
1995]).
2.3. CELP CODING OVERVIEW
21
(a)
Amplitude
2000
1000
0
−1000
−2000
0
0.005
0.01
Time/s
(b)
0.015
0.02
PSD−in−dB
80
60
40
0
500
1000
1500
2000
2500
Frequency
3000
3500
4000
Figure 2.7: Unvoiced Sound.
The Vocal Tract Parameters
According to the foregoing, the vocal tract can be considered as a time varying filter. In
each frame the vocal tract is modeled by a linear filter. The impulse response of the linear
filter is represented by the LPC coefficients vector A = (a1 , . . . , aM ). The LPC coefficients are
computed assuming that the speech signal in a given frame follows an Auto-Regressive model.
Computation of the LPC coefficients is called linear prediction analysis.
The LPC coefficients can be viewed as a spectral envelope estimation of the signal over a
frame. Using the LPC coefficients of an original signal, it is possible to generate another signal
whose spectrum characteristics are approximately those of the original signal. As indicated
in Fig. 2.8 (b), the LPC coefficients are used to build the synthesis filter (cf. EQ. 2.1). The
spectrum of the synthesis filter corresponds to the speech signal envelope. As a consequence, in
a perturbed environment, if the LPC coefficients of the useful speech signal are well estimated
before the decoding steps, it will be possible to recover a reasonably good speech spectrum
envelope. This idea is used further in this thesis to enhance a signal in noisy environment.
In speech coding, the linear prediction analysis can be defined as a procedure to remove
redundancy, where short term redundancy information is eliminated.
The synthesis filter associated with the computed LPC coefficients should be stable. An IIR
22
INTRODUCTION TO SPEECH CODING AND CELP
(a)
Amplitude
2000
1000
0
−1000
Magniture−in−dB
−2000
0
0.005
0.01
Time/s
(b)
0.015
0.02
80
60
Speech Spectrum
Synthesis Filter Spectrum
40
0
500
1000
1500
2000
2500
Frequency
3000
3500
4000
Figure 2.8: LPC Coefficients as Spectral Estimation.
filter is said to be stable if all the poles of its transfer function are inside the unit circle. If there
is a pole outside the unit circle, then there will be an exponentially increasing component of
the impulse response [Haykin 2002a]. In other words, a filter H is stable if its impulse response
h(n) decreases to zero as n goes to infinity.
In most CELP coders, especially in AMR-NB standard, the LPC coefficients are transmitted using their Line Spectral Frequencies (LSF ), introduced in [Itakura 1975]. An efficient
computation using Chebyshev polynomials was proposed [Kabal, Ramachandran 1986]. The
LSF representation is motivated by several advantages:
– The LSF are bounded: 0 ≤ LSF ≤ π and 0 ≤ LSF1 ≤ · · · ≤ LSFM ≤ π. With
this property, the stability check of the LPC coefficients can be easily performed. Stability behavior of the synthesis filter can also be taken into account during encoding
by controlling the LPC coefficients range (cf. [Kabal, Ramachandran 1986] and [Kabal
2003]).
– The number M of the LSF and the range of the values to quantize, allow better behavior
in low bit rate for quantization. The LSF parameters between adjacent frames are highly
correlated. This property leads to adoption of interesting quantization techniques such
as prediction and interpolation.
– The LSF parameters are directly linked to the spectral envelope of the speech signal. As
a consequence, the formants can be described by the consecutive LSF parameters repar-
2.3. CELP CODING OVERVIEW
23
tition. For example, two or three consecutive LSF coefficients can describe a formant.
Computation of LPC Coefficients: Linear Prediction Analysis
The Linear Prediction Analysis or LP analysis is performed block-by-block (frame by frame)
and it starts by windowing the frame to be analyzed. The windowing process as depicted in
Fig. 2.9, serves to select the appropriate section (frame or sub-frame) of the speech signal.
Simultaneous to the windowing, the individual blocks need to be overlapped to prevent loss of
information at edges of frames. The overlapping process allows to include portion of adjacent
frames in the current frame. The windowed speech also reduces the audible distortion of
the reconstructed speech. Impact of the windowing can be viewed by analyzing the spectral
representation of the window. The simplest analysis window is the rectangular window given
by:
1, n = 0, . . . , N − 1,
(2.3)
wrect (n) =
0, Otherwise.
Analysis
Window
Frame
Sub-Frame
1
2
3
4
Figure 2.9: Structure of the Analysis Window.
A rectangular window has in frequency domain a disadvantage. The Fourier transform of
the signal does have a narrow main lobe. But, it also has appreciable side lobes or secondary
lobes, as seen in Fig. 2.10. The rectangular window gives the best frequency resolution but
due to the side-lobes, this window is not used much in applications.
In order to avoid side lobes after Fourier transform, the speech signal is windowed using
some particular windows. The Hamming or hybrids (combination of halves hamming) are most
used in speech coding [Kleijn, Paliwal 1995]. The expression of a typical Hamming Window
is:
(
0.54 − 0.46 cos N2πn
−1 , n = 0, . . . , N − 1,
(2.4)
whamming (n) =
0,
Otherwise.
As shown in Fig. 2.11, the magnitude of the Hamming window is much greater around low
frequencies than in high frequencies. Multiplication point to point with a frame will allow the
edges of the frame to become insignificant.
24
INTRODUCTION TO SPEECH CODING AND CELP
Rectangular Window
1
0.8
0.6
0.4
0.2
0
0
10
20
30
40
50
60
70
80
14000
16000
70
80
14000
16000
Impulse Response of the Rectangular Window
PSD−in−dB
40
20
0
−20
−40
0
2000
4000
6000
8000 10000
Frequency
12000
Figure 2.10: Rectangular Window.
Hamming Window
1
0.8
0.6
0.4
0.2
0
0
10
20
30
40
50
60
Impulse Response of the Hamming Window
PSD−in−dB
20
0
−20
−40
−60
0
2000
4000
6000
8000 10000
Frequency
12000
Figure 2.11: Hamming Window.
2.3. CELP CODING OVERVIEW
25
Another issue is the window size. Its choice affects the time and the frequency resolution.
This follows Heisenberg’s principle. In fact, n = 0 corresponds to the first sample of the current
frame. The samples, for n < 0, are those of the previous frame.
A linear prediction analysis of order M writes the current windowed speech signal sample
sw (n) as a linear combination of M past samples plus a residual signal e(n). The residual
signal is also called LP or LPC residual signal.
sw (n) = −
M
X
k=1
(2.5)
as (k) · sw (n − k) + e(n)
In the following, the linear filter coefficients or the vector AS = (as (1), . . . , as (M )) will represent the LPC coefficients vector of the windowed frame speech signal sw (n), n = 1, . . . , p. To
estimate the optimal LPC coefficients, the autocorrelation and covariance methods are generally used. These methods select the linear filterPcoefficients by minimizing the short term
−1
2
energy (square error) of the residual signal EST = N
n=0 (e(n)) by least square computation.
EST (n) =
N
−1
X
sw (n) −
n=0
M
X
k=1
!2
as (k) · sw (n − k)
(2.6)
Setting the partial derivatives of the square error with respect to each coefficient as (k) to
zero, we obtain a set of LPC coefficients that minimize that square error. Autocorrelation
method achieves such computation as follows:
N
−1
N
−1
M
X
X
X
∂EST (n)
sw (n)sw (n − j)
sw (n − k)sw (n − j) =
as (k) ·
=0⇔
∂as (j)
k=1
(2.7)
n=0
n=0
P −1
In EQ. 2.7, the term rS (j) = N
n=j sw (n)sw (n − j), j = 0, . . . , M , represents the autocorrelation function as introduced previously. In matrix representation, the system in EQ. 2.7
can be written as:
−ΓS · AS = RS
(2.8)
where the M × M autocorrelation matrix is defined as:



ΓS = 



rS (M − 1)

..

rS (1)
rS (0)
.


..
..

.
rS (1)
.
rS (M − 1) · · ·
rS (1)
rS (0)
rS (0)
rS (1)
...
..
.
..
.
(2.9)
26
INTRODUCTION TO SPEECH CODING AND CELP
and the autocorrelation vector is given by:
RS = (rS (1), . . . , rs (M ))
(2.10)
The autocorrelation matrix ΓS is a Toplitz matrix. The LPC coefficients can be easily computed. Iterative techniques are usually used since matrix inversion is computational expensive.
Especially, solution based on the Recursive Levinson-Durbin algorithm is implemented in most
speech coders. References [Kondoz 1994] and [Haykin 2002a] give an extensive explanation of
the Levinson-Durbin algorithm. Fortunately, the Toplitz structure of the autocorrelation matrix involves estimation of AS such that the associated synthesis filter H(z) = AS1(z) is stable.
The linear prediction analysis is an important module in recent coders. The linear prediction
analysis can be defined in three different manners:
– It is an identification technique where parameters of the system are found from the
observations: Speech in this situation is modeled as an Autoregressive (AR) signal,
which is appropriate in practice, [Kleijn, Paliwal 1995].
– The linear prediction analysis can be viewed as a spectral estimation method: Linear
prediction allows computation of LPC coefficients which characterize the PSD of the
signal itself. Based on the LPC coefficients of an original signal, it is possible to generate
another signal such that the spectra characteristics are close to the original ones, (see
Fig. 2.8).
– Above all, for speech coding application, the linear prediction analysis can be defined as
a procedure to remove redundancy, where short time repeated information is eliminated.
Perceptual Weighting Filter
The Perceptual Weighting Filter is introduced in CELP coder during the analysis-bysynthesis procedure in the encoder. The optimum parameters are computed in the CELP
encoder by minimizing the error in perceptual domain (the error is filtered through the perceptual filter before the minimization). Frequency masking experiments have shown that the
effects of quantization noise are not audible in the frequency bands where the speech has more
energy (formant regions). The quantization of the parameters exploits this effect by allocating
the quantization noise to the bands with high energy. This technique is also called noise shaping, as introduced in [Schroeder et al. 1979]. The perceptual weighting filter which derives from
the LPC filter exploits the human auditory system characteristics. The perceptual Weighting
Filter was formally designed to shape noise in order to make its energy lower between formants
and bigger in formant zones (cf. [Kondoz 1994], [Spanias 1994]). Former perceptual weighting
filters were computed based on LPC coefficients as follows:
W(z) =
A(z)
A( γz )
(2.11)
2.3. CELP CODING OVERVIEW
27
The poles of the weighting filter are those of the synthesis filter but drawn towards the
center of the unit circle. This filter is an all-pole filter and is in fact a bandwidth expansion
filter [CHU 2000], since the constant introduces a dilatation effect [Kabal 2003]. By applying
this filter to the speech signal, the result is that a listener will notice little noise (quantization
noise) in formant regions. Current CELP coders use a perceptual weighting filter with the
form:
W(z) =
A( γz1 )
A( γz2 )
(2.12)
where the perceptual factor values are in the range 0 < γ1 , γ2 < 1.
Pitch Delay: Adaptive Excitation
Speech signal is also characterized by its long term dependency. The fine structure of the
speech spectrum corresponds to the long term autocorrelation. The dependency is caused by
the fundamental frequency of the speaker, which is in the range [40 − 500] Hz, according to
the speaker gender and age. This fundamental frequency F0 corresponds to the vibration of
the vocal chords. A voiced speech segment as shown in Fig. 2.6 is quasi periodic in time
domain. It can be identified by the positions of the largest signal peaks and analysis of the
fine structure. The distance between the largest peaks signal is referred to as the pitch period
or pitch lag. The spectrum of such a signal exhibits spectral peak, (in Fig. 2.6 for example,
four spectral envelope peaks). These peaks are the manifestation of the resonance of the vocal
tract at these frequencies. The adaptive codebook search exploits the quasi periodic structure
of the speech that occurs during voiced speech segments. The adaptive codebook vector v(n)
is computed by simply weighting the past excitation at lag T = FF0s , where Fs is the sampling
rate, with the fixed codebook gain ga .
The LPC coefficients and the adaptive excitation represent the predictable contribution
of the speech signal. After these contributions have been removed, the remaining non predictable excitation (noise like) or final residual signal needs to be modeled by the so called
fixed excitation.
Fixed Excitation
Signals with slowly varying power spectrum can be used as an approximation of that final
residual signal or non predictable signal. A table also called codebook containing pulses also
called fixed codebook is used in CELP coding to approximate this signal.
The fixed codebooks are generally characterized by their size, their dimension, and their
28
INTRODUCTION TO SPEECH CODING AND CELP
search complexity. Among all these properties, complexity is a very important criterion. Complexity is significantly reduced by simplifying and imposing a structure to the codebook dictionary.
Depending on the type of CELP coder, fixed codebooks can be classified as deterministic
or nondeterministic. As deterministic class of codebook, the Gaussian codebook, were the first
to be used in CELP codecs. This kind of codebook was proposed in [Schroeder, Atal 1985b]
and was made from Gaussian noise with 1024 entries. The complexity of this codebook is high.
The Gaussian codebook also needs memory space for storing the codevectors at the encoder
and at the decoder. Trained Gaussian codebook [Moriya et al. 1993] can also be used where
the fixed codebook vectors are trained by a representative set of input signal.
Deterministic codebook, in general are those where excitations are constituted by P nonzero pulses designed as follows (cf. [Atal, Remde 1982]):
cj (n) =
P
−1
X
i=0
βi · δ(n − mi )
(2.13)
where βi denotes the pulse amplitude and the position. Dealing with deterministic codebooks, regular pulse codebook and Algebraic codebook have been developed. Regular pulse
codebook uses pulse with constant distance between two consecutive pulses [Kroon et al. 1986].
Algebraic codebook was introduced in [Adoul et al. 1987], and the pulse amplitude is given by:
βi = ±1, ∀i ∈ (0, . . . , P − 1]. Most recent coders are based on this type of codebook. Only P
non-zero values are needed and their complexity and memory demand is low. Algebraic codebooks allow the implementation of special codebooks for CELP (ACELP) to alleviate storing
and reducing the complexity. In this category, Ternary Codebook is the most implemented. As
in [Goldberg, Riek 2000] and in [Kleijn et al. 1990], the codebook components are either 1, 0
or −1. This kind of codebook offers interesting flexibility. Only additions and subtractions are
performed. The complexity of such a codebook during the filtering process can be significantly
reduced by varying the number of zeros inside the codebook.
Finally the entire CELP coding processed, as illustrated by Fig. 2.12, can be summarized
as the following:
– The vocal tract parameters, represented by the LPC coefficients: as (i), i = 1, . . . , M are
computed, quantized generally using LSF quantization and transmitted.
– Excitations are searched based on the analysis-by-synthesis principle in the perceptual
domain, using the perceptual weighting filter W(z). An adaptive contribution and a
fixed contribution are combined to compute the excitation signal.
– The adaptive excitation v(n) is composed of an adaptive gain and a pitch lag: ga and T
respectively.
2.3. CELP CODING OVERVIEW
29
Fixed
codebook
0
j
Input PCM
speech s(n)
…
…
.
.
s(n)=ssubframe(n)
LPC Analysis
âk
cj(n)
s’subframe(n)
u(n)=usubframe(n)
1/Â (z)
e(n)
gf
Z-T
W(z)
U
ga
ak
Minimization
D(T,j,ga,gf) = sumsubframe(ew2T,j,ga,gf(n)) min
ew(n)
Figure 2.12: Typical CELP Encoder.
– The fixed codebook search provides index (signs and positions) of the fixed codebook
vector cj (n) and the associated gain gf .
The quantized parameters are transmitted to the decoder and:
–
–
–
–
2.3.3
The decoder reconstructs the final excitation by using the decoded parameters.
The LPC coefficients are decoded to build the synthesized filter and the post filter.
The final excitation is filtered through the synthesized filter.
The output of the synthesized filter is enhanced through the post filtering (optional) to
improve the quality of the decoded speech waveform.
Speech Perception
The human ear system is made of various components and its external part, called the
basilar membrane, has been well investigated, see [Goldberg, Riek 2000]. The basilar membrane can be considered as a non uniform filter bank or spectral analyzer whose bands (also
called critical bands) vary with the frequency. Different points of the basilar membrane react
differently depending on the frequency of the incoming sound wave.
The human ear can perceive a sound whose frequency ranges over the interval 15 − 20000
Hz with a significant level. The level of this sound must be above the absolute audition threshold, which depends on the frequency. The absolute audition threshold, also called Absolute
Threshold of Hearing (AT H), is generally characterized by the amount of energy needed for a
sound to be detected by a listener in a silent environment. The AT H varies from one person to
another. The human ear system tends to be more sensitive for frequencies in the range [1, 4]
kHz, (cf. [Painter, Spanias 2000]). In contrast, the Maximum Threshold of Hearing (MTH)
characterizes the threshold over which the hearing sensation becomes painful.
30
INTRODUCTION TO SPEECH CODING AND CELP
The masking property
The masking effect is defined as the situation where the perceptibility of a sound (the
maskee) is disturbed by the presence of another sound (the masker), (see [Vary, Martin 2005]
and [Schroeder et al. 1979]). Two existing masking effects, known as Temporal Masking and
Frequency Masking are mostly encountered.
Temporal Masking is difficult to realize. In this type of masking, a sound will not be noticed
if it follows directly a louder sound. Both Temporal and Frequency Masking are used in audio
coding to design the psychoacoustic model. The Frequency masking is observed when two or
more sounds are simultaneously presented to the auditory system. Depending on the shape
of the magnitude spectrum, the presence of certain spectral energy will mask the presence of
other spectral energy. The Frequency Masking is generally used in speech coding through the
weighting synthesis filter. Masked frequencies do not need to be quantized and transmitted,
hence the codec can focus on quantizing the most important information.
2.4
Standard
In this section we present a short overview of the standardized 3GPP (3rd Generation
Partnership Project) AMR ACELP [3GPP 1999b]. This CELP based coder is used in this
thesis as a reference for our simulations and applications. AMR-NB is the mandatory speech
coder for narrow-band 8 kHz communication over UMTS. This coder follows the legacy of
original CELP and uses algebraic techniques. Parameters of the AMR coder are exactly the
same as those computed in Sec. 2.3, but some additive enhanced features will be summarized
in the following.
2.4.1
The Standard ETSI-GSM AMR
The AMR narrowband codec is the 3GPP mandatory standard codec for narrowband
speech and multimedia messaging services over 2.5G/3G wireless systems based on evolved
GSM core networks (WCDMA, EDGE, and GPRS). Initially developed for the GSM system,
AMR-NB was standardized by the European Telecommunications Standards Institute (ETSI)
in 1999. The AMR-NB uses 8 kHz speech sampling rate to operate at various bit rates, also
called modes: 4.75, 5.15, 5.90, 6.70 (same as PDC-Japan), 7.4 (same as DAMPS-IS136), 7.95,
10.2 kbps and 12.2 kbps (same as GSM-EFR). The input speech signal is divided into frames of
20 ms (160 samples) each with four sub-frames of 5 ms (40 samples). The exact processing of
the speech depends on the AMR-NB encoder mode and is largely described in [3GPP 1999b].
As depicted in Fig. 2.13, the decoder block diagram of the AMR-NB shows various steps
2.4. STANDARD
31
(ii)
(i)
Pitch index T and Tfrac
LSF Indices
v(n)
a
Gain
Indices
(iv)
Gain Table of
Quantization
0
Decoding and
Interpolation:
LSF LPC
Gain Table of
Quantization
1/Â(z)
u(n)
(v)
gf
j
…
…
.
.
Post filter
f
j: Index of
Codebook
Synthesized
Speech š(n)
cj(n)
(iii)
U
Fixed
codebook
Figure 2.13: Decoding Block of the AMR-NB.
similar to initial CELP decoder in Fig. 2.4, but includes new features.
LPC Analysis
A 10th order LPC analysis is performed once or twice per frame, from a windowed (hybrid
window: halves of Hamming with different sizes) signal. LPC Analysis is performed twice
in 12.2 kbps mode, with no look-ahead and only once per frame in other mode with 5 ms
look-ahead. The Levinson-Durbin algorithm is used to compute the LPC coefficients. The
LSF representations are used to transmit the LPC coefficients.
The perceptual weighting filter is given by: W(z) =
A( γz )
A( γz ) ,
1
where γ2 = 0.6,whereas γ1 = 0.9
2
(for 10.2 kbps and 12.2 kbps mode) or γ1 = 0.94 (for all other modes).
Excitation Parameters (Computed for each sub-frame)
The adaptive excitation is computed in two steps: An open-loop search coupled to a closeloop. The open-loop search is performed twice per frame using a 10 ms signal samples but only
once per frame in 4.75 kbps, 5.15 kbps modes. After the open-loop search, an intermediate
open-loop pitch TOP is obtained. The close-loop search is then performed in each sub-frame.
32
INTRODUCTION TO SPEECH CODING AND CELP
The close-loop search is confined to a small number of lag values around the pitch computed
during open-loop search TOP . Finally, the pitch delay obtained at the end of the close-loop in
each sub-frame has two components, an integer part T and a fractional part Tf rac . The pitch
is completed by the computation of the adaptive gain ga and limited to a certain interval:
0 ≤ ga ≤ 1.2.
The fixed codebook is an algebraic with ternary ACELP vectors (see Table 3). The number
of pulses inside the codebook depends on the mode leading to several bit allocations. In 12.2
kbps mode, the fixed codebook is encoded using 35 bits at each sub-frame. The fixed codebook
has five tracks and in each track we can have two pulses, which can be assigned to 8 different
positions inside sub-frame length. A little trick is used for the transmission of the two pulses in
a track, if the position number of the second pulse is lower than the first pulse, then the second
pulse has opposite sign of the first pulse. If the position number is higher, then both pulses
have same sign. Hence, only the sign of the first pulse in the tract needs to be transmitted. As
all the pulses have same amplitudes (equal to one), no bit is transmitted for pulse amplitude.
The transmitted indices are used at the decoder to choose the appropriate codebook vector.
Track
1
2
3
4
5
Pulses
i0 , i5
i1 , i6
i2 , i7
i3 , i8
i4 , i9
Positions
0, 5, 10, 15, 20, 25, 30, 35
1, 6, 11, 16, 21, 26, 31, 36
2, 7, 12, 17, 22, 27, 32, 37
3, 8, 13, 18, 23, 28, 33, 38
4, 9, 14, 19, 24, 29, 34, 39
Table 2.3: 12.2 kbps Mode Algebraic Codebook Position.
In 12.2 kbps and 7.95 kbps modes, the fixed codebook and the adaptive codebook gains are
quantized separately. In other modes, they are jointly quantized. Dealing with 12.2 kbps mode,
the adaptive gain ga is quantized using four bits through a non uniform scalar quantization.
The fixed codebook is not directly quantized or transmitted, but a correlation factor γ̂gf is
′
computed based on the predicted fixed codebook gain gf . The predicted gain is computed
using a fourth order MA, (see [3GPP 1999b]). The correlation factor is quantized with five
bits. The fixed codebook quantization is performed by minimizing:
EQ12.2
kbps
′
= gf − γ̂gf · gf
(2.14)
′
In EQ. 2.14, gf is the predicted gain and hatγgf is the correlation factor. With the 7.4
kbps mode, the correlation factor and the adaptive gain are quantized by minimizing:
EQ7.4
kbps
= kX − ga · Y − gf · Zk2
(2.15)
2.5. CONCLUSION
33
where X is the target vector, Y the filtered adaptive codebook vector and Z is the filtered
fixed codebook vector (see [3GPP 1999b]). To complete the processing, at the end of each
sub-frame during encoding, the final excitation is computed and stored inside the adaptive
codebook dictionary for the next sub-frame. The filter states are also saved so that they can
also be used in the next sub-frame.
Additional Properties
A lot of special features are integrated in AMR-NB. Functionalities such as Frame Erasure
Concealment, Voice Activity Detection (VAD)[26.094 2002], Comfort Noise Generation (CNG)
[26.092 2002] and Discontinuous Transmission (DTX) are integrated, [3GPP 1999b]. Dealing
with Voice over Internet Protocol (VoIP), the eight available bit rates allow for Quality of
Service (QoS) improvement. AMR-NB flexibility enables the development of applications that
control the bit rate according to the network characteristics.
2.5
Conclusion
This chapter has presented the general principle of the CELP coding techniques, especially
the decoder. AMR-NB standard has been briefly introduced. This standard process speech
with 20 ms frame. Several groups of parameters are computed, quantized and transmitted to
the decoder. At the decoder side, the transmitted parameters are reconstructed or approximated to synthesize the speech waveform.
As vocal tract parameters, one or two set of LPC coefficients, depending on the coding
modes are decoded: ai (m), i = 1, . . . , M . These LPC coefficients, model the vocal tract, and
are transmitted via their LSF representation.
As far as the excitations in concerned, the following parameters are computed:
– Four adaptive codebook gains and four fixed codebook gains are computed each frame.
The resulting gains are one set of gains on every sub-frame m (ga (m), gf (m)). As some
special processing techniques are integrated in AMR-NB, the gains are individually or
jointly quantized, according to the mode.
– Four pitch delays, one for each sub-frame m. The pitch delay has an integer and fractional
parts: T and Tf rac respectivelly. The pitch delay is used to compute the adaptive
excitation v(n), which is in fact the previous excitation at lag (T and Tf rac ).
– Four sets of fixed codebook vectors, also called stochastic codebook vectors, one for each
sub-frame m: cj (m). Algebraic structures have permitted the implementation inside the
encoder and the decoder of flexible codebook dictionaries, depending on the coding mode.
Only signs and positions of the codebook vectors are transmitted. These parameters are
34
INTRODUCTION TO SPEECH CODING AND CELP
also called index. This information is used at the decoder side to select inside the stored
codebook the best matching fixed codevector. The quantization of the fixed codevectors
is where most of the bits are allocated.
We should mention that the CELP encoding process is much more complex. Therefore, in
real time process, the CELP encoding process is segmented in smaller, more manageable sequential searches of parameters. If the speech signal to be encoded is corrupted by background
noise or/and acoustic echo, the encoder performs poorly. The transmitted CELP parameters
need to be restored. According to the context of this PhD as introduced in Chap. 1, Voice
Quality Enhancement (VQE) will be performed by processing CELP parameters. In the next
chapters, we will present in detail the algorithms we have developed to modify the AMR-NB
ACELP parameters.
35
Chapter 3
Noise Reduction
Introduction
In mobile communication scenarios, when the speech signal is corrupted by high background
noise signal, speech coders used for low bit-rate coding are drastically affected and do not
perform well. This induces a fatigue for the far-end listener and difficulties to understand
each other. Speech quality and intelligibility of the synthesized speech in such conditions
are degraded. To overcome this drawback, a noise reduction processing is necessary. Noise
reduction algorithms generally use Pulse Code Modulation (PCM) samples of the available
signal (see Fig. 3.1).
As depicted in Fig. 3.1 (a), if the noise reduction system is located inside the mobile
device, as the existing system, the algorithm is implemented as a pre-processing, or a postprocessing, before encoding and after decoding respectively. As shown in Fig. 3.1 (b), if the
noise reduction system is located inside the network, the bit-stream needs to be decoded. The
decoded signal is enhanced and then re-encoded. This way to overcome the noise problem
has many disadvantages: increasing of computational load, increasing of delay, increasing of
complexity and signal distortion.
The idea to de-noising the coded parameters so as to yield noise reduction and speech
enhancement was originally proposed in [Chandran, Marchok 2000]. The authors in this contribution have successfully tested a feasibility study of speech enhancement based on coded
parameters, especially CELP parameters.
This chapter starts with a brief definition of noise in Sec. 3.1. The Sec. 3.2 presents an
overview of the techniques used to reduce the noise in the frequency domain also known as
spectral attenuation. Sec. 3.3 is an overview of previous approaches performing in the coded
36
NOISE REDUCTION
Speech
Signal Noise
Microphone
Loudspeaker
D/A
A/D
PCM
Samples
Noise Reduction Unit
SPD
SPE
Speech Coder
CHD
CHE
Channel Coder
(a) Noise Reduction in Mobile Device
Corrupted
Bit-stream
Encoder A
Enhanced
Bit-stream
Decoder A
PCM
Samples
Speech +
Noise
Near-end Speaker
Area
Encoder B
Noise
Reduction
Unit
Decoder B
PCM
Samples
Network Area or
Codec Domain
Enhanced
Speech
Far-end Speaker
Area
(b) Noise Reduction in Network
Figure 3.1: Existing Noise Reduction Unit Location.
3.1. NOISE
37
domain. Two algorithms developed or enhanced during this work are described in Sec. 3.4 and
Sec. 3.5 respectively. Experimental results are presented in Sec. 3.5.3 and a partial conclusion
concludes the chapter.
3.1
Noise
In the real world, noise can appear in different shapes and forms. Two important characteristics are generally taken into account [Loizou 2007]:
1. Noise can be stationary. This means that noise remains unchanged over time. Car noises,
as the one used in this work, is stationary over short periods, in general of about 0 ms
to 30 ms. The energy of such car noise is concentrated in low frequencies.
2. Noise can be non-stationary. This is the case with cafeteria noise and street noise which
change especially every 20 ms.
Implementation of algorithms dedicated to the suppression of non-stationary noise is more
complicated than that of stationary noise. In practice, the assumption is that noise is typically
stationary over short time frames of 0 − 30 ms.
3.2
Noise Reduction in the Frequency Domain
Among all the techniques used in noise reduction, spectral attenuation is historically the
most used to reduce perturbation. Spectral attenuation is especially performed in frequency
domain through Short Time Fourier Transform (STFT), or Discrete Cosine Transform (DCT).
In frequency domain, the filtering is performed by weighting the spectral components. The
weighting factor is generally computed as a function of the Signal to Noise Ratio (SNR). After
the corrupted speech signal has been enhanced in the frequency domain, the enhanced speech
signal is transformed back to the time domain using the inverse transform of the STFT or
DCT. Spectral attenuation is widely discussed in [Loizou 2007] and [Vary, Martin 2005].
3.2.1
Overview: General Synoptic
A noise reduction algorithm in the frequency domain is generally performed by following
four main steps as described in Fig. 3.2. The first step is the transformation of the signal from
the time domain into the frequency domain. The second step is the estimation of the noise
signal features. An attenuation rule is applied on the corrupted signal in the third step and
finally, the enhanced speech signal is converted back to PCM samples. The corrupted speech
signal y(n) is given by the relation:
y(n) = s(n) + d(n)
(3.1)
38
NOISE REDUCTION
where s(n) is the clean speech signal, d(n) is additive noise and n the sample index.
Speech Detection
Noisy
Speech
y(n)
FFT or
DCT
Y(p,fk)
Noise
Tools
Estimation
Phase
(p,f )
y
k
Y(p,fk)
Enhanced
Speech
(n)
IFFT or
IDCT
(p,fk)
(p,fk)
Weighting
function or
Gain
G(p,fk)
Figure 3.2: Spectral Attenuation Principle.
The temporal and spectral characteristics of the speech signal change over time [Allen 1977].
A representation of the signal over short periods (10-30) ms or frames is suitable to cope with
these speech properties. A window is used to select the corresponding segment of speech and
to weight the speech samples for processing. Simultaneous to the windowing, the individual
blocks needed to be overlapped to prevent loss of information at edges of frames. With the
overlapping some portions of the adjacent frames are included into the current windowed
samples. STFT is achieved by computing the DFT of each overlapping windowed frame as
follows [Martin et al. 2004]:
Y(p, fk ) =
N
−1
X
n=0
2π
y(p · σ + n) · w(n) · e− N k·n
(3.2)
where σ represents the frame shift, p the frame index, fk , k = 0, . . . , N − 1 is the frequency bin
index, related to the normalized center frequency Ωk = 2πk/N , and w(n) denotes the window
function. During this process, the effect of phase is not taken into account: φŜ (p, fk ) =
φY (p, fk ). This is motivated by the fact that the human ear is particularly insensible to phase
effects when the Signal-to-Noise Ratio is high enough: [Wang, Lim 1982] and [Pobloth, Kleijn
1999]:
As depicted in Fig. 3.1, The STFT Y(p, fk ) of the noisy signal is used to estimate the
noise spectrum D̂(p, fk ). Both Y(p, fk ) and D̂(p, fk ) are used to compute the filter G(p, fk ),
also called attenuation factor. Filter G(p, fk ) is then applied to Y(p, fk ). Then, the result of
this filtering process is the estimated clean speech STFT Ŝ(p, fk ), given by:
Ŝ(p, fk ) = G(p, fk ) · Y(p, fk )
(3.3)
Finally, to terminate the process, the enhanced speech signal is obtained by transforming
Ŝ(p, fk ) back to the time domain.
3.2. NOISE REDUCTION IN THE FREQUENCY DOMAIN
3.2.2
39
Spectral Attenuation Filters
In the following, we will introduce some particular filters used in spectral attenuation.
Spectral Subtraction
The spectral subtraction attenuation (cf. [Berouti et al. 1979] and [Lim, Oppenheim 1979]),
can be summarized by the formula bellow:

 1 

a
D̂(p,
f
)
k 

(3.4)
GSS (p, fk ) = max  max 1 − ζ ·
, 0 , ǫ
Y(p, fk )
The noise floor factor ǫ is introduced so that some noise is remained to reproduce the naturalness of the environment. The parameter a has an interesting effect on spectral subtraction
filter level of reduction, [Lim, Oppenheim 1979]. The factor ζ controls the amount of subtraction, typical values proposed in [Berouti et al. 1979] are between 3 and 5. Varying the noise
over-estimation parameter ζ has effect on musical noise level. If a = 1, it refers to magnitude spectral subtraction [Boll 1979] and a = 2 leads to power spectral subtraction [McAulay,
Malpass 1980].
Spectral subtraction has some drawbacks such as phase distortion, when the Signal-toNoise Ratio is close to zero. Multiplications in frequency domain result in circular convolution
in time domain and lead to artifacts. The most annoying and disturbing for listeners is musical
noise due to isolated tone bursts, distributed randomly over frequencies. These drawbacks have
motivated researches resulting in many derived methods for spectral subtraction, see [Loizou
2007] and [Vary, Martin 2005].
The Wiener Filter
In this work, the Wiener filter is considered as the reference classical solution for noise reduction. The Wiener Method is also based on short term frequency analysis of the noisy signal.
Considering a windowed frame of analysis p of about 20 ms, the noisy signal is transformed in
the frequency domain through STFT. The resulting signal Y(p, fk ) in the frequency domain
is filtered so that an estimation of the clean signal S(p, fk ) noted Ŝ(p, fk ), is given by:
Ŝ(p, fk ) = GW (p, fk ) · Y(p, fk )
(3.5)
GW is the Wiener filter and is computed as [Loizou 2007]:
GW (p, fk ) =
SN R(p, fk )
1 + SN R(p, fk )
(3.6)
40
NOISE REDUCTION
where SN R(p, fk ) stands for the ratio between the Power Spectral Density (PSD) of the clean
speech signal s(n) and the PSD of the noise d(n).
The Ephraim and Malah Rule
In [Ephraim, Malah 1984], the authors have proposed a modified Maximum Likelihood Envelope Estimation and have added an estimator for the a priori signal to noise ratio. They also
have introduced an exponential smoothing in the time domain and this leads to a performance
improvement in comparison to spectral subtraction. The Ephraim and Malah’s rule can be
summarized by the formula:
√ s
SN Rprio
SN Rprio
π
1
·M (1 + SN Rpost )
GEM SR (p, fk ) =
2
1 + SN Rpost
1 + SN Rprio
1 + SN Rprio
(3.7)
where: M (θ) = exp(− 2θ ) (1 + θ) · I0 ( 2θ ) · I1 ( 2θ ) , the terms I0 and I1 are the modified Bessel
functions of zero and first order [Bowman 1958].
Estimation of the Signal to Noise Ratio Techniques
An efficient estimation of the signal to noise ratio SN R(p, fk ) is based on the "decision
directed" approach as proposed in [Scalart, Filho 1996]:
ˆ R(p, fk ) = β ·
SN
2
ˆ
SN
R(p
−
1,
f
)
k D̂(p, fk )
+ (1 − β) · SN Rpost (p, fk )
(3.8)
with
SN Rpost (p, fk ) =
|Y(p, fk )|
D̂(p, fk )
−1
(3.9)
In these relations, D̂(p, fk ) represents an estimation of the noise PSD and β corresponds to the
mixing factor between present and past estimations of the SNR. Usually, β = 0.98 provides
good performance of the estimator D̂(p, fk ). The estimator D̂(p, fk ) is computed by smoothing
the noisy signal PSD Y(p, fk ) during non voice activity period.
D̂(p, fk ) = δ · |Y(p, fk )|2 + (1 − δ) · D̂(p − 1, fk )
The adaptation is frozen during speech activity periods.
(3.10)
3.2. NOISE REDUCTION IN THE FREQUENCY DOMAIN
41
In the Ephraim and Malah approach, the a priori signal to noise ratio SN Rprio is the most
significant parameter. This parameter is computed based on smoothed value of the a posteriori
signal to noise ratio SN Rpost as follows:
SN Rprio (p, fk ) = (1 − β) · SN Rpost (p, fk ) + β ·
|GEM SR (p − 1, fk ) · Y(p, fk )|2
2
D̂(p, fk )
(3.11)
and SN Rpost is computed according to EQ. 3.9. This approach leads to small variations of the
signal attenuation over successive frames, which significantly reduces musical noise artifacts,
[Cappe 1994].
3.2.3
Techniques to Estimate the Noise Power Spectral Density
One of the most important steps in noise reduction is estimation of the noise power spectrum
D̂(p, fk ). The noise PSD is unknown during the noise reduction process and must be estimated.
The noise signal is not stationary in principle. The ability to track changes of noise PSD is a
difficult task, especially during both noise and speech presence. Fortunately, it is possible to
observe noise during speech pauses. The speech pause detection can be used to tract the noise
PSD. This solution is known as Voice Detection Activity method (see [Vahatalo, Johansson
1999] and [3GPP 1999b]). Another estimation technique introduced in [Martin 1994] and
enhanced in [Martin 2001], called Minimum Statistics technique, makes it possible to estimate
noise parameters without discrimination of speech presence and absence.
3.2.3.1
Estimation of the Noise PSD based on the Voice Activity Detection
Principle of the VAD
The voice activity detection module in each frame or sub-frame computes a logical decision
(VAD). Therefore, the decision V AD = 1 corresponds to speech presence whereas the decision
V AD = 0 corresponds to speech pause, thus the presence of noise only. On the basis of this
decision, the noise PSD D(p, fk ) can be estimated as follows:
if V AD = 0, D̂2 (p, fk ) = cst · D̂2 (p − 1, fk ) + (1 − cst) · |Y(p, fk )|2
if V AD = 1, D̂2 (p, fk ) = D̂2 (p − 1, fk )
(3.12)
42
NOISE REDUCTION
Example of Voice Activity Detection: The AMR VAD
An interesting tool integrated in AMR [3GPP 1999b] is the VAD. The VAD is used to
control the discontinuous transmission module or DTX. This VAD analyzes several features
of the input speech frame to discriminate speech presence periods. There are two options for
the VAD that can be used. Thereafter we present only the VAD option one, initially proposed
in [Vahatalo, Johansson 1999], because both rely to the same result. As depicted in Fig.3.3,
the block diagram of the VAD involves four main functions, materialized with yellow boxes.
The first block function analyzes the input speech signal over 9 sub-bands. The output of this
module is the signal energy over each sub-band: level. The second block is dedicated to pitch
and tone detection. This block detects vowel sounds and other periodic signals. The function
makes use of the pitches computed during open-loop pitch search in the encoder:pitch − f lag.
In addition this function detects the tone information and signals with very strong periodic
components: tone−f lag. The third block is used to detect high correlated signal such as music,
based on the open-loop pitch search performed at encoder. A Complexwarning − f lag and a
Complex − timer are returned. The fourth block performs an estimation of the background
noise. The output of this module is combined with an intermediate voice decision to achieve
through hangover generator to output a final voice decision flag. If the speech is not detected,
then V AD − f lag = 0, otherwise V AD − f lag = 1.
VAD Hangover
Addition
1
Input
signal s(i)
Filter Bank and
Computation of
sub-band levels
Level[sbi]
VAD-flag
VAD Decision
Intermediated
VAD decision
4
2
Top(n), t1
and t2
Pitch and Tone
Detection:
Background Noise
Estimation
Pitch-flag and
Tone-flag
3
Open-Loop
Correlation
Vector
Complex warning
Complex Signal
Analysis
Complex-timer
Figure 3.3: Simplified Block Diagram of the AMR VAD Algorithm, Option 1.
Performances of the AMR VAD
We clearly see in Fig. 3.4 (a) that the VAD (VAD option 1 of the AMR-NB) on clean
speech differs from that obtained on the noisy speech, Fig. 3.4 (b). Estimation based on VAD
can be disturbed by the presence of noise, leading to non optimal estimation of noise power.
3.2. NOISE REDUCTION IN THE FREQUENCY DOMAIN
4
Amplitude
2
(a)
x 10
1
0
−1
−2
Clean Speech
VAD
0
2
4
6
Time/s
(b)
8
10
12
4
6
Time/s
8
10
12
4
2
Amplitude
43
x 10
1
0
−1
−2
Noisy Speech
VAD
0
2
Figure 3.4: Example of VAD Decision, Option 1.
3.2.3.2
The Minimum Statistic Technique
In opposite to the VAD method, the minimum statistic does not make any assumption on
the presence of speech. It permits estimation and updating of the noise power during speech
plus noise periods as well as during noise only periods. The minimum statistic assumes that
speech and noise signals are statistically independent and that the power of the corrupted
speech signal converges most of the time to the power of the noise. This is due to the fact that
speech communication has several pauses.
The noise power is, in this condition, viewed as a spectral floor. It is then possible by
tracking the minimum of the noisy speech signal power, to estimate the noise spectral power.
The algorithm is based on the minimization of the short time power estimate in each frequency
bin for each frame as detailed in [Martin 1994]:
P (p, fk ) = α · P (p − 1, fk ) + (1 − α) |Y(p, fk )|2
(3.13)
Typically,α ∈ [0.9 0.98] and the minimum of P (p, fk ) is searched within a window of U frames.
Pmin (p, fk ) = min (P (p, fk ), P (p − 1, fk ), . . . , P (p − U + 1, fk ))
(3.14)
At this point, estimation of the noise PSD D̂(p, fk ) is this minimum of P multiplied by a factor
44
NOISE REDUCTION
ς to compensate the bias of the minimum estimate and to reduce musical noise:
(3.15)
D̂(p, fk ) = ς · Pmin (p, fk )
The technique introduced by the author in [Martin 2001] to enhance the initial Minimum
Statistic is related to the computation of an optimal time varying smoothing parameter. The
author also proposed the computation of a more accurate bias compensation. Finally the
increase of the noise tracking speed is achieved.
3.3
Introduction to Noise Reduction in the Coded Domain
Noise reduction in classical approaches involves high complexity and additive computational
load when noise reduction is applied within network. An alternative approach proposed in this
work involves modifying the parameters provided by the CELP encoder. The CELP coders
are based on Analysis-by-Synthesis principle as depicted in Chap. 2. If we consider the CELP
decoder synthesis filter Hm (z) on a sub-frame basis m in Z-domain [Chandran, Marchok 2000],
we can approximately write that:
Hm (z) =
gf (m)
P
−i
1 − ga (m) · z −T (m) · 1 + M
a
(m)
·
z
i
i=1
(3.16)
In EQ. 3.16, M represents the order of the linear prediction filter, ai is the ith LPC
coefficient, ga is the adaptive gain, gf is the fixed codebook gain, T is the pitch delay and
m is the sub-frame index.
The synthesis filter Hm (z) can be viewed as the cascade of two filters. The Long-term Prediction (LTP) filter HLT P (z) , which simulates the vocal source, and the LPC filter HLP C (z),
which models the vocal tract are given by:
HLT P (z) =
1
(1−ga (m)·z −T (m) )
HLP C (z) =
(1+
PM 1
−i
)
i=1 ai (m)·z
(3.17)
The fixed codebook gain gf appears as a multiplication factor in the expression of Hm (z) in
EQ. 3.16. As a consequence, gf has a great influence on the decoded speech signal amplitude.
A weighting of gf modifies the signal amplitude. Noise reduction applied to the fixed codebook
gain is motivated by this remark and has already been experimented in [Chandran, Marchok
2000] and [Taddei et al. 2004]. This approach will be discussed in Sec. 3.4.
The LTP and LPC filters can be considered as filters representing the spectral shape of the
speech signal. As a result, modification of these filters has an impact on the spectral charac-
3.3. INTRODUCTION TO NOISE REDUCTION IN THE CODED DOMAIN
45
teristic of the synthesized speech signal. Considering noise reduction, analysis in [Chandran,
Marchok 2000] and [Duetsch 2003] show that estimating the LPC and LTP filters of the clean
speech signal provides the spectral shape of the clean speech signal. Compared to the modification of the fixed codebook gain, enhancement of the LPC filter as developed in [Thepie et
al. 2008] not only influences the amplitude of the signal but also its spectral characteristics.
Modifying the LPC filter has thus a potential positive effect on reducing the distortion of the
decoded speech signal. This second approach will be exposed in Sec. 3.5.
3.3.1
Some Previous Works in the Codec Domain
The Authors in [Chandran, Marchok 2000] and [Duetsch 2003] have demonstrated effectiveness of speech enhancement through modification of the CELP codec parameters. Experimental investigation with replacement of parameters was performed as depicted in Fig. 3.5.
Bit-streams or CELP parameters of two different noisy speech signals were exchanged. The
AMR-NB codec has been used in 12.2 kbps mode [3GPP 1999b]. The decoder was modified
to use some parameters from another bit-stream.
A speech signal was corrupted with two different background noise levels. The resulting
noisy speech signals were a 20 dB segmented signal to noise ratio (SN Rseg ) and a 10 dB
SN Rseg . The segmented signal to noise ratio was computed in this experiment only where
V AD = 1. These experiments consist on replacing the parameters of the 20 dB noisy speech
by those of 10 dB, as shown in figure bellow.
Near-end-Side
Network Area
Far-end-Side
Bit-stream b10
Speech +
noise 10 dB
Encoder
Decoder
Bit-stream b20
Speech +
noise 20 dB
Encoder
Decoded
Speech
LPC Coefficients
Adaptive-codebook
Vector / gain
Fixed-codebook
Vector / gain
Figure 3.5: Experimental Setup for the Exchange of Parameters.
Listening tests show that exchanging the CELP parameters into bit-stream has a great
influence on noise perception. The background noise level is more or less represented by the
fixed codebook gain gf . Another remark was that exchanging the LPC coefficients introduces
distortions on decoded signal. The noise perception is more noticeable if the 20 dB bit-stream
is replaced by the 10 dB bit-stream. In opposite, if the 10 dB bit-stream is replaced by the 20
dB, the perception of the noise is reduced. The noise perception is also carried by the LPC
coefficients. This experiment indicates that in a noisy environment, a good estimation of clean
speech signal CELP parameters will achieve noise reduction.
46
NOISE REDUCTION
In [Chandran, Marchok 2000] modification of both fixed and adaptive codebook gains was
motivated by the trade-off between efficiency and low computational cost. This approach leads
to a new transfer function defined as follows:
Hm (z) =
γf (m) · gf (m)
P
−i
1 − γa (m) · ga (m) · z −T (m) · 1 + M
a
(m)
·
z
i
i=1
(3.18)
Since the adaptive codebook vector generally has a higher signal to noise ratio than the
fixed codebook vector, especially during voiced speech segment, γf (m) should be computed
such that the decoded speech signal power is optimal. Special care in this approach should be
taken to insure a trade-off between a high noise attenuation and gain to preserve the original
speech signal power. The technique based on EQ. 3.18 clearly reduces noise effect. Steps and
computational details of this approach are explained in [Chandran, Marchok 2000].
Noise Reduction, Acoustic Echo Cancellation and Gain Control can be considered as automatic dynamic amplitude scaling. The authors in [Sukkar et al. 2006] have proposed techniques
to directly dynamic scale the coded parameters of the speech signal. They simultaneous scale
the fixed codebook and the adaptive codebook gains according to a predefined speech target
contour and the CELP encoding process. As performed in [Sukkar et al. 2006], the entire
coded domain scaling process is described in Fig. 3.6 below. The bit-stream is partially decoded to extract the required parameters. Once the gains (fixed and adaptive codebook gains)
are modified, the scaled gains are quantized and mapped back inside the bit-stream.
Mobile Device
Network Area
Target Scale Contour
v(n), c(n)
x(k)
x(n)
Encoder
Partial
Decoder
Partial
Decoder
ga, gf
v’(k)
g’a, g’f
Codebook
Gain
Scaling
a,
Q
f
Bit-stream
Modification
x’(k)
Figure 3.6: Coded Domain Scaling.
The principle of this experimentation is to modify the coded parameters to obtain a decoded
speech signal according to a defined characteristic. Let x(n) a signal, the signal characteristic
xd defined according to a contour Gx (n) is given by:
xd (n) = x(n) · Gx (n)
(3.19)
The contour characteristic Gx (n) is used to scale x(k), representing the coded parameters of
the signal x(n). The signal xsd (n) stands from the encoded and decoded version of the signal
3.4. NOISE REDUCTION BASED BY WEIGHTING THE FIXED GAIN
47
′
xd (n). The result of this experiment is that the decoded signal x (n) from the modified bit′
stream x (k) approximates the signal xsd (n).
Experimental results indicate that the scaling of the coded parameters according to the target
contour restores the speech signal contour Gx (n) without affecting other speech quality aspects.
Another consideration is that when speech calls are connected through two different network
end-to-end point devices, the volume levels are often unbalanced, being one side higher or lower
than the other. A level control processing is needed to achieve sufficient signal level. The
classical speech level control is performed by multiplying the speech signal in PCM format by
a suitable gain. The idea to control the level in the parameter domain was proposed in [Pasanen
2006], where the quantized speech parameters are directly modified. This approach reduces
the system complexity. With such an approach, the end-to-end delay is reduced. Experimental
results have shown that the performance of the level control in the coded domain is similar
to that obtained by level control in the time domain. The great advantages are that this new
approach preserves quality and complexity is reduced as no transcoding is involved.
Based on the studies previously mentioned, we follow the same principle by proposing
an estimation of the speech coded parameters in noisy environment. Especially, we propose
two noise reduction algorithms. First, a noise reduction based on the weighting of the fixed
codebook gain is explained. Then a noise reduction algorithm based on modification of the
LPC coefficients is described.
3.4
Noise Reduction Based by Weighting the Fixed Gain
Noise reduction in the speech codec parameter domain is performed following the same
steps as noise reduction in the frequency domain. In comparison with noise reduction in the
frequency domain, there is no need to transform codec parameters. As highlight in Fig. 3.7,
the involved steps are as follows:
Estimation of the
fixed codebook gain
of the noise signal
Parameters needed
for estimation
Enhanced Speech
Signal (n)
f, D
Noisy
Speech
Signal
y(n)
gf, Y
Encoder
Modification of
the noisy fixed
codebook gain
f, S
Post Filter
Decoder
LPC Coefficients, Adaptive codebook Vector / gain, Fixed
codebook Vector keep unchanged
Figure 3.7: Fixed Codebook Gain Modification in Parameter Domain.
48
NOISE REDUCTION
1. In contrast to the DFT transform, the parameter domain transform refers to the quantized representation of the speech signal in a set of parameters.
2. The second step is where the noise fixed codebook gain gf,D is estimated. This is because
it is assumed that the noisy speech signal fixed-codebook gain gf,Y is obtained by the
contribution of the clean speech signal gain gf,S and the noise gain gf,D : gf,Y is thus a
function of gf,S and gf,D .
3. The third step is the effective noise reduction. The noise reduction is here performed
by applying the weighting or modification rule to the corrupted speech signal gain such
that: ĝf,S = γc · gf,Y .
4. One important and last step is the control of the amount of noise reduction through a
post filtering as depicted in figure bellow.
3.4.1
Estimation of the Noise Fixed codebook Gain
A transposition of the minimum statistic approach in frequency domain as described in
Sec. 3.2.3.2 is used. This technique is implemented by mimicking the former minimum statistic
method from the frequency domain to the parameter domain.
We start this estimation by assuming that the fixed codebook gain of the noise signal gf,D
characterized the noise amplitude. Similarly to the classical minimum statistic, we consider
that gf,D is the floor of the fixed codebook gain gf,Y of the noisy speech signal y(n). Based
on this assumption, finding the minimum of gf,Y over sub-frames leads to estimate gf,D over
these sub-frames. Applying the minimum statistic on the fixed gain parameter involves the
following steps.
A smoothing factor is applied to the noisy speech fixed codebook gain gf,Y so as to compute
a smoother version gY of gf,Y , according to:
2
gY2 (m) = α(m) · gY2 (m − 1) + (1 − α(m)) · gf,Y
(m)
(3.20)
where α(m) is defined below. The minimum of gY2 is then searched within a window of τ = 100
sub-frames (cf. Chap. 2) and we have:
gY2 min (m) = min gY2 (m), gY2 (m − 1)), . . . , gY2 (m − τ )
(3.21)
To compensate the bias introduced by this estimation problem (estimation of the noise fixed
gain by minimizing gY2 over several sub-frames ), the noise fixed gain estimated is obtained by
weighting the minimum noisy speech signal fixed-gain with an overestimation factor βoverest ,
and thus:
ĝf,D (m) = βoverest · gY min (m)
(3.22)
3.4. NOISE REDUCTION BASED BY WEIGHTING THE FIXED GAIN
49
To relate the estimate noise fixed codebook gain ĝf,D (m) with the SNR, the smoothing factor
α(m) is taken equal to:


α(m) = max αmin ,
1+
αmax
gY (m−1)
ĝf,D (m−1)


2 
−1
(3.23)
where αmin = 0.3 and αmax = 0.98.
When the noise fixed gain is estimated, the foregoing consists on designing the filter to
apply to the noisy speech fixed codebook gain. We start by simulating a communication
scenario between two talkers through a network using the same codec, the AMR-NR 12.2 kbps
mode. The noisy speech signal is first encoded with encoder A, resulting in a noisy bit-stream.
The noisy bit-stream is then decoded through decoder A. This decoder A is modified such
that it is possible to extract all the parameters needed in our noise reduction system. After the
enhancement of the fixed codebook gain, the enhanced fixed codebook is introduced inside the
noisy bit-stream. This enhanced bit-stream is finally decoded at decoder A. The noisy speech
signal is the sum between car noise, and a speech signal. For each noisy signal, the signal to
noise ratio is computed only during speech activity: (V AD = 1) as follows:
SN Rseg =
L−1
1 X
SN R(l)
·
L
(3.24)
l=0
where L is the total number of sub-frames and SN R(l) is given by the relation:
SN R(l) = 10 · log10
PN −1
n=0
PN
−1
n=0
s(l · N + n)2
d(l · N + n)2
!
(3.25)
where N is the sub-frame index. In this example, the resulting signal to noise ratio is:
SN Rseg = 6 dB. The first step on the noise reduction consists in the noise signal fixed codebook
gain estimation: ĝf,D .
In Fig. 3.8, an example of the performance of the estimation of the noise fixed codebook
gain is given. It can be observed that during speech periods (2 − 4 s and 6 − 8 s), the estimated
noise fixed gain ĝf,D is still updated. The method does not stop estimating of the noise fixed
codebook gain. The highest updated values are observed during speech activity. These values
correspond to the variations or fluctuation appearing during speech sections. The estimation
is fairly constant during non speech periods. It can be seen that the estimated noise fixed
gain ĝf,D in noise only periods approximates the mean value of the noisy speech signal fixed
codebook gains gf,Y .
50
NOISE REDUCTION
4
2
x 10
Amplitude
Noisy Speech Signal
1
0
−1
−2
0
1
2
3
4
5
6
7
8
9
Gain Amplitude in dB
Time/s
Noisy Speech Fixed Gain
Estimated Noise Fixed Gain
30
25
20
15
0
200
400
600
800
1000
Sub−frames
1200
1400
1600
1800
Figure 3.8: Example of Noise Fixed Codebook Gain Estimation.
3.4.2
Relation between Fixed Codebook Gains
From the observation of the curves of the fixed codebook gain in Fig. 3.9, it can be stated
that:
– In speech periods, the fixed codebook gains (those of the noisy speech signal and the
clean speech signal) have approximately the same amplitude.
– In noise only periods, the clean speech codebook gains have very low amplitude while the
noisy fixed codebook gains correspond exactly to the noise signal fixed codebook gains.
– These observations tell us that gf,Y is a function of gf,S and gf,D , leading to:
gf,Y (m) = f (gf,S (m), gf,D (m))
(3.26)
In addition, the CELP coder used to compute those parameters is not a linear process.
function f is constructed intuitively, based on several observations. Let us take into account
the two hypotheses (H1 and H2 ), that denote the presence and the absence of clean speech
signal respectively. They correspond to high and low SNR respectively. The noisy speech
signal can be considered as clean speech signal when the SNR is high and gf,Y is then set
to gf,S . When the SNR is low, the noisy speech signal can be approximated by the noise
signal and gf,Y is set to gf,D . With these assumptions, setting f as a linear function is the
simplest choice. This choice is not limitative. If the clean speech and noise have same energy,
this estimation will provide some bias. To reduce this bias, we use an exponent factor δ(m),
3.4. NOISE REDUCTION BASED BY WEIGHTING THE FIXED GAIN
4
Amplitude
2
Gain Amplitude
(a)
x 10
0
−2
Gain Amplitude
51
0
1
2
3
4
5
6
7
8
9
Time/s
(b)
1000
500
0
0
200
400
600
800
1000
Sub−frames
(c)
1200
1400
1600
1800
0
200
400
600
800
1000
Sub−frames
1200
1400
1600
1800
1200
1000
800
600
400
200
Figure 3.9: (a)-clean Speech signal, (b)-noisy speech fixed gain (red), clean fixed gain (blue),
(c)-noisy speech fixed gain (red), noise fixed gain (blue).
depending on the sub-frame index m:
δ(m)
δ(m)
δ(m)
(3.27)
gf,Y (m) = gf,S (m) + gf,D (m)
δ(m)
δ(m)
This choice has good behavior since the approximation gf,Y (m) ≈ gf,S (m) for low SNR
if δ(m) > 1 can be made, according to observation of several fixed gain curves. In fact, if the
δ(m)
δ(m)
noise fixed codebook gain is constant, it can be considered that gf,Y (m) >> gf,S (m) for
smaller values of gf,S , in comparison with values needed to access gf,Y (m) >> gf,S (m).
3.4.3
Attenuation Function
Similar to filter smoothing process in the frequency domain, the modification filter is inspired by Wiener filter. The noise reduction process achieved in the parameter domain is the
estimation of the fixed codebook gain of the clean speech by:
ĝf,S (m) = γc (m) · gf,Y (m)
(3.28)
52
NOISE REDUCTION
The weighting function γc (m), as the standard Wiener filter, is computed on the basis of the
signal to noise ratio SNR. The SNR in this case is the a priori SN R.
γc (m) =
where
SN Rδ
1 + SN Rδ
δ(m)
SN Rδ =
ĝf,S (m)
δ(m)
ĝf,D (m)
δ(m)
=
(3.29)
δ(m)
gf,Y (m) − ĝf,D (m)
δ(m)
ĝf,D (m)
(3.30)
For efficient results, we follow the same suggestion made in [Ephraim, Malah 1984] by
introducing the a-priori Signal-to-Noise Ratio. The factor δ(m) introduced in Sec. 3.4.2, (cf.
[Cappe 1994]) must be linked to the SNR and thus to the weighting function γc (m). We
propose the following choice:
δ(m) =
δ1 , if γc (m) > 0.5
δ2 , if γc (m) ≤ 0.5
(3.31)
with δ1 = 2, and δ2 = 0.75. The final estimation of SN Rδ is drawn from Ephraim and Malah
decision directed approach [Ephraim, Malah 1984].
δ(m)
SN Rδ (m) = (1 − βf g ) ·
gf,Y (m)
δ(m)
ĝf,D (m)
δ(m)
+ βf g ·
γc (m − 1) · gf,Y (m − 1)
δ(m)
ĝf,D (m)
(3.32)
where βf g is taken in the interval [0 1] and affects the SNR updating rate. In the frequency
domain, the a priori SNR, SN Rprio is calculated according to:
δ(m)
SN Rprio =
gf,Y (m)
δ(m)
ĝf,D (m)
(3.33)
The term:
2
2 δ(m)
δ(m)
ĝf,D (m) = γc (m) · gf,Y (m)
(3.34)
in EQ. 3.32 corresponds to:
|GM EST (p − 1, fk ) · Y(p, fk )|2
(3.35)
3.4. NOISE REDUCTION BASED BY WEIGHTING THE FIXED GAIN
53
2
2
δ(m)
in EQ. 3.11. The last term ĝf,D (m) is equivalent to D̂(p, fk ) .
A typical example of estimation of the clean speech fixed codebook gain is presented in
Fig.3.10. The system maintains the estimated clean speech fixed codebook gain as close as
possible to the noisy one during speech periods, with high SNR. This result can be explained
by the assumption introduced in Sec. 3.4.2. The reduction of the corrupted fixed codebook
gain amplitude is effective during noise only periods. The reduction is about 5 dB in this
example where the signal to noise ratio is: SN Rseg = 6 dB.
4
Amplitude
2
(a)
x 10
0
−2
0
1
2
3
4
Gain Amplitude in dB
Amplitude
4
2
5
6
7
8
9
5
6
7
8
9
1200
1400
1600
1800
Time/s
(b)
x 10
0
−2
0
1
2
3
4
Time/s
(c)
30
20
10
0
200
400
600
800
1000
Sub−frames
Figure 3.10: (a)-clean speech, (b)- noisy speech, (c)-noisy fixed gain(red), estimated clean fixed
gain(blue).
3.4.4
Noise Reduction Control: Post Filtering
To keep the fixed codebook gain of the clean speech signal as close as possible to the noisy
speech fixed gain, especially in periods of high SNR, the attenuation function γc (m) has to be
controlled. The control also helps to avoid abrupt variation of γc (m), and thus to reduce the
artifacts. We compute the energies in the parameter domain by evaluating the overall energy
of both the noisy speech signal and the estimated clean speech signal. These energies represent
the energy before and the energy after the noise reduction process.
54
NOISE REDUCTION
′
Let us consider Eu (m), Eu (m) the energies before and after the process respectively:
Eu (m) =
Q
X
(ga,Y (m) · vi (m) + gf,Y (m) · ci (m))2
(3.36)
(ga,Y (m) · vi (m) + γc (m) · gf,Y (m) · ci (m))2
(3.37)
i=1
′
Eu (m) =
Q
X
i=1
where vi (m) and ci (m) stand for the adaptive-codebook excitation and the fixed codebook
excitation respectively. The adaptive codebook gain is ga,Y , Q is the number of sample per
sub-frame.
′
The control rule is as follows: depending on the value of Eu (m) and Eu (m), the attenuation
function is compensated so that the noisy speech signal fixed gain stays unchanged during high
SNR, as illustrated by Fig. 3.10. According to the foregoing and EQ. 3.27, we propose to
compute the smoothing function as follows:

 γc (m), if 10 · log10 Eu′ (m) ≥ T hdB
Eu (m) γc (m) =
Eu (m)
 1,
< T hdB
if 10 · log10 E
′
(m)
(3.38)
u
A CCR (Comparison Category Rate) test as specified [ITU-T 1996] was performed by comparing the proposed algorithm against the standard Wiener filter approach. The simulations
in [Taddei et al. 2004] and [DeMeuleneire 2003] show that modification of the fixed codebook
gain based on the principle as proposed in this work, for medium SNR provides good results.
In Fig. 3.11 the curves of the noisy, clean and estimated fixed codebook gains are jointly
compared. In comparison with the original clean fixed codebook gain during noise only periods
(see Fig.3.10 (b)), the noisy speech fixed codebook gain is not completely eliminated during
noise reduction. It has the consequence that there will be a slight amount of remaining background noise in the enhanced speech signal. In noise-only periods: (sub-frames 1 to 400 for
example), the estimated fixed codebook gain can be regarded as an attenuated version of the
noisy speech fixed codebook gain. In speech periods: (sub-frames 400 to 800 for example), the
attenuation is performed, by taking into account the clean speech fixed codebook gain shape.
This result is characterized by the assumption and formulation introduced in EQ. 3.27 and the
assumption in which this equation is based.
3.5. NOISE REDUCTION THROUGH MODIFICATION OF THE LPC COEFFICIENTS55
Noisy Speech Fixed Gain
Clean Speech Fixed Gain
Estimated Clean Speech Fixed Gain
1200
1000
Amplitude
800
600
400
200
0
0
200
400
600
800
1000
Sub−frames
1200
1400
1600
1800
Figure 3.11: Estimated Fixed codebook Gain.
3.5
Noise Reduction through Modification of the LPC coefficients
In CELP coders, the technique used to estimate the LPC coefficients is based on the
assumption that the speech signal follows an autoregressive model. This assumption leads to
the Yule-Walker equation that can be solved by using the Levinson-Durbin algorithm, [Haykin
2002a]. If the clean speech signal is encoded, the Yule-Walker equations are given by the
relation:
RS = −ΓS · AS
(3.39)
where RSP
is the the clean autocorrelation coefficients given by RS = [rS (1), . . . , rS (M )],
−1
with rS (j) = N
n=1 sw (n)·sw (n−j), and j = 0, . . . , M . The number M represents the order of
the linear prediction analysis. ΓS is the Toeplitz autocorrelation matrix and AS is the vector
of LPC coefficients.
In the same way, if noise is encoded,then
RD = −ΓD · AD
(3.40)
56
NOISE REDUCTION
where RD is the vector of the noise autocorrelation coefficients. ΓD is the noise autocorrelation matrix and AS is the vector of noise LPC coefficients. If both the speech signal and
noise are present, the Yule-Walker equation is:
(3.41)
RY = −ΓY · AY
In EQ. 3.41, RY , ΓY and AY stand for the vector of the noisy autocorrelation coefficients,
the noisy autocorrelation matrix and the noisy LPC vector respectively. The aim of this section
is to exhibit and analyze relations between the noisy speech LPC vector, the clean speech LPC
vector and the noise LPC vector.
In [Thepie et al. 2008], we have proposed a noise reduction system by the modification
of the LPC coefficients. This method is based on the VAD decision option 1 implemented in
AMR-NB. The VAD decision indicates how to process the extracted noisy LPC coefficients
vector AY .
A general overview of the approach is depicted in Fig. 3.12. The noisy speech signal is
first coded and transmitted over the network. Inside the network, the LPC coefficients vector
AY of the noisy speech is extracted whereas the remaining parameters are kept unchanged. If
speech is present, that is if V AD = 1, a modification function F is applied to AY to obtain
an estimation of the clean speech LPC coefficients ÂS . If V AD = 0 , meaning that noise only
is present, AY is damped through G. The estimated clean speech LPC coefficients vector ÂS
is then mapped into the bit-stream. The decoding of the modified bit-stream will provide an
enhanced speech signal. The design of the functions F and G will be described in the next
sections.
F
ÂS
VAD 1
Noisy
Speech
Signal
y(n)
E
N
C
O
D
E
R
Decoding of the noisy
LPC coefficients
Noisy Bitstream
Other Parameters keep
unchanged
Mobile Device
AY
0
G
Mapping ÂS
In the bitstream
Modified
Bit-stream
Pitch delay index, Adaptive Gain
index, Fixed codebook index
Network Area
Figure 3.12: Principle of NR based on LPC Coefficients.
3.5. NOISE REDUCTION THROUGH MODIFICATION OF THE LPC COEFFICIENTS57
3.5.1
Estimation during Voice Activity Periods
In this section, we exploit how the additive perturbation influences the autoregressive
model to design the function F. We assume that the noisy signal is the sum of the speech
signal and noise. Moreover we assume that the speech signal and the noise are not correlated,
so that the autocorrelation matrix and the vector of the noisy autocorrelation coefficients can
be decomposed according to:
RY = RS + RD ⇒ ΓY = ΓS + ΓD
(3.42)
RS + RD = − (ΓS + ΓD ) · AY
(3.43)
Introducing EQ.3.42 into EQ. 3.41, we obtain the relation:
By taking into account the equations obtained during linear prediction analysis of the clean
speech signal and noise (EQ. 3.39 and EQ. 3.40), EQ. 3.43 can be formulated as follows:
−ΓS · AS − ΓD · AD = − (ΓS + ΓD ) · AY
(3.44)
AS = AY + (ΓS )−1 · ΓD · (AY + AD )
(3.45)
Finally, reorganizing EQ. 3.44, the LPC coefficient vector of the clean speech signal can be
computed via the formula:
EQ. 3.45 can be interpreted as a filtering process of AY to obtain AS through the function
F as shown in Fig. 3.12. In this formula, the term (ΓS )−1 which represents the inverse of
the clean speech autocorrelation matrix appears as a particular critical term. First, the clean
speech autocorrelation matrix ΓS is unknown. Second, the estimated of ΓS must be inverted
during realization of EQ. 3.45. In fact, the filtering process, given by EQ. 3.45, requires the
estimation of the autocorrelation matrices ΓS and ΓD , as well as the estimation of the LPC
coefficients of the noise AD .
The flowchart needed to estimate the clean speech LPC coefficients is described in Fig.
3.13. To realize such estimation, we simulate the network area where the PCM samples of the
speech signal are not available. The noisy speech is first encoded using an AMR-NR encoder,
in 12.2 kbps mode. We should note that in 12.2 kbps mode two sets of LPC coefficients are
normally computed by the encoder and only one set in other modes. For a given frame p of four
sub-frames m = 1, . . . , 4, the LPC coefficients are computed in 12.2 kbps mode for sub-frames
m = 2 and m = 4. For other modes, the LPC coefficients are computed only for sub-frame
m = 4.
During encoding process, we extract from the encoder the VAD decision, the noisy autocorrelation coefficients RY to build the autocorrelation matrix ΓY , and the noisy LPC coefficients
58
NOISE REDUCTION
ΓY ,VAD
Noisy
Speech
y(n)
Noise Autocorrelation
Matrix Estimation
Γ̂D
Γ̂D
AMR
Encoder
ΓY
Clean Speech
Autocorrelation
Matrix Estimation
Γ̂S
ÂS
Clean Speech
LPC Estimation
Post
Filtering
AY
AY, VAD
ÂD
Noise LPC
Estimation
Figure 3.13: Estimation Flowchart of the Clean Speech LPC Coefficients.
AY . These extracted parameters are then used to estimate step by step the parameters needed
to compute the estimate clean LPC coefficients vector as via EQ. 3.45.
First the VAD decision and ΓY are used to compute an estimation of the noise signal
autocorrelation matrix Γ̂D . Then the VAD decision and AY are used to estimate the noise
signal LPC coefficients ÂD . The estimated noise autocorrelation matrix Γ̂D and the noisy
signal autocorrelation matrix ΓY are used to compute the clean speech autocorrelation matrix:
Γ̂S . Afterward, the process estimates of the clean speech LPC coefficients ÂS according to
EQ. 3.45, by using: AY and ÂD , Γ̂S , Γ̂D , instead of the actual but unknown values.
3.5.1.1
Estimation of the noise LPC vector: ÂD
Estimation of ÂD is based on averaging the noisy speech LPC vector during periods of
noise alone. During speech periods, the smoothing is frozen:
if V AD = 0, ÂD (m) = αlpc · ÂD (m − 1) + (1 − αlpc ) · AY (m)
(3.46)
if V AD = 1, ÂD (m) = ÂD (m − 1)
where m is the sub-frame number and αlpc ≈ 0.8, is chosen experimentally. Estimation
during speech activity is set to the latest estimation where the VAD was zero. This estimation
has minimal effect during the estimation specified by EQ. 3.45 if all the remaining parameters
are known.
3.5.1.2
Estimation of the Noise Autocorrelation Matrix: Γ̂D
The noise signal characteristics can be used to estimate the noise autocorrelation matrix.
The noise autocorrelation matrix plays a central role in EQ. 3.45. If noise is white, the non-
3.5. NOISE REDUCTION THROUGH MODIFICATION OF THE LPC COEFFICIENTS59
diagonal components of the autocorrelation matrix are zero:
ΓD



=


rS (0)
0
..
.
0
0
...
0
..
..
.
rS (0)
.
..
..
.
.
0
···
0 rS (0)






(3.47)
Accordingly, we just need to estimate the signal energy. This approach alleviates the
estimation problem. Experimental results based on this technique achieve poor estimations
with any noise different to white noise. In general, white noise is rarely met in real applications.
A different method exposed below proposes an estimation of the noise autocorrelation matrix
when the noise is not necessary white.
Estimation of the noise autocorrelation matrix starts with the estimation of the noise
autocorrelation coefficients. We propose here to estimate the noise autocorrelation coefficients
by the Inverse Recursive Levinson-Durbin algorithm, [Haykin 2002a]. The implementation of
the procedures of this algorithm is similar to the Recursive Levinson-Durbin algorithm. It needs
knowledge of the final prediction error and of the LPC coefficients to compute the associated
autocorrelation coefficients. The Inverse Recursive Levinson-Durbin algorithm performs the
inverse operation of the Recursive Levinson-Durbin algorithm carried out by the encoder to
compute the LPC coefficients.
As we are working without using the PCM speech samples, estimation of the final prediction
error is performed based on the coded parameters. During the computation of the noisy LPC
coefficients, the Recursive Levinson-Durbin algorithm uses an autocorrelation RY sequence
to compute the LPC coefficients AY , the reflection coefficients KY and the final prediction
error power erry , (cf. Appendixe B). The final prediction error if only noise was encoded
can be estimated from the noisy final prediction error errd . Therefore, errY is computed at
the decoder side as the energy of the total noisy excitation signal. With respect to the voice
activity decision, the final prediction error errd is computed as follows:
if V AD = 0, err
ˆ d (m) = µlpc · err
ˆ d (m − 1) + (1 − µlpc ) · err
ˆ y (m)
if V AD = 1, err
ˆ d (m) = err
ˆ d (m − 1)
(3.48)
The smoothing parameter µlpc is set to 0.8.
The principle of the Inverse Recursive Levinson-Durbin algorithm is to compute an autocorrelation sequence knowing the associated LPC coefficients and the final prediction error
power. The noise autocorrelation sequence
by applying the Inverse
can be thus estimated
ˆ d (m) . The estimated noise autocorRecursive Levinson-Durbin algorithm to ÂD (m), err
relation matrix Γ̂D (m) is directly built since it is a Toeplitz matrix whose first column is
60
NOISE REDUCTION
R̂D (m) = [r̂D (0), . . . , r̂D (M − 1)]:



Γ̂D (m) = 


3.5.1.3

r̂D (M − 1)

..

r̂D (1)
r̂D (0)
.


..
..

.
r̂D (1)
.
r̂D (M − 1) · · ·
r̂D (1)
r̂D (0)
r̂D (0)
r̂D (1)
...
..
.
..
.
(3.49)
Estimation of the Speech Autocorrelation Matrix: Γ̂S
This estimation is the central point of the process. A matrix inversion is required by EQ.
3.45. If the estimation is not good enough, it will be impossible to perform EQ. 3.45. The
authors in [Un, Choi 1981] have proposed estimating the autocorrelation matrix Γ̂S in the
frequency domain. A threshold was introduced as voice detection. After Γ̂S is computed, they
use the Recursive Levinson-Durbin algorithm to compute the clean LPC coefficients.
Estimation of the clean speech autocorrelation matrix is achieved following two steps. The
first step involves computation of Γ̂S (m) using R̂S (m), whereas the second step leads to the
enhancement of the estimation.
Step 1:
The noisy speech signal autocorrelation coefficients RY (m) is obtained by applying the
Inverse Recursive Levinson-Durbin algorithm to (AY (m), erry (m)). This is due to the fact
that the noisy speech parameters are all available inside the network. An estimation of the
clean speech autocorrelation coefficients is performed by exploiting the non correlation between
noise and speech, see EQ. 3.42, leading to:
R̂S (m) = RY (m) − R̂D (m)
(3.50)
The estimated clean speech autocorrelation matrix Γ̂S is obtained according to its toeplitz
structure using R̂S (m):



Γ̂S (m) = 



r̂S (M − 1)

..

r̂S (1)
r̂S (0)
.


..
..

.
r̂S (1)
.
r̂S (M − 1) · · ·
r̂S (1)
r̂S (0)
r̂S (0)
r̂S (1)
...
..
.
..
.
(3.51)
3.5. NOISE REDUCTION THROUGH MODIFICATION OF THE LPC COEFFICIENTS61
The estimated autocorrelation matrix is a Toeplitz matrix and is positive definite, thus its
inverse exists. As the matrix is symmetric and the first component is positive, this matrix can
be inversed using iterative technique such as Cholesky’s, [Haykin 2002a].
Step 2:
In general, amplitude of the first autocorrelation coefficient is large. This value represents
the energy of the signal over the window used for the signal analysis. The lags are defined as
index location of the autocorrelation vector and: lag(R̂S (m)) = [1, 2, . . . , M ].
When dealing with a noisy signal (speech + noise), the lags coefficients close to zero are generally more corrupted by noise than more distant lags coefficients: [Haykin 2002a]. For example, in a sequence at sub-frame m, of autocorrelation coefficients RY (m) = [rY (1) . . . , rY (M )],
the autocorrelation coefficient rY (1) is much more corrupted by noise than rY (M ).
In presence of noise, the voice detection is sometimes wrong. This decision impacts the
estimation of the noise parameters as parameters are not well updated. A sequence of estimated autocorrelation coefficients which differs much more from the original one can cause
ill conditioning of the resulting autocorrelation matrix. The ill conditioning of the estimated
matrix is particularly annoying for the computation of LPC coefficients [Kabal 2003]. If the
estimated matrix is ill conditioned, its inverse will diverge. To overcome problems due to
transition periods and wrong VAD decision, we introduce a threshold T hV AD such that:
If the noisy speech signal energy is higher than the defined threshold, then the estimation
is based on EQ. 3.45. But if the noisy speech signal energy is lower than this threshold, then
a damping procedure is used to estimate the clean speech LPC coefficients. Therefore, the
formulation in EQ. 3.45 is used to estimate the clean speech LPC coefficients only if:
if V AD = 1 and rY (1) > T hV AD , AS is estimated via EQ.3.45
else,
Applying the damping procedure
(3.52)
Filter Stability Problems:
The estimated ÂS obtained in EQ. 3.42 must lead to a stable filter. The condition is
satisfied if and only if the poles of the associated synthesis polynomial are located inside the
unit circle [El-Jaroudi, Makhoul 1991]. If the autocorrelation matrix is ill-conditioned, some
poles can be located out of the unit circle [Kondoz 1994]. To solve this problem, a bandwidth
expansion procedure is used.
62
NOISE REDUCTION
A bandwidth expansion is used in AMR-NB [3GPP 1999b] to reduce ill-conditioning of the
autocorrelation matrix. Ill-conditioning is generally source of stability problem [Kabal 2003].
The bandwidth expansion involves multiplying the autocorrelation coefficients by a lag-window
vector given by:
1 2πf0 i 2
·
, i = 0, . . . , M, f0 = 60 Hz, fs = 8000 Hz
Wlag (i) = exp
2
fs
(3.53)
1
Multiplication Factor
0.95
0.9
0.85
0.8
lag window: 60Hz
lag window: 70Hz
lag window: 80Hz
lag window: 90Hz
lag window: 100Hz
0.75
0.7
1
2
3
4
5
6
Lag Number
7
8
9
10
Figure 3.14: Lad Windowing Values.
The effect of the lag-windowing vector is to assign importance to the autocorrelation coefficients proportionally to their lag. The higher the component lags is, the larger the attenuation
is achieved. We have tested several values of f0 : 60, . . . , 100 Hz, see Fig. 3.14. By using
f0 = 100 Hz we experimentally notice that stability problems do not occur. As seen in Fig.
3.14, the lag window with f0 = 100 Hz significantly reduces the level of autocorrelation coefficients at high lags. The result is that coefficients with high noise correlation contribute less
to the computation of the autocorrelation matrix. The level of the coefficients at higher lags
decreases when f0 increases.
3.5. NOISE REDUCTION THROUGH MODIFICATION OF THE LPC COEFFICIENTS63
3.5.2
Estimation during Noise Only Periods
Estimation of clean speech LPC coefficients during noise only periods uses different technique because EQ.3.45 cannot be computed when there is no speech signal as during such
period ΓS (m) = 0 and accordingly the inverse of ΓS (m) = 0 can not be computed. The idea
behind the modification of the LPC coefficients is to attenuate the noisy spectrum amplitude.
This attenuation should be efficient if it follows the noise amplitude variation. To do so, we
compute in each sub-frame a damping factor [Kabal 2003], λ(m) > 1 as a linear function of the
estimated noise energy ÊD (m). The coefficients of the linear function ν and η are described
in the following way:
λ(m) = ν · ÊD (m) + η
(3.54)
In the Z-domain, when the damping factor is applied to the noisy LPC coefficients associated polynomial AY (z), the estimated clean speech LPC polynomial in sub-frame m is given
according to:
ÂS (z) = AY (z/λ(m))
(3.55)
In fact, EQ. 3.55 leads to modifying the k th noisy LPC coefficient as follows:
âs (k) =
1
(λ(m))k
· ay (k)
(3.56)
To avoid too poor or too weak attenuation, we introduce two thresholds Tmin = 27 dB
and Tmax = 60 dB to control the attenuation level. If the noise energy ÊD (m) is lower
than threshold Tmin , then the attenuation factor is set to a lower value: λ(m) = λmin = 2.
Otherwise, when the noise energy ÊD (m) is above Tmax , then λ(m) is limited to λmax = 10.
It follows that the coefficients of the linear function introduced in EQ. 3.54 are computed as
follows:
ν=
λmin − λmax
λmin − λmax
and η =
+ λmin
Tmin − Tmax
Tmin − Tmax
(3.57)
Fig.3.15 explains the evolution of the damping factor in noise only periods. We clearly see
that the attenuation factor is constant to λmax if the noise energy is larger than Tmax . If he
noise energy is below Tmin , then λ(m) = λmin . Such an approach has an interesting behavior
as the attenuation is applied by following the noise energy characteristic. The estimation of
the noise energy in the coded parameter domain is performed based on the idea developed in
[Doh-Suk et al. 2008].
As shown in Fig. 3.16, a typical example of spectrum where the damping procedure has
been applied on noisy LPC coefficients is presented. The original clean speech LPC spectra in
64
NOISE REDUCTION
(m)
max
min
ÊD(m)
Tmin
Tmax
Figure 3.15: Damping Factor Characteristics.
20
Clean Spectrum
Estimated Spectrum
Noisy Spectrum
15
10
Amplitude
5
0
-5
-10
-15
0
500
1000
1500
2000
2500
Frequency Bins
3000
3500
Figure 3.16: Typical Example of Spectrum Damping.
4000
3.5. NOISE REDUCTION THROUGH MODIFICATION OF THE LPC COEFFICIENTS65
silence periods is particularly flat, the noisy speech LPC spectrum has a significant high peak.
The noisy speech and the clean speech LPC spectrums have no similarity. The high peak is
erased in the estimated LPC spectrum. We can also see that the estimated spectrum better
approximates the original spectrum than the noisy one.
3.5.3
Some experimental Results
We studied several speech files in presence of car noise. A classical noise reduction in a
simulated network environment was implemented. The noisy speech signal is first encoded into
an AMR-NB 12.2 kbps mode bit-stream, To simulate the VQE, the bit-stream is decoded into
PCM format, processed by the standard Wiener filter in the frequency domain and coded back
into an AMR-NB 12.2 kbps mode bit-stream. Finally, the signal is decoded into PCM as it
would be at the far-end terminal. The Wiener filter implemented in this simulation is similar
to that of Sec. 3.2.2 (also described in [Beaugeant 1999]).
The properties of our proposed method are analyzed by comparing the spectrum of the
synthesis filter associated to the LPC coefficients. The synthesis filter or transfer in the frequency domain obtained from the LPC vectors ÂS , ÂW iener , AY and AS are built respectively:
HŜ (f ), HW iener (f ), HY (f ) and HS (f ). We compare during speech and non speech presence
the spectrum of the transfer functions. The mean spectral errors between the spectra of the
transfer functions are also computed for each sub-frame:
Error(f ) =
1
NF F T
·
X
f
|HS (f ) − HU (f )|
(3.58)
where NF F T is the length of the FFT analysis and HU (f ) stands for any of the signal Ŝ, Y
or ŜW iener .
Different spectra of a sub-frame in a speech period are presented in Fig. 3.17. In Fig. 3.17
(a), one can notice that the proposed method preserves and enhances the speech formants.
The LPC modification implemented in this work allows reconstruction of the formants. The
Wiener filter method, as presented in 3.17 (b), tends to amplify the formants at low frequencies,
especially in the area [0 − 1000] Hz. It also increases the energy at these frequencies. In the
frequency range [1000 − 3000] Hz, the Wiener filter method also tends to move the formants
to lower frequencies.
The spectral effect of our algorithm can be evaluated by the spectral error in this example.
The spectral error obtained with the proposed method is about 0.7 dB. The spectral error
without processing is 1.7 dB. The spectral error achieved by the Wiener method is 2.5 dB.
Therefore, the noise reduction method we propose achieves a gain of about 1 dB, compared
66
NOISE REDUCTION
(a)
Amplitude/dB
30
Estimated spectrum
Noisy spectrum
20
10
0
−10
0
500
1000
1500
2000
2500
frequency bin
(b)
3500
4000
Estimated spectrum
Noisy spectrum
Clean speech spectrum
Wiener spectrum
30
Amplitude/dB
3000
20
10
0
−10
0
500
1000
1500
2000
2500
frequency bin
3000
3500
4000
Figure 3.17: Typical Estimation Spectrun (SN Rseg = 12 dB): (a)-our proposed method is
displayed with the noisy spectrum, (b)-our proposed method is compared to the noisy, the
clean and the wiener method spectrum.
3.6. CONCLUSION
67
to the noisy spectrum. In most of the worst cases, the estimated spectrum remains close to
the noisy one. The concrete achievement and improvement of this noise reduction algorithm
will be discussed further in 5. The noise reduction system combining fixed gain weighting and
LPC coefficients modification will be integrated inside Smart Transcoding process.
3.6
Conclusion
This chapter has shown that techniques used to perform noise reduction in the frequency
domain can be successfully transposed into the coded parameters domain. We have presented
two algorithms dedicated to noise reduction in the parameter domain:
1. A noise reduction via modification of the fixed codebook gain. This noise reduction
system is a direct transposition of spectral attenuation or amplitude attenuation generally
performed in the frequency domain.
2. A noise reduction based on modification of the LPC coefficients. This second approach
enhances the spectral characteristics of the corrupted speech signal. The noise reduction
based on LPC coefficients is influenced by the associated filter properties. We have
proposed some techniques to obtain a stable estimated LPC.
We should mention than CELP encoding process is computationally expensive. The two
techniques proposed in this chapter avoid the complete decoding of the noisy speech signal
in PCM formats. These methods reduce the noise contribution and increase the SNR, while
maintaining the signal original dynamics. One advantage is that these algorithms can be
applied to any kind of codec using CELP technique. The noise reduction system in the following
will be based on the combination of the fixed codebook filtering and the LPC coefficients
modification.
68
NOISE REDUCTION
69
Chapter 4
Acoustic Echo Cancellation
4.1
Introduction
Current wireless communication networks enable high mobility but are still in general affected by external impairments. Impairment such as acoustic echo, due to coupling between
the Loudspeaker and the Microphone of the mobile device, can drastically impact the communication quality. In modern digital communication network, a new challenge appears today
with integration of Voice Quality Enhancement unit such as AEC inside the network. For
providers and industry, centralized AEC located at a specific area inside the network is very
attractive.
Several reasons may motivate this new approach. Integration of the AEC in the network
avoids implementation of AEC inside Mobile Devices. It is very important since the power
supply and computational load are critical issues for the Mobile Devices. This solution also
leads to the reduction of the cost function of the communication system.
Speech transmission over current wireless network is in general achieved using speech codecs
based in CELP techniques. The CELP codecs exhibit non-linear characteristics. In cascade
with the acoustic echo path, the entire echo path in such application presents strong and
large non-linearities. The performance of standard AEC algorithms based on adaptive filters
to estimate the acoustic echo path is drastically degraded by these non-linearities [Huang,
Goubran 2000].
Based on these observations, we deal in this chapter with new AEC algorithms where
the process is performed through modification of CELP coded parameters. The developed
algorithms can be easily implemented as centralized AEC for GSM network. These algorithms
may appear as solution and contribution to problem due to standard AEC approach.
70
ACOUSTIC ECHO CANCELLATION
The communication system studied in this PhD models an end-to-end conversation between
two users (far-end-speaker and near-end speaker) over a GSM network. This model exhibits
a situation where acoustic echo problems are encountered. Especially, acoustic echo appears
during communication in hand-free mode.
4.2
Acoustic Echo Scenario
In a mobile communication scenario, each speaker voice at the mobile device is recorded by
the microphone and digitized for source and channel coding. The coded signal at the near-end
speaker Mobile Device is transmitted over a channel to the counterpart of the conversation,
where it is decoded and played out through the loudspeaker.
x(k)
x(n)
Decoder
x(t)
D/A
Echo Path: H
Loudspeaker path
Far-end
Speaker Side
y(k)
z(t)=x(t)*H(t)
Encoder
Microphone path
y(n)
D/A
y(t)
Near-end
Speaker Side
speech s(t)
y(t) = z(t)+s(t)
Figure 4.1: Acoustic Echo Scenario.
As described in Fig.4.1, the symmetry of the system allows us to investigate at the near-end
speaker side. The communication path between the far-end and the near-end speakers is the
loudspeaker path while the reverse link is the microphone path.
The wave sound (x(t)) is the input of the loudspeaker at the near-end speaker side and
is spread in the echo path (Loudspeaker-Room-Microphone). The propagation of the sounds
within the near-end environment/room leads to a coupling between the loudspeaker and the
microphone. Through the different propagation paths, the acoustic signal is delayed and
attenuated. Such a phenomenon is modeled as the convolution of the input signal s(t) with the
time varying impulse response of the filter H(t), which characterized the echo path. Therefore,
the acoustic echo signal z(t) is given by:
z(t) = (x ∗ H)(t)
(4.1)
This type of echo is called acoustic echo. There is a superposition of sound waves at the
microphone. The microphone at the near-end side records both the modified version of the
4.3. ACOUSTIC ECHO CANCELLATION: STATE OF THE ART
71
loudspeaker sound namely: z(t), and the near-end speech signal s(t), so that:
y(t) = s(t) + z(t)
(4.2)
If the acoustic echo is not cancelled or attenuated, the far-end speaker experiences his
own voice with a delay of about 200 − 500 ms. The delay is created by the accumulation of
several delays caused by: the speech coding and decoding, the channel coding and decoding, the
network transport and possible transcoding. This phenomenon impacts badly the quality of the
conversation. Acoustic echo most of the time appears in hand-free telephony as sound spreads
not only on the direct path (Loudspeaker - Microphone) but also spreads and reflects over
the surrounding Loudspeaker-Room-Microphone. Standard solutions to conceal acoustic echo
cancellation imply estimation of the filter that model the acoustic echo path. This estimation
becomes critical when there is simultaneous presence of the near end speaker signal and the
echo signal, also called double talk.
Many techniques were developed to handle the effect of acoustic echo, (see [Haykin 2002b]
- [Messerschmidt 1984]). Amongst all these techniques, the family of Stochastic Gradient
algorithms and their derived versions are the most implemented solution [Breining et al. 1999].
These families of algorithms use PCM samples to perform AEC. This chapter presents in Sec.
4.3 a state of the art in acoustic echo cancellation. The Least Mean Square (LMS) algorithm
and the Normalized Least Mean Square (NLMS) algorithm are first introduced. Then the Gain
Loss Control in the time domain is presented, followed by the Wiener filter Method applied
to AEC. Sec. 4.3 ends by a short overview about the double talk problem. Sec. 4.4 analyzes
former AEC approaches in the coded parameter domain. The Gain Loss Control in the coded
parameters domain is presented in Sec. 4.5. In Sec. 4.6, an AEC algorithm by filtering the
fixed codebook gain is depicted. Sec. 4.7 concludes the chapter.
4.3
Acoustic Echo Cancellation: State of the Art
Stochastic gradient algorithms are the most implemented techniques to adaptively solve
estimation problems. The most popular of these algorithms is the LMS algorithm. The LMS
algorithm has been widely applied to all kind of adaptive signal processing problems, and in
particular for adaptive estimation of the acoustic echo path.
4.3.1
The Least Mean Square Algorithm
The principle of the LMS algorithm is presented in Fig. 4.2. The residual signal error
e(n) is the difference between the corrupted signal y(n) and the output ŷ(n) of the estimated
impulse response:
e(n) = y(n) − ŷ(n) = y(n) − Ĥ T (n) · X(n)
(4.3)
72
ACOUSTIC ECHO CANCELLATION
x(n)
x(n)
x(n)
x(n)
Far End
Speaker Side
Near-End
Speaker Side
Adaptation
Algorithm
Estimated
Acoustic Echo
Path Filter:
Echo
Path: H
echo z(n)
-
speech s(n)
(n)
y(n) = s(n) + z(n)
e(n)
Figure 4.2: System Identification in AEC.
The estimated impulse response Ĥ of the filter that models the acoustic echo path of order
L is given by:
Ĥ(n) = (h0 , . . . , hL−1 )
(4.4)
The input vector X(n) represents the L past samples: X(n) = (x(n), . . . , x(n − L − 1)).
Since the filter H(n) is unknown, the LMS algorithm improves Ĥ(n) by adaptively estimating
its value until the error between H(n) and Ĥ(n) is minimal. A measure of the error during
this estimation is the residual signal error e(n). The LMS algorithm thus searches Ĥ(n) that
minimizes the squared residual signal. One approach to do this is to move Ĥ(n) in the direction
of the gradient of the expected error. The filter Ĥ(n) can be approximated as follows:
∇Ĥ(n) = −
µ
µ
· ∇E |y(n) − ŷ(n)|2 = − · ∇E e(n)2
2
2
(4.5)
Parameter µ is the step size and is used to control the rate of change, E is the mathematical
expectation and ∇ is the gradient with respect to Ĥ(n). The Stochastic gradient approach
replaces the expected value error by the instantaneous value. In practice, we compute ∇Ĥ(n)
as follows:
∇Ĥ(n) ≈ −
µ
· ∇ e(n)2 = −µ · e(n) · ∇e(n) = −µ · e(n) · X(n)
2
(4.6)
At time index n + 1, the adaptive filter is updated as follows:
Ĥ(n + 1) = Ĥ(n) + µ · e(n) · X(n)
(4.7)
Generally, a sufficient condition for stability of the LMS algorithm is that step size µ lies
within the range: 0 < µ < 1/λmax , where λmax is the largest eigenvalue of the input signal
autocorrelation matrix ΓX . The autocorrelation matrix is computed as:
4.3. ACOUSTIC ECHO CANCELLATION: STATE OF THE ART

where rX (j) =
PL−1
n=j


ΓX = 



rX (L − 1)

..

rX (1)
rX (0)
.


..
..
.
rX (1) 
.
rX (L − 1) · · ·
rX (1)
rX (0)
rX (0)
rX (1)
...
..
.
..
.
73
(4.8)
x(n) · x(n − j), j = 0, . . . , L − 1.
The Normalized Least Mean Square Algorithm
A fast adaptation of the acoustic echo path can be achieved if the input signal is not
strongly correlated. The input speech signal x(n) is highly correlated and a whitening process
can be performed by dividing µ · e(n) · X(n) by a locally estimated power. This technique is
called the Normalized Least Mean Square (NLMS) algorithm where the adaptation procedure
is given by:
Ĥ(n + 1) = Ĥ(n) + µ ·
4.3.2
e(n)
· X(n)
X(n)T · X(n)
(4.9)
The Gain Loss Control in the Time Domain
One of the oldest and simplest mechanisms to cancel acoustic echo is Gain Loss Control.
The principle of this method is to attenuate the microphone signal if the far-end speaker is
talking, and to reduce the far-end signal, if the near-end speaker is talking. In this section, the
principle of Gain Loss Control in the time domain, as developed in [Heitkamper, Walker 1993],
is first explained. The main idea in this technique is to decrease the corrupted (microphone)
signal y(n) by an attenuation gain. This attenuation gain depends on the short-term level of the
input signal (far-end signal) x(n) and the short term level of y(n). To compute the attenuation
gain, the short term averaging magnitude is used instead of the original magnitude. The short
term averaging magnitude ȳs (n) of the corrupted signal can be computed as the first order
non-linear recursive relation below:
ȳs (n) =
(1 − αf ) · |y(n)| + αf · ȳs (n − 1), if |y(n)| ≤ ȳs (n − 1)
(1 − αr ) · |y(n)| + αr · ȳs (n − 1), Otherwise
(4.10)
The parameters αf and αr determine the time constants of the averaging process and are
chosen in a way that a rising edge of the corrupted signal level can be followed faster than a
falling edge. The result is that, on the one hand, the control unit reacts faster in case of a
74
ACOUSTIC ECHO CANCELLATION
sudden increase of the energy level and, on the other hand, neglects short speech pause. EQ.
4.10 can also be interpreted as low-pass filtering.
Then, the attenuation factor gain ay (n) applied to the microphone signal is computed as
follows:

if ȳs (n) ≤ y0max
 c1 ,
ω̃
q−1
ay (n) =
c2 · (ȳs (n))
, if y0max
< ȳs (n) < y0max
ω̃

c3 /ȳs (n),
if y0max < ȳs (n)
(4.11)
The constants have been chosen such that ay (n) is a continuous function. The threshold
y0max separates the several possible modes, as indicated in Fig. 4.3:
1. Above this threshold, the level of the modified microphone signal (in log) ay (n) · ȳs (n) is
almost constant: c3 , characterizing speech periods.
2. Below the threshold, there is a small expansion region whose width defined by: log(y0max )−
log(ω̃), is determined by factor ω̃. The degree of expansion is controlled by parameter q,
see Fig. 4.3.
3. Below the expansion regions, the modified microphone signal is distinctly attenuated
according to c1 .
The Characteristics of the microphone attenuation principle is shown in figure below:
log(ay(n)· s)
Variations of y0max
log(c3)
q
log(c1)
log(y0max)- log()
log()
log(y0max)
log( s)
Figure 4.3: Control Characteristics of the Microphone in Gain Loss Control.
Threshold y0max is adjustable and has to be chosen such that the background noise does
not reach the expansion area and that the average magnitude of the speech signal lies above
this threshold. Furthermore, the authors in [Heitkamper, Walker 1993] suggest computing
y0max by using the coupling factor between the input speech signal and the corrupted signal.
They also remark that the attenuation factor of the loudspeaker path should depend on the
speech intensity of the input signal.
4.3. ACOUSTIC ECHO CANCELLATION: STATE OF THE ART
4.3.3
75
The Wiener Filter Applied to Acoustic Echo Cancellation
The Wiener filter applied to echo cancellation, also called the Minimum Mean Squared
Error (MMSE) filter, is performed in the frequency domain, (see [Vaseghi 1996]). It aims
at minimizing the mean squared error J between the clean speech spectrum S(p, fk ) and its
estimated spectrum Ŝ(p, fk ) (cf. Chap. 3):
2 J = E S(p, fk ) − Ŝ(p, fk )
(4.12)
The estimated spectrum Ŝ(p, fk ) is computed by filtering the microphone signal spectrum
with the Wiener filter GW iener (p, fk ), so that: Ŝ(p, fk ) = GW iener (p, fk )·Y( p, fk ). The Wiener
filter GW iener (p, fk ) is obtained by minimizing the cost function J. The expression of Ŝ(p, fk )
is used during the minimization process. The Wiener filtering technique assumes that the
useful speech s(n) and the echo signal z(n) are statistically independent, leading to:
GW iener (p, fk ) =
SER(p, fk )
1 + SER(p, fk )
(4.13)
where the term SER represents the Signal-to-Echo Ratio (SER). The SER is computed as
the ratio between the speech power density spectrum of the useful speech ΦSS (p, fk ) and the
echo signal power density spectrum ΦZZ (p, fk ): SER(p, fk ) = ΦSS (p, fk )/ΦZZ (p, fk ).
4.3.4
The Double Talk Problem
A practical issue of importance in AEC is the situation where both the near-end speaker
and the far-end speaker talk simultaneously. This situation is also known as the double talk
period (cf. [Benesty et al. 2000] and [Ye, Wu 1991]), and is characterized by:
s(n) 6= 0, and x(n) 6= 0
(4.14)
where as above, s(n) is the near-end speaker speech signal and x(n) is the far-end speaker
signal. Double talk periods disturb in practice the identification of the echo path when using
adaptive filter leading to a possible divergence of the adaptive filter. An interesting way to
overcome the double talk problem is to redefine or stop the adaptation when double talk is
detected. Double Talk Detection (DTD) generally computes a detection statistic metric ξ. This
metric is then compared with a defined constant threshold Tsx , independent with the input
data. If double talk is detected, the filter adaptation process is disabled during this period of
76
ACOUSTIC ECHO CANCELLATION
time. The technique usually used to compute the metric is the normalized cross-correlation
method (see [Ye, Wu 1991]).
4.4
Overview on Acoustic Echo Cancellation Approaches in the
Coded Parameter Domain
Acoustic echo cancellation algorithms are generally deployed as close as possible to the echo
source to mitigate their effects on speech quality. But existing acoustic echo cancellers do not
provide a sufficient level of cancellation. After enhancement in the terminal, the speech signal
is coded and transmitted over the network. At the receiver side, the decoded speech quality
depends on the network capability. Therefore, it can be interesting to directly implement the
Voice Quality Enhancement unit (cf. [Beaugeant 1999]) inside the network. Such an approach
enables harmonized VQE solutions. The centralized management of the network quality will be
possible by placing the VQE solution in the network, independently of the type of impairment.
Acoustic Echo Cancellation implemented in the network has been simulated in [Lu, Champagne 2003]. Based on this idea, several works are conducted to adapt algorithms to this
new concept. In [Gordy, Goubran 2006], the authors have proposed estimating the residual
echo power spectrum based on a frequency dependent scalar factor. The algorithm is incorporated into a psychoacoustic post-filter for residual echo suppression as developed in [Lu,
Champagne 2003]. The advantage is that the algorithm is placed inside the network. One
relationship to codec domain processing is that the LPC coefficients are integrated in their
model. Nevertheless, the system still uses the PCM speech samples.
In [Gordy, Goubran 2004], the authors have implemented an adaptive X-LMS (X-filter
band Least Means Square) echo canceller located inside the network. Instead of a single decorrelated filter as in [Mboup et al. 1994], they use a filter bank of short de-correlation filters
whose coefficients are obtained from the LPC-based encoded representation of the loudspeaker
speech signal x(n).
The processing block diagram is depicted in Fig. 4.4. The LPC-based decoder provides
compressed representation of the Loudspeaker signal x(k) inside the network. The Loudspeaker
signal can be reconstructed as follows:
x(n) =
M
X
i=1
a(n, i) · x(n − i) + r(n)
(4.15)
where at time n, a(n, i) is the ith LPC coefficient and r(n) stands for the total excitation signal
given by:
r(n) = ga,X · r(n − TX ) + gf,X · c(n)
(4.16)
4.4. OVERVIEW ON ACOUSTIC ECHO CANCELLATION APPROACHES IN THE
CODED PARAMETER DOMAIN
Speech frames
Compressed
LPC-Based
Decoder
77
x(n)
x(n)
r(n)
LPC
Coefficients
N
E
T
W
O
R
K
Echo
Path
H(n)
(n)
f(k)
echo z(n)
Filter
Bank
e(n)
speech s(n)
(n)
-
e(n)
y(n)
Figure 4.4: Combined AEC/CELP Predictor.
In EQ. 4.16, ga,X , TX , gf,X , and c(n) are respectively the adaptive gain, the pitch delay,
the fixed codebook gain and the fixed codebook vector (see Chap.2). The total excitation
signal r(n) can be easily computed inside the network. The LPC coefficients at the decoder
are used to compute the de-correlation filter coefficients f (k). The estimated adaptive filter
is denoted by h̄n , and the original acoustic path is given by hn . The difference between the
target and the estimated impulse response at time n is given by:
∇hn (j) = hn (j) − h̄n (j)
(4.17)
The de-correlation filter bank is constructed using the inverse of the all pole LPC synthesis
filter at the decoder. At time n − l, the filter is computed as follows: (f (0) = 1, f (k) =
−a(n − l, k), 1 ≤ k ≤ M ). The new X-LMS updated equations are computed as follows:
ef (n, l) ≈
N
−1
X
j=0
∇hn (j) · r(n − j) +
L−1
X
k=0
f (k) · ν(n − j)
(4.18)
where ν(n) represents the uncorrelated adaptive noise. The estimated adaptive filter is updated
based on the new formulation below:
h̄n+1 (j) = h̄n (j) +
µ · ef (n, l) · r(n − l)
PN −1 2
k=0 r (n − k)
(4.19)
The parameter µ is the step size of the standard LMS algorithm. The simulations using the
G.729 standard show a more constant and faster convergence rate than classic NLMS. This
algorithm can be easily applicable to other LPC-based coders. This algorithm requires the
decoding of the excitation signal.
78
ACOUSTIC ECHO CANCELLATION
The same idea was introduced in [Gnaba et al. 2003] by combining the acoustic echo
canceller and the CELP structure. This approach was based on the fact that the residual echo
in the case of NLMS is equal to the quantization noise of the speech codec. In [Gnaba et al.
2003], the NLMS was modified by introducing the quantization noise of CELP based speech
codec during estimations.
The specificity of these algorithms is that they always need PCM speech samples. The codec
parameters are incorporated inside the algorithm structure to increase the performances. The
interesting behaviors of these approaches rely on the fact that these algorithms are implemented inside the networks. Such an approach is said to be centralized. In the following, we
propose algorithms dedicated to acoustic echo cancellation. The main difference between our
approach and those previously mentioned is that our algorithms do not require decoding of the
coded speech signal into PCM samples. Our algorithms deal with the CELP codec parameters
exclusively.
4.5
The Gain Loss Control in the Coded Parameter Domain
As introduced in Chap. 3, the modification of the fixed codebook gain gf results in the
modification of the decoded signal amplitude. This principle is used in this section to modify
the corrupted (microphone) signal fixed codebook gain gf,Y and the loudspeaker fixed codebook
gain gf,X in each sub-frame. Instead of attenuating the signals with a gain, the attenuation
factors ay and ax are applied to gf,Y and gf,X respectively. This proposed algorithm does not
dissociate Double Talk periods.
Decoder
x(n)
x(k)
Loudspeaker path
Near-End
Speaker
AE
path
echo z(n)
gf,x(m),
ga,x(m)
ax(m)
Control
Module
To Far-End
Speaker
s(n)
Microphone path
y(n) = s(n) + z(n)
gf,y(m),
ga,y(m)
ay(m)
Encoder
y(k)
Figure 4.5: Gain Loss Control in the Codec Parameter Domain.
In Fig. 4.5, the structure of the Gain Loss Control in the codec parameter domain is
4.5. THE GAIN LOSS CONTROL IN THE CODED PARAMETER DOMAIN
79
depicted. The process takes place on each sub-frame. The bit-stream x(k) of the far-end
speaker is decoded at the near-end side, resulting in the decoded loudspeaker speech x(t). The
microphone signal y(t) = s(t) + (x ∗ H)(t) is the sum of the near-end speaker’s speech signal
s(t) and the acoustic echo z(t) = (x ∗ H)(t). The microphone signal is encoded, providing
the bit-stream y(k). The control module has two functions. The control module estimates the
energy of the loudspeaker signal and the energy of the microphone signal. Then the control
module computes the attenuation gain factors ax and ay as specified in Sec. 4.3.2.
As we use the AMR-NB speech coder [3GPP 1999b], the encoder and the decoder are
modified so that it is possible to extract the adaptive codebook gains (ga,Y , ga,X ) and the fixed
codebook gains (gf,Y , gf,X ). The encoder and the decoder are also modified so that they can
also receive the attenuation factors to be applied to the fixed codebook gain. We describe now
the estimation of the signals energy and the computation of the attenuation gain factors.
4.5.1
Estimation of the Energy
As the fixed codebook gain and the adaptive gain are computed for each sub-frame, the
signal energy estimation is performed on a sub-frame basis. On a sub-frame m, the total speech
signal energy E(m) can be estimated by summing the adaptive codebook vector energy Ea (m)
and the fixed codebook vector energy Ef (m), weighted by their corresponding gain, ga and gf
respectively. The adaptive codebook vector is computed at the encoder side as a scaled version
of the total excitation of the previous sub-frame at lag the pitch period. A recursive formula
can then be used to simplify the computation of the signal energy. The energy of the adaptive
excitation at sub-frame m makes use of the total energy computed at sub-frame m − 1, leading
to:
ÊX (m) = Ef,X (m) · gf,X (m) + ÊX (m − 1) · ga,X (m)
ÊY (m) = Ef,Y (m) · gf,Y (m) + ÊY (m − 1) · ga,Y (m).
(4.20)
The encoder and the decoder have been transformed so that the needed parameters can
be extracted or introduced inside the bit-stream, in each sub-frame. As described in Fig.
4.5, the fixed codebook gain and the adaptive codebook gain are extracted from the encoder
(microphone side: gf,Y , ga,Y ) and the decoder (loudspeaker side: gf,Y , ga,X ). The parameters
are then used inside the control module to compute an estimation of the microphone and
loudspeaker energy: ÊY (m) and ÊX (m), respectively.
80
ACOUSTIC ECHO CANCELLATION
4.5.2
Computation of the Attenuation Gain Factors
In Sec. 4.3.2, it was indicated that the low pass filtered magnitude of a speech signal
is used to compute the attenuation gains. This approach is used to eliminate fast changes,
(see [Heitkamper 1995] and [Heitkamper 1997]) associated to the speech signal energy. Small
speech pauses are neglected during determination of the attenuation factors. Using the same
properties on the considered energy EQ. 4.10, the short term energies Êi (m) are filtered as
follows:
ẼX (m) =
(1 − αf ) · ÊX (m) + αf · ẼX (m − 1) if ÊX (m) ≤ ẼX (m − 1)
(1 − αr ) · ÊX (m) + αr · ẼX (m − 1), else
(4.21)
ẼY (m) =
(1 − αf ) · ÊY (m) + αf · ẼY (m − 1) if ÊY (m) ≤ ẼY (m − 1)
(1 − αr ) · ÊY (m) + αr · ẼY (m − 1), else
(4.22)
The smoothed factors were chosen experimentally as: αf = 0.95 and αr = 0.7. An abrupt
increase in the speech level is controlled. If the amplitude of the speech signal decreases, the
energies ẼX (m) and ẼY (m) decay slower. The short speech pauses are neglected with EQ. 4.21
and EQ. 4.22. Applying EQ. 4.21 to the loudspeaker signal and EQ. 4.22 to the microphone
signal corresponds to the estimation of the long-term energies ẼX (m) and ẼY (m) respectively.
The long-term estimation matches the shape of the short-term energy. High variations of
the estimated energy are followed by this long-term method. In particular, the curve of the
long-term estimated energy matches the maximum contour of the computed short-term energy.
During speech pause, the long-term energy floor is not equal to the short-term energy floor.
This is due to the fact that during periods detected as silence, the speech samples are not
really equal to zero. The adaptive principle of the estimation can be viewed during transition
phases. The curve of the estimated long-term energy (in red) moves slowly to the minimum
in silence periods. In the AMR coder the codebook gains, especially the adaptive ones, are
generally different from zero. In [Duetsch 2003], several experiments were performed to argue
on the interest of using expression 4.21 and 4.22 to characterize the signal energy in coded
parameters. As shown 4.6, this estimation provides good approximation of the short-term
energy of the signal.
These estimated long-term energies are used to compute the attenuation gains ax and ay
respectively. The attenuation gains are limited to the interval [0, 1]. Therefore, the fixed
codebook gains are either attenuated completely or not at all depending of the cases. A metric
Ẽdif f (m) is then computed to characterize the energy level. This metric is computed as a
scaled logarithmic of the ratio between the loudspeaker ẼX (m) and the microphone ẼY (m)
long-term energies:
4.5. THE GAIN LOSS CONTROL IN THE CODED PARAMETER DOMAIN
4
Signal Amplitude
1
81
Speech Signal
x 10
0
−1
−2
0
2
4
6
Time/s
8
10
12
Energy / dB
0
−20
−40
−60
Short−Term Energy Estimation
Long−Term Energy Estimation
−80
0
500
1000
1500
Subframes
2000
Figure 4.6: Example of Energy Estimation in Codec Parameter Domain.
Ẽdif f (m) = 10 · log10
ẼX (m)
ẼY (m)
!
(4.23)
Specifically, the attenuation gains are computed as follows:
ax (m) =

 0,

µGLC
2
1,
if Ẽdif f (m) < − µGLC
2
· Ẽdif f (m) + 0.5, if − µGLC
< Ẽdif f (m) <
2
< Ẽdif f (m)
if µGLC
2
µGLC
2
ay (m) = 1 − ax (m)
(4.24)
EQ. 4.24 guarantees that at least one of the loudspeaker and the microphone fixed codebook
gain is attenuated. To set µGLC , distinct processing has to be considered in silence periods and
during speech activity periods. If the estimated energy of the loudspeaker and the microphone
are below a certain threshold (−50 dB), parameter µGLC is modified to µGLC = cst · µGLC ,
where cst = 5. This situation corresponds to silence. The result is that the linear part
µGLC
2 · Ẽdif f (m) + 0.5 in EQ. 4.24 is increased by cst. The attenuation gains ax (m) and ay (m)
are about 0.5 as shown in Fig. 4.7.
We can observe in Fig. 4.7 that if in loudspeaker characteristics the attenuation factor is
zero, then the attenuation factor in microphone characteristics is equal to one and conversely.
82
ACOUSTIC ECHO CANCELLATION
These periods correspond to single talk periods, where either the far-end speaker or the nearend speaker is talking. The attenuations gains are characterized by the relation between the
microphone signal and the loudspeaker signal energy.
ax(m)
ay(m)
1
1
0.5
0.5
0
- GLC
-
GLC
/2
0
GLC
/2
0
diff(m)
GLC
Loudspeaker Characteristics
- GLC
-
GLC
/2
0
GLC
/2
diff(m)
GLC
Microphone Characteristics
Figure 4.7: Characteristics of the Attenuation Gains.
In Fig. 4.7, above µGLC
and below − µGLC
the attenuation factors are constant. This
2
2
algorithm has no double talk detection mechanism. Our algorithm provides a dual processing
between the loudspeaker and the microphone fixed codebook gain. The critical point is 0.5,
where the fixed codebook gains are identically attenuated. If the estimated
energies of the
µGLC µGLC loudspeaker and microphone are such that Ẽdif f (m) is between − 2
, the attenuation
2
is distinctly applied as desired to the two fixed codebook gains.
The Gain Loss Control in the codec parameter domain may fail when the near end and
the far end speakers are speaking at the same time, or during double talk mode. There
is no additive system in this method to determine double talk periods. The output of the
loudspeaker is cut off completely only if the near end speaker is talking and the microphone
input is interrupted only if the far end speaker is talking.
4.5.3
Experimental Results Analysis
The results presented in Fig. 4.8 include both single talk and double talk periods. Periods
from 0s to −3 s represent the single talk of the far-end speaker. Periods from 4 s to −6 s
correspond to the near-end speaker single talk. Finally, periods from 6.5s to −8 s are the double
talk periods where the near-end speaker and the far-end speaker are talking simultaneously.
The echo signal was simulated by filtering the far-end signal through a car impulse response.
The system implemented in this work simulates the network processing where the needed
parameters (fixed codebook gain and adaptive codebook gain) are extracted from the bitstream. In the area of echo only signal in Fig. 4.8 (from t = 0 s to t = 3 s), the Gain Loss
4.5. THE GAIN LOSS CONTROL IN THE CODED PARAMETER DOMAIN
4
Signal Amplitude
2
(a)
x 10
Unprocessed y(n)
Echo z(n)
1
0
−1
−2
0
2
4
6
8
10
6
8
10
Time/s
(b)
4
2
Signal Amplitude
83
x 10
Unprocessed y(n)
Enhanced Signal
1
0
−1
−2
0
2
4
Time/s
Figure 4.8: Example of AEC based on Gain Loss Control.
Control introduces interesting improvement of the acoustic echo cancellation. The acoustic
echo is completely cancelled. During the near-end speaker Single Talk, the microphone signal
is not modified. Difficulties appear in Gain Loss Control during double talk periods. Analysis
of the estimated energy in Fig. 4.9 (from t = 6.5 s to t = 10 s) indicates that the energies of the
microphone and the loudspeaker signals have the same level. During such periods, it is rather
difficult to discriminate the amplitude of the loudspeaker signal from that of the microphone
signal. The attenuation gains are not well estimated since there is no significant difference
between the energy amplitudes. As shown in Fig. 4.9, double talk periods are characterized by
rapid changes of the attenuation factors. Both the microphone input signal and the loudspeaker
output signal are alternatively attenuated. Informal listening tests indicate that during double
talk periods, the processed microphone input signal is slightly distorted.
The characteristics of the control module are shown in Fig. 4.9. During single talk of the
far-end signal that is during echo-only periods the estimated energy of the microphone is clearly
above the estimated energy of the loudspeaker. In these periods, the attenuation factors are
constant, either 1 for the loudspeaker and 0 for the microphone. The result in such periods is
that the loudspeaker output signal is cut off completely when the near-end speaker is talking.
The microphone input signal is cut off completely when the far-end speaker is talking. During
double talk periods, both the microphone and the loudspeaker signals are attenuated. The
Gain Loss Control algorithm makes use of microphone and loudspeaker energies to compute
the attenuation gains. It is particularly difficult in double talk to compare the signal energies.
84
ACOUSTIC ECHO CANCELLATION
Estimated Energy
0
Energy / dB
−20
−40
−60
EX(m)
−80
EY(m)
Attenuation amplitude
−100
200
400
600
800
1000 1200
Subframes
Attenuation Gains
1400
1600
1400
1600
1
0.8
0.6
0.4
ax(m)
0.2
ay(m)
0
200
400
600
800
1000
Subframes
1200
Figure 4.9: Typical Example of the Evolution of the Attenuation Factor.
4.6
Acoustic Echo Cancellation by Filtering the Fixed Gain
This section proposes a more sophisticated Acoustic Echo Cancellation algorithm. The
contents of this section were published in [Thepie et al. 2006]. The general principle consists
of the use of the codec parameters only to design a complete AEC algorithm. The Gain Loss
Control proposed in previous section has shown its limit during double talk periods. This new
proposed algorithm includes a double talk periods detector, leading to better performance of
the echo reduction.
Compared to classical solution on PCM signal, this new approach reduces complexity and
allows integration of echo cancellation in the network without adding any delay or creating the
so-called tandeming effect. We also introduce an innovative estimation of the Signal-to-Echo
Ratio (SER) based on a linear model of the fixed codebook gain parameters. Listening test
results presented at the end of the section validates the good quality achieved, even though
the complexity of the proposed algorithm is low.
4.6. ACOUSTIC ECHO CANCELLATION BY FILTERING THE FIXED GAIN
4.6.1
85
System Overview
Taking into account the CELP synthesis filter as introduced in Chap. 2, the idea is to
design a filter G(m) such that in presence of an echo signal, the microphone fixed codebook
gain gf,Y (m) is replaced by it weighted version given by:
(4.25)
ĝf,S (m) = G(m) · gf,Y (m)
Decoder block
Extraction of the
fixed codebook
gain
x(n)
Decoder
x(k)
gf,x(m)
AE
path
Estimation of the
Echo Signal Fixed
Gain
echo z(n)
f,z
(m)
s(n)
gf,y(m)
Filter
f,s
y(n)
Encoder
(m)
Mapping of
the bit-stream
y(k)
Encoder block
Figure 4.10: Filtering of the Microphone Fixed Codebook Gain Principle.
In our experiment, see Fig. 4.10, the encoder and the decoder blocks are modified to get
access to parameters needed for the echo reduction. Such scheme mimic what would have
happened in network where only the bit-streams, and in particular the gains, are available.
At the decoder side, the input loudspeaker speech bit-stream x(k) is decoded and the fixed
codebook gain gf,X (m) is extracted. The remaining parameters are kept unchanged. This
loudspeaker fixed gain is used to estimate the echo signal fixed codebook gain ĝf,Z (m). At
the encoder block side, the microphone fixed codebook gain gf,Y (m) is extracted during the
encoding process. The fixed gains gf,Y (m) and gf,Z (m) are used during the filtering process
to get the estimated clean speech signal fixed codebook gain ĝf,S (m). The estimated speech
86
ACOUSTIC ECHO CANCELLATION
gain is then mapped inside the bit-stream y(k) of the microphone speech signal. The obtained
bit-stream corresponds in fact to an estimation of the clean speech bit-stream ŝ(k).
The CELP coding process is not a linear application. Therefore, we consider that the
microphone fixed codebook gain is a function of the near-end speech signal fixed codebook
gain and of the echo signal fixed codebook gain:
gf,Y (m) = f (gf,S (m), gf,Z (m))
(4.26)
The joint function f (., .) is not clearly identified, because of the CELP encoding process.
In the following, we will propose an approximation of the joint function f (., .) and of the
weighting filter G(m).
4.6.2
Approximation of the joint function f (., .) and the Filter G(m)
In this work, we approximate the joint function with a linear combination based on three
parameters such as:
f (gf,S (m), gf,Z (m)) = a · gf,S (m) + b · gf,Z (m) + c
(4.27)
The parameter c is set to 0, as the fixed gain is null if no signal is encoded. EQ. 4.27 is
now reduced to:
f (gf,S (m), gf,Z (m)) = a · gf,S (m) + b · gf,Z (m)
(4.28)
Introducing the new measure described as a Signal-to-Echo Ratio: SER(m) = gf,S (m)/gf,Z (m)
, it follows from EQ. 4.25 and EQ. 4.28 that:
G(m) =
SER(m)
1 + SER(m)
(4.29)
This latter expression can be interpreted as a weighted Wiener filtering of the gain, showing
the similarity of our method to the filter developed in the ’classical’ frequency domain noise
reduction (cf. [Ephraim, Malah 1984]). The main problem lies in the estimation of the two
parameters (a and b). It is necessary to estimate a and b to obtain the joint function f (., .) and
the filter G(m) via EQ. 4.28 and EQ. 4.29 respectively. To compute a and b, discrimination
between single talk and double talk periods should be taken into account.
4.6. ACOUSTIC ECHO CANCELLATION BY FILTERING THE FIXED GAIN
87
The first observation is that if only the near-end speaker is talking, (Single Talk) then the
echo signal fixed codebook gain is: gf,Z = 0 and the near-end speech fixed codebook gain is
gf,S 6= 0. Similarly, if the far end speaker is not talking (Single Talk), we must have: gf,S = 0
and gf,Z 6= 0. Therefore, during Single Talk of the near end speaker, (a, b) = (1, 0). In the
same way, one can show that during single talk of the far-end speaker: (a, b) = (0, 1).
To compute the parameters a, b during Double Talk periods, we encode and extract from
a large data basis of acoustic echo scenarios the fixed codebook gains of the echo signal, the
Microphone speech and the Loudspeaker speech signals. We construct three vectors of size I,
the number of total sub-frame per speech file, as follows:
ℑy = [gf,Y (1), . . . , gf,Y (I)] , ℑs = [gf,S (1), . . . , gf,S (I)] , ℑz = [gf,Z (1), . . . , gf,Z (I)]
(4.30)
In this work the parameters a and b are assumed to stay constant during the estimation and
during double talk periods. The purpose is now to search the pseudo-inverse of the following
system, leading to optimum a and b:
[ℑs ℑz ] · [a b]T = ℑy
(4.31)
The system of equations in EQ. 4.31 is an over-determined system. The matrix ℑ = [ℑs ℑz ]
is not a squared matrix. We solve the system in the least mean square sense using the pseudoinverse ℑ+ of the matrix ℑ, also called the Moore-Penrose generalized inverse [Golub, Loan
−1
1996]. Using this inversion with ℑ+ = ℑT · ℑ
· ℑ, the optimal parameters a and b are
given by:
[a b] = ℑ+ ℑy
(4.32)
In order to evaluate the results obtained via EQ. 4.32, we simulate the acoustic echo
scenarios in six different car kits, see Tab. 4.1. The parameters obtained by EQ. 4.32 are
compared to the exact values by analyzing the normalized square error:
error =
PI
m=1 [gf,Y (m)
− (a · gf,S (m) + b · gf,Z (n))]2
PI
2
m=1 (gf,Y )
(4.33)
As the data basis scenarios were performed in similar environment, the results of Tab. 4.1,
suggest that the optimal values of a and b are approximately equal to:
88
ACOUSTIC ECHO CANCELLATION
(4.34)
a ≈ 1, and b ≈ 4/3
This latter equation is used to compute a general version of the weighting filter of EQ. 4.29
as follows:
SER(m)
(4.35)
G(m) =
ς + SER(m)
where ς = 1 during single talk periods and ς = 4/3 during Double Talk periods.
Car Kit
’Optimal a’
’Optimal b’
’error (%)’
c1
1.3
1.33
9.30
c2
0.99
1.39
6.85
c3
0.97
1.36
3.35
c4
0.98
1.34
8.27
c5
0.96
1.38
5.75
c6
0.97
1.27
3.25
Table 4.1: Mean Linear Coefficients in Double Talk Mode.
We introduce a recursive estimation of the Signal-to-Echo Ratio (SER), in order to simplify
the algorithm as in [Taddei et al. 2004]:
SER(m) = βe ·
G(m − 1) · gf,Y (m − 1)
gf,Y (m)
+ (1 − βe ) ·
ĝf,Z (m)
ĝf,Z (m)
(4.36)
where the smoothing factor is set to βe = 0.9, according to preliminary experiments. The
interest of EQ: 4.36 is that it needs no estimate of the near-end signal fixed codebook gain gf,S ,
similar to Signal-to-Noise Ratio computed in [Ephraim, Malah 1984]. Only the estimation of
the echo signal fixed codebook gain gf,Z is required. The acoustic echo cancellation problem
is now reduced to the discrimination double talk/single talk and the estimation of the echo
signal fixed codebook gain ĝf,Z .
4.6.3
Estimation of the Echo Signal Fixed Codebook Gain: ĝf,Z
The fixed codebook gain of the echo signal is estimated through mimicking the approach
followed in [Heitkamper 1997]. The estimated echo fixed codebook gain is defined as a shifted
and attenuated version of the far-end speech fixed codebook gain gf,X :
ĝf,Z = κopt (m) · gf,X (m − τopt (m))
(4.37)
The parameter τopt (m) represents the number of shifted sub-frames and κopt (m) is the
attenuation parameters. The parameter τopt (m) and κopt (m) are computed in two steps: The
4.6. ACOUSTIC ECHO CANCELLATION BY FILTERING THE FIXED GAIN
89
first step is the echo mode detection (echo presence and double talk periods) and the second
step is the effective estimation of the parameters.
Echo Mode Detection:
The echo mode detection is performed in two stages: First we detect whether an echo is
present. Second we detect whether there is a double talk period.
The echo mode detection starts with the estimation of the signal energy in the codec parameter domain. A smoothed version of the fixed codebook gain is considered as the signal energy.
The smoothed version of the microphone and loudspeaker fixed codebook gains ĝsmooth,Y (m)
and ĝsmooth,X (m) respectively, are computed as follows:
ĝsmooth,Y (m) = γe · ĝsmooth,Y (m − 1) + (1 − γe ) · gf,Y
ĝsmooth,X (m) = γe · ĝsmooth,X (m − 1) + (1 − γe ) · gf,X .
(4.38)
where typically γe = 0.9. We decide that a possible echo mode is detected if:
ĝsmooth,X (m) > max (Tsilence , ĝsmooth,Y (m))
(4.39)
where the threshold parameter is experimentally set to Tsilence = 10. The relation in EQ.
4.39 is aimed at verifying whether the far-end speaker is talking or not. Moreover, EQ. 4.39
is always verified in the presence of echo periods and/or double talk periods. This relation
will be further used to verify if the coupling between the loudspeaker and the microphone is
effective.
The echo period is finally detected by analyzing the normalized cross-correlation function
between the far-end speaker (loudspeaker) fixed codebook gain gf,Y and the microphone fixed
codebook gain gf,X . The normalized cross correlation in the coded parameters domain is given
by:
ϕgf,X gf,Y (i) =



qP
P Ncc −i−1
gf,X (j+i)·gf,Y (j)
,
2 P cc −1
2
|gf,X (j)| · N
|gf,Y (j)|
j=0
j=0
Ncc −1
j=0
ϕgf,X gf,Y (−i),
if i ≥ 0
(4.40)
else
The term Ncc represents the length of the normalized cross-correlation analysis. The maximum of the normalized cross-correlation function cmax (m) as well as its corresponding lag
lmax (m) are searched as follows:
90
ACOUSTIC ECHO CANCELLATION
cmax (m) = maxi ϕgf,X gf,Y (i) and lmax (m) = arg maxi ϕgf,X gf,Y (i)
(4.41)
We decide that if cmax (m) is above ta = 0.75 (enough correlation), and EQ. 4.39 is verified,
echo is detected and the current parameters τopt (m) and κopt (m) are updated as described in
the next sections.
Determination of τopt (m) and κopt (m):
The determination of τopt (m) and κopt (m) is performed similarly to [Heitkamper 1997]. In
contrast to [Heitkamper 1997], the fixed codebook gains are hereafter used as a sufficiently
good representation of the energy. In general the maximum cross-correlation lag characterizes
the amount of correlation between the analyzed signals. In the same sense, the maximum
normalized cross-correlation lag lmax (m) will be used to determine the optimum sub-frame
shift κopt (m). To prevent rapid variations of the fixed gain during echo periods, a short term
lag lst (m) is first computed on the basis of lmax (m) as follows:
lst (m) = α̂(m) · lst (m − 1) + (1 − α̂(m)) · lmax (m)
(4.42)
The smoothed factor α̂(m) is computed adaptively as a function of the correlation coefficient
cmax (m) and the echo detection metric ta :
α̂(m) =
e −δe
· cmax (m) +
− α1−t
a
αe ,
αe −δe ·ta
1−ta ,
if cmax (m) > ta
else
(4.43)
The smoothing parameters are taken equal to: αe = 0.96 an δe = 0.25 The interest of
such an approach is that the short term lag lmax (m) is adapted faster or slower, depending on
the correlation between the fixed codebook gain of the microphone signal and the loudspeaker
signal.
Therefore, the update of the short term lag needs to be controlled differently during echo
and non echo periods. The control of the short term lag is made to avoid critical variations
between two consecutive updating. To achieve the control procedure, an intermediated short
⊕
term lag (also called long term lag) lst
(m) is computed only during echo periods as follows:
⊕
⊕
lst
(m) = µe · lst
(m − 1) + (1 − µe ) · lst (m)
(4.44)
4.6. ACOUSTIC ECHO CANCELLATION BY FILTERING THE FIXED GAIN
91
where µe = 0.95 is a smoothing factor. This value is kept in memory and used to update
the short-term lag during non echo periods. During non echo period, the intermediate short
⊕
term lag lst
(m) is used as a convergence point of the short term lag according to:
⊕
lst (m) = αe · lst (m − 1) + (1 − αe ) · lst
(m)
(4.45)
Interpretation of such an approach is that if a non echo period is really short, lst (m) must
keep a value close to the last computed value. In this case, for the next echo period, we consider
that the echo path has not changed drastically. We thus use approximately the same value as
the previous short term lag. If the non echo period is long, lst (m) converges to the average
⊕
short term lag lst
(m). We assume during our experiments that the echo path does not change
considerably. Consecutive values are used to estimate the next short term lag when the next
echo period starts. Finally, the optimum sub-frame shift during echo and non echo periods is
obtained by:
(4.46)
τopt (m) = round(lst (m))
The attenuated κopt (m) factor during echo only periods is in fact the ratio between the
microphone fixed codebook gain and the shift loudspeaker fixed codebook gain:
κopt (m) =
gf,Y (m)
gf,X (m − τopt (m))
(4.47)
The fixed codebook gain of the echo only signal should be smaller than the fixed codebook
gain of the loudspeaker. If the ratio in EQ. 4.47 is bigger than one, it means that double talk
period is detected. The current computed parameters are then set to their previous values.
The short term attenuation factor is updated with its long term value in a similar way as the
computation of τopt (m) during non-echo periods. The short term attenuation factor during
echo periods is given by:
κst (m) = α̂(m) · κst (m − 1) + (1 − α̂(m)) ·
gf,Y (m)
gf,X (m − τopt (m))
(4.48)
and the long term attenuation is computed based on the relation below.
⊕
κ⊕
st (m) = µe · κst (m − 1) + (1 − µe ) · κst (m)
(4.49)
92
ACOUSTIC ECHO CANCELLATION
During non echo periods, the attenuation factor is updated using the updated version of
the short term attenuation factor and kept in memory as follows:
κst (m) = αe · κst (m − 1) + (1 − αe ) · κ⊕
st (m)
(4.50)
The attenuation factor in echo period is finally computed based on EQ. 4.48 through the
relation below:
κopt (m) = κst (m)
(4.51)
To complete the process, the values of the parameters τopt (m) and κopt (m) are used to
compute an estimation of the fixed codebook gain of the echo signal according to EQ. 4.37.
4.6.4
Experimental Results
To assess the performance of this Acoustic Echo Cancellation algorithm, (filtering of the
microphone fixed codebook gain), an Absolute Category Rate (ACR) (see [ITU-T 2006c])
listening test (see [Thepie et al. 2006]) was performed. Ten naïve and expert listeners participated. They scored using a Mean Opinion Score (MOS) [ITU-T 2006e], ranging from 1
(unacceptable) to 5 (excellent). Each scenario was defined as a conversation between a nearend speaker and a far-end speaker. The files listened during the test contain single talk and
double talk periods and were classified in 4 groups of 15 files each.
Group A is composed of clean speech files with no echo, group B of files with unprocessed
echo, group C of file processed with our echo reduction, group D of files enhanced using a
standard NLMS method [Beaugeant et al. 2006]. The acoustic echo was simulated based on
three car impulse responses: h1 , h2 and h3 . Mean scores and standard deviations of the MOS
obtained with the listening test are displayed in Tab. 4.2 below.
Considering the mean scores of all scenarios, the results show that our relatively simple
solution yields results very cloe to the intensively studied and more costly NLMS method.
The obtained average MOS (3.5) indicates an absolute quality of fair/better. This result is far
better than the assessment of the unprocessed scenarios for which the MOS is (1.9).
The result of this test depends highly on the kind of impulse response that is used and on the
Signal-to-Echo Ratio (SER). The measurements of the SER during double talk periods were
as follows: 11 dB for h1 , 15 dB for h2 , and 10 dB for h3 . We can observe that the Codec
Parameter Domain method was not as good as standard NLMS for low SER, whereas it was
rated as good or even better for SER > 15 (cf. Tab. 4.2).
4.6. ACOUSTIC ECHO CANCELLATION BY FILTERING THE FIXED GAIN
Groups
h1
h2
h3
’Total’
GrpA
4.66/0.36
4.58/0.62
4.56/0.57
4.60/0.69
GrpB
1.6/0.93
1.64/0.72
1.84/0.88
1.69/0.85
GrpC
2.68/0.87
3.1/0.70
3.68/0.76
3.15/0.88
93
GrpD
4.10/0.78
2.82/0.80
3.72/0.82
3.56/0.99
Table 4.2: Mean and Standard Deviation Opinion Score.
An example of processing based on the filtering of the microphone input fixed codebook
gain is presented in Fig. 4.11. We can observe that during single talk periods of the far-end
speaker, the echo signal is not completely cancelled. There is a slight amount of residual echo
after the processing, as well as with standard NLMS.
This residual echo is not particularly annoying in our method. During single talk periods of
the near-end speaker, our proposed algorithm completely restores the clean speech envelope.
In this approach based on the filtering of the fixed codebook gain, the double talk periods are
significantly improved. The enhanced microphone signal shows in Fig. 4.11 that the signal
envelope is recovered as well as with the NLMS method. During isolated echo-only periods
(echo presence during two consecutive near-end speech activity), the proposed algorithm significantly reduces the echo effects. This capability can be observed in periods between t = 7 s
and t = 8 s.
4
Amplitude
1
0
−1
0
Amplitude
4
1
x 10
2
4
6
8
Time/s
Unprocessed y(n) (black) / processed signal (red)
10
2
4
10
2
4
0
−1
0
4
1
Amplitude
Unprocessed y(n) (black) / Echo z(n) (red)
x 10
x 10
6
8
Time/s
Unprocessed y(n) (black) / Processed with NLMS (red)
0
−1
0
6
8
10
Time/s
Figure 4.11: Example of AEC by Filtering the Fixed Gain.
94
ACOUSTIC ECHO CANCELLATION
4.7
Conclusion
This chapter has provided an overview of techniques used to model the acoustic echo path.
This acoustic echo path is generally modeled as a multi tap time varying filer. In Sec. 4.3 of
this chapter, we have described current algorithms that model the acoustic echo path. These
techniques are implemented in the time domain or in the frequency domain. The issue with
these classical solutions is to adaptively compute an approximation of the acoustic echo path.
In the time domain or in the frequency domain, these algorithms use the PCM speech samples
of the far-end speaker and the near-end speaker to design their procedures. Moreover, during
double talk periods, the classical AEC algorithms do not achieve a good estimation of the
acoustic echo path. Finally, these algorithms need important computational load. We have
also seen in this chapter that during double talk, existing acoustic echo cancellers do not
provide adaptation of the acoustic echo path. There are several existing techniques to detect
Double Talk periods. Double talk detection algorithms based on cross-correlation are the most
used in practice.
In order to overcome these shortcomings, this chapter has proposed two new algorithms to
reduce acoustic echoes. The common feature of these algorithms is that they do not need to
decode the speech signal in PCM format. Only a partial decoding is done to extract the relevant
parameters: the fixed codebook gain and the adaptive codebook gain. More specifically, these
two algorithms are:
1. The Gain Loss Control in the Codec Parameter Domain. This algorithm provides two
attenuation factors used to control the amount of the fixed codebook gains of the microphone signal and the loudspeaker signal. In single talk periods, the algorithm performs
well. This algorithm can be integrated in a system as a post processing of another acoustic echo cancellation algorithm. The Gain Loss Control will thus eliminate the residual
echo, always present after the process. In double talk periods, the Gain Loss Control has
a drawback, in the sense that both the loudspeaker and the microphone are attenuated.
2. The filtering of the fixed codebook gain of the microphone signal is presented and appears
as a more efficient algorithm. The codec parameters are used to build a double talk
detector, important in the performance of the AEC. The performance of this algorithm
depends on the kind of impulse response and on the Signal-to-Echo Ratio (SER). An
accurate estimation of the SER improves the result. This second algorithm is precisely
aimed to good improvement during double talk periods. Experimental results show that
our algorithm achieves results similar to those attained by the standard NLMS. However,
the complexity of our algorithm is lower than that of the NLMS.
After proposing speech enhancement algorithms (Noise Reduction in Chap. 3 and Acoustic Echo Cancellation in this chapter), the purpose is now to deploy these algorithms together
with smart transcoding strategies. The Transcoding and Smart Transcoding were already introduced in Chap. 1. These last years, investigations on Smart Transcoding have provided
4.7. CONCLUSION
95
interesting results. In Smart Transcoding procedure, the coded parameters are used to reduce
complexity of the target encoder. Therefore, the objective of the next chapter is to propose an integration of our coded domain Voice Quality Enhancement algorithms inside Smart
Transcoding schemes.
96
ACOUSTIC ECHO CANCELLATION
97
Chapter 5
Voice Quality Enhancement and Smart
Transcoding
5.1
Introduction
As introduced in Chap. 1, telecommunication’s world is characterized by the successive
development of new networks. In general, to each network, a particular speech/audio international or regional standard is dedicated: UMTS working with Adaptive Multi Rate (AMR)
[3GPP 1999b], PSTN with G.711, G.726 or G.728, Services and Applications over Internet
with G.723.1 or G.729. The networks with their associated speech codecs are not usually
interoperable with each other [ITU-T 2004]. Due to operational characteristic, the coders
transmit parameters that have different formats. As these networks can happen to interoperate, a bit-stream conversion is absolutely necessary to achieve interconnections. In such a case,
the network systems must have to establish the speech communication between the sending
network and the receiving network.
The usual way to handle this issue is to perform transcoding [3GPP 1999b]. The classical
transcoding involves a full decoding [Kang et al. 2003]. It decodes one codec bit-stream and reencodes it into the target codec bit-stream format. This is achieved by placing decoder/encoder
of one point and encoder/decoder of the other end point in a so called gateway between
networks. Such a classical solution is far from being optimum. The Transcoding approach
generally implies three major problems: computational load increasing, speech quality and
intelligibility decrease and algorithmic delay increase.
An alternative approach to classical transcoding is to exploit the similarity of the speech
coders [Kang et al. 2003]. This method, also called smart transcoding, achieves interesting improvement over the classical transcoding approach by reducing the computational load and the
98
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
algorithmic delay. If there is no external impairment (background noise or/and acoustic echo),
the smart transcoding in addition provides better speech quality than standard transcoding
(see [Beaugeant, Taddei 2007] - [Ghenania 2005]).
External impairments such as background noise and/or acoustic echo generally impact the
speech quality and intelligibility in a mobile communication. The common way to reduce
their impact is to implement Acoustic Echo Cancellation (AEC) and/or Noise Reduction (NR)
algorithms or units in the mobile device [Eriksson 2006], [Beaugeant 1999]. This solution
has not yet achieved optimal results for many years [Cotanis 2003]. Therefore, AEC and NR
units are now progressively being implemented in the network [Eriksson 2006]. There are
several reasons and advantages for such an integration of Voice Quality Enhancement (VQE)
algorithms in the network [Enzner et al. 2005].
First of all, implementation of VQE in the network is related to a desirable central network
quality control. Indeed, network providers have a high diversity of devices in their network,
leading to various levels of speech quality. At the same time, the quality of the mobile devices
has not been particularly enhanced over the last decade. Second, new challenges have appeared,
as device miniaturization where speech quality has not been the main focus. For example,
mobile terminals focus more on multimedia application, whereas DECT phones concentrate
more on price reduction. As a consequence, even if from technical view point, coping with
voice enhancement brings a priori good results when solutions are implemented within the
terminal, concrete industrial developments entail that VQE placed in the network can be in
practice as efficient in terms of speech quality as solution built in terminal. Furthermore, the
low complexity constraint restricts the choice of the algorithms and thus limits the system
performance.
Transposition of VQE from terminal to network is a general trend and many network
providers have already deployed AEC and NR units, such as Tellabs 5500 (ref). The existing
systems are based on the analysis and processing of the decoded signal. This solution generally
introduces delay on the processed signal, as well as complexity due to decoding and re-encoding
into the same format. This solution also decreases the speech quality because of the codec
tandeming (decoding-encoding) effect. Especially for AEC, an important fact is that the
algorithm performance implemented inside the network can be severely disturbed by the nonlinearity and the unpredictability of the effective acoustic echo path see: [Enzner et al. 2005],
[Rages, Ho 2002], [Huang, Goubran 2000] and [Fermo et al. 2000].
All these drawbacks lead to the idea that VQE could be made directly on the available
bit-stream [Chandran, Marchok 2000], also by integrating algorithms described in Chap. 3 (see
[Taddei et al. 2004], [Thepie et al. 2008]) and Chap. 4 ([Thepie et al. 2006]). Modifying the
parameters composing the bit-stream avoids the total decoding/encoding process necessary in
classical solutions. This new approach reduces computational load and the quality loss due
to the tandeming effect is circumscribed. The purpose of this chapter is to embed (integrate)
inside a smart transcoding structure, the VQE algorithms described in Chap. 3 and 4. It
also discusses the performance obtained by integrating the AEC and NR unit into a smart
5.2. NETWORK INTEROPERABILITY PROBLEMS AND VOICE QUALITY
ENHANCEMENT
99
transcoding algorithm. The organization of this chapter is as follows. Sec. 5.2 explains
problems due to network interoperability and VQE. This part of the chapter also presents
former solutions. In Sec. 5.3, the smart transcoding principles and solutions are described.
Integration of the VQE based on codec parameters in the smart transcoding module is discussed
in Sec. 5.4 and 5.5. an optimal architecture dedicated to different AMR-NB coder modes is
proposed. Experimental results are discussed in Sec. 5.6 and the chapter ends by a conclusion.
5.2
Network Interoperability Problems and Voice Quality Enhancement
To ensure mobility, continuity or interoperability, networks need to interoperate [Yoon et
al. 2003]. Interoperation is usually achieved via transcoding. A typical diagram of wireless
interoperability network can be seen in Fig. 5.1. Each GSM network involves three main
components: the Mobile Device (MD), the Base Station Sub-system (BSS) and the Network
Sub-system (NSS), generally called core network. The GSM network is also connected to
an Operation and Support System (OSS) for maintenance and control, (see annex A). The
Gateway is the potential area where a given network will be interconnected with another
network: connection of network A to network B, network A to PLMN, network A to PSTN,
etc. The algorithms discussed in this chapter address solutions to high background noise
and/or acoustic echo problems.
5.2.1
Classical Speech Transcoding Scenarios
Transcoding is a key element for network interconnection. Transcoding is the process of
transforming the format representation of A to a target format B. As presented in Fig. 5.2,
if codecs A and B are different, the bit-stream of encoder A of the near-end speaker is first
decoded through decoder A. The obtained decoded speech signal sA (n) is then encoded in
target format B by encoder B, giving bit-stream B. Bit-stream B is transmitted to decoder B
at the far-end speaker side, where s′ (n) is decoded via decoder B.
This transcoding solution implies three particular problems. First the quality of the decoded speech signal at decoder B is degraded. Encoder B uses the decoded signal from decoder
A and re-encodes it with encoder B, rather than using the original speech signal. The speech
quality degradation is due to the succession of encoding/decoding operations which creates
accumulation of quantization errors, twice in this example. Secondly, computational load is
increased. Computational load is increased since two or more coders are simultaneously implemented. This number becomes all the more critical when a large number of users can be
interconnected. The third impairment is the increasing of the algorithmic delay. To obtain the
target bit-stream B, an additional look-ahead delay for LPC analysis is necessary during the
100
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
GSM Network (B)
GSM Network (A)
MD
A
(A)
(B)
MD
B
Base Station
Sub-System:
BTS + BSC
+ TRAU (A)
BTS
BTS
Base Station
Sub-System:
BTS + BSC
+ TRAU (B)
BSC
OSS
BTS
BSC
TRAU
(A)
MSC
OSS
TRAU
GMSC
GMSC
HLR, EIR, AUC, VLR
Network Sub-System: Core
Network, MSC+ GMSC (A)
BTS
(B)
MSC
HLR, EIR, AUC, VLR
Network Sub-System: Core
Network, MSC+ GMSC (B)
PSTN
PLMN or
Other
Networks
Figure 5.1: Generic GSM Interconnection Architecture.
Bit-stream A
Encoder A
Bit-stream B
Decoder A
Encoder B
Decoder B
Speech
Samples: sA(n)
Speech
Signal
s(n)
Near-end Speaker
(Mobile Device)
Network Area or
Parameters Domain
Decoded
Speech
s’(n)
Far-end Speaker
(Mobile Device)
Figure 5.2: Transcoding, Classical Approach.
5.2. NETWORK INTEROPERABILITY PROBLEMS AND VOICE QUALITY
ENHANCEMENT
101
transcoding process. Such process can introduce delay to the communication. It is well known
that a long transmission delay in the communication link can be particularly annoying for the
end-users.
5.2.2
Classical Speech Transcoding and Voice Quality Enhancement
Classical VQE solutions such as NR and/or AEC, implemented inside the network, can
lead to transcoding effects. In fact, with classical approach, and in presence of perturbation
(Noise and/or Acoustic Echo), the enhancement is totally performed on speech samples in
PCM format.
Current algorithms implemented in the network follow the principle described in Fig. 5.3
below. Let us consider a communication scenario established between two wireless networks
A and B. The corrupted speech signal is encoded by encoder A at the Mobile Device of the
near-end speaker. The corrupted bit-stream A is then transmitted. Inside the network, the
corrupted bit-stream A is decoded using decoder A and the decoded signal is sent to the VQE
unit for speech enhancement. The output of the VQE unit is re-encoded by encoder B. At
the far-end side, the VQE processed bit-stream B is decoded and the enhanced speech signal
version is obtained.
Corrupted
Bit-stream
Encoder A
Enhanced
Bit-stream
Decoder A
Corrupted
Speech
Samples
yA(n)
Corrupted
speech y(n)
Near-end Speaker
Encoder B
VQE Unit
Network Area or
Codec Domain
Decoder B
Enhanced
Samples
A(n)
Enhanced
Speech
B(n)
Far-end Speaker
Figure 5.3: Network VQE, Classical Solution.
The VQE unit (yellow box) as shown in Fig. 5.3 only deals with PCM speech samples.
The algorithms deployed in such systems are those described in the state of the art of NR and
AEC in Chap. 3 and 4 respectively. The network solution as presented in Fig. 5.3 has one
disadvantage, this architecture cumulates the problems due to classical transcoding solution
and the problems due to VQE algorithms in PCM format such as delay, computational load
and quality degradation.
102
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
5.3
Alternative Approach: the Speech Smart Transcoding
The goal during interoperability is to translate a bit-stream of coder A into the format
of coder B. Many research activities are being conducted for a so called smart transcoding
technique. If both coders A and coder B are similar, (CELP coders for example as in this
thesis), the coders make use of similar set of parameters. The CELP transmitted parameters
are: the LPC coefficients, the pitch delay, the fixed codebook vector, the fixed and the adaptive
codebook gains. In this thesis we restrict our attention to GSM networks and AMR-NB speech
coders.
Parameters
Bit-stream B
Bit-stream A
Encoder A
Decoder A
Speech
Signal s(n)
Mapping /
Partial
Encoding
Decoder B
Decoded
Speech s’(n)
Speech Samples sA(n)
Near-end
Speaker (MD)
Network Area or
Codec Domain
Far-end
Speaker (MD)
Figure 5.4: Smart Transcoding Principle.
5.3.1
The Speech Smart Transcoding Principle and Strategies
The key idea in smart transcoding consists in avoiding the re-computation at the target
encoder B of some parameters or group of parameters. The difference between the parameters
at encoder A and the parameters at encoder B is mostly related to the way these parameters
are computed [Ghenania, Lamblin 2004]. Typically, the parameter estimation method, the
resolutions of the estimation and the quantization technique used, differentiate the parameters
from codec A to codec B when they use CELP technique. Based on these observations, up to
three smart transcoding strategies can be applied to the relevant parameters.
– First, the Smart Transcoding performed at the binary stage. This means that the coded
parameter is directly transferred from bit-stream A to bit-stream B, of course a mapping
operation is needed. This approach is suitable if the parameter at encoders A and B
differs only by the quantization technique. This approach was already experimented on
the pitch delay in ([Tsai, Yang 2001], [Yasuji et al. 2002] and [Kang et al. 2000]). In
([Yasuji et al. 2002] and [Kang et al. 2000]) the smart transcoding was applied to the
fixed codebook index by restricting the search of the fixed codebook vector at encoder
B.
5.3. ALTERNATIVE APPROACH: THE SPEECH SMART TRANSCODING
103
– Second, in case the parameter computed from encoder A does not correspond to that of
encoder B, a partial decoding of the relevant parameter is required. The partial decoding
here at decoder A corresponds to a decoding process which does not require the PCM
signal reconstruction. As an example, considering LPC coefficients, the decoding process
can be performed up to the LSF coefficients representation. The LSF parameters from
decoder A are then mapped into encoder B [Ghenania, Lamblin 2004].
– The last smart transcoding scheme is related to the problematic where the formats of
the parameters at encoders A and B are different. In this type of smart transcoding,
also called parameter approach, a tandem is necessary and the speech signal is totally
decoded at decoder A. What is specific with this approach is that during the encoding
at encoder B, the functions assigned to compute the relevant parameters are either not
performed or not totally performed. The parameters decoded at decoder A are directly
used as input of the quantization units of encoder B. This smart transcoding strategy was
experimented in [Kim et al. 2001] and [Yoon et al. 2001] by mapping the fixed codebook
index and gain. The mapping of the relevant parameters can also be followed by the
suppression of redundant pre-processing and post processing functions. As an example,
the high pass filtering performed for the AMR encoder can be suppressed. It results in
a computational load that can be significantly reduced, cf. [Beaugeant, Taddei 2007].
The parameter based smart transcoding approach is retained in this thesis. In a noisy
environment or/and in presence of acoustic echo, the parameters decoded at decoder A are
corrupted and need to be enhanced. As described in Fig 5.4, the third smart transcoding
approach is performed. At decoder A, the relevant parameters are extracted. At encoder B, the
encoding process is not totally performed. Some of the functions assigned to the computation
of the relevant parameters are skipped. In general, a parameter extracted from decoder A is
first modified such that the parameter matches with the characteristics of the similar parameter
at encoder B. The smart transcoding is thus achieved by intelligently mapping the parameters
available from bit-stream A inside those of bitstream of encoder B. During our simulations,
the C floating point platform of the NB-AMR is used, [3GPP 1999b].
This work addresses solution to transcode between different AMR-NB speech codec modes,
especially between the AMR-NB at 12.2 kbps mode and the AMR-NB at 7.4 kbps mode and
vice versa.
Before to start investigating transcoding in presence of background noise and acoustic echo,
it is useful to study the impact of mapping parameter or group of parameters to the speech
quality and intelligibility. This study helps us to design the mapping functions used during
our smart transcoding. In this part of the work, experiments were conducted using clean
speech signal. We used AMR-NB at 7.4 kbps mode as coder A and 12.2 kbps mode as coder
B and vice-versa. Similarly to the smart transcoding scheme performed in [Beaugeant, Taddei
2007], and based on several experiments we will focus on three codec parameters in this work:
the LPC coefficients, the fixed codebook gain and the adaptive codebook gain. We should
note that a parameter extracted at decoder A side is the quantized version of the parameter
computed at encoder A. Therefore, the parameters used in the smart transcoding algorithm
104
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
are the quantized version denoted by: ÃS , g̃f,S and g̃a,f of the LPC coefficients, the fixed
codebook gain and the adaptive codebook gain respectively.
5.3.2
Mapping Strategy of the LPC Coefficients
As depicted in Chap. 2 cf. [3GPP 1999b], on the encoder side, the LPC coefficients are
computed twice per frame (every 10 ms) in 12.2 kbps mode (two sets) and only once per
frame (every 20 ms) in other modes (one set). Then, they are converted in the Line Spectral
Frequency (LSF) representation. The LSF are then interpolated to provide four sets of LSF,
one set per sub-frame. These interpolated LSF’s are then converted back to LPC, giving four
sets of quantized LPC coefficients, one set for each sub-frame. The LPC analysis window of
the 12.2 kbps mode allows computation of LPC coefficients localized at the sub-frames 2 and
4. No look-ahead is required. In the other modes, the analysis window is concentrated at
sub-frame 4, leading to the computation of only one set of LPC coefficients. In these later
modes, a 5 ms look-ahead is used. The LPC coefficients are extracted from decoder A and
a mapping procedure is applied before their introduction inside the encoder B. These LPC
coefficients are then directly used, no further computation is needed.
At decoder A or decoder B, four sets of quantized LPC coefficients are decoded in each
k
. The index k representing the number of sub-frame number is taken
frame: ÃS
decoder
as: k = 1, 2, 3, 4. According to the transcoding in place, the mapping inside encoder
B of the LPC coefficients extracted from decoder A follows the mapping equations below:
From transcoding from 12.2 kbps mode to 7.4 kbps mode, only one set of LPC coefficients is
computed at 7.4 kbps mode encoder each frame, leading to the mapping at frame p as follows:
4
(AS )Encoder (p) = ÃS
Decoder
(p)
(5.1)
During transcoding from 7.4 kbps mode to 12.2 kbps mode, two sets of LPC coefficients
are computed in the 12.2 kbps mode encoder each frame. The mapping is given by:
2·k
(AS )kEncoder (p) = ÃS
Decoder
where k = 1, 2
(p)
(5.2)
In Fig. 5.5 and 5.6 below, the spectra of the synthesis filters associated to the LPC
coefficients are represented. The curve in blue is the original spectrum of the unquantized
synthesis filter. The curve in black is obtained after smart transcoding. Finally, the curve in red
corresponds to that obtained from the standard transcoding. On a frame basis, the spectrum
5.3. ALTERNATIVE APPROACH: THE SPEECH SMART TRANSCODING
105
obtained by smart transcoding presents the best approximation of the original spectrum. The
spectrum obtained from the classical transcoding has been slightly amplified at low frequencies.
We clearly see in Fig. 5.6 how the spectrum obtained in classical transcoding has been
significantly distorted. There is a formant loss at high frequency with the standard transcoding
approach. Formant degradation can induce speech distortion and quality degradation. The
spectrum of the synthesis filter associated to the LPC coefficients obtained from the smart
transcoding is significantly close to that of the original speech signal. The formants returned
of the smart transcoding are preserved.
From AMR−NB 7.4 kbps mode to AMR−NB 12.2 kbps mode
Smart Transcoding Spectrum
Standard Transcoding Spectrum
Reference Spectrum
30
PSD−in−dB
20
10
0
−10
−20
0
500
1000
1500
2000
2500
Frequency
3000
3500
4000
Figure 5.5: Transcoding Example from 7.4 kbps mode to 12.2 kbps mode: Spectrum of the
associated synthesis filters.
5.3.3
Mapping Strategy of the Fixed and Adaptive Codebook Gains
For each frame, four adaptive gains are decoded in 12.2 kbps mode or 7.4 kbps mode. The
adaptive gain from decoder A is simply mapped inside encoder B as follows:
(ga,S )Encoder (m) = (g̃a,S )Decoder (m)
(5.3)
where m is the sub-frame index.
From the mapping of the adaptive codebook gain presented in Fig. 5.7, we can notice that
106
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
From AMR−NB 12.2 kbps mode to AMR−NB 7.4 kbps mode
Smart Transcoding Spectrum
Standard Transcoding Spectrum
Reference Spectrum
30
PSD−in−dB
20
10
0
−10
−20
0
500
1000
1500
2000
2500
Frequency
3000
3500
4000
Figure 5.6: Transcoding Example from 12.2 kbps mode to 7.4 kbps mode: Spectrum of the
associated synthesis filters.
5.3. ALTERNATIVE APPROACH: THE SPEECH SMART TRANSCODING
4
Signal Level
1
From AMR 7.4 kbps mode to AMR 12.2 kbps mode
x 10
0.5
0
−0.5
−1
0
0.2
0.4
0.6
0.8
1
time/s
1.5
Amplitude
107
1.2
1.4
1.6
1.8
Smart Transcoding
Standard Transcoding
Original Adaptive gain
1
0.5
0
−0.5
0
50
100
150
200
Subframes
250
300
350
Figure 5.7: Adaptive Gains in Transcoding, typical example during transcoding from 7.4 kbps
to 12.2 kbps mode.
108
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
the adaptive gains from the smart transcoding are less disturbed that those of the classical
transcoding. It is obviously beneficial to perform a smart transcoding on this parameter,
instead of transmitting the adaptive gains from the classical transcoding. We clearly see that
the adaptive gains obtained from the classical transcoding (in red) change enough but always
stay around the original adaptive gains (in black).
Estimation of the pitch delay used together with the adaptive gain to compute the adaptive
excitation in CELP coders is particularly robust. In practice there is no significant difference
between the pitch from smart transcoding and that of the standard transcoding. The interest
of the gain obtained from smart is that it improves the adaptive excitation.
The fixed codebook gain plays a major role on the decoded speech amplitude, see Fig. 5.8.
The shape of the curve of the fixed codebook gains matches the shape of the speech signal
amplitude. We have noticed experimentally that, at decoder B, the fixed codebook gain is
delayed compared to the original fixed codebook gain. The delay that appears during this
process impacts the speech quality. Based on several experiments ([ITU-T 2001]) and informal
listening tests, the optimal mapping of the fixed codebook gain is achieved by:
(gf,S )Encoder (m) = (g̃f,S )Decoder (m − 1)
(5.4)
where m is the sub-frame index.
This mapping takes into account the delay (one sub-frame) introduced by the ’encodingdecoding-encoding-decoding’ chain.
In Fig. 5.8 and Fig. 5.9, one can see in case of a transcoding from 7.4 kbps mode to 12.2 kbps
mode that the fixed codebook gains have been strongly attenuated. This observation results
from the fact that the transcoding is performed from low quality (7.4 kbps mode) to very high
quality (12.2 kbps mode). The reduction of the fixed gain level in classical transcoding is more
noticeable during speech periods. As an example, in Fig. 5.8 from 1.4 s to 1.8s, the amount of
the fixed gain attenuation in classical transcoding is particularly high. The transcoding from
the 12.2 kbps mode to 7.4 kbps mode is rather different, see Fig. 5.10. The fixed codebook
gain from the classical transcoding is not attenuated.
In Fig. 5.10, we find no amplitude attenuation as in 7.4 kbps to 12.2 kbps transcoding
mode. The classical transcoding gains tend to be higher than that of the smart transcoding
in presence of high speech energy. The smart transcoding of the fixed and adaptive codebook
gains does not particularly impact the computational load. Simulations with informal listening
test show that the mapping of the gains (fixed and adaptive) significantly improve the decoded
speech quality in transcoding from 7.4 kbps mode to 12.2 kbps mode. Another remark is
that the smart transcoding of the gains has an impact only if encoder B is 12.2 kbps mode.
This is due to the quantization technique used. The quantization in other modes does not
offer the possibility to directly replace the gains. The fixed gain and the adaptive gain are
distinctly quantized according to the codec mode. The gains are separately quantized in 12.2
5.3. ALTERNATIVE APPROACH: THE SPEECH SMART TRANSCODING
Amplitude
Signal Level
4
1
x 10
From AMR−NB 7.4 kbps mode to AMR−NB 12.2 kbps mode
0
−1
0
0.5
1
time/s
1000
1.5
2
Reference fixed gain
Smart Transcoding fixed gain
500
0
50
100
150
1000
Amplitude
109
200
250
Subframes
300
350
400
Reference fixed gain
Classical transcoding fixed gain
500
0
50
100
150
200
250
Subframes
300
350
400
Figure 5.8: Typical Example of Decoded Fixed Codebook Gains during transcoding from 7.4
kbps mode to 12.2 kbps mode.
110
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
From AMR−NB 7.4 kbps mode to AMR−NB 12.2 kbps mode
Smart Transcoding. fixed gain
Standard Transcoding fixed gain
Reference fixed gain
1000
900
Gain Amplitude
800
700
600
500
400
300
200
100
0
50
100
150
200
250
Subframes
300
350
400
Figure 5.9: Fixed Codebook Gains mapping transcoding from 7.4 kbps mode to 12.2 kbps
mode.
5.3. ALTERNATIVE APPROACH: THE SPEECH SMART TRANSCODING
111
From AMR−NB 12.2 kbps mode to AMR−NB 7.4 kbps mode
Smart Transcoding. fixed gain
Standard Transcoding fixed gain
Reference fixed gain
1000
Gain Amplitude
800
600
400
200
0
50
100
150
200
250
Subframes
300
350
400
Figure 5.10: Fixed Codebook Gains mapping transcoding from 12.2 kbps mode to 7.4 kbps
mode.
112
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
kbps mode. In other modes, the gains are jointly quantized. The quantization at those modes
is based on Moving-Average prediction and it is rather difficult to replace the parameters and
further studies are needed.
The results obtained by these smart transcoding algorithms can be interpreted as follows:
The classical transcoding approach needs to synthesize speech to re-compute the LPC coefficients. These recomputed coefficients will diverge from the original one. The synthesized
speech at decoder B has been distorted twice, by successive quantization, at the first encodingdecoding chain and the second encoding-decoding chain.
5.4
Network Voice Quality Enhancement and Smart Transcoding
In the presence of external impairments such as high background noise and/or acoustic
echo, the CELP coder alone does not deliver a clean speech signal [Eriksson 2006]. This is
due to the fact that the transmitted parameters are contaminated by noise and/or by acoustic
echo. It has been demonstrated in Chap. 3 and 4 that noise reduction and/or acoustic echo
cancellation can be performed by directly modifying the coded parameters. This idea allows us
to directly integrate our proposed algorithms in smart transcoding. The result is that the smart
transcoding will be simultaneously performed with the coded domain speech enhancement.
The general overview of the VQE unit embedded in smart transcoding process is depicted
in Fig. 5.11 below. Alternatively to classical transcoding, this approach may be used. This
scheme can convert the bit-stream from compression format A to a compression format B
without fully decoding to PCM and then re-encoding the signal.
The complete VQE unit is now located in the network. At the far-end speaker side, the
Mobile Device encodes the corrupted speech with encoder A. The corrupted bit-stream is
transmitted over the network. At the transcoding stage, the corrupted bit-stream is decoded by
decoder A. In parallel, the relevant corrupted parameters are extracted: the LPC coefficients,
the fixed and the adaptive codebook gains. The parameters needed to perform estimation of
noise and/or acoustic echo parameters are also buffered. The relevant parameters are sent to
the VQE unit for enhancement and modification (see yellow box in Fig. 5.11). During the
encoding at encoder B, the enhanced parameters are mapped and used as input of quantization
units in encoder B. There is no need to re-compute the LPC coefficients, the fixed codebook
gains and the adaptive codebook gains. The implementation of such an approach involves
three main advantages in comparison with classical transcoding combined with frequency or
time speech enhancement.
– The reduction of the computational load: Considering the process, encoder A to decoder
A, decoder A to encoder B and encoder B to decoder B, the overall computational load
5.4. NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING113
Codec
Domain VQE
Unit (AEC
and/or NR)
Corrupted Bitstream A
Encoder A
Corrupted
Speech y(n)
To Near-end
Speaker (MD)
Enhanced
Parameters
Corrupted
Parameters Mapping /
Decoder A
Partial
Encoding
Corrupted Speech
Samples yA(n)
Network Area or
Parameters Domain
Enhanced
Bit-stream B
Decoder B
Enhanced
Speech (n)
To Far-end
Speaker (MD)
Figure 5.11: Structure of the Codec Domain VQE Embedded in Smart Transcoding.
is reduced. There is no need to compute inside the AMR encoder B the LPC coefficients,
the fixed and the adaptive codebook gains. In addition, the enhancement algorithms use
the codec parameters instead of the PCM speech samples, yielding to a reduction of the
computational load. Finally and according to [Beaugeant, Taddei 2007], up to 27% of
computational load reduction can be achieved. For example with noise reduction, the
proposed algorithms process on each frame four fixed codebook gains and only one or
two sets of LPC coefficients. In contrast to the traditional frequency domain approach,
the frame PCM samples of the corrupted speech are first converted using the Fourier
transform. Then noise reduction is performed on each frequency bin. The enhanced
signal is after that transformed back into PCM format for encoding purpose.
– The enhancement of the decoded speech quality: It has been demonstrated that avoiding
transcoding reduces speech distortion. Furthermore, the corrupted parameters (the LPC
coefficients and the fixed codebook gain) in the presence of background noise and/or
acoustic echo are replaced by their enhanced version. It has also been verified in [Enzner
et al. 2005] and [Rages, Ho 2002] that the performance of the standard AEC is strongly
reduced inside the network. The performance degradation is due to the non-linearity
introduced by the coder. Accordingly the non linearity is created by the speech excitation
approximation and quantization. In fact, the mobile speech signal recovered inside the
network has passed through two speech codecs. The parameter domain algorithms avoid
non-linearity and unpredictability since they are directly applied to coded parameters.
No estimation of the linear acoustic echo path is required with these approaches.
– With such an implementation, the algorithmic delay is reduced. Many functions are
skipped during the decoding and the encoding. The amount of delay reduction compared
to classical transcoding also appears if the look-ahead due to LPC windowing is avoided.
The gain of 5 ms can be achieved as the proposed system does not require the same
amount of processing functions. Some classical VQE algorithms are based on time or
114
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
frequency domain transforms. Such transforms introduce additive delay (typically 5 to
10 ms). The coded domain VQE integrated into smart transcoding does not require any
transform. Therefore, the delay can be substantially reduced.
5.5
The proposed Architecture
In [Gnaba et al. 2003], the authors have studied the integration of CELP structures of the
12.2 kbps coder in the acoustic echo canceller for GSM network. The smart transcoding was
not taken into account, but the authors intend to minimize the quantization noise introduced
by the coders. We propose in this section a transcoding architecture between the 12.2 kbps
and 7.4 kbps modes. Several experiments were conducted as well as informal listening tests to
detect the impact of each parameter to the speech quality, see Sec. 5.2.1. We tested several
architectures and we found during the simulations that the architecture described in Fig. 5.12
achieves a good compromise in terms of speech enhancement quality and objective tests. Due
to the symmetry of the communication link, only one communication way (from the near-end
speaker A to the far-end speaker B) is presented in Fig. 5.12. The communication way (from
the far-end speaker B to the near-end speaker A) is similar to the communication way (from
the near-end speaker A to the far-end speaker B).
Codebook
Gain and
index
gf,x
Fixed Codebook
and Gain
Synthesis
gf,y
f,s
AEC
/ NR
FixedCodebook
Vector Search
f,s
Adaptive
Gain and
Pitch index
Mapping of
Ad. Gain
LTP
Synthesis
A kbps
LPC
Synthesis
Pitch and
Adaptive Gain
index
Pitch
Search
Ay
NR
B
kbps
Âs
LPC Mapping
Quantization
and Coding
Enhanced LPC
Coefficients
Skipping Preprocessing
Post-processing
AMR
Decoder A
Enhanced Fixed
Excitation index
ga,y
Parameters
Extraction
LPC
Coefficients
Enhanced
Fixed Gain
Mapping of
Fixed-Gain
Speech
samples: yA(n)
AMR
Encoder B
Figure 5.12: Proposed Architecture.
Fig. 5.12 presents a scheme of a smart transcoding strategy between two GSM networks. If
bit-stream A is from 7.4 kbps mode, then bit-stream B is from 12.2 kbps mode and vice versa. In
presence of perturbations, (background noise or/and acoustic echo) during the decoding process
at decoder A, the corrupted LPC coefficients, the corrupted fixed and adaptive codebook
gains are first extracted. The corrupted speech signal is then totally decoded and the needed
5.5. THE PROPOSED ARCHITECTURE
115
parameters are recorded. These needed parameters are the CELP parameters used during
clean speech parameters estimation.
The LPC coefficients and the fixed codebook gain are processed inside the VQE unit
developed in Chap. 3 and 4. The adaptive codebook gain remained unchanged. During
the re-encoding of the decoded noisy signal, the enhanced LPC coefficients, the enhanced fixed
codebook gain and the adaptive codebook gain are directly mapped inside the encoder B. This
procedure makes it possible to by-pass the functions used to compute these parameters, as
shown with yellow boxes in Fig. 5.12.
The fixed and adaptive gains are computed for each sub-frame. These gains (fixed and
adaptive) are individually quantized in 12.2 kbps mode and jointly quantized in other modes
[3GPP 1999b]. Due to this constraint, as shown in Fig. 5.12, the modified fixed codebook gain
is mapped in both decoder A and encoder B. During our investigations and tests, we saw that
the performance varies significantly if the enhanced fixed codebook gain is also re-introduced
inside the decoder A, showing that the fixed codebook is a key parameter to achieve a good
VQE system.
Two results are achieved with this proposed architecture. First, if Encoder B is in 12.2
kbps mode, then the coded domain speech enhancement is performed both at decoder A and
at encoder B. Second, if the encoder B is different from 12.2 kbps mode, then the speech
enhancement takes place only at decoder B. The mapping of the fixed codebook gain inside
the target encoder in a different mode to 12.2 kbps mode has minimal effect. We also found
during simulations and informal listening tests that to achieve interesting results, it is suitable
to map the fixed codebook in encoder B and decoder A simultaneously. The enhanced bitstream is finally sent to user B at Mobile Device B, where the enhanced speech signal is
synthesized based on decoder B. The proposed architecture can be easily applied to other
mode transcoding algorithms if their characteristics are taken into account.
5.5.1
Noise Reduction Integrated in the Smart Transcoding Algorithm
In presence of high background noise, the speech coder, especially CELP, computes the LPC
coefficients and the fixed codebook gain that are affected by noise. The speech quality provided
by the decoder B after the classical transcoding is decreased. We propose in this section to
integrate inside the smart transcoding algorithm noise reduction algorithms discussed in Chap.
3. Those algorithms are used to enhance the corrupted and quantized parameters extracted
from the AMR decoder A. The VQE unit as shown in Fig. 5.13 is the combination of the LPC
NR and the fixed gain filtering.
The mapping is applied to three groups of parameters: The enhanced LPC coefficients
ÂS , the enhanced fixed codebook gain ĝf,S and the noisy adaptive codebook gain ga,Y . The
quantized adaptive codebook gain from AMR decoder A is used as input to the adaptive gain
116
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
yA(n), ga,Y(m)
A
kbps
yA(k)
AMR
Decoder A
AMR
Encoder B /
Mapping
f,S(m)
gf,Y(m)
B
kbps
(k)
f,S(m)
VQE Unit: Noise
Reduction
AY(m)
To the near-end
speaker side
ÂS(m)
Network Area
To the far-end
speaker side
Figure 5.13: Flowchart Noise Reduction in Smart Transcoding.
quantization unit in AMR encoder B. As there is no need to perform the LPC analysis, it is
skipped at encoder B. These enhanced LPC coefficients are then directly used to compute the
LSF parameters at the AMR encoder B. The enhanced fixed codebook gain ĝf,S is also used
as input to the gain quantization module in AMR encoder B.
5.5.2
Acoustic Echo Cancellation Integrated in Smart Transcoding Algorithm
If the perturbation is an acoustic echo, only the fixed codebook gain is modified. The smart
transcoding algorithm is identical to that of the NR. The corrupted LPC coefficients and the
adaptive codebook gain from decoder A are mapped inside encoder B. The VQE module used
to modify the fixed codebook gain in this example can either be the Gain Loss Control or the
Filtering of the fixed codebook gain. The end-to-end communication is required since both
parameters from the microphone signal y(n) and the loudspeaker signal x(n) are needed.
The Gain Loss Control Integrated in the Smart Transcoding Algorithm
In this section, the GLC is integrated in the smart transcoding. The fixed codebook gains of
the microphone signal gf,Y and of the loudspeaker signal gf,X are needed by the GLC module.
The adaptive codebook gains are also used by the GLC module during the estimation of the
signals energies. The GLC unit computes the attenuation gains, ay (m) and ax (m). These
attenuation gains are used to weight respectively the microphone fixed codebook gain and
the loudspeaker fixed codebook gain. As shown in Fig. 5.14, the enhanced microphone fixed
codebook gain ay (m) · gf,Y is mapped in encoder B and decoder A. The enhanced loudspeaker
fixed codebook gain ax (m) · gf,X is mapped in encoder A and decoder B.
5.6. EXPERIMENTAL RESULTS
117
yA(n), AY(m), ga,Y(m)
AMR
Decoder A
A- kbps
yA(k)
AMR
Encoder B /
Mapping
ay(m)gf,Y(m)
gf,Y(m), ga,Y(m)
B-kbps
(k)
ay(m)gf,Y(m)
VQE Unit: Gain
Loss Control
ax(m)gf,X (m)
A-kbps
xA(k)
AMR
Encoder A
/ Mapping
gf,X(m), ga,X(m)
AMR
Decoder B
ax(m)gf,X (m)
B-kbps
xB(k)
xB(n), AX (m), ga,X(m)
To the near-end
speaker side
Network Area
To the far-end
speaker side
Figure 5.14: Overview of the Gain Loss Control Integrated in Smart Transcoding.
The process is terminated by encoding the PCM samples (yA (n) and xB (n)) and mapping
the corresponding LPC coefficients and adaptive gain: (AX (m), ga,X (m)) for the microphone
path and (AY (m), ga,Y (m)) for the loudspeaker path. The resulting bit-stream at the farend speaker side leads to the decoded speech where the acoustic echo has been attenuated or
cancelled.
Fixed Gain Filtering Integrated in the Smart Transcoding Algorithm
The architecture of the filtering of the fixed codebook gain for AEC is similar to that of the
GLC. The acoustic echo cancellation is considered to be performed only on the microphone
signal y(n). In this technique, the microphone fixed codebook gain is replaced by an estimation
of the clean speech fixed codebook gain computed as: G(m)·gf,Y . The enhanced fixed codebook
gain is mapped both in encoder B and in decoder A as described in Fig. 5.15. The LPC
coefficients and adaptive codebook gains are not enhanced, but are directly mapped inside
encoder B during the encoding of the decoded signal yA (n).
5.6
Experimental Results
In this section, we present the overall and detailed performance of our proposed architecture
through several objective tests. We first analyze the computational load and the delay improvement of the smart transcoding algorithm. Then, we compute various objective measurements
to compare our approach with former or classical ones. The proposed AEC algorithms are compared to the standard NLMS algorithm [Haykin 2002b] and especially the one implemented
118
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
yA(n), AY(m), ga,Y(m)
A
kbps
yA(k)
AMR
Decoder A
gf,Y(m)
AMR
Encoder B /
Mapping
G(m)gf,Y(m)
B
kbps
(k)
G(m)gf,Y(m)
VQE Unit:
Computation of
the Filter G(n)
gf,X(m)
A
kbps
xA(k)
To the near-end
speaker side
AMR
Encoder A
AMR
Decoder B
xB(n)
Network Area
B
kbps
xB(k)
To the far-end
speaker side
Figure 5.15: Filtering of the Fixed Codebook Gain Integrated in Smart Transcoding.
in [Beaugeant et al. 2006]. The proposed NR system is compared to the NR based on the
classical Wiener Filter approach, see [Beaugeant 1999].
5.6.1
Overall Computational Load and Algorithmic Delay
As demonstrated in [Beaugeant, Taddei 2007], the mapping of the LPC coefficients results
in skipping the Levinson-Durbin function and a complexity reduction of about 20% of the
entire encoding process at encoder B. Besides, if the down and up scaling at decoder A and
encoder B are skipped, as well as low pass and high pass filtering at encoder B, the proposed
smart transcoding algorithm yields another complexity reduction of about 7%. These two simplifications permit a total computational load reduction of approximately 27% in comparison
with the classical transcoding approach.
In smart transcoding from 12.2 kbps mode to 7.4 kbps mode, the LPC analysis is avoided.
The look-ahead at the 7.4 kbps encoder due to the windowing is also skipped. Accordingly,
this smart transcoding scheme provides a delay decrease of 5 ms. Conversely, during smart
transcoding from 7.4 kbps mode to 12.2 kbps mode, there is no delay reduction as the LPC
analysis window in 12.2 kbps mode encoder has no look-ahead. During noise reduction, we
process four fixed codebook gains and one or two sets of LPC coefficients per frame. The
LPC vector in practice has M + 1 components: A = [1, a1 , · · · , aM ]. The first coefficient is
not modified. Finally only M or M + M LPC coefficients are used by the algorithms. For
AEC algorithm, four fixed codebook gains are used per frame. The coded domain VQE does
not require any transformation such as STFT or Filter band decomposition. In comparison
to classical approach such as Wiener Filter or NLMS, coded domain VQE involves a low
complexity.
5.6. EXPERIMENTAL RESULTS
5.6.2
119
Overall Voice Quality Improvement
The objective performance of the network VQE embedded in smart transcoding can be
summarized as follows:
– The proposed noise reduction system tends to have objective performance similar to that
of the classical Wiener approach. In transcoding from 7.4 kbps mode to 12.2 kbps mode,
the performance of the proposed NR is above the performance of the standard Wiener
filter.
– The proposed AEC algorithms are clearly better than the classical NLMS algorithms.
Even with GLC or FFG, we obtain a ERLE ([Hansler, Schmidt 2004]) of about 40 dB
compared to the classical NLMS. The coded domain network AEC as discussed in this
thesis is suitable for network VQE.
The results performed to arrive to the statements above were obtained through several
experiments as described Sec. 5.6.3 and 5.6.4.
5.6.3
Noise Reduction
In order to check the behavior and interest of our codec parameter domain network NR
system (LPC coefficients and fixed codebook gain modification), we analyze the objective
measures provided by the classical NR algorithm (Wiener method in frequency domain) as
described in Chap. 3, Sec. 3.2.2 and that of our system. The reason here is because the LPC
coefficients enhancement and the fixed codebook modification are in fact complementary. The
LPC coefficients enhancement impacts the spectral representation of the noisy signal. The
fixed codebook modification impacts the noisy signal amplitude or signal to noise ratio. The
objective measure tool used in this work ([ITU-T 2006a] and [Knappe, Goubran 1994]) is a
further development of another tool included in the 3GPP GSM technical Specification 06.77
[3GPP-GSM 1999] and the TIA TR45 Specification TIA/EIA/IS/853 [3GPP 1999a].
5.6.3.1
The ITU-T Objective Measurement Standard for GSM Noise Reduction
One of the main targets of a NR system is to maintain the power level of the speech signal,
so as not to affect the level of the speech signal together with the background noise signal.
The ITU-T objective measurement in [ITU-T 2006a] presents in detail objective instruments
characterizing the effect of a NR method. These objective instruments are the objective metrics
defining the Signal to Noise Ratio Improvement (SNRI) and the Total Noise Level Reduction
(TNLR). The SNRI is measured during speech activity and it determines the impact of noise on
the speech signal at the end of the processing. The TNLR estimates the level of noise reduction
both during speech and no speech periods. The analysis is completed by the computation of
120
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
a Delta Measurement (DSN), which determine the balance between SNRI and TNLR. The
metric DSN reveals after the NR process a possible speech attenuation and possible undesired
speech amplification caused by the NR algorithm. Objective performance of the proposed
NR system is evaluated as an average over all test conditions in dB overload (dBov) based
on [ITU-T 2006d]. According to the recommendation in [ITU-T 2006a], the metrics should
satisfy:
SN RI ≥ 4 dB, T N RL ≤ −5 dB, −4 dB ≤ DSN ≤ 3 dB
(5.5)
The Delta Measurement also indicates if the effects of a proposed noise reduction algorithm
match with the recommendation. It indicates the interest of a noise reduction system. A more
detailed description of these objective measures is presented in Appendix B.
5.6.3.2
Noise Reduction: Simulation Results
Objective Measurements
This section presents the objective performance of our proposed NR system when applied
to two types of Smart Transcoding scheme: 7.4 kbps to 12.2 kbps and 12.2 kbps to 7.4 kbps.
Clean speech signals were constructed based on 24 utterances: 6 utterances from 4 speakers (2
females and 2 males). As for the noise signals, three sequences were considered: two car noise
and one street noise. The noisy signals were in total 216 files, and covering the background
noise and Signal-to-Noise Ratio conditions of 6 dB, 12 dB and 18 dB.
Metrics
DSN(dBov)
SNRI(dBov)
TNLR(dBov)
7.4kbps → 12.2kbps
Codec Domain W iener M ethod
−0.4415
0.1323
11.9968
9.9732
−12.4124
−9.9205
12.2kbps → 7.4kbps
Codec Domain W iener M ethod
0.0061
0.1082
8.8064
9.6539
−8.7060
−9.9772
Table 5.1: Total Average of the Objective Metrics.
The overall observation according to the results of Table 1 is that there is an obvious benefit
to use our proposed method, as well as the Wiener method. The objective measurements
concretely match the recommendation requirements. Especially, the new approach proposed
in this thesis reduces the noise level (T N LR = −12.4124 dBov) and amplifies the signal
(SN RI = 12 dBov) better than the Wiener method (T N LR = −9.9205 dBov and SN RI = 10
dBov) from 7.4 kbps to AMR 12.2 kbps smart transcoding mode. This result is explained by
the fact that, in this Smart Transcoding mode, the noise reduction is performed both in the
decoder of the 7.4 kbps mode and in the encoder of the 12.2 kbps mode. The enhanced fixed
5.6. EXPERIMENTAL RESULTS
121
codebook gain is mapped both in the decoder in 7.4 kbps mode and in the encoder in 12.2
kbps mode.
In 12.2 kbps mode to 7.4 kbps mode, the Wiener method slightly reduces the noise level
and amplifies the signal more than our proposed method. This is explained by the fact that,
in this transcoding mode, the mapping of the modified fixed codebook gain has minimal effect.
As previously indicated, the fixed and the adaptive gains are jointly quantized in the 7.4 kbps
mode. Our smart transcoding strategy does not modify the quantization.
Our proposed method is influenced by the AMR-NB modes. The gain mapping has a
very low effect in smart transcoding from 12.2 kbps mode to 7.4 kbps mode. Based on this
observation, we found during several experiments that the noise reduction algorithm is more
efficient if the modified fixed gain is also mapped inside decoder A. As a consequence, the noise
reduction is performed only at decoder A in this smart transcoding mode. The principle of
the dual mapping of the fixed gain (inside encoder B and decoder A) can be seen in Fig. 5.13
by the operation in red.
During Smart Transcoding from 12.2 kbps mode to 7.4 kbps mode, our proposed method
achieves an interesting balance between the possible total noise reduction level and the possible
speech amplification. In this smart transcoding scheme, the dB overload Delta Measurement is
particularly low: DSN = 0.0061 dBov. Another remark is that the Wiener method generally
provides fairly consistent results, independent from the AMR modes. This is due to the fact
that the Wiener method analyzes the completely decoded speech signal, whereas our proposed
method only uses coded parameters to perform noise reduction.
We can also establish that the objective measurements depend on the Signal-to-Noise Ratio
levels. In Fig. 5.16 and 5.17, the objective measurements when transcoding from 12.2 kbps
mode to 7.4 kbps mode and from 7.4 kbps mode to 12.2 kbps mode respectively are presented.
On one side, the total SNRI decreases when the segmental SNR increases. On the other
side, the total TNLR increases with the segmental SNR. The delta measurement (DSN) is
more difficult to evaluate. This metric depends on the level of the SNRI and the TNLR. The
DSN measure lies during these simulations between -1 and +1 dBov, revealing that there is a
good balance between the SNR improvement and the total noise level reduction. This DSN
metric also reveals that the possible speech level reduction is well compensated by the possible
speech amplification. The outcomes of the proposed noise reduction algorithms are confirmed
in Fig. 5.16 and 5.17. The proposed noise reduction algorithm performance, in comparison
with the classical Wiener method, depends on the transcoding strategy.
If the transcoding is performed from the 12.2 kbps mode to 7.4 kbps mode, the Wiener
approach achieves a signal amplification of 2 dBov better than the coded domain approach.
This gap of 2 dBov decreases when the segmented SNR increases to less than 1 dBov if the
segmented SNR moves to 18 dB. The remarks are the same if the total noise reduction level
(TNLR) is analyzed.
122
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
A
B
13.5
0.8
13
−9.5
12.5
−10
12
−10.5
0.4
SNRI(dBov)
0.2
0
−0.2
−9
TNLR(dBov)
0.6
DSN(dBov)
C
1
11.5
11
10.5
−13
9.5
−13.5
−14
9
5
10
15
Seg−SNR(dB)
−12
10
−0.6
−1
−11.5
−12.5
−0.4
−0.8
−11
5
10
15
Seg−SNR(dB)
5
10 15
Seg−SNR(dB)
Figure 5.16: Objective Metrics versus Segmented SNR. Transcoding from 12.2 kbps mode to
7.4 kbps mode. Proposed NR method (blue dashed circle), Wiener NR method (red dashed
diamond).
5.6. EXPERIMENTAL RESULTS
123
If the transcoding is performed from 7.4 kbps mode to 12.2 kbps mode, then our method
performs better than the classical Wiener approach, see Fig. 5.16. The coded domain approach
increases the SNR (SNRI) by around 2 dBov more than the classical Wiener method. The
total noise attenuation (TNLR) is almost 2 dBov greater than that of the classical Wiener
approach.
A
B
C
1
10.5
0.8
−8.5
0.6
−9
10
0
−0.2
TNLR(dBov)
0.2
SNRI(dBov)
DSN(dBov)
0.4
9.5
9
−9.5
−10
−10.5
−0.4
8.5
−0.6
−11
−0.8
−1
−11.5
8
5
10
15
Seg−SNR(dB)
5
10
15
Seg−SNR(dB)
5
10 15
Seg−SNR(dB)
Figure 5.17: Objective Metrics versus Segmented SNR. Transcoding from 7.4 kbps mode to
12.2 kbps mode. Proposed NR method (blue dashed circle). Wiener NR method (red dashed
diamond).
Spectrograms
As the noise attenuation and the speech amplification cannot describe alone the speech
enhancement, we analyze in the following the spectrograms of the processed and unprocessed
noisy speech signal.
Fig. 5.18 represents the spectrogram of a noisy car noise where the segmented SNR is 6
dB. The spectrogram allows temporal and frequency analysis of the speech signal. The vertical
axis stands for frequencies while the horizontal axis is the time axis. The signal amplitude is
proportional to the darkness of the picture. The spectrogram in Fig. 5.19 represents the noisy
speech processed by the standard Wiener filter. In Fig. 5.20 and Fig. 5.21, the spectrograms
of the speech processed by the coded domain noise reduction when transcoding from 12.2 kbps
124
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
mode to 7.4 kbps mode and from 7.4 kbps mode to 12.2 kbps mode respectively are presented.
One can see the effectiveness of the noise reduction of the coded domain in the interval 3 s to 5
s. Noise is more attenuated by transcoding from 7.4 kbps mode to 12.2 kbps mode in Fig. 5.21
than in the transcoding from 12.2 kbps mode to 7.4 kbp smode in Fig. 5.20. This attenuation
is due to the effect of the fixed codebook gain mapping inside the encoder in 12.2 kbps mode.
The shape of the speech signal is relatively unmodified in transcoding from 12.2 kbps mode
to 7.4 kbps mode. The shape of the speech signal is slightly distorted in transcoding from 7.4
kbps mode to 12.2 kbps mode. These distortions in speech periods (1 s to 3 s and 5 s to 7 s)
appear mainly during speech transition and at the end of the speech burst. In these specific
areas, SN R ≈ 0 dB. This leads to poor estimation of the LPC coefficients.
Frequencies
4000
2000
0
0
1
2
20
3
40
4
5
60
6
80
7
100
8
120
4
x 10
Amplitude
1
0.5
0
−0.5
−1
0
1
2
3
4
5
Time(s)
6
7
8
Figure 5.18: Spectrogram of the Noisy Speech Signal: 6 dB Segmented SNR.
5.6.4
Acoustic Echo Cancellation
The corrupted files used in this section are those already tested in Chap. 4. Three different
filters h1 , h2 and h3 were considered to simulate acoustic echo. The files were built such
that each contains an echo-only period, a Single Talk period of the near-end speaker and a
Double Talk period. The overall Signal-to-Echo Ratio (SER) was computed during Double
Talk periods and was respectively 11 dB, 15 dB and 17 dB.
In periods of echo only, remote single talk, the Echo Return Loss Enhancement (ERLE)
5.6. EXPERIMENTAL RESULTS
125
Frequencies
4000
2000
0
0
1
2
20
3
40
4
5
60
6
80
7
100
8
120
4
x 10
Amplitude
1
0.5
0
−0.5
−1
0
1
2
3
4
5
Time(s)
6
7
8
Figure 5.19: Spectrogram of the Noisy Speech Enhanced With the Standard Wiener Filter.
[Rages, Ho 2002] is a suitable performance criterion for a given AEC algorithm. The ERLE
characterizes the ratio of the energy of the original echo to the energy in the residual echo. If the
energy is completely attenuated, the acoustic echo effect will not be noticeable. We compare
in the following the ERLE of the classical NLMS to that of the coded domain processing (cf.
Chap. 4), is computed by averaging the ERLE values for every frame for each scenario:
ERLE =
C(Np )
X
1
ERLE(ℓ)
C(Np )
(5.6)
ℓ=1
where C(Np ) represents the total number of frames and Np the frame length. For each file,
the ERLE(ℓ) is computed as follows:
!
PNp
2
n=1 ŝ(ℓNp + n)
(5.7)
ERLE(ℓ) = −10 · log10 PNp
2
y(ℓN
+
n)
p
n=1
5.6.5
Simulation Results
The overall ERLE measurements are presented in Tab. 5.2. The simulation environment
corresponds to a typical simulated GSM network. The results presented in that table charac-
126
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
Frequencies
4000
2000
0
0
1
2
20
3
40
4
5
60
6
80
7
100
8
120
4
x 10
Amplitude
1
0.5
0
−0.5
−1
0
1
2
3
4
5
Time(s)
6
7
8
Figure 5.20: Spectrogram of Coded Domain Enhancement: Transcoding from 12.2 kbps mode
to 7.4 kbps mode.
5.6. EXPERIMENTAL RESULTS
127
Frequencies
4000
2000
0
0
1
2
20
3
40
4
5
60
6
80
7
100
8
120
4
x 10
Amplitude
1
0.5
0
−0.5
−1
0
1
2
3
4
5
Time(s)
6
7
8
Figure 5.21: Spectrogram of Coded Domain Enhancement: Transcoding from 7.4 kbps mode
to 12.2 kbps mode.
128
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
terize the mean ERLE over all test conditions as well as ERLE for each used filter. The results
are the ERLE of the corrupted files enhanced by the Gain Loss Control (GLC), with the Filtering of the Fixed Gain (FFG) algorithm and based on the standard Normalized Least-Mean
Square (NLMS). ERLEref represents the ERLE obtained with the microphone signal if there
is no echo.
Echo Only Periods
During echo only periods, the ERLE is appropriate to characterize the AEC algorithm
performances. It is clear that NLMS implemented inside the network is a linear AEC. This
linear AEC is not efficient enough to decrease the acoustic echo level. The voice quality is
not improved. The NLMS produces an ERLE below 10 dB. This performance increases fairly
in transcoding from 12.2 kbps mode to 7.4 kbps mode. The overall ERLE results show that,
independently from the SER, the coded domain AEC integrated in smart transcoding (FFG)
achieves the required ERLE for 12.2 kbps networks, that is to say 45 dB, [ITU-T 2006b]. The
ERLE is particularly high (up to 50 dB) during transcoding from 7.4 kbps mode to 12.2 kbps
mode. This result is justified by the fact that, in this transcoding, the acoustic cancellation is
performed at the decoder of the 7.4 kbps mode and in the encoder in 12.2 kbps mode, see Tab.
5.2 (a). The result obtained with the NLMS was already experimented in [Huang, Goubran
2000]. The effect of the vocoder was responsible for the low performance of the classical AEC
approach, as well as the NLMS.
The ERLE with the GLC algorithm increases when the SER increases in transcoding
from 12.2 kbps mode to 7.4 kbps mode. In comparison to the classical NLMS, the ERLE of
the NLMS decreases when the SER increases. The FFG algorithm tends to produce ERLE
according to the kind of the acoustic echo path.
Double Talk Periods
The GLC method computes and applies the attenuation coefficients to both the microphone
and the loudspeaker fixed codebook gain. The impression in high SER is that the echo signal
level is reduced. This effect is verified by the values of the ERLE: 12 dB in transcoding from
12.2 kbps mode to 7.4 kbps mode and 9 dB in the transcoding from 7.4 kbps to 12.2 kbps
mode.
The FFG shows good improvements during Double Talk periods. The acoustic echo level
is reduced in those periods. As shown in Tab. 5.2 (a) and (b), the overall ERLE is about 9
dB when transcoding from 7.4 kbps mode to 12.2 kbps mode. The ERLE is around 15 dB in
the transcoding from 12.2 kbps mode to the 7.4 kbps mode. The NLMS in these periods of
double talk did not perform any significant enhancement of the microphone signal. The ERLE
5.7. CONCLUSION
129
is low: 2.7 dB during transcoding from 12.2 kbps mode to 7.4 kbps mode and only 1.5 dB
when transcoding from 7.4 kbps mode to 12.2 kbps mode.
Near-End Speaker Single Talk
The ERLE in Single Talk periods of the near-end speaker represents the distortion introduced by the AEC system. The overall results show that the FFG and the GLC introduce
approximately the same distortion to the decoded speech. The NLMS algorithm has good
behavior in these periods. The NLMS ERLE is lower by 1 dB than when trancoding from 7.4
kbps mode to 12.2 kbps mode. This overall NLMS ERLE is about 1 dB in the transcoding
from 12.2 kbps mode to 7.4 kbps mode.
Typical Evolution of the ERLE
Fig. 5.22 and Fig. 5.23 show the typical comparative evolution of the ERLE of the GLC, the
FFG and the NLMS. These examples present processing where the acoustic echo was simulated
using the impulse response h1 , (11 dB SER). The time representation of the microphone y(n)
signal from t = 0 s to t = 3 s corresponds to the echo only periods, or Single Talk of the far-end
speaker. The single talk of the near speaker is from t = 3 s to t = 6 s. Finally the Double Talk
period occurs from t = 6 s to t = 10 s.
During echo-only periods, the FFG technique (curve in blue) provides the highest ERLE.
In average, the ERLE with the FFG algorithm is improved of about 35 dB to 45 dB compared
to the standard NLMS when transcoding from 12.2 kbps mode to 12.2 kbps mode. In this
transcoding, the ERLE is particularly low, leading to less distortion of the enhanced speech.
As shown in Fig. 5.23, during transcoding from 7.4 kbps mode to 12.2 kbps mode, the FFG
algorithm enhances much more that the GLC and the NLMS. The FFG ERLE is in this case
30 dB to 40 dB higher than the classical NLMS but 20 dB to 30 dB higher than the GLC.
5.7
Conclusion
In this chapter, we have demonstrated a smart transcoding algorithm applied on the LPC
coefficients, the fixed codebook gain and the adaptive codebook gain. The coded domain
VQE unit was embedded inside the smart transcoding strategies. The VQE algorithms were
achieved by modifying the fixed codebook gain and the LPC coefficients. The effectiveness of
the NR and/or the AEC improvement was verified through several experiments and objective
evaluations. The possibility to jointly perform AEC and NR inside a smart transcoding strategy
was also explored.
130
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
4
Amplitude
1
x 10
Transcoding: AMR 12.2 kbps −> AMR 7.4 kbps
0.5
0
−0.5
−1
y(n)
z(n)
0
2
4
6
8
10
Time/s
ERLE(dB)
100
FFG ERLE
Standard NLMS ERLE
GLC ERLE
50
0
−50
0
100
200
300
400
500
Frames
Figure 5.22: Time Evolution of the ERLE: from 12.2 kbps mode to 7.4 kbps mode, case filter
h1 .
5.7. CONCLUSION
131
4
Amplitude
1
x 10
Transcoding: AMR 7.4 kbps −> AMR 12.2 kbps
0.5
0
−0.5
−1
y(n)
z(n)
0
2
4
6
8
10
Time/s
ERLE(dB)
100
FFG ERLE
Standard NLMS ERLE
GLC ERLE
50
0
−50
0
100
200
300
400
500
Frames
Figure 5.23: Time Evolution of the ERLE: from 7.4 kbps mode to 12.2 kbps mode, case filter
h1 .
132
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
Coded domain NR (modification of the fixed codebook gain and modification of the LPC
coefficients) integrated inside smart transcoding provides approximately similar objective results as those of the standard Wiener filter. If smart transcoding is performed from 7.4 kbps
mode to 12.2 kbps mode, our proposed method achieves an objective result better than the
classical Wiener filter method. Our proposed NR system performances depend on the coders
used during transcoding modes.
With regard to the AEC, the Gain Loss Control achieves good improvement, especially
during single talk of the near-end speaker. This coded domain GLC algorithm is simple
and the requirement of the GSM network AEC is achieved (ERLE at least equal to 45 dB).
The filtering of the fixed codebook gain as AEC algorithm, also called in this thesis FFG,
provides the best objective results among all the simulations. The FFG is much more effective
in echo-only periods, as shown in the ERLE tables (ERLE better than 45 dB). The FFG
during Double Talk periods reduces the acoustic echo effect while keeping the near-end speech
understandable. In comparison, the classical NLMS introduces high distortion on enhanced
speech signal in double talk periods. The useful speech in high SER level becomes inaudible.
Informal listening tests show that the Gain Loss Control during double talk slightly attenuates
both the Microphone and the Loudspeaker.
Another critical aspect of AEC implemented in the network is that the classical solution
(NLMS) performance is drastically reduced. The ERLE results show that the classical approach
(NLMS) ERLE values are in general below 10 dB in echo-only periods. The NLMS used during
our simulations is influenced by the non-linearity and the unpredictability introduced by the
speech codecs.
This chapter also points out the interest of such an implementation in terms of computational load, delay. Several processing functions during the second encoding are skipped. The
AEC is performed by enhancing a single parameter: the fixed codebook gain. The NR is
achieved by modifying two parameters: the fixed codebook gain and the set LPC coefficients.
The proposed system does not require any transformation, as FFT for example. The parameters are extracted and directly modified or filtered before the mapping. The algorithmic delay
in such an approach is slightly reduced, especially if the encoder B is not that of the AMR
12.2 kbps mode.
5.7. CONCLUSION
133
AEC and Smart Transcoding: AMR-NB 7.4 kbps
Filter
h1 (11dB
SER)
h2 (15dB
SER)
h3 (17dB
SER)
Overall
Processing
AMR-NB 12.2 kbps
AEC
Algorithms
ERLE ref
NLMS
GLC
FFG
ERLE ref
NLMS
GLC
FFG
ERLE ref
NLMS
GLC
FFG
s(n) single
Talk Periods
2.4329
0.6011
1.9584
1.2982
2.6239
0.8968
1.0704
0.2114
2.4288
0.5456
4.1010
0.7930
Double Talk
Periods
4.370
2.7468
9.5187
8.5758
3.6132
0.0032
6.0136
6.1459
3.7193
1.2582
13.6213
10.6313
Echo-only
Periods
57.8149
8.3659
44.7805
55.7542
49.7853
4.5665
36.1725
39.2706
52.0316
5.3884
51.3593
49.7075
Total
ERLE
20.4490
03.8144
18.1106
20.9807
17.7191
01.7150
13.8502
14.5912
18.3972
02.3137
22.3873
19.6809
ERLE ref
NLMS
GLC
FFG
2.4791
0.6542
3.6475
0.8371
3.9368
1.5027
12.8212
8.7354
53.6388
6.2920
40.9326
49.3658
18.9952
2.7268
18.6600
18.8959
(a) - ERLE values during Transcoding from AMR 7.4 kbps mode to AMR 12.2 kbps mode
AEC and Smart Transcoding: AMR-NB 12.2 kbps
Filter
h1 (11dB
SER)
h2 (15dB
SER)
h3 (17dB
SER)
Overall
Processing
AMR-NB 7.4 kbps
AEC
Algorithms
ERLE ref
NLMS
GLC
FFG
ERLE ref
NLMS
GLC
FFG
ERLE ref
NLMS
GLC
FFG
s(n) single
Talk Periods
3.1036
1.1332
1.6522
2.1057
3.0776
1.2798
3.0964
2.4219
2.9290
0.8912
2.7448
1.8950
Double Talk
Periods
4.9666
3.0465
7.2345
13.1562
8.0276
2.9885
16.4988
17.0227
5.4576
1.9921
12.2837
15.4311
Echo-only
Periods
58.9073
11.7010
21.5751
49.3460
53.9084
9.1269
44.4998
40.3211
53.2751
9.0305
49.0305
40.7390
Total
ERLE
21.2192
5.1368
9.9251
20.9194
20.7731
4.3584
20.9502
19.6211
19.5841
3.8379
20.7054
18.9985
ERLE ref
NLMS
GLC
FFG
3.0367
1.1014
2.4978
2.1409
6.1506
2.6757
12.0057
15.2033
55.3636
9.9529
38.3684
43.4687
20.5255
4.4443
17.1936
19.8464
(b) - ERLE values during Transcoding from AMR 12.2 kbps mode to AMR 7.4 kbps mode
Table 5.2: Echo Return Loss Enhancement Values.
134
NETWORK VOICE QUALITY ENHANCEMENT AND SMART TRANSCODING
135
Chapter 6
General Conclusion
6.1
Context
External impairments such as high background noise and acoustic echo during communication reduce the speech quality and intelligibility. The solutions (Noise Reduction (NR)
and/or Acoustic Echo Cancellation (AEC) algorithms) for such problems are generally implemented in Mobile Device. However, network operators have started with Voice Quality
Enhancement (VQE) solutions directly implemented inside the network. Transposition from
Mobile Devices to network is motivated by several advantages. A centralized network quality
control is achieved with such an implementation. The low complexity constraint due to the
power supply of the Mobile Devices restricts the choice of the algorithms and thus limits the
system performance.
In contrast, there is no complexity cost restriction inside the network. With standard
network based VQE, the corrupted speech signal must be decoded, enhanced and then reencoded. In practice, classical approaches for acoustic echo control, based on linear method can
be drastically affected by the presence of the CELP based speech codecs in the communication
chain. The drawbacks with standard approaches are the computational load and the delay
increasing and quality degradation.
The development of new networks generally comes with the deployment of new speech
codecs which are not interoperable with existing ones. The bit-rate conversion via a standard
transcoding is necessary for network interoperability reasons: decoding of the source bit-stream
and re-encoding to transmit the target bit-stream. This standard transcoding always degrades
the speech quality and can introduce additional delay. An alternative approach, called smart
transcoding, exploits the similarity between speech codecs to perform the network interconnection. In a smart transcoding scheme, the parameters from the source coder are directly
136
GENERAL CONCLUSION
mapped inside the target codec.
The common thread to the classical NR AEC algorithms and transcoding is that the process
is carried out based on the signal in PCM format. Many deployed codecs in modern digital
networks follow the legacy of the CELP. 3GPP AMR-NB modes speech codec standard are
used in this thesis as platform for the experiments and applications. The challenge in this thesis
context is to maintain the Quality of Service (QoS) at low computational cost in modern digital
networks.
6.2
Thesis Contribution
According to the context of this thesis as detailed in Chap. 1, this thesis is based on two
key concepts:
– VQE algorithms that deal with the CELP codec parameters exclusively (Coded domain
VQE algorithms) and that are implemented inside network.
– Integration of the Coded domain VQE algorithms inside a smart transcoding strategy.
Four algorithms have been developed during this thesis. Two algorithms dedicated to
acoustic echo cancellation via the processing of the fixed codebook gain of the CELP codec.
Two other algorithms have been developed for noise reduction. The first NR algorithm modifies
the fixed codebook gain. The second NR algorithm enhances the spectral characteristics of the
speech through modification of the LPC coefficients. These concepts have been implemented
after physical description of the CELP parameters was examined in Chap. 2.
Noise Reduction
A NR algorithm based on the filtering of the fixed codebook gain of the CELP codec has
been proposed in Sec. 3.4 of Chap. 3. In each sub-frame, a filter is computed and applied to
the noisy speech fixed codebook gain. The result of this filtering process is an estimation of
the useful speech fixed codebook gain. This estimated fixed codebook gain then replaces the
current noisy speech fixed gain inside the bit-stream. This filter is a Wiener based filter. The a
priori and the a posteriori Signal-to-Noise Ratios have been also transposed in codec parameters
domain. The estimation of the Signal-to Noise Ratio has been based on an extrapolation and
transposition of the Ephraim and Malah rule. The Minimum Statistic generally implemented
in the frequency domain to estimate the noise PSD has been transposed and used to estimate
the fixed codebook gain of the noise. The noise amplitude perception is improved thanks to
this approach.
A NR based on a spectral characteristics enhancement of the noisy speech signal has been
6.2. THESIS CONTRIBUTION
137
presented in Sec. 3.5 of Chap. 3. This new algorithm has been implemented by modifying the
corrupted LPC coefficients. Contribution of this section has been published at ITG-SCC 2008,
[Thepie et al. 2008]. Two methods have been adopted based on the voice activity detection
integrated in AMR-NB. If there is no speech signal presence, the noisy LPC coefficients are
damped. The damping procedure is performed proportionally to the noise signal amplitude.
In presence of noise and useful speech, a modification function is necessary. This modification has been obtained by exploiting the LPC analysis procedures of the noisy speech signal.
The modification function is finally designed as the relation between the useful speech, the
noisy speech and the noise signal LPC coefficients. As only the noisy speech parameters are
available, it is necessary to estimate the noise autocorrelation matrix, the noise LPC coefficients, the clean speech autocorrelation matrix and the clean speech signal LPC coefficients.
The inverse Recursive Levinson-Durbin algorithm has been introduced to estimate the noise
autocorrelations coefficients, and thus the autocorrelation matrix. The modification function
introduced in this NR algorithm is considered as a useful estimator of the clean speech LPC
coefficients, especially in periods of voiced speech segments.
Acoustic Echo Cancellation
The first AEC discussed in this thesis appears in Sec. 4.5 of Chap. 4. This algorithm is
the transposition of the Gain Loss Control generally implemented in time domain. The idea
consists in computing two attenuation gains to be applied to the microphone and loudspeaker
fixed gains. In the CELP parameters domain, the fixed codebook gain has been considered as
a good representation of the signal energy. Estimate of the energy of the microphone and the
loudspeaker signal is then achieved based only on the fixed codebook gains and the adaptive
gains. The metric used to compute the attenuation gains is taken as the ratio between the
microphone energy and the loudspeaker gain. Finally, the microphone and the loudspeaker
signals have been distinctly attenuated according to the ratio of their long-term estimated
energies. This proposed algorithm has no double talk detection, but the performances are very
promising as attenuation is performed event during double talk periods.
The second AEC algorithm proposed in Sec. 4.6 of Chap. 4 is a more complex approach.
This AEC algorithm is based on the filtering of the fixed codebook gain of the microphone
signal. The filter derives from an extrapolation of the standard Wiener filter. This filter is built
as a function of the Signal-to-Echo Ratio (SER) in coded domain. The SER is computed in
parameter domain using a transposition of the Ephraim and Malah rule as in noise reduction.
The discrimination of the echo presence and the double talk periods has been introduced to
design the filter applied to the fixed codebook gain. The Echo presence and the double talk
periods are detected by analyzing the loudspeaker energy and the normalized-cross correlation
function between the microphone and the loudspeaker fixed codebook gains. The estimation
of the echo signal fixed codebook gain in each sub-frame has been computed by assuming that
the echo fixed codebook gain is an attenuated and shifted version of the current loudspeaker
fixed codebook gain. The sub-frame shift is considered as deriving from the maximum of the
138
GENERAL CONCLUSION
normalized cross-correlation. The attenuation coefficient has been computed by using the ratio
between the shifted loudspeaker fixed gain and the current microphone fixed codebook gain.
The sub-frame shift and the attenuation coefficients are updated only if the echo is detected.
This coded domain AEC approach provides listening test results similar to that of the NLMS.
This contribution has been presented in IWAENC 2006, [Thepie et al. 2006].
Network Voice Quality Enhancement and Interconnection Solution
The smart Transcoding refers to the ability of providing an (quality-wise, transparent)
effective way to map various parameters between two speech coders. In Chap. 5, a smart
transcoding strategy has been implemented between the 12.2 kbps mode and 7.4 kbps mode
of the 3GPP AMR-NB. After several experiments and informal listening tests, the smart
transcoding algorithm proposed in this thesis is applied to the LPC coefficients, the fixed
and the adaptive codebook gains. This thesis thus proposes in Chap. 5 integration of the
VQE algorithms developed above inside network or smart transcoding algorithms. The overall
improvement is as follows:
Computational Load Reduction: The retained smart transcoding strategy as developed in
this thesis involves a computational load reduction of about 27 %, compared to the standard
transcoding between the AMR-NB modes. This complexity gain is due to the fact that the
computations of the LPC coefficients, of the fixed and of the adaptive gains are not performed
at the target encoder.
Delay Reduction: The delay with this smart transcoding strategy is reduced only if the
target encoder is not the 12.2 kbps mode. The look-ahead needed in LPC analysis is skipped,
leading to a delay reduction of 5 ms. The processing delay with this approach may be particularly low. The proposed NR and AEC algorithms enhance only two sets of parameters
(the LPC coefficients and the fixed codebook gain) at each sub-frame. In comparison with the
standard approach, no transform such as STFT is required.
The ITU-T TIA TR45 Specification TIA/EIA IS 853 objective measurement has been
adopted to evaluate the performance of the proposed NR algorithm (combination of the fixed
codebook gain and the LPC coefficients modification). The metrics used are the Total Noise
Reduction Level (TNRL) and the Signal to Noise Ratio Improvement (SNRI). An additive
metric (DSN) is computed to determine the balance between the possible useful signal overattenuation and amplification of the noise. The proposed noise reduction algorithm integrated
inside the smart transcoding strategy has been compared to the classical Wiener filter approach. The overall conclusion is that our proposed algorithm performs better than the standard Wiener filter NR in transcoding from the AMR 7.4 kbps mode to the AMR 12.2 kbps
mode. During transcoding from the 12.2 kbps mode to the AMR 7.4 kbps mode, the proposed NR algorithm performance is similar to that of the classical Wiener filter. The DSN
measures reveal a good balance between the SNR improvement and the decrease of the noise:
6.2. THESIS CONTRIBUTION
139
−0.5 ≤ DSN ≤ 0.5 of our solution.
The performance of an AEC algorithm located inside the network has been evaluated
in this thesis using the Echo Return Loss Enhancement (ERLE). Our proposed algorithms
(Gain Loss Control (GLC) and Filtering of the Fixed Gain (FFG)) have been tested against
the standard NLMS algorithm located inside a GSM network. It has been demonstrated in
[Huang, Goubran 2000] that the presence of CELP speech coders degrade the ERLE of the
AEC algorithms. The required 45 dB ERLE in GSM has been achieved as well with GLC
or with the FFG during echo-only periods. The standard NLMS ERLE performance is in
general below 10 dB. Based on the ERLE measures, it can be noticed that the NLMS method
has minimal effect during double talk periods. The ERLE is less than 3 dB. The proposed
algorithms, with their strategies, try to reduce the acoustic echo effects during double talk
periods. The ERLE in double talk periods is about 15 dB with the FFG and can achieve
up to 12 dB with GLC. The ERLE performance of the proposed coded domain AEC when
transcoding from the 12.2 kbps mode to AMR 7.4 kbps mode is quasi similar to those obtained
in transcoding from the AMR 7.4 kbps mode to AMR 12.2 kbps mode.
This thesis has proposed a low cost solution to centralized VQE, especially network NR,
network AEC and network interoperability problems. The performance obtained during this
thesis are efficient, not only because the computational demand is low, but also because the
objective performances are interesting. Such approach has good behaviour since the resources
required for filtering with classical algorithms are no longer a limiting factor.
1. The NR performance achieved in this thesis is close to that of the standard Wiener
approach. The advantage with NR is that the computational load and the delay are
significantly reduced compared to the standard techniques.
2. The performance obtained with our proposed AEC located inside the network is better than classical adaptive approaches. The performance degradation due to the nonlinearities introduced on the acoustic echo path by the CELP codecs is avoided.
3. The problem due to the interconnection between two compatible networks is also reduced
through the computational load and the delay decreasing. There is no need to compute
the LPC coefficients, the fixed and adaptive gains during the transcoding.
6.2.1
Perspectives
The concept and the algorithms proposed in this thesis can be improved in several directions. The first perspective appearing as critical for this thesis is the algorithmic evaluation
or complexity (exact total number of multiplications and additions) of the proposed NR and
AEC algorithms. The algorithmic evaluation will enable a global complexity comparison with
standard techniques. Appropriated tools for quality evaluation are also necessary for the new
140
GENERAL CONCLUSION
parameters domain algorithms. Explanation and interpretation of the results obtained with
these algorithms need to be specifically defined.
The algorithmic evaluation can be structured by the enhancement of the techniques used
to estimate the unknown parameters. For example, the VAD decision used in this thesis is
based on analysis of the PCM samples. The development of a VAD system by processing the
CELP coded parameters will be very beneficial to the NR algorithm based on modification of
the LPC coefficients.
The techniques implemented in this thesis combine extrapolation/transposition of standard
filtering and many estimation tools (Minimum Statistic, Recursive Inverse Levinson-Durbin,
VAD, Normalized Cross-Correlation, LPC analysis, Short-Term and Long-Term Energy Estimation). An interesting perspective can be the study in the CELP parameters domain of the
interaction between the filtering process and the estimation techniques used. The algorithms
proposed in this thesis have a broad range of applicability. The principle of modifying the
LPC coefficients and the fixed codebook gain can be extended to any kind of codec based
on the CELP technique. This principle could be extended principally to transcoding between
AMR-WideBand modes, and from AMR-WideBand modes to the AMR-NarrowBand modes.
The transcoding critical is the transcoding from AMR-NarrowBand to AMR-WideBand where
a bandwidth extension is necessary [ITU-T 2003].
It would also be useful to investigate on other CELP parameters that could be modified
or filtered. The fixed codebook and the LPC coefficients are modified in this thesis. The
process can be applied to the LSF coefficients or to the fixed codebook vector. It will be
also interesting to investigate on different smart transcoding strategies. The mapping of the
entire set of CELP parameters instead of three parameters only as proposed in this thesis will
reduce the complexity of the proposed system. Since there is no need that the target encoder
computes these parameters anymore.
A centralized dual processing: parameters domain NR and AEC inside network can be suggested as another way of research. Interaction between centralized CELP parameters domain
NR and AEC can also be studied as the fixed codebook gain is modified by the NR and the
AEC algorithms. Another perspective in the same direction is to study the possibility that the
network operator controls both the acoustic echo and the line echo through CELP parameters.
We can conclude that, with the multiplication of networks and codecs, the need of delay
reduction and complexity minimization, the constraints of real time-processing, the problem
due to non-linearities introduced by the CELP coders to the acoustic echo path, centralized
VQE and network interconnection solution based on coded parameters could be inescapable
in the future.
6.2. THESIS CONTRIBUTION
141
Appendix
143
Appendix A
GSM Network and Interconnection
A.1
GSM Networks Architecture
In Fig. A.1 below, a generic interconnection architecture between two GSM networks A
and B is described [Cotanis 2003]. This presentation will not detail the GSM architecture. The
protocols and all other specific aspects of the GSM network are useless in this work. The GSM
network can be summarized in three main components: the Mobile Device (MD), the Base
Station Sub-System (BSS) and the Core Network (also called Network Sub-System (NSS)).
The MD or Terminal connects via the air a User to the GSM network. The signal conversion
from analogical to digital and the Speech Coding are performed in the MD. The different types
of Terminal are distinguished by their output power and their application. A MD is associated
with a Subscriber Identity Module (SIM) card, which is in conjunction with the network
Authentification Centre (AUC).
The BSS also called the Radio Access Network controls the radio link between the MD
and the Network. The BSS comprises the Base Transceiver System (BTS), the Base Station
Controller (BSC) and the Transcoder and Adaptation Unit (TRAU). The BTS is made of antennas or Transceivers (TRX) and electronic equipments. Operations such as coding, crypting,
modulation, synchronization, and multiplexing are performed at the BTS side. The BSC is
used to translate the 12.2kbps voice to 64kbps channel. Several BTS are generally connected
to one BSC which joints the MSC via the TRAU. The TRAU is used to handle the speech
codecs bit-stream to PCM conversion. The TRAU is an ideal area for Transcoding during the
communication. The Operations performed near the BSC are frequency hopping, time and
frequency synchronization, power management and time delay measurement.
The main component of the Core Network or NSS is the Mobile Service Switching Cen-
144
GSM NETWORK AND INTERCONNECTION
ter (MSC). Any user of a given GSM network is registered with an MSC stored inside one
particularly database called Home Location Register (HLR). All calls from and to the user
are controlled by the MSC. A GSM network can have one or more MSC, geographically distributed, which synchronizes the BSS. Another component is the Gateway Mobile Switching
Center (GMSC). The GMSC controls the mobile terminating calls and the timing between the
GSM network and other network. The Gateway can also be considered as potential area of
Transcoding, since this part of the network is the gate to access other networks. The GMSC is
the interconnection point from one GSM network to another GSM network or from one GSM
network to other networks such as PLMN or PSTN. The Visitor Location Register (VLR)
GSM Network (B)
GSM Network (A)
MD
A
(A)
(B)
MD
B
Base Station
Sub-System:
BTS + BSC
+ TRAU (A)
BTS
BTS
Base Station
Sub-System:
BTS + BSC
+ TRAU (B)
BSC
OSS
TRAU
(A)
MSC
BTS
BSC
OSS
TRAU
GMSC
GMSC
HLR, EIR, AUC, VLR
Network Sub-System: Core
Network, MSC+ GMSC (A)
BTS
MSC
(B)
HLR, EIR, AUC, VLR
PSTN
PLMN or
Other
Networks
Network Sub-System: Core
Network, MSC+ GMSC (B)
Figure A.1: Generic GSM Interconnection Architecture.
contains information from a subscriber’s HLR necessary to provide the subscribed services to
visiting users. When a subscriber enters the covering area of a new MSC, the VLR associated
to this MSC will request information about the new subscriber to its corresponding HLR. The
VLR will then have enough data to assure the subscribed services without needing to ask the
HLR each time a communication is established. The VLR is always implemented together with
a MSC; thus, the area under control of the MSC is also the area under control of the VLR.
The Authentification Centre (AuC) serves security purposes. The AuC provides the parameters needed for authentification and encryption functions. These parameters allow verification
of the subscriber’s identity. The Equipment Identity Register (EIR) stores security-sensitive
information about the MD. It maintains a list of all valid terminals as identified by their International Mobile Equipment Identity (IMEI). The EIR allows then to forbid calls from stolen or
A.1. GSM NETWORKS ARCHITECTURE
145
unauthorized terminals (e.g., a terminal which does not respect the specifications concerning
the output Radio Frequency (RF) power). In principle, a communication between two MD A
and B from the GSM A to the GSM B will go through the nearest BTS from MD A to the
BSC and, the associated MSC A. The call will be then routed to the MSC B via the GMSC
A and GMSC B. The MSC B addresses the call to the BSC which contacts the nearest BTS
from the MD B. Communication between two MD events from the same GSM network never
follows a direct link. The communications goes toward the BTS, the BSC and the MSC.
A GSM network is in addition connected to a so called Operation and Support System (OSS).
From the OSS, the network operator is able to monitor and control the system.
146
GSM NETWORK AND INTERCONNECTION
147
Appendix B
CELP Speech Coding Tools
B.1
The Recursive Levinson Durbin Algorithm
The LPC analysis is an important step in CELP encoding process. The LPC block in
CELP encoder aims to find the best prediction coefficients, also called LPC coefficients which
minimize the squared error between the current windowed speech samples and the predicted
ones. The prediction coefficients AS = (1, as (1), . . . , as (M )) minimizing that square error are
in fact the solution of a linear system also known as Yule-Walker equations below:
−ΓS · AS = RS
(B.1)
where M is the order of the LPC analysis and the M × M autocorrelation matrix ΓS is defined
as:



ΓS = 



rS (M − 1)

..

.
rS (1)
rS (0)


..
..

.
rS (1)
.
rS (M − 1) · · ·
rS (1)
rS (0)
rS (0)
rS (1)
...
..
.
..
.
RS = (rS (1), . . . , rs (M ))
PN −1
(B.2)
(B.3)
and rS (j) = n=j sw (n)sw (n − j), j = 0, . . . , M , and N represents the size of the window
of analysis. The solution of the linear system in EQ.B.1 can be obtained by inversing the
autocorrelation matrix ΓS . For real time implementation, such approach is too complex and
iterative solution is necessary.
148
CELP SPEECH CODING TOOLS
Current CELP coders use the Recursive Levinson-Durbin algorithm to find the optimum
LPC coefficients instead of the matrix inversion. In a windowed frame basis, the algorithm
makes use of the autocorrelation sequence rS (j), j = 0, . . . , M and the desired LPC analysis
order M to recursively estimate the LPC coefficients and the reflection coefficients.
B.1.1
Steps of the Recursive Levinson-Durbin Algorithm
The algorithm can be run only if the condition rS (0) > 0 is satisfied. For convenience,
(k)
when we write am , m is the step in the iteration of the algorithm, ranging from 0 to M . The
indice (k) defines the number of the computed LPC coefficient. The algorithm computes M +1
coefficients. The first coefficient is set to 1 and the M remaining are the LPC coefficients.
The algorithm can be summarized as follows:
1. If m = 0 then errS (0) = rS (0), also called the final prediction power error.
2. If m = 1
rS (1) (0)
(1)
(1)
a1 = −
, a1 = 1, and errS (1) = errS (0) · 1 − (a1 )2
rS (0)
(0)
(B.4)
((m))
3. For m = 2 to m = M , then am = 1 and the terms am
= Km generally called
reflection coefficients are first computed as:


m−1
X (j)
1
rS (m) −
a(m)
am−1 · rS (m − j) ·
(B.5)
m =
errS (m − 1)
j=1
The LPC coefficients are obtained at iteration m by:
T T
(1)
(m−1) T
(m−1)
(1)
(m−1)
(m)
=
a
·
·
·
a
a(1)
·
·
·
a
−
a
·
a
·
·
·
a
m
m
m
m−1
m−1
m−1
m−1
and
2
errS (m) = errS (m − 1) · 1 − (a(m)
m )
(B.6)
(B.7)
at the end of the iterating process are given by: AS = (1, as (1), . . . , as (M )) =
The LPC coefficients
(1)
(M )
1, aM , . . . , aM
B.2
The Inverse Recursive Levinson-Durbin Algorithm
With the explicit knowledge of the autocorrelation coefficients values of the signal s(n)
rS (j), j = 0, . . . , M , the Recursive Levinson-Durbin algorithm estimates the LPC coefficients
B.3. THE ITU-T P. 160
149
AS and the final prediction error power errM using the Yule-Walker equations. In order hand,
given the set of LPC coefficients AS is it possible to compute the appropriated autocorrelation
sequence values rS (j), j = 0, . . . , M ?
A solution of this problem is achieved by using the Inverse Recursive Levinson-Durbin
algorithm. This algorithm has as input the LPC coefficients and the final prediction error
power. It processes in two steps. First the algorithm computes the associated reflection
coefficients Km , m = 1, . . . , M . Then the autocorrelation values derive from the reflection
coefficients.
The Inverse Recursive Levinson-Durbin algorithm
decreases iteratively
from t = M to
(1)
(M )
t = 1. Knowing the entire set of LPC coefficients 1, aM , . . . , aM , the algorithm processes
as follows:
From B.5, we can obtain :
t−1
X
(t−1)
ak
k=0
· rS (k − 1) = −Kt · errS (t − 1)
(B.8)
Using the fact that rS (k) = rS∗ (−k), we can compute recursively the autocorrelation coefficients as follows:
(t)
rS (0) = errS (0) and a0 = 1
for t = M, . . . , 1, the autocorrelation coefficients are derived from:
rS (t) = Kt · errS (t − 1) −
t−1
X
k=0
(t−1)
ak
· rS (t − k)
(B.9)
The final prediction error power errS (t) is updated based on EQ. B.7.
B.3
The ITU-T P. 160
One of the targets of the noise suppression is to maintain the power level of the speech
signal so as not to attenuate the level of the speech signal together with the noise signal
in the Noise Reduction (NR) processing. It is presented here as objective methodology for
characterizing the basic effect of the NR methods. This tool presents NR solution in term of
Signal-to-Noise Ratio Improvement (SNRI) and the Total Noise Level Reduction (TNLR). The
150
CELP SPEECH CODING TOOLS
SNRI is measured during speech activity focusing on the effect of NR on speech signal. The
TNLR estimates the overall level of noise reduction, both during speech and speech pause. In
addition the delta measurement (DSN) is computed to reveal speech attenuation or undesired
speech amplification caused by an NR solution. This objective Measurement tool can be only
applied in specific conditions of the test material. The test signals should comprise at least:
– I = 24 original clean speech signal utterances si . 6 utterances from 4 speakers, 2 males
and two females.
– J = 6 original noise signal utterances dj . The noise signals covering a total SNR of 6 dB,
12 dB and 18 dB. As noise sequences, an interior car noise at 100 km/h, fairly constant
power level and a street noise. The noise signals should have slowly varying power.
After that, the noisy speech signal yini,j , taken as the reference signal, is given by:
yini,j = si + βi,j (SN R) · dj
(B.10)
The frame analysis basis is almost a 80 samples frame. The processed speech signal yout is
referenced as:
youti,j = N R(yini,j )
(B.11)
Based on the ITU-T Recommendation [ITU-T 2006d], the test signals should be normalized
to active speech level of −26 dBov and represented with 16 bit integers. The average power
Px of a signal x in a 80 Samples frame length is defined by:
80
Px =
1 X 2
x (n)
80
(B.12)
n=1
The power level in decibel overload (dBov) is defined relatively to a reference power level
P0 = 327680 as follows:
Px
Lx = 10 · log10
(B.13)
P0
The noisy speech signal d and the processed speech signal y are classified by their average
power (Ld and Ly respectively) in comparison with different clean speech signal average power
thresholds in dBov: thh , thm , thl , thnh , thnl .
1.
2.
3.
4.
5.
a
a
a
a
a
Signal
Signal
Signal
Signal
Signal
x
x
x
x
x
belongs
belongs
belongs
belongs
belongs
to
to
to
to
to
High speech class or Ksph class if Lx > thh
Medium speech class or Kspm class if thm < Lx ≤ thh
Low speech class or Kspl class if thl < Lx ≤ thm
Knse class if thnl ≤ Lx ≤ thnh
Kpse class if Lx ≤ thnh
The NR solution is evaluated in terms of Signal-to-Noise Ratio Improvement (SNRI) and
Total Noise Level Reduction (TNLR). The SNRI is measured during speech activity focusing
on the effect of NR on useful speech signal, mostly the amplification of the useful speech signal.
B.3. THE ITU-T P. 160
Threshold
’thh ’
’thm ’
’thl ’
’thnh ’
’thnl ’
151
Explaination
’Lower bound for high speech power class’
’Lower bound for medium speech power class’
’Lower bound for low speech power class’
’Higher bound for speech pause class’
’Lower bound for speech pause class’
Value
−1 dBov
−10 dBov
−16 dBov
−25 dBov
−40 dBov
Table B.1: Threshold Level for Speech Classification.
B.3.1
Assessment of SNR Improvement (SNRI)
The SNRI metric measures the SNR improvement achieved by the NR algorithm, meaning
the amplification of the speech. This SNR is computed in three speech power classes to obtain
an evaluation of the effect separately for strong, medium and weak speech related to the input
and the output speech signals. The computations are performed for high speech class. The
other classes easily rise from this calculation. We start with the SNR for input yini,j and
output youti,j speech signal as follows:



P
P Ksph
1
2 (l,n)


ξ+
yout
log
(
)
10

 10 Ksph l=1
n
i,j


−
1
SN Rout_hi,j = 10 · log10 max ξ,
(B.14)
P
P

Knse
1
2 (m,p)
ξ+
yout
log


)
p
i,j

 10 Knse m=1 10 (



P
P Ksph
1



 10 Ksph l=1 log10 (ξ+ n yin2i,j (l,n))


−
1
(B.15)
SN Rin_hi,j = 10 · log10 max ξ,
P
P

Knse
1
2



 10 Knse m=1 log10 (ξ+ p yini,j (m,p))
where ξ = 10−5 . The indexes p and n are carried on a frame of 80 samples. The indice n
is related to speech frames while p is related to noise frames with frame power between thh
bound and thl bound. The SNRI for high speech class during a single scenario (i, j) is given
by the relation:
SN RI_hi,j = SN Rout_hi,j − SN Rin_hi,j
(B.16)
The total SN RIi,j for high, medium and low speech class for a single scenario is obtained by:
1
·(Ksph · SN RI_hi,j + Kspm · SN RI_mi,j + Kspl · SN RI_li,j )
Ksph + Kspm + Kspl
(B.17)
Extending the SNRI procesing for all the speech signal utterances we obtain:
SN RIi,j =
I
SN RIj =
1X
SN RIi,j
I
(B.18)
i=0
Finally, the SNRI among the entire experiments is given by:
SN RI =
J
1X
SN RIj
J
j=0
(B.19)
152
B.3.2
CELP SPEECH CODING TOOLS
Assessment of Total Noise Level Reduction (TNLR)
The total noise level reduction measure, or TNLR, defines the capability of the noise
reduction method to attenuate the background noise level measured during both speech activity
and speech pauses. Due to the difference between the number of frames during speech activity
and during long speech pauses, TNLR mainly measures the capability of an NR to reduce noise
during long speech pauses. Then TNLR can be computed as follows:
!#
!
Kspe "
X
X
X
1
T N LRi,j =
yin2i,j (m, q)
yout2i,j (m, q) − log10 ξ +
log10 ξ +
· 10 ·
Kspe
q
q
m=1
(B.20)
The index q on a frame of 80 samples relates to noise frames with frame power less than thnh
bound.
I
1X
T N LRi,j
(B.21)
T N LRj =
I
i=0
Finally, the TNLR among the entire experiments is given by:
T N LR =
J
1X
T N LRj
J
(B.22)
j=0
An improvement in the SNR of a noisy speech signal can be achieved based on amplification
of the high energy portion of the signal. But, both attenuation and amplification can cause
damage to noise reduction improvement. The balance between the SNR improvement (amplification) and the noise level reduction (attenuation) obtained during speech activity should be
detected. The Noise Power Level Reduction (NPLR) is introduced and is computed similar to
TNLR, except that NPLR is computed between tnh and tnl , meaning short time speech pause
periods.
!#
!
"
K
nse
X
X
X
1
2
2
yini,j (m, q)
youti,j (m, q) − log10 ξ +
log10 ξ +
N P LRi,j =
· 10 ·
Knse
q
q
m=1
(B.23)
The index q carried on a frame of 80 samples is related here to noise frames with frame power
between than thnh bound and thnl bound.
Finally, the metric NPLR provides the counterpart for SNRI, and these two metrics are
together the basis evaluation of the balance. The metric SNRI-to-NPLR Difference (DSN)
is proposed as a measure to acquire an indication of possible speech attenuation or speech
amplification produced by the NR method:
DSN = SN RI − N P LR
(B.24)
The metric for SNRI, TNLR and DSN should satisfy the requirements described in table below.
The objective performance is defined through the average values of the metrics over all test
B.3. THE ITU-T P. 160
Objective Metric
SNRI
TNLR
DSN
153
Required Performance
SN RI ≥ 4 dBov ’as average over all test conditions’
T N LR ≤ −5 dBov ’as average over all test conditions’
−4 dBov ≤ DSN ≤ 3 dBov ’as average over all test conditions’
Table B.2: Objective Metrics Requirement.
conditions. Hence, NPLR is typically negative and the DSN should be close to zero. If NPLR
is higher in absolute values than SNRI making DSN clearly negative then the NR solution
produces speech level attenuation. But if DSN is clearly positive, the SNRI indicated SNR
improvement without a decrease in noise level: the speech level has been amplified.
154
CELP SPEECH CODING TOOLS
BIBLIOGRAPHY
155
Bibliography
26.092, G. T. (2002). Mandatory Speech Codec speech processing functions, AMR Speech
codec, Comfort noise aspects.
26.094, G. T. (2002). Mandatory Speech Codec speech processing functions, AMR Speech
codec, Voice Activity Detector.
3GPP (1999a). TIA/EIA-IS-853: Noise Suppression Minimum Performance for AMR, Technical report, 3GPP.
3GPP (1999b). TS 26.090: Mandatory Speech Codec speech processing functions; AMR Speech
codec; General Description, Technical report, 3GPP.
3GPP-GSM (1999). TS.06.77: Minimum Performance Requirement for Noise Suppresser Application to the AMR Speech Encoder, Release 7, Technical report, 3GPP.
Adoul, J.; Mabilleau, P.; Delprat, M.; Morissette, S. (1987). Fast CELP Coding Based on
Algebraic Codes, Proceedings of ICASSP’87, IEEE, pp. 1957–1960.
Allen, J. (1977). Short Term Spectral Analysis, Synthesis, and Modification by Discrete Fourier
Transform, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-25, no. 3, June,
pp. 235–239.
Atal, B.; Remde, J. (1982). A New Model of Excitation for Producing Natural-Sounding
Speech at Low Bit Rates, Proceedings of ICASSP’82, vol. 1, pp. 614–617.
Beaugeant, C. (1999). Réduction du Bruit et Contrôle d’Echo pour les Applications Radimobiles, PhD thesis, Université de Rennes I.
Beaugeant, C.; Schoenle, M.; Loellman, H.; Sauert, B.; Steinert, K.; Vary, P. (2006). Hand-Free
Audio and its Application to Telecommunication Termnals, In Proccedings of AES’06.
Beaugeant, C.; Taddei, H. (2007). Quality Computation Load Reduction Achieved by Applying
Smart Transcoding between CELP speech Coders, EUSIPCO’07, pp. 1372–1376.
Benesty, J.; Morgan, D.; Cho, J. (2000). A New Double Talk Detection Based on CrossCorrelation, Proceedings of IEEE Transaction on Speech and Audio Precessing, IEEE,
vol. 8, pp. 168–172.
Berouti, M.; Schwartz, R.; Makhoul, J. (1979). Enhancement of Speech Corrupted by Acoustic
Noise, Proceedings of ICASSP’79, vol. 4, pp. 208–211.
156
BIBLIOGRAPHY
Boll, S. (1979). Suppression of Acoustic Noise in Speech Using Spectral Subtraction, IEEE
Trans. Acoust. Speech, Signal Processing, vol. ASSP-27, no. 2, April, pp. 113–120.
Bossert, M. (1999). Channel Coding for Telecommunication, Jonh Wiley.
Bowman, F. (1958). Introduction fo Bessel Functions, New York Dover, 1958.
Breining, C.; Dreiseitel, P.; Hansler, E.; Mader, A.; Nitsch, B.; Puder, H.; Schertler, T.;
Schmidt, G.; Tilp, J. (1999). Acoustic Echo Control, an Application to Very High Order
Adaptive Filters, Proceedings of IEEE Signal Processing Magazine, IEEE, vol. 16, pp. 42–
69.
Cappe, O. (1994). Elimination of Musical Tone Phenomenon with the Ephraim and Malah
Suppressor, Proc. IEEE Trans. Acoust. Speech, Signal Processing, vol. 4, April, pp. 435–
349.
Chandran, R.; Marchok, D. (2000). Compressed Domain Noise Reduction and Echo Suppression for Network Speech Enhancement, Proceedings of the 43rd IEEE Midwest Symposium
on Circuits and Systems, IEEE, Lansing, USA, vol. 1, pp. 10–13.
CHU, W. (2000). Speech Coding Algorithms: Foundation and Evolution of Standardized Coders,
Wiley Interscience.
Cotanis, I. (2003). Speech in the VQE Device Environment, Wireless Communications and
Networking Conference, 2003 WCNC IEEE, vol. 2, no. 20, March, pp. 1102–1106.
Daumer, W.; Mermelstein, P.; Maitre, X.; Tokizawa, I. (1984). Overview of the ADPCM
Coding Algorithm, Proc. of GLOBECOM’84, pp. 23.1.–23.1.4.
DeMeuleneire, M. (2003). Noise Reduction on Codec Parameters, Master’s thesis, ENST Bretagne, Brest France.
Doh-Suk, K.; Cao, B.; Tarraf, A. (2008). Frame Energy Estimation Based on Speech Coded
Prameters, IEEE International Conference on Acoustics, Speech and Signal Processing,
vol. 1, no. 10, March, pp. 1641–1644.
Duetsch, N. (2003). Integrated Echo Cancellation in Speech Coding, Master’s thesis, Munich
University of Technology (TUM).
El-Jaroudi, A.; Makhoul, J. (1991). Discrete All-Pole Modeling, Proc. IEEE Trans. Signal
Processing, vol. 39, April, pp. 411–423.
Enzner, G.; Kruger, H.; Vary, P. (2005). On the Problem of Acoustic Echo Control in Cellular
Networks, Proceedings of the IWAENC’05, pp. 213–215.
Ephraim, Y.; Malah, D. (1984). Speech Enhancement Using a Minimum Mean Square Error
Short Time Amplitude Estimator, IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 32, no. 6, December, pp. 1109–1121.
Eriksson, A. (2006). Speech Enhancement in Mobile Devices, Technical report, Eriksson.
Federal Standard 1015, Telecommunications: Analog to Digital Conversion of Radio Voice By
2400 Bit/Second Linear Predictive Coding (1984). Technical report, National Communication System - Office Technology and Standards.
BIBLIOGRAPHY
157
Fermo, A.; Carini, A.; Sicuranza, G. (2000). Analysis of Different Low Complexity non Linear
Filters for Acoustic Echo Cancellation, Proc. of IEEE Workshop on Speech Coding, IEEE,
pp. 261–266.
Ghenania, M. (2005). Format Speech Conversion between Standarzed CELP Coders, PhD
thesis, Université de Rennes I.
Ghenania, M.; Lamblin, C. (2004). Low-Cost Smart Transcoding Algorithm between ITU-T
G.729 (8 Kbp/s) and 3GPP NB-AMR (12.2 Kbp/s), EUSIPCO.
Gnaba, H.; Turki, M.; Jaindane, M.; Scalart, P. (2003). Introduction of CELP Structure of
the GSM Coder in the Acoustic Canceller for GSM Network, Proceedings of Eurospeech,
Berlin, Germany, pp. 1389–1392.
Goldberg, L.; Riek, L. (2000). A Practical Handbook of Speech Coders, CRC Press.
Golub, G.; Loan, C. V. (1996). Matrix Computation, The John Kopkins, University Press,
Baltimor, 1996.
Gordy, J.; Goubran, R. (2004). A Combine LPC-Based Speech Coder and Filtered X-LMS Algorithm for Acoustic Echo Cancellation, Proceedings of ICASSP’04, IEEE, vol. 4, pp. 125–
128.
Gordy, J.; Goubran, R. (2006). Post Filtering for Suppression of the Residual from Vocoder
Distortion on Packet Based Telephony, Proceedings of ICME’06, IEEE, pp. 1957–1960.
Halonen, T.; Romero, J.; Melero, J. (2002). GSM, GPRS and EDGE Performance: Evolution
toward 3G/UMTS, John Wiley.
Hansler, E.; Schmidt, G. (2004). Acoustic Echo and Noise Control. A Practical Approach,
Wiley Interscience.
Haykin, S. (2002a). Adaptive Filter Theory, Prentice Hall, Information and System Serie.
Haykin, S. (2002b). Adaptive Filter Theory, Chapter 5 Least Mean Square Adaptive Filter,
Prentice Hall, Information and System Serie.
Heitkamper, P. (1995). Optimization of an Acoustic Echo Canceller Combined with Adaptive
Gain Control, Proceedings of ICASSP’95, IEEE, vol. 5, pp. 3047–3050.
Heitkamper, P. (1997). An Adpatation Control for Acoustic Echo Canceller, Proceedings of
IEEE Signal Processing Letters, IEEE, vol. 4, pp. 170–172.
Heitkamper, P.; Walker, M. (1993). Adaptive Gain Control for Speech Quality and Echo
Suppression, Proceedings of Eurospeech’93, Berlin, Germany, pp. 1077–1080.
Huang, Y.; Goubran, R. (2000). effects of the Vocoder distortion on Network Echo Cancellation,
In Proceedings of ICME’00, IEEE, vol. 1, pp. 437–439.
Itakura, F. (1975). Line Spectrum Representation of Linear predictive Coefficients of Speech
Signals, J. Acoust. Soc. Am., vol. 57, pp. S35.
ITU-T (1988). Recommendation G.711: Pulse Code Modulation (PCM) of voice frequencies.
ITU-T (1996). Recommendation P.800: Methods for Subjective Determination of Transmission
Quality.
158
BIBLIOGRAPHY
ITU-T (2001). Recommendation P.862: Perceptual Evaluation of Speech Quality (PESQ): An
Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone
Networks and Speech Codecs.
ITU-T (2003). Recommendation G.722.2: Wideband coding of speech at around 16 kbit/s
using Adaptive Multi-Rate Wideband (AMR-WB).
ITU-T (2004). Recommendation G.161: Interaction Aspects of Signal Processing Network
Equipments.
ITU-T (2006a). Recommendation G.160: Voice Enhancement Devices for Mobiles Network:
Appendix II, Serie G: Transmission System and Media, Digital System and Network, International Telephone Connections and Circuit. Apparatus Asociated with Long Distance
Telephone Circuit.
ITU-T (2006b). Recommendation G.168: Digital Network Echo Cancellers.
ITU-T (2006c). Recommendation P.10: Vocabulary for Performance and Quality of Service.
ITU-T (2006d). Recommendation P.56: Objective Measurement of Active Speech Level.
ITU-T (2006e). Recommendation P.800.1: Mean Opinion Score (MOS) Terminology.
Kabal, P. (2003). Ill Conditioning and Bandwidth Expantion in Linear Prediction of Speech,
IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 1, April, pp. 824–827.
Kabal, P.; Ramachandran, R. (1986). The Computation of Line Spectral Frequencies Using
Chebyshev Polynomials, IEEE Transactions on Acoustics, Speech and Signal Processing,
vol. 6, December, pp. 1419–1425.
Kang, H. G.; Kim, H. K.; Cox, R. V. (2003). Improving the Transcoding Capability of Speech
Coders, IEEE Trans. on Mulitimedia, vol. 5, no. 1, March, pp. 23–33.
Kang, H.; Kim, H.; Cox, R. (2000). Improving Transcoding Capability of Speech Coders in
Clean and Frame Erasured Channel Environments, Proc. of IEEE Workshop on Speech
Coding, IEEE, pp. 78–80.
Kim, K.; Jung, S.; Park, Y.; Choi, Y.; Youn, D. (2001). An effective Transcoding Algorithm for
G.723.1 and EVRC Speech Coders, Proccedings of IEEE VTS’01, IEEE, pp. 1561–1564.
Kleijn, B.; Paliwal, K. (1995). Speech Coding and Synthesis, Elversier.
Kleijn, W.; Krasinski, D.; Ketchum, R. (1990). Fast Method for the CELP Speech Coding
Algorithm, Proceedings of ICASSP’90, IEEE, pp. 1330–1342.
Knappe, M.; Goubran, R. (1994). Steady State Performance Limitation of full band AEC,
Proceedings of ICASSP’94, IEEE, vol. 2, pp. 73–76.
Kondoz, A. (1994). Digital Speech Coding for Low Bit Rate Communication Systems, Jonh
Wiley and Sons.
Kroon, P.; Deprettere, E.; Sluyter, R. (1986). Regular-Pulse Excitation - A Novel Approach
to Effective and Efficient Multipulse Coding of Speech, IEEE Transactions on Acoustics,
Speech, and Signal Processing, vol. 34, no. 5, pp. 1054–1063.
Lim, J.; Oppenheim, A. (1979). Enhancement and Bandwidth Compression of Noisy Speech,
Proceedings of the IEEE, vol. 67, pp. 1586–1604.
BIBLIOGRAPHY
159
Loizou, P. (2007). Speech Enhancement, Theory and Practice, CRC Taylor and Francis Group.
Lu, X.; Champagne, B. (2003). A Centralized Acoustic Echo Canceller Exploiting Masking
Properties of the Human Ear, Proceedings of ICASSP’03, IEEE, vol. 5, pp. 377–380.
Martin, R. (1994). Spectral Subtraction Based on Minimum Statistics, Proceedings of EUSIPCO’94, pp. 1182–1185.
Martin, R. (2001). Noise Power Spectral Density Estimation Based on Optimal Smoothing
and Minimum Statistics, IEEE Trans. on Speech and Audio Processing, vol. 9, no. 5, July,
pp. 504–512.
Martin, R.; Mallat, D.; Cox, R.; Accardi, J. (2004). Noise Reduction Preprocessor for Mobile
Voice Communication, EURASIP, Journal on Applied Signal Processing, vol. Issue-8,
pp. 1046–1058.
Mboup, M.; Bonnet, M.; Bershad, N. (1994). LMS Couples Adaptive Prediction and System Identification: Statistical Model Transient Mean Analysis, Trans. of IEEE Signal
Processing, IEEE, vol. 42, pp. 2607–2614.
McAulay, R.; Malpass, M. (1980). Speech Enhancement Using a Soft Decision Noise Suppression Filter, IEEE Trans. Acoust. Speech, Signal Processing, vol. ASSP-28, no. 2, April,
pp. 137–145.
Messerschmidt, D. (1984). Echo Cancellation in Speech and Data Transmission, Proceedings
in the IEEE Journal on Selected Areas in Communication, IEEE, vol. sac-2, pp. 283–297.
Moriya, T.; Miki, S.; Mano, K.; Ohmuro, H. (1993). Training Method of the Excitation
Codebook for CELP, Proceedings of Eurospeech’93, pp. 1155–1158.
Oppenheim, A.; Schafer, R. (1999). Discrete-Time Signal Processing, Second Edition, Prentice
Hall, Englehood Cliffs, New Jersey.
Painter, T.; Spanias, A. (2000). Perceptual Coding of Digital Audio, Proceedings of the IEEE,
IEEE, vol. 88, pp. 451–513.
Pasanen, A. (2006). Coded Domain Level Conrol for AMR Speech Codec, IEEE International
Conference on Acoustics, Speech and Signal Processing, vol. 1, no. 14, March, pp. I–I.
Pobloth, H.; Kleijn, W. (1999). On Phase Perception in Speech, Proceedings of ICASSP’99,
IEEE, vol. 1, pp. 29–30.
Rages, M.; Ho, K. (2002). Limits on Echo Return Loss Enhancement on a Voice Coded Speech
Signal, Proceedings of the 45rd IEEE MSCS, IEEE, vol. 2, pp. 152–155.
Scalart, P.; Filho, J. V. (1996). Speech Enhancement Based on A-Priori Signal to Noise Ration
Estimation, Proceedings of ICASSP’96, IEEE, vol. 2, pp. 626–632.
Schroeder, M.; Atal, B. (1985a). Code-Excited Linear Prediction (CELP): High-Quality Speech
at Very Low Bit Rates, Proceedings of ICASSP’85, IEEE, vol. 10, pp. 937–940.
Schroeder, M.; Atal, B. (1985b). Stochastic Coding of Speech at Very Low Bit Rates: the
importance of Speech Perception, Speech Communication 1985, vol. 4, pp. 155–162.
Schroeder, M.; Atal, B.; Hall, J. (1979). Optimizing Digital Speech Coders by Exploting
Masking Properties of the Human Ear, The Journal of the Acoustical Society of America,
vol. 66, December, pp. 1647–1652.
160
BIBLIOGRAPHY
Shannon, C. (1949). Communication in the Presence of Noise, Proc. Institute of Radio Engineers, vol. 37, pp. 10–21. Reprint as classic paper in: Proc. IEEE, Vol. 86, No. 2, (Feb
1998).
Spanias, A. (1994). Speech Coding: A Tutorial Review, Proceedings of the IEEE, IEEE, vol. 82,
pp. 1541–1582.
Sukkar, R.; Younce, R.; PengZhang (2006). Dynamic Scaling of Encoded Speech Through the
Direct Modification of Coded Parameters, Proceedings of ICASSP’06, IEEE, pp. 677–680.
Taddei, H.; Beaugeant, C.; DeMeuleneire, M. (2004). Noise Reduction on Speech Codec Parameters, Proceedings of ICASSP’04, IEEE, vol. 1, pp. 497–500.
Thepie, E.; Beaugeant, C.; Taddei, H.; Duetsch, N.; Pastor, D. (2006). Echo Reduction based
on Speech Codec Parameters, Proceedings of IWAENC’2006.
Thepie, E.; Beaugeant, C.; Taddei, H.; Pastor, D. (2008). Noise Reduction within Network
through Modification of the LPC Parameters, Proc of ITG-SCC’06.
Tsai, S.; Yang, J. (2001). GSM to G.729 Speech Transcoder, In Proceedings of ICECS’01,
IEEE, pp. 485–488.
Un, C.; Choi, K. (1981). Improving LPC analysis of Noisy Speech by Autocorrelation Subtraction Method, Proceedings of ICASSP’81, IEEE, vol. 6, pp. 1082–1085.
Vahatalo, A.; Johansson, I. (1999). Voice Activity Detection for GSM Adaptive Multirate
Codec, Proc. of IEEE Workshop Speech Coding, IEEE, Porvoo, Finland, pp. 55–57.
Vary, P.; Martin, R. (2005). Digital Speech Transmission: Enhancement, Coding and Error
Concealment, Jonh Wiley and Sons, LTD.
Vaseghi, S. (1996). Advance Signal Processing and Noise Reduction, Wiley-Teubner.
Wang, D.; Lim, J. (1982). The Unimportance of Phase in Speech Enhancement, Proc. IEEE
Trans. Acoust. Speech, Signal Processing, vol. 4, no. 4, August, pp. 679–681.
Yasuji, O.; Suzuki, M.; Tsuchinaga, Y.; Tanaka, M.; Sasaki, S. (2002). Speech Coding Translation for IP and 3G Mobile Integrated Network, In Proccedings of ICC’02, IEEE, pp. 114–
118.
Ye, H.; Wu, X. (1991). A New Double Talk Detection Based on the Orthogonality Theorem,
Proceedings of IEEE Transaction on Communication, IEEE, vol. 39, pp. 1542–1545.
Yoon, S.; Kang, H.; Park, Y.; Youn, D. (2001). An effective Transcoding Algorithm for G.723.1
and G.729A speech Coders, Proceedings of Eurospeech’01, pp. 2499–2502.
Yoon, S.; Kang, H.; Park, Y.; Youn, D. (2003). Transcoding Algorithm for G.723.1 and AMR
Speech Coders: For Interoperability between VoIP and Mobile Networks, Proceedings of
Eurospeech’03, pp. 1101–1104.
Résumé
L’amélioration de la qualité de la parole s’effectue progressivement dans les réseaux, plutôt que
dans les terminaux mobiles. Les contraintes liées à la réduction du délai, la réduction de la complexité
et le souhait d’un contrôle centralisé des réseaux motivent cette nouvelle approche. Le déploiement
des codeurs de paroles standardisés posent des problèmes d’interopérabilité entre les réseaux. Pour
assurer l’interconnexion entre ces réseaux, le transcodage du train binaire d’un codeur vers le codeur
cible est indispensable. Les solutions classiques d’amélioration de la qualité et le transcodage classique
nécessitent la présence du signal sous format PCM, c’est à dire des échantillons du signal.
Un concept alternatif pour améliorer la qualité de la parole dans les réseaux est proposé dans cette
thèse. Cette approche repose sur le traitement des paramètres des codeurs de type CELP. Un système
de réduction du bruit est implémenté dans cette thèse en modifiant le gain fixe et les coefficients LPC.
Deux algorithmes destinés à l’annulation de l’écho acoustique développés modifient le gain fixe. Ces
différents algorithmes utilisent des extrapolations et des transpositions des techniques existantes, du
domaine temporel ou fréquentiel dans le domaine des paramètres des codeurs de type CELP.
Au cours de cette thèse, nous avons aussi intégré les algorithmes ci-dessus mentionnés dans des
schémas de transcodage intelligent impliquant le gain fixe et adaptatif, ainsi que les coefficients LPC.
Avec cette approche, la complexité du système est réduite d’environ 27%. Les problèmes liés à la
non-linéarité introduits par les codeurs sont significativement réduits. Les tests objectifs indiquent en
ce qui concerne la réduction du bruit, que les performances sont meilleures que celles du filtre classique
de Wiener pendant le transcodage de l’AMR-NB 7.4 kbps vers 12.2 kbps. Elles sont sensiblement
équivalentes dans le transcodage de l’AMR-NB 12.2 kbps mode vers 7.4 kbps mode. Les mesures
objectives concernant l’annulation de l’écho acoustique (ERLE) montrent un gain de plus de 40 dB des
algorithmes proposés par rapport au NLMS. Le seuil minimal de 45 dB fixé pour le GSM est atteint.
Mots clés: codeur CELP, AMR-NB, réduction de bruit et annulation de l’écho acoustique dans le
domaine des paramètres CELP, filtre de Wiener, réseau GSM, transcodage intelligent.
Abstract
Voice Quality Enhancement (VQE) solutions are now moving from Mobile Device to the network.
This is due to constraints of low-complexity, low-delay and the need of centralized control of the
network. The deployment of incompatible standardized speech codecs implies interoperability issue
between telecommunication networks. To insure interconnection between networks, the transcoding
from one codec format to another is necessary. The common point to the classical network VQE and
standard transcoding is that they use the speech signal in PCM format during the process.
An alternative way to perform network VQE is developed in this thesis. This new approach leads
to modification of the CELP parameters to perform network VQE. A Noise Reduction algorithm is
implemented in this thesis by modifying the fixed codebook gain and the LPC coefficients of the noisy
speech signal. An Acoustic Echo Canceller is developed by filtering the fixed gain of the microphone
signal. These algorithms are based on extrapolation of existing algorithms in the time or the frequency
domain into the CELP parameter domain.
During this thesis, the algorithms developed in coded domain have been integrated into smart
transcoding algorithms. The smart transcoding strategy is applied to the fixed codebook gain, the
LPC coefficients and the adaptive codebook gain. With this approach, the non-linearity introduced
by the coders does not affect the performance of the network AEC. Many functions at the target
encoder are skipped, leading to a significant computational load reduction of about 27%, compared to
the classical approach. The network VQE embedded into smart transcoding has been implemented.
Objective metrics (the Signal-to-Noise Ratio Improvement (SNRI) and the Total Noise Reduction Level
(TNRL)) indicate that noise reduction integrated in smart transcoding performance is better than the
classical Wiener method when transcoding from the AMR-NB in 7.4 kbps mode to 12.2 kbps mode.
The performance is equivalent during transcoding from 12.2 kbps mode to 7.4 kbps mode. The Echo
Return Loss Enhancement (ERLE) values of our proposed algorithms are highly improved compared
to the standard NLMS (up to 40 dB). The required 45 dB ERLE in GSM is achieved.
Key words: CELP, AMR-NB, VQE based on CELP parameters, Wiener filter, GSM network, smart
transcoding.