Speech Quality Investigation using PESQ in a simulated Climax

Transcription

Speech Quality Investigation using PESQ in a simulated Climax
2007:260 CIV
MASTER'S THESIS
Speech Quality Investigation using PESQ
in a simulated Climax system for ATM
Alexander Storm
Luleå University of Technology
MSc Programmes in Engineering
Space Engineering
Department of Computer Science and Electrical Engineering
Division of Signal Processing
2007:260 CIV - ISSN: 1402-1617 - ISRN: LTU-EX--07/260--SE
Abstract
The demand for obtaining and maintaining a certain speech quality in
systems and networks has grown the last years. It is getting more and more
common with specifications of the quality instead of “-It sounds ok.”. The most
accurate way to perform a quality test is to let users of the services express
their opinion about the perceived quality. This subjective test method is
unfortunately expensive and very time consuming. Lately new objective
methods have been developed to replace these subjective tests. These new
objective methods have high correlation to subjective tests, are fast and
relatively cheap.
In this work the performance of one of these objective methods was
investigated, where the main focus was on speech distorted by impairments
commonly occurring in air traffic control radio transmissions. The software
where the objective method is implemented was evaluated for
recommendation for further usage. Some of the test cases were tested on
people for a subjective judgment and then compared with the objective
results.
Keywords: Objective speech quality evaluation, PESQ, Mean Opinion Score (MOS),
MOS-LQO, Climax.
Preface
This is the report of the final master thesis that concludes my journey towards
my MSc Degree in Space Engineering at Luleå University of Technology. The
thesis was carried out at Saab Communication in Arboga and it involved
examination of objective methods for grading speech quality in
communication links.
I would like to thank Saab Communication for this opportunity; a special
gratitude goes to my supervisors Alf Nilsson, Ronnie Olsson and Lars
Eugensson at Saab for their inspiration and knowledge. At the department of
Computer Science and Electrical Engineering I would like to thank my
examiners Magnus Lundberg Nordenvaad and James LeBlanc.
I would also like to express my gratitude to all the inspiring people that I
have had the opportunity to meet and work with during my five years at
LTU. You have made these years some of the best of my life and I wish you
all a delightful future.
Finally, a gratitude to my family and friends for your support.
Arboga, October 2007.
Alexander Storm
Content
1 Introduction ...............................................................................................................- 5 2 What is Speech Quality and how is it measured? ..................................................- 6 2.1 Impairments ...................................................................................................................- 7 2.2 Quality measurements ..................................................................................................- 9 2.3 PESQ – Perceptual Evaluation of Speech Quality.................................................- 16 2.4 Intelligibility .................................................................................................................- 21 2.5 Future measurement methods...................................................................................- 24 -
3 Theory.......................................................................................................................- 25 3.1 CLIMAX.......................................................................................................................- 25 3.2 Impairments .................................................................................................................- 27 3.3 Objective measurements ............................................................................................- 30 3.4 PESQ-verification........................................................................................................- 30 3.5 Subjective measurements...........................................................................................- 30 -
4 Methods ....................................................................................................................- 32 4.1 PESQ-verification........................................................................................................- 32 4.2 Objective quality measurements...............................................................................- 34 4.3 Subjective measurements...........................................................................................- 40 -
5 Result ........................................................................................................................- 42 5.1 PESQ-verification........................................................................................................- 42 5.2 Objective measurements ............................................................................................- 45 5.3 Subjective measurements...........................................................................................- 58 -
6 Conclusion and Discussion ....................................................................................- 59 6.1 PESQ-verification........................................................................................................- 59 6.2 Objective measurements ............................................................................................- 59 6.3 Subjective measurements...........................................................................................- 62 6.4 Error sources................................................................................................................- 63 6.5 The GL-VQT Software® .............................................................................................- 63 6.6 Additional measurements ..........................................................................................- 64 -
References ...................................................................................................................- 65 Appendix 1 Glossary..................................................................................................- 68 Appendix 2 ITU-T P.862, Amendment 2. Conformance test 2(b) ......................- 69 Appendix 3 Subjektivt test av talkvalitet ...............................................................- 70 Appendix 4 Results for case 9 of the objective measurements............................- 71 Appendix 5 The MOS-LQS result of the subjective measurement ....................- 72 -
1 Introduction
The European organization for civil aviation equipment (Eurocae1) is
developing a technical specification for a communication system using Voice
over IP (VoIP) for Air Traffic Management (ATM). The specification is
planned to be completed by the end of 2007 and it is expected to contain a
speech quality recommendation for ATM according to the International
Telecommunication Union (ITU) MOS-scale. To measure and verify the
quality it is proposed that the objective PESQ-algorithm should be used. To
get a feeling of what kind of quality demands that can be reasonable some
cases of speech impairments typical for ATM were investigated and tested in
this work using the PESQ-algorithm. For comparison and software evaluation
matters some of the test cases were also tested on humans to get subjective
opinions.
The purpose of this work was to investigate what kind of objective methods
for speech quality assessment that are available on the market and how they
perform. Another purpose was to simulate and investigate how different
impairments influence the speech quality using the objective PESQalgorithm. Evaluation and verification of the algorithm for these specific
impairments were made using subjective tests and recent research. Also the
software where the algorithm is implemented was investigated and
evaluated. A small comparison between intelligibility and quality of speech
was also performed.
1
Eurocae is an organization where European administrations, airlines and industry can
discuss technical problems. The members of Eurocae are European administrations, aircraft
manufacturers, equipment manufacturers and service providers and their objective is to
work out technical specifications and recommendations for the electrical equipment in the air
and on the ground [1].
-5-
2 What is Speech Quality and how is it measured?
With the introduction of IP services like Voice over IP (VoIP) the need for
methods to measure the performance of the services are required. VoIP is
introduced to reduce expenses using one type of network for both voice and
data.
Over the years users have become accustomed to the quality that the
“ordinary” Public Switched Telephone Network (PSTN) provides to the extent
that nowadays PSTN is a standard in quality and predictability. VoIP needs
to meet up with this standard to be widely accepted [2]. To cope with this
challenge it is important to understand the many differences between VoIP
and PSTN. Examples of the main differences are presented below:
PSTN was designed for time-sensitive delivery of voice traffic. It was
constructed with non-compression analog-to-digital encoding
techniques, always with the voice channel in mind to give the right
amount of bandwidth and frequency response [2]. The IP networks
were, on the other hand, designed for non-real-time applications like
file transfers and e-mails.
In PSTN the call setup and management are provided by the core of
the network while VoIP networks have put this management into the
endpoints such as personal computers and IP telephones [2]. Because
of this the network core is not equally controlled and regulated which
can have negative impact on the quality.
A telephone call in PSTN gets a dedicated channel with dedicated
bandwidth. This is a guarantee for a certain quality which is about the
same for every call. VoIP, on the other hand, can neither guarantee nor
predict voice quality. In VoIP the calls are divided up into small frames
or packets which can take different routes between the caller and the
receiver. The available bandwidth can not be guaranteed, it depends
on the performance and load of the network [3].
In PSTN the codec ITU-T G.711 [4] is used. It is a linear waveform
codec that almost reproduces the waveform at decoding. G.711 works
at a 8kHz sampling rate (8000samples/s) and each encoded segment is
8bit long (0,125ms) which gives a data rate of 64kbit/s (or bps). The
codecs G.729 (10ms segments, ~80bits/segment, ~8kbps) and G.723.1
(30ms segments, ~180bits/segment, ~6kbps) are non-linear because
they only try to process the parts of the waveform that are important
-6-
for perception; leading to a smaller bandwidth requirement. The
drawbacks are longer segments and low bit-rate which can lead to
higher end-to-end delay.
VoIP introduces factors, like headers, that increase the bandwidth
requirement. After encoding the code words are accumulated into
frames, usually of 20ms. The frames are then placed in packets before
transmission. For correct delivery, headers are added to the packets.
First the IP-, UDP- and RTP-protocols add a header each of a total of
320bits. The transmission medium layer, typical Ethernet, adds an
additional header of 304bits. This adds up to a total of 95,2kbps for the
VoIP transmission, eq.(1), using G.711, 64kbps payload.
(1280
frames
bits
320bits 304bits ) 50
s
frame
95200bps
(1)
2.1 Impairments
There are many factors influencing the quality of speech transmitted over a
network. It is possible to measure most of these factors but it is not assured
that these measures give a correct estimation of the quality. Quality is highly
subjective, it is the user that decides whether the quality is acceptable or not.
Voice quality can be described by three key parameters [3]:
end-to-end delay – the time it takes for the signal to travel from
speaker to listener.
echo – the sound of the talker’s voice returning to the talker’s ear.
clarity – a voice signal’s fidelity, clearness, lack of distortion and
intelligibility.
The two first are often considered to be the most important but the
relationship between the factors are complex and if just any of the three is
turned unacceptable the overall quality is unacceptable.
2.1.1 Delay
Delay is only affecting the conversational quality; it doesn’t introduce any
distortions to the signal. The delay in PSTN is dependent on the distance the
signal will travel, longer distance - higher delay. In VoIP the delay is
-7-
dependent on the managing of the packets; switching, signal processing
(encoding, compression), packet size, jitter buffers etc [2].
Delay becomes an issue when it reaches about 250ms, between 300ms and
500ms conversation is difficult and at delays over 550ms a normal
conversation is impossible. In PSTN the end-to-end delay is usually under
10ms but in VoIP networks the delay can reach 50-100ms due to the
operations (packetization and compression etc.) of the codec [2].
2.1.2 Echo
Like delay echo is a bigger issue when dealing with conversational quality. It
doesn’t affect the sound quality even though a talker can perceive it as
disturbing as other distortions. There are two different kinds of echo, acoustic
and electrical echo. The acoustic echo can be heard when a portion of the
speech is coming out of the speaker, at the far end, and heard by the
microphone and sent back to the talker. Electrical echo is introduced where a
2-wire analog is connected to a 4-wire system. These connections are made by
hybrids and if there’s some impedance mismatch between the 2-wire and the
4-wire the speech will leak back to the talker. If the echo returns less than
30ms after it was sent the talker will not usually perceive it as annoying, this
is also depending on the level of the echo. If the echo returns a little bit more
than 50ms after the transmission the conversation will be affected and the
talker will apprehend the conversation as “hollow” or “cave-like”. Echo is
bigger problem in VoIP than it is in PSTN. VoIP does not introduce more
echo but it introduces more delay which makes the echo more noticeable and
annoying [3].
2.1.3 Clarity
Clarity is the parameter that is the most subjective. Clarity is dependent on
the amount of various distortions introduced to and by the network. There
are several kinds of distortions that influence the clarity. Some examples [5]:
Encoding and decoding of the signal. Which codec is used and what
are its features.
Time-clipping. For example front end clipping (FEC) introduced by a
Voice Activity Detector (VAD).
Temporary signal loss caused by packet loss.
Jitter. Variance in delay of received packets.
Noise. Background noise for example.
Level Clipping. When an amplifier is driven beyond its voltage or
current capacity.
-8-
Of these the following impairments are introduced in a PSTN network:
Analog filtering and attenuation in a telephone handset and on line
transmissions.
Encoding via non-uniform PCM which introduces quantization
distortion. This has minimal impact on the clarity and is accepted in
ordinary PSTN telephony.
Bit errors due to channel noise.
Echo due to many hybrid wire junctions.
Along with VoIP new impairments have been introduced because of the new
technology of transmitting speech signals:
Low bit-rate codecs are more often used to limit the need for
bandwidth. These introduce nonlinear distortion, filtering and delay.
Front-end-clipping (FEC). To lower the bandwidth requirement even
more silence suppression is used together with VADs.
Packet losses which introduces dropouts and time-clipping.
Packet jitter, variance in packet arrival times to the receiver. This is
limited by jitter buffers.
Packet delay. Can cause packet loss and jitter.
In this work, only impairments affecting the clarity and the listening quality
are investigated.
2.2 Quality measurements
2.2.1 Subjective Assessment
Using people, for grading the quality of speech, is the most accurate way to
measure the quality since the user of the services is human. People are also
used because it is hard for instruments and machines to mimic how humans
perceive speech quality. The drawbacks are that subjective tests are very
expensive and time consuming. They are usually used in the development
phase of systems and services, not suitable for real-time monitoring. The
International Telecommunication Union (ITU) has standardized a method for a
subjective speech quality test. It is described in the standard ITU-T P.800 [6].
Subjective tests are performed in a carefully controlled environment with a
large number of people. A large number is required since the subjective
judgment is influenced by expectations, context/environment, physiology and
-9-
mood. The large number of subjects increases the accuracy and decreases the
influence of deviating results.
The participants listen to the transmitted/processed speech samples and
grade the perceived quality according to the scale stated in P.800, see table
2.1.
Table 2.1. Opinion scale according to ITU-T P.800.
Score
5
4
3
2
1
Quality of the speech
Excellent
Good
Fair
Poor
Bad
After the test, the individual scores are collected and the results are treated
statistically to produce the desired information. The most common result is
the mean value. This mean is evaluated as the quantity MOS (Mean Opinion
Score), a MOS-score of 3,6-4,2 is widely accepted as being a good score for a
network. Letters are added to state the kind of test. For a listening-only test
the notation is either MOS-LQS (listening-quality-subjective) or just MOS. For
a conversational test the notation is MOS-CQS or MOSc (table 2.2) [7].
Table 2.2. MOS notation according to P.800.1.
Subjective
Objective
Estimated
Listening-Quality
MOS-LQS
MOS-LQO
MOS-LQE
Conversational-Quality
MOS-CQS
MOS-CQO
MOS-CQE
In every subjective test some references should be employed, usually
Modulated Noise Reference Units (MNRU’s) are used. MNRU is
standardized in ITU-T P.810 [8] and it describes how to distort speech
samples in a controlled mathematical way. The amount of MNRU is
measured in dBQ where the Q-value is the ratio in decibels between the
signal and the added white noise, for subjective tests several Q-values are
used. These extra reference speech samples are mixed with the original
samples. After the test the reference samples have now both a MOS-value and
- 10 -
a Q-value. By plotting these it is possible to obtain a relationship between
MOS and Q. Figure 2.1 [9] shows an example of a regression of this
relationship curve, it usually has this S-shape. With this relationship it is
possible to translate every MOS-score to a Q-score. The Q-scores tend to be
more language and experiment independent which makes it possible to
compare scores from different experiments at different laboratories which is
not possible using the MOS-scores only.
Figure 2.1. An example of a regression of the relationship between MOS- and Q-values.
Together with listening and conversational test there are also talking quality
tests [6]. Listening tests are by far the most widely used since they are the
easiest to perform. The judgment is also stricter in listening tests since the
participants will be more focused and sensitive for small impairments that
won’t be caught during a conversational test. In ITU-T P.800 three different
listening tests are described; the ACR-, the DCR- and the CCR-method.
ACR – Absolute Category Rating
Here the subjects are presented with sentences with a length of 6-10s. After
each sentence the listeners should rate the perceived quality according to
table 2.1. The mean value of all ratings is the MOS-LQS. ACR is the most
frequent used listening test. It works well at low Q-values (Q<20dB) but
shows a reduction in sensitivity at higher Q-values (good quality circuits).
One reason for the low sensitivity can be that in ACR different sentences are
often used for different systems.
- 11 -
DCR – Degradation Category Rating
DCR shows higher sensitivity at high Q-values (Q>20dB) than ACR. In DCR
the listeners are presented with pairs of the same sentence where the first
sample is of high quality and the second one is processed by the system. After
each pair the listeners rate the degradation of the last sample compared to the
first unprocessed sample according to a degradation opinion scale (table 2.3).
Afterwards the degradation MOS (DMOS) is calculated.
Table 2.3. Degradation opinion scale (ITU-T P.800).
Score
5
4
3
2
1
The degradation is:
Inaudible
Audible but not annoying
Slightly annoying
Annoying
Very annoying
CCR-Comparison Category Rating
The CCR-method is similar to the DCR but the order, in which the samples
are presented to the listener, is random. In half of the pairs the processed
sample should be the first sample and for the other half of the test the second
sample should be processed. After each pair the listeners rate how the quality
of the second sample is compared to the quality of the first sample. The rating
is according to the comparison opinion scale in table 2.4.
Table 2.4. Comparison opinion scale.
Score
3
2
1
0
-1
-2
-3
The quality of 2:nd compared to 1:st is:
Much better
Better
Slightly better
About the same
Slightly worse
Worse
Much worse
This leads to a Comparison MOS (CMOS). An advantage of CCR over DCR is
the possibility to assess processes that either have degraded or improved the
speech quality.
- 12 -
2.2.2 Objective Assessment
Even though the subjective tests give the most accurate measurement of
speech quality the need for objective methods is desired. Subjective tests are,
as stated earlier, both expensive and time consuming. There have been many
different techniques of objective assessment during the years and they can be
divided into different groups [10]. First of all the measurements can be either
passive or active.
Passive measurements
The passive measurements are divided into planning and monitoring tools.
Planning tools
The E-model, ITU-T G.107
The E-model is a method for estimating the performance of networks. It is
used as a transmission planning tool and it is described in ITU-T G.107 [15].
The foundation of the model is eq.(2).
R
Ro
Is
Id
Ie
A
(2)
Ro is the basic signal-to-noise ratio. Is represents all impairments which occur
simultaneously with speech, for example loudness, quantization distortion
and side tone level. Id is the “delay impairment factor” which includes all
impairments due to delay and echo effects. Ie is the “equipment impairment
factor” and represents all impairments caused by the equipment; low bit-rate
codecs for example. Finally the “advantage factor” A represents the user’s
expectation of quality. For example, using a mobile phone out in the woods,
people can be more forgiving on quality issues because they are satisfied with
just being able to establish a connection.
All this sums up to the Rating Factor, R, which ranges from 0 to 100 where 100
is the highest rating, i.e. best quality. The R-value can then be converted into a
MOS-CQE (conversational-quality-estimated) or a MOS-LQE score for
comparison with other objective measurements.
- 13 -
Monitoring tools
In inactive or non-intrusive monitoring measurements the actual traffic is
examined, no need for speech samples being sent trough the system. It is
possible to monitor the system 24 hours a day and the monitoring doesn’t
affect or intrude the system. The drawback is that the accuracy and
correlation to subjective tests are lower than for intrusive measurements.
ITU-T P.563
The ITU-T P.563 [16] describes a new standard for non-intrusive
measurements. P.563 is a single-ended method for objective speech quality
assessment. It is based on models of voice production and perception. It
measures the effects of one-way distortions and noise on speech and delivers
a MOS-score that can be mapped to a MOS-LQO score for example.
Active measurements
Active measurements are divided into electroacoustic or psychoacoustic
measurements and the basic idea is to transmit a waveform from one end of
system and receive it at the other end. The received (degraded) waveform is
then compared to the original waveform resulting in a quality score based on
the difference between the two waveforms. The advantages of these methods
are that they have the highest correlation with subjective measurements and
that the original waveform can be constructed to match the objective of the
measurements, different languages, specific distortions etc. The drawback is
that the technique uses a specific speech sample which is transmitted, not live
traffic. Among these tests signal-to-noise ratio (SNR) and total harmonic
distortion (THD) can be mentioned [10].
Electroacoustic measurements
Electroacoustic measurements were among the first objective techniques to
measure the perceived quality of waveforms [10]. One example of the earlier
methods is the 21- and 23-Tone Multifrequency Test where a complex waveform
containing several equally spaced frequencies is transmitted through the
system. At the end the signal-to-distortion ratio (SDR) is calculated as a power
ratio in decibels (dB), the ratio is an indication of the quality. This method
was soon questioned since it gave very low SDR-values for some codecs even
though the users didn’t perceive any degradation. The reason was that the
codecs in concern affected parts of the transmission that was not very
important for the human perception. Later (1989) proposals were made to
- 14 -
change the multifrequency waveform to digital files containing recorded
speech. The processing was basically the same but the method was no success
due to poor correlation to subjective test results [10 p.120].
Psychoacoustic measurements
The problem with electroacoustic measurements was that they only consider
and measure different characteristics of the transmitted signal, the actual
content of what was being transmitted was not considered. The increasing
use of communication services raised the need for weighted measurements
which considered how humans perceive different kinds of impairments.
PSQM – Perceptual Speech Quality Measure (KPN Netherland)
PSQM was one of the earliest standardized methods to measure speech
quality from a human perception point of view. PSQM was standardized in
1996 through the ITU-T Recommendation P.861 [11]. The purpose was to
objectively measure the quality of telephone-narrow-band (300-3400Hz)
speech signals transmitted trough different codecs under different controlled
conditions. PSQM measured the perceptual distance between the input and the
output signal. The result was a score from 0 to infinity where 0 corresponded
to a perfect match. The objective was to map this score into a MOS-score but
due to problems with different results according to which language being
used there was no good mapping function resulting in low correlation to
subjective scores [11]. Another reason for the low correlation was the weak
time alignment function. The PSQM-algorithm was developed further in 1997
to cope with these limitations. The new method which was included in P.861
was called PSQM+ and it had solved problems like how to judge and handle
severe distortions and time clipping.
PAMS – Perceptual Analysis Measurement System (British Telecom)
The PAMS-algorithm is based on another signal processing technique than
PSQM. They both compare a source signal with the same one transmitted but
PAMS gives a score between 0 and 5 which correlate on the same scale as
subjective MOS testing. PAMS calculates and analyses the Error Surface to get
the score. The error surface is the difference between the Sensation Surfaces
between the output and input speech samples. The score is then the average
of the error surface at different frequencies. This process will be described in
the next section.
- 15 -
Both PSQM+ and PAMS showed unsatisfactory correlation with subjective
tests for a couple of test cases. The solution was the combination of the
perceptual model of PSQM99 (extension of PSQM+) and the powerful time
alignment function of PAMS. The new algorithm was called Perceptual
Evaluation of Speech Quality (PESQ) and became the new standard ITU-T P.862
[12] in February 2001. With the introduction of P.862 the PSQM, P.861standard was withdrawn. As for the earlier methods, the PESQ is intended
for narrow-band telephone signals. Since this work focuses on the
performance on the PESQ-algorithm it will get a more elaborate explanation.
2.3 PESQ – Perceptual Evaluation of Speech Quality
Figure 2.2. The basic functionality of the PESQ-method.
Figure 2.2 shows the basic block representation of the PESQ test procedure. A
speech sample is first inserted into the system under test and then collected at
the output of the system. The collected sample is then compared to the
original speech sample in the PESQ-algorithm resulting in a PESQ Raw-score
which is mapped to get the highest correlation to subjective MOS-scores. The
resolution of the inserted speech sample file should be 16-bit and the sample
rate should be 8kHz (PESQ is also validated for a sample rate of 16kHz).
Figure 2.3. Block representation of the PESQ algorithm.
- 16 -
The PESQ-algorithm is illustrated in figure 2.3 [13]. The first step of the
processing is to compensate for any gain or attenuation of the system under
test. The signals are aligned to the same constant power level in the level
alignment block; this level is the same as the normal listening level used in
subjective tests. In the input filter the algorithm models and compensates for
the filtering that takes place in the handset of the telephone in a listening test,
it is assumed that the handset’s frequency response follows the characteristics
of an IRF (Intermediate Reference System) receiver. Since the exact filtering is
hard to characterize the PESQ is rather insensitive to the filtering of the
handset.
To enable comparison between the two signals, time alignment is required. The
degraded signal is often delayed, sometimes with variable delays and PESQ
uses the technique from PAMS to cope with this problem. The time alignment
process is divided into two main stages; an envelope- based crude delay
estimation and a histogram-based fine time alignment. The envelope-based
approach starts by calculating the envelopes of the whole length of the
degraded and reference signal respectively. This is achieved with the help of
a voice activity detector (VAD). These envelopes are then cross-correlated by
frames in order to find the delay. This procedure yields a resolution of about
±4ms. Subsequently the signals are divided up in utterances; an utterance is a
continuous speech burst with pauses shorter than a certain lengths (200ms).
Theses utterances are examined using the same envelope-based delay
estimation. The first step in the histogram-based estimation is to divide the
signals into frames of 64ms with a 75% overlapping. These frames are Hannwindowed and cross-correlated and the index of the maximum from the
cross-correlation gives the delay estimate for each frame. A weighted
histogram of the delay estimates is then constructed, normalized and
smoothed by convolution with a symmetric triangular kernel of a width of
1ms. The location of the maximum in the histogram is then combined with
the previous delay estimation yielding the final delay estimation for the
utterance. The maximum is also divided by the sum of the histogram before
convolution to give a confidence measure between 0 (no confidence) and 100
(full confidence). In many cases there can be delay changes within the
utterance. To test for this each utterance is split up into smaller parts on
which the envelope- and histogram-based delay estimations are performed.
The splitting process is repeated at several points and the confidence is
measured and compared to the confidence before the split. As long as the
confidence is higher than before the splitting the process continues to find the
right delay estimation.
- 17 -
The auditory transform is a psychoacoustic model that mimics the properties of
human hearing. In this the signals are mapped into an internal representation
in the time-frequency domain by a short-term Fast Fourier Transform (FFT)
with a Hann-window over 32ms frames. The result is components called cells,
see figure 2.4. During the FFT the frequency scale is warped into a modified
Bark scale, called the pitch power density, which reflects the human sensitivity
at lower frequencies. This Bark spectrum is then mapped to a loudness scale
(Sone) to obtain the perceived loudness in each time-frequency cell. During
this mapping equalisation is made to compensate for filtering in the tested
system and for time varying gain [12]. The achieved representation is called
the Sensation Surface.
In the Disturbance Processing block the sensation surface of the degraded
signal is subtracted from the sensation surface of the reference signal
resulting in the Error Surface, containing the difference in loudness for every
cell. An example of an error surface is shown in fig.4.2. Two different
disturbance parameters are calculated; the absolute (symmetric) disturbance and
Figure 2.4. The time-frequency cells.
the additive (asymmetric) disturbance. The absolute disturbance is a measure of
the absolute audible error and it is achieved by examining the error surface. If
the difference in the error surface is positive, components like noise have
been added, if the difference is negative, parts of the signal have been lost due
to coding distortion for example. For each cell the minimum of the original
and degraded loudness is computed and divided by 4. This gives a threshold
which is subtracted from the absolute loudness difference; values that are less
than zero after this subtraction are set to zero. This is called masking, when the
influence of small distortions, that are inaudible in the presence of loud
signals, are neglected. The additive disturbance is a measure of the audible
errors that are significantly louder than the reference. It is calculated for each
- 18 -
cell by multiplying the absolute disturbance with an asymmetry factor. This
asymmetry factor is the ratio of the degraded and original pitch power
densities raised to the power of 1,2. Those factors that are less than 3 are set to
zero and those over 12 are clipped at that value leading to that only those
cells where the degraded pitch power densities exceeds the reference pitch
power densities remains, i.e. additive disturbance for positive disturbances
only. The two disturbance parameters are aggregated along the frequency
axis resulting in two frame disturbances. If these frame disturbances are above
a threshold of 45 they are identified as bad intervals. The delay is then
recalculated for these intervals and once again cross correlated in the time
alignment block. If this correlation is below a threshold it is concluded that
the interval is matching noise against noise and the interval is no longer bad.
For a correlation above the threshold a new frame disturbance is calculated
and replaces the original disturbance if it is smaller.
In the Cognitive Model the frame disturbance values and the asymmetrical
frame disturbance value are aggregated over intervals of 20 frames. These
summed values are then aggregated over the entire active interval of the
speech signal.
Finally the PESQ score is a linear combination of the average disturbance
value and the average asymmetrical disturbance value and ranges from -0,5
to 4,5, eq.(3).
PESQMOS
4, 0,1d SYM
(3)
0,0309d ASYM
d SYM is the average disturbance value and dASYM is the average asymmetrical disturbance
value.
This PESQ Raw-score shows in some cases poor correlation with MOS-LQS.
To obtain higher correlation the PESQ Raw-score is usually mapped into the
MOS-LQO (MOS-Listening Quality Objective) score (ITU-T P.862.1 -11/2003
[14]). The mapping function, shown in eq.(4) and in figure 2.5 gives a score
from 1,02 to 4,55 which corresponds to the P.800 MOS-LQS, see table 2.1. The
maximum value 4,5 for the PESQ-score was chosen because it is the same as
for a clear and undistorted condition in a typical ACR-LQ test.
y
0,999
4,999 0,999
1 e 1, 4945 x 4 , 6607
(4)
x represents the PESQ Raw-score and y the MOS-LQO score.
- 19 -
MOS-LQO Mapped P.862
5
4,5
4
3,5
3
2,5
2
1,5
1
0,5
0
-1
0
1
2
3
4
5
P.862 PESQ Raw-score
Figure 2.5. The MOS-LQO mapping function.
The produced MOS-LQO score estimates the listening quality only, it takes no
concern to impairments that influence the conversational quality (MOS-CQO)
like delay, jitter, echo, sidetone and the level of the incoming speech.
Table 2.5. Comparison between some objective methods. The average and worst-case
correlation coefficients for 38 subjective tests are shown.
No. of tests
19
9
10
Type
Mobile
network
Fixed
network
VoIP/
multitype
Corr.coeff.
average
worst-case
average
worst-case
average
worst-case
PESQ
0,962
0,905
0,942
0,902
0,918
0,810
PAMS
0,954
0,895
0,936
0,805
0,916
0,758
PSQM PSQM+
0,924
0,935
0,843
0,859
0,881
0,897
0,657
0,652
0,674
0,726
0,260
0,469
Table 2.5 [13] shows a comparison between PESQ, PAMS, PSQM and PSQM+.
The table shows the correlation coefficients for the different algorithms
compared with 38 subjective tests. The conclusion is that PESQ has the
highest correlation in both average and worst-case. PESQ shows high
accuracy for a wide range of conditions. For some conditions PAMS is close
but it is less accurate for some other conditions. PSQM and PSQM+ show
lower correlation in conditions including VoIP, packet loss etc.
- 20 -
2.4 Intelligibility
In communications quality is closely related to intelligibility (the degree to
which speech can be understood); high quality usually means high
intelligibility. However, it is important to distinguish between the two; in
many cases intelligibility is crucial while high quality is a desirable bonus
[17]. Even though they correlate well in many cases the relationship becomes
much more incomprehensible in other cases. For example, a small quality
drop can have big influence on the intelligibility. On the other hand even at
low quality scores it might be possible to apprehend and understand the
transmitted information without great effort. It is also possible to improve the
quality while decreasing the intelligibility and vice versa. An example is
when using noise suppression schemes to lower the background noise to
improve the perceived quality, these systems tend to decrease the
intelligibility [18]. The PESQ-algorithm was not developed to assess speech
intelligibility. However, since there are a relation between quality and
intelligibility it might be possible to extend the PESQ-algorithm to correlate
well with subjective intelligibility tests. Research is made to investigate the
relation and how PESQ performs in intelligibility tests [18], [19] and [20].
2.4.1 Subjective measurements
Just like quality, intelligibility is a subjective judgment indicating how well a
human listener can decode speech information [17]. It is measured using
statistical methods where trained talkers speak using standardized word lists
trough the system under evaluation. The words are received at the far end
and trained listeners try to recognize what words that have been spoken.
There are a number of different standardized word lists to use; one is the
Modified Rhyme Test (MRT) [21], [22]. It consists of 50 six-word lists of
rhyming words. The whole list is presented to the listener and the talker
pronounces one of the six words in each list. The listener marks the word he
thinks is spoken. After the test has been done by at least five listeners the
results are collected and treated statistically to access the desired information.
Table 2.6 shows the first five rows of six rhyming words in the MRT.
Table 2.6. The first five rows of words in the MRT.
went
hold
pat
lane
kit
sent
cold
pad
lay
bit
bent
told
pan
late
fit
dent
fold
path
lake
hit
tent
sold
pack
lace
wit
rent
gold
pass
lame
sit
- 21 -
A similar method is the Diagnostic Rhyme Test (DRT). It consists of 96 rhyming
pairs of words which are constructed from a consonant-vowel-consonant
sound sequence. Examples of the word pairs are presented in table 2.7. The
words only differ in the initial consonant and they are chosen in a way that
the result can be interpreted in different ways to show what kind of
consonants that are hard to recognize and then pin-point out what needs to
be altered in the system to get correct intelligibility. Consonants are chosen
because they are more important for the intelligibility than vowels [10]. They
are also more sensitive to additive impairments like noise, tones etc. as they
contain 20 times less average power than vowels. Since consonants are
shorter in duration, 10-100ms, compared to vowles, 10-300ms, they are also
more sensitive to losses and additive pulses.
Table 2.7. Examples of word pairs in the DRT. Specific features of speech is also shown.
Voicing
veal
bean
gin
dint
zoo
feel
peen
chin
tint
sue
Nasality
meat beat
need deed
bit
mitt
dip
nip
moot boot
Sustenation
vee
bee
sheet cheat
bill
vill
thick tick
pooh
foo
Sibilation
zee
thee
cheep
keep
gilt
jilt
thing
sing
goose
juice
Graveness
reed
weed
peak
teak
bid
did
fin
thin
moon
noon
Compactness
yield wield
key
tea
hit
fit
gill
dill
coop poop
The subjective intelligibility tests result in a percentage, 0-100% representing
the amount of words that were recognized correctly. These results are more
straightforward to interpret than the corresponding subjective MOS (1-5). The
MOS-score reflects more impressions than just intelligibility and the scores
can vary quite a lot among different listeners [17].
These subjective tests do not always reflect the reality. In normal life speech is
made up of sentences which increase the intelligibility because of the flow of
words. The MRT and the DRT consists of random words and even though
they are equally distorted the real life sentences are perceived as having
higher overall intelligibility.
2.4.2 Objective measurements
There are a couple of indices for objective speech intelligibility. The two most
fundamental are the Speech Transmission Index (STI) [23] and the Speech
Intelligibility Index (SII) [24]. The STI gives a number between 0 and 1 where 1
represents good intelligibility and low influence of acoustical system
properties and/or background noise (compare with subjective tests, 0-100%).
The STI is based on the assumption that speech can be described as a
- 22 -
fundamental waveform which is modulated by low-frequency signals [23].
The STI-score is calculated from the Modulation Transfer Function (MTF) of the
system (figure 2.6). The MTF is the reduction in the modulation index of the
signal at the transmitter, m(1), and at the receiver, m(2), eq.(5).
Figure 2.6. The STI-method.
MTF
m2
m1
(5)
The SII-method is described in [24]. It is a development from the STI-method
and it works in a similar way and produces scores between 0 and 1.
Correlation
The SII-method correlates well with subjective tests. An arising problem is
that this objective method along with other are limited to linear systems.
Testing modern applications such as low bit-rate coding do not produce well
correlated scores. Research is done to find new objective methods that can
deal with these non-linear systems [25].
- 23 -
2.5 Future measurement methods
The area of how to objectively measure speech quality is expanding fast. It is
getting more and more common that there are demands regarding the speech
quality in specification involving communication solutions. A couple of years
ago this was not the case, measurements to obtain a quality score had to be
made subjectively. This was far too expensive and time consuming to be used
in an every day manner. Today the objective tools have become accurate, fast
and cheap enough for extensive usage. Subjective tests are still more accurate
but regarding the benefits of objective measurements they will continue to
expand in areas of usage.
The research of today is struggling with the following tasks:
Higher correlation for non-intrusive measurements, like P.563, with
subjective tests. Today intrusive measurements, like P.862, give more
accurate results of the speech quality. A new ITU-T standard is under
development with the working name P.VTQ. It is a tool for predicting
the impact on the quality of IP-network impairments and for
monitoring the transmission quality. It uses metrics from the RTCP-XR
(RTP Control Protocol-Extended Report) to calculate the quality and it
gives a MOS-score on the ACR Listening Quality Scale [26].
Combine quality and intelligibility measurements. Extend the PESQalgorithm to include intelligibility measurements and give a common
score for both quality and intelligibility.
Make intelligibility measurements work in VoIP applications. The
objective STI-method is inaccurate in non-linear and time-variant
packetized networks.
Modify and extend in service standards like ITU-T P.862 and ITU-T
G.107 to have the same accuracy for VoIP systems as in “old” PSTN
networks. ITU-T P.OLQA is a new standard under development. It
will be the “Universal” model for objective predicting of listening
quality. It will include not only speech but other new 3G-applications
[26].
Develop tools for predicting and monitoring conversational quality
from mouth-to-ear. This includes both the electrical connection and the
acoustical part. The ITU-T P.CQO is a new standard that are
developed to deal with this task [26].
- 24 -
3 Theory
The main objective for this work is to investigate how different impairments
degrade the quality of speech in ATM (Air Traffic Management) radio. The
ATM system under consideration is the CLIMAX system.
3.1 CLIMAX
Climax, or the offset carrier system, is an Air Traffic Control (ATC)
Communications system working in the VHF-band (30-300MHz). The
construction of Climax started in the United Kingdom in the 1960’s and is
now widely spread in Europe. Sweden does not use the Climax system but it
becomes an issue for us when flying to countries where the system is used.
This multi-carrier system is attended for ground to air communication and it
is based on the idea on having 2-5 transmitters transmitting on the same
frequency with a slight offset. Climax offers greater ground coverage, higher
redundancy and better coverage on low altitude and at harsh environments
[27].
Climax is limited to a 25kHz channel spaced environment. This can cause
problems since the 8.33kHz environment is spreading in Europe because of
the need of more available frequencies; an 8.33kHz receiver does not have
enough bandwidth to operate correctly in a Climax environment.
To prevent for audible beats, homodynes, caused by frequency difference the
multiple carriers are separated in frequency according to table 3.1 [28].
Table 3.1. Frequency arrangement for Climax channels (fc is the assigned channel
frequency).
No. of
Climax Legs
2
3
4
5
Leg 1 Tx
frequency
fc +5kHz
fc +7.5kHz
fc +7.5kHz
fc -2.5kHz
Leg 2 Tx
frequency
fc -5kHz
fc
fc -7.5kHz
fc -7.5kHz
Leg 3 Tx
frequency
fc -7.5kHz
fc +2.5kHz
fc
Leg 4 Tx
frequency
fc -2.5kHz
fc +2.5kHz
Leg 5 Tx
frequency
fc +7.5kHz
- 25 -
3.1.1 Operations
The Air Traffic Control Centre (ATCC) transmits the audio signal to all the
transmitters and the air plane receives the signal from all the transmitters
within coverage (figure 3.1). The pilot will then hear all the incoming
transmissions simultaneously.
Figure 3.1. The basic structure of Climax. The air plane receives the signal from three antennas.
When the pilot receives multiple transmissions there is a great risk that the
transmissions are mutual delayed due to different transmission paths. It is
crucial that this delay do not reduce the quality and intelligibility of the
transmitted speech. The European Organisation for Civil Aviation Equipment
(EUROCAE) has proposed that this delay difference may vary between 0 and
10ms for Climax. For values over 10ms difference, echo effects will start to
become disturbing and annoying [29].
For air to ground it works differently. The transceiver on the air plane
transmits at the centre frequency and the transmission is received at each
aerial within range. The ATCC then selects the aerial with the best
performance by Best Signal Selection (BSS). This leads to that only one
transmission reaches the air traffic controller (figure 3.2) [27].
- 26 -
Figure 3.2. The air plane transmits to ATCC, BSS is used for better performance.
3.2 Impairments
For the investigation five main impairments are examined in this work:
I.
II.
III.
IV.
V.
The Climax case (delay).
Speech with added noise.
Speech with an added tone.
Packet (frame) losses.
Speech with added noise pulses.
3.2.1 Case I
Here the unique feature of the Climax system was investigated. It was
simulated that a pilot receives two transmissions with the same speech from
two different transmitters. One of the speech samples was delayed to
simulate the echo-effect that the pilot will perceive , longer delays give more
disturbing echo. How disturbing the echo gets is also dependent on the
propagation loss that may be different for the two paths, resulting in different
levels of the two received signals. Only simulations were investigated, no real
radio transmissions were made. Most of the delay originates in the equipment
used for the transmission; the propagation time in air is negligible. The
propagation loss is on the other hand mostly dependent on the transmission
path in the air. Examples of measured delays are shown in table 3.2 [30].
- 27 -
Table 3.2. Examples of measured delays from Sundsvall ATCC.
Station
Arvidsjaur
Gällivare
Måttsund
Storuman
Round-trip
(ms)
13,4
16,1
18,1
18,7
One-way
(ms)
6,70
8,05
9,05
9,35
Delay difference
(ms)
0,0
1,4
2,4
2,7
The measurements, in table 3.2, were made at the Sundsvall ATCC in
Sweden. The one-way latency is obtained by assuming the round-trip latency
is twice the one-way latency. The delay difference measure is the delay
relative the shortest one-way latency. It should be noted that the stations
operated at 125,60MHz and that these measurements were performed on
TDM-connections, not VoIP.
3.2.2 Case II and III
For the cases with added noise or tones the Signal-to-Noise Ratio (SNR) was
the measure which was being altered. The SNR is a measure of the level of
desired signal compared to the level of the background noise. The SNR is
measured in decibels (dB) and is calculated as eq.(6):
SNR dB
10 log
Psignal
Pnoise
Psignal dB
Pnoise dB
(6)
For the cases investigated in this work the Psignal is the Average RMS Power of
the entire clean signal and the Pnoise is the same for the nosie/tone.
Where noise was added, both the influence of white and pink noise was
investigated. White noise is characterized by that it contains all frequencies
with the same probability, the same mean energy and that the power is
evenly distributed among all frequencies. Pink noise emphasizes the lower
parts of the frequency spectrum, it distributes its energy evenly in all octaves,
that is the power density decreases by 3dB/octave towards higher
frequencies. This feature makes it, for example, more pleasant to listen to.
- 28 -
3.2.3 Case IV
In a digital voice transmission the speech is divided up into packets and
frames containing usually 20ms speech. Depending on the performance and
the load of the network some of these packets can be lost during
transmission. Losses can occur if the network is congested, i.e. components
receive too many packets which make their buffers to overflow and cause
packets to be discarded. Congestion can lead to packet rerouting which can
result in that the packets arrive too late to the jitter buffers leading to packet
discarding. The individual packets can also be discarded by different
applications because they are damaged with bit errors due to circuit noise or
equipment malfunction [2], [3].
The effect on the quality of a packet loss is depending on many factors. First,
what is the content of the lost packet? Of course packets containing speech
affects quality more than packets containing silence when lost. Also what
kind of speech the packets contain is important, if it contains vowel sounds or
consonant sounds, if they occur in the beginning of a syllable or in the end, or
if the whole syllable is lost? The time when the packet is lost is also important
especially when dealing with bursts of lost packets [31]. For example, bursts
towards the end of a telephone call are subjectively perceived as more
negative regarding quality than bursts occurring at the beginning of the call.
Another factor that influences the packet loss is what codec being used.
Waveform codecs like G.711 encodes the whole waveform, no compression
and high bit-rate. Usage of this codec affects the perceived quality much less
than other perceptual codecs like G.729, G.723 and G.721 which encode only
the relevant part of the voice signal.
The PESQ-algorithm has earlier been tested and verified for packet losses
with a normal distribution. Studies have also been made to investigate how
the PESQ measures the impact of specific packet losses [32].
3.2.4 Case V
The case with noise pulses occur when an analog radio is disturbed by
transmitters using frequency hopping. When the signals of the transmitters
are mixed together intermodulation products are created and if these
products coincide with the frequency of the radio a noise pulse is perceived
by the radio, see table 3.3. The length of the noise pulse is the time the
frequency hoppers remains at that specific frequency. An example is the
Bluetooth®-technology which uses frequency hopping. It changes channels
1600 times per second, i.e. it remains 0,625ms on a channel [33].
- 29 -
Table 3.3. Examples of intermodulation products.
Intermodulation
products (fA)
f1 + f2 = fA
f1 - f2 = fA
2f1 - f2 = fA
2f2 – f1 = fA
3.3 Objective measurements
As PESQ is the most accurate and most used objective speech quality tool it
was used for the investigation of the five cases. Several files were made to
include most of the realistic real-life cases. The files were examined using a
software where the PESQ-algorithm had been implemented. Each tested file
resulted in a MOS-LQO score.
3.4 PESQ-verification
Before the testing of the five cases the software itself was tested to make sure
that it worked as expected. Speech files with impairments that are neglected
by the PESQ were examined, expected to result in a maximum quality score.
To fully verify the algorithm a conformance test can be made according to
ITU-T P.862, Amendment 2 [36]. This test contains three test cases for the
narrow band operation where the test scores of the enclosed files should not
diverge from the scores of a reference implementation with more than a
certain value. Test case number 2 of these three specified cases was
performed; case 2 validates P.862 with variable delay.
3.5 Subjective measurements
To get a hint on whether the PESQ judge the impaired files correctly a
subjective test was performed with some of the files from the objective
measurements. The used Absolute Category Rating method delivers a MOS- 30 -
LQS score which was compared with the objective MOS-LQO score. The
results should be treated very carefully though, to be able to compare the
numeric values the subjective test needs to be performed in a standardized
way in a tightly controlled environment [6]. No calibration with MNRUs [8]
was performed for example; the test could therefore not be repeated
somewhere else with accurate results. The test was only made to investigate
how PESQ ranks the files compared to the subjects, does PESQ rank added
noise vs. an added tone differently than humans for example.
- 31 -
4 Methods
4.1 PESQ-verification
4.1.1 Test with neglected impairments
To verify the accuracy of the software four tests were made with impairments
not considered by the PESQ. The degraded speech file was processed with
(table 4.1):
1.
2.
3.
4.
added silence in the beginning to simulate end-to-end delay.
amplified and lowered level.
combination of added silence and changed level.
not processed, just resaved.
All these cases should result in maximum quality if the software works
correctly. The test was made with four Swedish voices, four English voices
and four Russian voices, two males and two females of each language. Table
4.2 shows the delays and levels of case 3 and table 4.3 shows which files being
used.
Table 4.1. Numerical values for the test cases for the PESQ-verification.
Case 1
Delay(ms)
Case 2
Level (dB)
Case 3
Comb.(nr)
Case 4
File (nr)
10
50
75
140
213
538
973
1688
2222
-15 -10 -7,5
-4,3
-2,1
-0,2
0,3
2,2
4,4
7,6
10,1
15,1
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
9
10
11
12
- 32 -
Table 4.2. Tested values for case 3.
Combination
number
1
2
3
4
5
6
7
8
Delay (ms)
Level (dB)
50
140
538
1688
50
140
538
1688
+7,6
-2,1
-10
+4,4
-15
+10,1
+0,3
-7,5
Table 4.3. Tested files/languages for case 4.
File number
1
2
3
4
5
6
7
8
9
10
11
12
File
swe_f2
swe_f7
swe_m5
swe_m8
B_eng_f1
B_eng_f3
B_eng_m3
B_eng_m6
Ru_f2
Ru_f5
Ru_m1
Ru_m7
4.1.2 Conformance test, P.862, Amendment 2
The enclosed 40 file pairs of the second test case were tested. To pass the test
the absolute difference between the PESQ Raw-score of the tested
implementation and the reference implementation should not be greater than
0,05 for 1 file pair and 0,5 for all pairs.
- 33 -
4.2 Objective quality measurements
The PESQ-algorithm requires two samples of the speech, one reference file
and a copy of the reference processed by the system or by hand. These two
samples are compared in the algorithm resulting in a PESQ Raw-score. The
speech samples should be speech-like voice samples, natural recorded speech
may be used but artificial voice should be avoided. The samples should have
the right level, typically -26dBov (dB relative the overload point of the
system). They should be around 8s in duration. The reference should be as
distortion-free as possible. The sample rate should be 8kHz or 16kHz for both
the reference and the degraded sample and the sample resolution should be
16-bit. The degraded sample should be recorded at a level where amplitude
clipping is avoided.
Speech samples from ITU-T P.50 [34] were used for these tests, P.50 consists
of speech samples of several languages. There are 8 female voices and 8 male
voices for each language. For the rest of this report the notation of these files
will be made up of a short of the language followed by the gender and finally
the file number. For example the fourth Swedish male is noted as swe_m4.
For these tests Swedish was used with the 8 male and the 8 female voices. The
samples were first saved as 16bit Windows PCM wave. The samples were
then copied and distorted with different impairments according to table 4.4.
In almost all real transmissions the speech is transferred through a codec. To
simulate this all the distorted samples were saved in ITU-T G.711 A-law
wave. G.711 [4] is an 8-bit waveform preserving, linear codec, used in PSTNnetworks. The samples need to be resaved in 16-bit PCM wave to be able to
work with the PESQ-algorithm.
To examine the files with the PESQ-algorithm the software “Voice Quality
Testing (VQT) from GL Communications Inc®. was used. Figure 4.1 shows a
screen dump of the VQT-main screen viewing 2-D representations of the two
sensation surfaces and the error surface. Figure 4.2 shows the 3-D sensation
surface for a degraded file, the x-axis shows the time, the y-axis shows the
frequency and the z-axis shows the loudness. 15 cases plus a reference case
were tested (table 4.4).
- 34 -
Figure 4.1. Screen dump of the GL VQT software.
Figure 4.2. The Sensation Surface for a degraded file.
- 35 -
Table 4.4. Overview over the different test cases.
Test case
Reference G.711
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Impairment
G.711
Delay
Delay with changed level
Pink noise
White noise
400Hz tone
1200Hz tone
Packet losses
Packet losses with concealment
Noise pulses
Different speech
Delay and pink noise
Delay and 400Hz tone
Delay and 1200HZ tone
Delay and packet loss
Delay and noise pulses
Used file
multiple
swe_m3, swe_f4
swe_f4
swe_m3
swe_m3
swe_m3
swe_m3
swe_m3, swe_f1
swe_m3, swe_f1
swe_m3
swe_m1, swe_f4
multiple
multiple
multiple
swe_f1
swe_m3
Reference case, G.711
The following files were used to investigate how much the G.711-codec itself
lowered the quality: swe_m1, swe_m2, swe_m3, swe_m5, swe_m7, swe_m8,
swe_f1, swe_f2, swe_f3, swe_f4, swe_f5, swe_f6, swe_f8, Ru_m1, Ru_f4,
B_eng_f7, B_eng_m2, Fr_m6, Fr_f2.
Case 1
Two identical files were mixed together to represent the degraded file. One of
the files was delayed relative the other before mixing, simulating what a pilot
can hear in his head-set.
The tested delays were (ms): 0,1 0,25 0,5 0,75 1,0 1,5 2,0 2,5 3,0 4,0 5,0 6,0 7,5 10
12,5 15 20 50.
Case 2
As case 1 but the level of the delayed file was first attenuated 3dB and then
6dB before mixing.
- 36 -
Case 3
Speech and pink noise were mixed together and investigated; the SNR was
the parameter that was altered.
SNR(dB) = speech average RMS power (dB) – noise average RMS power (dB).
Tested SNRs (dB): -20 -15 -10 -6 -5 -3 0 3 5 6 10 15 20 25 30 35 40.
Case 4
As case 3 but with white noise instead.
Case 5
A speech file and a sinus tone were mixed together. The tone was at 400Hz
and different SNRs were tested. The SNR was calculated as:
SNR(dB) = signal average RMS power (dB) – tone average RMS power (dB).
Tested SNRs (dB): -25 -20 -15 -10 -6 -5 -3 0 3 5 6 10 15 20 25 30 35 40.
Case 6
As case 5 but the tone was at 1200Hz instead.
Case 7
Speech with simulated packet losses. Each packet contained 20ms of speech
and every loss was replaced with silence. The packets were evenly spread
throughout the sentence.
Two sentences were tested and the loss densities (s-1) were:
0,1 (0,2%), 0,2 (0,4%), 0,3 (0,6%), 0,5 (1%), 0,7 (1,4%), 1 (2%), 2 (4%), 4 (8%), 10
(20%), 20 (40%).
The number in the brackets represents how much of the sentence that was
lost.
Case 8
As case 7 but the lost packets were replaced by the preceding packet. Each
packet contained 20ms of speech and the tested loss densities (s -1) were:
0,1 (0,2%), 0,2 (0,4%), 0,3 (0,6%), 0,5 (1%), 0,7 (1,4%), 1 (2%).
The number in the brackets represents how much of the whole sentence that
was lost.
- 37 -
Case 9
Speech with noise pulses which replaced the speech. Different lengths,
different levels of the noise pulses and different pulse densities were tested.
The pulses were evenly spread over the sentence but their locations were
chosen to where there were speech and not silence. The same locations in the
sentence were used for all pulse lengths and levels.
SNR(dB) = Peak value speech (dB) – Peak value noise (dB).
Tested SNRs (dB): -6 -3 0 3 6 9 12.
Tested pulse densities (s -1): 0,1 0,2 0,3 0,5 0,7 1,0 2,0 4,0.
a,
b,
c,
d,
e,
f,
g,
pulse length = 1ms
pulse length = 3ms
pulse length = 5ms
pulse length = 7ms
pulse length = 10ms
pulse length = 15ms
pulse length = 20ms
Table 4.5. Table of how much (in %) of the sentence that was noise.
Pulse densities
(s-1)
a,
b,
c,
d,
e,
f,
g,
0,1
0,2
0,3
0,5
0,7
1,0
2,0
4,0
0,01
0,03
0,05
0,07
0,1
0,15
0,2
0,02
0,06
0,1
0,14
0,2
0,3
0,4
0,03
0,09
0,15
0,21
0,3
0,45
0,6
0,05
0,15
0,25
0,35
0,5
0,75
1,0
0,07
0,21
0,35
0,49
0,7
1,05
1,4
0,1
0,3
0,5
0,7
1,0
1,5
2,0
0,2
0,6
1,0
1,4
2,0
3,0
4,0
0,4
1,2
2,0
2,8
4,0
6,0
8,0
Case 10
Two different sentences were mixed together and tested. The file swe_m1 kept
its level throughout the whole test but the file swe_f4’s level was changed to
test different SNRs.
SNR(dB) = swe_m1 average RMS power (dB) – swe_f4 average RMS power (dB).
Tested SNRs (dB): -15 -10 -6 -5 -3 -2,5 0 2,5 3 5 6 10 15 20 25 30 35 40.
- 38 -
Case 11
A combination of added delay and pink noise were investigated.
Delay (ms): 0,1 0,25 0,5 1,0 4,0 10 20.
SNRs (dB): -20 -15 -10 -6 -5 -3 0 3 5 6 10 15 20 25 30 35 40.
a, delay: 0,1ms, swe_m3
b, delay: 0,25ms, swe_f5
c, delay: 0,5ms, swe_m7
d, delay: 1,0ms, swe_f8
e, delay: 4,0ms, swe_m5
f, delay: 10ms, swe_f3
g, delay: 20ms, swe_m2
Case 12
Combinations of added delay and a 400Hz sinus tone were tested.
Delay (ms): 0,1 0,25 0,5 1,0 4,0 10 20.
SNRs (dB): -25 -20 -15 -10 -6 -5 -3 0 3 5 6 10 15 20 25 30 35 40.
a, delay: 0,1ms, swe_m3
b, delay: 0,25ms, swe_f5
c, delay: 0,5ms, swe_m7
d, delay: 1,0ms, swe_f8
e, delay: 4,0ms, swe_m5
f, delay: 10ms, swe_f3
g, delay: 20ms, swe_m2
Case 13
Combinations of added delay and a 1200Hz sinus tone were tested.
Delay (ms): 0,1 0,25 0,5 1,0 4,0 10 20.
SNR (dB): -25 -20 -15 -10 -6 -5 -3 0 3 5 6 10 15 20 25 30 35 40.
a, delay: 0,1ms, swe_m3
b, delay: 0,25ms, swe_f5
c, delay: 0,5ms, swe_m7
d, delay: 1,0ms, swe_f8
e, delay: 4,0ms, swe_m5
f, delay: 10ms, swe_f3
g, delay: 20ms, swe_m2
- 39 -
Case 14
Speech with packet losses and added delay. Packets of 20ms were lost. The
delayed file didn’t contain any packet losses.
Delays (ms): 0,10 0,25 0,50 1,0 4,0 10.
The delayed file was tested at three levels: the same as the original file,
attenuated 3dB and 6dB.
Tested losses (s-1): 0,5 (1%), 1,0 (2%), 2,0 (4%).
The number in the brackets represents how much of the sentence that was
lost.
Case 15
Combinations of added delay and noise pulses of 20ms. The noise pulses
were evenly spread throughout the sentence and they were added upon the
speech.
Delays (ms): 0,1 0,25 0,5 1,0 4,0 10.
Pulse densities (s -1): 0,1 (0,2%), 0,2 (0,4%), 0,3 (0,6%), 0,5 (1,0%), 0,7 (1,4%), 1,0
(2,0%), 2,0 (4,0%), 4,0 (8,0%), 10 (20%), 20 (40%).
The number in the brackets represents how much of the whole sentence that
was noise.
4.3 Subjective measurements
To be able to compare the objective results a subjective test was performed
based on [6] Absolute Category Rating (ACR) method. 19 of the voice files
investigated by the VQT were selected and used for the test. A copy of the
score sheet is found in appendix 3. The subjects should listen to the files and
grade them according to table 2.1. The results were collected and an average
MOSs was calculated for each file by taking the sum of all subjective scores
and dividing it by the number of subjects. The used files are shown in table
4.6.
- 40 -
Table 4.6. Overview over the test files used in the subjective test.
File
number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Impairment
Talker
Delay 10ms
Delay 10ms
-6dB leveled delay 10ms
Delay 10ms and pink noise SNR=25dB
Delay 10ms and a 400Hz tone SNR=20dB
Delay 10ms and a 1200Hz tone SNR=20dB
Tone at 400Hz, SNR=20dB
Tone at 1200Hz, SNR=20dB
Packet loss 1%, lost packets = silence
Packet loss 1%, lost packets = preceding packet
White noise, SNR=25dB
Pink noise, SNR=25dB
Speech over speech, SNR=25dB
Noise pulses 1ms, 0dB, 0,05% noise
Noise pulses 3ms, 0dB, 0,15% noise
Noise pulses 3ms, 0dB, 1,2% noise
Delay 50ms
Pink noise, SNR=10dB
No impairments
swe_f2
swe_m8
swe_f4
swe_f3
swe_f3
swe_f8
swe_f5
swe_m1
swe_m7
swe_f1
swe_f7
swe_m6
swe_m5(f3)
swe_m4
swe_f6
swe_m3
swe_f6
swe_m1
swe_m5
To avoid any comparison between similar files the file numbers were mixed
before testing. The score sheet with instructions and voice files were e-mailed
out to colleagues and friends. For those subjects having a file size limitation in
there e-mail inbox the files were compressed to 70kbits mp3-format. The
compression led to a small and almost not noticeable quality degradation.
This was not a standardized way to perform a subjective quality test therefore
the results must be treated very carefully. The ITU recommendation P.800 [6]
describes how to perform a subjective test; it should be very controlled and
involve many people.
- 41 -
5 Result
5.1 PESQ-verification
5.1.1 Test with neglected impairments
Table 5.1. Overview over the test cases in the PESQ-verification.
Case
1
2
3
4
Processing
Added silence
Changed level
Combination of case 1 and 2
No processing
Case 1
4,5
swe_f2
swe_f7
swe_m5
swe_m8
B_eng_f1
B_eng_f3
B_eng_m3
B_eng_m6
Ru_f2
Ru_f5
Ru_m1
Ru_m7
4
MOS-LQO
3,5
3
2,5
2
1,5
1
0,5
0
0
500
1000
1500
2000
2500
Silence added (ms)
Figure 5.1. MOS-LQO for 12 different speech files with added silence in the beginning.
- 42 -
Case 2
4,5
swe_f2
swe_f7
swe_m5
swe_m8
B_eng_f1
B_eng_f3
B_eng_m3
B_eng_m6
Ru_f2
Ru_f5
Ru_m1
Ru_m7
4
MOS-LQO
3,5
3
2,5
2
1,5
1
0,5
0
-16
-12
-8
-4
0
4
8
12
16
Level change (dB)
Figure 5.2. MOS-LQO for 12 different speech files with changed level of the degraded file.
Case 3
4,5
swe_f2
swe_f7
swe_m5
swe_m8
B_eng_f1
B_eng_f3
B_eng_m3
B_eng_m6
Ru_f2
Ru_f5
Ru_m1
Ru_m7
4
MOS-LQO
3,5
3
2,5
2
1,5
1
0,5
0
1
2
3
4
5
6
7
8
Combination number
Figure 5.3. MOS-LQO for 12 different speech files with both added silence and changed level.
- 43 -
Case 4
4,5
4
MOS-LQO
3,5
3
2,5
2
1,5
1
0,5
0
1
2
3
4
5
6
7
8
9
10 11 12
File number
Figure 5.4. MOS-LQO for 12 different speech files unprocessed.
5.1.2 Conformance test P.862, Amendment 2.
All the file pairs ended up within the test limits. In two cases (marked with *
in appendix 2) the difference was 0,01, for the remaining 38 pairs no
difference could be concluded. The complete test sheet is attached in
appendix 2.
- 44 -
5.2 Objective measurements
Reference G.711
4,5
0,16
4
0,14
3,5
0,12
Diff. MOS-LQO
MOS-LQO
3
2,5
2
1,5
0,10
0,08
0,06
1
0,04
0,5
0,02
0
5
10
15
File number
20
Figure 5.5. MOS-LQO for 19 speech
files processed by the G.711-codec.
0
5
10
15
File number
20
Figure 5.6. Difference in MOS-LQO, for
19 files, from the maximum score 4,55.
Case 1
4,5
4
MOS-LQO
3,5
3
2,5
swe_m3
swe_f4
2
1,5
1
0,5
0
0
5
10 15 20 25 30 35 40 45 50 55
Delay (ms)
Figure 5.7. MOS-LQO for two speech samples with different delays added.
- 45 -
4,5
4
MOS-LQO
3,5
3
2,5
swe_m3
swe_f4
2
1,5
1
0,5
0
0
1
2
3
4
5
6
Delay (ms)
7
8
9
10 11
Figure 5.8. As Fig.5.7 for delays 0-10ms.
Case 2
4,5
4
MOS-LQO
3,5
3
swe_f4
swe_f4 -3dB
swe_f4 -6dB
2,5
2
1,5
1
0,5
0
0
5
10 15 20 25 30 35 40 45 50 55
Delay (ms)
Figure 5.9. MOS-LQO speech samples where the level of the delayed sample is
attenuated 3dB and 6dB.
- 46 -
4,5
4
MOS-LQO
3,5
3
swe_f4
swe_f4 -3dB
swe_f4 -6dB
2,5
2
1,5
1
0,5
0
0
1
2
3
4
5
6
7
8
9
10 11
Delay (ms)
Figure 5.10. As Fig.5.9 for delays 0-10ms.
Case 3 and 4
4,5
4
MOS-LQO
3,5
3
2,5
pink noise
white noise
2
1,5
1
0,5
0
-30
-20
-10
0
10
20
30
40
50
SNR (dB)
Figure 5.11. MOS-LQO for two speech samples with noise added.
- 47 -
Case 5 and 6
4,5
4
MOS-LQO
3,5
3
2,5
400Hz tone
1200Hz tone
2
1,5
1
0,5
0
-30
-20
-10
0
10
20
30
40
50
SNR (dB)
Figure 5.12. MOS-LQO for speech samples with a sinus tone added.
Case 7
4,5
4
MOS-LQO
3,5
3
2,5
swe_m3
swe_f1
2
1,5
1
0,5
0
0
10
20
30
40
50
Lost packets (% of whole sentence)
Figure 5.13. MOS-LQO for two speech samples with lost packets.
- 48 -
4,5
4
MOS-LQO
3,5
3
2,5
swe_m3
swe_f1
2
1,5
1
0,5
0
0
0,5
1,0
1,5
2,0
2,5
Lost packets (% of whole sentence)
Figure 5.14. As Fig.5.13 for losses 0-2%.
Case 8
4,5
4
MOS-LQO
3,5
3
2,5
swe_m3
swe_f1
2
1,5
1
0,5
0
0
0,5
1,0
1,5
2,0
2,5
Lost packets (% of whole sentence)
Figure 5.15. MOS-LQO for two speech samples with packet losses replaced with the
preceding packet.
- 49 -
4,5
4
MOS-LQO
3,5
swe_m3 - silence
swe_m3 - packet
swe_f1 - silence
swe_f1 - packet
swe_m3 - noise pulse
(case 9g, -6dB)
3
2,5
2
1,5
1
0,5
0
0
0,5
1,0
1,5
2,0
2,5
Lost packets (% of whole sentence)
Figure 5.16. Comparison of different replacements for packet losses.
Case 9
4,5
4
MOS-LQO
3,5
1,0ms
3,0ms
5,0ms
7,0ms
10ms
15ms
20ms
3
2,5
2
1,5
1
0,5
0
0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
4,5
Noise pulses(s ¹)
Figure 5.17. MOS-LQO for speech samples with added noise pulses of SNR=-6dB and with
different lengths.
- 50 -
4,5
4
MOS-LQO
3,5
1,0ms
3,0ms
5,0ms
7,0ms
10ms
15ms
20ms
3
2,5
2
1,5
1
0,5
0
0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
4,5
Noise pulses(s ¹)
Figure 5.18. MOS-LQO for speech samples with added noise pulses of SNR=0dB and with
different lengths..
4,5
4
MOS-LQO
3,5
1,0ms
3,0ms
5,0ms
7,0ms
10ms
15ms
20ms
3
2,5
2
1,5
1
0,5
0
0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
4,5
Noise pulses(s ¹)
Figure 5.19. MOS-LQO for speech samples with added noise pulses of SNR=+6dB and with
different lengths..
- 51 -
4,5
4
MOS-LQO
3,5
1,0ms
3,0ms
5,0ms
7,0ms
10ms
15ms
20ms
3
2,5
2
1,5
1
0,5
0
0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
4,5
Noise pulses(s ¹)
Figure 5.20. MOS-LQO for speech samples with added noise pulses of SNR=+12dB and with
different lengths..
The results for all SNRs are shown in appendix 4.
Case 10
4,5
4
MOS-LQO
3,5
3
2,5
2
1,5
1
0,5
0
-20
-10
0
10
20
30
40
50
SNR (dB)
Figure 5.21. MOS-LQO for a speech sample with a different speech sample added.
- 52 -
4,5
4
MOS-LQO
3,5
400Hz tone
1200Hz tone
white noise
pink noise
speech
3
2,5
2
1,5
1
0,5
0
-30
-20
-10
0
10
20
30
40
50
SNR (dB)
Figure 5.22. Comparison of five different additive impairments.
Case 11
4,5
4
0,0ms
0,10ms
0,25ms
0,50ms
1,0ms
4,0ms
10ms
20ms
MOS-LQO
3,5
3
2,5
2
1,5
1
0,5
0
-25 -20 -15 -10 -5 0
5 10 15 20 25 30 35 40 45
SNR (dB)
Figure 5.23. MOS-LQO for different delays at varying SNR (pink noise added).
- 53 -
Case 12
4,5
4
0,0ms
0,10ms
0,25ms
0,50ms
1,0ms
4,0ms
10ms
20ms
MOS-LQO
3,5
3
2,5
2
1,5
1
0,5
0
-30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45
SNR (dB)
Figure 5.24. MOS-LQO for different delays at varying SNR (400Hz tone added).
Case 13
4,5
4
0,0ms
0,10ms
0,25ms
0,50ms
1,0ms
4,0ms
10ms
20ms
MOS-LQO
3,5
3
2,5
2
1,5
1
0,5
0
-30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45
SNR (dB)
Figure 5.25. MOS-LQO for different delays at varying SNR (1200Hz tone added).
- 54 -
Case 14
4,5
4
MOS-LQO
3,5
3
0,5s (1%)
1,0s (2%)
2,0s (4%)
2,5
2
1,5
1
0,5
0
0
1
2
3
4
5
6
7
8
9
10 11
Delay (ms)
Figure 5.27. MOS-LQO for speech samples with added delay at 0dB and packet losses.
4,5
4
MOS-LQO
3,5
3
0,5s (1%)
1,0s (2%)
2,0s (4%)
2,5
2
1,5
1
0,5
0
0
1
2
3
4
5
6
7
8
9
10 11
Delay (ms)
Figure 5.28. MOS-LQO for speech samples with added delay added delay at -3dB and packet
losses.
- 55 -
4,5
4
MOS-LQO
3,5
3
0,5s (1%)
1,0s (2%)
2,0s (4%)
2,5
2
1,5
1
0,5
0
0
1
2
3
4
5
6
7
8
9
10 11
Delay (ms)
Figure 5.29. MOS-LQO for speech samples with added delay added delay at -6dB and packet
losses.
Case 15
4,5
0,2%
0,4%
0,6%
1,0%
1,4%
2,0%
4,0%
8,0%
20%
40%
4
MOS-LQO
3,5
3
2,5
2
1,5
1
0,5
0
0
1
2
3
4
5
6
7
8
9
10 11
Delay (ms)
Figure 5.30. MOS-LQO for speech samples with added delay and noise pulses of 20ms.
- 56 -
4,5
4
0,2%
0,4%
0,6%
1,0%
1,4%
2,0%
4,0%
8,0%
20%
40%
MOS-LQO
3,5
3
2,5
2
1,5
1
0,5
0
0
0,2
0,4
0,6
0,8
1,0
1,2
Delay (ms)
Figure 5.31. As fig.5.30 for delays 0-1ms.
- 57 -
5.3 Subjective measurements
The column with MOS-LQO shows the results produced by the PESQ
software, the column with MOS-LQS shows the mean of the 31 subjects used
in this experiment and the right column shows the standard deviation of the
subjective results. The complete result can be viewed in appendix 5.
Table 5.2. The result of the subjective measurement.
File Impairment
nr.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Delay 10ms
Delay 10ms
Attenuation -6dB and delay 10ms
Delay 10ms and pink noise SNR=25dB
Delay 10ms and a 400Hz tone SNR=20dB
Delay 10ms and a 1200Hz tone SNR=20dB
Tone at 400Hz, SNR=20dB
Tone at 1200Hz, SNR=20dB
Packet losses, 1% silence
Packet losses, 1% preceding packet
White noise, SNR=25dB
Pink noise, SNR=25dB
Speech over speech, SNR=25dB
Noise pulses 1ms, 0dB, 0,05% noise
Noise pulses 3ms, 0dB, 0,15% noise
Noise pulses 3ms, 0dB, 1,20% noise
Delay 50ms
Pink noise, SNR=10dB
No impairments
MOS-LQO MOS-LQS
Standard
Deviation
MOS-LQS
2,65
2,63
3,69
2,42
2,59
2,23
3,31
3,21
3,54
3,56
3,22
3,59
3,74
3,80
3,27
1,86
1,47
1,92
4,49
3,06
1,97
3,39
2,55
2,42
2,03
2,81
2,71
2,48
3,35
2,45
3,42
3,26
3,45
3,16
1,84
1,48
1,94
4,35
0,93
0,75
0,92
0,72
0,96
0,91
0,87
1,01
0,72
1,02
0,85
0,92
0,93
0,93
0,97
0,86
0,63
0,93
0,71
- 58 -
6 Conclusion and Discussion
6.1 PESQ-verification
The PESQ-algorithm worked as expected in all the four verification cases. The
quality drops, for some files, in case 2 and 3 were caused by the high level of
the speech. Some of the speech had reached the limit where amplitude
clipping occurred with lower quality as the result. There are still some
features of the algorithm that need to be investigated. The time alignment
function should be tested further to examine how the algorithm deals with
delays in the middle of sentences for example.
The algorithm passed the second case in the conformance test. The results of
all, but two, files pairs agreed, after round off, to the reference score. The two
were still within the error limit and the 0,01 difference was probably due to
round off inaccuracy.
6.2 Objective measurements
Reference G.711
The G.711-codec lowered the quality with up to 0,15 on the MOS-LQO scale.
This is quite low, the PESQ-developers state a drop of about 0,30 for an
analog G.711 network [13]. The difference can depend on the fact that they
have used real measurements with higher controlled features of the speech
files and that they measured over an entire network, not just the codec itself.
Case 1
The, by Eurocae, proposed limit of delay is 10ms, this simulation showed a
MOS-LQO of about 2,8 at that delay. 2,8 is considered to be acceptable but
one should remember that in real life it is likely that more degrading factors
will occur than just delay.
Case 2
In the cases where the delayed file was attenuated the simulations showed an
expected increase in quality. For the 10ms case the quality was about 0,8
higher than when both files were at equal level. This is usually the case in real
- 59 -
life; one file has lower level than the other. The 7,5ms and 12,5ms delay
showed unexpected lower quality. This is an indication that multiple tests
should be performed to avoid random errors.
Case 3 and 4
Adding noise to the files produced nice looking plots. The white noise
showed a bit flatter curve than pink noise but the difference was small all
over the different SNRs. At very low SNRs the quality was almost constant
indicating the need to keep the SNR over a certain limit.
Case 5 and 6
When adding a tone instead of noise the results were a bit different. The
curves were more flat and at 0dB the quality difference was almost 1,0 in
favor for the cases with an added tone. Some remarkable results were
obtained at SNRs below 0dB; the quality started to increase again and at
-20dB it had reached the same level as at about +10dB. At very low SNRs
(-25dB) the quality was lower as expected, about 0,7 higher than for added
noise though. The reason for the quality increase at low SNRs is a case for
further investigation.
Case 7
Simulating lost packets or frames showed expected results. Ref.[35] shows a
MOS of about 3,6 at 1% loss which correlate well with these results. One
should remember that the content of the packet is very important; if it
contains speech or silence. The PESQ-algorithm consider whether the content
is speech or silence but it is unclear how well it can handle the importance of
different parts of the speech.
Case 8
Replacing the simulated packet losses with the preceding packet instead of
silence showed a bit higher quality, an indication that error concealments are
useful. Replacing the losses with noise should on the other hand be avoided.
Case 9
When the level of the pulses decreased the length of the pulses got more
important. At -6dB the quality difference was about 0,7 at the most while it
was about 1,8 at low noise levels. For most of the cases where 1% of the
- 60 -
sentence was noise the quality was around 3,0 for all pulse lengths making it
worse than silence and preceding packet as error concealment.
Case 10
Letting another human voice impair the original speech gave quality results
lower than in the cases with an added tone. On the other hand the quality
was a bit higher than noisy speech at SNRs 0-30dB. These results should be
treated carefully as the PESQ-algorithm has not been validated for cases with
multiple talkers [12].
Case 11
At low SNRs the amount of delay was negligible, the noise made the quality
too poor for the delay to make any difference. At higher SNRs the delay had
more influence on the quality and at 40dB the result was comparable with
case 1 where no noise was added.
Case 12
Adding a 400Hz tone to the delayed files gave curves with the same shape as
without delay. Noticeable was that under 0dB the results got more random
and any regression was hard to notice.
Case 13
Changing to a 1200Hz tone showed no surprising results. The curves had
better regression with each other than for the 400Hz tone indicating that the
delay had lower influence on the 1200Hz case than for the 400Hz case.
Case 14
More packet losses led to lower quality, just as expected. Adding more delay
gave less consistent results especially for the 1ms and 4ms delays, the quality
tend to rise at these delays. The level change of the delayed file gave expected
results even though the difference was smaller than in case 2. The conclusion
is that packet losses have a more degrading feature than delays, even if the
delayed file has the same level as the original sample.
- 61 -
Case 15
The results indicated once again that noise is the worst way to replace a lost
packet, for the 1% case the score was around 2,5. As for many other cases
there were some results that differ from the expected regression which
indicates that further investigation is desired.
Case 11-15 consisted of combinations of delays and other impairments. The
objectives of these were to investigate whether any combinations showed
remarkable results and if there was any impairment that was more important
to avoid. None of these showed any surprising results, the resulting quality
score was about the same as adding the scores from the measurements where
only one impairment was considered.
6.3 Subjective measurements
As mentioned earlier the results from the subjective measurement should be
treated very carefully. Comparing the results, with this in mind, shows well
correlated results. It is only in three cases where the difference is higher than
0,5. File 1 and 2 differ quite much, the reason is probably that the male voice
is much more slurred and not that well articulated as the female. Looking at
case 9 and 10 the PESQ has scored them almost the same while the subjective
results show that the error concealment simulation is perceived as almost 1,0
higher than silence. Comparing this with case 16 shows that both PESQ and
subjects agree on that noise pulses should be avoided when more than 1% of
the speech is impaired. For the cases with added background noise white
noise is the worst according to the subjects, about 1,0 worse than pink noise,
this is a bit bigger difference than what the PESQ shows. Using different
speech as the impairment is a bit worse than pink noise according to the
subjects while PESQ grades it higher than when noise or tones are added.
Finally there are some cases (16-19) with very well correlated results. The
reason for this is probably that the results are near the ends of the quality
scale.
The standard deviation is almost 1 for most of the cases indicating that the
individual results differ quite a lot. On the other hand that is what one can
expect when dealing with people with different opinions about quality.
- 62 -
6.4 Error sources
One of the most obvious sources to why there are some errors in the results is
that the original input speech files do not completely fulfill the requirements
of the input file according to ITU-T P.862 [12]. The level and filtering should
for example be controlled. To be able to fully compare the results the same
speech file should be used in every test. As seen in many cases some of the
scores differ quite a lot from the expected one, multiple measurements of the
same case should be made to avoid this. For the subjective measurement a
number of errors can occur. Performing the test on their own using different
equipment contributes to the uncertainty and deviation, between individual
scores, of the test. Some of this uncertainty could be avoided with more
detailed instructions, in case 2 for example where the subjects might have
taken the actual voice into consideration. The fact that the compression of the
files to the mp3-format did lower the quality, even though the author had
difficulties hearing the difference, might have lowered the subjects’ results.
6.5 The GL-VQT Software®
The test of the software where the PESQ-algorithm is implemented resulted
in satisfying results. There are still functions that need to be investigated, in
this work only the manual function has been used with manually made files
simulating some of the impairments that can occur in a network. For further
investigation tests should be performed on a real network, using both real
speech and similar speech files used in this work for comparison.
To fully validate the software there is a conformance test in ITU-T P.862,
Amendment 2 [36]. This test comes with about 1800 file pairs with known
PESQ-scores for testing and confirmation of the algorithm’s functions.
This tool is a good choice if manual offline measurements are desired. There’s
no need to bring the software out in the field, just bring a high quality
recorder, record the transmitted speech and investigate the quality later.
To get a more complete understanding of the overall speech quality the PESQ
should be used together with instrument measuring impairments like delay,
echo and other factors which PESQ hasn’t been validated for.
- 63 -
6.6 Additional measurements
For the Climax case, with delayed speech samples, it should be investigated
how additional speech influence the quality. In this work only one delayed
file has been investigated but it is possible that two or even more voice
samples can impair the radio transmission.
An ordinary transmission is usually degraded by many impairments. Here
only combinations with maximum two impairments have been tested, it
necessary to continue the investigation combining more impairments.
It is important to perform tests in real networks with real transmissions. Only
files simulating the real world have been investigated and it is necessary to
compare them with the real thing and investigate the difference.
- 64 -
References
[1]
http://www.eurocae.org/, “General Description”.
[2]
Hardman, Dennis, Noise and Voice Quality in VoIP Environments,
Agilent Technologies, Inc – White Paper, 2003.
[3]
Pracht, S. Hardman, D. Voice Quality in Converging Telephony and IP
Networks, Agilent Technologies, Inc – White Paper, 2000-2001.
[4]
ITU-T Rec. G.711, Pulse Code Modulation (PCM) of voice frequencies, Nov.
1988.
[5]
Anderson, John, Methods for Measuring Perceptual Speech Quality,
Agilent Technologies, Inc – White Paper, 2000-2001.
[6]
ITU-T Rec. P.800, Methods for subjective determination of transmission
quality, Aug. 1996.
[7]
ITU-T Rec. P.800.1, Mean opinion score (MOS) terminology, Jul. 2006.
[8]
ITU-T Rec. P.810, Modulated noise reference unit - MNRU, International
Telecommunication Union, Geneva, Switzerland, Feb. 1996.
[9]
Ericsson AB, Speech Quality Index in CDMA2000, Figure 1 in EAB06:009546 Uen Rev B, Technical Paper, May 2006.
[10]
Hardy, William C, VoIP Service Quality, Blacklick, OH, USA: McGrawHill, 2003, ISBN: 0-07-141076-7.
[11]
ITU-T Rec. P.861, Objective quality measurement of telephone band
(300-3400 Hz) speech codecs, Feb. 1998.
[12]
ITU-T Rec. P.862, Perceptual evaluation of speech quality (PESQ), an
objective method for end-to-end speech quality assessment of narrow-band
telephone networks and speech codecs, Feb. 2001.
[13]
Psytechnics Limited, PESQ – Product description,
www.psytechnics.com, Nov. 2002.
- 65 -
[14]
ITU-T Rec. P.862.1, Mapping function for transforming P.862 raw result
scores to MOS-LQO, Nov. 2003.
[15]
ITU-T Rec. G.107, The E-model, a Computational Model for Use in
Transmission Planning, Mar. 2005.
[16]
ITU-T Rec. P.563, Single ended method for objective speech quality
assessment in narrow-band telephony applications, International
Telecommunication Union, Geneva, Switzerland, May 2004.
[17]
Li, F.F. Speech Intelligibility of VoIP to PSTN Interworking – A key index for
the QoS, Department of Computing and Mathematics, Manchester
Metropolitan University, UK, 2004.
[18]
Beerends, J.G. van Wijngaarden, S. van Buuren, R. Extension of ITU-T
Recommendation P.862 PESQ towards Measuring Speech Intelligibility with
Vocoders, The NATO Research & Technology Organisation, New
Directions for Improving Audio Effectiveness (pp.10-1 – 10-6), RTOMP-HFM-123, 2005.
[19]
Liu, W.M. Jellyman, K.A. Mason, J.S.D. Evans, N.W.D. Assessment of
Objective Quality Measures for Speech Intelligibility Estimation, School of
Engineering, University of Wales Swansea, 2006.
[20]
Beerends, J.G. Larsen, E. Nandini, I. van Vugt, J.M. Measurement of
Speech Intelligibility based on the PESQ approach,
[21]
ISO standard, ISO/TR 4870:1991, Acoustics -- The construction and
calibration of speech intelligibility tests.
[22]
ANSI S3.2 – 1989 – Method for measuring the Intelligibility of Speech
over Communication systems.
[23]
IEC International standard, IEC 60268-16, 3:rd edition, 2003.
[24]
ANSI standard s3.5 – 1997, Methods for Calculation of the SII.
[25]
Chernick, C.M. Leigh, S. Mills, K.L. Toense, R. Testing the Ability of
Speech Reconizers to Measure the Effectiveness of Encoding Algorithms for
Digital Speech Transmission, IEEE Int. Military Comm. Conf.
(MILCOM), 1999.
- 66 -
[26]
International Telecommunication Union ITU-T SG 12 ITU-T Study
Group 12, Voice Quality Voice Quality Assessment Assessment,
Information slides, ETSI STQ Workshop, Taïwan, 13 February 2006.
[27]
Rihacek, Christoph. B-VHF Reference Environment, Project Report D08, Broadband VHF Aeronautical Communications System based on MCCDMA, Ref: 04A02 E515.10, 2005.
[28]
ICAO, Annex 10 –volume III, Attachment A to Part II, Guidance material
for Communication systems, Nov 1997.
[29]
Eurocae Climax recommendations, Working Group 67, SG1, 2006.
[30]
Nilsson, Alf. Latency in the radio network of the Swedish CAA, Ver 01.01,
Saab Communication, 30 Januari, 2006.
[31]
Tan, X-C. Wänstedt, S. Heikkilä, G. Experiments and Modeling of
Perceived Speech Quality of Long Samples, AWARE @ Ericsson Research,
Luleå, Sweden, 2001.
[32]
Hoene, C. Rathke, B. Wolisz, A. On the Importance of a VoIP Packet,
Technical University of Berlin, 2003.
[33]
http://www.bluetooth.com/Bluetooth/Learn/Basics/
[34]
ITU-T Rec. P.50, Artificial Voices, Sep, 1999.
[35]
Ding, Lijing. Goubran, Rafik A. Assesment of Effects of Packet Loss on
Speech Quality in VoIP, Department of systems and Computer
Engineering, Carleton University, Canada, 2003.
[36]
ITU-T Rec. P.862, Amendment 2, Revised annex A – Reference
implementations and conformance testing for ITU-T Recs P.862, P.862.1 and
P.862.2, International Telecommunication Union, Nov. 2005.
- 67 -
Appendix 1
Glossary
ACR
ATC
ATM
CCR
Climax
dBov
Absolute Category Rating
Air Traffic Controller
Air Traffic Management
Comparison Category Rating
Offset Carrier System
the decibel value relative to the overload point of a digital
system
DCR
Degradation Category Rating
DRT
Diagnostic Rhyme Test
Eurocae
European Organization of Civil Aviation Equipment
FEC
Front End Clipping
FFT
Fast Fourier Transform
ITU
International Telecommunication Union
MNRU
Modulated Noise Reference Unit
MOS
Mean Opinion Score
MOS-LQO Mean Opinion Score – Listening Quality Objective
MOS-LQS Mean Opinion Score – Listening Quality Subjective
MRT
Modified Rhyme Test
ms
millisecond
MTF
Modulation Transfer Function
PAMS
Perceptual Analysis Measurement System
PCM
Pulse Code Modulation
PESQ
Perceptual Evaluation of Speech Quality
PSQM
Perceptual Speech Quality Measure
PSTN
Public Switched Telephony Network
RTP
Real-time Transport Protocol
SII
Speech Intelligibility Index
SNR
Signal-to-Noise Ratio
STI
Speech Transmission Index
VAD
Voice Activity Detector
VoIP
Voice over IP
VQT
Voice Quality Tester
- 68 -
Appendix 2
ITU-T P.862, Amendment 2. Conformance test 2(b)
Reference
voip/or105.wav
voip/or109.wav
voip/or114.wav
voip/or129.wav
voip/or134.wav
voip/or137.wav
voip/or145.wav
voip/or149.wav
voip/or152.wav
voip/or154.wav
voip/or155.wav
voip/or161.wav
voip/or164.wav
voip/or166.wav
voip/or170.wav
voip/or179.wav
voip/or221.wav
voip/or229.wav
voip/or246.wav
voip/or272.wav
voip/u_am1s01.wav
voip/u_am1s01.wav
voip/u_am1s02.wav
voip/u_am1s01.wav
voip/u_am1s03.wav
voip/u_am1s03.wav
voip/u_am1s01.wav
voip/u_am1s02.wav
voip/u_am1s02.wav
voip/u_am1s03.wav
voip/u_am1s03.wav
voip/u_am1s03.wav
voip/u_am1s01.wav
voip/u_am1s03.wav
voip/u_am1s02.wav
voip/u_af1s01.wav
voip/u_af1s03.wav
voip/u_af1s02.wav
voip/u_af1s03.wav
voip/u_am1s03.wav
Degraded
voip/dg105.wav
voip/dg109.wav
voip/dg114.wav
voip/dg129.wav
voip/dg134.wav
voip/dg137.wav
voip/dg145.wav
voip/dg149.wav
voip/dg152.wav
voip/dg154.wav
voip/dg155.wav
voip/dg161.wav
voip/dg164.wav
voip/dg166.wav
voip/dg170.wav
voip/dg179.wav
voip/dg221.wav
voip/dg229.wav
voip/dg246.wav
voip/dg272.wav
voip/u_am1s01b1c1.wav
voip/u_am1s01b1c7.wav
voip/u_am1s02b1c9.wav
voip/u_am1s01b1c15.wav
voip/u_am1s03b1c16.wav
voip/u_am1s03b1c18.wav
voip/u_am1s01b2c1.wav
voip/u_am1s02b2c4.wav
voip/u_am1s02b2c5.wav
voip/u_am1s03b2c5.wav
voip/u_am1s03b2c6.wav
voip/u_am1s03b2c7.wav
voip/u_am1s01b2c8.wav
voip/u_am1s03b2c11.wav
voip/u_am1s02b2c14.wav
voip/u_af1s01b2c16.wav
voip/u_af1s03b2c16.wav
voip/u_af1s02b2c17.wav
voip/u_af1s03b2c17.wav
voip/u_am1s03b2c18.wav
Fsample PESQ_score VQT-score
8000
2.237
2,24
8000
3.180
3,18
8000
2.147
2,15
8000
2.680
2,68
8000
2.365
2,36*
8000
3.670
3,67
8000
3.016
3,02
8000
2.558
2,56
8000
2.768
2,77
8000
2.694
2,69
8000
2.606
2,61
8000
2.608
2,61
8000
2.850
2,85
8000
2.527
2,53
8000
2.452
2,45
8000
1.828
1,83
8000
2.774
2,77
8000
2.940
2,94
8000
2.205
2,20*
8000
3.288
3,29
8000
3.483
3,48
8000
2.420
2,42
8000
4.042
4,04
8000
3.179
3,18
8000
2.872
2,87
8000
2.806
2,81
8000
4.300
4,30
8000
3.634
3,63
8000
3.369
3,37
8000
3.911
3,91
8000
2.905
2,91
8000
3.579
3,58
8000
2.198
2,20
8000
3.276
3,28
8000
3.316
3,32
8000
3.307
3,31
8000
3.592
3,59
8000
2.614
2,61
8000
2.806
2,81
8000
2.540
2,54
- 69 -
Appendix 3
Subjektivt test av talkvalitet
En del av ett examensarbete vid Saab Communication i Arboga av
Alexander Storm - mars 2007.
Testet är ett talkvalitetstest där du ska lämna din åsikt om kvaliteten på inspelat tal.
Testet består av 19st ljudfiler à 7-9 sekunder. Du ska lyssna på dessa och sätta ett
kvalitetsbetyg på respektive fil. Hela testet skall göras vid ett och samma tillfälle i en
så ostörd miljö som möjligt utan någon annan persons inverkan. Tidsåtgången är 5-10
min. Önskvärt är att hörlurar används men i brist på detta så går högtalare bra. Det
första som ska göras är att en referensfil avlyssnas för att ställa volymen på dina
hörlurar/högtalare. Nivån ska vara behaglig och det som sägs i filen skall höras klart
och tydligt med så lite ansträngning som möjligt. Referensfilen kan avlyssnas hur
många gånger som helst medan de övriga filerna endast skall avlyssnas 1 gång innan
betyget sätts. Efter testet så mejlar du tillbaks detta dokument ifyllt till mig. Alla
uppgifter kommer att behandlas konfidentiellt.
Betygen ges mellan 1 och 5 enligt följande (endast heltal skall anges):
Kvalitet på talet
Utmärkt
Bra
Ganska bra
Ganska dålig
Dålig
Betyg
5
4
3
2
1
Tack så mycket för hjälpen!
/Alexander
Testet
Jag har använt (sätt kryss): hörlurar:___
högtalare:___
Kön: man:___
kvinna:___
Ålder: 0-30:___
31-50:___
51- :___
Betyg:
Fil 1:____
Fil 2:____
Fil 3:____
Fil 4:____
Fil 5:____
Fil 6:____
Fil 7:____
Fil 8:____
Fil 9:____
Fil 10:____
Fil 11:____
Fil 12:____
Fil 13:____
Fil 14:____
Fil 15:____
Fil 16:____
Fil 17:____
Fil 18:____
Fil 19:____
- 70 -
Appendix 4
Results for case 9 of the objective measurements
4,5
4
4
3,5
3,5
1,0ms
3,0ms
5,0ms
7,0ms
10ms
15ms
20ms
3
2,5
2
1,5
1
MOS-LQO
MOS-LQO
4,5
2,5
2
1,5
1
0,5
0,5
0
1,0ms
3,0ms
5,0ms
7,0ms
10ms
15ms
20ms
3
0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
0
4,5
0
0,5
1,0
1,5
SNR=-6dB
3,0
3,5
4,0
4,5
4,5
4
4
1,0ms
3,0ms
5,0ms
7,0ms
10ms
15ms
20ms
3
2,5
2
1,5
1
3,5
MOS-LQO
3,5
MOS-LQO
2,5
SNR=-3dB
4,5
1,0ms
3,0ms
5,0ms
7,0ms
10ms
15ms
20ms
3
2,5
2
1,5
1
0,5
0,5
0
0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
0
0
4,5
0,5
1,0
1,5
Noise pulses(s ¹)
2,5
3,0
3,5
4,0
4,5
SNR=+3dB
4,5
4,5
4
4
1,0ms
3,0ms
5,0ms
7,0ms
10ms
15ms
20ms
3
2,5
2
1,5
1
3,5
MOS-LQO
3,5
1,0ms
3,0ms
5,0ms
7,0ms
10ms
15ms
20ms
3
2,5
2
1,5
1
0,5
0,5
0
2,0
Noise pulses(s ¹)
SNR=+0dB
MOS-LQO
2,0
Noise pulses(s ¹)
Noise pulses(s ¹)
0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
0
4,5
Noise pulses(s ¹)
0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
4,5
Noise pulses(s ¹ )
SNR=+9dB
SNR=+6dB
4,5
4
MOS-LQO
3,5
1,0ms
3,0ms
5,0ms
7,0ms
10ms
15ms
20ms
3
2,5
2
1,5
1
0,5
0
0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
4,5
Noise pulses(s ¹ )
SNR=+12dB
- 71 -
MOS-LQS
Standard deviation
3,16 3,26 2,42 3,35 1,94 3,06 3,45 2,03 3,42
0,97 0,93 0,96 1,02 0,93 0,93 0,93 0,91 0,92
3,39
0,92
2,81
0,87
1,84
0,86
1,97
0,75
2,45
0,85
2,48
0,72
1,48
0,63
2,71
1,01
2,55
0,72
4,35
0,71
Subject Gender Age H/S File 1 File 2 File 3 File 4 File 5 File 6 File 7 File 8 File 9 File 10 File 11 File 12 File 13 File 14 File 15 File 16 File 17 File 18 File 19
1
M
1
H
3
3
2
4
3
2
5
2
3
3
2
3
2
3
2
2
3
2
5
2
M
3
S
5
4
4
4
3
4
5
4
5
4
4
3
2
3
2
2
3
2
5
3
M
1
S
3
2
2
3
2
4
3
2
4
5
2
2
2
3
2
1
2
3
5
4
M
2
H
3
2
1
3
1
2
4
1
2
4
2
2
1
2
2
1
3
2
4
5
F
3
S
3
3
2
3
1
4
2
2
4
4
3
1
1
1
2
1
3
3
5
6
M
3
S
2
5
2
2
1
3
2
1
4
3
2
1
2
1
2
1
2
2
5
7
M
2
H
2
2
2
3
2
2
3
1
3
2
2
1
2
2
2
1
2
3
3
8
M
2
S
3
3
4
5
2
5
4
2
3
4
3
1
3
3
4
2
1
4
4
9
M
2
S
5
4
3
5
2
3
4
2
4
3
3
2
2
2
2
1
3
2
3
10
M
H
4
4
3
5
2
3
3
2
4
5
3
2
2
3
4
2
2
3
5
11
M
2
S
4
3
3
4
2
3
4
3
3
4
3
3
2
2
3
2
3
2
4
12
M
2
S
4
5
4
4
4
4
4
4
5
5
4
3
3
4
3
2
4
4
5
13
M
1
S
2
3
1
2
1
3
2
1
3
2
2
1
1
3
1
1
4
3
4
14
F
1
S
2
3
3
2
1
2
2
2
2
3
4
1
1
2
2
1
3
2
5
15
M
1
H
3
3
2
4
2
2
3
2
3
3
4
2
2
2
3
2
2
3
4
16
M
1
S
4
3
3
4
3
4
4
3
4
4
4
3
4
4
4
3
3
3
5
17
M
3
S
4
5
4
5
4
5
4
3
5
4
4
3
2
3
3
3
4
3
5
18
F
1
H
3
2
1
3
2
3
4
2
3
3
2
2
2
3
3
1
3
2
3
19
M
3
S
2
3
2
3
1
3
4
1
3
2
3
2
1
2
3
2
2
2
5
20
M
2
S
3
4
2
3
1
2
3
2
4
3
2
1
2
2
2
1
3
2
5
21
M
1
S
2
3
2
2
1
3
3
2
3
3
2
2
1
3
2
1
4
2
4
22
F
1
H
4
3
3
4
2
4
4
2
4
3
2
2
1
4
2
2
2
4
5
23
M
1
H
3
4
1
2
1
2
2
1
3
2
2
1
1
2
2
1
2
2
4
24
M
1
H
5
5
4
4
3
4
4
3
4
4
4
2
3
3
2
2
4
3
5
25
M
1
H
3
2
2
3
2
3
3
1
2
3
2
1
3
2
3
1
2
2
4
26
M
3
H
3
3
3
3
2
2
4
2
3
3
3
2
2
3
2
1
4
2
4
27
M
1
H
2
2
1
3
1
2
2
1
2
2
2
1
2
1
2
1
1
1
3
28
F
1
H
4
3
2
4
1
3
4
2
3
4
2
1
2
2
3
1
1
3
4
29
M
1
S
2
4
2
1
3
2
5
4
5
3
4
1
3
2
2
1
4
3
4
30
M
1
S
2
3
2
3
1
4
3
1
2
5
2
1
2
1
3
1
1
2
4
31
M
1
H
4
3
3
4
3
3
4
2
4
3
4
4
2
3
3
2
4
3
5
Appendix 5
The MOS-LQS result of the subjective measurement
Gender: M = male, F = female.
Age: 1 = 0-30, 2 = 31-50, 3 = 51- .
H/S: H = headphones, S = speakers (used for listening).
The table shows the file numbers after the file mixing.
- 72 -