Gradient Algorithms for Designing Predictive Vector Quantizers

Transcription

Gradient Algorithms for Designing Predictive Vector Quantizers
IEEE
TRANSACTIONS
ON ACOUSTICS,
SPEECH,
AND SIGNAL
PROCESSING,
VOL. ASSP-34, NO. 4, AUGUST 1986
679
Gradient Algorithms for Designing Predictive Vector
Quantizers
PAO-CHI CHANG
AND
ROBERT M. GRAY,
Abstract-A predictive vector quantizer(PVQ) is a vector extension
of a predictive quantizer. It consists of two parts:
a conventional memoryless vector quantizer (VQ) and a vector predictor. Two gradient
algorithmsfordesigning
a PVQ aredevelopedinthispaper:the
steepest descent (SD) algorithm and the stochastic gradient (SG) algorithm. Both have the property of improving the quantizer and the
predictor in the sense of minimizing the distortion
as measured by the
average mean-squared error. The differences between the two design
approaches are the period and the step size used in each iteration to
update the codebook and predictor. The SG algorithm updates once
for each input training vector and usesa small step size, while the SD
updates only once for a long period, possibly one pass over the entire
training sequence, and usesa relatively large step size.
Code designs and tests are simulated both
for Gauss-Markov sources
and for sampled speech waveforms, and the results are compared to
codesdesignedusingtechniquesthatattempttooptimizeonlythe
quantizer for the predictor and not vice versa.
I. INTRODUCTION
VECTOR quantizer is a system for mapping a sequence of continuous or high rate discrete k-dimensional vectors into a digital sequence suitable for communication over or storage in a
digital channel. While
Shannon theory states that memoryless vector quantization is sufficient to achieve nearly optimal performance,
such performance is guaranteed only for large vector dimension. Unfortunately, however, for a fixed rate in bits
per sample, the codebook size grows exponentially with
vector dimension and, hence, thecomplexity of minimum
distortion searches required by the encoder alsogrows exponentially. For this reason, recent research has focused
on design techniques for vectorquantizers that have structures which yield a slower growth of encoder complexity
with rate or dimension. Two vector quantizer structures
that have proved promising are feedback vector quantizers with full searches of small codebooks and memoryless
vector quantizers with large codebooks and suboptimal,
but efficient, search algorithms. A general survey of vector quantization, including many examples of both structures, may be found in [ 11. We here develop new design
algorithms for predictive vector quantizers, a special case
of the class of feedback quantizers.
A
Manuscript received December 29, 1984; revised August 12, 1985. This
work was supported in part by the Joint Services Electronics Program at
Stanford University and by the National Science Foundation under Grant
ECS83-17981.
The authors are with the Information Systems Laboratory, Department
of Electrical Engineering, Stanford University, Stanford, CA 44305.
IEEE Log Number 8608122.
FELLOW,IEEE
A predictive vector quantizer (PVQ) or vector predictive quantizer is a vector extension of a predictive quantizer or DPCM system. In the encoding process, an error
vector formed as the difference between the input vector
and the prediction of this vector is coded
by a memoryless
vector quantizer. The vector quantizer chooses the minimum distortion codeword from a stored codebook, and
transmits the index of this codeword to the receiver. A
PVQ is a feedback VQ because the encoder output is fed
back to the predictor for use in approximating the new
input vector.
Thegeneralstructure
of PVQ was introduced by
Cuperman and Gersho [2], [3], who developed aPVQ
design algorithm for waveform coding with two main
steps. First, a set of linear vector predictive coefficients
is computed from the input training sequence by generalized.LPC techniques. Second, a vector quantizer codebook is designed for the innovation sequence formed by
subtracting the input vector from a linear predicted value
based on the actual past inputs (“open-loop’’ design) or
for the actual prediction error formed as the difference
between the input vector and the linear predicted value
based on the past quantized outputs (“closed-loop’’ design). The generalized Lloyd algorithm was used for the
vector codebook design(see, e.g., [4]).As with traditional design techniques for scalar predictivequantization
or DPCM,the predictor isdesigned under the assumption
that the prediction is based on past input vectors rather
than on their quantized values; that is, it is effectively
assumed that the quantized reproduction is nearly perfect.
This approximation may be quite poor if the quantizerhas
a low rate. This causes a potential problem in the system
design: the predictor ’which is optimum given past true
values will not be so for past quantized values. A second
potential problem arises when the open-loop design technique is used-the vector quantizer designed to be good
for the ideal innovations sequence may not be as good
when applied to the actual prediction error sequence. The
closed-loop design resolves this problem and was found
to provide 1-2 dB improvement over theopen-loop design
for sampled speech at rates ranging from 1 to 2 bits/sample with vector dimensions ranging from 1 to 5 .
This paper presents an approach to design predictive
vector quantizers by applying standard techniques of
adaptive filtering to design both vector quantizers and
vector predictors in PVQ. The quantizers and predictors
are iteratively optimized foreachother using gradient
0096-3518/86/0800-0679$01.OO O 1986 IEEE
680
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-34, NO. 4, AUGUST 1986
search techniques.This
is accomplished by simultaneously adjusting both the linear predictive coefficients
and the vector quantizercodebook with each fixed number
of training vectors, where the number of training vectors
used for each update can range from one vector to the
entire training sequence, The adjustments attempt to minimize the cumulative sample squared error between the
input vectors and the reconstructed signals. No assumption of perfect reproduction is needed in this algorithm
and, hence, it may yield better codes. The algorithm is
used only in the design of the system, not as an on-line
adaptation mechanism as in the adaptive gradient algorithms of, e.g., Gibson et al. [5] and Dunn [ 6 ] . Some
preliminary work on a scalarversion of this algorithm was
developed in unpublished work of Y. Linde. Preliminary
results of the research described here were reported in [ 11.
On the positive side, simulations on Gauss-Markov and
sampled speech indicate that the SD and SG design techniques yield good codes when the parameters are chosen
intelligently. On the negative side,the resulting codes
yield overall performance quite close to those designed
using the Cuperman-Gersho technique. These results are
of interest not only because they show that popular adaptive filtering algorithms can be modified to design good
predictive vector quantizers, but because they show that
optimizing the quantizer for the vector
predictor yields
good overall performance even when the predictor is not
optimized for the quantizer. In other words, predictive
vector quantizers are robust against inaccuracies in the
predictor provided the quantizer is matched to the predictor. As a final observation, the algorithms developed here
have also been extended to the design of good predictive
trellis encoding systems [7] and joint source and channel
trellis encoding systems [81.
The basic structure of a PVQ system is presented in the
second section. The principles of the gradient design algorithm are discussed in the third section. Simulation results for two different sources follow. Finally, comments
and suggestions for future research are mentioned.
11. PREDICTIVE
VECTORQUANTIZER
A PVQ system is sketched inFig. 1. Let
be a
vector-valued random process or source with alphabet B ,
e.g., k-dimensional Euclidean space R k . A PVQ consists
of three functions. An encoder y which assigns to each
error vector e, = x, - x",a channel symbol y(e,) in some
channel symbol set M , adecoder flassigning to each
channel symbol u, in M a valuein a reproduction alphabet
8, and a prediction function or next state functionfwhich
approximates the next input vector x, + as x",+ = f ( i n ,
in
). Given a sequence of input vectors and an
initial prediction fo,the channel symbol sequence u,, reproduction sequence i,, and prediction sequence x",are
defined recursively for n = 0, 1 , 2,
, as
-
u, =
r(e,>
4, = x",+
x"ni-l
=f(4,,
=
r(xn
- x",),
P(u,),
in-17
* * *
3).
(1)
LINEAR
ENCODER
"n
VECTOR
DECODER
Fig. 1. Block diagram of PVQ.
Since the prediction depends only on previous predictions and encoder outputs,given the initial prediction and
channel sequence the decoder can duplicate the prediction. In fact, the receiver is a subsystem of the transmitter, both of them have the same predictor and its input f,
the reproduction vector.
A linear vector predictor is used in this system for its
simple structure and well-known behavior. We consider,
however, a particular form of linear vector predictors.
Following Cuperman and Gersho [3] with some minor
modifications, we consider vector predictors that operate
internally as ordinary scalar predictors. To be specific,
the linear prediction function is
P
x",=
c
ai.2-j.l
i= 1
It =
1, 2,
* *
,
I
=
k(n
-
l),
where k is the vector dimension, p is the predictor order,
and aj are the predictive coefficient vectors. The predictor
can be expressed more compactly as
n = 1, 2,
f, =
*
-
,
(2)
where A = [ala2 *
ap]is the prediction coefficient matrix and 4;- l = [&2- l * 21-p+ ,IT is a p-dimensional
vector, where Y means that the order in this vector is reversed in time. In other words, the predictor generates a
vector x" by appropriately weighting previous samples 2.
A distortion measure d.is an assignment of a nonnegative cost d(x, f) of reproducing a given input vector x as
a reproduction vector 4. In this paper we consider
weighted squared error distortion measures.
- -
d(x, 4) = (x -
a)TW(x a).
-
68 1
CHANG AND GRAY: DESIGNING PREDICTIVE VECTOR QUANTIZERS
Note that d(x, f) is a difference distortion measure in the
sense that d(x, 3) = d(x - f, 0).
The vector quantizer (which includes the encoder and
decoder) operates as a minimum distortion or nearest
neighbor rule, i.e.,
y(e) = min-' d(x, f),
UGM
where the inverse minimum notation means that y(e) is
the index u for which the reproduction codeword f yields
the minimum possible distortion over all possible reproduction codewords. Putting (1) into the above equation,
for any difference distortion measure the minimum distortion rule becomes
y(e) = min-' d(e
ueM
+ 2,P(u) + f ) = min-'
d(e, P(u)).
UEM
Therefore,
d(e, P(y(e)))
=
min d(e, P(u)) = min d(x, f),
tortion encoding rule is not necessarily optimumfor a
feedback system, it gives satisfying performance and it is
easy to implement fora PVQ system. Hence, the encoder
still obeys the minimum distortion rule, and only a decoder and a predictor have to be designed and stored for
use.
For simplicity, we first isolate the decoder andthe predictor from the feedback loop to derive their update formulas. In other words, we derive formulas to update the
quantizer codebook for a given error signal and encoder,
and formulas to updatethe predictor for a given predictor
input sequence separately. Then we combine these formulas to design the whole PVQsystem.
In this section, we present two gradient algorithms for
designing both thequantizer codebook andpredictor coefficients of a PVQ system. The fundamentals and applications of adaptive signal processing can be found
in [ 111,
from which we borrow somenotation and nomenclature.
P
ueM
which means minimizing thedistortion of the overall system is exactly equivalent to minimizing the distortion of
the quantizer only. This property simplifies the design of
a PVQ system.
The decoder is simply a table lookup. It can be implemented by a ROM storing the reproduction error vectors.
The final reproduction vector is obtained by adding the
outputs of the decoder and the predictor.
The performance of a PVQ is given by its long term
average distortion A ,
. N
1
A = D(x, f) = lim -
N
~ - r m n= 1
d(x,, f,),
if the limit exists. In practice, we designa system by minimizing the sample average
L
1
AL = D L ( x , 9) = - C d ( ~ , ,fn),
L n=l
for large L .
A PVQ isa sequential machine or a state machine with
an infinite number of states. Unlike a finite state VQ [9],
[lo] which designs different codebooks for each state, a
PVQ stores only one codebook forall states. This implies
that the storage requirementand search complexity of
PVQ are almost the same as for a memoryless VQ with
the same rate and dimension. As in the scalar case, however, a PVQ outperformsa memoryless VQ since the correlation between vectors is used more effectively.
A major problemof many feedback systems isthe channel error propagation effect. Like scalgr predictive quantization, a PVQ has this problem. As with the scalar system, the effects of occasional errors should die out with
time if the predictor is stable (e.g., the predictor gain is
strictly less than 1, and hence, the system is not a delta
modulator in the scalar case).
111. GRADIENT ALGORITHMS
FOR DESIGNING
PVQ
A PVQ system consists of three functions: an encoder,
a decoder, and a predictor. Although the minimum dis-
A. Steepest Descent Algorithm
The method of steepest descent is one of the oldest and
most widely known methods for minimizing
a function of
several variables. The formulas to update the quantizer
and the predictor are given next.
Update the Quantizer Codebook: The goal of designing a PVQ system is to minimize the average distortion.
Assume the average distortion D(x, 1) is differentiable,
then the basic steepest descent formulatoupdate
the
quantizer is
Ei,m+l
-
Ei,m
-
~ q v & ;f~
n )(
, ~n
i ?=
1,
* * *
, 2R,
where E,. is the reproduct,ion codeword with indexi and pq
is the step size (whichaffects the rate of convergence and
stability). The choice of step size isdiscussed later. VSiis
the gradient with respect to si, and m is the step number.
In words, new codewords are formed by searching along
the direction of the negative gradient of the average distortion from old codewords.
By the definition of D(x,, f,) and (l), we obtain
i = 1,
. - . zR,
9
where Li is the number of training vectors which are
mapped into codeword i.
We consider weightedsquarederror
distortion measures of the form
d(x, 2) =
where
(X
- f)TW(X- f),
W is some positive-definite matrix. Then,
V,,d(ej,E i , m ) = v,<ej =
Ei,m)T
W(ej -
8i.m)
-(w + WT)(ej- q m ) .
The gradient is proportional to the difference between the
quantizer input andmappedcodeword.Therefore,
the
682
PROCESSING, VOL. ASSP-34,
4,NO.
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, SIGNAL
AND
gorithm with a slower convergence rate, however,
may
avoid bad local optima by giving a smoother search [ 131,
~41.
Update the Predictor: Assume the average distortion
D(x, i )is differentiable, the general steepest formula for
updating the predictor is
steepest descent formula of the quantizer is
zi,m+l =
+ pq(W +
wT>
i = 1,
,2R.
~ i , m
(3)
If simple squared distortion is considered, (3) becomes
Am+1 = Am - p p V ~ , D ( x 31,
,
where A is the predictor coefficient matrix, m is the step
number, and p p is the step size forupdating the predictor.
For a PVQ system with a linear predictor as in (2)
Zn = A
Practically, a long training sequence is used to design a
quantization sysem. The limit of (4) can be approximated
by a sum over a long training sequence
12
q - 1 ,
,2R.
= 1,2,
* * *
,
and the weighted squared distortion measure
d(x, i ) =
(X -
i)T
W(X - i ) ,
the gradient of the average distortion is
.
i = 1,
AUGUST 1986
(5)
1
-
= VAmlim
Note that ( l / L i )Cj:r(ej)= u i ej is just the centroid or center
of gravity of all source vectors encoded into channel symbol ui. For a given encoder y,the optimum decoderis the
one whose codewords are centroids of all training vectors
mapped into each channel symbol [ 11.
The choice of 2p, significantly affects the performance
of this algorithm.To make theanalysiseasy,we
onIy
consider the quantizer itself, Le., { e , } is assumed fixed
for every update step. Equation ( 5 ) can also be expressed
as
L
L
L+m
C
(X, -
n=l
Ami:-1
-
P(u,))~
L
=
-(w + w ~~lim
)- + m-L1 nC= l
(x, - in)@;
- I)T.
i = I,
a:
*
Define P as a k
X
and R as a p
p correlation matrix
p cross-correlation matrix
, 2R.
The equation is defined to be “stable” if and only if
11 - 2pql
< 1,
i.e., 0
< 2pq < 2.
X
.
L
Observe that when 2 p , is less than 1, the rate of convergence increases as 2 p , increases, reaching the maximum
rate of 2 p q = 1 . At this maximum rate the optimal solu- Then
tion, that is, thereplacement of old codewords by the cenVArnD(xn,
2,) = -(W
W T ) ( P- A m R ) .
(7)
troids, is reached in a single step. For 0 < 2pq < 1, there
is no oscillation in the codeword updating and the process
Let A* be a value of A yielding a zero gradient above,
is said to be overdamped. For 1 < 2pq < 2, the updating and hence, satisfying a necessary condition for optimality
process is underdamped and converges in a decaying os- in the sense of yielding the minimum average distortion
cillation. When 2pq = 1 , the process is critically damped,
and ( 5 ) becomes
VA,D(xn, a,) = 0.
+
A solution is obviously
A*
which is exactly the generalized Lloyd algorithm [4], [ 121.
This also shows that the Lloyd algorithm which achieves
the optimal solution in one step has the fastest convergence rate among the family of steepest descentalgorithms for a given encoder and training sequence. An al-
=
PR- 1 ,
(8)
if R-’ is invertible. This is the Wiener-Hopf equation in
matrix form for the vector predictor. From (7), the steepest descent formula of the predictor is
Am+l= A ,
+ pp(W + WT)(P- A m R ) .
(9)
CHANG AND GRAY: DESIGNING PREDICTIVE VECTOR QUANTIZERS
Practically,thecorrelation matrices are more difficult to
get than simple differences of x - i ;from (6) we have
A,+1 = A,
and
A,+
L
+ p p ( W + W T ) lim
- c
too L n = l
1
683
1
= A,(Z - 2Rpp)
+ 2A*Rpp.
(x, - 3,) iiyl.
Thus,theoptimal value of 2 9 . is 2pp = R-I if R is invertible, in which case A* is reached single
in
a step.
This
(10) choice is just Newton's method which has the fastest rate
of convergence but a slightly more complicated calculaAgain, if the simple squared distortion measure is chosen, tion.
then
Observe thatNewton's method is indeed solving the
Wiener-Hopf
equation to get the optimal solution A* for
Am+, = Am + 2pp(P - ArnR),
(1 1)
the predictor with fixed encoder and decoder. This soluand the limit is droppedif we use a long training sequence tion, however, may not be optimal for the whole system
in practice,
while updating thedecoderandthepredictor
simultaneously. Inoursimulations,
this solution occasionally
L
1
even
A,+1 = A ,
2p - C (X, - in)
ill'_
(12),
. resulted in an unstable system. The steepest descent
Ln=1
algorithm provides the possibility of obtaining the optimal
The stability condition of 2 y is more difficult to ana- solution for the whole system by properly choosing step
lyze than that of 2pq since R is generally not diagonal. sizes.
DesignProcedures: Theformulasforoptimizingthe
However, by the translating and rotating operations
quantizer and the predictor have been derived separately.
We now combine these formulas to design the complete
A,,, = YmQ + A*,
PVQ system.
where Y, is the new prediction matrix, Q is a p X p eiThere are two approaches to optimize the decoder and
genvector matrix. With some work, the algorithm can be the predictor of a PVQ system.
derived as
1) Optimize the decoder for a fixed predictor and optimize the predictor for a fixed decoder separately. Iterate
Y m + 1 = Ym(I - 2~pA)7
(13)
these procedures until convergence.
where A = QRQ- is the eigenvalue matrix in which the
2) Optimizethedecoderandthepredictor
simultaeigenvalues appear in the diagonal and zeros elsewhere. neously and iterate until convergence.
Equation ( 1 3) is stable and convergent if and only if
In general, the performance improvement of each iteration in optimizing eitherthedecoder or the predictor
2
tends to decrease rapidly if it converges. Thus, approach
( 1 - 2ppXmaxl< 1 ,
i.e.
o < 2pp < -,
2 ) may have a faster rate of convergence and hence we
X max
choose this approach. Thesteepest descent algorithm with
where,,X,
is the largest eigenvalue of R .
the simple mean squared distortion measure is described
To get the optimal step size, whereA* is reached in the
as follows.
minimum number of steps, we put (8) into ( 1 l ) , and get
Step 0-Initialization:
Given:
A, + 1 = A,(I - 2pPR ) 2ppA*R.
+
+
Sincetheequation 2ppR = I does nor hold forascalar
211, and a general p X p matrix R , this implies thata
scalar 2pp may not be capable of being the optimal step
size. Thus, we generalizethestepsize to be a p X p matrix 2pp, which has the same size as R . Starting from the
basic formula
.
Training sequence ( x n } : = l ,
vector dimension = k,
order of predictor = p ,
rate = R bits per vector = r bits per sample,
size of reproduction codebooks = 2 R ,
initial quantizer codebook Co = (p(u), u E M } ,
using similar derivations, we get the formulas
A,+
1
=
A,
1 "
+ -L C
n=l
(X,
initial predictor A,,
- 2,) iLC1 2pp
convergence threshold 6
set m
=
0, D - , =
2
00,$6
Step I-Minimum Distortion Encoding:
Obtain the error sequence ( e , = x,
Encode: u, = min-' d(e,, p(u)).
UEM
-
A m i :-
n = 1,
,L}
0,
=
0.
684
IEEE TRANSACTIONS ONACOUSTICS,
Compute the average distortion D,
If (Dm
-
-
SPEECH, ANDSIGNAL PROCESSING, VOL. ASSP-34, NO. 4, AUGUST 1986
=
1
L
-
L
c d(e,, P(u,)).
n=l
D,)/Dm < 6, halt with final codebook and predictor C,, A,.
Otherwise continue.
Step 2-Quantizer
Update:
Step 3-Predictor
Update:
where the step number m is equal to the vector number n .
where cj is the codeword chosen to reconstruct en by some
encoding rule. Putting it into thebasicformula,
we get
the SG formula for updating thequantizer.For i = 1 ,
...,2R
L
A,,
1 =
Set rn
+-
1
2 - C (X, - 2,)
L n=l
+
tic 1 p p .
m + 1 ; go to step 1.
A,
This algorithm is an iterative improvement algorithm
and requires an initial codebook and predictor to start the
process. The "splitting" technique is applied to generate
big initial codebooks from small ones since it keeps the
-
[gi,m
+ 2Pq,n(en -Ei,rn)
Ei,m+l -
No matter how large the codebook is, only one codeword
needs to be updated with'each incoming vector, since only
one codeword represents en.
If the simple squared distortion is considered, (14) simplifies to
if
if
Ei, rn
original codeword as a member of the new codebook so
that the new average distortion will not increase. The initial predictor of the whole process is simply set to 0, since
this disconnects the feedback loop and ensures the stability of the system.
The choiceof step sizes is the major factor
affecting the
performance of this algorithm. The optimal step sizes for
updating the quantizer and the predictor (2pq = 1 and 2pp
= R - ' , respectively) are chosen in all simulations unless
stated otherwise.
This is a very simple formula. To analyze the
stability
condition of the step size, we follow similar analysis of
the steepest descent algorithm, and get a sufficient condition for stability
0
5
2pq,n < 2 ,
n = 1,
*
-
, L.
The final codebook is obtained from the last update of
the whole training sequence. Since the locai statistical behavior of speech is time varying, the step size should be
chosen very small to protect the final codebook from undue influence of the local behavior of individual samples
E . Stochastic Gradient Algorithms
near the end of the training sequence. A small step size
In the preceding section, the quantizerand the predictor may result in an inefficient adaptation, however, and lead
are updated once for the whole training sequence. An- to a nonoptimal solution.To solve this problem,adeother algorithm widely used in adaptive systems is the so- creasing step size sequence is chosen so that the algorithm
-calledleast-mean-square or LMSalgorithm, which up- can achieve the range of optimal values rapidly with large
dates these parameters for each incoming vector. In this step sizes, then it can minimize the error from optimal
section, we present an algorithm that is similar to LMS, values in small step sizes.
Eweda and Macchi [ 151 gave a formula for the step size
but differs in that its step sizes arenot fixed, butdecreased
with time or input signals. It is called the stochastic gra- and proved that an adaptive linear estimatorwith this step
size sequence is almost-sure(a.s.) and quadratic mean
dient (SG) algorithm.
Update the Quantizer Codebook: The quantizer is up- convergent. The stepsize p n of the algorithm is a decreasdated with each incoming vector in the SG algorithm. ing sequence of positive numbers satisfying
Hence, the gradient of the average distortion is replaced
by the gradient of the current distortion, and the basic
0 < c1 < 00,
formula becomes
Ei,m+ 1 - Ei,m
-
P q , n ~ & ; d ( X n 7a n ) ,
i
=
17
>
2R,
685
CHANG AND GRAY: DESIGNING PREDICTIVE
QUANTIZERSVECTOR
Step 3-Predictor Update:
This sequence is applied to all SG simulations in this paper. To reduce the complexity, p, is not updated for every
A, + 1 = A, + 2pP,,,(x,, - a,,) 1.
vector but for every block ranging from hundreds to thou. Step 4: Go to step 1 until the training sequence is exsands of training vectors. Although a PVQ system is
slightly different from their system, this formula worked hausted.
The “splitting” technique is again applied to generate
well in all simulations.
Update the Predictor: The SG algorithm to update the the initial codebooks. The training sequence is assumed
to be sufficiently long so that the system will have conpredictor is
verged when the training sequence is exhausted. HowAm+ 1 = Am - ~ p , , V ~ , d (3).
x,
ever, if the training sequence is not long enough for conFollowing a similar analysis to the steepest descent al- vergence in one pass, it isrepeated several times until the
gorithm, the SG algorithm for updating the predictor is system converges. The convergencecan be determined by
either testing whether the changesof each codewords and
derived as
predictor coefficients are less than a threshold, or testing
Am+1 = Am + p p , n ( W + WT)(xn - a n > aF-1 (17) whether the change of average distortion is small enough
for a period of time. We choose the latter method in our
for the general weighted squared distortion, and
design algorithm because it is easier to implement.
AmLl = A, + 2pp,,,(x, - a,) .fL?l
(18)
C . Cuperman-Gersho Design Algorithm
for the simple squared distortion.
In this section we discuss basic differences between the
A decreasing sequence satisfying (16) i s chosen as the
step size to achieve both rapid convergence and minimal design algorithm developed by Cuperman and Gersho [2],
[3] and the gradient algorithms developed here. For easy
error. For simplicity, the 2pP,,,sequence is chosen as
comparison, we considernonadaptive PVQ systems only.
The principal difference between these algorithms is the
design of the predictor. The Cuperman-Gersho algorithm
where I?@) is the variance o f f . The matrix form of 2pp designs the predictor based on the assumption of perfect
is not under consideration, since the SG algorithm needs reproduction.
For a PVQ system,theoptimalpredictorfor
fixed
an update for each training vector, the
computation of R-’
quantizer
and
training
sequence
has
been
found
as
(8)
substantially increases the complexity of the design procedures.
A* = PR-’ .
Design Procedures: The SG design algorithm with the
simple mean squared distortion measure is summarized The Cuperman-Gersho algorithm uses A’ to approximate
A*, where A’ is obtained by
below.
Step O-Znitialization:
A’ = P ’ R ’ - l ,
(19)
Given:
where P’ is a k X p autocorrelation matrix
Training sequence {x,,};= ,
P’ = E@,) (x; vector dimension = k ,
and R‘ is a p X p autocorrelation matax
order of predictor = p ,
R’ = E ( $ - l ~ rT
,-l).
rate = R bits per vector = r bits per sample,
Note that A’ is determined only by the statistical properties of the input signals.
size of reproduction codebooks = 2 R ,
If the rate (or quantization SNR) is sufficiently large so
initial quantizer codebook Co = {p(u), u E M } ,
that the quantization error is negligible, then
initial predictor Ao,
E(x&LC G E(x, - p(u,))($- 1)
,)
set n =
o,’rS-, = 03,a; = 0 .
and
Step 1 --Minimum Distortion Encoding:
Setn
+-
n
E ( x ~ - l x ~ ?G
- l )E(iL-12L?l),
+ 1.
Obtain the error vector e,, = x,
-
and hence, A’ z A*. In otherwords,thepredictor
is
nearly optimal under this assumption, eventhough it does
not use the true predictor inputs in the design.
A,,- al;-
Encoding u, = min-’ d(e,, p(u)).
uaM
Step 2-Quantizer Update:
~ i , n + l = Ei,n
+ 2 ~ q , n(en
- 8i.J
if
P(r(e,,))=
~ i , ~
IV. SIMULATIONS
Gauss-Markov sources and sampled speech sequences
were used to design and testPVQ systems. Simple squared
.error distortion was chosen as the distortion measure for
IEEE TRANSACTIONSONACOUSTICS,SPEECH,ANDSIGNALPROCESSING,VOL.ASSP-34,NO.4,AUGUST
686
1986
TABLE I
VQ VERSUSPVQ FOR A GAUSS-MARKOV SOURCE. SIGNAL-TO-NOISE RATIOS
A N D INSIDE
OUTSIDE TRAINING SEQUFNCEOF(SNR)
FULL SEARCH MEMORYLESS
VQ (VQ), SIGNAL-TONOISE RATIOS
INSIDE TRAINING SEQUENCE (SNRIN), SIGNAL-TO-NOISE
RATIOS OUTSIDE
TRAINING
SEQUENCE (SNROUT),AND THE FIRSTPREDICTIONCOEFFICIENT (a,) OF PVQ USING
STEEPEST DESCENT
ALGORITHM (PVQ(SD)) VERSUS
PVQ USINGSTOCHASTIC
GRADIENT
ALGORITHM (PVQ(SG)). RATE
= 1 BITKAMPLE.
k = VECTORDIMENSION.
GAUSS-MARKOV
SOURCE
WITH CORRELATION COEFFICIENT
0.9
k
1
11.2
11.6
11.8
11.2
11.7
11.9
7.9
11.6
10.2
10.6
11.9
2
11.6
3
4
5
12.0
6
SNR SNRout
SNRin
4.4
10.0
10.1
9.2
10.9
both sources. The performances of the systems are given
by the signal-to-quantization-noise ratio or SNR defined
as the inverse of the normalized average distortion on a
logarithmic scale.
Gauss-Markov Sources: A Gauss-Markov source or
a first-order Gauss autoregressive source {X,} is defined
by the difference equation X , + = ax, + W,, where a is
the autoregression constant and { W,} is a zero mean,unit
variance,independent, identically distributed Gaussian
source.Thissource is of interest since it is a popular
guinea pig for data compression systems and since its optimal performance, the rate-distortion function, is known.
We here consider only the highly correlated case of a =
0.9 with transmission rate r = 1 bit/sample. The maximum achievable SNR given by Shannon’s distortion-rate
function of this source and rate is 13.2 dB [16].
Both the steepest descent algorithm and the stochastic
gradient algorithm were used to design first-order PVQ
systems for a training sequence of 60 000 samples. All
codes were tested by a separate sequence of 60 000 samples. Table I shows the results for various dimensions. As
expected, the PVQ systems designed by both algorithms
outperform the memoryless VQ in all cases. (Note that a
memoryless VQ can be considered as a PVQ with a predictorthat always produces an allzero vector. Hence,
memoryless VQ can be viewed as a special case of PVQ.)
The difference in performance starts large at dimension 1
and decreases as the dimension increases. Observe that a
simplescalarpredictivequantizerachievestheperformance of the memoryless VQ with dimension = 4, and
the performance is the same as the analytically optimized
predictive quantization system of Arnstein [ 171 run on the
same data. (Arnstein optimized the quantizer for the predictor, but not vice versa.) The test SNR’s (SNRout) are
within 0.1 dB of the design SNR’s (SNRin) in all cases.
The good performance of PVQ systems for this source is
probably due to the strong similarity between the GaussMarkov source model and the PVQ structure.
The SD and SG algorithms yield almost the same performanceforthissource.The
designed codebooks and
predictors are also close for low dimensions. For large
a , SNRoutSNRin
0.907
0.913
0.907
0.907
0.906
0.898
10.1
11.2
11.6
11.8
11.9
12.0
a,
10.0
11.2
11.5
11.7
11.8
11.9
0.916
0.919
0.917
0.914
0.910
0.906
dimensions, however, the codebooks are quite different.
As shown in the table, both sets of prediction coefficients
are very close to the autoregressiveconstant of the source.
This implies that the autoregressive constant of this source
is a good estimation of the prediction coefficient of a PVQ
system.
Only first-order PVQ systems were considered since the
source was a first-order Markov source.
To compare the convergence rate of SD and SG algorithms, their learning curves of SNR and the prediction
coefficient al for designing a 2-dimensional 4-codeword
PVQ are shown in Fig. 2. The SNR curves in SG algorithms represent the signal-to-noise ratios for the partial
training sequence from the beginning to the current samples. The learning curves of the SD algorithm tend to jump
from one iteration to the next iteration, while the SG algorithm moves more smoothly. These different converging paths are possibly the reason for the convergence to
different local optima, and hence the generation of different codes. Both algorithms converged very fast, usually
in fewer than 10 iterations or 10 passes.
Sampled Speech: A training sequence of 640 000 samples of speech from five male speakers sampled at 6.5
kHz was used to design PVQ systems. The designed systems were then tested on a test sequence of 76 800 samples from a sixth male speaker. The training and test sequences are the same as
those used in [ 11 and [181 for easy
comparison.
Table I1 shows the results of first-order and second-order PVQ systems designed by the SD and SG algorithms.
Both algorithms yield similar results. The differences between the design distortion and test distortion increase as
the dimension increases. Only PVQ systems with dimension up to 6 were considered. Since the autocorrelation
function decreases with the lag, the accuracy of vector
predictors with higher dimensions is reduced. From the
table, first-order PVQ’s outperform VQ by from 0.5 dB
to more than 1 dB on the test sequence. To further improve the performance,second-order PVQ’s were designed. The improvement over’first-order PVQ’s is 0.1
dB to 0.2 dB in general. As with scalar predictive quan-
CHANGANDGRAY:
0
12.00
x
10.00
DESIGNING PREDICTIVE VECTOR QUANTIZERS
a
3
+
687
12.00
.
c
10.00
. .
8.00
. . . split
6.00
6.00
4.00
4.00
2.00
2.00
.00
.00
.00
10.00
5.00
15.00
eplit
8.00
20.00
I
I
.00
2 .oo
4.00
I
I
6.00
8.00
iterat ions
-
h
10.00
iterations
1 .oo
1.00
V
0
split
.BO
.80
.60
.60
.40
.40
.20
.20
.oo
-00
.oo
5.00
10.00
15.00
20.00
.oo
2 .oo
4.00
iteratrons
I
1
6.00
8.00
10.00
fteratiohs
(a)
(b)
Fig. 2. Learning curves for a Gauss-Markov source. (a) Steepest descent
algorithm. (b) Stochastic gradient algorithm.
tizers, the performance of PVQ’s tends to saturate when
the order is over 2, and hence, higher order PVQ’s were
not considered.
In addition to the performance, the computation complexity and storage requirement are also important properties of a coding system. The complexity of a PVQ can
be measuredby the number of multiplications per sample.
The quantizer complexity, as given in [l], is equal to the
number of codewordssearched, e.g., 2kR if it is full
searched. The prediction complexity is the number of
multiplications to generate the predictor output,which
equals the orderp . Therefore, the overall complexity of a
PVQ is m = 2kR p . Considering the storage requirement, both the quantizer codebook and the predictor matrix of a PVQ have to be stored. The quantizer storage is
kZkRreal values as shown in [l]. The predictor storage is
kp real values, since the predictor matrix is a p X k matrix. Thus, the overall storage requirement is k2kR kp.
Observe that both the complexity and the storage requirement of the quantizer grow exponentially with the dimension, while those of the predictor grow linearly with the
predictor order.
Table I11 showsthecomplexityand
storage requirements of VQ and PVQ. Comparing the same rate and dimension, a PVQ requires only a small increase in complexity and storage.over a memorylessVQ. If we consider
+
+
the same performance, a PVQ is more attractive. For ex,ample, the performance of a 6-dimensional PVQ is approximately equal to an 8-dimensional memoryless VQ.
However, the complexity and the storage of the former
are much less than those of the latter. Informal listening
tests show consistent results. The quality of the test sequence of a 6-dimensional PVQ is superior to that of a 6dimensional memoryless VQ, and the difference between
a 6-dimensional PVQ and an 8-dimensional memoryless
VQ is inaudible.
The learning curves of two algorithms to design a firstorder, 2-dimensional, 4-codeword PVQ are shown in Fig.
3. Both algorithms yield very close SNR andal , but their
learning curves are quite different. Note that there is an
overshoot phenomenon of al in the SD algorithm. This
implies that the optimal step sizes for designing the predictor itself may be too large for designing the whole PVQ
system for sampled speech. A smaller step size may be
more suitable to make the design process stable. In fact,
we have chosen smaller step sizes to design second-order
PVQ systems in our simulations.
To examine the effect of the assumption that the quantization error is small, we simulated the Cuperman-Gersho algorithm with the same training and test speech sequences. The simulation results are also shown in Table
11. As expected,forthe
smaller SNR’s this algorithm
688
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-34, NO. 4, AUGUST 1986
TABLE I1
SPEECH. SIGNAL-TO-NOISE
RATIOSINSIDE
TRAINING SEQUENCE
VQ VERSUS PVQ FOR SAMPLED
(SNRIN)OF 640 000 SPEECHSAMPLES,
SIGNAL-TO-NOISE
RATIOSOUTSIDE TRAINING SEQUENCE
FOR FULL SEARCH MEMORYLESS
VQ (VQ), FIRST(SNROUT)OF 76 800 SPEECHSAMPLES,
(CG),
ORDER AND SECOND-ORDER PVQ USING CUPERMAN-GERSHO ALGORITHMSTEEPEST
DESCENT
ALGORITHM
(SD), AND STOCHASTIC GRADIENT ALGORITHM
(SG). RATE= 1 BIT/
SAMPLE.
k = VECTORDIMENSION
SNRin
First-Order PVQ
SD
CG
8.8
8.7
SG k
SD
VQ
CG
1
2
3
4
8.7 5
6
7
8
2.0
5.2
6.1
7.1
7.9
8.5
9.1
9.7
2.0
5.9
7.2
8.47.8
6.4
7.3
8.3
9.69.2
9.6
9.3
Second-Order PVQ
SG
2.2
4.83.6
6.3
7.5
8.6
9.0
2.1
7.0
6.4
7.6
4.6
6.6
7.9
8.7
9.2
9.7
7.7
8.7
9.1
9.4
SNRout
First-Order PVQ
6.3
SG k
SD
VQ
CG
1
5.8 2
3
4
5
8.8 6
7
8
2.1
5.3
6.9
6.0
7.9
7.0
8.47.6
8.88.1
8.4
8.8
2.3
Second-Order PVQ
CG
2.3
6.8
7.5
8.3
8.6
4.32.6
6.2
7.2
7.9
8.4
7.9
SD
3.7
6.36.0
6.9
SG
4.4
6.4
7.0
8.0
8.4
8.9
8.3
8.9
8.8
7.4
8.0
8.5
TABLE I11
OF VQ VERSUS PVQ. NUMBER
OF MULTIPLICATIONS
COMPLEXITY AND STORAGE REQUIREMENT
PER SAMPLE(COMPLEXITY), AND STORAGE REQUIREMENT
(STORAGE) FOR FULLSEARCH
MEMORYLESS
VQ (VQ), FIRST-ORDER
PVQ (PVQl), A N D SECOND-ORDER
PVQ (PVQ2).
RATE= 1 BITISAMPLE.
k = VECTORDIMENSION
Complexity
18
17
65
896
k
VQ
1
2
3
4
,5
6
7
8
2
4
8
16
32
64
128
256
PVQl
Storage
PVQZ
3
5
9
4
6
10
33
34
39066
yields performance inferior to that of the SD and SG, but
the differences are small and become negligible
as dimension increases.
Comparing the prediction coefficient obtained by different algorithms, they are different but have similar
shapes. For example, the predictor matrix of a 6-dimensional first-order PVQ is as follows.
Cuperman-Gersho Algorithm: AT = 0.78, 0.42, 0.09,
-0.17, -0.36, -0.46.
Steepest Descent Algorithm: AT = 0.70, 0.34, -0.01,
-0.28, -0.50, -0.61.
'
PVQ 1
PVQ2
2
8
24
3
10
27
4
12
30
160
384
165
170
396
VQ
2048
Stochastic Gradient Algorithm: AT = 0.73, 0.39, 0.08,
-0.20, -0.42,-0.54.
Although the prediction coefficients are different, they
all yield very close performance.
Finally, we designed PVQ systems for fixed predictors
in order to examine the sensitivity of the performance to
the prediction coefficients. Inthesesimulations,vector
quantizers were designed by the generalized Lloyd algorithm for fixed predictors whose prediction coefficients
were randomly chosen from a range around the optimal
values. Surprisingly, all results show very good perfor-
689
GRAY: DESIGNING PREDICTIVE VECTOR QUANTIZERS
CHANGAND
10.00
10.00
8.00
8.00
6.00
6.00
4.00
4'00
2 .oo
2.00
.oo
I
.OO
I
2.00
I
I
I
8.00
4.00
6.00
I
10.00
I
12.00
I
14.00
I
.oo
16.00
.OO
1.00
2.00
3.00
4.00
5.00
iterations
I
1.20
1.to
1.oo
.a0
.80
.60
.60
.40
.40
.20
.20
.00
.OO
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
I
.OO
1.00
I
I
2.00
3.00
iterations
(a)
7.00
iterations
1.00
.oo
6.00
Fig. 3. Learningcurves for sampled speech. (a)Steepestdescentalgorithm. (b) Stochastic gradient algorithm.
I
4.00
I
I
5.00
6.00
7.00
iterations
(b)
dictive quantizers to about 1 dB for dimension 6 predictive quantizers. For sampled speech waveforms, the improvement was 2.3 dB for scalar predictive quantizers,
and ranged from 0.8 dB to 1.4 dB for higher dimensions.
Alternatively, PVQ provided approximately the same
performance as VQ with only the complexity and storage.
Although PVQ suffers from the channel error propagation
problem, it is not as severe as with other feedback quantizers. Thus, it is an inexpensive way to improve the performance of VQ by expanding the VQ into a PVQsystem.
V. COMMENTS
For low-dimensionalPVQ, gradient algorithms perWe have introduced SD and SG algorithms for design- form better than other existing algorithms. For higher diing PVQ systems. Experimentally, the SG algorithm mensional PVQ, all algorithms give similar performance
yields slightly better performance, but it is not consis- since the optimization of the quantizer well adapts it to a
tently better. The step sizes of the SD algorithm are very widerange of predictors. Nevertheless, gradient algoeasy to choose, sincetheir optimal values for updatingthe rithms still yield slightly better performance.
Only nonadaptive PVQ was considered here. An adapquantizer and the predictor have been developed separately. For the SG algorithm, the choice of step sizes is tive VQ using one VQ, the model classifier, to adapt a
still a problem. From the simulations, different step size PVQ by selecting one of a collection of predictors should
sequences result in different codes and performance, and provide better performance and better track locally stano formula consistently yields the best performance in all tionary behavior. Preliminary results of this kind may be
cases. Hence, the SD algorithm is at the moment the eas- found in [3] and [11.
iest to use in practice.
Although the stability conditions and optimal values of
The simulation results show that PVQ provides im- step sizes in updating the quantizer and thepredictor sepprovements in performance over memoryless VQ for a arately have been derived, no optimality or convergence
given rate, complexity, and storage. For Gauss-Markov properties of the jointly designed algorithms have yet been
sources, the improvement ranges from 6 dB for scalarpre- found.
mance (within 0.2 dB tothe best performance) even when
their prediction coefficients are k0.2 far away from the
optimal values. This shows that the performance of the
system is not sensitive to the prediction coefficients, thus,
the design of the predictor 'isnotvery critical. Even a
poorly designed predictor can be compensated for by a
well-designed quantizer providing that the codebook size
is large enough.
a
690
IEEETRANSACTIONSONACOUSTICS,
REFERENCES
[l] R. M. Gray, “Vector quantization,” IEEEASSP Mag., vol. 1, pp.
4-29, Apr. 1984.
[2] V. Cuperman and A. Gersho, “Adaptive differential vector coding of
speech,” in Conf. Rec., GlobeCom 82, Dec. 1982, pp. 1092-1096.
[3] -,
“Vector predictive coding of speech at 16 kb/s,” IEEE Trans.
Commun., vol. COM-33, pp. 685-696, July 1985.
[4] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,’ IEEE Trans. Commun., vol. COM-28, pp. 84-95, Jan.
1980.
[5] J. D. Gibson, S. K. Jones, and J. L. Melsa, “Sequentially adaptive
prediction and coding of speech signals,” IEEE Trans. Commun.,
V O ~COM-22,
.
pp. 1789-1797, NOV.1974.
[6] J. G. Dunn, “An experimental 9600-bit/s voice digitizer employing
adaptiveprediction,” IEEE Trans.Commun., vol.COM-19,pp.
1021-1032, Dec. 1971.
[7] E. Ayanoglu and R. M. Gray, “The design of predictive trellis waveform coders using the generalized Lloyd algorithm,” 1986, submitted
for publication.
“The design of joint source and channel trellis waveform cod[8] -,
ers,’’ 1986, submitted for publication.
[9]J.Foster,R.
M. Gray, and M. 0. Dunham,“Finite-state vector
quantization for waveform coding,” IEEE Trans. Inform. Theory, vol.
IT-31, pp. 348-359, May 1985.
[lo] M. 0. Dunham and R. M. Gray, “An algorithm for the design of
labeled-transition finite-state vector quantizers,” IEEE Trans. Commun., vol. COM-33, pp. 83-89, Jan. 1985.
[ l I] B. Widrow and S. Steams, Adaptive Signal Processing. Englewood
Cliffs, NJ: Prentice-Hall, 1985.
[12] R. M. Gray and Y . Linde, “Vector quantizers and predictive quantizersfor Gauss-Markov sources,” IEEE Trans. Cornmun., vol.
COM-30, pp. 381-389, Feb. 1982.
[13] G. H. Freeman, “The design of time-invariant trellis source codes,”
in Abstracts 1983 IEEE Int. Symp. Inform. Theory, St. Jovite, P.Q.,
Canada, Sept. 1983, pp. 42-43.
[I41 -,
“Design and analysis of trellis source codes,” Ph.D. dissertation, Univ. Waterloo, Ont., Canada, 1984.
[15] E. Eweda and 0. Macchi, “Convergence of an adaptive linear estimation algorithm,” IEEE Trans. Automat. Contr., vol. AC-29, pp.
119-127, Feb. 1984.
[16] T. Berger, Rate Distortion Theory. Englewood Cliffs, NJ: PrenticeHall, 1971.
[17] D. S . Arnstein, “Quantization error in predictive coders,” IEEE.
Trans. Commun., vol. COM-23, pp. 423-429, Apr. 1975.
[I81 H. Abut, R. M. Gray, and G. Rebolledo,“Vector quantization of
speech and speech-like waveforms,” IEEE Trans. Acoust., Speech,
Signal Processing, vol. ASSP-30, pp. 423-435, June 1982.
SPEECH,ANDSIGNALPROCESSING,
VOL. ASSP-34, NO. 4, AUGUST 1986
Pao-Chi Chang was born in Taipei, Taiwan, on
January 9, 1955.He received the B.S.E.E. and
M.S.E.E.degreesfrom
National Chiao Tung
University,Taiwan, in 1977 and 1979, respectively.
From 1979 to1981 he worked at the Chung
Shan Institute of Science and Technology, Taiwan, as an Assistant Scientist. He is currently
working toward the Ph.D. degree in the Department of Electrical Engineering, Stanford University, Stanford, CA. Hismain research interests are
speech coding and data compression
Robert M. Gray (S’68-M’69-SM’77-F’80) was
born in San Diego, CA, on November 1, 1943.
He received the B.S. and M.S. degrees from the
Massachusetts Institute of Technology, Cambridge, in 1966 and the Ph.D. degree from the
University of Southern California, Los Angeles,
in 1969, all in electrical engineering.
Since 1969 he has been with Stanford University, Stanford, CA, where he is currently a Professor of Electrical Engineering and Director of
the Information Systems Laboratory. His research
interests are the theory and design of data compression and classification
systems, speech coding and recognition, and ergodic and information theory. He is a coauthor, with L. D. Davisson, of Random Processes (Englewood Cliffs, NJ: Prentice-Hall, 1986).
Dr. Gray is a member of the Board of Governors of the IEEE Information Theory Group and served on that board from 1974 to 1980. He was
ON INFORMATION THEORY
an Associate Editor of the IEEE TRANSACTIONS
from September 1977 through October 1980, and was the Editor of that
TRANSACTIONS from October 1980 through September 1983. He has been
on the program committee of several IEEE International Symposia on Information Theory, and was an IEEEdelegate to the Joint IEEEiUSSR
Workshop on Information Theory in Moscow in 1975. He was Co-recipient
with L. D. Davisson of the 1976 IEEE Information Theory Group Paper
Award and Co-recipient with A. Buzo, A. H. Gray, Jr., and J. D. Markel
of the 1983 IEEE ASSP Senior Award. He was a Fellow of the Japan Society for thePromotion of Science (198l) and the John Simon Guggenheim
Memorial Foundation (1981-1982). In 1984 he was awarded an IEEE Centenia1 Medal. He is a member of Sigma Xi, Eta Kappa Nu, SIAM, IMS,
AAAS, and the SocietC des Ingenieurs et Scientifiques de France. He holds
an Advanced Class Amateur Radio License (KB6XQ).