RECURSIVE FILTERING ON SIMD ARCHITECTURES Rainer

Transcription

RECURSIVE FILTERING ON SIMD ARCHITECTURES Rainer
RECURSIVE FILTERING ON SIMD ARCHITECTURES
Rainer Schaffer , Michael Hosemann, Renate Merker , and Gerhard Fettweis
Department of Electrical Engineering and Information Technology
Dresden University of Technology, Germany
<schaffer, merker>@ias.et.tu-dresden.de
<hosemann, fettweis>@ifn.et.tu-dresden.de
ABSTRACT
Recursive filters are used frequently in digital signal processing. They can be implemented in dedicated hardware
or in software on a digital signal processor (DSP). Software solutions often are preferable for their speed of implementation and flexibility. However, contemporary DSPs are
mostly not fast enough to perform filtering for high datarates or large filters. A method to increase the computational power of a DSP without sacrificing efficiency is to
use multiple processor elements controled by the singleinstruction multiple-data (SIMD) paradigm.
The parallelization of recursive algorithms is difficult,
because of the data dependencies. We are using design
methods for parallel procesor arrays to realize implementations that can be used on a parallel DSP. Further, we are
focusing on the partitioning of the algorithm so that the
realization can be used for different architectures. Consequences for the architecture are considered, too.
we analyze these filters using design methods for parallel
processor arrays. This means that we describing algorithms
by affine recurrence equations (AREs) [2] which can be
transformed into uniform recurrence equations (UREs) using known localization tools [3, 4, 5]. A focus in the design
is on the partitioning of the algorithm that the realization can
be used for different architectures and parameters. Based on
the results we outline control structures which enhance the
SIMD control scheme to cope with recursive filters without requiring excessive overhead as in multiple-instruction
multiple-data (MIMD) schemes. These control structures
will be implemented in the M5-DSP currently designed at
our institution.
2. UNDERLYING DSP ARCHITECTURE
Slice
DMA
Data Memory
1. INTRODUCTION
Recursive filters are used frequently in digital signal processing. They are particularly usefull if steep filter responses
shall be implemented with a low number of filter taps. Recursive structures are also found in adaptive filters or decisionfeedback equalizers. Filters can be implemented in dedicated hardware or in software on a digital signal processor
(DSP). Software solutions are often prefered for their speed
of implementation and flexibility. However, contemporary
DSPs are often not fast enough to perform filtering for high
data-rates or large filters. In order to increase the computational power either the clock rate can be raised or multiple processor elements (data paths) can be used. A popular
method to increase the computational power of a DSP is
to use multiple processor elements controled by the singleinstruction multiple-data (SIMD) paradigm as in our M3DSP [1]. However, regular SIMD schemes are unable to
cope with the data flows required by recursive filters. Hence
This research has been funded by Deutsche Forschungsgesellschaft’
project A1/SFB 358 and A6/SFB 358.
Register File
Address
Generation
Program
Control
Interconnectivity
Data Paths
Specialization
Scalability
Data Manipulation
Control
Fig. 1. Overall Architecture of the Platform-Based DSP
We are designing DSP architectures following the concepts presented in [6]. The architecture will be derived from
a platform by scaling the number of slices and tailoring the
functionality of these slices. Additionally, the communication network between these slices has to be considered since
it can require a large amount of chip space and introduce
long delays. Figure 1 shows the overall architecture of our
platform based DSP. The data manipulation part consists
of a scalable number of slices, each containing data memory, a register file, a part of the interconnectivity unit (ICU)
and a data path. The ICU and data path are tailored to the
functionality required by the target algorithms. Such functionality could be an FFT-butterfly-tailored network in the
ICU or special arithmetic like Galois-field in the ALU. The
data paths are capable of performing the required multiplyaccumulate (MAC) operations.
The control part performs program control, address generation and direct memory access (DMA). All Slices are
controlled by just one program control unit in SIMD fashion. This means that while adjusting the number of slices to
fullfill the computational requirement control overhead remains constant. However, this also implies limitations in the
parallelism that can be exploited in the target application.
3. INFINITE IMPULSE RESPONSE FILTERING
The infinite impulse response (IIR) filter is the most familiar
recursive filter. It shall be used in the following for desciption of the design process. The IIR filter is given with
yk = ykb + yka =
L−1
X
l=0
bl xk−l +
J−1
X
where the variable is produced (source) and the index point
i2 , where the variable is used (destination). For the y variable three purviews have to be distinguished:
Calculation of y = y c : The calculation of the y values can be realized with increasing or decreasing
index j.
0
Thus two data dependencies dyc ∈ ( 01 ) , −1
are possible. These are illustrated with the dark blue arrows in the
figure 2.
Propagation
y-values:
We obtain two data dependen
, with the index k − j of the y =
cies dyp ∈ ( 11 ) , −1
−1
y p -value that have to be propagated. With a further analysis
we obtain that only the data dependency dyp = ( 11 ) can be
applied. If dyp = −1
−1 is used, yk+1 has to be calculated
before yk . That is in the contradiction to the calculation of
yk+1 , where yk is needed. In Figure 2 these data dependencies are drawn as dark red arrows.
Transfer of the y-values: The calculation of the ykc values can be finished in the index point i = ( k1 ) or i =
k
J−1 , depending on dyc . These results have to be trans
k+1
fered to the starting point of the propagation
1(i = 1 1 ).
We obtain two data dependencies dyt ∈ ( 0 ) , −J+2 ,
depending on dyp . In Figure 2 both possible data dependencies are shown for y1 , with the thick cyan arrows.
j
aj yk−j ,
(1)
j=1
where j, k, l ∈ Z and 0 ≤ k < K. The upper bounds of
the indices j and l are 3 ≤ L, J ≤ 20 and for k the upper
bound is K >> 100.
The algorithm is split in two parts ykb and yka , which can
be executed sequentially. In the remainder of this paper only
the recursive component yka part shall be discussed. For the
FIR component ykb solutions are available. The recursive
component in the IIR filter makes a parallel implementation
difficult, since each value of yka depends on its predecessors.
Hence, consecutive filter outputs cannot be calculated at the
same time as it is possible for FIR filters.
J −1
y0c
y1c
y2c
4
y3c
y4c
p
y−1
p
yK−J
y5c
y1p
y0p
y2p
3
y3p
2
y4p
c
yK−1
p
yK−5
p
yK−4
p
yK−3
1
0
0
1
2
3
4
5
K−1
k
Fig. 2. data dependencies for the IIR filter
4. MAPPING ON THE ARCHITECTURE
For the calculation of yka a multiplication of the filter weight
aj and a previously determined yk−j is needed and these
results have to be added. This MAC operation has to be
performed in each index point i = ( ks ) of the index space
I = {i | 0 ≤ k < K ∧ 1 ≤ j < J}.
In the further the data dependency
dyt = ( 10 ) is used,
0
which follows that dyc = −1 is needed. Additionally we
have the data dependency da of the independent variable a,
which will be set to da = ( 10 ). Thus we obtain the following description of the IIR filter as UREs[7]:
y c (i) = y c (i − dyc ) + a(i) · y p (i)
p
4.1. Data Dependencies
At the beginning the data dependencies for the y variable
have to be determined. These data dependencies will be described by a dependence vector d. The dependence vector
i2 = i1 + d gives the distance between the index point i1 ,
c
i∈I
(2)
t
y (i) = y (i − dyt )
y p (i) = y p (i − dyp )
i∈I
i ∈ Ip
(3)
(4)
a(i) = a(i − da )
i∈I
(5)
with I p , I t ⊂ I, I p = {i | 0 ≤ k < K ∧ 2 ≤ j < J}, and
I t = {i | 0 ≤ k < K ∧ j = 1}.
4.2. Space-Time Transformation
When (date) and where (processor element) the calculation
of an index point i is executed, will be determined with the
space-time (ST) transformation. Generally, the ST transformation [8, 9] for an n-dimensional index space is described
by
i = Rr = St + Lp, R = S, L , r = pt . (6)
The matrix R = (S, L) describes a co-ordinate transformation with L ∈ Zm×n and S ∈ Z(n−m)×n . The new coordinates consist of processor p ∈ Zm and time t ∈ Zn−m
for the calculation of each instance of the UREs. For the
IIR filter algorithm the time is t ∈ Z and the processor is
p ∈ Z.
In the processor array design the constraint ∀d ∈ D :
td > 0, d = Std + Lpd , (D being the set of dependence
vectors d of the UREs) ensures the causality. This means
that all data needed to evaluate an equation of the UREs are
available at the evaluation time. This constraint are needed
for the data dependencies dyc and dyt .
If the variable is independent, which means the data will
be propagated only through the index space, the causality
constraints can be relaxed to ∀di ∈ Di ⊂ D : tdi ≥ 0,
di = Stdi + Lpdi . The variables a and y p are independent,
thus Di = {da , dyp } can be applied.
Various solutions can be found for the ST transformation, but only two solutions for the ST transformation, where
the execution time tmax is minimal, shall be discussed in
future. The most important parameters can be found in Table 1.
R
tmax
pmax
rd,yc
rd,yt
rd,yp
rd,a
ST-1
( 10 11 )
J +K −2
J −1
1
−1
( 10 )
( 01 )
( 10 )
ST-2
0 1
−1 1
J +K −2
K
( 10 )
( 11 )
( 01 )
( 11 )
Table 1. Space-Time Transformation for IIR Filter
Solution ST-1 has two advantages in compare as solution ST-2. The processor array is smaller (pmax = J − 1)
and each processor element (PE) is used often (K times).
Only at the beginning and at the end some PEs are idle. The
control of the data transfer from the data dependency dyt
is needed only for PE p = 1. In figure 3 this ST transformation (ST-1) is illustrated graphically. On the strength of
data channels in the opposite direction for r d,yc and rd,yp
the ST-transformed index space cannot be partitioned later.
In section 4.3.1 the solution ST-1 shall be explored further
for the applicability on the M5 architecture.
The potential for partitioning is the main advantage of
solution ST-2 (see Table 1), because the realization of the
algorithm is not efficient after the ST transformation. The
number of PEs is high (pmax = K), but J − 1 PEs are
active only. In Section 4.3.2 possible improvements with
partitioning shall be discussed. The processor array and the
execution order are illustrated in Figure 3.
In opposition to ST-1, where PE p = 1 has to realize
the data transfer, each PE (p = k) has to allow one time for
this task. In Figure 3 the transfer control is realized with the
multiplexers between two PEs.
4.3. Adaptation on the M5 architecture
Both solutions of the ST transformation shall be adapted on
a SIMD architecture. To compare the solutions the degree of
parallelism DOP (t) [10] shall be used as a measure for the
parallelism. The DOP (t) specifies the number of used PEs
in a cycle t, which is dependent from the cycle. With the
Ptmax −1 DOP (t)
DOP the average parallelism AP = t=0
tmax and
the maximum parallelism MP = max0≤t<tmax DOP (t)
can be determined, which are measures for the entire algorithm. The sum of all DOP (t) is the number of index points
i of the index space I.
4.3.1. Limitation for Solution ST-1
The ST-transformed index space with ST-1 (see Table 1)
cannot be partitioned, the reason are the data paths in the
opposite direction. In Figure 3 (R-1) the ST-transformed index space is pictured as a parallelogram. The co-ordinate
axes of the coordinate system are denoted as the date t and
the processor element p. For two y values the data transfer
in the index space is illustrated, whereat the three purviews
calculation, transfer, and propagation are differenced. The
processor array is drawn more simple.
From the representation of the ST-transformed index space
the MP = J − 1 can be determined. On an architecture
with more than J − 1 PEs some PEs are not needed for the
execution of the algorithm. Hence, the average parallelism
AP = (J−1)·K
K+J−2 is lower. The reason is the starting and finishing phase, where some PEs are idle. If K >> J − 1
(J−1)·K
can be assumed the AP = (J−1)·K
K+J−2 = K·(1+ J−2 ) ≈ J − 1
K
is nearly the MP . For our M5 architecture with 16 slices
this realization is most effective for J = 17. If the filter
has more than J − 1 = 16 weighting factors aj , the algorithm cannot be executed on the M5 architecture with the
R-1 realization.
t
K−2
Space-Time Transformation ST-2
yK−1
b
yK−1
y5
y5b
y4
y4b
Realization R-1
y3
y3b
y2
y2b
0
J −1
p
y1
y1b
aj
y−j
j
y5
y4
y3
y2
K−1
y0b
4
a4
y−4
3
a3
y−3
2
a2
y−2
1
a1
y−1
y0
0
0
y1
1
y2
2
3
y3b
y4b
y5b
b
yK−1
y3
y4
y5
yK−1
4
5
K−1
k
4
Realization R-2
2
3
0 1 2 3 κP
−J + 1
−1
1
0
y−j+1
yk
a1
a2
a3
a4
t
aJ−1
4
a4
y−4
3
a3
y−3
2
a2
y−2
1
a1
y−1
y0
0
0
1
y1
y1b
y0b
j
J −1
aJ−1
y−J+1
ykb
Space-Time Transformation ST-1
5
y5b
y4b
y3b
y2b
b
yK−1
yK−1
k
−J + 1
J −1
aJ−1
y−J+1
t
y0
y0b
y1b
y2b
K−2
t
Fig. 3. design flow on the IIR filter
2
3
dK
4 − 1e
κP
4.3.2. Partitioning for Solution ST-2
The solution ST-2 with a processor array of K PEs cannot
be used directly for the M5 architecture. A partitioning of
the processor array is needed. Therefore the locally parallel, globally sequential (LPGS) partitioning [11, 12] shall
be used, which preserves the data locality of the full size
processor array. The LPGS partitioning for an architecture
with 16 PEs (slices) can be described as follows:
i = St + Lp = St + L(κP + ΘP κ
bP )
0
t + ( 11 ) (κP + 16 · κ
bP )
= −1
(7)
with ΘP = ϑP and ϑP = 16. For the variables κP
and κ
bP we apply: 0 ≤ κP < ϑP , 0 ≤ κ
bP < d ϑKP e and
P
P
κ , κ
b ∈ Z. In Figure 3 (R-2) the partitioning is illustrated, whereby the parameter ϑP was reduced to ϑP = 4.
The partitions κ
bP have to be processed in serial. If we
wait to the last execution of a partition before the calculation of the next partition will be started (see Figure 4, R-2a)
(J−1)K
J−1
is quite low. At maxi≈ ϑP J+14
the AP = (J+14)
d ϑKP e
mum, we obtain an AP of 8.94 for J = 20 and ϑP = 16,
which is around the half of value that can be achieved. If
we use the idle slices at the end of a partition κ
b P for the
P
first calculations of the next partition κ
b + 1, the utilization can be improved. That is possible only, if the execution
time tp for the calculation of the index point i is greater or
equal the execution time t of the full size processor array
(tp (i) ≥ t (i)).
t
t
R-2a: sequential
t
ϑP ≥ J − 1
R-2b: overlapped
ϑP < J − 1
R-2c: overlapped
Fig. 4. Execution of Partitions
For the IIR filter the number of weighting factors aj
limit the degree of overlapping of the partitions. If J − 1 <
ϑP the PEs have to be idle for ϑP − J + 1 calculations between the execution of partitions (see figure 4, R-2b). The
AP for such IIR filters is the same as for realization R-1
(AP ≈ J − 1).
All PEs will be active, if J − 1 ≥ ϑP . In Figure 4, picture R-2c this processing is illustrated. For the AP we obtain
AP = (J−1) K (J−1)·K
≈ ϑP for J − 1 << K,
d ϑP e+(K−1) mod ϑP
where the execution time tmax is determined by the number
of partitions d ϑKP e, the filter size J − 1 and the non overlapping last piece (K − 1) mod ϑP .
5. IMPLEMENTATION ISSUES
The M5-DSP features an architecture where the data is processed in data vectors x16 of 16 elements.
For realization R-1 the weighting factors a remain the
same in a slice during performing the filtering, meaning the
weighting data vector a16 does not change during the calculation. The y p value is the same for all slices for a cycle t.
The value yk−l must be written to all elements of the vector yp16 with a broadcast()1 instruction. The value yk−l
which was calculated in the first slice p = 1 one cycle before is needed for the multiplication in the same slice. The
c
element yk−l = y0,16
has to be selected from vector y c16
(instruction select(0)) and stored in the memory. For
the next calculation the elements of vector y c16 have to be
c
shifted to the next right position after and element y15,16
is
set to zero (shift1WRight()). This instruction is also
known as Zurich Zip.
For all three realizations based on R-2 the elements of
data vector a16 have to be shifted to the next higher position
(shift1WLeft()) and on position a0,16 a new value has
to be written. Following, the instructions for the other data
vector manipulations shall be contemplated.
In the sequential realization (R-2a) the calculation of a
y value is done in one PE. The data vector y c16 remains unchanged. At the beginning of the calculation of a new partic
= 0. To realize the triangular end
tion all elements are yi,16
of the realization we input 0 in a16 for the last calculations
of a partition. Hence, we eventuelly obtain a resulting vector yc16 with all results of the partition. This vector can be
written completely to the memory (selectall()) as already provisioned for by the architecture. With the broadcast() instruction the y p has to be written in vector y p16 .
If the value is not needed in the PE, the factor ai = 0 ensures that the multiplication yields zero.
The overlapping realization R-2b again requires the broadcast() of the y p in vector yp16 . Also the calculation of a
y value is realized in the same PE, but the result yk has
to be read out directly (t = k + J − 1) from the position
k mod ϑP of the data vector y c16 (select(k%16)).
On
=
0
for
the
next
this position the starting value y c k+16
J
1 In the paper the instructions are written in type writer letter.
The instruction describes only its functionality.
J
D
a16
yp16
yc16
yk
R-1
2, . . . , 17
≈J −1
R-2a
R-2b
R-2c
2, . . . , +∞
2, . . . , 17
18, . . . , +∞
J−1
≈ 16 J+14
≈J −1
≈ 16
Data Vector Manipulations and Result Extraction
-shift1WLeft() shift1WLeft() shift1WLeft()
broadcast()
broadcast()
broadcast()
broadcPart(i)
shift1WRight()
-input(0,k%16) input(0,k%16)
select(0)
selectall()
select(k%16)
select(k%16)
Table 2. Parameter of Realizations, ϑP = 16
calculation yk+16 has to be written (input(0,k%16)).
In realization R-2c again the result has to be read out
directly from the data vector yc16 (select(k%16))
and
on this position the starting value (y c k+16
=
0)
must
be
J
written (input(0,k%16)). Differently to R-2b the simple broadcast() instruction for the y p broadcast cannot be used. With ϑP < J − 1 two different y values
are needed in the PEs depending on the processed partition. In Figure 4, picture R-2c illustratetes this. We call
the instruction broadcPart(i), which realizes the parp
p
tial broadcast of y 1 in y0,16
= . . . = yi−1,16
= y 1 and y 2 in
p
p
2
yi,16 = . . . = y15;16 = y .
6. CONCLUSION
Four different realizations of an IIR filter on a parallel DSP
were presented in this paper. Their main parameters are
summarized in Table 2. For IIR filters with less or equal
weighting factors (number of filter taps) than PEs (slices)
the realizations R-1 and R-2b yield the highest performance.
Of those two realizations R-1 requires less functionality for
data transfers.
For more weighting factors than slices (J − 1 > ϑP )
solution R-2c is faster than R-2a with the sequential execution of the partitions. However, solution R2-c requires
complex data transfer instructions were subsets of slices are
controlled independently, hence increasing control efforts.
Depending on the application a tradeoff has to be found between performance and efficiency.
7. REFERENCES
[1] T. Richter, W. Drescher, F. Engel, S. Kobayashi, V. Nikolajevic, and G. Fettweis, “A Platform-Based Highly Parallel
Digital Signal Processor ,” in Proceedings of CICC, 2001.
[2] J. Teich, A compiler for application specific processor arrays, Ph.D. thesis, Verlag Shaker, Aachen, Germany, 1993.
[3] V. Roychowdhury, L. Thiele, S.K. Rao, and T. Kailath, “On
the localisation of algorithms for VLSI processor arrays,”
VLSI Signal Processing III, pp. 459–470, 1989.
[4] U. Eckhardt and R. Merker, “Hierarchical algorithm partitioning at system level for an improved utilization of memory structures,” IEEE Transactions on CAD, vol. 18, no. 1,
pp. 14–24, Jan. 2000.
[5] J. Rosseel, F. Catthoor, and H. De Man, “An optimisation
methodology for array mapping of affine recurrence equations in video and image processing applications,” in Proc.
Conf. on Appl.-Spec. Array Proc., Aug. 1994.
[6] Matthias Weiss, Frank Engel, and Gerhard P. Fettweis, “A
New Scalable DSP Architecture for System on Chip (SOC)
Domain,” in Proceedings of ICASSP’99, Phoenix, AZ, April
1999, vol. 4, pp. 1945–1948.
[7] R.M. Karp, R.E. Miller, and S. Winograd, The organization
of computation for uniform recurrence equations, J.ACM,
July 1967.
[8] S.K. Rao, Regular Iterative Algorithms and their Implementations on Processor Arrays, Ph.D. thesis, Stanford Univ.,
1985.
[9] U. Eckhardt, Algorithmus-Architektur-Codesign für den Entwurf digitaler Systeme mit eingebettetem Prozessorarray
und Speicherhierarchie, Ph.D. thesis, Dresden University of
Technology, Germany, June 2001.
[10] K. Wang, Advanced Computer Architecture: Parallelism,
Scalability, Programming, McGraw-Hill, New York, 1993.
[11] J. Teich and L. Thiele, “Partitioning of processor arrays:
A piecewise regular approach.,” INTEGRATION: The VLSI
Journal, vol. 14(3), pp. 297–332, 1993.
[12] U. Eckhardt and R. Merker, “Co-partitioning - A method
for hardware / software codesign for scalable systolic arrays,” in Reconfigurable Architectures, R. Hartenstein and
V. Prasanna, Eds., pp. 131–138. IT Press, Chicago, 1997.

Similar documents