Texas Learning and Computation Center

Transcription

Texas Learning and Computation Center
Texas Learning and Computation Center
Automatic Performance Tuning in
UHFFT Library
Dragan Mirković and Lennart Johnsson
University of Houston
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Introduction
• FFT is one of the most popular algorithms in science
and technology.
• Applications:
• Digital signal processing
• Image processing
• Solution of partial differential equations
• Strong motivation for development of highly
optimized implementations
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Introduction
•
Difficulties:
• Growing complexity of modern microprocessors.
• Deep memory hierarchies
• Out-of-order execution
• Instruction level parallelism
• Inherent properties of the FFT algorithms
• Unfavorable data access pattern (big 2n strides).
• Recursive nature of the algorithm.
• High efficiency of the algorithm (O(nlogn))
• low floating point v.s. load/store ratio.
• Additions/multiplications unbalance.
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Solutions
•
•
Adaptive FFT libraries
• Code automatically adapts to the architecture
• Needs a degree-of-freedom on the algorithm level
Adaptability levels
• Run-time
• Change of size of the transform (CWP)
• Dynamic construction of the algorithm (FFTW, SPIRAL,
UHFFT)
• Installation
• Automatic selection of optimal compiler options (UHFFT)
• Adaptive generation of the FFT library modules (UHFFT)
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Overview
• Background
• Performance tuning methodology in UHFFT
• Low level optimization
• Code generation
• High level optimization
• Execution plan generation
• UHFFT architecture
• Performance examples
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Fast Fourier Transform
6Application of Wn requires O (n 2 ) operations.
6Fast algorithms ⇔ sparse factorizations of Wn
Wn = A 1 A 2
Ak
where A i are sparse (O(n) operations) and k = log (n).
6Application of Wn can be evaluated in O(n log (n)) ops.
6Sparse factorization is not unique.
6Many different ways to factor the same Wn .
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
•
•
Notation
Simple way to describe highly structured matrices
Tensor (Kronecker) product
a01B
 a00 B

a11B
 a B
A ⊗ B =  10

a B a B
n −11
 n −10
A = scaling matrix,
•
a0 n −1B 

a1n −1B 


an −1n −1B 
B = blocking matrix
Direct sum notation:
A 0 

A ⊕ B = 
 0 B
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
UHFFT Factorizations
•
•
•
•
ICCS 2001
Rader’s Algorithm
Prime Factor Algorithm (PFA)
Split-Radix (SR)
Mixed-Radix Cooley-Tukey (MR)
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Mixed-Radix Splitting
When n = rq, Wn can be written as
Wn = (Wr ⊗ I q )Dr,q (I r ⊗ Wq )Π n,r ,
where Dr,q is a diagonal twiddle factor matrix
Dr,q = I q ⊕ Ω n,q ⊕
Ω n,q = 1 ⊕ ωn ⊕
⊕ Ω rn,−q1 ,
⊕ ω rn −1 ,
and Π n,r is a mod -r sort permutation matrix.
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Prime Factor Algorithm
When n = rq , and gcd (r,q) = 1 it is possible
to reduce the number of operations by using
the PFA (no twiddle factor multiplica tion).
Wn = Π 1 (Wr ⊗ I q )(I r ⊗ Wq )Π 2
= Π 1 (Wr ⊗ Wq )Π 2 ,
where Π 1 , Π 2 are permutation matrices.
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Split-Radix Algorithm
6When n = 2 k the most efficient algorithm.
6Two levels of radix-2 splitting.
Wn = (W2 ⊗ I n/2 )D 2,n/2 (I 2 ⊗ Wn/2 )Π n,2
6When n = 2q = 4 p we can write
Wn = B SR (Wq ⊕ Wp ⊕ Wp )Π n,q,2 ,
B SR = B a B m ,
~
B a = (W2 ⊗ I q )[I q ⊕ (W2 ⊗ I p )] ,
~
B m = I q ⊕ Ω n,p ⊕ Ω n,3 p , W2 = (1 ⊕ −i ) W2 ,
Π n,q,2 = (I q ⊕ Π q,2 )Π n,2
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Rader’s Algorithm
For n prime, Rader’s algorithm reduces the size n FFT to size n-1:
Wn = Q

T 
n ,r 




1 1T Q −1 = Q T  1 1T Q
n ,r 
 n,r
1 Cn −1  n,r
 1 S n −1 
where 1 is a vector of all ones,
Cn −1 and S n −1 are circulant and skew - circulant matrices and
Q n,r , Q n,r −1 are permutations.
Both Cn −1 and S n −1 can be diagonalized by Wn .
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
UHFFT Performance tuning
methodology
Input Parameters
System specifics,
User options
Input Parameters
Size, dim., …
UHFFT Code
generator
ICCS 2001
Library of
FFT modules
Initialization
Select best plan
Performance
database
Execution
Calculate one
or more FFTS
Installation
Run-time
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Code Generation
•
•
•
•
•
ICCS 2001
Algorithm abstraction
Optimization
Generation of a DAG
Scheduling of instructions
Unparsing
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Code generator
• Basic structure is an Expression
• Constant, variable, sum, product, sign change, …
• Basic functions
• Expression sum, product, assign, sign change, …
• Derived structures
• Expression vectors, matrices and lists
• Higher level functions
• Matrix vector operations
• FFT specific operations
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Example: Mixed-radix splitting
Equation:
Wn = (Wr ⊗ I m )Dr,m (I r ⊗ Wm )Π n,r
Is implemented as:
/*
*
FFTMixedRadix()
Mixed-radix splitting.
*
Input:
*
r
radix,
*
dir, rot direction and rotation of the transform,
*
u
input expression vector.
*/
ExprVec *FFTMixedRadix(int r, int dir, int rot, ExprVec *u)
{
int
m, n = u->n, *p;
m = n/r;
p = ModRSortPermutation(n, r);
u = FFTxI(r, m, dir, rot,
TwiddleMult(r, m, dir, rot,
IxFFT(r, m, dir, rot, PermuteExprVec(u, p))));
free(p);
return u;
}
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Code Generation
•
Code generator is written in C
•
•
Generates FFT codelets of arbitrary size,
direction, and rotation
Algorithms used:
•
•
•
•
•
•
ICCS 2001
Speed, portability and installation tuning
Rader (two versions)
Split-radix
Mixed-radix
PFA
Highly optimized straight line C code.
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Factorization Logic for FFT Code
Generation
if n<=2 use DFT
else if n is prime use Rader’s algorithm
else {
Chose factor r of n
if r and n/r are coprime use PFA
else if n is divisible by (r2) and n>r3
use Split-Radix algorithm
else use Mixed-radix algorithm
}
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Representation of Factorization
Example N = 6
FFTPrimeFactor n = 6, r = 3, dir = Forward, rot = 1
FFTRader n = 3, r = 2, dir = Forward, rot = 1
DFT n = 2, r = 2, dir = Forward, rot = 1
DFT n = 2, r = 2, dir = Inverse, rot = 1
FFTRader n = 3, r = 2, dir = Forward, rot = 1
DFT n = 2, r = 2, dir = Forward, rot = 1
DFT n = 2, r = 2, dir = Inverse, rot = 1
DFT n = 2, r = 2, dir = Forward, rot = 1
DFT n = 2, r = 2, dir = Forward, rot = 1
DFT n = 2, r = 2, dir = Forward, rot = 1
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
UHFFT Codelet Operation Counts
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
UHFFT Architecture
UHFFT
Library
Unparser
Library of
FFT Modules
Initialization
Routines
Execution
Routines
Utilities
FFT Code
Generator
Mixed-Radix
(Cooly-Tukey)
Prime Factor
Algorithm
Split-Radix
Algorithm
Scheduler
Key:
Optimizer
Fixed library code
Initializer
(Algorithm Abstraction)
Generated code
Code generator
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Performance Modeling
•
Analytic models
•
•
•
•
•
Cache influence on library codes
Performance measuring tools (PCL, PAPI)
Prediction of composed code performance
Updated from execution experience
Data base
• Library codes. Recorded at installation time
• Composed codes. Recorded and updated for
each execution.
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
UHFFT Performance
Processor characteristics for target hardware architectures:
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
SGI R10000 250 MHz Radix-16, Forward FFT
HP PA 8200, 180 MHz Radix-16 FFT, Forward FFT
ICCS 2001
IBM RS 6000, 120 MHz Radix-16, Forward FFT
Intel PII, 400 MHz Radix-16, Forward FFT
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
The SGI R10000
•
•
Two-way set-associative caches on both levels
Level one: 32 KB, trashing occurs when
•
Level two: 4 MB, trashing occurs when
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Intel Pentium 4 - 1.5 GHz
n = 8 codelet performance
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Intel Pentium 4 - 1.5 GHz
n = 16 codelet performance
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Execution plan generation
• Algorithms used by different libraries
•
•
•
•
Rader (FFTW, UHFFT)
Mixed-radix (FFTW, SPIRAL, UHFFT)
Split-radix (UHFFT)
PFA (UHFFT)
• Optimal plan search options
• Exhaustive
• Recursive
• Estimate
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
IBM Power3 222 MHz Performance
n = 16 (MR Plan)
Rank
1
2
3
4
5
6
7
8
ICCS 2001
Plan
16
28
44
82
224
242
422
2222
CPU Time MFLOPS Relative
1.18E-06
272.07
1.000
3.68E-06
87.07
0.320
4.95E-06
64.65
0.238
6.84E-06
46.76
0.172
7.16E-06
44.70
0.164
9.55E-06
33.52
0.123
1.05E-05
30.61
0.113
1.25E-05
25.60
0.094
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
IBM Power3 222 MHz Performance
n = 16 (MR Plan)
350
300
MFLOPS
250
200
150
100
50
2222
422
242
224
82
44
28
16
0
Plan
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
IBM Power3 222 MHz Performance
for n = 2520 (PFA Plan)
430
420
410
"MFLOPS"
400
390
380
370
360
350
5978
7985
9875
8975
7598
7859
9857
8597
7589
7895
5798
5897
9785
9578
8759
5987
9758
8795
8579
8957
5879
5789
7958
9587
340
Plan
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
IBM Power 3, 200 MHz
800 Mflops peak
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
IBM Power 3, 200 MHz
Power-of-2 sizes
800 Mflops peak
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
IBM Power 3, 200 MHz
PFA sizes
800 Mflops peak
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Related Efforts
• Old:
• FFTPACK
• CMSSL
• CWP
• New (Current):
• FFTW
• SPIRAL
ICCS 2001
May 28, 2001
Mirković, Johnsson
Texas Learning and Computation Center
Conclusions
•
•
Adaptive approach works very well for FFT.
•
•
Solution: Automatic code generation.
ICCS 2001
Straight-line code is very efficient but difficult to
write.
Code generator used for one problem can be
easily extended to other problems and languages.
May 28, 2001
Mirković, Johnsson

Similar documents

Integrated Adaptive Software Systems

Integrated Adaptive Software Systems – New codelet types and better execution routines – Unified algorithm specification on all levels

More information