Texas Learning and Computation Center
Transcription
Texas Learning and Computation Center
Texas Learning and Computation Center Automatic Performance Tuning in UHFFT Library Dragan Mirković and Lennart Johnsson University of Houston ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Introduction • FFT is one of the most popular algorithms in science and technology. • Applications: • Digital signal processing • Image processing • Solution of partial differential equations • Strong motivation for development of highly optimized implementations ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Introduction • Difficulties: • Growing complexity of modern microprocessors. • Deep memory hierarchies • Out-of-order execution • Instruction level parallelism • Inherent properties of the FFT algorithms • Unfavorable data access pattern (big 2n strides). • Recursive nature of the algorithm. • High efficiency of the algorithm (O(nlogn)) • low floating point v.s. load/store ratio. • Additions/multiplications unbalance. ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Solutions • • Adaptive FFT libraries • Code automatically adapts to the architecture • Needs a degree-of-freedom on the algorithm level Adaptability levels • Run-time • Change of size of the transform (CWP) • Dynamic construction of the algorithm (FFTW, SPIRAL, UHFFT) • Installation • Automatic selection of optimal compiler options (UHFFT) • Adaptive generation of the FFT library modules (UHFFT) ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Overview • Background • Performance tuning methodology in UHFFT • Low level optimization • Code generation • High level optimization • Execution plan generation • UHFFT architecture • Performance examples ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Fast Fourier Transform 6Application of Wn requires O (n 2 ) operations. 6Fast algorithms ⇔ sparse factorizations of Wn Wn = A 1 A 2 Ak where A i are sparse (O(n) operations) and k = log (n). 6Application of Wn can be evaluated in O(n log (n)) ops. 6Sparse factorization is not unique. 6Many different ways to factor the same Wn . ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center • • Notation Simple way to describe highly structured matrices Tensor (Kronecker) product a01B a00 B a11B a B A ⊗ B = 10 a B a B n −11 n −10 A = scaling matrix, • a0 n −1B a1n −1B an −1n −1B B = blocking matrix Direct sum notation: A 0 A ⊕ B = 0 B ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center UHFFT Factorizations • • • • ICCS 2001 Rader’s Algorithm Prime Factor Algorithm (PFA) Split-Radix (SR) Mixed-Radix Cooley-Tukey (MR) May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Mixed-Radix Splitting When n = rq, Wn can be written as Wn = (Wr ⊗ I q )Dr,q (I r ⊗ Wq )Π n,r , where Dr,q is a diagonal twiddle factor matrix Dr,q = I q ⊕ Ω n,q ⊕ Ω n,q = 1 ⊕ ωn ⊕ ⊕ Ω rn,−q1 , ⊕ ω rn −1 , and Π n,r is a mod -r sort permutation matrix. ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Prime Factor Algorithm When n = rq , and gcd (r,q) = 1 it is possible to reduce the number of operations by using the PFA (no twiddle factor multiplica tion). Wn = Π 1 (Wr ⊗ I q )(I r ⊗ Wq )Π 2 = Π 1 (Wr ⊗ Wq )Π 2 , where Π 1 , Π 2 are permutation matrices. ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Split-Radix Algorithm 6When n = 2 k the most efficient algorithm. 6Two levels of radix-2 splitting. Wn = (W2 ⊗ I n/2 )D 2,n/2 (I 2 ⊗ Wn/2 )Π n,2 6When n = 2q = 4 p we can write Wn = B SR (Wq ⊕ Wp ⊕ Wp )Π n,q,2 , B SR = B a B m , ~ B a = (W2 ⊗ I q )[I q ⊕ (W2 ⊗ I p )] , ~ B m = I q ⊕ Ω n,p ⊕ Ω n,3 p , W2 = (1 ⊕ −i ) W2 , Π n,q,2 = (I q ⊕ Π q,2 )Π n,2 ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Rader’s Algorithm For n prime, Rader’s algorithm reduces the size n FFT to size n-1: Wn = Q T n ,r 1 1T Q −1 = Q T 1 1T Q n ,r n,r 1 Cn −1 n,r 1 S n −1 where 1 is a vector of all ones, Cn −1 and S n −1 are circulant and skew - circulant matrices and Q n,r , Q n,r −1 are permutations. Both Cn −1 and S n −1 can be diagonalized by Wn . ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center UHFFT Performance tuning methodology Input Parameters System specifics, User options Input Parameters Size, dim., … UHFFT Code generator ICCS 2001 Library of FFT modules Initialization Select best plan Performance database Execution Calculate one or more FFTS Installation Run-time May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Code Generation • • • • • ICCS 2001 Algorithm abstraction Optimization Generation of a DAG Scheduling of instructions Unparsing May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Code generator • Basic structure is an Expression • Constant, variable, sum, product, sign change, … • Basic functions • Expression sum, product, assign, sign change, … • Derived structures • Expression vectors, matrices and lists • Higher level functions • Matrix vector operations • FFT specific operations ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Example: Mixed-radix splitting Equation: Wn = (Wr ⊗ I m )Dr,m (I r ⊗ Wm )Π n,r Is implemented as: /* * FFTMixedRadix() Mixed-radix splitting. * Input: * r radix, * dir, rot direction and rotation of the transform, * u input expression vector. */ ExprVec *FFTMixedRadix(int r, int dir, int rot, ExprVec *u) { int m, n = u->n, *p; m = n/r; p = ModRSortPermutation(n, r); u = FFTxI(r, m, dir, rot, TwiddleMult(r, m, dir, rot, IxFFT(r, m, dir, rot, PermuteExprVec(u, p)))); free(p); return u; } ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Code Generation • Code generator is written in C • • Generates FFT codelets of arbitrary size, direction, and rotation Algorithms used: • • • • • • ICCS 2001 Speed, portability and installation tuning Rader (two versions) Split-radix Mixed-radix PFA Highly optimized straight line C code. May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Factorization Logic for FFT Code Generation if n<=2 use DFT else if n is prime use Rader’s algorithm else { Chose factor r of n if r and n/r are coprime use PFA else if n is divisible by (r2) and n>r3 use Split-Radix algorithm else use Mixed-radix algorithm } ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Representation of Factorization Example N = 6 FFTPrimeFactor n = 6, r = 3, dir = Forward, rot = 1 FFTRader n = 3, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Inverse, rot = 1 FFTRader n = 3, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Inverse, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 DFT n = 2, r = 2, dir = Forward, rot = 1 ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center UHFFT Codelet Operation Counts ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center UHFFT Architecture UHFFT Library Unparser Library of FFT Modules Initialization Routines Execution Routines Utilities FFT Code Generator Mixed-Radix (Cooly-Tukey) Prime Factor Algorithm Split-Radix Algorithm Scheduler Key: Optimizer Fixed library code Initializer (Algorithm Abstraction) Generated code Code generator ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Performance Modeling • Analytic models • • • • • Cache influence on library codes Performance measuring tools (PCL, PAPI) Prediction of composed code performance Updated from execution experience Data base • Library codes. Recorded at installation time • Composed codes. Recorded and updated for each execution. ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center UHFFT Performance Processor characteristics for target hardware architectures: ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center SGI R10000 250 MHz Radix-16, Forward FFT HP PA 8200, 180 MHz Radix-16 FFT, Forward FFT ICCS 2001 IBM RS 6000, 120 MHz Radix-16, Forward FFT Intel PII, 400 MHz Radix-16, Forward FFT May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center The SGI R10000 • • Two-way set-associative caches on both levels Level one: 32 KB, trashing occurs when • Level two: 4 MB, trashing occurs when ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Intel Pentium 4 - 1.5 GHz n = 8 codelet performance ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Intel Pentium 4 - 1.5 GHz n = 16 codelet performance ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Execution plan generation • Algorithms used by different libraries • • • • Rader (FFTW, UHFFT) Mixed-radix (FFTW, SPIRAL, UHFFT) Split-radix (UHFFT) PFA (UHFFT) • Optimal plan search options • Exhaustive • Recursive • Estimate ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center IBM Power3 222 MHz Performance n = 16 (MR Plan) Rank 1 2 3 4 5 6 7 8 ICCS 2001 Plan 16 28 44 82 224 242 422 2222 CPU Time MFLOPS Relative 1.18E-06 272.07 1.000 3.68E-06 87.07 0.320 4.95E-06 64.65 0.238 6.84E-06 46.76 0.172 7.16E-06 44.70 0.164 9.55E-06 33.52 0.123 1.05E-05 30.61 0.113 1.25E-05 25.60 0.094 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center IBM Power3 222 MHz Performance n = 16 (MR Plan) 350 300 MFLOPS 250 200 150 100 50 2222 422 242 224 82 44 28 16 0 Plan ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center IBM Power3 222 MHz Performance for n = 2520 (PFA Plan) 430 420 410 "MFLOPS" 400 390 380 370 360 350 5978 7985 9875 8975 7598 7859 9857 8597 7589 7895 5798 5897 9785 9578 8759 5987 9758 8795 8579 8957 5879 5789 7958 9587 340 Plan ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center IBM Power 3, 200 MHz 800 Mflops peak ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center IBM Power 3, 200 MHz Power-of-2 sizes 800 Mflops peak ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center IBM Power 3, 200 MHz PFA sizes 800 Mflops peak ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Related Efforts • Old: • FFTPACK • CMSSL • CWP • New (Current): • FFTW • SPIRAL ICCS 2001 May 28, 2001 Mirković, Johnsson Texas Learning and Computation Center Conclusions • • Adaptive approach works very well for FFT. • • Solution: Automatic code generation. ICCS 2001 Straight-line code is very efficient but difficult to write. Code generator used for one problem can be easily extended to other problems and languages. May 28, 2001 Mirković, Johnsson
Similar documents
Integrated Adaptive Software Systems
– New codelet types and better execution routines – Unified algorithm specification on all levels
More information