T Ishihara
Transcription
T Ishihara
High performance compu0ng of canonical high Reynolds number turbulent flows Takashi Ishihara (Nagoya University, JST CREST) Koji Morishita (Nagoya University, JST CREST) Yukio Kaneda (Nagoya University) Table of contents • Introduc0on to the DNS of Canonical Turbulence • High Performance Compu0ng of Canonical high Re turbulence – Box turbulence – Turbulent channel flow – (Turbulent boundary layer) Japanese supercomputers : ES, ES2, FX1, (K-‐computer) • Summary Turbulence phenomena http://hinode.nao.ac.jp/Movies/ (1) http://www.kcc.zaq.ne.jp/tsubosango/TS49.html (2) http://sekitori.web.infoseek.co.jp/Houses/ie_Bahrain_Gas_pip.html (3) http://n700.jp/know/index.html Equa0ons of mo0on • Navier Stokes Equa0ons ∂u 1 + (u ⋅∇)u = −∇p + Δu + f ∂t Re ∇ ⋅u = 0 – Determinis0c – Nonlinear – Unpredictable – Non-‐equilibrium – Dissipa0ve – ... u : velocity, p : pressure Re = UL ν Universality • Kolmogorov’s idea (1941) K41 – High Re – At small scale • <ε >, ν – η=(ν 3/<ε>)1/4 Kolmogorov length scale – E(k)=(<ε>ν 5) 1/4 φ(kη) – Iner0al subrange (1/L<<k<<1/η): E(k) = C<ε>2/3k-‐5/3 100 100 (4.4) (4.5) 101 102 103 Intermiaency 104 Rλ Small-scale turbulence 347 Figure 7. Flatness factor Fstatistics of theoflongitudinal Small-scale statistics velocity of turbulence gradient ∂u/ et al JFM (2007) at100t = tF Ishihara in Series 1 and 2, respecti (!10)0 and circles (") are the values 10 R = 1131 675 (a) (b) t = tF /2 to RtF= 429 in Run 2048 the time-averages of (R 732λ , −S) over t from ζ = ∂u/∂x ζ = ∂u/∂x 471 268 # , $ , + are from figure 6 in Sreenivasan & Antonia (1997), % symbols 257 173 –2 –2 10 10 94 167 and & from Gotoh et al. (2002); the crosses (+) are from experiments DNS. Larger Rλ 10–4 10–4 F 10 2 λ σP (ζ) λ 1 Series 1 Series 2 Kerr (1985) Jimenez et al. (1992) Wang et al. (1996) Gotoh et al. (2002) (4.4) (4.5) –6 A close inspection shows that 10the DNS data of Series 2 are sys those of Series 1, and a least-square fitting of the data of Serie –8 –8 10 10 10 10 0 –30 –20 –10 20 30 –3010 –20 10 –10 100 10 20 10 30 gives 10 10 R 0 0 10 10 (1.14 ∓ 0.19)R 0.34±0.03 . F ∼ RFigure = 675 7. Flatness factor F of the longitudinal velocity gradient 675 vs. R . Filled trian λ R =∂u/∂x (c) (d ) 429 429 and circles (") are the values at t =ζt =in large circles de (!)268 ωxSeries 1 and 2, respectively; 268 ζ = ∂v/∂x over t from t = t /2 to t in Run 2048-2 and Run 4096-2. the173 time-averages ofvery (R , −S)–2different 173power Both (4.5) and (4.6) are not from this law 10–2 10 % from Wang et al. (19 symbols 94 #, $, + are from figure 6 in Sreenivasan & Antonia (1997), 94 10–6 0 0 1 2 3 4 5 λ λ λ λ F F λ F & from Gotoh et al. the (2002); the crosses (+)more are from experiments, and andnumber Increasing the Reynolds makes turbulence intermittent. the others are f σP (ζ) 4.4. Fourth-order moments of velocity gradie –4 10 A close inspection shows that the DNS data of Series 2 are systematically larger t In homogeneous those isotropic turbulence, any fourth-order mo of Series 1, and a least-square fitting of the data of Series 2 to lnF = α + βl 10–4 10–6 DNS. gives 10–6 Direct Numerical Simula0on of turbulence • To explore the universality of turbulence in small scales • To understand the nature of high Re turbulence The direct numerical simula0on (DNS) of the incompressible Navier-‐Stokes equa0ons is very useful. A Huge Degree of Freedom The ra0o of the size of the largest eddy to the size of the smallest eddy L L η ~ Re 3/ 4 Cf. Kolmogorov’s theory 3 ⎛ L ⎞ DOF ~ ⎜ ⎟ ~ Re9 / 4 ⎝ η ⎠ η Computa0onal Cost ~ Re 3 Development of supercomputer K Computer 10.51PFlopes x1000 Re 10years hap://www.top500.org/lists/2011/06/performance_development Earth Simulator 35.86TFlopes 10years x10 Challenge for canonical turbulence Example:Turbulent Boundary Layer Shinkansen (1) (1) http://kojii.cocolog-nifty.com/blog/cat5249145/index.html (2) http://aofg.blogs.com/the_airing_of_grievances/2005/05/fallujah_sandst.html (3) http://fdrc.iit.edu/research/images/TurbulentBoundaryLayer.jpg Sand storms (2) • Recent DNS achieves high Reynolds numbers • TBL in nature is at high Re and has a huge degree of freedom • It is beaer to perform the large-‐scale DNS to understand a common feature of high Re turbulence • Thus the canonical turbulence should be sa0sfied the following condi0ons – Boundary condi0ons … as simple as possible – Reynolds number … as high as possible (3) Experiments Direct Numerical Simula0on (DNS) of canonical turbulence • Solve incompressible NS equa0ons – No model, no numerical viscosity • Resolve not only large scales but also small scales ( LES) • Simple geometry • High accuracy, high resolu0on and high precision – e.g. Spectral method • To avoid extra uncertain0es • High performance (and many steps) Computer resource is finite, We have to sacrifice some of these. – Reynolds number as high as possible To explore Universality of Turbulence To understand the nature of high Re turbulence To consider appropriate model (LES etc) of high Re turbulence Canonical turbulence • Box turbulence ßDynamics – forced/decay – rotation, shear, stratified • Wall bounded flows – Channel – TBL – Pipe • • • • • • • Turbulent mixing layer Jet Flow past body sphere/cylinder Turbulent diffusion/mixing/transport Turbulent combustion MHD turbulence Quantum turbulence Box turbulence is the simplest -‐-‐-‐ like 100-‐meter sprint in athle0cs Simple rule, Interna0onal Box Turbulence History of representative DNS Homogeneous Isotropic Turbulence under periodic BC Challenge for higher Re Jimenez et al.(1993) 10000 Number of grid points 20483 40963 Caltech Delta machine 1283 Kerr(1985) 1000 643 5123 VPP Cray-1S NCAR 1 Siggia(1981) Cray-1 NCAR 100 2 4 3 Yokokawa, et al(2002) Earth Simulator 10243 Ishihara&Kaneda (2001), Gotoh&Fukayama (2001) VPP5000/56NUCC Yamamoto et al(1994)NWT 10 323 Vincent&Meneguzzi (1991) Orszag(1969) 2403 IBM Model 360-95 Gordon Bell 1 1960 1970 1980 1990 Year 2000 2010 DNS of Box Turbulence • Fourier spectral method !! $ ! 2 # + ! k ! c(k) & û(k) = P(k) ! (u ! ! )(k) " !t % k ⋅ uˆ (k ) = 0 k ⎧c (k < 2.5) c(k ) = ⎨ ⎩0 otherwise • 4th order Runge-‐Kuaa method u! ! ! (k) ! P(k)!u " ! (k) Removal of Alias error uˆ (k ) = 0 for k > ( 2 / 3) N uˆ (k ), ik × uˆ (k ) uˆ '(k ) = eik ⋅Δuˆ (k ), ik × uˆ '(k ) -1 FFT u(x),! (x) u(x + !),! (x + !) FFT u(x) ! ! (x), 6 Δ = (π / N )(1,1,1) u! ! ! (k), u! ! ! '(k) u! !! (alias free) 3 u(x + ") ! ! (x + ") 1 % ! "ik#$ ! ( = 'u ! ! + e u ! ! '* for k < ( 2 / 3)N ) 2& 3D-FFT by domain decomposition Fourier Space Physical Space uˆk1, k 2,k3 u j 1 , j 2 , j3 FFTx V ( N+1, N, N/nd ) FFTy FFTz Transposition V ( N+1, N, N/nd ) V ( N+1, N/nd, N) FFTx FFT FFTy MPI MPI MPI FFTz Configuration of the Earth Simulator (ES1) • Peak performance/AP : 8Gflops • Peak performance/PN : 64Gflops • Shared memory/PN : 16GB • Total number of PNs : 640 • Total peak performance : 40Tflops • Total main memory : 10TB 3D-FFT by domain decomposition Fourier Space Physical Space uˆk1, k 2,k3 u j 1 , j 2 , j3 FFTx V ( N+1, N, N/nd ) FFTx Microtasking Vector processing MPI FFTy FFTz Transposition V ( N+1, N, N/nd ) V ( N+1, N/nd, N) Vector processing Vector processing FFT FFTy Microtasking MPI MPI FFTz Microtasking Performance Yokokawa et al (2002) for vector than for scalar processors, because IF statements in do loops in number of floating-point operations due to masking operations. Since ther the subroutines for calculating 3D-FFTs, the number measured by the hard calculation of sustained performance. Method (b) is based on the fact that the number of operations in a 1D-F the number of grid points N is3 N(5p 8.5q), is represented by 3 where 3 256 5123 + 1024 20483 N4096 N the value of N[5]. Considering that the DNS code has 72 real 3D-FFTs Ø Series 1(kmaxη =1) 167 257 471 732 1131 number of operations over the measurement range by hand, we R obtain the λ Ø Series 2(kmaxη =2) 94 173 268 429 675 the number of operations: Two series of DNS data Number of opera0ons per one step 459N 3 log2 N + (288 + 16π)N 3 , if p = For N=4096 459N 3 log2 N + (369 + 16π)N 3 , if p = 1step = 4.0x1014 18100 of steps (series 1)+ 15400 steps (series ) =33500steps = 1.3x1019 Results these expressions can be 2used as reference values on the numb (~2.1T) 278 hours x 512 nodes @13.4TFlops sustained(~2.4T) performance of code. = 11.6days Tables 1 and 2 show results obtained by methods (a) and (b), respective For N=8192 1step = 3.5x1015 the numbers of grid points and PNs used in each run, respectively. The num 36200 steps (series 1)+ 30800 steps (series 2) =67000steps = 2.3x1020 27days @100TFlops 5.4days @500TFlops 8 Rλ=3000 : 1020 Flop (Orszag) ANRV365-FL41-10 3.5 ARI a 3.0 12 November 2008 14:55 Iner0al sub-‐range k5/3E(k)/ ε 2/3 Rλ = 167 257 471 732 1131 2.5 b ∏(k)/ ε 10 Rλ = 167 257 471 732 1131 (δu Lr)2 /( ε r)2/3 – (δu Lr)3 /( ε r) Rλ = 167 257 471 732 1131 Rλ = 167 257 471 732 1131 2.0 1 1.5 1.0 165-180. Downloaded from arjournals.annualreviews.org nly at all four sites on 06/02/09. For personal use only. 0.5 0 0.001 0.01 kη 0.1 1 0.1 10 100 r/η 1000 Figure 3 (a) Compensated energy spectrum k5/3 E(k)/!ε"2/3 (solid colored lines) and energy-flux spectrum "(k)/!ε"(dashed colored lines) versus kη by direct numerical simulation at various Rλ ’s. The solid gray line is the approximation (Equation 8) with Ko measured in the run at Rλ = 1131. The dashed gray horizontal line indicates "(k)/!ε" = 1.0. (b) Normalized second- and third-order longitudinal structure functions !(δurL )2 "/(!ε" r)2/3 (solid colored lines) and −!(δurL )3 "/(!ε" r) (dashed colored lines) versus r/η. The solid gray line is the approximation (Equation 7) with C0 and Cf measured in the run at Rλ = 1131. The dashed gray horizontal line indicates −!(δurL )3 "/(!ε" r) = 4/5. close inspection also reveals that the slope is slightly tilted, suggesting that E(k) ∝ k−5/3+µ with Ishihara, Gotoh, Kaneda al. (2009) µ ∼ −0.1 (Kaneda et al. 2003). Equation 1 implies that the energy spectrum is universal not only in the inertial subrange, but also at higher wave numbers. In the near-dissipation range kη ∼ 1, DNS data analyses support 2563 Rλ=94 L/η=65 Ishihara et al JFM (2007) 40963 Rλ=1131 L/η=2137 Annu. Rev. Fluid Mech. 2009. 41 There are sharp boundaries between cluster regions and void regions. Configuration of ES2 x(1/4) • Peak performance/CPU : 102.4Gflops • Total number of PNs : 160 • Peak performance/PN : 819.2Gflops • Total number of CPUs : 1280 • Shared memory/PN : 128GB • Total peak performance : 131Tflops • CPUs/PN : 8 • Total main memory : 20TB x3.275 x13.2 Vector Processer + Fat-‐Tree interconnect The data transfer from one node to another node needs two steps. 3D-FFT by domain decomposition Fourier Space Physical Space uˆk1, k 2,k3 u j 1 , j 2 , j3 FFTx V ( N+1, N, N/nd ) FFTx Microtasking Vector processing MPI FFTy FFTz Transposition V ( N+1, N, N/nd ) V ( N+1, N/nd, N) Vector processing Vector processing FFT FFTy Microtasking MPI MPI FFTz Microtasking Box MHD(~3DFFT) on ES2 (by Kondo) • ES2 8nodes=8processes x 8 threads (Box MHD) – Peak :819.2Gflops(x8) = 6553.6GFlops – Sustain: 991.058GFlops(15.1%) – 1024x1024x1024 grid points • ES2 32nodes=32processes x 8 threads (Box MHD) – Peak :819.2Gflops(x32) = 26.21TFlops – Sustain: 1.893TFlops(7.22%) – 1024x1024x1024 grid points Note: the code has not been op0mized for ES2 • ES 64nodes=64processes x 8 threads (Box HD) – Peak: 64GFlops(x64) = 4.1TFlops – Sustain: 1.7TFlops(42%) – 1024x1024x1024 grid points FX1@nagoyaU SPARC64TM VII (~SPARC64TM VIIIfx on K computer) • 1 CPU = 40Gflops (= 4 cores x 10 Gflops/core) • 1 node = 1 CPU (40Gflops) / 32 GB • 768 nodes =768 CPUs (30.7Tflops) / 24 TB Interconnec0on Network: Fat tree ~ ES2 Scalar Processer + Fat-‐Tree interconnect 3D-FFT by domain decomposition Fourier Space Physical Space uˆk1, k 2,k3 u j 1 , j 2 , j3 FFTx V ( N+1, N, N/nd ) FFTx FFTy V ( N+1, N, N/nd ) V ( N+1, N/nd, N) Scalar processing Scalar processing FFT FFTy Scalar processing MPI FFTz Transposition MPI MPI FFTz 3DFFT(ds_v3drcf) on FX1@nagoyaU 30TFlops@768nodes • FX1 32nodes=32process x 4 threads – Peak: 1.28TFlops – Sustain: 0.0535TFlops(4.18%) – 2048x2048x2048 grid points • FX1 64nodes=64process x 4 threads – Peak: 2.56TFlops – Sustain: 0.1167TFlops(4.56%) – 2048x2048x2048 grid points hap://www.top500.org/lists/2011/11/press-‐release The meaning of 「京」(Kei) is 1016=10P Cf. 「千」(Sen)=103, 「万」(Man)=104 The data transfer from one node to the nearest neighbor node can be done at one step. Performance of K computer Rank Computer 1 Site Total Cores Rmax Rpeak Effeciency (Tflop/s) (Tflop/s) (%) Nmax RIKEN K computer, Advanced Institute for SPARC64 VIIIfx 2.0GHz, Computatio nal Science Tofu interconnect (AICS) 705024 Power (kW) Processor Speed Mflops/ Watt Processor (MHz) SPARC64 VIIIfx 8C 10510 11280 93.17 11870208 12659.7 830.182.00GHz 2000 Tofu (Torus fusion) Interconnect 10.51 PFlop/s on Linpack 93% (High efficiency) 830MFlops/Waa (Energy-‐efficient) 29.5 hours (High stability) 3D torus network 128GFlop/s 8 cores x 705024 = 11.28PFlop/s peak Nmax 11870208 ~ 1.1x106 PFlop HPC Challenge 2011 First place but the efficiency is less than 2% hap://www.4-‐traders.com/FUJITSU-‐6492460/news/FUJITSU-‐K-‐computer-‐No-‐1-‐in-‐Four-‐Benchmarks-‐at-‐HPC-‐Challenge-‐Awards-‐13893670/ Summary (box turbulence ) Achievement at present – HPC of DNS using a Fourier spectral method on ES • 20483 DNS at 16.4TFlops • 40963 DNS (at ~13TFlops) – Iner0al range at Rλ=1301 with kmaxη=1 Grand Challenge for K computer – HPC of DNS using a Fourier spectral method • 81923 DNS at 100TFlops (?) – HPC of DNS using a finite difference algorithm with a good Poisson solver • 122883 DNS at (?)TFlops Expected Results Real proper0es in the iner0al range of high Re turbulence Development of a new LES model Singularity forma0on in NS equa0ons Turbulent Channel Flow DNS method of Channel flow (KMM1987) Navier-Stokes equations ∂ ui ∂p 1 2 =− + Hi + ∇ ui ∂t ∂xi Reτ Equations ∂ 1 g = hg + ∇2 g ∂t Reτ ∂ 1 φ = hv + ∇2φ ∂t Reτ ∂ ui =0 ∂ xi BC u=v=w=0 at y = ± δ Normal direction y, v x, z : periodic Flow U(y) spanwise g = 0, v = 0, at y = ±δ ∂v =0 ∂y ∂H x ∂H z − , ∂z ∂x ∂ ⎛ ∂H x ∂H z ⎞ ⎛ ∂ 2 ∂ 2 ⎞ hv ≡ − + + ⎟ H y ⎜ ⎟ + ⎜ ∂y ⎝ ∂x ∂z ⎠ ⎝ ∂x 2 ∂z 2 ⎠ hg ≡ y = +δ z, w BC ∂u ∂w g ≡ ωy = − ∂z ∂x φ = ∇ 2v y = ‐δ Fourier Spectral x, u streamwise Reτ = uτ δ ν uτ : friction velocity ν : viscosity 2δ :wall distance Time Integra0on " ( x, y, z ) = !!"ˆ (k x , y, k z )ei ( k kx x x+kz z ) f NL = {hv , hg } kz ∂ 1 ψ = f NL + ∇ 2ψ ∂t Reτ Fourier Transform ∂ 1 ψˆ = fˆNL + ∂t Reτ 3step Runge Kutta+Implicit Euler for k =1, 2, 3 n + k3−1 ˆ Calculate : f NL 2 ψ = {φ , g} ⎡ ∂ 2 ⎤ 2 2 ⎢− (k x + k z ) + 2 ⎥ψˆ ∂y ⎦ ⎣ ⎧ 1 1 ⎫ ,1⎬ 3 2 ⎩ ⎭ α k = ⎨ , n + k3 ⎡ Reτ Reτ n d ψˆ n + k3−1 2 2 ⎤ n + k3 ˆ ˆ Solve : − ⎢ + (k x + k z )⎥ψ = − Reτ f NL − ψˆ 2 dy α k Δt ⎣α k Δt ⎦ given 1D Boundary Value Problem of 2nd order ODE Chebyshev-tau method History of the DNS of channel turbulence 出典:平行平板間乱流の大規模DNS 進展と今後の展開 河村洋 阿部浩幸 A series of DNS of turbulent channel K41 spectrum • Compensated one-‐dimensional longitudinal spectrum at the outer end of the log-‐law region K41 spectrum – K41 spectrum range becomes broader with the increase of the Reynolds number Turbulent channel flow Performance in TFlops • ES2 64node=64processes x 8 thread – Peak :819.2GFlops(x64)=52.43TFlops – Sustain: 5.9103TFlops(11.27%) – 1024x1536x1024 grid points • ES2 64node=64processes x 8 thread – Peak :819.2GFlops(x64)=52.43TFlops – Sustain: 6.0495TFlops(11.54%) – 2048x1536x2048 grid points Summary (Channel turbulence) Achievement at present – HPC of DNS using a Fourier Chebyshev-‐tau method on ES2 • 1024x1536x1024 DNS at 5.9TFlops • 2048x1534x2048 DNS at 6.0TFlops – An Iner0al range in a log-‐law region Grand Challenge using K comp. – HPC of DNS using a Fourier Chebyshev-‐tau method – HPC of DNS using a finite difference algorithm with a good Poisson solver Expected Results Rτ>10240 Iner0al proper0es in a log-‐law region Prac0cal modeling of wall bounded turbulence Drag reduc0on at high Re Turbulent boundary layer Method in Spalart&Wutmaff(1993) Fringe method TBL thickness flow Effective region Control TBL thickness by using external force in fringe region Fringe region Func0on expansion streamwise(x),spanwise(z)direc0on wall normal(y)direction Fourier Series O(N log N) Jacobi polynomials O(N2) cf. Skote et al(2002) use Chebyshev polynomials TBL • ES2 16node=128processes – Peak :102.4Gflops(x128)=13.107TFlops – Sustain: 6.041TFlops(46.1%) – 4608x512x768 grid points, 3072x340x512 modes TBL • FX1 32nodes=32process x 4thread – Peak :1.28TFlops – Sustain: 0.11TFlops(8.6%) – 720x128x192 grid points, 480x85x128 modes • FX1 64nodes=64process x 4thread – Peak :2.56TFlops – Sustain: 0.21TFlops(8.2%) – 1728x160x240 grid points, 1152x106x160 modes Summary • The efficiency of the 3D spectral code strongly depends on the network topology – Full crossbar > Fat tree > 3D torus • In order to efficiently use the K computer, we have to give up the spectral method(??)