T Ishihara

Transcription

T Ishihara
High performance compu0ng of canonical high Reynolds number turbulent flows
Takashi Ishihara (Nagoya University, JST CREST) Koji Morishita (Nagoya University, JST CREST) Yukio Kaneda (Nagoya University)
Table of contents
•  Introduc0on to the DNS of Canonical Turbulence •  High Performance Compu0ng of Canonical high Re turbulence –  Box turbulence –  Turbulent channel flow –  (Turbulent boundary layer) Japanese supercomputers : ES, ES2, FX1, (K-­‐computer) •  Summary Turbulence phenomena
http://hinode.nao.ac.jp/Movies/
(1) http://www.kcc.zaq.ne.jp/tsubosango/TS49.html
(2) http://sekitori.web.infoseek.co.jp/Houses/ie_Bahrain_Gas_pip.html
(3) http://n700.jp/know/index.html
Equa0ons of mo0on
•  Navier Stokes Equa0ons ∂u
1
+ (u ⋅∇)u = −∇p +
Δu + f
∂t
Re
∇ ⋅u = 0
–  Determinis0c –  Nonlinear –  Unpredictable –  Non-­‐equilibrium –  Dissipa0ve –  ... u : velocity, p : pressure
Re =
UL
ν
Universality
•  Kolmogorov’s idea (1941) K41 –  High Re –  At small scale •  <ε >, ν
–  η=(ν 3/<ε>)1/4 Kolmogorov length scale –  E(k)=(<ε>ν 5) 1/4 φ(kη) –  Iner0al subrange (1/L<<k<<1/η): E(k) = C<ε>2/3k-­‐5/3 100
100
(4.4)
(4.5)
101
102
103
Intermiaency
104
Rλ
Small-scale
turbulence
347
Figure 7. Flatness
factor Fstatistics
of theoflongitudinal
Small-scale statistics velocity
of turbulence gradient ∂u/
et al JFM (2007)
at100t = tF Ishihara in Series
1 and
2, respecti
(!10)0 and circles (") are the values
10
R = 1131
675
(a)
(b)
t = tF /2 to RtF= 429
in Run 2048
the time-averages
of
(R
732λ , −S) over t from
ζ = ∂u/∂x
ζ = ∂u/∂x
471
268
#
,
$
,
+
are
from
figure
6
in
Sreenivasan
&
Antonia
(1997), %
symbols
257
173
–2
–2
10
10
94
167
and & from Gotoh et al. (2002); the crosses (+) are from experiments
DNS.
Larger Rλ
10–4
10–4
F 10
2
λ
σP (ζ)
λ
1
Series 1
Series 2
Kerr (1985)
Jimenez et al. (1992)
Wang et al. (1996)
Gotoh et al. (2002)
(4.4)
(4.5)
–6
A close inspection shows that 10the
DNS data of Series 2 are sys
those
of Series 1, and a least-square
fitting of the data of Serie
–8
–8
10
10
10
10
0
–30 –20 –10
20
30
–3010 –20 10 –10 100
10
20 10 30
gives
10
10
R
0
0
10
10 (1.14 ∓ 0.19)R 0.34±0.03 .
F
∼
RFigure
= 675 7. Flatness factor F of the longitudinal velocity gradient
675 vs. R . Filled trian
λ R =∂u/∂x
(c)
(d )
429
429
and circles (") are the values at t =ζt =in
large circles de
(!)268
ωxSeries 1 and 2, respectively;
268
ζ = ∂v/∂x
over t from t = t /2
to t in Run
2048-2
and Run 4096-2.
the173
time-averages
ofvery
(R , −S)–2different
173power
Both
(4.5) and (4.6)
are
not
from
this
law
10–2
10
% from Wang et al. (19
symbols
94 #, $, + are from figure 6 in Sreenivasan & Antonia (1997), 94
10–6
0
0
1
2
3
4
5
λ
λ
λ
λ
F
F
λ
F
& from Gotoh
et al. the
(2002);
the crosses (+)more
are from
experiments, and
andnumber
Increasing the Reynolds
makes
turbulence
intermittent.
the others are f
σP (ζ)
4.4. Fourth-order
moments of velocity gradie
–4
10
A close inspection shows that the DNS data of Series 2 are systematically larger t
In homogeneous those
isotropic
turbulence, any fourth-order mo
of Series 1, and a least-square fitting of the data of Series 2 to lnF = α + βl
10–4
10–6
DNS.
gives
10–6
Direct Numerical Simula0on of turbulence
•  To explore the universality of turbulence in small scales •  To understand the nature of high Re turbulence The direct numerical simula0on (DNS) of the incompressible Navier-­‐Stokes equa0ons is very useful. A Huge Degree of Freedom
The ra0o of the size of the largest eddy to the size of the smallest eddy
L
L
η
~ Re
3/ 4
Cf. Kolmogorov’s theory
3
⎛ L ⎞
DOF ~ ⎜ ⎟ ~ Re9 / 4
⎝ η ⎠
η
Computa0onal Cost
~ Re
3
Development of supercomputer
K Computer 10.51PFlopes x1000
Re
10years
hap://www.top500.org/lists/2011/06/performance_development
Earth Simulator 35.86TFlopes 10years
x10
Challenge for canonical turbulence
Example:Turbulent Boundary Layer
Shinkansen (1)
(1) http://kojii.cocolog-nifty.com/blog/cat5249145/index.html (2) http://aofg.blogs.com/the_airing_of_grievances/2005/05/fallujah_sandst.html
(3) http://fdrc.iit.edu/research/images/TurbulentBoundaryLayer.jpg
Sand storms
(2)
•  Recent DNS achieves high Reynolds numbers •  TBL in nature is at high Re and has a huge degree of freedom •  It is beaer to perform the large-­‐scale DNS to understand a common feature of high Re turbulence •  Thus the canonical turbulence should be sa0sfied the following condi0ons –  Boundary condi0ons … as simple as possible –  Reynolds number … as high as possible (3)
Experiments
Direct Numerical Simula0on (DNS) of canonical turbulence •  Solve incompressible NS equa0ons –  No model, no numerical viscosity •  Resolve not only large scales but also small scales ( LES) •  Simple geometry •  High accuracy, high resolu0on and high precision –  e.g. Spectral method •  To avoid extra uncertain0es •  High performance (and many steps) Computer resource is finite, We have to sacrifice some of these. –  Reynolds number as high as possible To explore Universality of Turbulence To understand the nature of high Re turbulence To consider appropriate model (LES etc) of high Re turbulence Canonical turbulence
• 
Box turbulence ßDynamics
–  forced/decay
–  rotation, shear, stratified
• 
Wall bounded flows
–  Channel
–  TBL
–  Pipe
• 
• 
• 
• 
• 
• 
• 
Turbulent mixing layer
Jet Flow past body sphere/cylinder
Turbulent diffusion/mixing/transport
Turbulent combustion
MHD turbulence
Quantum turbulence
Box turbulence is the simplest -­‐-­‐-­‐ like 100-­‐meter sprint in athle0cs
Simple rule, Interna0onal
Box Turbulence History of representative DNS
Homogeneous Isotropic Turbulence under periodic BC
Challenge for higher Re
Jimenez et al.(1993)
10000 Number of grid points
20483 40963
Caltech Delta machine
1283
Kerr(1985)
1000 643
5123
VPP Cray-1S NCAR
1
Siggia(1981)
Cray-1 NCAR
100 2
4
3
Yokokawa, et al(2002) Earth Simulator 10243
Ishihara&Kaneda (2001),
Gotoh&Fukayama (2001)
VPP5000/56NUCC
Yamamoto et al(1994)NWT
10 323
Vincent&Meneguzzi (1991)
Orszag(1969)
2403
IBM Model 360-95
Gordon Bell
1 1960 1970 1980 1990 Year
2000 2010 DNS of Box Turbulence
•  Fourier spectral method !!
$
!
2
# + ! k ! c(k) & û(k) = P(k) ! (u ! ! )(k)
" !t
%
k ⋅ uˆ (k ) = 0
k
⎧c (k < 2.5)
c(k ) = ⎨
⎩0 otherwise
•  4th order Runge-­‐Kuaa method u!
! ! (k)
!
P(k)!u
" ! (k)
Removal of Alias error
uˆ (k ) = 0 for k > ( 2 / 3) N
uˆ (k ), ik × uˆ (k )
uˆ '(k ) = eik ⋅Δuˆ (k ), ik × uˆ '(k )
-1
FFT
u(x),! (x)
u(x + !),! (x + !)
FFT
u(x) ! ! (x),
6
Δ = (π / N )(1,1,1)
u!
! ! (k), u!
! ! '(k)
u!
!!
(alias free)
3
u(x + ") ! ! (x + ")
1 % ! "ik#$ ! (
= 'u ! ! + e u ! ! '* for k < ( 2 / 3)N
)
2&
3D-FFT by domain decomposition
Fourier Space
Physical Space
uˆk1, k 2,k3
u j 1 , j 2 , j3
FFTx
V ( N+1, N, N/nd )
FFTy
FFTz
Transposition
V ( N+1, N, N/nd )
V ( N+1, N/nd, N)
FFTx
FFT
FFTy
MPI
MPI
MPI
FFTz
Configuration of the Earth Simulator (ES1)
•  Peak performance/AP : 8Gflops
•  Peak performance/PN
: 64Gflops
•  Shared memory/PN
: 16GB
•  Total number of PNs
: 640
•  Total peak performance : 40Tflops
•  Total main memory : 10TB
3D-FFT by domain decomposition
Fourier Space
Physical Space
uˆk1, k 2,k3
u j 1 , j 2 , j3
FFTx
V ( N+1, N, N/nd )
FFTx
Microtasking
Vector processing
MPI
FFTy
FFTz
Transposition
V ( N+1, N, N/nd )
V ( N+1, N/nd, N)
Vector processing
Vector processing
FFT
FFTy
Microtasking
MPI
MPI
FFTz
Microtasking
Performance
Yokokawa et al (2002)
for vector than for scalar processors, because IF statements in do loops in
number of floating-point operations due to masking operations. Since ther
the subroutines for calculating 3D-FFTs, the number measured by the hard
calculation of sustained performance.
Method (b) is based on the fact that the number of operations in a 1D-F
the number of grid points N
is3 N(5p
8.5q),
is represented
by
3 where
3
256
5123 + 1024
20483 N4096
N
the value of N[5]. Considering
that the DNS code has 72 real 3D-FFTs
Ø Series 1(kmaxη =1) 167 257 471 732 1131
number of operations over the measurement range by hand, we R
obtain
the
λ
Ø Series 2(kmaxη =2) 94 173 268 429 675
the number of operations:
Two series of DNS data Number of opera0ons per one step 459N 3 log2 N + (288 + 16π)N 3 ,
if p =
For N=4096 459N 3 log2 N + (369 + 16π)N 3 ,
if p =
1step = 4.0x1014 18100 of
steps (series 1)+ 15400 steps (series ) =33500steps = 1.3x1019
Results
these
expressions
can
be 2used
as reference
values on the numb
(~2.1T) 278 hours x 512 nodes @13.4TFlops sustained(~2.4T)
performance of code.
= 11.6days
Tables
1 and 2 show results obtained by methods (a) and (b), respective
For N=8192 1step = 3.5x1015 the numbers of grid points and PNs used in each run, respectively.
The num
36200 steps (series 1)+ 30800 steps (series 2) =67000steps = 2.3x1020
27days @100TFlops 5.4days @500TFlops 8
Rλ=3000 : 1020 Flop (Orszag)
ANRV365-FL41-10
3.5
ARI
a
3.0
12 November 2008
14:55
Iner0al sub-­‐range k5/3E(k)/ ε
2/3
Rλ = 167
257
471
732
1131
2.5
b
∏(k)/ ε
10
Rλ = 167
257
471
732
1131
(δu Lr)2 /( ε r)2/3
– (δu Lr)3 /( ε r)
Rλ = 167
257
471
732
1131
Rλ = 167
257
471
732
1131
2.0
1
1.5
1.0
165-180. Downloaded from arjournals.annualreviews.org
nly at all four sites on 06/02/09. For personal use only.
0.5
0
0.001
0.01
kη
0.1
1
0.1
10
100
r/η
1000
Figure 3
(a) Compensated energy spectrum k5/3 E(k)/!ε"2/3 (solid colored lines) and energy-flux spectrum "(k)/!ε"(dashed colored lines) versus kη by
direct numerical simulation at various Rλ ’s. The solid gray line is the approximation (Equation 8) with Ko measured in the run at
Rλ = 1131. The dashed gray horizontal line indicates "(k)/!ε" = 1.0. (b) Normalized second- and third-order longitudinal structure
functions !(δurL )2 "/(!ε" r)2/3 (solid colored lines) and −!(δurL )3 "/(!ε" r) (dashed colored lines) versus r/η. The solid gray line is the
approximation (Equation 7) with C0 and Cf measured in the run at Rλ = 1131. The dashed gray horizontal line indicates
−!(δurL )3 "/(!ε" r) = 4/5.
close inspection also reveals that the slope is slightly tilted, suggesting that E(k) ∝ k−5/3+µ with
Ishihara, Gotoh, Kaneda al. (2009)
µ ∼ −0.1 (Kaneda et al. 2003).
Equation 1 implies that the energy spectrum is universal not only in the inertial subrange, but
also at higher wave numbers. In the near-dissipation range kη ∼ 1, DNS data analyses support
2563
Rλ=94
L/η=65
Ishihara et al JFM (2007)
40963
Rλ=1131
L/η=2137
Annu. Rev. Fluid Mech. 2009. 41
There are sharp boundaries between cluster regions and void regions.
Configuration of ES2
x(1/4)
•  Peak performance/CPU : 102.4Gflops •  Total number of PNs
: 160
•  Peak performance/PN : 819.2Gflops •  Total number of CPUs : 1280
•  Shared memory/PN
: 128GB
•  Total peak performance : 131Tflops
•  CPUs/PN : 8
•  Total main memory : 20TB
x3.275
x13.2
Vector Processer + Fat-­‐Tree interconnect
The data transfer from one node to another node needs two steps. 3D-FFT by domain decomposition
Fourier Space
Physical Space
uˆk1, k 2,k3
u j 1 , j 2 , j3
FFTx
V ( N+1, N, N/nd )
FFTx
Microtasking
Vector processing
MPI
FFTy
FFTz
Transposition
V ( N+1, N, N/nd )
V ( N+1, N/nd, N)
Vector processing
Vector processing
FFT
FFTy
Microtasking
MPI
MPI
FFTz
Microtasking
Box MHD(~3DFFT) on ES2
(by Kondo)
•  ES2 8nodes=8processes x 8 threads (Box MHD) –  Peak :819.2Gflops(x8) = 6553.6GFlops –  Sustain: 991.058GFlops(15.1%) –  1024x1024x1024 grid points •  ES2 32nodes=32processes x 8 threads (Box MHD) –  Peak :819.2Gflops(x32) = 26.21TFlops –  Sustain: 1.893TFlops(7.22%) –  1024x1024x1024 grid points Note: the code has not been op0mized for ES2 •  ES 64nodes=64processes x 8 threads (Box HD) –  Peak: 64GFlops(x64) = 4.1TFlops –  Sustain: 1.7TFlops(42%) –  1024x1024x1024 grid points
FX1@nagoyaU
SPARC64TM VII (~SPARC64TM VIIIfx on K computer)
•  1 CPU = 40Gflops (= 4 cores x 10 Gflops/core)
•  1 node = 1 CPU (40Gflops) / 32 GB •  768 nodes =768 CPUs (30.7Tflops) / 24 TB Interconnec0on Network: Fat tree
~ ES2
Scalar Processer + Fat-­‐Tree interconnect
3D-FFT by domain decomposition
Fourier Space
Physical Space
uˆk1, k 2,k3
u j 1 , j 2 , j3
FFTx
V ( N+1, N, N/nd )
FFTx
FFTy
V ( N+1, N, N/nd )
V ( N+1, N/nd, N)
Scalar processing
Scalar processing
FFT
FFTy
Scalar processing
MPI
FFTz
Transposition
MPI
MPI
FFTz
3DFFT(ds_v3drcf) on FX1@nagoyaU
30TFlops@768nodes
•  FX1 32nodes=32process x 4 threads –  Peak: 1.28TFlops –  Sustain: 0.0535TFlops(4.18%) –  2048x2048x2048 grid points •  FX1 64nodes=64process x 4 threads –  Peak: 2.56TFlops –  Sustain: 0.1167TFlops(4.56%) –  2048x2048x2048 grid points hap://www.top500.org/lists/2011/11/press-­‐release
The meaning of 「京」(Kei) is 1016=10P Cf. 「千」(Sen)=103, 「万」(Man)=104
The data transfer from one node to the nearest neighbor node can be done at one step. Performance of K computer
Rank Computer
1
Site
Total
Cores
Rmax
Rpeak
Effeciency
(Tflop/s) (Tflop/s) (%)
Nmax
RIKEN
K computer, Advanced
Institute for
SPARC64
VIIIfx 2.0GHz, Computatio
nal Science
Tofu
interconnect (AICS)
705024
Power
(kW)
Processor
Speed
Mflops/
Watt
Processor (MHz)
SPARC64
VIIIfx 8C
10510
11280
93.17 11870208
12659.7 830.182.00GHz
2000
Tofu (Torus fusion) Interconnect
10.51 PFlop/s on Linpack
93% (High efficiency)
830MFlops/Waa (Energy-­‐efficient)
29.5 hours (High stability)
3D torus network
128GFlop/s 8 cores
x 705024 = 11.28PFlop/s
peak
Nmax 11870208 ~ 1.1x106 PFlop
HPC Challenge 2011
First place but the efficiency is less than 2%
hap://www.4-­‐traders.com/FUJITSU-­‐6492460/news/FUJITSU-­‐K-­‐computer-­‐No-­‐1-­‐in-­‐Four-­‐Benchmarks-­‐at-­‐HPC-­‐Challenge-­‐Awards-­‐13893670/
Summary (box turbulence )
Achievement at present –  HPC of DNS using a Fourier spectral method on ES •  20483 DNS at 16.4TFlops •  40963 DNS (at ~13TFlops) –  Iner0al range at Rλ=1301 with kmaxη=1
Grand Challenge for K computer –  HPC of DNS using a Fourier spectral method •  81923 DNS at 100TFlops (?) –  HPC of DNS using a finite difference algorithm with a good Poisson solver •  122883 DNS at (?)TFlops
Expected Results Real proper0es in the iner0al range of high Re turbulence Development of a new LES model
Singularity forma0on in NS equa0ons Turbulent Channel Flow DNS method of Channel flow (KMM1987) Navier-Stokes equations
∂ ui
∂p
1 2
=−
+ Hi +
∇ ui
∂t
∂xi
Reτ
Equations
∂
1
g = hg +
∇2 g
∂t
Reτ
∂
1
φ = hv +
∇2φ
∂t
Reτ
∂ ui
=0
∂ xi
BC
u=v=w=0
at y = ± δ
Normal direction
y, v
x, z : periodic
Flow U(y)
spanwise
g = 0, v = 0,
at y = ±δ
∂v
=0
∂y
∂H x ∂H z
−
,
∂z
∂x
∂ ⎛ ∂H x ∂H z ⎞ ⎛ ∂ 2
∂ 2 ⎞
hv ≡ −
+
+
⎟ H y
⎜
⎟ + ⎜
∂y ⎝ ∂x
∂z ⎠ ⎝ ∂x 2 ∂z 2 ⎠
hg ≡
y = +δ
z, w
BC
∂u ∂w
g ≡ ωy =
−
∂z
∂x
φ = ∇ 2v
y = ‐δ
Fourier Spectral
x, u
streamwise
Reτ =
uτ δ
ν
uτ : friction velocity
ν : viscosity
2δ :wall distance
Time Integra0on " ( x, y, z ) = !!"ˆ (k x , y, k z )ei ( k
kx
x x+kz z )
f NL = {hv , hg }
kz
∂
1
ψ = f NL +
∇ 2ψ
∂t
Reτ
Fourier Transform
∂
1
ψˆ = fˆNL +
∂t
Reτ
3step Runge Kutta+Implicit Euler
for k =1, 2, 3
n + k3−1
ˆ
Calculate : f NL
2
ψ = {φ , g}
⎡
∂ 2 ⎤
2
2
⎢− (k x + k z ) + 2 ⎥ψˆ
∂y ⎦
⎣
⎧ 1 1 ⎫
,1⎬
3
2
⎩
⎭
α k = ⎨ ,
n + k3
⎡ Reτ
Reτ n
d ψˆ
n + k3−1
2
2 ⎤ n + k3
ˆ
ˆ
Solve :
− ⎢
+ (k x + k z )⎥ψ
= − Reτ f NL −
ψˆ
2
dy
α k Δt
⎣α k Δt
⎦
given
1D Boundary Value Problem of 2nd order ODE
Chebyshev-tau method
History of the DNS of channel turbulence 出典:平行平板間乱流の大規模DNS 進展と今後の展開 河村洋 阿部浩幸
A series of DNS of turbulent channel
K41 spectrum
•  Compensated one-­‐dimensional longitudinal spectrum at the outer end of the log-­‐law region K41 spectrum
–  K41 spectrum range becomes broader with the increase of the Reynolds number Turbulent channel flow Performance in TFlops
•  ES2 64node=64processes x 8 thread –  Peak :819.2GFlops(x64)=52.43TFlops –  Sustain: 5.9103TFlops(11.27%) –  1024x1536x1024 grid points •  ES2 64node=64processes x 8 thread –  Peak :819.2GFlops(x64)=52.43TFlops –  Sustain: 6.0495TFlops(11.54%) –  2048x1536x2048 grid points Summary (Channel turbulence)
Achievement at present –  HPC of DNS using a Fourier Chebyshev-­‐tau method on ES2 •  1024x1536x1024 DNS at 5.9TFlops •  2048x1534x2048 DNS at 6.0TFlops –  An Iner0al range in a log-­‐law region
Grand Challenge using K comp. –  HPC of DNS using a Fourier Chebyshev-­‐tau method –  HPC of DNS using a finite difference algorithm with a good Poisson solver Expected Results Rτ>10240 Iner0al proper0es in a log-­‐law region Prac0cal modeling of wall bounded turbulence Drag reduc0on at high Re Turbulent boundary layer Method in Spalart&Wutmaff(1993)
Fringe method
TBL thickness
flow
Effective region
Control TBL thickness by using external force in fringe region Fringe region
Func0on expansion
streamwise(x),spanwise(z)direc0on
wall normal(y)direction
Fourier Series O(N log N)
Jacobi polynomials O(N2) cf. Skote et al(2002) use Chebyshev polynomials
TBL
•  ES2 16node=128processes –  Peak :102.4Gflops(x128)=13.107TFlops –  Sustain: 6.041TFlops(46.1%) –  4608x512x768 grid points, 3072x340x512 modes
TBL
•  FX1 32nodes=32process x 4thread –  Peak :1.28TFlops –  Sustain: 0.11TFlops(8.6%) –  720x128x192 grid points, 480x85x128 modes •  FX1 64nodes=64process x 4thread –  Peak :2.56TFlops –  Sustain: 0.21TFlops(8.2%) –  1728x160x240 grid points, 1152x106x160 modes
Summary
•  The efficiency of the 3D spectral code strongly depends on the network topology –  Full crossbar > Fat tree > 3D torus •  In order to efficiently use the K computer, we have to give up the spectral method(??)