Xeon Phi - Intel Developer Zone

Transcription

Xeon Phi - Intel Developer Zone
Parallel Computing and Intel® Xeon Phi™
coprocessors
Warren He Ph.D (何万青) [email protected]
Technical Computing Solution Architect
Datacenter & Connected System Group
Transition from Multicore to Manycore
Agenda
• Why Many-Core?
• Intel Xeon Phi Family & ecosystem
• Intel Xeon Phi HW and SW architecture
• HPC design and development considerations for
Xeon Phi
• Case Studies & Demo
3
Scaling workloads for Intel® Xeon Phi™ coprocessors
Performance
Scale to
manycore
Vectorize
Parallelize
% Vector
Fraction
Parallel
* Theoretical acceleration of a highly parallel processor over a Intel® Xeon® parallel processor
VECTOR
IA CORE
VECTOR
IA CORE
…
VECTOR
IA CORE
VECTOR
IA CORE
INTERPROCESSOR NETWORK
COHERENT
CACHE
COHERENT
CACHE
COHERENT
CACHE
COHERENT
CACHE
…
…
COHERENT
CACHE
COHERENT
CACHE
COHERENT
CACHE
COHERENT
CACHE
INTERPROCESSOR NETWORK
VECTOR
IA CORE
VECTOR
IA CORE
…
VECTOR
IA CORE
VECTOR
IA CORE
MEMORY and I/O INTERFACES
FIXED FUNCTION LOGIC
Intel® Xeon Phi™ Coprocessor
PCIe
GDDR5
Notice: This document contains information on products in the design phase of development. The
information here is subject to change without notice. Do not finalize a design with this information. Contact
your local Intel sales office or your distributor to obtain the latest specification before placing your product
order.
Knights Corner and other code names featured are used internally within Intel to identify products that are in
development and not yet publicly announced for release. Customers, licensees and other third parties are
not authorized by Intel to use code names in advertising, promotion or marketing of any product or services
and any such use of Intel's internal code names is at the sole risk of the user. All products, computer
systems, dates, and figures specified are preliminary based on current expectations, and are subject to
change without notice.
All SKUs, pricing and features are PRELIMINARY
and are subject to change without notice
Silicon Cores
57-61
Silicon Max Freq
1.05-1.25 GHz
Double Precision
Peak Performance
1003-1220 GFLOP
GDDR5 Devices
24-32
Memory Capacity
6-8GB
Memory Channels
12-16
Form Factors
Refer to picture: passive, active,
dense form factor (DFF), no thermal solution (NTS)
Memory/BW Peak
240-352 GT/s
Total Cache
28.5- 30.5MB
Board TDP
225-300 Watts
PCIe
PCIe Gen 2 x16, I/O speeds
up to: KNC Host: 7.0 GB/s; Host KNC: 6.7 GB/s
Misc.
•
•
•
•
Product codename: Knights Corner
Based on 22nm process
ECC enabled
Turbo not currently available
Power/ Mgmt
•
•
Future node manager support
IMPI including Intel Coprocessor Communications Link
Tools
•
•
•
•
Intel C++/C/FORTRAN compilers, Intel Math Kernel Library
Debuggers, Performance and Correctness Analysis Tools
OpenMP, MPI, OFED messaging infrastructure (Linux only), OpenCL
Programming Models: Offload, Native and mixed Offload+Native
OS
•
RHEL 6.x, SuSE 11.1, Microsoft Windows 8 Server, Win 7, Win 8
Intel®Many Integrated Core Architecture:
Processor Numbering Overview
Intel®Xeon Phi™ coprocessor
Brand
Designates the family
of products
Performance
Shelf
7 1 10 P
Suffix
SKU Number
Generation
Indication of relative
pricing and performance
within generation
1
Knights Corner
2
Knights Landing
NOTE: Where applicable, may
denote form factor, thermal
solution, usage model
7
Best Performance
P
Passive Thermal
Solution
5
Best Performance/Watt
A
Active Thermal Solution
3
Best Value
X
No Thermal Solution
D
Dense Form Factor
Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design
with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order.
Knights Corner and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release.
Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use
of Intel's internal code names is at the sole risk of the user. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are
subject to change without notice.
7
Software Development Ecosystem1 for Intel® Xeon Phi™
Coprocessor
Open
Source
Commercial
Compilers,
Run
environments
gcc
(kernel build
only, not for
applications),
python*
Intel® C++ Compiler, Intel® Fortran Compiler, MYO, CAPS*
HMPP* 3.2.5 (Beta) compiler, PGI*, PGAS GPI (Fraunhofer
ITWM*), ISPC
Debugger
gdb
Intel Debugger, RogueWave* TotalView* 8.9, Allinea* DDT 3.3
Libraries
TBB2, MPICH2
1.5, FFTW,
NetCDF
NAG*, Intel® Math Kernel Library, Intel® MPI Library, Intel®
OpenMP*, Intel® Cilk™ Plus (in Intel compilers), MAGMA*,
Accelereyes* ArrayFire 2.0 (Beta), Boost C++ Libraries 1.47+
Profiling &
Analysis Tools
Intel® VTune™ Amplifier XE, Intel® Trace Analyzer & Collector,
Intel® Inspector XE, Intel® Advisor XE, TAU* – ParaTools* 2.21.4
Virtualization
ScaleMP* vSMP Foundation 5.0, Xen* 4.1.2+
Cluster,
Workload
Management,
and
Manageability
Tools
Altair* PBS Professional 11.2, Adaptive* Computing Moab 7.2,
Bright* Cluster Manager 6.1 (Beta), ET International* SWARM
(Beta), IBM* Platform Computing* {LSF 8.3, HPC 3.2 and PCM
3.2}, MPICH2, Univa* Grid Engine 8.1.3
1These
are all announced as of November 2012. Intel has said there are more actively being developed but
are not yet announced. 2Commercial support of Intel TBB available from Intel.
Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design
with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order.
Knights Corner and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release.
Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use
of Intel's internal code names is at the sole risk of the user. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are
subject to change without notice.
Technical Compute Target Market Segments &
Applications
Applications/Workloads
Sub-Segments
Intel® Xeon Phi™ Candidates
Public Sector
HPL, HPCC, NPB, LAMMPS, QCD
Energy (including Oil & Gas)
RTM (Reverse Time Migration), WEM
(Wave Equation Migration)
Climate Modeling & Weather Simulation
WRF, HOMME
Financial Analysis
Monte Carlo, Black-Scholes, Binomial
model, Heston model
Life Sciences (Molecular Dynamics, Gene Sequencing, BioChemistry)
LAMMPS, NAMD, AMBER, HMMER,
BLAST, QCD
Manufacturing CAD/CAM/CAE/CFD/EDA
Implicit, Explicit Solvers
Digital Content Creation
Ray Tracing, Animation, Effects
Software Ecosystem
Tools, Middleware
Mix of ISV and End User developm
9
Agenda
• Why Many-Core?
• Intel Xeon Phi Family & ecosystem
• Intel Xeon Phi HW and SW architecture
• HPC design and development considerations for
Xeon Phi
• Case Studies & Demo
10
Intel® Xeon Phi™ coprocessor codenamed
“Knights Corner” Block Diagram
Distributed tag
directory enables
faster snooping
High-speed
bi-directional
ring interconnect
Cores: lightweight,
power efficient,
in-order, support
4 hardware threads
Vector Processing Unit
(VPU) on each core,
32 512bit SIMD
registers
Speeds and Feeds
Per-core caches reference info:
Type
Size
Ways
Set
conflict
Location
L1I (instr)
32KB
4
8KB
on-core
L1D (data)
32KB
8
4KB
on-core
Memory:
L2 (unified) 512KB
8
64KB
connected via
• 8 memory controllers, each supporting
2 32-bit
core/ring
interface
channels
• GDDR5 channels theoretical peak of 5.5GT/s
• Practical peak BW limited by ring and other factors
12
Speeds and Feeds, Part 2
Per-core TLBs reference info:
Type
Page Size
Coverage
L1 Instruction 32
4KB
128KB
L1 Data
64
4KB
256KB
32
64KB
2MB
8
2MB
16MB
64
4KB, 64KB, or 2MB
Up to 128MB
L2
Entries
• Linux support for 64K pages on Intel64 not yet available
13
Knights Corner Core
X86 specific logic < 2% of core + L2 area
Intel Confidential
64-bit
512-wide
In Order
VPU: integer, SP, DP;
3-operand,
16-instruction
Specialized
Instructions
(new encoding)
HW transcendentals
Intel Confidential
Fully Coherent
L2 Hardware Prefetching
32 KB
per core
512 KB Slice
per Core –
Fast Access to
Local Copy
Interprocessor
Network
L2 cache
Scalar
Registers
Vector
Registers
4 Threads per Core
L1 Cache
[Data and
Instruction]
Scalar Unit
Vector Unit
Instruction
Decode
Knights Corner Silicon Core Architecture
Improved ring
interconnect delivering
3x bandwidth over
Aubrey Isle silicon ring
Vector Processing Unit
PPF
D2
DEC
E
VC1
PF
VC2
D0
D1
V1
D2
E
WB
D2
E
VC1
V2
V3
LD
VPU
RF
3R, 1W
EMU
Vector ALUs
16 Wide x 32 bit
8 Wide x 64 bit
ST
Fused Multiply Add
Mask
RF
Intel Confidential
Scatter
Gather
VC2
V1-V4
V4
WB
Peak Performace = Clock Freq * # of Cores * 8 Lane (DP) * 2 (FMA) FLOPS/Cycle
e.g. 1.091 GHz * 61 Cores * 8 Lanes * 2 = 1064.8 Gflops
17
Wider vectorization opportunity
Intel® Pentium® processor (1993)
MMX™ (1997)
Effective use of ever increasing
number of vector elements and
new instruction sets is vital to
Intel’s success
Illustrated with the number of 32bit data processed
by one “packed” instruction
Intel® Streaming SIMD Extensions
(Intel® SSE in 1999 to Intel® SSE4.2 in 2008)
Intel® Advanced Vector Extensions
(Intel® AVX in 2011 and Intel® AVX2 in 2013)
Intel® Xeon™ Phi Coprocessor (in 2012)
18
1/7/2013
Multi-Cores and
Many-Cores
on top of this!!
Knights Corner PCI Express Card Block Diagram
INTEL
CONFIDENTIA
L
19
ntel® Xeon Phi™ Coprocessors Software
Architecture (MPSS)
20
Intel® Xeon Phi™ Coprocessors: They’re So
Much More
General purpose IA Hardware leads to less idle time for your investment.
Restrictive architectures
It’s a supercomputer on a chip
Operate as a
compute node
Run a full OS
Program to MPI
GPU
ASIC
FPGA
Run x86 code
Run restricted code
Run offloaded code
Custom HW Acceleration
Intel® Xeon Phi™ Coprocessor
Restrictive architectures limit the ability for applications to use arbitrary nested parallelism, functions calls and threading models
Source: Intel Estimates
21
MIC Software Stack
HOST
MIC
Offload Tools/
Applications
Offload Tools/
Applications
User Mode
User Mode
Offload Library
MYO
COI
Linux
Kernel
Mode
SCIF
Offload Library
MYO
COI
SCIF
Linux
Kernel
Mode
PCI-E Communications HW
1/7/2013
22
Spectrum of Execution Models of
Heterogeneous Computing on Xeon and Xeon
Phi Coprocessor
CPU-Centric
Intel®MIC-Centric
Intel®Xeon Processor
Intel®Many Integrated Core (MIC)
Multi-core Hosted
Offload
General purpose
serial and
parallel
computing
Symmetric
Multi-core
Main( )
Foo( )
MPI_*()
Codes with
serial phases
Main( )
Foo( )
MPI_*()
Foo( )
PCIe
Foo( )
Many-core
Many-Core Hosted
Highly-parallel
codes
Codes with
balanced needs
Codes with
highly- parallel
phases
Main( )
Foo( )
MPI_*()
Reverse Offload
Main( )
Foo( )
MPI_*()
Main( )
Foo( )
MPI_*()
Productive Programming Models Across the Spectrum
Supported with Intel Tools
Not Currently Supported by Language Ext. Offload for Intel Tools
Main()
Foo( )
MPI_*()
MPSS Setup Step by Step
24
Tools for both OEMs and End Users :
•
•
•
•
micsmc – A control panel application for system management and viewing configuration
status information specific to SMC operations on one or more MIC cards in a single
platform.
micflash – A command line utility to update both the MIC SPI Flash and the SMC
Firmware as well as grab various information about the flash from either on the ROM on
the MIC card or from a ROM file.
micinfo – A command line utility that allows the OEM and the end user to get various
information (version data, memory info, bus speeds, etc.) from one or more MIC cards
in a single platform.
hwcfg – A command line utility that can be used to view and configure various
hardware aspects of one or more MIC cards in a single platform.
Tools for OEMs only :
•
•
•
•
micdiag – A command line utility that performs various diagnostics on one or more MIC
cards in a single platform.
micerrinj – A command line utility that allows the OEMs to inject one or more ECC
errors (correctable, uncorrectable or fatal) into the GDDR on a MIC card to test platform
error handling capabilities.
KncPtuGen – A power/thermal command line utility that can be used to put various
loads on one or more individual cores on a MIC card to simulate compromised
performance on specific cores.
micprun (OEM Workloads) – A command line utility that provides an environment for
executing various performance workloads (provided by Intel) that allow the OEM to test
their platform’s performance over time and compare to Intel’s “Gold” configuration in the
factory.
Intel® Many
Integrated
Core
Architecture
Marketing
1/7/2013
25
‘micsmc’ – MIC SMC Control Panel
• What does it do?
– A graphical display of SMC status items, such as
temperature, memory usage and core utilization for one or
multiple MIC cards on a single platform.
– Command Line Parameters available for scripting purposes.
• What does it look like?
1/7/2013
Intel® Many
Integrated
Core
Architecture
26
‘micsmc’ Card View Windows
Utilization View
Core Histogram View
Historical Utilization
View
1/7/2013
Intel® Many
Integrated
Core
Architecture
27
‘micprun’ – MIC Performance Tool
• What can it do?
– This is actually a wrapper that can be used to execute the OEM workloads.
– Neatly wraps up many of the parameters specific to the workload in addition to
setting up workload comparisons to previous executions and even to Intel “Gold”
system configurations.
• What does it look like?
– ./micprun [–k workload] [–p workloadParams] [-c category] [-x method] [-v level]
[-o output] [-t tag] [-r pickle]
– Workload (-k for kernel) specifies which workload to run... SGEMM, DGEMM, etc.
– Categories (-c) include “optimal”, “scaling”, “test”.
– Methods (-x) include “native”, “scif”, “coi” and “myo”.
– Levels (-v) specify the level of detail in the output.
– Output (-o) specifies to create a “pickle” file for future comparisons (use Tag (-t) to
specify custom name).
– Pickle (-r) is used to compare the current workload execution to a previously saved
workload execution and execute the current run exactly the same as the specified
pickle file.
1/7/2013
Intel® Many
Integrated
Core
Architecture
28
‘micprun’ continued…
• Show me some examples!
– Run all of the native workloads with optimal parameters :
– ./micprun
– ./micprun –k all –c optimal –x native –v 0
– Run SGEMM in scaling mode, and save to a pickle file :
– ./micprun –k sgemm –c scaling –o . –t mySgemm
– Run a comparison to the above example and create a
graph :
– ./micprun –r micp_run_mySgemm –v 2
– View the SGEMM workload parameters :
– ./micprun –k sgemm --help
– Run SGEMM on only 4 cores :
– ./micprun –k sgemm –p ‘--num_core 4’
1/7/2013
Intel® Many
Integrated
Core
Architecture
29
Agenda
• Why Many-Core?
• Intel Xeon Phi Family & ecosystem
• Intel Xeon Phi HW and SW architecture
• HPC design and development considerations for
Xeon Phi
• Case Studies & Demo
30
Considerations for Application Development
on MIC
Scientific Parallel
Application
Considerations
Workload
Implementation
Algorithm
Communication
Dataset
Size
Grid
Shape
Data Access
Patterns
Load
SPMD
Balancing
Fork/Join
Loop
Structure
Considerations in Selecting Workload
Size
Applicability of Amdahl's law,
• Speedup = (s+p)/(s+(p/N))
..(1),
where p is the parallel fraction, N number of parallel processors
• Using single processor runtime to normalize, s+p=1,
• Speedup = 1/(s+(p/N))
• Indicates serial portion ‘s’ will dominate in massively
parallel hardware like Intel® Xeon Phi™
•
Limiting advantage of the many core hardware
Ref: “Scientific Parallel Computing” – Scott Ridgeway et. al.
Scaled vs. Fixed Size Speedup
Amdahl’s Law
Implication
Amdahl’s Law considered harmful in designing parallel
architecture.
Gustafson’s[1] Observation:
• For Scientific computing problems, problem size increases with
parallel computing power
•
The parallel component of the problem scales with problem size
• Ask “how long a given parallel program would have taken on
serial processor”
• Scaled Speedup = (s’+p’N)/(s’+p’)
•
where s’,p’ represent serial and parallel time spent on parallel system,
s’+p’=1
• Scaled Speedup = s’ + (1-s’)*N = N + (1-N) s’
..(2)
When speedup is measured by scaling problem size, the scalar
fraction tend to shrink as more processors are added
Gustafson’s Law
Scaled Speedup = s’ + (1-s’)*N = N +
(1-N) s’
When speedup is measured by scaling problem size, the
scalar fraction tend to shrink as more processors are
added
Fig: John L Gustafson, “Reevaluating Amdahl’s Law”
Intel® Xeon Phi™ Coprocessors Workload Suitability
Many core peak
Can your
workload
scale to
over 100
threads?
Yes
Performance
Yes
No
No
No
Multi-core peak
Can your
workload
benefit
from large
vectors ?
Can your
workload
benefit
from more
memory
bandwidth?
Yes
If application scales with threads
and vectors or memory BW 
Intel® Xeon PhiTM Coprocessors
Threads
Click to see Animation video on
Exploiting Parallelism on Intel®Xeon PhiTM
Coprocessors
Representative example workload
for illustration purposes only
A picture speaks a Thousands Words
37
The Keys to Take Advantages of MIC
• Choose the right Xeon-centric or MIC-centric model
for your application
• Vectorize your application
– Use Intel’s vectorizing compiler
• Parallelize your application
– With MPI (or other multiprocess model)
– With threads (via Pthreads, TBB, Cilk Plus, OpenMP, etc.)
• Consider standard optimization techniques such as
tiling, overlapping computation and
communication, etc.
38
Intel® Xeon™ Phi SW Imperatives
• In order for Intel® Xeon™ Phi program to succeed, think
differently and guide the programmers to think differently.
Speed
optimize
Speed
port
Time
Naï
ve-Port/Longer-Optimization
 Losing Strategy
Time
Intelligent-Port/Shorter-Optimization
 Winning Strategy
Intelligent application porting to Intel® Xeon™ Phi Platforms
requires parallelization and vectorization
39
1/7/2013
Algorithm and Data Structure
One needs to consider
• Type of Parallel Algorithm to use for the problem
– Look carefully at every sequential part of the algorithm
• Memory Access Patterns
– Input and output data set
• Communication Between Threads
• Work division between threads/processes
– Load Balancing
– Granularity
40
Parallelism Pattern For Xeon Phi™
41
Memory Access Patterns
• Algorithm determines the data access patterns.
• Data Access Patterns can be categorized into:
– Arrays/Linear Data Structures
– Conducive to low latency and high bandwidth
– Look out for: Unaligned access
– Random Access
– Try to avoid. Needs prefetching if possible
– Not many threads to hide latency compared to Tesla implementation
– Recursive (Tree)
• Data Decomposition:
– Geometric Decomposition
– Regular
– Irregular
– Recursive Decomposition
42
Vectorization methodologies
• Auto vectorization may be improved by
additional pragmas like #pragma IVDEP
• A more aggressive pragma is
#pragma SIMD
this pragma will use less strict heuristics and
may produce wrong code. Results must be
carefully checked
• Experts may use compiler intrinsics. The
caveat is the loss of portability. A portable
layer between intrinsics and app. may help
1/7/2013
43
Compiler has to assume the worst case
the language/flag allow.
• float *A;
void vectorize()
{ int i;
for (i = 0; i < 102400; i++)
{ A[i] *= 2.0f; }
}
•Loop Body:
• Load of A
3) Recompile with –ansi-alias
• icc –vec-report1 –ansi-alias test1.c
– test1.c(4): (col. 3) remark: LOOP
WAS VECTORIZED.
4) Change “float *A” to “float *restrict
A”.
Q: Will the store • icc –vec-report1 test1a.c
– test1a.c(4): (col. 3) remark: LOOP
modify A?
WAS VECTORIZED.
5) Add “#pragma ivdep” to the loop.
A: Maybe • icc –vec-report1 test1b.c
– test1b.c(5): (col. 3) remark: LOOP
WAS VECTORIZED.
• Load of A[i]
• Multiply with 2.0f
• Store of A[i]
44
1/7/2013
“NO” is needed to make
vectorization legal.
Writing Explicit Vector Code with
Intel® Cilk™Plus
9) Using Array Notation
• float *A;
• float *A;
void vectorize()
void vectorize()
{
{
A[0:102400] *= 2.0f;
for (int i = 0; i < 102400; i++)
}
{ A[i] *= 2.0f; }
10)Using Elemental Function
}
• float *A;
8) Using SIMD Pragma
__declspec(noinline)
• float *A;
__declspec(vector(uniform(p),
void vectorize()
linear(i)))
{
void mul(int i){ p[i] *= 2.0f; }
#pragma simd vectorlength(4)
void vectorize()
for (int i = 0; i < 102400; i++) {
for (int i = 0; i < 102400; i++)
{ A[i] *= 2.0f; }
{ mul(A, i); }
}
}
45
1/7/2013
Memory Accesses and Alignment
SIZE:
64B for Xeon™Phi,
32B for AVX1/2,
16B for SSE4.2 and below
• Memory Access
Patterns Unit-stride
…
A[i]
A[i+1]
A[i+2]
A[i+3]
•Alignment
Optimization
Aligned Unit-stride
…
…
A[i]
A[i+1]
A[i+2]
Strided (special form of gather/scatter)
Addr % SIZE == 0
A[2*i]
A[2*(i+1)]
…
A[i]
Gather/Scatter
A[i+1]
A[B[i]]
Misaligned Unit-stride
A[i+2]
A[i+3]
A[B[i+1]]
…
Alignment unknown Unit-stride
…
A[i]
A[i+1]
A[i+2]
A[i+3]
If you write in Cilk™Plus Array Notation,
Addr % SIZE == ???
access patterns are obvious in your eyes:
Unit-stride means A[:] or A[lb:#elems], helps you think more clearly.
1/7/2013


Addr % SIZE != 0

46
…
A[2*(i+2)]

A[B[i+2]]
A[i+3]
…

Memory Access Patterns
• float A[1000];
Struct BB {
float x, y, z;
} B[1000];
int C[1000];
• for(i=0; i<n; i++){
a) A[I]
// unit-stride
b) A[2*I] // strided
c) B[i].x // strided
d) j = C[i] // unit-stride

A[j]
// gather/scatter
e) A[C[i]] // same as d).
f) B[i].x + B[i].y + B[i].z 

// 3 strided loads
}

47
1/7/2013
•
Unit-Stride
– Good --- with one exception.
– Out-of-order cores won’t store-forward
masked (unit-stride) store. 
–
•
Avoid storing under the mask (unit-stride
or not), but unnecessary stores can also
create bandwidth problem. Ouch.
Strided, Gather/Scatter
– Perf varies on micro-arch and the
actual index patterns.
– Big latency is exposed if you have
these on the critical path
– F90 Arrays tend to be strided if not
carefully used. See this article. Use
F2008 “contiguous” attribute.
http://software.intel.com/enus/articles/fortran-array-data-andarguments-and-vectorization
Data Alignment Optimization
• float *A, *B, *C, *D;
A[0:n] = B[0:n] *
C[0:n]
+
D[0:n];
• Explain compiler’s
assumption on data
alignment and how
that relates to the
generated code.
• Identify different
ways to optimize the
data alignment of this
code.
48
1/7/2013
1) Understand the baseline assumption
icc –xSSE4.2 –vec-report1 –restrict
test2.c
ifort –xSSE4.2 –vec-report1 ftest2.f
icc –mmic –vec-report6 test2.c
ifort –mmic –vec-report6 ftest2.f
Examine the ASM code.
2) Add “vector aligned”
icc –xSSE4.2 –vec-report1 test2a.c
ifort –xSSE4.2 –vec-report1 ftest2a.f
icc –mmic –vec-report6 test2a.c
ifort –mmic –vec-report6 ftest2a.f
Examine the ASM code
3) Compile with –opt-assume-safe-padding
icc –mmic –vec-report6
–opt-assume-safe-padding test2.c
ifort –mmic –vec-report6
–opt-assume-safe-padding ftest2.f
Examine the ASM code
Align the Data at Allocation and
Tell that to the Compiler at the Use
• C/C++
• FORTRAN
• Align at Data
• Align at Data
Allocation
Allocation
__declspec(align(n,
!DIR$ attribute align:n
[off]))
:: var
__mm_malloc(size, n)
-align arrayNbyte
• Inform the Compiler
(N=16,32,64)
at Use Sites
#pragma vector aligned • Inform the Compiler
__assume_aligned(ptr,
n);
__assume(var % x ==
0);
49
at Use Sites
!DIR$ vector aligned
• __assume_aligned(ptr
, n);
Reading Material:
http://software.intel.com/en-us/articles/data-alignment-to1/7/2013
assist-vectorization
Array of Struct is BAD
• struct node {
• Use Struct of Arrays
float x, y, z;
struct node1 {
};
float x[1024],
struct node NODES[1024];
y[1024],
float dist[1024];
z[1024];
for(i=0;i<1024;i+=16){
}
float
struct node1 NODES1;
x[16],y[16],z[16],d[16];
• Use Tiled Array of
x[:] = NODES[i:16].x;
y[:] = NODES[i:16].y;
Structs
z[:] = NODES[i:16].z;
struct node2 {
d[:] = sqrtf(x[:]*x[:] +
float x[16],
y[:]*y[:] +
y[16], z[16];
z[:]*z[:]);
}
dist[i:16] = d[:];
struct nodes2
}
NODES2[];
50
1/7/2013
Example in Optimizing LBM for MIC
• Updates 𝑓𝑖 (𝑥, 𝑡)
– Particle distribution function of
particles in pos x at time t
traveling in direction ei
𝑓𝑖 𝑥 + 𝑒𝑖 𝑑𝑡, 𝑡 + 𝑑𝑡 = 𝑓𝑖 𝑥, 𝑡 + Ω𝑖 (𝑓 𝑥, 𝑡 )
1 0
Ω𝑖 𝑓 𝑥, 𝑡 = (𝑓𝑖 − 𝑓𝑖 )
𝜏
• D3Q19 model
– Each cell has |ei| = 19 values
– Uses 19 values at curr node, does ~12 flops/direction (220
flops total) and writes 19 neighbors
– Has special case computation at boundaries (flag array used
here)
51
DO k = 1, NZ
DO j = 1, NY
!DIR$ IVDEP
DO i = 1, NX
NX = NY = NZ = 128
m = i + NXG*( j + NYG*k )
phin
= phi(m)
rhon
= g0(m+now) + g1(m+now)
+ g5(m+now) + g6(m+now)
+ g10(m+now) + g11(m+now)
+ g15(m+now) + g16(m+now)
+
+
+
+
g2(m+now)
g7(m+now)
g12(m+now)
g17(m+now)
+
+
+
+
g3(m+now) + g4(m+now) &
g8(m+now) + g9(m+now) &
g13(m+now) + g14(m+now) &A
g18(m+now)
Key challenges:
- 49 streams of data (some unaligned)
–
–
–
–
sFx = muPhin*gradPhiX(m)
sFy = muPhin*gradPhiY(m)
sFz = muPhin*gradPhiZ(m)
ux = ( g1(m+now) - g2(m+now) + g7(m+now)
g10(m+now) + g11(m+now) - g12(m+now)
+
0.5D0*sFx )*invRhon
uy = ( g3(m+now) - g4(m+now) + g7(m+now)
+
g10(m+now) + g15(m+now) - g16(m+now)
+
0.5D0*sFy )*invRhon
uz = ( g5(m+now) - g6(m+now) + g11(m+now)
+
g14(m+now) + g15(m+now) - g16(m+now)
+
0.5D0*sFz )*invRhon
f0(m+nxt)
f1(m+now)
f2(m+now)
f3(m+now)
f4(m+now)
f5(m+now)
f6(m+now)
=
=
=
=
=
=
=
invTauPhi1*f0(m+now)
invTauPhi1*f1(m+now)
invTauPhi1*f2(m+now)
invTauPhi1*f3(m+now)
invTauPhi1*f4(m+now)
invTauPhi1*f5(m+now)
invTauPhi1*f6(m+now)
g0(m+nxt)
g1(m+nxt + 1)
g2(m+nxt - 1)
g3(m+nxt + NXG)
g4(m+nxt - NXG)
g5(m+nxt + NXG*NYG)
g6(m+nxt - NXG*NYG)
g7(m+nxt + 1 + NXG)
g8(m+nxt - 1 - NXG)
g9(m+nxt + 1 - NXG)
g10(m+nxt - 1 + NXG)
g11(m+nxt + 1 + NXG*NYG)
g12(m+nxt - 1 - NXG*NYG)
g13(m+nxt + 1 - NXG*NYG)
g14(m+nxt - 1 + NXG*NYG)
g15(m+nxt + NXG + NXG*NYG)
g16(m+nxt - NXG - NXG*NYG)
g17(m+nxt + NXG - NXG*NYG)
g18(m+nxt - NXG + NXG*NYG)
END DO
END DO
END DO
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
+
+
+
+
+
+
+
Af0
Af1
Af1
Af1
Af1
Af1
Af1
+
+
+
+
-
- g8(m+now) + g9(m+now) &
+ g13(m+now) - g14(m+now) &
- g8(m+now) - g9(m+now) &
+ g17(m+now) - g18(m+now) &
- g12(m+now) - g13(m+now) &
- g17(m+now) + g18(m+now) &
invTauPhi*phin
Cf1*ux
Cf1*ux
Cf1*uy
Cf1*uy
Cf1*uz
Cf1*uz
invTauRhoOne*g0(m+now)
invTauRhoOne*g1(m+now)
invTauRhoOne*g2(m+now)
invTauRhoOne*g3(m+now)
invTauRhoOne*g4(m+now)
invTauRhoOne*g5(m+now)
invTauRhoOne*g6(m+now)
invTauRhoOne*g7(m+now)
invTauRhoOne*g8(m+now)
invTauRhoOne*g9(m+now)
invTauRhoOne*g10(m+now)
invTauRhoOne*g11(m+now)
invTauRhoOne*g12(m+now)
invTauRhoOne*g13(m+now)
invTauRhoOne*g14(m+now)
invTauRhoOne*g15(m+now)
invTauRhoOne*g16(m+now)
invTauRhoOne*g17(m+now)
invTauRhoOne*g18(m+now)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
g0'nxt(i ,
g1'nxt(i+1,
g2'nxt(i-1,
g3'nxt(i ,
g4'nxt(i ,
g5'nxt(i ,
g6'nxt(i ,
g7'nxt(i+1,
g8'nxt(i-1,
g9'nxt(i+1,
g10'nxt(i-1,
g11'nxt(i+1,
g12'nxt(i-1,
g13'nxt(i+1,
g14'nxt(i-1,
g15'nxt(i ,
g16'nxt(i ,
g17'nxt(i ,
g18'nxt(i ,
19 read streams g* (blue)
19 write streams g* (orange)
7 read/write streams f* (green)
4 read streams (red) (phi,
gradPhiX, gradPhiY, gradPhiZ)
- Prefetch / TLB associative issues
- Compiler not vectorizing some hot
loops.
j ,
j ,
j ,
j+1,
j-1,
j ,
j ,
j+1,
j-1,
j-1,
j+1,
j ,
j ,
j ,
j ,
j+1,
j-1,
j+1,
j-1,
k )
k )
k )
k )
k )
k+1)
k-1)
k )
k )
k )
k )
k+1)
k-1)
k-1)
k+1)
k+1)
k-1)
k-1)
k+1)
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
g0(i,j,k)
g1(i,j,k)
g2(i,j,k)
g3(i,j,k)
g4(i,j,k)
g5(i,j,k)
g6(i,j,k)
g7(i,j,k)
g8(i,j,k)
g9(i,j,k)
g10(i,j,k)
g11(i,j,k)
g12(i,j,k)
g13(i,j,k)
g14(i,j,k)
g15(i,j,k)
g16(i,j,k)
g17(i,j,k)
g18(i,j,k)
1
1
1
2
3
4
(left)
(left)
(left)
(left)
(left)
(left)
(left)
(left)
(left)
52
Initial Results + Basic Optimizations
• Straight porting:
– KNC was 1/10 the performance of a 2-socket SNB-EP
system
– CPU Perf improved by 8x
– KNC Perf improved by 150x
– Optimizations:
–
–
–
–
compiler vector directives
2M pages
loop fusion
change in OMP work
distribution
Speedup
• Basic optimizations improve both CPU and MIC
20
performance:
16
15
10
5
8
1
0.11
SNB-EP
KNC
0
Reference Code
SNB-EP
KNC
Moderate Optimized
Code (Pragmas,
Compiler, etc)
Data measured by Tanmay Gangwani, Shreeniwas Sapre and
Sangeeta Bhattacharya on 2x8 SNB-EP 2.2GHz, KNC-ES2
53
Advanced Optimizations
• Include modifications to data structure and tiling to make
better use of vector units, caches and available memory
bandwidth
– AOS to SOA transformation helps improve SIMD utilization by
avoiding gather / scatter operations
– 3.5D blocking to take advantage of large coherent caches
– SW Prefetching for MIC
• Results:
– CPU and MIC both
further improves
by 9x
Similarity in Intel ® Architectures make optimizations (basic + advanced)
Ref: Anthony Nguyen et. al.®“3.5-D Blocking Optimization
for Stencil
work for both Intel Xeon processor
and Intel Xeon Phi ® coprocessor
Computations on Modern CPUs and GPUs,” SC 2010.
54
Agenda
• Why Many-Core?
• Intel Xeon Phi Family & ecosystem
• Intel Xeon Phi HW and SW architecture
• HPC design and development considerations for
Xeon Phi
• Case Studies & Demo
55
ScaleMP
• vSMP Foundation aggregates
multiple x86 systems into a single,
virtual x86 system
– SCALE their application compute,
memory, and I/O resources
• Simple to move an application to
Intel® Xeon Phi™ coprocessor
based systems using ScaleMP
software
• Zero porting effort
Video: ScaleMP to Support Intel Xeon Phi Coprocessors ...
ScaleMP CEO Shai Fultheim describes how vSMP Foundation supports
the new Intel Xeon Phi coprocessor. (From ISC’12 – Hamburg)
Intel demo booth #: ScaleMP – SMP VM
for Xeon & Xeon Phi
57
Programming Intel® MIC-based
Systems
CPU-Centric
CPU-hosted
Offload
Intel®MIC-Centric
Symmetric Reverse Offload
MIC-hosted
MIC
Data
CPU
Data
MIC
CPU
MPI
MPI
MIC
Data
CPU
Data
CPU
MIC
...
Network
Network
MIC
Data
CPU
Data
MIC
CPU
MIC
Data
CPU
Data
MIC
CPU
58
LBM running in Xeon cluster with Xeon Phi co-processor
Offload
Data
MPI
CPU
MIC
Data
CPU
Network
Data
CPU
MIC
Homogenous Node
Data
CPU
59
LES_MIC: Load balance between. CPU & MIC
LES CPU+MIC simultaneous
computing design
Relative speedup with different problem size
Copyright ©2012 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Sponsors of Tomorrow., and the Intel Sponsors of Tomorrow. logo are trademarks of Intel Corporation in the U.S. and other countries.
Demo#: LES_MIC
61
Recommended reading: 4 books
Intel® Xeon Phi™ Coprocessor
Developer site:
http://software.intel.com/mic-developer
• Architecture,
Setup and
Programming
resources
• Self-guided
training
• Case Studies
• Information on
Tools and
Ecosystem
• Support through
Community
Forum
6
3
64