vectors

Transcription

vectors
ENES Workshop
Exascale Technologies & Innovation in HPC
for Climate Models
Hamburg, March 2014
© NEC Corporation 2014
Agenda
▌ 
  Brief overview
  First results and interpretation
  Lessons learned  future
▌  NEC’s future plans
  Basic directions
  Implications for users
2
© NEC Corporation 2014
Vector-Core
Vector Pipeline(VPP) x16
ADB 1 MByte
(Assignable Data Buffer)
Vector Pipeline
Mask Reg.
Mask
Multi
Load: 256GB/s *
(16B x16)
Vector
Reg.
Multi/Logical
16KB *
+
128KB *
Add
Add/Div/Sqrt
Scalar
Scalar
Reg.
Scalar Unit
* aggregate for 16VPPs
3
© NEC Corporation 2014
ALU/Mul
Add/Sub
SIMD = Vector ?
Input
Pipeline
Result
Scalar
SIMD
Vector
SX
Vector is more efficient than SIMD
SX is a SIMD-vector
4
© NEC Corporation 2014
LSI Configuration
Scalar Processing Unit
Vector Processing Unit
SPU
Remote access Control Unit
Architecture
VPU
256GB/s
Vector
CORE
corecorecore
ADB
RCU
(Assignable Data Buffer)
256GB/s
Interconnect
8GB/s x2
8GB/s x2
crossbar
Performance
64GFlops
ADB size
1MB
ADB bandwidth
256GB/s
Memory bandwidth
64GB/s~
256GB/s
Memory Byte/Flop
1.0 ~ 4.0
CPU
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
256GB/s
Memory controller
256GB/s
Memory(DDR3)
5
© NEC Corporation 2014
Cores
4
Performance
256GFlops
Memory bandwidth
256GB/s
Byte/Flop
1.0
Architectural Improvement
Enhancing vector instruction issue
Enhancing bypass chaining path
Shorten memory latency
Enhancing instruction reordering
Enlarging ADB capacity
Avoiding redundant loads (MSHR)
Avoiding redundant stores (store merge)
6
© NEC Corporation 2014
CPU
MSHR: Miss Status Handling Register
 New feature for NGV, lesson learned from SX-9
 Should improve performance for a lot of applications
do j = 1, n
do i = 1, n
d(i,j) = a(i-1,j) + a(i,j) + a(i+1,j) - a(i,j-1)
end do
end do
7
© NEC Corporation 2014
MSHR: Miss Status Handling Register
MSHR
release
MSHR
release
Simulation: Factor 2.4 on the Himeno-Benchmark
LD A #4
4:4
8
© NEC Corporation 2014
4:3
4:2
Memory
MSHR
ADB
ALU
Memory
MSHR
ADB
With MSHR NGV
ADB Off
ADB On
ALU
Memory
ADB
ALU
Memory
ADB
LD A #1
LD A #2
LD A #3
ALU
No MSHR SX-9
ADB Off
ADB On
4:1
NGV Node Card
CPU
Memory
4GB x 16DIMMs
DDR3 2000MHz
256GF
256GB/s
37cm
11cm
9
© NEC Corporation 2014
Configurations
64 nodes = 16TF, 16TB/s
4 cages = 32 modules = 64 nodes = 64CPUs
8 modules = 16 nodes = 16 CPUs
2 nodes = 2 CPUs
1CPU, 256GF, 256GB/s
10
© NEC Corporation 2014
Rack Specifications
0.75m x 1.5m x 2.0m
30KW
Sustained memory bandwidth [GB/s]
Sustained Memory Bandwidth
11
© NEC Corporation 2014
The single core can exploit
whole memory bandwidth of
single CPU, 256GB/s
Exploiting # of cores
Real Code
12
© NEC Corporation 2014
Power-Efficiency
13
© NEC Corporation 2014
Lessons learned (what a surprise!)
▌  Bandwidth matters
▌  Faster CPU  less loss of efficiency due to parallelization
  Amdahl’s law still holds! If not worse!
  There is no use to only optimize scalability, be it application,
system software or hardware.
▌  Consequently there is a scientific need for an HPC-targeted
architecture.
  Do the business-cases exist?
•  Need for capability computing? Or rather throughput?
•  Power-consumption and TCO!
  Economical aspects on the vendors’ side!
14
© NEC Corporation 2014
Moving Forward
All information is subject
to change without notice!
▌  Board of Directors: Provide a profitable business-concept for a
differentiated system targeted to high-performance processing, not
only the traditional HPC-market.
▌  Current direction
  new product-line, not just a continuation of SX
  “It’s the economy …”, need to target a broader market, not only HPC 
consequences even for the processor architecture
  Studies ongoing:
•  technology, architecture, most important competition
•  which markets, and what are the applications and the requirements?
▌  Thoughts and quite some open questions:
  Multi-core Vector, more registers
  Shared ADB? Multi-socket, or how much “SMP”? Coherency?
  Optimization for short(er)-vectors (some ideas!)
  Interconnect: own one, 3rd party?
•  Functionalities! Atomic updates, CAF, …
  Revised product lineup?From “student board” to Exa-Flops  LINUX!
15
© NEC Corporation 2014
What does it mean for the code-owner?
▌  Vectorization is here to stay and increasingly important
  depending on hardware-vendor
  I believe the importance will grow, and so will be the “SIMD- or vector-width”
▌  MPI will continue to be the leading paradigm for distributed memory,
OpenMP for shared memory, the user has to master both.
▌  What about PGAS?
  Perhaps my personal problem: I have seen such things before, Intel Paragon and
Cray T3D, HPF …
  Two aspects to it: ease of use and performance
  Latency? (dimensionless numbers!)
• 
PGAS can also be used for latency hiding … to my opinion the most important aspect!
▌  Well, yes, economy! A little complaint from a vendor 
  A “proprietary architecture” (basically nowadays a “non-Intel-architecture”) is
expensive, but is “Intel-only-HPC” a solution?
• 
• 
probably not!  Nvidia!?
Implications for users!
  We build the targeted product, but the customers have to give it a chance!  RFP
• 
16
Long time ago RFPs were very different!
© NEC Corporation 2014
17
17
© NEC Corporation 2014
Water Cooling
 NGV racks are cooled by both water and air
 2 stages water cooling with primary and secondary coolant
Heat exchanger
NGV
rack
NGV
rack
NGV
rack
NGV
rack
chiller
20 deg C
300L/min
7-15 deg C
300L/min
21-24 deg C
16-19 deg C
Secondary coolant
18
© NEC Corporation 2014
Primary coolant
(Facility supplying chilled water)