slides

Transcription

slides
Using dataflow as a high
performance computing engine
Michael J. Flynn Maxeler Technologies and Stanford University Outline
◦ History •  Dataflow as a supercomputer technology •  Generalizing the dataflow programming model •  OpEmizing the hardware for dataflow •  openSPL (open spaEal programming language) The great parallel precessor debate of 1967
•  Amdahl espouses to sequenEal machine and posits Amdahl’s law •  Danial Slotnick (father of ILLIAC IV) posits the parallel approach while recognizing the problem of programming 3 Slotnick’s law
“The parallel approach to compuEng does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.” -­‐Daniel Slotnick (1967) ….Speedup in parallel processing is achieved by programming effort…… Michael J Flynn 4 Dataflow as a supercomputer technology
•  Looking for another way. Some experience from Maxeler •  Dataflow conceptually creates an ideal machine to match the program but the interconnect problem had been a limitaEon for large programs. •  Today’s FPGAs have come a long way and enable an emulaEon of the dataflow machine. 5 Hardware and Software Alternatives
•  Hardware: A reconfigurable heterogeneous dataflow array model •  So]ware: A spaEal (2D) dataflow programming model complements a sequenEal model Large compute intensive model
•  Assumes host CPU + dataflow card •  Many compute applicaEon consists of two parts –  EssenEal (high usage, >99%) part (kernel(s)) –  Bulk part (<1% dynamic acEvity) •  EssenEal part is executed on dataflow card; Bulk part on host •  So Slotnick’s law of effort now only applies to a small porEon of the applicaEon Automating programming for maximum
speedup
•  Emulate the producEon line in program execuEon. •  Prefer most restricEve parallel models: staEc dataflow. •  Use 2D spaEal programming to implement the dataflow •  Support mulEcore for control flow, dataflow for compute intensive kernels M.J. Flynn 8
Typical Maxeler applications
9 Dataflow hardware model:
server with dataflow engine cards
10 Computing with Dataflow Engines (DFEs)
Dataflow CompuEng: Ultra long pipelines (order of 1000s) generate 1 result every clock cycle => OpEmal performance per Wak and per m3. 11 Static, Synchronous, Streaming DFMs
•  Create a staEc DFM (unroll loops, etc.); generally the goal is throughput not latency. •  Create a fully synchronous DFM synchronized to mulEple memory channels. The Eme through the DFM is always the same. •  Stream computaEons across the long DFM array, creaEng MISD or pipelined parallelism. •  If silicon area and pin BW allow, create mulEple copies of the DFM (as with SIMD or vector computaEons). •  Iterate on the DFM aspect raEo to opEmize speedup. 12 Acceleration with Static, Synchronous, Streaming
DFEs
•  Create a fully synchronous data flow machine synchronized to mulEple memory channels, then stream computaEons across a long array DFE (Engine) card is Dataflow Machine plus memory and control Computation #2
Data from node
memory
Computation #1
Results to
memory
Dataflow Machine
Buffer
intermediate results
13 Software
•  Profiler •  Dataflow compiler graphic interface with Eme and area annotaEons •  Manager compiler manages the access to data streams •  OS provides run Eme coordinaEon, device drivers •  Simulator, cycle accurate •  Vendor provided place and route 14 Dataflow: X2 + 30
x x SCSVar x = io.input("x", scsInt(32));
30 SCSVar result = x * x + 30;
io.output("y", result, scsInt(32));
+ y 15 Dataflow: Moving Average
Y = (Xn-­‐1 + X + Xn+1) / 3 SCSVar x = io.input(“x”, scsFloat(7,17));
SCSVar prev = stream.offset(x, -1);
SCSVar next = stream.offset(x, 1);
SCSVar sum = prev + x + next;
SCSVar result = sum / 3;
io.output(“y”, result, scsFloat(7,17));
16 Dataflow: Choices
x SCSVar x = io.input(“x”, scsUInt(24));
SCSVar result = (x>10) ? x+1 : x-1;
io.output(“y”, result, scsUInt(24));
>
-­‐ +
y 17 1 1 10 Data flow graph as generated by compiler 4866 nodes; about 250x100 Each node represents a line of JAVA code with area Eme parameters, so that the designer can change the aspect raEo to improve pin BW, area usage and speedup Dataflow Programming
•  MaxCompiler – Java-­‐driven dataflow compiler •  MaxIDE – Graphical development environment •  MaxCompilerSim – Seamless simulaEon environment 19 MPC-C500 for compute intensive apps
• 
• 
• 
• 
• 
• 
• 
• 
• 
20 1U Form Factor 4x dataflow engines 12 Intel Xeon cores 192GB DFE RAM 192GB CPU RAM PCIe Gen2 x8 MaxRing interconnect 3x 3.5” hard drives Infiniband MPC-X1000
•  8 dataflow engines (192-­‐384GB RAM) •  High-­‐speed MaxRing •  Zero-­‐copy RDMA between CPUs and DFEs over Infiniband •  Dynamic CPU/DFE balancing 21 E.g. compute intensive: Seismic Processing
•  For Oil & Gas exploraEon: distribute grid of sensors over large area •  Sonic impulse the area and record reflecEons: frequency, amplitude, delay at each sensor •  Sea based surveys use 30,000 sensors to record data (120 db range) each sampled at more than 2kbps with new sonic impulse every 10 seconds Order of terabytes of data each day !
1200m Generates >1GB every 10s 1200m 1200m 1200m 1200m Modelling Results
•  Up to 240x speedup for 1 MAX2 card compared to single CPU core •  Speedup increases with cube size •  1 billion point modelling domain using single FPGA card 24 Achieved ComputaEonal Speedup for the enEre applicaEon (not just the kernel) compared to Intel server 624 RTM with Chevron VTI 19x and TTI 25x Credit 32x and Rates 26x
624 Sparse Matrix
20-40x
Lattice Boltzman
Fluid Flow 30x
Seismic Trace Processing
24x
Conjugate Gradient Opt 26x
CFD Performance vs GPU
•  For this 2D linear advecEon test problem we achieve ca.450M degree-­‐of-­‐
freedom updates per second •  For comparison a GPU implementaEon (of a Navier-­‐
Stokes solver) achieves ca.50M DOFs/s "   Max3A workstation with Xilinx Virtex 6
475t + 4-core i7
Networking and low latency apps
Maxeler MPC-­‐N Series •  Rack mounted CPU servers with DFE boards •  High-­‐end Intel Xeon based server playorm •  1U –  Up-­‐to 240Gb/s total Ethernet bandwidth across 2 DFE boards •  2U –  Up-­‐to 640Gb/s total Ethernet bandwidth across 4 DFE boards 27 (*) expected
Q1 2015
Juniper MPC-­‐Switch •  TOR Switch with DFE acceleraEon module •  Switch management funcEons provided by Juniper JUNOS •  DFE applicaEon runs in VM with MaxelerOS •  Total 320Gb/s DFE bandwidth –  160Gb/s DFE <-­‐> Switch PFE –  160Gb/s DFE <-­‐> direct QSFP The appliance model
28 Finance market :High Speed Analytics
Customer ApplicaEon N:1 Quote / Spreader API QuoEng Market Data Order Router Pricing Hedging QuoEng Fast-­‐path • 
• 
• 
• 
10GE DFE Ultra-­‐fast quote updates based on hedge market movement Ultra-­‐fast hedging of quote fills Configurable pricing, quoEng and hedging strategy Interfaces directly with exisEng so]ware applicaEon Trading Venue Market Data Sources CPU High Frequency Trading
Customer ApplicaEon Strategy and Market ConnecEvity API Market Data User-­‐defined Algorithm Order Router Strategy Fast-­‐path •  Customer strategy running on DFE •  Rich API for market connecEvity and strategy •  Ultra-­‐low-­‐latency access to market DFE 10GE Trading Venue Market Data Sources CPU Latency detail
Phys. Port 10GE MAC TCP/IP Kernel TCP/IP 10GE MAC Phys. Port •  Ultra-­‐Low latency •  Ultra-­‐Low Jiker Measurement Details: •  Dual passive-­‐opEcal fiber taps, 50/50 – 50um •  Timestamping card on RX and TX fibers •  50B payload in standard TCP/IP packets •  Timestamps measured a]er last bit of frame received •  TCP, IP and Ethernet checksums calculated and checked •  Measurements valid up to line rate 31 Option trading
32 So for HPC, how can dataflow emulation with
FPGAs be better than multi core?
•  FPGAs emulate the ideal data flow machine •  Success comes about from their flexibility in matching the DFG with a synchronous DFM and streaming data through and shear size > 1 million cells •  Effort and support tools provide significant applicaEon speedup •  With a really effecEve dedicated dataflow chip there’s probably 2-­‐3 orders of magnitude improvement possible in Area x Time x Power. 33 33 Hardware: FPGA Pros & Cons
•  The FPGA while quite suitable for emulaEon is not an ideal hardware substrate –  Too fine grain, wasted area –  Expensive –  Place and route Emes are excessive and will get longer –  Slow cycle Eme •  FPGA advantages –  Commodity part with best process technology –  Flexible interconnect –  Transistor density scaling –  In principle, possible to quickly reduce to ASIC 34 OpenSPL Foundation www.openspl.org
Programming model is available to all as
OpenSPL: the basics
•  Control and Data-­‐flows are decoupled –  Both are fully programmable •  OperaEons exist in space and by default run in parallel –  Their number is limited only by the available space •  All operaEons can be customized at various levels –  e.g., from algorithm down to the number representaEon • 
• 
• 
• 
• 
MulEple operaEons consEtute kernels Data streams through the operaEons / kernels The data transport and processing can be balanced All resources work all of the Eme for max performance The In/Out data rates determine the operaEng frequency 36 Generalizing the programming model
OpenSPL http://www.openspl.org/
•  Open spaEal programming language, an orderly way to expose parallelism •  2D dataflow is the programmer’s model, JAVA the syntax •  Could target hardware implementaEons, beyond the Data Flow –  map on to CPUs (e.g. using OpenMP/MPI) –  GPUs –  Other accelerators 37 Speaking of research: Maxeler UP
•  University program has over 100 university members •  Membership is free •  Hardware can be bought at cost; so]ware free •  Possible to access via simulator and cloud •  Shared research among the members. Conclusions 1
•  Parallel Processing demands rethinking algorithms, programming approach and environment and hardware. •  The success of dataflow points to the weakness of evoluEonary approaches to parallel processing: hardware (mulE core) and so]ware (C++, etc.), at least for some applicaEons •  The evoluEon of dataflow is sEll early on; sEll required: tools, analysis methodology and a new hardware basis •  FPGA as a basis for dataflow has important limitaEons: hardware AxTxP, inefficient place and route, SW 39 Conclusions 2
•  In parallel processing: to find success, start with the problem not the soluEon. •  There’s a lot of research ahead to effecEvely create dataflow based parallel translaEon technology.