LoHAM - Peter Heywood

Transcription

LoHAM - Peter Heywood
Alternative GPU friendly
assignment algorithms
Paul Richmond and Peter Heywood
Department of Computer Science
The University of Sheffield
Graphics Processing
Units (GPUs)
Context: GPU Performance
Accelerated Computing
Serial Computing
~40
GigaFLOPS
Parallel Computing
384
GigaFLOPS
8.74
TeraFLOPS
1 core
16
cores
4992 cores
10.0 TFlops
8.74 TeraFLOPS
9.0 TFlops
8.0 TFlops
7.0 TFlops
6.0 TFlops
5.0 TFlops
4.0 TFlops
3.0 TFlops
2.0 TFlops
1.0 TFlops
0.0 TFlops
~40 GigaFLOPS
1 CPU Core
GPU (4992 cores)
6 hours CPU time
vs.
1 minute GPU
time
Accelerators
• Much of the functionality of CPUs is unused for HPC
• Complex Pipelines, Branch prediction, out of order execution,
etc.
• Ideally for HPC we want: Simple, Low Power and Highly
Parallel cores
An accelerated system
DRAM
GDRAM
CPU
GPU/
Accelerator
I/O
PCIe
I/O
• Co-processor not a CPU replacement
Thinking Parallel
• Hardware considerations
• High Memory Latency (PCI-e)
• Huge Numbers of processing cores
• Algorithmic changes required
• High level of parallelism required
• Data parallel != task parallel
“If your problem is not parallel then think again”
Amdahl’s Law
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑆 =
Speedup (S)
25
20
P = 25%
15
P = 50%
1
𝑃
𝑁 − (1 − 𝑃)
P = 90%
10
P= 95%
5
0
Number of Processors (N)
• Speedup of a program is limited by the proportion
than can be parallelised
• Addition of processing cores gives diminishing returns
SATALL Optimisation
Time per function for Largest Network (LoHAM)
45000
Profile the
Application
•
40000
97.4%
35000
11 hour runtime
30000
•
Function A
•
97.4% runtime
•
2000 calls
Time (s)
•
25000
20000
Hardware
• Intel Core i7-4770k 3.50GHz
15000
• 16GB DDR3
• Nvidia GeForce Titan X
10000
5000
0.3%
0.3%
0.1%
0.1%
0.1%
B
C
D
E
F
0
A
Function
Function A
• Input
• Network (directed weighted graph)
• Origin-Destination Matrix
• Output
• Traffic flow per edge
• 2 Distinct Steps
1.
Single Source Shortest Path (SSSP)
All-or-Nothing Path
For each origin in the O-D matrix
2. Flow Accumulation
Distribution of Runtime
(LoHAM)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Apply the OD value for each trip to each link on the route
Serial
Flow
SSSP
Single Source Shortest Path
For a single Origin Vertex (Centroid)
Find the route to each Destination Vertex
With the Lowest Cumulative Weight (Cost)
Serial SSSP Algorithm
• D’Esopo-Pape (1974)
• Maintains a priority queue of vertices to
explore
Highly Serial
• Not a Data-Parallel Algorithm
• We must change algorithm to match
the hardware
Pape, U. "Implementation and efficiency of Moore-algorithms for the shortest route problem."
Mathematical Programming 7.1 (1974): 212-222.
Parallel SSSP Algorithm
• Bellman-Ford Algorithm (1956)
• Poor serial performance & time complexity
• Performs significantly more work
• Highly Parallel
• Suitable for GPU acceleration
Number of Edges
30
Bellman, Richard. On a routing problem. 1956.
Total Edges Considered
25
20
15
10
5
0
Bellman-Ford
Desopo-Pape
Implementation
A2 - Naïve BellmanFord using Cuda
•
Up to 369x slower
•
Striped bars continue
off the scale
Derby
36.5s
CLoHAM
1777.2s
LoHAM
5712.6s
Time to compute all requied SSSP results per model
70.00
60.00
Time (seconds)
•
50.00
40.00
30.00
20.00
10.00
0.00
Serial
A2
Optimisation
Derby
CLoHAM
LoHAM
Implementation
Followed iterative cycle of
performance optimisations
•
A3 – Early Termination
•
A4 – Node Frontier
•
A8 – Multiple origins
Concurrently
•
•
A10 – Improved load
Balancing
•
•
SSSP for each Origin in
the OD matrix
Cooperative Thread Array
Time to compute all requied SSSP results per model
70.00
60.00
Time (seconds)
•
50.00
40.00
30.00
20.00
10.00
0.00
Serial
A2
A3
A4
A8
Optimisation
A11 – Improved array access
Derby
CLoHAM
LoHAM
A10
A11
Limiting Factor (Function A)
Distribution of Runtime (CLoHAM)
Distribution of Runtime (LoHAM)
100%
100%
90%
90%
80%
80%
70%
70%
60%
60%
50%
50%
40%
40%
30%
30%
20%
20%
10%
10%
0%
0%
Serial
Serial
Flow
SSSP
Flow
SSSP
Limiting Factor (Function A)
• Limiting Factor has now changed
• Need to parallelise Flow Accumulation
Distribution of Runtime (CLoHAM)
Distribution of Runtime (LoHAM)
100%
100%
90%
90%
80%
80%
70%
70%
60%
60%
50%
50%
40%
40%
30%
30%
20%
20%
10%
10%
0%
0%
Serial
A11
Flow
SSSP
Serial
A11
Flow
SSSP
Flow Accumulation
• Shortest Path + OD = Flow-per-link
For each origin-destination pair
Trace the route from the destination to the origin
increasing the flow value for each link visited
•
Parallel problem
• But requires synchronised access to shared data structure for all trips (atomic
operations)
0
1
2
…
0
0
1
2
…
1
2
0
5
…
2
3
4
0
…
…
…
…
…
Link 0 1 2 3 …
Flow 9 8 6 9 …
Time to compute Flow values per link
Flow
Accumulation
5
Problem:
•
A12 - lots of atomic operations serialise
the execution
4
Time (Seconds)
•
6
3
2
1
0
Serial
A12
Derby
CLoHAM
LoHAM
Time to compute Flow values per link
Flow
Accumulation
Problem:
•
•
5
A12 - lots of atomic operations serialise
the execution
Solutions:
•
A15 - Reduce number of atomic
operations
•
•
•
3
Solve in batches using parallel reduction
A14 - Use fast hardware-supported single
precision atomics
•
4
Time (Seconds)
•
6
Minimise loss of precision using multiple
32-bit summations
2
1
0.000022% total error
0
Serial
A12
Derby
CLoHAM
A15
(double)
LoHAM
A14
(single)
Integrated Results
Relative Assignment Runtime Performance vs
Serial
40
Serial
35
LoHAM – 17m 32s
Hardware:
•
Intel Core i7-4770K
•
16GB DDR3
•
Nvidia GeForce Titan X
18.24
20.70
15
10
5
6.14
•
11.24
Reduced loss of precision
3.73
•
20
6.81
Single precision
25
5.51
LoHAM – 35m 22s
3.42
•
1.00
Double precision
30
1.00
LoHAM – 12h 12m
1.00
•
Assignment Time relative to serial
Assignment Speedup
relative to Serial
41.77
45
0
Serial
Multicore
Derby
CLoHAM
A15
LoHAM
A14
(single)
Relative Assignment Runtime Performance vs
Multicore
7
6.14
Assignment Speedup
relative to Multicore
6
•
Reduced loss of precision
•
LoHAM – 17m 32s
2
•
Nvidia GeForce Titan X
0.15
16GB DDR3
0.18
•
1
0.29
Intel Core i7-4770K
1.00
Hardware:
•
3.31
3
1.79
Single precision
4
3.04
LoHAM – 35m 22s
2.04
•
1.09
Double precision
5
1.00
LoHAM – 1h 47m
1.00
•
Assignment Time relative to multicore
Multicore
0
Serial
Multicore
Derby
CLoHAM
A15
LoHAM
A14
(single)
GPU Computing at UoS
Expertise at Sheffield
• Specialists in GPU Computing and
performance optimisation
• Complex Systems Simulations via FLAME
and FLAME GPU
• Visual Simulation, Computer Graphics and
Virtual Reality
• Training and Education for GPU
Computing
Thank You
Largest Model (LoHAM) results
Runtime
Paul Richmond
[email protected]
Speedup
Serial
Speedup
Multicore
Serial
12:12:24
1.00
0.15
Multicore
01:47:36
6.81
1.00
A15 (double
precision)
00:35:22
20.70
3.04
A14 (single
precision)
00:17:32
41.77
6.14
paulrichmond.shef.ac.uk
Peter Heywood
[email protected]
ptheywood.uk
Backup Slides
Benchmark Models
• 3 Benchmark networks
Benchmark model performance
50000
• Range of sizes
45000
• Small to V. Large
40000
35000
Model
Derby
CLoHAM
Time (s)
• Up to 12 hour runtime
30000
25000
Vertices
(Nodes)
Edges
(Links)
O-D trips
2700
25385
547²
10000
15179
132600
2548²
5000
20000
15000
0
LoHAM
18427
192711
5194²
Serial Salford
Multicore
Version
Derby
CLoHAM
LoHAM
Edges considered per algorithm
Total Edges Considered
7
30
6
25
5
Number of Edges
Number of Edges
Edges Considered Per Iteration
4
3
2
20
15
10
5
1
0
0
1
2
3
4
0
Iteration
Bellman-Ford
Desopo-Pape
Bellman-Ford
Desopo-Pape
Vertex Frontier (A4)
• Much fewer threads
launched per iteration
• Up to 2500 instead of 18427
per iteration
Frontier size for each element for
source 1
3000
2500
Frontier Size
• Only Vertices which were
updated in the previous
iteration can result in an
update
2000
1500
1000
500
0
Iteration
Derby
CLoHAM
LoHAM
Multiple Concurrent Origins (A8)
Frontier size for each element for
source 1
Frontier Size for all concurrent sources
18000000
3000
16000000
14000000
2500
Frontier Size
Frontier Size
12000000
2000
1500
10000000
8000000
6000000
1000
4000000
500
2000000
0
0
Iteration
Derby
CLoHAM
LoHAM
Iteration
Derby
CLoHAM
LoHAM
Atomic Contention
•
Atomic operations are guaranteed
to occur
30000000
Atomic Contention
• multiple threads atomically modify
same address
• Serialised!
• atomicAdd(double) not
implemented in hardware
• Not yet
•
Atomic Contestance per Iteration
Solutions
1.
2.
Algorithmic change to minimise
atomic contention
Single precision
25000000
Atomic Contestance
•
20000000
15000000
10000000
5000000
0
0
10
20
30
40
Cloham
50
60
Loham
70
80
90 100
Assignment runtime per algorithm
50000
45000
Raw Performance
40000
35000
•
Intel Core i7-4770K
•
16GB DDR3
•
Nvidia GeForce Titan X
Time (seconds)
Hardware:
30000
25000
20000
15000
10000
5000
0
Serial
Multicore
A8
Derby
A10
CLoHAM
A11
LoHAM
A15
A13
(single)
A14
(single)