# DAG EXECUTION MODEL, WORK AND DEPTH

## Transcription

DAG EXECUTION MODEL, WORK AND DEPTH
```6/16/2010 DAG Execu>on Model, Work and Depth 1 DAG EXECUTION MODEL, WORK AND DEPTH 6/16/2010 DAG Execu>on Model, Work and Depth 2 Computa>onal Complexity of (Sequen>al) Algorithms •  Model: Each step takes a unit >me •  Determine the >me (/space) required by the algorithm as a func>on of input size 6/16/2010 DAG Execu>on Model, Work and Depth Sequen>al Sor>ng Example •  Given an array of size n •  MergeSort takes O(n . log n) >me •  BubbleSort takes O(n2) >me 3 6/16/2010 DAG Execu>on Model, Work and Depth 4 Sequen>al Sor>ng Example •  Given an array of size n •  MergeSort takes O(n . log n) >me •  BubbleSort takes O(n2) >me •  But, a BubbleSort implementa>on can some>mes be faster than a MergeSort implementa>on •  Why? 6/16/2010 DAG Execu>on Model, Work and Depth 5 Sequen>al Sor>ng Example •  Given an array of size n •  MergeSort takes O(n . log n) >me •  BubbleSort takes O(n2) >me •  But, a BubbleSort implementa>on can some>mes be faster than a MergeSort implementa>on •  The model is s>ll useful •  Indicates the scalability of the algorithm for large inputs •  Lets us prove things like a sor>ng algorithm requires at least O(n. log n) comparisions 6/16/2010 DAG Execu>on Model, Work and Depth We need a similar model for parallel algorithms 6 6/16/2010 DAG Execu>on Model, Work and Depth 7 Sequen>al Merge Sort 16MB input (32-­‐bit integers) Time Recurse(le\) Recurse(right) Merge to scratch array Copy back to input array Sequen>al Execu>on 6/16/2010 DAG Execu>on Model, Work and Depth 8 Parallel Merge Sort (as Parallel Directed Acyclic Graph) 16MB input (32-­‐bit integers) Time Recurse(le\) Recurse(right) Merge to scratch array Copy back to input array Parallel Execu>on 6/16/2010 DAG Execu>on Model, Work and Depth 9 Parallel DAG for Merge Sort (2-­‐core) Sequen>al Sort Merge Sequen>al Sort Time 6/16/2010 DAG Execu>on Model, Work and Depth Parallel DAG for Merge Sort (4-­‐core) 10 6/16/2010 DAG Execu>on Model, Work and Depth Parallel DAG for Merge Sort (8-­‐core) 11 6/16/2010 DAG Execu>on Model, Work and Depth 12 The DAG Execu>on Model of a Parallel Computa>on •  Given an input, dynamically create a DAG •  Nodes represent sequen>al computa>on •  Weighted by the amount of work •  Edges represent dependencies: •  Node A  Node B means that B cannot be scheduled unless A is ﬁnished 6/16/2010 DAG Execu>on Model, Work and Depth Sor>ng 16 elements in four cores 13 6/16/2010 14 DAG Execu>on Model, Work and Depth Sor>ng 16 elements in four cores (4 element arrays sorted in constant >me) 1 8 1 1 1 16 1 1 8 1 6/16/2010 DAG Execu>on Model, Work and Depth Performance Measures •  15 6/16/2010 DAG Execu>on Model, Work and Depth Work and Depth •  16 6/16/2010 DAG Execu>on Model, Work and Depth T∞ (Depth): Cri>cal Path Length (Sequen>al Bobleneck) 17 6/16/2010 DAG Execu>on Model, Work and Depth T1 (work): Time to Run Sequen>ally 18 6/16/2010 19 DAG Execu>on Model, Work and Depth Sor>ng 16 elements in four cores (4 element arrays sorted in constant >me) 1 8 1 1 1 16 1 1 8 1 Work = Depth = 6/16/2010 DAG Execu>on Model, Work and Depth Some Useful Theorems 20 6/16/2010 DAG Execu>on Model, Work and Depth Work Law •  “You cannot avoid work by parallelizing” 21 6/16/2010 Work Law •  DAG Execu>on Model, Work and Depth 22 6/16/2010 Work Law •  DAG Execu>on Model, Work and Depth 23 6/16/2010 Depth Law •  DAG Execu>on Model, Work and Depth 24 6/16/2010 DAG Execu>on Model, Work and Depth Amount of Parallelism •  25 6/16/2010 DAG Execu>on Model, Work and Depth Maximum Speedup Possible Speedup Parallelism “speedup is bounded above by available parallelism” 26 6/16/2010 DAG Execu>on Model, Work and Depth 27 Greedy Scheduler •  If more than P nodes can be scheduled, pick any subset of size P •  If less than P nodes can be scheduled, schedule them all 6/16/2010 DAG Execu>on Model, Work and Depth 28 Performance of the Greedy Scheduler •  6/16/2010 DAG Execu>on Model, Work and Depth 29 Performance of the Greedy Scheduler •  6/16/2010 DAG Execu>on Model, Work and Depth 30 Greedy is op>mal within a factor of 2 •  6/16/2010 DAG Execu>on Model, Work and Depth Work/Depth of Merge Sort (Sequen>al Merge) •  31 6/16/2010 DAG Execu>on Model, Work and Depth Main Message •  Analyze the Work and Depth of your algorithm •  Parallelism is Work/Depth •  Try to decrease Depth •  the cri>cal path •  a sequen'al bobleneck •  If you increase Depth •  beber increase Work by a lot more! 32 6/16/2010 DAG Execu>on Model, Work and Depth 33 Amdahl’s law •  Sor>ng takes 70% of the execu>on >me of a sequen>al program •  You replace the sor>ng algorithm with one that scales perfectly on mul>-­‐core hardware •  How many cores do you need to get a 4x speed-­‐up on the program? 6/16/2010 DAG Execu>on Model, Work and Depth 34 = the parallel por>on of execu>on = the sequen>al por>on of execu>on
= number of cores used 6/16/2010 DAG Execu>on Model, Work and Depth 35 4.5 4.0 Speedup 3.5 Desired 4x speedup 3.0 2.5 2.0 Speedup achieved (perfect scaling on 70%) 1.5 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores 6/16/2010 DAG Execu>on Model, Work and Depth 36 4.5 4.0 Speedup 3.5 Desired 4x speedup 3.0 2.5 Limit as c→∞ = 1/(1-­‐f) = 3.33 2.0 1.5 Speedup achieved (perfect scaling on 70%) 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores 6/16/2010 DAG Execu>on Model, Work and Depth 37 1.12 1.10 Speedup 1.08 1.06 Amdahl’s law limit, just 1.11x Speedup achieved with perfect scaling 1.04 1.02 1.00 0.98 0.96 0.94 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores 6/16/2010 DAG Execu>on Model, Work and Depth 38 60 Speedup 50 40 30 20 10 0 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 #cores 6/16/2010 DAG Execu>on Model, Work and Depth Lesson •  Speedup is limited by sequen>al code •  Even a small percentage of sequen>al code can greatly limit poten>al speedup 39 6/16/2010 DAG Execu>on Model, Work and Depth 40 Gustafson’s Law Any suﬃciently large problem can be parallelized eﬀec>vely = the parallel por>on of execu>on = the sequen>al por>on of execu>on
= number of cores used ```