Slides - Indico
Transcription
Slides - Indico
Mitglied der Helmholtz-Gemeinschaft Status of Investigations in GPU-based Online Tracking Algorithms DAQ/FEE Workshop Boppard 2014 31 March 2014, Andreas Herten 1 Outline • Algorithms Mitglied der Helmholtz-Gemeinschaft – Hough Transform – Riemann Track Finder – Triplet Finder 2 Mitglied der Helmholtz-Gemeinschaft Graphics Processing Units CPU GPU 3 Mitglied der Helmholtz-Gemeinschaft Graphics Processing Units CPU GPU a1 → b1 → c1; a2 → b2 → c2; a3 → … a1 → b1 → c1 a2 → b2 → c2 a3 → … 3 Mitglied der Helmholtz-Gemeinschaft ALGORITHMS #1 Hough Transform Riemann Track Finder Triplet Finder 4 Algorithm: Hough Transform • Idea: Transform (x,y)i → (α,r)ij, find lines via (α,r) space • Solve rij line equation for – Lots of hits (x,y,ρ)i and – Many αj ∈ [0°,360°) each Hough Transform — Princip • Fill histogram • Extract track parameters y r Mitglied der Helmholtz-Gemeinschaft Mitglied der Helmholtz-Gemeinschaft y → Bin giv α x Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 x 5 Algorithm: Hough Transform • Idea: Transform (x,y)i → (α,r)ij, find lines via (α,r) space rij = cos↵j · xi + sin↵j · yi + ⇢i • Solve rij line equation for – Lots of hits (x,y,ρ)i and – Many αj ∈ [0°,360°) each i: ~100 hits/event (STT) rij: 180—000 Hough Transform Princip j: every 0.2° • Fill histogram • Extract track parameters y r Mitglied der Helmholtz-Gemeinschaft Mitglied der Helmholtz-Gemeinschaft y → Bin giv α x Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2 x 5 r Hough transformed Algorithm: Hough Transform 68 (x,y)0 points 0.6 Entries 2.2356e+08 25 0.5 Mean x 90 Mean y 0.02905 0.4 RMS x 51.96 RMS y 0.1063 20 0.3 0.2 15 0.1 0 10 -0.1 Mitglied der Helmholtz-Gemeinschaft -0.2 5 -0.3 -0.4 0 20 40 60 80 100 120 140 160 180 α Angle / ° 0 PANDA STT+MVD 1800 x 1800 Grid 6 r Hough transformed Algorithm: Hough Transform 68 (x,y)0 points 0.6 Entries 2.2356e+08 25 0.5 Mean x 90 Mean y 0.02905 0.4 RMS x 51.96 RMS y 0.1063 20 0.3 0.2 15 0.1 0 10 -0.1 Mitglied der Helmholtz-Gemeinschaft -0.2 5 -0.3 -0.4 0 20 40 60 80 100 120 140 160 180 α Angle / ° 0 PANDA STT+MVD 1800 x 1800 Grid 6 Hough Transform — Remarks Two Implementations Thrust Plain CUDA • Performance: 3 ms/event • Performance: 0.5 ms/event – Independent of angular granularity – Reduced to set of standard routines – Built completely for this task • Fitting to every problem • Fast (uses Thrust‘s optimized algorithms) • Customizable • Inflexible (has it‘s limits, hard to customize) • A bit more complicated at parts – No peakfinding included Even possible? • Adds to time! • Using: Dynamic Parallelism, Shared Memory Mitglied der Helmholtz-Gemeinschaft • – Simple peakfinder implemented (threshold) 7 Hough Transform — Summary • Running code • Big issue: Multipeak finder → Martin‘s idea Advantages of HT algorithm To Dos / Plans • Parallelism beyond Thrust • Easy algorithm • With grid Mitglied der Helmholtz-Gemeinschaft granularity parallelism increases Plain • Flexibility of HT‘ed equation CUDA events • Isochrones (fully) • Time-based • Isochrones (fully) • Integrate to PandaRoot • Time-based Draw Backs Problems Challenges • Stuck in Thrust infrastructure • Multipeak finder/ Aliasing (Algorithm out of image processing – usually used to detect continuous lines) • Code at: – https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/ GpuHoughTransform – https://github.com/AndiH/CUDA/ 8 Mitglied der Helmholtz-Gemeinschaft ALGORITHMS #2 Hough Transform Riemann Track Finder Triplet Finder 9 Riemann Track Finder — Method • Idea: Don‘t fit lines (in 2D), fit planes (in 3D)! • Create seeds – All possible three hit combinations • Grow seeds to tracks Continuously test next hit if it fits – Use mapping to Riemann paraboloid x x y x x x y x x x x y x x x Mitglied der Helmholtz-Gemeinschaft z‘ More on: Seeds; Growing 10 Riemann Algorithm — Triplet Generation CPU • Three loops to generate triplets serially for (int i = 0; i < hitsInLayerOne.size(); i++) { for (int j = 0; j < hitsInLayerTwo.size(); j++) { for (int k = 0; k < hitsInLayerThree.size(); k++) { /* Triplet Generation */ } } } GPU • Loops are not good parallelizable! • Needed: Mapping of inherent GPU indexing variable to triplet index Mitglied der Helmholtz-Gemeinschaft int ijk = threadIdx.x + blockIdx.x * blockDim.x; ⌘ 1 ⇣p nLayerx = 8x + 1 1 2 p p p 3 3 243x2 1 + 27x 1 p pos(nLayerx ) = +p p p 2/3 3 3 3 3 3 243x2 1 + 27x 1 → 100 × faster than CPU version: ~0.6 ms/event 11 GPU Riemann Algorithm — Summary • Running port of CPU Riemann to GPU (J. Timcheck) + Improvements / needed changes wrt to CPU version Advantages of Riemann algorithm Mitglied der Helmholtz-Gemeinschaft • Secondaries • Runs also only with MVD • Basis for more sophisticated algorithms • Uncertainties • Fast track fitter (track parameters) To Dos / Plans • Measurement Uncertainties • Cuts (Extension: Hit to close, Zero crossing) • Parallelism (32 threads per seed, not 1) • Track merger • Integrate to PandaRoot Draw Backs Problems Challenges • Combinatorically explosive - Many combinations (if used bluntly) - Esp. if used as finder - Pre-steps needed • Currently paused • Include more subdetectors • Make timebased • Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/ GpuRiemann • Extensive documentation at: http://panda-wiki.gsi.de/cgi-bin/view/Computing/ RiemannTrackFinder (+Summary of theory behind Riemann algorithm) 12 Mitglied der Helmholtz-Gemeinschaft ALGORITHMS #3 Hough Transform Riemann Track Finder Triplet Finder 13 Triplet Finder • Algorithm specifically designed for the PANDA Straw Tube Tracker (STT) Original algorithm by Marius Mertens et al 1.5 m Mitglied der Helmholtz-Gemeinschaft • Ported to GPU by Andrew Adinetz – CUDA, Dynamic Parallelism, Thrust – Quality of tracks comparable to CPU http://www.fz-juelich.de/ias/jsc/ 14 Triplet Finder • Idea: Use only subset of detector as seed – Combine 3 hits to Triplet – Calculate circle from 3 Triplets (no fit) • Features Mitglied der Helmholtz-Gemeinschaft – Fast & robust algorithm, no t0 – Many tuning possibilities More 15 Triplet Finder — Display Isochrone early Isochrone early & skewed Isochrone close Isochrone late MVD hit Triplet Mitglied der Helmholtz-Gemeinschaft Track current Track timed out 16 Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Times 17 Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Times 17 Triplet Finder — Optimizations • Bunching Wrapper Mitglied der Helmholtz-Gemeinschaft – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best 18 Triplet Finder — Optimizations • Bunching Wrapper – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best Mitglied der Helmholtz-Gemeinschaft Hit 18 Triplet Finder — Optimizations • Bunching Wrapper – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best Event Mitglied der Helmholtz-Gemeinschaft Hit 18 Triplet Finder — Optimizations • Bunching Wrapper – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best Event Mitglied der Helmholtz-Gemeinschaft Hit 18 Triplet Finder — Optimizations • Bunching Wrapper – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best Hit Event Mitglied der Helmholtz-Gemeinschaft Bunch 18 Triplet Finder — Optimizations • Bunching Wrapper – Hits from one event have similar timestamp – Combine hits to sets (bunches) which occupy GPU best Hit Event Bunch Mitglied der Helmholtz-Gemeinschaft 𝒪(N2) → 𝒪(N) 18 Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Bunching Performance 19 Triplet Finder — Optimizations • Sector Row testing – After found track: Hit association not with all hits of current window, but only with subset Mitglied der Helmholtz-Gemeinschaft (first test rows of sector, then hits of row) More 20 Triplet Finder — Optimizations • Sector Row testing – After found track: Hit association not with all hits of current window, but only with subset Mitglied der Helmholtz-Gemeinschaft (first test rows of sector, then hits of row) More 20 Triplet Finder — Optimizations • Sector Row testing – After found track: Hit association not with all hits of current window, but only with subset Mitglied der Helmholtz-Gemeinschaft (first test rows of sector, then hits of row) More 20 Triplet Finder — Optimizations • Sector Row testing – After found track: Hit association not with all hits of current window, but only with subset Mitglied der Helmholtz-Gemeinschaft (first test rows of sector, then hits of row) More 20 Triplet Finder — Optimizations • Sector Row testing – After found track: Hit association not with all hits of current window, but only with subset Mitglied der Helmholtz-Gemeinschaft (first test rows of sector, then hits of row) More 20 Triplet Finder — Sector Rows Mitglied der Helmholtz-Gemeinschaft Preliminary (in publication) 21 Triplet Finder — Optimizations GPU CPU • Compare kernel launch strategies Dynamic Parallelism Joined Kernel Host Streams Triplet Finder Triplet Finder Triplet Finder thread/ 1 thread bunch bunch 1 1thread//bunch Calling Calling Calling kernel kernel kernel block 1block block//bunch 1 bunch 1 /bunch Joined Joined Joined kernel kernel kernel TF Stage #1 stream/ 1 stream bunch 1 bunch 1 stream// bunch Combining Combining Calling stream stream stream TF Stage #1 Mitglied der Helmholtz-Gemeinschaft TF Stage #1 TF Stage #2 TF Stage #2 TF Stage #2 TF Stage #3 TF Stage #3 TF Stage #3 TF Stage #4 TF Stage #4 TF Stage #4 22 Triplet Finder — Kernel Launches Mitglied der Helmholtz-Gemeinschaft Preliminary (in publication) Explanation 23 Triplet Finder — Optimizations Mitglied der Helmholtz-Gemeinschaft • Impact of chipset Tesla K40 Tesla K20X Peak double performance 1.46 TFLOPS 1.31 TFLOPS Peak single performance 4.29 TFLOPS 3.95 TFLOPS GPU Chipset GK110B GK110 # CUDA Cores 2880 2688 Memory size 12 GB 6 GB 288 GByte/s 250 GByte/s Memory bandwidth Source: http://www.nvidia.com/content/tesla/pdf/NVIDIA-Tesla-Kepler-Family-Datasheet.pdf 24 Triplet Finder — Clock Speed / GPU Preliminary (in publication) Mitglied der Helmholtz-Gemeinschaft K40 3004 MHz, 745 MHz / 875 MHz K20X 2600 MHz, 732 MHz / 784 MHz Memory Clock Core Clock GPU Boost 25 Triplet Finder — Summary • Optimizations possible & needed – Speed, €: More float less double-cards a la K10 – ε: ~55% (see Marius‘ talks during the PANDA meetings) • Best performance: 20 µs/event – 20⋅10-6 s/event * 2⋅107 event/s 400 GPUs2014 – PANDA2019: Multi GPU system – 𝒪(100) GPUs Mitglied der Helmholtz-Gemeinschaft Advantages To Dos / Plans Draw Backs • Fast • No isochrones needed • t0 Algorithmic tuning • • Already built for time- • Integrate to PandaRoot/ still to be done based hits (no events) • t0 deliverer? Problems Challenges ZMQ Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/GpuTripletFinder 26 GPU Triplet Finder — Summary Advantages To Dos / Plans Draw Backs Problems Challenges • Fast • No isochrones needed • t0 Algorithmic tuning • • Already built for time- • Integrate to PandaRoot/ still to be done Mitglied der Helmholtz-Gemeinschaft based hits (no events) • t0 deliverer? ZMQ • Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/ GpuTripletFinder 27 Et cetera • Data transfer to GPU Mitglied der Helmholtz-Gemeinschaft – Not researched yet at PANDA – Contact with INFN Roma (P. Vicini, NA62) – Interesting technique: GPU direct 6 GB/s 28 Et cetera • Data transfer to GPU – Not researched yet at PANDA NA62) – Contact with INFN Roma (P. Vicini,"Replace"custom"hardware" with"a"GPU)based"system" – Interesting technique: GPU directperforming"the"same"task" but:" 6 GB/s Programmable" RICH# MUV# CEDAR# STRAWS# LKR# LAV# 10' MHz' 1'MHz' 1' MHz' Mitglied der Helmholtz-Gemeinschaft L0TP" 100'kHz' GigaEth"SWITCH" L1/L2# #PC# L1/L2# #PC# L0#trigger# L1#trigger# Trigger#primi,ves# Data#th,"2012" April"15 L1/L2# #PC# L1/L2# #PC# O(kHz)' L1/L2# #PC# L1/L2# #PC# L1/2" L1/L2# #PC# Upgradable"" Scalable"" Cost"effec]ve" Increasing"selec]on" efficiency"of"interes]ng" events"implemen]ng" more"demanding" algorithms.""" L0" 10' MHz' The#topic#of#this#talk.# CDR" GTC"2013")"March"20,"2013"–"Alessandro"Lonardo")"INFN" 7" 28 Et cetera • Data transfer to GPU RICH# MUV# CEDAR# STRAWS# LKR# LAV# 10' MHz' 1'MHz' 10' MHz' 1' MHz' 100'kHz' GigaEth"SWITCH" L1/L2# #PC# L1/L2# #PC# L0#trigger# L1#trigger# Trigger#primi,ves# L1/L2# #PC# O(kHz)' L1/L2# #PC# L1/L2# #PC# "Replace"custom"hardware" with"a"GPU)based"system" performing"the"same"task" but:" Programmable" Upgradable"" Scalable"" Cost"effec]ve" Increasing"selec]on" efficiency"of"interes]ng" events"implemen]ng" more"demanding" algorithms.""" The#topic#of#this#talk.# CDR" GTC"2013")"March"20,"2013"–"Alessandro"Lonardo")"INFN" 7" Mitglied der Helmholtz-Gemeinschaft Data#th,"2012" April"15 L1/L2# #PC# L1/2" L1/L2# #PC# L0TP" L0" – Not researched yet at PANDA – Contact with INFN Roma (P. Vicini, NA62) – Interesting technique: GPU direct 6 GB/s 28 Et cetera • Data transfer to GPU RICH# MUV# CEDAR# STRAWS# LKR# LAV# 10' MHz' 1'MHz' 10' MHz' 1' MHz' 100'kHz' GigaEth"SWITCH" L1/L2# #PC# L1/L2# #PC# L0#trigger# L1#trigger# Trigger#primi,ves# L1/L2# #PC# O(kHz)' L1/L2# #PC# L1/L2# #PC# "Replace"custom"hardware" with"a"GPU)based"system" performing"the"same"task" but:" Programmable" Upgradable"" Scalable"" Cost"effec]ve" Increasing"selec]on" efficiency"of"interes]ng" events"implemen]ng" more"demanding" algorithms.""" The#topic#of#this#talk.# CDR" GTC"2013")"March"20,"2013"–"Alessandro"Lonardo")"INFN" 7" Mitglied der Helmholtz-Gemeinschaft Data#th,"2012" April"15 L1/L2# #PC# L1/2" L1/L2# #PC# L0TP" L0" – Not researched yet at PANDA – Contact with INFN Roma (P. Vicini, NA62) – Interesting technique: GPU direct 6 GB/s 28 Et cetera • Tegra Jetson Developer Board Mitglied der Helmholtz-Gemeinschaft 40 GFLOPS (single), 192 USD 29 Summary • Algorithms in active evaluation and optimization – Triplet Finder very exciting Mitglied der Helmholtz-Gemeinschaft • New PhD student L. Bianchi 30 Summary • Algorithms in active evaluation and optimization – Triplet Finder very exciting • New PhD student L. Bianchi Mitglied der Helmholtz-Gemeinschaft ! u o y Thank rten Andreas He elich.de u j z f @ n e t r a.he 30 List of Resources Used • #4: Earth icon by Francesco Paleari from The Noun Project • #4: Einstein icon by Roman Rusinov from The Noun Project • #6: FAIR vector logo from official FAIR website • #6: FAIR rendering from official website • #11: Flare Gun icon by Jop van der Kroef from The Noun Project • #27: STT event animation by Marius C. Mertens Mitglied der Helmholtz-Gemeinschaft • #35: Graphics cards images by NVIDIA promotion • #35: GPU Specifications – Tesla K20X Specifications: http://www.nvidia.com/content/PDF/kepler/TeslaK20X-BD-06397-001-v07.pdf – Tesla K40 Specifications: http://www.nvidia.com/content/PDF/kepler/Tesla-K40Active-Board-Spec-BD-06949-001_v03.pdf – Tesla Familiy Overview: http://www.nvidia.com/content/tesla/pdf/NVIDIA-TeslaKepler-Family-Datasheet.pdf 31 Mitglied der Helmholtz-Gemeinschaft BACKUP 32 Mitglied der Helmholtz-Gemeinschaft Hough Transform — Principle Back 33 Hough Transform — Principle y Mitglied der Helmholtz-Gemeinschaft x Back 33 Hough Transform — Principle y Mitglied der Helmholtz-Gemeinschaft x Back 33 Hough Transform — Principle y* rij = cos↵j · xi + sin↵j · yi + ⇢i x* (r, α) Mitglied der Helmholtz-Gemeinschaft 1 Back 33 Hough Transform — Principle y* rij = cos↵j · xi + sin↵j · yi + ⇢i x* r (r, α) Mitglied der Helmholtz-Gemeinschaft 1 α Back 33 Hough Transform — Principle y* rij = cos↵j · xi + sin↵j · yi + ⇢i x* r (r, α) Mitglied der Helmholtz-Gemeinschaft 1 α Back 33 Hough Transform — Principle y* rij = cos↵j · xi + sin↵j · yi + ⇢i x* r (r, α) 1 ) (r, α Mitglied der Helmholtz-Gemeinschaft 2 α Back 33 Hough Transform — Principle y* rij = cos↵j · xi + sin↵j · yi + ⇢i x* Mitglied der Helmholtz-Gemeinschaft r α Back 33 Hough Transform — Principle y* rij = cos↵j · xi + sin↵j · yi + ⇢i x* Mitglied der Helmholtz-Gemeinschaft r α Back 33 Hough Transform — Principle y* rij = cos↵j · xi + sin↵j · yi + ⇢i x* Mitglied der Helmholtz-Gemeinschaft r α Back 33 Hough Transform — Principle y* rij = cos↵j · xi + sin↵j · yi + ⇢i x* Mitglied der Helmholtz-Gemeinschaft r α Back 33 Hough Transform — Principle y* rij = cos↵j · xi + sin↵j · yi + ⇢i x* Mitglied der Helmholtz-Gemeinschaft r α Back 33 Hough Transform — Principle y* rij = cos↵j · xi + sin↵j · yi + ⇢i x* Mitglied der Helmholtz-Gemeinschaft r → Bin with highest multiplicity gives track parameters α Back 33 Mitglied der Helmholtz-Gemeinschaft Riemann Algorithm — Procedure 34 Riemann Algorithm — Procedure 1• Create triplet of hit points Mitglied der Helmholtz-Gemeinschaft – All possible three hit combinations need to become triplets 34 Riemann Algorithm — Procedure 1• Create triplet of hit points – All possible three hit combinations need to become triplets 2• Grow triplets to tracks: Continuously test next hit if it fits to triplet track – Use Riemann paraboloid to circle fit track • Test closeness of new hit: good → add hit; bad → dismiss hit • Continue with next hit Mitglied der Helmholtz-Gemeinschaft – Helix fit: arc length s vs. z position 34 Riemann Algorithm — 11 Triplets 5 4 3 2 Mitglied der Helmholtz-Gemeinschaft 1 Layer number 1 2 3 4 5 Back 35 Riemann Algorithm — 11 Triplets 5 4 3 2 Mitglied der Helmholtz-Gemeinschaft 1 Layer number 1 2 3 4 5 Back 35 Riemann Algorithm — 11 Triplets 5 4 3 2 Mitglied der Helmholtz-Gemeinschaft 1 Layer number 1 2 3 4 5 Back 35 Riemann Algorithm — 11 Triplets 5 4 3 11 21 31 2 Mitglied der Helmholtz-Gemeinschaft 1 Layer number 1 2 3 4 5 Back 35 Riemann Algorithm — 11 Triplets 5 4 3 11 21 31 11 31 41 2 Mitglied der Helmholtz-Gemeinschaft 1 Layer number 1 2 3 4 5 Back 35 Riemann Algorithm — 11 Triplets 5 4 3 2 11 21 31 11 31 41 11 31 32 Mitglied der Helmholtz-Gemeinschaft 1 Layer number 1 2 3 4 5 Back 35 Riemann Algorithm — 11 Triplets 5 4 3 2 11 21 31 11 31 41 11 31 32 Mitglied der Helmholtz-Gemeinschaft 1 Layer number 1 2 3 4 5 Back 35 Mitglied der Helmholtz-Gemeinschaft Riemann Algorithm — 12 Expansion Back 36 Riemann Algorithm — 12 Expansion z‘ x x x Mitglied der Helmholtz-Gemeinschaft x y Expand to z‘ Back 36 Riemann Algorithm — 12 Expansion z‘ x x x x x x x y Mitglied der Helmholtz-Gemeinschaft Expand to z‘ x y Riemann Surface (paraboloid) Back 36 Riemann Algorithm — 12 Expansion z‘ x x x x x x x y Mitglied der Helmholtz-Gemeinschaft Expand to z‘ x y Riemann Surface (paraboloid) Back 36 Riemann Algorithm — 12 Expansion z‘ x x x x x x x y Mitglied der Helmholtz-Gemeinschaft Expand to z‘ x y Riemann Surface (paraboloid) Back 36 Riemann Algorithm — 12 Expansion z‘ x x x x x x x y Mitglied der Helmholtz-Gemeinschaft Expand to z‘ x y Riemann Surface (paraboloid) Back 36 Riemann Algorithm — 12 Expansion z‘ x x x x x x x y Mitglied der Helmholtz-Gemeinschaft Expand to z‘ x y Riemann Surface (paraboloid) Back 36 Riemann Algorithm — 12 Expansion z‘ x x x x x x x x y Mitglied der Helmholtz-Gemeinschaft Expand to z‘ x y Riemann Surface (paraboloid) Back 36 Riemann Algorithm — 12 Expansion z‘ x x x x x x x x y Mitglied der Helmholtz-Gemeinschaft Expand to z‘ x y Riemann Surface (paraboloid) Back 36 Riemann Algorithm — 12 Expansion z‘ x x x x x x x x y Mitglied der Helmholtz-Gemeinschaft Expand to z‘ x y Riemann Surface (paraboloid) Back 36 Riemann Algorithm — 12 Expansion z‘ x x x x x x x x y Mitglied der Helmholtz-Gemeinschaft Expand to z‘ x y Riemann Surface (paraboloid) Back 36 Riemann Algorithm — 12 Expansion z‘ x x x x x x x x y Mitglied der Helmholtz-Gemeinschaft Expand to z‘ x y Riemann Surface (paraboloid) Back 36 Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method STT More 37 Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method STT More 37 Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method STT More 37 Mitglied der Helmholtz-Gemeinschaft Triplet Finder — Method STT More 37 Triplet Finder — Method Mitglied der Helmholtz-Gemeinschaft • STT hit in pivot straw STT More 37 Triplet Finder — Method Mitglied der Helmholtz-Gemeinschaft • STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) STT More 37 Triplet Finder — Method Mitglied der Helmholtz-Gemeinschaft • STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with STT More 37 Triplet Finder — Method • STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with STT Mitglied der Helmholtz-Gemeinschaft 1.Second STT pivot-cog virtual hit More 37 Triplet Finder — Method • STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with STT Mitglied der Helmholtz-Gemeinschaft 1.Second STT pivot-cog virtual hit More 37 Triplet Finder — Method • STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with STT Mitglied der Helmholtz-Gemeinschaft 1.Second STT pivot-cog virtual hit More 37 Triplet Finder — Method • STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with STT 1.Second STT pivot-cog virtual hit 2.Interaction point Mitglied der Helmholtz-Gemeinschaft Interaction Point More 37 Triplet Finder — Method • STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with STT Mitglied der Helmholtz-Gemeinschaft 1.Second STT pivot-cog virtual hit 2.Interaction point • Calculate circle through three points Interaction Point More 37 Triplet Finder — Method • STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with STT Mitglied der Helmholtz-Gemeinschaft 1.Second STT pivot-cog virtual hit 2.Interaction point • Calculate circle through three points → Track Candidate Interaction Point More 37 Triplet Finder — Optimizations • Sector Row testing – Thicken track; shrink sector row layer to line – Find intersection Sector-Row Testing Track Track Sector-Row Mitglied der Helmholtz-Gemeinschaft Sector-Row Back 38 Triplet Finder — Kernel Launch Strategies • Joined Kernel (JK): slowest – High # registers → low occupancy Mitglied der Helmholtz-Gemeinschaft • Dynamic Parallelism (DP) / Host Streams (HS): comparable performance – Performance • HS faster for small # processed hits, DP faster for > 45000 hits • HS stagnates there, while DP continues rising – Limiting factor • High # of required kernel calls • Kernel launch latency • Memcopy – HS more affected by this, because • More PCI-E transfers (launch configurations for kernels) • Less launch throughput, kernel launch latency gets more important • False dependencies of launched kernels – Single CPU thread handles all CUDA streams (Multi-thread possible, but synchronization overhead too high for good performance) – Grid scheduling done on hardware (Grid Management Unit) (DP: software) » False dependencies when N(streams) > N(device connections)=323.5 Back 39 Triplet Finder — Host Stream Connections Mitglied der Helmholtz-Gemeinschaft Preliminary (in publication) 40 Triplet Finder — Bunch Sizes Mitglied der Helmholtz-Gemeinschaft Preliminary (in publication) 41