Slides - Indico

Transcription

Slides - Indico
Mitglied der Helmholtz-Gemeinschaft
Status of Investigations in
GPU-based
Online Tracking Algorithms
DAQ/FEE Workshop Boppard 2014
31 March 2014, Andreas Herten
1
Outline
• Algorithms
Mitglied der Helmholtz-Gemeinschaft
– Hough Transform
– Riemann Track Finder
– Triplet Finder
2
Mitglied der Helmholtz-Gemeinschaft
Graphics Processing Units
CPU
GPU
3
Mitglied der Helmholtz-Gemeinschaft
Graphics Processing Units
CPU
GPU
a1 → b1 → c1; a2 → b2 → c2; a3 → …
a1 → b1 → c1
a2 → b2 → c2
a3 → …
3
Mitglied der Helmholtz-Gemeinschaft
ALGORITHMS #1
Hough Transform
Riemann Track Finder
Triplet Finder
4
Algorithm: Hough Transform
• Idea: Transform (x,y)i → (α,r)ij, find lines via (α,r) space
• Solve rij line equation for
– Lots of hits (x,y,ρ)i and
– Many αj ∈ [0°,360°) each
Hough Transform — Princip
• Fill histogram
• Extract track parameters
y
r
Mitglied der Helmholtz-Gemeinschaft
Mitglied der Helmholtz-Gemeinschaft
y
→ Bin
giv
α
x
Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
x
5
Algorithm: Hough Transform
• Idea: Transform (x,y)i → (α,r)ij, find lines via (α,r) space
rij = cos↵j · xi + sin↵j · yi + ⇢i
• Solve rij line equation for
– Lots of hits (x,y,ρ)i and
– Many αj ∈ [0°,360°) each
i: ~100 hits/event (STT)
rij: 180—000
Hough Transform
Princip
j: every 0.2°
• Fill histogram
• Extract track parameters
y
r
Mitglied der Helmholtz-Gemeinschaft
Mitglied der Helmholtz-Gemeinschaft
y
→ Bin
giv
α
x
Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2
x
5
r
Hough transformed
Algorithm: Hough Transform
68 (x,y)0 points
0.6
Entries
2.2356e+08
25
0.5
Mean x
90
Mean y
0.02905
0.4
RMS x
51.96
RMS y
0.1063
20
0.3
0.2
15
0.1
0
10
-0.1
Mitglied der Helmholtz-Gemeinschaft
-0.2
5
-0.3
-0.4
0
20
40
60
80
100
120
140
160
180
α
Angle / °
0
PANDA STT+MVD
1800 x 1800 Grid
6
r
Hough transformed
Algorithm: Hough Transform
68 (x,y)0 points
0.6
Entries
2.2356e+08
25
0.5
Mean x
90
Mean y
0.02905
0.4
RMS x
51.96
RMS y
0.1063
20
0.3
0.2
15
0.1
0
10
-0.1
Mitglied der Helmholtz-Gemeinschaft
-0.2
5
-0.3
-0.4
0
20
40
60
80
100
120
140
160
180
α
Angle / °
0
PANDA STT+MVD
1800 x 1800 Grid
6
Hough Transform — Remarks
Two Implementations
Thrust
Plain CUDA
• Performance: 3 ms/event
• Performance: 0.5 ms/event
– Independent of angular granularity
– Reduced to set of standard routines
– Built completely for this task
• Fitting to every problem
•
Fast (uses Thrust‘s optimized algorithms)
•
Customizable
•
Inflexible (has it‘s limits, hard to customize)
•
A bit more complicated at parts
– No peakfinding included
Even possible?
•
Adds to time!
• Using: Dynamic Parallelism, Shared
Memory
Mitglied der Helmholtz-Gemeinschaft
•
– Simple peakfinder implemented
(threshold)
7
Hough Transform — Summary
• Running code
• Big issue: Multipeak finder → Martin‘s idea
Advantages
of HT algorithm
To Dos / Plans
• Parallelism beyond
Thrust • Easy algorithm
• With grid
Mitglied der Helmholtz-Gemeinschaft
granularity
parallelism
increases
Plain • Flexibility of HT‘ed
equation
CUDA
events
• Isochrones (fully)
• Time-based
• Isochrones (fully)
• Integrate to PandaRoot
• Time-based
Draw Backs
Problems
Challenges
• Stuck in
Thrust
infrastructure
• Multipeak
finder/
Aliasing
(Algorithm out of
image processing –
usually used to
detect continuous
lines)
• Code at:
– https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/
GpuHoughTransform
– https://github.com/AndiH/CUDA/
8
Mitglied der Helmholtz-Gemeinschaft
ALGORITHMS #2
Hough Transform
Riemann Track Finder
Triplet Finder
9
Riemann Track Finder — Method
• Idea: Don‘t fit lines (in 2D), fit planes (in 3D)!
• Create seeds
– All possible three hit combinations
• Grow seeds to tracks
Continuously test next hit if it fits
– Use mapping to Riemann paraboloid
x
x
y
x
x
x
y
x
x
x
x
y
x
x
x
Mitglied der Helmholtz-Gemeinschaft
z‘
More on: Seeds; Growing 10
Riemann Algorithm — Triplet Generation
CPU
• Three loops to generate
triplets serially
for (int i = 0; i < hitsInLayerOne.size(); i++) {
for (int j = 0; j < hitsInLayerTwo.size(); j++) {
for (int k = 0; k < hitsInLayerThree.size(); k++) {
/* Triplet Generation */
}
}
}
GPU
• Loops are not good
parallelizable!
• Needed: Mapping of
inherent GPU indexing
variable to triplet index
Mitglied der Helmholtz-Gemeinschaft
int ijk = threadIdx.x + blockIdx.x * blockDim.x;
⌘
1 ⇣p
nLayerx =
8x + 1 1
2
p
p p
3
3 243x2 1 + 27x
1
p
pos(nLayerx ) =
+p
p p
2/3
3
3
3
3
3 243x2
1 + 27x
1
→ 100 × faster than CPU version: ~0.6 ms/event
11
GPU Riemann Algorithm — Summary
• Running port of CPU Riemann to GPU (J. Timcheck)
+ Improvements / needed changes wrt to CPU version
Advantages
of Riemann algorithm
Mitglied der Helmholtz-Gemeinschaft
• Secondaries
• Runs also only
with MVD
• Basis for more
sophisticated
algorithms
• Uncertainties
• Fast track fitter
(track parameters)
To Dos / Plans
• Measurement
Uncertainties
• Cuts (Extension: Hit to
close, Zero crossing)
• Parallelism (32 threads
per seed, not 1)
• Track merger
• Integrate to PandaRoot
Draw Backs
Problems
Challenges
• Combinatorically
explosive
- Many combinations (if
used bluntly)
- Esp. if used as finder
- Pre-steps needed
• Currently paused
• Include more
subdetectors
• Make timebased
• Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/
GpuRiemann
• Extensive documentation at: http://panda-wiki.gsi.de/cgi-bin/view/Computing/
RiemannTrackFinder (+Summary of theory behind Riemann algorithm)
12
Mitglied der Helmholtz-Gemeinschaft
ALGORITHMS #3
Hough Transform
Riemann Track Finder
Triplet Finder
13
Triplet Finder
• Algorithm specifically designed for the
PANDA Straw Tube Tracker (STT)
Original algorithm by
Marius Mertens et al
1.5 m
Mitglied der Helmholtz-Gemeinschaft
• Ported to GPU by Andrew Adinetz
– CUDA, Dynamic Parallelism, Thrust
– Quality of tracks comparable to CPU
http://www.fz-juelich.de/ias/jsc/
14
Triplet Finder
• Idea: Use only subset of detector as seed
– Combine 3 hits to Triplet
– Calculate circle from 3 Triplets (no fit)
• Features
Mitglied der Helmholtz-Gemeinschaft
– Fast & robust algorithm, no t0
– Many tuning possibilities
More 15
Triplet Finder — Display
Isochrone early
Isochrone early & skewed
Isochrone close
Isochrone late
MVD hit
Triplet
Mitglied der Helmholtz-Gemeinschaft
Track current
Track timed out
16
Mitglied der Helmholtz-Gemeinschaft
Triplet Finder — Times
17
Mitglied der Helmholtz-Gemeinschaft
Triplet Finder — Times
17
Triplet Finder — Optimizations
• Bunching Wrapper
Mitglied der Helmholtz-Gemeinschaft
– Hits from one event have similar timestamp
– Combine hits to sets (bunches) which occupy GPU best
18
Triplet Finder — Optimizations
• Bunching Wrapper
– Hits from one event have similar timestamp
– Combine hits to sets (bunches) which occupy GPU best
Mitglied der Helmholtz-Gemeinschaft
Hit
18
Triplet Finder — Optimizations
• Bunching Wrapper
– Hits from one event have similar timestamp
– Combine hits to sets (bunches) which occupy GPU best
Event
Mitglied der Helmholtz-Gemeinschaft
Hit
18
Triplet Finder — Optimizations
• Bunching Wrapper
– Hits from one event have similar timestamp
– Combine hits to sets (bunches) which occupy GPU best
Event
Mitglied der Helmholtz-Gemeinschaft
Hit
18
Triplet Finder — Optimizations
• Bunching Wrapper
– Hits from one event have similar timestamp
– Combine hits to sets (bunches) which occupy GPU best
Hit
Event
Mitglied der Helmholtz-Gemeinschaft
Bunch
18
Triplet Finder — Optimizations
• Bunching Wrapper
– Hits from one event have similar timestamp
– Combine hits to sets (bunches) which occupy GPU best
Hit
Event
Bunch
Mitglied der Helmholtz-Gemeinschaft
𝒪(N2) → 𝒪(N)
18
Mitglied der Helmholtz-Gemeinschaft
Triplet Finder — Bunching
Performance
19
Triplet Finder — Optimizations
• Sector Row testing
– After found track:
Hit association not with all hits of current window,
but only with subset
Mitglied der Helmholtz-Gemeinschaft
(first test rows of sector, then hits of row)
More 20
Triplet Finder — Optimizations
• Sector Row testing
– After found track:
Hit association not with all hits of current window,
but only with subset
Mitglied der Helmholtz-Gemeinschaft
(first test rows of sector, then hits of row)
More 20
Triplet Finder — Optimizations
• Sector Row testing
– After found track:
Hit association not with all hits of current window,
but only with subset
Mitglied der Helmholtz-Gemeinschaft
(first test rows of sector, then hits of row)
More 20
Triplet Finder — Optimizations
• Sector Row testing
– After found track:
Hit association not with all hits of current window,
but only with subset
Mitglied der Helmholtz-Gemeinschaft
(first test rows of sector, then hits of row)
More 20
Triplet Finder — Optimizations
• Sector Row testing
– After found track:
Hit association not with all hits of current window,
but only with subset
Mitglied der Helmholtz-Gemeinschaft
(first test rows of sector, then hits of row)
More 20
Triplet Finder — Sector Rows
Mitglied der Helmholtz-Gemeinschaft
Preliminary
(in publication)
21
Triplet Finder — Optimizations
GPU
CPU
• Compare kernel launch strategies
Dynamic
Parallelism
Joined
Kernel
Host
Streams
Triplet
Finder
Triplet
Finder
Triplet
Finder
thread/
1
thread
bunch
bunch
1 1thread//bunch
Calling
Calling
Calling
kernel
kernel
kernel
block
1block
block//bunch
1
bunch
1
/bunch
Joined
Joined
Joined
kernel
kernel
kernel
TF Stage #1
stream/
1 stream
bunch
1
bunch
1 stream//
bunch
Combining
Combining
Calling
stream
stream
stream
TF Stage #1
Mitglied der Helmholtz-Gemeinschaft
TF Stage #1
TF Stage #2
TF Stage #2
TF Stage #2
TF Stage #3
TF Stage #3
TF Stage #3
TF Stage #4
TF Stage #4
TF Stage #4
22
Triplet Finder — Kernel Launches
Mitglied der Helmholtz-Gemeinschaft
Preliminary
(in publication)
Explanation 23
Triplet Finder — Optimizations
Mitglied der Helmholtz-Gemeinschaft
• Impact of chipset
Tesla K40
Tesla K20X
Peak double
performance
1.46 TFLOPS
1.31 TFLOPS
Peak single
performance
4.29 TFLOPS
3.95 TFLOPS
GPU Chipset
GK110B
GK110
# CUDA Cores
2880
2688
Memory size
12 GB
6 GB
288 GByte/s
250 GByte/s
Memory bandwidth
Source: http://www.nvidia.com/content/tesla/pdf/NVIDIA-Tesla-Kepler-Family-Datasheet.pdf
24
Triplet Finder — Clock Speed / GPU
Preliminary
(in publication)
Mitglied der Helmholtz-Gemeinschaft
K40 3004 MHz, 745 MHz / 875 MHz
K20X 2600 MHz, 732 MHz / 784 MHz
Memory Clock
Core Clock
GPU Boost
25
Triplet Finder — Summary
• Optimizations possible & needed
– Speed, €: More float less double-cards a la K10
– ε: ~55% (see Marius‘ talks during the PANDA meetings)
• Best performance: 20 µs/event
– 20⋅10-6 s/event * 2⋅107 event/s 400 GPUs2014
– PANDA2019: Multi GPU system – 𝒪(100) GPUs
Mitglied der Helmholtz-Gemeinschaft
Advantages
To Dos / Plans
Draw Backs
• Fast
• No isochrones needed • t0
Algorithmic tuning
•
• Already built for time- • Integrate to PandaRoot/ still to be done
based hits (no events)
• t0 deliverer?
Problems
Challenges
ZMQ
Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/GpuTripletFinder
26
GPU Triplet Finder — Summary
Advantages
To Dos / Plans
Draw Backs
Problems
Challenges
• Fast
• No isochrones needed • t0
Algorithmic tuning
•
• Already built for time- • Integrate to PandaRoot/ still to be done
Mitglied der Helmholtz-Gemeinschaft
based hits (no events)
• t0 deliverer?
ZMQ
• Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/
GpuTripletFinder
27
Et cetera
• Data transfer to GPU
Mitglied der Helmholtz-Gemeinschaft
– Not researched yet at PANDA
– Contact with INFN Roma (P. Vicini, NA62)
– Interesting technique: GPU direct
6 GB/s
28
Et cetera
• Data transfer to GPU
– Not researched yet at PANDA
NA62)
– Contact with INFN Roma (P. Vicini,"Replace"custom"hardware"
with"a"GPU)based"system"
– Interesting technique: GPU directperforming"the"same"task"
but:"
6 GB/s
Programmable"
RICH#
MUV#
CEDAR#
STRAWS#
LKR#
LAV#
10'
MHz'
 
1'MHz'
1'
MHz'
Mitglied der Helmholtz-Gemeinschaft
L0TP"
100'kHz'
GigaEth"SWITCH"
L1/L2#
#PC#
L1/L2#
#PC#
L0#trigger#
L1#trigger#
Trigger#primi,ves#
Data#th,"2012"
April"15
L1/L2#
#PC#
L1/L2#
#PC#
O(kHz)'
L1/L2#
#PC#
L1/L2#
#PC#
L1/2"
L1/L2#
#PC#
Upgradable""
 Scalable""
 Cost"effec]ve"
 Increasing"selec]on"
efficiency"of"interes]ng"
events"implemen]ng"
more"demanding"
algorithms."""
 
L0"
10'
MHz'
The#topic#of#this#talk.#
CDR"
GTC"2013")"March"20,"2013"–"Alessandro"Lonardo")"INFN"
7"
28
Et cetera
• Data transfer to GPU
RICH#
MUV#
CEDAR#
STRAWS#
LKR#
LAV#
10'
MHz'
1'MHz'
10'
MHz'
1'
MHz'
100'kHz'
GigaEth"SWITCH"
L1/L2#
#PC#
L1/L2#
#PC#
L0#trigger#
L1#trigger#
Trigger#primi,ves#
L1/L2#
#PC#
O(kHz)'
L1/L2#
#PC#
L1/L2#
#PC#
"Replace"custom"hardware"
with"a"GPU)based"system"
performing"the"same"task"
but:"
 Programmable"
 Upgradable""
 Scalable""
 Cost"effec]ve"
 Increasing"selec]on"
efficiency"of"interes]ng"
events"implemen]ng"
more"demanding"
algorithms."""
The#topic#of#this#talk.#
CDR"
GTC"2013")"March"20,"2013"–"Alessandro"Lonardo")"INFN"
7"
Mitglied der Helmholtz-Gemeinschaft
Data#th,"2012"
April"15
L1/L2#
#PC#
L1/2"
L1/L2#
#PC#
L0TP"
L0"
– Not researched yet at PANDA
– Contact with INFN Roma (P. Vicini, NA62)
– Interesting technique: GPU direct
6 GB/s
28
Et cetera
• Data transfer to GPU
RICH#
MUV#
CEDAR#
STRAWS#
LKR#
LAV#
10'
MHz'
1'MHz'
10'
MHz'
1'
MHz'
100'kHz'
GigaEth"SWITCH"
L1/L2#
#PC#
L1/L2#
#PC#
L0#trigger#
L1#trigger#
Trigger#primi,ves#
L1/L2#
#PC#
O(kHz)'
L1/L2#
#PC#
L1/L2#
#PC#
"Replace"custom"hardware"
with"a"GPU)based"system"
performing"the"same"task"
but:"
 Programmable"
 Upgradable""
 Scalable""
 Cost"effec]ve"
 Increasing"selec]on"
efficiency"of"interes]ng"
events"implemen]ng"
more"demanding"
algorithms."""
The#topic#of#this#talk.#
CDR"
GTC"2013")"March"20,"2013"–"Alessandro"Lonardo")"INFN"
7"
Mitglied der Helmholtz-Gemeinschaft
Data#th,"2012"
April"15
L1/L2#
#PC#
L1/2"
L1/L2#
#PC#
L0TP"
L0"
– Not researched yet at PANDA
– Contact with INFN Roma (P. Vicini, NA62)
– Interesting technique: GPU direct
6 GB/s
28
Et cetera
• Tegra Jetson Developer Board
Mitglied der Helmholtz-Gemeinschaft
40 GFLOPS (single), 192 USD
29
Summary
• Algorithms in active evaluation and optimization
– Triplet Finder very exciting
Mitglied der Helmholtz-Gemeinschaft
• New PhD student L. Bianchi
30
Summary
• Algorithms in active evaluation and optimization
– Triplet Finder very exciting
• New PhD student L. Bianchi
Mitglied der Helmholtz-Gemeinschaft
!
u
o
y
Thank
rten
Andreas He
elich.de
u
j
z
f
@
n
e
t
r
a.he
30
List of Resources Used
• #4: Earth icon by Francesco Paleari from The Noun Project
• #4: Einstein icon by Roman Rusinov from The Noun Project
• #6: FAIR vector logo from official FAIR website
• #6: FAIR rendering from official website
• #11: Flare Gun icon by Jop van der Kroef from The Noun Project
• #27: STT event animation by Marius C. Mertens
Mitglied der Helmholtz-Gemeinschaft
• #35: Graphics cards images by NVIDIA promotion
• #35: GPU Specifications
– Tesla K20X Specifications: http://www.nvidia.com/content/PDF/kepler/TeslaK20X-BD-06397-001-v07.pdf
– Tesla K40 Specifications: http://www.nvidia.com/content/PDF/kepler/Tesla-K40Active-Board-Spec-BD-06949-001_v03.pdf
– Tesla Familiy Overview: http://www.nvidia.com/content/tesla/pdf/NVIDIA-TeslaKepler-Family-Datasheet.pdf
31
Mitglied der Helmholtz-Gemeinschaft
BACKUP
32
Mitglied der Helmholtz-Gemeinschaft
Hough Transform — Principle
Back 33
Hough Transform — Principle
y
Mitglied der Helmholtz-Gemeinschaft
x
Back 33
Hough Transform — Principle
y
Mitglied der Helmholtz-Gemeinschaft
x
Back 33
Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i
x*
(r,
α)
Mitglied der Helmholtz-Gemeinschaft
1
Back 33
Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i
x*
r
(r,
α)
Mitglied der Helmholtz-Gemeinschaft
1
α
Back 33
Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i
x*
r
(r,
α)
Mitglied der Helmholtz-Gemeinschaft
1
α
Back 33
Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i
x*
r
(r,
α)
1
)
(r, α
Mitglied der Helmholtz-Gemeinschaft
2
α
Back 33
Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i
x*
Mitglied der Helmholtz-Gemeinschaft
r
α
Back 33
Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i
x*
Mitglied der Helmholtz-Gemeinschaft
r
α
Back 33
Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i
x*
Mitglied der Helmholtz-Gemeinschaft
r
α
Back 33
Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i
x*
Mitglied der Helmholtz-Gemeinschaft
r
α
Back 33
Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i
x*
Mitglied der Helmholtz-Gemeinschaft
r
α
Back 33
Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i
x*
Mitglied der Helmholtz-Gemeinschaft
r
→ Bin with highest multiplicity
gives track parameters
α
Back 33
Mitglied der Helmholtz-Gemeinschaft
Riemann Algorithm — Procedure
34
Riemann Algorithm — Procedure
1•
Create triplet of hit points
Mitglied der Helmholtz-Gemeinschaft
– All possible three hit combinations need to become
triplets
34
Riemann Algorithm — Procedure
1•
Create triplet of hit points
– All possible three hit combinations need to become
triplets
2•
Grow triplets to tracks:
Continuously test next hit if it fits to triplet track
– Use Riemann paraboloid to circle fit track
• Test closeness of new hit: good → add hit; bad → dismiss hit
• Continue with next hit
Mitglied der Helmholtz-Gemeinschaft
– Helix fit: arc length s vs. z position
34
Riemann Algorithm — 11 Triplets
5
4
3
2
Mitglied der Helmholtz-Gemeinschaft
1
Layer number
1
2
3
4
5
Back 35
Riemann Algorithm — 11 Triplets
5
4
3
2
Mitglied der Helmholtz-Gemeinschaft
1
Layer number
1
2
3
4
5
Back 35
Riemann Algorithm — 11 Triplets
5
4
3
2
Mitglied der Helmholtz-Gemeinschaft
1
Layer number
1
2
3
4
5
Back 35
Riemann Algorithm — 11 Triplets
5
4
3
11
21
31
2
Mitglied der Helmholtz-Gemeinschaft
1
Layer number
1
2
3
4
5
Back 35
Riemann Algorithm — 11 Triplets
5
4
3
11
21
31
11
31
41
2
Mitglied der Helmholtz-Gemeinschaft
1
Layer number
1
2
3
4
5
Back 35
Riemann Algorithm — 11 Triplets
5
4
3
2
11
21
31
11
31
41
11
31
32
Mitglied der Helmholtz-Gemeinschaft
1
Layer number
1
2
3
4
5
Back 35
Riemann Algorithm — 11 Triplets
5
4
3
2
11
21
31
11
31
41
11
31
32
Mitglied der Helmholtz-Gemeinschaft
1
Layer number
1
2
3
4
5
Back 35
Mitglied der Helmholtz-Gemeinschaft
Riemann Algorithm — 12 Expansion
Back 36
Riemann Algorithm — 12 Expansion
z‘
x
x
x
Mitglied der Helmholtz-Gemeinschaft
x
y
Expand to z‘
Back 36
Riemann Algorithm — 12 Expansion
z‘
x
x
x
x
x
x
x
y
Mitglied der Helmholtz-Gemeinschaft
Expand to z‘
x
y
Riemann Surface
(paraboloid)
Back 36
Riemann Algorithm — 12 Expansion
z‘
x
x
x
x
x
x
x
y
Mitglied der Helmholtz-Gemeinschaft
Expand to z‘
x
y
Riemann Surface
(paraboloid)
Back 36
Riemann Algorithm — 12 Expansion
z‘
x
x
x
x
x
x
x
y
Mitglied der Helmholtz-Gemeinschaft
Expand to z‘
x
y
Riemann Surface
(paraboloid)
Back 36
Riemann Algorithm — 12 Expansion
z‘
x
x
x
x
x
x
x
y
Mitglied der Helmholtz-Gemeinschaft
Expand to z‘
x
y
Riemann Surface
(paraboloid)
Back 36
Riemann Algorithm — 12 Expansion
z‘
x
x
x
x
x
x
x
y
Mitglied der Helmholtz-Gemeinschaft
Expand to z‘
x
y
Riemann Surface
(paraboloid)
Back 36
Riemann Algorithm — 12 Expansion
z‘
x
x
x
x
x
x
x
x
y
Mitglied der Helmholtz-Gemeinschaft
Expand to z‘
x
y
Riemann Surface
(paraboloid)
Back 36
Riemann Algorithm — 12 Expansion
z‘
x
x
x
x
x
x
x
x
y
Mitglied der Helmholtz-Gemeinschaft
Expand to z‘
x
y
Riemann Surface
(paraboloid)
Back 36
Riemann Algorithm — 12 Expansion
z‘
x
x
x
x
x
x
x
x
y
Mitglied der Helmholtz-Gemeinschaft
Expand to z‘
x
y
Riemann Surface
(paraboloid)
Back 36
Riemann Algorithm — 12 Expansion
z‘
x
x
x
x
x
x
x
x
y
Mitglied der Helmholtz-Gemeinschaft
Expand to z‘
x
y
Riemann Surface
(paraboloid)
Back 36
Riemann Algorithm — 12 Expansion
z‘
x
x
x
x
x
x
x
x
y
Mitglied der Helmholtz-Gemeinschaft
Expand to z‘
x
y
Riemann Surface
(paraboloid)
Back 36
Mitglied der Helmholtz-Gemeinschaft
Triplet Finder — Method
STT
More 37
Mitglied der Helmholtz-Gemeinschaft
Triplet Finder — Method
STT
More 37
Mitglied der Helmholtz-Gemeinschaft
Triplet Finder — Method
STT
More 37
Mitglied der Helmholtz-Gemeinschaft
Triplet Finder — Method
STT
More 37
Triplet Finder — Method
Mitglied der Helmholtz-Gemeinschaft
• STT hit in pivot straw
STT
More 37
Triplet Finder — Method
Mitglied der Helmholtz-Gemeinschaft
• STT hit in pivot straw
• Find surrounding hits
→ Create virtual hit (triplet)
at center of gravity (cog)
STT
More 37
Triplet Finder — Method
Mitglied der Helmholtz-Gemeinschaft
• STT hit in pivot straw
• Find surrounding hits
→ Create virtual hit (triplet)
at center of gravity (cog)
• Combine with
STT
More 37
Triplet Finder — Method
• STT hit in pivot straw
• Find surrounding hits
→ Create virtual hit (triplet)
at center of gravity (cog)
• Combine with
STT
Mitglied der Helmholtz-Gemeinschaft
1.Second STT pivot-cog virtual hit
More 37
Triplet Finder — Method
• STT hit in pivot straw
• Find surrounding hits
→ Create virtual hit (triplet)
at center of gravity (cog)
• Combine with
STT
Mitglied der Helmholtz-Gemeinschaft
1.Second STT pivot-cog virtual hit
More 37
Triplet Finder — Method
• STT hit in pivot straw
• Find surrounding hits
→ Create virtual hit (triplet)
at center of gravity (cog)
• Combine with
STT
Mitglied der Helmholtz-Gemeinschaft
1.Second STT pivot-cog virtual hit
More 37
Triplet Finder — Method
• STT hit in pivot straw
• Find surrounding hits
→ Create virtual hit (triplet)
at center of gravity (cog)
• Combine with
STT
1.Second STT pivot-cog virtual hit
2.Interaction point
Mitglied der Helmholtz-Gemeinschaft
Interaction Point
More 37
Triplet Finder — Method
• STT hit in pivot straw
• Find surrounding hits
→ Create virtual hit (triplet)
at center of gravity (cog)
• Combine with
STT
Mitglied der Helmholtz-Gemeinschaft
1.Second STT pivot-cog virtual hit
2.Interaction point
• Calculate circle through three
points
Interaction Point
More 37
Triplet Finder — Method
• STT hit in pivot straw
• Find surrounding hits
→ Create virtual hit (triplet)
at center of gravity (cog)
• Combine with
STT
Mitglied der Helmholtz-Gemeinschaft
1.Second STT pivot-cog virtual hit
2.Interaction point
• Calculate circle through three
points
→ Track Candidate
Interaction Point
More 37
Triplet Finder — Optimizations
• Sector Row testing
– Thicken track; shrink sector row layer to line
– Find intersection
Sector-Row Testing
Track
Track
Sector-Row
Mitglied der Helmholtz-Gemeinschaft
Sector-Row
Back 38
Triplet Finder — Kernel Launch Strategies
• Joined Kernel (JK): slowest
– High # registers → low occupancy
Mitglied der Helmholtz-Gemeinschaft
• Dynamic Parallelism (DP) / Host Streams (HS): comparable performance
– Performance
• HS faster for small # processed hits, DP faster for > 45000 hits
• HS stagnates there, while DP continues rising
– Limiting factor
• High # of required kernel calls
• Kernel launch latency
• Memcopy
– HS more affected by this, because
• More PCI-E transfers (launch configurations for kernels)
• Less launch throughput, kernel launch latency gets more important
• False dependencies of launched kernels
– Single CPU thread handles all CUDA streams (Multi-thread possible, but
synchronization overhead too high for good performance)
– Grid scheduling done on hardware (Grid Management Unit) (DP: software)
» False dependencies when N(streams) > N(device connections)=323.5
Back
39
Triplet Finder — Host Stream Connections
Mitglied der Helmholtz-Gemeinschaft
Preliminary
(in publication)
40
Triplet Finder — Bunch Sizes
Mitglied der Helmholtz-Gemeinschaft
Preliminary
(in publication)
41