A Scalable Computer Architecture for On

Transcription

A Scalable Computer Architecture for On
A Scalable Computer Architecture for
On-line Pulsar Search on the SKA
- Draft Version G. Knittel, A. Horneffer
MPI for Radio Astronomy
with help from:
M. Kramer, B. Klein, R. Eatough
Bonn
GPU-Based
GPU
Based Pulsar Timing
FFT – DeDisp – IFFT –
Full Stokes – Folding
DISK
GPU
User‐Mem
DMA
From ADC
NIC
Systtem Memory
GPU
Fast Bit‐
Reversal
CPU
GPU Processing
GPU-Processing
FFT
Coherent Dedispersion
DM
IFFT
Full Stokes Parameters
Folding
Pulse Period
GPU-Based
GPU
Based Pulsar Timing
Performance:
610M Samples/s
(2 GPUs)
Pulsar Search
FFT
Coherent Dedispersion
DM
IFFT
Full Stokes Parameters
Folding
Pulse Period
Pulsar Search
FFT
Loop
Coherent Dedispersion
Trial DM
IFFT
Stokes I Parameter
Folding
Trial
Pulse Period
Binary Systems Search
FFT
Loop
Coherent Dedispersion
Trial DM
IFFT
Stokes I Parameter
Folding
Trial
Pulse Period,
Orbital Orbital
Parameters
Pulsar Search
• Idea:
y Massively-Parallel
y
Pulsar Search by
Folding (in Time Domain)
Pulsar Search
• Idea:
y Massively-Parallel
y
Pulsar Search by
Folding
• Binary Systems:
Make Length of Phase Bins variable
(similar to Time Sequence Resampling)
Pulsar Search
• Idea:
y Massively-Parallel
y
Pulsar Search by
Folding
• Binary Systems:
Make Length of Phase Bins variable
• High-Dimensional Search Space:
Complete Coverage not possible
possible.
Pulsar Search
• Frequency Domain:
y Massively-Parallel
y
Pulsar Search by
Harmonic Summation
• Binary Systems:
Process Range of neighboring Frequency
Bins
• Use same Hardware!
Pulsar Search
• Frequency Domain:
y Massively-Parallel
y
Pulsar Search by
Harmonic Summation
• Binary Systems:
Process Range of neighboring Frequency
Bins
• Use same Hardware!
• Not completely worked out yet.
Pulsar Search on the SKA
• Add Coherent Beamforming
Pulsar Search on the SKA
Polyphase Filterbank
Polyphase Filterbank
FFT
FFT
Coherent Beamforming
Coherent Beamforming
Coherent Dedispersion
Coherent Dedispersion
IFFT
Stokes I Parameter
Stokes I Parameter
Power Spectrum
Power Spectrum
Foldingg
Harmonic Sum
Pulsar Search on the SKA
Polyphase Filterbank
FPGA
Polyphase Filterbank
FFT
CPU
FFT
Coherent Beamforming
CPU
Coherent Beamforming
Coherent Dedispersion
GPU
Coherent Dedispersion
IFFT
GPU
Stokes I Parameter
Stokes I Parameter
GPU
Power Spectrum
Power Spectrum
Foldingg
ASIC
Harmonic Sum
Mode of Operation (Time Domain)
Telescope 0
Telescope 1
Telescope 1
Telescope 127
Parallelization: Timeslicing
Telescope 0
Telescope 1
Telescope 1
Telescope 127
Timeslice to
„„Rank 0“
Timeslice to
„„Rank 1“
Timeslice to
„„Rank 0“
Parallelization
Telescope 0
Telescope 1
Telescope 1
Telescope 127
Required Processing Time defines Number of Ranks
Parallelization
Data from all Telescopes
Rank 0
Rank 0
Chain Network
Rank 1
Rank 1
Rank 2
Rank 2
Architecture scales endlessly
Architecture scales endlessly
Rank 63
Rank 63
Parallelization
Telescope n
Polyphase Filterbank
yp
16 Subbands
Compute Node 15
Compute Node 1
Compute Node 1
Compute Node 0
Parallelization
O R k
One Rank
Data Capture Phase
Compute Node 0
Compute Node 1
Compute Node 15
8 Telescopes each
Parallelization
O R k
One Rank
Filtering and
Subband Distribution
Ring Network
Compute Node 0
Compute Node 1
Compute Node 15
Coherent Beamforming
Compute Node k
System
y
Mem
Beam 0 Subb k
Beam 1 Subb k
Beam 2 Subb k
Local
Mem
Beam n Subb k
CPU
Tel 0 Subb k
FPGA
Polyphase
Filterbank,
Data Exchange
Data Exchange
FFT,
Coherent
Beamforming
Tel 1 Subb k
Tel 2 Subb k
Tel 127 Subb k
CPU
Coherent Beamforming on CPUs
• Performance using AVX:
p Spectra,
p
, single-precision
g p
float,,
128 Input
256k Elements
• 128 Beams of same Size:
2.1s per Core @ 3.5GHz (prel. Results)
GPU Processing
GPU-Processing
Video
Mem
Video
Mem
GPU0
GPUn
DeDisp, IFFT, SI
p,
,
System
Mem
Beam 0 Subb k
Beam 1 Subb k
Beam 2 Subb k
Local
Mem
CPU
Tel 1 Subb k
Tel 1 Subb k
Polyphase
P
l h
Filterbank,
Data Exchange
FFT,
Coherent
Beamforming
Beam n Subb k
Tel 0 Subb k
FPGA
Compute Node k
Tel 2 Subb k
T l 127 S bb k
Tel 127 Subb k
CPU
GPU Processing
GPU-Processing
• Performance:
2 Spectra,
p
horz/vert, single-precision
g p
float,
4M Elements
• Total Power Time Sequence of same Size:
5ms
GPU Processing
GPU-Processing
• How to output the Results to the ASICs?
GPU Processing
GPU-Processing
• How to output the Results to the ASICs?
• All PCIe-Slots
PCIe Slots are already taken (GPUs,
(GPUs
FPGAs)
GPU Processing
GPU-Processing
• How to output the Results to the ASICs?
• All PCIe-Slots
PCIe Slots are already taken (GPUs,
(GPUs
FPGAs)
• Write to Screen Buffer
Buffer, to be output via
Monitor Cable
GPU Processing
GPU-Processing
Mini‐DisplayPort
17.28 Gbit/s
~ 70 Gbit/s
Equiv. 1 PCIe x16 Slot
GPU Processing
GPU-Processing
• Does it work?
GPU Processing
GPU-Processing
• Does it work? Yes, but...
GPU Kernel
GPU Kernel
Screen
Via DVI:
Vi
DVI
2.7 Gbit/s
(Video)
Massively-Parallel
Massively
Parallel Folding
Local
Mem
Local
Mem
ASIC0
ASICn
SI Time Sequence
Monitor
Cable
Video
Vid
Mem
GPU0
Massively‐Parallel
Foldingg
Massively-Parallel
Massively
Parallel Folding
Compute Node 0
C
Compute Node 1
t N d 1
ASIC‐PC
Up to 16 Monitor Cables
Compute Node 15
GPU‐PC
Folding – Time Domain
Hypothetical
Pulse Period P
time

Detects Solitary Pulsars
D
t t S lit P l
having P +‐ small P
Folding – Acceleration Search
Hypothetical
Acceleration
time

Variable Bin Length
(# of Samples per Bin)
Equiv. to Time Sequence Resampling
Harmonic Summation
f
2f
f0
2f0
3f
3f0
4f
4f0

Detects Solitary Pulsars
D
t t S lit P l
between f0 and f0+f
f0
f0 + f
freq
Harmonic Sum - Acceleration Search
f
2f
f0
2f0
3f
3f0
4f
4f0
freq

Detects Binary Systems
D
t t Bi
S t
with max. Acceleration ‫ ؙ‬f
f0
f0 + f
Folding Processor
Broadcast Bus
Broadcast Bus
SI Time Sequence or Power Spectrum
SI Time Sequence or Power Spectrum
Accumulator
Set of
Counters and
Incrementers
Programmable
Memory
64 x 32 bits
64 x 32 bits
To / from
local Memory
local Memory
Pulsar Detector ASIC
103 ‐ 105
Folding
Processors
ASIC Network
Ring Network
Rank 0
Rank 0
Rank 1
Rank 1
Rank 2
Rank 2
Rank 63
Rank 63
ASIC‐PC
RFI Mitigation
• Subband-relative:
Accumulation is p
per Subband
• Beam-relative
The Pulsar Search Machine
•
•
•
•
•
128 Telescopes
128 Beams
100 DMs
50.000 Orbits
6 4 x 108 hypothetical Pulsars
6.4
The Pulsar Search Machine
•
•
•
•
•
PC Cluster
Switch-less
Switch
less Design
Design, helps Scalability
GPU-PC + ASIC-PC = Compute Node
64 Ranks of 16 Compute Nodes
2048 PCs,
PCs 4096 CPUs,
CPUs 8192 GPUs
GPUs,
16384 ASICS
• 32.5 M€
Thanks!
The Pulsar Search Machine
Folding Processor
Costs
Supercomputer Costs 2005
• Sandia National Laboratories Red Storm: $90
$
million
• Los Alamos National Laboratory ASCI Q: $215
million
• Earth Simulator Center, Japan: $250 million
• IBM Blue Gene/L: $290 million
various (unreliable) Internet Sources