A Scalable Computer Architecture for On

Transcription

A Scalable Computer Architecture for
On-line Pulsar Search on the SKA
- Draft Version G. Knittel, A. Horneffer
MPI for Radio Astronomy
with help from:
M. Kramer, B. Klein, R. Eatough
Bonn
GPU-Based
GPU
Based Pulsar Timing
FFT – DeDisp – IFFT –
Full Stokes – Folding
DISK
GPU
User‐Mem
DMA
From ADC
NIC
Systtem Memory
GPU
Fast Bit‐
Reversal
CPU
GPU Processing
GPU-Processing
FFT
Coherent Dedispersion
DM
IFFT
Full Stokes Parameters
Folding
Pulse Period
GPU-Based
GPU
Based Pulsar Timing
Performance:
610M Samples/s
(2 GPUs)
Pulsar Search
FFT
DM
IFFT
Full Stokes Parameters
Folding
Pulse Period
Pulsar Search
FFT
Loop
Trial DM
IFFT
Stokes I Parameter
Folding
Trial
Pulse Period
Binary Systems Search
FFT
Loop
Trial DM
IFFT
Stokes I Parameter
Folding
Trial
Pulse Period,
Orbital Orbital
Parameters
Pulsar Search
• Idea:
y Massively-Parallel
y
Pulsar Search by
Folding (in Time Domain)
Pulsar Search
• Idea:
y
Pulsar Search by
Folding
• Binary Systems:
Make Length of Phase Bins variable
(similar to Time Sequence Resampling)
Pulsar Search
• Idea:
y
Pulsar Search by
Folding
• Binary Systems:
Make Length of Phase Bins variable
• High-Dimensional Search Space:
Complete Coverage not possible
possible.
Pulsar Search
• Frequency Domain:
y
Pulsar Search by
Harmonic Summation
• Binary Systems:
Process Range of neighboring Frequency
Bins
• Use same Hardware!
Pulsar Search
• Frequency Domain:
y
Pulsar Search by
Harmonic Summation
• Binary Systems:
Process Range of neighboring Frequency
Bins
• Use same Hardware!
• Not completely worked out yet.
Pulsar Search on the SKA
• Add Coherent Beamforming
Polyphase Filterbank
FFT
FFT
Coherent Beamforming
IFFT
Stokes I Parameter
Stokes I Parameter
Power Spectrum
Power Spectrum
Foldingg
Harmonic Sum
FPGA
FFT
CPU
FFT
CPU
GPU
IFFT
GPU
Stokes I Parameter
Stokes I Parameter
GPU
Power Spectrum
Power Spectrum
Foldingg
ASIC
Harmonic Sum
Mode of Operation (Time Domain)
Telescope 0
Telescope 1
Telescope 1
Telescope 127
Parallelization: Timeslicing
Telescope 0
Telescope 1
Telescope 1
Telescope 127
Timeslice to
„„Rank 0“
Timeslice to
„„Rank 1“
Timeslice to
„„Rank 0“
Parallelization
Telescope 0
Telescope 1
Telescope 1
Telescope 127
Required Processing Time defines Number of Ranks
Parallelization
Data from all Telescopes
Rank 0
Rank 0
Chain Network
Rank 1
Rank 1
Rank 2
Rank 2
Architecture scales endlessly
Architecture scales endlessly
Rank 63
Rank 63
Parallelization
Telescope n
yp
16 Subbands
Compute Node 15
Compute Node 1
Compute Node 1
Compute Node 0
Parallelization
O R k
One Rank
Data Capture Phase
Compute Node 0
Compute Node 1
Compute Node 15
8 Telescopes each
Parallelization
O R k
One Rank
Filtering and
Subband Distribution
Ring Network
Compute Node 0
Compute Node 1
Compute Node 15
Compute Node k
System
y
Mem
Beam 0 Subb k
Beam 1 Subb k
Beam 2 Subb k
Local
Mem
Beam n Subb k
CPU
Tel 0 Subb k
FPGA
Polyphase
Filterbank,
Data Exchange
Data Exchange
FFT,
Coherent
Beamforming
Tel 1 Subb k
Tel 2 Subb k
Tel 127 Subb k
CPU
Coherent Beamforming on CPUs
• Performance using AVX:
p Spectra,
p
, single-precision
g p
float,,
128 Input
256k Elements
• 128 Beams of same Size:
2.1s per Core @ 3.5GHz (prel. Results)
GPU Processing
GPU-Processing
Video
Mem
Video
Mem
GPU0
GPUn
DeDisp, IFFT, SI
p,
,
System
Mem
Beam 0 Subb k
Beam 1 Subb k
Beam 2 Subb k
Local
Mem
CPU
Tel 1 Subb k
Tel 1 Subb k
Polyphase
P
l h
Filterbank,
Data Exchange
FFT,
Coherent
Beamforming
Beam n Subb k
Tel 0 Subb k
FPGA
Compute Node k
Tel 2 Subb k
T l 127 S bb k
Tel 127 Subb k
CPU
GPU Processing
GPU-Processing
• Performance:
2 Spectra,
p
horz/vert, single-precision
g p
float,
4M Elements
• Total Power Time Sequence of same Size:
5ms
GPU Processing
GPU-Processing
• How to output the Results to the ASICs?
GPU Processing
GPU-Processing
• All PCIe-Slots
PCIe Slots are already taken (GPUs,
(GPUs
FPGAs)
GPU Processing
GPU-Processing
• All PCIe-Slots
PCIe Slots are already taken (GPUs,
(GPUs
FPGAs)
• Write to Screen Buffer
Buffer, to be output via
Monitor Cable
GPU Processing
GPU-Processing
Mini‐DisplayPort
17.28 Gbit/s
~ 70 Gbit/s
Equiv. 1 PCIe x16 Slot
GPU Processing
GPU-Processing
• Does it work?
GPU Processing
GPU-Processing
• Does it work? Yes, but...
GPU Kernel
GPU Kernel
Screen
Via DVI:
Vi
DVI
2.7 Gbit/s
(Video)
Massively-Parallel
Massively
Parallel Folding
Local
Mem
Local
Mem
ASIC0
ASICn
SI Time Sequence
Monitor
Cable
Video
Vid
Mem
GPU0
Massively‐Parallel
Foldingg
Massively-Parallel
Massively
Parallel Folding
Compute Node 0
C
Compute Node 1
t N d 1
ASIC‐PC
Up to 16 Monitor Cables
Compute Node 15
GPU‐PC
Folding – Time Domain
Hypothetical
Pulse Period P
time

Detects Solitary Pulsars
D
t t S lit P l
having P +‐ small P
Folding – Acceleration Search
Hypothetical
Acceleration
time

Variable Bin Length
(# of Samples per Bin)
Equiv. to Time Sequence Resampling
Harmonic Summation
f
2f
f0
2f0
3f
3f0
4f
4f0

Detects Solitary Pulsars
D
t t S lit P l
between f0 and f0+f
f0
f0 + f
freq
Harmonic Sum - Acceleration Search
f
2f
f0
2f0
3f
3f0
4f
4f0
freq

Detects Binary Systems
D
t t Bi
S t
with max. Acceleration ‫ ؙ‬f
f0
f0 + f
Folding Processor
Broadcast Bus
Broadcast Bus
SI Time Sequence or Power Spectrum
SI Time Sequence or Power Spectrum
Accumulator
Set of
Counters and
Incrementers
Programmable
Memory
64 x 32 bits
64 x 32 bits
To / from
local Memory
local Memory
Pulsar Detector ASIC
103 ‐ 105
Folding
Processors
ASIC Network
Ring Network
Rank 0
Rank 0
Rank 1
Rank 1
Rank 2
Rank 2
Rank 63
Rank 63
ASIC‐PC
RFI Mitigation
• Subband-relative:
Accumulation is p
per Subband
• Beam-relative
The Pulsar Search Machine
•
•
•
•
•
128 Telescopes
128 Beams
100 DMs
50.000 Orbits
6 4 x 108 hypothetical Pulsars
6.4
•
•
•
•
•
PC Cluster
Switch-less
Switch
less Design
Design, helps Scalability
GPU-PC + ASIC-PC = Compute Node
64 Ranks of 16 Compute Nodes
2048 PCs,
PCs 4096 CPUs,
CPUs 8192 GPUs
GPUs,
16384 ASICS
• 32.5 M€
Thanks!
Folding Processor
Costs
Supercomputer Costs 2005
• Sandia National Laboratories Red Storm: $90
$
million
• Los Alamos National Laboratory ASCI Q: $215
million
• Earth Simulator Center, Japan: $250 million
• IBM Blue Gene/L: $290 million
various (unreliable) Internet Sources

A Scalable Computer Architecture for On

Transcription

Similar documents

View Press - Pulsar Light

The Pulsar® 3 System

Ad for user Orgill (KELSEYVILLE LUMBER)

Brochure

Microsoft Xbox 360

smi confidential for symmetry use only

Dining room Functionality

Pulsar® Plus Briquettes

The New Trimaran -TNT 34 C (Cruiser)

CLAX, THE CLEVER FOLDING CART!