A Scalable Computer Architecture for On
Transcription
A Scalable Computer Architecture for On
A Scalable Computer Architecture for On-line Pulsar Search on the SKA - Draft Version G. Knittel, A. Horneffer MPI for Radio Astronomy with help from: M. Kramer, B. Klein, R. Eatough Bonn GPU-Based GPU Based Pulsar Timing FFT – DeDisp – IFFT – Full Stokes – Folding DISK GPU User‐Mem DMA From ADC NIC Systtem Memory GPU Fast Bit‐ Reversal CPU GPU Processing GPU-Processing FFT Coherent Dedispersion DM IFFT Full Stokes Parameters Folding Pulse Period GPU-Based GPU Based Pulsar Timing Performance: 610M Samples/s (2 GPUs) Pulsar Search FFT Coherent Dedispersion DM IFFT Full Stokes Parameters Folding Pulse Period Pulsar Search FFT Loop Coherent Dedispersion Trial DM IFFT Stokes I Parameter Folding Trial Pulse Period Binary Systems Search FFT Loop Coherent Dedispersion Trial DM IFFT Stokes I Parameter Folding Trial Pulse Period, Orbital Orbital Parameters Pulsar Search • Idea: y Massively-Parallel y Pulsar Search by Folding (in Time Domain) Pulsar Search • Idea: y Massively-Parallel y Pulsar Search by Folding • Binary Systems: Make Length of Phase Bins variable (similar to Time Sequence Resampling) Pulsar Search • Idea: y Massively-Parallel y Pulsar Search by Folding • Binary Systems: Make Length of Phase Bins variable • High-Dimensional Search Space: Complete Coverage not possible possible. Pulsar Search • Frequency Domain: y Massively-Parallel y Pulsar Search by Harmonic Summation • Binary Systems: Process Range of neighboring Frequency Bins • Use same Hardware! Pulsar Search • Frequency Domain: y Massively-Parallel y Pulsar Search by Harmonic Summation • Binary Systems: Process Range of neighboring Frequency Bins • Use same Hardware! • Not completely worked out yet. Pulsar Search on the SKA • Add Coherent Beamforming Pulsar Search on the SKA Polyphase Filterbank Polyphase Filterbank FFT FFT Coherent Beamforming Coherent Beamforming Coherent Dedispersion Coherent Dedispersion IFFT Stokes I Parameter Stokes I Parameter Power Spectrum Power Spectrum Foldingg Harmonic Sum Pulsar Search on the SKA Polyphase Filterbank FPGA Polyphase Filterbank FFT CPU FFT Coherent Beamforming CPU Coherent Beamforming Coherent Dedispersion GPU Coherent Dedispersion IFFT GPU Stokes I Parameter Stokes I Parameter GPU Power Spectrum Power Spectrum Foldingg ASIC Harmonic Sum Mode of Operation (Time Domain) Telescope 0 Telescope 1 Telescope 1 Telescope 127 Parallelization: Timeslicing Telescope 0 Telescope 1 Telescope 1 Telescope 127 Timeslice to „„Rank 0“ Timeslice to „„Rank 1“ Timeslice to „„Rank 0“ Parallelization Telescope 0 Telescope 1 Telescope 1 Telescope 127 Required Processing Time defines Number of Ranks Parallelization Data from all Telescopes Rank 0 Rank 0 Chain Network Rank 1 Rank 1 Rank 2 Rank 2 Architecture scales endlessly Architecture scales endlessly Rank 63 Rank 63 Parallelization Telescope n Polyphase Filterbank yp 16 Subbands Compute Node 15 Compute Node 1 Compute Node 1 Compute Node 0 Parallelization O R k One Rank Data Capture Phase Compute Node 0 Compute Node 1 Compute Node 15 8 Telescopes each Parallelization O R k One Rank Filtering and Subband Distribution Ring Network Compute Node 0 Compute Node 1 Compute Node 15 Coherent Beamforming Compute Node k System y Mem Beam 0 Subb k Beam 1 Subb k Beam 2 Subb k Local Mem Beam n Subb k CPU Tel 0 Subb k FPGA Polyphase Filterbank, Data Exchange Data Exchange FFT, Coherent Beamforming Tel 1 Subb k Tel 2 Subb k Tel 127 Subb k CPU Coherent Beamforming on CPUs • Performance using AVX: p Spectra, p , single-precision g p float,, 128 Input 256k Elements • 128 Beams of same Size: 2.1s per Core @ 3.5GHz (prel. Results) GPU Processing GPU-Processing Video Mem Video Mem GPU0 GPUn DeDisp, IFFT, SI p, , System Mem Beam 0 Subb k Beam 1 Subb k Beam 2 Subb k Local Mem CPU Tel 1 Subb k Tel 1 Subb k Polyphase P l h Filterbank, Data Exchange FFT, Coherent Beamforming Beam n Subb k Tel 0 Subb k FPGA Compute Node k Tel 2 Subb k T l 127 S bb k Tel 127 Subb k CPU GPU Processing GPU-Processing • Performance: 2 Spectra, p horz/vert, single-precision g p float, 4M Elements • Total Power Time Sequence of same Size: 5ms GPU Processing GPU-Processing • How to output the Results to the ASICs? GPU Processing GPU-Processing • How to output the Results to the ASICs? • All PCIe-Slots PCIe Slots are already taken (GPUs, (GPUs FPGAs) GPU Processing GPU-Processing • How to output the Results to the ASICs? • All PCIe-Slots PCIe Slots are already taken (GPUs, (GPUs FPGAs) • Write to Screen Buffer Buffer, to be output via Monitor Cable GPU Processing GPU-Processing Mini‐DisplayPort 17.28 Gbit/s ~ 70 Gbit/s Equiv. 1 PCIe x16 Slot GPU Processing GPU-Processing • Does it work? GPU Processing GPU-Processing • Does it work? Yes, but... GPU Kernel GPU Kernel Screen Via DVI: Vi DVI 2.7 Gbit/s (Video) Massively-Parallel Massively Parallel Folding Local Mem Local Mem ASIC0 ASICn SI Time Sequence Monitor Cable Video Vid Mem GPU0 Massively‐Parallel Foldingg Massively-Parallel Massively Parallel Folding Compute Node 0 C Compute Node 1 t N d 1 ASIC‐PC Up to 16 Monitor Cables Compute Node 15 GPU‐PC Folding – Time Domain Hypothetical Pulse Period P time Detects Solitary Pulsars D t t S lit P l having P +‐ small P Folding – Acceleration Search Hypothetical Acceleration time Variable Bin Length (# of Samples per Bin) Equiv. to Time Sequence Resampling Harmonic Summation f 2f f0 2f0 3f 3f0 4f 4f0 Detects Solitary Pulsars D t t S lit P l between f0 and f0+f f0 f0 + f freq Harmonic Sum - Acceleration Search f 2f f0 2f0 3f 3f0 4f 4f0 freq Detects Binary Systems D t t Bi S t with max. Acceleration ؙf f0 f0 + f Folding Processor Broadcast Bus Broadcast Bus SI Time Sequence or Power Spectrum SI Time Sequence or Power Spectrum Accumulator Set of Counters and Incrementers Programmable Memory 64 x 32 bits 64 x 32 bits To / from local Memory local Memory Pulsar Detector ASIC 103 ‐ 105 Folding Processors ASIC Network Ring Network Rank 0 Rank 0 Rank 1 Rank 1 Rank 2 Rank 2 Rank 63 Rank 63 ASIC‐PC RFI Mitigation • Subband-relative: Accumulation is p per Subband • Beam-relative The Pulsar Search Machine • • • • • 128 Telescopes 128 Beams 100 DMs 50.000 Orbits 6 4 x 108 hypothetical Pulsars 6.4 The Pulsar Search Machine • • • • • PC Cluster Switch-less Switch less Design Design, helps Scalability GPU-PC + ASIC-PC = Compute Node 64 Ranks of 16 Compute Nodes 2048 PCs, PCs 4096 CPUs, CPUs 8192 GPUs GPUs, 16384 ASICS • 32.5 M€ Thanks! The Pulsar Search Machine Folding Processor Costs Supercomputer Costs 2005 • Sandia National Laboratories Red Storm: $90 $ million • Los Alamos National Laboratory ASCI Q: $215 million • Earth Simulator Center, Japan: $250 million • IBM Blue Gene/L: $290 million various (unreliable) Internet Sources