an energy and bandwidth efficient ray tracing architecture
Transcription
an energy and bandwidth efficient ray tracing architecture
AN ENERGY AND BANDWIDTH EFFICIENT RAY TRACING ARCHITECTURE Daniel Kopta, Konstantin Shkurko, Josef Spjut, Erik Brunvand, Al Davis The Goal 2 ! Can we reduce ray tracing energy consumption without hurting frame rate? TRaX – SPMD Ray Tracing GPU 3 Cycle accurate simulator ! Tiled “Thread Multiprocessors” (TMs) ! Instruction Cache Instruction Cache Threads Local Store Registers Integer/Branch L1 Data Cache Floating Point Units Where Does Energy Go? 4 Energy/Frame (Joules) ! Energy estimates from Cacti and Synopsys L1 0.22 L2 1.31 DRAM 2.11 0.57 0.94 1.2 ~200 Watts at 30 FPS Instruction Cache Register File Compute (XUs) What can we do? 5 ! Consider an ASIC ! Great ! Configurable persistent pipelines ! Almost ! energy/delay, but not flexible as good, but more flexible Coherent data access ! Stay in pipeline mode longer ! Reduce bandwidth requirements Macro Instruction Pipelines 6 General Purpose Reconfigurable Pipeline Instruction fetch Instruction fetch Register File Register File Instruction fetch XU XU XU XU Register File Register File Register File Ray - Box/Triangle Dataflow 7 TRaX TM 8 TPs Shared XUs TPs Reconfigured Pipeline 9 Special Regs Box Unit TPs Shared FUs TPs Pipeline Persistence 10 ! Opportunities in RT ! Box test (traversal) ! Triangle test (intersection) ! Others? ! Streaming, Ray sorting, etc… ! StreamRay: Gribble, Ramani, 2008, 2009 ! Treelet decomposition: Aila & Karras, 2010 Treelets 11 ! Sized to fit in L1 cache = box stream = triangle stream 1 2 3 4 STRaTA 12 ! Streaming Treelet Ray Tracing Architecture ! HW support for treelet ray streams ! Repurposes most of L2 cache ! L1 hit rate increases, off-chip access decreases ! Enables pipelines to persist longer ! Reduce instruction fetch and register access Test Scenes 13 Hairball ! R Sibenik Vegetation Where Does Energy Go? 14 Energy/Frame (Joules) L1 Treelets L2 DRAM Persistent pipelines 0.22 1.31 2.11 Instruction Cache 0.57 Register File Compute (XUs) 0.94 1.2 Results (ICache + RF) 15 I$ + Register File enery (Joules) 4.7 Baseline Treelets Only STRaTA 4.4 3.1 2.95 3.0 2.6 2.9 2.5 2.2 Hairball Sibenik Vegetation Results (L2 + DRAM energy) 16 L2 Dcache + DRAM energy (Joules) 2.18 Baseline STRaTA 1.64 1.54 1.18 .72 .43 Hairball Sibenik Vegetation Results 17 Energy/Frame (Joules) Hairball Sibenik 0.14 1.1 1.5 0.53 Vegetation 1.2 2.6 0.33 0.22 0.85 1.4 0.8 0.29 1.6 0.91 0.13 1.7 0.48 1.1 0.3 1.8 1 L1 L2/Stream DRAM Instruction Cache Register File Compute (XUs) Saved Conclusions 18 ! Up to 38% combined energy reduction ! (memory hierarchy, I$, RF) ! No significant change in performance ! Simple HW/SW modifications Thanks! 19 Thanks to: Samuli Laine for Hairball and Vegetation, Marko Dabrovic for Sibenik Anonymous reviewers Utah Hardware Ray Tracing Group