Swizzle Switch
Transcription
Swizzle Switch
1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna Das, Thomas Wenisch, Dennis Sylvester, David Blaauw, and Trevor Mudge University of Michigan University of Michigan 1 1 Outline 2 Swizzle Switch—Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch—Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation University of Michigan 2 2 Swizzle Switch Conventional Matrix Crossbar 3 Swizzle Switch Embeds arbitration within crossbar—single cycle arbitration Re-use input/output data buses for arbitration SRAM-like layout with priority bits at cross-points Low-power optimizations Excellent scalability University of Michigan 3 3 Data Routing 4 Pre-charge Source IP Multicast & Broadcast 1 0 0 128 Source IP 0 0 1 Source IP 0 1 0 Bitlines discharged if Data = “1” Crosspoint = “1” Read Buffers 128 Dest. IP University of Michigan Dest. IP Dest. IP 4 4 Swizzle Switch Architecture 5 Output bus used as prioritybus Req k Rel k 0 1 Req l Rel l 0 0 0 Req m Rel m Pre-charge Source IP 1 0 0 0 0 0 0 0 128 Source IP 0 0 Priority vectors Source IP 1 0 128 Dest. IP 1 0 1 1 1 1 0 0 1 1 1 1 Sense Amp + Latch Read Buffers Dest. IP Priority vector Dest. IP Req n Rel n 0 1 0 1 Data routing, arbitration, And priority update control embedded within crosspoints University of Michigan 5 5 Outline 6 Swizzle Switch—Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch—Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation University of Michigan 6 6 Inhibit Based Arbitration Output bus used as priority bus Priority Priority line l line n 0 Req l x<y Priority x Priority y 1 0 Sense Amp + Latch Req n 0 Priority z University of Michigan 0 Priority line m 0 Req m 0 z>y 1 1 7 This diagram is a single column in the Swizzle-Switch (output), each output arbitrates/transfers data independently Each Crosspoint has a sense amp/latch to indicate connectivity. Each input samples a unique bit of the output bus to determine if it has been granted the channel Priority vectors are stored and when a request is issued they discharge bits along the output columns to INHIBIT lower priority requests Finally, the priority vectors are updated when the data transfer completes. 7 7 Least Recently Granted(LRG) 8 Output bus used as priority bus Priority Priority line l line n 0 Req l x<y Priority x Priority y 1 0 Priority z University of Michigan z>y 1 INTERMEDIATE Priority Discharges SOME Priority Lines 0 Sense Amp + Latch Req n 0 Priority line m 0 Req m 0 LOWEST Priority Discharges NO Priority Lines HIGHEST Priority Discharges ALL Priority Lines 1 8 8 Least Recently Granted(LRG) 9 Output bus used as priority bus Priority Priority line l line n 0 Req l x<y Priority x Priority y 1 0 Sense Amp + Latch Req n 0 Priority line m 1 Req m 0 Example Arbitration: (1) Req l and Req m Request the bus (red lines) (2) Req m discharges Priority line l, priority lines m and n remain charged (green lines) (3) Req l senses Priority line l and is inhibited (not granted), Req m senses Priority line m and is not inhibited (4) The crosspoint records the connectivity at input m 0 Priority z University of Michigan z>y 1 1 9 9 Least Recently Granted(LRG) 10 Output bus used as priority bus Example Priority Update: Req l Rel l Req m Rel m Req n Rel n SET 0 Priority x 0 Input m signals it is done with data transfer by asserting Rel m 0 1 Priority y 1 0 RESET 0 Priority z University of Michigan 1 1 10 10 Least Recently Granted(LRG) 11 Output bus used as priority bus Req l Rel l Req m Rel m Req n Rel n 0 Priority x+1 1 0 LOWEST Priority Discharges NO Priority Lines 0 Priority 0 0 0 HIGHEST Priority Discharges ALL Priority Lines 0 Priority z University of Michigan INTERMEDIATE Priority Discharges SOME Priority Lines 1 1 11 11 Least Recently Granted(LRG) 12 Req l Rel l Req m Rel m Req n Rel n 0 Priority x+1 1 0 0 Priority 0 0 0 Increasing Priority Output bus used as priority bus LRG Algorithm 0 0 x l Least priority m 0 Priority upgraded l y m m-1 Priority 62 62 62 unchanged z n n “m” releases channel 0 Priority z University of Michigan 1 1 12 12 Outline 13 Swizzle Switch—Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch—Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation University of Michigan 13 13 64x64 Prototype University of Michigan 14 14 14 Measurement Results University of Michigan 15 15 15 Measurement Results University of Michigan 16 16 16 Outline 17 Swizzle Switch—Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch—Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation University of Michigan 17 17 Scaling Interconnect for Many-Cores 18 Existing interconnects—Buses, Crossbars, Rings Limited to ~16 cores Other’s Interconnect proposals for Many-Cores Packet-switched, multi-hop, network-on-chip (NoC) Grid of routers—meshes, tori and flattened butterfly Our Proposal Swizzle Switch Networks Flat single-stage, one-hop, crossbar++ interconnect University of Michigan 18 18 Mesh Network-on-Chip Memory Controller Memory Controller Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Memory Controller Memory Controller 1.73 mm 32 kB ICache ARM 32 kB Cortex DCache A5 R 256 kB L2 Cache Bank 1.73 mm Tile 13.8 mm Tile Memory Controller Memory Controller Tile Tile Memory Controller Memory Controller Tile 19 Tile Area = 3.0 mm2 13.8 mm Area = 190 mm2 Router R (a) University of Michigan 19 19 Flattened Butterfly Network-on-Chip Memory Controller Tile Tile Router Tile Memory Controller Tile Router Tile Tile Router Tile Router Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Router Router Router Router Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Router Router Router Router Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Router Tile Router Tile Tile Router Tile Memory Controller Tile Router Tile Tile Tile 13.8 mm Tile Memory Controller Memory Controller Tile Memory Controller Memory Controller Tile 20 Memory Controller 13.8 mm Area = 190 mm2 University of Michigan 20 20 Motivating Swizzle Switch Networks 21 Uniform access latency Ease of programming, data placement, thread placement,... Low Power Simplicity Packet-switched NoCs need routing, congestion management, flow control, wormhole switching,... University of Michigan 21 21 Motivating Swizzle Switch Networks Mesh SSN Average = 0.017 flits/node/sec Unfairness = 42.2 Average = 0.013 flits/node/sec Unfairness = 1.008 0.06 0.06 0.05 0.05 0.04 0.04 0.03 0.03 0.02 dex (X) 1 3 no de 1 2 3 4 nod e in 5 6 7 8 7 (Y ) 0.00 5 0.01 0.00 1 2 3 4 no de i n 5 6 7 8 d e x (X ) 1 5 7 in de x 0.01 3 no de (Y ) 0.02 in de x accepted throughput [flits/node/cycle] 22 Unfairness = Nodehighest_throughput / Nodelowest_throughput Hotspot Traffic = All nodes sending data to node8,8 Under Hotspot traffic, the Crossbar has a slightly less throughput than the Mesh but is 40x more fair. University of Michigan 22 22 Motivating Swizzle Switch Networks Mesh SSN 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 3 no de d e x (X ) 1 7 in de x 0.00 5 0.10 0.00 1 2 3 4 no de i n 5 6 7 8 d e x (X ) 1 5 7 in de x 0.10 3 no de (Y ) 0.60 (Y ) Average = 0.47 flits/node/sec Unfairness = 1.02 Average = 0.36 flits/node/sec Unfairness = 1.89 1 2 3 4 no de i n 5 6 7 8 23 In the Mesh, nodes closest to the center receive the highest throughput Under Uniform Random traffic, the Crossbar has more throughput than the Mesh and is 87% more fair. University of Michigan 23 23 Motivating Swizzle Switch Networks University of Michigan 24 24 24 Outline 25 Swizzle Switch—Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch—Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation University of Michigan 25 25 Top-Level Floorplan Memory Controller 26 Memory Controller 512 kB L2 cache 1.38 mm NE Memory Controller Memory Controller NW WN 3.27 mm EN WS SE L2 Area = 4.50 mm2 0.86 mm Arm Cortex A5 32 kB Icache 32 kB Dcache 0.86 mm Memory Controller ES SW 14.3 mm Memory Controller Swizzle Switch x 3 Core + L1 Area = .74 mm2 Memory Controller Memory Controller 14.3 mm Total Area = 204 mm2 University of Michigan 26 26 Outline 27 Swizzle Switch—Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch—Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation University of Michigan 27 27 Evaluation Simulation Parameters Feature NoC (Mesh/FBFly) SSN Processors 64 in-order cores, 1 IPC, 1.5 GHz L1 Cache 32kB I/D Caches, 4-way associative, 64-byte line size, 1 cycle latency L2 Cache Shared L2, 16 MB, 64-way banked, 8way associative, 64-byte line size, 10 cycle latency Shared L2, 16MB, 32-way banked, 16-way associative, 64-byte line size, 11 cycle latency Interconnect 3.0 GHz, 128-bit, 4-stage Routers, 3 virt. networks w/ 3 virt. channels 1.5 GHz, 64x32x128bit Swizzle Switch Network Main Memory 28 4096MB, 50 cycle latency Benchmarks SPLASH 2 : Scientific parallel application suite University of Michigan 28 28 Cholesky . FFT . FMM Lu Contig Lu NonContig . Ocean Contig 29 Ocean NonContig Radix . Raytrace . Water NSquared . Water Spatial Overall Performance Quality-of-Service University of Michigan 29 SSN Mesh FBFly SSN FBFly Mesh SSN FBFly Mesh SSN Mesh . FBFly SSN Mesh . FBFly Memory Stall SSN FBFly Mesh SSN Mesh . Core Active FBFly Mesh . FBFly SSN FBFly Mesh SSN FBFly Mesh SSN Mesh . FBFly SSN FBFly Barnes SSN Synchronization Stall 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Mesh Normalized Execution Time Results—Performance & QoS 29 30 Barnes! .! Cholesky! .! FFT! FMM! .! Lu Contig! .! Lu .! NonContig! Ocean Contig! Ocean .! NonContig! Radix! Raytrace! Water .! NSquared! Water Spatial! SSN! Mesh! FBFly! SSN! FBFly! Mesh! SSN! Mesh! .! FBFly! SSN! Mesh! .! FBFly! SSN! FBFly! Mesh! SSN! Mesh! .! FBFly! SSN! Mesh! FBFly! SSN! FBFly! Mesh! SSN! Mesh! FBFly! SSN! Mesh! .! FBFly! SSN! FBFly! Mesh! SSN! FBFly! SSN! Mesh! Wire Dynamic! SSN/Router Dynamic! SSN/Router Leakage! Clock! Buffer! Mesh! 7,000! 6,000! 5,000! 4,000! 3,000! 2,000! 1,000! 0! FBFly! Interconnect Power (mW)! Results—Power .! Weighted Average! Barnes! .! Cholesky! .! FFT! FMM! .! LuContig! .! Lu .! NonContig! Ocean Contig! .! Ocean .! NonContig! Radix! Raytrace! Water .! NSquared! Water Spatial! Average! Which results in an average reduction in total system energy to complete the task of 11% University of Michigan 30 SSN! Mesh! .! FBFly! SSN! FBFly! Mesh! SSN! Mesh! .! FBFly! SSN! Mesh! .! FBFly! SSN! FBFly! Mesh! SSN! FBFly! Mesh! SSN! Mesh! FBFly! SSN! FBFly! Mesh! SSN! FBFly! Mesh! SSN! Mesh! .! FBFly! SSN! FBFly! Mesh! SSN! FBFly! SSN! Mesh! Interconnect! L2 Dynamic! L2 Leakage! Core/L1 Dynamic! Core/L1 Leakage! FBFly! 900! 800! 700! 600! 500! 400! 300! 200! 100! 0! Mesh! Total Benchmark Energy! (pJ/Inst)! On average the SSN uses 28% less power in the interconnect compared to a flattened butterfly 30 Summary 31 Swizzle Switch Prototype (45nm) 64x64 Crossbar with 128-bit busses Embedded LRG priority arbitration Achieved 4.4 Tbps @ ~600MHz consuming only 1.3W of power Swizzle Switch Network Evaluation Improved performance by 21% Reduced power by 28% Reduced latency variability by 3x University of Michigan 31 31 32 Additional Detailed Slides University of Michigan 32 32 Arbitration Mechanism (Matrix View) 33 Pre-charge Source IP 128 Req Rel Source IP Req Rel Inhibits (X) X0 X1 X2 X3 X4 Priority Req Rel Read Buffers 128 Dest. IP Dest. IP University of Michigan Dest. IP Requests (R) Source IP R0 X 1 0 0 0 1 R1 R2 0 1 X 1 0 X 0 0 0 0 0 2 R3 R4 1 1 1 1 1 1 X 0 1 X 4 3 33 33 Least Recently Granted (LRG) Reset S e t X0 X1 X2 X3 X4 Priority In0 X 1 0 0 0 1 In1 0 X 0 0 0 In2 1 1 X 0 In3 1 1 1 In4 1 1 1 University of Michigan 34 X0 X1 X2 X3 X4 Priority In0 X 1 0 0 1 2 0 In1 0 X 0 0 1 1 0 2 In2 1 1 X 0 1 3 X 1 4 In3 1 1 1 X 1 4 0 X 3 In4 0 0 0 0 X 0 34 34 Round Robin Arbitration 35 Output bus used as priority bus Req l Rel l Req m Rel m Req n Rel n SET 0 Priority x 0 0 0 Priority y 1 0 0 Priority z University of Michigan 1 1 RESET 35 35 Round Robin Arbitration 36 Req l Rel l Req m Rel m Req n Rel n 0 Priority x+1 0 1 0 Priority y+1 1 1 Increasing Priority Output bus used as priority bus Round Robin Algorithm 0 0 x l Least priority m 0 l m m-1 62 62 z n 62 n y Priority upgraded 0 Priority 0 University of Michigan 0 0 36 36 QoS Arbitration 37 QoS LRG Arbitration Arbitration Increasing Priority 3 3 0 QoS values 3 3 Highest QoS 3 3 Winner 2 0 1 University of Michigan 37 37 Timing Diagram University of Michigan 38 38 38 Crosspoint Circuit 39 wl Crosspoint(x,y) Rel_b out<0> out<64> in<0> in<64> clock_b wl Priority-line clock Sense_ amp Rel out<y> Reset_b Priority bit out<y> (priority_line) Rel in<y> wl Req wl out<63> out<127> Ack clock_b Rel_b in<63> in<127> in<x+64> in<x> used to index Req/Rel out<y> sampled for Arbitration in<x+64> used for sending ack out<y> discharged for LRG update University of Michigan 39 39 Regenerative Bit-line Repeater 40 clock_b 16×16 Macro clock_b Bitline_R Bit-line Repeaters Bitline_L SSN clock Regeneration and Decoupling improves speed Bitline_L University of Michigan Bitline_R 40 40 Simulated bit-line delay improvement 41 Technology : 45nm Supply : 1.1V Temperature : 25°C University of Michigan 41 41 SSN Scaling: Simulation 42 Technology : 45nm Supply : 1.1V Temperature : 25°C wo Repeater with Repeater 256 bit Bus with Repeater 128 bit Bus Regenerative repeaters improve SSN scalability University of Michigan 42 42 Swizzle Switch Network-on-Chip ./ ./ %3, %) * %)! %)! %% (1 (1 (' $ 1 ( %)*%- . / 1 ( %)! %- . / 1 ( %)*%+, 1 0%)*%+, 4 / / %- . %- . )* % %)* (1 (' , %+, ! %+ %)* %) (' (1 $ # $ $ (' %+, %)! 4 $ $ # ' ( %)*%+, % $ ' ( %)*%- . / # 4 ' ( %)! %+, ' 0%)! %+, 4 $ $ ! "#$%&$! ' # ( ) *+' *"' , - . /0012$ - . /345 67+8 8 9 ! "#$%&$! "( ) *( ) *"' , - . /0012 - . /345 "6"8 8 9 ' ( %)! %- . /% ! "#$%& & 1 0%)! %- . / # 43 # ! "$: ; * 1 ( %)! %+, 1 0%)! %+, 1 0%)*%- . / Destination L2 L1 Shared Data Data Forwarding Requests Writebacks L2 Responses Invalidations Source L1 Classification × × University of Michigan ! ' #$%&$! "# +' *( ) *"' , - . /0012$- . /345 67+8 8 9 4 # $ $ 4 # $ 4 01 0' 01 0 0 %)* %2! %)* 1 %)! ' %)! % %+, %+, %3, %- . - . / / $ $ 01 0' % %)* ) *% %- . - . / / $ $ 4 # 0' %)! %+, ' 0%)*%+, ' 0%)*%- . / ' 0%)! %- . / ! "#$%& & (b) ⇠ 43 43 Results—64-core with A9 O3 cores University of Michigan 44 44 44