The PicoSecond is Dead Long Live the PicoJoule

Transcription

The PicoSecond is Dead Long Live the PicoJoule
The PicoSecond is Dead
Long Live the PicoJoule
Christos Kozyrakis
Stanford University
http://mast.stanford.edu
Scaling Systems
without Technology Help
Systems Scaling
•  Higher performance
•  More data
•  Constant cost (power, $)
3"
Scaling Past
[Source: CPUDB]
Transistors ! + frequency ! = performance !
4"
Scaling Past
Power ≈ C"Vdd2"Freq
Transistors + Frequency !
Capacitance + Voltage #$
Power density %
5"
Scaling Present
[Source: CPUDB]
Voltage scaling ✖
Transistors ! Power density !
Frequency %
6"
Scaling Present
Power = C"Vdd2"Freq
All chips are now power limited
Switch to multi-core does not change this
7"
Performance/Cost (log)
Scaling without Technology Help
ge
n
e
l
l
r Cha
Ou
80s
90s
00s
10s
20s
30s
40s
[Source: Hill & Kozyrakis’12]
8"
Scaling without Technology Help
Power = Ops/sec " Energy/op
Reduce energy/op through better HW
Reduce ops/sec needed through better SW
Tradeoff performance – power
9"
Better HW with Specialization
Assume temporal locality for now
Caches work
10"
Energy Overheads
Operation
Energy
Scale
8-bit add
0.03 pJ
1
32-bit add
0.10 pJ
3
8-bit mult
0.20 pJ
6
32-bit mult
3.00 pJ
100
16-bit FP mult
1.00 pJ
30
32-bit FP mult
4.00 pJ
133
Obvious thought: specialize data-types
11"
But Wait…
Operation
Energy
Scale
8-bit add
0.03 pJ
1
32-bit add
0.10 pJ
3
8-bit mult
0.20 pJ
6
32-bit mult
3.00 pJ
100
16-bit FP mult
1.00 pJ
30
32-bit FP mult
4.00 pJ
133
>70.00 pJ
>2,300
RISC instruction
Instruction overheads dominate!
12"
Energy Overheads
25"pJ"
6"pJ"
20+"pJ"
25"pJ"
I-cache
Register
file
Control
D-cache
Op
First
Amortize instruction and control overheads
Avoid accesses to register files and data caches
Then
Specialize data-types
13"
SIMD & Vectors
25"pJ"
6"pJ"
I-cache
20+"pJ"
25"pJ"
…
Control
D-cache
Ops
SIMD amortizes I-cache and control overheads
10x energy efficiency & must use
Not enough on its own
Must also amortize register and D-cache accesses
14"
Specialization for Convolutions
Convolutional patterns
Signal/photo/video processing, computer vision, …
Map-reduce patterns over 1D/2D/3D stencils
15"
Specialization for Convolutions
Convolutional patterns
Signal & photography processing, computer vision, …
Map-reduce patterns over 1D/2D/3D stencils
16"
Specialization for Convolutions
Convolutional patterns
Signal & photography processing, computer vision, …
Map-reduce patterns over 1D/2D/3D stencils
17"
Specialization for Convolutions
Specialized unit
Register files that shift support & no data reloading
Map-reduce, SIMD instructions & amortize control, shared data
Specialized data-types & avoid waste
18"
Specialization Opportunity
1000
Performance
Energy Savings
100
[Source: ISCA’10]
10
1
4 cores
+ ILP
+ SIMD
+ custom
inst
ASIC
>100x opportunity in performance & energy
Close to ASIC efficiency
19"
Specialization Challenge #1
[Source: Xilinx]
Efficiency Vs. generality
Solution #1: domain-specific accelerators
Solution #2: automatic generation of accelerators
20"
Domain-specific Accelerators
[Source: ISCA’13]
Convolution engine
Generalized engine for convolution-line computations
Wider storage, general ALUs, general reduction, …
10x better than SIMD, ~2x worse custom unit
21"
Automatic Generation of Accelerators
Table 4: Area of the LINQits hardware template configured to implement di↵erent applications.
to adopt the same divide-and-conquer approaches used inGroupby
Join
Bscholes
LINQits
C
LINQits
KeyCount
C, 2 threads
Figure 7: LINQits hardware template for partitioning, grouping, and hashing.
1
C, 2 threads
Post-Core
Partition
Allocator
+ Writer
6
C
Post-Core
Pre-Core
11
7
LINQits
Pre-Core
10
C, 2 threads
Partition
Reader
10
12
12
C
DRAM
Spill
FSM
23
17
LINQits
Control
L2
AXI
Bus
(Mem)
Metadata
Structures
C, 2 threads
Post-Core
(user-def)
108
61
C
Pre-Core
(user-def)
121
100
LINQits
Core
HW
432
149
C, 2 threads
Core
AXI
Bus
(I/O)
959
515
C
SW
LOG Normalized Energy Gain vs. Mono
1000
K-means
to make
hardware-based
hash tables practical to
Figure 12: Power of software
optimized
C (1
and 2-CPUs)
implement for queries that operate on large-scale dataset
vs. LINQits.
[Source:
Chung’13]
13: Energy of optimized
C (1 and
2-CPUs)
sizes. In the context of a small SoC such as Figure
ZYNQ, the
vs.
LINQits.
HAT
must
operate
with
a
limited
amount
of
on-die
memcross-compiler, we also observed a similar degradation in
ory Mono
storage against
(typicallyC1MB
Naively implementing
performance when running
on or
a less).
desktop
a hardware-based hash table using this limited also
storage
can
increases
memory traffic to DRAM, resulting in higher
Xeon-based processor. hurt performance due to excessive conflicts and spills.
A
power
dissipation.
A more important metric than power
The rightmost set of key
barsstrategy
in Figure
11is show
LINQits
we take
to develop
hardware that performs
or
speedup
alone,
however,
is the relative energy reduction
achieving nearly 2 to 3 orders
magnitude
better perforin-place,ofmulti-pass
partitioning
prior to actually carrying
using
LINQits. Figure 13 shows how LINQits achieves anyure 8: Data Structures for mance
LINQits
partitions.
out the final
hashRelative
operationto
needed
in a2multi-level
GroupBy
relative
to single-threaded
C#.
C with
where
from 8.9 to 30.6 times energy reduction relative to
or
Join.
We
also
observe
that
both
hashing
and
partitionthreads,
LINQits
achieves between 10.7 and 38.1 times bethe Join operator also poses similar
challenges,
as shown
ing can leverage the same physical structures with
minimal
multithreaded
C. This result is somewhat surprising given
gure 6. In a Join, two large datater
collections
are
matched
performance. These gains are possible due to the HAT’s
impact to hardware complexity and area.
that
a
significant
percentage (around 50%) of the overall
common key followed by a user-defined
function
that
ability to exploit parallelism
and 7pipelining
between
Figure
illustrates the
physicalmultiorganization of our proboth values associated with the common key and compower
is
attributed
to DRAM, yet LINQits is still able to
ple stages in hardware (i.e.,
queue
spilling,
posed
HAT insertion,
architecture,queue
spanning
both hardware and softs an output (also based on a user-defined function). For
improve
energy
efficiency
by an order of magnitude. Inand used
post-core
the ability
to of the HAT is a
ware on theetc),
ZYNQand
platform.
The heart
large data set sizes, a hash join ispretypically
to avoidcomputations,
tuitively,
by
speeding
up
the
computation by an order-ofsteering and
storage engine that enables fine-grained recoalesce
memory
accessesdata
efficiently
to DRAM.
xpensive nested, pairwise comparison.
In the
hash join,
partitioning
of
data
into
multiple
hash
entries,
implemented
magnitude,
LINQits
reduces
the
amount of static energy
ents from the smaller of the two partitions are inserted
22"
using
a
network-on-chip
and
physical
queues.
Overall,
for
the
applications
we
studied,
LINQits
achieves
wasted
between
idling
periods
of
the
ZYNQ’s DRAM suba hash table. Elements of the larger partition are then
At run-time,
data is streamed
in opfrom main memory
betweenkeys.
10.7Like
andin38.1 times
better performance
than
systemand
as well as maximizing efficient use of data (versus a
to probe the hash table for matching
processed
by
a
collection
of
preand
post-core
modules
that
pBy, implementing a naive hash join
similarly
exhibits
timized
multithreaded
C code running on the ARM cores.
cache
controller
that may unnecessarily fetch unused data).
implement the user-defined functions described in Section 2;
caching and memory behavior.
In
a
hypothetical
mobile system with other components such
the outputs of the pre-cores are used by the network-on-chip
LINQits engine
Configurable engine for LINQ map-reduce
Ops: select, selectMany, where, join, groupBy, aggregate
10x performance & 10x energy efficiency
Specialization for I/O
40Gbps NIC
[Source:NVMW’14]"
Data-serving accelerators (KVS, noSQL, …)
Common case in HW, complex cases in SW
Example: 20M IOPS @ 4–50usec (DRAM + SSD)
10x throughput & 10x energy efficiency of x86 servers
23"
Domain-specific
Throughput (cache hit) Specialization
25
40Gbps line rate
HW Flash KVS (GET)
HW Flash KVS (SET)
Best published HW
x86(Xeon® 8 cores)
Throughput [Mqps]
20
15
10
[Source:NVMW’14]"
5
0
0
100
200
300
Key + value size [byte]
400
500
Achieved 20M qps
Accelerator (DRAM): 20M IOPS, 3.8usec, 170 Kqps/W
© 2014 Toshiba Corporation 11
Flash: 10M IOPS, 50usec, 85 KQPS/W
x86 Server: 2M IOPS, 300usec, 14 Kqps/W
24"
Specialization Challenge #2
ASICs
Expensive and inflexible
FPGAs
High overheads (bit-level config, I/O interface)
Upto10x efficiency loss
Solutions
Coarse-grain FPGAs
2.5D and 3D integration
CPU + FPGA integration
Traditional SOC
Large
Chip
Multi-Die SOC
CPUs
Cache
xPUs
FPGA
Interposer
25
Specialization Challenge #3
What if there is limited temporal locality?
Graphs, in-memory analytics, …
26"
Energy Overheads
Operation
Energy
Scale
8-bit add
0.03 pJ
1
32-bit add
0.10 pJ
3
32-bit FP mult
4.00 pJ
133
RISC instruction
70.00 pJ
2,300
8KB cache access
10.00pJ
300
100.00 pJ
3,000
2,000.00 pJ
60,000
1MB cache access
DRAM access
Memory overheads dominate!
27"
Full-system Energy Analysis
Core"
LLC"
Memory"
Other"
Energy'Breakdown'
100%"
90%"
80%"
70%"
60%"
50%"
40%"
30%"
20%"
10%"
0%"
histogram"
word"
count"
kmeans" pagerank"
sssp"
bc"
spmv"
mergesort"
6"Xeon"cores,"HT,"3"DDR3E1333"channels,"pagerank"
~50% of energy due to memory
>50% of CPU energy due to idling
28"
Full-system Runtime Analysis
Execu1on'Time'Breakdown'
instrucWon"issued"
memory"stalls"
other"stalls"
100%"
90%"
80%"
70%"
60%"
50%"
40%"
30%"
20%"
10%"
0%"
histogram"
word"
count"
kmeans" pagerank"
sssp"
bc"
smvp" mergesort"
6"Xeon"cores,"HT,"3"DDR3E1333"channels,"pagerank"
>50% of the time waiting for memory
29"
Memory-side Computing
✖"Data & compute
MC
CPU
MC
✔"Compute & data
Avoid interconnect overheads
Move computation function to the data
Streaming computations
30"
Memory-side Computing
DDR'base'
HMC'base'
MSC'
1"
0.1"
Time"
Delay"
Energy"
Energy"
0.01"
E*D"
E*D"
0.001"
10x performance & 10x energy improvements
Compute capability scales with memory capacity
31"
Didn’t We Try this Before?
Exacube, PIM, IRAM, DIVA, FlexRAM, Active Pages, …
Reasons for past failures
Conventional designs scale well
Apps have temporal locality
No good parallel programming models
All compute close to memory
Poor technology options ?
32"
Technology Options
3D integration
Buffer-on-board
Stacking w/ edge-bonding
33"
Better Software
34"
Better Software
Better algorithms
Match SW to modern HW
Reduce bloat
Specialization
Elasticity
35"
Matching SW to Modern HW
Core
SIMD
VT-x/d
Core
SIMD
Core
VT-x/d
SIMD
Core
VT-x/d
New?
SIMD
VT-x/d
Shared Cache
NIC
Net
Offload
Crypto
Zip
New?
PCIe
DRAM
Time for SW to exploit modern HW
36"
Example: System Software
Hardware is fast
10 GbE widely deployed
>10 cores per server
Large distributed apps should be fast
Millions of QPS for small messages
10-20 us RTT
37"
But Software is the Bottleneck
70"
Mostly Unloaded
6"
60"
Peak'Load'
5"
50"
4"
Millions'QPS'
HW"Limit" 40"
30"
Linux"
20"
10"
0"
3"
2"
1"
0"
99th"PercenWle"Latency"(us)"
Throughput"(QPS)"
Example: Linux + memcached w/ 200 bytes values
Single-socket SandyBridge + 10GbE NIC
38"
Conventional Wisdom
Bypass the kernel
Move TCP to user-space
Avoid protection domain crossings
Replace TCP
Offload to hardware (TOE)
Use a different transport protocol (UDP, new)
Replace Ethernet hardware
Use a different fabric (Infiniband)
Offload I/O processing to HW (rDMA)
How about looking into system SW?
39"
Network I/O in Linux
System Calls
And VFS
Complex Interface
Packet
Scheduling
TCP/IP
Ethernet + ARP
Scheduling and Buffering
Interrupts
And
Deferred
Work
[Source:"h`p://www.linuxfoundaWon.org/collaborate/workgroups/networking/kernel_flow]"
40"
Rethinking System Software
Efficiency mechanisms
Run to completion
Idle"
Timer"
Free"
RX"Queue"
TX"Queue"
Recv"
App"
Send"
41"
Rethinking System Software
Efficiency mechanisms
Run to completion
Adaptive batching
Timer"
Free"
RX"Queue"
TX"Queue"
Recv"
App"
Send"
42"
Rethinking System Software
Efficiency mechanisms
Run to completion
Adaptive batching
Flow-consistent hashing
43"
Rethinking System Software
Efficiency mechanisms
Run to completion
Adaptive batching
Flow-consistent hashing
Zero copy
44"
Rethinking System Software
Efficiency mechanisms
Run to completion
Adaptive batching
Flow-consistent hashing
Zero copy
Scalable + practical APIs
POSIX sockets ✖, events/light-weight threads/futures ✔
45"
System SW Implementation
3-way protection
C
C
C
C
C
C
C
VMX"
NonERoot"
CPL"3"
httpd
libIX
Your App
memcached
Custom
Runtime
libIX
VMX"
NonERoot"
CPL"0"
IX
IX
(Driver+TCP/IP)
(Driver+TCP/IP)
Custom
Transport
VMX"
Root"
Dune"
Module"
Domain-specific
I/O stack
Linux"
Kernel"
Control plane for coarsegrain resource assignment
Full API compatibility
NICs"
[Source:OSDI’14]"
46"
Better Software Impact
400
IX Avg
IX 99th
IX 99.9th
Linux Avg
Linux 99th
Linux 99.9th
350
Latency (US)
300
250
200
~10x''improvement''
@'100us'99.9th'SLA'
150
100
50
0
0
0.5
1
1.5
2
2.5
Throughput (millions of QPS)
3
3.5
[Source:OSDI’14]"
10x throughput &1/3 latency over Linux
Scalable, elastic, secure
47"
Utilization
48"
Why Does Utilization Matter?
The cloud makes specialized HW/SW practical
But it does not make them free
49"
Cloud Economics
Total Cost of Ownership
6%" 3%"
Servers'
Energy'
14%"
16%"
Cooling'
61%"
Networking'
Other'
[Source: James Hamilton]
Hardware dominates TCO
Must use it as well as possible
50"
Cloud Utilization
Twitter
[Source:ASPLOS’14]
Google
[Source:Barroso’09]
Low utilization in private & public clouds (<20%)
Despite the aggressive use of multi-tenancy
Specialized systems can make it worse
51"
Why is Utilization Low?
[Source:ASPLOS’14]
Overprovisioning @ Twitter
Similar challenges across the industry
52"
Performance
Heterogeneity
Scale-up
Scale-out
!
s
n
o
i
t
a
v
r
e
s
e
r
Servers
Cores
n
o
i
s
i
Repeat
whenever
code
v
o
r
p
r
e
v
O
Interference
Cores
or HW changes! Input load
Performance
Performance
Performance
Provisioning in the Cloud is Hard
Input size
53"
The Path to High Utilization
Predictable systems
Reduce interference
Elastic software
Reservations ✖, QoS goals ✔
Machine-learning to manage complex systems
54"
High Utilization in the Cloud
[Source:ASPLOS’14]
Quasar: apps QoS + ML to guide resource selection
>70% cluster utilization
Predictable app performance
55"
High Utilization in the Cloud
Quasar
Quasar: apps QoS + ML to guide resource selection
>70% cluster utilization
Predictable app performance
56"
Summary
The bad news
All systems are power limited
The good news
We can built much better systems
Lower energy/op + fewer ops/task
[Source:ThinkStock]
The challenges
Design complexity, heterogeneity, utilization, …
Think vertical: HW + system + app + management
57"