Energy-efficient GPUs from mobile to supercomputers

Transcription

Energy-efficient GPUs from mobile to supercomputers
Parallelism and Energy-efficient GPUs
from Mobile to Supercomputers
Jem Davies
ARM Fellow,VP of Technology
Media Processing Division
24-Feb-13
First, some background…
2
The Eras of Computing
The Internet of Things
100 Billion
Mobile Internet
Units
Desktop
Internet
PC
Mini
Mainframe
1st Era
1M
1960
3
1970
1 Billion
100M
2nd Era
10M
1980
10 Billion
1990
2000
2010
2020
The Eras of Computing
The Internet of Things
100 Billion
Mobile Internet
Units
Desktop
Internet
PC
Mini
Mainframe
1st Era
1M
1960
4
1970
1 Billion
100M
2nd Era
10M
1980
10 Billion
1990
2000
2010
2020
Merging Our Digital and Physical Worlds
ARM® technology and the ARM ecosystem scale from sensors to servers,
connecting our world, from the Internet of Things to supercomputers
ARM’s partners make SoCs containing heterogeneous compute engines
5
 8.7 billion ARM-based chips in 2012, from <50 cents to >$200
 We’ve always been in the multicore business
From 1mm3 to 1km3
Battery
1mm3 platform
Solar Cells
8.75mm3 platform
solar cell 0.18µm
Cortex™-M3
12µAh Li-ion battery
University of Michigan
Processor, SRAM and PMU
4200 ARM powered Neutrino Detectors
70 bore holes 2.5km deep
60 detectors per bore hole
Work supported by the National Science Foundation and University of Wisconsin-Madison
6
1km3 platform
Mobility Is Reshaping Behaviour
79%
Of iPad users
use exclusively
for Internet
36%
60%
Facebook users
access from
mobile
Tablets
outsell
laptop PCs
71%
of enterprises
plan to establish
own store
Chinese
Android™ users
spend 2+ hrs a
day on mobile
apps
1 billion smartphones will ship in 2013
IDG, Pew Research
7
50% of mobile computing devices in 2013 will be ARM-based
Symantec “2012 State of Mobility
Survey,” BI intelligence 2012
Guohe, 2012
Pushing the Boundaries of Mobile Computing
Delivering tomorrow’s features and capabilities
Facial
recognition
and unlock
Multi-tasking
and multiwindow
productivity
8
2.5k/4k displays,
multiple
simultaneous
displays
Video &
image editing,
manipulation,
and rendering
Consolequality gaming
experience
Live image and
scene recognition
for augmented
reality
Intuitive gesture
and natural
language input
1080p HD
video and
beyond, multiple
streams,
HEVC
Interactive Mobile Devices Drive Graphics
 Graphics capability is a key factor in consumer
purchasing decisions
 Rich graphics is a priority for anything with a screen
 Smartphones, DTVs, STBs, Tablets,
hand-held games consoles, netbooks
In-car infotainment
 In the post-PC era, data from the cloud and
graphically-rich, interactive, connected devices means a
growing market for GPUs
 4 billion internet-connected screens in 2016, most with
embedded graphics
9
Solving the Graphics Market Challenges
10
Relative Power
Performance/Power is the Big Challenge
ARM GPU and
System savings
of 35% annually
 Requirements on the GPU continue to grow exponentially but still have to fit within constant power
boundaries
 Mali GPU power already in mobile power budget; 35% additional energy efficiency improvements required
every year to fit new performance requirements within SoC thermal limits
11
ARM Mali GPU Momentum
Delivering Tomorrow’s Features and Capabilities
Over 150M Mali™-based
GPUs reported in 2012
#1 graphics in
Android tablets (>50%)
>20% Android
smartphones
12
>230 OEM products
shipping with Mali GPUs
75 licences across 54
partners, making
Mali-based SoCs
#1 in graphicsenabled DTV (>70%)
ARM is not just about Mobile…
…but Mobile has been driving a lot of innovation
and enabling diversity
13
Mobile Revolution Built On Diversity
40LP
32
HKMG
14/16
FinFET
Cortex
-A53
Cortex
-A5
20nm
Cortex
-A57
Cortex
-A7
Cortex
-A9
Cortex
-A15
Mali
-400
Mali
-450
Mali
-T628
SOI
Connectivity
Video
Memory
Modem
Android
Windows
Phone
Firefox
Mobile
Chrome
OS
Windows
RT
Aliyun
OS
Mali
-T604
Mali
-T678
Reference
Design
Superphone
Camera
Massmarket
smartphone
14
Clamshell
Tablet
Post-PC World Built on Diversity
 Rapid pace of development has embraced numerous changes,

built on diverse, nimble, fast-responding ecosystem of partners
 One size does NOT fit all (nor should it!)
 Not just desktops and laptops any more
New form factors and new usage models:
 Phones, tablets, phablets, netbooks, hybrid tablets, TVs,
consoles, Chromebooks, watches!
 Always-on, always-connected, long battery-life
 Different form factors in different regions
Diverse Silicon foundry ecosystem: bulk CMOS, FDSOI, FinFET

 ARM’s partners are driven by innovation and user requirements, not by ARM
 ARM shows technical leadership but embraces diversity – we do not pick winners!
15
big Performance, LITTLE Power
Highest Performance
Performance
 Smartphone needs to be
very responsive
big
Mobile
performance
needs are
highly elastic
 Constant connectivity and usage
 Voice, SMS, trickle feeds are lowintensity tasks
LITTLE
16
Highest Efficiency
GPU Compute Making the Difference
Heterogeneous computing
Portability
Parallel computation
Hardware acceleration
GPU Computing
Trends
Benefits
More energy-efficient processing
BOM reduction
Improved accuracy/quality
Improved existing use cases
Unlock new use cases
Multi-Perspective Vision
2D to 3D
Multi-User Interaction
Computer Vision
Real Time Still and
Moving Image Perfection
Computational Photography
Information Extraction
Light-Field Photography
Upscaling
17
Unifying CPU and GPU Compute
 High efficiency approaches for:
 Computational photography
 Immersive visual computing
 Augmented Reality
 Improved user interfaces
 GPU for highly-threaded tasks and
Right-sized
computing
CPU for low-threaded tasks
 Efficient ARM IP synergy makes it worthwhile to
move even small tasks to the GPU
 Full Profile GPU Compute is key
to enabling portability
18
Continued
Performance
& Efficiency Gains
Energy Efficiency Underlies It All
Efficiency is a requirement for a connected world
19
ARM has always understood
this – others are now
catching on
Energy-efficiency Has Always Mattered
 Segment differentiation matters less than you think
 High-end Mobile driven by thermal constraints
 And “You can never be too thin”
Sensors need 10 years from a button cell battery

 Everybody hates fans
 Desktops struggle to:

20
 Dissipate more than ~150W out of a single chip package
 Supply more than ~300W from a PSU onto a PCI card
Servers struggle:
 To get more than ~10kW into a rack
 To get more than ~500kW of heat out of a shipping container
 Spending more on power and cooling than on hardware
 To buy more power from the grid
GPUs from Mobile to Supercomputer
 Mont-Blanc project selects Samsung Exynos 5 Processor - 13 Nov 2012
The project continues its research effort towards an energy efficient HPC prototype using low-power embedded technology
 Salt Lake City, 13th November 2012.- The Mont-Blanc European project has selected the
Samsung Exynos platform as the building block for powering its first integrated low power- High
Performance Computing (HPC) prototype. The aim of Mont-Blanc project is to design a new type of
computer architecture capable of setting future global HPC standards, built from today’s energy efficient
solutions used in embedded and mobile devices
The Samsung Exynos 5 Dual is built on 32nm low-power
HKMG (High-K Metal Gate), and features a dual-core
1.7GHz mobile CPU built on ARM® Cortex™-A15
architecture plus an integrated ARM Mali™-T604
GPU for increased performance density and energy
efficiency. It has been featured and market proven in
consumer and mobile devices such as Samsung
Chromebook and Google’s Nexus 10
21
The Search for Parallelism
 ARM (like the industry) utilizes parallelism across all segments, across multiple devices


to drive down energy use
 CPUs use instruction-level parallelism: modern ARM CPUs have IPC well over 3
 CPUs and OSs exploit task-level parallelism: fast context swap, MMUs, virtual memory, hypervisors
Graphics is one of the few problem spaces that have very high levels of parallelism:
 Do <foo> on every vertex in the frame (10s - 100s of kvertices/frame)
 Do <bar> on every pixel in the frame (millions of pixels/frame)
 Data-level parallelism through thread-level parallelism
GPU compute (throughput architectures/implementations) has great advantages:
 Performance and Energy efficiency
CPUs remain “easier” to program

 Do GPUs merge with CPUs?
 Maybe not yet, but connections between them become ever more important
22
Amdahl’s Law is Alive and Well
100% parallel
32
28
Speedup on parallel processors is limited by the
sequential portion of the program
Maximum speedup
24
20
Sequential portion need not be large
to constrain speedup significantly !
16
12
95% parallel
8
90% parallel
4
75% parallel
50% parallel
0
0
4
8
12
16
Number of cores
23
20
24
28
32
ARM’s Heterogeneous Computing Advantages
 ARM has always been involved in
heterogeneous systems
 For 23 years, mobile SoCs have contained
CPUs, DSPs and other compute elements
 big.LITTLE™ and GPU Computing

 ARM’s first multicore CPU was
produced 10 years ago
 ARM has produced several generations of
controllers, baseband, WiFi
coherent interconnect
 ARM introduced the world’s first
embedded multicore GPU
 Now have TFLOP GPUs
24
 Right task, right processor
Today’s modern mobile SoC contains ISPs,
DSPs, GPUs, baseband accelerators,
crypto accelerators, GPUs and VPUs
 “Helper” ARM CPUs, security, power

 Heterogeneous, multicore applications CPUs
ARM is introducing multicore VPUs
 All within today’s energy budget…
Improving the Granule of Offload
 Hardware: sharing data
 Traditional CPU -> GPU was unidirectional offload
 GPU is I/O-coherent with CPU
◦ GPU reads snoop CPU data

25
 Frame buffer is not cached
 Too big, especially with multi-buffering
 Typically large, pre-arranged buffers
 Set up by a driver
GPU Compute is much more bidirectional
 GPU needs to be fully coherent with the CPU
 Smaller-scale sharing of data
 Exchanging pointers to data structures
 Directly referred to by the application
 Avoid driver interaction
CPU
Display
processor
GPU
Object Database
Texture Database
Frame
Buffer
CPU
GPU
CPU + GPU Heterogeneous benefits
 ARM leadership in system coherency is key to enable heterogeneous computing and
the future of graphics
 GPU participating in system coherency
enables a broad set of benefits and values
 Avoids unnecessary cache flush/invalidate operations
Up to Quad
big CPUs
L2
Up to Quad
LITTLE CPUs
L2
L2
 Maintain warm caches: lower power, higher performance
CoreLink CCN-504 Cache Coherent Network
with AMBA 4 ACE Interfaces
Avoids DRAM access to data present in CPU caches
 Lower power by elimination of off-chip memory accesses
NIC Network
CoreLink DMC 520
System and I/O
Eliminates software-managed cache coherency
interconnect
 Greater efficiency, reduced complexity
 Faster TTM by removing source of complex synchronization bugs
Reduced cost of sharing data between CPU and GPU
 Widens the set of algorithms ripe for GPU acceleration
™



26
Up to 8 MaliT678
®
™
What is ARM doing to help?
27
Building Smarter Systems
 Visual computing is about designing the best compute
systems to match the competing demands of power and
performance
 ARM provides the four fundamental IP building blocks
needed to enable high-performance, energy efficient,
coherent System-on-Chips
 ARM’s unique experience and portfolio of processors,
memory systems, physical IP and interconnect designed together
 ARM’s leading system IP extends coherency for
optimized multicore solutions
 Heterogeneous computing for optimum energy efficiency
28
ARMv8: 64-bit Processing Now Available In Mobile Power
 64-bit processing and full
backward compatibility with 32-bit
 Mobile devices will need more
than 4GBytes
CRYPTO
Scalar FP
ARM v7A
Software
Advanced SIMD
 Scalable power and performance
with big.LITTLE
32-bit
 Sub 1mm2 processors
 Co-designed with
Mali-T600 GPUs
ARMv8
Designed to deliver all your computing needs
29
64-bit
Complexity
 Out-of-order superscalar CPUs introduces some complexity
 Multicore CPUs/GPUs – more complexity
 Memory consistency model
 Threading models
Heterogeneous computing - more complexity

 Some GPU compute engines are:

30
 Complex
 Badly described
 Difficult to reason about
Most graphics developers want to create visually stunning
content, not argue about the finer points of computer science
Complexity drives the need for standards
 If we do not abstract away from this complexity
and present a simpler world to the developer…
 Either they won’t use these new systems
 Or, they’ll use it and get it wrong
 Either way, it will be our fault,
and they will hate us for it 
 If we all provide different abstractions…
 They will hate us for it
 We need standard(s)
 And a very small number of them
31
ARM and HSA
 Heterogeneous System Architecture (HSA) Foundation
 Not-for-profit consortium founded by industry leaders to create
hardware and software standards for heterogeneous computing to:
 Simplify the programming environment
 Make compute at low power pervasive
 Introduce new capabilities in modern computing devices
 ARM is a founding member and board member
 Heterogeneous computing is the essence of efficient computation:
the right processor for the right task
 More research required – not a solved problem
 Better models, compilation technologies, debug, programming methodologies
 We need to make it easier to use – still seen as a “black art”
32
Diverse Partners Driving Future of Heterogeneous Computing
Founders
Promoters
Supporters
Contributors
Academic
33
© Copyright 2012 HSA Foundation. All Rights Reserved.
A Call to the ARM Partnership for Action
34

Towards the Future
 Don’t miss opportunities. Push the boundaries!
 Innovation in the post-PC era requires:
 Focus on energy-efficiency across diverse market
segments with diverse devices
 Continue to utilize parallelism (of all forms)
 Compute on the right device
 Highest efficiency CPU for required performance
 Use the right programmable device for the right task,
such as GPU computing
 Domain-specific processors for specific tasks
 Not just hardware problem, need more software investment:

35
 Operating systems, hypervisors, managed environments, APIs, tools
 Application developers across the ecosystem
We need new standards, but there will be different paths to follow
 One size does NOT fit all
ARM committed to heterogeneous, parallel computing
36
Thank you
谢谢
37
Questions?
38