Micro-architecture of Godson-3 Multi-Core Processor

Transcription

Micro-architecture of Godson-3 Multi-Core Processor
Micro-architecture of Godson-3
Multi-Core Processor
Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen
Institute of Computing Technology (ICT)
Chinese Academy of Sciences
[email protected]
Hotchips 2008, San Francisco, August 26
1
Contents
 A brief introduction to Godson processors
 The architecture of Godson-3 multi-core processor
 Physical implementation
 PetaFLOPS and TeraFLOPS
Godson is the academic name of LoongsonTM
2
National Project
 High performance CPUs are of national strategic importance
 Chinese ICT industry is growing to a significant scale
 2007 Demand: ICT market = 5.6 trillion RMB
 2007 Supply: only 22% by domestic companies, 3.75% profits
 Godson CPU is supported by
 National 863 project
 National 973 project
 National Science Foundation of China
 National key project
 Key project of Chinese Academy of Sciences
3
Godson CPU Briefs
 ICT started Godson CPU design in 2001
 The 32-bit Godson-1 CPU in 2002 is the
first general purpose CPU in China
 The 64-bit Godson-2B in 2003.10
 The 64-bit Godson-2C in 2004.12
 The 64-bit Godson-2E in 2006.03
 Each tripled the performance of its
previous one
4
Godson Development
10000
Intel/AMD/HP/IBM/SGI/Sparc SPEC cpu2000 rate
1000
100
Godson rate
10
1999
2000
2001
2002
2003
2004
2005
2006
5
Godson-2E SPEC CPU2000 Rate
Ref time
Run time
Ratio
168.wupwise
1600
238
672
512
171.swim
3100
660
469
221
497
172.mgrid
1800
579
311
1800
307
586
173.applu
2100
549
382
186.crafty
1000
167
598
177.mesa
1400
221
634
197.parser
1800
472
382
178.galgel
2900
412
704
252.eon
1300
188
690
179.art
2600
416
624
253.perlbmk
1800
354
508
183.equake
1300
208
624
254.gap
1100
240
458
187.facerec
1900
300
632
255.vortex
1900
263
722
188.ammp
2200
432
509
256.bzip2
1500
365
411
189.lucas
2000
396
506
300.twolf
3000
645
465
191.fma3d
2100
531
395
200.sixtrack
1100
345
319
301.apsi
2600
528
493
Programs
Reftime
Run time
Ratio
164.gzip
1400
403
347
175.vpr
1400
273
176.gcc
1100
181.mcf
SPEC int2000
<503>
Programs
SPEC fp2000
<503>
6
Godson-2E and Godson-2F
 1.0GHz@90nm CMOS, 5-7W
 47M xtors, area 36mm^2
 Godson-2 CPU core
 64-bit MIPS III compatible
 Four-issue, OOO
 64KB+64KB L1 (four-way)
 512KB L2 (four-way)
 On-chip DDR controller
 SysAD Front-end bus
 1.0GHz@90nm CMOS, 3-5W
 51M xtors, area 43mm^2
 Godson-2 CPU core
 64-bit MIPS III compatible
 Four-issue, OOO
 64KB+64KB L1 (four-way)
 512KB L2 (four-way)
 On-Chip DDR2 controller
 PCI/PCIX, local IO, GPIO, etc.
 Volume production
7
Low end roadmap: From CPU to SOC
CPU
PCI
North
Bridge
PCI
Graphic
CPU
+NB
CPU
+NB
Graphic
South
Bridge
PCI/HT
GPU+
South
Bridge
CPU
+GPU+
NB+SB
South
Bridge
2E˄2006˅
2F˄2007˅
2G˄2008˅
2H˄2009˅
8
Contents
 A brief introduction to Godson processors
 The architecture of Godson-3 multi-core processor
 Physical implementation
 PetaFLOPS and TeraFLOPS
9
Godson-3 Briefs
 Scalable architecture
 Reconfigurable CPU core and L2
 X86 binary translation optimization
 Low power consumption
 >1.0GHz@65nm
10
Scalable Architecture Design
 Scalable interconnection network
 Crossbar + Mesh
 Single crossbar connects cores, L2s, and four directions
 Directory-based cache coherence protocol
 Distributed L2 caches are globally addressed
 Each cache block has a directory entry
 Both data cache and instruction cache are recorded in directory
P0
P1
P2 P3
E
S
E
S
8x8 Xbar
W
W
N
N
L2
L2 L2 L2
11
Reconfigurable
Architecture
General Purpose Core:
64-bit, 4-issue, OOO,
AXI interface
8 configurable address
windows of each master
port allow pages
migration across L2 and
memory
P1
P2
P3
m1
m2
m3
m4
s1
S0
PCI...
DMA engine supports prefetch and matrix
transpose
m5
s5
6*6 AXI Switch
s2
s3
s4
S1
S2
S3
HT
m0
s0
DMA Controller
HT
DMA Controller
P0
Multiple Purpose Core:
LINPACK, biology, signal
processing, AXI interface
5*4 AXI Switch
MC0
MC1
Shared L2 can be configured
as internal RAM; DMA to
internal RAM directly
12
GS464 general purpose core

BHT
PC
ITLB
Integer
Register
File
Fix
Queue
DCACHE
ALU2
Tag Compare




Target for LINPACK, biology
computation, signal processing
8-16 MACs per node
Big multi-port register file
Reconfigurable based on applications.
Standard 128-bit AXI interface
ALU1
BTB
Register Mapper

Write back Bus
ROQ
BRQ
Map Bus
Decode Bus
GStera multiple purpose core
Branch Bus
Decoder





Commit Bus
Reorder Queue
Predecoder



GS464 Architecture
PC+16



MIPS64, 200+ more instructions for X86
binary translation and media acceleration
Four-issue superscalar OOO pipeline
Two fix, two FP, one memory units
Two FP units each supports full pipelined
double/paired-single MAC operation
48-bit VA and PA, 128-bit memory access
64KB icache and 64KB dcache, 4-way
64-entry fully associated TLB, 16-entry
ITLB, variable page size
Non-blocking accesses, load speculation
Directory-based cache-coherence for CMP
Parity check for icache, ECC for dcache
EJTAG for debugging
Standard 128-bit AXI interface
AGU
FPU1
Floating
Point
Register
File
Float
Queue
ICache
DTLB
CP0
Queue
FPU2
Refill Bus
imemread
Test Controller
EJTAG TAP Controller
Test Interface
dmemwrite
dmemread, duncache
Processor Interface
ucqueue
wtbkqueue
JTAG Interface
missqueue
clock, reset, int, …
AXI Interface
XBAR
MicroController
1024*64
1024*64
1024*64
1024*64
1024*64
1024*64
1024*64
1024*64
4W4R
4W4R
4W4R
4W4R
4W4R
4W4R
4W4R
4W4R
Micro-Code
Store
DMA+AXI Controller
AXI interface
13
Matrix Transposing Performance
 15+times faster
250000000
200000000
150000000
dsp
cpu
100000000
50000000
0
256x256
512x512
1024x1024
14
Hardware Support for X86 Binary Translation
 Define new instructions
 X86 ISA function and MIPS ISA format
 Binary translation mechanism supporting
 >200 instructions are defined with 5% additional silicon cost
 Speed up X86-to-MIPS binary translation
 10 times compared to software only QEMU
16000
MS windows
Linux apps. on X86
14350
14000
Linux apps. on MIPS
System level X86 VM
Process level X86 VM
12000
11611
10212
9980
10000
9964
9032
Linux on MIPS
8616
7954
8000
6743
Enhanced MIPS decode
6086
6000
5430
4851
4178
4000
Enhanced Godson internal operations
2312
2000
1643
15
1010
649
Figure 2. The architecture of Godson-2 virtual machine
612
421
0
IDCT
FFT(FX)
FFT(FP)
GP
EFLAG
No HO, No SO
HO Only
Both
Ideal
Contents
 A brief introduction to Godson processors
 The architecture of Godson-3 multi-core processor
 Physical implementation
 PetaFLOPS and TeraFLOPS
16
Physical implementation
 65nm CMOS LP/GP technology
 Cell-based design methodology
DC + ICC
Manual P&R for critical cells
 2008: 4-core (4GP + 0MP) + 4MB L2
GP: General purpose core
MP: Multiple purpose core
10w@1GHz
 2009: 8-core (4GP + 4MP) + 4MB L2
 20w@1GHz
17
4-core
 2008
4-core (general purpose core)
65nm, 1GHz, 10w
P0
P3
m2
m3
m4
m5
s5
6*6 X1 Switch
s1
S0
s2
s3
s4
S1
S2
S3
HT, PCIE
P2
DMA Controller
DMA Controller
HT, PCIE
m1
m0
s0
P1
PCI...
5*4 X2 Switch
MC0
MC1
18
8-core
 2009
8-core (4GP + 4MP)
65nm, 1GHz, 20w
P2
P3
P0
P1
P2
P3
m1
m2
m3
m4
m1
m2
m3
m4
m0
s0
s1
S0
PCI
LPC
m5
s5
X1 Switch
s2
s3
s4
S1
S2
S3
m0
s0
m5
s5
X1 Switch
s1
S0
s2
s3
s4
S1
S2
S3
X2 Switch
X2 Switch
MC0
MC1
HT, PCIE
P1
DMA Controller
DMA Controller
HT, PCIE
P0
19
Full Customer Register file and CAM
Physical register file
TLB CAM
Size: 321um x 262um
Size: 224 um x 235 um
Power: 50mW@1GHz
Power: 55mw @ 1GHz
Delay: 470ps
Delay: 550ps
20
HyperTransport PHY
HT1.0
Driver & Receiver
FlipChip Compatible 2Row design
800mw @ 1.6Gbps
Size: 250um x 300um
Power: < 10mW
Freq: 1.6GHz
Jitter: 20 ps
21
Test Chip for Customer Blocks
TEST CHIP
ST 65nm
1206um x 1206um
Function:
CAM1W1R - BIST
CAM1W1R - Scan
RAM4W4R - BIST
RAM4W4R - Scan
ICT_PLL
- Freq. test
HT1.0
- BIST
HT1.0
- Error rate test
22
Cell-based High Performance Physical Design
 The Full Hierarchical Design Methodology
 Manual placement & route for critical paths
 Manual placement of all FFs and clock buffers,
manual clock gating
 Architecture optimization with physical feedback
ictreg1
rissuebus2
Div
ictreg0
rissbus
resbus2
fwdbus2
alu2
alu_buf_vsrc
Mul
alu1
regfile0
qissbus
mapbus0/1/2/3
fxqueue
cam0
resbus0
fwdbus2
br_pc_value
resbus2/3
fwdbus2/3
jr_target
resbus0
resbus2
resbus1
23
Clock Tree
 H-Tree + Mesh
 Manual placement
of FFs
 Manual clock gate
24
Layout of 4-core Godson-3
HT
HT
GS464
GS464
GS464
GS464
DMA
L2
DDR2/3
DMA
Xbar
L2
L2
Xbar
L2
DDR2/3
25
Contents
 A brief introduction to Godson processors
 The architecture of Godson-3 multi-core processor
 Physical implementation
 PetaFLOPS and TeraFLOPS
26
PetaFLOPS and TeraFLOPS
 PetaFLOPS for Large Scale Applications
To build PetaFLOPS HPC with Godson-3 in 2010
 TeraFLOPS for Personal HPC
Putting desktop to pockets
Putting TeraFLOPS to desktop
High-performance computing for the masses
27
Scaling down TeraFLOPS
TeraFLOPS in 1997
$100K/2007
“refrigerator”
$50K/2008
“washing machine”
$10K/2009
28oven”
“microwave
References
 The architecture of Godson-2 superscalar architecture is
available at:
 Weiwu Hu, Fuxin Zhang, Zusong Li, “Microarchitecture of the Godson-2
Processor”, Journal of Computer Science and Technology, 20(2):243-249,
Mar. 2005
 Weiwu Hu, Jiye Zhao, Shiqiang Zhong, Xu Yang, Elio Guidetti, Chris Wu,
“Implementing a 1GHz Four-issue Out-of-Order Execution Microprocessor
in a Standard Cell ASIC Methodology”, Journal of Computer Science and
Technology, 22(1):1-14, Jan. 2007
 The experiences learning from Godson processor design is
available at:
 Weiwu Hu, Jian Wang, “Making Effective Decisions in Computer
Architects’ Real-World: Lessons and Experiences with Godson-2 Processor
Designs”, Journal of Computer Science and Technology, 23(4), July 2008
 The architecture of Godson-3 multi-core is available at:
 HotChip’08
29
Concluding Remarks
 CPU R&D are of national strategic importance
 Godson-3 has a low-power, scalable architecture
 Godson-3 will be used to build
client side systems
Petaflops machines
Teraflops systems for the masses
 The Godson team at ICT: open cooperation
30
Thanks
31

Similar documents

GS464V - Hot Chips

GS464V - Hot Chips South Bridge An Open Standard On-chip Interconnect Specification of ARM Godson Super-Link

More information

0 0 0 0 0 0 0 0 0 0 0 0 - Locality Parallelism and Hierarchy

0 0 0 0 0 0 0 0 0 0 0 0 - Locality Parallelism and Hierarchy – Guaranteed to detect any 1-bit error • In fact, any odd number of errors (more on this later)

More information