Hardware – Software Coverification with Acceleration systems

Transcription

Hardware – Software Coverification with Acceleration systems
Hardware – Software Coverification
with Acceleration systems
June 2003
Jörg Kayser
eServer Verification
IBM Corp Böblingen
1
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Content
•
•
•
•
•
•
•
Motivation
Where acceleration fits
scheduling basics
special features
History
Acceleration Products
Outlook
2
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
System verification objective
• Reduce time to market
• Optimize development cost
• Cover complexity challenges
• by reduction of
– hardware fails prior to tape out
– number of EC's
– code fails prior power on
– bring up hardware
3
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Microcode bugs
Finding Bugs After Coding is Costly
(Apar $15-40,000)
$14,000
Percentage of Bugs
85%
%Defects
Introduced
in
this phase
% Defects
found in
in
this phase
$1000
$25
$130 $250
Coding
Unit
Test
Funct
Test
$ Cost to
repair defect
in this phase
Field
Test
Post
Release
Source: Applied Software Measurement, Capers Jones,1996
4
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Hierarchical Design
System
Chip
...
Unit
Macro
Allows design team to break system down into logical and comprehendable
components.
Also allows for repeatable components.
© Bruce Wile
5
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Current Practices for Verifying a System
n
n
n
n
Designer Level Sim
èVerification of a macro (or a few small macros)
Unit Level Sim
èVerification of a group of macros
Element Level Sim
èVerification of a entire logical function such as a processor, storage
controller or I/O control
èCurrently synonymous with a chip
System Level Sim
èMultiple chip verification
èOften utilizes a mini operating system
© Bruce Wile
6
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Shift left from bringup into simulation
Early System Integration
Unit Simulation
CEC Chips
Intermediate Level
Hardware
Verified RIT Level
Hardware
HW Subsystem
I/O Chips
System
Simulation
CECSIM
Service Element
Office Mode
Bring-up
and integration
CECSIM
Bringup Vehicle
Virtual Power-On
Real Power-On
Shift Left
Time axis
© Stefan Körner
7
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
How to verify a big iron ?
© Stefan Körner
8
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Acceleration versus Emulation
• Acceleration
– Can increase simulation by 10..1000 vs. Software simulator
• Hyper-Acceleration
– Can increase simulation by 1k..100k vs. Software simulator
• Emulation
– Not only increases sim speed, but also allows direct physical
interconnect to a target system (real hardware)
– Example: emulated processor connected to real system
motherboard
9
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Model build
Principle: compile into bool operators, then schedule communication
HDL
A = (B & C) | (D & E);
Synthesis
4-input/1-output
gates
B
C
D
E
AND-OR
A
(4in/1out gate)
Partitioner/Scheduler
Model
10
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Principle of Operation
D
Q
Logic
D
Q
Clock
Clock
Register
Register
10 ns clock’s step
3-5 steps per logic level
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
• Compiler transforms
combinational logic into
boolean operations
• Compiler schedules interprocessor communications
using a fast broadcast
technique
• Emulation performance
dictated by
- number of processors
- number of levels in the
design
Simulation -> Acceleration
Software Simulation
steps sequential
1
A
A
B
C
2
3
C
B
Hardware Acceleration
steps Parallel
processors
1
A
C
2
B
12
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Another example
Reg
A
c
A
Reg
M
f
h
O
A
d
a
A
b
c
e
g
d
j
f
h
b
g
i
l
k
Reg
Reg
i
e
a
j
k
l
EP1
EP2
EP3
EP4
Step1
b
a
d
c
Step2
g
f
Step3
I
h
Step4
k
Step5
l
e
J
12 steps serial, 5 steps parallel
13
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
FPGA versus Processor based
FPGA based systems
• ~50c/gate
• Fewer one-time cost
Processor based systems
• ~10c/gate
• Limited interconnections
• Higher gate utilization
• Faster compile
• Faster runtime
• Higher capacity
14
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Simulation methods
• Acceleration:
– Self contained mode: model and test vectors are kept inside the
Emulator. Highest speed, no workstation overhead
– HDL Co-sim mode: part of the design runs in Emulator, other
part in software simulator, speed dependent on testbench, design
and simulator speed
– C,C++ testbench: C-program is connected to Emulator, faster
than HDL Co-Sim mode, no simulator overhead. Interactions
can be packaged (transaction based interface)
• Emulation:
– In-Circuit Emulation: Emulator model runs together with real
hardware. Smaller model than self contained mode
15
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Multivalue Simulation
• The Acceleration hardware supports 2 values only
• To simulate „X“ or „Hi-Z“ you have to make changes to the model
3value 2value
4value 2value
3value
Signal Signal Signal_is_X
Bus
Bus Bus_is_X Recv
0
0
0
0
0
0
0
1
1
0
1
1
0
1
X
1
1
H
0
1
0
X
1
1
X
16
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Multivalue Simulation (continued)
OR
A
B
C
OR
D
D
CLK
OR
OUT
A_is_X
OR
OUT
IN_IS_X
B_is_X
C_is_X
Latch
IN
OR
OUT_is_X
OR
D
OUT_is_X
D_is_X
A
B
C
AND
AND
CLK_IS_X
OR
OUT
Inverter
D
A
A_is_X
B_is_X
C_is_X
OR
AND
OUT_is_X
D_is_X
A_is_X
INV
OR
OUT
OUT_is_X
17
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Multivalue Simulation (continued)
Driver
Data
AND
En
3-state Bus
Bus
0/1
OR
Data_is_X
AND
En_is_X
OR
Other_En
Receiver
Data
Bus
Bus_is_X
AND
Data_is_X
X
AND
OR
H
OR
Bus_is_X
=> Multivalue acceleration increases model size by ~4x
18
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
other features
• event / signal tracing
–
–
–
–
–
thousands of signals
dynamic probing
infinite cycles
compression
post processing
• fast communication to workstation
– 100 Mb/s Ethernet
– 100 MB/s FIBER
– direct attach, pin multiplexed
• worldwide remote control
– multi user support
– multiple models at the same time
– 24h usage
• cross platform checkpoint/restart with other simulators
• Multiple clock domains
– Design partitioned into domains with different clocks (clock separation)
– Can also be done with clock oversampling and 1 domain
19
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Acceleration history
• 1960: NASA funded Boeing for Apollo verification
– Advanced LSI chips of that time contained < 500 gates, 5usec cycle
– Vision was to have space missions lasting 8-12 years
– Goal:
• Architectural Study on navigation processor, Evaluation of faults
• Built a hardware emulation engine to do the verification job
– Result:
•
•
•
•
•
•
4 processors Boeing Computer Simulator (McKay)
Study showed more than 8 processors was "not adding value"
Event based communications issues with architecture
48k gate model max, 48 bit instructions, 650nsec
Model slowdown 800x (vs. ET4x4 1000M/3M)
Patent lawyers and management stopped future work in early 70s, too
complex/expensive
© IBM Corporation
• Asked IBM to build a product
20
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Yorktown Simulation Engine
• EVE prototype
• LS-TTL chips
• Multiwire boards
22x24 sq.in
© IBM Corporation
21
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
1982: EVE 1/1.5
• 100ns per cycle
• 32k gates, 4 cards
7x9 sq.in per proc.
• >1000 sq.ft
• 25cents per gate
© IBM Corporation
22
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
1987: EVE2
• 80ns per cycle
• 1 chip, 512 gates per
proc
• 100 sq.ft
• 67 cents/gate
• Too expensive,
=> crushed
© IBM Corporation
23
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Acceleration history (continued)
LSM YSE EVE1 EVE1.5 EVE2 Evette Corvette ET Awan
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
© IBM Corporation
24
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Cadence acceleration
• CoBALT (ET3)
–
–
–
–
–
–
–
–
CoBALT = Concurrent Broadcast Array Logic Technology
64 Processors per chip, 65chips (1M gates) per board
Fast compile time 2M gates/hour
Fast download
Leader in capacity (1997)
6 modes of Emulation
Multi user capable (only system at that time)
3 levels of memory, automatic selection
25
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Quickturn Advertisement
Months Starting 3/98 to 4/99
M
•
High Level Architecture
•
Behavioral Development
•
Develop Module Level Designs
•
System level Regression/Accel
•
System Level/Netlist Completion
•
A
M
J
J
A
S
O
N
D
J
F
M
A
Tape-Out & Fabrication
•
System/Software Integration
•
First Customer Ship (Alpha)
•
Silicon Re-Spin #1
•
Silicon Re-Spin #2
Quickturn
Simulation/Emulation used
No Quickturn
Simulation/Emulation Used
3 - 4 Month Schedule Improvement
ND_Rev 980419
26
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Cadence acceleration (continued)
• CoBALT Plus (ET3.5)
–
–
–
–
64 Processors per chip, 65chips (2.5M gates) per board
2GB embedded memory
Big designs run at 65k cps
Direct Attach Stimulus card (DAS), PCI based host adapter with
sustained 50MB/s data rate
– Used at IBM since 1998 with 16brd (40M gates, 65k processors)
27
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
28
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Cadence acceleration (continued)
• Palladium (ET4)
–
–
–
–
256 Processors per chip, 65chips (10M gates) per board
4.1GB embedded memory
Fast compile time 5M gates/hour
IBM uses 16brd since 2001 (160M gates, 266k proc, 65GB
mem)
– designs run at 300k cps
29
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
ET4 module layout
30
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Parallelism on ET4
• 256 processors per module connected to
each other sharing 1MB SRAM and
64MB DRAM
• 65 modules per board connected to each
other
• 16 boards connected with high speed
cables
=> up to 266k parallel processors
=> up to 64M 4way gates in 1 design
=> up to 65GB memory
=> up to 1M cps
1 gate evaluation in 7.5 ns (1 step)
~ 35 x 1012 evaluations per second
10 M evaluations in < 2 us
31
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
32
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
IBM inhouse systems
• AWAN
–
–
–
–
–
300..3000 cps
29M 4w gates (80M 2w gates)
256 LP, 16 AP
Sparse array mapping
Modbld faster than ET4
• AWAN 4X
– 115M 4w gates
• AWAN NG
33
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Awan Logic Board
LP LP LP LP
Array
Processor
LP LP LP LP
LP LP LP LP
Switch
Daughter
Card
Backplane
LP LP LP LP
© Harrell Hoffman
34
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
AWAN Processor board
Logic (Gate) Processor
Toshiba Gate Array
Cmos-4 (.4 micron)
Mosys Dram
Instruction
(LP/SW) Memory
DDR + Common
addr/data bus +
32 word burst
Backplane
Connectors
Daughter Board
Handles inputs
from backplane
Switch (Interconnect) Chip
Toshiba Gate Array
Sram Proc + Memory
Chip Express Gate Array
(8MBytes memory)
Dram Proc + Memory
Chip Express Gate Array
(128MBytes memory)
Also handles
sparse arrays using
associative mapping
© Harrell Hoffman
35
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
AWAN
• High capacity:
– 10 Million Gates (2 boards)
– 256/512MB memory
• High speed:
– Compiler performance: 12M gates/hour
– Simulation performance: 500..5000 Hz
• Highly flexible:
– Processor based acceleration
– SW simulator model
• Low Cost:
– Pennies per gate, 1.1M$ (DAC 2000)
© Harrell Hoffman
36
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Other Acceleration vendors
• Mentor Graphics, (including Ikos)
– Celaro Pro, VStation, both 30M gates, cascadable
• Aptix
– MP4, 5-10MHz, 1.8M gates, FPGA, no trace memory, 185k$
• Axis Systems
– Xtreme-II, up to 100M gates
• Tharas Systems
– Hammer 32M, Processor based, 32M gates, cascadable
– 10M gates/hour compile time, incremental compile
37
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
Main Drawbacks to Emulation
“Time to emulation”
73%
Effort required to use
65%
Emulation at-speed
62%
Cost per gate
62%
Connection to other
EDA tools
50%
0%
10%
20%
30%
40%
50%
60%
70%
Percent of Teams
Source: 1997 Collett International, Inc
Key barriers: Time-to-emulation, ease-of-use and cost
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com
80%
Outlook
• Acceleration systems are still expensive
• Hardware-software coverification pays off after a project
or two
• The tool handling and integration into other simulators
gets easier
• DAC 2003 trend: FV and Acceleration is rising
39
PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com